#nullranges

2018-08-14

Michael Love (09:09:38): > @Michael Love has joined the channel

Kasper D. Hansen (09:09:38): > @Kasper D. Hansen has joined the channel

Hervé Pagès (09:09:38): > @Hervé Pagès has joined the channel

Michael Love (09:11:58): > instead of copy pasting code, what do you think about starting a github repo where we each can work on different implementations?

Michael Love (09:13:00): > if this becomes a new Bioc package, what’s the umbrella purpose: shuffling ranges?bedtoolshas ashufflecommand, so it seems that word is probably known to the bioinf community, whereas “block bootstrap” or “stationary bootstrap” is probably opaque to many

Ben Johnson (09:13:29): > @Ben Johnson has joined the channel

Kasper D. Hansen (10:21:32): > well the goal is to compute null distributions for various problems expressed in ranges

Kasper D. Hansen (10:22:15): > For example, my guess is that even something likefindOverlaps()you can have many different ways of computing nulls, depending on what the ranges represent (and perhaps your goal)

Kasper D. Hansen (10:22:51): > The hard thing here is that shuffling can be very specific to what the ranges represent. Which is good

Kasper D. Hansen (10:22:58): > And we want different approaches to this

Kasper D. Hansen (10:23:34): > I will need to read the block bootstrap, but I strongly doubt that it will solve everything

Kasper D. Hansen (10:24:03): > For example, for ranges coming from methylation, a shuffling need to represent that ultimately methylation only happens at CpG and they are not randomly distributed

Kasper D. Hansen (10:24:24): > Or say you want “random” ranges which all overlap promoters.

Michael Love (10:28:46): > makes me think ofShuffleRangesorNullRanges

Michael Love (10:29:04): > is the output of interest basically the shuffled set of ranges?

Michael Love (10:29:48): > and then the user has to dofor (i in seq_len(B)) findOverlaps(x, boots[[i]]) ...

Michael Love (10:30:32): > or do you want to consider a convenience function so that this is going on under the hood

Michael Love (10:31:55): > the block bootstrap won’t solve everything, yes. whitelisted and blacklisted sites are very important. the reason that Bickel and others focus on block is that the theory is built around the with-replacement potentially-overlapping blocks

Michael Love (12:50:55): > Bernat pointed to RegioneR on the devel list

Michael Love (12:51:13): > For reference, here’s the existing function: > > randomizeRegions Randomize Regions > > Description > > Given a set of regions A and a genome, this function returns a new set of regions randomly dis- tributted in the genome. > > Usage > > randomizeRegions(A, genome=“hg19”, mask=NULL, allow.overlaps=TRUE, per.chromosome=FALSE, …) > > Arguments > > A The set of regions to randomize. A region set in any of the accepted formats by toGRanges (GenomicRanges, data.frame, etc…) > > genome The reference genome to use. A valid genome object. Either a GenomicRanges or data.frame containing one region per whole chromosome or a character uniquely identifying a genome in BSgenome (e.g. “hg19”, “mm10”,… but not “hg”). Internally it uses getGenomeAndMask. > > mask The set of regions specifying where a random region can not be (centromeres, repetitive regions, unmappable regions…). A region set in any of the accepted formats by toGRanges (GenomicRanges,data.frame, …). If NULL it will try to derive a mask from the genome (currently only works if the genome is a character string). If NA it gives, explicitly, an empty mask. > > allow.overlaps A boolean stating whether the random regions can overlap (FALSE) or not (TRUE). > > per.chromosome Boolean. If TRUE, the regions will be created in a per chromosome maner - every region in A will be moved into a random position at the same chromosome where it was originally-. > > … further arguments to be passed to or from methods. Details > > The new set of regions will be created with the same sizes of the original ones, and optionally placed in the same chromosomes. > > In addition, they can be made explicitly non overlapping and a mask can be provided so no regions fall in an undesirable part of the genome.

Tim Triche (15:01:55): > @Tim Triche has joined the channel

Michael Hoffman (15:01:55): > @Michael Hoffman has joined the channel

Michael Love (16:32:11): > Hervé posted a solution that works but uses lapply

Michael Love (16:32:33): > But data stays as ranges the whole time which is preferred for many reasons

2020-01-22

Stuart Lee (01:33:22): > @Stuart Lee has joined the channel

2020-07-17

Hervé Pagès (13:20:11): > @Hervé Pagès has left the channel

2020-07-31

Dr Awala Fortune O. (16:15:56): > @Dr Awala Fortune O. has joined the channel

2020-11-18

Stuart Lee (18:08:40): > @Michael Loveand i are keen to start working on this again, so if anyone has made progress on this since 2018 let us know:slightly_smiling_face:

Stuart Lee (18:09:48): > hoping to have this as a pretty open development so happy to get people’s input on it

Michael Love (20:34:57): > I will put some code implementing simple block bootstrapping into a GitHub repo and post here, and we’ll just leave it totally open

Michael Love (20:35:06): > prob have time by this weekend to do this

2020-11-19

Stuart Lee (18:19:19): > great! your old blog comes up as one of the first hits on google for block bootstrap genomicshttps://mikelove.wordpress.com/2012/07/28/block-bootstrap/but the code link doesn’t work for me:disappointed: - Attachment (Mike Love’s blog): Block bootstrap > In looking at sequential data (e.g. time-series or genomic data), any inference comparing different sequences needs to take into account local correlations within a sequence. For example, you might want to know how often is it raining in two cities at the same time, and if this is more than expected by chance. But it is more likely to rain on a given day if it was raining the day before, and this dependence will change the distribution of overlap expected by chance. In stochastics, this is a question of whether the process is ’stationary‘. > One way out of the problem of estimating the distribution of overlap of two process by chance is the block bootstrap. Instead of randomly shifting features in the sequence (what I call naive permutation), you randomly build new sequences from large blocks of the original sequence. Then a distribution can be formed of overlap of features by chance. Here is a single bootstrap sample (top sequence) constructed in this manner from the data (bottom sequence). > > > Here are histograms demonstrating various ways of estimating the null distribution of overlaps between two sequences, with the true null on top (the clusters of features are of size 20). The block bootstrap can do a much better job of estimating the mean and variance of the null distribution. Knowing how large of a block to define is another problem, and Politis and Romano (below) explore the effect of using randomly sized blocks over fixed size blocks. > > a reference for this problem in genomic inference is: Peter Bickel Boley N, Brown JB, Huang H and Zhang NR, Non-Parametric Methods for Genomic Inference, 2010, > and a more general reference is Dimitris N. Politis and Joseph P. Romano, The Stationary Bootstrap, Journal of the American Statistical Association, Vol. 89, No. 428 (Dec., 1994), pp. 1303-1313, > The R code for this example is here.

Stuart Lee (18:33:04): > Also I imagine the package API would basically have a generic sayshuffleorshuffle_rangesorgeneratethat takes as input aRangesand null generator function (of which we could have say the block bootstrap etc.) and returns aRangesorRangesListback. I think we would want to abstract out the null generating mechanism (i.e. not using the words block or stationary bootstrap as they aren’t super expressive imo).

Stuart Lee (18:33:16): > Should the white/blacklisting be left to the end user?

2020-11-20

Michael Love (07:13:47): > re: the old post, i think i can do it more efficiently with Views(), I’ll work on putting together an example this weekend > > re: null generator, that sounds like a good idea > > re: allowed/excluded regions (there is effort in genomics to move towards new language although not widely picked up yet), I think it would be good for the function to handle this in a fairly built-in way, and maybe we can get Anshul’s ENCODE blacklist into AHub if they are not there alreadyhttps://doi.org/10.1038/s41598-019-45839-z

Michael Love (10:01:10): > Also in discussion with other collaborators at UNC, aside from allowed regions (e.g. open chromatin, CpGs, etc), various aspects of matching on covariates might be desired. I’m hoping that we can help create a general expressive framework that works for many domains

2020-11-23

Kasper D. Hansen (03:22:53): > Matching on covariates is critical, but it might be hard to come up with a fully general framework. In the spirit of not having the perfect be the enemy of the good, it might make sense to start with something easy like CpG density or more general nucleotide composition

Michael Love (08:56:30): > Yeah, so a collaborator at UNC, Doug Phanstiel, who will be interested to contribute, was mentioning matching open chromatin regions based on some continuous score

Kasper D. Hansen (15:44:26): > I would assume this is not crazy hard for a single 1-dim score, but much harder when you want to match on multiple things at the same time

Michael Love (19:41:25): > agree, matching in higher dim quickly becomes untenable

2020-12-18

Michael Love (16:54:25): > has renamed the channel from “blockboot” to “nullranges”

Eric Davis (17:19:45): > @Eric Davis has joined the channel

2020-12-20

Wancen Mu (23:06:32): > @Wancen Mu has joined the channel

2020-12-22

Michael Love (08:05:01): > set the channel topic: Generate null sets of genomic ranges - https://github.com/nullranges/nullranges

Michael Love (08:15:29): > I’ve created a repo for implementing approaches for null range set generation. We’d welcome any contributions or feedback from folks here. > > Right now it’s just a vignette with some sample code from the email thread from 2 years ago:https://github.com/nullranges/nullrangesThe goal of the package is not to do enrichment analysis… we are aiming for modularity and integrating with existing software for that. It will just create set(s) of null ranges either through block bootstrapping or sampling from a control set with covariate matching.@Wancen Muwill be looking into the block size and performing enrichment across a continuous score of features in the query set, and@Eric Daviswill be looking into covariate matching from a control set.

Michael Love (09:01:18): > @Eric Davisin the vignette i pull down some open chromatin from AnnotationHub, which i don’t think has meaningful covariates for you to work with. You could assign a simulated set of covariates if you wanted to work on covariate matching while we find a good example, or if you’re already working with the dataset from fluentGenomics, that’s good too

Michael Love (10:34:28): > We’ll also be meeting every two weeks to discuss, and open to anyone who’s interested. tentatively these options for meeting times:

Michael Love (10:34:47): > 1st and 3rd Wednesdays 4pm US East

Michael Love (10:34:56): > 1st and 3rd Fridays 4pm US East (this would mean Stuart has to get up on Saturdays…not preferred)

2020-12-23

Stuart Lee (17:55:20): > My preference would be for Wednesdays so I don’t have to get up on Saturday:slightly_smiling_face:

Stuart Lee (17:55:48): > Looking forward to playing around with the code in the new year!

Michael Love (19:31:50) (in thread): > absolutely!

Michael Love (19:32:48): > it’s really very basic starting ground, just copying in Herve’s code from the email thread

2021-01-05

Michael Love (07:52:45): > working on finding a good regular time with Doug – will check back in soon

Michael Love (08:16:42): > 1st and 3rd Thurs 4pm US East works for Doug and I,@Eric Davis@Wancen Mu@Stuart Leeworks for you?

Wancen Mu (09:03:48): > Works for me:ok_hand:

Eric Davis (09:30:44): > works for me too!

Michael Love (13:26:23): > ok sent a recurring meeting – this week we can have a quick chat, i can talk about the repo (which is very empty right now…)

2021-01-07

Michael Love (15:51:34): > hi so probably just a quick chat today, i just want to review the bootstrapping code from Herve a bit:https://github.com/nullranges/nullranges/blob/main/vignettes/nullranges.Rmd

Michael Love (16:18:29): > oops my laptop ran out of battery

Michael Love (16:22:44): > set the channel topic: Generate null sets of genomic ranges - https://github.com/nullranges/nullranges - https://docs.google.com/document/d/1-c5XVz-3wDVsK1Ysxh3q0ofQ2EFGJpH7bdNUMg24Few/edit?usp=sharing

Michael Love (16:22:56): > ok I added the scratchpad to the channel topic

2021-01-12

Michael Love (14:44:50): > shoot… my department just moved our department seminar to 3:30-4:30 on Thurs, which would conflict with our regular 4pm

Michael Love (14:45:00): > could we move to 4:30?

Stuart Lee (17:39:04): > that’s ok with me

Michael Love (20:10:02): > ok with Doug too, moving them now…

2021-01-21

Michael Love (16:17:57): > hi, so we have a meeting coming up in 15 min. I’ve been tremendously swamped with starting teaching this semester, so i don’t have any updates on the code i showed last time, so it’ll be whatever is new from Eric/Doug and from Wancen (although I think Wancen has been working on her scRNA-seq project full steam)

Wancen Mu (16:19:50): > Yes, sorry about that. No updates from me this week

Eric Davis (16:23:56): > I’ve got some updates!

Michael Love (16:24:18): > Great!

Michael Love (16:24:22): > looking forward

Michael Love (16:24:40) (in thread): > no worries…i know what you’ve been up to :)

Michael Love (16:31:20): > meet.google.com/fou-qhqe-qvo

Eric Davis (16:31:26): > thanks haha

Michael Love (17:29:56): > My notes to self today: > > results() should offer to bring along columns of rowData to the GRanges... > fix the formula issue for matchit()? > R infer package - a grammar for specifying null hypothesis -[https://infer.netlify.app/](https://infer.netlify.app/)Mike will do RNA-seq quantification >

Eric Davis (17:37:09): > I made a directory that you should have access to on our proj space on longleaf. The rna-seq files are located here/proj/phanstiel_lab/Share_Love/rna

Eric Davis (17:37:26): > Let me know if you have trouble accessing

Stuart Lee (18:58:53): > I’ve added Mike’s notes to our google doc, also if anyone wants to do some pair programming over the next couple of weeks before the next meeting, I’d be happy to organise. I’ve found with some other collaborators it’s a good way to get the ball rolling on the code end of things to have a focused 50 minutes working on coding.

Michael Love (22:08:22): > that sounds delightful, i hope i can get the ball rolling with the course that i can join, but that might not happen :-{

2021-01-22

Annajiat Alim Rasel (15:44:48): > @Annajiat Alim Rasel has joined the channel

2021-01-23

Mikhail Dozmorov (20:29:50): > @Mikhail Dozmorov has joined the channel

2021-01-30

Stuart Lee (04:28:14): > hi everyone, I won’t be able to make next weeks meeting but would be keen to do some pair programming on Wednesday evening NC time / Thursday morning Melbourne time if anyone is free

Eric Davis (11:57:48) (in thread): > I’d like to do this, but I just realized that I won’t be available this coming Wednesday evening:pensive:

Mikhail Dozmorov (19:17:35): > Hi all. I learned about nullranges from@Michael Love, would be interested in contributing. From the times working with genomic enrichments, I was thinking about randomizing regions preserving distance to the nearest downstream TSS and/or considering TAD boundaries. I’m new to the channel, not sure when the next meeting will be?

Stuart Lee (22:36:25) (in thread): > @Eric DavisI could do your Tuesday evening if that works

Stuart Lee (22:38:17): > Hi Mikhail, usually we meet every fortnight on Thursday @4.30pmeastern time over google meet

2021-01-31

Michael Love (10:23:53): > Great@Mikhail Dozmorov, would love to have your contributions and thoughts. We do google meet and we were going to meet this week at 430, what is your preferred email (you can email me if you like at michaelisaiahlove at gmail) > > it’s informal, we just present on what we’ve done if we have something to share. Wancen and I have been looking into block bootstrap (resampling blocks of genome) and also threshold free inference using GAM. Eric and Doug have been looking into defining sets of “control” features while preserving covariate distributions with matching > > preserving TAD boundaries would be something I bet Doug and Eric would like to chat about also

Mikhail Dozmorov (11:23:27) (in thread): > Sounds good, I’ll plan to join, it’ll help to get on the same page. My e-mail is mikhail.dozmorov at gmail

2021-02-04

Michael Love (16:18:47): > hi everyone, so we’ll meet in ~12 min, I know Stuart can’t make it but we can see what updates there are from Wancen and Eric, and we can meet Mikhail for those that haven’t met him already

Michael Love (17:04:31): > Mike note to self: > > results() should offer to bring along columns of rowData to the GRanges... >

Eric Davis (17:38:30): > matchedControlExample02 and matchRanges have been uploaded to the repository

2021-02-05

Michael Love (09:34:56): > sketch of block bootstrapping from a meeting with Wancen - File (PNG): Screen Shot 2021-02-05 at 9.33.18 AM.png

2021-02-17

Michael Love (16:00:03): > hi@Wancen Mu@Eric Davis@Stuart Lee@Mikhail Dozmorov– Doug and I are both in a tough position re: tomorrow’s meeting – not much to share from our sides and we also have a conflict, so going to cancel. > > we may also need to find another time, as we are both supposed to be watching kids at the 4:30 time so can only give partial attention

Michael Love (16:01:29): > I could even do a 1-1 with Stuart PM in Melbourne and then another one with US East folks, depending if it works for Stuart

Michael Love (16:02:10): > also i’m going to take a shot at vectorizing the boot code that we have in the repo now

Wancen Mu (16:03:37): > Either works for me:ok_hand:.My time schedule is pretty flexible

Eric Davis (16:12:32): > :+1:same here, I am usually free whenever Doug is available

Michael Love (16:19:16): > cool thanks all

Michael Love (16:48:14): > we’re thinking about moving a biweekly meeting to 8am US east, and then maybe we could have a separate regular meeting to catch up with Stuart (but e.g. Doug wont be able to join that one)

Michael Love (16:49:56): > Any preference here among 8am Tue,Thur,Fri?

Mikhail Dozmorov (16:51:50): > 8am Thursday will work, but Tue is ok also (for me). Which week though, it this week is cancelled?

Michael Love (16:58:18): > just this week, yes

Michael Love (16:58:37): > i cancelled the google meeting for tomorrow

Michael Love (16:59:27): > 8am won’t work for Stuart bc it’s midnight but i can try to find another time

Michael Love (17:01:30): > so i’m leaning toward 8am Thurs then if it works for@Wancen Muand@Eric Davis

Michael Love (17:02:03): > and we can start Thurs 2/25

Mikhail Dozmorov (17:03:26): > Sounds good for me

Michael Love (17:19:17): > ok vectorizing was just two lines of code:flushed:

Michael Love (17:19:42): > old > > user system elapsed > 26.446 0.466 27.030 > > new > > user system elapsed > 0.005 0.000 0.006 > > same: > > > all.equal(y_prime, y_prime2) > > [1] TRUE >

Michael Love (17:20:17): > https://github.com/nullranges/nullranges/blob/main/vignettes/nullranges.Rmd#L57-L58

Michael Love (17:20:41): > that is definitely fine in terms of speed, we can do 100s of bootstraps and it should take only seconds

Wancen Mu (17:26:10) (in thread): > I usually have class on 8am Tue, Thur. But I can make it if it works for everyone.

Wancen Mu (17:29:48) (in thread): > That’s awesome!

Stuart Lee (18:50:03) (in thread): > I’m happy to do that - whatever is easiest for you!

Mikhail Dozmorov (20:24:37) (in thread): > Played with it - impressive! Reading the old thread about block bootstrap helps

Michael Love (21:10:06) (in thread): > yeah so now we’re considering how to deal with chromosomes, and how to deal with segmenting the genome into gene dense and sparse regions for bootstrapping within those

Michael Love (21:10:51) (in thread): > so you could do M W or F at 8am?

Michael Love (21:13:36) (in thread): > Mikhail what about Monday at 8am?

Mikhail Dozmorov (21:15:44) (in thread): > Less optimal. Teaching M/W mornings, typically preparing for class. Unless no other options..

Wancen Mu (21:17:25) (in thread): > Yeah, these three days works for me.

Michael Love (21:17:44) (in thread): > so there’s 10 pm VIC - 6 am US, or 8am VIC - 4 pm US…

Michael Love (21:18:06) (in thread): > got it, Friday 8am?

Mikhail Dozmorov (21:20:03) (in thread): > That’ll work, but what about Stuart?

Michael Love (21:21:25) (in thread): > so i’ll come up with another slot to keep Stuart in the loop, i just can’t really do 4pm very well bc i’m supposed to have kids

Michael Love (21:22:14) (in thread): > or 10am - 6pm, 11am - 7pm

Stuart Lee (21:22:16) (in thread): > 8am VIC would be my preference

Mikhail Dozmorov (21:22:32) (in thread): > Fully understand, Friday 8am will be fine

Michael Love (21:23:41) (in thread): > ok and then it would be T-F 8am VIC I guess, of those what works? and how frequent do you think, once a month or 2x a month?

Michael Love (21:24:18) (in thread): > this would be to check in with you on syntax of nullranges vis a vis plyranges workflows, aside from Slack updates

Stuart Lee (21:26:24) (in thread): > Thursday 8am would good! And great, I’m happy to contribute there. I feel like I haven’t contributed much yet sorry!

Michael Love (21:29:44) (in thread): > oh no worries, I think actually that once we have very basic functions inR/for generating null ranges, then I think you could start to sketch out your desired integration

Michael Love (21:30:51) (in thread): > the thing i’ve been thinking about (and now i’m burying in a scheduling thread) is that null ranges / matching range generation has a lot of knobs and switches to play with, but then how to put that into a plyranges flow (also potentially there are diagnostic checks wrt the null ranges that one may want to pull out of the flow)

Michael Love (21:31:26) (in thread): > it may be something where the null / matching ranges have to break the flow

Michael Love (21:32:09) (in thread): > would once a month Thurs 8am VIC be too infrequent?

Stuart Lee (21:32:26) (in thread): > I’m happy to go fortnightly if that works

Michael Love (21:33:10) (in thread): > ok let’s do 2x and we can just see who can join

Michael Love (21:33:15) (in thread): > i’ll throw these up on Google

Stuart Lee (21:33:28) (in thread): > I have a few ideas around the integration, mainly thinkingsummarize/reduce_rangesfor computing the boot stats while diagnostics are worth another verb I think.

Stuart Lee (21:36:41) (in thread): > I am currently thinking we have all the null generators as special function objects that have the potential to take a BPARAM so we could do the nullset generation in parallel and over chromsomes / design. There would be a generic function that accepts ranges objects plus a null generator and spits out the bootstrapped / matched set of ranges either as List or a GroupedRanges

Stuart Lee (21:40:15) (in thread): > So the flow would be > * invoke a generator that explains how you want to generate a null set > * “sample” from the null based on your ranges object and the generator > * “diagnose” the null set > * “summarise” or collapse the null set to generate summary stats via plyranges

Michael Love (21:41:32) (in thread): > this sounds perfect

Michael Love (21:41:50) (in thread): > we should start to put this here i thinkhttps://github.com/nullranges/nullranges/issues/new

Michael Love (21:44:01): > i sent two biweekly, one works for Stuart. i may not always be able to attend the 4pm one but invited everyone

Mikhail Dozmorov (21:47:58) (in thread): > Would it be better to do one Wednesday meeting, at 4pm? Seems like it works for all time zones

Michael Love (21:52:24) (in thread): > i can’t reliably attend that one bc i watch kids – the issue for me is that i want to make sure i’m giving Wancen and Eric undistracted feedbcak

Michael Love (21:52:32) (in thread): > i think people can attend whichever works or both

Doug Phanstiel (21:53:25): > @Doug Phanstiel has joined the channel

Michael Love (21:53:38): > :wave:

Stuart Lee (21:54:24) (in thread): > Great, I’ll write it up now!

Mikhail Dozmorov (21:59:12) (in thread): > Got it

2021-02-18

Doug Phanstiel (09:03:56): > :wave:

Michael Love (09:17:50): > Stuart has added thoughts here on what the function flow might look like:https://github.com/nullranges/nullranges/issuesit’s a good idea to push concrete ideas to the issues so we can start keeping track there. > > i’m going to try to implement a bare bones bootstrapping function so we can start experimenting with workflow ideas, but then Wancen has more correct bootstrapping functionality she is working on

2021-02-20

Michael Love (08:36:59): > started to write very basic functions (see NEWS for short description):https://github.com/nullranges/nullranges/blob/main/R/bootstrap.Rhttps://github.com/nullranges/nullranges/blob/main/vignettes/nullranges.Rmd#L31https://github.com/nullranges/nullranges/blob/main/NEWS.mdI’ll need to think about how these would interact with Stuart’s ideas about the null generator workflow

Michael Love (08:38:25): > i’m thinking for more sophisticated bootstrapping, one approach is to create a pseudo chromosome that concatenates all the chromosomes. This should simplify sampling ranges across chrom, and also sampling within segmentation regions (something Wancen is working on)

2021-02-23

Michael Love (15:32:57): > i pushed some examples to help visualize

Michael Love (15:33:04): - File (PNG): Screen Shot 2021-02-23 at 3.31.48 PM.png

Michael Love (15:33:27): > for these input ranges, bootstrapping with L_b (length of boot window) = 100 gives:

Michael Love (15:34:05): - File (PNG): Screen Shot 2021-02-23 at 3.33.47 PM.png - File (PNG): Screen Shot 2021-02-23 at 3.33.53 PM.png

Michael Love (15:34:12): > these are 5 bootstraps

Michael Love (15:35:52): > Wancen and I are planning on how to code this so we will also distribute ranges across chromosome, and across chrom + within segmentation boundaries (e.g. if we segment genome into regions of gene density/sparsity)

2021-02-24

Mikhail Dozmorov (20:18:24): > I played with it more, visualization helps. The main question is how to defineblocks. Hevre already suggested natural chromosome segmentation, by Giemsa bands - why not use it? Another ideas are: ChromHMM segmentation (simplified), consensus TAD boundaries, early/late replication timing regions.

Mikhail Dozmorov (20:20:03): > Gene density segmentation will be great. I’m playing with thegr.distfunction to create a distance matrix among regions on a chromosome, and cluster them using DBSCAN. Can be applied to genes, cluster them into gene-dense regions. We can discuss Friday. - Attachment (rdrr.io): gr.dist: Pairwise distance between two ‘GRanges’ in mskilab/gUtils: R Package Providing Additional Capabilities and Speed for GenomicRanges Operations > Computes matrix of pairwise distance between elements of two GRanges objects of length n and m. Distances are computed as follows:

Michael Love (20:29:18): > Thanks Mikhail! Wancen has tried segmentation by gene density using HMM or circular binary segmentation, both could be options. Giemsa bands / pre-baked ChromHMM / TAD boundaries / early vs late replication could definitely be supplied as well

Michael Love (20:29:42): > i think the segmentation should also be customizable

Mikhail Dozmorov (20:29:50): > Just saw how Wancen implemented it..

Michael Love (20:30:34): > it’s a step before the bootstrapping, and basically we can think about the format and syntax for providing the segmentation to the bootstrapping function

Michael Love (20:31:00): > both the segmentation and teh bootstrapping will benefit from diagnostic plots and summary metrics

Michael Love (20:31:25): > also there are diagnostic plots for covariate matching, from Eric and Doug’s side

Mikhail Dozmorov (20:32:32): > Not sure I remember covariate matching. How different genomic features relate to blocks?

Michael Love (20:33:05): > this is totally separate from bootstrapping – it’s Eric and Doug’s work on: you have genes of interest, and you want to compare with same number of genes from {all genessetdiffinteresting genes} that are “matching” across various covariates

Michael Love (20:33:30): > e.g. similar expression level, similar histone mods, etc.

Mikhail Dozmorov (20:34:10): > got it, and remember we discussed it

2021-02-26

Michael Love (09:57:12): > I’ve taken notes for today in the google docs (see the channel info at top of page)

Michael Love (09:57:25): > we’re talking about sending two abstracts to Bioc2021, which sounds great

Michael Love (09:57:58): > and we have a question for@Stuart Lee(when you get a chance) about if plyranges will be able to work on Pairs objects out of the box?

Michael Love (10:07:25): > i’ve cleaned up the repo so it is only code now

Michael Love (10:07:45): > all data objects moved over tonullrangesData

Wancen Mu (10:08:11): > Thanks!:+1:

Michael Love (10:16:07): > i’m also going to clean the repository on GitHub so it stays small, as we only have code now, this way when people work on the project in the future they can avoid downloading the old data files in the clone step

Michael Love (10:17:05): > importantly@Wancen Muand@Eric Davis, can you please now save any code that you haven’t push to GitHub (e.g. put it in a separate directory somewhere on your machine), delete your existing nullranges directory, and then re-clone from GitHub?

Michael Love (10:17:37): > everyone will need to re-clone, if they’ve already cloned the nullranges repo locally

Michael Love (12:25:12): > i’ve updated the meetings so that they will use Zoom instead of Google Meet for video. i think people can still share screen even if i do not join, but we can test that next week

Eric Davis (12:48:42): > For the BioC abstracts, should we work on them in this google doc:https://docs.google.com/document/d/1-c5XVz-3wDVsK1Ysxh3q0ofQ2EFGJpH7bdNUMg24Few/edit?usp=sharing? or should we create a separate one?

Eric Davis (12:59:43): > I have some scripts that generate the files innullrangesData/datashould I move these scripts from nullranges tonullrangesData/inst/scripts?

Michael Love (13:39:27): > sure that’s a good idea

Michael Love (13:39:37) (in thread): > same doc is fine with me

Michael Love (13:40:57): > a last thought – since we have a lot of people contributing code, it may be good to use consistent code style, so i’m going to start usingstylerto format my code after I write it, it’s very simple: > > > library(styler) > > style_file("bootstrap.R") >

Michael Love (13:41:12): > the first time you run it, it will ask if it can create a cache directory on your machine and you just say [y]

Michael Love (13:41:30): > it edits the file in place and you can review withgit diff

Michael Love (13:41:42): > effortless coding:slightly_smiling_face:

Wancen Mu (13:54:51): > Sounds good!

Eric Davis (14:05:02): > Should we each be working on separate branches or do you think the work is different enough that it doesn’t matter for now?

Michael Love (14:09:44): > i think the latter

Michael Love (14:09:57): > i dont see much collision since we’re doing diff things

Stuart Lee (19:15:07) (in thread): > No but it could be a possibility

Michael Love (20:11:07) (in thread): > cool, we’ll look into it

Michael Love (20:12:08) (in thread): > i may not be able to attend Wed/Thurs meeting (next week) depending on child care situation. but if i can’t attend i’ll try to formulate our plyranges / workflow related Qs

2021-02-27

Stuart Lee (00:02:03) (in thread): > sounds good, let me know if there is a better time! my schedule is pretty flexible

2021-02-28

Mikhail Dozmorov (21:37:37) (in thread): > Interesting. I usedformatR::tidy_source()ortidy_app, but in-place file formatting is more convenient

Michael Love (21:38:02) (in thread): > i’m running these functions now to do some cleanup

Michael Love (21:40:13) (in thread): > it helps me to read code across files if they all have same style, a little bit obsessive maybe:open_mouth:

Mikhail Dozmorov (21:41:12) (in thread): > Clean code is always easier to understand:slightly_smiling_face:

Mikhail Dozmorov (21:41:43): > Played with`segment_density``- works as it should, and useful for many things besides gene segmentation!

Mikhail Dozmorov (21:42:07): > And with matchRanges - didn’t know about the MatchIT package, reading their vignette.

2021-03-02

Mikhail Dozmorov (18:39:21): > Saw the March 12 meeting was cancelled due to UNC holiday. I’ll join the south one tomorrow.

Mikhail Dozmorov (18:39:42) (in thread): > Wish to have an extra holiday as well :)

Michael Love (19:14:33) (in thread): > i’ll join at 4 tomorrow but will have kids around so kind of partial attention

2021-03-03

Michael Love (17:01:52): > next meeting in 2 weeks then

Michael Love (17:02:04): > recommend everyone take the “wellness days”:smile:

Michael Love (17:02:35): > i am glad they did this – rather than one spring break (which we cant make use of anyway to go anywhere) they put 2 day buffers on weekends throughout semester

Mikhail Dozmorov (21:07:05) (in thread): > It’s a good practice. Our students have two “reading days”. No break for faculty :)

2021-03-04

Kasper D. Hansen (05:03:07): > I think you’re the first one Ive heard liking this approach. Clearly you’re destined to become a dean

Michael Love (11:09:10): > BTW in addition to the abstract i might submit a birds-of-feather on this topic, e.g. infrastructure in Bioc for enrichment analysishttps://twitter.com/Bioconductor/status/1367467661576327170 - Attachment (twitter): Attachment > Submit an abstract for a talk, demo, long workshop, digital poster, or Birds-of-a-feather session at #Bioc2021 on Aug 4-6! The conference will be virtual this year. > https://bioc2021.bioconductor.org/submissions/
> Deadline to submit is March 9. > #bioconductor #bioinformatics #conference

2021-03-07

Mikhail Dozmorov (13:46:30): > The abstract looks great! Found the Bickel 2010 paper, will read. And theold blog postwith a good illustration of block bootstrapping. Added suggestion to use TAD boundaries as blocks to the Google doc. - Attachment (Mike Love’s blog): Block bootstrap > In looking at sequential data (e.g. time-series or genomic data), any inference comparing different sequences needs to take into account local correlations within a sequence. For example, you might want to know how often is it raining in two cities at the same time, and if this is more than expected by chance. But it is more likely to rain on a given day if it was raining the day before, and this dependence will change the distribution of overlap expected by chance. In stochastics, this is a question of whether the process is ’stationary‘. > One way out of the problem of estimating the distribution of overlap of two process by chance is the block bootstrap. Instead of randomly shifting features in the sequence (what I call naive permutation), you randomly build new sequences from large blocks of the original sequence. Then a distribution can be formed of overlap of features by chance. Here is a single bootstrap sample (top sequence) constructed in this manner from the data (bottom sequence). > > > Here are histograms demonstrating various ways of estimating the null distribution of overlaps between two sequences, with the true null on top (the clusters of features are of size 20). The block bootstrap can do a much better job of estimating the mean and variance of the null distribution. Knowing how large of a block to define is another problem, and Politis and Romano (below) explore the effect of using randomly sized blocks over fixed size blocks. > > a reference for this problem in genomic inference is: Peter Bickel Boley N, Brown JB, Huang H and Zhang NR, Non-Parametric Methods for Genomic Inference, 2010, > and a more general reference is Dimitris N. Politis and Joseph P. Romano, The Stationary Bootstrap, Journal of the American Statistical Association, Vol. 89, No. 428 (Dec., 1994), pp. 1303-1313, > The R code for this example is here.

Michael Love (14:39:53): > sounds good, I think abstract went in yesterday but we will definitely include TAD as among things to allow segementation on

2021-03-09

Michael Love (10:59:36): > Aaron Quinlan sent me this review paper which is relevant for the bootstrapping paper:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4271068/(it references Bickel, Wancen I added it to sciwheel already) - Attachment (PubMed Central (PMC)): The dilemma of choosing the ideal permutation strategy while estimating statistical significance of genome-wide enrichment > Integrative analyses of genomic, epigenomic and transcriptomic features for human and various model organisms have revealed that many such features are nonrandomly distributed in the genome. Significant enrichment (or depletion) of genomic features is …

Michael Love (11:00:43): > i’m chatting with Aaron to hear what his plans are wrt bedtools shuffle, which is an independent software for computing and evaluating overlap stats

Wancen Mu (12:37:19) (in thread): > Great, will look into it.

Mikhail Dozmorov (18:57:19) (in thread): > Interesting, Brent Pedersen was involved. Didn’t know about it, will read.

2021-03-11

Michael Love (15:48:42): > I chatted with Aaron Quinlan, a couple things: > 1. we think it would be good to have some platform-agnostic documentation of the block bootstrapping methods – e.g. diagrams of what each mechanism does, what the choices are, and unified language (we should stick to Bickel I think) > 2. he is also interested in implementing block bootstrap in bedtools shuffle (he has been interested in this since the Bickel paper came out) and we can share notes about fast implementation

2021-03-16

Mikhail Dozmorov (21:43:22): > I have to skip the nullranges south meeting tomorrow, student’s committee meeting. Will join the March 26 one.

Mikhail Dozmorov (21:43:31): > Thought to share a couple references about DNA segmentation, learned from the review paper,https://doi.org/10.1186/1471-2105-8-171andhttps://doi.org/10.1214/ss/1028905933- reading now. - Attachment (EUCLID): Statistical methods for DNA sequence segmentation > This article examines methods, issues and controversies that have arisen over the last decade in the effort to organize sequences of DNA base information into homogeneous segments. An array of different models and techniques have been considered and applied. We demonstrate that most approaches can be embedded into a suitable version of the multiple change-point problem, and we review the various methods in this light. We also propose and discuss a promising local segmentation method, namely, the application of split local polynomial fitting. The genome of bacteriophage \(\lambda\) serves as an example sequence throughout the paper.

Michael Love (22:04:18) (in thread): > good luck to student:slightly_smiling_face:

2021-03-17

Michael Love (09:01:55): > Checking in with@Eric Davisand@Wancen Mu– things to share today at 4pm? I haven’t had time to do much on my end

Wancen Mu (09:02:40): > Me too. Nothing to update yet.

Eric Davis (09:22:25): > Same, not much to report

Michael Love (10:48:48): > @Stuart Lee@Doug Phanstielok i propose we cancel the meeting so we’ll have more for next week. Sorry for the late notice Stuart!

Doug Phanstiel (11:16:45): > Sounds great

Stuart Lee (20:15:17): > No problem, I am on vacation for the next two weeks so won’t be able to make it

Eric Davis (22:58:00): > Is the meeting tomorrow cancelled as well?

2021-03-18

Michael Love (08:08:42): > i have a meeting in a week from tomorrow, the Wed / Fri ones alternate weeks

2021-03-24

Eric Davis (12:08:23): > I am looking for some general package writing best-practices advice. If I have a function and I want it to be able to operate on two different classes of objects (lets say a data.table or a GInteractions object), what is the best way to build the package to deal with that situation? Especially considering that the code will have to be handled differently for each case. I image the simplest (but probably messy way) would be to write if statements to check for input classes. Alternatively, I could make the function generic and use method dispatch to handle the different classes. Or perhaps there is another way that you all have found works best?

Kasper D. Hansen (12:13:54): > This sounds like a methods dispatch although it breaks my informal rule of only using methods if you dispatch on at least 3 different classes. But anyway, methods or not, that is actually not that important for tje majority of the coe

Kasper D. Hansen (12:14:43): > You would want to think carefully about abstractions. You would want some part of the code to be shared and some part to be object specific. Where and how do you set this boundary is very speciifc.

Kasper D. Hansen (12:15:10): > If this is HiC data for example you might need to handle in memory vs. out of memory stuff.

Kasper D. Hansen (12:16:17): > The easiest of course (but most likely not best) would be to write code which takes a basic matrix and then then convert the two types of objects into basic matrices. Or write code which takes object of one kind and then code which transforms the other object into the first kind.

Kasper D. Hansen (12:16:57): > actually, I have no idea why we’re having this discussion in this channel

Michael Love (12:17:14) (in thread): > really agree on this part. maximizing shared code is so important and worth thinking about for a long time. we can discuss in our Friday meeting

Michael Love (12:18:28) (in thread): > Eric is working on a method + software for coming up with sets of control ranges while controlling for covariates, he’ll present at BioC this summer. The relevance to HiC is that they are looking at enhancer promoter loops

Eric Davis (12:23:48): > Thanks for that advice! I want to accept data in multiple forms (i.e. data.table/frame or GInteractions object), but take advantage of the tools that have already been built - so it sounds like you are suggesting I should transform data.table/frame/ into a GInteractions object then dispatch code to process the data on that object.

Kasper D. Hansen (17:17:49): > well yeah in principle. Not sure if GInteractions at present scale well, but Im out of touch

2021-03-25

Michael Love (16:10:31): > so we’re on for nullranges “north” meeting tomorrow (8am US East), Wancen has been looking into a new dataset to model enrichment over LFC, and I bet Eric has updates as well – whoever is free, look forward to see you tomorrow

2021-03-26

Tim Triche (08:46:39): > This ties in nicely to some work my group has been doing with positional covariates, would be interested in participating in the next meeting (not sure how these things work). Curious if propensity scoring for ranges suffers from any of the latent confounding observed with use of DR in other observational contexts.

Tim Triche (08:48:11): > Neat abstracts and I confess to not realizing how useful this would be in other contexts until reading the abstracts. (Apologies if being a creeper in this respect, my code is public too – whenever someone forks it we tend to learn something, although that presumes people play nice)

Michael Love (08:57:49): > hi Tim! I’d be happy to add you in on the calls, we’ve currently been doing alternating schedule (US East): 4pm Wed one week / 8 am Friday other week. We just had a our Friday one so next is 4pm (US East) next week. The 4pm one is easier for Stuart to attend but next week he is not available I see, so I may cancel that (CC@Mikhail Dozmorov)

Michael Love (08:58:14) (in thread): > DR?

Michael Love (08:59:04) (in thread): > happy to have you here! the whole point of making this into a public channel was to catch attention of people doing similar things or interested in collaborating

Michael Love (08:59:39) (in thread): > we’re trying not to recreate any existing functionality in Bioc e.g. we aren’t writing a package to do enrichment analysis, just creating bootstrap ranges or matched control ranges

Michael Love (09:00:49): > So Tim if either or both of those work for you I can add you to the calls (i have two recurring google events with a Zoom link), just let me know which email and which time(s)

Michael Love (09:02:40): > @Eric Davisand@Wancen Mufrom today I think it makes sense to think about two classesmatchedRangesandbootRangeswhich would extend GRanges and have a few specific methods and accessors, we don’t have to build the infrastructure just yet but maybe we can start brainstorming what information we want to store in the object and, as Mikhail pointed out, what methods would go into a vignette / user case

Tim Triche (09:12:07): > if it is at all useful I have been spending far too much time with sparse matrices lately and may be able to help if performance issues come up.

Michael Love (09:13:10): > Eric may get into sparse matrices with HiC examples for the vignette

Tim Triche (09:13:18) (in thread): > that would be great –tim.triche@vai.organd I’ll add both to my calendar. Both times are usually open for me

Tim Triche (09:14:03): > very cool. We have a slightly different angle on this which I’d like to deposit the paper for first (although we wrote up some of it already as a BioC 2021 abstract).

Michael Love (09:14:25) (in thread): > sent

Tim Triche (09:14:57): > I think the background ranges idea will be tremendously handy in that context though and hope that the reverse may also be the case. Thanks for coordinating this.

Tim Triche (09:15:09) (in thread): > thanks!

Tim Triche (09:15:52) (in thread): > doubly robust estimators (usually sandwich + propensity scoring, whether by boosting or some other method). I have a sordid past in observational epidemiology that I don’t like to discuss

Tim Triche (09:16:21) (in thread): > I don’t imagine confounding by indication is much of an issue for genomic interactions though:slightly_smiling_face:

Michael Love (09:17:04) (in thread): > ahhh

Michael Love (09:17:17) (in thread): > right, we’re just doing matching

Tim Triche (09:17:33) (in thread): > matching on observed/known confounders? or… ?

Michael Love (09:17:39) (in thread): > “focus set” and then pulling matching ranges from a large universe based on PS

Michael Love (09:17:42) (in thread): > matching on PS

Michael Love (09:18:13) (in thread): > so e.g. focus set of ranges with a certain GC content, pull from universe of control ranges with matching distribution of GC

Michael Love (09:18:26) (in thread): > the PS idea is so Eric can match on multiple covariates easily

Tim Triche (09:18:27) (in thread): > makes sense. Very far down the road it will be of interest to know whether undersampling in that universe is relevant for arbitrary conditions. Not urgent

Tim Triche (09:52:09) (in thread): > the more I think about that the more useful this seems. having a simple sanity check on “would this be expected by chance” is… well, it’s not always popular to do the right thing but it matters

2021-03-28

Michael Love (16:21:22): > @Eric DavisI’ve made a minimal package that demonstrates building a class out from GRanges:https://github.com/mikelove/s4demo

Michael Love (16:21:59): > it has a methodfoobarthat does something, a constructorfooRangesand an accessor functionbarbar

Michael Love (16:22:28): > it passes R CMD check, i didn’t yet look at BiocCheck

Eric Davis (16:23:05): > This is perfect, thanks!

Michael Love (17:12:40): > feel free to ping me here in the channel with any questions, probably there are more details

Michael Love (17:12:54): > oh and to try out the example, you could dodevtools::load_all()

2021-03-29

Michael Love (09:21:49): > i updated the demo to include a setter functionbarbar(x) <- "hey"

Michael Love (14:54:19): > Looking to Wed, Stuart is not free, so i’d lean towards canceling unless someone has something specific to talk about

2021-04-03

Eric Davis (20:30:26): > Hey all, I’ve reorganized the matchRanges portion of the package into a S4 class-based package as we discussed at the last meeting. If you have some time, feel free to check it out and let me know if you have any suggestions (we could also discuss this at the next meeting). Here is some test code that should work with nullranges and nullrangesData: > > library(nullranges) > > ## Test data > library(nullrangesData) > data("enhPromContactFreqHg19") > > ## Define parameters > set.seed(123) > s <- sample(1:length(enhPromContactFreqHg19), 6000) > x <- enhPromContactFreqHg19[head(s, 1000)] > u <- enhPromContactFreqHg19[tail(s, 5000)] > covar = ~ anchor1.peakStrength + contactFreq > > ## Match GInteractions object > mgi <- matchRanges(focal = x, pool = u, covar = covar) > > ## Prints and behaves like a normal GInteractions object > mgi > > ## Visualize matching > plot(mgi) #plot shows propensity scores > plot(mgi, type = 'ridge') > plotCovariates(mgi, covar = 'all', type = 'ridge', logTransform = T) > plotCovariates(mgi, covar = 'all', type = 'jitter', logTransform = T) > plotCovariates(mgi, covar = 'contactFreq', type = 'jitter', logTransform = T) > > ## Summary and access data > overview(mgi) > matchedData(mgi) > covariates(mgi) > > ## Get each type of result as GInteractions objects > focal(mgi) > pool(mgi) > matched(mgi) > unmatched(mgi) > > ## Extract indices by group (default is 'matched') > indices(mgi, group = "unmatched") > indices(mgi) > > ## Should be TRUE > identical(matchedData(mgi)[group == "pool"][indices(mgi, group = "unmatched"), -c('group')], > matchedData(mgi)[group == "unmatched", -c('group')]) >

Eric Davis (20:32:41): > I need to improve the documentation, but this example should also work for DataFrame, data.frame, data.table, GRanges, and GInteractions inputs

2021-04-04

Michael Love (09:35:32): > great! we have an upcoming on Friday – Eric do you want to throw together some (rough) slides to also introduce new folks e.g. Tim to the classes and methods you’re thinking about > > doesn’t have to be pretty, just an overview of the motivation of matching (eg what do you need these for in Phanstiel lab project) and what you’ve implemented so far

2021-04-05

Tim Triche (10:32:32): > the code works with a couple of tweaks (pull requests submitted) – beautiful figures!

Michael Love (10:33:17): > thanks, PR accepted

Tim Triche (10:33:32): > I saw that. you’re incredibly fas t

Eric Davis (13:04:45): > :+1:Thanks for catching that!

Eric Davis (18:12:54): > Typically focal << pool (FYI, switching terminology from universe to pool). But currently, matchRanges() will work even if focal >> pool. It works by selecting the same ranges as many times as it takes to create the appropriate distribution. I could adjust the class validity to require that the pool of options is larger than the focal group, or I could just leave it as is to give users flexibility to do what they want. What are people’s thoughts on this?

Eric Davis (18:14:39) (in thread): > I suppose it could be useful to “upsample” a matched control if you happen to have a larger set of interest than options to choose from.

Michael Love (19:13:29) (in thread): > i vote flexibility, but print awarning()if pool < focal

Michael Love (19:13:46) (in thread): > users often react to warnings like errors anyway

2021-04-06

Doug Phanstiel (12:06:24): > Really slick code and nice plots. I think the jitter plots are probably the least useful though. They are really only informative for a very narrow range of sample sizes. I think simple density line plots are probably best way to compare. I would consider making that the default.

Tim Triche (12:14:18): > the ridge plots? agreed, much easier to interpret than the jitters

Tim Triche (12:15:15): > i.e. these - File (PNG): ridges.png

Eric Davis (12:15:32) (in thread): > The jitter can be useful for visualizing sparsity at the tails of the distributions that isn’t apparent in density plots - but in general I agree. I went with ridge plots over simple density plots because it gets hard to see 4 closely matching datasets when they are overlayed.

Eric Davis (12:15:43) (in thread): > I can add another plot type = “density” for this

Eric Davis (12:16:15) (in thread): > and I could provide an option for users to specify which to view (i.e. c(“focal”, “pool”) etc…

Doug Phanstiel (12:17:31): > yeah, i like the ridge plots. Plus the look just generally ‘cool’. And are definitely more informative than the jitter. But boring only line plots are probably the best way to compare. But you are right they might be busy with 4 lines, especially when the distributions are already well matched like in your current example

Doug Phanstiel (12:17:52) (in thread): > Yep

Doug Phanstiel (12:17:57) (in thread): > I think that would be good

Michael Love (12:18:02): > Eric, it looks like the above can be a vignette right?

Doug Phanstiel (12:18:07) (in thread): > default to all of them i guess

Michael Love (12:18:08): > it’s not folded in yet though?

Eric Davis (12:18:34) (in thread): > Sounds good, thanks!

Michael Love (12:18:56): > i’m movingnullranges.Rmdtoboot_ranges.Rmd

Doug Phanstiel (12:20:00): > I think this is nice for a vignette but would definitely want to start with focal distributions that are not already representative of the pool

Doug Phanstiel (12:20:27): > I think this was just to test/show the functionality

Doug Phanstiel (12:20:37): > classes, plotting fxns, etc

Eric Davis (12:20:39): > Would you like the vignette to be very simple or have a more complex example (I am working on the complex example now)?

Doug Phanstiel (12:29:16): > I think really simple would be fine (even preferable) for the vignette. Basically just this but select a focal group that has different distributions that the pool

Doug Phanstiel (12:30:09): > But the more complicated examples where we try to better understand enhancer-gene correlations using Hi-C would be good for the eventual paper

Michael Love (12:35:48) (in thread): > ah got it

Michael Love (12:36:40) (in thread): > that is a really good question, and hard to answer… > > i think an ideal vignette covers the 80% use case in the first few pages (if you were to print) > > but vignettes are also valuable for the complex cases

Michael Love (12:37:21) (in thread): > i would say, you can add the complex case later, or as Doug says, that could be a companion to a paper

Michael Love (12:37:46) (in thread): > e.g. Rmd as “notebook” that gets attached as Supp to a paper is a really nice trend

Michael Love (12:38:01) (in thread): > or it can be the paper itself as with F1000R Bioc channel

Kasper D. Hansen (12:39:34): > I would absolutely also have the complex example in the vignette

2021-04-07

Doug Phanstiel (10:11:11): > Seems clear enough right@Eric Davis? Doug: no, Mike: maybe, Kasper: definitely:joy:

Doug Phanstiel (10:11:48): > Starting with a simple example is good. And then follow by the complex example we are working on.

2021-04-08

Doug Phanstiel (08:08:40): > Eric and I were chatting about nullranges on our Slack and I thought it would be good to brig them up here to get broader input. Currently matchRanges is fast and elegant but it does have some limitations and we are deciding if each of these limitations is ok or not. and if not how to deal with them

Doug Phanstiel (08:12:09): > Currently, it always allows resampling. The same item can be selected multiple times. I tend to prefer null ranges that don’t allow random sampling because it doesn’t seem representative of the focal set if the focal set can’t have repetitive elements. I think depending on the question sometimes with or without resampling is appropriate. From your points of view how important is to have an option to allow or disallow resampling?

Tim Triche (08:14:49): > important

Tim Triche (08:14:58): > consider if you’re doing CV

Tim Triche (08:15:08): > you really don’t want to have folds overlapping if you can help it

Tim Triche (08:16:59): > there’s nothing wrong with bootstrapping but it is harder to reason about due to resampling; it induces a sort of stochastic bias that is hard to quantify precisely as a result

Doug Phanstiel (08:17:35): > A second consideration is how it handles ties. It currently takes the nearest neighbor and if there is a tie it takes the ‘first’ one it finds. If you are matching continuous covariates this is not a major problem since ties are rare. But if you are making discrete data (especially if there are few categories, for example “disease grade” that is an integer from 1 to 4) you will end up selecting the same 4 items over and over again. There is a simple fix which is to grab all of the ties and randomly pick one. But for the “disease grade” example you are going to have a lot of ties every time so it seems this approach might be slow. For examples like this it would be easier just to subsample by grade and randomly select n items that to do propensity score matching

Tim Triche (08:18:14): > also typically supported via an option

Tim Triche (08:18:50): > for example, suppose you want the column maxes of a sparse matrix

Doug Phanstiel (08:18:54): > But then there are in between cases. What if you are matching multiple covariates and they are all discrete. We could look at the data and see how many unique combinations of covariate values there are to assess how big of an issue this is

Tim Triche (08:19:08): > from personal experience I will say that supporting random instead of first or last is very annoying

Tim Triche (08:19:32): > but yeah the above is why it’s supported anyways.

Tim Triche (08:20:02): > this is almost certainly explored somewhere in the propensity scoring literature

Doug Phanstiel (08:24:17): > Thanks. Here are some possible solutions

Doug Phanstiel (08:27:28): > 1) Add a T/F. argument for bootstrapping I am not sure how to efficiently implement the code if it is F. It could slow things down a bit but seems important

Tim Triche (08:28:00): > just use the same API assample

Tim Triche (08:28:24): > > sample(x, size, replace = FALSE, prob = NULL) > Arguments: > > x: either a vector of one or more elements from which to choose, > or a positive integer. See 'Details.' > > n: a positive number, the number of items to choose from. See > 'Details.' > > size: a non-negative integer giving the number of items to choose. > > replace: should sampling be with replacement? > > prob: a vector of probability weights for obtaining the elements of > the vector being sampled. >

Tim Triche (08:28:51): > whenever I find myself asking “should I just do the same thing as in base::foo()?” the answer is always “yes”

Tim Triche (08:29:50): > actually I left out maybe the most important argument for this one:

Tim Triche (08:30:10): > > sample.int(n, size = n, replace = FALSE, prob = NULL, > useHash = (!replace && is.null(prob) && size <= n/2 && n > 1e7)) > > Arguments: > > ... > > useHash: 'logical' indicating if the hash-version of the algorithm > should be used. Can only be used for 'replace = FALSE', > 'prob = NULL', and 'size <= n/2', and really should be used > for large 'n', as 'useHash=FALSE' will use memory > proportional to 'n'. >

Doug Phanstiel (08:30:27) (in thread): > Except for maybestringsasfactorsanduseDingbats:rolling_on_the_floor_laughing:

Tim Triche (08:30:29): > consider using an NN/ANN index instead of a hash (granted they’re basically the same thing, but…)

Tim Triche (08:30:48) (in thread): > didn’t the former get fixed in 4.0.x?

Tim Triche (08:31:02) (in thread): > the wheels of Ripley grind slowly, but exceedingly fine

Doug Phanstiel (08:31:07) (in thread): > yeah. but i couldnt resist mentioning it

Tim Triche (08:32:24): > I suppose for the hash / NN index, it may as well be implicit – if there’s one, use it; if not, warn the user that they’ll probably be happier when there is:smile:

Tim Triche (08:33:29) (in thread): > I have had a few settings in my .Rprofile since forever (tab-complete everything, strings are never factors, etc.) and kind of forgot about that stinker

Tim Triche (08:34:12) (in thread): > I looked the other day and|is still overloaded as a pipe, the way dog intended it

Tim Triche (08:34:28) (in thread): > never surrender:smile:

Doug Phanstiel (08:34:33) (in thread): > haha. nice

Tim Triche (08:35:33) (in thread): > https://github.com/ttriche/dotfiles/blob/master/.Rprofile - Attachment: .Rprofile > > # tim's .Rprofile > # > .First <- function() { > > # since I'm always on Linux this is a gimme > Sys.setenv("R_PDFVIEWER"="/usr/bin/evince") > > # use basilisk for this type of stuff going forward > # Sys.setenv(TENSORFLOW_PYTHON="/usr/bin/python3.6") > > # choosing a repo gets old in a hurry > options("repos" = c(CRAN = "[http://cran.rstudio.com/](http://cran.rstudio.com/)"), > browserNLdisabled = TRUE, > deparse.max.lines = 2) > > # for new package automation: > options("skeletor.email"="<mailto:trichelab@gmail.com|trichelab@gmail.com>") > options("skeletor.name"="Tim Triche, Jr.") > options("skeletor.github"="trichelab") > > # notes below on why this is done > if (dir.exists("~/.Rscripts")) { > for (i in list.files("~/.Rscripts")) { > # phelp, lsos, print.data.frame, ... > source(paste("~/.Rscripts/", i, sep="/")) > } > } > > # bootstrap functions for getting packages set up > req <- function(p) require(p, character.only=TRUE) > reqInstall <- function(p) { if (!req(p)) install.packages(p); req(p) } > > # if there's a user: > if (interactive()) { > > reqInstall("utils") > reqInstall("BiocManager") > > # bridge to system package management > library(bspm) # binary packages > suppressMessages(bspm::enable()) > > # "Packages I'd rather not work without" > pkgs <- c("tidyverse","knitr","useful","gtools","skeletor","S4Vectors") > BiocManager::install(setdiff(pkgs, unique(rownames(installed.packages())))) > > # fix shortcomings > for (p in pkgs) reqInstall(p) > > # color-code output > require("colorout") # BiocManager::install("jalvesaq/colorout") > > # all set > cat("\nWelcome to", R.version.string, "\n") > > # for roxygenise() > rox <<- roxygen2::roxygenise > dox <<- devtools::document > > # change some defaults > options("digits"=9) > options("max.print"=9999) > options("pdfviewer"="/usr/bin/evince") > options("browser"="/opt/google/chrome/chrome") > options("scipen" = 9999) > options("prompt"="R> ") > > ## tab-complete libraries > rc.settings(ipck=TRUE) > > # should always be the case IMHO > options("stringsAsFactors" = FALSE) > options("useFancyQuotes" = FALSE) > > # many thanks to Duncan Murdoch and Ivo Welch > message("Set options('warn'=2) to stop on warnings...") > options("warn"=0) ## or =2 to stop on warnings > > # set up bigrquery > # library("bigrquery") > # billing_project <- is set in ~/.Rscripts/bigquery.R > > # set up plotly > # library("plotly") > # plotly_api_key <- is set in ~/.Rscripts/plotly.R > > library(BiocManager) > biocLite <- BiocManager::install > > } > } > > # I like syntax highlighting, too > if (interactive()) { > lrda <- function(...) list.files(pattern="rda$", ...) > lrds <- function(...) list.files(pattern="rds$", ...) > > setHook(packageEvent("grDevices", "onLoad"), > function(...) grDevices::X11.options(type='cairo')) > options(device='x11') > > local({ > options(editor="vim") > }) > # options(contrasts=c("contr.sum","contr.poly")) > # for RM-ANOVA and so forth > > ## like magrittr, but better: > `|` [" > return( or(x, y) ) > } > } > > # get or set the DISPLAY environment variable > getDisp <- function() Sys.getenv("DISPLAY") > setDisp <- function(x) Sys.setenv("DISPLAY"=x) > > # quote words, like in perl > qw <- function(...) sapply(match.call()\[-1\], deparse) > > # Now in its own script in ~/.Rscripts/phelp.R > # > # phelp <- function(...) { # {{{ get help for a package > # help(package=as.character(sapply(match.call()\[-1\], deparse)\[1\])) > # } # }}} > # > > # Now in its own script in ~/.Rscripts/lsos.R > # > # lsos <- function(..., n=10) { > # .ls.objects(..., <http://order.by|order.by](- function(x, y) { > if(is.data.frame(x)) { > return(eval(call("%.%", substitute(x), substitute(y)), > envir=parent.frame())) > } else { > or <- base::")="Size", decreasing=TRUE, head=TRUE, n=n) > # } > > latexify <- function(filebase, pgf=TRUE) { # {{{ > filebase <- gsub(".Rnw","",filebase) > filebase <- gsub(".rnw","",filebase) > filebase <- gsub(".tex","",filebase) > if(pgf) { > require(pgfSweave) > pgfSweave(paste(filebase, "Rnw", sep=".")) > } else { > Sweave(paste(filebase, "Rnw", sep=".")) > } > system(paste("LaTeXify", paste(filebase, "tex", sep="."))) > } # }}} > > # Now in its own script in ~/.Rscripts/print.data.frame.R > # > # print.data.frame <- function(df) { # {{{ > # if (ncol(df) > 0 && require("IRanges")) { > # prev.max.print <- getOption("max.print") > # on.exit(options(max.print=prev.max.print)) > # options(max.print=ncol(df) * 20) > # x <- capture.output(print(as(df, "DataFrame"))) > # cat(sub("DataFrame", "data frame", x[[1]]), x[-1], sep="\n") > # } else { > # base::print.data.frame(df) > # } > # } # }}} > > # AWS stuff now in ~/.Rscripts/AWS.R > > host <- function() system2("hostname", stdout=T) > } > >

Doug Phanstiel (08:36:02): > Ok. I am not familiar enough with how one would implement the resample=FALSE situation but we will keep that in mind

Doug Phanstiel (08:38:47) (in thread): > Wow. That is great

Doug Phanstiel (08:39:20): > In terms of efficiently dealing with ties, Eric and I discussed some options:

Doug Phanstiel (08:39:43): > First you could check to see how many unique combinations of covariate values there are

Doug Phanstiel (08:40:36): > if the number is very small (for example a single boolean covariate), just filter the data for each covariate combination and usesampleto select random ones

Doug Phanstiel (08:41:11): > if the number of combinations is large (i.e. very few ties), do propensity score matching settling ties with a random selection

Tim Triche (08:41:59): > https://github.com/wch/r-source/blob/7cb7c00cf338acd56d8b24f4d4b080624b2f6b77/src/main/unique.c - Attachment: src/main/unique.c > ``` > / > R : A Computer Language for Statistical Data Analysis > * Copyright (C) 1997–2020 The R Core Team > * Copyright (C) 1995, 1996 Robert Gentleman and Ross Ihaka > > This program is free software; you can redistribute it and/or modify > * it under the terms of the GNU General Public License as published by > * the Free Software Foundation; either version 2 of the License, or > * (at your option) any later version. > > This program is distributed in the hope that it will be useful, > * but WITHOUT ANY WARRANTY; without even the implied warranty of > * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > * GNU General Public License for more details. > > You should have received a copy of the GNU General Public License > * along with this program; if not, a copy is available at > * https://www.R-project.org/Licenses/ > / > > / This is currently restricted to vectors of length < 2^30 / > > #ifdef HAVE_CONFIG_H > #include <config.h> > #endif > > #define R_USE_SIGNALS 1 > #include <Defn.h> > #include <Internal.h> > > #define NIL -1 > #define ARGUSED(x) LEVELS(x) > #define SET_ARGUSED(x,v) SETLEVELS(x,v) > > / interval at which to check interrupts / > #define NINTERRUPT 1000000 > > typedef size_t hlen; > > / Hash function and equality test for keys */ > typedef struct _HashData HashData; > > struct _HashData { > int K; > hlen M; > R_xlen_t nmax; > #ifdef LONG_VECTOR_SUPPORT > Rboolean isLong; > #endif > hlen (hash)(SEXP, R_xlen_t, HashData ); > int (equal)(SEXP, R_xlen_t, SEXP, R_xlen_t); > SEXP HashTable; > > int nomatch; > Rboolean useUTF8; > Rboolean useCache; > }; > > #define HTDATA_INT(d) (INTEGER0((d)->HashTable)) > #define HTDATA_DBL(d) (REAL0((d)->HashTable)) > > > / > Integer keys are hashed via a random number generator > based on Knuth’s recommendations. The high order K bits > are used as the hash code. > > NB: lots of this code relies on M being a power of two and > on silent integer overflow mod 2^32. > > Integer keys are wasteful for logical and raw vectors, but > the tables are small in that case. It would be much easier to > implement long vectors, though. > / > > / Currently the hash table is implemented as a (signed) integer > array. So there are two 31-bit restrictions, the length of the > array and the values. The values are initially NIL (-1). O-based > indices are inserted by isDuplicated, and invalidated by setting > to NA_INTEGER. > / > > static hlen scatter(unsigned int key, HashData d) > { > return 3141592653U * key >> (32 - d->K); > } > > static hlen lhash(SEXP x, R_xlen_t indx, HashData d) > { > int xi = LOGICAL_ELT(x, indx); > if (xi == NA_LOGICAL) return 2U; > return (hlen) xi; > } > > static R_INLINE hlen ihash(SEXP x, R_xlen_t indx, HashData d) > { > int xi = INTEGER_ELT(x, indx); > if (xi == NA_INTEGER) return 0; > return scatter((unsigned int) xi, d); > } > > /* We use unions here because Solaris gcc -O2 has trouble with > casting + incrementing pointers. We use tests here, but R currently > assumes int is 4 bytes and double is 8 bytes. > / > union foo { > double d; > unsigned int u[2]; > }; > > static R_INLINE hlen rhash(SEXP x, R_xlen_t indx, HashData d) > { > /* There is a problem with signed 0s under IEC60559 / > double xi = REAL_ELT(x, indx); > double tmp = (xi == 0.0) ? 0.0 : xi; > / need to use both 32-byte chunks or endianness is an issue / > / we want all NaNs except NA equal, and all NAs equal / > if (R_IsNA(tmp)) tmp = NA_REAL; > else if (R_IsNaN(tmp)) tmp = R_NaN; > #if 2SIZEOF_INT == SIZEOF_DOUBLE > { > union foo tmpu; > tmpu.d = tmp; > return scatter(tmpu.u[0] + tmpu.u[1], d); > } > #else > return scatter(((unsigned int ) (&tmp)), d); > #endif > } > > static Rcomplex unify_complex_na(Rcomplex z) { > Rcomplex ans; > ans.r = (z.r == 0.0) ? 0.0 : z.r; > ans.i = (z.i == 0.0) ? 0.0 : z.i; > if (R_IsNA(ans.r) || R_IsNA(ans.i)) > ans.r = ans.i = NA_REAL; > else if (R_IsNaN(ans.r) || R_IsNaN(ans.i)) > ans.r = ans.i = R_NaN; > return ans; > } > > static hlen chash(SEXP x, R_xlen_t indx, HashData d) > { > Rcomplex tmp = unify_complex_na(COMPLEX_ELT(x, indx)); > > #if 2SIZEOF_INT == SIZEOF_DOUBLE > { > unsigned int u; > union foo tmpu; > tmpu.d = tmp.r; > u = tmpu.u[0] ^ tmpu.u[1]; > tmpu.d = tmp.i; > u ^= tmpu.u[0] ^ tmpu.u[1]; > return scatter(u, d); > } > #else > return scatter((((unsigned int )(&tmp.r)) ^ > (((unsigned int )(&tmp.i)))), d); > #endif > } > > /* Hash CHARSXP by address. Hash values are int, For 64bit pointers, > * we do (upper ^ lower) / > static R_INLINE hlen cshash(SEXP x, R_xlen_t indx, HashData d) > { > intptr_t z = (intptr_t) STRING_ELT(x, indx); > unsigned int z1 = (unsigned int)(z & 0xffffffff), z2 = 0; > #if SIZEOF_LONG == 8 > z2 = (unsigned int)(z/0x100000000L); > #endif > return scatter(z1 ^ z2, d); > } > > static R_INLINE hlen shash(SEXP x, R_xlen_t indx, HashData d) > { > unsigned int k; > const char p; > if(!d->useUTF8 && d->useCache) return cshash(x, indx, d); > const void vmax = vmaxget(); > / Not having d->useCache really should not happen anymore. / > p = translateCharUTF8(STRING_ELT(x, indx)); > k = 0; > while (p++) > k = 11 * k + (unsigned int) p; / was 8 but 11 isn’t a power of 2 / > vmaxset(vmax); / discard any memory used by translateChar / > return scatter(k, d); > } > > static int lequal(SEXP x, R_xlen_t i, SEXP y, R_xlen_t j) > { > if (i < 0 || j < 0) return 0; > return (LOGICAL_ELT(x, i) == LOGICAL_ELT(y, j)); > } > > > static R_INLINE int iequal(SEXP x, R_xlen_t i, SEXP y, R_xlen_t j) > { > if (i < 0 || j < 0) return 0; > return (INTEGER_ELT(x, i) == INTEGER_ELT(y, j)); > } > > / BDR 2002-1-17 We don’t want NA and other NaNs to be equal / > static R_INLINE int requal(SEXP x, R_xlen_t i, SEXP y, R_xlen_t j) > { > if (i < 0 || j < 0) return 0; > double xi = REAL_ELT(x, i); > double yj = REAL_ELT(y, j); > if (!ISNAN(xi) && !ISNAN(yj)) > return (xi == yj); > else if (R_IsNA(xi) && R_IsNA(yj)) return 1; > else if (R_IsNaN(xi) && R_IsNaN(yj)) return 1; > else return 0; > } > > / This is differentiating {NA,1}, {NA,0}, {NA, NaN}, {NA, NA}, > * but R’s print() and format() render all as “NA” / > static int cplx_eq(Rcomplex x, Rcomplex y) > { > if (!ISNAN(x.r) && !ISNAN(x.i) && !ISNAN(y.r) && !ISNAN(y.i)) > return x.r == y.r && x.i == y.i; > else if (R_IsNA(x.r) || R_IsNA(x.i)) // x is NA > return (R_IsNA(y.r) || R_IsNA(y.i)) ? 1 : 0; > else if (R_IsNA(y.r) || R_IsNA(y.i)) // y is NA but x is not > return 0; > // else : none is NA but there’s at least one NaN; hence ISNAN(.) == R_IsNaN(.) > return > (((ISNAN(x.r) && ISNAN(y.r)) || (!ISNAN(x.r) && !ISNAN(y.r) && x.r == y.r)) && // Re > ((ISNAN(x.i) && ISNAN(y.i)) || (!ISNAN(x.i) && !ISNAN(y.i) && x.i == y.i)) // Im > ) ? 1 : 0; > } > > static int cequal(SEXP x, R_xlen_t i, SEXP y, R_xlen_t j) > { > if (i < 0 || j < 0) return 0; > return cplx_eq(COMPLEX_ELT(x, i), COMPLEX_ELT(y, j)); > } > > static R_INLINE int sequal(SEXP x, R_xlen_t i, SEXP y, R_xlen_t j) > { > if (i < 0 || j < 0) return 0; > SEXP xi = STRING_ELT(x, i); > SEXP yj = STRING_ELT(y, j); > / Two strings which have the same address must be the same, > so avoid looking at the contents / > if (xi == yj) return 1; > / Then if either is NA the other cannot be / > / Once all CHARSXPs are cached, Seql will handle this / > if (xi == NA_STRING || yj == NA_STRING) > return 0; > / another pre-test to avoid the call to Seql / > if (IS_CACHED(xi) && IS_CACHED(yj) && ENC_KNOWN(xi) == ENC_KNOWN(yj)) > return 0; > return Seql(xi, yj); > } > > static hlen rawhash(SEXP x, R_xlen_t indx, HashData d) > { > return (hlen) RAW_ELT(x, indx); > } > > static int rawequal(SEXP x, R_xlen_t i, SEXP y, R_xlen_t j) > { > if (i < 0 || j < 0) return 0; > return (RAW_ELT(x, i) == RAW_ELT(y…

Tim Triche (08:42:11): > me personally I’d do it in C or C++

Tim Triche (08:42:33): > I can ask one of the grad students who’s been doing a ton of Rcpp optimization if he’s up for it

Tim Triche (08:42:50): > you’d have to make him a coauthor but we’re all mercenaries here, right?:wink:

Tim Triche (08:45:10) (in thread): > students and interns kept asking for it and eventually I realized it would be useful for me when I nuke a laptop

Tim Triche (08:45:24) (in thread): > again, the clever bits are mostly from Dirk and Matrin

Doug Phanstiel (08:46:06) (in thread): > I don’t understand all of it. But then again I don’t understand all of what is in my .bashprofile either and it is super helpful

Tim Triche (08:46:28): > if there are laggy bits in the package (either memory eaters or time eaters), those can become C/C++ too

Tim Triche (08:46:29): > https://support.rstudio.com/hc/en-us/articles/218221837-Profiling-with-RStudio - Attachment (RStudio Support): Profiling with RStudio > Getting started Using the profiler Using the flame graph Using the data viewer Profiling examples Profiling time example Profiling memory example Frequently Asked Questions Additional Resou…

Tim Triche (08:47:02): > if not, don’t bother. or if you can use Rcppannoy for the index, or BiocNeighbors, or whatever, then let someone else deal with the hassle of that

Tim Triche (08:47:34) (in thread): > well, if it’s helpful, then you’re welcome to it; and if not, do a Bruce Lee and discard what is not useful:slightly_smiling_face:

Doug Phanstiel (08:48:43): > I don’t want to speak for everyone but I don’t think there is any problem adding authors if we need to. I also have someone in the group who does all of our C coding

Tim Triche (08:49:11): > the trick with hashing for sampling without replacement is that you can use it to establish uniqueness in constant time once the hash is built. hence, you eat the cost of building it upfront, but then benefit every time you sample. If you’re a CS person you know all of this and also the locality sensitive kernel trick, but just in case

Doug Phanstiel (08:49:38): > But knowing what/how we should implement things is really helpful and it sounds like you have more experience with this than I do.

Doug Phanstiel (08:50:36): > I am not a CS person

Tim Triche (08:50:52): > It’s possible that I’ve simply made a larger universe of mistakes over the years. Most recently in the past week when we didn’t bite the bullet on caching ranges for sequence-aware spike-ins and OICR called me out on it (correctly)

Tim Triche (08:51:36): > so it’s fresh in my memory (if you’re going to do something expensive more than once, cache the result)

Tim Triche (08:52:08): > regardless, hash tables are very possibly the best thing CS ever created

Tim Triche (08:52:36): > (arguably compilers might be a little better for the progress of humanity)

Tim Triche (08:53:20): > I think it might be possible to brute-force this by using the coordinates plus whatever covariates make them unique as a hash key

Tim Triche (08:54:37): > yeah

Tim Triche (08:54:58): > looking at the code, this is something where Herve has probably invented a half dozen different solutions for this

Michael Love (12:25:24) (in thread): > in think this slowness may be avoided depending on implementation. > > i think the last thing you mention is stratified sampling which would be efficient and makes a lot of sense > > also, i’m realizing that, aside from matching on PS you could also fit a smooth density to the focal set PS distribution and then do rejection sampling of the pool based on its PS distribution to obtain a matched distribution. this would avoid picking the nearest each time. i can sketch this out on Friday

Michael Love (12:27:10) (in thread): > “stratified sampling”

Michael Love (12:31:14) (in thread): > yeah, and here i think we can also do PS-based rejection sampling > > then you can have:uniqCombinationThreshold = Xwhich would switch to PS-based matching when the number of unique combinations in >= X, andmethodanother argument with options"strata"or"PS"but no default value (default is instead determined by number of unique combinations of covariates)

Michael Love (12:32:58): > i’m fine with adding folks as we go along of course (hence we’re doing this in a public channel). > > Let me propose the rejection sampling approach tomorrow to see if that would solve the closest PS issue though, before we start implementing fast random matching

Tim Triche (13:12:28): > oddly enough when I looked for fast IPTW in marginal structural models, the first thing that came up was importance sampling. probably the way to go

Tim Triche (13:12:55): > (rejection sampling as complementary to importance sampling, presumably it’s the better fit here if you’re choosing it)

Michael Love (13:22:39): > oh yeah i think of them as same class, either RS or IS

Michael Love (17:45:12): > I’m adding Eric’s code as a vignette for now

Michael Love (17:46:49): - File (HTML): match_ranges.html

Michael Love (17:47:21): > just for ease of browsing… this is looking very slick

2021-04-09

Michael Love (11:05:06): > @Eric Davisthis is not production code but just a sketch … we’ll need to figure out if this approach would be practical… - File (R): reject.R

Eric Davis (11:12:22) (in thread): > Thanks, I’ll give this a try. Was there not already a package that implemented rejection sampling, or is this for illustration purposes?

Michael Love (11:12:54) (in thread): > i didn’t find one actually that would let us do sampling w/ replacement on the original pool observations

Michael Love (11:13:11) (in thread): > i’ll keep looking but the above works for now, there is a fiddle parameter (scale)

Michael Love (11:13:28) (in thread): > it will work, depending on the coverage and numbers of focal and pool

Eric Davis (11:13:28) (in thread): > :+1:

Michael Love (11:15:09) (in thread): > if length(focal) ~= length(pool), and the densities are far apart (e.g. large earth mover distance or KS distance) this won’t work at all

Michael Love (11:15:57) (in thread): > so it would also need some feasibility checking, ideally warnings if we think we don’t do a good job matching

Eric Davis (11:16:06) (in thread): > yeah, isn’t that also true of any sampling without replacement method?

Michael Love (11:17:32) (in thread): > oh and, this code works as well for with replacement, … i think it’s just use sample withaccept_rateasprob

Eric Davis (11:19:59) (in thread): > I can point out the relevant lines in the function if you want to make a branch to test the method in the function context? Or I can replace it later once its been fully fleshed out?

Michael Love (11:23:35) (in thread): > yeah we can plug this in later — maybe one day i’ll be able to play with this but not for some weeks at least

Michael Love (11:23:40) (in thread): > :sob:

Michael Love (11:24:43) (in thread): > i mean — i shouldn’t have been too worried about “production code” stuff, you can feel free to throw the above code in and play with it

Michael Love (11:24:53) (in thread): > we’re still deep in “beta” mode

Michael Love (11:25:16) (in thread): > or if it’s still not clear how this would plug in we can chat in our Tue meeting next week

Eric Davis (11:37:20): > Here is a first pass (not confident that this is correct) at using sample() with probabilities to draw from a distribution. Starting with replacement first for comparison to our existing approach. Here fps is a vector of focal propensity scores and pps is a vector of pool propensity scores. > > ## Use sample() to get distributionally matched set > set.seed(123) > system.time({ > s <- sample(seq_along(pps), length(fps), replace = T, prob = pps) > }) > > ## View sample approach > plot(density(fps), xlim = c(0, 0.025), ylim = c(0, 300), > main = "sample method (with replacement)") > lines(density(pps), col = 'green') > lines(density(pps[s]), col = 'blue', lty = 5, lwd = 2) > legend('topright', > text.col = c('black', 'blue', 'green'), > legend = c('focal', 'matched', 'pool')) > > ## View NN approach > plot(density(mdt$fps), xlim = c(0, 0.025), ylim = c(0, 300), > main = "NN method (with replacement)") > lines(density(pps), col = 'green') > lines(density(pps[mdt$ppsIndex]), col = 'blue', lty = 5, lwd = 2) > legend('topright', > text.col = c('black', 'blue', 'green'), > legend = c('focal', 'matched', 'pool')) > - File (PNG): image.png - File (PNG): image.png

Eric Davis (11:40:10): > For sampling with replacement the nearest neighbor approach seems to achieve a much better fitting distribution while only drawing about 9 more duplicate samples from the pool (also there is no time complexity difference): > > > length(table(s)[table(s)>1]) > [1] 63 > > length(table(mdt$ppsIndex)[table(mdt$ppsIndex)>1]) > [1] 72 >

Eric Davis (11:40:52): > although I could be implementing the sampling approach imperfectly

Eric Davis (11:41:22) (in thread): > :+1:that would be great

Tim Triche (11:49:04): > pretty cool result regardless

Tim Triche (11:52:11): > nb. just realized you can use a KS, anderson darling, or chisq test result to warn the user if the target and sampling distributions are too far off

Tim Triche (11:52:43): > or make it an argument so that it’s the user’s responsibility:wink:

Michael Love (11:52:47) (in thread): > oh so my sampling with replacement approach would be to use theaccept_rateas the prob

Michael Love (11:53:21) (in thread): > and it involves a fiddle parameter so that the dist of PS for the pool covers the dist of PS for the focal gorup

Michael Love (11:54:00) (in thread): > not sure it would be better than NN tho!

Michael Love (11:54:21) (in thread): > good you’re keeping track of the duplicate rate and complexity

Eric Davis (11:55:08) (in thread): > We could take a complementary approach: use nearest neighbor to match most of the data, then use rejection sampling to handle these duplicates?

Michael Love (12:06:20) (in thread): > let me think so, let’s say focal is F and has a subset F’ that is easy to match without duplicates in pool

Michael Love (12:06:33) (in thread): > then F / F’ is the leftover ones

Michael Love (12:07:17) (in thread): > then make a KDE of F / F’ and use pool to produce a covering distribution, then sample from pool withaccept_rateto fill out the rest of the sample

Michael Love (12:07:38) (in thread): > it seems like it would work and be fast, we can try it out

Eric Davis (12:07:58) (in thread): > Although, trying out your approach seems just as fast

Michael Love (12:08:12) (in thread): > we can compare

Michael Love (12:08:35) (in thread): > thethreshthing was to avoid some issues when the two densities are going to 0 but at different rates, then the ratio can be very bad

Michael Love (12:08:47) (in thread): > diagnostic plot will be critical

Eric Davis (12:10:55) (in thread): > Here is a first pass at applying your code with the “real” data: - File (PNG): image.png

Eric Davis (12:13:22) (in thread): > > user system elapsed > 1.405 0.181 1.609 >

Eric Davis (12:13:54) (in thread): > Comparing to sample(replace = FALSE) - File (PNG): image.png

Eric Davis (12:14:10) (in thread): > > user system elapsed > 9.533 0.015 9.556 >

Michael Love (12:14:52) (in thread): > nice!

Michael Love (12:15:22) (in thread): > i mean the beautiful thing about rejection sampling is that, if you have big enough pool and you cover the focal density, it just has to work

Michael Love (12:15:36) (in thread): > you are literally drawing the density with the rejections

Michael Love (12:16:05) (in thread): > i guess we don’t need a dedicated rejection sampling package then:stuck_out_tongue:

Eric Davis (12:16:55) (in thread): > thats good haha:sweat_smile:

Eric Davis (12:20:19) (in thread): > To summarize (at a first pass): NN matching provides the best and fastest matching distribution with replacement. Rejection sampling provides the fastest and closest approximation without replacement. We could implement a combination approach to get a ‘better’ matching distribution by using NN to get a majority of matches and handle the duplicates with rejection sampling. If its worth the work then we could get the best of both worlds - quality distribution matching that is fast.

Michael Love (12:48:04) (in thread): > i can think of two ways to fitscale: one is more efficient but not perfect coverage: find the mode of dfocal and then scale up dpool so it equals that mode; other is accurate but inefficient: choose scale such that scale x dpool > dfocal everywhere. the latter can be found with evaluation of ratio of the two densities on a fine grid~binary search~

Michael Love (12:48:45) (in thread): > or earth mover, i like that but dont know if its implemented in R yet

Michael Love (12:48:51) (in thread): > very intuitive

Tim Triche (13:15:11) (in thread): > I think EMD for single cell is implemented somewhere, which suggests it’s a fast implementation:smile:

Tim Triche (13:15:55) (in thread): > https://www.bioconductor.org/packages/release/bioc/vignettes/EMDomics/inst/doc/EMDomics.html

Tim Triche (13:15:55) (in thread): > yeah

Eric Davis (13:18:08) (in thread): > I can try this out, data.table has got some awesome binary search infrastructure. That’s how the nearest matching is so fast

Michael Love (13:31:39) (in thread): > oh i saw that, i think it’s too specialized for searching for a row

Michael Love (13:34:12) (in thread): > oh wait, i’m overthinking this, we can do it with grid evaluation of the densities

Michael Love (13:35:02) (in thread): > calculate df at grid of x, calculate dg at grid of x, then pick scale based on the ratio of those. scale = max df / dg

Michael Love (13:37:50) (in thread): > duhh… that took me too long to realize:neutral_face:

Doug Phanstiel (14:33:21): > I am not following all of the details here. But it seems like there might be some tradeoffs for different methods.sampleperformed worse in this case but it might it seems like it might be more robust in other cases (reliably mediocre).

Doug Phanstiel (14:35:36): > Can we include all of the different methods and allow the user to choose?

Doug Phanstiel (14:35:58): > Maybe that is what tim just suggested

Doug Phanstiel (14:48:54): > Also, is it possible thatsample()is not working as well because there was more smoothing done when generating the probabilities?

Doug Phanstiel (14:49:45): > the ones that glm is providing that is

Doug Phanstiel (15:13:54): > One reason I suspect this is the edge cases in this plot, especially the matched data on the far right

Doug Phanstiel (15:13:59): - File (PNG): image.png

Doug Phanstiel (15:15:26): > it seems to be pulling from regions that should have a prob of close to 0. But if there was more smoothing in the generation of probabilities than it would think those were good regions to pull from

Doug Phanstiel (15:15:59): > there isnt even a lot of data there in the pool, so the only reason to pull from that region is intentionally

Eric Davis (15:31:46) (in thread): > One thing that might be an issue is that the rejection sampling approach (as it currently is) doesn’t return the same number of matched samples as focal

Eric Davis (15:32:12) (in thread): > The distribution looks good, but it would be nice to have the same number of focal and matched

Eric Davis (15:33:22) (in thread): > I guess that is determined partially by the scale parameter?

Michael Love (15:37:29) (in thread): > yes so there is some extra work to be done

Michael Love (15:37:56) (in thread): > one sec, first, here are the twoscaleestimators

Michael Love (15:38:01) (in thread): - File (R): reject.R

Michael Love (15:39:07) (in thread): > yes, depending onaccept_ratioand the binomial rejection process being a random process, the exact number you get is hard to predict ahead of time. the good news is that if you have too many, it’s a random sample — just take the firstlength(focal)of them

Michael Love (15:40:08): > if by smoothing you are referring to the thing i was talking about at 8, that’s not the plot to look at

Michael Love (15:40:29): > it’s this one:https://community-bioc.slack.com/files/U01H9L26J8J/F01U0TFC9SQ/image.png - File (PNG): image.png

Michael Love (15:41:08): > here, we create a smooth density estimate of the PS distribution, and then sample from the pool to match using the ratio of the density estimates

Michael Love (15:42:23) (in thread): > so you want to overshoot

Eric Davis (15:42:42) (in thread): > but you still want scale to be high enough to cover?

Michael Love (15:42:57) (in thread): > yeah so, i think safest would be the grid approach

Michael Love (15:43:59) (in thread): > we can put both in as options? it’s hard to guess in practice what these distributions and the sample sizes would look like and what will give most reasonable (approx) matched sets

2021-04-13

Michael Love (12:32:36): > hi all, > > Not sure we need tomorrow’s meeting.@Eric Davisis working on some matching implementations but has a conflict tomorrow,@Wancen Muand I have been focused a lot on a different Bioc package being submitted, so not sure we have updates from our side for tomorrow at least. I propose to cancel then (also I have kids at the 4pm slot…)

Tim Triche (12:35:45): > you’ll hear no protests from me:slightly_smiling_face:

Michael Love (12:48:49): > ok

Kasper D. Hansen (14:40:36): > You’ll never get to associate if you keep cancelling those meetings

2021-04-16

Eric Davis (19:00:17): > So Mike’s rejection sampling approach (sampling without replacement) works very well for continuous distributions of data to match, but it doesn’t perform very well (or in some cases at all) when the data is discrete. Does anyone have suggestions for how to handle these situations?

Eric Davis (19:00:53) (in thread): > Or how to detect if a distribution is “too discrete”?

2021-04-17

Michael Love (06:53:13) (in thread): > fraction of unique to total observations?

Eric Davis (11:33:01) (in thread): > possible solution: what if I introduce a small amount of random noise to both focal and pool propensity scores? Since it is random it shouldn’t have any effect on the selection, but still will allow us to estimate a kernel density?

Mikhail Dozmorov (12:02:14) (in thread): > Still catching up, semester is crazy. Reading about rejection sampling and played with Mike’s code - yes, it seems to work well when there are a lot of data. Fraction of unique to total or the level of noise likely need to be optimized in each case. Perhaps, make sampling with replacement the default for non-continuous data withlength(unique(x))/length(x) < 1?

Michael Love (14:14:32) (in thread): > @Eric DavisI think if the distribution is discrete and there are, say, ~10 unique values (of combinations), you’d want to do the stratified approach, so not the add noise approach

Michael Love (14:16:35) (in thread): > have to think about the in between cases

Michael Love (14:16:55) (in thread): > adding noise is a good solution i think

2021-04-19

Tim Triche (10:24:49): > can you write up an example@Eric Davis? I suspect that Laplace smoothing of the probabilities for each categorical outcome would finesse this issue – it is the standard approach in NLP

Tim Triche (10:27:12): > i.e. suppose you have a vector of category/factor-combination counts (0, 100, 11, 53, 17, 0, 1, 2). Add a pseudocount and sample from the corresponding universe with probability (count + pseudo)/(total + totalpseudo). Pseudocount can be arbitrary or calibrated like Aaron Lun suggests.

Tim Triche (10:36:49) (in thread): > Laplace smoothing should do both fwiw

Wancen Mu (11:06:18): > Hi all, there are three biology questions that I am not sure about and want to have some comments. > 1. When we do block bootstrap , we want to exclude blacklist regions. However there are 910 pieces for hg38 and it will lower the bootstrap speed because we have to generate different number of random start considering each segmentation length. Since there is lots of Satellite DNA in the blacklist regions which are short. Can we only exclude the longer blacklist regions (>500bp) during block bootstrap step and exclude shorter blacklist regions after generating block bootstrap genome? Does it make sense? It will left ~260 pieces and fast the speed twice.

Wancen Mu (11:09:21): > 2. Another question is same gene may generate multiple times during generating random start step and move to different genome location. Do we want to only keep it once?

Wancen Mu (11:27:19): > 3. When we move gene to other location, should we keep original length of gene or the length has overlap? 1 OR 2 IN THE SCRATCH BELOW. If keep the original length, then genes has higher chance exceed the chromosome boundary or simply exceed the segmentation region boundary, and will be discarded for fast filtering step. - File (PNG): image.png

Michael Love (11:40:34) (in thread): > i don’t follow this question, what kind of multiple generations are you referring to? as part of the bootstrap, i would expect that the boot sample will contain zero, one, or multiple copies of original features

Michael Love (11:46:28) (in thread): > my first thought would be to include features in the random block if start is within, and to propagate the full feature into the bootstrap sample. this is mostly motivated by concern for speed and code simplicity, and that blocks will typically be >> feature size

Michael Love (11:47:02) (in thread): > e.g. if we are shuffling DHS, those are ~100-1000 bp and the blocks are ~500k-1Mb

Michael Love (11:47:10) (in thread): > but curious others thoughts

Wancen Mu (11:48:27) (in thread): > I mean multiple copies of original features. Just want to make sure it won’t influence following analysis, like enrichment analysis. Yeah, it may not be an issue!

Wancen Mu (11:53:54) (in thread): > If only consider about start is within, there are many cases start is near the end of blocks. Then still it will still exceed the end boundary. Actually it a lot depends on first question, there will be much cases if those short blacklist included.

Michael Love (12:25:49) (in thread): > you are shuffling genes right?

Michael Love (12:25:58) (in thread): > i wonder if we should change to shuffling peaks because of this problem

Michael Love (12:26:50) (in thread): > i talked with Aaron about this, and he didn’t have any strong preference, but computationally it’s much cleaner if we are bootstrapping features that much more often will fall within blocks (sometimes over but rarely)

Michael Love (12:27:25) (in thread): > i think it’s all good

Wancen Mu (14:57:18): > BioC 2021 conference has accepted the block-bootstrap method as 10min short talk!:tada:

Michael Love (15:06:00): > that’s great. saw that match ranges got a poster, instead of demo, too bad! i’d recommend we record both as talks and post online to have a broad viewing (i will promote on Twitter as well)

Eric Davis (17:44:25): > I’ve pushed the latest changes to the matchRanges side of things. Here is a summary of the updates: > * There are now two more arguments:methodandreplacewhich users can select to use nearest neighbor matching or rejection sampling to generate the matched distributions. See vignette below… > * overview() now provides meaningful information for factors - such as the number of each category selected among focal, matched, pool, and unmatched groups. > * plot(…, type = 'lines')will plot propensity scores as density lines (still need to update this for theplotCovariates()function) > I wasn’t sure the best way to demo these changes, so I wrote a vignette that tests out the different methods in different use cases such as when the matching covariates are continuous, discrete (binary), categorical (discrete non-binary), or when focal > pool (upsampling). Happy to walk through this on Friday as well. - File (HTML): matchRanges_methods.html

Eric Davis (17:45:59) (in thread): > I’ve pushed the most recent code - I think this is similar to what I am doing, but let me know what you think or if you know of a better way to implement it

Eric Davis (17:52:41): > Here is the vignette as a pdf - File (PDF): matchRanges_methods_vignette.pdf

Tim Triche (18:05:39): > neat – do you have any concerns about referring to bootstrapping as upsampling?

Eric Davis (18:08:08) (in thread): > haha I suppose bootstrapping is better terminology

Eric Davis (18:08:47) (in thread): > so I guess its subset selection vs bootstrapping

Tim Triche (18:19:53) (in thread): > I think that will be more intuitive for people:slightly_smiling_face:

Tim Triche (18:20:08) (in thread): > Upsampling suggests “sequence more deeply”.

Tim Triche (18:20:16) (in thread): > Or something like that.

Doug Phanstiel (22:24:09): > It looks great except for the binary and categorical covariates.

Doug Phanstiel (22:24:53): > But that is fine. We just need a separate approach for those. Just using base R to split and sample should be good

Doug Phanstiel (22:26:57): > One minor question about this code

Doug Phanstiel (22:27:00): > > ## Cut enh-prom distance into groups > epp$epDistanceGrouped <- > cut(epp$epDistance, > breaks = c(-Inf, 20e03, 40e03, 60e03, 80e03, Inf), > labels = c("0-20Kb", "20-40Kb", "40-60Kb", "60-80Kb", "80Kb-100Kb")) >

Doug Phanstiel (22:28:06): > Have you already filtered for enhancer-promoter pairs that are 100Kb or shorter? Or should the last label be “80Kb-1Mb” or something like that?

2021-04-20

Eric Davis (00:26:03) (in thread): > Interestingly it looks like like the longest E-P distance is ~1Mb*(not 100kb). These are also gap distances. > > > max(epp$epDistance) > [1] 998681 >

Eric Davis (00:26:54) (in thread): > I can double check that the distances are being calculated correctly to confirm

Eric Davis (11:23:10) (in thread): > Wow haha clearly I don’t know numbers - yes that should be 80Kb - 1Mb. 998,681 is not 100,000:rolling_on_the_floor_laughing:

Eric Davis (11:25:29) (in thread): > However, I was curious as to why we aren’t getting longer distances (i.e. closer to ~2Mb). And it looks like it has to do with which point we use to define “2Mb around gene promoters”. Currently, I am taking 2Mb around the center of promoters that are 2kb upstream and 200bp downstream of the TSS. Instead I could take 2Mb around the TSS and this seems to make a difference > > > max(enhPromPairs$epDistance) > [1] 1999592 >

Eric Davis (11:25:58) (in thread): > Do you think I should change this?

Doug Phanstiel (11:26:36) (in thread): > I just wanted to make sure we were looking at distances longer than 100kb

Eric Davis (11:27:49) (in thread): > yes my bad for the bad labels:rolling_on_the_floor_laughing:

2021-04-22

Mikhail Dozmorov (20:26:02): > Looks like something happened with thematchRangesfunction - it cannot find methods for the corresponding class signatures. E.g., inmatch_ranges.Rmdvignette,mgi <- matchRanges(focal = x, pool = u, covar = covar)gives: > > Error in (function (classes, fdef, mtable) : > unable to find an inherited method for function 'matchRanges' for signature '"GInteractions", "GInteractions", "formula", "missing", "missing" > > Same inmatchedControlExample02.R, it errors on regularGRanges. The code seems OK, but something not working. Anyone encounters this?@Eric Davis

Eric Davis (20:46:18): > Ah yes, I added two more arguments to the function without providing defaults so some of the old vignettes won’t work without setting them

Mikhail Dozmorov (20:50:48): > :slightly_smiling_face:I was staring at the methods, and overlooked the new arguments.

Eric Davis (21:02:07): > Thanks for catching it! It should be fixed now

2021-04-23

Michael Love (07:40:21): > We’ve got a meeting coming up at 8am US East (~25 min) – from the block bootstrap side@Wancen Muhas some timing updates comparing segmented boot to un-segmented (I still need to fix my code from the “linear chromosome” idea to sampling chromosomes instead, now that semester is wrapping up I will have more time for this task) > > Looks like@Eric Daviswill have updates re: the above. > > I can talk briefly about plans going forward at the end

Michael Love (08:55:32): > Next meeting would be Wed 4/28 at 4pm — that’s the one i can sit in on but not take good notes

Michael Love (08:55:56): > but it looks like Stuart can join so we can ask about his wish list for development/integration of plyranges

2021-04-25

Stuart Lee (21:27:55): > great! I think daylight savings change has messed up times for meeting, so it’s now 6am for me. I’ll try and make it. Looking forward to seeing all of the new features!

2021-04-26

Tim Triche (10:25:29): > not sure if compartmap-the-paper will have a live DOI by then, but as long as it’s close I’ll be happy to send it around and see what people think in terms of “plays nicely with nullranges” or alternatively “what on god’s green earth were you thinking”

Michael Love (11:19:07): > haha

Michael Love (11:19:10): > share away!

Tim Triche (11:26:04): > need one more author (@Kasper D. Hansen) to sign off on it:slightly_smiling_face:

Kasper D. Hansen (11:27:41): > Yes, yes, I have started reading it but I have not finished. I am lazy and not dependable unlike the solid midwestern co-authors.

Tim Triche (11:30:30): > JPF moved to the Midwest?!

Tim Triche (11:31:00): > I mean I guess the rent is cheaper than in the Bay Area but I had no idea:wink:

Kasper D. Hansen (11:36:27): > I was thinking of you and the rest of the crew in Grand Rapids

Kasper D. Hansen (11:52:25): > Instead of us coastal elites

2021-04-27

Michael Love (07:50:19): > Doug had a good idea that, as part of submitting to extend funding support, it would be good to have a pkgdown site, so I’m starting to move in that direction.@Wancen Mucould you move this file to nullrangesData? You can dosave(deny, file="deny.rda")and then put that in the /data directory of nullrangesData package, then you can dolibrary(nullrangesData)anddata("deny") > > deny <- import("C:/Users/wancenmu/OneDrive - University of North Carolina at Chapel Hill/Lab/project2/data/ENCFF356LFX.bed.gz") >

Michael Love (07:54:11): > @Eric Davisdata.table is required as aDepends, or could it be anImports?

Eric Davis (08:10:37): > I would think depends as many of the methods require it to work (Like the nearest neighbor matching method), but I know imports is generally preferred. What would you suggest?

Michael Love (08:42:34): > Imports is preferred by me bc otherwise we get the loading messages which are also loud right now (about Macs not working)

Michael Love (08:43:06): > can you walk through at some point (no rush) how much extra work it would be to move it to imports?

Wancen Mu (10:05:22) (in thread): > Will let you know once I done.:ok_hand:

Eric Davis (10:22:24) (in thread): > I don’t think it would be any extra work. All of the methods import data.table as needed, so I think it should work fine if moved to imports instead of depends.

Eric Davis (10:22:46): > I am getting this error when trying to build the package with the two latest commits: > > Rebuilding evaluation_segment.Rmd > Quitting from lines 27-32 (evaluation_segment.Rmd) > Error in parseURI(uri) : > cannot parse URI C:/Users/wancenmu/OneDrive - University of North Carolina at Chapel Hill/Lab/project2/data/ENCFF356LFX.bed.gz > Calls: suppressPackageStartupMessages ... ungzip -> ungzip -> .local -> .parseURI -> parseURI > In addition: There were 13 warnings (use warnings() to see them) > Execution halted > > Exited with status 1. >

Wancen Mu (10:30:13): > Oh, it related to the deny data that above Mike mentioned. I will add the data later!

Eric Davis (10:31:02) (in thread): > Oh thanks! Sorry I should’ve seen that

Michael Love (10:46:50) (in thread): > cool

Eric Davis (10:51:40) (in thread): > Here is a document about importing data.table. I vaguely remember running into an issue using some of the key setting functionality if its not in depends, but I can’t remember.https://cran.r-project.org/web/packages/data.table/vignettes/datatable-importing.html

Michael Love (12:30:59) (in thread): > if you hit a problem, wanna post here and we can try to figure it out

Michael Love (12:31:21) (in thread): > worst case, we could inform the user to load data.table pkg via library()

Michael Love (12:31:54) (in thread): > but i’d prefer that to always loading data.table upon library(nullranges) bc not all analyses require data.table loaded (or installed for that matter)

Michael Love (12:32:09) (in thread): > i’m fine with it as an imports for now

Michael Love (22:02:55): > finally had a chance to re-implement the un-segmented block bootstrap, haven’t done much testing yet, but it’s vectorized so should be fast, can walk through it tomorrow

Michael Love (22:03:25): > and no longer doing the “map to one long chromosome” thing, which causes integer overflow

2021-04-28

Michael Love (07:52:55): > i’m moving RcppHMM and DNAcopy toSuggestsand following a paradigm i use often in other packages:https://github.com/nullranges/nullranges/blob/main/R/segment_density.R#L38-L40 - Attachment: R/segment_density.R:38-40 > > if (!requireNamespace("DNAcopy", quietly=TRUE)) { > stop("type='cbs' requires installing the Bioconductor package 'DNAcopy'") > } >

Michael Love (07:57:58): > i’m just going over the imports and depends to try to reduce our dependencies > > Imports: > InteractionSet, > ggplot2, > MatchIt, > plyranges, > ks, > speedglm, > pbapply > Depends: > data.table > > Sodata.tablecan move to Imports, what aboutMatchIt? Could that be a Suggests, as it is an optional method? Do we still use it?kshas a lot of dependencies itself, so we could also consider to move that to Suggests, but then I suppose a reasonable number of users will want to do matching via sampling with replacement using the rejection method. So maybekshas to stay as a dependencyspeedglmhas minimal dependencies so not a big deal,pbapplyi’m not worried about either

Michael Love (07:59:07) (in thread): > this is bc not all users will necessarily do their own segmentation, and then even so they don’t need both packages, they may choose one or the other to install. not a big hassle to install one more pkg for some functionality (at least this has been my logic in DESeq2 and tximport)

Eric Davis (08:14:31) (in thread): > We don’t use MatchIt anymore, so that can be removed

Tim Triche (09:11:11) (in thread): > this is a great way to do things, I don’t remember whether Mike Lawrence or Aaron Lun showed it to me, but I habitually do this now as well

Tim Triche (09:12:38) (in thread): > TIL whatpbapplydoes. Neat.

Michael Love (09:19:44) (in thread): > yeah, for the impatient:slightly_smiling_face:

Michael Love (09:19:55) (in thread): > ok removing MatchIt now

Tim Triche (11:28:54): > only just now realized why HMM and DNAcopy are in there in the first place. Seems specialized to segment based on gene density per megabase – having a more general input (e.g. CpG density, CTCF motif density, whatever) seems like it makes the package far more broadly applicable, at which point I have to assume someone already implemented this elsewhere (vince maybe?) and therefore it stops being your problem. No?

Tim Triche (11:29:37): > oh, wait a minute, “gene” can be any flavor of features

Tim Triche (11:30:56): > remark about “somebody did this somewhere else for sure” stands although I don’t off the top of my head know who or where. But e.g. density of chromHMM active enhancers or Michael H/Bill Noble’s predicted contacts or whatever could equally be the input. So now I understand why to keep at least a vestigial version in the package. Neat

Michael Love (11:31:55) (in thread): > soxcan be any features, doesn’t have to be genes

Michael Love (11:32:48) (in thread): > the segmented block bootstrap can take any pre-existing genome segmentation, or we have this function to compute it based on density of features, just flexibility there

Michael Love (11:33:24) (in thread): > Yeah, happy to point to existing Bioc segmentations of genomes in addition

Michael Love (11:34:26) (in thread): > yup. exactly. i’d love to spend some time to bring these into BioC as Ahub resources, or to help others in that effort

Michael Love (11:34:51) (in thread): > we should have: various segmentations, and also the ENCODE deny list, which isn’t in AHub yet

Michael Love (11:36:25): > I can’t remember why these warnings appear > > > warnings() > Warning messages: > 1: MatchedDataFrame.Rd is missing name/title. Skipping > 2: MatchedGRanges.Rd is missing name/title. Skipping > 3: MatchedGInteractions.Rd is missing name/title. Skipping > 4: overview.Rd is missing name/title. Skipping > 5: matchedData.Rd is missing name/title. Skipping > 6: covariates.Rd is missing name/title. Skipping > 7: indices.Rd is missing name/title. Skipping > 8: plot.Rd is missing name/title. Skipping > 9: plotCovariates.Rd is missing name/title. Skipping > 10: focal.Rd is missing name/title. Skipping > 11: pool.Rd is missing name/title. Skipping > 12: matched.Rd is missing name/title. Skipping > 13: unmatched.Rd is missing name/title. Skipping >

Michael Love (11:41:19): > so we have > > #' @rdname Matched > #' @import ggplot2 ggridges > #' @export > setMethod("plot", signature(x="Matched", y="missing"), plot.propensity) >

Michael Love (11:42:24): > which correctly adds\alias{plot,Matched,missing-method}and\S4method{plot}{Matched,missing}(x, type)etc. to Matched.Rd

Michael Love (11:43:57): > Just noticing,plotCovariatesneeds arguments documented somewhere

Eric Davis (11:51:49) (in thread): > :+1:I can add the documentation forplotCovariates. Not sure what is causing those other warnings though…

Michael Love (11:52:20) (in thread): > i’m looking into it now, something about the roxygen2 tags i had recommended for you …

Michael Love (12:02:40) (in thread): > oh i see, it’s the generics

Michael Love (12:06:51) (in thread): > i’m fixing

Tim Triche (12:08:16) (in thread): > the deny list really really ought to be in AHub now that you mention it

Tim Triche (12:09:19) (in thread): > is there a procedure for getting things into AHub or does one just ask Lori nicely and hope time exists

Michael Love (12:09:38) (in thread): > ok fixed

Tim Triche (12:09:45) (in thread): > (and also upload the thing to an appropriate S3 bucket)

Michael Love (12:09:54) (in thread): > one issue is plot, i think we can’t export a generic for that,

Michael Love (12:11:12) (in thread): > i think there we want to import it

Michael Love (12:11:51) (in thread): > @importFrom stats4 ploti think means that you can avoid creating and exporting aplotgeneric

Michael Love (12:13:43) (in thread): > the rest of your generics seems fine, e.g. none are in BiocGenericshttps://www.bioconductor.org/packages/release/bioc/manuals/BiocGenerics/man/BiocGenerics.pdf

Michael Love (12:14:38) (in thread): > i’ve got it on my todo list, there is a procedure and i’ve been reading up on it

Michael Love (12:15:14) (in thread): > From my understanding, core team is not in the business of adding new things to Ahub

Michael Love (12:16:02) (in thread): > after the release, i plan to put together some useful resources for nullranges incl ENCODE deny regions and ChromHMM etc, and put it into a package

Michael Love (12:16:10) (in thread): > they want Ahub resources tied to a package

Tim Triche (12:29:00) (in thread): > that’s awesome

Tim Triche (12:29:25) (in thread): > I know Vince worked on rounding up chromHMM tracks inerma

Tim Triche (12:30:13) (in thread): > I noticed that at one point,EnsemblDbbuilds were going into Ahub, but then that stopped happening. That sort of sucks although I guess tximeta is kind of the logical handler for those?

Wancen Mu (12:52:43) (in thread): > Added the deny data into nullranges:ok_hand:. Still haven’t got chance make changes on segment_density.R to resolve short gap.

Wancen Mu (12:53:48) (in thread): > What’s do you mean here?# TODO: why not seqlengths(x)?`` query <- tileGenome(seqlengths(x)[seqnames(x)@values], tilewidth = Ls,cut.last.tile.in.chrom = TRUE)

Michael Love (13:06:36) (in thread): > deny data -> great thanks!!

Michael Love (13:07:47) (in thread): > oh i think i get it now, i’ll remove the comment, you’re only sampling from chromosomes that have a single feature

Michael Love (13:36:16): > nice! package builds:slightly_smiling_face:

Michael Love (13:37:20): > i’ll work on the pkgdown so we can have a landing page

Tim Triche (14:02:44): > I have to duck out of today’s call I think (sorry!).@Kasper D. Hansencan present the paper though:wink:

Michael Love (15:29:08): > no prob

Michael Love (15:29:29): > i’ll be able to take notes — swapped time with my wife

Michael Love (15:55:21): > playing around with a logo - File (PNG): Screen Shot 2021-04-28 at 3.55.13 PM.png

Michael Love (15:56:04): > feedback welcome:slightly_smiling_face:

Michael Love (15:56:36): > I was also thinking about { }, butDoug pointed out, if we use { }, and then put something inside it, it’s no longer the null set

Michael Love (15:57:08): > H_0 points to the idea that the pkg is hopefully useful for generating a distribution under a null hypothesis

Mikhail Dozmorov (15:58:03) (in thread): > Looks great. Why mountains? Instead of some genomic ranges sketch?

Michael Love (15:58:24) (in thread): > it’s the “range”

Michael Love (15:58:51) (in thread): > https://en.wikipedia.org/wiki/Rangeland - Attachment: Rangeland > Rangelands are grasslands, shrublands, woodlands, wetlands, and deserts that are grazed by domestic livestock or wild animals. Types of rangelands include tallgrass and shortgrass prairies, desert grasslands and shrublands, woodlands, savannas, chaparrals, steppes, and tundras. Rangelands do not include forests lacking grazable understory vegetation, barren desert, farmland, or land covered by solid rock, concrete and/or glaciers. > Rangelands are distinguished from pasture lands because they grow primarily native vegetation, rather than plants established by humans. Rangelands are also managed principally with practices such as managed livestock grazing and prescribed fire rather than more intensive agricultural practices of seeding, irrigation, and the use of fertilizers. > Grazing is an important use of rangelands but the term rangeland is not synonymous with grazingland. Livestock grazing can be used to manage rangelands by harvesting forage to produce livestock, changing plant composition, or reducing fuel loads. > Fire is also an important regulator of range vegetation, whether set by humans or resulting from lightning. Fires tend to reduce the abundance of woody plants and promote herbaceous plants including grasses, forbs, and grass-like plants. The suppression or reduction of periodic wildfires from desert shrublands, savannas, or woodlands frequently invites the dominance of trees and shrubs to the near exclusion of grasses and forbs.

Mikhail Dozmorov (15:59:22) (in thread): > Great analogy!

Eric Davis (17:06:37): > I made a new branch for developing that stratified method and put the testing code ininst/script/dev-strat_method.R(which I will delete once incorporated into the package). The new section starts at “Develop new stratified matching method”. Here is a linkhttps://github.com/nullranges/nullranges/blob/dev-strat_method/inst/script/dev-strat_method.R - Attachment: inst/script/dev-strat_method.R > > ## Load packages and data ---------------------------------------------------------------- > > library(nullranges) > library(nullrangesData) > library(hictoolsr) > library(magrittr) > library(GenomicRanges) > library(speedglm) > > > ## Load enh-prom pairs (and shorten the name) > data("enhPromContactFreqHg19") > epp <- enhPromContactFreqHg19 > > ## Load loops > loops <- fread(system.file("extdata/hic/MOMA_SIP_10kbLoops_Merged.txt", > package = 'nullrangesData')) %>% as_ginteractions() > > ## Annotate looped and unlooped enh-prom pairs > epp$loopedEP <- FALSE > epp$loopedEP[countOverlaps(epp, loops) > 0] <- TRUE > > ## Prepare data for new method ----------------------------------------------------------- > > ## Define sample.vec to handle vectors of varying length > sample.vec <- function(x, ...) x[sample(length(x), ...)] > > ## Define focal, pool and covar > # focal <- mcols(epp[epp$epDistance <= 40e03]) > # pool <- mcols(epp[epp$epDistance >= 40e03]) > # covar <- ~loopedEP > # covar <- ~contactFreq > # covar <- ~anchor1.peakStrength > # covar <- ~loopedEP + contactFreq > # covar <- ~loopedEP + contactFreq + anchor1.peakStrength > > focal <- mcols(epp[epp$loopedEP == TRUE]) > pool <- mcols(epp[epp$loopedEP == FALSE]) > # covar <- ~epDistance > # covar <- ~contactFreq > # covar <- ~anchor1.peakStrength > # covar <- ~epDistance + contactFreq > covar <- ~epDistance + contactFreq + anchor1.peakStrength > > method <- 'stratified' > replace <- FALSE > > ## Extract covariates from formula as character vector > covars <- nullranges:::parseFormula(covar) > > ## Check that all covariates are in both focal and pool > if (!(all(covars %in% colnames(focal)) & > all(covars %in% colnames(pool)))) { > stop("All variables in covar must be columns in both focal and pool.") > } > > ## Check method and replace arguments > method <- match.arg(method, choices = c('nearest', 'rejection', 'stratified')) > > if (isFALSE(replace) & nrow(focal) >= nrow(pool)) > stop("focal must be <= pool when replace = FALSE.") > > if (method == 'nearest' & isFALSE(replace)) > stop("nearest neighbor matching without replacement not available.") > > ## Create data table with covariate data > covarData <- as.data.table(cbind(id = factor(c(rep(1, nrow(focal)), > rep(0, nrow(pool)))), > rbind(focal[covars], pool[covars]))) > > ## Assemble covariate formula > f <- as.formula(paste("id ~", paste(covars, collapse = "+"))) > > ## Run glm model > model <- speedglm(formula = f, data = covarData, > family = binomial("logit"), fitted = TRUE, model = TRUE) > > ## Get propensity scores of focal and pool groups as vectors > psData <- data.table(ps = predict(model, type = "response"), id = model$model$id) > fps <- psData[id == 1, ps] > pps <- psData[id == 0, ps] > > ## Add propensity scores to covarData > covarData$ps <- psData$ps > > ## Develop new stratified matching method ------------------------------------------------ > > ## Helper function to assign fps and pps to n bins > stratify <- function(fm, pm, n) { > > ## Define breaks using fps and pps > mn <- min(c(fm$fps, pm$pps)) > mx <- max(c(fm$fps, pm$pps)) > br <- seq(from=mn, to=mx, by=(mx-mn)/n) > > ## Assign fps and pps to bin > fm$bin <- cut(fm$fps, breaks = br, include.lowest = TRUE) > pm$bin <- cut(pm$pps, breaks = br, include.lowest = TRUE) > > ## Assign indices to bins > fpsBins <- fm[, .(fpsN = .N, fpsIndices = list(fpsIndex)), by = bin] > ppsBins <- pm[, .(ppsN = .N, ppsIndices = list(ppsIndex)), by = bin] > > ## Define strata by joining fps and pps on bins > strata <- fpsBins[ppsBins, on = 'bin'] > > return(strata) > > } > > ## Initialize result, fpsOptions and ppsOptions > results <- data.table(bin=integer(), fpsIndex=integer(), ppsIndex=integer()) > fpsOptions <- data.table(fps, val = fps, fpsIndex = seq_along(fps)) > ppsOptions <- data.table(pps, val = pps, ppsIndex = seq_along(pps)) > i <- 1 > > while (nrow(results) != nrow(focal)) { > > ## Update n definition > n <- length(unique(fpsOptions$fps, ppsOptions$pps)) > > ## Stratify ps by bins and match focal and pool > strata <- stratify(fpsOptions, ppsOptions, n) > > ## If all fpsN > ppsN, set binsize to 1 > if (nrow(strata[![is.na](http://is.na)(fpsN) & fpsN <= ppsN]) == 0) > strata <- stratify(fpsOptions, ppsOptions, 1) > > ## Assign indices that can be sampled > set.seed(123) > result [is.na](- > strata[!<http://is.na)(fpsN) & fpsN <= ppsN, > .(fpsIndex = unlist(fpsIndices), > ppsIndex = sample.vec(unlist(ppsIndices), fpsN, replace = replace)), > by = bin] > > ## Append to results > results <- rbind(results, result) > > ## Remove assigned indices from options > fpsOptions <- fpsOptions[!fpsIndex %in% result$fpsIndex] > ppsOptions <- ppsOptions[!ppsIndex %in% result$ppsIndex] > > print(sprintf("iteration %s: %s %% complete, %s bin(s)", i, > round(nrow(results)/nrow(focal) * 100, 2), n)) > i <- i + 1 > # if (nrow(results) == nrow(focal)) break > > } > > ## Reorder by fpsIndex > results <- results[order(results$fpsIndex)] > > ## Assemble matched data table > mdt <- data.table(fps = fps[results$fpsIndex], > ppsIndex = results$ppsIndex, > fpsIndex = results$fpsIndex) > > ## Assemble information by group > matchedData <- rbind( > covarData[id == 1, c(.SD, group = 'focal')], > covarData[id == 0][mdt$ppsIndex, c(.SD, group = 'matched')], > covarData[id == 0, c(.SD, group = 'pool')], > covarData[id == 0][!mdt$ppsIndex, c(.SD, group = 'unmatched')] > ) > > ## Matched indicies > matchedIndex <- mdt$ppsIndex > > > > ## Combine information in Matched object > obj <- nullranges:::Matched(matchedData = matchedData, > matchedIndex = matchedIndex, > covar = covars) > > ## Look at results ----------------------------------------------------------------------- > > overview(obj) > covariates(obj) > > table(matchedData(obj)[group == 'focal']$loopedEP) > table(matchedData(obj)[group == 'matched']$loopedEP) > > plot(obj, type = 'lines') > plot(obj, type = 'lines') + ggplot2::xlim(c(0, 0.02)) > plot(obj, type = 'lines') + ggplot2::scale_x_log10() > > plotCovariates(obj, type = 'lines') > plotCovariates(obj, type = 'lines', logTransform = TRUE) > > plotCovariates(obj, covar = "epDistance", type = 'lines', logTransform = FALSE) > plotCovariates(obj, covar = "contactFreq", type = 'lines', logTransform = TRUE) > plotCovariates(obj, covar = "anchor1.peakStrength", type = 'lines', logTransform = TRUE) > >

Michael Love (17:13:24): - File (PNG): Screen Shot 2021-04-28 at 5.13.13 PM.png

Michael Love (17:14:10): > one step classier:slightly_smiling_face:i think i’ll add little ranges stacking up in the background

Michael Love (17:17:24): > a little drop shadow on the H_0? - File (PNG): Screen Shot 2021-04-28 at 5.17.13 PM.png

Wancen Mu (17:18:13) (in thread): > Definitely!

2021-04-29

Kasper D. Hansen (04:06:07): > make the mountain range an outline

Kasper D. Hansen (04:06:13): > Photos don’t work well

Michael Love (07:18:18): > yeah, agree, was thinking to do more outline filter in GIMP and then also de-saturate and lighten

Michael Love (08:14:42): > Two threads from yesterday: > * thanks@Eric Davisfor the BentoBox code, happy to use that instead! looks way better already > * Discussion on twitter with Spencer and how they selectcontrolregionshttps://twitter.com/NystromSpencer/status/1387737327791915011 - Attachment (twitter): Attachment > @mikelove @anshulkundaje Absolutely. Both sound great. In some cases, TSS distance may be less good than % promoter. For example, most Drosophila genes are very close together and while there are long distance contacts, they’re not on the same scale as mammals, in this case % promoters may be ideal.

Eric Davis (08:20:54) (in thread): > It would be interesting to test out selecting background regions w/nullranges for Spencer’s package. We could try out a variety of Mike’s suggestions. Could be a neat vignette/use case. What do others think?

Michael Love (08:39:51) (in thread): > agree:slightly_smiling_face:

Michael Love (11:49:04) (in thread): > making it more cartoon-ish

Michael Love (11:49:09) (in thread): - File (PNG): Screen Shot 2021-04-29 at 11.47.48 AM.png

Michael Love (11:49:29) (in thread): > most important now is making Kasper hate the logo

Doug Phanstiel (12:43:50) (in thread): > yep. focal could be differential ATAC peaks

Doug Phanstiel (12:44:03) (in thread): > first just pull random regions from the genome (very bad)

Doug Phanstiel (12:44:14) (in thread): > then pull from all ATAC peaks

Doug Phanstiel (12:45:08) (in thread): > then pull from ATAC peaks matched for %promoter, or GC content, or K27ac, or ATAC signal, etc

Michael Love (15:50:45): > with some little null ranges floating in the background? - File (PNG): Screen Shot 2021-04-29 at 3.50.19 PM.png

Michael Love (15:55:17): > opaque ranges - File (PNG): Screen Shot 2021-04-29 at 3.55.09 PM.png

Michael Love (16:28:02): - File (PNG): Screen Shot 2021-04-29 at 4.22.46 PM.png

Michael Love (16:28:15): > thanks to Eric’s help in using BentoBox, this looks waaaay better than autoplot

Michael Love (16:28:24): - File (PNG): Screen Shot 2021-04-29 at 4.28.20 PM.png

Michael Love (16:28:39): > coloring by chrom in the original dataset helps to show what the bootstrap is doing

Michael Love (16:29:45): > @Wancen Muwe can also use this for coloring by segment x chrom: i have an idea of how to show this, will work on some demo code this weekend

Wancen Mu (16:34:13) (in thread): > That would be great! Wow, lots of progress going on, need to catch up!

Doug Phanstiel (18:02:50): > nice

Doug Phanstiel (18:03:07): > you could still plot them horizontally too if that is preferable

Michael Love (20:18:57): > this is good i think, better use of space

Michael Love (20:19:24): > i’m gonna keep tweaking it, but it’s much better than before

Michael Love (21:15:46): > more centered ranges - File (PNG): Screen Shot 2021-04-29 at 9.15.35 PM.png

Michael Love (21:16:06): > this is at least good enough for pre-Bioc submission

Doug Phanstiel (21:17:06): > Sounds good. Also, super minor but I think if you setspaceWidthto 0, all of the ranges will be plotted at the same y value. In the 2nd plot on chr2 one of the red ranges is higher than the others. Obviously, not a big deal but I notice these things

Doug Phanstiel (21:18:01) (in thread): > Looks great

Michael Love (21:18:03): > oh but sometimes the ranges are overlapping and then the shifting is a good thing

Doug Phanstiel (21:18:09): > oh yeah

Michael Love (21:18:13): > some of the bootstraps had nicely spaced ranges

Doug Phanstiel (21:18:15): > you’ll want it in that case

Stuart Lee (21:27:10): > the plots look really great!

Stuart Lee (21:27:44): > also@Michael Lovenot sure if you saw my message earlier, but it’s hard for me to make the south meeting now with the timezone differences, could we update it?

Michael Love (23:54:39) (in thread): > yeah, so from May 10 onward i’ll be in Europe, and then there aren’t any good meeting times for all three time zones. > > i could meet with you 16:00 Melbourne / 8:00 Europe every now and then to catch up?

Michael Love (23:55:55) (in thread): > i didn’t have a chance to poll the channel yet, but i’ll see if people want to keep a morning US East meeting every ~2 weeks

2021-04-30

Stuart Lee (00:40:28) (in thread): > That would be good thanks:smile:

Michael Love (08:07:24): > Re: meeting times for summer, I’ll be in European time zone, and I be trying to meet with Stuart ~ every 2 weeks at a new time that won’t work for US time zones (16:00 Melbourne / 8:00 Europe). So I’ll just cancel the current “nullranges south” schedule > > For the “nullranges north” the next meeting was Friday May 21 at 8am. But as we said on the call, schedules are more open now. For me 8-11am US East, Mon-Thur preferred. Any other notes/requests for “nullranges north”?

Doug Phanstiel (08:25:44): > i think these times would work for me > Mon 9-11 > Tues 8-9, 10-11 > Thur 8-11

Wancen Mu (08:53:46): > I am pretty open to any time.

Tim Triche (09:41:17): > these times also happen to be perfect for me. At some point in the near future (Kasper kindly having edited the compartmap paper) I would like to present, or if possible have Ben (the first author) present, compartmap and see what you plural think of its fit (or lack thereof) with nullranges.

Tim Triche (09:41:26): > @Michael Lovethanks for pointing out the memes/universalmotif thread on Twitter, by the way.

Tim Triche (09:42:25) (in thread): > this will motivate a comparison with chromVAR (https://www.nature.com/articles/nmeth.4401#Sec3) - Attachment (Nature Methods): chromVAR: inferring transcription-factor-associated accessibility from single-cell epigenomic data > ChromVar infers transcription-factor-associated accessibility from low-coverage or single-cell chromatin-accessibility data, thus enabling the clustering of cells and analysis of regulatory sequence motifs from sparse data sets.

Kasper D. Hansen (09:42:43): > I forgot to write this, and I did not think about your bootstrapping procedure, but in our old paper JP used some tricks to get really fast bootstraps of a principal component.,

Tim Triche (09:43:00): > we should probably borrow those:slightly_smiling_face:

Tim Triche (09:43:37): > especially given the way that they help to perform a multiscale “damping” trick (the “unsharp mask” approach to finding boundaries)

Lauren Harmon (09:46:57): > @Lauren Harmon has joined the channel

Tim Triche (10:06:23) (in thread): > @Ben Johnsonplease note:smile:

Michael Love (10:07:51) (in thread): > is this bootstrapping of data or locations? i’m thinking nullranges will have limited scope of generation of ranges

Michael Love (10:22:32) (in thread): > hmm that seems a lot fancier than what we are doing/showing

Michael Love (10:23:18) (in thread): > my aim is to fly under the radar and avoid comparison with fancy methods by pointing out we have a limited scope: helping people define what they consider to be “control” in whatever fancy model they are running or building

Kasper D. Hansen (10:28:10) (in thread): > This is purely math about getting confidence intervals of svd(X) by bootstrapping X. It turns out that theres a trick so you only have to compute svd(X) one time and not once for every bootstrap sample. Its not really nullranges related but its related to Tim’s compartmap

Ben Johnson (10:45:05) (in thread): > thanks@Kasper D. Hansen! will follow up with the paper from you and JP

Michael Love (10:57:12) (in thread): > Mathematicians hate him! One weird trick to reduce computation of CI of PCs with bootstrapping

Michael Love (10:58:08) (in thread): > there i wrote your title

Tim Triche (11:08:45) (in thread): > 10 simple rules for green eigencomputation

Ben Johnson (11:11:25) (in thread): > @Kasper D. Hansenis this the approach JP used?https://amstat.tandfonline.com/doi/abs/10.1080/01621459.2015.1062383#.YIwd931Kj5A

Kasper D. Hansen (11:31:18) (in thread): > Yes. Was done by a fellow grad student about the time we wrote the paper.

2021-05-01

Michael Love (13:13:07): > heads up: I’m moving docs and vignettes around today, incl Eric and Wancen’s pieces

Michael Love (13:13:15): > i’m trying to get a pkgdown ready now

Michael Love (19:39:06): > https://nullranges.github.io/nullranges/Note: I’ve added Stuart, Mikhail and Tim as contributors, which also populates theauthors.htmlpage. Feel free to add/change your person info, ORCID,hreftag (in_pkgdown.yaml) etc.https://github.com/nullranges/nullranges/blob/main/DESCRIPTIONAs we go forward, code develops, people join project, etc. we can move things around as needed of course. I just had to start putting downAuthors@Ras we have a website now:slightly_smiling_face: - Attachment (nullranges.github.io): Generation of null ranges via bootstrapping or covariate matching > Modular package for generation of sets of ranges representing the null hypothesis. These can take the form of bootstrap samples of ranges (using the block bootstrap framework of Bickel et al 2010), or set of control ranges that are matched across one or more covariates. nullranges is designed to be inter-operable with other packages for analysis of genomic overlap enrichment, including the plyranges Bioconductor package.

Michael Love (19:41:32): > the website is good because it helps reveal the holes and need to do more bridging across the branches of the project (this is primarily my job). while i likefunctions_with_underscore()and i think these are ok to remain like this internally, for the user facing functions of a BioC package we need to move tobootstrapRanges()andsegBootstrapRanges()etc.

Mikhail Dozmorov (20:02:04): > The website seems to lack the hex sticker? I wish to have contributed more. Been playing with the vignettes - the newplotGRangesis awesome, may be a separate wrapper function by its own rights. Btw, individual vignettes compile, butdevtools::build_vignettes()fails with a non-descriptive error.

Mikhail Dozmorov (20:02:55) (in thread): > I tried to clone the repo again to double-check, and got the following: > > 'docs/reference/Matched.html' > 'docs/reference/matched.html' > 'man/Matched.Rd' > 'man/matched.Rd' > It is unlikely the case, but something seems off. >

Mikhail Dozmorov (20:04:44): > A couple of suggestions for the vignettes. More details and interpretation would help. E.g., what the datasets/objects are, X-Y axes on the plots, plot interpretations. And, better definition of the results, like this is the “null” GRanges object that can be taken for downstream analysis.

Doug Phanstiel (20:10:10): > Yeah, I think the vignettes are not in their final state. I think they were written as internal docs that were really helpful for us to see how the code was working, organized, etc. And a great starting point. But a little more text describing what is going on is definitely needed.

Doug Phanstiel (20:10:26): > But the page overall is looking great

Michael Love (20:20:22) (in thread): > oh yeah i haven’t fully kicked tires on pkg vignettes building i’ll take a look

Michael Love (20:20:45) (in thread): > will add hex this week for sure – adding CZI funding now !

Michael Love (20:21:53): > Absolutely. Putting things up as a pkgdown website is a good way to show the underbelly of the package:arrow_right:incentive to get things looking nice

Michael Love (20:22:50): > i mostly worked today on getting BentoBox diagrams working in the two bootstrap vignettes, they are really nice for showing what is going on, much more than text can do

2021-05-02

Doug Phanstiel (13:55:16): > They look nice!

2021-05-03

Eric Davis (10:27:19): > Some more package building errors > > Quitting from lines 69-74 (segmented_boot_ranges.Rmd) > Error in smooth.CNA(cna) : could not find function "smooth.CNA" > Calls: suppressPackageStartupMessages ... withCallingHandlers -> withVisible -> eval -> eval -> segment_density > Execution halted > > Exited with status 1. > > I think we are missing a call tolibrary(DNAcopy)somewhere in the segmented_boot_ranges vignette?

Wancen Mu (10:30:10): > Ah, l edit segment_density function yesterday, fix it!

Michael Love (11:16:16) (in thread): > the strategy we employ is to use package::function so we don’t need to load the package. we were just missing thepackage::part

Michael Love (11:17:34) (in thread): > i’m on the fence about doing this forksas well, to winnow down our Imports. do you have an idea how many users you think will be wanting the “with replacement” analysis?

Michael Love (11:18:08): > We haveprogressandpbapplywhich kind of accomplish the same thing, can we pick just one?

Eric Davis (11:19:14) (in thread): > I thinkksis used for the entire rejection sampling approach

Eric Davis (11:20:57) (in thread): > Can pbapply be used for a while loop?

Michael Love (11:21:07) (in thread): > i’m looking:slightly_smiling_face:

Michael Love (11:21:31) (in thread): > progress has more imports it looks like, so if that was possible it would help trim imports

Michael Love (11:25:06) (in thread): > so inssMatchdo you know in advance how many iterations you will proceed through? if so it’s preferable to avoid thewhileloop which grows results withresults <- rbind(results, result)

Michael Love (11:25:53) (in thread): > the Bioc package checkers will look for this and recommend to reformulate into a singlelapplyfollowed by a single do.call(rbind, list)

Eric Davis (11:26:23) (in thread): > Right, I don’t know ahead of time or I would never resort to a while loop:sweat_smile:

Michael Love (11:26:31) (in thread): > ok got it, let me take a look

Eric Davis (11:27:49) (in thread): > I guess I could reformulate because I know the maximum number of iterations

Michael Love (11:28:03) (in thread): > well but then you don’t have access to the changes from the previous iteration

Eric Davis (11:28:12) (in thread): > ah right

Eric Davis (11:29:26) (in thread): > Its a little complicated to discuss on slack, but I wrote 3 alternative versions of ssMatch. One of them does have a fixed step size, but it produces suboptimal matching

Eric Davis (11:29:58) (in thread): > but I think the same problem would apply with not having changes from previous interations

Michael Love (11:30:45) (in thread): > is therbinda non-trivial operation here in terms of memory? if so it may be more efficient i think to dores_list <- list(),res_list[[i]] <- ..., thenresults <- do.call(rbind, res_list)

Michael Love (11:31:08) (in thread): > even if we keep thewhile

Eric Davis (11:31:45) (in thread): > I don’t know memory-wise, but time-wise it is certainly not the bottleneck

Michael Love (11:32:00) (in thread): > ok

Michael Love (11:32:15) (in thread): > so how does the progress bar work if you don’t know how many iterations will happen? you just end early?

Michael Love (11:33:10) (in thread): > ok we can see when we submit how high we are on the # dependencies scoreboard:laughing:

Eric Davis (11:34:22) (in thread): > you can set the progress as a ratio - so with each iteration the number of rows inresultsgrows until it approaches the length offps.

Eric Davis (11:35:02) (in thread): > The while loop breaks whennrow(results) == length(fps)and that ratio is 1

Eric Davis (11:36:05) (in thread): > There might be similar options forpbapply, although I do like how you can customize each step with additional information - for example it also reports how many bins the data is stratified by

Michael Love (11:37:52) (in thread): > ok i’m ok with keeping both, or maybe we can even swap out pbapply

Michael Love (11:38:26) (in thread): > oh we only use it once, so I’ll recommend we just use~progress~BiocParallel there

Michael Love (11:38:38) (in thread): > I can take a shot at that

Michael Love (11:40:07): > Ah, so I think we may want to use BiocParallel instead of pbapply in segmented_bootstrap, I can look into making this change later this week

2021-05-04

Michael Love (07:23:30): > @Kasper D. Hanseni thought of you - File (PNG): Screenshot from 2021-05-04 07-22-12.png

Kasper D. Hansen (09:39:32): > Wait what, why does this remind you of me?

Michael Love (09:46:42) (in thread): > Reconstructing A/B compartments

Tim Triche (10:08:30) (in thread): > @Ben Johnsonthis is like a bat signal for you to present

Tim Triche (10:09:21): > Now we are morally obligated to break out the capture Hi-C kits to avoid… wait… damn it!

Kasper D. Hansen (10:11:37) (in thread): > ok. I think of some crazy polymer simulations when I read this headline, but I think that’s fair

Tim Triche (10:12:18) (in thread): > The irony of this is that doing Hi-C never saved anyone from computational modeling anyhow

Tim Triche (10:13:08) (in thread): > the DIPC and in-situ genome sequencing papers are pretty amazing though.

Michael Love (10:21:16) (in thread): > oh, i dont even look at papers that attempt to recreate chromatin interactions from MD sims:scream:

Eric Davis (10:27:02): > @Michael Lovethe pkgdown site is really coming together but is also revealing some documentation issues - particularly with the matchRanges side of things. It seems like there is a lot of active development going on, so I am a bit wary of making breaking changes or causing conflicts. It would be great if we could meet and make a plan for what needs to be done/who will do it. What do you think?

Kasper D. Hansen (10:28:17) (in thread): > But I do think that’s the reference here. There is a whole community of biophysicists doing this

Michael Love (10:28:46): > yeah — i’m crazy busy the rest of this week but Monday next week?

Michael Love (10:29:24) (in thread): > for right now, don’t worry about breaking things — go ahead and break at will

Michael Love (10:29:48) (in thread): > i’ll rebuild the pkgdown next wek

Eric Davis (10:30:41) (in thread): > I am averse to breaking things because I’d like to use these functions for parallel projects:sweat_smile:

Eric Davis (10:31:18) (in thread): > But i’ll just keep a stable build locally:+1:

Michael Love (10:32:02): > let me loop back here to get some times down

Tim Triche (10:34:50) (in thread): > We have breaking news: an entire community of physicists and chemists have been secretly modeling biopolymers for years prior to the advent of Hi-C. Be sure to catch 4DN On Your Side at 11pm for complete updates

Tim Triche (10:36:41) (in thread): > @Eric Davisfork or branch the project and track that? we use the CRAM branch of Rsamtools all the time, for example. Not sure whether a branch, tag, or fork is best but this problem can be overcome:slightly_smiling_face:

Eric Davis (10:40:29) (in thread): > @Tim Trichegood point - I might keep my own stable branch for other analyses

Tim Triche (10:41:21) (in thread): > this works for forks as well, PR from the dev branch and use stable/release for other projects

Michael Love (10:45:32) (in thread): > ah i forgot that nullranges is already part of Phanstiel lab workflows:laughing:yeah i’d recommend to make a stable branch if you want that you can maintain, you don’t need to fork

Michael Love (10:45:38) (in thread): > up to you

Michael Love (11:39:37) (in thread): > @Eric DavisMonday at 10, Tuesday 8 or 9?

Eric Davis (11:40:52) (in thread): > All of those times work for me @Doug Phanstielwhat works best for you?

Doug Phanstiel (12:22:40) (in thread): > mon at 10 is best for me

Michael Love (12:36:34) (in thread): > ok let’s do it, we can usehttps://uncsph.zoom.us/j/4133532783?pwd=VHl6dlNXMk5NYStCODN6S1IwaVliQT09

Michael Love (12:45:23) (in thread): > one more thing, Eric, you can update thedocsif you want but don’t feel like you are required to do it everytime. > > just running document() is sufficient, and then we can do sweeps every now and then to get the website to reflect the code base, again up to you if you want to show things to others via updating the docs with pkgdown > > once we are a bit more stable i can set up GitHub Actions to do the pkgdown

Stuart Lee (20:06:05) (in thread): > @Eric Davislet me know if you want help with anything - I have time this week to look over stuff

Mikhail Dozmorov (20:24:22) (in thread): > If that helps, “3. Build {pkgdown} site” section athttps://www.rostrum.blog/2020/08/09/ghactions-pkgs/has simple steps how to set up pkgdown GitHub actions withusethis::use_github_action("pkgdown"). Works great.

Stuart Lee (20:24:51) (in thread): > Agreed@Mikhail Dozmorov! also the biocthis package is great for setting up gha

Mikhail Dozmorov (20:25:35) (in thread): > I’m yet to explore it!

Eric Davis (20:28:26) (in thread): > @Stuart Leeif you have some time to look at the class structure that would be great! You might have some suggestions for a better way to implement

Stuart Lee (20:30:33) (in thread): > No problem! Ok I’ll take a look over it later today! Is this for the matchRanges stuff?

Eric Davis (20:32:49) (in thread): > Thanks! Yeah the relevant files are AllClasses.R, AllGenerics.R, methods-Matched.R, and methods-matchRanges.R

Stuart Lee (20:37:01) (in thread): > got it thanks!

Stuart Lee (23:48:34): > what style are we using for empty arguments in function? I’ve noticed some use ofmissing(arg)in the package. I personally prefer settingarg = NULLas the default but it’s fine as long as we are consistent.

2021-05-05

Stuart Lee (00:01:44): > @Eric Davisthe code is looking good! Could you talk me through your thinking behind the class structure? It seems quite complicated at the moment and I’m not sure how it all fits.

Michael Love (08:17:48) (in thread): > agree with consistency. i think i’ve used both in the past, what’s your preference forNULL

Michael Love (08:23:17) (in thread): > meanwhile we will also be thinking about whether or not to have a bootRanges class as well. The argument in favor would be, so that the bootstrapped sets of ranges can have associated metadata (e.g. the segmentation and length of blocks, settings etc)

Eric Davis (09:33:52) (in thread): > Thanks! The pdf shows a diagram of how the super and subclass systems relate and give access to the methods. Essentially we created a superclass ‘Matched’ that has methods for all of the ‘matching’ related functions and combined this with other existing classes (‘DFrame’, ‘DelegatingGenomicRanges’, and ‘GInteractions’) to create subclasses for each data.type of interest (i.e. ‘MatchedDataFrame’, ‘MatchedGRanges’, or ‘MatchedGInteractions’). This allow us to have methods for specific to these classes, but also allow them to have access to and behave like GRanges, GInteractions, etc… > > Let me know if you’d like to discuss this further! - File (PDF): class_structure.pdf

Tim Triche (12:07:30) (in thread): > this was nice when you first showed it and it holds up very well. v. helpful

Stuart Lee (18:59:56) (in thread): > My preference is forNULLmainly cause I think it broadcasts the argument is optional but its not a big deal imo

Stuart Lee (19:00:38) (in thread): > That is great, and a lot clearer now, thanks. I reckon you should put that in the intro vignette

Michael Love (20:52:49) (in thread): > ok this makes sense to me too

Michael Love (20:54:29) (in thread): > Agree w/ Stuart that we usearg=NULLand thenif (is.null(arg))…for consistency across the package, instead ofargwithout a default value and thenmissing(arg). I’ll make this change throughout when i get a chance

2021-05-06

Doug Phanstiel (06:48:04) (in thread): > Def put in the vignette

2021-05-09

Mikhail Dozmorov (22:15:48): > Finally, we have the AnnotationHub package example,https://github.com/mdozmorov/CTCF. This is following our conversation for creating one for the deny regions, or we can rename and expand this one. The gist is to havemake-data.Randmake-metadata.R. Usedbiocthis- it has a few scripts and functions to speed things up and set up BioC-compatible GH actions, thanks,@Stuart Lee. Any suggestions?

2021-05-10

Michael Love (00:53:47) (in thread): > recommend adding seqlengths: > > > seqlengths(CTCF_hg38) > chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 > NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA > chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY > NA NA NA NA NA NA NA NA >

Michael Love (00:53:57) (in thread): > looks great though

Michael Love (00:54:28) (in thread): > i’ll go through these steps to create a deny regions AHub package

Michael Love (01:00:41) (in thread): > maybe “Genomic coordinates of predicted CTCF binding sites” ?

Mikhail Dozmorov (07:07:46) (in thread): > Good points, will implement.

Mikhail Dozmorov (07:08:36) (in thread): > The package seems to be just a half way through - there’s a process of informing the BioC team and uploading to AWS. Will try as well.

Doug Phanstiel (07:40:21) (in thread): > Perfect timing! We would like to use this for the matchRanges vignette

Mikhail Dozmorov (11:40:50) (in thread): > After fixingseqinfoand minor glitches, upload to the cloud was easy,instructionsare straightforward.

Michael Love (12:00:29) (in thread): > nice, i will work on deny (after CZI grant submitted)

2021-05-11

Michael Love (05:26:19): > logo is up:slightly_smiling_face:i’ll spend some time this weekend tidying the vignettes as needed, bc on next Wed (5/19) we will submit a proposal to extend funding for nullranges development, and i want to include the link and also screenshots of the website and vignettes as part of the progress report > > i think it’s not necessary that the man pages are finalized, but as long as the package builds w/o error it will help me in tidying. I will also add Travis CI for now (bc i know how to do this, don’t want to fiddle with GitHub actions over the weekend)

Michael Love (05:36:49): - File (PNG): Screen Shot 2021-05-11 at 11.36.39 AM.png

Doug Phanstiel (07:22:44): > Looks great

Megha Lal (16:45:17): > @Megha Lal has joined the channel

Eric Davis (16:57:32): > I am working on building an example for the matchRanges vignette that uses CTCF peak calls. Does anyone have a preference between these files from AnnotationHub? > > > query(ah, c("CTCF", "GM12878", "narrowPeak")) > AnnotationHub with 8 records > # snapshotDate(): 2020-10-27 > # $dataprovider: UCSC > # $species: Homo sapiens > # $rdataclass: GRanges > # additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer, > # rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype > # retrieve records with, e.g., 'object[["AH22521"]]' > > title > AH22521 | wgEncodeAwgTfbsBroadGm12878CtcfUniPk.narrowPeak.gz > AH22809 | wgEncodeAwgTfbsSydhGm12878Ctcfsc15914c20UniPk.narrowPeak.gz > AH23120 | wgEncodeAwgTfbsUtaGm12878CtcfUniPk.narrowPeak.gz > AH23175 | wgEncodeAwgTfbsUwGm12878CtcfUniPk.narrowPeak.gz > AH25444 | wgEncodeOpenChromChipGm12878CtcfPkRep1.narrowPeak.gz > AH25876 | wgEncodeSydhTfbsGm12878Ctcfsc15914c20StdPk.narrowPeak.gz > AH27499 | wgEncodeUwTfbsGm12878CtcfStdPkRep1.narrowPeak.gz > AH27500 | wgEncodeUwTfbsGm12878CtcfStdPkRep2.narrowPeak.gz >

Mikhail Dozmorov (17:04:54) (in thread): > These are cell type-specific. I’m not sure what’s the closest cell type to monocytes/macrophages data in the enhPromContactFreqHg19 dataset. The predicted CTCF sites are generic.

Eric Davis (17:07:12) (in thread): > For the new example we are using data from GM12878 - I want to use ChIP data to identify peak locations in that cell type and intersect them with your CTCF motif predictions

Mikhail Dozmorov (17:09:41) (in thread): > Then, wgEncodeAwgTfbsBroadGm12878CtcfUniPk is best.

Eric Davis (17:10:15) (in thread): > thanks!:slightly_smiling_face:

2021-05-12

Michael Love (07:53:08) (in thread): > oh i just saw that Travis CI ended free offering for public repos in 11/2020, guess that means i should go straight for GitHub Actions

Tim Triche (09:58:39): > so this is interesting, since one of the verification figures in the compartmap paper is “how well do the detected boundaries line up with properly oriented CTCF sites”. when is the next scheduled nullranges call?

Michael Love (10:17:47): > just fixed the calendar — Tues next week at 8am East

Tim Triche (10:17:55): > you’re putting in your proposal friday?

Michael Love (10:18:05): > it’s Wed deadline 5/19

Michael Love (10:18:23): > things are looking good i’m just touching up the vignettes now, and putting in Actions

Tim Triche (10:19:10): > I’ll get the paper deposited today (Ben fixed up the codebase in BioC) and you plural can decide whether compartmap is a useful hook for general audience expansion of nullranges’ application. I don’t think much is required (if anything) in terms of code on your end, and the discussion re: CTCF site specification is relevant there too.

Ben Johnson (10:23:46): > @Michael Loveplease let me know if you need figures or additional background if compartmap is of interest to incorporate into the grant

Ben Johnson (10:24:11): > also still trying to rig up github actions for R CMD check

Michael Love (10:24:24): > so the issue is that there are very few words in the proposal

Michael Love (10:25:12): > many sections are 250-500 words, so you really have to make your case in few words

Michael Love (10:25:34): > that said — very interested in collaborating, just that i may not have words to put it in this proposal

Ben Johnson (10:26:51): > whoa, that’s a crazy word limit!

Michael Love (10:26:54): > i think compartmap + nullranges have a lot of shared interest

Michael Love (10:27:43): > yeah, also they want to fund broad, general purpose proposals [e.g. like bedtools], so i’m keeping an eye that we are useful for many types of analysis, not focused only on chromatin organization

Ben Johnson (10:28:03): > I wonder if something regarding compartmap + nullranges would be useful to spin in a way that enables exploration of higher-order chromatin inference in existing CZI funded scRNA-seq projects?

Ben Johnson (10:29:04): > ah, that makes sense

Michael Love (11:17:18): > yeah:confused:i’m trying to balance the text so it’s showing that nullranges is for various enrichment questions, but i’m definitely interested in seeing how to combine our pkgs in workflows

Michael Love (16:09:33): > ok now have GitHub Actions + badge — we have warnings right now due to docs issues, but i just changed those to a “pass”, and i’m not building vignettes right now (i’ll work on those tomorrow and Friday) - File (PNG): Screen Shot 2021-05-12 at 10.07.59 PM.png

Michael Love (16:11:01): > the build warnings are mostly the same as what you’d get locally withR CMD check

Mikhail Dozmorov (19:17:09): > Nathan Sheffield just published a preprint with a similar idea of randomized regions, the BedShift toolhttps://doi.org/10.1101/2020.11.11.378554. As the name suggests, it only shifts and adds/drops regions.

2021-05-13

Michael Love (00:25:51) (in thread): > yeah I think I’ve spoken with Nathan before about the need for more tools in this space. I think as it’s command line it’s a useful complement to shuffleBed in a bedtools call, but hopefully doesn’t eclipse what we’ve got

Michael Love (00:28:47) (in thread): - File (PNG): Screen Shot 2021-05-13 at 6.28.41 AM.png

Michael Love (00:28:57) (in thread): - File (PNG): Screen Shot 2021-05-13 at 6.28.52 AM.png

Michael Love (00:29:44) (in thread): > The Bickel motivation for the bootstrap is that, if the features are clumpy, the blocks will be needed to get the right null distribution

Michael Love (00:41:42) (in thread): > there are some interesting references here. i hadn’t seen GAT:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3722528/ - Attachment (PubMed Central (PMC)): GAT: a simulation framework for testing the association of genomic intervals > Motivation: A common question in genomic analysis is whether two sets of genomic intervals overlap significantly. This question arises, for example, when interpreting ChIP-Seq or RNA-Seq data in functional terms. Because genome organization is complex, …

Michael Love (00:42:25) (in thread): > > GAT implements a null model that the two sets of intervals are placed independently of one another, but allows each set’s density to depend on external variables, for example, isochore structure or chromosome identity. > this is getting at the segmentation, but then, if there is additionally clumpiness within regions of high density, block is still needed

Michael Love (00:43:17) (in thread): > another nice reference i hadn’t seen:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6499241/ - Attachment (PubMed Central (PMC)): Colocalization analyses of genomic elements: approaches, recommendations and challenges > Many high-throughput methods produce sets of genomic regions as one of their main outputs. Scientists often use genomic colocalization analysis to interpret such region sets, for example to identify interesting enrichments and to understand the interplay …

Michael Love (05:34:12): > I’ve winnowed the package check WARNINGS down to the following: > > Undocumented S4 methods: > generic 'matchRanges' and siglist > 'DF_OR_df_OR_dt,DF_OR_df_OR_dt,formula,character,logical' > generic 'matchedData' and siglist 'Matched' > > … > > Objects in \usage without \alias in documentation object 'matchRanges': > '\S4method{matchRanges}{DF_OR_df_OR_dt,DF_OR_df_OR_dt,formula,character,logical}' > > Documented arguments not in \usage in documentation object 'Matched-class': > 'y' > Objects in \usage without \alias in documentation object 'Matched-class': > '\S4method{matchedData}{Matched}' > > … > I can keep looking but if anyone has any insight how to populate these missing S4 methods (via roxygen2 tags) that would help

Michael Love (05:36:52): > re:ploti think we shouldn’t export an S4 generic but instead use the following strategy:https://github.com/mikelove/DESeq2/blob/master/R/core.R#L244https://github.com/mikelove/DESeq2/blob/master/R/methods.R#L855if@Eric Davisagrees, I can make this change

Eric Davis (07:54:04) (in thread): > Sounds good to me!

Mikhail Dozmorov (08:06:17) (in thread): > :slightly_smiling_face:I remember reviewing the latter. And yes, GAT is good, it accounts for length and GC content. Maybe, we can use more references about different tools from herehttps://doi.org/10.1093/bioinformatics/btx414

Mikhail Dozmorov (08:09:10) (in thread): > Yes, this is the issue with Roxygen tags. Struggling with related issues in another case, at the stage of editing .Rd files. Don’t know the solution yet, also keep looking.

Michael Love (08:56:26) (in thread): > saved

Michael Love (08:57:22) (in thread): > yeah… i’m thinking also to maybe split some of these Rd’s up to see if that solves things. while it’s nice to group them together I think the issue may be some of the overloaded Rd’s

Michael Love (08:58:16) (in thread): > https://github.com/nullranges/nullranges/blob/main/man/Matched.Rd#L6-L13

2021-05-14

Michael Love (09:00:56): > i’ve nearly solved the docs warnings…

Mikhail Dozmorov (10:16:09) (in thread): > Curious to see. Haven’t done much on documenting methods..

Michael Love (11:09:01) (in thread): > ok solved, here was the trick:https://github.com/nullranges/nullranges/blob/main/R/methods-Matched.R#L3-L20

Michael Love (11:09:24) (in thread): > i had to add the NULL and then break out the S4 method in order for the docs to pass check

Michael Love (11:10:17) (in thread): > i noticed the patten was that the S4 method that was always causing the problem was the one with the roxygen2 docs above it, while the other S4 methods which were piggybacking on the Rd file were fine. now they all are piggybacking on the Rd file in the same way

Michael Love (11:10:34) (in thread): > no more WARNINGS

Mikhail Dozmorov (11:32:18) (in thread): > Interesting,@rdnameis not the most frequently used tag, had toread about it. They have the example how NULL is used, still, it looks like an extra layer of complexity.

Michael Love (12:03:56): > Ok now we have no warnings, and i’ve added in vignettes to the check as well. I’ve changed it so future commits that lead to warnings will break the PASS badge (and you’ll get an email from GitHub I think). This just means, you should run a quick check on your end before pushing changes that you think would break the build/tests/vignettes/examples. I’d like to cut the vignettes down to 30 second each also to make the checking less of a delay. - File (PNG): Screen Shot 2021-05-14 at 6.01.22 PM.png

Michael Love (12:08:02): > the NOTES are about non-standard evaluation stuff: > > combnCovariates: no visible binding for global variable '.' > nnMatch: no visible global function definition for '.' > nnMatch: no visible binding for global variable 'ppsIndex' > nnMatch: no visible binding for global variable 'N' > nnMatch: no visible binding for global variable 'fpsIndex' > overviewMatched: no visible binding for global variable 'group' > plot_covariates: no visible binding for global variable 'group' > plot_covariates: no visible binding for global variable 'data' > plot_covariates: no visible binding for global variable 'value' > plot_propensity: no visible binding for global variable 'group' > plot_propensity: no visible binding for global variable 'ps' > propensityMatch: no visible binding for global variable 'id' > propensityMatch: no visible binding for global variable 'ps' > set_matched_plot: no visible binding for global variable 'group' > ssMatch: no visible binding for global variable 'fpsN' > ssMatch: no visible binding for global variable 'ppsN' > ssMatch: no visible global function definition for '.' > ssMatch: no visible binding for global variable 'fpsIndices' > ssMatch: no visible binding for global variable 'ppsIndices' > ssMatch: no visible binding for global variable 'bin' > ssMatch: no visible binding for global variable 'fpsIndex' > ssMatch: no visible binding for global variable 'ppsIndex' > stratify: no visible global function definition for '.' > stratify: no visible binding for global variable 'fpsIndex' > stratify: no visible binding for global variable 'bin' > stratify: no visible binding for global variable 'ppsIndex' > combnCov,character: no visible binding for global variable '.' > overview,Matched: no visible binding for global variable 'group' > plot,Matched-missing: no visible binding for global variable 'group' > plot,Matched-missing: no visible binding for global variable 'ps' > Undefined global functions or variables: > . N bin data fpsIndex fpsIndices fpsN group id ppsIndex ppsIndices > ppsN ps value >

Michael Love (12:08:36): > it’s not a big deal now, just some noise during the check. don’t know how to get around it, maybe@Stuart Leehas thoughts

Michael Love (12:09:42): > e.g. here:https://github.com/nullranges/nullranges/blob/main/R/methods-Matched.R#L183-L188

Stuart Lee (21:20:54) (in thread): > no probs, I can take a look at this, usually it’s just a matter of usingrlangquosures or setting the variable to NULL in the scope of the function

Stuart Lee (21:23:09) (in thread): > also is BentoBox on cran? if we want to build vignettes on gha you will need to set the Remotes: field in the description.

2021-05-15

Michael Love (00:03:48) (in thread): > yeah i should do that, i hacked it into the GHA, but should do the proper way

Michael Love (07:25:36) (in thread): > my hack for now:https://github.com/nullranges/nullranges/blob/main/.github/workflows/check-bioc.yml#L180BentoBox and hictoolsr will be submitted either before or same time as nullranges is the plan

Michael Love (07:26:12) (in thread): > nullrangesData will go away (or be repurposed) > > we will instead use AHub for example data

2021-05-16

Mikhail Dozmorov (10:53:08): > Plotting methods forMatchedGInteractionsseem not working. Inmatch_rangesandmatch_ranges_methodsvignettes, plot functions (plot()andplotCovariates()) error withError in switch(type, jitter = ggplot(data, aes = aes(x = !!x, y = group, : EXPR must be a length 1 vector. I see the plot_propensity and plot_covariates methods defined, but not sure why they are not working. Those code chunks are currently eval=FALSE, so they may not be ready?

Michael Love (13:28:44) (in thread): > yeah I think I saw those errors. > > I did eval=FALSE on the vignette because I was doing a lot of checking wrt the docs and GitHub Actions, and the vignette takes 100 seconds on my side. It would be good to get each vignette down to ~30 seconds for ease of package checking which we will want to do now before pushing to GH

Michael Love (13:29:50) (in thread): > thanks for the PR by the way, I’ll check it out tomorrow morning. I’m also doing some re-writing of code in the segmented boot R file, just re-organization and streamlining variable names a bit, i imagine i will push tomorrow morning

Mikhail Dozmorov (13:43:32) (in thread): > Makes perfect sense to aim for minimal time for vignettes. Catching up with the commits, the package shapes nicely. The PR is trivial, minor fixes while trying the code.

Mikhail Dozmorov (13:45:07) (in thread): > Btw, happy to make a PR for the “deny regions” package. Don’t know it it’s been started.. I’ve been collecting notes on questionable regions,https://github.com/mdozmorov/ChIP-seq_notes#blacklisted. Can systematically organize/liftover this data, together with the canonical deny regions.

Michael Love (15:07:02) (in thread): > oh wow, that’s great. i think you’re quite ahead of me. If you want to put it together and be maintainer entirely up to you. i wouldn’t be able to start anything until last week of May.

Mikhail Dozmorov (16:23:47) (in thread): > I’ll take the first stab on it, and we all can add what’s necessary. Any suggestion for the package name? Perhaps,denyranges(parallelingnullranges)?

2021-05-17

Michael Love (01:01:03) (in thread): > sounds good to me!

Stuart Lee (01:25:53) (in thread): > this was my mistake when I edited the plot functions. should be fixed now.

Stuart Lee (01:26:02) (in thread): > the globals check is hopefully fixed now

Michael Love (04:03:36) (in thread): > cool, i’ll be pulling things together today and tidying vignettes and docs

Michael Love (07:19:42): > toy example for block bootstrapping within segmentation states, as implemented by Wancen:slightly_smiling_face: - File (PNG): Screen Shot 2021-05-17 at 1.18.53 PM.png - File (PNG): Screen Shot 2021-05-17 at 1.19.02 PM.png

Michael Love (07:25:15) (in thread): > some details: > > it’s quite fast it seems. takes ~.13 second for 1 bootstrap of all 170k DHS in A549 (with mcols) across three segmentation states. thanks for Wancen for line profiling of the core functions > > there is a little bug in BentoBox right now (I told Nicole thru an Issue), so i have some extra code in the vignette to fix the plot [that little red blip in the second plot is part of the hack, not from the boostrapping] > > I was reworking the segmented block boot code, and ended up dropping the implementation for within chromosome — just bc it was a lot of extra code that I’m not sure we want to support (the leaner the better). it’s a few times slower (seg or unseg) than across chrom, and I don’t see people want to do within chrom bootstrapping > > i didn’t have a chance to look at how we handle trimming of bootstrapped ranges over the deny regions, i’ll have time after grant submission > > because the blocks go over their segments, it is possible to pull a range from another state into a different state in the block bootstrapped sample. as the segmentation is not “real” i’m not so worried about that, but if someone can convince me otherwise, we can address this with extra code

Kasper D. Hansen (07:32:10) (in thread): > Without knowing the details, within chromsome bootstrapping should be possible by supplied a GRanges object with a single chromosome

Michael Love (07:37:45): > i’m now working on the pkgdown site deployed by GHA

Michael Love (07:38:07) (in thread): > correct, that will work

Michael Love (07:38:37) (in thread): > well it may not work ATM but minimal changes needed to code to make it work

Michael Love (07:59:00) (in thread): > ok done, just had to run deploy… locally first

Eric Davis (14:28:52): > @Stuart LeeWhen you have a chance, can you take a look at these errors with the new versions ofplotPropensityandplotCovariates? It seems like ridges works, but not lines: > > > nullranges::plotPropensity(mgi, type = 'lines') > Error in set_matched_plot(md, type, cols, x = "ps") : > object 'ans' not found > > nullranges::plotCovariates(mgi, type = 'lines') > Error in set_matched_plot(mmd, type, cols = cols[names(cols) %in% sets], : > object 'ans' not found > In addition: Warning message: > In melt.data.table(md, measure.vars = covar) : > 'measure.vars' [totalSignal, n_sites, n_intervening_sites, ...] are not all of the same type. By order of hierarchy, the molten data value column will be of type 'double'. All measure variables not of type 'double' will be coerced too. Check DETAILS in ?melt.data.table for more on coercion. >

Eric Davis (15:01:30) (in thread): > It might be that thetypeargument is not working

Michael Love (15:29:27): > I’m starting to structure the reference pages, feel free to move things around as you like:https://github.com/nullranges/nullranges/blob/main/_pkgdown.yaml#L33-L55you can use multiple sections if you want > > you can use pkgdown::build_reference() to test it locally, and then when you like it, just commit your changes to the yaml tomain. Don’t push thedocsfolder frommainto GitHub (it’s ignored anyway). Now we usegh-pagesbranch for the website, and the docs are built automatically by GHA upon commits tomain. The references page does look a little funny when you build it (bc the rest of the pkgdown site is missing, so the alignment is a bit off). You could even do pkgdown::build_site() if you want to test things locally, just don’t push that frommainto GH

Mikhail Dozmorov (20:00:24): > The prototype ofdenyrangespackage is ready. There are many definitions of deny regions, and they are all different.https://github.com/mdozmorov/denyranges/blob/main/README.md. Any suggestions for improvement?

Stuart Lee (20:59:52) (in thread): > sure i’ll take a look now

Stuart Lee (21:09:36) (in thread): > ah i thought it was “line” not “lines”

Stuart Lee (21:09:40) (in thread): > i’ll add a check in for that

Eric Davis (21:10:11) (in thread): > thanks! Shouldn’t match.arg() catch both?

Eric Davis (21:10:51) (in thread): > > > match.arg(arg = c('line'), choices = c('lines', 'ridges')) > [1] "lines" >

Stuart Lee (21:12:40) (in thread): > yep that works

Stuart Lee (21:15:12) (in thread): > i just pushed up the changes, which should fix that issue

Stuart Lee (21:15:20) (in thread): > let me know if there’s anything else

Eric Davis (21:15:36) (in thread): > will do, thanks Stuart!

Stuart Lee (21:18:03) (in thread): > looking good! How much of the internals of the Matched objects are user facing? For example, the class union “DF_OR_df_OR_dt” probably won’t be directly touched by a user

Stuart Lee (21:39:46): > One very minor thing but could we make the organisation logo on github the nullranges hex sticker

Stuart Lee (21:48:06) (in thread): > Looking good Mikhail! One suggestion if you want to compute the jaccard statistic in a pure R way here’s a function adapted from the HelloRanges vignette > > jaccard <- function(gr_a, gr_b) { > intersects <- intersect(gr_a, gr_b, ignore.strand = TRUE) > intersection <- sum(width(intersects)) > union <- sum(width(union(gr_a, gr_b, ignore.strand = TRUE))) > DataFrame(intersection, union, > jaccard = intersection/union, > n_intersections = length(intersects)) > } >

Mikhail Dozmorov (22:02:36) (in thread): > This is great, will implement! And, learn about HelloRanges, didn’t know.

2021-05-18

Michael Love (01:23:08): > oh sure!

Michael Love (01:26:43): - File (PNG): Screen Shot 2021-05-18 at 7.26.37 AM.png

Michael Love (01:27:55) (in thread): > yeah, any thoughts on hiding that but still passing check? we could put a heading “Internal stuff users don’t need to see”

Michael Love (02:55:21) (in thread): > this is great Mikhail, will be very helpful for many people on Bioconductor. > > recommendation for the vignette / README: the total width of the deny regions (log10 scale instead of transform) for comparison of the sources - File (PNG): Screen Shot 2021-05-18 at 8.54.14 AM.png

Michael Love (02:56:02) (in thread): > > mtx_to_plot <- data.frame(TotalWidth = c(sum(width(denyGR.hg38.Bernstein)), sum(width(denyGR.hg38.Kundaje.1)), sum(width(denyGR.hg38.Kundaje.2)), sum(width(denyGR.hg38.Reddy)), sum(width(denyGR.hg38.Wold)), sum(width(denyGR.hg38.Yeo))), Source = c("Bernstein.Mint_Blacklist_GRCh38", "Kundaje.GRCh38_unified_blacklist", "Kundaje.GRCh38.blacklist", "Reddy.wgEncodeDacMapabilityConsensusExcludable", "Wold.hg38mitoblack", "Yeo.eCLIP_blacklistregions.hg38liftover.bed")) > > ggplot(mtx_to_plot, aes(x = TotalWidth, y = Source, fill = Source)) + geom_bar(stat="identity") + scale_x_log10() >

Michael Love (02:57:02) (in thread): > here you see that Yeo Wold and Kundaje (non-unified) are actually pretty close in the extent of denied regions

Michael Love (02:58:55) (in thread): - File (PNG): Screen Shot 2021-05-18 at 8.58.43 AM.png

Michael Love (02:59:10) (in thread): > here withscale_y_discrete(label=abbreviate)

Michael Love (07:32:21): > hi - nullranges north meeting in half hour for whoever can make it,https://uncsph.zoom.us/j/94855145668?pwd=Yms1ekxJUjRMZWJRRGlEVVJkMnRQUT09(Doug can’t but I think Eric has some new things to show?) > > Agenda: > > * Eric updates > * Mikhail’s denyranges package > * Mike’s updates to seg block boot + vignette > * CZI proposal items

Mikhail Dozmorov (07:54:58) (in thread): > Indeed, very informative, will add. And, didn’t know aboutlabel=abbreviate, very convenient!

Michael Love (08:02:26): > @Wancen Muare you free for meeting — wanted to ask you about some changes I made this weekend (maybe the meeting invite missed your email?)

Wancen Mu (08:40:26) (in thread): > Sorry, my alarm didn’t make me awake. Am I missed yet?

Tim Triche (09:05:06): > @Eric Davisis the CTCF vignette posted somewhere? I was going to point@Ben Johnsonat it based on our conversation this AM:slightly_smiling_face:

Eric Davis (09:06:16): > Here is the one from this morning - I’ll do some additional cleaning before commit to repo - File (HTML): Using_matchRanges.html

Michael Love (09:20:28) (in thread): > no problem !:slightly_smiling_face:

Doug Phanstiel (10:17:24): > Perhaps not the most important thing. But what about adding a favicon to the pkgdown site?

Michael Love (10:34:08) (in thread): > working on it

Michael Love (10:48:23) (in thread): > in theory it’s there now

Doug Phanstiel (10:49:56) (in thread): > Is it a grey N?

Michael Love (11:23:33) (in thread): > it’s supposed to be the logo, smaller

Michael Love (11:23:48) (in thread): > https://github.com/nullranges/nullranges/blob/gh-pages/favicon.ico

Michael Love (11:24:02) (in thread): > https://github.com/nullranges/nullranges/blob/gh-pages/favicon-32x32.png

Stuart Lee (18:18:08) (in thread): > I’ve added these into main withpkgdown::build_favicons()I think the favicons need to be on the main branch otherwise gha will overwrite them when it builds the site

Stuart Lee (18:35:51) (in thread): > huzzah - File (PNG): Screen Shot 2021-05-19 at 8.35.13 am.png

2021-05-19

Michael Love (02:13:31) (in thread): > this will surely convince the funders

Michael Love (02:13:35) (in thread): > :wink:

Michael Love (03:39:47): > CZI app submitted:crossed_fingers:thanks all. I know it will be competitive to renew but think we’ve got a pretty useful toolbox here:hammer_and_wrench:

Eric Davis (14:49:40): > Currently, theplotCovariatesfunction works well for continuous data, but it behaves very inconsistently with categorical data or a mixture of the two. I propose reformatting these functions to only accept a single covariate at a time (currently you can do multiple) and adding some alternate visualizations for categorical data (like stacked bar plots). While we would lose the ability to plot all covariates at once, we could gain the ability to sense the data type and provide better looking defaults. I think this will result in a much cleaner solution. Does anyone have thoughts about this?

Doug Phanstiel (15:09:00): > Can you just detect what kind of data is provided without users having to specify or add separately?

Eric Davis (16:35:42) (in thread): > Should be simple if users enter categorical data as factors, but more complicated if the data is numeric. We would have to make a judgement call about how many unique values justify showing data as categorical or continuous.

Doug Phanstiel (18:55:14) (in thread): > Yeah. I dont think that would be the end of the world

Doug Phanstiel (18:55:43) (in thread): > you can show the example in question either here or in the next meeting and we can get input from others

Stuart Lee (19:50:29) (in thread): > I guess one question is what do you want the user to get out of this plot when seeing categorical variables and do you want them to be able to compare across variables in a single view? I think currently we facet with free scales which implies you want to look at one variable at a time. I think it simplifies the approach a bit if you take one variable at time.

Stuart Lee (19:54:46) (in thread): > For multiple categorical variables another approach would be to use mosaics if interested in seeing if there’s independence between sets

Stuart Lee (20:01:25) (in thread): > I like this. One other thought is you could doplotCovariatefor single var at a time andplotCovariateSetsor something similar if you want to look at multiple variables. One other thought is that I think we should we use colour palettes from ggplot2 instead of using scale colour manual, i.e. scale_color_brewer(palette = “Dark2”) is colour blind friendly.

Eric Davis (20:11:37) (in thread): > I like the idea of plotCovariate because it simplifies things and we could use patchwork to view multiple at the same time

Stuart Lee (20:13:15) (in thread): > Yeah I think leaving the user to stitch plots together is a good idea. WithplotCovariateSetsI was thinking more of multivariate plots if there was an interest in seeing how matching performs on variables jointly (not sure if that is a thing that is interesting?)

Eric Davis (20:15:28) (in thread): > That’s a good idea - what kind of visualization would you suggest for that?

Stuart Lee (20:19:37) (in thread): > i think a scatterplot matrix could work for numeric variables, where you do like focal_var1 vs matched_var1 and focal_var1 vs matched_var2 and so on… but that doesn’t scale to large numbers of variables

Stuart Lee (20:21:19) (in thread): > Another approach would be to use parallel coordinates <https://en.wikipedia.org/wiki/Parallel_coordinates> or even a heatmap - Attachment: Parallel coordinates > Parallel coordinates are a common way of visualizing and analyzing high-dimensional datasets. > To show a set of points in an n-dimensional space, a backdrop is drawn consisting of n parallel lines, typically vertical and equally spaced. A point in n-dimensional space is represented as a polyline with vertices on the parallel axes; the position of the vertex on the i-th axis corresponds to the i-th coordinate of the point. > This visualization is closely related to time series visualization, except that it is applied to data where the axes do not correspond to points in time, and therefore do not have a natural order. Therefore, different axis arrangements may be of interest.

Stuart Lee (20:22:16) (in thread): > Would be interesting to see if there’s already a literature on this

Stuart Lee (20:39:26) (in thread): > something like this

Stuart Lee (21:08:55) (in thread): - File (PNG): image.png

2021-05-20

Michael Love (01:47:11) (in thread): > I think in the propensity score literature it’s called a “balance plot”

Michael Love (01:47:41) (in thread): - File (PNG): image.png

Stuart Lee (02:08:30) (in thread): > why are the lines connected?

Michael Love (02:20:24) (in thread): > no reason:slightly_smiling_face:there are more examples i just grabbed the first one

Michael Love (02:20:53) (in thread): > it does seem common - File (PNG): image.png

Michael Love (02:22:12) (in thread): - File (PNG): image.png

Eric Davis (06:35:44) (in thread): > Noah wrote a package calledcobaltfor assessing covariate balancehttps://cran.r-project.org/web/packages/cobalt/index.html - Attachment (cran.r-project.org): cobalt: Covariate Balance Tables and Plots > Generate balance tables and plots for covariates of groups preprocessed through matching, weighting or subclassification, for example, using propensity scores. Includes integration with ‘MatchIt’, ‘twang’, ‘Matching’, ‘optmatch’, ‘CBPS’, ‘ebal’, ‘WeightIt’, ‘cem’, ‘sbw’, and ‘designmatch’ for assessing balance on the output of their preprocessing functions. Users can also specify data for balance assessment not generated through the above packages. Also included are methods for assessing balance in clustered or multiply imputed data sets or data sets with longitudinal treatments.

Michael Love (07:51:26) (in thread): > could we hook into that somehow? maybe we could output a table that can be used directly into one of those tables/plots?

Michael Love (07:51:43) (in thread): > it could be aSuggests

Michael Love (11:17:25): > moving things around / breaking things: going for a single functionbootRangeswhich will output abootRangeswhich is a GRangesList with small validity checks. Will be moving towards a plyranges enrichment analysis in the vignette to start to think about how we should choose block length

Stuart Lee (18:34:26) (in thread): > One thought is we could output a GroupedGenomicRanges fromplyrangesfor downstream analysis, otherwise it is fairly straightforward to convert GRangesList

2021-05-21

Michael Love (00:35:03) (in thread): > oh i hadn’t see GGR, what do you think are the considerations? Is there a memory difference? We could have e.g. 100k features x 100 bootstraps

Stuart Lee (01:02:32) (in thread): > There’s definitely a memory difference right now, but I guess per our grant text I could start working on that:stuck_out_tongue:. I think we could leave it as GRangesList and it should be straightforward to plugin groups later

Stuart Lee (02:07:54) (in thread): > I can have a go of adding in a wrapper to this package next week

Michael Love (02:48:15) (in thread): > awesome, can you walk through how GGR would be preferred for plyranges analysis? one consideration i’m thinking about is that, if we haveRbootstraps total, sayR=100we may want to distribute a subset of those to different sub-processes, and then re-assemble later. E.g. send 20 bootRanges to 5 different processes.

2021-05-23

Michael Love (16:16:15): > I just noticed Aaron hashttp://bioconductor.org/packages/BiocNeighborsfor fast exact/approx nearest neighbor matching - Attachment (Bioconductor): BiocNeighbors > Implements exact and approximate methods for nearest neighbor detection, in a framework that allows them to be easily switched within Bioconductor packages or workflows. Exact searches can be performed using the k-means for k-nearest neighbors algorithm or with vantage point trees. Approximate searches can be performed using the Annoy or HNSW libraries. Searching on either Euclidean or Manhattan distances is supported. Parallelization is achieved for all methods by using BiocParallel. Functions are also provided to search for all neighbors within a given distance.

2021-05-24

Tim Triche (11:20:20): > that reminds me, I think the indices can be cached (as one might wish when bootstrapping etc.)

Michael Love (14:59:58) (in thread): > you mean the random seeds?

Stuart Lee (21:37:58) (in thread): > is neighbour finding done for matching or on the bootstraps too, and is it done on covariates or range coordinates

Stuart Lee (23:46:39) (in thread): > I figured if you want to do summaries, like estimating a mean but over say chromosomes within groups plyranges grouping would be helpful

2021-05-25

Michael Love (02:40:44) (in thread): > so far only neighbor matching is for covariates (not for boot or by range)

Stuart Lee (03:09:55) (in thread): > ok, good to know. it looks like the current implementation is quite tied to data.table but I reckon it would be feasible. Would finding the nearest ranges for bootstraps be of interest?

Michael Love (04:14:57) (in thread): > i dont think we need that, but haven’t thought deeply about it — we can brainstorm tomorrow during “nullranges south”?

Eric Davis (10:53:17): > Does anyone think it would be useful to have a function that would pull a random set (of a given length) of genomic ranges/interactions from a specified genome? Or maybe a function like this already exists?

Doug Phanstiel (10:56:47): > Yeah, I do think that could make sense to include. Not sure if it is necessary or not. You could do that withsamplebut having it all integrated into you workflow with plotting functions etc could be useful.

Michael Love (11:05:42): > that’s interesting. so are these just random locations? or with respect to allowed regions / segmentation?

Doug Phanstiel (11:07:50): > I thought you meant just pull random indices from your pool

Doug Phanstiel (11:08:58): > pulling random ranges from the genome is almost always a bad idea

Eric Davis (11:12:09): > I was thinking something like this, with arguments for specifying chromosomes and denyRegions: > > randomGRanges(length = 100, widths = 100, genome = 'hg19', chroms = 'all', denyRegions = 'default') > > Mostly just for grabbing some regions to test functions

Michael Love (11:14:45): > that seems like it would be generally useful for testing (maybe we can point out in the man pages it’s not going to be good for inference)

Michael Love (11:15:01): > we usedenyin bootRanges to specify a GRanges object of deny regions

Kasper D. Hansen (11:17:01): > Im clearly confused but I would have thought this is a special case of what you have already. Perhaps with the difference of whether random regionscouldoverlap or not

Doug Phanstiel (11:19:10): > yeah, it does seem like something you could hopefully get with certain parameters from bootRanges

Eric Davis (11:20:36): > probably true, althoughbootRanges()requires input ranges, right?

Michael Love (11:44:52): > bootRanges requires input ranges, because the output has the local clustering structure of the input

Michael Love (11:45:10): > it sounds like Eric is talking about just “randomly plop down some features, i have no input ranges”

Doug Phanstiel (12:20:00): > > randomlyPlopDownSomeFeaturesIHaveNoInputRanges <- function(length, widths , genome , chroms, denyRegions) >

Eric Davis (13:00:22) (in thread): > I couldn’t have written it better myself

Mikhail Dozmorov (13:10:33): > There’s createRandomRagions function in regioneR,https://rdrr.io/bioc/regioneR/man/createRandomRegions.html. Doesn’t have denyRegions though. - Attachment (rdrr.io): createRandomRegions: Create Random Regions in regioneR: Association analysis of genomic regions based on permutation tests > Creates a set of random regions with a given mean size and standard deviation.

Michael Love (14:01:39): > that + removing overlaps works

Tim Triche (18:33:00) (in thread): > I meant the indices themselves, although for Annoy that implies keeping the random seed too

2021-05-26

Michael Love (01:14:02) (in thread): > oh I see, the order of the resample from the original data, you mean to include that as a metadata column?

Michael Love (02:27:47) (in thread): > Stuart points out what we have already is probably fastest

Michael Love (08:03:58): > some nullranges changes: i looked into some issues with rejection sampling with Eric, added some stopifnot there, changed the arguments inbootRangesto camel case, as this is Bioc style. E.g.L_bis nowblockLengthfor the user-facing functions. I reviewed@Wancen Mutrimming code and looks good to me, i think we should insist that deny ranges are unstranded before passed to bootRanges unless someone can think of why not, I added a test file just to see how it looks:https://github.com/nullranges/nullranges/blob/main/tests/testthat/test_trim_deny.R

Eric Davis (13:06:49): > General R S4 question: does anyone know how to usesetGenericandsetMethodin a way that will show named arguments that are not part of the signature definition? For example, if I set the signature asmyFun(x, y, …)but I have argumentsmyFun(x, y, z), is there a way to show that z belongs to myFun. This could be an RStudio problem I suppose…

Eric Davis (14:44:00) (in thread): > I think I found a way around this, but it involves making a lot of class unions > > setClassUnion("character_OR_missing", c("character", "missing")) > > and defining the signature with these > > setMethod("myFUN", signature(x = 'integer', y = 'integer', z = 'character_OR_missing'), myFun) >

Eric Davis (14:45:24) (in thread): > We could use"character_OR_NULL"but the functions will throw error if optional arguments (like z) are left out.

Eric Davis (14:45:32) (in thread): > Does anyone have strong opinions about this?

Eric Davis (15:19:36) (in thread): > Defaults can be set for missing arguments in the generic: > > setGeneric('myFun', function(x = 1, y = 2, z = 'a') standardGeneric('myFun')) >

Kasper D. Hansen (15:31:02) (in thread): > I don’t think the question is very clear. What do you mean by “show”. And what is an example of code that doesn’t achieve what you want. I’ll note that your solution actually nameszin the call tosetGeneric()so it doesn’t satisfy your original requirement

Eric Davis (16:53:49): > Here are 4 methods (Examples 0 -3) for doing method dispatch on a generic function with S4. I was trying to figure out how to set default arguments and have them appear in RStudio’s autocomplete while still allowing the function to be missing some arguments that have defaults. It looks like Example 3 does everything I was looking for, but it requires creating a bunch of class unions. > > Example 0, defining z = ‘a’ as default with internal function but leaving z out of generic definition > > ## Example 0 ------------------------------------------------ > > ## Define example function > my_Fun <- function(x, y, z = 'a') { > print(paste0('x = ', x,', y = ', y, ', z = ', z)) > } > > ## Set generic > #' @export > setGeneric('myFun', function(x, y, ...) standardGeneric('myFun')) > > ## Set methods > #' @export > setMethod('myFun', > signature = signature(x = 'integer', y = 'integer'), > definition = my_Fun) > > Testing example 0 (works but doesn’t show z in RStudio autocomplete) > > > myFun(x = 1L, y = 2L) > [1] "x = 1, y = 2, z = a" > > myFun(x = 1L, y = 2L, z = 'a') > [1] "x = 1, y = 2, z = a" > > Example 1, defining z = ‘a’ as default with internal function doesn’t correctly set default when z is in generic: > > ## Example 1 ------------------------------------------------- > > ## Define example function > my_Fun <- function(x, y, z = 'a') { > print(paste0('x = ', x,', y = ', y, ', z = ', z)) > } > > ## Set generic > #' @export > setGeneric('myFun', function(x, y, z) standardGeneric('myFun')) > > ## Set methods > #' @export > setMethod('myFun', > signature = signature(x = 'integer', > y = 'integer', > z = 'missing'), > definition = my_Fun) > > #' @export > setMethod('myFun', > signature = signature(x = 'integer', > y = 'integer', > z = 'character'), > definition = my_Fun) > > Testing example 1 (doesn’t work when z is undefined, but shows z in RStudio autocomplete) > > > myFun(x = 1L, y = 2L) > Error in paste0("x = ", x, ", y = ", y, ", z = ", z) : > argument "z" is missing, with no default > > myFun(x = 1L, y = 2L, z = 'a') > [1] "x = 1, y = 2, z = a" > > Example 2, defining z = ‘a’ as default with setGeneric works > > ## Example 2 ------------------------------------------------- > > ## Define example function > my_Fun <- function(x, y, z) { > print(paste0('x = ', x,', y = ', y, ', z = ', z)) > } > > ## Set generic > #' @export > setGeneric('myFun', function(x, y, z = 'a') standardGeneric('myFun')) > > ## Set methods > #' @export > setMethod('myFun', > signature = signature(x = 'integer', > y = 'integer', > z = 'missing'), > definition = my_Fun) > > #' @export > setMethod('myFun', > signature = signature(x = 'integer', > y = 'integer', > z = 'character'), > definition = my_Fun) > > Testing example 2 (works and shows defaults with RStudio autocomplete) > > > myFun(x = 1L, y = 2L) > [1] "x = 1, y = 2, z = a" > > myFun(x = 1L, y = 2L, z = 'a') > [1] "x = 1, y = 2, z = a" > > Example 3, creating class unions decreases the number of setMethod calls > > ## Set generic > #' @export > setGeneric('myFun', function(x, y, z = 'a') standardGeneric('myFun')) > > ## Define example function > my_Fun <- function(x, y, z) { > print(paste0('x = ', x,', y = ', y, ', z = ', z)) > } > > ## Set class unions > #' @export > setClassUnion('character_OR_missing', > c('character', 'missing')) > > ## Set methods > #' @export > setMethod('myFun', > signature = signature(x = 'integer', > y = 'integer', > z = 'character_OR_missing'), > definition = my_Fun) > > Testing example 3 (works and shows defaults with RStudio autocomplete) > > > myFun(x = 1L, y = 2L) > [1] "x = 1, y = 2, z = a" > > myFun(x = 1L, y = 2L, z = 'a') > [1] "x = 1, y = 2, z = a" > - File (PNG): image.png - File (PNG): image.png - File (PNG): image.png - File (PNG): image.png

Eric Davis (16:53:51) (in thread): > You are right, its a hard problem to describe - I’ve moved the discussion to a new thread with some example code.

Eric Davis (16:54:46) (in thread): > To test this you have to copy the code and build an R package, unfortunately.

Wancen Mu (21:56:11) (in thread): > Good to know, thank you!

Stuart Lee (22:35:19) (in thread): > I’m not sure if this is helpful but take a look atIRanges::findOverlapsandGenomicRanges::findOverlapsfor GRanges it adds the ignore.strand argument but that doesn’t come up because it is captured in the... > > getMethod("findOverlaps", signature = c("GenomicRanges", "GenomicRanges")) > > etMethod("findOverlaps", signature = c("IntegerRanges", "IntegerRanges")) >

Stuart Lee (22:36:19) (in thread): > I think the problem in your case is you don’t need to dispatch on z but you also won’t get tab completion (unless someone looks at the docs)

Stuart Lee (22:36:44) (in thread): > Instead you only need to define methods on x,y pairs and then for certain pairs the method could gain an additional arg

2021-05-27

Kasper D. Hansen (03:28:37): > Your main issue seems to have to do with how one specific IDE (Rstudio) chooses to parse and display autocomplete. I am not sure you should consider that authoritive in any way.

Kasper D. Hansen (03:30:27): > Traditionally, the main Rstudio developers are not using that much S4 so I would not be surprised if the tools (and documentation) is semi-lacking here.

Kasper D. Hansen (03:31:51): > Having said that, the standard way of doing this (Example 0) means - if you think about it - that you can only know about “z” once the specific method has been selected, ie. you need to know what the input object is so you can figure out dispatching. (this answer is if you 100% want the true dispatch. For Rstudio, the IDEcouldidentify the signature of all the specific methods and do something with those)

Stuart Lee (03:34:50): > Out of curiosity do any other IDEs display this “correctly” as per Eric’s comments above? Like ESS or nvim-R or VScode?

Kasper D. Hansen (03:35:05): > I don’t think so, but that’s an IDE issue

Stuart Lee (03:36:18): > Yep it’s a not a problem with S4 method dispatch, in the example I gave in the thread above withfindOverlapsthe IDE would have to infer that your inputs are GRanges in order to tab complete the ignore.strand argument

Kasper D. Hansen (03:36:26): > I am surprised that Example 1 doesn’t throw an error

Kasper D. Hansen (03:36:53): > Yes, and of course the IDE has a problem with identifying the class of the argument

Kasper D. Hansen (03:37:18): > That requires a substantial level of analysis and might only be achievable if you run the code

Michael Love (03:39:55): > Bouncing across packages to find what arguments are possible and where they are documented has always been a downside of S4

Kasper D. Hansen (03:40:22): > In general, having in the generic is really irritating. I can understand why its done, but most of the time it has these really irritating consequences for looking at code. Like not being able to doargs(func)(which is really what@Eric Davisis complaining about)

Kasper D. Hansen (03:41:33): > yes, and method specific help pages etc. But usingtakes it to the next level. Especially in GenomicRanges where the generic sometimes has two (named) arguments but the method you use 99% of the time has like 20.

Kasper D. Hansen (03:43:03): > The reason why they do this - in my understanding - is when you have a very abstract generic and the arguments might not make sense all the time. To take a hypothetical example, let’s say we makeplota generic. There are tons of plot-type specific arguments which only make sense in context of a specific type of plot.

Stuart Lee (03:48:32): > Yeah I’m curious to see how they handle this in the new OOP paradigm that the RConsortium is working on. It would be useful to know what arguments are allowed in...

Kasper D. Hansen (04:58:43): > My guess would be essentially nothing. I can see you want this to be unspecified, like theplotexample I have above. However, what S4 desparately needs is a much better way to explore the specific methods for a generic, both in terms of (a) discoverability (b) finding relevant help pages and (c) finding arguments (this conversation).

Kasper D. Hansen (04:59:42): > The current system works conceptually, like we can get help and arguments for a specific method for a generic if we know the signature, but in practice it doesn’t work.

Michael Love (05:29:11): > BTW in case anyone has the old meetings in their schedule, nullranges north meeting is 6/1 and every two weeks (no meeting today)

Tim Triche (09:33:31) (in thread): > I just meant that annoy & friends can cache the index if it’s needed for another round of resampling or similar. need not recompute all neighbors just to rerun

Eric Davis (12:23:01): > Thanks all for the discussion! Unless anyone has strong objections, I’ll implement the functions in the style of “Example 3”. It seems to be the best available solution - providing multiple dispatch while still listing arguments with defaults. It also seems easy enough to change later

Kasper D. Hansen (14:19:19): > The fact that you have a default for z in the call to setGeneric is kind of weird.

Eric Davis (14:22:09) (in thread): > Is there a way to set default values like that without doing it in setGeneric?

Kasper D. Hansen (14:26:21) (in thread): > I’m surprised you’re allowed to do this. I don’t recall seeing it, but perhaps Im just not remembering this correctly

2021-05-28

Michael Love (01:11:33) (in thread): > oh this is specific to a nearest neighbors analysis. got it

Eric Davis (10:21:16): > Update: I’m working through a list of changes to improve the matchRanges side of things (filling out documentation, making functions behave consistently etc…). I’ve made a functionmakeExampleMatchedDataSetfor using in example documentation (https://nullranges.github.io/nullranges/reference/makeExampleMatchedDataSet.html) like withplotCovariate(https://nullranges.github.io/nullranges/reference/plotCovariate.html). The make example function is what the S4 discussion was about - feel free to check out how its implemented and make some suggestions if you think there is a better way. > > Other changes that are coming soon include: > 1. Fully documented functions with usage examples > 2. Replacing the “Intro to matchRanges” vignette with an example frommakeExampleMatchedDataSet > 3. A cohesive “Using matchRanges with GRanges” vignette > 4. A cohesive “Using matchRanges with GInteractions” vignette. > I hope to have these updates ready for next weeks meeting. - Attachment (nullranges.github.io): Function for generating an example matchRanges or Matched dataset — makeExampleMatchedDataSet > This function will generate an example dataset as either 1) input for matchRanges() (when matched = TRUE) or 2) a Matched Object (when matched = FALSE). - Attachment (nullranges.github.io): Covariate plotting for Matched objects — plotCovariate > This function plots the distributions of a covariate from each matched set of a Matched object.

Doug Phanstiel (10:34:35): > This looks great. Minor comment on colors and line types. ForplotCovariatewith type == ‘lines’ it can often be hard to distinguish the matched/focal and pool/unmatched lines. Perhaps make matched and unmatched dashed lines (on bring them to the front) so that you can more easily see when they overlap.

Doug Phanstiel (10:37:31): > And the default color palette on when type == ‘bars’ is so saturated in hurts my eyes. Maybe we can pick a different one

Eric Davis (10:38:54) (in thread): > haha, I think Stuart suggested including an argument to change these colors by specifying a Rcolorpalette (like “Dark2” or something)

Eric Davis (10:39:58) (in thread): > I like having consistent colors for the line/ridges/jitter plots because the 4 categories are fixed. But for the bar plots it would make sense to specify a palette since there can be an arbitrary number of categories

Eric Davis (10:40:39) (in thread): > Alternatively we could just pick a default we like and let the user make their own custom plots by pulling out the data themselves withmatchedData()

Doug Phanstiel (11:12:43) (in thread): > yep, i think allowing the user to specify a different palette is good. I would just change the default one to something more muted

Eric Davis (13:50:02): > How much documentation is typical for internal (non-exported) functions?

Stuart Lee (22:38:02) (in thread): > why not use a sensibleggplot2palette to start and not hard code colors? since the output is ggplot2 object the user could modify the colors with their own scale_color_* function

Stuart Lee (22:39:12) (in thread): > really up to you I would say. some people document like ordinary functions and just add an @noRd tag at the bottom

Eric Davis (22:49:17) (in thread): > My thoughts for using the 4 fixed colors for focal, pool, matched, and unmatched is to have some consistency, even when viewing 2 or 3 sets at a time (instead of all 4). That way dark blue always corresponds to the focal set and so on. The colors are actually from the “Paired” color palette, but swapped around so that the darker colors correspond to focal and pool.

Eric Davis (22:50:13) (in thread): > I will try to err on the side of more docs:slightly_smiling_face:

Stuart Lee (22:58:57) (in thread): > cool - that makes sense to me. I agree with Doug about the saturation too. I don’t think it necessary to add a palette arg since ggplot2 object is returned and the user could customize themselves if they wish.

Eric Davis (23:06:01) (in thread): > good point! I’ll just lighten up the default plot a bit withscale_fill_hue(l = 70, c = 50)- unless you know of a better method?

Stuart Lee (23:11:58) (in thread): > I’ve used this before to play around with changing a palettehttps://projects.susielu.com/viz-palette

2021-05-29

Michael Love (01:01:03) (in thread): > all sounds great

Michael Love (01:03:21) (in thread): > yes! this is an example of what i did for a non-exported:https://github.com/nullranges/nullranges/blob/main/R/unseg_bootstrap.R#L44-L49

Michael Love (01:03:25) (in thread): > pretty minimal

2021-05-31

Eric Davis (00:34:47): > I updated the reference section of the _pkgdown.yaml file but don’t see the changes on the site after GitHub actions ran. However, I do see changes to other documentation. Is there something special I need to do to get the reference section to update?

Stuart Lee (01:46:44): > I think the spacing is off?

Michael Love (02:12:06): > and BTW, it’s easier to do this kind of pkgdown debugging locally — you can do e.g.pkgdown::build_site()and skip the extra time it takes to spin up the GHA machines

Mikhail Dozmorov (08:36:07): > AnnotationHub packages updates: > * CTCF data is on AHub, instructions updatedhttps://github.com/mdozmorov/CTCF. The package is under review. > * denyrangesnow has GRanges for centromeres/telomeres and other gaps. Those regions are also to be avoided. The “gap” UCSC tables are available on AHub, but without metadata. So I added more granular data, each gap type has its own GRanges object.https://github.com/mdozmorov/denyranges. A bit hesitant whether to keep or comment it out, thoughts?

Mikhail Dozmorov (08:37:44): > In parallel, I’ve been thinking aboutallowranges- regions that are likely biologically active. One ENCODE3 paperhttps://doi.org/10.1038/s41586-020-2559-3provides a systematic collection of DNAse hypersensitive sites. So,https://github.com/mdozmorov/allowrangeshas DNAse sites as GRanges for different tissues/cells. The primary idea is to use it for Hi-C data. Hi-C bins are large, and intersecting them with DNAse sites will give a better idea where activity might happen. Also, planning to add other such data, collected many references in the Excel file. Would it be useful? - File (Excel Spreadsheet): GenomeRunner.xlsx

Eric Davis (08:53:39) (in thread): > It’s strange because it did work locally for me

Michael Love (09:27:13) (in thread): > comment what out?

Michael Love (09:27:23) (in thread): > this is awesome, will be incredibly useful

Michael Love (09:28:06) (in thread): > I think this would be incredibly useful

Mikhail Dozmorov (09:28:59) (in thread): > Whether to include gaps. They in principle available on AHub, but not very useful. Hence, thought to include here

Michael Love (09:44:49) (in thread): > include here as well i think, bc more metadata is useful

Eric Davis (15:27:36): > This is probably not worth the time to construct, but we could re-imagine the matchRanges workflow to follow a more plyranges style syntax. For example, these commands could be used to do matchRanges: > > gr %>% > focalFeature(feature1 == TRUE) %>% > covarFeatures(feature2, feature3) %>% > matchingMethod(method = 'stratfied', replace = FALSE) %>% > matchRanges() > > I imagine something like this could work well for bootRanges as well.

Mikhail Dozmorov (19:56:15): > rejection sampling usesks::kde. After macOS and R update, installingkspackage generateserror: X11 library is missing. X11 is required for theplot3Ddependency package. Installing it on a mac without admin access is no small feat. Is it possible to use another kernel density estimation function?

2021-06-01

Michael Love (01:54:32): > oh bummer, ks::kde was the one i found that allowedpredict()— i’ll look around for a replacement

Michael Love (01:58:57): > this is an option:https://cran.r-project.org/web/packages/kdensity/index.html - Attachment (cran.r-project.org): kdensity: Kernel Density Estimation with Parametric Starts and Asymmetric Kernels > Handles univariate non-parametric density estimation with parametric starts and asymmetric kernels in a simple and flexible way. Kernel density estimation with parametric starts involves fitting a parametric density to the data before making a correction with kernel density estimation, see Hjort & Glad (1995) doi:10.1214/aos/1176324627](https://doi.org/10.1214%2Faos%2F1176324627)). Asymmetric kernels make kernel density estimation more efficient on bounded intervals such as (0, 1) and the positive half-line. Supported asymmetric kernels are the gamma kernel of Chen (2000) doi:10.1023/A:1004165218295](https://doi.org/10.1023%2FA%3A1004165218295)), the beta kernel of Chen (1999) doi:10.1016/S0167-9473(99)00010-9](https://doi.org/10.1016%2FS0167-9473%2899%2900010-9)), and the copula kernel of Jones & Henderson (2007) doi:10.1093/biomet/asm068](https://doi.org/10.1093%2Fbiomet%2Fasm068)). User-supplied kernels, parametric starts, and bandwidths are supported.

Stuart Lee (02:04:56): > There might be options in the multivariate task view on CRAN toohttps://cran.r-project.org/web/views/Multivariate.html

Stuart Lee (02:21:39) (in thread): > and this oldish paperhttps://vita.had.co.nz/papers/density-estimation.pdf

Michael Love (02:36:18) (in thread): - File (PNG): Screen Shot 2021-06-01 at 8.36.11 AM.png

Michael Love (02:36:32) (in thread): > we could try ash also

Michael Love (02:37:19) (in thread): > or kernsmooth

Michael Love (02:39:09) (in thread): > we don’t need thepredict()method technically, we just need to provide the range ofxas here:https://github.com/nullranges/nullranges/blob/main/R/methods-matchRanges.R#L97

Michael Love (02:40:20) (in thread): > so we could useKernSmooth::bkde(x, gridsize = 1001L, range.x=quantile(pps, c(.001,.999))

Michael Love (02:41:49) (in thread): > orash::bin1(x, ab=quantile(pps, c(.001,.999), nbin=1001L)followed byash::ash1()

Michael Love (04:07:01) (in thread): > great point, the flow is fundamental. this is actually the whole point behind starting the package, is so that we could have an analysis that would fit into plyranges workflow smoothly

Michael Love (04:18:07) (in thread): > but there is something to be said for bundling up the details of the null/control feature specification into a single function call with arguments, because there are still many other “verbs” that need to happen > > e.g. in boot (and perhaps also in match) we iterate over multiple rounds of sampling and perhaps across a grid of parameters to determine e.g. what block length, so we need to do something like: > > boots <- y %>% bootRanges(…) > boots %>% diagnose() # QC on bootstraps alone, e.g. inter-range distances, distribution per segment, per chrom, across boot parameters > # plyranges overlap calls btwn x and y, x and boots, potentially with parallelization > res %>% diagnose() # look at null distributions across boot parameters > res %>% summarize() # calculate bootstrap p-value > > so i’m thinking that bundling the specification into a single bootRanges call will make the flow more readable

Michael Love (06:58:01): > Ok we have a “north” meeting coming up in 1 hour, I’m going to take a shot at replacingks@Wancen Muit looks like the segmented vignette is not building correctlyhttps://github.com/nullranges/nullranges/runs/2715752495?check_suite_focus=true#step:22:141

Michael Love (07:39:30) (in thread): > I started to work on using KernSmooth instead but it would def be a regression. The matching was visibly worse on my simple test case (test_reject.R)

Michael Love (07:40:13) (in thread): > my code (which i tossed bc it wasnt working as well as current): > > tail_p <- .001 # how much of the tail on either side of PS > bounds <- quantile(pps, c(tail_p, 1 - tail_p)) > gridsize <- 1e5 # ten thousand grid points, KernSmooth is fast > out <- KernSmooth::bkde(fps, gridsize = gridsize, range.x=bounds) > grid <- out$x > fgrid <- out$y > pgrid <- KernSmooth::bkde(pps, gridsize = gridsize, range.x=bounds)$y > ## Set scale by finding the highest point of density ratios (focal/pool) > ## This ensures that pool covers focal at all points > if (any(fgrid [ pgrid < 0)) { > stop("kernel density estimates by ks::kde are negative, cannot perform rejection sampling") > } > scale <- max(fgrid/pgrid) > if (scale ]( 0 ) 1e3) { > stop("scaling factor for density of the PS for pool is > 1e3, could lead to instability") > } > ## Calculate the probability of accepting each pool > thresh_lo <- function(x) ifelse(x > 1e-3, x, 0) > thresh_hi <- function(x) ifelse(x < 1e-6, 1e-6, x) > # linear interpolation of the density > full_grid <- c(min(pps), grid, max(pps)) > pred_df <- approx(full_grid, c(0,fgrid,0), xout=pps)$y > pred_dp <- approx(full_grid, c(0,pgrid,0), xout=pps)$y > accept_prob <- pmin( thresh_lo(pred_df)/(scale * thresh_hi(pred_dp)), 1) >

Michael Love (07:41:06) (in thread): > my guesses as to why it was worse: the KDE was maybe less high quality? the linear approximation of the density evaluated at the pool PS was bad?

Michael Love (07:42:06) (in thread): > before we go down this path, why does XQuartz installation not work?https://www.xquartz.org/

Michael Love (07:42:19) (in thread): > is this because you can’t install any software on your machine w/o admin?

Mikhail Dozmorov (07:49:33) (in thread): > Yes, I have to request elevated privileges. Following the discussions, looks like there are alternatives.

Michael Love (07:57:50) (in thread): > yeah, i tried one but it was def worse than ks, didn’t work well for a trivial matching problem, I can keep trying though

Michael Love (07:58:18) (in thread): > the predict() is very useful, and not many of the other packages provide this. hacking a predict() with linear interpolation may be the cause of the bad performance

Wancen Mu (08:00:30) (in thread): > Oh, working on that

Wancen Mu (08:32:20) (in thread): > It’s good now:ok_hand:

Michael Love (08:55:36) (in thread): > thanks!

Michael Love (08:55:51) (in thread): > i’ll take a shot at this other plot type after our lab meeting

Wancen Mu (08:57:25) (in thread): > Thanks, if it’s the state percentage plot on each chr, I can try it too.

Michael Love (11:00:22) (in thread): > yeah, ok you can go for it

Michael Love (11:01:02) (in thread): > exactly — that’s what i was thinking

Wancen Mu (13:02:13) (in thread): > Is this plot ok? I will change ranges plot to have same color pallet like this one. I was trying to keep same pallet with Eric, but I found he manually set the color while here the state number is not fixed. - File (PNG): bar plot.png

Michael Love (13:56:41) (in thread): > this is perfect:ok_hand:

Michael Love (13:57:00) (in thread): > so this can be a different plottype=c("genome","bars")maybe as an argument within plotSegments?

Wancen Mu (14:13:30) (in thread): > Now, I am settingtype = c("ranges","barplot","boxplot")which default is plot all three figures, while users can specify any one or two figures.

2021-06-02

Michael Love (02:50:10) (in thread): > since these are ggplot2 figures, we could take the approach that Eric and Stuart have worked on in plotCovariate, which is to enable creation of a list of the ggplot2 objects so they can be either plotted individually or together with patchwork

Michael Love (02:50:33) (in thread): > i forget if we are currently putting them all to the plotting device at once or returning the list?

Wancen Mu (08:56:23) (in thread): > Oh! Right now, it directly putting them all to the device like the below fig. Using the codeif ( "barplot"%in% type) {plot} - File (PNG): image.png

Michael Love (08:59:49) (in thread): > ok i’ll take a look thanks

Wancen Mu (09:01:11) (in thread): > I saw plotCovariate’s reference now is putting one firgure to the device at one time. Are they currently trying to make a list in the function and give argumentindividuallyorpatchwork?

Michael Love (09:05:08) (in thread): > i think we should have plotSegments return a single ggplot2 object

Michael Love (09:06:33) (in thread): > and then if a user wants all three they can do: > > plots <- lapply(c("ranges","barplot","boxplot"), function(t) plotSegments(seg, type=t)) > patchwork(plots) > > it’s just a lot cleaner this way?

Wancen Mu (09:14:26) (in thread): > Yep, easier to understand. Previously I thought generating one plot each time is repetitive on calculating countOverlap for “ranges” and “boxplot”.

Michael Love (09:32:56) (in thread): > we should store the count in the mcols… i can sketch that out

Wancen Mu (09:36:38) (in thread): > Oh, right!!

Wancen Mu (09:38:31) (in thread): > But what if user want to draw their segmentation GRanges? Then either they have mcols or we can only draw “barplot”

Wancen Mu (09:39:49) (in thread): > I think we can also provide “ranges” plot if there is no mcols. Like all segment in one horizontal line. How do you think?

Vince Carey (10:11:33): > @Vince Carey has joined the channel

Michael Love (10:14:00) (in thread): > yes excactly

Michael Love (10:14:16) (in thread): > i would just switch the plot whether they have “count” in the mcols or not

Michael Love (10:14:35) (in thread): > and i think we should havesqrt(count)on the y-axis to be explicit what is plotted

Wancen Mu (10:18:45) (in thread): > Gotcha. Because I thought user will still wondering what iscountand why we use that. So I mentioned in documentThe y axis \code{"density"} represent square root of overlap counts within segment length.

Wancen Mu (10:19:44) (in thread): > Okay, I will add mcols to the output of segmentDensity and fix the plots:ok_hand:

Aedin Culhane (11:03:06): > @Aedin Culhane has joined the channel

2021-06-03

Laurent Gatto (12:05:49): > @Laurent Gatto has joined the channel

Stephanie Hicks (12:06:26): > @Stephanie Hicks has joined the channel

2021-06-14

Michael Love (14:30:56): > Hi@Eric Davis@Wancen Mu@Doug Phanstiel@Mikhail Dozmorov@Tim Triche— I don’t have anything for agenda tomorrow, I know Eric and Doug are working on BentoBox and Wancen is working on a single cell paper. I’ve been busy with some papers on desk and study section. Should we cancel the meeting for this week?

Wancen Mu (14:36:31): > Yeah, canceling will be good with me.

Michael Love (14:44:32): > Ok looks like a yes. I hope to have more time to catch up on Eric’s pieces and thinking more about the boot workflow after this week is done

Doug Phanstiel (14:44:40): > Works for us

Michael Love (14:50:04): > Ok I tried to cancel the event just this once - hopefully that worked

2021-06-15

Eric Davis (19:02:07): > Here are some visual examples of thematchRanges()workflow. The first uses a random dataset, selecting circles matched by color and size - a general example of the matching concept. The second does the same thing, except using GRanges matched by color and length. Both examples use the actual matching functions and the figure was built withBentoBox. Let me know if anyone has feedback or suggestions to improve these! - File (PNG): image.png - File (PNG): image.png

Mikhail Dozmorov (19:43:45): > :+1:Both seems worth using, perhaps the circle one as a gentle introduction, the GRanges one as a real case also introducing function names. Hard to think what can be improved, both look great and intuitive!

2021-06-16

Michael Love (04:07:54): > Yes, agree with Mikhail, these are both great. Circles is very easy for anyone to understand. We will have to explain propensity score in the package docs and the paper, but that’s not too hard, i think it’s fairly intuitive. > > The only caveat about second one is that someone coming to the package may think that we are suggesting it is important to control for feature size, that iswidth(gr),~e.g. that this is one of the main covariates we think you should control for. So besides that possible misinterpretation (which could be handled maybe in the text) i think it’s great~Doug points out youdowant to control for this

Kasper D. Hansen (04:23:57): > The font for ~ is bad in the second figure, it looks like a dash to me. I suggest removing it

Doug Phanstiel (07:12:14) (in thread): > I think a lot of times you do want to control for width. If you focal set is genes or ChIP-seq peaks, the width of these ranges can vary and will definitely effect likelihood of overlapping other features. I think width is actually a pretty reasonable example.

Michael Love (07:16:33) (in thread): > Ok fair enough!

Doug Phanstiel (07:17:49): > I do think it is debatable whether or not to show the propensity score plot. That is sort of under the hood info. Info on how it works. And might be confusing to some people. I am interested in your thoughts on this

Michael Love (07:17:54) (in thread): > I guess I usually do either TSS focused analysis, or for peaks I often narrow and do predefined windows around the summit of the peak

Michael Love (07:18:23) (in thread): > But in general then you’re right, why not control for width

Doug Phanstiel (07:22:55): > We were thinking that the 2nd figure might be a good paper figure since it captures both the GRanges and covariate matching concepts. Also interested to see what you all think of that

Michael Love (07:59:36) (in thread): > I think it is not going to be understand at face value by most users. but we will definitely want to explain the concept in docs/paper. > > i think people can handle about one new concept in a paper, and then they come away happy bc they learned something. if we make PS the “one new thing” then i think it’s ok in the overview figure

Michael Love (08:00:05) (in thread): > now that you’ve convinced me widthissomething to control for, i’m on board with second figure bc it has ranges

Michael Love (08:00:27) (in thread): > for a talk you can start with the first and then show the second

Kasper D. Hansen (08:01:51): > In contrast, I think the first figure is essential in showing what is happening, but itfailsin showing how the matched set becomes close to the focal set and not the pool

Doug Phanstiel (08:17:28) (in thread): > Yeah, Eric has an even simpler version of the dot figure where all dots are the same size and the only matching is for color. In case we want to walk through it even more slowly

Mikhail Dozmorov (09:49:03) (in thread): > The first figure with circles may be supplementary. It shows the intuition very well. But the second is the main one showing the functionality of the package.

2021-06-22

Michael Love (06:15:14): > @Stuart Leeif you have time I have a short Q for you tomorrow at our typical meeting time, it’s thinking through how to do bootRanges + plyranges -https://gist.github.com/mikelove/56e00143d486348c59885aed0d9cc4d2otherwise, I’m going to find time to review Eric’s vignettes and latest

Michael Love (06:16:26): > Q for@Eric Davis, I’m doing some work on bootRanges, and when I randocument()I got: > > deleted: man/MatchedDataFrame.Rd > deleted: man/MatchedGInteractions.Rd > deleted: man/MatchedGRanges.Rd >

Michael Love (08:30:48): > i’m going tocheckoutthose files so they won’t be deleted, but maybe Eric can figure out whydocument()removes them

Michael Love (08:31:54): > also on my todo is to update the BentoBox plots given Nicole’s new code fixing colorby

Eric Davis (08:37:45) (in thread): > Do you think it’s a case-sensitive issue with git?

Eric Davis (08:39:01) (in thread): > All of my local files in the man folder begin with lowercase letters

Eric Davis (08:48:53) (in thread): > These might need to be changed to lowercase (or change the code to document them in uppercase) and then moved in a way that git will track them correctly like thishttps://stackoverflow.com/questions/17683458/how-do-i-commit-case-sensitive-only-filename-changes-in-git - Attachment (Stack Overflow): How do I commit case-sensitive only filename changes in Git? > I have changed a few files name by de-capitalize the first letter, as in Name.jpg to name.jpg. Git does not recognize this changes and I had to delete the files and upload them again. Is there a way

Michael Love (08:59:56) (in thread): > ah i see, yeah i’ve had issues before with capitalization of man pages

Michael Love (09:00:37) (in thread): > i’ll try to figure something out — so on your end there is no issue when you rundocument()?

Eric Davis (09:02:04) (in thread): > No issues here

Wancen Mu (09:31:39) (in thread): > Just want to say every time I have those man file deleted also. It’s great if Eric would help figure out~

Michael Love (10:26:58) (in thread): > i’ll try to figure this out later and post if i find anything useful

Michael Love (13:09:04) (in thread): > @Eric Daviswhat do you have on your machine in terms of capitalization? - File (PNG): Screen Shot 2021-06-22 at 7.08.54 PM.png

Michael Love (13:09:48) (in thread): > i’m inclined to just move this to upper case here:https://github.com/nullranges/nullranges/blob/main/R/AllClasses.R#L135

Michael Love (13:10:02) (in thread): > given the class is uppercase

Stuart Lee (20:56:55) (in thread): > hi Mike, I won’t be able to make the meeting today but I’ll have a look through the gist and make some comments

2021-06-23

Michael Love (01:08:03) (in thread): > ok sounds good

Michael Love (02:40:51): > ok@Eric Davis, I’ve changed the Roxygen tags so that these man pages will be capitalized now similarly to the class name

Eric Davis (14:31:50) (in thread): > sounds good! I’ll try to change the capitalization on my computer, but let me know if any commits I make bring this problem back again

2021-06-24

Mikhail Dozmorov (08:06:53): > At the end of gh-pages vignettehttps://nullranges.github.io/nullranges/articles/matching_ginteractions.html, there are warning messages and NAs, instead of means on n_sites. devtools::build_vignettes() build it correctly locally. Not sure why the gh-pages rendering fails. - Attachment (nullranges.github.io): Case study II: CTCF orientation > nullranges

Eric Davis (08:19:21) (in thread): > Ah yes, I am planning to remove that section. I’ve updated the data object in nullrangesData so that n_sites is a factor instead of numeric. If you rebuild nullrangesData locally it should produce the same output as ghpages

Mikhail Dozmorov (08:26:14) (in thread): > That was indeed the case. I updated BentoBox and hictoolsr, but not nullrangesData. Yes, now the outputs agree. Why the number of sites is a factor? It’ll probably be explained, will wait.

Eric Davis (08:36:39) (in thread): > Not the most biologically-sound reason, but we saw that most bins contained a few CTCF sites with a small number of outliers. This made the covariate plot look bad, so we converted this to a factor for better visualization. It didn’t really impact the results, but the plots look nicer.

Mikhail Dozmorov (08:37:42) (in thread): > Ah, makes sense.

Tim Triche (15:07:36): > congrats@Doug Phanstielon your Nature paper today:slightly_smiling_face:

Doug Phanstiel (15:13:47): > Thanks@Tim Triche.@Eric Davisis actually 2nd author on that one! He did a great job. But in many ways, we were really just lucky to be in the right place at the right time for that project. The story is cool though

Tim Triche (15:14:11): > it is very cool. hits both NUP98 and FET-ETS fusions

Tim Triche (15:14:58): > we have a NUP98-x paper coming out before too long, I kind of want to revisit the DNAme data and see if there are obvious H3K27ac / CTCF-independent regions that open up amongst the hypermethylated/compacted regions

Tim Triche (15:16:03): > within individual NUP fusions (you guys used NUP98-HOXA9 I saw, but there are dozens), we often see cell-of-origin-specific (or seemingly so) impacts. Wondering if the same type of tissue-of-origin impacts for phase separation seen in FUS-ERG or EWS-FEV are operating here.

Tim Triche (15:17:04): > in the compartmap paper we focused on inv(3)(q21;q26) because it was 1) “easy” and 2) on a single chrom arm, but the original motivation for compartmap was NUP fusions, all the way back on a COG call in 2015

Tim Triche (15:18:11): > pretty cool that you (UNC folks) beat St. Jude to the punch on this publication, tbh. Nothing against St. Jude, but it’s a feather in Gang Wang’s cap to have got this result out first.

2021-07-16

Michael Love (02:57:23): > @Stuart Leewith Wancen we are coming back to the question of parallelism, and the idea of a generator. two options we are considering: > 1. generate bootRanges using multiple cores –> the GRangesList goes back to parent process –> plyranges uses multiple cores to operate on the GRL > 2. bootRanges are generated within plyranges section of code, so that the bootstrapped ranges are generated on the fly and operated on, all within the child process (we would only send out,seedfor example, from the parent process) > #2 was the idea of specifying a “generator” as you called it. I think #2 will actually have a big speed benefit as sending big bootstrap ranges across cores wastes a lot of time. so should we even bother with the #1 implementation of parallelism, or jump straight to designing #2?

Michael Love (02:58:24): > if this isn’t clear i can also diagram it

2021-07-18

Stuart Lee (22:33:28): > That makes sense, #2 was exactly how I imagined it working

2021-07-19

Michael Love (03:21:31) (in thread): > would you have time to chat about implementation on Wednesday?

Michael Love (03:22:11) (in thread): > I think it’s the 16:00 Melbourne time slot

Michael Love (03:23:38) (in thread): > and i’m going to drop this here also, I have two broad points to ask about, one about implementation of the generator idea and then the other is, without parallelization, how to perform grouped-by-bootstrap overlaps in plyrangeshttps://gist.github.com/mikelove/56e00143d486348c59885aed0d9cc4d2

Stuart Lee (03:47:27) (in thread): > Yep definitely have time to chat on Wednesday

Michael Love (04:07:42) (in thread): > :ok_hand:

2021-07-21

Michael Love (02:37:06): > Notes from call w/ Stuart > * Two questions: > > * Thoughts on implementing the generator idea “lazy bootstrapping” > > * Code lives in nullranges > * Bpiterate, Bpvec operating on the random seeds, stored as a metadata column in the GRL > * Piping the output into a summarize call > * The specification of the bootstrap? A param object? A function that returns a function. See memoise package on CRAN. > > * How to perform group-by-bootstrap iteration. Even if we do generator, such that bootstrap samples are generated in child processes, it won’t be one bootstrap sample per child, but many, because R >> num cores > > * Unlist, bind_ranges, iranges_stack > We’ll give a shot at implementing this. As far as future nullranges meetings, I think after BioC let’s set up a new schedule that will work for people for the Fall

2021-07-28

Michael Love (06:10:35): > @Eric Daviswe want to put a matchRanges promo slide in Wancen’s short talk, to point to your poster location and time, could you make a google slide we can copy (you can do this anytime in the next week, no rush)

Michael Love (06:11:02): > we are using widescreen 16:9 page setup

2021-07-29

Eric Davis (18:33:08): > @Mikhail Dozmorovand@Michael LoveDo you know the proper size/dimensions for the BioC2021 online posters?

Mikhail Dozmorov (19:30:58): > Good point, we didn’t discuss it. I’d say, the regular or widescreen PowerPoint slide will work.

Mikhail Dozmorov (19:34:53): > Instructions from last year, but they don’t specify dimensions. Fairly open-ended “feel free to be creative”https://bioc2020.bioconductor.org/instructions-for-presenters - Attachment (bioc2020.bioconductor.org): BioC 2020 > Where Software and Biology Connect. July 27 - 31, Boston, USA.

Eric Davis (19:38:56): > Thanks! I’ll go with the 16:9 format then

2021-07-31

Michael Love (11:52:40): > I’ve been playing with bootstrap-plyranges code. I think the creation and later binding of GRangesList may be a unnecessary bottleneck. What if bootRanges directly returns a GRanges where iteration is recorded as a metadata column?

Wancen Mu (12:04:02) (in thread): > I think it’s ok to return GRanges if it’s troublesome to make plyranges work with bootranges. Anyway users probably won’t save a large bootranges object, so a workflow directly can generate test statistics is good enough?

Michael Love (18:03:33) (in thread): > Yes I think this is a good approach, bootRanges -> “stacked” GRanges -> test stat only. This is what people will want 95% of the time

2021-08-01

Michael Love (03:41:01) (in thread): > I will investigate bottlenecks this week with some testing code. The bootRanges object is just a lightweight class on top of a standard class, that’s likely not the problem. I think there may be an issue with creation and then unlist-ing of a GRangesList. I’ll keep working on it and report back here

2021-08-02

Eric Davis (10:07:52): > Here is my poster for BioC2021. Please let me know if there are any changes I should make (names/affiliations/content). I’ve made a gold version and a blue version - which do people think looks best? - File (PDF): 2021_BioC_matchRanges_gold.pdf - File (PDF): 2021_BioC_matchRanges.pdf

Eric Davis (10:09:08) (in thread): > Is this sufficient for the promo slide in Wancen’s talk or should I make something different for that?

Doug Phanstiel (10:21:30) (in thread): > Generally i think a promo slide would be much simpler than this. But this one is so beautiful i think it serves as a great enticement. I think this is great

Doug Phanstiel (10:23:15) (in thread): > My affiliations should only be 1, 6-9

Doug Phanstiel (10:39:51) (in thread): > Maybe add a star for me and mike for equal contribution or co-corresponding.

Wancen Mu (10:39:59) (in thread): > Yellow is wonderful! > I also expect something simpler than this for the promo slide because I may only left 1 min to mention that. But it’s ok if we stick that and I can pick the important point to talk!

Doug Phanstiel (10:40:56) (in thread): > Overall though, it is really really nice

Mikhail Dozmorov (10:49:16) (in thread): > I’d vote for yellow. The information content is superb. Suggesting keeping one QR code, or move them apart, my QR reader is confused.

Mikhail Dozmorov (10:51:59) (in thread): > One conceptual suggestion - simplify the introductory paragraph. To answer the question - why would I choose nullranges over something like bedtools random. Currently, it jumps immediatly into the details. A more gentle variant can be: > > When performing statistical analysis on a set of genomic regions with different properties (e.g., region length, signal strength), it is often important to compare it to null (random) sets that preserve these properties. > And then define focal/null sets, covariates, etc.

Stuart Lee (22:00:25) (in thread): > The yellow slide looks so good! I reckon you could also use it as a cheatsheet for the package.

2021-08-03

Michael Love (10:48:37) (in thread): > I like yellow/gold also. This is fantastic, yeah once this is up on F1000, we can link from the nullranges README and from matchRanges vignette to the DOI

Michael Love (10:49:05) (in thread): > i’ll have a close look today and will post if i have any detailed suggestions

Eric Davis (11:16:56) (in thread): > Thanks all! Would this be a more gentle introduction? > > When performing statistical analysis on a set of genomic regions with different features (e.g., region length, signal strength), it is often important to compare it to null sets that preserve these potential covariate features. > > > > To address this need, the nullranges package implements matchRanges(), an efficient and convenient tool for selecting a subset of covariate-matched null hypothesis ranges from a pool of background ranges. > > > > The package provides a host of functions for assessing, visualizing, and extracting matched data that integrates seamlessly into existing Bioconductor workflows.

Mikhail Dozmorov (13:05:21) (in thread): > Reads very clear.

Michael Love (15:03:01): > @Eric Daviswe will link to this URL from the slides:https://bioc2021.bioconductor.org/posters/paper41/and can link to a F1000R link once you’re ready > Remind me (I should know this but i’ve been distracted) do you show up at the Thurs 6pm poster session and the Friday 9am or do you pick one?

Eric Davis (15:08:11) (in thread): > I think we get to choose which session we attend

Michael Love (15:08:55) (in thread): > ok which should we point people to?

Michael Love (15:08:58) (in thread): > from the short talk

Eric Davis (15:13:04) (in thread): > I listed this one “Moderna Poster Session, Thursday August 5th at 3:00pm Pacific Time” on the Airmeet booth

Eric Davis (15:13:17) (in thread): > I think thats the Thursday at 6pm EST

Michael Love (15:33:29) (in thread): > ok great we will mention that

Michael Love (15:35:39) (in thread): - File (PNG): Screen Shot 2021-08-03 at 3.35.33 PM.png

Michael Love (15:36:12) (in thread): > it’ll just be a quick mention,@Wancen Mudoesn’t walk thru the slide

Eric Davis (15:36:37) (in thread): > perfect:+1:

Eric Davis (15:38:21) (in thread): > Here is the updated version with the correct affiliations and text changes: - File (PDF): 2021_BioC_matchRanges_gold.pdf

Michael Love (15:40:06) (in thread): > ok we can replace

Wancen Mu (16:00:52) (in thread): > Cool, I will replace it

2021-08-04

Michael Love (08:53:27): > Really great work on the poster@Eric Davisand on the slides for the short talk@Wancen Mu! It’s amazing how much you have both gotten accomplished in terms of code, tests, and docs since January.:tada:As a reminder, Wancen’s talk is coming up today at the 1:30pm Methodology short talk slot and then Eric’s poster will be available in the Thursday 6pm slot. - File (PNG): Screen Shot 2021-08-04 at 8.50.50 AM.png - File (PNG): Screen Shot 2021-08-04 at 8.52.40 AM.png

Eric Davis (13:46:55): > Awesome presentation@Wancen Mu!!:tada:

Michael Love (13:48:20): > we’ll see if some brave souls make their way to this corner of the bioc slack:laughing:

2021-08-05

Eric Davis (18:44:03): > Posting some resources suggested by poster visitorshttps://github.com/leekgroup/enrichedRangeshttps://www.biorxiv.org/content/10.1101/2020.11.11.378554v2http://code.databio.org/GenomicDistributions/articles/intro.html - Attachment (code.databio.org): Getting started with GenomicDistributions > GenomicDistributions

Michael Love (18:55:58): > cool, I’d seen Bedshift and GenomicDistributions — i think we can actually integrate with the latter one in a useful way

Michael Love (18:56:37): > hadn’t seen enrichedRanges yet, is there a paper tho?

Michael Love (18:57:13): > looks like it’s uniform start, so similar to shuffleBed?

Michael Love (18:57:14): > > Given a set of regions, generate genomic intervals uniformly at random from those regions. These can be strand specific or not.

Eric Davis (18:57:35): > Leonardo mentioned it was something he wrote a while ago

Michael Love (18:59:10): > hmm, maybe I can just ask@Leonardo Collado Torreshimself — Leo can you sketch out what happens inenrichedRanges? Is it similar conceptually to shuffleBed (uniform start positions)? Or is it block sampling?

Leonardo Collado Torres (18:59:15): > @Leonardo Collado Torres has joined the channel

Leonardo Collado Torres (19:15:17): > err, I hadn’t looked inside the package since well, 2014 I think:stuck_out_tongue:

Leonardo Collado Torres (19:15:28): > https://github.com/leekgroup/enrichedRanges/blob/master/R/randomInterval.R#L50-L59is indeed uniform

Leonardo Collado Torres (19:17:04): > you query across a set of the genomic coordinates, define a set of tiles that exclude some regions of the genome (gaps) across a search spacehttps://github.com/leekgroup/enrichedRanges/blob/master/R/enrichRanges.R#L55

Leonardo Collado Torres (19:17:31): > @Andrew Jaffewrote most if it back in the day and I put into an R package form

Andrew Jaffe (19:17:35): > @Andrew Jaffe has joined the channel

Leonardo Collado Torres (19:18:50): > Like I told@Eric Davis, I’m looking forward to a tweet with the poster and making a little joke about the RStudio cheatsheets:wink:I really liked the poster format!! ^^

Leonardo Collado Torres (19:20:19): > we did useenrichedRangesin a few papers, I think most notably inhttps://pubmed.ncbi.nlm.nih.gov/25501035/ - Attachment (PubMed): Developmental regulation of human cortex transcription and its clinical relevance at single base resolution - PubMed > Transcriptome analysis of human brain provides fundamental insight into development and disease, but it largely relies on existing annotation. We sequenced transcriptomes of 72 prefrontal cortex samples across six life stages and identified 50,650 differentially expression regions (DERs) associated …

Leonardo Collado Torres (19:20:28): > but we never published the R package independently

Leonardo Collado Torres (19:20:52): > and well, it seems to me thatnullrangeswill allow us to deprecateenrichedRanges^^

Michael Love (19:25:04): > well it doesn’t offer an all-in-one enrichment analysis

Michael Love (19:25:11): > we just do the null features

2021-08-06

Michael Love (17:07:05): > @Eric DavisI’ve begged off child duties for 3 days now but i’ll have to take care of kids during the lightning talks, but i’ll be eager to hear how it goes:slightly_smiling_face:

Wancen Mu (17:12:11): > Oh, will Eric give the lightning talk? Will be there to moral support!

Wancen Mu (18:21:57): > Great talk! Perfect timing!@Eric Davis:tada::dealwithit-parrot:

Michael Love (18:44:03): > Any questions?

Eric Davis (18:45:52): > mostly about BentoBox it seems

Michael Love (18:50:14): > There is a lot of nullranges enthusiasm on Twitter, I need to just finish up some benchmarking on how it plays with plyranges before we can submit to Bioc

Doug Phanstiel (20:44:05): > Yeah, i burned all of my non-parenting capital on the bentobox talk. Far shorter than Mike’s three days but it was during a family vacation at bed time so it was pricey

Doug Phanstiel (20:45:21): > Are the lightening talks recorded?

Leonardo Collado Torres (20:56:40): > on Airmeet I think that all talks are recorded

Leonardo Collado Torres (20:56:47): > but they’ll only be there for a week or less

Leonardo Collado Torres (20:57:01): > I’m not sure that they’ll upload them to YouTube, but we could ask Erica if you want

2021-08-13

Michael Love (08:12:06): > @Mikhail DozmorovI just was following up on the denyranges submission, is there something i can help with? it looks like you’ve decided to rework the resources?

Mikhail Dozmorov (08:42:30): > It still raised concerns. I’m planning to redo it, just need time.

Michael Love (09:07:30): > got it, let me know if i can help

Michael Love (09:07:59): > no rush really i guess, we will hopefully submit nullranges to Bioc this cycle, but we can leave off thedenyargument in the vignette

Michael Love (09:08:09): > and then add it back in once denyranges is up

Kasper D. Hansen (09:20:23): > how aboutexcludeinstead ofdeny

Michael Love (09:37:09): > I think we used deny bc of precedent lemme see

Mikhail Dozmorov (09:41:21): > I’ve been thinking aboutexclude, it is more neutral while conveying the meaning of these regions. The GRanges objects can also be renamed. On my long todo..

Michael Love (09:43:23): > Yeah I don’t see biological precedent

Michael Love (09:43:52): > In server lingo, white and blacklisting has become deny and allowlist but we don’t have to follow that

Michael Love (09:44:16): > Exclude sounds fine to me, I can also ask Anshul what he prefers as he has spent a lot of time on this project

Michael Love (09:44:38): > Sc still universally uses whitelist it seems

Mikhail Dozmorov (09:46:45): > Asking Anshul would be great. He may be able to rename the file on ENCODE. I don’t thing ENCODE will rename all files, but if he can rename his, that’ll help.

Kasper D. Hansen (09:50:37): > This can and should be used with other ranges than the ENCODE lists, so I am not sure following ENCODE normenclature is helpful or confusing

Michael Love (11:02:36): > Exclude is fine with me. It sounds natural. Are there other examples of excluded lists btw?

Mikhail Dozmorov (11:08:18): > Gaps, centromeres and telomeres fit the “excluderegions” umbrella

Michael Love (13:52:42): > Anshul says they have changed to “exclude”

Michael Love (13:52:48): > > The DAC Exclusion List Regions (previously named “DAC Blacklisted Regions”)

Kasper D. Hansen (13:56:18): > Since he agrees with me, he is obviously right

Michael Love (14:01:27): > We’re gonna need a special package role designation for Kasper

Michael Love (14:01:40): > “opi” for opinions

Mikhail Dozmorov (14:09:14): > :slightly_smiling_face:Good news

Doug Phanstiel (14:34:59): > just curious. Why is exclude better than deny?

2021-08-19

Tim Triche (10:11:02): > because it doesn’t presuppose guilt:wink:

Tim Triche (10:11:19): > nuh_uh_list and yuh_huh_list

2021-08-20

Michael Love (07:42:24): > Congrats to@Eric Davison winning a best innovative poster at BioC2021!

Michael Love (07:42:59): - File (JPEG): Image from iOS

Eric Davis (09:01:52) (in thread): > Thanks! I appreciate everyone’s helpful feedback !

Michael Love (10:17:55): > Now that I’m back in states and things have settled down a bit, I’m going to try to put things in order for Bioc pkg submission. Might need to comment out the range plots in the boot ranges vignette while plotgardner is in submission, but will add those back in later. I’ll also review closely the matchRanges vignettes. I don’t think we need any meetings for the next month or so, but probably one before submission will help

Tim Triche (10:54:05): > nice job@Eric Davis

2021-08-23

Mikhail Dozmorov (10:18:50): > excluderanges package is reborn. Please, review, you already provided lots of important comments, but maybe something still needs to be fixed. If anyone is OK with the current content, I’ll submit to BioC.https://github.com/mdozmorov/excluderanges

Michael Love (11:11:53) (in thread): > awesome, thanks Mikhail. I’ll make sure to try it this week

Michael Love (15:19:28): > @Wancen Muand@Eric Davisnote there is a CZI event on Tuesday, November 2 - Thursday, November 4 between 08:00-13:00 Pacific Time (PT) / 16:00-21:00 UTC.

Michael Love (15:19:54) (in thread): > “It will be structured as a combination of presentations, software demos, lightning talks, and unstructured time. A detailed agenda will be shared at a later date. “

Michael Love (15:20:21) (in thread): > I’ll share the registration link. But this would be a good chance to share the work. You could use same figures as your talk/poster

Michael Love (15:20:53) (in thread): > If we have a chance, I’d love for you both to present. I could give a brief intro and then hand off to each in turn

Wancen Mu (15:30:45) (in thread): > Sounds good, thanks for sharing!

Michael Love (16:04:04) (in thread): > update, it’s Tue - Thu 11/2-11/4

Michael Love (16:07:32) (in thread): > @Doug Phanstielif you’d like to attend just let me know

2021-08-31

Michael Love (10:25:05): > this looks great@Mikhail Dozmorovas before, this will be incredibly useful to have in Bioc

Michael Love (10:26:07): > two notes: > 1. the heatmap image isn’t showing up on github README? it has a diff plot there > 2. can this table be available elsewhere more programmatically accessible:https://github.com/mdozmorov/excluderanges#source-data-for-the-excludable-regions

Mikhail Dozmorov (21:14:30): > Thanks so much,@Michael Love, fixed. The tables are now available as .csv files. Everything checks, pkgdown gh-action works, will submit to BioC soon.

Michael Love (21:14:43): > nice

2021-09-03

Michael Love (15:45:31): > I’m changing ourdenyarguments intoexcludein bootRanges… I’ll be checking up on the package this next week for submission to Bioc (hopefully in Sept)

Michael Love (15:46:01): > this will involve switching to plotgardener functions also

Mikhail Dozmorov (15:53:59): > excluderanges has been submitted, forgot to share the link. It may take a couple of weekshttps://github.com/Bioconductor/Contributions/issues/2269

2021-09-09

Michael Love (10:12:15): > @Eric Daviscan I removehictoolsrfrom the nullranges suggested packages? I’m currently switching us from BentoBox to plotgardener for submission to Bioc

Michael Love (10:51:24): > I think i’ve made all necessary fixes to BB -> plotgardener, and changing deny -> exclude, i’ll start tagging issues for final fixes before submission

Wancen Mu (11:01:55) (in thread): > Oh, the reference page of nullranges website missplotSegment(), I will add it later today.

Eric Davis (11:16:46) (in thread): > nullrangesitself doesn’t usehictoolsr, but some of the scripts innullrangesDatause it to make data objects which are used in the vignettes.

Eric Davis (11:17:24) (in thread): > Do you think its okay to remove it from suggests in this case?

Eric Davis (11:18:49) (in thread): > Also the three “matchedControlExample” scripts undernullranges/inst/scriptcan be removed since we no longer make use of them.

Michael Love (11:25:05) (in thread): > i’m doing final touches now to make the build pass

Michael Love (11:25:12) (in thread): > i’ll ping here when it’s passing

Michael Love (11:25:36) (in thread): > yeah i think we can remove from nullranges suggests

Michael Love (11:26:08) (in thread): > go for it on removing anything not used anymore, i’ll ping here when the build is passing, then feel free to commit and push changes

Wancen Mu (11:27:04) (in thread): > Cool:ok_hand:

Michael Love (12:26:05) (in thread): > ok we are back to passing, make sure to pull first to get the latest

Michael Love (12:26:11) (in thread): > ok we are back to passing, make sure to pull first to get the latest

Eric Davis (13:54:46) (in thread): > Same for thenullrangesData? There are lots of objects that we decided ultimately not to use

Michael Love (19:01:18) (in thread): > Yes, if you and@Wancen Mucan remove any old datasets from there (and renamedenytoexcludewhere you see it), that would help me get us ready to submit

Michael Love (19:01:26) (in thread): > nullrangesDatawill be submitted alongside

2021-09-10

Eric Davis (12:42:08): > I’ve removed all of the unused datasets/scripts from nullranges and nullrangesData so I believe everything is good to go:+1:

Wancen Mu (14:23:05) (in thread): > I haven’t done yet because I am preparing a presentation on Yun’s deep learning journal club this afternoon.

Wancen Mu (14:25:20) (in thread): > By the way, I found out the memory of bootranges of current GWAS data may not be that huge. Could we potentially adding the GAM and diagnostics function in package in future?

Michael Love (14:32:11) (in thread): > No rush, by end of next week is fine > > Yes, but let’s add GAM and diagnostics in October (e.g. just after release), just so we can get the package into this release

2021-09-12

Wancen Mu (16:58:26) (in thread): > In segment_bootranges.Rmd, We previously selected exclude2 by width(exclude) >= 500 and did the segmentation because we worried about too many segmentation. But exclude was still used in bootranges to be consistent. But now bootranges also used exclude2, I don’t know did it delete by accident?@Michael Love

Wancen Mu (17:33:32): > I also updated all deny to exclude in nullrangesData and add something in nullranges reference sections. Everything looks great!

Michael Love (20:12:46) (in thread): > I’ll take a look, can you point me to a line in code? What do you mean by bootRanges uses exclude2?

Michael Love (20:13:12): > Thanks Wancen and Eric!

Wancen Mu (20:14:37) (in thread): > https://github.com/nullranges/nullranges/blob/f1519ecddc0822c89581f0de35535adc72dfe5f6/vignettes/segmented_boot_ranges.Rmd#L100-L104

Wancen Mu (20:14:57) (in thread): > previously we name it deny2, and give deny2 to the segmentDensity function

Wancen Mu (20:15:35) (in thread): > While here still use original denyprop = bootRanges(x, seg, blockLength, 1, exclude, proportionLength = TRUE)

Wancen Mu (20:16:52) (in thread): > “bootRanges uses exclude2” I mean use the one with filtered ranges.

Michael Love (21:16:53) (in thread): > oh, I see, I will explain this in the vignette text tomorrow

Michael Love (21:17:55) (in thread): > i just wanted to simplify the code, I will explain that, they should use with bootRanges the ranges that they want to not allow overlaps with, whether the large ones or all of them (going back to the original exclude ranges)

2021-09-14

Michael Love (11:22:31): > Paper of interest for@Wancen Muand perhaps also for@Eric Davis(in terms of how to write up a methods paper that examines choice of setting up analysis and how it will affect results)https://pubmed.ncbi.nlm.nih.gov/33259518/ - Attachment (PubMed): The impact of different negative training data on regulatory sequence predictions - PubMed > Regulatory regions, like promoters and enhancers, cover an estimated 5-15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in …

Michael Love (11:22:57): > > By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements’ relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity.

Michael Love (11:23:32): > (added to Wancen’s sciwheel)

Wancen Mu (11:24:09) (in thread): > Thanks!!

Michael Love (21:59:27): > A heads up: I’m doing some large scale cleanup to get in shape for submission, so if just make sure yougit pullbefore doing any work or you’ll likely hit a git merge error.

2021-09-15

Michael Love (11:28:16): > ok, i’m pausing on cleanup for a bit (again, make sure togit pullbefore starting work on your end) > > Could@Eric Davistry to address these pieces: 1) avoiding1:n2) don’t useTforTRUEetc 3) avoidset.seed()in functions (I can discuss this if you want) 4) adding@returnto all function man pages > > * Checking coding practice... > * NOTE: Avoid 1:...; use seq_len() or seq_along() > Found in files: > methods-Matched.R (line 75, column 12) > methods-Matched.R (line 81, column 20) > methods-Matched.R (line 84, column 12) > methods-matchRanges.R (line 48, column 47) > methods-matchRanges.R (line 138, column 14) > * WARNING: Avoid T/F variables; If logical, use TRUE/FALSE (found 1 times) > F in R/methods-matchRanges.R (line 233, column 13) > * WARNING: Remove set.seed usage (found 1 times) > set.seed() in R/methods-utils.R (line 51, column 3) > > and > > * Checking man page documentation... > * WARNING: Add non-empty \value sections to the following man pages: man/focal.Rd, > man/matched.Rd, man/overview-methods.Rd, man/pool.Rd, man/unmatched.Rd >

Eric Davis (11:52:20) (in thread): > Oops! I though I had corrected these - will do!

Michael Love (12:56:11) (in thread): > great, and no rush, I’ll probably loop back to the package over the weekend to cleanup any man pages / vignette missing text

Michael Love (12:56:14) (in thread): > thanks!

Eric Davis (15:44:40) (in thread): > I kept thatset.seed()in themakeExampleMatchedSet()function so it would produce the same data every time. Should I instead move theset.seed()to the vignettes? Although I like that it only produces the same set…

Michael Love (19:47:53) (in thread): > yes, it should be in the vignette and not in the example creation function

Michael Love (19:48:31) (in thread): > https://github.com/mikelove/fishpond/blob/master/R/swish.R#L152-L153

2021-09-17

Michael Love (12:02:32) (in thread): > @Eric Davisit looked like you have addressed these? should i pull and continue with final touchups?

Eric Davis (12:20:00) (in thread): > Yes! Let me know if you find any additional issues that I missed

Michael Love (14:33:54) (in thread): > great, thanks

2021-09-21

Michael Love (11:06:29): > @Eric Davislooking over ournullrangesDataobjects, theCTCFdata package is in Bioc devel, so we can removeCTCF_hg19right?https://bioconductor.org/packages/devel/data/annotation/vignettes/CTCF/inst/doc/CTCF.html@Wancen Mualso I think onceexcluderangesis accepted into devel branch we can remove theexcludeobject and instead do this in a hidden chunk in the nullranges vignettehttps://github.com/Bioconductor/Contributions/issues/2269

Michael Love (11:08:44): > and i can remove the chain file fromextdatait seems like

Wancen Mu (11:09:51) (in thread): > Yeah, sure!

Eric Davis (11:16:33) (in thread): > I think thats correct!

Michael Love (19:49:19): > Bioc strongly recommends ExperimentHub, so I’ll be moving nullrangesData to an EHub package. I’ll make the old data package renamed to nullrangesOldData

Michael Love (20:28:02): > Ok, started this process, new EHub skeleton is:https://github.com/nullranges/nullrangesDataOld data package ishttps://github.com/nullranges/nullrangesOldDataHow this works is then that I will submit the Ehub package and add nullranges to that issue once the resources can be accessed

2021-09-22

Michael Love (10:49:09): > @Mikhail Dozmorovany news on excluderanges? I’m removing ourexcludedata object from nullrangesData as I think we can use your new package in our vignette

Mikhail Dozmorov (12:58:10): > Not yet. Just got to answer, teaching today.

Michael Love (21:20:49): > I’m going thru matching vignettes again, the part explaining how the classes work is actually pretty cool:grin:pulling back the S4 curtain@Eric DavisA few notes: > > In overview() mention the frequency columns? > > Can you explain briefly how it could happen that a range was present many times with NN and why this might be of concern? Maybe also this can explain why without replacement isn’t implemented (it’s difficult to achieve balance, right?) > > You say NN and stratified are both fastest. Wasn’t sure what you meant. > > What’s a propensity score? Maybe just link to Wiki. > > Can you be a bit more verbose about how stratified works in the intro to match vignette? I think users will appreciate the work put into it. > > In case study 1: > > I think you mean non uniform not non random? Maybe “controlling for open chromatin status” rather than “independently”? > > Can you link to plyranges package landing page and workflow at the first use here, just to send more users to Stuart ;-) can we also link to Mikhail’s CTCF package just as it’s related and users interested in loops/CTCF may find it useful > > Case study 2: “asymmetric” SP, “ctcf-bound” should be uppercase? > > Link to GInteractions package landing page > > Do you comment on the final result?

2021-09-23

Michael Love (12:12:12): > Heads up, I just pushed updates to nullranges vignettes, temporarily they use nullrangesOldData (this will be the case until nullrangesData has been submitted as a EHub pkg)

Eric Davis (16:23:08) (in thread): > Thanks for these comments! I will try to address them and submit a pull request. I think@Doug Phanstielalso had some thoughts on how to frame Case-study 2

Michael Love (18:36:09) (in thread): > why PR? you can put these straight into nullranges

Michael Love (18:36:14) (in thread): > i’m not working on it now BTW

Michael Love (18:36:27) (in thread): > tonight i’m submitting nullrangesData (new one)

2021-09-26

Michael Love (07:36:35) (in thread): > Thanks Eric! I made this one change:https://github.com/nullranges/nullranges/blob/main/vignettes/matching_ranges.Rmd#L408-L411Looks like we’re set, I just need to finalize the nullrangesData package and then I think I can submit this week

2021-09-28

Michael Love (18:23:39): > Data is posted to AWS just waiting on the insertion into EHub so I can submit pkgs

2021-09-30

Michael Love (08:34:14): > @Mikhail Dozmorov— do you have EHub entries, even ifexcluderangesisn’t accepted yet? I want to use your objects in the nullranges vignette

Michael Love (08:34:23): > i can just call them directly for now

Mikhail Dozmorov (08:38:13) (in thread): > No, I believe the AHub part comes after acceptance. excluderanges is 29 days under review, I’ll ask for updates this weekend.

Mikhail Dozmorov (08:42:50) (in thread): > I’ve been also thinking of publishing excluderanges in F1000. Together with all nullranges authors. Expanding it with excludable regions detected from the latest T2T genome assembly. Planning to get to it after acceptance.

Michael Love (08:45:34) (in thread): > happy to help with that as needed, I think it’s a critical piece of Bioc annotation data

Michael Love (08:46:49): > I’m starting submission right now just to get us in the queue, bc I submitted the data to Ehub on Monday and just waiting on acceptance, and Mikhail submitted to Ehub 29 days ago and that’s waiting also… that’s all we are waiting on so I want to make sure we don’t miss the deadline bc of a hub backlog

Michael Love (14:04:17) (in thread): > it’s possible that they aren’t reviewing bc you have a warning flag? > > Warning seems trivial: > > * checking for unstated dependencies in examples … WARNING > > no parsed files found

Michael Love (14:04:30) (in thread): > http://bioconductor.org/spb_reports/excluderanges_buildreport_20210910092916.html

Mikhail Dozmorov (14:07:48) (in thread): > Unlikely. It was the same with CTCF package submission, I hacked around it by making a dummy file but was asked to remove. CTCF was approved with this warning.

Mikhail Dozmorov (14:09:14) (in thread): > But that warning will be a reason to ask.

2021-10-01

Michael Love (08:44:51): > well it’s submission deadline, so i’m going through the package to set all chunks depending on the Ehub data to FALSE, i don’t think we should miss the deadline bc Ehub backlog…

Michael Love (08:45:29): > @Wancen Mui’m changing the segmentation and plotting code soexcludecan be NULL (previously code assumed it would always be provided), I noticed this as we don’t yet have access toexcluderanges

2021-10-03

Mikhail Dozmorov (16:13:19): > Is there a workaround to installnullrangesData? As of now,devtools::install_github("nullranges/nullrangesData")results in error: > > Error: package or namespace load failed for 'nullrangesData': > .onLoad failed in loadNamespace() for 'nullrangesData', details: > call: FUN(X[[i]], ...) > error: 'DHSA549Hg38' not found in ExperimentHub > Error: loading failed > Execution halted > > I’m able to create that file viamake-dhs-data.R. Btw, othermake-…files complain that > > source("inst/script/util.R") > Error in file(filename, "r", encoding = encoding) : > cannot open the connection >

Michael Love (19:32:03) (in thread): > No because of the EHub backlog. If it hasn’t progressed by Monday I’ll email. They acknowledged being backlogged with package submission

Mikhail Dozmorov (19:47:09) (in thread): > Looks like all packages are delayed. OK, thought to use in class, at least will refer to code how to work with S4 classes/methods.

Vince Carey (19:47:39) (in thread): > Hi – I don’t think this will be resolved Monday as Lori is out until Tuesday. I am really sorry about this. We have a lot to catch up on.

Michael Love (19:50:11) (in thread): > Understand there’s a lot of work and appreciate the time of the core team, I’ve been following the contributions page and a lot came thru this week!

Mikhail Dozmorov (19:54:04) (in thread): > Second that

2021-10-04

Michael Love (12:12:25) (in thread): > Ok, Lori looped back to the thread, and it was bc I had version 0.0.4 instead of 0.99.x for the data package, now I’ve also bumped the Ehub submission email so hopefully can get things building asap on Bioc machines

2021-10-05

Michael Love (12:21:42): > Ok, nullrangesData is now available: > > > eh = ExperimentHub() > snapshotDate(): 2021-10-05 > > query(eh, "nullrangesData") > ExperimentHub with 3 records > # snapshotDate(): 2021-10-05 > # $dataprovider: Aiden Lab, UCSC > # $species: Homo sapiens > # $rdataclass: GenomicRanges, InteractionSet > # additional mcols(): taxonomyid, genome, description, > # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, > # rdatapath, sourceurl, sourcetype > # retrieve records with, e.g., 'object[["EH7082"]]' > > title > EH7082 | DHSA549Hg38 > EH7083 | hg19_10kb_bins > EH7084 | hg19_10kb_ctcfBoundBinPairs >

Michael Love (12:21:52): > I’m updating nullranges so vignettes are live again

2021-10-06

Michael Love (10:07:22): > successful builds on Bioc for both nullranges and excluderanges :-D - File (PNG): Screenshot from 2021-10-06 10-05-43.png

Michael Love (12:15:58): > excluderanges is now accepted. very cool. after we get the reviews, I’ll make sure we put it back in to the nullranges vignette

2021-10-08

Tim Triche (11:59:04): > this is minor, but the topic of blacklisting regions in CUT&RUN came up recently and I blame you plural for my piping up and suggesting that we (us & collaborators) call it an excludelist:slightly_smiling_face:

Mikhail Dozmorov (14:48:17): > @Michael Love, the excluderanges data are available. I need to modify the excluderanges vignette, will catch up on weekend. > > > ah = AnnotationHub() > snapshotDate(): 2021-10-08 > > query(ah, "excluderanges") > AnnotationHub with 42 records >

Michael Love (14:49:29): > awesome, thanks!

Michael Love (14:49:41): > i’ll add these in after Bioc review (looks like should be done soon)

2021-10-09

Michael Love (10:34:38): > We’re in:tada:so@Wancen Mufeel free to push your new work to GitHub now - File (PNG): Screenshot from 2021-10-09 10-33-42.png

Mikhail Dozmorov (21:58:18): > Any suggestions for the excluderanges’ hex sticker? I’m not particular, but why not have something? - File (PNG): imgfile.png

2021-10-10

Wancen Mu (20:02:32) (in thread): > I just upload 2 Rscripts. But I failed on devtools::check() because ofError: Can't find template 'matched-class-slots'.Could you help me generate the Rd files for those two?

2021-10-11

Michael Love (08:33:38): > I like it !

Michael Love (08:33:55): > if you don’t like the hexagons, I can make that smooth in gimpshop

Michael Love (08:34:15): > as in - File (PNG): image.png

Michael Love (08:35:01): > do you have the DNA pieces as a separate layer?

Mikhail Dozmorov (08:40:31): > The tximport sticker really shines! Here’s the script and the image I used for excluderanges - if you can improve, would be great! - File (PNG): Genome_cut.png - File (R): sticker.R

Michael Love (09:05:21): > it is very easy to modify — i have it as a gimp file from the Bioc template - File (PNG): Screen Shot 2021-10-11 at 9.05.08 AM.png

Michael Love (09:05:41): > i matched the border more to the fill as it already has green / red colors

Michael Love (09:05:59): > i like green / red bc it connotes stop light:traffic_light:

Mikhail Dozmorov (09:06:59): > This is excellent, and perfect associations!

Michael Love (09:07:02): > easy to change colors or take suggestions from anyone

2021-10-14

Mikhail Dozmorov (15:38:50) (in thread): > @Michael Love, can you make a PR with the transparent background version? I tried to add this screenshot, it look a bit suboptimalhttps://github.com/mdozmorov/excluderanges

2021-10-15

Michael Love (10:18:09) (in thread): > forninreduceSegmentcan I remove that argument and set > > n <- max(mcols(x)[,col]) > > ?

Wancen Mu (10:19:54) (in thread): > Yeah, sure!

2021-10-19

Michael Love (11:28:21): > nullranges/plyranges workflow question for@Stuart Leewhen he has time: > * suppose we are doingbootRanges->bind_ranges->join_overlap_left->group_by(iter)-> … > * even if I putmutate(iter=factor(iter, levels=seq_len(R)))followingbind_ranges, if one of the bootstrap iterations had length 0, then it won’t show up following the join > * right now I have a hack where I add back in the 0s for no overlaps when the bootstrap had length 0, but any cleaner thoughts on doing this?

Michael Love (11:28:37): > here is some code where this happened to arise:https://gist.github.com/mikelove/607b3dd2eb8a70f21046fc06629055a2#file-issues_seg_wrt_features-r-L33-L42

Stuart Lee (20:07:53) (in thread): > oh yes I think this is expected behaviour, in the sense that empty groups are dropped by default. that is one of the issues on the backlog I have to make it so group by doesn’t drop empties (dplyrhas supported this for a while).

Michael Love (20:40:15) (in thread): > Ok, if you also have any random pointers on how I can fix this more elegantly I can do reading

Michael Love (20:40:36) (in thread): > I guess I should look up how it works in dplyr

Michael Love (21:04:23) (in thread): > Oh there’s like complete() which is running expand()… I see now

Michael Love (21:05:37) (in thread): > And .drop=FALSE

2021-10-20

Michael Love (11:02:05): > Re:bootRangescurrently the code hasxas the first argument, but I’m going to change its name toy, because typically we are bootstrappingywhen we are doing overlaps of two sets,xandye.g. count_overlaps(x,y)

Michael Love (11:13:56) (in thread): > actually, tidyrcompletesolves this with no extra code needed: > > rate_unseg <- x %>% join_overlap_left(stack_unseg) %>% > group_by(iter) %>% > summarize(rateOverlaps=sum(!is.na(id))/length(x)) %>% > as.data.frame() %>% > complete(iter, fill=list(rateOverlaps=0)) %>% > pull(rateOverlaps) >

Michael Love (13:29:13): > @Wancen Mui’m also movingsegto the fourth spot in the function, bc blockLength is a required numeric (Rwill nearly always be specified also), whilesegis optional. putting blockLength earlier in the argument list allows users to not have to write out the argument names, e.g. nowbootRanges(y, 5e5, 10, seg)andbootRanges(y, 5e5, 10)are both possible without throwing an error. previously you would have to writebootRanges(y, blockLength=5e5, R=10)for unsegmented bootstrap > > I think it makes sense to have the two numerics, then the two ranges arguments together: > > bootRanges <- function(y, blockLength, R = 1, > seg = NULL, > exclude = NULL, ... >

Wancen Mu (13:30:32): > Yeah, that sounds reasonable!

Michael Love (20:20:55): > ok cool, i think i’m nearly done tweaking the package before release… > > Maybe one more tweak to improve the boot workflow. Currently we have to list then unlist the GRL which doesn’t make sense I think

Michael Love (20:22:14): > Also the pkgdown is building again. I re-arranged these sections so it’s more clear both functionality is listed:slightly_smiling_face: - File (PNG): Screenshot from 2021-10-20 20-20-32.png

2021-10-21

Michael Love (10:41:45): > i just noticed excluderanges was accepted on Tuesday! Congrats@Mikhail DozmorovI’ll try to swap it into the nullranges vignette today (I’m doing some last minute work anyway)

Mikhail Dozmorov (11:22:48): > Yes, congratulations to all! Accepted a couple of days ago, things are a bit crazy, not catching up with messages. The data are on AHub, can be used in the vignette.

Michael Love (13:57:48): > i put it in and everything looks great

Michael Love (13:57:53): > thanks so much for putting that together:smile:

Michael Love (16:31:23): > with the release looming, I really wanted to clean up the bootstrap workflow. > > a lot was clarified once I decided thatbootRangesshould just output a simpleGRanges. > > Thanks to@Stuart Leedesign choices, the whole analysis ends up looking like: > > all_pks <- bind_ranges(boot=pks_prime, unif=pks_unif, .id="type") > > dat <- tss %>% > join_overlap_left(all_pks, maxgap=30) %>% > mutate(iter=factor(iter, levels=seq_len(R))) %>% # this is needed for now so we know R... > group_by(type, iter) %>% > summarize(rate=calculateRate(id, n)) %>% > as.data.frame() %>% # needed bc tidyr::complete operates on df > complete(type, iter, fill=list(rate=0)) > > this outputsdatwhich can go directly toggplothttps://gist.github.com/mikelove/56e00143d486348c59885aed0d9cc4d2

Eric Davis (16:51:56): > There are a few wording changes I would like to make (per Doug’s suggestions) to one of the case studies. Should I make these changes to the github repo or is there a bioconductor repo that I should push to?

Michael Love (16:53:09): > If you push to GH I can push to Bioc. Note that Friday is kind of “last day” wrt the new release

Michael Love (16:53:25): > So we just have to be extra careful not to break the build or check

Michael Love (16:53:44): > Eg you should R CMD build and check the pkg locally etc

Stuart Lee (17:26:53) (in thread): > oh nice! I really should add these to plyranges when I have a moment

Michael Love (17:29:03) (in thread): > happy to chat sometime in Nov, i think it would be nice to remove the mutate and the as.data.frame to make it a few lines shorter

Michael Love (17:29:29) (in thread): > as you can probably tell i’m getting really excited about pushing this out to the community:slightly_smiling_face:

Michael Love (17:30:40) (in thread): > we’ve got two manuscripts in the works, both Eric and Wancen have nice real data examples, and if i can do a little more toodling on parallel i think boot will be ready for people to deploy on large datasets

Stuart Lee (17:36:36) (in thread): > happy to chat in Nov. I need to reserve a bit of time of plyranges planning and maintenance so that would be great.

Michael Love (17:39:28) (in thread): > wanna throw out a random date?

Michael Love (17:39:37) (in thread): > i like to just put things down on calendar, we can always cancel

Michael Love (17:40:14) (in thread): > i’m free 10am your time (i think 7pm here) or 9pm your time (i think 6am here), esp if we just do half hour

Michael Love (17:40:59) (in thread): > DST ends Nov 7 here so that may throw things off

Stuart Lee (18:48:23) (in thread): > how about 7pm your time on the 3rd of Nov?

2021-10-22

Michael Love (12:38:35) (in thread): > I changed the listing titles of vignettes here:https://github.com/nullranges/nullranges/commit/99ca4e9ead4729dc8a64f9e0319e4840bc36f1b9

Michael Love (12:38:50) (in thread): > I’d like to push any final changes before release before 5pm today

Michael Love (12:39:51) (in thread): > just so we can see everything builds correctly on Saturday

Eric Davis (13:19:26) (in thread): > Sounds good, I’ll make those vignette changes shortly. They are really wording changes so hopefully they wont affect the build process, but I’ll make sure they are finished before 5 today just in case:+1:

Eric Davis (13:27:56): > It looks like there is both a “main” and “master” branch for nullranges. Is this intentional?

Michael Love (13:38:55): > yes, Bioconductor only uses master

Michael Love (13:38:58): > for now

Michael Love (13:39:08): > so i switched the default GitHub branch to master

Michael Love (13:39:18): > otherwise pushing gets really confusing/painful

Michael Love (13:39:46) (in thread): > cool thanks!!

Eric Davis (14:09:56) (in thread): > I’ve pushed the changes and am letting actions run the checks now. I was unable to run the checks on my computer (I don’t think I am up to date with all of the ExperimentHub changes since it can’t find some of the example datasets)

Michael Love (14:15:42) (in thread): > i can check on my side now, thanks!

Eric Davis (14:16:10) (in thread): > Thank you!

Michael Love (14:28:32) (in thread): > builds and checks fine on my side and on GH, pushing to Bioc now…

Michael Love (14:30:36) (in thread): > after release, 1) I should read Eric’s draft and 2) we should have a meeting to plan the two papers, e.g. what the main figures are etc. Wancen is aiming for an application note I think

Michael Love (14:31:54) (in thread): > she has some nice figures where she shows that the bootstrap p-value changes for various levels of a score threshold (e.g. GWAS p-value)

2021-10-24

Michael Love (17:01:01): > Everything builds fine on Bioc , we’re set for release this week

2021-11-03

Michael Love (19:30:08): > Notes from Nov-3-2021 meeting with Stuart: > 1. Makeitera factor Rle within bootRanges to avoid doing it later, shouldn’t be as slow as factor (test this) > 2. Parallelization and reproducibility: either store chunk seed inmetadata(x)[because we can’t store it reliably inmcols(x), as ranges may be missing in the bootstrap sample, or after the overlap join], or the user can just keep track of their own starting seeds. Maybe I will make the storage of chunk seeds in metadata an non-default option.

Kasper D. Hansen (20:34:33): > don’t store seed

Michael Love (20:36:17) (in thread): > because users should do that?

Kasper D. Hansen (20:41:12) (in thread): > yes, stay away from interferring with the number generator. This is a place where - IMO - reproducibility has done more harm than good

Michael Love (20:43:04) (in thread): > so the proposal was that users would be runningset.seed(xyz)on their own, within a parallel loop, but should we facilitate/encourage the saving of those seeds from the child process

Michael Love (20:44:02) (in thread): > I guess we just don’t facilitate the saving of the seed but just tell users it’s really important they do it, and provide example code how to do it

Kasper D. Hansen (20:45:33) (in thread): > I am way behind on the strategy here but (1) if you use a good parallel generator, you can just set the seed and reproduce it and (2) why is the data splitting sampling not just done outside the loop

Michael Love (20:48:58) (in thread): > it’s the bootstrap sampling, not data splitting

Michael Love (20:49:47) (in thread): > i shouldn’t have said chunk, it’s like a “chunk” of bootstraps

Michael Love (20:50:09) (in thread): > it’s like some fraction of the total R bootstrap samples

Stephanie Hicks (22:35:24): > @Stephanie Hicks has left the channel

2021-11-05

Michael Love (22:44:27): > cut out one line of code thanks to Stuart’s suggestion to outputiteras a factor-Rle: > > dat <- tss %>% > join_overlap_left(all_pks, maxgap=30) %>% > group_by(type, iter) %>% > summarize(rate=calculateRate(id, n)) %>% > as.data.frame() %>% > complete(type, iter, fill=list(rate=0)) > > As Stuart pointed out when we were chatting, the as.data.frame() is needed anyway if we are going to plot the bootstrap densities with ggplot2. And complete() is needed if there is a chance of a zero overlap. thefill=list(rate=0)is a little clunky. i wouldn’t want to type that out again and again. > > There’s another line above which defines the function: > > calculateRate <- function(id, n) sum(!is.na(id))/n > n <- length(tss) > > this is because a range fromtssis included with NA foridif it overlaps no peaks. should we export this small helper function? I’m really trying to make the code as readable and concise as possible.

2021-11-06

Michael Love (07:37:50): > Here is what the parallel case will look like: > > library(BiocParallel) > ncores <- 4 > bp <- MulticoreParam(workers=ncores) > seeds <- sample(1e4, ncores) > res <- bplapply(seq_along(seeds), function(i) { > set.seed(seeds[i]) > R <- 50 > pks_prime <- pks %>% > bootRanges(blockLength=chrlen/10, R=R, > type="bootstrap", withinChrom=TRUE) > dat <- tss %>% > join_overlap_left(pks_prime, maxgap=30) %>% > group_by(iter) %>% # group over bootstrap iterations > summarize(rate=calculateRate(id, n)) %>% # mean overlap > as.data.frame() %>% # GRanges -> df > complete(iter, fill=list(rate=0)) %>% # complete empties > mutate(iter=as.numeric(iter) + (i-1)*R) # shift iter > dat > }) > dat <- do.call(rbind, res) >

2021-11-07

Michael Love (17:15:55): > I’m thinking of writing some potential workshop/learning material, maybe as a bookdown. In the spirit ofhttps://combine-lab.github.io/alevin-tutorial/(idea being the “chapters” could be written by different people and somewhat independent) > > How different than our current vignettes? > * simplistic – like someone new to Bioc, eg a bedtools user could potentially read these > * not necessarily all of these involving nullranges. Eg some are just intro plyranges material > * don’t need to worry about breaking the build, or where data is hosted > * can get into thorny issues around parallel > Where should these go? > * nullranges.github.io/ranges-examples > * …tidy-ranges-examples > * …plyranges-examples > * … above buttutorialor instead ofexamples > * different base url?

Mikhail Dozmorov (18:54:21) (in thread): > It is a good idea, although I must say the current vignettes are pretty excellent tutorials by themselves. But making smaller/simpler/more focused examples on different functionalities employed in nullranges will help to lower entry barrier. Planning for BioC2022?

Michael Love (19:00:27) (in thread): > Maybe for BioC yes — I’m actually thinking to target for someone who doesn’t know Bioconductor at all

Michael Love (19:01:08) (in thread): > So super slow pace

Michael Love (19:01:45) (in thread): > The current vignettes are the right pace for vignettes

Michael Love (19:06:28): > set the channel topic: Development and support channel for {nullranges} and associated packages

Mikhail Dozmorov (19:07:16) (in thread): > Then, it may be more general introduction to modern GenomicRanges. I’m still referring students to Kasper’s course material,https://kasperdanielhansen.github.io/genbioconductor/, but tutorials on the updated, tidy handling of GRanges are lacking. Making them will be very useful.

Michael Love (19:07:34): > set the channel description: {nullranges} is a modular package to generate feature sets representing the null hypothesis (either through matching on covariates or block bootstrapping of features)

Michael Love (19:08:13) (in thread): > Yes Kasper’s material is great

Michael Love (19:09:13) (in thread): > I’m thinking about very slowly going from the idea of joins, which is the key idea behind how we will combine ply- and null- and exclude-ranges

2021-11-08

Michael Love (06:38:20) (in thread): > I’m leaning now towardtidy-ranges-tutorial

Michael Love (08:28:49): > starting things off here -https://nullranges.github.io/tidy-ranges-tutorial/ - Attachment (nullranges.github.io): Tidy Ranges Tutorial > Basic examples of computing tidy operations on ranges.

Michael Love (08:29:09) (in thread): > ok, made a hello world bookdown there

Mikhail Dozmorov (09:23:41) (in thread): > It may be best to make a summary what topics to include, who would like to contribute.

Mikhail Dozmorov (09:24:39) (in thread): > I’m thinking if it may be worth to follow the Carpentries template? I’ve recently enrolled in their course.

Michael Love (10:03:57) (in thread): > i can put in theREADMEwhat i’m planning to write… I plan to write a few files this week and that will help give a better idea. I don’t need volunteers from the group at the moment but of course welcome any feedback – i just copy pasted the miminal bookdown which is all I need to make these look decent with minimum effort, e.g.render_book()+ push to GH.

Michael Love (10:09:17) (in thread): > oh you mean like this:https://carpentries-incubator.github.io/bioc-project/03-installing-bioconductor/index.htmlJust to keep things simple I think I’ll stick with the miminal bookdown for now

Mikhail Dozmorov (10:11:08) (in thread): > Bookdown is fine, technically, it is similar and conveys information well

Michael Love (17:09:23) (in thread): > here’s what i’m planning to write, one of these with help from Wancenhttps://github.com/nullranges/tidy-ranges-tutorial/blob/main/README.md

Mikhail Dozmorov (17:20:05) (in thread): > Great plan. I would add a permutation-based enrichment analysis example, especially parallelized.

Michael Love (18:55:42) (in thread): > Oh yeah it’s not super clear but that’s #4

2021-11-10

Michael Love (08:03:35): > @Wancen MuI’m thinking for the manuscript we should save a genome segmentation for hg38 and mm10 based on gene density and put them on Ahub (via nullrangesData) in devel branch so people can skip segmentation step. So on your side that would mean creating an R script like you did for the DHS that generates our preferred segmentation. Thoughts?

Wancen Mu (09:33:33) (in thread): > Yeah, that would save time for users to query excluderanges and genes! > Both our dataset uses hg38, is mm10 for ease of uses?

Michael Love (09:39:25) (in thread): > Yes, lots of people in genetics/genomics work in mouse so it’s just to expand the set of people that can straight away use bootRanges without having to do segmentation

Michael Love (09:39:39) (in thread): > the code should be identical for the two genomes, just pulling a different gene set

Michael Love (09:40:14) (in thread): > if you start writing the script here, you can ping me and I’ll look it overhttps://github.com/nullranges/nullrangesData/tree/master/inst/scripts

Michael Love (09:40:28) (in thread): > it can be one script for hg38 and then i can adapt it for mm10 actually

Michael Love (09:40:45) (in thread): > and i will push the resources to Ahub once we finalize

Wancen Mu (14:44:07) (in thread): > I have 2 questions for the new excluderanges since I use deny previously. > 1. Following excludeRanges vignette, it suggests use excludeGR.hg38.Kundaje.1 object. Besides that ENCODE produced regions, centromeres, telomeres, and other gap locations are stored in UCSC gap table, should we combine those together? > 2. We said only select width>500 when do the segmentation for avoiding generating too many small pieces segmentation ranges. So even we upload segmentation to Ahub, user still need to generate excluderanges to drop small width pieces in bootranges, right?

Michael Love (14:51:55) (in thread): > I would preprocess the segmentation so it’s ready to use — essentially the one that we use directly in our analyses

Michael Love (14:53:01) (in thread): > question (1) is maybe one for@Mikhail Dozmorov— opinion on if we should just use the ENCODE regions or also centromeres/telomeres etc as exlcuded ranges so we don’t place features there during bootstrapping?

Wancen Mu (15:57:34): > Hi@Mikhail Dozmorov, do you have opinion that use the ENCODE regions or also include UCSC centromeres/telomeres and other gap regions as exlcuded ranges so we don’t place features there during bootstrapping?

Mikhail Dozmorov (17:36:31): > Centromeres and telomeres - yes, and I think it is important to make others aware that one should account for them. I’m not so sure about gaps. For simplicity, suggesting excludable regions plus centromeres/telomeres.

Mikhail Dozmorov (17:37:01) (in thread): > Just got to it - replied in the main thread. And open to hear other opinions!

Michael Love (18:38:04): > thanks Mikhail

Wancen Mu (20:10:56) (in thread): > Mike, here is the script for hg38! I write the code for save excluderanges, hmm and cbs segmentation, respectively. Let me know if there is any questions for the code!@Michael Lovehttps://github.com/nullranges/nullrangesData/blob/a3c7c86c6a709d106e2700d0b3b1f59d3dfae5e5/inst/scripts/make-segmentation-hg38.R#L62-L64

Wancen Mu (20:11:51) (in thread): > Thank you, Mikhail!

Michael Love (20:16:44) (in thread): > Great I’ll work on adding this + mm10 to devel branch

2021-11-29

Michael Love (08:30:56): > @Eric Daviswhen you get a chance can you poke around why the match * vignettes don’t build on Windows? (for some reason it’s just in devel, in release it’s fine)http://bioconductor.org/checkResults/devel/bioc-LATEST/nullranges/riesling1-buildsrc.html > > Quitting from lines 40-216 (matching_ranges.Rmd) > Error: processing vignette 'matching_ranges.Rmd' failed with diagnostics: > must specify only one of 'font' and 'fontface' >

Eric Davis (11:15:13) (in thread): > Hmm, that is strange. It looks like the error comes from the way the font parameters are interpreted in grid.text (https://github.com/thomasp85/grid/blob/master/R/gpar.R#L154). I don’t have a windows machine and I can’t replicate the error, but this bit of code should be able to test the problem on a windows machine: > > library(plotgardener) > library(grid) > > pageCreate(width = 8.5, height = 6.5, showGuides = FALSE, xgrid = 0, ygrid = 0) > > plotText(label = "Pool Set", > x = 2.25, y = 0.9, > just = c("center", "bottom"), > fontcolor = "#33A02C", > fontface = "bold", > fontfamily = 'mono') > > Either way this seems to be a plotgardener/grid issue rather than a nullranges issue

Michael Love (12:01:26) (in thread): > oh and that made me look at plotgardener —http://bioconductor.org/checkResults/devel/bioc-LATEST/plotgardener/

Michael Love (12:01:32) (in thread): > might want to forward that along to Nicole

Doug Phanstiel (12:10:49) (in thread): > huh

Doug Phanstiel (12:10:51) (in thread): > thanks

2021-12-10

Michael Love (08:38:16) (in thread): > i know ya’ll are busy with the ABCD, but any update on fixing plotgardener, nullranges still failing build in devel. we could just comment out the fontface calls~relevant chunks~in the vignettes for now?

Michael Love (08:38:32) (in thread): > http://bioconductor.org/checkResults/devel/bioc-LATEST/nullranges/

2021-12-14

Michael Love (07:43:05): > Been playing around with some tutorial material, just posted this to twitter to see what people have to sayhttps://nullranges.github.io/tidy-ranges-tutorial/bootstrap-overlap.html - Attachment (nullranges.github.io): Chapter 4 Bootstrap overlap | Tidy Ranges Tutorial > Basic examples of computing operations on genomic ranges using the tidy data philosophy.

2021-12-15

Michael Love (08:19:18) (in thread): > @Eric Daviscan you remove all thefontface = "bold"in the three vignettes so we can pass build for now?

Eric Davis (10:35:21) (in thread): > Will do!

Eric Davis (15:26:28) (in thread): > I think I bothered Nicole enough to get her to fix the issue:sweat_smile:. If the build is still failing by Monday I will removefontface = "bold"until we figure out the cause.

Michael Love (15:37:50) (in thread): > sounds good, thanks for looking into it

Eric Davis (23:59:10): > Updated documentation around matching terminology and propensity scores - via this PR (https://github.com/nullranges/nullranges/pull/16). Feel free to adjust the language as needed!

2021-12-16

Michael Love (08:01:00): > perfect, thanks@Eric Davis!!

Michael Love (08:02:54): > I’ve pushed to Bioc devel

Wancen Mu (08:03:41) (in thread): > Hey@Michael Love,do you want to put this to devel branch too?

Michael Love (08:03:43): > we are using matchRanges on a project with a group at Duke and I was trying to explain to them how the matching is done:slightly_smiling_face:

Michael Love (08:04:07) (in thread): > yes I will I’ve just been looking into the segmentation a bit

Michael Love (08:04:21) (in thread): > i plan to push this to devel in the next week or so

Wancen Mu (08:04:53) (in thread): > No worries! Just in case you forget:joy:

Michael Love (08:04:57) (in thread): > I was playing around with bootstrapping DHS peaks in the CBS segmentation of hg38

Michael Love (08:05:29) (in thread): > i noticed it was clumpier than i thought, but the reason is that the DHS peaks should probably be merged

Michael Love (08:06:19) (in thread): > i’ll have examples in a week or so, but yeah will plan to send the segmentation to devel branch

Wancen Mu (08:10:14) (in thread): > Oh, are you saying currently CBS segmentation doesn’t fit DHS peaks? Anyway, maybe I can wait your examples to know better!

Michael Love (08:15:00) (in thread): > the CBS is probably ok – i think the issue i found is that the DHS peaks need to be merged, i’ll work on an example to show this

Michael Love (16:34:38) (in thread): > oh interesting. I wasn’t getting great matching with nearest or stratified, then i just downsampled the tss = 0 case and got better results with nearest

Michael Love (16:35:01) (in thread): > maybe of interest to@Eric Davishave you tried this strategy? we could discuss it in the vignettes if it makes sense

Michael Love (16:35:30) (in thread): > before (~3 million pool), focal set is ~500 - File (PNG): Screenshot from 2021-12-16 16-27-29.png

Michael Love (16:35:50) (in thread): > after (40k pool) - File (PNG): Screenshot from 2021-12-16 16-33-37.png

Michael Love (16:36:05) (in thread): > i just randomly downsampled tss=0 from millions to 10k

Michael Love (16:38:49) (in thread): > also this is quite easy to work with, bravo Eric

Michael Love (16:42:06) (in thread): > the plyranges part of this is nice and simple IMO > > focal <- pks %>% > mutate(id = seq_along(.)) %>% > join_overlap_inner(bins) %>% > group_by(id) %>% > reduce_ranges(GC=mean(GC), map=mean(map), tss=ceiling(mean(tss))) > > pool <- bins %>% > filter_by_non_overlaps(pks) %>% > filter(tss <= max(focal$tss)) >

Doug Phanstiel (16:48:17) (in thread): > Are you matching with replacement?

Doug Phanstiel (16:49:35) (in thread): > It seems like downsampling the pool should make matching worse for nearest without replacement

Eric Davis (16:54:33) (in thread): > That is interesting. We don’t support nearest without replacement, so maybe it is selecting the same sets as many times as it needs to build a better distribution?

Eric Davis (16:54:51) (in thread): > Can you try checking how many indices are duplicated?

Eric Davis (16:55:32) (in thread): > Code for this is in the vignette:https://nullranges.github.io/nullranges/articles/matching_ranges.html#nearest-neighbor-matching - Attachment (nullranges.github.io): Overview of matchRanges > nullranges

Michael Love (17:07:43) (in thread): > I was doing nearest with replacement, then tried stratified w/o replacement (same problem), then finally back to nearest with the downsampling idea

Michael Love (17:08:22) (in thread): > my guess as to why it gets better is that the logistic regression was sub optimal, because the tss=0 cases overwhelmed the model. tss=0 likely different in GC and mappability than tss > 0 (these are genomic bins with number of TSS as a covariate)

Michael Love (17:10:58) (in thread): > only 2 indices were duplicated out of the 500

2021-12-18

Mikhail Dozmorov (09:18:56): > @Eric Davis,nullrangesDatahas thehg19_10kb_ctcfBoundBinPairsobject. Are there plans to create the same but for hg38? Different resolutions? The rationale is that it allows a user to plug in his/her loop data in theloopedcolumn and test for CTCF convergence. Many work with hg38. Ifhg38_10kb_ctcfBoundBinPairsdoes not exist, I’m thinking of creating one - do you think it will be useful to add tonullrangesData?

Michael Love (09:51:32) (in thread): > It shouldn’t be too onerous to update these — I’m going to also add hg38 segmentations. I can coordinate the AWS upload altogether

Mikhail Dozmorov (09:59:33) (in thread): > I thought you may have some code to create it - that’ll be great. The algorithm is straightforward, but requires some work. If you create it, would you mind sharing? With emptyloopingcolumn, for testing.

Michael Love (10:27:28) (in thread): > sorry I mean, if Eric can make a new object, I can include it also with other new objects when I update nullrangesData

Mikhail Dozmorov (10:29:51) (in thread): > or I can help. Let’s hear from Eric, and coordinate to avoid duplicate work

Eric Davis (11:29:07) (in thread): > Another possibility is that the way “nearest” is implemented. It usesdata.table’s rolling-join function which might take the nearest in the forward direction.

Eric Davis (11:36:58) (in thread): > It should be pretty simple, assuming thatAnnotationHubhas the appropriate data in hg38. I suppose if it doesn’t I could always lift over the ranges too. There are the scripts that make the objects for hg19:https://github.com/nullranges/nullrangesData/blob/master/inst/scripts/make-hg19-bins-data.Rhttps://github.com/nullranges/nullrangesData/blob/master/inst/scripts/make-hg19-ctcf-pairs-data.R

Eric Davis (11:38:11) (in thread): > Would you like me to generate these objects for hg38?

Eric Davis (11:39:10) (in thread): > These were also meant to be example datasets for the vignettes

Mikhail Dozmorov (12:14:38) (in thread): > If you can create this object, that’ll be great. I lifted over the Rao 2014 loops using Doug’s liftOverBedpe tool, here are the resultsGSE63525_GM12878_primary+replicate_HiCCUPS_looplist_hg38.txt.gz. And the code: > > cat GSE63525_GM12878_primary+replicate_HiCCUPS_looplist.txt | awk 'BEGIN {OFS="\t"} {print "chr"$1,$2,$3,"chr"$4,$5,$6,"chr"$1":"$2"-"$3"_chr"$4":"$5"-"$6,".",".","."}' > GSE63525_GM12878_primary+replicate_HiCCUPS_looplist_chr.txt > #[https://github.com/dphansti/liftOverBedpe](https://github.com/dphansti/liftOverBedpe)python2.7 liftOverBedpe.py --chain hg19ToHg38.over.chain --i GSE63525_GM12878_primary+replicate_HiCCUPS_looplist_chr.txt --o GSE63525_GM12878_primary+replicate_HiCCUPS_looplist_hg38.txt --h T --lift ./liftOver --v T >

Eric Davis (12:16:16) (in thread): > Thanks for the code! What resolutions were you thinking?

Mikhail Dozmorov (12:16:36) (in thread): > 10kb

Michael Love (19:11:33) (in thread): > oh, interesting. I’ll assess this next week. If this is the case we should also mention in vignette to randomize the order of the data?

2021-12-21

Michael Love (14:38:48) (in thread): > i looked into this, and shuffling didn’t help. What about a section like “Tips for matching” at the end of one of the matching vignettes? because i don’t think everyone needs to downsample super larger categories (e.g. pool having way more than focal) but it did help in this case

Michael Love (14:41:30) (in thread): > also it would be good to mention in this extra section that there is a directionality to the “nearest” method – and if users don’t want that they should just shuffled the ranges before using matchRanges

2021-12-23

Eric Davis (23:20:31) (in thread): > Not sure where to put this, but I created a new branchhg38_datasetsinnullranges/nullrangesDatawith both datasets for hg38:https://github.com/nullranges/nullrangesData/tree/hg38_datasets/dataThis can either be merged into main or you can take the data and not merge the branch - whichever is easiest

2021-12-24

Michael Love (06:54:04) (in thread): > thanks! bc it’s an ExperimentHub package, I’ll upload the data to AWS after the break and then not merge the branch – the ExperimentHub packages are “shells” without data in them

Michael Love (06:54:18) (in thread): > i can delete or leave the branch and just not push it to Bioc

Mikhail Dozmorov (08:59:54) (in thread): > Thanks, Eric, Mike, that’s awesome! The new object works perfectly, as a drop-in replacement in thematching_ginteractions.Rmdvignette.

2022-01-06

Michael Love (14:41:44) (in thread): > hi all, I’m looking to add the new objects to Bioc but i can’t work with the GitHub repo bc it seems to have 100Mb of data in its git history, and that doesn’t agree with the Bioc repo (which is a slim 100k) > > I’m going to rename the GitHub repository tonullrangeDataArchivedand just deprecate it

Michael Love (14:46:12) (in thread): > i’m now going to add in the scripts from Wancen and Eric, and contact Bioc core about adding new objects to AWS

Michael Love (15:03:49): > @Mikhail Dozmorovjust noticed this, when using excluderanges in Wancen’s script: > > telomere <- query_data2[["AH95938"]] > !> telomere > GRanges object with 48 ranges and 6 metadata columns: > seqnames ranges strand | bin ix n size type > <Rle> <IRanges> <Rle> | <numeric> <numeric> <character> <numeric> <character> > 1 chr1 0-10000 * | 585 1 N 10000 telomere > > the0will make noise so requires atrim(telomere)to trim it > > !> trim(telomere) > GRanges object with 48 ranges and 6 metadata columns: > seqnames ranges strand | bin ix n size type > <Rle> <IRanges> <Rle> | <numeric> <numeric> <character> <numeric> <character> > 1 chr1 1-10000 * | 585 1 N 10000 telomere > > no big deal though

Mikhail Dozmorov (16:50:53) (in thread): > Good point, note taken. Also saw your biochubs question - will follow. There will be some updates to theexcluderangesandctcfpackages, so will do the trimming then.

2022-01-14

Michael Love (10:58:33): > For bootRanges diagram something like this arrangement? I can clean up more eg compute statistic is a bit bunched up - File (JPEG): Image from iOS

Michael Love (10:59:38): > @Wancen Mu:point_up:I’m trying to make it very simple and we can describe segmented in text maybe. Just don’t want readers to be lost at the diagram

Michael Love (11:06:54): > i’ll do a version 2 to then pass to Doug’s art wizard:slightly_smiling_face:

Michael Love (11:07:17): > maybe also put the bootstraps (like a stack) in A also

Wancen Mu (11:11:11) (in thread): > No problem! This looks tidier and much easy to read! Just some small causal details, maybe add a block length annotation, different length of ranges in A and the mcols in B is dotted line~

Michael Love (11:13:16) (in thread): > will do

Michael Love (11:13:20) (in thread): > bc optional right

Wancen Mu (11:14:07) (in thread): > Thank you!

2022-01-16

Michael Love (20:25:48): > Here’s iteration 3 - File (JPEG): Image from iOS

2022-01-17

Wancen Mu (06:24:20) (in thread): > That looks perfect!! Do u think is it helpful if we add “Ranges” or “count” in panel B? - File (JPEG): Image from iOS

Michael Love (07:56:14) (in thread): > Yes, that’s better, I’ll pass this on to Doug’s team

Wancen Mu (07:59:07) (in thread): > Thank you, Mike:pray:! And maybe a histogram of bootstrap statistics overlapped with the line plot!:relaxed:

Michael Love (09:34:46) (in thread): > I don’t follow the last part, histogram overlapped with line plot?

Wancen Mu (21:05:38) (in thread): > Oh, I mean also show the histogram of bootstrap statistics under the line plot maybe clearer on how the line plot construct? - File (JPEG): Image from iOS

2022-01-18

Michael Love (09:33:39) (in thread): > oh i see, let’s see how the simple density looks first, we can keep iterating with Erika perhaps

Wancen Mu (10:51:16) (in thread): > Thanks! For the two tries in the email, I think put x on the top makes arrow looks tidier although it not follows join_overlap workflow position. Looking forward to Erika’s figure!

2022-01-24

Michael Love (08:12:33): > > Submissions to present at Bioc2022 are now open. The deadline to submit is Wednesday March 9th, 2022. Please seehttps://bioc2022.bioconductor.org/submissions/for more details. > Maybe a nullranges software demo (30-45 min)? We could combine efforts instead of splitting up like last time - Attachment (bioc2022.bioconductor.org): Submissions > Submissions

Mikhail Dozmorov (09:31:04) (in thread): > Presenting - certainly. Not sure what’s the difference btw splitting or combining efforts.

Mikhail Dozmorov (09:33:16) (in thread): > Looks likehg38_10kb_ctcfBoundBinPairswas deleted? Can’t find the branchhttps://github.com/nullranges/nullrangesData/tree/hg38_datasets/dataor scripts to generate it. I have the object on my computer, but that’s fragile.

Michael Love (09:39:52) (in thread): > here’s the story on that:https://community-bioc.slack.com/archives/CC88GP2F4/p1641498104000100?thread_ts=1636549415.042200&cid=CC88GP2F4 - Attachment: Attachment > hi all, I’m looking to add the new objects to Bioc but i can’t work with the GitHub repo bc it seems to have 100Mb of data in its git history, and that doesn’t agree with the Bioc repo (which is a slim 100k) > > I’m going to rename the GitHub repository to nullrangeDataArchived and just deprecate it

Michael Love (09:41:13) (in thread): > we shouldn’t really put data objects into GH bc it makes it hard to sync with Bioc where repos are not supposed to have data objects > > i asked in early Jan to submit the objects to EHub but haven’t heard back from hubs — i just pinged again last week so waiting to hear back

Michael Love (09:41:37) (in thread): > we’re in limbo now but we have scripts to make the object and 3 local copies of these objects

Mikhail Dozmorov (09:55:36) (in thread): > Ah, recall now. Well, let’s hope EHub will help.

Michael Love (18:20:42) (in thread): > Well last time nullranges got a talk but missed out on a workshop. Maybe asking for both hurt the chances

Michael Love (18:23:49) (in thread): > We could do a workshop that could involve Wancen, Eric, maybe Stuart if he’s free and interested, maybe Mikhail you could speak about excluderanges resources … just throwing out ideas

Mikhail Dozmorov (19:33:38) (in thread): > The format would likely be dictated by the conference format. Last year’s presentation format was dictated by the online venue. I’d vote for a workshop, it feasible - the CTCF orientation vignette is a great demo, And others, of course, I’m just using that one the most.

2022-01-28

Megha Lal (11:13:52): > @Megha Lal has left the channel

2022-02-04

Michael Love (12:24:11) (in thread): > @Wancen Musee this which is also in the segmented vignette on Bioc

Michael Love (12:24:16) (in thread): > https://nullranges.github.io/nullranges/articles/segmented_boot_ranges.html#use-with-plyranges - Attachment (nullranges.github.io): Segmented block bootstrap > nullranges

Michael Love (12:24:27) (in thread): > this is how we solved the 0 overlap case

Michael Love (12:25:17) (in thread): > another code example herehttps://nullranges.github.io/tidy-ranges-tutorial/bootstrap-overlap.html - Attachment (nullranges.github.io): Chapter 4 Bootstrap overlap | Tidy Ranges Tutorial > Basic examples of computing operations on genomic ranges using the tidy data philosophy.

Wancen Mu (12:28:20) (in thread): > Aha, thank u! Didn’t notice the vignette has been updated!

2022-02-22

Michael Love (14:56:24): > Congrats@Eric Davison passing oral exam today!:tada:

Michael Love (14:56:42): > matchRanges made an appearance along with lots of interesting biology

Michael Love (14:56:56): > In other news, the new objects are now in EHub: > > > eh = ExperimentHub() > snapshotDate(): 2022-02-22 > > > query(eh, "nullrangesdata") > ExperimentHub with 8 records > # snapshotDate(): 2022-02-22 > # $dataprovider: Aiden Lab, Love lab, UCSC, Kundaje lab > # $species: Homo sapiens > # $rdataclass: GenomicRanges, InteractionSet > # additional mcols(): taxonomyid, genome, description, > # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, > # rdatapath, sourceurl, sourcetype > # retrieve records with, e.g., 'object[["EH7082"]]' > > title > EH7082 | DHSA549Hg38 > EH7083 | hg19_10kb_bins > EH7084 | hg19_10kb_ctcfBoundBinPairs > EH7306 | exclude_hg38_all > EH7307 | seg_cbs > EH7308 | seg_hmm > EH7309 | hg38_10kb_bins > EH7310 | hg38_10kb_ctcfBoundBinPairs >

Michael Love (15:27:57) (in thread): > @Wancen MuI recommend you modify the segmented vignette so that you add a section above herehttps://github.com/nullranges/nullranges/blob/master/vignettes/segmented_boot_ranges.Rmd#L63with title “Pre-built segmentations”. 99% of users will want to just do this because 1) it’s easier 2) the package authors have decided this is a good segmentation, so they will be doing the right thing

Michael Love (15:28:40) (in thread): > and then here we should use the pre-built ones:https://github.com/nullranges/nullranges/blob/master/vignettes/segmented_boot_ranges.Rmd#L215

Wancen Mu (15:33:23) (in thread): > Wow finally released! Good plan. Will let you know once I finished

Michael Love (15:35:37) (in thread): > Thanks, we have plenty of time before release

Michael Love (15:35:54) (in thread): > yeah – it did take a while, not sure why exactly

Michael Love (15:36:45) (in thread): > and we’ve made a lot of changes in the devel branch so make sure you start with1.1.9

Wancen Mu (15:37:24) (in thread): > Got it. Thanks for the reminder!

Eric Davis (15:53:21) (in thread): > Thanks!! Glad it is behind me now:sweat_smile:

2022-03-04

Michael Love (16:50:53): > @Eric Davis@Wancen Muwant to submit an abstract for nullranges? I’d be happy to help drafthttps://community-bioc.slack.com/archives/CC88GP2F4/p1643029953016000 - Attachment: Attachment > > Submissions to present at Bioc2022 are now open. The deadline to submit is Wednesday March 9th, 2022. Please see https://bioc2022.bioconductor.org/submissions/ for more details. > Maybe a nullranges software demo (30-45 min)? We could combine efforts instead of splitting up like last time

Michael Love (16:57:46): > just starting to draft here, please anyone feel free to edit as you like

Michael Love (16:57:47): > https://hackmd.io/rvjGSXTyRS2JNz3B7Hd3SA - Attachment (hackmd.io): nullranges: modular workflow for overlap enrichment - HackMD

2022-03-07

Michael Love (07:17:27): > @Wancen Mu@Eric Davisfeel free to edit the above demo description, and then we’ll submit sometime tomorrow

Wancen Mu (10:07:16) (in thread): > Thanks Mike, I will do it today~

Wancen Mu (16:49:53) (in thread): > @Michael LoveI’m done editing it. There are 1686 characters excluding space right now. I think there are many space left for Eric! Thanks!

Wancen Mu (17:01:02) (in thread): > Oops, just know it is 5000 characters limit, do I need to add some result on that? This time, I mainly write down package functionalities. Is there any difference on submitting that as workshop and as short talks?

Michael Love (20:27:23) (in thread): > you don’t need to aim for the limit! I review these and appreciate brevity

Michael Love (20:27:24) (in thread): > :smile:

2022-03-08

Michael Love (12:23:22) (in thread): > ok we have a little more time

Eric Davis (15:14:20) (in thread): > I’ve added some text for the matching portion - feel free to edit and let me know if I should add more detail!

Michael Love (16:13:07) (in thread): > thanks!

Michael Love (16:26:18): > awesome thanks, I tweaked some language bc i noticed “features” was being used both to talk about ranges and covariates, here I use “ranges” and “characteristics” instead of saying features

Michael Love (16:27:15): > > There are many well-established packages for overlap enrichment in R/Bioconductor: these can be used to establish if two sets of genomic ranges are distributed closer to each other than expected under a particular null hypothesis. In this software demo we will focus on two branches of specification of null hypothesis for distribution of genomic ranges, where we find it is beneficial to separate generation of null ranges from the enrichment analysis steps. These are cases where the specification of the null hypothesis is complex in itself and deserves its own multiple steps, diagnostic considerations, and plots – all covered in our workshop. Finally, we will demonstrate how nullranges plays a role in a tidy data workflow tying together multiple Bioconductor and tidyverse packages. > > The first branch of null hypothesis specification allows users to generate matched ranges that control for specific confounding characteristics. Since the distribution of these characteristics often differs between the set of interest (the focal set) and the pool of candidate ranges, an appropriate null set must be matched to the characteristics of the focal set. We have implemented a propensity score-based method for performing covariate-matched subset selection. Our implementation is efficient for operating on genome-scale data and tightly integrated with existing Bioconductor classes. Additionally, we have provided accessor methods and plotting functions for visualizing and assessing matching quality and covariate balance. > > Another branch is to perform bootstrap resampling on blocks of the genome containing an original set of ranges, preserving the ranges' clustering properties, possibly considering an exclusion list of regions where ranges should not be located. The algorithm follows the genomic block bootstrap (Bickel et al 2010). Our implementation uses efficient vectorized code for generating bootstrap ranges from input GRanges objects. We have implemented options for bootstrapping with respect to a segmented genome, to deal with highly heterogeneous range distributions. We will discuss considerations of segmentation, block length, and their impact on the hypothesis testing, in comparison to shuffling start positions of ranges. > > After generation of a set of ranges representing the null hypothesis, we will demonstrate use of plyranges as the engine for downstream overlap enrichment analysis or other analyses. Other possible downstream analyses made possible with nullranges + plyranges include computing correlations of sample data for all overlapping pairs of ranges, and optimizing an effect size threshold for differential analysis with the use of penalized splines. For the former we will also demonstrate complementary analysis with tidySE. >

Wancen Mu (16:32:41): > That is perfect! Thanks for all the work!

Michael Love (16:32:55): > excited and:crossed_fingers:

Michael Love (16:33:03): > i’ll submit tonight i think

Michael Love (16:34:00): > oh wait, i should not submit all these abstracts bc it will hurt the chances — Wancen could you submit sometime (there’s not fast deadline now)

Wancen Mu (16:35:06) (in thread): > For sure, I can do it tonight if there is no changes anymore!

Michael Love (16:36:51) (in thread): > ok yeah and I would suggest, given the scope of the demo to have co-authors: > > Eric Davisesdavis@live.unc.eduMikhail Dozmorovmikhail.dozmorov@vcuhealth.orgStuart Leestuart.andrew.lee@gmail.comMichael Lovemichaelisaiahlove@gmail.comDouglas Phanstieldouglas_phanstiel@med.unc.edu

Wancen Mu (16:38:00) (in thread): > Gotcha, thanks for those information!

Michael Love (16:39:02) (in thread): > i dont know what to put for virtual vs in person… maybe unsure for now

2022-04-29

Michael Love (10:30:27): > @Wancen Muand@Eric Daviscongrats, the software demo is approved. Eric, do you want to attend BioC in person? I think I can cover on some CZI extra funds. Wancen will be virtual I think

Eric Davis (11:28:45) (in thread): > I think it would be very neat to attend in person! Will you be going also?

Michael Love (13:28:45) (in thread): > I’ll have to be virtual — Euphy and Ji-Eun from my lab are planning to attend in person though

Michael Love (13:29:08) (in thread): > I’m going to do a hybrid demo with Euphy, and I think you could consider that too

2022-05-03

Tim Triche (18:35:59): > Congratulations Wancen!

Wancen Mu (19:11:12): > Thanks, Tim!:laughing:

Michael Love (21:14:46) (in thread): > So@Eric Davisdo you think you would be able to travel? I think they want confirmation of in person attendance this week

2022-05-04

Eric Davis (11:10:03) (in thread): > Yes! Is there any information I need to include when registering?

Wancen Mu (11:40:02) (in thread): > Eric, did you see below? > > - Email Erica Feick (efeick@ds.dfci.harvard.edu) no later than May 7 indicating your name, that you are a speaker, the number of your submission in OpenReview, and whether you would like to attend in-person or virtually. You will be notified on May 16th if you have an in-person ticket. Please note the registration fees for in-person attendance are 400 for faculty and staff and > > 250 for students and postdocs. If you need travel assistance, please go to this form to apply (https://forms.gle/YotNvDzx8qWjnQ3D7). > > - If you are attending virtually, please register now at the following link:https://bioc2022.eventbrite.com/

Michael Love (11:46:38) (in thread): > so I should pay for registration for everyone — but email Erica to mention that you are an in person speaker

Michael Love (11:47:15) (in thread): > I’ll try to take care of registration today

2022-05-13

Michael Love (07:51:00) (in thread): > @Eric DavisI’m going to start setting up travel for BioC, you’re still on for in person?

Eric Davis (09:18:28) (in thread): > Yes! Though I haven’t received confirmation that I have an in-person ticket yet (I think that happens may 16th?)

Michael Love (13:58:29) (in thread): > ok i’ll wait a few days

2022-05-16

Michael Love (13:26:01) (in thread): > let me know when you find out!

Michael Love (13:26:46) (in thread): > Eric, as a presenter, don’t you automatically have in person slot?

Michael Love (13:27:21) (in thread): > Can you send a note here:https://bioc2022.bioconductor.org/contact/

Wancen Mu (13:29:18) (in thread): > I receive an email there is a in-person tickets. I am confused that I said I do virtually, but do mention Eric will present in-person in the email. Should he use this link? > > Thank you for participating in BioC2022 in Seattle, Washington from July 27-29th. You have received an in-person ticket for the conference. > > Please register for your in-person ticket at[https://bioc2022.eventbrite.com](https://bioc2022.eventbrite.com)no later than Saturday May 21st. After this date, tickets may be released to those on the waiting list. > > If you requested financial assistance for travel, we will send that information separately. >

Michael Love (13:30:11) (in thread): > I think Eric can just contact the organizers and mention 1) he is presenting a workshop 2) he would like to be in-person

Michael Love (13:30:56) (in thread): > and then I can deal with the registration this week

Wancen Mu (13:32:38) (in thread): > Gotcha, so can I ignore this email? Or do I have to emphasize “I will do virtually”?

Michael Love (13:36:29) (in thread): > I guess Eric can email to say “hi, do I currently have a spot, and also BTW Wancen will be virtual”

Eric Davis (14:37:10) (in thread): > I got an email confirmation that I do have an in-person spot! Sorry for the delay, I am in California right now and haven’t had consistent access to internet.

Michael Love (15:18:03) (in thread): > gotcha — i’ll take it from here

2022-05-19

Michael Love (11:25:48) (in thread): > @Eric Davis“There is a waitlist for in person, so I placed Eric Davis on the waitlist.”can you email organizers and point out that you are leading a workshop?

2022-06-28

Wancen Mu (17:44:59) (in thread): > @Eric DavisThe organizer emailedmeaboutmaking a small change to the schedule. Doyou mind speaking at an earlier time on July 27th - they would move us 3:30pm (Pacific time) talk to 2:15pm (PT). Please let me know if that works for you!

Eric Davis (17:49:10) (in thread): > Soundsgoodtome,thanksWancen!

2022-07-11

Michael Love (09:00:03): > @Eric Davisand@Wancen Mu— wanna find a time to meet this week, I can show you how to setup the software demo? It’s pretty simple and you can port over some content from our vignettes

2022-07-12

Wancen Mu (19:46:49) (in thread): > Sounds good

Michael Love (20:02:29) (in thread): > Thurs 11am (East)? or Friday morning?

Eric Davis (21:35:19) (in thread): > Either of those work for me!

2022-07-13

Doug Phanstiel (09:51:44) (in thread): > If you can push to 11:30 on Thursday I can attend

Michael Love (10:32:36) (in thread): > i can

2022-07-14

Michael Love (10:51:12) (in thread): > btw i don’t expect to take one hour — i will just walk through how to make a software demo package. it’s really not very complicated and you end up with a nice product

Michael Love (11:30:32) (in thread): > zoom updates…

2022-07-24

Michael Love (18:33:46) (in thread): > Feel free to ping me if you want proofing of the vignette, I’ll be available all day M and T

Eric Davis (18:44:53) (in thread): > Thanks Mike! I was considering swapping out the first figure in the matching section to more clearly explain the sets in the example data, but other than that I think it is ready to be looked over.

Wancen Mu (18:46:33) (in thread): > Thanks Mike. I am still adding things to the vignettes! It would be ready on Monday. And maybe we could consider practice once on Tuesday, Eric?

Michael Love (18:49:56) (in thread): > Ok I’ll read over Eric’s tomorrow morning and then Wancen EOD tomorrow? Im happy to watch a practice anytime Tuesday

2022-07-25

Wancen Mu (16:21:44) (in thread): > @Eric Davis, do you want to do the first 5 mins introduction part or I could do the introduction and hand over to you?

Eric Davis (16:49:56) (in thread): > You can do the introduction part if you’d like. I am sure you would do a much better job of describing bootstrapping than I:sweat_smile:

Wancen Mu (17:34:53) (in thread): > Haha, we can practice once tomorrow and see how it goes. When are you available tomorrow?

Eric Davis (17:35:53) (in thread): > I am free any time before 3pm tomorrow

Michael Love (17:36:59) (in thread): > I’m free 2-3

Wancen Mu (17:42:32) (in thread): > Great, we could do 2-3 then?

Michael Love (18:39:03) (in thread): > We can use my zoom room:https://unc.zoom.us/j/4133532783?pwd=VHl6dlNXMk5NYStCODN6S1IwaVliQT09

2022-07-26

Michael Love (20:53:13): > haven’t solved the bug > > Exit status: 128 > Stderr: > fatal: detected dubious ownership in repository at '/__w/nullranges/nullranges' > To add an exception for this directory, call: > > git config --global --add safe.directory /__w/nullranges/nullranges > > but at least I found the thread:

Michael Love (20:53:14): > https://github.com/actions/checkout/issues/760

Michael Love (20:58:24): > ok trying the solution from the thread

2022-07-27

Michael Love (08:59:43): > worked

Leonardo Collado Torres (17:48:17): > > > ## split sparse count matrix into NumericList > > rna <- rna_Granges[-which(rna.sd==0)] %>% > + mutate(counts1 = NumericList(asplit(rna.scaled, 1)))%>% sort() > Error in h(simpleError(msg, call)) : > error in evaluating the argument 'x' in selecting a method for function 'sort': object 'rna_Granges' not found > > traceback() > 5: h(simpleError(msg, call)) > 4: .handleSimpleError(function (cond) > .Internal(C_tryCatchHelper(addr, 1L, cond)), "object 'rna_Granges' not found", > base::quote(mutate(., counts1 = NumericList(asplit(rna.scaled, > 1))))) > 3: mutate(., counts1 = NumericList(asplit(rna.scaled, 1))) > 2: sort(.) > 1: rna_Granges[-which(rna.sd == 0)] %>% mutate(counts1 = NumericList(asplit(rna.scaled, > 1))) %>% sort() > > options(width = 120) > > sessioninfo::session_info() > ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────── > setting value > version R version 4.2.0 (2022-04-22) > os Ubuntu 20.04.4 LTS > system x86_64, linux-gnu > ui RStudio > language (EN) > collate en_US.UTF-8 > ctype en_US.UTF-8 > tz Etc/UTC > date 2022-07-27 > rstudio 2022.02.3+492 Prairie Trillium (server) > pandoc 2.17.1.1 @ /usr/local/bin/pandoc > > ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── > package * version date (UTC) lib source > AnnotationDbi 1.59.1 2022-05-19 [1] Bioconductor > AnnotationHub * 3.5.0 2022-04-26 [1] Bioconductor > assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.2.0) > Biobase * 2.57.1 2022-05-19 [1] Bioconductor > BiocFileCache * 2.5.0 2022-04-26 [1] Bioconductor > BiocGenerics * 0.43.0 2022-04-26 [1] Bioconductor > BiocIO 1.7.1 2022-05-06 [1] Bioconductor > BiocManager 1.30.18 2022-05-18 [1] CRAN (R 4.2.0) > BiocParallel 1.31.10 2022-07-07 [1] Bioconductor > BiocVersion 3.16.0 2022-04-26 [1] Bioconductor > Biostrings 2.65.1 2022-06-09 [1] Bioconductor > bit 4.0.4 2020-08-04 [1] RSPM (R 4.2.0) > bit64 4.0.5 2020-08-30 [1] RSPM (R 4.2.0) > bitops 1.0-7 2021-04-24 [1] RSPM (R 4.2.0) > blob 1.2.3 2022-04-10 [1] RSPM (R 4.2.0) > cachem 1.0.6 2021-08-19 [1] RSPM (R 4.2.0) > cli 3.3.0 2022-04-25 [1] RSPM (R 4.2.0) > codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.0) > colorspace 2.0-3 2022-02-21 [1] RSPM (R 4.2.0) > crayon 1.5.1 2022-03-26 [1] RSPM (R 4.2.0) > curl 4.3.2 2021-06-23 [1] RSPM (R 4.2.0) > data.table 1.14.2 2021-09-27 [1] RSPM (R 4.2.0) > DBI 1.1.3 2022-06-18 [1] RSPM (R 4.2.0) > dbplyr * 2.2.1 2022-06-27 [1] RSPM (R 4.2.0) > DelayedArray 0.23.0 2022-04-26 [1] Bioconductor > digest 0.6.29 2021-12-01 [1] RSPM (R 4.2.0) > dplyr 1.0.9 2022-04-28 [1] RSPM (R 4.2.0) > ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.2.0) > ExperimentHub * 2.5.0 2022-04-26 [1] Bioconductor > fansi 1.0.3 2022-03-24 [1] RSPM (R 4.2.0) > farver 2.1.1 2022-07-06 [1] CRAN (R 4.2.0) > fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.2.0) > filelock 1.0.2 2018-10-05 [1] RSPM (R 4.2.0) > generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0) > GenomeInfoDb * 1.33.3 2022-05-10 [1] Bioconductor > GenomeInfoDbData 1.2.8 2022-05-03 [1] Bioconductor > GenomicAlignments 1.33.1 2022-07-22 [1] Bioconductor > GenomicRanges * 1.49.0 2022-04-26 [1] Bioconductor > ggplot2 * 3.3.6 2022-05-03 [1] RSPM (R 4.2.0) > ggridges * 0.5.3 2021-01-08 [1] RSPM (R 4.2.0) > glue 1.6.2 2022-02-24 [1] RSPM (R 4.2.0) > gtable 0.3.0 2019-03-25 [1] RSPM (R 4.2.0) > htmltools 0.5.3 2022-07-18 [1] CRAN (R 4.2.0) > httpuv 1.6.5 2022-01-05 [1] RSPM (R 4.2.0) > httr 1.4.3 2022-05-04 [1] RSPM (R 4.2.0) > InteractionSet * 1.25.0 2022-04-26 [1] Bioconductor > interactiveDisplayBase 1.35.0 2022-04-26 [1] Bioconductor > IRanges * 2.31.0 2022-04-26 [1] Bioconductor > KEGGREST 1.37.3 2022-07-08 [1] Bioconductor > KernSmooth 2.23-20 2021-05-03 [2] CRAN (R 4.2.0) > knitr 1.39 2022-04-26 [1] RSPM (R 4.2.0) > ks 1.13.5 2022-04-14 [1] RSPM (R 4.2.0) > labeling 0.4.2 2020-10-20 [1] RSPM (R 4.2.0) > later 1.3.0 2021-08-18 [1] RSPM (R 4.2.0) > lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.0) > lifecycle 1.0.1 2021-09-24 [1] RSPM (R 4.2.0) > magrittr 2.0.3 2022-03-30 [1] RSPM (R 4.2.0) > MASS 7.3-58 2022-07-14 [1] CRAN (R 4.2.0) > Matrix 1.4-1 2022-03-23 [2] CRAN (R 4.2.0) > MatrixGenerics * 1.9.1 2022-06-24 [1] Bioconductor > matrixStats * 0.62.0 2022-04-19 [1] RSPM (R 4.2.0) > mclust 5.4.10 2022-05-20 [1] RSPM (R 4.2.0) > memoise 2.0.1 2021-11-26 [1] RSPM (R 4.2.0) > mime 0.12 2021-09-28 [1] RSPM (R 4.2.0) > munsell 0.5.0 2018-06-12 [1] RSPM (R 4.2.0) > mvtnorm 1.1-3 2021-10-08 [1] RSPM (R 4.2.0) > nullranges * 1.3.0 2022-04-26 [1] Bioconductor > nullrangesData * 1.3.0 2022-04-27 [1] Bioconductor > patchwork * 1.1.1 2020-12-17 [1] RSPM (R 4.2.0) > pillar 1.8.0 2022-07-18 [1] CRAN (R 4.2.0) > pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.2.0) > plyr 1.8.7 2022-03-24 [1] RSPM (R 4.2.0) > plyranges * 1.17.0 2022-04-26 [1] Bioconductor > png 0.1-7 2013-12-03 [1] RSPM (R 4.2.0) > pracma 2.3.8 2022-03-04 [1] RSPM (R 4.2.0) > promises 1.2.0.1 2021-02-11 [1] RSPM (R 4.2.0) > purrr * 0.3.4 2020-04-17 [1] RSPM (R 4.2.0) > R6 2.5.1 2021-08-19 [1] RSPM (R 4.2.0) > rappdirs 0.3.3 2021-01-31 [1] RSPM (R 4.2.0) > Rcpp 1.0.9 2022-07-08 [1] CRAN (R 4.2.0) > RCurl 1.98-1.7 2022-06-09 [1] RSPM (R 4.2.0) > restfulr 0.0.15 2022-06-16 [1] RSPM (R 4.2.0) > rjson 0.2.21 2022-01-09 [1] RSPM (R 4.2.0) > rlang 1.0.4 2022-07-12 [1] CRAN (R 4.2.0) > Rsamtools 2.13.3 2022-05-25 [1] Bioconductor > RSQLite 2.2.15 2022-07-17 [1] CRAN (R 4.2.0) > rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.2.0) > rtracklayer 1.57.0 2022-04-26 [1] Bioconductor > S4Vectors * 0.35.1 2022-06-08 [1] Bioconductor > scales 1.2.0 2022-04-13 [1] RSPM (R 4.2.0) > sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.2.0) > shiny 1.7.2 2022-07-19 [1] CRAN (R 4.2.0) > speedglm 0.3-4 2022-02-24 [1] RSPM (R 4.2.0) > SummarizedExperiment * 1.27.1 2022-04-29 [1] Bioconductor > tibble 3.1.8 2022-07-22 [1] CRAN (R 4.2.0) > tidyr * 1.2.0 2022-02-01 [1] RSPM (R 4.2.0) > tidyselect 1.1.2 2022-02-21 [1] RSPM (R 4.2.0) > utf8 1.2.2 2021-07-24 [1] RSPM (R 4.2.0) > vctrs 0.4.1 2022-04-13 [1] RSPM (R 4.2.0) > withr 2.5.0 2022-03-03 [1] RSPM (R 4.2.0) > xfun 0.31 2022-05-10 [1] RSPM (R 4.2.0) > XML 3.99-0.10 2022-06-09 [1] RSPM (R 4.2.0) > xtable 1.8-4 2019-04-21 [1] RSPM (R 4.2.0) > XVector 0.37.0 2022-04-26 [1] Bioconductor > yaml 2.3.5 2022-02-21 [1] RSPM (R 4.2.0) > zlibbioc 1.43.0 2022-04-26 [1] Bioconductor > > [1] /usr/local/lib/R/site-library > [2] /usr/local/lib/R/library > > ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── >

Leonardo Collado Torres (17:49:28): > Ran into this error@Eric Davis@Wancen Mufollowing the#bioc2022vignette you made, just an FYI

Tim Triche (17:49:31): > I can reproduce this bug instantly

Michael Love (17:50:33): > i believe that code below is not for live coding

Michael Love (17:50:40): > i think it’s some eval=FALSE

Michael Love (17:50:50): > yeah she’s explaining now

Michael Love (17:51:12): - File (PNG): Screen Shot 2022-07-27 at 5.51.09 PM.png

Sean Davis (17:51:22): > @Sean Davis has joined the channel

Michael Love (17:51:40): > but feel free to ask her more about it … ok Tim is now:slightly_smiling_face:

Leonardo Collado Torres (17:51:49): > yeah, my bad Mike. I didn’t read the “pseudo code” part: > * Plyranges pseudo code saving the count matrix in GRanges’s metadata column as aNumericList()format and usePlyrangesin downstream analysis.

Leonardo Collado Torres (17:52:31): > thanks for asking Tim!

Leonardo Collado Torres (17:53:17): > I was just reading the rendered HTML vignette, well, blindly running the code. I didn’t open the Rmd file:stuck_out_tongue:https://community-bioc.slack.com/archives/CC88GP2F4/p1658958640309079 - Attachment: Attachment > i think it’s some eval=FALSE

Michael Love (17:54:05): > BTW, this last bit of code is doing some interesting stuff — she is computing correlations across pseudo-bulk cell types for enhancer ATAC and promoter RNA, based on proximity of peak to gene

Leonardo Collado Torres (18:32:24): > :open_mouth:Thanks!

Wancen Mu (18:44:10): > Yeah, that’s part workflow has been used for a writing paper now. Seems like it deserve to be uploaded toExperimentHuband make it run in the vignette in future!:smile:

Tim Triche (18:45:59): > Great demo today!

2022-08-07

Doug Phanstiel (08:01:30): > matchRanges preprint is uphttps://www.biorxiv.org/content/10.1101/2022.08.05.502985v1 - Attachment (bioRxiv): matchRanges: Generating null hypothesis genomic ranges via covariate-matched sampling > Deriving biological insights from genomic data commonly requires comparing attributes of selected genomic loci to a null set of loci. The selection of this null set is non trivial, as it requires careful consideration of potential covariates, a problem that is exacerbated by the non-uniform distribution of genomic features including genes, enhancers, and transcription factor binding sites. Propensity score-based covariate matching methods allow selection of null sets from a pool of possible items while controlling for multiple covariates; however, existing packages do not operate on genomic data classes and can be slow for large data sets making them difficult to integrate into genomic workflows. To address this, we developed matchRanges, a propensity score-based covariate matching method for the efficient and convenient generation of matched null ranges from a set of background ranges within the Bioconductor framework. ### Competing Interest Statement The authors have declared no competing interest.

Doug Phanstiel (08:04:33) (in thread): > Two suggestions (that I should have thought of prior to submission:rolling_on_the_floor_laughing:) > 1. We should acknowledge Erika for typesetting and figure editing > 2. Would this title be slightly more clear?****“matchRanges: Generating null sets of genomic ranges via covariate-matched sampling”****

2022-08-09

Michael Love (10:37:55): > @Eric DavisI was gonna change this code in CTCF vignette #1 but wanted to check in first

Michael Love (10:39:43): > You have this code which works fine, but it has to define the overlap operation 3x, and we also disconnect the statistic (occupied %) and the label between the top and bottom of the code (so in theory someone could mix up their g’s and their labels) - File (PNG): Screen Shot 2022-08-09 at 10.38.13 AM.png - File (PNG): Screen Shot 2022-08-09 at 10.38.26 AM.png

Michael Love (10:40:19): > I propose binding the ranges: - File (PNG): Screen Shot 2022-08-09 at 10.40.11 AM.png

Michael Love (10:40:59): > Followed by a group_by and summarize, piped to ggplot2 - File (PNG): Screen Shot 2022-08-09 at 10.40.30 AM.png - File (PNG): Screen Shot 2022-08-09 at 10.40.55 AM.png

Michael Love (10:41:34): > i’ll push this to github if that’s ok, and feel free to tweak it, e.g. make labels the way you like it

Eric Davis (10:48:24): > Looks great to me!

Michael Love (10:48:33): > ok pushed

2022-08-10

Michael Love (09:11:04) (in thread): > one thing I noticed when tweeting was, Fig 1A has a centering “+” in the middle (it’s light grey so hard to see)

Michael Love (09:11:16) (in thread): > I agree with suggestion (2)

Michael Love (09:12:05): > A tweetorial for matchRanges preprint:https://twitter.com/mikelove/status/1557334149459021826 - Attachment (twitter): Attachment > New preprint led by @ericscottdavis1 of @dphansti lab describing matchRanges, a tool for efficiently generating covariate-matched sets of genomic ranges from a pool of background ranges. > > :thread: below: > > https://www.biorxiv.org/content/10.1101/2022.08.05.502985v1 > > Funded by @cziscience #EOSS https://pbs.twimg.com/media/FZy-N_5XgAAspHQ.jpg

2022-08-11

Rene Welch (17:15:46): > @Rene Welch has joined the channel

2022-09-07

Michael Love (08:22:02): > Wancen on:hiking_boot:ranges:https://twitter.com/wancenm/status/1567486685562417153?s=46&t=upDsOKD9YHMz9nTJ-dp2Cg - Attachment (twitter): Attachment > Announcing a new preprint! We developed bootRanges to generate bootstrapped genomic ranges for hypothesis testing in enrichment analysis. Bootstrap preserves typical clumping of ranges and provides reliable null set. Biggest thank you to @mikelove:thread:below https://www.biorxiv.org/content/10.1101/2022.09.02.506382v1

Michael Love (09:49:27): > I’m working on fixing the class name to BootRanges, but im stuck bc DNAcopy doesn’t want to run on my Mac (rosetta install). It breaks as binary or when i install from source code… i’ll have to continue on a different machine

Tim Triche (13:58:39): > maybe you should re…. boot? /ducks

Michael Love (14:51:35): > i don’t want to bother anyone at Bioc about it, but the weird thing is that DNAcopy installs as a binary, but then when you go to load it errors out

Michael Love (14:51:52): > i blame Mac and its tendency to wipe out gfortran during self-update

2022-09-08

Tim Triche (13:07:57): > this just happened to Zach’s students. my suggested solution is to wipe the machine & put linux on it, but reinstalling Xcode is another possibility

Michael Love (13:31:56): > haha

Michael Love (13:32:00): > i’ll go with 2

Kasper D. Hansen (13:41:20): > If its gfortran, you should just install gfortran

Kasper D. Hansen (13:41:35): > But really, the lack of M1 binaries is not good right now

Michael Love (14:24:28): > there was a slack post from Herve saying it shouldn’t matter much re speed, but i’ve noticed the slow down from Rosetta is substantial

Michael Love (14:25:02): > https://community-bioc.slack.com/archives/C34NC134G/p1659017032375929?thread_ts=1628275010.047100&cid=C34NC134G - Attachment: Attachment > I’ve been building a workshop for BioC on my M1 mac using Rosetta and had a speed quoted for one step at 40 seconds. I was testing on the conference machines and the same version of code runs in ~2 seconds on Linux. I just reproduced this on my cluster as well.. i’m pretty surprised…

Kasper D. Hansen (14:35:05): > In contrast to this, Brian Ripley reported that all standard tests in R executed under rosetta is faster than on Intel.

Kasper D. Hansen (14:35:33): > Would be nice to know what the code is doing

Michael Love (14:46:08): > my code is primarily running matrixStats rowRanks

Michael Love (14:47:44): > but i guess the point is that, non-negligible differences do exist

Michael Love (14:48:49): > i haven’t gone and tried to compile this code under an arm version of R on my M1

Kasper D. Hansen (15:02:53): > I would expect something like that to run well under Rosetta

Kasper D. Hansen (15:02:59): > Not that Im an expert

Michael Love (15:03:53): > I also don’t spend a lot of time making sure i’ve installed R on mac optimally. I just install until it runs and often forget what steps i took

Kasper D. Hansen (15:05:36): > Well, you clearly haven’t optimized but that’s not the point. Most reports suggests it should be well performant out of the box

Kasper D. Hansen (15:06:25): > I am wondering if the rowRanks code uses some specific CPU optimization, but I wouldn’t expect Henrik to do that kind of stuff intentionally

Kasper D. Hansen (15:16:53): > I don’t see any clear potential issues with rowRanks when I glance at the code

2022-09-22

Michael Love (07:44:25): > FYI i’ve added the two ms to the vignettes, README, etc.

2022-10-05

Mikhail Dozmorov (21:45:10): > Nullranges BioC video is uphttps://youtu.be/VGSDzUrvE38 - Attachment (YouTube): Nullranges: Modular Workflow For Overlap Enrichment

2022-10-15

Mikhail Dozmorov (15:29:54): > Anybody knows a data R package published in Bioinformatics? A reviewer of CTCF questions whether data package should be published in a methodological journal. I cannot find a good example - any suggestions?

2022-10-17

Michael Love (08:19:56): > I mean, NAR commonly publishes resources

2022-10-18

Mikhail Dozmorov (10:10:02) (in thread): > Thanks, Mike, that’s the best answer we can give. I’ll be in touch soon, large grant submission, it takes its time.

2022-11-22

Michael Love (08:28:09): > hi nullrangers, I’m going to run a short online tutorial session with Stefano Mangiola on tidy-in-Bioc:https://community-bioc.slack.com/archives/CEQD45CHK/p1669123640377819 - Attachment: Attachment > @stefano mangiola and I have been discussing doing an tidy-in-Bioc online tutorial in mid-December. Details cross-posted to birdsite and oldelephantsite > • https://twitter.com/mikelove/status/1595038783371706369 > • https://genomic.social/@mikelove/109387583769396088

2022-11-28

Michael Love (08:54:50): > @Mikhail DozmorovI tried to address some of your comments in this commit. I tried to remove references to “spatial” as this is likely confusing given Bioc does a lot of work with spatial imaging and txomicshttps://github.com/nullranges/nullranges/commit/f99357dc9003da3ff6a3605592a2e78f828d1891

Mikhail Dozmorov (10:41:47) (in thread): > :+1:I think it looks great. Everything is clearly described, hard to think of any additions. The new references, including fluent genomics, are also perfect.

Michael Love (14:01:17) (in thread): > thanks Mikhail

2022-12-02

Michael Love (12:56:47): > Edit: solved~has anyone seen this before on arm:~ > > > library(nullranges) > Error: package or namespace load failed for 'nullranges' in dyn.load(file, DLLpath = DLLpath, ...): > unable to load shared object '/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/InteractionSet/libs/InteractionSet.so': > dlopen(/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/InteractionSet/libs/InteractionSet.so, 6): Symbol not found: __ZNKSt3__115basic_stringbufIcNS_11char_traitsIcEENS_9allocatorIcEEE3strEv > Referenced from: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/InteractionSet/libs/InteractionSet.so (which was built for Mac OS X 13.0) > Expected in: /usr/lib/libc++.1.dylib >

Michael Love (13:39:49): > @Eric DavisI don’t think you can move interactionset to suggests bc it’s in a class defintion

Michael Love (13:43:09): > oh, somewhere i’m seeing it’s bc my mac is out of date…

Michael Love (13:43:22): > ooook i’m gonna try to update mac later and see if that helps

Wancen Mu (13:44:23): > yeah,IfeelIencounteredittoo when trying to load data for Lima project

Wancen Mu (13:46:11): > I fixed it by using R version>4.2 in linux. But not a problem on my mac

Michael Love (14:00:51): > lemme just try and fix this later today i’ll report back

Michael Love (14:01:04): > thanks all for the quick meeting:smile:I think we can definitely send these back before end of year

Michael Love (19:54:34): > the installation thing above was solved by my upgrading to Mac OS X 13.0

Michael Love (20:07:50): > I’ve taped this and uploaded a video to youtube. I’m gonna do the same for linux just to rub it in, binaries are the way to go > > complete de novo installation in 1 minute

Michael Love (20:08:27): > @Mikhail Dozmorovthis also allowed for DNAcopy from binary with no issues (and i have homebrew on this machine)

Mikhail Dozmorov (20:24:19) (in thread): > Great, it may be easier on Apple silicon. I have the latest macOS version on Intel. Tried removing homebrew-installed gcc with gfortran and reinstalling, still “unable to load shared object…” Continue looking.

2022-12-03

Michael Love (12:03:50) (in thread): > yeah ARM works now

Kasper D. Hansen (15:41:19): > I had a similar error on Rgraohviz which was caused by missing -lm.However the error only appeared with -O3 which was super weird and next to impossible to track down.Perhaps you need something similar for this symbol.

2022-12-04

Michael Love (15:27:17): > my Mac setup is fine now, i’m using ARM on OS X 13, so far so good

Mikhail Dozmorov (21:28:52): > With GitHub Actions and Bioconductor checks passing, it is safe to say the package won’t be a problem to install. I still cannot fix the gfortran problem, but it looks like a special case.

2022-12-06

Wancen Mu (10:56:36): > @Eric DavisHave you noticed setting different seed in matchRanges can result very different interested statistics in matched set even both covariate and propensity score plot looks good? For example, in lima project. When I setseed = 2022, I gotmean(hic) in focal set = 1771andmean(hic) in matched set = 1707. But when I setseed = 2253, I gotmean(hic) in matched set = 1840. My interested statistics ismean(hic) in focal set - mean(hic) in matched set. So one will result a positive value64while one will result a negative value-68.

Wancen Mu (10:58:34): > > > summary(Perm.test.stat) > Min. 1st Qu. Median Mean 3rd Qu. Max. > -68.98 23.39 40.75 41.03 60.04 111.92 > > Here is interested statistics performing matchRanges 1000 times using different seed. Do you think should we use a permutation test in downstream analysis, like # times interested statistics >0 /1000

Wancen Mu (11:00:19): > Here is the matching quality. It’s same when running using different seed. > > MatchedGInteractions object: > set N distance.mean distance.sd ps.mean ps.sd > focal 912 93000 71000 0.14 0.0099 > matched 912 93000 71000 0.14 0.0099 > pool 5538 99000 73000 0.14 0.0100 > unmatched 4626 100000 73000 0.14 0.0100 > -------- > focal - matched: > distance.mean distance.sd ps.mean ps.sd > -11 -22 1.4e-06 -2.7e-06 >

Wancen Mu (11:02:54): > I am thinking if the reason is the focal set are heterogenous.

Eric Davis (11:16:46): > Yeah it looks like in this case there are multiple subsets of the pool that can form a matched set. A permutation test seems like a reasonable way to address this to me

Wancen Mu (11:21:35): > Cool. Then I believe there is no need to perform chi-square test for categorical value or t-test for continuous value between focal set and each matched set anymore.

Michael Love (11:26:32): > not disagreeing but just probing: how do we know if there is enough variance in the multiple-matching procedure that this is safe way to perform inference

Michael Love (11:27:26): > e.g. in some cases matching will give the same set each time, if for example there is a single nearest in pool per element of focal

Michael Love (11:28:04): > then relying on variance in multiple draws would give you p-value of 0 regardless of the focal vs matched comparison (even under the null) bc of underestimation of variance

Wancen Mu (11:29:52): > That’s true… Can we look at the variance of interested statistics?

Wancen Mu (11:31:15): > I feel in this case the reason why there is multiple subsets of pool can be matched is because the focal set is too heterogeneous and we only match one covariate-distance.

Michael Love (11:32:18): > what criteria would we use to determine if there is enough variance to trust inference? e.g. if population is infinite I totally agree that multiple draws would work to perform inference

Michael Love (11:33:33): > but with finite population and greedy matching, e.g.: > focal element has x=1.5, and in pool we have elements with x=1.59, 1.6, 1.6, 1.6, 1.61, …. with some algorithms we will always pick the 1.59 to be in matched

Wancen Mu (11:40:40): > Could we compare the variance of to be controlled covariates? Like in this example the sd(distance) in focal set = 71326.86 and sd(distance) in pool set = 72465.25. These two number are quite close. Then when we derive matched set only based on distance, it highly possible to have different sets. But if the sd in focal set is smaller than sd in pool set in different order of magnitude, then it’s highly only have a single nearest in pool per element of focal.

Michael Love (11:43:31): > i wonder if there is some literature from causal on this topic? e.g. when is matching sufficiently like sampling

2022-12-07

Wancen Mu (13:04:03): > Maybe we could incorporate this in the package? Derive SE after matching. Standard way after using Matchithttps://cran.r-project.org/web/packages/MatchIt/vignettes/estimating-effects.html#estimating-treatment-effects-and-standard-errors-after-matching

Michael Love (13:47:47): > > for the bootstrap approach, see the section “Using Bootstrapping to Estimate Confidence Intervals” below. > this is great, if they account for the number of overlap btwn bootstraps (that is, successive draws of matched from pool)

Michael Love (13:48:22): > if overlap = 0 fine, but i don’t know what to do when overlap is between 0 and 1

Michael Love (13:57:16): > Exactly! this is just what i was worried about - File (PNG): Screenshot 2022-12-07 at 1.56.51 PM.png

Michael Love (13:58:55): > The correlations is something i was pondering, e.g. say we have a element and we then split it into 3 ranges, and those go into pool. But they may be highly correlated in their covariates AND in their outcome value, so they are pseudo-duplicates when we draw them into different matched sets

Wancen Mu (14:05:58): > Yeah, totally agree..But they found out in simulation bootstraps SE are adequate? The boxplot you asked seems like there is no differentiation in range between matched set and focal set even the mean value are different. - File (PNG): 截屏2022-12-07 下午1.59.59.png

Michael Love (21:05:50): > i’d be worried about calling these sig right?

Wancen Mu (23:34:35) (in thread): > Yeah…Sad to admit we couldn’t find a good correlation cutoff that show significance.

2022-12-08

Michael Love (08:33:08) (in thread): > ok, that’s fine if this is the result, let’s chat next week about next steps?

Michael Love (08:34:01) (in thread): > do you and@Eric Davishave a time when we could both meet? Wancen will you be EST or PST next week?

Wancen Mu (14:19:19) (in thread): > I will still at Chapel Hill next week. So feel free to schedule a time slot that works for you best.

Michael Love (15:22:48) (in thread): > Monday morning?

Wancen Mu (15:23:47) (in thread): > Yep, that works for me.

2022-12-09

Eric Davis (13:11:10): > Many of the comments from reviewers want more comparison to MatchIt (or alternative matching packages). We are responding to reviewers reinforcing that we are not claiming “better” matching than existing packages, rather comparable matching that is compatible with exiting Bioconductor packages. To show that the matching is comparable we are plotting covariate distributions, and I wanted to incorporate balance plots (like the love plot) fromcobalt. > > Stuart added a section to the vignette showing how to usematchedData()accessor to compute balance statistics withbal.tab()and visualize it withlove.plot()a long time ago. I must have missed this when I was updating the vignettes - sorry Stuart! I think we should add this section back in and extend it to show a comparable matching. What do others think? > > Here is a link to the code Stuart added previously:https://github.com/nullranges/nullranges/blob/86d4ae317f12239030efac2b181fc1960c1e38d7/vignettes/match_ranges.Rmd#L94-L109

Eric Davis (13:12:11): > Here is a toy example comparing matching betweenMatchIt(first plot), andmatchRanges(second plot): - File (PNG): image.png - File (PNG): image.png

Eric Davis (13:13:48): > It would be nice if we could get the plot on the right to look more like the one on the left (using blue points instead of two separate comparisons), but I am not exactly sure how to achieve this without recreating the plot…

Michael Love (16:39:52): > so love.plot makes the right one? (awesome name!)

Michael Love (16:40:09): > > popularized by Dr. Thomas E. Love

Michael Love (16:41:27): > for the paper you can try to make it look like the one from MatchIt, but for the vignette you can keep it as is IMO

Michael Love (16:41:56): > e.g. for the paper, take the output of bal.tab and make a ggplot that is just like the one you get from MatchIt so they are visually comparable

Eric Davis (17:17:02) (in thread): > Yes,love.plot()makes both of these

2022-12-13

Michael Love (08:45:48): > Here’s what i did to the bootstrapping unit tests: mostly just adding comments and being creative with what theexpect_calls are looking for in the output:https://github.com/nullranges/nullranges/blob/master/tests/testthat/test_boot.R

Wancen Mu (10:37:09) (in thread): > Beautiful!

Eric Davis (18:02:25) (in thread): > I will try to replicate this with the matching function:+1:

Eric Davis (18:03:16): > Any suggestions for this figure - “Assessing covariate balance with matchRanges and cobalt”? - File (PDF): supplementaryFigureX.pdf

Eric Davis (18:04:29) (in thread): > Purpose is to show that matchRanges achieves comparable matching to MatchIt and this can be assessed by looking at the distributions of the covariates and with “love plots” comparing mean differences between covariates after matching.

Doug Phanstiel (21:13:00) (in thread): > So are you matching two covariates simultaneously? f2 is continuous and f3 is categorical?

Doug Phanstiel (21:14:26) (in thread): > i am not a huge fan of panel C but I guess that is a cobalt output?

Doug Phanstiel (21:15:59) (in thread): > For panel A, it is a little had to decode what each line is. Can do it but takes some time. I would not use a legend. And just label the lines directly with the same color as the line

Doug Phanstiel (21:16:14) (in thread): > But since this is a sup fig that is prob not neccesay

Doug Phanstiel (21:18:04) (in thread): > I forget to mention, that oveall this looks awesome and should def address the reviewer’s request

Eric Davis (21:59:00) (in thread): > Normally I’d agree about the direct labelling, but it might be hard to see with the lines so close together: - File (PNG): image.png

Eric Davis (22:16:01) (in thread): > Alternatively I could move the legend and color the text like this: - File (PNG): image.png

Eric Davis (22:20:38) (in thread): > Or move the labels a little bit closer to their target lines: - File (PNG): image.png

2022-12-14

Michael Love (07:48:44) (in thread): > this is great and addresses the reviewer critique

2022-12-15

Assa (08:07:56): > @Assa has joined the channel

2022-12-19

Doug Phanstiel (19:15:59) (in thread): > I like the last option. That is what i was suggesting

2022-12-21

Eric Davis (23:20:47): > In response to this reviewer’s comment: > > The input of matchRanges requires a set of ranges as the “pool”, but the generation of such a pool is not an easy task. For example, most people may start with only the focal set. In this case, could the authors provide tutorials or guidance as to how to select covariates and how to generate the pool? Moreover, in some situations, the focal set may consist of ranges with variable lengths, but all examples included on the website contain only fixed-length ranges. Finding an adequate pool for such variable-length sets can be even more difficult for many users. > I’ve created the following vignette. It goes into a little more detail about how a pool could be created by using data from AnnotationHub (per Mikhail’s suggestion) to match ranges of variable length - which is a comment made by this and several other reviewers. I’ve also tried to incorporate as many “tidy” genomics practices as possible (as per Mike’s recent talk). Please let me know if you have any suggestions, thanks! - File (PDF): matching_poolSet.pdf

2022-12-22

Michael Love (07:59:27) (in thread): > i’ll proof this in the coming days

Michael Love (08:00:45) (in thread): > @Wancen Mui’m also still looking over your supplement figs and text. > > there’s no rush to resubmit next week, no one will look at the revision until after the new years eve anyway

Mikhail Dozmorov (08:41:43) (in thread): > Looks great! Currently, the vignette starts with why a pool is important, followed by how to make it. Suggesting adding “Prerequisites/Definitions” subheader. To define a pool of ranges annotated by various properties (covariates, must be at least one, give examples like signal, length, distance to smth, can be derived from own or additional (public) data). Can make a PR. And, adding a missing word in “how to use cobalt” on page 5.

Eric Davis (10:34:37) (in thread): > Thanks for catching that Mikhail! If you want to modify/add content here is the open PR:https://github.com/nullranges/nullranges/pull/22 - Attachment: #22 Vignette for creating a pool for matchRanges > More detailed vignette using data from AnnotationHub to show an example of creating a pool set and performing matching & covariate balance.

Eric Davis (10:34:46) (in thread): > If not I can try to incorporate your suggestions

Mikhail Dozmorov (10:36:15) (in thread): > Yes, I’ll give it a shot, today.

Wancen Mu (11:49:59) (in thread): > Yeah sure. Totally understand that.@Michael LoveBTW, do you think is itbettertoadd an overview ofbootrangesvignette which includes new simulation figure that showing bootstrap statistics are similar tooriginalstatistics? I feel currentvignettes is little messy with all schematic explainedinsegmentedvignette. And wehaven’t included metadata example in vignette yet. I kind ofwanttomerge current two vignettes into one except overview one. How do you think?

Michael Love (13:52:51) (in thread): > I don’t think the vignettes need to replicate paper results, e.g. that shuffling is not a good null model. People will be able to read the paper for that demonstration. From my work with DESeq2, I’ve learned that users will just blindly copy the code from vignette when they do their own analysis, so it’s actually a better idea to leave out anything that isn’t relevant for regular users, such as repeating the shuffling vs bootstrapping analysis. > > I agree with you that we could use a metadata example, and I also agree that it might be easier to fold the unsegmented into the end of the segmented vignette, and just have a single vignettebootstrap.Rmd

Michael Love (13:53:25) (in thread): > Feel free to make that change@Wancen Mu, and to edit as you like to streamline things. Nothing is precious, you can make it how you like

Wancen Mu (16:47:34) (in thread): > Sounds good. I will work on merge them together into one vignette and add a metadata example in the coming days!

2022-12-28

Michael Love (05:31:38) (in thread): > pkg was failing to build bc new vignette was missing from pkgdown vignette index, you can more or less quickly check things like this locally withpkgdown::build_site()https://github.com/nullranges/nullranges/commit/536ef980dcb2596d8164a24c9eee195a1cc18aaa

2023-01-13

Michael Love (15:39:50): > I’m going to do some proofing of the new bootRanges vignette, while i’m at it, the ordering got a little funky: > > bootRanges.Rmd: %\VignetteIndexEntry{4. Introduction to bootRanges} > matching_ginteractions.Rmd: %\VignetteIndexEntry{3. Case study II: CTCF orientation} > matching_granges.Rmd: %\VignetteIndexEntry{2. Case study I: CTCF occupancy} > matching_pool_set.Rmd: %\VignetteIndexEntry{1. Creating a pool set for matchRanges} > matching_ranges.Rmd: %\VignetteIndexEntry{1. Overview of matchRanges} > nullranges.Rmd: %\VignetteIndexEntry{0. Introduction to nullranges} > > @Eric Daviswhat order do you want the pool set one to be? And I may put bootRanges to come after Introduction just for ease of viewing

Eric Davis (15:49:19): > For matchRanges I was thinking this order? > 1. Overview of matchRanges > 2. Creating a pool set for matchRanges > 3. Case study I: CTCF occupancy > 4. Case study II: CTCF orientation > What do you think?

Michael Love (15:49:40): > :thumbsup:

Michael Love (15:50:08): > i dont know how it happened that pool was also 1 but apparently it wipes out display of all vignettes on Bioc, I’ll fix now

Michael Love (15:54:30): > I may also do this if you don’t mindgit mv matching_ranges.Rmd matchRanges.Rmd

Michael Love (15:55:30): > > nullranges.Rmd: %\VignetteIndexEntry{0. Introduction to nullranges} > bootRanges.Rmd: %\VignetteIndexEntry{1. Introduction to bootRanges} > matchRanges.Rmd: %\VignetteIndexEntry{2. Introduction to matchRanges} > matching_granges.Rmd: %\VignetteIndexEntry{3. Matching case study I: CTCF occupancy} > matching_ginteractions.Rmd: %\VignetteIndexEntry{4. Matching case study II: CTCF orientation} > matching_pool_set.Rmd: %\VignetteIndexEntry{5. Creating a pool set for matchRanges} >

2023-01-21

Hien (16:03:37): > @Hien has joined the channel

2023-02-06

Michael Love (12:11:45): > patchworkappears to be breaking all of the vignettes that use it, in devel branch > > build report -https://master.bioconductor.org/checkResults/3.17/bioc-LATEST/nullranges/ > > Error: processing vignette 'bootRanges.Rmd' failed with diagnostics: > Cannot create zero-length unit vector ("unit" subsetting) > > I googled this and found ggplot2 threads about combining plots / facetting > > SUMMARY: processing the following files failed: > 'bootRanges.Rmd' 'matchRanges.Rmd' 'matching_ginteractions.Rmd' > 'matching_granges.Rmd' > > ~~~~~ > > ➜ vignettes git:(master) grep patchwork *.Rmd > bootRanges.Rmd:We load the **nullranges** and **plyranges** packages, and **patchwork** in > bootRanges.Rmd:library(patchwork) > matchRanges.Rmd:Since these functions return ggplots, `patchwork` can be used to visualize all covariates like this: > matchRanges.Rmd:library(patchwork) > matching_ginteractions.Rmd:covariates along with `patchwork` and `plotCovarite` to visualize all > matching_ginteractions.Rmd:library(patchwork) > matching_granges.Rmd:covariates along with `patchwork` and `plotCovarite` to visualize all > matching_granges.Rmd:library(patchwork) > matching_pool_set.Rmd:Now let's use the `plotCovariate()` function with `patchwork` to > matching_pool_set.Rmd:library(patchwork) >

Eric Davis (12:25:24): > The matching vignettes give a different error: > > --- re-building 'matchRanges.Rmd' using rmarkdown > Quitting from lines 42-218 (matchRanges.Rmd) > Error: processing vignette 'matchRanges.Rmd' failed with diagnostics: > cannot mtfrm > --- failed re-building 'matchRanges.Rmd' >

Eric Davis (12:26:02): > Those lines (42-218) are for drawing the example figure - and everything runs fine locally…

Eric Davis (12:27:48) (in thread): > I guess I should use the correct settings/versions to reproduce the error

Michael Love (12:31:14): > oh i see

Michael Love (12:32:04): > i’m gonna try to push some new code to nullranges now so i can play around locally

Eric Davis (14:53:11): > I rannullrangesin the Bioc docker container and was able to replicate this error: > > Error in mtfrm.default(list(path = NULL, name = "page", n = 1L)) : > cannot mtfrm > > It happens when I try to run aplotgardenerplotting function, but I also get this warning when I loadnullranges: > > > devtools::load_all(".") > ℹ Loading nullranges > Warning message: > R graphics engine version 16 is not supported by this version of RStudio. The Plots tab will be disabled until a newer version of RStudio is installed. > > So it doesn’t look like aplotgardenerornullrangesissue to me…

Eric Davis (15:26:51) (in thread): > Been chatting with Nicole and think we figured out that the default match method (viamtfrm) changed and we need to explicitly convert a symbol to a character to avoid this error.

Michael Love (16:15:35) (in thread): > thanks for quickly identifying this

Michael Love (16:15:52): > gotcha so i was off, it just happened that we generated two unrelated errors at once

Michael Love (16:17:08): > i’ll see why Bioc is throwing an error on bootRanges but not our GitHub version

2023-02-08

Michael Love (16:02:08) (in thread): > I know ya’ll are busy:smile:so i’ve eval=FALSE the plotgardener chunks in devel for now

Michael Love (16:02:13) (in thread): > we can get back to this in 2 weeks

Michael Love (16:02:45) (in thread): > i realized the chunk failing in bootRanges.Rmd was also the plotgardener plot

2023-03-07

Michael Love (15:16:05): > just wanted to loop back here ahead of April release, as far as i remember: > * this following PR depends on mariner, if i remember correctly? or it was breaking build for some reason, I can’t remember. let’s address this before pushing to Biochttps://github.com/nullranges/nullranges/pull/24 > * also the default match method issue in plotgardener –> these vignette chunks are stilleval=FALSE - Attachment: #24 matchitToMatched function > Function to convert matchit objects to Matched objects. Combines the power of MatchIt with the genomics workflow benefits of matchRanges().

2023-03-08

Unyime William (01:16:20): > @Unyime William has joined the channel

2023-03-14

Michael Love (08:43:59): > nullranges has switched todevelas default branch

Michael Love (08:44:39) (in thread): > @Eric Daviswanted to ask about the PR, whether it could be merged ahead of April release

Eric Davis (09:12:18) (in thread): > Hey Mike! I think it just depends on a single function from mariner.I’dlike to have mariner as part of the spring release as well, but maybe we should just replicate they function to avoid an extra dependency anyway.What do you think?

Michael Love (09:40:08) (in thread): > i lean in favor of replicating functions to reduce dependency burden, which function btw?

Michael Love (09:41:16) (in thread): > also if you have any update on the plotgardener issue, bc we’ve got a bunch of chunks that are now un-evaluated so we pass check

Eric Davis (20:21:34) (in thread): > The function should be easy to duplicate -it’sthe function for converting data frames to GInteractionsobjects. > > I think Nicole fixed the issue but we are away at a conference now and can check next week when we are back

2023-03-15

Michael Love (09:29:32) (in thread): > ok sounds good! thanks for quick reply:slightly_smiling_face:

2023-03-21

Eric Davis (13:36:13) (in thread): > Checked with Nicole and the plotgardener issue has been resolved! So those chunks can be re-incorporated:+1:

2023-03-22

Michael Love (08:18:04) (in thread): > awesome, turning plotgardener back on

2023-04-12

Michael Love (12:14:24): > nullranges depends onspeedglmwhich has been removed from CRAN. I’m emailing the devel to figure out why and the timeline for getting it back online

Michael Love (13:29:42): > > I’m trying to fix the problem. I hope the speedglm package will be on CRAN by the end of this week.

2023-04-13

Michael Love (16:12:59): > https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btad190/7115835

Michael Love (16:13:10): > :its-happening-gif:

Wancen Mu (16:14:10): > Yea hey:tada:

Michael Love (16:16:40): > Wancen do you wanna announce on Monday?

Wancen Mu (16:19:47) (in thread): > Certainly! I can either create the announcement myself or retweet your announcement to reach a wider audience. I’m open to both options, whichever you think will be more effective.

Michael Love (16:21:17) (in thread): > you can do it and i’ll RT or QT

Michael Love (16:21:26) (in thread): > so it points to your account

Wancen Mu (16:22:00) (in thread): > Sure! Will draft it over the weekend~

2023-04-16

Wancen Mu (18:34:23) (in thread): > Is it better to wait matchRanges has an officially link and two labs announce together?

2023-04-17

Michael Love (08:13:12) (in thread): > sure thats a good idea

2023-04-23

Michael Love (07:40:12): > speedglm author told me that he would have it back on CRAN soon but i don’t think it’s gonna happen in time.

Michael Love (07:41:16): > the other option would be to remove the dependency and replicate its functionality

Michael Love (07:48:04): > ok i just made this change (1.5.19)

Michael Love (09:41:41): > Eric, I had to comment out this example bc it was throwing error, don’t know why:https://github.com/nullranges/nullranges/commit/43e9e1f3165351e2e9fa7231b0209e1b3b8c8b6b#diff-1aaf0db35945b06b026ccb0[…]04f2bc6628d2d32850aee21036e2

Michael Love (09:42:13): > Also Wancen note that if you use a package in examples or vignette, it needs to be aSuggests(tidySummarizedExperiment)

Eric Davis (10:04:40) (in thread): > Thats weird, I can run that example without error in the bioconductor_docker:devel container

Eric Davis (10:07:06) (in thread): > Oh I get it if I run the entire example

Eric Davis (10:07:59) (in thread): > but if you addset.seed(123)before it, it works fine. So it must be that the randomly chosen dataset doesn’t work for the rejection sampling method.

Eric Davis (10:08:33) (in thread): > > # throwing error (April 2023) > set.seed(123) > makeExampleMatchedDataSet(type = 'GInteractions', matched = TRUE, > method = 'rejection', > replace = FALSE) >

Michael Love (19:40:21) (in thread): > weird, we can add it back i guess after release this week

Michael Love (19:40:35) (in thread): > i’m skittish about changing things so close to the release

Michael Love (19:40:47) (in thread): > i was hoping speedglm would come through in time but it didn’t

2023-04-24

Michael Love (09:27:23): > Ok, everything appears to be fine, once the new Bioc check results are posted we should be fine for release

Wancen Mu (09:29:49) (in thread): > Thanks for taking care of this!

Michael Love (12:17:27): > had to make another change — I eval=FALSE the new bootRanges chunks about the single cell multi-omics, these were taking over 10 minutes and so giving ERROR on Bioc servers for the whole package. after the release, we can go back and including again after cutting these down to reasonable build times (each vignette should really be <30 seconds to build). you can for example just work with a subset of the genome or restrict in some other way

Wancen Mu (14:07:24) (in thread): > Oh, I remember the running time on my laptop didn’t take that much time. Definitely, we could give a small subset of genome in the bootRangesData to run it faster or in a new vignette.

Michael Love (14:15:56) (in thread): > yeah let’s take a look after release

Michael Love (14:15:57) (in thread): > :ok_hand:

2023-06-02

Michael Love (09:33:52): > @Eric Daviskshas some tricky upstream dependencies (misc3d -> tcltk) which are sometimes hard for users to install. > > how would you feel about me making this a suggests that user is prompted to install when asking for rejection sampling?

Eric Davis (09:39:11) (in thread): > That seems reasonable to me. Should we change the default method to something else so it doesn’t immediately prompt install?https://github.com/nullranges/nullranges/blob/e7b9f675194879d01a9ca4f93f291fc22d4a0603/R/AllGenerics.R#L72

Michael Love (10:17:10) (in thread): > yeah how about NN?

Michael Love (10:17:28) (in thread): > btw what is your current preferred method?

Eric Davis (10:48:04) (in thread): > NN works but would also need to change the default toreplace=TRUE. stratified is slower, but should work in all cases and can acceptreplace=FALSE

Michael Love (10:52:37) (in thread): > i’ll switch defaults to replace=TRUE and NN if that’s ok. just don’t want people to not be able to install the whole shebang bc ofks. > > i remember Mikhail brought this up earlier. it’s just that it is the only kernel density package we found that can predict for any givenxthe density

Michael Love (10:53:41) (in thread): > thanks for quick replies:smile:i’m teaching nullranges at this course in two weeks — excited to share the love of the null

Eric Davis (10:54:34) (in thread): > That’s exciting! Is it the Bioc pre-conference workshop?

Michael Love (10:57:06) (in thread): > this is CSAMA organized by Wolfgang

Michael Love (15:46:49) (in thread): > seems finehttps://nullranges.github.io/nullranges/ - Attachment (nullranges.github.io): Generation of null ranges via bootstrapping or covariate matching > Modular package for generation of sets of ranges > representing the null hypothesis. These can take the form > of bootstrap samples of ranges (using the block bootstrap > framework of Bickel et al 2010), or sets of control ranges > that are matched across one or more covariates. nullranges > is designed to be inter-operable with other packages for > analysis of genomic overlap enrichment, including the > plyranges Bioconductor package.

2023-06-19

Pierre-Paul Axisa (05:11:42): > @Pierre-Paul Axisa has joined the channel

2023-07-28

Benjamin Yang (15:58:06): > @Benjamin Yang has joined the channel

2024-05-14

Lori Shepherd (10:39:05): > archived the channel