#salmon2bioc
2018-11-05
Michael Love (15:31:50): > @Michael Love has joined the channel
Michael Love (15:31:50): > set the channel description: Import alevin quantification into Bioconductor efficiently
Charlotte Soneson (15:31:50): > @Charlotte Soneson has joined the channel
Peter Hickey (15:31:50): > @Peter Hickey has joined the channel
Rob Patro (15:31:50): > @Rob Patro has joined the channel
Michael Love (15:32:20): > i made this a public channel
Michael Love (15:33:18): > hi all! I’m going to add people here who are interested in importing alevin quantification into Bioconductor
Michael Love (15:34:38): > • Peter posted this issue in Augusthttps://github.com/mikelove/tximport/issues/24• A related thread is here from Koen about sparse importhttps://github.com/mikelove/tximport/issues/25 - Attachment (GitHub): Idea: Supporting alevin output · Issue #24 · mikelove/tximport > (Unsure if this idea belongs here, in the tximeta repo, or in the salmon repo. Apologies if I guessed wrong. Tagging @rob-p and @k3yavi for their thoughts). I'm looking into using the alevin to… - Attachment (GitHub): sparse matrix support? · Issue #25 · mikelove/tximport > I was wondering whether it would possible to support sparse matrices as output of tximport. This might make tximport more attractive for people working with scRNA-seq data. I'm currently runnin…
Michael Love (15:38:20): > I just invited Koen, Rory, and Avi to the Slack, then i’ll add them if they join
Michael Love (15:38:55): > I’m going to add some alevin data to tximportData to have some test data. Avi provided me with some data that looks to be 299 cells and 107k genes (human + mouse). The binary matrix is 1.6 Mb
Michael Love (15:39:28): > something to think about is whether we use sparse matrices in memory or just do it all with reading/writing hdf5 from R?
Charlotte Soneson (15:46:03): > Hello! Just to add a bit to the scope (albeit slightly tangential), we have also started sketching on a small R package for generating a nice “CellRanger-like” summary report from the Alevin output - in case someone else would be interested in that too.
Michael Love (15:46:32): > set the channel description: Import alevin quantification into Bioc and making nice summary report
Michael Love (15:46:51): > yes! oops, forgot to add that
Michael Love (15:47:02): > i’m interested in the summary report as well:smile:
Michael Love (15:50:58): > Avi sent me the following, which I will start putting into tximport with tests running on tximportData - File (R): readAlevin.R
Michael Love (15:56:10): > ok i just pushed alevin test data to tximportData. it’s inextdata/alevin
Koen Van den Berge (16:16:29): > @Koen Van den Berge has joined the channel
Michael Love (16:17:36): > hi Koen!
Koen Van den Berge (16:20:02): > hey!:slightly_smiling_face:thanks for organizing this channel.
Michael Love (16:31:23): > my main question for now is whether in-memory sparse matrices are a better way to go vs reading/writing hdf5
Michael Love (16:32:12): > and the design concerns around these options
Peter Hickey (16:36:20): > forbsseq, i’ve added aBACKEND
argument with default valueNULL
to the data import function.BACKEND = NULL
gets you an ordinary matrix (I imagine it’d be a sparse matrix for this function/pkg)BACKEND = HDF5Array
gets you an HDF5-backed DelayedArray > I could potentially support other DelayedArray backends (e.g., another disk-backed format) but this is a low priority for me. > Note that this is in slight conflict withDelayedArray::setRealizationBackend(BACKEND = NULL)
which means ’give me a matrix-backed DelayedArray”, but I have come to believe that matrix-backed DelayedArray should mostly not be used (instead, just use an ordinary matrix or sparse matrix)
Peter Hickey (16:38:29): > that was my data import function can return a matrix-backed or HDF5Array-backed SummarizedExperiment-ish object
Peter Hickey (16:40:49): > I knowscranandscatersupport matrix/sparse matrix/hdf5-backedSingleCellExperimentbut i guess a complication for this is that other downstream pkgs may not yet support HDF5? > withbsseqwe’re ‘lucky’ because most/all downstream functionality lives inbsseq
Michael Love (16:41:26): > right
Michael Love (16:41:44): > also, good morning
Michael Love (16:43:40): > so my standard approach would be that tximport would creat either a matrix / sparse matrix, and then tximeta will build an SE (or SCE as appropriate). from scanningalevin
output, it looks like I’ll need the txome hash in order to tximeta operations on it
Michael Love (16:44:21): > sparse matrix saves you for a little while, but then eventually you hit the memory limit again
Rob Patro (21:19:36): > Hi all, thanks for the invite@Michael Love!
Rob Patro (21:20:08): > I’m obviously interested in the alevin-based html report@Charlotte Soneson:slightly_smiling_face:— and also in the appropriate storage format
Rob Patro (21:20:25): > we’d be happy to help output in a more convenient format, but it’s not clear to me what that is yet
Michael Love (21:23:26): > Yeah I think first need to think about scale and whether sparse in memory will work and for how long
Michael Love (21:23:45): > @Rob Patrodoes alevin output the hash of txome?
Michael Love (21:24:09): > Can tximeta do its thang
Rob Patro (21:24:21): > I believe so — lemme check
Rob Patro (21:27:21): > @Michael Love: yes
Michael Love (21:28:08): > I couldn’t find it inhttps://github.com/COMBINE-lab/salmon/blob/master/tests/alevin_test_data.tar.gz - Attachment (GitHub): COMBINE-lab/salmon > :fish: :sushi: :bento: Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using lightweight alignments - COMBINE-lab/salmon
Michael Love (21:28:13): > But maybe I missed it
Michael Love (21:28:23): > Is that typical output?
Michael Love (21:28:40): > Also feel free to snooze this until tomorrow!
Rob Patro (21:29:24): > ahh, in that data, avi just gave you the alevin directory
Rob Patro (21:29:31): > so alevin output looks like this
Rob Patro (21:30:10): > you have the normal quant directory > | > | – aux_info > | > |— logs > | > |—- libParams > | > | —- lib_format_counts.json > | > |—– cmd_info.json > | > |—— alevin
Rob Patro (21:30:55): > so, there, avi just handed off the alevin subdirectory
Michael Love (21:31:04): > Oh ok maybe I can get the other ones and I’ll add them tomorrow to tximportData
Michael Love (21:31:48): > No time pressure actually
Michael Love (21:32:02): > I’m gonna head to bed but work more on alevin import tomorrow
Rob Patro (21:42:37): > cool!
Rob Patro (21:42:40): > thanks@Michael Love
Rob Patro (21:43:01): > I’ll ping Avi about it
2018-11-06
Avi Srivastava (10:50:34): > @Avi Srivastava has joined the channel
Avi Srivastava (10:54:38): > Thanks@Michael Lovefor creating and inviting to the channel.
Avi Srivastava (10:56:52): > It’s really helpful for keeping all the discussion for improving Alevin in one channel.
Avi Srivastava (10:58:42): > Ah I missed the full output of the alevin, I’ll share that it in a bit.
Michael Love (11:03:16): > ok great, and i’ll update tximportData
Michael Love (11:07:30): > so Koen had mentioned that he was hitting a memory wall in R with 400 cells, and we’ve now added sparse import totximport
. but that only saves you by the sparsity factor. if the data is 0.01 nonzero it allows you 100 times more cells before you hit the same wall. as a design decision, the other option is to not work in memory and do things on disk, but this slows you down bc of reading and writing. if anyone has thoughts on whether sparse is a good medium term solution, good to discuss
Michael Love (11:10:11): > also, if we do go with sparse import, i have an alternative implementation I want to test for speed, but i need someone with e.g. 400 cells to test it
Michael Love (11:11:42): > oh looks like Rory just offered to test it
Rory Kirchner (11:12:02): > @Rory Kirchner has joined the channel
Michael Love (11:12:13): > hi Rory, I’ve thrown you in here too:slightly_smiling_face:
Michael Love (11:13:08): > @Rory Kirchnerwhat kind of large n + sparse quants do you have that i could try out the tximport sparse import implementations on
Avi Srivastava (11:14:32): > @Michael Loveif it’s helpful, I can share the output of Alevin for various cells size experiments, used in the preprint for the testing. 900, 2k, 4k, 8k cells.
Rory Kirchner (11:17:43): > Heya– I have sparse matrices of 700k cells or so I can point you to if you want a big one.
Michael Love (11:25:58): > @Avi Srivastavathat sounds good, maybe 900 cells
Michael Love (11:26:25): > @Rory Kirchnerthe testing would be making the matrix from 100s of files
Michael Love (11:26:38): > sounds like i can do it with Avi’s alevin data
Michael Love (11:27:09): > Rory, how do you build the matrices?
Rory Kirchner (11:28:47): > Gotcha– we build them by sample, so all cells from the same sample are quantified simultaneously, and then the per-sample matricies cbinded at the end
Michael Love (11:29:37): > so you build a list of sparse matrices in R
Michael Love (11:29:49): > i’m doing something similar now
Michael Love (11:30:21): > i’m thinking that a first pass through the files where # of nonzero is counted may be faster than this
Michael Love (11:30:39): > it’s read time vs memory allocation time
Michael Love (11:31:22): > because on the second pass, I can preallocate vectors of the right size where I store the nonzero value and row position
Rory Kirchner (11:32:24): > we do it in python here:https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/rnaseq/umi.py#L382 - Attachment (GitHub): bcbio/bcbio-nextgen > Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis - bcbio/bcbio-nextgen
Rory Kirchner (11:33:36): > Does Alevin not output sparse matrices?
Rory Kirchner (11:37:10): > here’s the R implementation of something similar, just doing what you suggested reading a list of sparse matrices in and then cbinding them:https://github.com/hbc/bcbioSingleCell/blob/master/R/counts-internal.R#L146 - Attachment (GitHub): hbc/bcbioSingleCell > R package for bcbio single-cell RNA-seq analysis. Contribute to hbc/bcbioSingleCell development by creating an account on GitHub.
Rory Kirchner (11:37:55): > its not that memory intensive since there is only ever one matrix that isnt sparse
Michael Love (11:40:38): > got it
Michael Love (11:40:47): > the memory pinch point is growing a list
Michael Love (11:40:59): > because R will move the memory around
Michael Love (11:41:19): > it may be the fastest way to go though
Michael Love (11:51:39): > Maybe lapply/mapply is better than list <-c(list, x)
2018-11-08
Charlotte Soneson (10:48:05): > A small logistics question: as I mentioned, I have put together a small R package (I called italevinQC
) that currently contains only two functions: (1) Avi’s function to read alevin output, and (2) a function to create a summary report of the outputs of an alevin run. Now, (1) fits better intximport
eventually, so I was wondering what would be the best way forward. Is there anything else that someone can think of that would fit in the package? Would the report be something that should rather be generated by alevin at runtime (I guess this is a question for@Rob Patroand@Avi Srivastava)? Or, would it still be useful to keep it as an R package with a single function?
2018-11-09
Michael Love (18:55:52): > Sorry I forgot to reply to this
Michael Love (18:56:37): > I think (2) as an R function with one function definitely still useful even if (1) is in tximport
2018-11-12
Michael Love (07:32:57): > @Avi Srivastavaif you get a chance, I’ll add the full sample dir output to tximportData so we can process eg the metadata
Michael Love (07:33:37): > I have the alevin dir but not the files in the dir one higher
Avi Srivastava (07:39:08): > Apologies for the slow response. I came out of country for a little while and was out of touch. I was thinking of uploading all the Alevin preprint dataset to zenodo. In the meantime I’ll will forward the 900 cell results of Alevin definitely in an hour.
Michael Love (07:57:32): > Ok !
Michael Love (08:02:36): > No rush
Avi Srivastava (08:29:49): - File (Gzip): alevin.tar.gz
Avi Srivastava (08:30:51): > @Michael Love, this is the Alevin results for 900 cells mouse datasets from the 10x website.
Michael Love (08:31:13): > great. I’ll work on the sparse runtime comparison
Avi Srivastava (08:31:21): > lovely !
Michael Love (08:31:47): > and for tximportData, i should wait for the zenodo files, correct?
Avi Srivastava (08:33:05): > Do you need multiple experiments?
Avi Srivastava (08:33:18): > The above file has the full sample dir
Michael Love (08:33:31): > let me see
Michael Love (08:33:57): > previously I put this in tximportData > > ➜ alevin git:(master) ls -1 > MappedUmi.txt > alevin.log > barcodeSoftMaps.txt > cell_eq_info.txt.gz > cell_eq_mat.gz > cell_eq_order.txt > featureDump.txt > predictions.txt > quants_mat.gz > quants_mat_cols.txt > quants_mat_rows.txt > raw_cb_frequency.txt > transcripts.txt.gz > whitelist.txt >
Michael Love (08:34:20): > these are human transcripts
Avi Srivastava (08:35:01): > This was from the github right ?
Michael Love (08:36:19): > yes
Avi Srivastava (08:36:51): > yes so the dataset on github was a different one, it was human+mouse 100 cells data.
Michael Love (08:36:51): > https://github.com/mikelove/tximport/issues/24#issuecomment-414163236 - Attachment (GitHub): Idea: Supporting alevin output · Issue #24 · mikelove/tximport > (Unsure if this idea belongs here, in the tximeta repo, or in the salmon repo. Apologies if I guessed wrong. Tagging @rob-p and @k3yavi for their thoughts). I'm looking into using the alevin to…
Avi Srivastava (08:37:09): > The current one is only mouse with more cells.
Michael Love (08:37:11): > ok, I can swap it out for this one instead
Avi Srivastava (08:37:49): > Is there a limit on the tximportData size, just wondering if the 900 cell data bloats up something.
Michael Love (08:38:18): > it was just 18 Mb
Avi Srivastava (08:38:24): > If yes I can forwards the 100 cell dataset too, the caveat being it’s a human+mouse data.
Michael Love (08:38:24): > that’s ok
Avi Srivastava (08:38:29): > sounds good.
Michael Love (08:38:59): > oh i see so this isn’t the dataset to test sparse on
Michael Love (08:39:02): > i was confused
Michael Love (08:39:07): > alevin outputs a matrix
Avi Srivastava (08:40:08): > huh? In the zipped file there should be a folder with namealevin
which contains the full (binary) matrix.
Michael Love (08:40:25): > yeah, ok so i was confused about two features being developed
Michael Love (08:40:28): > 1) alevin support
Michael Love (08:40:33): > 2) sparse import
Avi Srivastava (08:40:44): > ah ! Yep gotcha!
Michael Love (08:41:09): > sparse import is tricky when quants are separated over 1000s of files
Michael Love (08:41:21): > alevin is much easier bc it outputs the full matrix
Avi Srivastava (08:41:50): > Interesting, does any tool dumps separated files ?
Michael Love (08:42:23): > Koen had hundreds of Salmon files for some project
Michael Love (08:42:37): > > I’m currently running into problems when importing >400 single cells because of memory limits.
Michael Love (08:43:18): > https://github.com/mikelove/tximport/issues/25 - Attachment (GitHub): sparse matrix support? · Issue #25 · mikelove/tximport > I was wondering whether it would possible to support sparse matrices as output of tximport. This might make tximport more attractive for people working with scRNA-seq data. I'm currently runnin…
Michael Love (08:43:26): > i’m actually not sure what quant he was referring to
Avi Srivastava (08:44:00): > One reason could be smart-seq2
Avi Srivastava (08:44:09): > but unlikely.
Michael Love (08:44:12): > so my plan will be to put the new mouse files in tximportData, incorporate your alevin reader into tximport() and report back here
Avi Srivastava (08:45:35): > that sounds reasonable .
Charlotte Soneson (08:48:18): > Just to chime in here: I was briefly discussing with Avi last week about the output format of alevin. A colleague of mine pointed out that the current binary format may not be fully portable between platforms (or, rather, readable with the same function regardless of where it was generated - you need to specify the endian type etc), and he was wondering whether something like hdf5 output would be more portable. Admittedly I don’t know too much about this topic:disappointed:but I am sure others know more, so I wanted to open up the discussion anyway in parallel to the design of the import function…
Michael Love (08:50:25): > that’s a good point
Michael Love (08:51:35): > so this line of R codecon <- gzcon(file(matrix.loc, "rb"))
Charlotte Soneson (08:52:00): > And one line in the loop, where the connection is read
Michael Love (08:53:06): > readBin(con, double(), endian = "little", n=num.genes)
Charlotte Soneson (08:53:06): > And I guess anything that stores all the zeros may run into memory issues at some point.
Charlotte Soneson (08:53:10): > yes
Michael Love (08:54:25): > we’ve been using the paradigm in tximport for reading in Gibbs or bootstrap replicates from Salmon
Michael Love (08:54:39): > but maybe no one is using inf reps in the wild so it hasn’t been exposed much
Michael Love (08:54:48): > https://github.com/mikelove/tximport/blob/master/R/infReps.R#L57-L67
Charlotte Soneson (08:55:56): > Yeah - I guess at this point it’s more of a potential issue - reading like this worked for us as well
Avi Srivastava (08:56:28): > yes I was also thinking about that, also I wonder how BAM files handle this issue.
Avi Srivastava (09:00:18): - File (PNG): Screen Shot 2018-11-12 at 19.29.50.png
Avi Srivastava (09:01:27): > I was reading more about that, it looks like Gibbs/bootstrap didn’t raised issue possibly because most of the comp being from x86 architecture ..
Avi Srivastava (09:02:50): > But as@Charlotte Sonesonwas saying it definitely can create a cross platform issues.
Charlotte Soneson (09:04:39): > I don’t know whether CellRanger-like output would be an option, I think it’s just text (.mtx matrix + row/column names) (?)
Avi Srivastava (09:06:45): > I think it can be done, and I guess .mtx has a sparse format too.
Charlotte Soneson (09:07:22): > https://math.nist.gov/MatrixMarket/formats.html
Charlotte Soneson (09:07:28): > (I think this is the one)
Avi Srivastava (09:11:26): > Although there is a slight difference in size, but since the matrix is very sparse overall size is not big > > 18M neurons_2k/alevin/quants_mat.gz > 11M neurons_900/alevin/quants_mat.gz > 71M neurons_9k/alevin/quants_mat.gz >
Avi Srivastava (09:11:33): > > 65M neurons_2k/matrix.mtx > 34M neurons_900/matrix.mtx > 299M neurons_9k/matrix.mtx >
Avi Srivastava (09:12:29): > The first one is Alevin and the second one is cellranger
Charlotte Soneson (09:13:21): > Oh, interesting. I hadn’t actually compared the output sizes - I wouldn’t have guessed CellRanger output was that much bigger
Avi Srivastava (09:14:19): > I just realized alevin files are gzipped, compressing mtx now…
Avi Srivastava (09:15:11): > > 17M neurons_2k/matrix.mtx.gz > 9.0M neurons_900/matrix.mtx.gz > 74M neurons_9k/matrix.mtx.gz >
Avi Srivastava (09:17:17): > so compressed alevin matrix output looks similar-ish to compressedmtx
. Just a point to note here is that the alevin output might have more non-zero, possibly whymtx
are smaller in size.
Charlotte Soneson (09:17:53): > Yes, seems reasonable
Avi Srivastava (09:38:49): > One more thought was that we can make the endianness as an external parameter too. But it has its own caveates.
Charlotte Soneson (09:41:07): > Yeah, I guess it does. Also, I don’t mean that the current output format necessarily has to be changed - maybe it’s never going to cause any issues in practice - I just wanted to bring up the discussion before we write the import functions:slightly_smiling_face:
Avi Srivastava (09:42:07): > Yep, appreciate that !:slightly_smiling_face:
Michael Love (09:48:51): > Good point
Michael Love (09:49:05): > We should put it to#generalor Bioc devel maybe
Michael Love (12:30:58): > ok I posted to general
2018-11-13
Michael Love (09:56:54): > so reporting back here, it sounds like we continue with the binary storage as implemented, although we may change in the future if we hit problems
Michael Love (10:03:30): > ok I just pushed neuron 900 to tximportData and will start folding in readAlevin in tximport this week
Avi Srivastava (10:05:18): > Awesome, thanks a lot@Michael Lovefor working things out !
Michael Love (10:05:58): > for simplicity i’m thinkingtxi <- tximport(files, type="alevin")
is the interface
Michael Love (10:06:04): > is that what others were imagining?
Michael Love (10:07:02): > for all other methodsfiles=c("dir/quant.sf",...)
Michael Love (10:07:18): > here files should be length 1…
Charlotte Soneson (10:07:35): > That looks good to me, assuming that the text files with row and column names are in the same directory.
Michael Love (10:08:10): > we could just dofiles="dir"
(only for alevin) because in alevin there is no quant file in the top dir
Michael Love (10:08:48): > orfiles="dir/alevin/quants_mat.gz"
(again just for alevin)
Charlotte Soneson (10:09:37): > The latter would be more consistent with the current interface
Michael Love (10:09:42): > the first one is simpler, the second one is more similar to what we’ve been doing, yes
Michael Love (10:09:48): > consistency is big
Charlotte Soneson (10:09:57): > And remove the ambiguity as to whether the Salmon or alevin directory should be specified
Charlotte Soneson (10:10:09): > But assumes that people don’t take the quants_mat.gz file out of there…
Michael Love (10:10:17): > which we really don’t want them to do
Charlotte Soneson (10:10:23): > Exactly.
Michael Love (10:10:36): > bad design decision on my part going back to 2015:slightly_smiling_face:
Michael Love (10:10:46): > i should have had tximport point to dirs, but too late for that now
Michael Love (10:12:10): > it would be pretty bad if people start taking quants_mat.gz and renaming it experiment1.gz and putting it in a folder with other experiments
Michael Love (10:12:31): > it’s much more loss of information than with salmon
Charlotte Soneson (10:12:36): > Indeed
Michael Love (10:12:40): > so maybe that justifies change alevin tofiles="dir"
Charlotte Soneson (10:13:06): > Meaning the top Salmon directory?
Michael Love (10:13:11): > yeah
Michael Love (10:13:32): - File (PNG): Screen Shot 2018-11-13 at 10.13.20 AM.png
Michael Love (10:13:50): > point to the dir w/ this structure
Charlotte Soneson (10:15:08): > But in principle, people need only the alevin directory, right (I mean, to lose equally much information as they would do if they move the quants.sf file from the Salmon output directory)?
Michael Love (10:16:23): > I guess if we’re going to break with pointing to quant file, I’d go for pointing to top dir so no one gets the idea to move things around
Charlotte Soneson (10:16:23): > For the inf. reps, we assume that there is a file in a folder relative to the quants.sf file, right?
Michael Love (10:16:49): > Yeah but that’s all downstream from a decision i would now change if I could
Charlotte Soneson (10:17:00): > Understood:slightly_smiling_face:
Michael Love (10:17:01): > :man-facepalming:
Charlotte Soneson (10:17:13): > But maybe it’s still better to stay consistent…
Charlotte Soneson (10:17:21): > Or maketximport2
:wink:
Michael Love (10:17:33): > Haha classic Bioc
Michael Love (10:18:28): > @Avi Srivastavaor@Rob Patroopinions?
Michael Love (10:18:46): > I could go either way I think
Michael Love (10:19:31): > So@Charlotte Sonesonyou prefer “dir/alevin”?
Michael Love (10:19:44): > Or point to quants matrix
Charlotte Soneson (10:19:46): > I think my preference would be pointing to the file
Charlotte Soneson (10:20:05): > And check that the other required files are in the directory.
Charlotte Soneson (10:20:13): > But I would be fine with either solution.
Avi Srivastava (10:20:53): > It might be a little work, why can’t we do both and tell user in the console what files we read ?
Avi Srivastava (10:22:17): > On the flip side it might create more confusion. But I was thinking if the user specifies a directory then we look for alevin folder and the relevant files inside it and if they specify the exact gz file then we look for meta data file in the same folder.
Avi Srivastava (10:24:26): > But in my mind these are all add ons which we can work on once skeleton is ready …
Michael Love (10:24:28): > i’d go for one correct way (otherwise can be confusing, yes for users)
Michael Love (10:25:08): > point toquants_mat.gz
file and check that the required files are there sounds good to me, that way no one will get very far if they start moving things around
Charlotte Soneson (10:26:14): > :+1:
Michael Love (10:26:41): > ok i think i’m set then i’ll start folding readAlevin intotximport
once tximportData propagates so i can have tests
Avi Srivastava (10:28:00): > Sounds good .
Michael Love (15:45:31): > Ok just pushed tximport anyway, maybe it will be too soon but that’s ok. Then I’ll work on tximeta going for the hash
2018-11-14
Rob Amezquita (12:36:29): > @Rob Amezquita has joined the channel
2018-11-15
Vince Carey (14:14:35): > @Vince Carey has joined the channel
2018-11-16
Avi Srivastava (18:28:53): > @Charlotte SonesonSorry for the late response. and thanks for the awesome function ! It looks really slick, I personally am on board with including this by default in Alevin.@Rob Patrowhat do you think? The only caveat being we might have to add R as an external dependency for salmon/alevin ? or there is a separate R magic to do this ? > > Also I’d like to suggest couple of more plots if you don’t mind adding them. I’ll compile a list after discussing them with@Rob Patro. - Attachment: Attachment > A small logistics question: as I mentioned, I have put together a small R package (I called it alevinQC
) that currently contains only two functions: (1) Avi’s function to read alevin output, and (2) a function to create a summary report of the outputs of an alevin run. Now, (1) fits better in tximport
eventually, so I was wondering what would be the best way forward. Is there anything else that someone can think of that would fit in the package? Would the report be something that should rather be generated by alevin at runtime (I guess this is a question for @Rob Patro and @Avi Srivastava)? Or, would it still be useful to keep it as an R package with a single function?
Michael Love (18:30:00): > I can help out as well, testing code etc
Avi Srivastava (18:30:39): > Thanks@Michael Lovethat would be great !
Michael Love (20:36:44): > just pushing this now: > > > se <- tximeta(coldata, type="alevin") > importing quantifications > reading in alevin gene-level counts across cells > found matching transcriptome: > [ Gencode - Mus musculus - release M16 ] > loading existing TxDb created: 2018-11-17 01:09:27 > generating gene ranges > fetching genome info >
2018-11-17
Charlotte Soneson (02:00:42): > Thanks@Avi Srivastava! Yes, please let me know what other plots you would like to include. One point that I was thinking about: currently I think it requires that alevin was run in--debug
mode, to have all the necessary output files. Perhaps some of these files could be returned also in default mode (or we adapt the report according to the available output files)?
2018-11-19
Charlotte Soneson (12:00:36): > @Avi Srivastava,@Michael Love: FYI, it’s now possible to generate a shiny app instead of a html report withalevinQC
:slightly_smiling_face:Not sure how useful this is really, but still…
Avi Srivastava (12:01:23): > lovely ! I have heard about shiny but never used it ..
Michael Love (12:12:42): > Cool! I will try it out (at some point)
Rob Patro (15:36:46): > @Avi Srivastava: do you have a run of alevin with bootstraps that you could share with@Michael Love?
Avi Srivastava (15:37:06): > yep !
Avi Srivastava (15:38:38): > Do we need it for a particular sample or any would be fine ?
Michael Love (15:39:24): > Anything is fine
Michael Love (15:39:57): > I’ll implement tximport and tximeta to bring these in. Want to make a nice platform for other groups to build on top of
Avi Srivastava (15:41:21): > Sounds good.
Michael Love (15:42:41): > thx, no rush
Avi Srivastava (15:45:12): - File (Gzip): alevin.tar.gz
Avi Srivastava (15:45:33): > Note this is not the same data as you were using previously for tximport.
Avi Srivastava (15:46:04): > If you need the bootstrap estimates for the same data i can run it for that too .
Michael Love (15:48:15): > diff data is fine
Michael Love (15:49:47): > so does one of these matrices have the complete 3d array? - File (PNG): Screen Shot 2018-11-19 at 3.49.16 PM.png
Avi Srivastava (15:50:59): > ah sorry.
Avi Srivastava (15:51:32): > Yes, somean_mat.gz
and thevar_mat.gz
is the mean and the variance of the bootstrap estimates
Avi Srivastava (15:51:49): > quants_mat.gz
is the regular MLE estimate.
Avi Srivastava (15:52:26): > Currently like salmon we don’t dump the full estimates of each bootstrap, just the mean and the variance.
Avi Srivastava (15:52:50): > Parsing all the matrices should be the same.
Michael Love (15:53:18): > you meanunlikesalmon?
Avi Srivastava (15:53:46): > duh! yes again my bad.
Michael Love (15:53:52): > ok no prob, just checking:wink:
Avi Srivastava (15:54:04): > :thumbsup:
Michael Love (15:54:20): > ok i’ll bring in the var at least bc i have a clear slot for that
Michael Love (15:54:54): > i can have the mean replace the MLE with an existing tximport arg
Avi Srivastava (15:55:38): > I guess so… Also fyi it could be a little more PITA buttier_mat.gz
can be useful. We can add that later though.
Michael Love (15:56:29): > i’ll brb in a minute but say more about tier_mat
Avi Srivastava (16:00:29): > yea so basically it’s yet another, although a little crude, way of estimating the confidence in our estimates. Based on the evidence of UMI deduplication network we classify per cell-gene estimates into four tiers 0-1-2-3, in the decreasing order of the confidence in their estimation.
Koen Van den Berge (16:06:05): > this sounds very interesting for filtering or weighted FDR:slightly_smiling_face:
Michael Love (16:33:12): > yes, should be straightforward for IHW
Michael Love (17:00:50): > oktximport
now brings in var, working on tximeta and thinking about other imports…
Michael Love (17:06:40): > > > txi <- tximport("alevin_w_boots/alevin/quants_mat.gz", type="alevin") > reading in alevin gene-level counts and inferential variance across cells > > names(txi) > [1] "abundance" "counts" "variance" "length" > [5] "countsFromAbundance" > > dim(txi$variance) > [1] 107450 298 > > dim(txi$counts) > [1] 107450 298 >
Michael Love (17:07:00): > this is version 0.11.3
Michael Love (20:24:54): > and for tximeta: > > > se <- tximeta("~/Downloads/alevin_w_boots/alevin/quants_mat.gz", type="alevin") > importing quantifications > reading in alevin gene-level counts and inferential variance across cells > couldn't find matching transcriptome, returning un-ranged SummarizedExperiment > !> assayNames(se) > [1] "counts" "variance" >
Michael Love (20:25:39): > that’s v 1.1.4
Michael Love (20:26:05): > i guess the transcripts for this dataset aren’t a Gencode or Ensembl release?
Michael Love (20:26:12): > (that’s ok just double checking)
2018-11-20
Avi Srivastava (06:19:33): > It’s gencode but human+mouse .
Avi Srivastava (06:20:21): > may be we can add SHA512 for human+mouse too in tximeta ? since it’s very common in single cell world.
Michael Love (08:32:43): > Interesting
Michael Love (09:00:28): > These might be unranged for a while. The Bioc classes (GRanges, SummarizedExperiment) only support one genome at a time… have to think about if we can do something clever here, but tximeta recognizing the combo of organisms wouldn’t help right now
Michael Love (09:00:48): > Bc I can’t build an object with two genomes
Charlotte Soneson (09:03:46): > I guess the question is whether youwantthem in the same object in the end…I think CellRanger, for example, spits out two different matrices, split by organism.
Michael Love (09:06:16): > Makes sense. I can put this on the wish list, because it would be a big-ish effort to do something special when paired txomes are detected (thinking even just restricting to Gencode human + mouse pairs)
Michael Love (09:07:07): > There are some low hanging todos for me like swap haplotype chromosometxps out when there are txp dups
Charlotte Soneson (09:07:17): > :+1:
Michael Love (09:19:48): > https://github.com/mikelove/tximeta/issues/16 - Attachment (GitHub): Paired txomes (human + mouse Gencode) · Issue #16 · mikelove/tximeta > For single cell experiments, we sometimes use pairs of txomes as the reference. With some extra work, we could output two objects, where we pick apart the human and mouse txomes. This would probabl…
Avi Srivastava (09:20:34): - Attachment: Attachment > I guess the question is whether you want them in the same object in the end…I think CellRanger, for example, spits out two different matrices, split by organism.
Avi Srivastava (09:22:45): > I never got around this idea, how did they really handle species ambiguous reads in that case. Not sure how easy it would be to make two disjoint matrices based on the species .
Charlotte Soneson (09:24:58): > I never looked into the details, but I guess they use the total human/mouse count to decide whether a cell is human or mouse, and then only use those reads that map to that species (?)
Charlotte Soneson (09:25:53): > I mean, they don’t really do much with “within-species” ambiguous reads either.
Avi Srivastava (09:26:59): > I guess so, that would be reasonable.
Michael Love (09:27:44): > But for QC don’t people want to see the off-block-diagonal counts?
Charlotte Soneson (09:28:40): > Yeah, so this information is provided (see e.g. “Multiple species” here:https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/summary
Avi Srivastava (09:29:16): - File (PNG): Screen Shot 2018-11-20 at 09.28.53.png
Avi Srivastava (09:29:38): > yep people use these kind of cases to check the validity of the single protocols .
Michael Love (09:30:18): > So this would argue to output all cells for both objects
Michael Love (09:30:28): > Not to make a cut based on counts
Michael Love (09:31:00): > I see now
Charlotte Soneson (09:33:56): > yes, I think that would make sense
Peter Hickey (14:24:20) (in thread): > minor point that probably doesn’t really help: they do support multiple genomes but not duplicated seqlevels. so it only really helps if you’re willing to use something likechr1_mouse
,chr1_human
or to use GenBank or RefSeq accession IDs
Charlotte Soneson (15:41:36): > On the topic of example data:@Avi Srivastava- do you also happen to have a small data set run with--dumpFeatures
lying around that we can use foralevinQC
?:upside_down_face:
Avi Srivastava (15:42:04): > Definitely, you got it !
Avi Srivastava (15:47:55): - File (Gzip): alevin.tar.gz
Charlotte Soneson (15:48:56): > Awesome, thanks!
Avi Srivastava (15:49:28): > No problem, the zipped file might have all the files likecsv
and eqclasses.
Avi Srivastava (15:49:36): > feel free to ignore non relevant ones.
Charlotte Soneson (15:50:13): > yep
2018-11-21
Michael Love (08:28:57) (in thread): > Aha. Also, the genome slot would have to be some combination so it would prevent any range overlaps using species specific ranges
Michael Love (10:07:12) (in thread): > Or would it work if the unique chroms have different genome strings, I’ve never tried that… anyway I’ll probably work on the low hanging tximeta fruit first
Michael Love (10:08:00) (in thread): > Like duplicate txp tools and GRangesList swapping
2018-11-28
Charlotte Soneson (05:55:11): > alevinQC
now usestximport
to read the alevin matrix - works like a charm:slightly_smiling_face:except for a complaint about a connection left open @Michael Love, I guess the connections (con
) should probably be closed after generating the respective matrices intximport::readAlevin()
.
Michael Love (06:41:50): > Got it, I’ll add that
Michael Love (11:01:10): > done:https://github.com/mikelove/tximport/commit/045fc4a3266f913de95b86d01e4a6bd451542255 - Attachment (GitHub): closing alevin connection · mikelove/tximport@045fc4a > Import and summarize transcript-level estimates for gene-level analysis - mikelove/tximport
Avi Srivastava (11:02:39): > ah that was my bad, thanks@Charlotte Sonesonand@Michael Lovefor correcting it.
Michael Love (11:02:52): > no problem
Michael Love (11:02:59): > i forgot to add it also
Charlotte Soneson (11:04:13): > yeah, I didn’t notice either until R started complaining while running the examples during check
Michael Love (11:19:40): > I’m trying out alevinQC on the tximportData example and it looks like files are missing
Michael Love (11:19:44): > > > alevinQCReport("/Users/love/bioc/tximportData/inst/extdata/alevin/neurons_900", sampleID="neurons", outputFile="neurons.html") > Error in checkAlevinInputFiles(baseDir) : > The following required file(s) are missing: > /Users/love/bioc/tximportData/inst/extdata/alevin/neurons_900/alevin/raw_cb_frequency.txt > /Users/love/bioc/tximportData/inst/extdata/alevin/neurons_900/alevin/filtered_cb_frequency.txt > /Users/love/bioc/tximportData/inst/extdata/alevin/neurons_900/alevin/featureDump.txt > /Users/love/bioc/tximportData/inst/extdata/alevin/neurons_900/alevin/MappedUmi.txt > Calls: alevinQCReport -> checkAlevinInputFiles > > dir <- "/Users/love/bioc/tximportData/inst/extdata/alevin/neurons_900" > > list.files(dir) > [1] "alevin" "aux_info" "cmd_info.json" > [4] "lib_format_counts.json" "libParams" "logs" > > list.files(file.path(dir,"alevin")) > [1] "alevin.log" "predictions.txt" "quants_mat_cols.txt" "quants_mat_rows.txt" > [5] "quants_mat.gz" "quants_tier_mat.gz" "whitelist.txt" >
Michael Love (11:20:26): > should we put those required files intotximportData
?
Charlotte Soneson (11:21:44): > yes -alevinQC
assumes thatalevin
was run in--debug
mode (or with--dumpFeatures
, or both, I can’t remember which files come from where) - the example data inalevinQC
should work, but yes, it would probably make sense to keep all the data intximportData
.
Michael Love (11:24:37): > ok i’ll try it out on alevinQC for now
Avi Srivastava (11:26:07): > yes, so all the missing files are dumped with--dumpFeatures
flag. We are considering to dump the features files by default, but is not in the current release.
Michael Love (11:26:52): > works!:thumbsup:
Charlotte Soneson (11:27:05): > :nerd_face:
Michael Love (11:27:26): > i know you’re busy with GB submission, but if you send me the extra files sometime, I’ll add them to tximportData. again, no rush
Michael Love (11:28:23): > @Charlotte Sonesonone comment, I guess it is expected in Quantification that the barcode freq and UMI count would be roughly linear, while both of these vs # genes would be log-ish
Michael Love (11:28:40): > (bc some genes soak up all the UMIs)
Michael Love (11:29:11): > you have “degree of saturation”, but maybe state this more explicitly
Michael Love (11:29:38): > i can just imagine you will get repetitive emails that say “one of the scatter plots is linear but the other two look like log(x)”
Avi Srivastava (11:29:47): > @Michael LoveThis might already have all the files, but I’ll double check. - Attachment: Attachment
Michael Love (11:29:56): > i get this all the time with “my dispersion plot isn’t exactly like the one in the vignette”
Charlotte Soneson (11:30:20): > Yes, good point.
Michael Love (11:31:50): > ok, that has it
Michael Love (11:32:09): > i guess that’s not the mouse neurons though
Michael Love (11:32:59): > i should actually take a crack at runningalevin
locally anyway
Michael Love (11:33:15): > i’ll run on the mouse neurons 900 to re-generate
Avi Srivastava (11:34:31): > ah yes that might be human+mouse. > sounds good, let me know if you face any problem. Otherwise, I’ll forward the output of the alevin run.
2018-12-07
Michael Love (15:22:09): > dropping this here in case anyone is interestedhttps://gist.github.com/mikelove/7b1537b6e35f031668d460552bb35d5a
Michael Love (15:22:23): > it’s an importer for alevin w/ bootstrap inferential replicates
Michael Love (15:22:43): > the file format is temporary, so this won’t go intotximport
2018-12-20
Vince Carey (14:37:48): > hi – i don’t see which of the .tar.gz that have been posted above include inferential replicates. i looked at two of them and i see no quants_boot_mat.gz. thanks for any pointers
Michael Love (15:03:34): > hi Vince
Michael Love (15:03:49): > I think Avi may have sent them to me in another channel, i think perhaps on the Alevin slack
Michael Love (15:03:52): > let me look up
Michael Love (15:04:45): > ok, found them and downloading, i can post them here in a minute
Michael Love (15:05:02): > ~190 Mb
Avi Srivastava (15:06:08): > Thanks@Michael Love!
Michael Love (15:07:55): > 44% uploaded…
Michael Love (15:08:11): > :hourglass_flowing_sand:
Michael Love (15:11:11): - File (Gzip): mouse_900.tar.gz
BJ Stubbs (15:33:58): > @BJ Stubbs has joined the channel
2019-01-04
Charlotte Soneson (11:41:37): > Happy new year! For those of you who may be interested - thealevinQC
package for creating a QC report summarizing thealevin
output is now public athttps://github.com/csoneson/alevinQC. Feedback, suggestions and additions welcome:slightly_smiling_face: - Attachment (GitHub): csoneson/alevinQC > Create QC and summary reports for Alevin output. Contribute to csoneson/alevinQC development by creating an account on GitHub.
Avi Srivastava (12:09:40): > :thumbsup:
Rob Patro (12:10:15): > :+1::slightly_smiling_face:
Michael Love (13:03:56): > great!
Michael Love (13:05:09): > maybe a screenshot for theREADME.md
?
Charlotte Soneson (13:05:32): > good idea, I’ll add that
Avi Srivastava (13:19:13): > @Charlotte Sonesonwrote an awesome tutorial for usingiSEE
with alevin generated quants too.
Avi Srivastava (13:19:17): > it’s up herehttps://combine-lab.github.io/alevin-tutorial/2018/alevin-isee/ - Attachment (combine-lab.github.io): How to use alevin with iSEE > Preparing alevin output for exploration with iSEE
Charlotte Soneson (13:19:41): > Screenshot added tohttps://github.com/csoneson/alevinQC(of the shiny app, it was easier so)
Michael Love (13:39:48): > great
2019-01-05
Michael Love (20:21:05): > posting here, bc there are lots oftximeta
interested folks… > I’m working oncleanDuplicateTxps
to deal with Ensembl having lots of transcripts with duplicate sequence (which Salmon collapses by taking the first transcript that appears in the FASTA). Now, with this argument set to TRUE, tximeta will try to swap out haplotype chromosome txps for ones on standard chromosomes: > > !> se <- tximeta(coldata, cleanDuplicateTxps=TRUE) > importing quantifications > reading in files with read_tsv > 1 > found matching transcriptome: > [ Ensembl - Homo sapiens - release 94 ] > loading existing EnsDb created: 2019-01-04 19:45:15 > generating transcript ranges > cleaning 1981 duplicate transcript names > Warning message: > In checkAssays2Txps(assays, txps) : missing some transcripts! > 7107 out of 211939 are missing from the GTF and dropped from SummarizedExperiment output > Calls: tximeta -> checkAssays2Txps >
Michael Love (20:21:34): > here i can rescue 1981 txps by swapping for the standard chromosome ENST
Michael Love (20:21:53): > the other 7107 are txps that don’t have sequence on the standard chromosomes (I think, need to confirm more closely)
Michael Love (20:22:43): > i still need to do some close testing but FYI it’s in the devel branch
Michael Love (20:23:34): > And a general comment: Gencode is much much better for most users, bypassing all of this mess
2019-01-06
Charlotte Soneson (05:10:14): > @Michael Lovethis is great!
Charlotte Soneson (05:12:49): > I am wondering about the exclusion of transcripts that are not present in the gtf - would it make sense to provide an option of not removing these (but, obviously, not providing any location information)? Or would settingskipMeta=TRUE
be the way to go? I am just thinking that the matrix would look different from what you would get with e.g.tximport
(and the TPMs may not sum to 1M, etc), which may cause some confusion unless it is done with an explicit argument.
Charlotte Soneson (05:18:23): > And just a note, since I was bitten by this myself: It’s not just Ensembl transcripts that have the.XX
in the end - StringTie transcript IDs are of the formSTRG.x.y
(x
- gene ID,y
- transcript number within gene), and the CHESS catalog similarly names transcriptsCHS.x.y
, so in these cases, excluding everything after the dot doesn’t leave much information…
Michael Love (06:34:37): > Thx@Charlotte Soneson. For the second I think I should add an if statement to only chop for Ensembl. Good point. For the first one, yes skipMeta, bc otherwise I’d have to fill in “dummy” ranges with no chromosome. > Seems strange
Charlotte Soneson (06:35:44): > Yep, makes sense
Michael Love (06:38:39): > I guess I could have another option to get metadata but still do unranged SE
Michael Love (06:39:45): > But you can get this by running once with skipMeta F and once with T
Michael Love (06:40:18): > So maybe doesn’t deserve another dedicated argument
Charlotte Soneson (06:42:12): > So in that case one would just copy over themetadata
slot fromskipMeta=FALSE
to theskipMeta=TRUE
SE?
Michael Love (07:06:07): > Yeah and that way you also get the ranges for the txps that are in the GTF, that’s a bit better than just getting the metadata alone
Michael Love (07:06:51): > I’ll look into these 7000 txps to see if we’re losing a lot
Michael Love (07:07:25): > Ensembl is quite a pain that they don’t provide ranges for FASTA entries
Michael Love (07:07:49): > The solution is something like a graph genome I suppose but in the meantime it’s quite a mess to deal with
Charlotte Soneson (07:09:36): > Yes, definitely agree
Charlotte Soneson (07:12:38) (in thread): > But you wouldn’t get this withskipMeta=TRUE
, right? So you can’t have both the full matrix (all transcripts in the fasta file)andrange information.
Michael Love (07:30:01) (in thread): > Yeah you can’t get it in one go
Michael Love (07:30:42) (in thread): > Unless there are some SE magic I don’t know about where some rows can be unranged
Charlotte Soneson (07:33:01) (in thread): > Yup. Sorry, I misunderstood, I thought you meant that you could also transfer the ranges from the “small” SE to the larger one with all the transcripts. But you’d need to retain both SEs in parallel if you want both the ranges and the full set of txs.
Michael Love (08:35:26) (in thread): > Down the line,@Rob Patroand I are interested in being more specific about which transcript is expressed and i think we’ll want to be more formal about these haplotype transcripts. But I think it will involve preprocessing the Ensembl files, unless they make them more useful between now and then
Avi Srivastava (11:13:24) (in thread): > good to know !!
Rob Patro (11:23:19): > This isawesome,@Michael Love!
Avi Srivastava (11:24:02): > exactly I was thinking about that, we had this problem so many times and had to run things with--keepDuplicates
Avi Srivastava (11:41:48): > Just a thought, may be the use case is different or it may as well already can handle this in the current version, but I was thinking if there are marked duplicates we can group them together and report the summary (may be the sum) of the quantification counts of these txps instead of reporting them all .
Michael Love (12:23:52): > That’s what’s done now I think
Michael Love (12:24:16): > It reports the sum and picks one name out of the pool of txps with duplicate sequence
Michael Love (12:24:30): > And this new argument just picks a new name
Avi Srivastava (12:25:11): > Awesome ! thanks for the clarification .
2019-01-07
Michael Love (08:46:11): - File (PNG): Screenshot from 2019-01-07 08-45-57.png
Michael Love (08:46:26): > more than 500 transcripts recovered from chr6: the MHC locus is on chr6
Charlotte Soneson (08:46:57): > :+1:
Michael Love (11:58:13): > ok a little more hijacking of this channel
Michael Love (11:58:36): - File (PNG): Screenshot from 2019-01-07 11-54-29.png
Michael Love (11:58:40): - File (PNG): Screenshot from 2019-01-07 11-54-53.png
Michael Love (11:59:21): > log counts of the ~7000 transcripts that are in the FASTA, not in the GTF, and not fixable bycleanDuplicateTxps
Michael Love (11:59:27): > top is density, bottom is histogram
Michael Love (11:59:40): > log10 counts is shown
Michael Love (12:00:32): > so these are deemed by Salmon to be expressed just as much as any other transcripts FWIW
Avi Srivastava (12:19:50): > One more very hand wavy argument, I’ve observed with the new patches of the ensemble release, they deprecate some of the transcript/gene names. If there is way to find that out that can be useful too.
Michael Love (13:37:10): > oh, to find out which are new/old across releases?
Avi Srivastava (16:13:25): > yep, and potentially tell user the new one (provided it’s available), if given with the relevant flag ?
Michael Love (16:15:02): > i think ensembldb might help with this
Michael Love (16:15:36): > at least, to map between versions
Michael Love (16:17:05): > but Ensembl is tough to deal with right now as they aren’t even packaging a single FASTA that can be used for quantification and the FASTA/GTF issues, i’m thinking that i’ll develop Ensembl-specific features as they are demanded by users, but other flashy/shiny features might get us more users quicker
Michael Love (16:17:40): > e.g. quick and easy visualization of quants in genomic coordinates where other assays can be visualized as well
Michael Love (16:18:29): > also relevant for Alevin: would be cool to quickly visualize scRNA and scATAC in genomic coordinates. this is the killer app for tximeta i think, also then leading into plyranges manipulation of quants + other assays
Avi Srivastava (16:20:31): > This sounds interesting, what kind of quant visualization you have in mind, like pileups across genome ?
Avi Srivastava (16:22:55): > oh wow just got to know plyranges,https://bioconductor.github.io/BiocWorkshops/fluent-genomic-data-analysis-with-plyranges.html - Attachment (bioconductor.github.io): The Bioconductor 2018 Workshop Compilation > This book contains all the workshops presented at the Bioconductor 2018 Conference
Avi Srivastava (16:23:03): > @Rob Patromight love it:stuck_out_tongue_winking_eye:
Michael Love (16:46:11): > plyranges is very cool. i want to incorporate it into the tximeta workflow
Avi Srivastava (16:47:33): > it indeed is, although I am curious about the visualization of quants, or may be I am missing specific use case in the vignette ?
Michael Love (16:47:34): > re: what kind of visualization, yeah something like expression levels along the genome, accessibility at peaks
Michael Love (16:48:18): > plyranges doesn’t do any visualization just manipulation. like bedtools
Michael Love (16:48:39): > it would be cool to show things like isoform switching on the genome
Avi Srivastava (16:48:57): > ah I remember some plots of scATAC peaks back at 10x, although I am not sure about their biological interpretation.
Michael Love (16:49:13): > if you have diff promoters for isoforms you could visualize isoform switching + promoter differential accessibility
Michael Love (16:49:48): > for sc i’m always imagining detecting populations and then merging the signal from all cells in the sub pop
Michael Love (16:50:59): > the visualization would be per locus, and then you can use plyranges to compute global overlap or correlations
2019-01-08
Avi Srivastava (11:03:07): > That sounds really Awesome !
Hirak (13:46:56): > @Hirak has joined the channel
2019-01-09
Avi Srivastava (17:26:08): > I wonder what’s the best place/way to store salmon index for widely used txome (say gencode) online ?
Rob Patro (19:35:54): > You mean have like a prebuilt index?
Avi Srivastava (19:42:59): > Right, so the proposal is to put the pre-build index in a shared repo for say mouse and human with may be a couple of variations in the kmer size.
Avi Srivastava (19:43:46): > The changes to the index is relatively less frequent and we can set up our drone ci to automatically update it if there is one.
Rob Patro (19:43:53): > Why not zenodo?
Rob Patro (19:44:00): > Then the index gets a doi
Rob Patro (19:44:03): > Etc
Avi Srivastava (19:46:00): > Actually that’s also a good point, we can try zenodo. I just wanted to start a discussion if we’d like to maintain a central per say gencode like website for salmon distributables like index:stuck_out_tongue_winking_eye:
Avi Srivastava (19:46:07): > but may be too much work …
Rob Patro (19:49:45): > Wed need a dedicated host etc.
Rob Patro (19:50:02): > We can link to such things from the salmon homepage etc.
Avi Srivastava (19:50:20): > cool:thumbsup:
Michael Love (22:03:45): > Should I change this channel to more general salmon2bioc?
Michael Love (22:03:55): > I seem to be missing such a channel
Michael Love (22:04:46): > And there is so much overlap in the people involved in Salmon/tximport/alevin and so on
2019-01-10
Avi Srivastava (11:30:43): > sure, fine by me .
Michael Love (11:31:38): > has renamed the channel from “alevin2bioc” to “salmon2bioc”
Michael Love (11:33:53): - File (GIF): salmon.gif
Avi Srivastava (11:34:15): > :slightly_smiling_face:
Avi Srivastava (11:34:57): > we need a swim downstream one:stuck_out_tongue_winking_eye:
Michael Love (11:53:46): > this is what google gives me
Michael Love (11:53:54): - File (JPEG): animals-trout-swimming_downstream-gps-sat_nav-satellite_navigation-llan1409_low.jpg
Avi Srivastava (11:54:12): > hahaha:joy:
Rob Amezquita (12:05:56): > just a quick salmon related question/comment - it took quite a bit of searching to find how exactly to work with V1 chemistry (where you have I1 + R1 + R2 reads, and a fail run where i didnt use the I1 reads). i found this page (https://combine-lab.github.io/alevin-tutorial/2018/running-alevin/) which seems to still be current (is it?) based on the repo structure, but it would definitely be helpful to add a note (or even link to this) in the salmon docs - Attachment (combine-lab.github.io): How to Run Alevin > Running Alevin:
Avi Srivastava (13:05:11): > Thanks for the heads up@Rob Amezquita, we will put a link to the tutorial for using alevin with v1 chemistry in the salmon docs.
Rob Amezquita (13:06:19): > thanks Avi!
2019-01-15
Michael Love (10:14:37): > working on some new datasets for Bioc courses and workflows
Michael Love (10:14:40): > [2019-01-12 13:58:38.453] [jointLog] [info] Mapping rate = 10.2516%
Michael Love (10:14:59): > helps to map to mouse instead of human when the experiment is mouse
Charlotte Soneson (10:15:29): > 10% is not bad for mouse to human though!:slightly_smiling_face:
Michael Love (10:15:50): > i tried to do some work over the weekend while also watching two kids
Michael Love (10:16:10): > ok, so also, the PCA plot separated the conditions (mostly)
Michael Love (10:16:11): > haha
2019-03-11
Michael Love (09:07:24): > hi folks, I changed the behavior oftximeta::addIds
slightly over the weekend, if you doaddIds(se, gene=TRUE)
whenrownames(se)
areENST
, it will use thegene_id
column to map to IDs, e.g. gene symbol
Michael Love (09:07:40): > this new mechanismgene=TRUE
gives a lot more successful symbol mappings
Charlotte Soneson (09:10:14): > awesome:+1:We were actually just thinking here - is this the way you envision that gene symbols will/should be added in the future, or were you thinking about adding them automatically (for Gencode, that is)?
2019-03-13
Michael Love (03:18:20): > Great work on ARMOR
Michael Love (03:18:32): > Maybe promote on#general?
Michael Love (03:19:35) (in thread): > Yes I’m going to leave it to users to run addIds(), bc not all organisms will have SYMBOL I think
Charlotte Soneson (03:20:55) (in thread): > Ok, cool.
Charlotte Soneson (03:21:26): > Thanks! Yes, I could do that - it’s strongly Bioc-centric:blush:
2019-06-20
Sanjeev Sariya (17:36:47): > @Sanjeev Sariya has joined the channel
2019-08-07
Michael Love (13:23:07): > @Avi Srivastavaand I have been working on the importer for Alevin, and to speed it up we’ve moved from R to C++. Avi has written an importer than is 40x faster than our R version and ~3x more memory efficient
Michael Love (13:23:41): > the importer lives infishpond
, which is our general purpose Salmon2Bioc package, containingswish
and other future utilities
Michael Love (13:25:10): > i wanted to see if our Bioc experts such as@Martin Morganmight have any suggestions as to avoiding copies of objects when passing btwn C++/R
Martin Morgan (13:25:13): > @Martin Morgan has joined the channel
Michael Love (13:26:03): > so the reader code that we just added to devel branch can be seen here:https://github.com/mikelove/fishpond/blob/master/src/readEDS.cpp
Michael Love (13:26:33): > Avi and I can also share a dataset we’ve been using to benchmark performance if anyone’s interested in kicking the tires (or we can benchmark ideas ourselves)
Avi Srivastava (13:32:18): > Just to add to@Michael Love’s point, the specific problem we are facing wrt to the C++ -> R communication is the copy of theEigen::SparseMatrix
object herehttps://github.com/mikelove/fishpond/blob/master/src/readEDS.cpp#L111. > > In one of the case, if thespMat
object of classEigen::SparseMatrix
has3.3G
data, as soon as we wrap it into S4 object to share with C++ usingS4 Xout(wrap(spMat));
the memory usage doubles to7G
. My guess is that the data gets copied instead of moved into theXout
object. I was wondering if there is any memory-frugal way to forward the sparse matrix from C++ to R.
Michael Love (14:22:33): > i’m gonna pull one more person on, so@Aaron Lunpointed out we may be stuck here with that copy because of Eigen vs R memory management
Aaron Lun (14:22:36): > @Aaron Lun has joined the channel
Martin Morgan (14:24:20): > The memory isn’t owned by R so the data has to be copied. A different strategy would be to allocate R memory and fill it, rather thanEigen::SparseMatrix
. If you were representing this as a dense Rmatrix()
this would be ‘easy’, but I think you’re aiming for a sparse matrix from the Matrix package, and I don’t think there is a clean way to directly access the memory used there.
Aaron Lun (14:24:59): > It can be done, depending on exactly how you want to do it -beachmatdirectly accessesdgCMatrix
slots, for example.
Aaron Lun (14:25:15): > Bit messy, but well, that’s what we gotta do.
Michael Love (14:25:20): > ah maybe we should look at beachmat
Martin Morgan (14:25:28): > (Matrix::sparseMatrix()
looks like a really inefficient implementation)
Michael Love (14:26:14): > our initial R based attempt looks over the data and preallocates vectors, which get fed tosparseMatrix
Aaron Lun (14:26:41): > Check out the sparse matrix writers in beachmat.
Aaron Lun (14:26:58): > If you know what the vector length is beforehand, you can skip the bits where I usedeque
s.
Aaron Lun (14:27:45): > If you’re not already aware of it, there’s a difference betweendgTMatrix
anddgCMatrix
; the former is easier to fill but the latter is more efficient to use.
Michael Love (14:28:07): > yup
Michael Love (14:28:55): > we went with dgTMatrix at one point, now i think we’re doing dgCMatrix as part of the conversion from Eigen
Michael Love (14:28:57): > we’re relatively happy with our implementation, but always looking to improve. what we have now is smaller on disk than the matrix market exchange sparse format (which Alevin can also export optionally), a little less memory to read in the data compared toMatrix::readMM
, and also ~2x faster
Michael Love (14:30:29): > it’s taking something like 7 s to read in 30k cells with Avi’s reader compared to ~18 s withreadMM
Michael Love (14:31:45): > Avi is now profiling 300k and 3M
Michael Love (14:32:27): > thanks for feedback@Martin Morganand@Aaron Lun, we’ll keep working on it
Aaron Lun (14:32:58): > Oh, 3M cells will probably blow up thedgCMatrix
.
Avi Srivastava (14:33:14): > Yep thanks@Martin Morganand@Aaron Lun!
Michael Love (14:33:17): > (once these packages propagate to the build machines, this is what you’ll get withtximport(files, type="alevin")
in devel branch)
Avi Srivastava (14:33:25) (in thread): > Yep it is !
Aaron Lun (14:34:04) (in thread): > https://support.bioconductor.org/p/123265/#123275. Don’t have a real solution if Matrix maintainers don’t care.
Michael Love (14:34:34): > and likewise,tximeta
will wrap them into a SE with gene ranges which can be made into an SCE
Avi Srivastava (14:37:04) (in thread): > haha that was crazy !
Martin Morgan (14:38:45) (in thread): > Do the Matrix maintainers not care?
Aaron Lun (14:40:14) (in thread): > Not entirely sure, actually. I remember emailing them about a few things, to which I didn’t get a response, can’t remember whether this was one of them.
Aaron Lun (14:42:18) (in thread): > But I can kind of empathize if this was placed in the “too hard” basket, becausep
would need to benumeric
rather thaninteger
, and that could break a whole lot of things.
Aaron Lun (14:50:15) (in thread): > And I should also add that MMaechlerdidrespond to a few of my questions, so I really don’t have a handle on what was fixed and what was not.
Aaron Lun (14:50:25) (in thread): > Well, not from memory, anyway.
2019-08-08
Aaron Lun (14:20:26): > Just listened to a talk from Laraib, who’s interning here.
Aaron Lun (14:20:37): > Was thinking, “This sounds familiar.”
2019-12-11
Aaron Lun (11:38:12): > @Aaron Lun has left the channel
2020-02-14
Andrew Skelton (05:09:17): > @Andrew Skelton has joined the channel
2020-04-04
Michael Love (09:58:36): > note of changes to tximport in devel (and tximeta on the way):alevinArgs
is a new list argument that will summarize all the alevin specific args, e.g.filterBarcodes
,tierImport
, andforceSlow
=> want@Charlotte Sonesonto see this most of all…
Charlotte Soneson (09:59:43): > Noted!:slightly_smiling_face:
Michael Love (10:02:42): > yeah, just wanted to clean up the docs a bit and now the alevinArgs can grow as needed without inflating the man page
Avi Srivastava (11:04:24): > :grin:
Michael Love (13:15:00): > ok now tximeta also updated in devel: > > se <- tximeta(coldata, type="alevin", dropInfReps=TRUE, alevinArgs=list(tierImport=TRUE)) >
Michael Love (13:19:14): > here are the new args described in?tximport
- File (PNG): Screen Shot 2020-04-04 at 1.18.07 PM.png
Michael Love (13:19:19): - File (PNG): Screen Shot 2020-04-04 at 1.18.30 PM.png
Michael Love (13:21:03): > i’m gonna make it more clear you can specify any number of them, don’t need to specify all
2020-04-17
Avi Srivastava (13:59:14): > @Michael Lovehow do I enforce not to load inf mean/var matrix, even if the matrices are present in the output folder ?
Michael Love (14:58:56): > i dont have an argument for that, since it only slows things a bit (nothing like inf reps)
Michael Love (14:59:13): > is the reason just to make it fast
Michael Love (14:59:14): > ?
Michael Love (14:59:33): > i feel like we want people to see the mean and var matrices… it’s kind of a feature not a bug?
Avi Srivastava (15:10:38): > Yes, I agree it’s a feature. However, for example, there was an alevin run with Bootstraps but for some quick analysis I just needed the MLE estimates. It loads all three matrices, mean, var and the count. It’s fast no doubt:slightly_smiling_face:but it can becomes painful with big matrices. How does having having an argument in alevinArgs sounds ? likeImportBootstraps
Avi Srivastava (15:11:29): > default to TRUE
Michael Love (15:11:40): > i can put it in, but want to also test and dont have time before release I think
Avi Srivastava (15:11:53): > Not a problem. no rush on this one
Michael Love (15:12:20): > ok, wanna make a tximport issue on GH
Michael Love (15:12:42): > commits are halted this weekend
Michael Love (15:12:50): > Release is Tuesday
Avi Srivastava (15:12:53): > yep ! I can make a PR as well, if it helps.
Michael Love (15:13:04): > issue is fine with me
Avi Srivastava (15:13:08): > Lovely !
Michael Love (15:13:09): > i’ll just get to it in next cycel
Michael Love (15:13:17): > :thumbsup:
Avi Srivastava (15:13:23): > Yea, not a high priority
2020-05-14
Tim Triche (14:23:47): > @Tim Triche has joined the channel
Ben Johnson (14:29:50): > @Ben Johnson has joined the channel
Ben Johnson (14:30:36): > :wave:
Michael Love (14:31:26): > :wave:
Michael Love (14:32:02): > (i’ve got kids this afternoon, but will check in later today, and see if we can think of a nice generalizable solution)
2020-05-18
Ben Johnson (09:16:25): > hi@Michael Love- hope you’re hanging in there through all of this
Ben Johnson (09:17:11): > one kick I’m running into with tximeta is if we have a failed cell using something like smart-seq, generating the final object will also fail
Ben Johnson (09:17:27): > the work around I’ve been using is to import individual cells and then bind ’em up at the end
Ben Johnson (09:17:55): > which as the number of cells goes up, it becomes somewhat memory intensive
Ben Johnson (09:18:50): > do you think it would be worth having some sort of prefilter in place to make sure we don’t have outright failures when generating the final SummarizedExperiment or leave that to the user to make sure things are clean prior to import?
Michael Love (09:24:16): > so help me think through this: how would software know that the cell failed until it tries to import it?
Michael Love (09:24:23): > is there another way to know this?
Ben Johnson (09:26:57): > so I think that was my question when tximeta is importing a bunch of cells, is there a way to check each cell has a non-zero length before adding it into the counts, abundance, etc. assay slots when generating the SummarizedExperiment?
Michael Love (09:27:48): > so can you describe the file? it’s empty?
Michael Love (09:28:03): > it’s an empty file written by Salmon?
Ben Johnson (09:29:23): > yeah maybe a little more context - I’m using eisaR to generate spliced and unspliced reference txome, align/quant with salmon (100 bootstraps included), and then import with tximeta
Ben Johnson (09:32:45): > a handful of cells have too few tx counts, import okay, then ultimately have incorrect numbers of columns relative to the other cells leading to a “list object cannot be coerced…” error
Ben Johnson (09:35:56): > also, rereading my initial question, good grief… that was as clear as mud. my apologies!
Michael Love (09:39:48): > so the idea would be to pre-scan all files but not import them?
Michael Love (09:39:52): > like scan the hash?
Michael Love (09:41:06): > the hash would work i think, this could be an option
Ben Johnson (09:44:55): > :thumbsup:
Ben Johnson (09:45:12): > trying to put together a reproducible example
Michael Love (09:46:18): > i can pretty easily add an argument to pre-scan the hashes across all files first
Michael Love (09:46:41): > but i’m wondering if this should just be default
Ben Johnson (09:47:33): > would the hash also cover things if the bootstraps failed (very naive question)?
Ben Johnson (09:51:40): > I guess even more naive question that’s more salmon related, if a cell has very few reads, can the bootstrapping fail?
Ben Johnson (09:56:19): > as in, the initial point estimates are okay, but the bootstrapping gets weird if the number of reads is sufficiently low/low numbers of transcripts with non-zero counts?
Rob Patro (09:58:59): > @Avi Srivastava:point_up:any thoughts. I don’t think salmon should “fail” in the bootstrap. Any thoughts about alevin bootstrap with low reads / cell?
Michael Love (10:03:32): > yeah i’m still confused about the reason a sample is not importing, the hash value won’t help here
Michael Love (10:03:54): > the hash is a signature of thetranscripts.fa
Michael Love (10:04:52): > i was confused earlier, i mis-read you, thinking that a sample had too fewtranscripts e.g. a different txome was used. now see its a boot issue
Michael Love (10:05:58): > i’d also like to point out this new functionality to Ben, which we’ve been exploring very recently in the lab
Michael Love (10:07:09): > this may or may not work to resolve this particular bug, but we have found that if you just needmarginaluncertainty information, you can save a lot of time and memory just using mean and variance of boots
Michael Love (10:07:10): > https://github.com/mikelove/fishpond/blob/master/R/helper.R#L229-L246
Michael Love (10:07:23): > marginal meaning, not the covariance across transcripts
Michael Love (10:07:34): > but just the mean and variance of inf reps for a single txp at a time (or gene at a time for alevin)
Michael Love (10:08:25): > if you want to take transcript data and summarize to gene, you should use the true inf reps though, just adding this to docs actually
Ben Johnson (10:18:09): > my sincerest apologies for the confusion
Ben Johnson (10:18:59): > totally recognize that this is new functionality and it is incredible to be able to do this sort of thing
Ben Johnson (10:20:20): > the data we are working on are deeply sequenced total scRNA-seq (save a couple cells) - 378 cells total
Avi Srivastava (10:21:18): > I have a couple of thoughts: > 1.) I think Ben is using Smart-seq data, which is a full-length technology. Assuming he is using non V3 protocol, here he ran salmon on each cell independently, unlike alevin. > 2.) Salmon or alevin, I think the bootstrapping should not fail, unless we are talking about some cells with super low number of reads. One way to counter that is to first perform some sort of filter i.e. there is no point of quantifying if the # reads are below some threshold. We can still end up having no quants if there were no reads mapped, the thing to check there would be are we using right reference / index ? > 3.) As Ben mentioned he is trying to quantify spliced/unspliced reads, I wonder how much do we expect the number of unspliced reads in the SMART-seq tech. Ideally the fraction of reads should be similar to droplet based techs but these are different techs and makes me curious if we expect similar fraction of unspliced, in case we wan’t to do RNA velocity.
Ben Johnson (10:21:21): > we are finding that if you use the inferential replicates, it can make a big difference when computing RNA velocity vs the original gene counts in the absence of the inf reps
Ben Johnson (10:21:56): > @Avi Srivastavayes, we are almost 50/50 in unspliced and spliced fractions for passing cells
Ben Johnson (10:22:05): > that is, non-zero counts
Ben Johnson (10:22:24): > no lower bound on that (unlike velocyto with I think 80 counts being the floor?)
Ben Johnson (10:23:53): > yes, I ran salmon on each cell independently with 100 bootstraps, but it’s also sounding like I should use Gibbs after talking with Tim/Rob
Ben Johnson (10:25:12): > I’m reimporting each cell individually into a list object right now using tximeta and will report back which cells “fail” - the main symptom being that the assay slots are not equivalent across all cells, leading to the possible bootstrapping question
Ben Johnson (10:26:08): > thus, if I try and import all cells up front, the error occurs and am speculating that it is because the matrices are not the same dims to generate the final SummarizedExperiment object
Avi Srivastava (10:29:12): > Yep, thanks for reporting this. It’s hard to pin point right now, but it’d be interesting to know what’s going wrong here. It’s possible that salmon failed and we can add checks for that but if you can figure out which cells failed and share the salmon logs for a couple of them, it’s help solve things faster !
Avi Srivastava (10:29:51): > Over the top of my head, number of rows i.e. number of txps and the number of columns i.e. # bootstraps should be same for all cells, so the behavior is indeed a bit wonky
Avi Srivastava (10:30:11): > Unless we somehow remove all zero rows and columns
Ben Johnson (10:31:25): > thanks so much for everyone’s help and again my apologies for the confusion/frustration due to my lack of clarity!
Ben Johnson (10:31:45): > will definitely send salmon logs, code used to import, other details shortly
Ben Johnson (10:31:54): > looks like it’s only 2 cells
Michael Love (10:32:35): > sounds good
Rob Patro (10:32:41): > If you can share the data for the cells that fail, we can def dig into that
Michael Love (10:32:50): > thanks for report Ben:slightly_smiling_face:
Ben Johnson (10:38:36): > okay, so the short version is that there are 0 counts for these two cells (possible contaminating cells?! or maybe just had crappy libs…) and thus, the assay slots for these cells are: counts, abundance, and length
Ben Johnson (10:39:06): > whilest the others have the interential rep slots
Ben Johnson (10:39:13): > now to pull the logs and forward them on
Ben Johnson (10:39:21): > better to open an issue on github?
Michael Love (10:44:24): > so the short fix on tximeta side would be to scan to make sure there are bootstraps for all samples ifdropInfReps=FALSE
Michael Love (10:44:59): > yeah if you can throw up the sample dir for one of these that would help
Michael Love (10:45:27): > i just need to figure out where tximeta could figure out that these samples won’t merge properly
Ben Johnson (10:46:04): > putting together a dropbox link to share
Rob Patro (10:57:28): > great!
Ben Johnson (12:10:43): > let me know if this doesn’t work:https://www.dropbox.com/sh/4dq9wwavi412pbv/AACCGtmz_o83atm-t3urpOrXa?dl=0 - Attachment (Dropbox): salmon_tximeta_smartseq_bkj > Shared with Dropbox
Ben Johnson (12:11:27): > there are two cells that failed and I also included 2 that ‘passed’ - along with the code used to prep the txome with spliced/unspliced
Ben Johnson (12:11:36): > and then import into R via tximeta
Ben Johnson (12:13:53): > I can make a tarball of all of it too now that I think about it…
Avi Srivastava (12:14:21): > > [2020-04-18 22:29:04.666] [jointLog] [warning] salmon was only able to assign 1 fragments to transcripts in the index, but the minimum number of required assigned fragments (–minAssignedFrags) was 10. This could be indicative of a mismatch between the reference and sample, or a very bad sample. You can change the –minAssignedFrags parameter to force salmon to quantify with fewer assigned fragments (must have at least 1). > Ah ! one obvious way to fix this is use--minAssignedFrags 1
, although something is really off for this cell as the fastq file had only one read, can you check if the fastq was downloaded properly for these cells ?
Ben Johnson (12:15:10): > yeah I’m now worried it may be even further upstream since we have to trim these due to Takara’s library prep
Ben Johnson (12:15:43): > yes, the checksums did pass when I downloaded the files
Avi Srivastava (12:16:43): > Yea, then it’s worth checking the size of the fastq pre/post filtering.
Avi Srivastava (12:17:01): > @Michael Loveone obvious way to tackle this on tximeta is through the json
Ben Johnson (12:18:02): > and I also promised@Charlotte Sonesonand@Michael Stadlerand@Avi Srivastavaprocessed R objects for this
Ben Johnson (12:18:14): > I’ll add those to the dropbox too
Charlotte Soneson (12:18:46): > Not related to the import issues, but just to rule out…we saw similar things (0 or 1 assigned fragments) for some of our Smart-Seq2 data, since some of the wells were deliberately left empty for control purposes.
Michael Love (12:27:52) (in thread): > e.g.num_bootstraps
is 0 in meta_info.json
Ben Johnson (12:29:39) (in thread): > looks like there were a few thousand reads there and then got tossed after trimming for being too short.
Ben Johnson (12:30:39) (in thread): > so that was definitely on me to be more careful with which cells I actually send on for analysis
Ben Johnson (12:30:56) (in thread): > lesson learned here:slightly_smiling_face:
Michael Love (12:50:57): > trying out something in1.7.4
Michael Love (12:51:02): > > > se <- tximeta(files, type="salmon") > > NOTE: inferential replicate number not equal across files, > may lead to errors in object construction, unless 'dropInfReps=TRUE' > > NOTE: the following file numbers have 0 bootstraps: > 3,4 > > importing quantifications > reading in files with read_tsv > 1 2 3 4 Error in tximport(files, type = type, txOut = TRUE, ...) : > Note: not all samples contain inferential replicates. > tximport can only import data when either all or no samples > contain inferential replicates. Instead first subset to the > set of samples that all contain inferential replicates. > Calls: tximeta -> tximport >
Michael Love (12:51:18): > can you try on your end:https://github.com/mikelove/tximeta
Michael Love (12:51:37): > for various reason i use message here
Michael Love (12:51:52): > (warning holds until end of run, so defeating the point)
Michael Love (12:52:08): > (and dropInfReps is in ellipses, so easiest to just print a message at the start)
Ben Johnson (15:23:32): > appears to work for me! > > > txi <- tximeta::tximeta(coldata = data.frame( > + names = cell_names, > + files = quant_files, > + stringsAsFactors = FALSE > + ), type = "salmon") > > NOTE: inferential replicate number not equal across files, > may lead to errors in object construction, unless 'dropInfReps=TRUE' > > NOTE: the following files (by #) have 0 bootstraps: > 33,40 > > importing quantifications > reading in files with read_tsv > 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 > found matching linked transcriptome: > [ GENCODE - Homo sapiens - release 33 ] > building TxDb with 'GenomicFeatures' package > Import genomic features from the file as a GRanges object ... OK > Prepare the 'metadata' data frame ... OK > Make the TxDb object ... OK > generating transcript ranges > Error in stopifnot(all(sapply(infReps, ncol) == nreps)) : > 'list' object cannot be coerced to type 'integer' >
Ben Johnson (15:23:50): > will follow-up seeing what happens if we dropInfReps=T
Ben Johnson (15:28:54): > also looks good and builds the object, no problem! > > > txi <- tximeta::tximeta(coldata = data.frame( > + names = cell_names, > + files = quant_files, > + stringsAsFactors = FALSE > + ), type = "salmon", dropInfReps = T) > > NOTE: inferential replicate number not equal across files, > may lead to errors in object construction, unless 'dropInfReps=TRUE' > > NOTE: the following files (by #) have 0 bootstraps: > 33,40 > > importing quantifications > reading in files with read_tsv > 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 > found matching linked transcriptome: > [ GENCODE - Homo sapiens - release 33 ] > loading existing TxDb created: 2020-05-18 19:20:51 > Loading required package: GenomicFeatures > Loading required package: BiocGenerics > Loading required package: parallel > > Attaching package: 'BiocGenerics' > > The following objects are masked from 'package:parallel': > > clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, > clusterExport, clusterMap, parApply, parCapply, parLapply, > parLapplyLB, parRapply, parSapply, parSapplyLB > > The following objects are masked from 'package:dplyr': > > combine, intersect, setdiff, union > > The following objects are masked from 'package:stats': > > IQR, mad, sd, var, xtabs > > The following objects are masked from 'package:base': > > anyDuplicated, append, as.data.frame, basename, cbind, colnames, > dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, > grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, > order, paste, pmax,[pmax.int](http://pmax.int), pmin,[pmin.int](http://pmin.int), Position, rank, > rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, > union, unique, unsplit, which, which.max, which.min > > Loading required package: S4Vectors > Loading required package: stats4 > > Attaching package: 'S4Vectors' > > The following objects are masked from 'package:dplyr': > > first, rename > > The following object is masked from 'package:base': > > expand.grid > > Loading required package: IRanges > > Attaching package: 'IRanges' > > The following objects are masked from 'package:dplyr': > > collapse, desc, slice > > Loading required package: GenomeInfoDb > Loading required package: GenomicRanges > Loading required package: AnnotationDbi > Loading required package: Biobase > Welcome to Bioconductor > > Vignettes contain introductory material; view with > 'browseVignettes()'. To cite Bioconductor, see > 'citation("Biobase")', and for packages 'citation("pkgname")'. > > > Attaching package: 'AnnotationDbi' > > The following object is masked from 'package:dplyr': > > select > > loading existing transcript ranges created: 2020-05-18 19:20:53 > fetching genome info for GENCODE >
Ben Johnson (15:29:09): > > txi >
Ben Johnson (15:29:12): > > class: RangedSummarizedExperiment > dim: 424579 378 > metadata(6): tximetaInfo quantInfo ... txomeInfo txdbInfo > assays(3): counts abundance length > rownames(424579): ENST00000456328.2 ENST00000450305.2 ... > ENST00000435945.1-U ENST00000387347.2-U > rowData names(3): tx_id gene_id tx_name > colnames(378): A10 A11 ... P8_S320 P9_S232 > colData names(1): names >
Michael Love (16:01:36): > i wonder why i got an informative message and you did not
Michael Love (16:01:40): > the error that is
Michael Love (16:01:58): > > Error in tximport(files, type = type, txOut = TRUE, ...) : > Note: not all samples contain inferential replicates >
Michael Love (16:02:52): > this code was added in devel branch around Jan it looks like
Ben Johnson (16:02:52): > hm, that might be my fault if I need to update tximport as well (I only grabbed tximeta)
Michael Love (16:02:57): > yup
Michael Love (16:03:02): > ok solved
Michael Love (16:03:32): > so now, message up front, and informative error at the end
Michael Love (16:03:37): > save people some headaches
Ben Johnson (16:03:58): > this is awesome. you’re awesome. thanks so much!
Michael Love (16:04:10): > thanks for the report
2020-06-16
jessi elderkin (09:53:59): > @jessi elderkin has joined the channel
Jordan Veldboom (10:31:32): > @Jordan Veldboom has joined the channel
2020-06-18
pamela himadewi (09:54:44): > @pamela himadewi has joined the channel
2020-07-18
Michael Love (07:42:09): > @Charlotte Soneson@Avi SrivastavaIalmostforgot to talk about alevinQC:face_palm:… just included it now:https://github.com/mikelove/alevin2bioc/commit/4ed1bbb9e24bfe095b39486924624ae0bd25c4aalet me know if you think of anything more i should say here
Charlotte Soneson (08:48:35): > @Michael Lovelooks good to me:slightly_smiling_face:
2020-07-19
Vince Carey (06:48:11): > I had a quick look at this document – I think “EM” is used without definition … could use a sentence of explanation.
Michael Love (07:00:21): > Thanks. Good point, I’ll take a shot Vince
2020-07-21
James MacDonald (15:48:05): > @James MacDonald has joined the channel
2020-07-31
bogdan tanasa (14:05:24): > @bogdan tanasa has joined the channel
2020-09-01
Michael Love (13:59:55): > Added two new functions to tximeta:addCDS
andretrieveCDNA
:https://github.com/mikelove/tximeta/blob/master/NEWS#L1-L12
Charlotte Soneson (14:22:01): > :tada:
2020-09-10
Michael Love (10:13:29): > Another new featurefromDb
argument foraddIds()
: > > let’s you pull from the TxDb/EnsDb for adding IDs (rather than from the org package)https://github.com/mikelove/tximeta/blob/master/man/addIds.Rd#L15-L18
Rob Patro (21:36:44): > w00h00!
2020-09-11
Tim Triche (15:21:36): > hey just wanted to add that I wrote some wrappers to do species-mixed velocity indexing
Tim Triche (15:22:13): > it is utterly dependent upon prepending the assembly name to the seqnames but it seems to work OK. Curious if this belongs in eisaR or tximeta
Michael Love (15:58:57): > so, tximeta has a lot of stuff tied to chrom lengths, what do you do there?
Michael Love (15:59:07): > do you frankenstein the output from UCSC?
Michael Love (16:01:57): > i feel like Charlotte will know best about the infrastructure and bridge from eisaR and tximeta, so i’ll be interested in her feedback on where species mixed velocity work should go. another thing is we have test code in tximeta for all the functions, and would want to be able to have test coverage for something funky like species mixing + velocity
Michael Love (16:02:21): > not to say your science is funky:stuck_out_tongue:
Charlotte Soneson (16:38:09): > To say where it would belong I think it would be helpful to see how these wrappers are constructed/what is wrapped. Do you have an example?
Tim Triche (17:14:20): > sure, let me finish testing it on gencode 35+M25
2020-09-16
Tim Triche (16:20:32): > @Hervé Pagèsthis is new: > > seqs <- GenomicFeatures::extractTranscriptSeqs(x = genome, transcripts = grl) > Error during wrapup: unable to find an inherited method for function 'getSeq' for signature '"DNAStringSet"' > packageVersion("GenomicFeatures") > [1] '1.41.3' >
Hervé Pagès (16:20:36): > @Hervé Pagès has joined the channel
Tim Triche (16:22:30): > nb.@Michael Lovenot to say your science is funky
– oh, but it is
Hervé Pagès (16:58:38) (in thread): > mmh I don’t remember touching the imports in GenomicFeatures or Biostrings or BSgenome. Anyway, I can’t reproduce and if I can’t reproduce I can’t help you. Please file an issue on GitHub with reproducible example,sessionInfo()
, etc.. Thx
2020-09-17
Tim Triche (08:45:00) (in thread): > will do – this is from a call withineisaR
. Looking into whether the problem exists when I use a BSgenome for the same purpose, since that would presumably be a lot easier reprex to handle:slightly_smiling_face:
2020-10-02
Tim Triche (08:47:29): > nb. I got a bit delayed working on the above due to the issue above in eisaR/GenomicFeatures (will patch post-new-package-deadline)
Tim Triche (08:50:11): > however I was wondering, is there a tool for salmon/alevin similar torsem-tbam2gbam
? I realized that RSEM does transcriptome BAM to genome BAM mapping, and in principle any selectively-aligned salmon/alevin BAM should do the same (especially since@Avi Srivastavaadded--writeMappings
and--dumpFeatures
to the code). I asked@Rob Patroabout this but then realized, hey wait a minute, maybe there’s code out there already that could be used for this purpose (without creating a bunch of liftOver chains or similar nonsense). Alternatively, Caleb Lareau’seasyLift
package could potentially be used to merge this withtximeta
’s internal gearing w/r/t coordinates and such, which might be just as well.
Tim Triche (08:50:54): > I’ve been using SnapATAC far too much lately (scRNA/scATAC from 10X/BioRad via Alevin/Bap2, which means that nothing works without fiddling)
Tim Triche (08:52:05): > Given thattximeta
may need to be modified for cross-species velocity anyways, I’m thinking about whether it could also be modified to go from txome <-> genome based on the metadata it knows already.
Tim Triche (08:52:34) (in thread): > followup: the same problem existed withBSgenome
Michael Love (09:50:27): > I don’t follow the last past, you want the ranges tximeta provides moved from genome to a diff coordinate?
Michael Love (09:51:42): > I think that Bioc needs to be modified for cross species basically, right? What is the container that holds chromosomes with same name across organisms with different genome?
Tim Triche (09:55:04): > the kludge is simple, you use either Genbank identifiers or genome_contig (e.g. GRCh38_1)
Tim Triche (09:56:04): > 1:1 for GA4GH as it happens once you fingerprint them
Tim Triche (09:56:52): > recall that genome() can take different values within a GRanges etc. – this is not nearly as nasty as you might expect
Tim Triche (09:57:42): > but my point above was just that, within a transcriptome,tximeta
has mappings to the sequence and splice junctions within the source genome anyways
Tim Triche (09:57:58): > I wasn’t really thinking of cross species txome->genome, that was just loose ends from before
Michael Love (09:58:25): > but what is the ask there for tximeta?
Michael Love (09:58:46): > it doesn’t have access to mappings, and processing those in R would be super slow…?
Tim Triche (09:58:50): > I guess I didn’t think through theask
– it may not be an ask at all
Michael Love (09:58:54): > :smile:
Tim Triche (09:59:01): > tximeta doesn’t know about genomic coordinates?
Michael Love (09:59:21): > it does, but are you asking if tximeta can translate txome mappings to genomic?
Michael Love (09:59:24): > it doesn’t have mappings
Michael Love (09:59:28): > i think i’m lost:stuck_out_tongue:
Tim Triche (09:59:41): > yeah, and I think it’s an Rhtslib kludge to do it – will document later
Michael Love (09:59:55): > for cross species, can this be handled within linkedTxome?
Michael Love (10:00:02): > you have to generate a new GTF anyway
Tim Triche (10:00:10): > (re: mapping: this isn’t nearly as slow as you think, I’ve been doing it with bioawk)
Tim Triche (10:00:28): > anyways I have some bullshit admissions committee meeting, will poke head in tomorrow after the BioC-3.12 deadline:slightly_smiling_face:
Michael Love (10:00:32): > I believe that’s the path that@Charlotte Sonesontook with exon + intron
Michael Love (10:00:37): > ok:slightly_smiling_face:
Tim Triche (10:01:23): > ?tximeta::linkedTxome > > linkedTxome package:tximeta R Documentation > > Make and load linked transcriptomes ("linkedTxome") > > Description: > > For now, for details please see the vignette > 'inst/script/linked.Rmd' >
> :joy:
Tim Triche (10:01:45): > you’re more detailed even with your nondocumentation than I am:slightly_smiling_face:
Tim Triche (10:02:38): > the examples are enough that I think I can gin up a cross-species linkedTxome for both purposes (presumingeisaR
will play nice regarding construction)
Charlotte Soneson (10:04:14): > https://community-bioc.slack.com/archives/CDW69P8UT/p1601647232021800. Indeed - search formakeLinkedTxome
here for example:https://combine-lab.github.io/alevin-tutorial/2020/alevin-velocity/ - Attachment: Attachment > I believe that’s the path that @Charlotte Soneson took with exon + intron - Attachment (combine-lab.github.io): Alevin Velocity > RNA Velocity with alevin
Michael Love (10:05:14) (in thread): > fixing!
Michael Love (10:12:02) (in thread): > thanks for pointing this out!
Michael Love (10:27:05): > ok new docs
Michael Love (10:27:07): > > #' Make and load linked transcriptomes ("linkedTxome") > #' > #' \code{makeLinkedTxome} reads the checksum associated with a Salmon > #' index at \code{indexDir}, and links it to key information > #' about the transcriptome, including the \code{source}, \code{organism}, > #' \code{release}, and \code{genome} (these are custom character strings), > #' as well as the locations (e.g. local, HTTP, or FTP) for one or more \code{fasta} > #' files and one \code{gtf} file. \code{loadLinkedTxome} loads this > #' information from a JSON file. See Details. > #' > #' \code{makeLinkedTxome} links the information about the transcriptome > #' used for quantification in two ways: > #' 1) the function will store a record in tximeta's cache such that > #' future import of quantification data will automatically access and > #' parse the GTF as if the transcriptome were one of those automatically > #' detected by tximeta. Then all features of tximeta (e.g. summarization > #' to gene, programmatic adding of IDs or metadata) will be available; > #' 2) it will by default write out a JSON file > #' that can be shared, or posted online, and which can be read by > #' \code{loadLinkedTxome} which will store the information in tximeta's > #' cache. This should make the full quantification-import pipeline > #' computationally reproducible / auditable even for transcriptomes > #' which differ from those provided by references (GENCODE, Ensembl, > #' RefSeq). >
Tim Triche (12:34:24): > nb@Charlotte Sonesonthe errors that eisaR/GenomicFeatures were throwing were from my following the Alevin Velocity tutorial
Tim Triche (12:34:47): > I will have more time tomorrow onward to make a reprex or discover that the problem is fixed in devel (will report back)
Charlotte Soneson (12:35:04): > :+1:
2020-10-04
Hervé Pagès (20:15:46) (in thread): > you still didn’t tell me how I can reproduce this
2020-10-05
Tim Triche (13:35:41) (in thread): > currently with a small GTF and a BSgenome it works, looking to see whether the issue is the GTF or the FA (in the alevin velocity example) or whether it has been fixed at some point in the past couple weeks
Tim Triche (14:42:41) (in thread): > the simplest example that can possibly work does not, at present, succeed in provoking the error. I’ll try provoking it a bit harder:wink:and see if it really is a size- or memory-dependent issue - File (R): velo_reprex.R - File (Plain Text): gencode.v35.annotation.sub.gtf - File (Binary): chr1.sub.fa
Hervé Pagès (15:11:13) (in thread): > Anunable to find an inherited method for function 'getSeq' for signature '"DNAStringSet"'
error is unlikely to have anything to do with the size of the data but what do I know. Anyway, when you have a reproducible example, please open an issue on GitHub and provide all the details there. Thx!
Tim Triche (15:51:15) (in thread): > Will do. I thought it would be easier than this:confused:
2020-10-06
Tim Triche (16:29:52): > working from a known-good (single-species) transcriptome, is there a way to provide a GTF in the event one is not found?
Tim Triche (16:31:38): > e.g. suppose I want to brute-force import and merge a few Alevin quants against a splice/unspliced transcriptome. A little function like this helps: > > import_one_txi <- function(cdat) { > txi_cdat <- data.frame(names=cdat['names'], > files=cdat['files'], > stringsAsFactors=FALSE) > tximeta::tximeta(coldata=txi_cdat, type="alevin") > } >
Tim Triche (16:32:36): > now, outside of that, I’ve already run > > txome <- file.path(txpath, paste(txstub, "json", sep=".")) > tximeta::loadLinkedTxome(jsonFile=txome) >
> and it does indeed indicate that the linked txome is loaded into the local cache.
Tim Triche (16:34:19): > Attempting to load the quants throws a rather unhelpful error: > > R> import_one_txi(cdat[1,]) > importing quantifications > reading in alevin gene-level counts across cells with fishpond > found matching linked transcriptome: > [ GENCODE - Homo sapiens - release 33 ] > useHub=TRUE: checking for TxDb via 'AnnotationHub' > snapshotDate(): 2020-09-03 > did not find matching TxDb via 'AnnotationHub' > building TxDb with 'GenomicFeatures' package > Import genomic features from the file as a GRanges object ... Error in open.connection(con, open) : cannot open the connection > In addition: Warning messages: > 1: In for (i in seq_len(n)) { : > closing unused connection 3 (gencode.v33.annotation.withintrons.expanded.gtf) > 2: In open.connection(con, open) : > cannot open file 'gencode.v33.annotation.withintrons.expanded.gtf': No such file or directory > > Enter a frame number, or 0 to exit > > 1: import_one_txi(cdat[1, ]) > 2: #5: tximeta::tximeta(coldata = txi_cdat, type = "alevin") > 3: getTxDb(txomeInfo, useHub = useHub) > 4: makeTxDbFromGFF(txomeInfo$gtf) > 5: import(file, format = format, colnames = colnames, feature.type = GFF_FEATU > 6: import(file, format = format, colnames = colnames, feature.type = GFF_FEATU > 7: import(FileForFormat(con, format), ...) > 8: import(FileForFormat(con, format), ...) > 9: .local(con, format, text, ...) > 10: connection(m, con, "r") > 11: connectionForResource(manager, resource(x), open = open) > 12: open(con, open) > 13: open.connection(con, open) > > Selection: >
Tim Triche (16:34:28): > This seems like perhaps not the desired behavior?
Tim Triche (16:35:13): > I have a GTF generated as part of the prep perhttps://combine-lab.github.io/alevin-tutorial/2020/alevin-velocity/ - Attachment (combine-lab.github.io): Alevin Velocity > RNA Velocity with alevin
Tim Triche (16:35:34): > But it doesn’t seem like tximeta has an option to provide that during the import.
Tim Triche (16:37:02): > (The appropriate GTF file, as it happens, is in the working directory; but > * I would have thought GENCODE might be a supported transcriptome, and > * There doesn’t seem to be an option to provide the annotations for transcriptomes that aren’t already in AnnotationHub. > So this seems like a bit of a problem.)
Tim Triche (16:43:12): > Nevermind, figured it out: > > R> file.exists(rjson::fromJSON(file=txome)[[1]][["gtf"]]) > [1] TRUE >
> The GTF file has to be in the original working directory, it appears that pointing at a .json file under a path for the linkedTxome does not communicate this to tximeta. Or, alternatively, perhaps there’s an argument/flag that allows this but I haven’t found it yet. Seems like it would be handy:slightly_smiling_face:
Tim Triche (16:43:36): > > R> txis <- apply(cdat, 1, process_velo_txi) > importing quantifications > reading in alevin gene-level counts across cells with fishpond > found matching linked transcriptome: > [ GENCODE - Homo sapiens - release 33 ] > useHub=TRUE: checking for TxDb via 'AnnotationHub' > snapshotDate(): 2020-09-03 > did not find matching TxDb via 'AnnotationHub' > building TxDb with 'GenomicFeatures' package > Import genomic features from the file as a GRanges object ... OK > Prepare the 'metadata' data frame ... OK > Make the TxDb object ... OK > generating gene ranges > generating gene ranges > fetching genome info for GENCODE >
Tim Triche (16:43:54): > More in a bit. Working on a two-genome reprex and threw this error today for the first time.
Tim Triche (16:45:44): > This at least avoids it: > > gtf <- rjson::fromJSON(file=txome)[[1]][["gtf"]] > stopifnot(file.exists(gtf)) >
Michael Love (17:04:54): > I’m lost, what the issue with the location of the GTF? When specifying a GTF in a JSON, you should use full path. Was that it?
Michael Love (17:11:43): > The original idea of the JSON was that it would point to eg a Zenodo URL that would work for everyone
Michael Love (17:11:53): > But local paths work but relative won’t work
Michael Love (17:13:43): > tximeta wants to be able to recreate the TxDb if needed, I don’t have a concept of a one-time source GTF… tximeta thinks it will always have access to this file
Tim Triche (17:25:10): > right, I figured out why it doesn’t know about the relative path – it can’t
Tim Triche (17:25:55): > sostopifnot(file.exists(rjson::fromJSON(file=txome)[[1]][["gtf"]]))
avoids wasting time on importing a bunch of un-annotate-able quants
Tim Triche (17:26:13): > also, issplitSE
typically very very slow?
Michael Love (17:42:05): > Ping CS on speed of splitSE — I’m novice to velocity
Tim Triche (18:06:28): > @Charlotte Sonesonis this fairly typical for splitSE? > > system.time(txis <- tximeta::splitSE(txi, cg, assayName = assayName)) > user system elapsed > 1146.965 1.048 1148.055 >
2020-10-07
Charlotte Soneson (01:49:32): > Depends on the size of the data set, but yes, it’s not exceedingly fast…
Tim Triche (08:45:00): > :thumbsup:
Tim Triche (08:45:06): > OK I’m not insane after all:slightly_smiling_face:
2020-10-10
Hervé Pagès (04:09:29): > @Hervé Pagès has left the channel
2020-10-11
Kozo Nishida (21:42:42): > @Kozo Nishida has joined the channel
2020-10-26
Tim Triche (09:16:56): > @Michael Love@Charlotte Sonesonwhat does this mean? > > importing quantifications > reading in alevin gene-level counts across cells with fishpond > found matching linked transcriptome: > [ Ensembl - Mus musculus - release 101 ] > Error: BiocParallel errors > element index: 1, 2, 3, 4 > first error: not all 'rnames' found or unique. >
> It’s rather a bit of a mystery even after digging through the BiocFileCache codebase.
Tim Triche (09:18:05): > same thing with SerialParam() for the loader: > > linkedTxome is same as already in bfc > Processing CITE_SLN111_D1... > Importing... > importing quantifications > reading in alevin gene-level counts across cells with fishpond > found matching linked transcriptome: > [ Ensembl - Mus musculus - release 101 ] > Error in bfcrpath(bfc, txdbName) : not all 'rnames' found or unique. > In addition: Warning message: > In FUN(X[[i]], ...) : 'rnames' exact pattern > 'mm10.ens101.annotation.expanded.gtf' > is not unique; use 'bfcquery()' to see matches. >
Tim Triche (09:19:23): > I may just blow away my BiocFileCache directory and see if it persists. Which would be annoying but oh well. Eventually I need this to work for merged cross-species transcriptomes to work (a fairly standard test of any novel single-cell assay, after all).
Charlotte Soneson (09:23:07): > Looks like you have multiple entries in your BiocFileCache with the name ‘mm10.ens101.annotation.expanded.gtf’ - could you check the BFC manually to confirm if that’s the case?https://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/TroubleshootingTheCache.html#resource-id
Tim Triche (09:27:22): > > R> bfc <- BiocFileCache("/home/tim/.cache/AnnotationHub/") > R> bfcinfo(bfc) %>% dplyr::select(rname) > # A tibble: 11 x 1 > rname > <chr> > 1 annotationhub.sqlite3 > 2 annotationhub.index.rds > 3 AH5040 : 5040 > 4 AH5122 : 5122 > 5 AH53345 : 60083 > 6 AH53347 : 60085 > 7 AH78783 : 85529 > 8 AH75193 : 81939 > 9 AH75189 : 81935 > 10 AH83247 : 89993 > 11 AH83216 : 89962 >
Tim Triche (09:28:05): > > R> bfcquery(bfc, "mm10") > # A tibble: 0 x 10 > # … with 10 variables: rid <chr>, rname <chr>, create_time <dbl>, > # access_time <dbl>, rpath <chr>, rtype <chr>, fpath <chr>, > # last_modified_time <dbl>, etag <chr>, expires <dbl> >
Tim Triche (09:28:48): > Just in case a 1:1 match is required: > > R> bfcquery(bfc, query="mm10.ens101.annotation.expanded.gtf") > # A tibble: 0 x 10 > # … with 10 variables: rid <chr>, rname <chr>, create_time <dbl>, > # access_time <dbl>, rpath <chr>, rtype <chr>, fpath <chr>, > # last_modified_time <dbl>, etag <chr>, expires <dbl> >
Charlotte Soneson (09:33:43): > Is that the cache directory used bytximeta
? I.e., the one given bygetTximetaBFC()
?
Tim Triche (09:35:35): > ugh, no: > > R> getTximetaBFC() > [1] "/home/tim/.cache/BiocFileCache" >
Tim Triche (09:35:42): > Time to blow that away too, I guess
Tim Triche (09:36:25): > Ugh! There are a bunch of Conda environments in there
Tim Triche (09:36:43): > how do I figure out which parts to delete without having to rebuild huge conda envs
Tim Triche (09:37:01): > > [1] "1cc0b83f6caa36_Miniconda3-py37_4.8.3-Linux-x86_64.sh" > [2] "1d1012653256fa_Miniconda3-py37_4.8.2-Linux-x86_64.sh" > [3] "a911040945cc4_a911040945cc4.sqlite" > [4] "a911056021c79_a911056021c79.rds" > [5] "a911071f7a4d_a911071f7a4d.rds" > [6] "BiocFileCache.sqlite" > [7] "e8bde403f38a1_e75615b21756b_89993" > [8] "e8bde6027e94b_e8bde6027e94b.rds" > [9] "e8bdf403f38a1_e75615b21756b_89993" > [10] "e8bdf6027e94b_e8bdf6027e94b.rds" >
Tim Triche (09:38:30): > maybe I’ll just change it to a dedicated tximeta BFC directory
Tim Triche (09:40:15): > > importing quantifications > reading in alevin gene-level counts across cells with fishpond > found matching linked transcriptome: > [ Ensembl - Mus musculus - release 101 ] > useHub=TRUE: checking for EnsDb via 'AnnotationHub' > snapshotDate(): 2020-10-20 > found matching EnsDb via 'AnnotationHub' > loading from cache > generating gene ranges > generating gene ranges > Renaming... > Splitting... > Timing stopped at: 0.009 0 0.009 > Error in tximeta::splitSE(txi, cg, assayName = "counts") : > None of the IDs in the unspliced column are present in rownames(se) > In addition: Warning message: > In checkAssays2Txps(assays, txps) : > > Warning: the annotation is missing some transcripts that were quantified. > 33005 out of 87421 txps were missing from GTF/GFF but were in the indexed FASTA. > (This occurs sometimes with Ensembl txps on haplotype chromosomes.) > In order to build a ranged SummarizedExperiment, these txps were removed. > To keep these txps, and to skip adding ranges, use skipMeta=TRUE > > Example missing txps: [ENSMUSG00000089699-I, ENSMUSG00000104238-I, ENSMUSG00000102269-I, ...] > > > Enter a frame number, or 0 to exit > > 1: NvimR.selection() > 2: NvimR.source(..., local = local) > 3: base::source(getOption("nvimcom.source.path"), ..., print.eval = print.eval > 4: withVisible(eval(ei, envir)) > 5: eval(ei, envir) > 6: eval(ei, envir) > 7: Rsource-947508#13: process_velo_txis(CITEs, txstub, anno = mm_ens101_anno, > 8: do.call(cbind, bplapply(by_sample, import_velo_txis, cg = cg, QC = QC, BPPA > 9: bplapply(by_sample, import_velo_txis, cg = cg, QC = QC, BPPARAM = BPPARAM) > 10: bplapply(by_sample, import_velo_txis, cg = cg, QC = QC, BPPARAM = BPPARAM) > 11: lapply(X, FUN_, ...) > 12: FUN(X[[i]], ...) > 13: FUN(...) > 14: withCallingHandlers({ > tryCatch({ > FUN(...) > }, error = handle > 15: tryCatch({ > FUN(...) > }, error = handle_error) > 16: tryCatchList(expr, classes, parentenv, handlers) > 17: tryCatchOne(expr, names, parentenv, handlers[[1]]) > 18: value[[3]](cond) > > Selection: >
Tim Triche (09:40:21): > nah, that didn’t work ether
Tim Triche (09:40:46): > tximeta REALLY does not like the idea of a velocity aware transcriptome with anything resembling the name of an unaware transcriptome
Tim Triche (09:41:00): > I may have to poison the hashes for velocity-aware transcriptomes if this is how it’s going to work
Tim Triche (09:41:53): > I started automating some of this:https://github.com/trichelab/velocessor
Tim Triche (09:42:35): > However, if tximeta is going to override it and then complain about not having the appropriate ENSEMBL IDs, I foresee problems.
Tim Triche (09:43:41): > (The steps for index creation are simply a refactored version of your examples witheisaR
, and they appear to work just fine, at least up to now:confused:)
Tim Triche (09:44:32): > Oh, wait
Tim Triche (09:44:39): > this is really bizarre
Tim Triche (09:44:47): > it’s splitSE that is having issues
Michael Love (09:53:00): > so just to check, there’s no tximeta bug above? you’re calling a number of functions in a row? it may help to print the commands so we know what step is associated with the errors you’re reporting
Michael Love (09:55:07): > If you add an entry for your new transcripts to the GTF then tximeta will know them. the idea is that the GTF remains the source of all information about transcript locations and metadata
Tim Triche (11:13:14): > nah, they’re in there:
Tim Triche (11:13:16): > > [tim.triche@submit raw_FASTQs]$ grep "ENSMUSG00000089699-I" mm10.ens101.annotation.expanded.gtf > 1 rtracklayer exon 3466598 3513494 . + . exon_id "ENSMUST00000161581-I"; exon_rank 1; transcript_id "ENSMUST00000161581-I"; gene_id "ENSMUSG00000089699-I"; ID "ENSMUST00000161581-I" > 1 rtracklayer transcript 3466598 3513494 . + . transcript_id "ENSMUST00000161581-I"; gene_id "ENSMUSG00000089699-I"; ID "ENSMUST00000161581-I" > 1 rtracklayer gene 3466598 3513494 . + . gene_id "ENSMUSG00000089699-I"; ID "ENSMUSG00000089699-I" >
Tim Triche (11:13:52): > per > > Warning: the annotation is missing some transcripts that were quantified. > 33005 out of 87421 txps were missing from GTF/GFF but were in the indexed FASTA. > (This occurs sometimes with Ensembl txps on haplotype chromosomes.) > In order to build a ranged SummarizedExperiment, these txps were removed. > To keep these txps, and to skip adding ranges, use skipMeta=TRUE > > Example missing txps: [ENSMUSG00000089699-I, ENSMUSG00000104238-I, ENSMUSG00000102269-I, ...] >
Tim Triche (11:14:24): > they’re probably not in the annotation on AnnotationHub though… I’m not sure how exactly this is occurring
Michael Love (11:51:09): > oooh
Michael Love (11:51:26): > ~useHub=FALSE~
Michael Love (11:51:40): > or…wait
Michael Love (11:52:39): > if you want tximeta to not go for the Ensembl GTF (either via AHub or via ftp) then you need to add your GTF as an entry in the cache via make or loadlinkedTxome
Michael Love (11:52:58): > linkedTxome has priority
Michael Love (11:53:13): > but if you don’t tell tximeta about your GTF it won’t know about these new txps
Tim Triche (13:53:53): > I think that’s what’s going on – however,process_velo_txis
attempts to force it to do exactly that
Tim Triche (13:53:56): > > tim@thinkpad-P1:~/Dropbox/ARMSTRONG_TCELL_SCRNASEQ/Alevin/CITE_SLN111_D1_alevin/alevin$ tail quants_mat_cols.txt > ENSMUSG00000094791-I > ENSMUSG00000096550-I > ENSMUSG00000094172-I > ENSMUSG00000094887-I > ENSMUSG00000091585-I > ENSMUSG00000095763-I > ENSMUSG00000095523-I > ENSMUSG00000094855-I > ENSMUSG00000095475-I > ENSMUSG00000095041-I >
Tim Triche (13:54:18): > however, I’m going to see if forcinglinkedTxome
fixes it
Tim Triche (13:54:40): > because it’s definitely throwing away the real mm10.ens101 velo GTF in favor of something else
Tim Triche (13:54:46): > this seems like a bug to me, fwiw
Tim Triche (13:55:24): > (why would a spliced/unspliced index have the same hash as a spliced-only index? That doesn’t seem like a good hash function:slightly_smiling_face:)
Michael Love (18:01:58): > Wait, I’m a bit confused… if you want Salmon/tximeta to recognize this as an independent txome andnotto pull metadata from canonical sources then don’t you just use the hash from the index including the introns?
Michael Love (18:02:19): > Why is the Salmon index hash the same with and without introns
Michael Love (18:02:51): > I’m lost about the description of the bug, which package it lies in
2020-10-27
Tim Triche (09:05:49): > It appears to be in txome RA
Tim Triche (09:06:02): > Err, it appears to be in tximeta
Tim Triche (09:06:30): > I don’t get this problem with gencode33 (where there is no index or annotations in annotatuonhub)
Tim Triche (09:07:40): > I have more time to debug today. Will keep picking at it. Velocessor is essentially one gigantic reprex as a package (per yours and Hervé’s requests, and common courtesy)
Tim Triche (09:08:12): > Thanks for your feedback so far (and Charlotte’s, thanks both of you)
Michael Love (09:31:59): > Absolutely – thanks for kicking the tires on this
Michael Love (09:35:51): > if you see a case where linkedTxome is not getting priority, e.g. there is a linkedTxome with the correct hash (it lines up with the hash in the quant files) and pointing to the GTF you want to use (whether local or Zenodo deposited), and this is not being used that would definitely be a tximeta bug, but it may be something got lost in the wrapper creation, so if you can show the bug with only tximeta function calls that would help me squash it.
Michael Love (09:37:23): > the other thing is, you say that quantifying with or without introns that Salmon/alevin is outputting the same sequence hash? I didn’t work on the velocity piece, so I don’t know what’s happening there exactly. we can ask Avi if that is the case, what he thinks should be happening
Charlotte Soneson (09:39:11): > I don’t think this should give the same hash -alevin
(orSalmon
for the indexing) doesn’t know which sequences are transcripts or introns, so from that point of view it’s just a fasta file with a lot more sequences.
Michael Love (10:12:57): > Tim, this seems to be the crux of downstream issues, can you confirm that the Salmon index with the introns is producing the same hash as without the introns:https://community-bioc.slack.com/archives/CDW69P8UT/p1603734924072500 - Attachment: Attachment > (why would a spliced/unspliced index have the same hash as a spliced-only index? That doesn’t seem like a good hash function :slightly_smiling_face: )
Rob Patro (10:47:00): > Are the spliced/unspliced transcripts indexed separately?
Rob Patro (10:47:20): > the hash function used is just the SHA256 of the concatenated input sequences
Rob Patro (10:47:54): > so probability of a collision between different input sequences is vanishingly small
Tim Triche (12:00:56): > they’re indexed together, which is why I don’t understand it. I will try just indexing mm10.ens101 without any unspliced transcripts and see if the hash is the same
Tim Triche (12:01:09): > it oughtn’t be, I wouldn’t think!
Tim Triche (12:01:22): > I need to go get a flu shot but will check in with results later
Michael Love (12:12:12): > Also check the linkedTxome table in your cache
Michael Love (12:13:14): > It will have the hash and the GTF path
Michael Love (12:17:13): > We can then figure out which hash is which and why you’re getting the canonical GTF instead of your custom one
Rob Patro (12:55:57): > @Tim Triche: correct, if you just index the unspliced by itself, it won’t have the same hash.
Tim Triche (12:56:13): > Ima check this now:grin:
Tim Triche (12:56:36): > Intuitively I assume that (thought exercise: MD5 it)
Tim Triche (12:56:52): > Not sure why it would happen otherwise. Investigating
Rob Patro (12:57:15): > oh cool, got my flu shot last week:stuck_out_tongue_winking_eye:— feels more important now than ever even though I’m at home like 99% of the time.
2020-12-12
Huipeng Li (00:38:41): > @Huipeng Li has joined the channel
2021-01-11
Kozo Nishida (21:08:18): > @Kozo Nishida has left the channel
2021-01-22
Annajiat Alim Rasel (15:46:21): > @Annajiat Alim Rasel has joined the channel
2021-05-11
Megha Lal (16:45:35): > @Megha Lal has joined the channel
2021-07-23
Batool Almarzouq (15:53:58): > @Batool Almarzouq has joined the channel
2021-10-10
Michael Love (20:40:11): > @Wes Where is a good place to chat re fishpond and loadFry
Michael Love (20:40:26): > CC@Steve Lianoglou
Steve Lianoglou (20:41:35): > @Steve Lianoglou has joined the channel
Wes W (20:41:36): > @Wes W has joined the channel
Wes W (20:47:21): > thanks
Michael Love (21:05:27): > Feel free to PR the new feature and we can have it in the release
Michael Love (21:06:20): > Also we should highlight all these awesome contributions re loadFry() in NEWS:https://github.com/mikelove/fishpond/blob/master/NEWS
Michael Love (21:06:47): > I’ll tweet about it late October after release
Michael Love (21:07:08): > @Dongze He:wave:
Dongze He (21:07:10): > @Dongze He has joined the channel
Michael Love (21:07:42): > I’m gonna call it a night but will check in tomorrow
Dongze He (21:07:52): > Okay
Dongze He (21:08:37): > G’nite
Michael Love (21:08:48): > > Friday October 22Deadline for packages passing ‘‘R CMD build’’ and ‘‘R CMD check’’ without errors or warnings. This includes software, data experiment and workflow packages. Some warnings will be accepted, clarification on the bioc-devel mailing list. Bioconductor release candidate. Package maintainers should limit changes to “show-stopper” bugs and documentation improvements.
Michael Love (21:09:15): > I think we can get in Wes’ proposed changes w plenty of time
2021-10-11
Wes W (08:46:46): > Good morning everyone!
Wes W (08:50:19): > have a few meetings this morning and will add my feature to the function this afternoon and make a PR. happy to make all your acquaintances.
Michael Love (08:55:02): > and if@Dongze Heand@Steve Lianoglouhave brief bulletpoints for theNEWS
file, you can either add them as PR or just paste them directly here and I can add them > > fishpond is bumping up to version2.0
this cycle because it has had quite a lot of development in the past year (loadFry()
, importing allelic data, a lot of development of single cell analysis, dealing with bootstrap data efficiently, plotting improvements, correlation-based tests, support for continuous nuisance factors like RUV/SVA)
Wes W (23:06:37): > okay all I am about ready to do my PR, but here is the thing. The current test data setfry-usa-basic
for my purposes needs to be a smidgen bigger… scVelo needs at least 4 points for fitting non-linear curve so to achieve a test case where you can go fromloadFry()
directly into velociraptor in R with the provided ExampleFryData will mean I need to add a another cell… is that okay? else it works with my clinical patient data and a sample data set I download online
Michael Love (23:36:00): > I think that’s fine, its still small altogether I think right?
Michael Love (23:36:12): > Eg we have more code than toy data :-)
Wes W (23:59:31): > haha totally
Wes W (23:59:45): > yeah we have a 2 gene 3 cell dataset
2021-10-12
Michael Love (10:55:18): > Thanks for the PR@Wes W(so feel free to also expand the test data to include your test)@Dongze Heif you have time could you look over the changes here:https://github.com/mikelove/fishpond/pull/10/files(just noting that I didn’t write the function so want to make sure it makes sense to its author) > > furthermore, I’d like to include one or more bullets in the NEW regarding all this hard work. These actually get read! I will post this to twitter, and it’s a good way to get user buy-in to packages, so they know things are under active development. So feel free to propose NEWS items here and I’ll collate
Wes W (10:56:43): > perfect will expand the test data tonight to include an additional cell:smiley:
Michael Love (10:57:26) (in thread): > that is true single cell work
Michael Love (10:57:57) (in thread): > our pipeline includes single cell
Dongze He (12:24:00): > I will look at it today!:partying_face:
Dongze He (16:42:27): > Hi@Michael Love, I realize that we can make loadFry more flexible so that we don’t need to have an argument for every possible output format. I will reformat the implementation of loadFry and get back to you soon.
Michael Love (17:10:23): > Awesome, yeah we have plenty of time to try things, ideally settled by ~Wed Oct 20
Wes W (18:20:09): > @Dongze Hefor sure , in the future I imagine anformat
attrubute and it could = “scVelo” or “eisaR” etc etc, i just added a quick and dirty version that I am actually using live right now in my work
Wes W (18:45:14): > I can give it some more robust changes tomorrow
Wes W (18:45:24): > make it more future proof
Dongze He (19:05:01): > Thanks so much! Actually that is exactly I want to do, basically the function signature will be something like
Dongze He (19:05:04): > > loadFry( ..., output_slots = list( c('spliced', c('S', 'A')), c('unspliced', c('U')) ) .... ) >
Dongze He (19:06:04): > And the output_slots arg can take either a string that represents a pre-defined set of slots, like scVelo, or a list
Dongze He (19:09:14): > Hi@Wes Wcould you please point me to the version you added? I did find it
Wes W (23:06:00) (in thread): > found it then? cool
2021-10-13
Dongze He (03:01:13): > I made a new PR for the redesigned loadFry(), please check and let me know if you have any suggestions
Michael Love (06:42:25): > @Dongze Heshould we bring in your PR first then see how@Wes WPR will need to change wrt the newer loadFry()?
Michael Love (09:55:49): > hi@Dongze HeI had to change the example code and comment out the test code, bc it was giving errors
Michael Love (09:56:08): > > > test_file("test_loadFry.R") > > ══ Testing test_loadFry.R ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════ > [ FAIL 2 | WARN 0 | SKIP 0 | PASS 1 ] > > ── Error (test_loadFry.R:11:3): Reading in Alevin-fry USA count matrix works ───────────────────── > Error: unused argument (which_counts = c("S", "A")) > > ── Error (test_loadFry.R:33:3): Main gene-level quantiation is same when velocity = TRUE or FALSE ───────────────────── > Error: unused argument (which_counts = c("S", "A")) > > [ FAIL 2 | WARN 0 | SKIP 0 | PASS 1 ] >
Michael Love (09:56:30): > https://github.com/mikelove/fishpond/commit/85b4a688190c56d2fd84d8f95f25f47baf61b646
Michael Love (10:00:14): > then a question for@Wes W, as Dongze re-organizedloadFry()
, I don’t know where that puts your original PR, was there additional pieces that would help you downstream?
Wes W (10:23:41): > @Dongze He’ new PR has successfully integrated my code into his and its looks pretty smooth. hopefully those comments I put in made that easier to understand what each code block was for.@Michael Lovethe only thing you could do now is pull out come of the comments that were more for Dongze than the user, line 174 to 177 of the new loadFry.R was just me explaining the purpose of the block , we could probably make that just one line now that its been implemented , maybe just like > > # velociraptor /w defaults requires size factors should be positive >
Wes W (10:26:28): > amd then remove that little bit about me rambling on line 175 about my hacky solution. but glad to see my hacky solution has made its way into the Dongze’s new update on line 271 and I can say I have left my mark on history
Michael Love (10:31:35): > haha, ok I will address this now
Michael Love (10:35:37): > https://github.com/mikelove/fishpond/commit/26795da8f981789e81b72fbddcb79a646524b0cb
Michael Love (10:37:53): > I’m adding Wes and Euphy to ctb as we welcome these additional ways to better integrate Salmon/alevin output with Bioc > > Just to update this channel, Euphy’s been working on and evaluating methods for testing when fold changes within a sample pair are correlated across a secondary covariate. E.g. we see this in dynamic AI datasets when we have two alleles + time or spatial location
Dongze He (10:44:46) (in thread): > Oh I’m sorry I forgot to change the code there. But thanks so much for fixing it.
Dongze He (10:49:52): > Great!@Wes Wthanks so much for you help and excellent implementation. I appreciate it. Please let me know if you have any questions
Michael Love (10:51:16) (in thread): > oh sure, note, I didn’t fix the second test, I just commented it out, should I remove that test entirely?
Michael Love (10:51:36) (in thread): > https://github.com/mikelove/fishpond/blob/master/tests/testthat/test_loadFry.R#L30
Michael Love (12:48:22): > I’m taking care of Steve’s notes on the recent PR
Michael Love (12:55:51): > well i took care of the first two, he’s still got more coming
Michael Love (12:56:14): > here is what NEWS will say on release day: > > Changes in version 2.0.0 > * New loadFry() function, written by Dongze He with > contributions from Steve Lianoglou and Wes Wilson. > loadFry() helps users to import and process > alevin-fry quantification results. Can process > spliced, unspliced and ambiguous counts separately > and flexibly. Has specific output formats designed > for downstream use with scVelo or velocity analysis. > See ?loadFry for more details. >
Michael Love (12:56:23): > suggestions welcome
Wes W (19:15:38): > after a long and emotional day, PR sent that adds a single cell + 2 genes to give enough data points for non-linear curve fitting. Thanks to@Avi Srivastavafor helping me trouble shoot what SHOULD have been an easy task. works in Ubuntu 20.04 and Windows 10
Dongze He (19:43:31) (in thread): > As we don’t have this argument anymore, I think we don’t need this test
Michael Love (19:43:37): > i will pull and check on my machine here
Michael Love (19:44:52): > > This SHOULD have taken 30 minutes but I ended up spending all day on it , I can’t believe its 7:03pm right now. > :scream:
Michael Love (19:46:26): > > fishpond
now works out of the box as an entry point for alevin-fry processed data into R for velocity analysis byvelociraptor
> :velociraptor:
Michael Love (19:46:34): > yes i just added that emoji
Michael Love (19:49:40) (in thread): > ok will remove now, thanks
Michael Love (20:42:22): > all looks good, everything is pushed to Bioc
2021-10-14
Rob Patro (10:15:19): > This is fantastic@Wes W! Thank you:slightly_smiling_face:!!
Michael Love (10:34:31): > @Rob Patroas I’ll be tweeting out the NEWS updates, anything you want to put in that bullet about alevin-fry?
Michael Love (10:35:01): > Like do you have an elevator pitch for people that are like “ooh what’s that, how can i try that out”
Wes W (11:55:02) (in thread): > love giving back where I can!
2021-10-19
Michael Love (09:07:20): > @Wes Wquestion about filtering: previously you had filtering for cells with all zero inanyassay
Michael Love (09:07:37): > is this because downstream methods perform size factor estimation on the assays separately?
Michael Love (09:07:59): > I can see not wanting to drop a cell even though it has say, no ambiguous reads
Wes W (09:10:26): > sorta exactly
Michael Love (09:14:31): > if size factor is for dealing with seq depth, why don’t they just use the size factor from the assay with most counts
Michael Love (09:14:36): > i dont know the scope of the loss of cells, but if U or A are << than S, you could be tossing too many cells (which have decent counts in S)…
Michael Love (09:15:30): > I’m just not that close to the analysis so don’t know the typical ratios
Wes W (09:15:50): > so no ambiguous reads shouldn’t be a problem but when S+A is not a positive (say zero) then the standard SCE steps fail hard. > > scVelo by itself, has no problem with this. it uses its own normalization and feature selection in python that can handle non-positive values. But since most people who use velociraptor stay in the BioConductor family, and want to use their same embedings and standard normalization and feature selection, VC by default uses this standard SCE format for handling this by scran and scuttle
Michael Love (09:16:36): > if you don’t mind me with more questions:slightly_smiling_face:scran and scuttle by default scale each assay separately?
Wes W (09:16:53): > that is correct , by default anyway
Michael Love (09:17:21): > that seems suboptimal, bc this is just a technical parameter that is per cell not per assay (in my head)
Michael Love (09:17:49): > maybe I can bother@Charlotte Sonesonto get her opinion on this detail about size factor estimation for velociraptor
Wes W (09:19:44): > totally. again, velociraptor has no issues if you pass theuse.theirs=TRUE
flag to scVelo , but the standard workflow and test pass case using defaults doesn’t do this… and arguably the majority of users that are using velociraptor in downstream analysis as part of a larger SCE analysis arent either
Michael Love (09:20:42): > got it. thanks for hunting this down Wes. While I don’t want to break people’s default pipelines, I kind of quibble about this detail:slightly_smiling_face:
Michael Love (09:20:43): > I could also carry this over to#sc-velocityif we want more people’s opinion
Wes W (09:21:05): > i will double check the new code, but it shouldnt be tossing out cells… it should be tossing out genes that arent in any cells… let me have another look…
Wes W (09:21:41): > i mean yes you are still leaving data on the table either way
Michael Love (09:21:46): > yeah so the new code was tossing columns but the docs and comments said “genes” which is why i even started looking into this
Michael Love (09:22:13): > bc columns are cells in an SCE
Wes W (09:22:23): > they sure are
Wes W (09:22:47): > maybe something was lost in translation between my script and the new script , i’ll take a peak after lunch and get back toyou
Wes W (09:22:48): > to you
Michael Love (09:24:15): > cool thanks:slightly_smiling_face:
2021-11-06
Michael Love (08:49:09): > Note thattximetahas a build error in release and devel, but both build fine on my machine with the latest packages – i think this will just resolve once the latest EnsDb in Ahub are load-able
Michael Love (14:39:09): > oh, I cleared my caches and now i have the same error as on the build machines. I’ve reported it here –https://support.bioconductor.org/p/9140240/#9140479
Michael Love (14:54:29): > I’m going to comment out the problematic part of the vignette to clear the error
Rob Patro (15:14:59): > So what can be done?
Michael Love (16:56:21): > I think Johannes fixes, but our error should clear by Monday – i missed there were also sections of examples and tests that would attempt to download these EnsDb from Ahub
2022-01-03
Kurt Showmaker (17:05:17): > @Kurt Showmaker has joined the channel
2022-01-28
Megha Lal (11:14:29): > @Megha Lal has left the channel
2022-02-24
Jeroen Gilis (03:22:50): > @Jeroen Gilis has joined the channel
2022-02-25
Michael Love (06:48:51): > :wave:@Jeroen Gilis
Michael Love (07:05:18): > If you could use this paradigm for any extra packages you need (e.g. data.table) at the top of the function and add these to Suggestshttps://github.com/mikelove/fishpond/blob/master/R/helper-allelic.R#L75-L77
Michael Love (07:08:43): > we chosesvMisc::progress
(which is an import) because it seemed to have less overhead thanpbapply
i think… I remember doing some benchmarking – but if you want pbapply just add that as a Suggests as well
Michael Love (07:08:46): > https://github.com/mikelove/fishpond/blob/master/R/swish.R#L364
Michael Love (07:09:23): > maybe you could put your new code intohelper-EC.R
, oralevin-EC.R
andsalmon-EC.R
? I’ve tried to reorganize lately so it’s easier to find what lives where:https://github.com/mikelove/fishpond/tree/master/R
Michael Love (07:10:01): > make sure to add yourself/selves as authors here in DESCRIPTION and using@author
on the exported functions and feel free also to put@references
https://github.com/mikelove/fishpond/blob/master/DESCRIPTION#L4-L17
Jeroen Gilis (08:44:58): > Alright, thanks for the guidelines!
2022-03-08
Jeroen Gilis (03:37:27): > I have a question that I thought would fit this channel. I have the following output from running alevin and alevinQC, without input whitelist: > > - From the alevin_meta_info.json
: initial_whitelist = 1419, low_conf_cbs = 709, final_num_cbs = 1090 > - From the alevin quants_mat.gz
as imported with tximport/fishpond: ncol(mat) = 2128 > - From the alevinQC report
: Number of barcodes (initial whitelist) = 1419, Number of barcodes (final whitelist) = 1090 > > This is not what I expected. I expected ncol(mat) = 1090, since these are the barcodes of the final whitelist. I did discover that you can enforce this by settingalevinArgs = list("filterBarcodes" = TRUE)
in the tximport call, however, this is not default behavior. What also struck me is that the number 2128 does not appear in thealevinQC
report directly (only the 1419 and 709). It also doesnt align well with thehttps://salmon.readthedocs.io/en/latest/alevin.html#outputdocs, where it is stated that > “final_num_cbs – Total number of cellular barcodes present in the output quant matrix.”, which is not what users see under default settings. > > I guess my two questions are; am I missing something here? And what would you suggest by default for a user; continuing with the 2128 cells or the 1090 cells?
Charlotte Soneson (04:02:25) (in thread): > @Rob Patroor@Avi Srivastavawill probably give a more informed response (and correct me if I’m wrong:slightly_smiling_face:), but I don’t think you’re missing anything. 2128=1419+709 is the total number of barcodes used for the second round of whitelisting (for which the quantification is required, which is why they’re all in the count matrix). On the question of what to use, I usually import the full matrix and then do additional QC filtering afterwards. And foralevinQC
, we can certainly add the 2128 as well to the output tables. However, I think thelow_conf_cbs
number is more or less independent of the data (~1/2 of the initial whitelist size unless there are lots of barcodes, in which case it’s a constant number, just to get enough data for the classifier), so it’s mostly informative for knowing how many barcodes were input to the second round of whitelisting, not really for the inferred number of cells in the data set.
2022-03-15
Michael Love (11:17:08): > @Charlotte Sonesonwanted to get your thoughts on a proposal for moving things around, related to alevin:fishpond
package has some alevin related code includingloadFry
written by a team in this channel,readEDS
, and the recent contributionalevinEC
from Jeroen. > > AsreadEDS
is the only C++ code, I’ve been discussing with@Avi Srivastavato move it over to its own Bioconductor package,eds
in the next devel cycle (I would transfer to Avi). This would makefishpond
an R only package, so avoiding C++ compilation and trimming its dependencies.https://github.com/mikelove/edsIn other news, Jeroen is preparing new test data that will go into tximportData and replace the current alevin dataset. I will probably remove the 0.12 legacy alevin test data. > > Finally, I’ve been thinking of adding an alevin vignette to fishpond in next devel cycle that points to alevinQC package, shows examples of loadFry and alevinEC. Doesn’t have to be long but would highlight these related functions/packages.
Charlotte Soneson (19:24:15) (in thread): > I think that makes sense to me (if I recall correctlyreadEDS
is not used by any otherfishpond
function at the moment?) In any casealevinQC
is mostly/only concerned with the metadata files, so that should be ok also after the restructuring. Having test data intximportData
would be cool:slightly_smiling_face:and I also like the idea of thealevin
vignette - let me know if you’d like input.
2022-03-16
Michael Love (07:10:42) (in thread): > great. yes,readEDS
is alone in fishpond and only used by tximport, so it should be fine to move to a standalone package > > it would be great to get your input on an alevin vignette in fishpond, that would mostly showcase the different work in Bioc, and could also mention e.g. this channel for developer discussion. I’m thinking it would be mostly links out to, e.g. other package vignettes or man pages.
2022-03-23
Michael Love (12:14:25): > @Dongze Hewould you be ok with me changingload_fry_raw
toloadFryRaw
? I’m trying to standardize all exported functions (prev I didn’t notice thinking the underscores were for an unexported function)
Dongze He (12:20:32): > Hi@Michael Love, Good point! Actually we can just make this function as unexposed, as nobody is going to call this function directly. Thank you.
Michael Love (12:20:54): > ok will do
Michael Love (12:34:06): > @Dongze Healso updating: > > #' @references > #' > #' alevin-fry publication: > #' > #' He, D., Zakeri, M., Sarkar, H. et al. "Alevin-fry unlocks rapid, accurate and > #' memory-frugal quantification of single-cell RNA-seq data." > #' Nature Methods 19, 316–322 (2022). > #' \url{[https://doi.org/10.1038/s41592-022-01408-3](https://doi.org/10.1038/s41592-022-01408-3)} >
> :tada:
2022-03-28
Dongze He (21:42:35): > Hi@Michael Love, I found a bug in loadFry for non-USA mode. I made a PR, could you please check ? thank you so much!
2022-03-29
Michael Love (09:35:04): > yes will check — thanks for the PR
Martin Morgan (10:48:29): > @Martin Morgan has left the channel
2022-05-26
Michael Love (09:46:34): > Developer heads up: we are movingreadEDS()
fromfishpond
to a soon-to-be-submitted Bioc packageeds
— this will help with the dependency graph as fishpond doesn’t necessarily need to be compiling C++ code unless someone want to read in EDS format data.How this affects you: if you are indirectly using fishpond to import EDS (Alevin output) via tximport or tximeta, you will switch from fishpond to eds as a Suggests for your package. We haven’t made the change in devel yet, we are just at the stage of submitting eds , I’m just pre-emptively letting devels know.:point_up:relevant info for@Charlotte Soneson@Stephanie Hicks
Michael Love (09:46:54): > @Avi Srivastavacan you go tosupport.bioconductor.organd add “eds” to your watched tags?
Michael Love (09:47:09): > also let me know your Bioc support site email address
Avi Srivastava (09:48:32) (in thread): > asrivastava@cs.stonybrook.edu
Michael Love (09:48:51) (in thread): > alumni email… nice ;)
Avi Srivastava (09:51:22) (in thread): > Yea, Stony CS is nice enough to not disable email id after graduation. My current institution might change hence sticking to the old email id.
2022-07-08
Michael Love (09:56:41): > @Avi Srivastavacongrats on eds being accepted
Michael Love (09:59:17): > for anyone browsing this channel, see the thread immediately above:point_up:to understand how this will affect development, hopefully will be seamless for downstream devels/users, but next release, fishpond won’t be used by tximport/tximeta for importing alevin EDS files, instead it will useeds::readEDS()
I guess one change that will be needed is for packages that import fishpond for readEDS, they will need to import eds now, but minor hiccup,@Charlotte Sonesoncan you think of any other consequences, or devels I should notify?
Michael Love (10:00:18): > i see one reverse import:http://bioconductor.org/packages/release/bioc/html/singleCellTK.html - Attachment (Bioconductor): singleCellTK > The Single Cell Toolkit (SCTK) in the singleCellTK package provides an interface to popular tools for importing, quality control, analysis, and visualization of single cell RNA-seq data. SCTK allows users to seamlessly integrate tools from various packages at different stages of the analysis workflow. A general
Michael Love (10:03:54): > ok so it looks like they have a function which builds a SCE using tximport, and just imports fishpond so that the reading is fast. I’ll write a PR for them so they import eds instead, once the move of the function is complete
Charlotte Soneson (10:04:04) (in thread): > Nothing that immediately comes to mind - will post here if I think of something! Looks likesingleCellTK
usestximport
, so I guess no code changes, just changing a dependency.
Charlotte Soneson (10:04:29) (in thread): > Haha, ok, you just wrote that too.
Michael Love (10:04:30) (in thread): > yup yup
Michael Love (10:04:31) (in thread): > haha
Michael Love (10:11:23): > @Albert Kuo@Stephanie Hicksin a few days we will movereadEDS
from fishpond to eds. You could add something here:https://github.com/stephaniehicks/quantify-snrna/blob/fab920e1ec048b8fe26012a7b95cb610fd3c80c6/mouse_cortex/code/run-tximeta.R#L12e.g. > > if (packageVersion("tximeta") >= 1.15.1) { > library(eds) > } else { > library(fishpond) > } >
Albert Kuo (10:11:26): > @Albert Kuo has joined the channel
Stephanie Hicks (10:11:26): > @Stephanie Hicks has joined the channel
Stephanie Hicks (10:26:35): > hey! thanks@Michael Love
Stephanie Hicks (10:27:36): > Albert is likely to be slow to respond to this. He’s now started his new position at Amazon. I’ll work on updating that.
Stephanie Hicks (10:33:49): > updated !
2022-12-15
Mercilena Benjamin (09:39:20): > @Mercilena Benjamin has joined the channel
2023-01-10
Vince Carey (10:47:48): > @Vince Carey has left the channel
2023-07-28
Benjamin Yang (15:58:37): > @Benjamin Yang has joined the channel
2023-08-04
Lucio Queiroz (14:23:44): > @Lucio Queiroz has joined the channel
Michael Love (16:07:28): > Just showed this at BioC2023, a new argument (in devel branch) fortximeta::summarizeToGene
. The data (counts, abundance, etc) is not changed, this only concerns therowRanges
. Also the new option is non-default so it shouldn’t change anyone’s work unless they choose the non-default option below. - File (PNG): Screenshot 2023-08-04 at 4.06.52 PM.png
2023-09-07
Tim Triche (11:48:27): > this is interesting for (e.g.) using alevin-fry/bustools to feed tree-shaped inference. Looking forward to the junction of those two. Swish/ASE was great too!
2023-09-12
Ilaria Billato (10:24:01): > @Ilaria Billato has joined the channel
2023-12-15
Jeroen Gilis (04:04:34): - File (PNG): Screen Shot 2023-12-15 at 10.03.33.png
Jeroen Gilis (04:05:17): > Some of you must have noticed this ages ago, but still worth a post imo :p
Michael Love (08:08:30): > HA!
Michael Love (08:08:46): > tagging@Rob Patro