#singlecell-queries

2020-05-03

Nitin Sharma (11:06:38): > @Nitin Sharma has joined the channel

Nitin Sharma (11:06:39): > set the channel description: Discuss general queries regarding Single Cell Data Analysis

Chiaowen Joyce Hsiao (13:04:50): > @Chiaowen Joyce Hsiao has joined the channel

Dan Bunis (14:27:48): > @Dan Bunis has joined the channel

Nitin Sharma (15:12:18): > hello all, > > I want to ask a basic question and hopefully you can shed some light on it > > in scRNA-seq, seurat pipeline, the QC filtration is generally done on the following criteria > . > * We filter cells that have unique feature counts over 2,500 or less than 200 > * We filter cells that have >5% mitochondrial counts > Now, I am working on single nuclei RNA-seq (snRNA-seq) and wanted to know what cutoff will be used for QC filtration step and why?

Vivek Das (15:12:48): > @Vivek Das has joined the channel

Vivek Das (15:19:41): > This should be tissue and biology dependent. The idea of generalizing that threshold is not correct if I look across evidences in varied tissue type. If I may ask what is your tissue of origin? How many cells and genes were in fact subjected to be quantified? Since before subjected to sequencing there must be a specific check list based on read depth, overall cell numbers & genes that was the goal. This should guide the query in addition to biology. Does your tissue source have mitochondrial function as a key component? If not then anything between 2-5% is fine but if it does you might have to think through. In cardiomyocytes & kidney, in theory general cut off is 20-30% for mito genes and upon clustering and projection Mito genes are generally found in all clusters. Take a look at the recent single nuc RNASeq papers in those tissues and their adjoining codes on GitHub. You will see that it is high. However, try to see in your tissue of query what kind of evidences have been published so far. This will serve as a primer as well. Another easy way is to filter initially only for minimum genes and cells. Plot the mitochondrial % & cluster. See if Mito genes are only restricted in single cluster via tSNE or UMAP. If so then that’s not due to biology and most probably the sequencing artifact or due to mechanical dissociation process during single cell isolation. Then just remove the entire cluster of mitochondrial genes. These are common hacks. Again in sn-RNASeq one should not have a lot of it given it is not a biological component. But if it does then discount it based on how it’s distributed across all clusters. > > My two cents. Hope this helps.

Keith Connolly (18:41:51): > @Keith Connolly has joined the channel

Mikhael Manurung (18:43:08): > @Mikhael Manurung has joined the channel

Kelly Eckenrode (18:45:24): > @Kelly Eckenrode has joined the channel

Anamaria Elek (19:04:11): > @Anamaria Elek has joined the channel

Vince Carey (19:13:52): > @Vince Carey has joined the channel

Peter Hickey (19:47:19): > @Peter Hickey has joined the channel

2020-05-04

Charlotte Soneson (01:55:43): > @Charlotte Soneson has joined the channel

Laurent Gatto (03:13:44): > @Laurent Gatto has joined the channel

Charlotte Rich-Griffin (04:13:42): > @Charlotte Rich-Griffin has joined the channel

Nadine Bestard-Cuche (05:39:38): > @Nadine Bestard-Cuche has joined the channel

Devika Agarwal (05:46:09): > @Devika Agarwal has joined the channel

Giuseppe D’Agostino (06:17:37): > @Giuseppe D’Agostino has joined the channel

Almut (07:13:53): > @Almut has joined the channel

Nils Eling (07:31:19): > @Nils Eling has joined the channel

Robert Ivánek (07:48:11): > @Robert Ivánek has joined the channel

Kevin Blighe (10:22:29): > @Kevin Blighe has joined the channel

Tim Triche (11:23:58): > @Tim Triche has joined the channel

Tim Triche (11:24:14): > here’s a thought

Tim Triche (11:24:23): > suppose you’re working on cardiomyocyte development

Tim Triche (11:24:42): > what happens when you choose a static cutoff for mitochondrial transcripts?

Tim Triche (11:24:45): > another thought

Tim Triche (11:25:21): > suppose you’re working on naive or memory B and T cells, or some other random immune cell from which it’s a pain to extract mRNA

Tim Triche (11:25:46): > now suppose you superload your droplet prep so that you get a WHOLE LOTTA CELLS!!1

Tim Triche (11:26:11): > and then you toss out the cells that have less than some number of fragments

Tim Triche (11:26:34): > what do you suppose happens to the cells that potentiate the adaptive immune response?

Alan O’C (11:43:49): > @Alan O’C has joined the channel

Tim Triche (12:21:06): > I would be extremely careful with these types of cutoff based “QC” protocols. Refer to e.g.https://www.cell.com/fulltext/S0092-8674(12)01226-3for $60M+ worth of cautionary tales about assumptions:slightly_smiling_face:

Nicholas Knoblauch (12:40:14): > @Nicholas Knoblauch has joined the channel

Jared Andrews (13:51:23): > @Jared Andrews has joined the channel

Jared Andrews (13:55:30): > Indeed, the pipeComp paper found arbitrary QC thresholds or even MADs-thresholds of whole samples may remove populations of interest. A per-cluster removal of outliers with manual removal of clusters composed of clearly dead cells may be the “safest” option.

Stephanie Hicks (14:02:44): > @Stephanie Hicks has joined the channel

2020-05-05

Al J Abadi (03:37:26): > @Al J Abadi has joined the channel

2020-05-06

Davide Risso (10:16:26): > @Davide Risso has joined the channel

Keegan Korthauer (14:14:05): > @Keegan Korthauer has joined the channel

Kelly Eckenrode (15:01:22): > (Context: I am new to analysis) Can you get away without thresholding? I understand that thresholding allows the reduction of noisy signal, but isn’t preserving heterogeneity (even if noisy) ideal? Thresholding also allows for binning, which is great for defining a cell type. But what if cell types don’t fit into a nice box..?@Tim Triche@Jared Andrews

Alan O’C (15:02:17): > What do you mean by thresholding? Simply removing low-quality cells?

Kelly Eckenrode (15:04:59): > Ah, so thresholding is used for quality control?

Jared Andrews (15:07:22): > I’d be wary of going without it completely. A cell with 10 genes expressed is not useful information. However, determining where those cutoffs should lie is more challenging, as it’s true that you don’t want to be overzealous and remove populations that may normally have a slightly high mitochondrial read rate (or fewer genes expressed or more reads or whatever). However, with no filtering, you will likely retain a lot of fairly useless information that may affect downstream analyses (dimensionality reduction and differential expression chief among them) in a significant way.

Jared Andrews (15:08:25): > And there are a number of papers showing just that, so some sort of filtering is good practice, but blanket applying arbitrary thresholds is not the best way to go about it.

Alan O’C (15:11:36): > Yes, typically the reason for removing cells is that you don’t believe that these are representative observations of the true cells. As Jared said, the idea is that these outlying samples that may bias downstream analyses. For example, if you have a bunch of cells with close to zero count, they will mess with the estimation of normalisation factors, and the 1st (possibly 1st several) principal components will simply explain the difference between low quality observations and the rest. This is also true for bulk analyses

Alan O’C (15:13:11): > It’s not unheard of to have bulk RNAseq samples with far fewer mapped reads than all other samples, and these will heavily skew all downstream analyses including normalisation factors and PCs

Tim Triche (15:21:36): > plan on fitting and plotting a mixture model – I’ll see if Eric will make some of his cardiomyocyte differentiation slides available – they’re the best illustration I’ve seen of this.

Tim Triche (15:22:43): > the scVelo paper is incredibly interesting w/r/t dim reduction, plotting, and velocity estimation – the comparison of “top 30 genes” vs “everything but the top 30 genes” is of interest especially

Tim Triche (15:23:29): > if your model sucks, or your data sucks, your results will probably suck; but the more robust either of the two is, the more leeway you have to exercise some judgment in picking away at the underlying biology

Tim Triche (15:24:24): > representativeness is a particularly interesting concept – what does a “true” pre-exhausted T cell or ciliated follicular progenitor look like? or a “true” long-term engrafting multilineage HSC ?

Tim Triche (15:25:21): > it’s bad enough for normal cells, when you start looking at malignant progenitors it gets really fun (cf.https://www.nature.com/articles/s41586-018-0436-0) - Attachment (Nature): The genetic basis and cell of origin of mixed phenotype acute leukaemi > A large-scale genomics study shows that the cell of origin and founding mutations determine disease subtype and lead to the expression of multiple haematopoietic lineage-defining antigens in mixed phenotype acute leukaemia.

Tim Triche (15:25:50): > I submit that you need to plan on at least one, and preferably two or three, orthogonal functional validation experiments for any exploratory scRNA result. Maybe a lot more.

Tim Triche (15:26:42): > I don’t really understand how solid tissue people deal with some of this complexity; it’s bad enough teasing it apart in systems without a ton of static physical structure.

Tim Triche (15:29:13): > we have done some ultra-deep ribozero sequencing of single cells where we get tens of millions of unique fragments and upwards of 80-90% detectable gene expression in cells; comparing to 10X or sci-RNAseq data from “equivalent” specimens is interesting to say the least. Similarly, the cardiomyocyte and FTE experiments get at “gee, what happens when certain types of normal cells really do have 20,000 mitochondria active?”

Tim Triche (15:30:17): > so, IMHO, 1) plot your data and 2) don’t blindly follow someone else’s recommendations for e.g. peripheral blood, especially if you’re studying muscle tissue or germinal center activation.

Tim Triche (15:30:53): > Note that 2) also applies to anything I suggest, because I am probably not studying the same types of cells as you are:slightly_smiling_face:

Kelly Eckenrode (16:28:19) (in thread): > What do you mean by this? Physical structure makes it more difficult because you have to separate the cells? I’m so leery of nuclei scRNA-seq because of missing the exported mRNAs.

Tim Triche (17:59:11) (in thread): > That’s part of it, but another aspect is that the instant you either fix a cell (for e.g. visium) or dissociate it, you are already changing its milieu. At least with blood cells we don’t have to fix or dissect (much)

Wanding Zhou (22:25:36): > @Wanding Zhou has joined the channel

2020-05-07

Vivek Das (17:55:24) (in thread): > Solid tissue people here@Tim Triche:wink:

Vivek Das (18:00:51) (in thread): > This is why in my reply I wrote, it’s not the best to put static ones. In cardiac or kidney this is a moving goal post. Mitochondrial biogenesis is a key process for developmental. I am only seeing a broad range in such between 20-30% . Again I was very nicely suggested by Davide Cittaro to check for the clusters where Mito genes express. If it’s artifact ideally should not be on all clusters. When I look into certain tissues where it plays a role without discarding, I have seen it’s they are spread out all over in every cell clusters but that’s not always the case tissues where I see it’s role is not clearly demonstrated.

Vivek Das (18:08:06): > One quick question: apart from human cell Atlas is there any effort for multi solid tissue for single cell? Just at the level of single cell of single nuc? Any interesting consortium that I should keep an eye on apart from HCA & HTAN? What do you all suggest?

Tim Triche (18:22:32) (in thread): > Good call — fwiw we see up to 60% in some stages of cardiomyocyte development — looking into whether this is even possible (via high speed microscopy) to try and disambiguate

Tim Triche (18:23:42) (in thread): > Technically bone marrow has some solid bits as do lymph nodes and myeloid sarcomas:grin:

Tim Triche (18:24:20): > Have you looked at the hSCL data?

Vivek Das (18:24:44): > No

Vivek Das (18:24:52): > Can you share the link?

Tim Triche (18:26:26): > https://www.nature.com/articles/s41586-020-2157-4 - Attachment (Nature): Construction of a human cell landscape at single-cell level > Single-cell RNA sequencing is used to generate a dataset covering all major human organs in both adult and fetal stages, allowing comparison to similar datasets for mouse tissues.

Tim Triche (18:27:03): > Prying out the raw reads is taking a while (we look at velocity on basically everything)

Vivek Das (18:27:34): > That’s HCL. It’s the one I shared with you over Twitter about the datasets.:grimacing:

Vivek Das (18:29:06): > I know about this. I am using it at my end. But it is it’s own study right. Not consortia if I remember the paper correctly. I do use it, apart from the ones I have access to via BROAD, cellxgene and EBI.

Vivek Das (18:30:44): > But yes, I have not downloaded the raw data at my end yet. I did check the GitHub but I don’t recall the preprocessing steps right now for this work. Will have to go back to it. However, this is again a different single cell technology from what is mostly being used. Like that chromium or smartseq2. Isn’t that so?

Vivek Das (18:34:11): > Also if I recall(I need to access download this paper on my phone) , the single cell platform here is not a part of the benchmarking work.:disappointed:

Upasna Srivastava (18:38:00): > @Upasna Srivastava has joined the channel

Paul Harrison (19:56:03): > @Paul Harrison has joined the channel

Tim Triche (20:35:58): > It’s seqwell (or maybe microwell-seq) IIRC. On my phone, will update later. But droplet crap is way overrated imho

Tim Triche (20:36:44): > HCA is definitely best for benchmarking. The SmartSeq3 data is amazing

Tim Triche (20:37:02): > We need to get or make an HCA sample for the Takara prep

Tim Triche (20:37:21): > In my incredibly biased opinion, it blows all the others away

Tim Triche (20:38:48): > We do need to get them to swap out the UDIs for a random hexamer UMI with huge edit distance so we can run the same libraries on the minION

Tim Triche (20:39:54): > But SmartSeq3 is the only thing I’d be tempted by, and my lab manager said she’d rather eat glass than run their home brewed prep

Tim Triche (20:40:04): > Their paper is amazing though

Tim Triche (20:40:59): > If you care about the assay side of things (eg spliced/unspliced spike-ins for velocity controls) it may be the best scRNA paper ever written

Vivek Das (20:58:04): > Indeed. I have just read it a day ago. I indeed enjoyed it. You know, it is not rocket science to figure out the wave & trends in the market of single cell.:wink:I hope SMARTSeq3 will catch up. I have hopes about it but it has to scale up in a lot of ways, just not kits. A success of a product also depends on a lot other factors other than just scientific evidences. Isn’t it?:blush:But you also know who has more market penetration in terms of single cell technology & platform. That’s pretty easy to find out. Isn’t that so? Like the sequencing company.:grimacing:However, things are getting pretty interesting. I will be more watching out for case-Control works with large arm experiment. The fun begins there.:blush:

2020-05-08

Federico Marini (03:50:23): > @Federico Marini has joined the channel

Kellie Kravarik (07:02:48): > @Kellie Kravarik has joined the channel

Lucy (07:13:49): > @Lucy has joined the channel

Dan Bunis (14:23:39): > Question from a colleague of mine: > Does anyone have thoughts about integrating a scRNA seq library and a snRNA library, especially if the single cell and single nuclei samples are from different timepoints? If integrating to look for shared cell populations from these two different library preps isn’t problematic (they recognize that this might be one major hurdle), are there specific workflows for batch corrections that are most helpful when integrating single nucleus and single cell libaries?

Tim Triche (14:40:42): > look at the HCA paper and/or the Broad preprint, it looks like it should be doable, although it will depend on how/what you’re combining and with regards to what (i.e. if you’re combining deep plate-seq with droplet snRNAseq and looking to compute velocity… you’re gonna have a rough time). Aaron Lun’s paper on using a big pseudocount is pretty amazing in the “simplest thing that could possibly work” department, fwiw; we found it to be almost magical for ratiometric “normalization”. Obviously if you can avoid normalization with scaling, you should; see for examplehttps://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02006-2and the spike-in section of the SmartSeq3 paper and the HCA comparisons by Holger Heyn’s group. We will be submitting a paper on multi-omic single-cell spike-ins presently (because not enough big data people hate us yet) as well. - Attachment (Genome Biology): Single-cell RNA-seq with spike-in cells enables accurate quantification of cell-specific drug effects in pancreatic islets > Single-cell RNA-seq (scRNA-seq) is emerging as a powerful tool to dissect cell-specific effects of drug treatment in complex tissues. This application requires high levels of precision, robustness, and quantitative accuracy—beyond those achievable with existing methods for mainly qualitative single-cell analysis. Here, we establish the use of standardized reference cells as spike-in controls for accurate and robust dissection of single-cell drug responses. We find that contamination by cell-free RNA can constitute up to 20% of reads in human primary tissue samples, and we show that the ensuing biases can be removed effectively using a novel bioinformatics algorithm. Applying our method to both human and mouse pancreatic islets treated ex vivo, we obtain an accurate and quantitative assessment of cell-specific drug effects on the transcriptome. We observe that FOXO inhibition induces dedifferentiation of both alpha and beta cells, while artemether treatment upregulates insulin and other beta cell marker genes in a subset of alpha cells. In beta cells, dedifferentiation and insulin repression upon artemether treatment occurs predominantly in mouse but not in human samples. This new method for quantitative, error-correcting, scRNA-seq data normalization using spike-in reference cells helps clarify complex cell-specific effects of pharmacological perturbations with single-cell resolution and high quantitative accuracy.

Tim Triche (14:41:37): > just found this one too:https://www.biorxiv.org/content/10.1101/832444v2

Tim Triche (14:42:04): > I have to say, I’m beginning to greatly enjoy the switch from qualitative to quantitative single-cell sequencing foci:slightly_smiling_face:

Tim Triche (14:42:22): > revenge of the statistical curmudgeons:slightly_smiling_face:

Dan Bunis (14:48:46): > Thanks so much! I’ve passed your suggestions along and will read up myself too when I have a chance:slightly_smiling_face:

Tim Triche (15:11:11): > Oh I was mostly hoping for corrections/updates from you w/r/t the above:slightly_smiling_face:

Tim Triche (15:11:39): > Sean @ Agios wrote decontX with WEJ, will be interesting to see if we can apply that and gold-standard check it given the above

Tim Triche (15:13:36): > but “single” cell is probably a euphemism for many droplet protocols (we have some people at VAI who have studied this extensively:

Tim Triche (15:13:36): > https://github.com/VanAndelInstitute/intent

Tim Triche (15:14:23): > Shiyi & Sean’s paper:https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-1950-6) - Attachment (Genome Biology): Decontamination of ambient RNA in single-cell RNA-seq with DecontX > Droplet-based microfluidic devices have become widely used to perform single-cell RNA sequencing (scRNA-seq). However, ambient RNA present in the cell suspension can be aberrantly counted along with a cell’s native mRNA and result in cross-contamination of transcripts between different cell populations. DecontX is a novel Bayesian method to estimate and remove contamination in individual cells. DecontX accurately predicts contamination levels in a mouse-human mixture dataset and removes aberrant expression of marker genes in PBMC datasets. We also compare the contamination levels between four different scRNA-seq protocols. Overall, DecontX can be incorporated into scRNA-seq workflows to improve downstream analyses.

Tim Triche (15:15:01): > I’m having a hell of a time finding the Broad scRNA/snRNA pipeline comparison, you’ll have to forgive me, but if anyone else finds it, please post the link here.

Dan Bunis (15:33:44): > For updates/corrections, not sure we have the knowledge-base to correct, but i can add that the particular data in question was made with 10X for both the nuclei and whole-cell data. So probably not great for velocity, and the big goal, at least initially, is just “simple” differential gene expression across the time points in question.

Jared Andrews (15:34:50): > So is 10X data generally regarded as crap for velocity?

Dan Bunis (16:14:42): > > I’m having a hell of a time finding the Broad scRNA/snRNA pipeline comparison, you’ll have to forgive me, but if anyone else finds it, please post the link here. > I imagine it’s thishttps://www.biorxiv.org/content/10.1101/632216v2. (Not sure where to find the scumi tool they apparently developed for normalizing across experimental methods… I don’t think it’s released yet but it could be quite useful!)

Vivek Das (16:25:17) (in thread): > The gods will be angry:rolling_on_the_floor_laughing:

Vivek Das (16:30:05) (in thread): > This is already published recently. Well, I have not seen much of a difference in the pipelines at my end. But, I am also skeptical about motivation of integration of single cell and snRNASeq. Why do we want to do so? Isn’t dependent on the tissue biology, organ complexity, nature of cell types & associated markers we want to profile & understand? The underlying complexity of the tissue biology should guide us right or unless we have multi tissue multi organ datasets from both & make systematic comparative assessments for certain biological questions. We just don’t want to do it for the sake of it or do we?:wink:

Tim Triche (19:44:07) (in thread): > I don’t know about “crap” but you have to smooth more cells together to get a decent vector field if each cell is sequenced more shallowly. On the other hand, for the same number of reads, you could shallowly phenotype more cells (see the sci-PLEX and sci-fi-RNAseq papers for this taken to the far extreme). So I think it’s a question of “right tool for the job”. We and others tried 10X on FTE, it didn’t work, but Takara worked like a charm and we realized that sometimes olde skoole plate-seq is not only “good enough” but can be better.

Tim Triche (19:45:20) (in thread): > Ido Amit made a great point at ASH, though: it cannot be overstated how much 10X contributed to the uptake of scRNAseq by making a fairly standardized, comparable, straightforward method that most anyone can use. Ido estimated 2-3 years to roll your own, and the only reason we did that (his estimate is right, FWIW) is that we couldn’t do the experiments we wanted to with 10X or other droplet preps.

Tim Triche (19:45:34) (in thread): > For an awful lot of people, 10X is perfect.

Tim Triche (19:46:08) (in thread): > My lab and our collaborators aren’t among those people, but that doesn’t mean 10X is “bad”. It’s just a different tool for different experimental questions.

Jared Andrews (19:46:21) (in thread): > That’s fair. For context, just trying to determine if it’s worth my time or not. I’m mostly hoping for it to confirm trends observed from pseudotime analyses. Otherwise, I am not going to dig any deeper into the results.

Tim Triche (19:46:28) (in thread): > use PAGA

Tim Triche (19:46:33) (in thread): > that’s an easy one

Tim Triche (19:47:11) (in thread): > https://theislab.github.io/scanpy-in-R/

Tim Triche (19:48:17) (in thread): > if you want a method that can be informed by velocity, or not, you want PAGA. If you want to try and normalize a bunch of datasets together and THEN handle that quantitatively… well, that’s an open research question. But we are working with Rob Patro and Charlotte Soneson and others on closing it a little bit:wink:

Vivek Das (19:51:15) (in thread): > Also Roman wrote it last year in R.https://romanhaa.github.io/blog/paga_to_r/

Jared Andrews (19:54:34) (in thread): > Guess I’ll have to toy with it. Mixing between Seurat, SCE, and AnnData objects is not something I’m looking forward to, but no pain, no gain.

2020-05-09

Tim Triche (16:08:01) (in thread): > both of those links are great

Tim Triche (16:08:15) (in thread): > although Roman’s is a little out of data and some of the objects have moved around IIRC

Tim Triche (16:08:56) (in thread): > the going back and forth part is the big PITA

Tim Triche (16:09:51) (in thread): > once Ben deposits the compartmap MS, he wants to write up a complementary tutorial. Since I am a godawful human being, I’m going to ask him to turf it to an intern (or collaborator grad student) instead

Jared Andrews (16:09:54) (in thread): > Yes, I haven’t dug deep, but if I can run PAGA on my already processed Seurat/SCE object, using their dim reducs and all, then that’s fine.

Tim Triche (16:10:02) (in thread): > yes. you can

Tim Triche (16:10:17) (in thread): > and if the datasets aren’t too radically different you can merge them with Harmony

Tim Triche (16:10:26) (in thread): > although the velocity vector fields will not be properly preserved

Tim Triche (16:10:32) (in thread): > which is its own rabbit hole

Jared Andrews (16:10:35) (in thread): > I already have a big ol object that was merged via fastMNN.

Tim Triche (16:10:56) (in thread): > good enough:slightly_smiling_face:

Tim Triche (16:11:16) (in thread): > still won’t preserve the vectors, but at least the “6 hours ago” positions will be sensible:wink:

Jared Andrews (16:12:15) (in thread): > Yeah, I should probably really read about what goes into this so that my logic isn’t just “I want arrows to go in this direction pls” with zero knowledge of what’s actually being done.

Tim Triche (16:12:36) (in thread): > the La Manno paper is really, really good, as is the scVelo paper

Jared Andrews (16:13:19) (in thread): > Excellent. Maybe I’ll throw my code on biostars as a tutorial or something after the fact if it works out.

Tim Triche (16:13:37) (in thread): > we have some idea how to do it but I want to use spikes and a defined shift to make sure we’re not just circlejerking like with imputation “benchmarks”

2020-05-11

John Hutchinson (14:28:34): > @John Hutchinson has joined the channel

Tim Triche (17:50:21): > Fwiw, the pipeline paper is published today:https://www.nature.com/articles/s41591-020-0844-1 - Attachment (Nature Medicine): A single-cell and single-nucleus RNA-Seq toolbox for fresh and frozen > A set of ready-to-use tools for profiling fresh and frozen clinical tumor samples using scRNA-Seq and snRNA-Seq facilitates the implementation of single-cell technologies in clinical settings and the construction of single-cell tumor atlases.

2020-05-13

Aedin Culhane (17:09:58): > @Aedin Culhane has joined the channel

2020-05-17

Goutham Atla (19:09:03): > @Goutham Atla has joined the channel

2020-05-19

Tobias Hoch (11:18:48): > @Tobias Hoch has joined the channel

2020-05-20

dylan (01:34:10): > @dylan has joined the channel

2020-05-23

Mikhael Manurung (14:59:02): > In light of this recent pre-print by Lior Pachter (https://www.biorxiv.org/content/10.1101/2020.05.19.100214v1), do we still consider log1p transformation as the best default transformation for scRNA-seq instead of square-root or arcsinh?

Jared Andrews (15:13:23): > Are there any studies showing that they perform any better for low count genes? Genuinely curious, I have not looked for any. Lior’s commentary is all well and good, but doesn’t seem to offer any practical solution.

2020-05-24

Tim Triche (00:01:08): > asinh is handy, we used it for years, but perhaps better is to use a larger pseudocount:https://www.biorxiv.org/content/10.1101/404962v1.full

Tim Triche (00:01:43): > the “fat pseudocount” approach seems to solve a bunch of other issues as well, at least in our hands

2020-05-26

Somesh (13:52:31): > @Somesh has joined the channel

Somesh (14:20:29): > I am having issues recovering genes while doing cluster-wise differential expression testing between two conditions - The testing mostly produces genes up-regulated for one condition while very few genes (< 6) seem to be up-regulated for the other condition. Read-counts and other technical variates look okay for both conditions - Could it be because of the test I am employing (wilcox) or other parameters used? I am working with Seurat v3

Alan O’C (14:30:14): > Seurat is not a bioconductor package:wink:You say “recovering genes” - do you have prior knowledge of the genes that ought to be differentially expressed between conditions? Asymmetric DE is not terribly uncommon

Somesh (14:40:32) (in thread): > Haha …. I figured that. But, my thinking was that this issue might be irrespective of Seurat. Regarding genes, to a fair extent yes. For most of the clusters, I get the same DE genes (mostly chaperones) for each cluster for that condition which could indicate no up-regulation is actually present in that condition.

Alan O’C (15:02:10) (in thread): > I don’t think that last point really follows from the observation. You could quite easily have genes that are increasing in expression across all clusters in one condition, which would make them useless as marker genes for clusters, while having a large LFC when comparing conditions.

2020-05-27

Mikhael Manurung (14:51:11) (in thread): > The lack of practical solution offered and the wording of the last paragraphs made me think that the article seems like a lengthy subtweeet.

Jared Andrews (14:53:44) (in thread): > It is fitting with his Twitter style, admittedly.

2020-05-29

Shuyu Zheng (08:53:51): > @Shuyu Zheng has joined the channel

2020-06-06

Olagunju Abdulrahman (19:58:17): > @Olagunju Abdulrahman has joined the channel

2020-06-07

Juan Ojeda-Garcia (12:21:16): > @Juan Ojeda-Garcia has joined the channel

2020-06-08

MounikaGoruganthu (08:13:07): > @MounikaGoruganthu has joined the channel

2020-06-09

Peter Hickey (19:42:18): > anyone know off hand a paper/preprint where they used CITE-seq antibodies, genetic variation, and hashtag labelling all on the same cells?

Tim Triche (19:51:50): > https://www.nature.com/articles/s41592-019-0392-0 - Attachment (Nature Methods): Multiplexed detection of proteins, transcriptomes, clonotypes and CRIS > ECCITE-seq combines the single-cell analysis of multiple modalities, for example transcriptome, immune cell receptors, cell surface proteins and single-guide RNAs.

Stephanie Hicks (20:35:28): > @Peter Hickey— by “genetic variation”, do you mean to demultiplex pooled cells?

Peter Hickey (20:36:29): > yeah like samples from 3 donors (‘genetic variation’ labelled) and 2 treatments (hashtag labelled) run in 1 capture where all cells were also assayed with CITE-seq antibodies

Stephanie Hicks (20:39:35): > I’d be very interested in that data too. Tagging@Lukas Weberto join in the convo too. I’ll definitely check out that paper@Tim Triche— thanks!

Peter Hickey (20:40:12): > we’ve done this but not published:slightly_smiling_face:looking for something to cite in a grant app

Lukas Weber (20:40:16): > @Lukas Weber has joined the channel

Lukas Weber (20:40:25): > oh I wasn’t in this channel:joy:

Stephanie Hicks (20:40:51): > @Peter Hickey— nice!!

Stephanie Hicks (20:41:15): > please do let us know when you can share that data.:slightly_smiling_face:

Tim Triche (20:41:17): > thought occurs to just use VarTrix on the samples to assign genotypes to cells

Tim Triche (20:41:36): > (was bugging Chris Miller about this earlier since I don’t care about genotypes, only SVs:wink:)

Stephanie Hicks (20:41:56): > yeah,@Lukas Weberand I are using cellSNP to assign genotypes and Vireo to demultiplex

Tim Triche (20:42:03): > the rest (i.e. ADTs, hashing, etc.) is handled by ECCITE

Tim Triche (20:42:16): > I know CellSee is adapting ECCITE to do up to 250K cells at a time

Tim Triche (20:42:38): > the droplet version kind of sucks in terms of throughput, although the fact that it can be done at all is amazing:slightly_smiling_face:

Tim Triche (20:43:26): > we’re lazy so we plate-seq and pullaside the nucleus and organelles. It turns out that organelles mostly survive the 19.6 m/s^2 drop into the lysis buffer

Tim Triche (20:43:44): > with enough depth, genotyping becomes something of a non-issue:slightly_smiling_face:

Tim Triche (20:44:27): > similar to your situation, though, haven’t published the whole thing yet – hopefully two of the three pieces will go up on bioRxiv this week

Tim Triche (20:45:15): > @Stephanie Hickswhat does Vireo do?

Stephanie Hicks (20:46:02): > it’s basically a computational demultiplexer using genetic variants.https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1865-2 - Attachment (Genome Biology): Vireo: Bayesian demultiplexing of pooled single-cell RNA-seq data without genotype reference > Multiplexed single-cell RNA-seq analysis of multiple samples using pooling is a promising experimental design, offering increased throughput while allowing to overcome batch variation. To reconstruct the sample identify of each cell, genetic variants that segregate between the samples in the pool have been proposed as natural barcode for cell demultiplexing. Existing demultiplexing strategies rely on availability of complete genotype data from the pooled samples, which limits the applicability of such methods, in particular when genetic variation is not the primary object of study. To address this, we here present Vireo, a computationally efficient Bayesian model to demultiplex single-cell data from pooled experimental designs. Uniquely, our model can be applied in settings when only partial or no genotype information is available. Using pools based on synthetic mixtures and results on real data, we demonstrate the robustness of Vireo and illustrate the utility of multiplexed experimental designs for common expression analyses.

Stephanie Hicks (20:46:13): > uses a bayesian algorithm

Tim Triche (20:46:26): > oh that is slick

Lukas Weber (20:46:26): > yep, andcellSNPis their associated tool for the cell-level genotyping, e.g. using a list of known SNPs from 1000 Genomes Project

Stephanie Hicks (20:46:28): > @Lukas Weberhas become somewhat of an expert on this topic as of late

Tim Triche (20:46:33): > very cool

Tim Triche (20:46:38): > does it leverage UMIs

Stephanie Hicks (20:46:39): > so i’ll let him explain details:slightly_smiling_face:

Lukas Weber (20:46:57): > hmm not sure right now

Tim Triche (20:48:04): > that’s a great paper! Thanks for pointing it out!

Stephanie Hicks (20:48:10): > by leverage UMIs do you mean, can it apply to UMI scRNAseq data as opposed to plate-based only?

Peter Hickey (20:48:15) (in thread): > In case you’re still experiencing the pain, can recommend first subsetting the list of known SNPs to those in 3’ UTRs (presuming you are using it on 10X 3’ data). so. much. faster.

Stephanie Hicks (20:48:16): > if so, then yes, it can use UMI data

Peter Hickey (20:48:32) (in thread): > learnt that one the hard way

Lukas Weber (20:48:34): > yep

Tim Triche (20:48:39): > nah I mean can it take advantage of PCR dupes to distinguish low-freq variants from noise

Lukas Weber (20:48:52) (in thread): > oh, thanks! yes I have noticed it is extremely slow

Tim Triche (20:48:56): > one of our pet projects is to graft UMIs onto our plate-based total RNAseq prep

Stephanie Hicks (20:49:01): > oh that i have no idea

Stephanie Hicks (20:49:06): > (i don’t think so)

Stephanie Hicks (20:49:11): > but I could totally be wrong

Lukas Weber (20:49:18): > yeah I’m not sure about that either

Tim Triche (20:49:19): > no worries, in droplet protocols it’s not very important so nobody does it

Stephanie Hicks (20:49:24): > but what you are describing is also a very cool idea

Tim Triche (20:49:31): > in clinical MRD, ArcherDx is about to make $100M off of it

Tim Triche (20:49:56): > can send Todd’s 2016 presentation from ASH or you can just watch for the IPO:wink:

Tim Triche (20:50:31): > but yeah, single cell protocols have so many cool bits and many days it seems like people are obsessed with using them inefficiently or just to burn money

Peter Hickey (20:50:47) (in thread): > i feel like it shouldn’t be that slow but also never looked at the code so what do i know:slightly_smiling_face:

Lukas Weber (20:51:11) (in thread): > last time I ran it I think it took 4 days with 10 cores on our cluster

Tim Triche (20:51:43) (in thread): > not ONLY that technology. But it’s not a secret, at least in that application

Peter Hickey (20:52:37) (in thread): > yep > 1 week for me. here’s a script you might adapt to create the 3’ UTR SNP set > > # Construct a VCF file for use cellSNP by intersecting the cellSNP-recommended > # VCF file with 3' UTR regions. > # Initially developed for SCORE project 'C081_Arandjelovic_Pellegrini'. > # Peter Hickey > # 2020-03-04 > > # Sort and index the cellSNP-recommended VCF file ------------------------------ > > cmd <- paste0( > "module load bcftools/1.9", > "\n", > "gunzip ", > "-c /wehisan/home/allstaff/h/hickey/grpu_mritchie_0/tools/cellSNP/genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf.gz ", > "| ", > "bcftools sort ", > "-o /wehisan/home/allstaff/h/hickey/grpu_mritchie_0/tools/cellSNP/genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf.bgz -O z", > "\n", > "bcftools index /wehisan/home/allstaff/h/hickey/grpu_mritchie_0/tools/cellSNP/genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf.bgz") > > # NOTE: This isn't the correct way to do this, but it works. > system(writeLines(cmd)) > > # Construct 3' UTRs ------------------------------------------------------------ > > library(AnnotationHub) > library(ensembldb) > > ah <- AnnotationHub() > EnsDb.Hsapiens.v94 <- ah[["AH64923"]] > three_utrs <- threeUTRsByTranscript(EnsDb.Hsapiens.v94) > three_utrs <- keepSeqlevels(three_utrs, c(1:22, "X"), pruning.mode = "coarse") > reduced_three_utrs <- reduce(unlist(three_utrs)) > > # Load VCF for loci in 3' UTRs ------------------------------------------------- > > library(VariantAnnotation) > library(here) > vcf <- VcfFile( > here("extdata", "cellSNP", "genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf.bgz"), > here("extdata", "cellSNP", "genome1K.phase3.SNP_AF5e4.chr1toX.hg38.vcf.bgz.csi")) > > vars <- readVcf( > vcf, > param = ScanVcfParam( > fixed = names(fixed(scanVcfHeader(vcf))), > info = names(info(scanVcfHeader(vcf))), > geno = names(geno(scanVcfHeader(vcf))), > which = reduced_three_utrs)) > > # Write to disk as VCF file ---------------------------------------------------- > > writeVcf( > vars, > here( > "extdata", > "cellSNP", > "genome1K.phase3.SNP_AF5e4.chr1toX.hg38.threeUTRs.vcf"), > index = TRUE) >

Peter Hickey (20:52:51) (in thread): > (would share git link but it’s part of a private repo)

Lukas Weber (20:52:57) (in thread): > awesome, thanks a lot

Tim Triche (20:54:27): > I do love the insane people at NYGC pushing the envelope and seeing how far the technology can really go:smile:

Lukas Weber (20:54:44) (in thread): > @Stephanie Hicks:point_up:

Lukas Weber (20:55:00) (in thread): > seems we can speed up cellSNP significantly

Tim Triche (20:55:10): > Thanks for exposing me to Vireo. With the Shearwater algorithm in deepSNV, I think I have a solution to something that’s been bugging me for a while now.

Stephanie Hicks (20:56:10) (in thread): > THANK YOU@Peter Hickey!

Stephanie Hicks (20:56:38) (in thread): > this is really great news.

Lukas Weber (20:56:48): > yeah Vireo seems to be great. There is also another one published at around the same time calledscSplit, which seems very similar:https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1852-7 - Attachment (Genome Biology): Genotype-free demultiplexing of pooled single-cell RNA-seq > A variety of methods have been developed to demultiplex pooled samples in a single cell RNA sequencing (scRNA-seq) experiment which either require hashtag barcodes or sample genotypes prior to pooling. We introduce scSplit which utilizes genetic differences inferred from scRNA-seq data alone to demultiplex pooled samples. scSplit also enables mapping clusters to original samples. Using simulated, merged, and pooled multi-individual datasets, we show that scSplit prediction is highly concordant with demuxlet predictions and is highly consistent with the known truth in cell-hashing dataset. scSplit is ideally suited to samples without external genotype information and is available at: https://github.com/jon-xu/scSplit

Lukas Weber (20:57:00): > haven’t tried this one, since Vireo was already working well for us

Peter Hickey (20:57:31): > https://github.com/wheaton5/souporcellalso worth a look (i’ve not used)

Stephanie Hicks (20:57:44): > so@Peter Hickeycan I ask, are you seeing researchers in your core using these computational demultiplexers and avoiding experimental demultiplexers?

Peter Hickey (20:58:50) (in thread): > welcome. i spoke with@Davis McCarthyabout adding something like this to the docs (and getting their tick of approval that i’m not doing something dumb) but i imagine it got lost amongst the million other things he has going on.

Stephanie Hicks (20:59:01): > @Lukas Weberand I are trying to convince ourselves that this is a reasonable thing to do with some tissues for one of our projects.

Stephanie Hicks (20:59:31) (in thread): > yeah, it wouldsignificantlyenhance our life.:wink:

Tim Triche (20:59:51) (in thread): > I love love love the name of it but the code was a bit of a mess when last I tried to use it

Peter Hickey (20:59:52) (in thread): > we’re frequently recommending genotype demultiplexing for human samples, supplemented by hashtags to distinguish treatments. but we also do a lot of stuff on inbred mice where we can’t make use of genotypes:disappointed:

Davis McCarthy (21:00:06): > @Davis McCarthy has joined the channel

Stephanie Hicks (21:00:42) (in thread): > ah makes sense.

Peter Hickey (21:00:54) (in thread): > also got more R&D projects with other labelling strategies such as MULTI-seq

Tim Triche (21:01:21) (in thread): > the Marra lab was playing with MULTI-seq and canned it. that’s my only data point for that

Lukas Weber (21:01:34) (in thread): > our collaborators had some difficulties with MULTI-seq, which is why Stephanie suggested the genetic demultiplexing for us

Lukas Weber (21:01:51) (in thread): > so now we are trying to figure out if Vireo is good enough for us

Stephanie Hicks (21:02:01) (in thread): > yeah, MULTI-seq apparently was not working well.@Lukas Weberand I never actually saw any results, but they were only telling us it was not working in their lab

Peter Hickey (21:02:19) (in thread): > https://www.biorxiv.org/content/10.1101/2020.02.12.946509v2is on my to-read list

Peter Hickey (21:02:40) (in thread): > important for us since we do a lot of immune and PBMC stuff.

Stephanie Hicks (21:03:33) (in thread): > ah thanks

Peter Hickey (21:03:54) (in thread): > gotta get back to COVID work … ah gees, this convo is really making me pre-emptively miss in-person BioC2020:disappointed::cry:

Tim Triche (21:03:54) (in thread): > yeah! Wow that’s handy

Tim Triche (21:04:04) (in thread): > the Trima paper

Tim Triche (21:04:13) (in thread): > or the “don’t use Trima” paper

Tim Triche (21:04:27) (in thread): > a lot cheaper than a bone marrow aspirate though

Tim Triche (21:05:14) (in thread): > see alsohttps://www.pnas.org/content/111/47/16802 - Attachment (PNAS): Sample processing obscures cancer-specific alterations in leukemic transcriptomes > An important goal of cancer biology is to identify molecular differences between normal and cancer cells. Accordingly, many large-scale initiatives to characterize both solid and liquid tumor samples with genomics technologies are currently underway. Here, we show that standard blood collection procedures cause rapid changes to the transcriptomes of hematopoietic cells. The resulting transcriptional and posttranscriptional artifacts are visible in most published leukemia genomics datasets and hinder the identification and interpretation of cancer-specific alterations.

Tim Triche (21:14:10): > @Stephanie Hicks@Lukas Weberare you using the (older, R) version of Vireo or the (newer, Python) version?

Lukas Weber (21:14:26): > latest from GitHub (Python)

Lukas Weber (21:15:23): > I didn’t realize there was a previous R version

Lukas Weber (21:17:33): > what is the difference between the versions?

Dan Bunis (21:19:49): > lots of messages here now but I don’t see demuxlet mentioned. It’s a tool specifically built for geneotype-based sample-origin calling. Is it not well known or have people found it not to work well? I used it in my own work and have some functions for pulling the output into SCEs currently built into dittoSeq.

Lukas Weber (21:20:45): > ah yes I am aware of demuxlet – but it requires separate sample genotyping. I don’t think it can do it all from the scRNA-seq reads

Dan Bunis (21:21:11): > There’s a genotype-free version too.

Dan Bunis (21:21:27): > They call it “freemuxlet”

Lukas Weber (21:21:40): > oh, I did not know that. Thanks, I’ll look into it. Is there a paper?

Dan Bunis (21:22:23): > Not sure tbh

Lukas Weber (21:22:35): > looks like it is part of this GitHub repo, no paper:https://github.com/statgen/popscle/

Tim Triche (21:28:45) (in thread): > I mentioned it off the bat above! But scSplit and vireo can deal with partial or missing genotype information unlike demuxlet

Lukas Weber (21:29:52): > same authors though

Dan Bunis (21:29:57) (in thread): > ahh okay. figured it must have come up!

Dan Bunis (21:31:29): > Oh yea it is the same people. I knew about it I suppose just cuz I work closely with Jimmie Ye at UCSF.

Dan Bunis (21:33:59): > perhaps there’s no paper cuz vireo and scSplit beat them to it:man-shrugging:. I have no idea how they compare in performance.

Stephanie Hicks (21:37:14): > nice, thanks for freemuxlet out@Dan Bunis!

Stephanie Hicks (21:39:19): > btw does anyone know of a genetic demultiplex computational software that takes in alevin (vs cell ranger) as input?

Stephanie Hicks (21:39:34): > it seems most expect cell ranger

Lukas Weber (21:39:55): > yes was just thinking about that again too

Lukas Weber (21:40:17): > @Peter Hickeyhave you used cellSNP with alevin input?

Stephanie Hicks (21:40:32): > i’m not too keen on cell ranger as it tosses multi-map reads.

Lukas Weber (21:40:35): > maybe I got it wrong and it is actually possible

Stephanie Hicks (21:41:05): > but seems like a heavy lift to modify one of the genetic demultiplexers to allow for a different input file

Peter Hickey (21:44:02) (in thread): > nah only cellranger. but it just requires a BAM doesn’t it (or does alevin not give you this?)

Dan Bunis (21:45:13): > I know for Demuxlet, you provide the BAM and tell it which identifier contains the cell barcode info, and freemuxlet likely works similarly. I used it with cellranger data myself, but perhaps that makes them amenable to other types?

Lukas Weber (21:46:20) (in thread): > no, it has its own output format

Stephanie Hicks (21:46:35) (in thread): > https://salmon.readthedocs.io/en/latest/alevin.html#output

Stephanie Hicks (21:46:49) (in thread): > oh but apparently “Alevin can also dump the count-matrix in a human readable – matrix-market-exchange (mtx) format, if given flag –dumpMtx which generates a new output file called quants_mat.mtx.”

Stephanie Hicks (21:46:56) (in thread): > that’s the cell ranger format, no?

Lukas Weber (21:47:19) (in thread): > hmm maybe there is a way to then turn that into a BAM?

Stephanie Hicks (21:47:45) (in thread): > yeah, let’s chat about that. but thanks@Peter Hickey!

Stephanie Hicks (21:49:41): > hmm@Dan Bunisre-reading alevin docs (https://salmon.readthedocs.io/en/latest/alevin.html#output), it seems like theoretically it may be possible to convert one type of output from alevin (.mtx) into bam. “Alevin can also dump the count-matrix in a human readable – matrix-market-exchange (mtx) format, if given flag –dumpMtx which generates a new output file called quants_mat.mtx.”

Stephanie Hicks (21:49:47): > anyways, thanks everyone

Lukas Weber (21:51:54): > but that is just the count matrix right, not all the reads etc that are in the BAM/SAM. I’m not sure how that information could be recovered from a count matrix

Lukas Weber (21:52:24): > will read some more though, thanks

Lukas Weber (21:53:40): > there was alsoSTARsolo, which I looked into initially when Cell Ranger wasn’t working on our cluster:https://github.com/alexdobin/STAR/blob/master/docs/STARsolo.md

Lukas Weber (21:53:59): > maybe worth trying this one again

Peter Hickey (22:08:11) (in thread): > AFAIU STARsolo is very much pitched as afasterdrop-in replacement replacement for CellRanger (whereas alevin and kalisto+bustools take a different approach that may not necessarily produce a BAM file when run ‘optimally’)

Lukas Weber (22:08:35) (in thread): > yes, that is my understanding too

Lukas Weber (22:09:55) (in thread): > STARsolo was much faster than Cell Ranger when I tried it, although I think I had some issue with it too

Lukas Weber (22:11:05) (in thread): > while alevin is doing pseudoalignment so not sure if there is any way to then convert that back into a BAM

Lukas Weber (22:11:52) (in thread): > STARsolo doesn’t have a paper / benchmarking etc (but then neither does Cell Ranger:joy:)

2020-06-10

Charlotte Soneson (01:35:49) (in thread): > Maybe@Avi Srivastavacan provide some more insights here

Avi Srivastava (01:35:54): > @Avi Srivastava has joined the channel

Avi Srivastava (09:01:20): > Hey guys ! Thanks@Charlotte Sonesonfor adding to the channel ! > It sounds like a very interesting discussion.@Lukas Weber@Stephanie Hicks, if I understand correctly, the motivation is to call cell-level SNPs using the alignments from the BAM ? > In theory, you have to provide alevin with --writeMappings flag and it will dump the BAM file, however, I think the requirement here is somewhat different. If I understand correctly the cellSNP consume 10x format BAM ? I have to read about it but if it does then there are multiple question in my mind. First, like Stephanie already mentioned, 10x indeed drops the multi-mapping while quantification however it does reports the alignments in the BAM, does cellSNP consume that or ignore that ? Second, 10x just tags one of the alignment as the primary alignment, in that case, since fragmentation happens randomly if you consume only the primary alignment then there is a chance that you just miss the SNP which was present in thesecondary alignment and was randomly ignored. > > I think these are very interesting questions and I’d be happy to contribute, add / edit/ modify alevin any required way. Currently, alevin will dump the BAM but the Cellular barcode and the UMI information/tags is not propagated, if we are interested I can add that over the weekend.

Tim Triche (11:51:44): > that would be handy – I can easily go from 10X BAM back to FASTQ (attached) but getting the flags into a BAM that doesn’t have them is more trouble:wink:

Tim Triche (11:52:06): > god bless bioawk - File (Plain Text): 10xR1 - File (Plain Text): 10xR2 - File (Plain Text): 10xSE

Tim Triche (11:52:32): > (the reason for those was to grab a remote BAM, open it up, and stream into kbus/salmon/alevin)

Tim Triche (11:53:07): > I know that we’ve talked about UMI disambiguation for ASE and velocity@Avi Srivastava– this would be kind of useful especially if parts can be incorporated into the Salmon streaming pass up-front

Tim Triche (11:54:13): > so for example with the above, if you use htslib/samtools tosamtools viewa remote BAM that SRA is hosting on AWS S3, you can pipe the output through to get R1, R2, or a fake single-end with CB/UB tags

Tim Triche (11:54:30): > I use this a lot to fiddle with UMI-aware variant calling in high coverage transcripts/regions

Tim Triche (11:56:06): > with plate-seq, we’ve been using Salmon (obviously?) but in an ideal world a selective-alignment BAM with all reads tagged for CB/UB would be great. I’m not super concerned about UMI or CB error correction schemes.

Tim Triche (11:57:08): > The “problem” of having 250K or 1M “cells” that may or may not be informative is… less my interest than having a few hundred/thousand that are very informative for a process of interest.

Rob Patro (11:57:48): > @Rob Patro has joined the channel

Tim Triche (11:57:58): > oh hi Rob

Lukas Weber (11:58:40): > awesome, thanks@Avi Srivastavaand@Charlotte Sonesonyes, we are runningcellSNPto call cell-level SNPs, using the 10x format BAM from Cell Ranger, together with the a preprocessed human variant list from 1000 Genomes Project that is provided along withcellSNP:https://sourceforge.net/projects/cellsnp/files/SNPlist/I didn’t know about the--writeMappingflag, so this is very useful, thanks. I also didn’t really know what Cell Ranger was doing internally with the multi-mapping reads, so yes, if it works the way you describe (selecting one at random) then it seems plausible that it will miss the SNPs from the secondary alignment. > > When you mention the cellular barcodes, do you mean the alevinBAMdoes not include cell barcodes currently? This would indeed be very useful for us to add (in fact probably crucial for what we are doing, since we are using the cell barcodes with unique sample IDs to demultiplex, e.g.AGAATGGTCTGCAT-X2whereX2is the sample ID)

Tim Triche (11:59:01): > interested if others are doing these things oout in public (https://github.com/trichelab/single_cell_analyses)

Tim Triche (11:59:45): > my lab has been working on an in-house unified pipeline for inDrops (Jovinge lab), 10X (various metabolomics people), CelSeq2 (Pospisilik lab), and Takara SmartERseq + UMIs (Triche/Shen labs)

Tim Triche (12:01:11): > @Rob Patro@Charlotte Soneson@Avi Srivastavathis is us slowly reconciling RNA velocity measurements – you might get a kick out of the revisions tohttps://www.nature.com/articles/s41591-020-0944-y/figures/4where they ended up re-running everything with the dynamical model – looks like we’re not the only ones who see this:slightly_smiling_face:

Tim Triche (12:02:09): > obviously@Charlotte Sonesonhas done a spectacular job of this:https://github.com/csoneson/rna_velocity_quant(for anyone who doesn’t know)

Avi Srivastava (12:07:19): > That’s awesome@Tim Triche, I wasn’t aware about this. - Attachment: Attachment > @Rob Patro @Charlotte Soneson @Avi Srivastava this is us slowly reconciling RNA velocity measurements – you might get a kick out of the revisions to https://www.nature.com/articles/s41591-020-0944-y/figures/4 where they ended up re-running everything with the dynamical model – looks like we’re not the only ones who see this :slightly_smiling_face:

Tim Triche (12:15:36): > well, it turns out that your paper matters:slightly_smiling_face:

Avi Srivastava (12:16:42): > All right, it looks like it’d be helpful for people to have the CB / UMI in the BAM. 10x generally has two flags for each CB / UMI, I think, representing raw and correcting sequences. I’d most probably add only the corrected ones for starters.

Tim Triche (12:17:13): > three tags

Tim Triche (12:17:42): > UR/UY/UB and CR/CY/CB

Tim Triche (12:17:50): > raw/qual/corrected for each

Avi Srivastava (12:18:12): > Oh right, quality is another variable, probably skip that as well ?

Tim Triche (12:18:31): > so either all three, or just CB/UB

Tim Triche (12:19:17): > if you want to futz around with inDrops, error correction, whatever then it’s probably good to have the “standard” (hahahaha) 10X BAM format to unify things

Tim Triche (12:20:06): > you can use these bioawk scripts as a fake unit test if you want :-) - File (Plain Text): 10xR1 - File (Plain Text): 10xR2 - File (Plain Text): 10xSE

Dan Bunis (12:20:18): > I’d throw into this discussion that missing a few SNPs-per-cell is not a terrible thing for the purpose of demultiplexing — so the multi-map assignment issue may be an unimportant one for this goal. For other purposes, and especially if one wants to investigate a specific SNP, it can matter more. But getting accurate sample calls takes only 50 SNPs even when there are 64 samples in the droplet lane. Often that is even overkill cuz there are far fewer samples. (This number comes from some simulations within the original demuxlet paper.)

Avi Srivastava (12:20:40) (in thread): > Thanks Tim !

Tim Triche (12:20:55) (in thread): > enlightened self interest makes the world go round:wink:

Tim Triche (12:21:33): > sure, but if you want to look at aneuploid or circular regions, maybe don’t rely solely upon diploid sims:slightly_smiling_face:

Dan Bunis (12:22:38) (in thread): > Valid. I’ll try to chuck off my singular human focus lol.

Tim Triche (12:23:40) (in thread): > *healthy human focus:wink:

Tim Triche (12:25:01) (in thread): > ps. all healthy human cells have non-diploid genomes. Mostly these get thrown away, until someone needs a Science or Nature paper:https://www.the-scientist.com/magazine-issue/the-two-genomes-in-every-eukaryotic-cell-66623 - Attachment (The Scientist Magazine®): The Two Genomes in Every Eukaryotic Cell > Interactions between mitochondrial and nuclear genomes have further-reaching effects on physiological function, adaptation, and speciation than previously appreciated.

Lukas Weber (12:30:36): > yeah I was getting very good demultiplexing performance with the Cell Ranger BAM and cellSNP/Vireo, so the multimapping issue wasn’t creating huge problems in this case. However there was a small proportion of cells (1-2%) that Vireo couldn’t assign correctly to samples (it ended up calling them as “unassigned” or incorrect doublets), so it is possible that some of these were affected by multimapping reads

Tim Triche (12:36:28): > my suspicion is that for non-allele-specific expression of autosomal diploid genes, this is fine; however Lana showed increased performance with better assignment overall:https://www.nature.com/articles/s41467-018-07170-5 - Attachment (Nature Communications): Using single nucleotide variations in single-cell RNA-seq to identify > Identification of cell subpopulations using transcript abundance is noisy. Here, the authors developed a linear modeling framework, SSrGE, which utilizes effective and expressed nucleotide variations from single-cell RNA-seq to identify tumor subpopulations.

Tim Triche (12:37:46): > caveat: as usual, it probably depends on the experiment, the prep, the depth, and the degree of diversity in the cells

2020-06-12

Paul Hoffman (13:40:27): > @Paul Hoffman has joined the channel

2020-06-15

Avi Srivastava (12:15:18): > @Tim Triche@Lukas WeberI’ve added the code for propagating the Cellular Barcode & UMI tags to the alevin generated BAM. Currently the changes are in the develop branch of salmon and has to be compiled from source, they will available by default from the next salmon release.

Lukas Weber (12:18:41): > awesome, thanks a lot. will try this out in our pipeline!

Lukas Weber (12:18:44): > @Stephanie Hicks

Tim Triche (12:19:10): > that’s awesome! Thanks so much! We were just discussing a use case for this about 30 seconds ago!

Avi Srivastava (15:00:59): > Glad to hear that, let me know if you need help with things.

Tim Triche (15:09:00): > nvm I see it

Tim Triche (15:10:29): > extraBAMtags is where all the action is correct? this could allow for reunification of a lot of different approaches, thanks for doing this

Avi Srivastava (15:12:30): > That was fast:slightly_smiling_face:. YesextraBAMtagshas the extra BAM tags which got propagated.

Stephanie Hicks (21:51:40) (in thread): > thank you@Avi Srivastava! (i’m so sorry for my late response. just catching up with bioc slack messages)

Avi Srivastava (21:52:24) (in thread): > No worries, happy to help !

Stephanie Hicks (21:56:35): > so awesome, thank you@Avi Srivastava!

2020-06-16

Alexander Toenges (05:36:41): > @Alexander Toenges has joined the channel

Alexander Toenges (05:44:14): > @Charlotte SonesonQuick question if I may towards velocity, especially your Alevin/scvelo tutorial: Do you know if it is appropriate to run velocity on integrated data? I have two genotypes with n=2 each, basically followed OSCA, integrated and clustered them, and now aim to run velocity analysis. Any assumption of either scvelo of velocyto that forbids to use the integrated PCA (so the “corrected” reducedDim)? Since some clusters are unique for one condition I would run velocity separately on each condition but using the integrated reducedDims, is that possible?

Charlotte Soneson (07:07:54): > @Alexander ToengesYes, that’s where things get complicated…I don’t have a terribly informed answer, in the sense that I haven’t done very thorough investigations on this front, and I think there are clear challenges. > I do think it can make sense in some cases to provide, e.g., an integrated reduced dimension representation, which will then be used to get the nearest neighbors to calculate moments across. However, if the batch effect is strong you may end up in trouble by averaging spliced or unspliced abundances across batches (if the abundances are not also properly integrated). > For the velocity calculations,scVelowill want normalized counts (or raw counts, which will be internally normalized). Thus, to account for batch effects on that level you’d need to get batch corrected (normalized) “counts” in one way or another. Moreover, you would likely want to adjust the spliced and unspliced counts together somehow, to avoid destroying the relationship between the two count matrices. This gets difficult when the batch correction is done gene-wise (so just concatenating the spliced and unspliced matrices is likely not enough). > In your specific setup, if you’re running the velocity analysis separately for each condition anyway I don’t know if it makes a big difference to use the within-condition PCA or the integrated one, the nearest neighbors within the same condition may not be that different (the main difference may be that you also have neighbors from the other condition in the integrated PCA).

Alexander Toenges (07:27:08): > I see, thank you for your answer. I guess I will end up exploring how the different strategies behave and compare.

jessi elderkin (09:53:59): > @jessi elderkin has joined the channel

Nicholas Knoblauch (10:12:12): > @Nicholas Knoblauch has left the channel

Jordan Veldboom (10:31:32): > @Jordan Veldboom has joined the channel

2020-06-18

pamela himadewi (09:54:44): > @pamela himadewi has joined the channel

2020-06-23

Steve Lianoglou (19:33:42): > @Steve Lianoglou has joined the channel

2020-06-25

CristinaChe (16:00:23): > @CristinaChe has joined the channel

2020-06-30

Frank Rühle (06:21:08): > @Frank Rühle has joined the channel

2020-07-05

Giuseppe D’Agostino (22:08:46): > I was reading the latest preprint from the Linnarsson lab (https://www.biorxiv.org/content/10.1101/2020.07.02.184051v1.full.pdf) and found their discussion of the new cytograph pipeline interesting, in particular this point:“euclidean distance in PCA space” has no intrinsic meaning that can be interpreted as a meaningful distance between cells. Then they argue on using a fixed threshold on the (square root of the) jensen-shannon divergence between any pair of cells as a better way to draw the graph edges. I was under the naive impression that euclidean distance in PCA space was a “good enough” approximation of biologically meaningful differences, but I’d be interested in anybody else’s take on this

2020-07-06

Vivek Das (11:53:35): > My opinionated 2 cents@Giuseppe D’Agostino. Glad to see you here.:wink:There is more to it. Even though we are using ED for PCA, the inherent assumptions of normality in PCA is a big problem that is one of the bottleneck challenges as per my opinion that we should get rid of while using it for single cell or tapping more intrinsic information in a biological space that account for not just gene to gene relationships but also cell to cell. We did it a lot of PCA while profiling for bulk expression be it RNA or protein but it is always not the best metric as we have a normality assumption. For single cell our sub space of mRNA & protein with dimensionality is beyond a linear space. A representation in a lower sub space in 2D is not the same as clustering & embedding all multi dimension of cells, genes, samples, states, etc . As per my experience with more and more data we should be moving beyond linear PCA unless we strictly mention it’s for denoising. Else it’s challenging and also has issues. Cell to cell variation & stochasticity is not linear in nature, then why do we approximate it with a linear measure? Hence, graph based methods, manifold, etc are getting popular. Reason why we also see Autoencoders or other various neural networks based methods being popular lately. I have often questioned this but haven’t always got a great response but I still don’t buy that using linear PCA to understand a non-linear phenomena is the best metric specially when it’s [cell * genes * samples] that can be either in healthy vs disease setting or for that matter a developmental trajectory. Here we are treading totally a function of function area. As per my learning this isn’t helpful using linear PCA. But then again, I say this as I don’t consider the normality assumptions. If we still assume normality then one can still defend using linear PCA. There are a few reads from Casey Greene’s lab around it, that might also be insightful.

Vivek Das (11:56:07): > I would also encourage to read works from Smitha Krishnaswamy lab as well that will give more insights.

Vivek Das (11:57:40): > My argument is always about the dynamic and stochastic process we want to understand from transcription and translation. Generalizations of linearity has helped us all along till date but that doesn’t mean it’s always the best and the correct way when our information abundance is non-linear in nature & relationship we study is a non-linear function.

Vivek Das (12:00:56): > I will be happy to learn what others in#singlecell-querieschannel thinks about it.:blush:

Tim Triche (12:05:54): > this:https://academic.oup.com/bioinformatics/article/36/11/3418/5807606is a fascinating approach to using a nonlinear model, linearly decoded, to produce something resembling a traditional factor analysis in terms of interpretability, while retaining some of the universal-function-approximator power of a VAE

Vivek Das (12:15:00): > Indeed@Tim Triche:blush:

Vivek Das (12:15:52): > It’s all about factor and matrix manipulation isn’t it?

Vivek Das (12:16:24): > And you know I have a thing about VAE.:grimacing:

Tim Triche (12:17:57): > learning the embedding for a VAE seems to be the sticking point – I can learn a UMAP embedding and project new data onto it, I’m less confident that a VAE will behave as expected going both ways. So LD-VAE seems to offer a useful compromise in terms of learning on massive data but also being able to project new data into the embedded space in a useful manner. Whether it’s as straightforward in production I don’t yet know.

Tim Triche (12:21:30): > also regressing on arbitrary factors, etc. – yeah things are nonlinear processes but we have 200 years of statistical shenanigans to help retrieve a locally or high-dimensionally linearized representation. so.

Alan O’C (12:21:42): > I would have slightly more confidence in the reliability of a VAE embedding rather than UMAP (for new data). What’s the reasoning there?

Vivek Das (12:21:43): > I did some testing around it but I have to dig deeper. There are some underground projects I am doing with some of my collaborators on my side curiosity driven gigs. Let’s see how far we understand and get successful and then may be publish if it’s such kind of material can be of interest for others. However, for this and other projects related to this, we are trying out multiple datasets that are profiled at single cell level, importantly thinking on the lines of the biological process rather than just sequence the heck and let’s data do the talking. I again don’t buy that. The biological domain knowledge is important while studying the data behavior, properties & shape. We are doing assumptions & approximate things based on those. Even if it’s Frequentist, Bayesian, etc.

Tim Triche (12:22:35): > @Alan O’Cfor UMAP? mostly that it’s familiar and at least with iterative LSI, seems to behave roughly as expected

Vivek Das (12:22:43) (in thread): > When information abundance is so huge it’s beyond linear statistics and this is where deep mathematics comes into play.:blush:

Tim Triche (12:23:47): > VAEs have been the single greatest source of contention in comparing results between single-cell methods developers that I have ever seen. and I mean this literally, as in, actual humans arguing in person about it

Vivek Das (12:24:14): > Well, statistics is also about generalizations and approximations isn’t it? So why not high level mathematics coupled with statistics?

Tim Triche (12:25:12): > not opposed to either of the above, just noting that nonlinear regression (for example) or GAMs allow dealing with structured data in a way that can be both interpretable and powerful. once things get too “deep” it seems to be hard to elicit useful feedback from domain experts

Tim Triche (12:25:43): > visualizing what’s going on inside the black box is not trivial, although GANs and generative models certainly help

Vivek Das (12:25:52): > Again, it’s just not VAE, I am more invested VAE but that doesn’t mean any other won’t work. Manifold is actually something I am pretty amazed to learn. I took a workshop to understand the underlying math. It is fascinating.

Tim Triche (12:26:04): > just MHO though. For prediction, best model wins. For inference, I feel like it is thornier. But, just MHO.

Vivek Das (12:26:19): > Of course I need to dig a bit more and now create evaluation around it.

Tim Triche (12:26:35): > noneuclidean geometries and differential forms are fascinating

Vivek Das (12:27:49): > I agree@Tim Tricheit gets trickier to explain and interpret those details. But again, if in image and signal processing we reached that level, I don’t understand why in upcoming years we can’t educate biology subject matter folks in single cell space.

Tim Triche (12:27:53): > Tamara Munzner showed decades ago that this can be turned to advantage by using hyperbolic space to “lens” over enormous trees

Tim Triche (12:28:41): > so I’m not opposed to it (Aviv Regev and Bonnie Berger, I think, revived it recently for single-cell visualization), just uncertain of how much additional insight vs. confusion is generated.

Vivek Das (12:29:13) (in thread): > Interpretations and inferencing is beyond prediction as per my understanding. There is so much. The more I am learning, the fascinating it gets, specially with multi-dimensional data using single cell

Tim Triche (12:29:30): > good point about signal processing – nobody needs to know how a Fourier transform or HMM works to use their cellphone

Tim Triche (12:29:57): > although it’s probably a good idea for the engineers to:wink:

Vivek Das (12:31:13) (in thread): > Yes, in the preprint. There was again another preprint from Facebook group in poincare maps. Very fascinating. > 1. https://www.biorxiv.org/content/10.1101/689547v1 > 2. https://ai.facebook.com/blog/poincare-maps-hyperbolic-embeddings-to-understand-how-cells-develop/ - Attachment (ai.facebook.com): Poincaré maps: Hyperbolic embeddings to understand how cells develop > Poincaré maps: Hyperbolic embeddings to understand how cells develop

Vivek Das (12:36:47): > Well, a pretty naive generalization at my end from signal processing. But my point was more on how we understand imaging science and methods we employ around it while dealing the pixels across stacks for any kind of interpretations. This isn’t restricted to linear approximations since we have so many factors we take into account from geometry, phenomena, distance, etc. With single cell biology reaching a state of such high abundance information complex, I fail to understand why we can’t take learnings from imaging science. If we can, then those methods will also come into usage and probably we can eventually get comfortable. Of course the black box arguments will be there but there are smarter folks who do write those underlying mathematical equations that led to such methods, and probably they can teach and educate us

Vivek Das (12:56:43) (in thread): > I should have clarified it better. Again, regressing on arbitrary factors is a problem. In statistics, we often approximate things of non-linearity via additive or multiplicative models. Isn’t that what we do for interactions. This is definitely to an extent feeding into non-linearity as per my understanding but still the premise of that comes from linear processing of data along with visualization & representation based on linear methods. This is where it gets challenging at my end often and I often don’t find reasons convincing enough when it comes to single cell data & underlying biology of it, specially when transcription and translation isn’t entirely a linear phenomena. Again my assumptions are based on the fact, that I don’t consider molecular central dogma entirely a linear process or phenomena (whatever we want to term it). This is due to my training and liking for systems biology and gene regulatory relationships. > > Again, not entirely correct and can’t be generalized across all forms. I would like to mention that as well. This is one way of thinking. As often I like to say, and also others have told it around me that, it’s never one gene to one protein. If indeed one gene is not one protein holds true or if agree on that, then I am afraid, we should think our methods employment and usage around understanding such accordingly.:blush:

Tim Triche (12:59:30): > looking at e.g. the Poincare maps paper, it’s elegant and beautiful but at the same time it’s not immediately apparent from the selection of applications they used that the approach offers light-bulb type insights that the other methods don’t. JMHO. I like the math but am not sold on the novel insight aspect of the technique.

Vivek Das (13:04:01): > Of course@Tim Triche. The word “novel” is often an overstatement. In biology, any data driven insights or measurements or methods, etc are all applied form from some other field. I don’t think they are novel, but their application may be. But still to make it reach that statement, rigorous comparison, evaluation and validation is needed. In short a benchmark. If such isn’t there it’s never “novel” for me nor it’s something, I will be sold. I might find it fascinating though, but that’s it.:wink:

Vivek Das (13:05:58): > I will stop here and not bore others in the#singlecell-querieschannel. Will be happy to learn from others. Thanks for the engagement and discussion@Tim Triche.:blush:

Tim Triche (14:10:40): > question for e.g.@Aaron Lunand others who have spent some time on this – what do people who have benchmarked single-cell plate-seq approaches with and without spike-in standards think ofsctransformand its regularized negative binomial regression model?https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1 - Attachment (Genome Biology): Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression > Single-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from “regularized negative binomial regression,” where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform, with a direct interface to our single-cell toolkit Seurat.

Vivek Das (16:24:24): > #singlecell-queries, to add to our morning discussions, this preprint seems pretty interesting to me. Just came around it today.https://www.biorxiv.org/content/10.1101/2020.06.24.169136v1

Vivek Das (16:26:59): > Interestingly cells are defined as samples here.:wink:

2020-07-13

Lukas Weber (09:10:04): > Hi@Avi Srivastava, I have another follow-up question from our discussion a few weeks ago about the alevin generated BAM/SAM. I have been having some trouble getting cellSNP to accept the alevin generated BAM for cell genotyping, and I think I have tracked down the issue to cellSNP having difficulty figuring out the chromosome for each read - it recognizes the read, but gives an error when trying to match it to chromosomes. > > If I understand the SAM format correctly, there is theRNAMEfield in the 3rd column, which is usually where the chromosome information is stored. In my previous Cell Ranger SAM, this column just had a simple number, e.g.1(I assume for chromosome 1; or a star if it is unmapped). However in the alevin SAM, there is a transcript name here instead, e.g.ENST00000377813.6. I’m not super familiar with SAM/BAM formats etc, so just wanted to check here - am I interpreting this correctly, and is this field supposed to have a transcript name in the alevin output? And if so, do you know if there is a way to easily recover the chromosome info from this? I might also get in touch with the cellSNP authors again about this too, just in case I am completely misinterpreting things. > > Below is also a screenshot of a few lines from my alevin SAM, in case this helps. The field I am referring to is where it saysENST00000377813.6, where the Cell Ranger SAM would just have a chromosome number instead. Thanks!

Lukas Weber (09:10:16): - File (PNG): read.png

Avi Srivastava (09:27:57): > Yes, the transcript names are expected in alevin BAM. > Cellranger (internally usage STAR) performs Genome alignment while alevin performs the transcriptome alignment. > Extracting chromosome mapping of a transcript should be relatively straightforward, however, I’m assuming cellSNP might need the read-mappings in the genomic coordinates, and the SAM has to be converted from transcriptomic coordinates to genomic coordinates. Unless, cellSNP has a way to utilize the transcriptomic BAM directly.

Lukas Weber (09:37:05): > Thanks - yes that makes sense, and also that the read mappings would need to be in genomic coordinates. I’ll see if there is a way to convert this and/or check if cellSNP can accept the transcriptomic BAM directly

Rob Patro (09:39:53): > I haven’t used it in a while,@Lukas Weber, but sam-xlate is theoretically able to convert between transcriptomic and genomic coordinates (https://github.com/mozack/ubu/wiki). There may be other tools as well (@Avi Srivastavawas working on one, but I don’t think it’s production-ready yet).

Lukas Weber (09:40:55): > oh cool, thanks for the reference@Rob Patro, I’ll try that one to start with

Rob Patro (09:41:12): > sure thing!

Tim Triche (10:52:19): > this is interesting (any time I see something in Java I immediately look to see if the same tool exists in a sensible language):https://www.biorxiv.org/content/10.1101/160085v1.full

Tim Triche (10:53:23): > > Mapping to transcriptomes > GraphMap can now accept a GTF file to internally construct a transcriptome sequence from a given reference genome, and map RNA-seq data to it. The final alignments are converted back to genome space by placing N operations in the CIGAR strings. >

Tim Triche (10:53:24): > https://github.com/isovic/graphmap

Tim Triche (10:53:43): > this looks cool and maybe fast enough to solve some problems (e.g. wut iff we link it into a package)

Rob Patro (11:17:58): > Interesting find,@Tim Triche! This is always a tool that I thought would be generally useful.

Rob Patro (11:18:16): > sort of like “liftover”, but within a geneome, between genomic and transcriptomic coordinates, and for alignments.

Tim Triche (11:21:24): > exactly. also sort of likeliftoffif liftoff didn’t take literally weeks

Tim Triche (11:22:16): > plus with a given GTF (esp. ENSEMBL or, I think, Gencode) the coordinates are already there so the operation always seemed like it would be trivial

Tim Triche (11:24:17): > this will also come in handy: > > # Process reads from a circular genome: > ./graphmap align -C -r escherichia_coli.fa -d reads.fastq -o alignments.sam >

2020-07-14

Stephanie Hicks (13:18:21): > Is there a biocExperimentHubscRNA-seq data package using nanopore long-read sequencers? I couldn’t find one, but this seems like it might be useful to have for developers to play with. Here is a package with some immune cells combining 10X chromium to generate cDNA pools and then Nanopore long-read sequencer for full isoform-level transcriptomes:https://www.biorxiv.org/content/10.1101/2020.01.10.902361v1

Stephanie Hicks (13:19:06): > if its of interest, I could work on pull together a data package, but wanted to know if others already had one in the works

Tim Triche (13:25:53): > The r2c2 paper! That would be a nice resource. We have some direct RNA and capture DNA on the same cell lines but the r2c2 approach is far more fun

Luyi Tian (20:04:44): > @Luyi Tian has joined the channel

2020-07-21

James MacDonald (15:48:33): > @James MacDonald has joined the channel

2020-07-27

Arun Chavan (12:10:08): > @Arun Chavan has joined the channel

2020-07-29

Brianna Barry (21:32:37): > @Brianna Barry has joined the channel

2020-07-30

Ayush Aggarwal (01:43:15): > @Ayush Aggarwal has joined the channel

Ludwig Geistlinger (10:44:35): > @Ludwig Geistlinger has joined the channel

Nur-Taz Rahman (10:46:48): > @Nur-Taz Rahman has joined the channel

koki (10:54:57): > @koki has joined the channel

Markus Schroeder (10:56:25): > @Markus Schroeder has joined the channel

sani (10:58:23): > @sani has joined the channel

sani (10:59:32): > Hi everyone, does anyone draw something similar to thishttps://github.com/satijalab/seurat/issues/962

Jared Andrews (11:02:10): > Ah, you want dittoSeq:https://bioconductor.org/packages/release/bioc/html/dittoSeq.html

shr19818 (11:02:17): > @shr19818 has joined the channel

Jared Andrews (11:02:24): > Can do that with ease.

Dan Bunis (11:09:35): > @sanidittoBarPlot is all about making these plots =)

sani (11:11:17) (in thread): > I actually don’t know how to represent my clusters on x axes.

Dan Bunis (11:28:29) (in thread): > You should just need to putgroup.by = "ident"or= "name_of_clustering_metadata"for that. Feel free to DM though if it’s still not working.

sani (11:29:03) (in thread): > Thank you I will try this and will let you know:slightly_smiling_face:

Bharati Mehani (11:35:39): > @Bharati Mehani has joined the channel

sani (14:38:02): > I was wondering if anyone knows how I can order the cluster number in pic below. Thank you. - File (PNG): 2.PNG

Lucy (14:41:12) (in thread): > Yes, change the levels of the factor to the order that you would like

Jared Andrews (14:41:45) (in thread): > Or if usingdittoBarPlot, use thex.reorderparameter: > > Integer vector. A sequence of numbers, from 1 to the number of groupings, for rearranging the order of x-axis groupings. > > > > Method: Make a first plot without this input. Then, treating the leftmost grouping as index 1, and the rightmost as index n. Values of x.reorder should be these indices, but in the order that you would like them rearranged to be.

sani (14:42:23) (in thread): > I used dittoBarPlot.

Selvi Guharaj (17:14:25): > @Selvi Guharaj has joined the channel

2020-07-31

sani (14:47:33): > Would you mind letting me know it is usual to take long time (more than an hour) to run SCTransform, my file size is 19g.

Jared Andrews (15:01:34): > How many cells? Seurat bloats the hell out of its data objects, I really don’t know why. SCTransform can take quite a while with a lot of cells, certainly over an hour.

2020-08-02

sani (10:21:06) (in thread): > My problem is solved by osvobod command here:https://github.com/satijalab/seurat/issues/1426

2020-08-03

sani (09:32:46): > I was wondering if anyone knows why after running SCTransform function my program is stopped working. Here is also screen after running SCTransform. - File (PNG): Screen Shot 2020-08-02 at 2.12.12 PM.png

Alan O’C (09:38:22): > You might be running out of memory

sani (09:46:34): > My file is 19 g and I assigned 64 g

Alan O’C (09:48:59): > I’ve not used Seurat for datasets of that size, but I’ve heard it’s not particularly memory efficient so that may not be enough. e.g, a non-sparse representation of 100k cells x 22k genes would be ~ 130Gb

Tim Triche (11:56:15): > goodness gracious, scanpy would blow through that in a matter of minutes

Alan O’C (12:09:00): > I’m just basing that on what people have said to me so take with appropriate salt levels. EG a colleague last year couldn’t find a way to run a Seurat pipeline with the 1.2M mouse brain dataset on our cluster - scanpy and Bioc equivalents were more memory efficient

sani (13:14:34): > It seems the problem was memory. Thanks@Alan O’C!

2020-08-04

sani (18:20:24): > I have tried to run, RunUMAP but it gives me this error I was wondering if anyone know how I can install a required package > s= RunUMAP(s, dims = 1:30, verbose = FALSE) > Error in py_get_attr_impl(x, name, silent) :  >  AttributeError: module ‘umap’ has no attribute ’UMAP

Peter Hickey (18:30:57): > is this using Seurat? Seurat is not a Bioconductor package, so you’ll probably have more luck asking the Seurat developers for help

sani (18:35:28) (in thread): > Yes it is Seurat, may I know where?

Peter Hickey (19:30:38) (in thread): > i don’t use it. i’m guessinghttps://github.com/satijalab/seurat/issues

2020-08-05

Hans-Rudolf Hotz (03:23:25): > @Hans-Rudolf Hotz has joined the channel

Frederick Tan (13:39:32): > @Frederick Tan has joined the channel

Aaron Lun (13:42:02): > @Aaron Lun has joined the channel

Aaron Lun (13:42:13): > my god, this is where all the questions are.

Aaron Lun (13:42:45) (in thread): > I don’t think it’s necessary. Probably doesn’t hurt, but I don’t think it’s necessary.

Aaron Lun (13:49:05) (in thread): > Sounds like what Sten was talking about in 2018. I was like meh, whatever floats his boat.

Elana Fertig (13:50:42): > @Elana Fertig has joined the channel

Aaron Lun (13:53:33) (in thread): > The practical issue is that, if you use a fixed radius for linking cells, the size of your graph may be n^2 rather than linear to the number of cells (as is the case for existing NN graphs). Which results in larger graphs and slower clustering. I don’t care enough about theoretical principles to be able to swallow a speed drop.

Aaron Lun (13:54:00) (in thread): > I had a link to a few documents where I think this process through, but I don’t remember all of it.

Aaron Lun (14:09:36) (in thread): > yeah, enjoy debugging pipelines with inherently mutable objects

2020-08-09

Tim Triche (11:55:11) (in thread): > lots of trade-offs – I never said that I liked the way scanpy mutates things in place without a trace:slightly_smiling_face:

2020-08-10

Bharati Mehani (00:31:06): > Hello everyone, i am having trouble with AverageExpression() in Seurat. I am trying to calculate the average expression using > cluster.averages <- AverageExpression(test) > and referring RNA values to export its raw counts but getting “Inf” as its value for most of the genes. > > Example: > > cluster.averages\(RNA['EGFR','Tumor Cells'] > [1] Inf > cluster.averages\)RNA[‘SEC61G’,‘macrophage’] > [1] Inf > cluster.averages$RNA[‘SPP1’,] > T cell macrophage microglia Oligodendrocyte Tumor Cells > SPP1 Inf Inf Inf Inf Inf > > Can any of you please explain why I am getting “Inf” instead of an average count for some of the genes in the RNA method? How can I correct this if I am doing something wrong. > > Thanks in advance!

Dan Bunis (03:07:27): > I would expect some improper transformation put a bunch ofInfs into your data.Inf / #cellswill always yieldInf. > > However, Seurat is not a Bioconductor package and its creators/maintainers are not here. This question is better directed as an issue on their Github, as a question on Stack Overflow, or something else of similar sort.

Jared Andrews (09:09:51): > Or head over to Biostars and ask there if you want some folks well acquainted with single cell sequencing to take a look.

Lucy (09:35:47) (in thread): > Yes I would highly recommend Biostars

Nur-Taz Rahman (10:15:44): > Are you using NormalizeData -> FindVariableFeatures -> Scaledata or sctransform?

Tim Triche (10:19:17): > set the channel topic: Please note: Seurat is not a bioconductor package and its authors are not here

2020-08-11

sani (12:24:07): > I was wondering if anyone knows why when I ran dittoBarPlot 2 times, I got two number different numbers for clusters in the same data?

Dan Bunis (12:56:05): > That shouldn’t happen.

Dan Bunis (12:57:59) (in thread): > Was it with the same code both times? Is it reproducible? > > This would be a bug if true that I’d want to eliminate.

sani (13:39:53) (in thread): > Yes, same code once I got 39 clusters and second time 44 clusters

Dan Bunis (13:44:07) (in thread): > That sounds upstream of dittoSeq to me. > > dittoBarPlot doesn’t perform the clustering. It just summarizes the metadata that you point it to. Might that metadata be changing between runs? > > Perhaps you could send the chunk of code that you are running from dittoBarPlot call#1 to dittoBarPlot call #2?

sani (13:51:15) (in thread): > The umaps are the same.

sani (13:52:02) (in thread): > I ran it once and I ran it the second… I will try it again. Because the umap looks the same.

Dan Bunis (13:52:30) (in thread): > The umaps are not relevant to this issue. > > I’ll add that if you are providingvar = "ident"whenobjectis a Seurat,Idents(object)are used. So in this case, if any code between your dittoBarPlot calls changes the internally stored clustering, the changing results would be correct & expected.

Bharati Mehani (17:04:55) (in thread): > Also try to use same seed

Bharati Mehani (17:05:13) (in thread): > Thanks

sani (17:05:47) (in thread): > May I know what do you mean by seed?

Bharati Mehani (17:08:58) (in thread): > I am using first sctransform on raw read count followed by selectIntegrationFeatures > preSCTIntegration > FindaintegrationAnchors >IntegrateData

Bharati Mehani (18:12:12) (in thread): > There is a seed() in R, it is used to reproduce the exact results.

Bharati Mehani (18:13:15) (in thread): > Specially while clustering and performing PCA or UMAP

Dan Bunis (18:13:35) (in thread): > A bit more info: Random numbers aren’t actually random and are actually generated from a precompiled set of near-random numbers.set.seed()can be used to pick a set point within that set of numbers to start from at the time of it’s call. This can then ensure that when random numbers are used, they will be reliably the same.

Dan Bunis (18:17:17) (in thread): > But I think Seurat’s relevant functions all have a seed.use or random.seed input, each with default values. So good call@Bharati Mehani, but I doubt that this is the issue here.

sani (19:13:47) (in thread): > Thanks… But the umaps are same… I mean I can’t see any differences.

2020-08-12

Nur-Taz Rahman (01:39:19) (in thread): > You might need to get your RNA expression data from the $SCT assay. See if you have this slot in the metadata.

2020-08-13

Loyal (14:25:03): > @Loyal has joined the channel

2020-08-14

Roye Rozov (04:44:00): > @Roye Rozov has joined the channel

Kasper D. Hansen (05:28:09): > @Kasper D. Hansen has joined the channel

Shijie C. Zheng (15:06:20): > @Shijie C. Zheng has joined the channel

2020-08-17

sani (18:53:02) (in thread): > I still have a problem, while umap shows I have 44 cluster the dittoBarPlot shows only 39 clusters.

Dan Bunis (18:54:32) (in thread): > I would ask again for you to share the code you are using.

sani (18:55:01) (in thread): > Sure, it is > dittoBarPlot(seur.immune,var= “genotype_diet”,group.by=“RNA_snn_res.1.5”)

Dan Bunis (18:56:10) (in thread): > Specifically, yourUMAPlot()(as I’m assuming this is what you are using for the umap plot? otherwise, whatever other Seurat plotter) anddittoBarPlot()code

sani (18:57:21) (in thread): > DimPlot(seur.immune, groupby=“cluster”, label=TRUE)

Dan Bunis (19:04:08) (in thread): > Right, okay. So here’s what I suspect: > > There is a potential disconnect here in that for theSeurat::DimPlotcode, by providinggroup.by="cluster"you are either allowing Seurat to obtain it’s internally stored clustering, or you are grabbing data from a “cluster” metadata. > > For yourdittoBarPlotcode, you are specifyinggroup.by = "RNA_snn_res.1.5"which is a metadata that holds the clustering from when res was set to 1.5. BUT if this resolution is not what Seurat is currently considering to be your clusters, then theDimPlotcode could beexpectedto show something different.

sani (19:07:23) (in thread): > I use cluster to get number of clusters and make sure the total number of clusters I have. Also to see which bar chart of the dittoBarPlot is correct and compare the distribution of the data by each cluster.

Dan Bunis (19:09:28) (in thread): > Sure, but what I’m saying it that this is not linked on the dittoBarPlot side. To have dittoBarPlot automatically utilize the cluster data, you need to givegroup.by = "ident"todittoBarPlot

Dan Bunis (19:10:58) (in thread): > With your code above, dittoBarPlot is looking to “RNA_snn_res.1.5”, regardless of whether or not that is the same as your clusters.

sani (19:13:32) (in thread): > It works! Why is this happens?

sani (19:14:08) (in thread): > Also may I know how I can reorder the number I check but I don’t know which value should I assign.

Dan Bunis (19:15:44) (in thread): > I would guess that you either changed your clustering manuallyIdents(object) <- "name_of_any_discrete_metadata", or else you ran multiple clustering resolutions and 1.5 was not the last one that you ran because whatever resolution was last would be the stored one.

Dan Bunis (19:17:39) (in thread): > For reordering, dittoBarPlot input for that would bex.reorder. See?dittoBarPlot.

sani (19:18:06) (in thread): > Yes but what should I give as a value. Should I give a vector from 1 to 44, in the order?

Dan Bunis (19:21:16) (in thread): > The data is being ordered alphabetically, which is default for ggplot even if it’s not what we necessarily want. Anyway, it looks a bit weird, but tryorder(order(as.character(1:44))).

sani (19:24:49) (in thread): > dittoBarPlot(seur.immune, var=“genotype_diet”,group.by=“ident”, order(order(as.character(1:44)))) > Error in match.arg(scale) : ‘arg’ must be NULL or a character vector

Dan Bunis (19:25:26) (in thread): > You didn’t specify the input name.

Dan Bunis (19:25:33) (in thread): > x.reorder =

sani (19:25:49) (in thread): > I did this one it said unused argument

sani (19:26:28) (in thread): > dittoBarPlot(seur.immune, var=“genotype_diet”,group.by=“ident”, x.order=order(order(as.character(1:44)))) > Error in dittoBarPlot(seur.immune, var = “genotype_diet”,group.by= “ident”, :  >  unused argument (x.order = order(order(as.character(1:44))))

sani (19:26:36) (in thread): > I did misrake

Dan Bunis (19:26:47) (in thread): > x.****re****order

sani (19:28:06) (in thread): > dittoBarPlot(seur.immune, var=“genotype_diet”,group.by=“ident”, x.reorder=order(order(as.character(1:44)))) > Error in .rename_and_or_reorder(data$grouping, x.reorder, x.labels) :  >  incorrect number of indices provided to ‘reorder’ input

Dan Bunis (19:29:11) (in thread): > 0:44 then

sani (19:31:39) (in thread): > Thanks! It works and reordered

2020-08-18

Daniel Baker (09:00:54): > @Daniel Baker has joined the channel

Will Macnair (09:08:36): > @Will Macnair has joined the channel

Stephany Orjuela (09:14:58): > @Stephany Orjuela has joined the channel

2020-08-19

Stephanie Hicks (09:32:33): > @Kevin Blighei haven’t read it, but i saw this come across my feed in Janhttps://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1850-9 - Attachment (Genome Biology): A benchmark of batch-effect correction methods for single-cell RNA sequencing data > Large-scale single-cell transcriptomic datasets generated using different technologies contain batch-specific systematic variations that present a challenge to batch-effect removal and data integration. With continued growth expected in scRNA-seq data, achieving effective batch integration with available computational resources is crucial. Here, we perform an in-depth benchmark study on available batch correction methods to determine the most suitable method for batch-effect removal. We compare 14 methods in terms of computational runtime, the ability to handle large datasets, and batch-effect correction efficacy while preserving cell type purity. Five scenarios are designed for the study: identical cell types with different technologies, non-identical cell types, multiple batches, big data, and simulated data. Performance is evaluated using four benchmarking metrics including kBET, LISI, ASW, and ARI. We also investigate the use of batch-corrected data to study differential gene expression. Based on our results, Harmony, LIGER, and Seurat 3 are the recommended methods for batch integration. Due to its significantly shorter runtime, Harmony is recommended as the first method to try, with the other methods as viable alternatives.

Tim Triche (10:16:54): > see also@Kasper D. Hansen’s recent post on batch effect correction in rna velocity estimation

Tim Triche (10:17:29): > Harmony works pretty well in our hands for large integration exercises and has a fairly principled approach to how it does things

Kasper D. Hansen (10:18:11): > I think it’s fair to say there are a ton of solutions. But probably not a single universally accepted one.

Tim Triche (10:18:11): > it will be interesting to see how approaches like that are (or are not) used in the context of spliced/unspliced matrices whose relationship is unclear

Tim Triche (10:18:27): > I like that you reached for ComBat first

Tim Triche (10:18:40): > simplest thing that could possibly work, at least to a first approximation

Tim Triche (10:19:30): > validating approaches that seek to preserve/correct inputs to velocity estimates, etc. will be interesting

Aaron Lun (11:14:12): > it’s a shame that the harmony folks abandoned their Bioconductor submission. I am reluctant to consider random GitHub packages as production-ready.

Kasper D. Hansen (11:46:24): > For sure

Aaron Lun (11:51:50): > In general, this problem is pretty messy. The diversity of methods reflects the differences in assumptions to deal with the lack of actual information. > > When I wrote fastMNN, I threw principles to the wind and just wrote the fastest method that would give the nicest looking t-SNE.

Kasper D. Hansen (11:59:23): > I think that part is pretty well understood and I think that observation has been made. Not to say that there is not enough methods for a review

Kasper D. Hansen (12:02:04): > I do think however that some people say that clustering on a tSNE/UMAP/PCA plot is identical to batch effects, as in lack of clustering => batch effects have been fully and completely removed

Aaron Lun (12:03:36): > there’s an easy way to game that system. Just mush cells from all samples into a giant ball.

Tim Triche (12:11:20): > @Aaron Lunhey, don’t give away the TCGA batch effect amelioration “strategy,” the Big Ball of Mud approach worked for a decade

Aaron Lun (12:12:00): > just move all cells to the origin, no batch effects there.

Tim Triche (12:12:06): > :100:

Yi Wang (12:25:20): > @Yi Wang has joined the channel

2020-08-20

Vince Carey (09:28:52): > Superficial question: does anyone know of putative signatures for T-reg cells?

Vince Carey (09:32:32): > The real problem is to have a deconvolution procedure that can be applied to whole blood bulk rna-seq that will help discriminate T-reg abundance.

Nur-Taz Rahman (09:46:37): > If you are annotation markers, I have used these for a collaboration: CD3E+ CD4+ FoxP3+. Did help us distinguish a small cluster in scrnaseq

Tim Triche (10:06:16): > @Vince CareyHaniffa & Teichmann investigated this with PB validation inhttps://linkinghub.elsevier.com/retrieve/pii/S1074761319300019

Tim Triche (10:11:42): > CD4+ FOXP3+ IL17- is canonical and discriminates reasonably well from Th17 cells, but Teichmann’s signature is probably more useful given they validated it in blood

Tim Triche (10:11:52): > better link for@Vince Carey:https://www.cell.com/immunity/fulltext/S1074-7613(19)30001-9

Vince Carey (11:28:57): > thank you all!

Dan Bunis (12:05:27): > CD25/IL2RA gene is another canonical marker used for FACSorting of Tregs. > And if it helps, a full differential expression signature from bulk RNAseq comparison of human fetal & adult Tregs to adult naive T cells is in the supplement of this paper:https://pubmed.ncbi.nlm.nih.gov/31757834/

Jared Andrews (12:06:47): > For single cell, FOXP3 has been the best marker in my hands.

Daniel Baker (13:44:42): > @Daniel Baker has left the channel

2020-08-24

Jose Alquicira (07:55:44): > @Jose Alquicira has joined the channel

Nur-Taz Rahman (10:12:39): > Is there any tool that is tailored for gene set enrichment or pathway analysis for single cell transcriptomics?

Nur-Taz Rahman (10:14:14): > I have come across iDEA - does anybody have experience with it?

Ludwig Geistlinger (10:32:08) (in thread): > Did you checkAUCell,MAST, orslalomyet (those are Bioc packages)?

Nur-Taz Rahman (10:53:14) (in thread): > I will check them out. Thank-you! Hope you are doing well.

Ludwig Geistlinger (10:58:23) (in thread): > Great. I’d also read about iDEA, but I didn’t try it out yet. They have an R implementation (accompanying the manuscript) but it’s neither on CRAN nor Bioconductor, so I’d expect getting it to run might be a bit more difficult. Would be interested in your experience though if you give it a try.

Jared Andrews (11:08:35) (in thread): > New bioconductor package coming in the next release:http://bioconductor.org/packages/devel/bioc/html/escape.html

Aaron Lun (11:19:36) (in thread): > 227 dependencies! Holy crap.

Aaron Lun (11:19:46) (in thread): > Ah, one sees why.

Jared Andrews (11:23:21) (in thread): > Yes, working on that.

Jared Andrews (11:26:00) (in thread): > @Nick BorcherdingWe should get Seurat out of the dependencies list if possible. Can get away with moving it to suggests for the vignette. Or we could change the example to use any of the datasets from thescRNAseqpackage.

Nick Borcherding (11:26:05): > @Nick Borcherding has joined the channel

Nur-Taz Rahman (11:38:02) (in thread): > @Ludwig GeistlingerI will let you know how it goes.

2020-08-28

sani (21:46:44): > Hi, happy Friday, I was wondering if anyone knows it is possible to visualize the interaction gene lists.

sani (21:57:01) (in thread): > Not specifically. Any suggestions in R programming

sani (22:55:18) (in thread): > I got the list of common genes from couple of datasets and now I wanted to visualize them something like van diagram. Is there any tools can plot this?

2020-08-29

sani (06:46:12) (in thread): > Thanks but I was looking for something in R.

2020-08-30

Alan O’C (05:05:29) (in thread): > I like UpSetR for visualising set interactionshttps://github.com/hms-dbmi/UpSetR

2020-08-31

Jared Andrews (15:19:06): > This channel is specifically for Bioconductor-related single cell questions. You are likely to get quicker help for general R questions from biostars or stackoverflow.

2020-09-06

Bob Policastro (08:36:28): > @Bob Policastro has joined the channel

2020-09-08

Elana Fertig (09:13:59): > Question for folks here - let’s say you want to compute gene gene correlations in single cell. It would seem a negative binomial model that accounts for the dropouts would be the best option, but I haven’t seen anything that accounts for the negative binomial nature of the covariate. I’ve been doing this through scale factor normalization and imputation but I’m not sure if someone has a better solution and suggestions for specific packages that implement this.

Bob Policastro (09:24:35): > The Seurat SCTransform library normalizes UMI counts based on a negative binomial regression.

Bob Policastro (09:25:04): > Perhaps that can be of use.

Alan O’C (09:47:13): > Is sctransform used for normalisation proper? The last Seurat integration paper implies that they use sctransform for HVG selection but still use log-normalisation. > > glm-pca is a similar approach, I’ve not seen an in-depth comparison yet

Tim Triche (09:55:14): > Isn’t glm-PCA multinomial?

Alan O’C (09:57:33): > Based on a multinomial model, but using Poisson as an approximation of the marginal for multinomial and negative binomial as an approximation of the marginal of Dirichlet-Multinomial

Alan O’C (09:59:23) (in thread): > This review by Will is a really nice summary of the concepts, though they’re described in the glmpca paper toohttps://arxiv.org/pdf/2001.04343v1.pdf

Elana Fertig (10:13:35): > I’ve seen a lot of the transforms and modeling

Elana Fertig (10:13:44): > but this is a little bit of a different question becuase it’s the gene gene correlation

Elana Fertig (10:14:07): > to my understanding most of the regression models still assume the covariates are normal or factors

Elana Fertig (10:14:19): > but if you have a gene itself it’s negative binomial

Elana Fertig (10:14:25): > so I’m not sure if that’s been considered

Tim Triche (10:45:10): > What’s the inverse CDF to map a negbin to Gaussian? Just use that

Tim Triche (10:45:30): > Bonus: no more annoying marginals

Tim Triche (10:47:02): > (It’s not entirely clear that zero inflation matters but if so, interpolating into a phi coefficient near the limits of detection would help)

2020-09-11

Talha (00:23:47): > @Talha has joined the channel

Bharati Mehani (12:22:34): > Hi I am new with single cell analysis and want to do reference based clustering. It is a human canacer data generated from 10 human cancers by 10x chemistry. I am using Seurat v3 to integrate all these samples with SCTransform. I have list of genes specific to tumor cells and TME, but not sure how can i integrate it and perform reference based clustering. Can anybody please share any workflow for the same? > Thanks in adavance.

Bharati Mehani (12:22:44): > Advance*

Dan Bunis (12:32:03): > if by “reference-based clustering” you might mean “reference-based cell type annotation”, then check out SingleR. (If you use it, you’d want to provide counts from the “RNA” assay rather than from your “integrated” or “SCT” assays.)

Bharati Mehani (13:50:15): > Hi@Dan BunisI used singleR and scCATCH to annotate my clusters and could predict the cell type annotation for only 6 out of 17 clusters. I am worried about rest of the 11 clusters what are they. Thus wanted to cluster them in a supervised manner if it is possible.

Bharati Mehani (13:52:20): > This could be a naive question. Sorry about that.

Aaron Lun (13:53:22): > SingleR will give an assignment for every cell in every cluster, so you will have to be clearer about how you decided to ignore 11 of the 17 classifications.

Bharati Mehani (19:49:33): > Hi@Aaron LunCan singleR assign a label to each cell? i only know it can annotate clusters.

Aaron Lun (19:50:01): > sure it can, check out the book:https://ltla.github.io/SingleRBook/ - Attachment (ltla.github.io): Assigning cell types with SingleR > The SingleR book. Because sometimes, a vignette just isn’t enough.

Bharati Mehani (21:35:45): > Thanks for sharing this. Will check it for sure.

2020-09-12

Aaron Lun (13:47:13): > Anyone know of a CITE-seq study where they used the ADTs for something more interesting than surface markers?

Aaron Lun (13:47:42): > e.g., conjugation to influenza peptides to measure MHC binding, etc.

Tim Triche (17:18:52) (in thread): > https://www.nature.com/articles/s41467-020-15710-1 - Attachment (Nature Communications): High throughput pMHC-I tetramer library production using chaperone-mediated peptide exchange > Peptide-MHC (pMHC) tetramers are important tools for probing T cell repertoire and adaptive immune responses. Here the authors use a molecular chaperone, TAPBPR, to develop a high-throughput, multiplexible platform for pMHC tetramer generation to facilitate simultaneous assessments of T cell repertoire/antigen specificity and transcriptome.

Aaron Lun (17:19:20) (in thread): > excellent,t hanks.

Tim Triche (17:19:35) (in thread): > technically ECCITE but w/e

Aaron Lun (17:21:41) (in thread): > ah, looks like it doesn’t really have gene expression data.

2020-09-14

Tim Triche (14:00:13) (in thread): > good point, that is somewhat of an issue:disappointed:

2020-09-23

Peter Hickey (02:11:38): > anyone seen DE analysis of CITE-seq markers?

Aaron Lun (02:12:31): > can’t imagine it’s all that hard, once you figure out the normalization.

Peter Hickey (02:13:27): > i guess that’s mostly what i’m asking about. some set of ‘control’ markers needed you thnk?

Aaron Lun (02:14:05): > does the book say anything about this? I forget.

Peter Hickey (02:16:01): > https://osca.bioconductor.org/integrating-with-protein-abundance.html#by-differential-testing? - Attachment (osca.bioconductor.org): Chapter 18 Integrating with protein abundance | Orchestrating Single-Cell Analysis with Bioconductor > Online companion to ‘Orchestrating Single-Cell Analysis with Bioconductor’ manuscript by the Bioconductor team.

Aaron Lun (02:26:17): > is that what you were looking for? It’s just the usual stuff on the log-abundance values.

Aaron Lun (02:27:32): > i’ve been trying to find a better example for that.

Peter Hickey (02:27:33): > i guess so (was thinking more count-based models, but that can be done from what i remember) > counts are counts, right

Aaron Lun (02:27:42): > damn straight.

Aaron Lun (02:28:01): > the tricky part is the norm, but if you can do that, it’s smooth sailing from there on in.

Peter Hickey (02:28:30): > yeah i don’t know how competition affects composition and how much that matters

Aaron Lun (02:34:04): > I think the chapter also has some advice on that.

Peter Hickey (02:38:00): > cheers, it does

Aaron Lun (02:41:32): > Oh good, can’t remember whether it was dev or release.

2020-09-25

Anna Liza Kretzschmar (22:56:13): > @Anna Liza Kretzschmar has joined the channel

2020-09-28

Aedin Culhane (08:43:47) (in thread): > PCA variance shows that max distance between cells/features, not necessarily correlation. Correspondence analysis (chi sq transform) shows strength of association between genes/cells and is often a better metric. Legrendre & Legrendre have a great table shows the difference between transforms in dim red space. I’ll see if I can grab a screen shot

Aedin Culhane (09:17:40) (in thread): > We have compared these, yet they are similar.@Lauren Hsu

Aedin Culhane (09:18:02) (in thread): > Not in practise. They use poisson

Aedin Culhane (09:20:08): > I have demultiplexed fastq read pair data, that has HTO tags from the Broad. Struggling with Cell Ranger, not sure if I can use another approach… Anyone familiar with the Broad output, or getting from unmapped.1.fastq.gz, unmapped.2.fastq.gz pairs to counts?

Tim Triche (12:42:33) (in thread): > as a marginal approximation like for loglinear regression?

2020-09-29

Jared Andrews (13:32:17): > @Aaron Lundo you have any thoughts on integration w.r.t. pseudotime analyses? Readinghttps://osca.bioconductor.org/multi-sample-comparisons.html#sacrificing-differencesand was curious. In such cases, is it appropriatenotto integrate and just recognize that any resulting DE tests from tradeSeq, etc, may include batch/sample-specific differences? Or do you just live with the assumption that anything that’s truly a different lineage will still separate even after integration?

Jared Andrews (13:37:26): > Or anyone else.

Aaron Lun (13:39:32): > Probably not much beyond my thoughts for clustering. I think the assumptions/problems would be the same. Possibly a little worse, because it’s a bit harder to tell for these continuous things, especially if the integration “stretches” one batch or sample to fit the other. > > Probably the big question is: are you happy with cells from different batches/conditions being forced into the same trajectory? If that makes sense to do, then the subsequent DE analysis between samples should be able to recover any lost differences.

Jared Andrews (13:41:37): > Yeah, I suppose that’s the real question. I don’t know if my endpoints between conditions truly represent different lineages.

Jared Andrews (13:42:33): > Well, may as well try both ways and see how different the end results are. Thanks for the input.

Aaron Lun (13:43:57): > Indeed. I don’t know that I can give you a data-derived answer to that question, at least not without some constraints on the direction/magnitude of the non-biological effects.

Aaron Lun (13:44:31): > Fortunately, even if integration merges two different lineages, you should be able to pick up some differences based on the tradeSeq comparison between samples within each lineage

Aaron Lun (13:52:27): > I would guess that most merging algorithms are hyper-aggressive, so if you put in, e.g., a B cell lineage in one batch and a T cell lineage in another without anything else, they’d probably get stuck together. At least forfastMNN, this is by design so that people wouldn’t complain to me about their batches not merging.

2020-10-07

Stephanie Hicks (14:26:00): > https://twitter.com/m_hemberg/status/1313897724954251267 - Attachment (twitter): Attachment > #Brexodus news: I will join https://evergrande.hms.harvard.edu/home and @BrighamWomens in Q1 2021 after 6.5 great yrs at @sangerinstitute and @GurdonInstitute. Re-booting lab => post-doc positions available. If you are interested contact me at my http://sanger.ac.uk email (mh26)

2020-10-08

Batuhan Cakir (13:00:13): > @Batuhan Cakir has joined the channel

Koen Van den Berge (13:46:03): > @Koen Van den Berge has joined the channel

2020-10-09

Aedin Culhane (17:07:37) (in thread): > I use CITE-seq-Count for some of data prep. With HTO citeseq you have a large background population so can estimate this pretty significant expression. I used supervised PCA (bga) to assign tags

2020-10-16

Jared Andrews (11:35:50): > In iSEE, how can lasso select be used? I’ve seen it mentioned multiple places, but have yet to figure out how to actually do it after running through the tutorial and reading the vignette.

Aaron Lun (11:39:58): > basically just click anywhere on the plot. THis will lay down a lasso waypoint. And then you keep on clicking, and the final click should be near the first one to close the lasso.

Aaron Lun (11:40:32): > It’s a bit clunky because we had to work around the lack of native shiny support for this. It’s on the list of things to clean up next release.

Jared Andrews (12:05:08): > Got it. Second q, is there a way to get cell barcodes/names out of said selection?

Aaron Lun (12:06:14): > Sure, just link a colDataTable to receive the selection from the panel where you made the lasso, and you’ll get a table with that subset. Then if you go to the top right and click the download-looking dropdown, you should be able export that table.

Aaron Lun (12:06:30): > Hovering over will give you names for specific barcodes, but that’s probably not what you want.

Jared Andrews (12:07:25): > Ahh, okay, got it. Thanks.

2020-10-17

Kevin Blighe (08:25:23): > @Kevin Blighe has joined the channel

Alexander Toenges (19:49:14): > Towardshttps://support.bioconductor.org/p/134696/where I think Gordon got me a bit wrong. Given we use alevin for quantification of 10x data and use its bootstrapping procedure to quantify mapping uncertainty: Is there a meaningful way of aggregating this information to the pseudobulk level together with the raw counts? Basically, when usingalevinand then import it withtximetayou get a SummarizedExperiment with the gene level counts, the mean and variance of the bootstrapping that alevin performs. Is it even possible to use this information on the pseudobulk level?

Aaron Lun (20:09:42): > Sounds like you could just pseudo-bulk your bootstrap replicates and use those to compute the mean and variance for downstream applications.

Aaron Lun (20:09:49): > Sounds like a bother, though.

Aaron Lun (23:45:29): > Man, readinghttps://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1850-9’scomments on fastMNN”s speed is a bit rough when their code converts everything into an ordinary matrix.

Aaron Lun (23:47:30): > I was going to have a look at their datasets to see if I could reproduce some of fastMNN”s failure states, but they exceeded the LFS bandwith so GitHub isn’t giving me access. :(

2020-10-18

Aaron Lun (00:00:20): > Argh. Their megacontainer takes ages to mount, too.

Aaron Lun (00:21:56): > FYI, this is what I get for their dataset 6 withfastMNN. Quite different from the paper. - File (PNG): dataset6.png

Aaron Lun (00:22:20): > (Context: batch 1 is 293T, batch 2 is Jurkat, and batch 3 is a 50:50 of the two.)

Aaron Lun (00:22:41): > Code is as follows: > > library(scuttle) > b1 <- readSparseCounts("b1_exprs.txt.gz") > b2 <- readSparseCounts("b2_exprs.txt.gz") > b3 <- readSparseCounts("b3_exprs.txt.gz") > > combined <- cbind(b1, b2, b3) > sce <- SingleCellExperiment(list(counts=combined)) > sce$batch <- rep(1:3, c(ncol(b1), ncol(b2), ncol(b3))) > > # Standard analysis > library(scran) > qc <- perCellQCMetrics(sce, subsets=list(Mt=grepl("^MT-", rownames(sce)))) > out <- quickPerCellQC(qc, sub.fields=TRUE) > sce <- sce[,!out$discard] > > sce <- logNormCounts(sce) > dec <- modelGeneVar(sce, block=sce$batch) > hvgs <- getTopHVGs(dec, n=5000) > > # Only real manual intervention needed here, to stop b1 and b2 being merged > # first (as these are just batches of unique cell lines). > set.seed(100) > library(batchelor) > merged <- fastMNN(sce, batch=sce$batch, subset.row=hvgs, merge.order=c(3,1,2)) > > library(scater) > merged <- runTSNE(merged, dimred="corrected") > > png("dataset6.png", unit='in', height=8, width=8, res=150) > plotTSNE(merged, colour_by="batch") > dev.off() >

Aaron Lun (00:24:33): > That’s made me grumpy all week. Going off to watch some calming anime now.

Peter Hickey (00:26:55): > did they take the auto merge?

Aaron Lun (00:28:03): > I do not think they did.

Aaron Lun (00:40:20): > I have already repro’d most of their pancreatic example in depth inhttp://bioconductor.org/books/devel/OSCA/merged-pancreas.html, sans the Xin dataset. - Attachment (bioconductor.org): Chapter 34 Merged human pancreas datasets | Orchestrating Single-Cell Analysis with Bioconductor > Or: how I learned to stop worrying and love the t-SNEs.

Aaron Lun (00:41:02): > And this is what I get from their HSC example, which is not too bad - File (PNG): dataset10.png

Aaron Lun (00:50:47): > alright, seriously, going off for some intense slice-of-life anime now

Aaron Lun (01:03:20): > Oh wait, forgot to runmultiBatchNormbeforehand. And kicked up the merge aggressiveness for your viewing pleasure: - File (PNG): dataset10.png

Adele Barugahare (20:26:14): > @Adele Barugahare has joined the channel

2020-10-23

ImranF (11:40:57): > @ImranF has joined the channel

2020-11-02

RGentleman (16:07:41): > @RGentleman has joined the channel

2020-11-04

Regina Reynolds (15:56:28): > @Regina Reynolds has joined the channel

2020-11-07

Jonathan Griffiths (10:34:13): > @Jonathan Griffiths has joined the channel

2020-11-11

Joshua Shapiro (09:09:39): > @Joshua Shapiro has joined the channel

2020-11-13

brian capaldo (14:41:50): > @brian capaldo has joined the channel

2020-11-16

brian capaldo (15:12:53): > what would be the best way to usebatchelorto integrate scRNA with scATAC? I have inferred gene expression data from my ATAC experiment, I’m just not sure how to best pass that intobatchelor. I am fully aware thatSeuratandSignacexist and do this already, but I have a very niceSingleCellExperimentobject right now, and I’d rather not spend half my time trying to figure out what seurat clusters map to my single cell experiment clusters. If the answer is DON’T! then so be it

Alan O’C (15:17:48): > Do you mean you want to integrate the inferred gene expression with the ATACseq data that was used to generate it?

brian capaldo (15:19:25): > I want to integrate the inferred gene expression data from the ATAC seq with RNA seq

brian capaldo (15:19:45): > so I think yes

Aaron Lun (15:21:36): > I guess you could just stuff it in as a separate batch and hope for the best. I don’t have much experience with that so I don’t even know that’s a generally sensible thing to do, but you could give it a shot and report back.

Aaron Lun (15:21:50): > I mean, it can’t be any worse than integrating 10X with Smart-seq data.

brian capaldo (15:22:32): > well my question was where do I put the inferred gene expression data into batchelor?

brian capaldo (15:23:01): > cause it’s technically not normalized, but it’s not raw counts either

Aaron Lun (15:24:09): > I assume that the scATAC is not for the same cells, right?

Aaron Lun (15:24:16): > It’s just an entirely different expt

brian capaldo (15:24:25): > yep

Aaron Lun (15:25:33): > multIbatchNormshould work even if it’s not raw cell counts, provided your values are at least non-negative.

brian capaldo (15:25:48): > noted

Aaron Lun (15:26:07): > well, it shouldn’t fail horribly, at the very least

brian capaldo (15:26:20): > i’ll let you know if the hpc dies

2020-11-19

Pierre-Luc Germain (05:23:34): > @Pierre-Luc Germain has joined the channel

Kevin Blighe (08:30:46): > @Kevin Blighe has joined the channel

Pierre-Luc Germain (09:38:53): > Would anybody be interested in a bioconductordocker imagepre-loaded with single-cell analysis packages? > Some people (e.g.@Aaron Lun) already use images to speed up the automation of package build & check (i.e. github actions running much more rapidly than if you have to reinstall all your dependencies on a generic BioC image every time), and it seems a lot of people could rely on the same basic image for such purposes. I guess there could be other uses, like providing a comparable/reproducible environment for benchmarks, teaching, and such. > If you’d be interested in having that please raise a hand, and if you’d be interested in discussing how to do it (e.g. what to include) please answer in a thread!

2020-11-20

ImranF (00:24:49): > @ImranF has joined the channel

Jonathan Griffiths (04:24:43) (in thread): > Do you mean going beyond Bioc packages? I have found it very easy to work from the bioconductor-provided images and simply install the packages I needed - so I’m not sure what this would add except for saving some install time? (which is commendable, still)

Pierre-Luc Germain (05:27:17) (in thread): > not necessarily beyond BioC (though that’s a possibility), and yes install time is the chief motivation for me at least. I just started doing that for my packages to speed up github build/check actions, and thought it might save people time if we could all rely on the same image rather than each make (and maintain) our own. But maybe that’s not a genuine need.

James MacDonald (12:28:38): > Anybody have thoughts re biological replicates for scRNA-Seq? Are we still at the ‘too expensive, just do one’ stage, or are people (meaning granting bodies primarily) starting to expect some small amount of replication?

Aaron Lun (12:31:24): > I think everyone’s doing them now. It’s the only reason people would talk about datasets with >100k cells, you wouldn’t spend all that money on sequencing a single patient, for example.

Peter Hickey (16:24:56): > in our core we’re routinely doing n>=3

Dan Bunis (16:32:51): > Costs of including reps are coming down due to, fairly standardized now, hashing/genotype-based methods allowing multiple reps to be put into a single-well. So I’d say we’re past the point of having a single replicate be sufficient. I’ll say that differently… I personally nolonger see n=1 as acceptable.

James MacDonald (18:10:29): > Thanks for the feedback!

2020-11-22

Peter Hickey (02:54:16) (in thread): > Yep, for us, antibody-based and lipid-based hashing, along with genotype-based methods, really have been key.

2020-11-23

Jenny Drnevich (23:34:53): > @Jenny Drnevich has joined the channel

Jenny Drnevich (23:39:25): > Wasn’t sure whether I should post this here or on#randomdue to its general coolness, but I came across this great interactive site explaining UMAP and tSNE. My favorite is the 3D pixelated woolly mammoth skeleton reduced down to a 2D UMAP projectionhttps://pair-code.github.io/understanding-umap/ - Attachment (pair-code.github.io): Understanding UMAP > UMAP is a new dimensionality reduction technique that offers increased speed and better preservation of global structure.

2020-12-11

Bharati Mehani (10:37:02): > Hello all, i am looking for a workflow where i can score cells based on a list of genes in order to assign a label to each cell. Does any of you can share your experience or can share any link for any such vignettes or workflow approach?

Aaron Lun (11:25:34): > AUCell works fairly well.

Lucy (11:25:50): > Yes I would also recommend AUCell

2020-12-12

Huipeng Li (00:38:19): > @Huipeng Li has joined the channel

2020-12-13

Bharati Mehani (14:12:29): > Hi@Aaron Lunand@Lucythanks to you both for your answer. I have a follow up question, does AUCell also accepts a custom list of genes?

Aaron Lun (14:37:28): > yes, that’s sort of the point.

2020-12-14

Lucy (04:11:00): > :+1:

Bharati Mehani (08:11:30): > Thanks, will check it.

2020-12-21

Harithaa Anand (04:11:30): > @Harithaa Anand has joined the channel

2020-12-22

Nur-Taz Rahman (10:23:48): > Any seasoned bioinformaticians around for a conversation about finding SNPs in sc-RNAseq data? SCmut pipeline seems to need DNA-seq data for good performance, and now I’m exploring bcftools mpileup. Any obvious/new tools I’m missing? I don’t have DNA-seq data, sc or bulk.

Jared Andrews (10:26:29): > Data sparsity means you’re going to miss most. Definitely easier to use the scRNA-seq as verification/identifying mutant populations rather than calling directly on it. This paper has some nice info:https://www.nature.com/articles/s41467-019-11591-1

Stephanie Hicks (10:27:00) (in thread): > Tagging@Lukas Weberwho know a lot about this topic.

Lukas Weber (10:30:24) (in thread): > yes we have recently usedbcftoolsand alsocellSNPhttps://github.com/single-cell-genetics/cellSNP(from the authors of Vireo for genetic-based demultiplexing of samples:https://vireosnp.readthedocs.io/en/latest/)

Lukas Weber (10:30:36) (in thread): > if you have bulk RNA-seq samples, thenbcftoolsworks very well

Lukas Weber (10:30:50) (in thread): > alternatively, if you only have single-cell, thencellSNPis a great tool to try

Lukas Weber (10:31:31) (in thread): > however it requires some long runtimes

Nur-Taz Rahman (10:32:03): > I completely agree@Jared AndrewsBut collaborators want to exhaust all options. They are offering to do bulk exome sequencing on a different population of cells, which is great, but totally not sure how that would inform my pipeline for calling variants.

Jared Andrews (10:32:39): > It’d tell you variants to look for in your scRNA-seq data so that they could be tied to specific populations.

Jared Andrews (10:33:04): > Or whether they are pervasive throughout the sample.

Nur-Taz Rahman (10:47:04): > But this is patient data, with scRNA seq already showing us cancer vs non-cancer populations within each patient’s blood, and may be a “transition population”. Is it possible that potentially important/oncogenic mutations from the exome-seq will: > (1) be missed by scRNAseq data (because of sparsity)? > (2) be present in all populations of the scRNAseq data? > If the question I’m trying to answer is how is the “transition” population different from the cancer and non-cancer populations by way of SNPs, does exome seq still help?

Nur-Taz Rahman (10:50:27) (in thread): > Seems likecellSNPis the way to go. Thank-you!

Jared Andrews (11:13:50) (in thread): > 1. Yes, this very much depends on expression level, sequencing saturation, platform, etc. > 2. Also yes, but that would at least inform you that said mutation may be germline rather than somatic. > I imagine the exome-seq would still be helpful, as it’d at least give you a list of variants to look at more closely in the scRNA-seq data. I expectfinding SNPs associated only with your “transition” population will be a difficult task, but you may be able to identify SNPs that arise in that population that are carried forward in your malignant population.

Nur-Taz Rahman (15:57:07) (in thread): > Thank-you, Jared! I’m very grateful to you for sharing your thoughts.

Alexander Toenges (17:09:13): > @Aaron LunCan you clarify on howde.nargument works in SingleR? I am providing lists of marker genes from bulk comparisons viagenes=and it seems no matter whatde.nis (1 or 200) the results are the same, but when I manually subset the listsgenes=to different numbers of genes results do change. Isde.neven considered whengenes=is set? > > Manual says (de.n):An integer scalar specifying the number of DE genes to use when genes="de". If de.method="classic", defaults to 500 * (2/3) ^ log2(N) where N is the number of unique labels. Otherwise, defaults to 10.So “to use when genes="de"means it is ignored when genes !=”de”, is that correct?

Aaron Lun (17:11:12): > That’s right. It only affects the internal marker detection. If a marker gene set is manually supplied, we don’t know whether the genes are supplied in ranked order, or just arbitrarily arranged (e.g., from existing manually defined lists); picking the firstde.nwouldn’t make sense in the latter.

Alexander Toenges (17:12:27): > Ok, I see, thanks.

2020-12-24

Lucy (10:32:36): > Hi all, I have a question about the use of MAST for identifying cluster markers (genes increased in a cluster relative to all other cells). I know MAST includes cellular detection rate as a covariate and I also want to include donor. I thought that it was not possible to use non-continuous covariates with MAST, but I have seen people in papers using batch or donor. Do people just recode these as numeric variables (and is that a sensible thing to do) or is there a way to include categorical variables as covariates? Thanks!

2021-01-01

Bernd (14:06:49): > @Bernd has joined the channel

Alexander Toenges (15:10:40): > Regarding theoffsetargument infitGAMfrom tradeSeq, I see in the source code that by default theedgeRsize factors it calculates are used likelog(colSums(counts) * sizeFactor). Can one do the exact same thing with the factors obtained fromscran::calculateSumFactors()?

Aaron Lun (15:12:15): > No, as normalization factors != size factors. Normalization factors need to be multiplied by the library sizes to get a value proportional to the size factor.

Aaron Lun (15:13:56): > In this context, the log-size factors could be used directly as the GLM offsets.

Alexander Toenges (15:16:09): > So simplylog(scran::calculateSumFactors()), right? Natural log I guess?

Aaron Lun (15:18:05): > I would assume so.

Alexander Toenges (15:19:38): > Will try, thanks!

2021-01-06

Aaron Lun (03:43:34): > <!here>If anyone has any public datasets that they’d like to see inscRNAseq, let me know. I’m doing my half-yearly vacuum of interesting datasets into the package and am looking to queue up requests.

Alexander Toenges (05:51:51): > Definitelyhttps://pubmed.ncbi.nlm.nih.gov/29915358/:slightly_smiling_face: - Attachment (PubMed): Single-cell characterization of haematopoietic progenitors and their trajectories in homeostasis and perturbed haematopoiesis - PubMed > The dynamics of haematopoietic stem cell differentiation and the hierarchy of oligopotent stem cells in the bone marrow remain controversial. Here we dissect haematopoietic progenitor populations at single cell resolution, deriving an unbiased reference model of transcriptional states in normal and …

Alexander Toenges (05:52:38): > MARS-seq of hematopoietic almost-everything plus CRISP-seq

Aaron Lun (12:22:30): > it shall be done

Jared Andrews (12:41:15): > https://pubmed.ncbi.nlm.nih.gov/26060301/https://www.nature.com/articles/nature25980http://science.sciencemag.org/content/358/6368/1318.longhttps://www.sciencedirect.com/science/article/pii/S009286741830789Xhttps://www.sciencedirect.com/science/article/pii/S0092867415011241?via%3DihubAny of those.

Jared Andrews (12:41:55): > Some may be larger than wanted though and may warrant their own data package.

Aaron Lun (12:44:18): > how large are we talking?

Aaron Lun (12:44:34): > well, guess I’ll find out.

Jared Andrews (12:49:29): > Like 160k cells.

Jared Andrews (12:49:55): > That’s the largest of those linked, I think.

Aaron Lun (12:49:56): > ah, that’s probably okay. As long as it fits on my computer.

Jared Andrews (12:51:35): > There’s also all the Allen Brain map data (1M+ cells). > > Oh, and this one:https://www.nature.com/articles/s41586-020-1962-0

Dan Bunis (12:53:26): > https://www.cell.com/cell-reports/fulltext/S2211-1247(20)31562-X, specifically the HSPCs data herehttps://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE158490

Jared Andrews (12:54:18): > This guy, putting forth his own paper.

Jared Andrews (12:54:27): > I kid, I kid.

Aaron Lun (12:54:27): > can you believe it

Dan Bunis (12:54:40): > =p

Aaron Lun (12:54:46): > probably uses his own package for the plots as well.

Jared Andrews (12:55:27): > Missed out on a chance to be the first citation for it too.

Dan Bunis (12:56:03): > nah, didn’t miss it!

Dan Bunis (12:56:20): > The dittoSeq paper technically came out first and is cited.

Jared Andrews (12:56:26): > ohoho, I stand corrected.

Jared Andrews (12:57:21): > I didn’t realize the package was itself an author as well: - File (PNG): image.png

Dan Bunis (12:58:27): > Oh no…. adding refs in their proof system was weird, but I’d pointed that out to them!

Jared Andrews (12:58:46): > Classic publishers.

Dan Bunis (12:59:08): > :man-facepalming:thanks for catching… I’ll pass it along to the journal.

Dan Bunis (13:02:20): > Anyway, datasets for@Aaron Lun… The HSPCs data is what I’d used when I was doing SingleR benchmarking. The labels aren’t super precise, mostly cuz of the nature of the dataset, but I think it’s a good one for developer testing. Probably easier to start with the Seurat:speak_no_evil:object here and convert / pair down.https://figshare.com/articles/dataset/Processed_HSPCs_single-cell_RNA-seq_Seurat_object/11894691.

Aaron Lun (13:03:33): > As a general rule, the ingestion scripts try to work from whatever’s on GEO

Dan Bunis (13:04:42): > Ahhh okay. Labels aren’t available there, but I can give/add them

2021-01-07

Aaron Lun (03:43:24) (in thread): > And so it is. I give this one a 4/10 for ease of ingestion.

2021-01-08

Aaron Lun (03:06:42) (in thread): > First one is done

Jared Andrews (08:56:02) (in thread): > :+1:

2021-01-09

Aaron Lun (04:35:25) (in thread): > second is done

Aaron Lun (04:44:32) (in thread): > Third is shuttered behind dbGAP, and is basically ungettable AFAICT.

Aaron Lun (04:47:43) (in thread): > as is the last one

Jared Andrews (04:49:13) (in thread): > Ah, yeah, raw counts are hard to find. 3rd has semi-processed data here, but it may be a better fit for celldex:http://cells.ucsc.edu/?ds=cortex-dev

Jared Andrews (04:50:17) (in thread): > Raw counts and metadata for the last one are also available here:https://www.pollenlab.org/datasets

2021-01-12

Aaron Lun (02:14:10) (in thread): > 3rd is done.

2021-01-13

Aaron Lun (02:54:46) (in thread): > last is done.

2021-01-14

Aaron Lun (02:37:03) (in thread): > What the! a 17 GB loom file!

Jared Andrews (02:43:03) (in thread): > If you really wanna get wild, there’s also the Tabula Muris datasets:https://www.biorxiv.org/content/10.1101/661728v3http://cells.ucsc.edu/?ds=tabula-muris-senis%2Ffacs%2Fallhttp://cells.ucsc.edu/?ds=tabula-muris-senis%2Fdroplet%2Fall

Aaron Lun (02:43:13) (in thread): > oh, those are definitely their own package.

Jared Andrews (02:44:15) (in thread): > Yes, they are quite beefy.

Aaron Lun (02:44:24) (in thread): > anyway, had to make some space on my hard drive to download this damn loom file

Jared Andrews (02:46:01) (in thread): > What’d ya clear? Failed package hex stickers? Anime reaction gifs? Cyberpunk2077?

Aaron Lun (02:47:58) (in thread): > Much more boring; some old BAM files floating around from my ChIP-seq book

Jared Andrews (02:53:57) (in thread): > Actually, that sparks a question regarding ChIP-seq (potentially for another venue/time). Did you ever contemplate/attempt more nuanced deduplication of ChIP-seq samples by trying to estimate technical duplicates versus true signal? Or do you expect this really isn’t worth the hassle - either duplications aren’t enough of an issue to remove or the data is poor enough quality that it isn’t likely to make much of a difference?

Aaron Lun (03:03:21) (in thread): > … and with compression, it goes down to 500 MB.

Aaron Lun (03:04:56) (in thread): > Anyway. yes, back in my PhD I did think about some more careful strategies for dedupping.

Aaron Lun (03:05:09) (in thread): > But your final comment is more or less correct.

Aaron Lun (03:06:21) (in thread): > The problem with PCR duplicates is usually not with the duplicates themselves; the bigger culprit are those read stacks that form in repeat units like microsatellites. Blacklisting those problematic regions (e.g., using the ENCODE blacklist or RepeatMasker regions) can often filter out these bad places.

Aaron Lun (03:06:57) (in thread): > I usually don’t remove duplicates anywhere, just trusting the downstream statistical model to handle the extra variability.

Jared Andrews (03:08:10) (in thread): > Okay, that was kind of my expectation. Thanks for entertaining.

Aaron Lun (03:09:04) (in thread): > I also expect that dup removal for paired end reads is reasonably safe, chances of getting both ends at the same position seem pretty low. But even then, I would only do that if something was wrong with the regular analysis.

Jared Andrews (03:10:28) (in thread): > Yeah, I really only expected it tomaybebe worth considering for single-end, non-UMI TF samples.

Jared Andrews (03:12:02) (in thread): > So much for my use case for learning Rust.

Aaron Lun (03:14:48) (in thread): > well, if you wrote something to handle UMI dedupping in ChIP-seq datasets, I’d be interested in using it.

Jared Andrews (03:17:12) (in thread): > I thought UMI-tools handled that pretty squarely already.

Aaron Lun (03:18:39) (in thread): > I used that in the past but it was pretty sluggish.

Jared Andrews (03:23:18) (in thread): > There’s also this:https://github.com/Daniel-Liu-c0deb0t/UMICollapseThough I have no idea if performance is any better. Java, and seems to support parallelization.

2021-01-19

Aaron Lun (02:52:04): > Right. Who was next.

Aaron Lun (03:36:14): > Right bunis. You’re next!

Aaron Lun (03:36:26): > Let’s see how good your metadata is!

2021-01-20

Aaron Lun (02:58:50): > Okay, I didn’t get around to dealilng with Dan’s data, but@Jared Andrewsall your suggested datasets should now be in ****scRNAseq**** v2.5.2.

2021-01-21

Aaron Lun (02:24:46): > naive t cells… my arch nemesis.

Aaron Lun (02:30:32): > woah. Pulled down the microarray data by accident, wondered what the hell was going on.

Aaron Lun (02:51:52): > Well. that was pretty easy going from GEO resources.@Dan Bunisif you want to add more metadata from the downstream analysis, make a PR into thebunixbranch athttps://github.com/LTLA/scRNAseq.

Aaron Lun (02:54:11): > Might also want to clean up the demuxlet output, e.g., put it into a nested DataFrame so it doesn’t bloat up the colData.

Dan Bunis (11:48:52): > :raised_hands:I’ll take a look at what’s there and see about adding / trimming / reorganizing. lots of the Demuxlet stuff has low utility, so hiding inside a nested DF seems a good plan.

Alexander Toenges (19:46:19): > Is anyone aware of a way to test withglmTreatin edgeR for fold changes being below a absolute threshold, so significantly not changing, like DESeq2 with “lessAbs” altHypothesis?

Aaron Lun (19:50:36): > https://support.bioconductor.org/p/66283/#66287

Aaron Lun (19:50:46): > It’s not pretty, in either programming or statistical terms.

2021-01-22

Annajiat Alim Rasel (15:45:40): > @Annajiat Alim Rasel has joined the channel

2021-01-26

Alexander Toenges (09:41:06) (in thread): > Out of curiosity, given that there is no function for this, is this something that you/edgeR team find to be generally unreliable or is it something that simply has never been request by broader audience?

2021-01-27

Aaron Lun (01:45:52) (in thread): > Probably the latter.

Aaron Lun (01:55:17) (in thread): > I don’t think I’ve ever had to do that as a matter of direct interest. The only situations I can think of is when I want to find genes that are DE in one comparison and not DE in another comparison. One could debate whether “not DE” (no evidence of DE) is better/worse than “significantly constant” (from TOST) for the purposes of obtaining a gene list for follow-up work. I would suspect that TOST would favor high-abundance housekeeping genes with low variance.

Tim Triche (11:00:58): > anyone else had this issue recently? > > Adding velocity... > sh: 1: .: Can't open ~/.cache/basilisk/1.2.1/0/etc/profile.d/conda.sh >

Tim Triche (11:01:24): > which is bizarre because: > > tim@thinkpad-P1:~$ ls -l ~/.cache/basilisk/1.2.1/0/etc/profile.d/conda.sh > -rwxrwxr-x 1 tim tim 3739 Dec 22 11:38 /home/tim/.cache/basilisk/1.2.1/0/etc/profile.d/conda.sh >

Tim Triche (11:17:11): > not limited to velociraptor: > > > reducedDim(x, "DENSMAP") <- densvis::densmap(x=reducedDim(x, rdn), > + n_components=6L) > sh: 1: .: Can't open ~/.cache/basilisk/1.2.1/0/etc/profile.d/conda.sh > Error in system(paste(act.cmd, collapse = " "), intern = TRUE) : > error in running command >

Tim Triche (11:18:15): > > R> ?densmap > R> set.seed(42) > R> x <- matrix(rnorm(200), ncol=2) > R> densmap(x) > sh: 1: .: Can't open ~/.cache/basilisk/1.2.1/0/etc/profile.d/conda.sh > Error in system(paste(act.cmd, collapse = " "), intern = TRUE) : > error in running command >

Aaron Lun (11:26:55): > Hm. I suppose that manually. ~/.cache/basilisk/1.2.1/0/etc/profile.d/conda.shhas no error on the command line.

Tim Triche (12:07:51): > it returns nothing, but it does not return an error

Tim Triche (12:08:07): > I manually adjusted it to be +x just to be sure

Aaron Lun (12:09:22): > Hm… okay, what doessystem()’ing that command do?

Alan O’C (12:12:30): > I also see this actually

Alan O’C (12:13:02): > system-ing it doesn’t complain

Aaron Lun (12:13:54): > HMMM. Most puzzling. What’s the enviroment? Fresh R session, no virtualenv?

Alan O’C (12:14:35): > The freshest. Just plain R

Alan O’C (12:16:02): > Mine didn’t have x perms but changing that doesn’t help as Tim says

Aaron Lun (12:17:58): > can only assume that$HOMEhas been changed

Tim Triche (12:26:11): > I’ve tried this in a tmux’ed session and also from a terminal. both break

Aaron Lun (12:27:00): > Suggestdebug(basilisk:::.activate_environment)and stepping through to see whatact.cmdis, and then seeing if it breaks inside and outside of ****basilisk****, and then inside and outside of R.

Tim Triche (12:27:06): > > R> Sys.getenv("HOME") > [1] "/home/tim" > R> system("echo $HOME") > /home/tim > R> system("cd ~; pwd") > /home/tim >

Alan O’C (12:27:22): > Yeah stepping through now

Tim Triche (12:27:42): > > R> debug(basilisk:::.activate_environment) > Error in get(name, envir = asNamespace(pkg), inherits = FALSE) : > object '.activate_environment' not found > > Enter a frame number, or 0 to exit > > 1: debug(basilisk:::.activate_environment) > 2: basilisk:::.activate_environment > 3: get(name, envir = asNamespace(pkg), inherits = FALSE) > > Selection: 0 >

Tim Triche (12:27:46): > it didn’t like that at all

Aaron Lun (12:29:11): > can’t remember what the function was called in release. Maybe it was.activate_env.

Alan O’C (12:29:26): > debugonce(basilisk:::.activate_condaenv)

Alan O’C (12:29:37): > Presumably

Aaron Lun (12:29:40): > that soudns right.

Alan O’C (12:30:00): > For me act.cmd is". '~/.cache/basilisk/1.2.1/0/etc/profile.d/conda.sh' && conda activate && /usr/lib/R/bin/Rscript --default-packages=NULL -e \"con <- socketConnection(port=11957, open='wb', blocking=TRUE);serialize(Sys.getenv(), con);close(con)\""

Aaron Lun (12:30:48): > looks sensible.

Aaron Lun (12:31:01): > This fails if yousystem?

Alan O’C (12:31:14): > Yeahsh: 1: .: Can't open ~/.cache/basilisk/1.2.1/0/etc/profile.d/conda.sh

Aaron Lun (12:31:36): > What happens if you replace~with the absolute path?

Alan O’C (12:31:54): > ’s fine

Aaron Lun (12:32:00): > Oh, that might be why. Because single-quotes prevent shell expansion.

Alan O’C (12:32:17): > That’ll do it aye

Aaron Lun (12:32:25): > I wonder why you guys get~as your home, though. I get my actual home.

Tim Triche (12:32:48): > thought occurs to replace it withSys.getenv("HOME")?

Aaron Lun (12:33:06): > normalizePathshould be sufficient.

Tim Triche (12:33:20): > oh, that would be good too.

Tim Triche (12:33:34): > > R> normalizePath("~") > [1] "/home/tim" >

Tim Triche (12:33:36): > TIL

Aaron Lun (12:34:20): > Following the daisy chain of calls, it suggests thatuser_cache_diris actually the one responsible for including~.

Aaron Lun (12:35:15): > Oh. It changed today.

Tim Triche (12:36:04): > OK I feel better now. I distinctly remember things working until today:slightly_smiling_face:

Aaron Lun (12:41:52): > Right. Pull basilisk 1.2.2 from GitHub and see if it’s better.

Tim Triche (12:42:41): > as opposed to 1.3.6 from bioc-git?

Alan O’C (12:43:33): > On the 3.12 branch?

Aaron Lun (12:43:36): > Yes

Aaron Lun (12:43:42): > I assume this is all on release.

Aaron Lun (12:44:03): > If you’re on devel, then you’ll want to update basilisk.utils instead.

Alan O’C (12:44:27): > Yep seems to work

Tim Triche (12:44:36): > ah crap. looks like I am on release but installed devel. derp. fixing.

Tim Triche (12:44:38): > thanks

Alan O’C (12:44:44): > Hmmm… plot thickens

Tim Triche (12:45:10): > no I mean just now – I was at 1.2.1 then installed 1.3.6 from bioc-git. Pulling 1.2.2 from githbu

Tim Triche (12:47:46): > argh.

Tim Triche (12:47:50): > > Preparing transaction: done > Executing transaction: done > installation finished. > Error in system2(python.cmd, c("-E", "-c", shQuote("print(1)")), stdout = TRUE, : > error in running command >

Alan O’C (12:48:13): > Beat me to it. Same problem with ~ in basilisk.utils:::installConda

Tim Triche (12:48:13): > > install("LTLA/basilisk", ref="RELEASE_3_12") > library(densvis) > ?densmap > set.seed(42) > x <- matrix(rnorm(200), ncol=2) > densmap(x) > > provocation

Tim Triche (12:48:31): > (simplest reprex I’ve found so far)

Aaron Lun (12:48:32): > yes, more of the same. Need to protect allsystemcalls now.

Alan O’C (12:48:37): > python.cmd is~/.cache/basilisk/1.2.2/0/bin/pythonor something to that effect

Aaron Lun (12:55:51): > Right. Try installing RELEASE_3_12 of basilisk.utils.

Tim Triche (12:57:07): > > Preparing transaction: done > Executing transaction: done > installation finished. > Error in system2(python.cmd, c("-E", "-c", shQuote("print(1)")), stdout = TRUE, : > error in running command >

Tim Triche (12:57:26): > after > > R> install("LTLA/basilisk.utils", ref="RELEASE_3_12") # successful >

Aaron Lun (12:58:25): > that had better be on a fresh R session

Alan O’C (12:59:09): > Works for me now

Tim Triche (13:04:12): > @Aaron Lunyeah I realized that, looks like kicking the session and reinstallingbasiliskandbasilisk.utils1.2.2. has solved it (waiting for Conda to finish downloading the entire internet)

Tim Triche (13:04:40): > > R> head(densmap(x)) > [,1] [,2] > [1,] 0.603827775 8.52678585 > [2,] 1.711574554 6.15337420 > [3,] 3.801946163 9.26100063 > [4,] 0.485207111 7.66318989 > [5,] 3.512055874 8.99942493 > [6,] 2.816894293 7.41515160 >

Tim Triche (13:05:05): > victory! THanks for the quick patches@Aaron Lunand verifying that I’m not completely incompetent@Alan O’C

Alan O’C (13:05:57): > Hooray! We are united in partial competence Tim:slightly_smiling_face:Thanks for the quick fix Aaron

Aaron Lun (13:06:32): > ur welcome. Consider flipping basilisk back to 1.2.1 to avoid confusion when we debug the next problem.

2021-01-28

Aaron Lun (03:24:11): > ARE THERE NO MORE DATASETS TO INGEST?

Jared Andrews (09:15:06): > I mean, I can give you more if you enjoy the torture.

Jared Andrews (09:39:32): > Don’t think I sent these before:https://www.nature.com/articles/s41588-019-0531-7?draft=collection#data-availability

Jared Andrews (09:40:05) (in thread): > This one has some tumor data as well, but a pretty nice cell atlas component too.

Jared Andrews (09:40:10): > https://www.sciencedirect.com/science/article/pii/S0960982218309928?via%3Dihub

2021-01-29

Davide Corso (08:46:18): > @Davide Corso has joined the channel

2021-02-03

Aaron Lun (02:29:32): > alright, bring it

2021-02-04

Aaron Lun (03:02:39) (in thread): > do you know where the count matrix lives for this one?

Jared Andrews (20:02:31) (in thread): > Hm, can’t find it, so I guess not. Only the ENA project. My bad.

2021-02-08

Jared Andrews (22:25:48): > Another decent set:https://www.kidneycellatlas.org/May deserve its own package though.

2021-02-09

Aaron Lun (01:19:01): > I would like to finish my alphabet soup.

Aaron Lun (01:19:44): > I need datasets where the first author’s last name starts with F, I, O, Q, V, or Y.

Davide Risso (01:24:17): > Well, there’s the Fletcher et al. data on my GitHub… it’s not a ton of cells, but it’s a good dataset for trajectory analysis

Davide Risso (01:24:31): > https://github.com/drisso/fletcher2017data

Davide Risso (01:25:29): > The original matrices are on GEO

Aaron Lun (01:28:52): > sounds like the second-to-last author would be a good candidate for preparing a PR.

Aaron Lun (01:29:11): > One wonders why you didn’t just stick it in scRNAseq in the first place.

Davide Risso (01:37:37): > Ahaha yeah I know… GitHub was supposed to be a temporary stop when I needed it in a hurry for my class:grimacing:

Davide Risso (01:37:56): > I can work on a PR, consider the F taken!

2021-02-11

Aedin Culhane (10:05:17): > anyone know what software/function what used to generate these figures My collaborator likes them and wants me to reproduce them with their data - File (PNG): Fig_Yuifei_lines.png

Lucy (10:22:35) (in thread): > The original Monocle for the trajectory/heatmap?

Alexander Toenges (13:23:38) (in thread): > The middle could be approximated withhttps://github.com/PoisonAlien/trackplot

Avi Srivastava (20:21:40) (in thread): > Just want to throw Signac in there for the second plot if it helps .https://satijalab.org/signac/articles/visualization.html

2021-02-12

Aaron Lun (03:10:05): > @Davide Rissowhy are there NAs in the count matrix?

Davide Risso (03:12:47): > ah… not sure…

Davide Risso (03:15:40): > but I can confirm that they are present in the GEO count tables as well

Aaron Lun (03:50:30): > seems like you should figure out whether they’re zeroes or not.

Aaron Lun (03:50:38): > I mean, are they zeros? What are they?

Davide Risso (04:09:54): > I have no idea, but I can try to figure it out with the person who did the preprocessing

Alexander Toenges (10:53:20) (in thread): > @Aaron LunSince there is a guy on Twitter asking about this paper, do you think they did a proper benchmark in terms of using fastMNN accurately? Just out of interest, I will not quote you.

2021-02-13

Wes W (12:34:15): > @Wes W has joined the channel

2021-02-15

Aaron Lun (00:13:45): > Is there no hero who will take the multimodal chapter off my hands?

Aaron Lun (00:17:40) (in thread): > I’ll hold you to that, then.

Wes W (12:05:14) (in thread): > @Aaron Lunlike integration of CITE-seq and ATAC-seq? or multimodal analysis like mixed model linear approaches? if the former I could step up and be that hero:smiley:if the later, I am afraid my experience is limited to tumours and I am not sure how universal those methods are…

Aaron Lun (12:05:28) (in thread): > yes, the former

Aaron Lun (12:06:01) (in thread): > basically, I want someone to take the existinghttps://github.com/OSCA-source/OSCA.advanced/blob/master/inst/book/protein-abundance.Rmdand convert it into a full book.

Aaron Lun (12:06:46) (in thread): > The main requirement is that (as the book title suggests) we must use BioC packages to do this.

Wes W (12:07:49) (in thread): > Happy to try my hand at it Aaron.

Aaron Lun (12:09:36) (in thread): > okay, great. Let’s see if we can build a small team to do this.

Wes W (12:10:41) (in thread): > Perfect

Wes W (12:10:59) (in thread): > do you prefer to communicate by email or github?

Aaron Lun (12:11:25) (in thread): > ah, github, probably, my email gets washed out with lots of crap.

Wes W (12:11:35) (in thread): > fair

Davide Risso (17:13:19) (in thread): > Ok, got a response! > “Yes, NA is equivalent to 0 in these matrices. I always produced matrices that had all the annotated genes in the reference transcriptome, and if no reads were detected than it was set to NA. Originally, I thought that distinguishing NAs and 0s for two slightly different cases (no reads in the current sample / no reads in any of the samples) could be useful, but this turned out not to be the case. So in the downstream analyses, I just overwrote the NA’s with 0’s right after reading the matrix.”

Davide Risso (17:14:37) (in thread): > what do we want to do? We could turn them into 0’s in FletcherOlfactoryData() or just get rid of those genes…

Aaron Lun (17:14:45) (in thread): > thanks, yes, I’ll turn them into zeroes

2021-02-17

Wes W (14:10:30) (in thread): > Can I clone and fork and start setting stuff up, or should we wait till we have our team?

Aaron Lun (14:10:43) (in thread): > Just go ahead

Aaron Lun (14:10:52) (in thread): > probably make a new repo entirely, no need to fork.

Aaron Lun (14:11:16) (in thread): > Just copy-paste the package structure.

Wes W (15:02:40) (in thread): > will do!

2021-02-23

Wynn Cheung (10:33:04): > @Wynn Cheung has joined the channel

2021-02-24

Jenny Drnevich (16:55:12): > @Aaron Lunwhat kind of single cell data sets have you been harvesting? For the Carpentries-style intro to R workshop we are developing, we are trying to find a 10K+ cell data set with libraries across a time series (or at least 3 different treatments…). Any suggestions?

Aaron Lun (16:55:41): > three different treatments, huh.

Aaron Lun (16:57:19): > there’s a whole list athttps://github.com/LTLA/scRNAseq/blob/master/inst/extdata/manifest.csv

Aaron Lun (16:57:48): > Oh wait,BachMammaryData()is a classic. Mammary gland, 25k cells total, lactating and non-lactating and one other pregant-related condition but I can’t remember what it was.

Dan Bunis (16:59:36): > The upcoming BunisHSPCData is heterogeneous cells (hematopoeitic stem & progenitor cells) from fetal, newborn, and adult timepoints… But it’s not even ready in devel yet, so doesn’t help in near term. Also only 5k cells…. I’ll shut up lol

Aaron Lun (17:00:50): > NP = not pregnatn?G = gestation, L = lactatiing, PI = post involuation, IIRC.

Jenny Drnevich (17:01:47) (in thread): > Two replicates each from 4 different development stages published in 2017?! Way ahead of their time! Looks very promising - thanks!

Jenny Drnevich (17:02:56) (in thread): > Thanks Dan - we appreciate the thought!

2021-03-09

Peter Hickey (19:13:13): > For a dataset with both HTOs (hashtag oligos) and ADTs (antibody derived tags), should the HTOs be included or excluded when computing size factors to normalize the ADTs? > Following from that, if usingDropletUtils::read10xCounts()you’ll end up withaltExp(sce, "Antibody Capture"), so perhaps as a user I should then split those intoaltExp(sce, "HTO")andaltExp(sce, "ADT")to simplify things?

Aaron Lun (19:16:00): > You could make a case either way. In theory, the HTOs would provide more information for normalization, especially since they help satisfy the assumption that most tags are not actually present in the droplet.

Aaron Lun (19:16:51): > However, that assumes that the HTOs and ADTs are subject to the same biases, which may not be true, e.g., if the HTOs use a tagging mechanism that’s not antibody binding, like cholersterol-based.

Peter Hickey (19:19:25): > good point about different tagging mechanisms.

Peter Hickey (19:19:26): > We may also sequence HTOs and ADTs to different depths, e.g., shallow sequencing of small number of HTOs and deeper sequencing of large number of ADTs - would that pose any problems for normalization?

Aaron Lun (19:20:00): > In principle, no, not for between-cell normalization.

Aaron Lun (19:20:02): > In practice…

Aaron Lun (19:20:07): > ¯*(ツ)*/¯

Peter Hickey (19:22:21): > cheers. i’ll explore this in a couple of datasets i have with a 2-7 HTOs and 250+ ADTs and let you know anything interesting

Aaron Lun (19:22:45): > oh, if those are the sort of numbers we’re talking about, then it doesn’t matter.

Aaron Lun (19:23:14): > Might as well burn the HTOs once you’re done with them.

Peter Hickey (19:24:46): > fair enoguh. got another one with 2 HTOs and only 9 ADTs, so perhaps that’s more relevant

Aaron Lun (19:26:18): > possibly, though with 2 HTOs, you’re always going to have 1 of them present, so it doesn’t really help with the assumption mentioned above.

Peter Hickey (19:26:50): > thanks aaron

2021-03-23

Lambda Moses (23:06:08): > @Lambda Moses has joined the channel

2021-03-29

Federico Marini (07:41:21): > FYI:https://cannoodt.dev/2021/03/anndata-for-r-has-a-new-home/ - Attachment (Robrecht Cannoodt): anndata for R has a new home! | Robrecht Cannoodt > Welcome to the dynverse project :)

Aaron Lun (11:24:56): > What’s new? Looks same-old, same-old to me.

Kasper D. Hansen (11:38:12): > But it’s a new URL

2021-04-06

Lindsay Hayes (00:09:39): > @Lindsay Hayes has joined the channel

2021-04-15

Alexander Toenges (11:33:24): > Is there a way to get data from the scRNAseq package (GiladiHSCData) given that I am on Bioc 3.12 with R.4.0.3 and would like not to upgrade things as within the RProject software versions should stay what they are right now?

Alexander Toenges (11:34:02): > I get theError in .local(x, i, j = j, ...) : 'i' must be length 1when using the Github version from LTLA/scRNAseq and BiocManager::valid() indicates it is “too new”.

Aaron Lun (11:39:14): > no, is the short answer.

Alexander Toenges (11:46:46): > So what would people do who have projects going on that take years to complete, and at some point need newer packages, yet for reproducibility cannot upgrade older stuff?

Aaron Lun (11:48:35): > containers? Packrat? any environment thingy?

Alexander Toenges (11:48:57): > sorry for asking…

Aaron Lun (11:49:07): > i mean, if you need newer packages and you can’t upgrade older stuff, you’re kind of screwed by definition.

Alan O’C (11:49:39): > If you can’t upgrade but need to upgrade you’re always going to be snookered

Alan O’C (11:49:48): > :shrug:

Davide Risso (12:00:40): > I guess one way would be to use a container to get the data from the scRNAseq package using the latest Bioc version, save a local copy of the resulting SCE and then load it into the older R/Bioc session

Peter Hickey (18:44:32): > like@Davide Risso, i’d also just fire up a new-enough R version (container, other system, wherever) to get the data and then save it and load into other R version. but you have to acknowledge you’re on your own in terms of support because of stepping outside the bioc version cycle. but for a data object (as opposed to function stuff) you’reprobablyon safe ground doing things this way

2021-04-16

Nadine Bestard-Cuche (05:33:52): > Hi, I recently posted in Biostars(here) a behavior I do not understand after applying a linear regression to a scRNAseq dateset. It looks less integrated than before! I explain in the post I am not specially interested for this specific case, I think it is fine to carry on with no batch correction at all. But I am just very surprised by this and would like to understand how results like this one are possible, or if I was violating some requirements or assumptions I am not aware. > The post led to a nice exchange about batch corrections / integrations methods, however I am still with no answer for my very first question: How is this possible? Maybe someone here has an insight?

Aaron Lun (11:47:48): > This is known.

Aaron Lun (11:48:24): > linear regression with just the batch factor assumes that “everything else is equal” between the two batches - in particular, the population composition.

Aaron Lun (11:50:08): > For example, think about a situation where one batch has 90% T cells and 10% B cells, and another batch has 10% T cells and 90% B cells. Linear regression tries to equalize the means of the two batches… which is not sensible in this case, because the meansshouldn’tbe the same, given that the two batches have different ratios of their cell types.

Aaron Lun (11:50:46): > hence the need for other batch correction methods that I would otherwise consider overly complicated if they weren’t actually necessary.

Tim Triche (11:58:37): > @Aaron Lunhow do you feel about harmony vs. MNN based on the above observation

Aaron Lun (12:00:30): > It’s been a long time since I played with harmony, but IIRC the key point was the use of clusters; this allows for faster correction, and possibly more accurate correction, but that would be dependent on the quality of the clustering.

Aaron Lun (12:00:42): > Maybe they were using kmeans to aggregate the data. Can’t remember.

Aaron Lun (12:01:16): > They had a BioC submission but they dropped it and I was like well, whatever.

Jared Andrews (12:01:30): > Shame a lot of those methods aren’t easier to apply to SCEs.

Alan O’C (12:05:59): > Is Harmony just on github then?

Jared Andrews (12:08:14): > Believe so. Most of the immunogenomics stuff is github only.

Alan O’C (12:12:00): > Huh. “Interesting choice”

Tim Triche (13:01:15): > harmony was inBiocManager::install("harmony", version = "3.8")

Tim Triche (13:02:10): > must have got deprecated. it works well on SCEs and short of Zhimin Zhang (sp?)’s burn-all-the-rainforests VAEGAN or pythonic bbknn, it seems to be about as good of a general purpose solution as any. shame about the BioC situation:confused:

2021-04-19

Wes W (10:21:45): > Harmony vs scMerge vs LIGER? what are people using in the wild? Most people in my building are using CCA from Seurat, I am the only one using Bioc here…

Tim Triche (10:37:12): > Harmony (for everyone I know) or MNN (for corner cases)

Tim Triche (10:37:55): > @Aaron Lunwhere’s that amazing GitHub issue that describes why e.g. SeuratDisk is unlikely to have a long shelf life and trusting Seurat API stability might not be the path to serenity?

Tim Triche (10:39:16): > I need to bookmark it for anytime someone asks me why one might aim at SCE rather than straight at Seurat for user-facing examples. Nothing wrong with Seurat per se, but from a 3rd party developer perspective, the “move fast and break stuff” approach can create challenges.

Alan O’C (10:46:00): > This I think?https://github.com/LTLA/scRNAseq/issues/15 - Attachment: #15 Seurat versions? > If I convert the datasets present in this package to Seurat format, with metadata retained between versions, would the authors allow a pull request? I know the community is somewhat split over SingleCellExperiment vs. Seurat, but my and many other labs use Seurat, and whenever I pull a dataset from this package I need to convert it to Seurat format. I think it’d be great for the package to include a compressed version of a Seurat object for each experiment, and I’m willing to do the legwork to convert the files myself.

Alan O’C (10:47:10) (in thread): > Worth noting that “move fast and break stuff” works in other languages with good package version management (eg, ruby) but not with the “all as one” model of R and CRAN in particular

Wes W (10:50:50) (in thread): > thanks

Tim Triche (12:28:32) (in thread): > It also works a lot better when infinite VC money can be pointed at maintenance, as opposed to NIH/NSF “maintenance is not innovative” funding

Tim Triche (12:32:16) (in thread): > I greatly enjoy having conda nightmares handled by basilisk instead of by shaving years off my (or my trainees’) life. “dev is not prod” and all that.

Alan O’C (12:46:09) (in thread): > It’s great yeah. Astonishing to see people even recently saying “just install X program from github into your path somewhere” as a solution for an R package hosted on a central repository

Tim Triche (12:56:36) (in thread): > “just pipe this thing off the Interwebs into sudo”

Tim Triche (12:56:49) (in thread): > a close competitor for Break Things Fastest

ImranF (14:33:06): > @Wes W, I’m using Conos.

Wes W (17:48:25) (in thread): > thanks, checking it out

Alan O’C (19:13:43) (in thread): > I saw some issues relating to the R package version mismatching the compiled code it was calling and thought “Wow that’s horrible and confusing, and yet completely expected”

Tim Triche (19:19:06) (in thread): > dear god

2021-04-20

Vince Carey (06:36:50) (in thread): > It might be worth noting that Rstudio pro includes features that allow users to work with multiple versions of R. You can emulate this if you don’t go to pro, using renv:https://rstudio.github.io/renv/articles/renv.html - Attachment (rstudio.github.io): Introduction to renv > renv

Alan O’C (06:59:21) (in thread): > Yeah using more than one version of R with renv is probably the cleanest way to manage a mix of old and new packages. Can guarantee some confusion if half the scripts are run with one version of R and the other half another though

Vince Carey (08:48:36) (in thread): > I’ve not had time to explore the notion of a sessionInfo-based signature for a script or session. I noticed that there’s a sessioninfo package on CRAN but I don’t know if they have a diff concept or functions to restore a given session state on demand. Seems important to be able to compute on the sessionInfo for provenance and reproducibility. This is probably not the best channel for such speculations …

2021-04-21

Nadine Bestard-Cuche (11:11:24) (in thread): > thank you for your reply! that was helpful

Nadine Bestard-Cuche (11:21:04) (in thread): > I know this is not a place where to discuss Seurat stuff, but I got a bit confused recently. Everyone talks about “CCA from Seurat”,  but I am not sure what it refers too looking at their webpage. Is this CCA integration? In the paper cited this seems something different, also inspired from MNN and “improved”  (even if we know how all changes are sold as better than the previous version) - Attachment (satijalab.org): Introduction to scRNA-seq integration > Seurat

Alan O’C (11:24:43) (in thread): > The original Seurat integration paper was “just” CCA IIRC, while the 2019 paper is an update that borrows ideas from MNN etc. I don’t remember if the 2019 method still uses primarily CCA? I recall a lot of talk about “integration anchors”

Nadine Bestard-Cuche (11:39:17) (in thread): > I see, thanks Alan. I just had a look at the 2019 paper and it is not clear to me if it is still CCA. So I’ll assume that when people says “Seurat CCA integration” presumably they are talking about the initial workflow from the original paper. And as far as I could see that is not in a nice vignette anymore, probably it has been replaced by this newer one. I might ask them at some point to be sure.

Alan O’C (11:40:45) (in thread): > While that sounds sensible, I would assume that if somebody says “Seurat CCA integration” they mean “I followed one of the Seurat integration workflows” unless I know they are likely to be specifically referring to one method over another

Alan O’C (11:41:41) (in thread): > I would say it’d be easiest to ask which version of Seurat but I have a bad feeling that the different methods don’t map neatly to v2, v3 (v4 now?)

Nadine Bestard-Cuche (11:43:30) (in thread): > I don’t think they do! I think the functions from the previous CCA workflow still work on the newest versions, so someone could be following a saved script of the old CCA method on a more recent Seurat version

Alan O’C (11:46:57) (in thread): > Indeed, endless confusion. If I remember well it’s also not totally clear whether they use the sctransform residuals as normalised expression values or simply use them to select HVGs

Lucy (13:28:28): > https://www.cell.com/cell/fulltext/S0092-8674%2819%2930559-8See figure 1 of this paper

Alan O’C (13:35:46): > Ah, was too lazy to go re-read the paper. In this case it’s CCA into MNN. So I guess if somebody says “CCA with Seurat” you hope they mean the more recent version

Aaron Lun (13:36:49): > IIRC the CCA bit is not really doing anything special, it’s just dimensionality reduction. The original implementation’s magic sauce was really the dynamic time warping step along each dimension to align major populations.

2021-04-22

Stephanie Hicks (12:02:07): > Random question. Are there published standards or guidance on which metadata elements should be required for scRNAseq or scATACseq data? I know of thishttps://www.nature.com/articles/s41587-020-00744-z.pdf?origin=ppuband some metadata types from HCA (e.g.https://data.humancellatlas.org/metadata). - Attachment (HCA Data Portal): Metadata Types > An overview of the HCA metadata schema types and structure.

Stephanie Hicks (12:02:35): > Are there other guidelines for metadata?

2021-04-23

hcorrada (19:13:28): > @hcorrada has joined the channel

2021-04-24

Jayaram Kancherla (08:34:57): > @Jayaram Kancherla has joined the channel

2021-04-26

rohitsatyam102 (11:19:49): > @rohitsatyam102 has joined the channel

2021-04-27

Chris Vanderaa (17:18:13): > @Chris Vanderaa has joined the channel

2021-04-29

rohitsatyam102 (03:51:53): > I really found this post useful as I begin the single-cell preprocessing specially the way@Aaron Lunput’s it. It’s humorous in many ways:joy:esp. the expression “throwing out the baby with the bathwater”.https://support.bioconductor.org/p/118877/

2021-04-30

Tim Triche (10:04:39): > one day people will look back at the single cell literature and wonder whether Aaron T. Lun was a pseudonym for a large group of extremely productive investigators. The Bourbaki of single cells, as it were

Davide Risso (10:14:20) (in thread): > maybe it is… I hear there are infinite monkeys involved somehow…

Federico Marini (10:21:04) (in thread): > I’d ask “whyone day” actually

Tim Triche (10:56:56) (in thread): > as the LLS likes to say, “someday is today”

2021-05-02

rohitsatyam102 (09:15:18): > Can someone help me solve this:https://support.bioconductor.org/p/9136786/. I have been sitting on it all day long but couldn’t work it out. Please Please…

Alan O’C (09:46:07) (in thread): > What’s the traceback?

rohitsatyam102 (10:19:47) (in thread): > No Traceback. It just spits out this error.

Alan O’C (10:20:22) (in thread): > traceback()returns absolutely nothing?

rohitsatyam102 (10:29:58) (in thread): > Oops!! That traceback. Sorry I misunderstood it:

rohitsatyam102 (10:30:04) (in thread): > > traceback() > 9: stop("invalid character indexing") > 8: intI(j, n = d[2], dn[[2]], give.dn = FALSE) > 7: subCsp_ij(x, i, j, drop = drop) > 6: new.data[new.features, colnames(x = object), drop = FALSE] > 5: new.data[new.features, colnames(x = object), drop = FALSE] > 4: SetAssayData.Assay(object = a, slot = "data", new.data = mats$data) > 3: SetAssayData(object = a, slot = "data", new.data = mats$data) > 2: as.Seurat.SingleCellExperiment(sce_clean) > 1: as.Seurat(sce_clean) >

rohitsatyam102 (10:34:13) (in thread): > Also, for the time being I tried doing it in a naive way: > Also, I wish to understand difference between DFrame and data.frame object because the functions likesubsetthat works on data.frame doesn’t work on DFrame objects. > > counts <- assays(sce_clean)[[1]] > seurat <- CreateSeuratObject(counts = counts, project = "Harmony_All", min.cells = 5) > t2 <- seurat@meta.data > t <- data.frame(colData(sce_clean)) > t3 <- data.frame(t,t2) > seurat@meta.data <- t3 >

Alan O’C (10:35:57) (in thread): > Seems like > > Warning: Non-unique cell names (colnames) present in the input matrix, making unique > Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-') > > andinvalid character indexingmight be connected

Alan O’C (10:36:37) (in thread): > As for DFrame particulars, no idea. I guess the s4vectors package documentation should be able to help there

Alan O’C (10:38:56) (in thread): > In general I’ve found the conversion functions supplied by Seurat to be… well, given polite company, I’ll just say “very bad”.

Alan O’C (10:40:18) (in thread): > Thishttps://github.com/cellgeni/sceasymight be… less bad

rohitsatyam102 (11:13:54) (in thread): > It broke during installation.: Even though I have R 3.6.3 and installed all other dependencies > > Encountered problems while solving: > - package r-sceasy-0.0.3-r36_0 requires r-base >=3.6,<3.7.0a0, but none of the providers can be installed >

Wes W (15:36:40): > what is your fav package for cell to cell communication with single cell data for working with SCE? I have used celltalker in the past when I was using Seurat, but now that I am trying to move to all Bioconductor + Python packages want to see what everyone else likes and doesn’t like. In the past I have experimented with CellPhoneDB (cant remember why I didnt like it) , cellchat. I haven’t used SingleCellSignalR , but it is in Bioconductor, I remember looking at once upon a time, but dont know how plug and play it is SCE or if I have to use the matrix and use edgeR in the pipeline seperately

2021-05-04

Tim Triche (10:21:19): > have you looked at CITEfuse

rohitsatyam102 (11:30:04): > I am writing a package and using SummarizedExperiment to store the data and metadata. I also saw SingleCellExperiment that has additional slot to store reduced dimensionality representations of the primary data obtained by methods such as PCA and t > t-SNE. How do I borrow just that functionality (or that slot) and use it in my SummarizedExperiment object?

Alan O’C (12:35:33): > Possibly displaying my pampered compbio upbringing where I rarely have to muck with (pseudo-)aligners and such, but if you quantify with Salmon and summarise to gene level, you’d expect the “counts” to be integers, right?

Davide Risso (12:44:37) (in thread): > Why not simply using the SingleCellExperiment class? SingleCellExperiment extends SummarizedExperiment so everything that works on a SE should work on a SCE too

Alan O’C (13:04:53) (in thread): > If you don’t want it to contain the name “SingleCell” you could create a new class that just inherits from SCE, egsetClass("MyExperiment", contains="SingleCellExperiment")

rohitsatyam102 (14:53:36) (in thread): > Yes that’s works. Wow!! Didn’t occur to me.

Tim Triche (15:23:48): > only if the reads for each gene (e.g. protocadherins) can be completely assigned to one gene or another

Alan O’C (15:45:22): > So what’s the done thing?round(x)?

Jenny Drnevich (15:46:20): > Also depends on how you adjust for transcript effective lengths. What are you using down-stream? Most “count”-based packages (DESeq2, edgeR) can handle non-integer counts. So I don’t round.

Alan O’C (15:48:18): > I’ve rolled my own model which I suppose should handle non-integers

Alan O’C (15:48:46): > Thanks; wasn’t sure, though I was nearly sure that DESeq2 didn’t like them

Charlotte Soneson (15:49:52): > DESeq2willroundinternally if you get your counts fromtximport/tximeta- otherwise you should provide integers.

Alan O’C (15:52:18): > Ah, I wasn’t going mad! I knew I’d seen it refuse non-integers. Thanks

Aaron Lun (15:53:13): > incidentally, here is a post from Gordon on this topic:https://stats.stackexchange.com/questions/310676/continuous-generalization-of-the-negative-binomial-distribution/311927 - Attachment (Cross Validated): Continuous generalization of the negative binomial distribution > Negative binomial (NB) distribution is defined on non-negative integers and has probability mass function\[f(k;r,p)={\binom {k+r-1}{k}}p^{k}(1-p)^{r}.\] Does it make sense to consider a continuous

Alan O’C (15:59:55) (in thread): > Unless you’re defining extra behaviour for your object it might be easier/less confusing/less work to just use SCE and explain to your users that it’s just a container; the name doesn’t matter

2021-05-05

Chris Vanderaa (07:53:57) (in thread): > I’m usingnichenetr(https://github.com/saeyslab/nichenetr). Although the ligand-target database looks great and the vignettes are very clear, I have the feeling that the functions are a bit simplistic and require the user to take several arbitrary decisions. I have no experience with other tools, so I cannot contrast it to other implementations.

2021-05-06

Alan O’C (05:08:41): > Any ideas why I get this updating DropletUtils? > > unable to load shared object '/home/alan/R/x86_64-pc-linux-gnu-library/4.0/00LOCK-DropletUtils/00new/DropletUtils/libs/DropletUtils.so': > /home/alan/R/x86_64-pc-linux-gnu-library/4.0/00LOCK-DropletUtils/00new/DropletUtils/libs/DropletUtils.so: undefined symbol: __ubsan_vptr_type_cache >

Aaron Lun (11:18:49): > ¯*(ツ)*/¯

Aaron Lun (11:18:53): > nothign new on my end

Alan O’C (11:19:38): > It’s definitely not new, I just haven’t been bothered to dig into it for some time. Was just vainly hoping it would ring a bell for somebody

Aaron Lun (11:21:01): > does this affect other Rhdf5lib-based packages?

Alan O’C (11:24:09): > Not mbkmeans at least

Aaron Lun (11:24:45): > hm.

Aaron Lun (11:25:45): > I don’t remember doing anything special for DropletUtils re. UBSAN.

Alan O’C (11:33:53): > I reinstalled a heap of stuff and it goes fine, prob should have tried the stupid method before posting sorry

2021-05-11

brian capaldo (15:18:38): > aside from computational burden, is there a reason to not use all genes when performing dimensionality reduction? I find that I end up with better resolution between clusters when using all genes rather than a top subset (which I feel like shouldn’t be too surprising), but my expertise and training is not in these higher level maths.

Megha Lal (16:45:53): > @Megha Lal has joined the channel

2021-05-12

Lucy (14:26:49): > Great question, my initial thought is that the lowly expressed genes might introduce some noise to the data?

Aaron Lun (14:31:48): > the computational burden is the big one. But including the lowly variable genes also adds more noise without contributing much biological signal. Normally this wouldn’t be much of a problem because they’ve got low variance anyway, so by themselves they don’t change the distnace calculations that much. But if you include the majority of the transcriptome that’s lowly variable, it starts to add up. > > Of course, it’s debatable where you want to draw the line here. It’s similar to the usual variance-bias trade-off.

Lucy (14:32:15): > :+1:

brian capaldo (14:39:37): > Might be my samples specifically. Only tested so far with highly heterogeneous organoid models that contain discrete lineages confirmed by lineage tracing. Will have to try it on some of my more homogeneous samples

Lucy (14:40:40): > Would be interested to know what you find

brian capaldo (14:50:27): > absolutely, I have half a dozen or so single cell models in this study, and some groups just published additional models that fit in nicely with my work. So I’ll have close to 10 models or so to test this on

2021-05-14

rohitsatyam102 (15:42:03): > Does 10X also take into account the Multimapping reads towards the read counts? Or like STAR it only use uniquely mapping reads?

Jared Andrews (15:53:30): > cellranger uses STAR still AFAIK, and no, it doesn’t. I expect them to add it eventually given the latest STARsolo preprint though (which does account for multimappers):https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1.full.pdf

rohitsatyam102 (17:06:48) (in thread): > This is useful. Thanks.

2021-05-18

rohitsatyam102 (04:29:26) (in thread): > I went back and did a tiny experiment. I used one of thepython codewritten by my colleague once to produce count matrix out of bam file. And then I randomly sampled 500 Cells from that count matrix. I then used this subsetted matrix to extract cells from seurat object made out of cell-rangers count matrix (using read10xCounts function) and plotted the correlation. If the CellRanger were to use only uniquely mapping reads the correlation coefficients between 500 columns of two matrices must be 1. However, I see that’s not the case: not for all 500 cells. Am I expecting too much (like 100% cells must fully correlate if Cell Ranger is using NH:1 flag for unique reads). Here is the plot - File (PNG): image.png

rohitsatyam102 (04:40:03): > Also the correlation calculation code: > Where mat_new will be the entire matrix I got from my python script and t is raw count matrix obtained from seurat object bydata.frame(sobj@assays$RNA@counts) > > correlation <- function(mat_new){ > mat_new <- mat_new[,intersect(colnames(mat_new),colnames(t))] > mat_new <- mat_new[,sample(ncol(mat_new), size = 500), drop = FALSE] > common_row <- intersect(rownames(mat_new), rownames(t)) > common_col <- intersect(colnames(mat_new), colnames(t)) > mat.subset <- as.matrix(mat_new[common_row,common_col]) > t.subset <- as.matrix(t[common_row,common_col]) > ## Column wise correlation > col.cor <- sapply(seq.int(dim(mat.subset)[2]), function(i) cor(mat.subset[,i],t.subset[,i])) > return(col.cor) > } >

2021-05-31

Davide Risso (05:07:20): > Hi all, are there guidelines/recommendations on how to best analyze single-cell ATAC such as those from the 10X multiome platform? I know that 10X cellranger has a dedicated pipeline, but I was wondering if there are alternatives (e.g., in the salmon-alevin world@Rob Patro?) Do people usually call peaks from these data or summarize ATAC reads at the gene and/or promoter level?

Rob Patro (09:53:13): > Tagging@Avi Srivastavahere, as I believe he’s done some scATAC using alevin on the scRNA side. He may have some recommendations / practices.

Avi Srivastava (11:52:32): > Hey@Davide Risso, yea so unfortunately the pipelines for genome wide coverage measuring technologies like (scATAC-seq, scCUT&Tag) are not as structured as the feature counting based ones (scRNA-seq, CITE-seq) but they are improving. The read-alignment part and the downstream preprocessing are usually handled by different tools. Theoretically salmon should be able to align reads to the genome reference, however personally I have not tested things much on that side, I’d be happy to help if you wan’t to give it a try. As per my personal experience working on genome wide coverage measuring technologies, MACS2 defined peaks work relatively well for the ATAC-seq data, however, multiple recent papers have stick to coverages in the genome wide 5kb binned windows. The fundamental unit (file) for ATAC-seq data processing is afragmentfile, which is basically a bed file with CB information attached to the read-alignment; Cellranger dumps it post BWA-mem based read alignment. We have been putting a good effort in a processing tool called Signac (I am a co-author,https://satijalab.org/signac/), but there are other pipelines also likehttps://github.com/GreenleafLab/ArchR. > > Regarding summarizing ATAC reads at gene-level, I think it basically depends on the underlying question, for example if we are interested in say clustering of the atac-seq data, then I guess summarizing the reads at promoter level and generating the GeneActivity for clustering makes sense. However, if the question is to link distal regulatory elements (like Cicero doeshttps://pubmed.ncbi.nlm.nih.gov/30078726/) then it’s better to allow gene-distal peaks. Hope it helps, and I am happy to answer any other question you may have.

Jared Andrews (11:52:35): > ArchR is one package worth a look post initial processing.

Jared Andrews (11:52:47): > Oh. Beat to it.

Davide Risso (11:59:47): > thanks@Avi Srivastavafor the detailed answer! I think I have more than enough to get started, but I’m sure I will have more questions going forward, I’ll keep you posted if you don’t mind!

Avi Srivastava (12:00:23): > Absolutely, happy to help:thumbsup:

Aaron Lun (18:32:09): > I dread the day when I need to somehow integrate ArchR into our internal pipelines. Seems like we’re going to be spending thousands of core-hours converting stuff to/from their custom HDF5 format.

2021-06-04

Izaskun Mallona (04:21:28): > @Izaskun Mallona has joined the channel

Flavio Lombardo (05:52:35): > @Flavio Lombardo has joined the channel

Tim Triche (11:32:03) (in thread): > look on the bright side, rows are samples and columns are who knows what

2021-06-09

Jenny Drnevich (13:02:48): > Anyone analyzed any 10X Genomics CellPlex (https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cellplex) yet? Seems like the demultiplexing is not done withcellranger mkfastqbut has instead been pushed tocellranger multi? Has anyone (Seurat, Salmon/Alevin, BioC packages?) developed any 3rd party software yet to handle this?

Aaron Lun (13:06:36): > dropletutils::hashedDropsprovides a simple way to demux, assuming you get the HTO count matrix.

Jenny Drnevich (13:12:19): > Can Alevin provide the HTO count matrix? But it seems like there isn’t a way to demux to different samples without first aligning to the genome, which my sequencing center is unable to provide (too many genomes to manage). I guess that means more work coming my way in the bioinformatics center…

Dan Bunis (13:37:24): > new kid on the block that might help:https://pypi.org/project/fba/not sure if their paper is out yet but it at least claims to be (*I haven’t actually tested myself.) a flexible barcode aligner that can make the counts matrix and demultiplexing calls

Dan Bunis (13:40:53) (in thread): > https://doi.org/10.1093/bioinformatics/btab375

Alexander Toenges (13:43:13) (in thread): > this maybe?https://combine-lab.github.io/alevin-tutorial/2020/alevin-features/ - Attachment (combine-lab.github.io): Alevin w/ Feature Barcodes > Feature Barcoding based Single-Cell Quantification with alevin

Vince Carey (15:29:10) (in thread): > Nice looking set of examples:https://github.com/jlduan/fba#workflow-example

Jenny Drnevich (16:37:49) (in thread): > Promising - thanks@Alexander Toenges!

2021-06-14

Rob Patro (13:46:29): > You can also do this pretty easily with alevin-fryhttps://combine-lab.github.io/alevin-fry-tutorials/2021/af-feature-bc/ - Attachment (combine-lab.github.io): Processing feature barcoding data with alevin-fry > In this tutorial we will look at how to process a CITE-seq experiment (a type of feature barcoding experiment) using an alevin-fry based pipeline. Note : This tutorial is meant to mimic the original tutorial for feature barcode analysis with alevin written by Avi Srivastava and Yuhan Hao. Thus, most of the descriptive text and commands are taken directly from that tutorial. However, here we will be analyzing the data using the alevin-fry pipeline instead of alevin.

Rob Patro (13:47:02): > Happy to answer questions or add other use cases as it makes sense

2021-06-17

Peter Hickey (00:29:25): > can anyone point to a 10x study where they incorporated some sort of control sample across batches and/or comment on the expected utility of this for estimating/correcting batch effects?

Almut (03:40:36) (in thread): > Not really , but here (https://pubmed.ncbi.nlm.nih.gov/33758076/) we looked into different batch effects and found them partly to be cell type-specific, which makes across batch controls difficult. We anyways tried for another project using spike in cells from a different organisms cell line, but batch differences between the spike in cells and the real sample was different (probably also because we chose very different cells) and we ended up filtering those spike in cells out..

Lucy (04:18:47): > I would be interested in this too - I had a number of people suggesting that I did this when I set up my previous 10x experiment

Peter Hickey (17:12:04) (in thread): > Thanks, Almut!

Peter Hickey (17:14:00) (in thread): > That sounds like my experience to date when we’ve included a control sample, but found that the control sample was distinct (and usually more homogeneous because it was a cell line in our cases) from the real samples. This meant the control sample could be used as a diagnostic (i.e. these batches are different) but was less useful as a remedy (i.e. to estimate/correct the batch effect) and it we ultimately filtered those control cells out

2021-06-24

Tim Triche (15:41:41) (in thread): > anyone compare againsthttps://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02006-2? It sounds like our experience with scATAC, where we could use a control cell line to compare others, but a control cell line was not really sufficient to control for confounders in complex tissues. - Attachment (Genome Biology): Single-cell RNA-seq with spike-in cells enables accurate quantification of cell-specific drug effects in pancreatic islets > Background Single-cell RNA-seq (scRNA-seq) is emerging as a powerful tool to dissect cell-specific effects of drug treatment in complex tissues. This application requires high levels of precision, robustness, and quantitative accuracy—beyond those achievable with existing methods for mainly qualitative single-cell analysis. Here, we establish the use of standardized reference cells as spike-in controls for accurate and robust dissection of single-cell drug responses. Results We find that contamination by cell-free RNA can constitute up to 20% of reads in human primary tissue samples, and we show that the ensuing biases can be removed effectively using a novel bioinformatics algorithm. Applying our method to both human and mouse pancreatic islets treated ex vivo, we obtain an accurate and quantitative assessment of cell-specific drug effects on the transcriptome. We observe that FOXO inhibition induces dedifferentiation of both alpha and beta cells, while artemether treatment upregulates insulin and other beta cell marker genes in a subset of alpha cells. In beta cells, dedifferentiation and insulin repression upon artemether treatment occurs predominantly in mouse but not in human samples. Conclusions This new method for quantitative, error-correcting, scRNA-seq data normalization using spike-in reference cells helps clarify complex cell-specific effects of pharmacological perturbations with single-cell resolution and high quantitative accuracy.

2021-07-10

Kent Johnson (11:21:47): > @Kent Johnson has joined the channel

2021-07-23

Batool Almarzouq (15:54:04): > @Batool Almarzouq has joined the channel

2021-08-04

shristi shrestha (13:58:30): > @shristi shrestha has joined the channel

2021-08-05

Ambarish S. Ghatpande (07:11:01): > @Ambarish S. Ghatpande has joined the channel

Assa (08:48:59): > @Assa has joined the channel

Assa (08:50:30): - Attachment: Attachment > @Assa great question! yes this is a widely discussed topic in the field. Briefly, the choice of how many zeros we “expect” to see does seem to vary across technologies. This difference is important in how not only we preprocess the data (e.g. imputation), but also how we model the data (e.g. dimensionality reduction or differential expression). Here are a few papers you might find relevant related to just evaluating “how many zeros we expect to see in 10x data”. specifically, they show 10x data are not zero inflated > • https://www.nature.com/articles/s41587-019-0379-5 > • https://academic.oup.com/bioinformatics/article/33/21/3486/3952669 > • https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1861-6 > In terms of just imputation, here is a paper on evaluating and benchmarking 18 scRNA-seq imputation methods where both 10x and smart-seq data are discussed: > • https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02132-x > Two other more recent papers on this topic that you might find relevant: > • https://www.biorxiv.org/content/10.1101/2020.12.28.424633v1 > • https://www.biorxiv.org/content/10.1101/477794v2 > Happy single-cell data analysis-ing!

Manojkumar Selvaraju (17:58:06): > @Manojkumar Selvaraju has joined the channel

2021-08-12

koki (21:35:28): > Thank you to everyone involved in Bioconductor 2021.@Aedin CulhaneThis talk was interesting for me.https://www.youtube.com/watch?v=dTNMmBpizGAI wonder why we can avoid batch effect with Correspondence Analysis as opposed to Principal Component Analysis? > Maybe the normalization in the row and column direction was effective? - Attachment (YouTube): Dimension Reduction for Beginners

2021-08-13

Jenny Drnevich (10:55:30): > Anyone know of any mouse or human ovary single cell data sets? None of the references in celldex or scRNAseq packages are useful for calling cell types in a mouse ovarian follicle single cell data set I have. Or any references for marker genes for granulosa cell subtypes?

Frederick Tan (11:24:32) (in thread): > Does this work?https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE136441

Kevin Blighe (11:26:21) (in thread): > This is one that I came across recently:https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-8381/If my notes are correct: > * normal human female ovarian cortex from a gender reassignment individual, and C-section donors > * ~12000 cells > * Cell Ranger 2.1.1 or 3.0.1; aligned to hg19 using STAR > * authors seem to have identified cell-types, but no way to link these back to original cell IDs

Jenny Drnevich (11:36:58) (in thread): > Thanks,@Frederick Tanand@Kevin Blighe, I will check those out. I also came across this one:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE118127that the PI said might be better because prenatal/neonatal does not have some of the cell types that develop in adulthood. - Attachment (ncbi.nlm.nih.gov): GEO Accession viewer > NCBI’s Gene Expression Omnibus (GEO) is a public archive and resource for gene expression data.

2021-08-17

Jared Andrews (16:43:56): > @Aaron Lunare you trying to completely reimplement every step of single cell analysis in Cpp? What’s the deal with all those new repos, hmm?

Aaron Lun (16:50:48): > aw you got me

Aaron Lun (16:52:32): > we should have an interesting prototype in the next few weeks

Jared Andrews (16:56:23): > I look forward to probably immediately breaking it.

Kasper D. Hansen (18:03:42): > Shouldn’t all the new tools be in Rust?

Rob Patro (18:06:13): > Yup!

Aaron Lun (18:22:03): > no thank you

2021-08-19

mariadermit (02:01:33): > @mariadermit has joined the channel

2021-09-07

Andrew Jaffe (14:52:03): > @Andrew Jaffe has joined the channel

2021-09-16

Henry Miller (18:35:10): > @Henry Miller has joined the channel

2021-09-24

ChiaSin (14:18:59): > @ChiaSin has joined the channel

2021-09-25

Mikey C (19:04:52): > @Mikey C has joined the channel

2021-10-07

Alan O’C (18:12:20): > I vaguely recall reading a comparison of VSTs for scRNAseq, including sqrt transform. So far I’ve only dug up Aaron’s and Valentine’s blogs, and ConstAE’s paper. Any ideas, or did I just dream about it?

2021-10-08

Wancen Mu (00:44:25): > @Wancen Mu has joined the channel

Federico Marini (05:22:21): > The recent work from@Constantin Ahlmann-Eltzemaybe? –>https://www.biorxiv.org/content/10.1101/2021.06.24.449781v2 - Attachment (bioRxiv): Transformation and Preprocessing of Single-Cell RNA-Seq Data > The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-seq data. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range. These steps are intended to make subsequent application of generic statistical methods more palatable. Here, we describe three transformations (based on the delta method, model residuals, or inferred latent expression state) and compare their strengths and weaknesses. We find that although the residuals and latent expression state-based models have appealing theoretical properties, in benchmarks using simulated and real-world data the simple shifted logarithm in combination with principal component analysis performs surprisingly well. Software An R package implementing the delta method and residuals-based variance-stabilizing transformations is available on [github.com/const-ae/transformGamPoi][1]. Contact constantin.ahlmann{at}embl.de ### Competing Interest Statement The authors have declared no competing interest. [1]: https://github.com/const-ae/transformGamPoi

Constantin Ahlmann-Eltze (06:02:26): > @Constantin Ahlmann-Eltze has joined the channel

Constantin Ahlmann-Eltze (06:03:47): > Thanks@Federico Marinifor tagging me:slightly_smiling_face:Yeah, that is pretty exactly what we did there. Or is some other specific aspect of VST for single-cell that you were interested in@Alan O’C?:slightly_smiling_face:

Constantin Ahlmann-Eltze (06:04:57): > Thesqrttransform was maybe a bit more prominent in the first version of the preprint.

Alan O’C (06:12:45): > Yeah that’s the paper I meant, it’s a very good one. Hmm I may be slightly confused as you mention the sqrt transform in v1 but I don’t see it represented in the results

Alan O’C (06:14:07): > I vaguely remember reading something else which discussed sqrt in more detail, though, which is really what I’m after

Constantin Ahlmann-Eltze (06:29:15): > Thanks :)) And thanks for the feedback:slightly_smiling_face:Hm, sorry I am also not sure which other paper that might be, but if I remember something I will get back to you:slightly_smiling_face:

Alan O’C (06:36:03): > Cheers! Just trying to round out some thoughts based on yours and the Lause paper and a few others. I may end up ignoring sqrt since I think nobody really uses it:smile:

2021-10-21

Emily Collins (14:28:35): > @Emily Collins has joined the channel

2021-10-29

Enrico Ferrero (13:22:24): > @Enrico Ferrero has joined the channel

2021-11-03

Koen Van den Berge (09:54:12): > Which tools would be recommended for demultiplexing sample-level FASTQ files based on cell barcode for, e.g., Drop-seq data?

Koen Van den Berge (09:59:44): > We know where the cellular barcodes are within the FASTQ-files, and are wondering about tools that can split up the reads efficiently.

Jared Andrews (10:05:03): > I think people use Kraken for this:https://www.ebi.ac.uk/research/enright/software/kraken

Jeroen Gilis (10:09:13): > @Jeroen Gilis has joined the channel

2021-11-08

Paula Nieto García (03:29:38): > @Paula Nieto García has joined the channel

2021-11-19

brian capaldo (12:34:28): > I’m trying to run velociraptor on a batch corrected single cell experiment, but keep running into this error: Error in py_module_import(module, convert = convert) : ModuleNotFoundError: No module named 'typing_extensions'

brian capaldo (12:35:16): > velo_out <- scvelo(sce_glm_pca, use.dimred = "corrected")

brian capaldo (12:35:51): > I don’t see any gitub issues regarding this, not sure how to modify the conda env to manually add that library

Charlotte Soneson (12:40:11): > Which version of velociraptor are you using? I think we fixed this before the release (in v1.3.1).

brian capaldo (12:42:20): > 1.0.0let me force an update

brian capaldo (12:44:17): > and my R version needs to be updated apparently, this is not good…

brian capaldo (13:50:23): > and fixed after updating everything

2021-12-01

Alexander Bender (11:21:14): > @Alexander Bender has joined the channel

2021-12-06

Nadine Bestard-Cuche (07:28:08): > Hello, > They’ve send me this plot obtained with the data from embryo atlasPijuan-Sala, Marioni Göttgens. They were asking if there are known artifacts that could explain these lines. We had a look with@Alan O’Cand the only potential culprit we could see looking at their methods was MNN but we are still not very convinced and were wondering if someone here has a better explanation. - File (PNG): GetAttachmentThumbnail.png

Jonathan Griffiths (07:47:31) (in thread): > ah wow, that’s me

Jonathan Griffiths (07:48:40) (in thread): > I think it’s just library size effects. Each line is a ratio of counts (e.g., 3 Sox2 vs 2 Nanog), but the cells have a range of normalisation factors that is influenced by the expression of all other genes in that cell. Those factors slide your points along a line of fixed gradient for a given Nanog and Sox2 count level

Jonathan Griffiths (07:49:23) (in thread): > These effects are very visible for low counts, and should disappear as you get much higher

Jonathan Griffiths (07:50:08) (in thread): > Here you’re looking at TFs which are infamously painful to analyse due to low RNA levels

Jonathan Griffiths (07:51:53) (in thread): > You can see that one of the lines (the 4th down, with the most points in it) follows the identity line - so matching count values for each gene

Jonathan Griffiths (07:52:21) (in thread): > In any case I’m glad to see the dataset getting use!

Jonathan Griffiths (08:08:07) (in thread): > Also it’s certainly not MNN at play - I used what is now called fastMNN, which operates only on the PCs. The expression data is left as “standard” logcounts

Nadine Bestard-Cuche (12:50:05) (in thread): > Thank you very much for coming back to me! It is reassuring seeing there is an explanation for this behavior.

Nadine Bestard-Cuche (12:50:09) (in thread): > I am unsure I understand how “Those factors slide your points along a line of fixed gradient for a given Nanog and Sox2 count level” I thought that all the counts in each cell were divided by the same size factor, so how can it affect the expression of one gene different form another and thus affecting the correlations.

2021-12-07

Jonathan Griffiths (03:08:10) (in thread): > What I mean is that there are many cells with the same ratio of Nanog/Sox2 counts. However, they differ in their normalisation factor. So what you are seeing along these lines is a set of cells that share a Nanog/Sox2 count ratio, but differ in their normalisation factor. As a result, that identical Nanog/Sox2 count ratio gets squished up or down the line depending on how it is scaled by that normalisation factor, which depends on the expression of all the other genes

2021-12-08

Nadine Bestard-Cuche (06:17:17) (in thread): > Ok! I get it now. Thank you:star-struck:

2022-01-03

Kurt Showmaker (17:05:25): > @Kurt Showmaker has joined the channel

2022-01-10

Aaron Lun (12:59:50): > https://twitter.com/jayaram/status/1480599647039016962 - Attachment (twitter): Attachment > Today Aaron and I are excited to announce Kana https://www.jkanche.com/kana, an app to perform #SingleCell RNA-seq analysis in the browser. Yes you read that right, the calculations are performed client-side, by your browser, on your laptop! #webassembly > > Want to analyze your data?:thread: https://pbs.twimg.com/media/FIwJKVqUYAU4eXi.jpg

Rob Patro (15:16:47): > very cool!

Stephanie Hicks (19:07:45): > Congratulations@Aaron Lun@Jayaram Kancherla!

2022-01-12

rohitsatyam102 (06:42:36) (in thread): > I liked playing ping pong, but the screen was also moving up and down along with the ping pong paddles.

Levi Waldron (09:41:29): > @Levi Waldron has joined the channel

Jayaram Kancherla (10:49:47) (in thread): > its a new type of ping pong:stuck_out_tongue:

Vince Carey (15:16:25) (in thread): > Can you provide pointers to a matrixmarket dataset and metadata example that will be a good illustration?

Jayaram Kancherla (15:22:49) (in thread): > @Vince CareyThis is the pbmc 3K downloaded from 10x website - File (Gzip): pbmc3k_filtered_gene_bc_matrices (1).tar.gz

2022-01-15

Alexander Bender (04:48:52): > Does alevinQC has an in-built way of parsing QC reports of individual samples into an overall summary report for multiple samples, so basically the content of the summary tables?

Charlotte Soneson (04:52:12) (in thread): > Not really. However, you can get the summary tables with thereadAlevinQC()function and combine them manually (https://csoneson.github.io/alevinQC/articles/alevinqc.html#generate-individual-plots). - Attachment (csoneson.github.io): alevinQC > alevinQC

Alexander Bender (05:08:28) (in thread): > That’s all I need. Thanks!

2022-01-19

Stephany Orjuela (10:11:21): > @Stephany Orjuela has left the channel

2022-02-01

Wes W (18:25:39): > Hey everyone. Today I wanted to check the “STEMness” of the cells in each condition of a sc experiment i have been working on. I have seen this done a couple ways at conferences but haddn’t done it myself. > > I know the bioconductor packageTCGAbiolinkshad support for this, but seems to have been removed or permalinks changed so couldn’t super find it or its vignette and before I could do more digging I foundSCENTandLentSCENTfrom Andrew Teschendorff’s lab… the vignette is outdated and calls a lot of depreciated dependency functions… so i spent the day updating it and getting it to work and added some custom code to handle WAY more cells than it was meant too (I needed 240K cells and it craps out hard on the math by default) … THEN i thought, hey maybe this would be useful to other people and I should make it into a bioconductor package… BUT then i was like, surely im not the first person to need this or do this… so before i got down the rabbit hole of putting this new code, old code, and updated code into a package, is there already someone who has updated it OR has no one updated it because there is a way better to do this already and I just dont know it…. > > Thanks for advice!

Alan O’C (18:36:00) (in thread): > ?https://www.bioconductor.org/packages/release/bioc/manuals/TCGAbiolinks/man/TCGAbiolinks.pdf#page=46

Alan O’C (18:36:16) (in thread): > TCGAbiolinks seems to still have it in release

Wes W (18:39:32) (in thread): > thats great! a lot of google links to it seem to point to dead places

Wes W (18:39:47) (in thread): > also the bioconductor offical search points to dead links too =(

Alan O’C (18:45:42) (in thread): > Where’s the dead links on bioconductor? I’m sure folks would like to fix those if possible

Alan O’C (18:47:16) (in thread): > Also sorry I have no idea if anybody’s gone through this before etc. Sounds like there may be some lost code that needs updating though so it may well be worth putting the updated code out there if nobody else has:slightly_smiling_face:and I imagine the TCGA code will struggle with 200k+ observations as well

Wes W (18:47:44) (in thread): > http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/tcgaBiolinks.html

Alan O’C (18:48:12) (in thread): > Where did you find that link, though?

Wes W (18:49:08) (in thread): > that one… from the TCGA page , but when i did search bioconductor this morning i found more that were similar… i will have to recreate my search again to find them….

Wes W (18:50:41) (in thread): > i will say, what I like about my implimentation of SCENT is that it just takes an SCE in … while TCGA requires you to put your SCE first through their preprocessing steps… although i have never actually used the TCGA package so for all i know if you have already done your pre-processing, maybe you can jump into the middle of their pipleine. but their github doesnt imply that…

2022-02-15

Gene Cutler (12:01:38): > @Gene Cutler has joined the channel

2022-02-17

Alan O’C (09:46:02): > Has anybody else encountered theseWarning in read.table(file = file, header = header, sep = sep, quote = quote, : line 2 appears to contain embedded nullserrors? I think it’s related to thescRNAseqpackage but I’ve not managed to reproduce just yethttp://bioconductor.org/checkResults/devel/bioc-LATEST/scater/nebbiolo1-buildsrc.html

rohitsatyam102 (11:07:53): > Hi!! Has anybody used MELD:https://github.com/KrishnaswamyLab/MELDwith Seurat or scanpy to resolve clusters into sub-clusters.

ImranF (18:25:31): > Hi Rohit, its been awhile, but yes, why?

2022-02-18

rohitsatyam102 (03:12:53) (in thread): > Thanks for being a ray of hope!! I will DM you my query so as not to spam here with load of messages.

rohitsatyam102 (06:27:51) (in thread): > I am posting my github issue here for wider reach:https://github.com/KrishnaswamyLab/MELD/issues/56

2022-02-21

Enrico Ferrero (12:46:34): > Hello everyone, I recently came across this interesting paper on covarying neighbourhood analysis (CNA), which establishes a framework to test for associations between groups of cells (e.g., cell clusters or “neighbourhoods”) and phenotypes (e.g., clinical variables, experimental conditions):https://www.nature.com/articles/s41587-021-01066-4The Python cna package in built on top of scanpy:https://github.com/immunogenomics/cnaAre there options to perform a similar kind of analysis within the SingleCellExperiment framework? - Attachment (Nature): Co-varying neighborhood analysis identifies cell populations associated with phenotypes of interest from single-cell transcriptomics > Nature Biotechnology - Inter-sample variability reveals disease-associated cell subpopulations in single-cell RNA sequencing.

2022-02-22

Alexander Bender (10:40:40): > @Charlotte SonesonShort question on the conquer paper. Is there a particular reason why the cellular detection rate was included for some (e.g. edgeR but not limma) testing frameworks but not all? As I understand it the idea is that DetRate often drives the first principal component but is purely technical, so correcting for it would benefit any testing framework. Could you comment on that if you have a minute?

Charlotte Soneson (11:12:55) (in thread): > Yeah, good question - digging in my memory I think it was mostly a question of not creating an unwieldy set of methods (trying all combinations), and at the same time we wanted to see whether including the detection rate was helpful also outside of MAST (where it was recommended in their paper). For other methods we tried other variations from the default (e.g. robust dispersion and different normalizations for edgeR/LRT, different priors for DESeq2).

2022-02-26

Enrico Ferrero (12:52:07) (in thread): > Just to add a visual example of why such an analysis can yield interesting results: in this case the authors are looking for cells which correlate the most with age (but it could be any sample-level variable): - File (PNG): image.png

2022-02-28

Nadine Bestard-Cuche (14:59:35) (in thread): > I see that one of the problems that this method is trying to overcome is that “Current statistical approaches typically map cells to clusters and then assess differences in cluster abundance”. MILO also tries to find an alternative for this same step, using neighborhoods of cells (not exactly the same, but might be useful) - Attachment (Nature): Differential abundance testing on single-cell data using k-nearest neighbor graphs > Nature Biotechnology - Milo identifies differentially abundant populations of cells in scRNA-seq data without clustering.

2022-03-04

rohitsatyam102 (14:04:51): > Hi Everyone. I have a tiny query regarding scRNASeq analysis that I posted on Biostarshere. Can anyone please advise me if I am thinking in right direction?

2022-03-21

Pedro Sanchez (05:02:47): > @Pedro Sanchez has joined the channel

2022-03-23

Doortje Theunissen (04:40:42): > @Doortje Theunissen has joined the channel

2022-03-29

rohitsatyam102 (03:03:06): > Hi everyone. I have started a discussion thread on Biostars related to scRNAseq analysis regimens. Please share your insights if time allows:https://www.biostars.org/p/9516696/

2022-04-09

rohitsatyam102 (21:11:48): > Hi everyone. If we have control and treatment scRNAseq samples, and we wish to transfer cell type labels from an atlas to them should I use Seurat’s TransferLabels separately on them and then integrate the control and treatment samples?

Alan O’C (21:31:01): > Seurat is not a bioconductor package

2022-04-10

rohitsatyam102 (05:36:16) (in thread): > Can we achieve the same usingSingleR. I mean do I have to run singleR separately on control and treatment sample?

2022-04-11

Tim Triche (12:44:46) (in thread): > depends on how the controls and treatment samples were run. Multiplexed with HTOs in the same 10X run? It would be astounding if you couldn’t 1:1 transfer labels. Different technology 2 years apart? Probably not so much

2022-04-12

rohitsatyam102 (15:24:08) (in thread): > Yes They were multiplexed. I am able to run SingleR but there is an issue that I have now posted here:https://support.bioconductor.org/p/9143363/

2022-04-14

Davide Risso (06:09:07) (in thread): > Running SingleR one sample at a time has worked well for us in a regular (i.e. not multiplexed) 10x dataset

rohitsatyam102 (07:08:46) (in thread): > Sorry!! I confirmed with the person who prepared the library. The samples were not multiplexed. Pardon the confusion

rohitsatyam102 (07:09:56) (in thread): > I am not facing issues in lable transfer as such. The only problem is that all of my labels are being assigned the score of 1. And though the final label assigned is correct, I wish to understand should I be worried about the all the predicted scores per sample turning out to be 1.

2022-05-03

Ray Su (06:55:49): > @Ray Su has joined the channel

2022-05-11

Jenny Drnevich (10:07:13): > I need to get some data from a Biorxiv preprint and the data availability statement just says “The scRNAseq count matrix dataset was deposited in OSF”. Anyone know what this is?https://www.biorxiv.org/content/10.1101/2022.02.08.479522v1.full - Attachment (bioRxiv): A single cell atlas of the cycling murine ovary > The estrous cycle is regulated by rhythmic endocrine interactions of the nervous and reproductive systems, which coordinate the hormonal and ovulatory functions of the ovary. Folliculogenesis and follicle progression require the orchestrated response of a variety of cell types to allow the maturation of the follicle and its sequela, ovulation, corpus luteum (CL) formation, and ovulatory wound repair. Little is known about the cell state dynamics of the ovary during the estrous cycle, and the paracrine factors that help coordinate this process. Herein we used single-cell RNA sequencing to evaluate the transcriptome of > 34,000 cells of the adult mouse ovary and describe the transcriptional changes that occur across the normal estrous cycle and other reproductive states to build a comprehensive dynamic atlas of murine ovarian cell types and states. ### Competing Interest Statement The authors have declared no competing interest.

Hans-Rudolf Hotz (10:19:07) (in thread): > OSF is probably:https://osf.io/after a quick search, I found this:https://osf.io/9cvym/

Jenny Drnevich (10:20:28) (in thread): > Thank you,@Hans-Rudolf Hotz!

Jenny Drnevich (10:22:09) (in thread): > It seems the data I want is not public yet. I was worried about that for a pre-print

Tim Triche (11:47:32) (in thread): > people are getting wise to the fact that, if you name a GSE or SRP in your preprint, GEO will release it upon request

Tim Triche (11:48:28) (in thread): > I tried asking a relevant senior investigator about a similar dataset from 2 years ago and you can predict what happened next:cricket:

Jenny Drnevich (15:08:21) (in thread): > Interesting - I hadn’t thought about GEO considering BioxRxiv “published” in terms of keeping the submission private.

Tim Triche (15:26:28) (in thread): > It is the case that, if a publicly available manuscript (e.g. a preprint) references a GSE or SRP, then a request to GEO or SRA to release the data will be honored.

Tim Triche (15:27:40) (in thread): > The BSD labs have got wise to this and on the rare occasions that entire methods sections are not dropped, many will at least prevent others from reanalyzing the NIH-funded data. Whether this practice is good for science and the public that funds it is left as an exercise for the reader.

2022-06-08

Aedin Culhane (23:06:26) (in thread): > SoundslikeitmightbetimeforanauditofaccesstoscRNASeq. Withthehcaandcellxgenes,sra,geo,osfet etcitseemslikedataarebeingDepositedinawidearrayoflocationsandtheannotation/metadataisofteninconsistent. Andaccesstodatacan beaproblem. Alsodatainhcasaysit’savailableafterthemainpaperispublished.Butwhatisthemainpaper…@Kasper D. Hansen@Martin Morgan

Aedin Culhane (23:08:42) (in thread): > Indaysofoldmicroarraytimes,suchpapers..only30%ofpapersprovidedataetc…wereimpactfulindevelopingpublication requirementsandenforcingopendataaccess. IfeelwithscRNAseqandwithspatial,thingsaremoredisparate. Suchanaudittypestudyiswarranted

2022-06-28

GuandongShang (00:48:22): > @GuandongShang has joined the channel

GuandongShang (00:56:58): > Hi, everyone. I want to select a package can do cell type deconvolution, which means I have a 10x scRNA-seq data and have many same tissue(mutant or wt) bulk RNA-seq and want to know the cell-type fraction in these bulk RNA-seq. I have seen AutoGeneS, but it is a python package. I am wondering whether someone can recommend a R package and can treat SingleCellExperiment or seurat object.

GuandongShang (01:09:03): > I also find the “https://github.com/WWXkenmo/ENIGMA

Pedro Sanchez (03:36:28): > Recently a really nice paper by John Dick’s lab was published about AML deconvolution. Regarding your question they used CIBERSORTx although it was developed as a “web framework with its back-end based on R and PHP”

GuandongShang (08:39:47) (in thread): > Thanks :)

2022-06-30

rohitsatyam102 (09:21:45) (in thread): > Try MuSiC2:https://github.com/Jiaxin-Fan/MuSiC2

2022-07-01

GuandongShang (02:51:34) (in thread): > Thanks :)

2022-07-04

Andrew J. Rech (19:45:25): > @Andrew J. Rech has joined the channel

2022-07-07

Clara Pereira (14:27:34): > @Clara Pereira has joined the channel

2022-07-11

Andrew McDavid (00:15:23): > @Andrew McDavid has joined the channel

Andrew McDavid (00:27:17): > has anyone had luck with the kallisto kite workflow for cite seq quantitation on the chromium v3 chemistry? I’m getting some odd behavior with the cell barcodes that are recovered there (ie essentially no overlap with the ones reported from cellranger), although the protein expression and number of recovered cells is plausible. I think there’s something arcane with the chemistry in the 10x feature barcoding kits (beyond that the middle two bases in the barcode are reverse complemented in feature barcoded libraries compared to the GEX libraries)…

Andrew McDavid (00:30:59): > i realize that this isn’t bioconductor related, but i figured there are so many brilliant bioinformaticians here that someone must have fought this dragon before:sunglasses:

Stephanie Hicks (04:06:04) (in thread): > Ihaven’ttriedthekallistoworkflow.However,broadlymy gotomoveinthesesituations,istotryoutathirdquanttool(e.g.https://combine-lab.github.io/alevin-tutorial/2020/alevin-features/)tofigureoutifthereareatleasttwothat(mostly)agreewitheachotherandisolatethethirdtofocusmyenergyonwhichoneI mightneedtodebug. Goodluck!

Andrew McDavid (12:07:11) (in thread): > oh great suggestion to look at alevin’s implementation of this.

2022-07-13

Alan Aw (14:58:59): > @Alan Aw has joined the channel

Alan Aw (15:01:41): > Hi, I have a question about normalization of raw counts data. These are the methods I am aware of: > * log-normalization (link) > * sctransform (link) – withv2option and without > I am not particularly interested in what each method is doing (I can read and understand them on my own).I am however wondering what an agreeable (or a general consensus) normalization approach is if > * my data at hand contains many cell types but for a single tissue > * my goal is to remove technical noise from the counts, in order to perform DE analysis > Finally, you may assume that I will run Wilcoxon tests on the normalized data. Appreciate any advice on this:slightly_smiling_face: - Attachment (satijalab.org): Normalize Data — NormalizeData > Normalize the count data present in a given assay. - Attachment (satijalab.org): Introduction to SCTransform, v2 regularization > Seurat

Steve Lianoglou (15:53:04) (in thread): > In broad strokes: > 1. I’d see if it’s necessary to use something like SCTransform (or MNN, or whatever) to integrate data from different samples to control for some technical artifact (it may not be). > 2. I’d do the downstream cell identification and clustering steps (left as an exercise to the reader, but you’ll find how to do it in many tutorials, including the ones you linked to). > 3. I’d then perform DE analysis by pseudo-bulking the data at the appropriate level of resolution (sample:celltype or sample:cluster)as explained here(as opposed to using the Wilcoxon approach) > That all assumes that you have biological replicates, though. - Attachment (bioconductor.org): Chapter 4 DE analyses between conditions | Multi-Sample Single-Cell Analyses with Bioconductor > Chapter 4 DE analyses between conditions | Multi-Sample Single-Cell Analyses with Bioconductor

Steve Lianoglou (15:57:38) (in thread): > You may also want to have a look at the newcountsplitpackage for DE testing, as well (you’ll find a link to their preprint on that github page as well). Even if you don’t end up using it, it does point out some things to be aware of when doing standard DE approaches on scRNAseq data.

Alan Aw (19:42:59) (in thread): > Thanks for the suggestions, Steve.

2022-07-15

Ashley Robbins (15:18:27): > @Ashley Robbins has joined the channel

2022-07-17

ImranF (15:59:25) (in thread): > There are several ideas that are a tad conflated above. > * Instead of “how many”, I’d ask “how heterogenous are the celltypes” > * I’d start with lognorm and sctransform (be aware of the recent paper that critiqued it however, I have yet to vet it) > * Normalisation is not the same as Integration. You haven’t mentioned how many samples you have. Prior to Integration, I would think about whether you want to normalize all samples together or individually (the latter would erase baseline differences in gex). > * Either way, I would create (multiple) lists of HVGs (one fur each normalisation technique) and just study them ( how much overlap, which HVGs and are missing, is there a overrepresentation is certain genesets, etc)

2022-07-18

Alan Aw (20:23:18) (in thread): > Thank you Imran. I had the same thought regarding comparing lists of HVGs generated via different methods.

2022-07-19

Tim Triche (11:15:31): > https://www.biorxiv.org/content/10.1101/2021.06.24.449781v3re: transformations - Attachment (bioRxiv): Comparison of Transformations for Single-Cell RNA-Seq Data > The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-seq data. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range. These steps are intended to make subsequent application of generic statistical methods more palatable. Here, we describe four transformation approaches based on the delta method, model residuals, inferred latent expression state, and factor analysis. We compare their strengths and weaknesses and find that the latter three have appealing theoretical properties. However, in benchmarks using simulated and real-world data, it turns out that a rather simple approach, namely, the logarithm with a pseudo-count followed by principal component analysis, performs as well or better than the more sophisticated alternatives. ### Competing Interest Statement The authors have declared no competing interest.

Pedro Sanchez (11:16:58) (in thread): > I want to tell my PI I won’t come the next month to the lab because I want to read all the sc normalization literature:sweat_smile:

Tim Triche (11:20:44) (in thread): > it will perhaps come as a surprise to no one that using residuals as if data is… not generally a great idea

Tim Triche (11:21:37) (in thread): > if a person is running Wilcoxon tests for DE, why bother normalizing at all?

ImranF (11:24:41) (in thread): > You’ll need normalized data to identify cell subpopulations; which I am presume will be used for downstream DE

Tim Triche (11:25:42) (in thread): > this assumes that cell subpops are discrete and that a linear model is the most useful one (which, tbh, it usually is; but if that’s the case why use Wilcoxon for testing)

Tim Triche (11:26:49) (in thread): > log1pis almost always the right answer, fwiw (cf.https://www.biorxiv.org/content/10.1101/2021.06.24.449781v3)

Tim Triche (11:27:05) (in thread): > asinhworks fine too

Tim Triche (11:27:17) (in thread): > using residuals as data is almost always a bad idea

ImranF (11:31:03) (in thread): > Pardon me, you’re touching on several fairly disparate points here. First of all, what exactly are non-discrete cell-subpopulations? I’m aware of the biological nuance , but I’m talking computationally

Tim Triche (11:32:30) (in thread): > That’s kind of the problem, isn’t it? Dendritic cells come to mind as an example of cells where “type” is difficult to disambiguate from (e.g.) “activation”

Tim Triche (11:32:58) (in thread): > with regards to normalization, the presumed goal is to ensure that uninformative variation is Gaussian. that’s it.

Tim Triche (11:33:23) (in thread): > if this goal is accomplished, one might ask why a person would squander statistical power using tests on ranks

Tim Triche (11:33:56) (in thread): > especially given that a two-group test is equivalent to a test for a unit difference

Tim Triche (11:41:25) (in thread): > in any event, it’s always possible to turn a linear model into a nonparametric model, usually at substantial cost to statistical power and computational efficiency. Depending on the spec, this may or may not be desirable, but if you want to decouple biology from the discussion, then you inevitably have to work from a spec instead. And consensus is a really crappy spec

Tim Triche (14:59:02): > nb. strongly recommend reading this recent paper on count splitting for more on the duality of discrete vs. continuous states and testing for both without double dipping. it’s also implemented already in R and python:https://arxiv.org/pdf/2207.00554.pdf

Anna Reisetter (17:21:31): > @Anna Reisetter has joined the channel

2022-07-20

Alan Aw (02:42:41) (in thread): > Thanks for the thoughtful comments, Tim. From what you’ve written, it seems like the ‘right’ normalization ultimately depends on what we know or assume about the assay and the experimental conditions/protocol? As long as we cannot completely verify the mechanisms resulting in the observed data (variance of counts is quadratic in mean, too many zeros, splicing differences, fragment length differences, etc.), we can only rely on (1) principled normalizations, such as log(x+c) and other approaches, or (2) computing and seeing what the results look like before coming up with an explanation.

Alan Aw (02:44:31) (in thread): > Also, it seems like if one obtains a gene from running DE analyses using many different choices of normalization, that gene ought to be somewhat meaningful?

Tim Triche (07:45:35) (in thread): > The latter is reasonable (strong signal will overcome change of scale) but is not optimal for statistical power. I’d certainly trust something that is obviously different on every scale (and we have done exactly this for EZH2 between fusion groups in AML, there is no overlap on any scale); the only tradeoff is power.

Tim Triche (07:46:28) (in thread): > Strongly recommend reading the count splitting paper though. It’s wonderful and it lays out some serious confounding effects from double dipping (clustering then testing) due to the spec for clustering methods being “find clusters”

Tim Triche (07:47:22) (in thread): > https://arxiv.org/abs/2207.00554 - Attachment (arXiv.org): Inference after latent variable estimation for single-cell RNA… > In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some…

Tim Triche (07:47:40) (in thread): > It is quite tricky to control the error rate otherwise!

Tim Triche (08:03:01): > Countsplitting site:https://anna-neufeld.github.io/countsplit/ - Attachment (anna-neufeld.github.io): Splitting a Poisson count matrix into independent training and testing matrices. > Implements the count splitting methodology from the paper “Inference after latent variable estimation for single cell RNA sequencing data” (Neufeld et al., 2022).

2022-07-28

Mervin Fansler (17:21:13): > @Mervin Fansler has joined the channel

Erick Cuevas (18:43:36): > @Erick Cuevas has joined the channel

Erick Cuevas (18:50:18): > Hello, is there any known software or tool to do meta analysis of multiple scRNA data?

Tim Triche (18:54:46): > dreamlet was demonstrated yesterday along those lines, but it’s an interesting question in terms of what joint or meta analysis should mean in this context

2022-08-02

Alvaro Sanchez (05:09:49): > @Alvaro Sanchez has joined the channel

2022-08-04

Peter Hickey (19:25:05): > Cross-posting a (deep-in-the-weeds) problem I’m facing with calling non-empty droplets usingDropletUtils::emptyDrops()in a large 10x 5’ dataset:https://support.bioconductor.org/p/9145822/

2022-08-15

Michael Kaufman (13:15:52): > @Michael Kaufman has joined the channel

2022-09-02

Jenny Drnevich (11:50:13): > What you get when Cellranger does a tSNE “reduction” on only 2 antibodies::rolling_on_the_floor_laughing: - File (PNG): Antibody_tSNE.png

2022-09-13

Alex Schrader (17:48:22): > @Alex Schrader has joined the channel

2022-09-19

Ryan Williams (16:51:32): > @Ryan Williams has joined the channel

2022-09-27

Jennifer Holmes (16:14:59): > @Jennifer Holmes has joined the channel

vin (22:14:37): > @vin has joined the channel

2022-10-06

Devika Agarwal (05:38:02): > @Devika Agarwal has joined the channel

2022-10-20

Connie Li Wai Suen (01:25:40): > @Connie Li Wai Suen has joined the channel

2022-10-28

Vandenbulcke Stijn (12:29:41): > @Vandenbulcke Stijn has joined the channel

2022-10-31

Chenyue Lu (10:05:40): > @Chenyue Lu has joined the channel

2022-11-06

Sherine Khalafalla Saber (11:21:31): > @Sherine Khalafalla Saber has joined the channel

2022-11-29

Jenny Drnevich (09:05:24): > Hi all. I’ve analyzed lots of single cell data sets, and it always seems like for some cells the results of the SNN clustering and UMAP algorithms disagree on which cells they are most similar to (see attached example for clusters 0, 1 and especially 2). I know that even though both algorithms use the same input data (usually 30 PCs), they weight things differently. But what to do about the cells that do not co-locate with the rest of their cluster? Just ignore them and leave them in? Filter them out because they might be GEMs with two cell types? I don’t usually do much in terms of doublet detection other than filtering out cells with high UMI counts/genes detected. In this particular data set, cluster 2 does have much lower number of UMI counts/genes detected and higher percentage of MT counts, and it does seem to be a frequent occurrence that the cluster most dispersed in UMAP space has sub-optimal QC values. Maybe these are “not happy” cells that get clustered regardless of cell type? Should I use this as justification to further remove all cells in cluster 2 (it seems to be the same cell type as cluster 0) or go back and re-adjust my filtering thresholds? Just wondering what others do… - File (JPEG): UMAP_seurat_res0.07_splitCluster.jpeg - File (JPEG): Cluster_QCchecks_2022-11-28.jpeg

Tim Triche (10:53:41): > are these cardiomyocytes or CMSCs?

Jenny Drnevich (11:17:16): > Mouse ovarian cells

Gene Cutler (11:26:10) (in thread): > It does seem that cluster 2 are potentially dying cells. I would probably readjust filtering and recluster.

ImranF (12:53:05) (in thread): > Could be that cluster 2 are either (a) doublets with a mix of cluster 0 or cluster 1 cells; (b) dying versions of cluster0 and cluster1 cells. > Alternately, I would also do some ambient rna profiling

2022-12-07

Tim Triche (12:59:48) (in thread): > some of which are also loaded up with mitos? MiQC helped for our CMSCs

2022-12-12

Umran (17:58:20): > @Umran has joined the channel

Lexi Bounds (17:59:50): > @Lexi Bounds has joined the channel

2022-12-13

Lea Seep (08:58:37): > @Lea Seep has joined the channel

Ana Cristina Guerra de Souza (09:01:29): > @Ana Cristina Guerra de Souza has joined the channel

2022-12-20

Elana Fertig (10:10:26): > For color blind friendly single-cell visualization

Elana Fertig (10:10:27): > https://elifesciences.org/articles/82128

Jared Andrews (10:34:15): > The patterning is an interesting idea, how does it look with mixed populations though?

Jennifer Foltz (10:41:37): > @Jennifer Foltz has joined the channel

Jennifer Foltz (11:10:21): > Hi all! I recently started digging into NMF for scRNAseq analysis!:tada:I’ve been playing around withCoGAPSand really enjoying its ease of use. I’ve come across a couple questions about it, would anyone here be willing to share their knowledge onCoGAPSand answer a few questions? I would be so grateful!

Tim Triche (11:26:42) (in thread): > you want to talk to@Elana Fertigsince her lab implemented and maintains it

Elana Fertig (11:32:25) (in thread): > yup

Elana Fertig (11:34:29) (in thread): > test it out!

Jennifer Foltz (11:36:26) (in thread): > Thank you! I have two questions:1)I recently have run into large changes in the atoms (A) and (P) matrix values when running across multiple nSets, which seems less than ideal since they should be stabilizing. Should I run for more iterations or is there some other reason for these large changes in the matrix values? (Include the full parameters to help) > > -- Standard Parameters -- > nPatterns 25 > nIterations 1e+05 > seed 891 > sparseOptimization TRUE > distributed single-cell > > -- Sparsity Parameters -- > alpha 0.01 > maxGibbsMass 100 > > -- Distributed CoGAPS Parameters -- > nSets 6 > cut 25 > minNS 3 > maxNS 9 > [1] 33650 21914 > [1] 14160 21914 > > This is CoGAPS version 3.14.0 > Running single-cell CoGAPS on /tmp/90251.tmpdir/Rtmp31EVRF/file62dc263e1.mtx (14160 genes and 21914 samples) with parameters: > > -- Standard Parameters -- > nPatterns 25 > nIterations 1e+05 > seed 891 > sparseOptimization TRUE > distributed single-cell > > -- Sparsity Parameters -- > alpha 0.01 > maxGibbsMass 100 > > -- Distributed CoGAPS Parameters -- > nSets 6 > cut 25 > minNS 3 > maxNS 9 > > 14160 gene names provided > first gene name: FAM87B > > 21914 sample names provided > first sample name: 56546_tube1_AAACCTGCACGGCGTT-1 > > Creating subsets... > set sizes (min, mean, max): (3652, 3652.333, 3654) > Running Across Subsets... > > worker 2 is starting! > worker 4 is starting! > Data Model: Sparse, Normal > Sampler Type: Sequential > Loading Data...Done! (00:01:26) > worker 1 is starting! > worker 6 is starting! > worker 3 is starting! > worker 5 is starting! > -- Equilibration Phase -- > 5000 of 100000, Atoms: 100971(A), 52232(P), ChiSq: 468311360, Time: 03:28:08 / 206:51:55 > 10000 of 100000, Atoms: 132274(A), 61924(P), ChiSq: 466190848, Time: 09:57:29 / 271:50:33 > 15000 of 100000, Atoms: 147622(A), 64281(P), ChiSq: 465477408, Time: 17:28:09 / 302:57:29 > 20000 of 100000, Atoms: 157316(A), 65815(P), ChiSq: 465079392, Time: 25:23:51 / 319:39:54 > 25000 of 100000, Atoms: 165574(A), 67174(P), ChiSq: 464867552, Time: 33:37:39 / 330:19:17 > 30000 of 100000, Atoms: 172295(A), 68477(P), ChiSq: 464668896, Time: 42:07:54 / 338:07:31 > 35000 of 100000, Atoms: 179143(A), 69047(P), ChiSq: 464584608, Time: 50:52:01 / 344:12:38 > 40000 of 100000, Atoms: 184012(A), 69617(P), ChiSq: 464515392, Time: 59:49:42 / 349:18:55 > 45000 of 100000, Atoms: 189512(A), 69805(P), ChiSq: 464441792, Time: 68:54:57 / 353:19:43 > 50000 of 100000, Atoms: 196919(A), 68410(P), ChiSq: 464404192, Time: 78:14:55 / 357:11:07 > 55000 of 100000, Atoms: 206964(A), 64601(P), ChiSq: 464468192, Time: 87:40:43 / 360:20:54 > 60000 of 100000, Atoms: 808401(A), 12023(P), ChiSq: 500546240, Time: 96:09:45 / 359:07:47 > 65000 of 100000, Atoms: 1055385(A), 8298(P), ChiSq: 527779296, Time: 103:43:00 / 354:42:26 > 70000 of 100000, Atoms: 950700(A), 7337(P), ChiSq: 533742400, Time: 112:06:23 / 353:24:58 > 75000 of 100000, Atoms: 924196(A), 7082(P), ChiSq: 539779776, Time: 120:40:25 / 352:40:01 > 80000 of 100000, Atoms: 948739(A), 6947(P), ChiSq: 541298112, Time: 129:33:54 / 352:45:34 > 85000 of 100000, Atoms: 1000041(A), 6826(P), ChiSq: 543738112, Time: 138:59:54 / 354:05:43 > 90000 of 100000, Atoms: 1064006(A), 6708(P), ChiSq: 546889728, Time: 148:56:49 / 356:23:28 > 95000 of 100000, Atoms: 1251780(A), 8249(P), ChiSq: 534668288, Time: 159:05:50 / 358:46:42 >

Jennifer Foltz (11:37:51) (in thread): > 2) On a separate run, I specifiednPatterns=25for distributed genome-wide CoGAPS, but the output was 33 patterns. Is this an expected outcome? I recall less is ok, but I was not sure about more patterns. Thank you@Elana Fertigfor following up with my message!!

Elana Fertig (11:42:18) (in thread): > I saw your email

Elana Fertig (11:42:29) (in thread): > the reason it’s less is becuase it does a consensus across the patterns from the runs across sets

Elana Fertig (11:42:51) (in thread): > that can be less than the total number that was asked for so at the end of the parallelization it gives the best set of consensus patterns

Elana Fertig (11:42:53) (in thread): > if that makes sense

Elana Fertig (11:43:09) (in thread): > re the atoms, I’m not sure what’s going on there – by any chance od you have rows that are all zero?

Jennifer Foltz (11:53:24) (in thread): > Thanks, in this case, the CoGAPS result was more patterns than I asked for- returning 33, when I asked for 25. Is it possible for the consensus match toincreasethe number of final patterns?

Jennifer Foltz (11:56:08) (in thread): > Re. the atoms, I removed genes that had a sd of 0, so I don’t think there should be rows of all 0, although I could be overlooking something:keepIndex = (apply(dat,1,sd)!=0)`` dat = dat[keepIndex,]wheredatis my gene x cell matrix. It seems to only happen when I split acrossnSets. Happy to check the data in another way for other discrepancies if you advise:slightly_smiling_face:

Jennifer Foltz (12:17:53) (in thread): > Following up, I confirmed that none of the rows of thelog2(dat+1)matrix going intoRunCoGAPS()are 0s usingrowSums()

2022-12-21

Elana Fertig (08:51:00) (in thread): > yes – consensus can increase too because it’s based on clustering across the 25 across all sets and then cutting

Elana Fertig (08:51:18) (in thread): > but, if you are increasing my guess is you are underdimensionalized in the sets and would be increasing

Elana Fertig (08:52:01) (in thread): > ill have to check back on the atoms increasing with nSets – my guess is that there’s something funky with the number of dimensions you are geteting especially given that it’sincreasing

Jennifer Foltz (10:12:08) (in thread): > thank you! Would you suggest for the increasing patterns that I rerun requesting highernPatterns? Or decrease the number ofnSetsto split the data?

Jennifer Foltz (10:13:45) (in thread): > For the run with atoms increasing, these are the settings I’m using:-- Standard Parameters --``nPatterns 25 ``nIterations 1e+05 ``seed 891 ``sparseOptimization TRUE ``distributed single-cell ``-- Sparsity Parameters --``alpha 0.01 ``maxGibbsMass 100 ``-- Distributed CoGAPS Parameters -- ``nSets 6 ``cut 25 ``minNS 3 ``maxNS 9 ``14160 gene names provided``first gene name: FAM87B ``21914 sample names providedThank you very much for looking into it. Is there anything I can try that would be helpful on my end to troubleshoot, or any parameter that I incorrectly set?

Elana Fertig (10:48:01) (in thread): > I would recommend increasing the number of patterns and seeing if that fixes it

Elana Fertig (10:48:36) (in thread): > it would be easier if you can pull out one of the sets for which it’s occuring and optimize parameters there

Elana Fertig (10:48:41) (in thread): > then pull it together

Jennifer Foltz (11:02:33) (in thread): > To clarify, is pulling out the set for the run where the atoms are behaving weirdly?

Jennifer Foltz (11:02:54) (in thread): > How do I pull out the set and proceed with optimizing, and pulling together?

2022-12-22

Elana Fertig (13:21:22) (in thread): > @Jennifer Foltzwhy don’t you email me and I can set up a meeting with the team after the new year?

Jennifer Foltz (14:56:25) (in thread): > Sounds great, thank you!

2023-01-08

Pageneck Chikondowa (05:31:20): > @Pageneck Chikondowa has joined the channel

2023-01-18

Jenea Adams (22:32:28): > @Jenea Adams has joined the channel

2023-01-26

Yu Zhang (12:32:59): > @Yu Zhang has joined the channel

Aedin Culhane (14:07:19): > HCA Biological Network SeminarThursday 09 February 202310:30-12:00 EDT | 15:30-17:00 BSTPlease join us for the upcoming Biological Networks Seminar on Thursday 9th February 2023, featuring updates on Atlas Integration and CAP (Cell Annotation Platform). > > The aims of this series are to increase biological network visibility and spark new opportunities for collaboration and engagement. Further seminars will be announced soon via theHCA Biological Seminar Serieswebpage. > > Please see below for additional information on this seminar, and feel free to contactmeetings@humancellatlas.orgwith any questions. > > Registerhere. We hope you can join us. - Attachment (humancellatlas.org): Biological Network Seminar Series > To create comprehensive reference maps of all human cells—the fundamental units of life—as a basis for both understanding human health and diagnosing, monitoring, and treating disease. - Attachment (Zoom): Welcome! You are invited to join a webinar: HCA Bionetworks Seminar: Updates on Atlas Integration and CAP (Cell Annotation Platform). After registering, you will receive a confirmation email about joining the webinar. > Please use an email address linked to your Zoom account, and make sure you are logged into that account when registering.

Tim Triche (17:04:22): > I’ll be there…

2023-01-31

Ahmad Al Ajami (09:11:17): > @Ahmad Al Ajami has joined the channel

brian capaldo (13:05:27): > is is possible to update/override therownamesin a singleCell/summarized experiment object? I realize this is probably really poor OOD if it was easy to do so

Alan O’C (13:15:46): > You can change the col/rownames normally, yes

brian capaldo (13:16:49): > hmm, I swear that didn’t work just a second ago

brian capaldo (13:17:01): > I’m going to go think about my life choices now

brian capaldo (13:17:51): > ah, it was because I was trying to do it inside the constructor

Alan O’C (15:14:21): > Happens to the best of us

2023-02-02

rohitsatyam102 (14:24:02): > Hi Everyone. I was looking for some Bioconductor R packages for rare cell identification in single cell. I came across several CRAN packages such as FiRE, GapClust, CellSIUS and sc-Syno, but all of them lacks proper documentation.

rohitsatyam102 (15:48:57) (in thread): > my data has control-knockout design and there is a cluster that’s new in the knockout and absent in control.

Peter Hickey (16:30:41) (in thread): > It sounds like you’ve found the cluster. So what do you want to do next? > * Identify marker genes of that cluster? > * Test whether it’s presence/absence is statistically significant in your experiment? > * Something else?

2023-02-03

rohitsatyam102 (02:31:55) (in thread): > Umm I have cluster-wise DE genes and marker genes as well. I have been asked by a reviewer if these could be new cells and could tell us something interesting. So yes the second point you mentioned. This cluster has 22 cells from the single cell reference atlas (not a large number though) but have like 1000 cells from knockout and around 300 from control.

Peter Hickey (16:37:31) (in thread): > If you haven’t already, take a read ofhttps://bioconductor.org/books/3.16/OSCA.multisample/differential-abundance.htmlfor methods to test for changes in cluster abundance. > Note that this requires you have biological replicates, as would any sensible statistical method for this type of analysis. - Attachment (bioconductor.org): Chapter 6 Changes in cluster abundance | Multi-Sample Single-Cell Analyses with Bioconductor > Chapter 6 Changes in cluster abundance | Multi-Sample Single-Cell Analyses with Bioconductor

2023-02-06

rohitsatyam102 (08:48:06) (in thread): > Alas!! We don’t have biological replicates.

2023-02-19

Vince Carey (15:23:44): > scviRhas now been submitted to Bioconductor. It addresses interfacing to scvi-tools with a focus on CITE-seq.

2023-02-22

michaelkleymn (01:44:33): > @michaelkleymn has joined the channel

2023-03-01

Pedro Sanchez (06:16:11): > Hi all! I am usingscDblFinder()and was wondering how one can select the values for the following arguments:nfeatures, dims, includePCs

Pedro Sanchez (06:16:15): > Also, in the case of having known inter-sample doublets, is it better to runscDblFinder()withknownDoubletsor would you runrecoverDoublets()? Thank you!

jeremymchacón (12:14:34): > @jeremymchacón has joined the channel

2023-03-02

Pierre-Luc Germain (15:34:06) (in thread): > For thedims, this should be similar to the dimensionality you use for other analyses (not that this is so simple to establish…), and the same can be appropriate forincludePCsas well. > In the newest version, some of these parameters have been set according to thisparameter search(it’s based on just the 16 benchmark datasets, but clearly better than guesswork).

Pierre-Luc Germain (15:44:43) (in thread): > I haven’t actually compared the two face to face… > But the following I copy-pasted from an exchange with a user is perhaps relevant: > > yes there is an argument `knownDoublets` where these can be passed. How > they are used will depend on the `knownUse` argument: "discard" means > that the known doublets won't be used to trained the classifier, while > "positive" means that the classifier will consider the known doublets as > true doublets also during training. > Using known doublets as positives intuitively sounds like a good idea, > but it can lead to worse results when there is a high proportion of > homotypic doublets (because those are often virtually indistinguishable > from real cells, they can mislead the classifier). This is why the > default is "discard", as the safest procedure is to exclude heterotypic > doublets using scDblFinder, and exclude further homotypic doublets known > from multiplexing. > However, if the rate of homotypic doublets isn't high, i.e. if your > samples are complex enough (many cell types) and if the multiplexing is > high (many patients in one capture), then it's very plausible that a > slightly better performance could be achieved by treating the known > doublets as positives for the learning. The be honest I haven't dealt > with such cases enough to have a feel for how much of an improvement (if > any) can really be obtained in this way... > In benchmark datasets I tried giving scDblFinder half of the known > doublets, and checking the AUPRC for detecting the other half, and it > either didn't change performance or reduced it (for the reasons just > mentioned). But most of the benchmark datasets tend to have a lot of > homotypic doublets, so if you feel like trying the same on yours I'd be > very curious to hear about the result :) >

Pierre-Luc Germain (15:46:50) (in thread): > the same problem applies to therecoverDoublets()approach: if a lot of the known doublets are homotypic (i.e. combining cells of the same type), then this will recover also singlets because they homotypic doublets look very much like real cells.

Pierre-Luc Germain (15:49:00) (in thread): > otherwise, I’d suppose that usingscDblFinder()withknownDoubletsandknownUse="positive"gets the same doublets correctly identified by recoverDoublets, but I haven’t actually investigated that.

2023-03-03

Pedro Sanchez (02:51:55) (in thread): > Thank you, Pierre-Luc. Pretty useful! > In my case, I’ve decided to useknowUse="positive", but the concordance with the doublets detected if"discard"is used, is of 98%

Pierre-Luc Germain (02:59:18) (in thread): > well I guess that’s a good sign!

Meeta Mistry (11:34:40): > @Meeta Mistry has joined the channel

2023-03-10

Edel Aron (15:28:02): > @Edel Aron has joined the channel

2023-03-14

Jean Yang (04:05:06): > @Jean Yang has joined the channel

2023-03-24

H. Emre (07:27:44): > @H. Emre has joined the channel

2023-03-30

Ludwig Geistlinger (11:01:22): > Mark your calendars for a CCB seminar special with Aaron Lun, > the mastermind behind theOrchestrating Single-Cell Analysiswith Bioconductor(OSCA) online book! > > Aaron will speak about the journey that lead to the OSCA book > from a developer’s perspective in his talk: > > Code, sweat, and tears: how the OSCA sausage was made > > When: April 03, 2023, 3 PM ET > Where:https://harvard.zoom.us/j/97173440183?pwd=eHI1ODRub0p5NGNEZncwU0lURlJjdz09

2023-04-21

Kozo Nishida (14:22:10): > @Kozo Nishida has joined the channel

2023-05-03

Rebecca Butler (16:18:49): > @Rebecca Butler has joined the channel

2023-05-18

Oluwafemi Oyedele (05:54:06): > @Oluwafemi Oyedele has joined the channel

2023-06-01

Pedro Sanchez (09:11:18): > Hi everyone! I’d like to read your opinions on tools for deconvoluting spots of cells without spatial information, but obviously with a single cell transcriptomics reference. Many thanks:hugging_face:

Kasper D. Hansen (10:47:34): > This could be useful:https://arxiv.org/abs/2305.06501 - Attachment (arXiv.org): Challenges and opportunities to computationally deconvolve heterogeneous tissue with varying cell sizes using single cell RNA-sequencing datasets > Deconvolution of cell mixtures in “bulk” transcriptomic samples from > homogenate human tissue is important for understanding the pathologies of > diseases. However, several experimental and computational challenges remain in > developing and implementing transcriptomics-based deconvolution approaches, > especially those using a single cell/nuclei RNA-seq reference atlas, which are > becoming rapidly available across many tissues. Notably, deconvolution > algorithms are frequently developed using samples from tissues with similar > cell sizes. However, brain tissue or immune cell populations have cell types > with substantially different cell sizes, total mRNA expression, and > transcriptional activity. When existing deconvolution approaches are applied to > these tissues, these systematic differences in cell sizes and transcriptomic > activity confound accurate cell proportion estimates and instead may quantify > total mRNA content. Furthermore, there is a lack of standard reference atlases > and computational approaches to facilitate integrative analyses, including not > only bulk and single cell/nuclei RNA-seq data, but also new data modalities > from spatial -omic or imaging approaches. New multi-assay datasets need to be > collected with orthogonal data types generated from the same tissue block and > the same individual, to serve as a “gold standard” for evaluating new and > existing deconvolution methods. Below, we discuss these key challenges and how > they can be addressed with the acquisition of new datasets and approaches to > analysis.

Pedro Sanchez (11:01:06) (in thread): > Indeed. Many thanks!!

2023-06-12

Carmen Navarron (02:56:40): > @Carmen Navarron has joined the channel

2023-06-22

Peter Hickey (20:43:27): > @Aaron Lunany general thoughts on Seurat’s WNN method for integrative analysis of RNA + ADT (https://satijalab.org/seurat/articles/weighted_nearest_neighbor_analysis.html) and comparison to things available inmumosa?

2023-06-23

Aaron Lun (02:35:30): > haven’t thought about it for a very long time. last I remember was thinking that their method was pretty complicated, and I didn’t want to implement that, so I just slapped the two PC matrices together and called it a day. (with some finessing inrescaleByNeighbors). Seems to work well enough for RNA + ADT/CRISPR, the same approach is used bykanaand gives me structure from the supplied modalities. Plus I get a nice low-dimensional matrix back out for further use in different stuff that don’t operate on graphs.

2023-06-25

Peter Hickey (18:40:39): > Thanks. The Seurat documentation (https://satijalab.org/seurat/reference/findmultimodalneighbors) is terse and really doesn’t explain anything. I’m trying to read the code but that’s a slog too.

Peter Hickey (18:41:09): > At the moment the main difference I have is Seurat’s tries to estimate per-cell weights for each modality whereas formumosathe user provides a per-modality weight that is the same for all cells (and can handle > 2 modalities)

Peter Hickey (18:42:05): > Trying Seurat’s method on a subset of the dataset I’m analysing gave very same-ish weights for all cells anyway

2023-06-26

Alan O’C (06:48:51) (in thread): > > Other parameters are listed for debugging, but can be left as default values. > is… an interesting way to document your function.

Alexander Bender (08:58:04): > Anyone with ideas forhttps://support.bioconductor.org/p/9152951/? Different results between machines despite full containerization and identical input data? Here for fastMNN, but I had this before as well with other software, see comment. Not sure whether this is machine precision or anything else.

Nils Eling (09:10:19) (in thread): > Thanks for posting it here. If it boils down to different OS versions: locally I work on MacOS Big Sur 11.7.4 and the GA runner usesubuntu-latestwhich should default to Ubuntu 22.04. I highly appreciate any input on this.

Alan O’C (21:37:20) (in thread): > If it’s a newish Mac, then it could well be an arm64 vs x86_64 issue that may persist if you run on the mac github runner

2023-07-07

rohitsatyam102 (04:18:41): > Hi everyone. I need some advise on using ALRA for imputation. I have control-drug treated single cells from two different time points (so T1 ctrl, T1 treated, T2 ctrl and T2 treated where T1 and T2 have 1hr difference only ) and I want to use ALRA to perform imputation. In arecent publication, the ALRA authors used “ALRA was applied on the merged and normalized data matrix.” I wish to understand if I should perform ALRA imputation by taking all the data together or separately (all T1 imputed separately and T2 separately) or should impute control together and treatment together? - Attachment (Nature): Zero-preserving imputation of single-cell RNA-seq data > Nature Communications - Missing values in scRNA-seq datasets can bias their analysis. Here, the authors threshold the low rank approximation of the expression matrix, so false zeros can be imputed…

Vince Carey (11:41:43): > Working with a student at CSHL summer course. > > > ##### SCTransform Each Object Individually ##### > > > > # run for increased number of variable features, 3000-10000); 10,000 used when you're planning .... [TRUNCATED] > Calculating cell attributes from input UMI matrix: log_umi > Error: C stack usage 7971780 is too close to the limit > > Seen before? > > R version 4.3.0 Patched (2023-04-24 r84317) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 20.04.6 LTS > > Matrix products: default > BLAS: /home/stvjc/R-430-dist/lib/R/lib/libRblas.so > LAPACK: /home/stvjc/R-430-dist/lib/R/lib/libRlapack.so; LAPACK version 3.11.0 > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > time zone: America/New_York > tzcode source: system (glibc) > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] beepr_1.3 patchwork_1.1.2 scCustomize_1.1.1 glmGamPoi_1.12.2 > [5] sctransform_0.3.5 Matrix_1.5-4.1 ggplot2_3.4.2 SeuratObject_4.1.3 > [9] Seurat_4.3.0.1 rmarkdown_2.23 > > loaded via a namespace (and not attached): > [1] RColorBrewer_1.1-3 audio_0.1-10 > [3] shape_1.4.6 jsonlite_1.8.7 > [5] startup_0.20.0 magrittr_2.0.3 > [7] ggbeeswarm_0.7.2 spatstat.utils_3.0-3 > [9] GlobalOptions_0.1.2 zlibbioc_1.46.0 > [11] vctrs_0.6.3 ROCR_1.0-11 > [13] spatstat.explore_3.2-1 paletteer_1.5.0 > [15] RCurl_1.98-1.12 janitor_2.2.0 > [17] forcats_1.0.0 S4Arrays_1.0.4 > [19] htmltools_0.5.5 parallelly_1.36.0 > [21] KernSmooth_2.23-20 htmlwidgets_1.6.2 > [23] ica_1.0-3 plyr_1.8.8 > [25] lubridate_1.9.2 plotly_4.10.2 > [27] zoo_1.8-12 igraph_1.5.0 > [29] mime_0.12 lifecycle_1.0.3 > [31] pkgconfig_2.0.3 R6_2.5.1 > [33] fastmap_1.1.1 snakecase_0.11.0 > [35] GenomeInfoDbData_1.2.10 MatrixGenerics_1.12.2 > [37] fitdistrplus_1.1-11 future_1.33.0 > [39] shiny_1.7.4 digest_0.6.32 > [41] colorspace_2.1-0 rematch2_2.1.2 > [43] S4Vectors_0.38.1 tensor_1.5 > [45] irlba_2.3.5.1 GenomicRanges_1.52.0 > [47] progressr_0.13.0 timechange_0.2.0 > [49] fansi_1.0.4 spatstat.sparse_3.0-2 > [51] httr_1.4.6 polyclip_1.10-4 > [53] abind_1.4-5 compiler_4.3.0 > [55] withr_2.5.0 MASS_7.3-60 > [57] DelayedArray_0.26.6 tools_4.3.0 > [59] vipor_0.4.5 lmtest_0.9-40 > [61] beeswarm_0.4.0 httpuv_1.6.11 > [63] future.apply_1.11.0 goftest_1.2-3 > [65] glue_1.6.2 nlme_3.1-162 > [67] promises_1.2.0.1 grid_4.3.0 > [69] Rtsne_0.16 cluster_2.1.4 > [71] reshape2_1.4.4 generics_0.1.3 > [73] gtable_0.3.3 spatstat.data_3.0-1 > [75] tidyr_1.3.0 data.table_1.14.8 > [77] sp_2.0-0 utf8_1.2.3 > [79] XVector_0.40.0 BiocGenerics_0.46.0 > [81] spatstat.geom_3.2-1 RcppAnnoy_0.0.21 > [83] ggrepel_0.9.3 RANN_2.6.1 > [85] pillar_1.9.0 stringr_1.5.0 > [87] ggprism_1.0.4 later_1.3.1 > [89] circlize_0.4.15 splines_4.3.0 > [91] dplyr_1.1.2 lattice_0.21-8 > [93] survival_3.5-5 deldir_1.0-9 > [95] tidyselect_1.2.0 miniUI_0.1.1.1 > [97] pbapply_1.7-2 knitr_1.43 > [99] gridExtra_2.3 IRanges_2.34.1 > [101] SummarizedExperiment_1.30.2 scattermore_1.2 > [103] stats4_4.3.0 xfun_0.39 > [105] Biobase_2.60.0 matrixStats_1.0.0 > [107] stringi_1.7.12 lazyeval_0.2.2 > [109] evaluate_0.21 codetools_0.2-19 > [111] tibble_3.2.1 cli_3.6.1 > [113] uwot_0.1.16 xtable_1.8-4 > [115] reticulate_1.30 munsell_0.5.0 > [117] Rcpp_1.0.10 GenomeInfoDb_1.36.1 > [119] globals_0.16.2 spatstat.random_3.1-5 > [121] png_0.1-8 ggrastr_1.0.2 > [123] parallel_4.3.0 ellipsis_0.3.2 > [125] bitops_1.0-7 listenv_0.9.0 > [127] viridisLite_0.4.2 scales_1.2.1 > [129] ggridges_0.5.4 crayon_1.5.2 > [131] leiden_0.4.3 purrr_1.0.1 > [133] rlang_1.1.1 cowplot_1.1.1 >

Stephanie Hicks (11:44:29) (in thread): > ah, sorry no i’m not familiar withsctransformin the Seurat family.

Vince Carey (11:45:30) (in thread): > could be a good learning experience for reprex production and bug filing. stay tuned.

Aaron Lun (11:46:06) (in thread): > typical error from an infinite recursion.

Vince Carey (11:49:40) (in thread): > true but it isn’t always easy to smoke it out with such a big stack. and i am concerned that the student did not hit it so there may be a subtle platform- or version-dependence of the event.

2023-07-10

Wes W (14:25:52) (in thread): > what type of single cell data are you working with? > > also what are you trying to do with the imputed data? annotations? or are you trying to actually do analysis? > > be careful with imputation methods for downstream analysis on those genes with datasets with very poor coverage. it will be hard to tell if you are what is there is real or if you are just amplifying noise. > > in response to your question more directly, if you expect differences in your treatment or your time, you dont want to obviously impute genes that aren’t turned on in one and then amplify them in the other conditions… which with some types of data isn’t such a problem BUT with something like 10X data that has poor coverage and higher dropout , could be a real problem. I have seen my students accidently turn convert male pt data into female data and create some pretty wacky double positive populations of immune cells by running this blindly on a merged set of samples. > > so if you think your data is good for imputation or if you are just using it for better defined clusters like the paper (and not for DE etc) , I would recommend running the imputation on the cells that you expect to be similar to be more confident and just store it as an altExp in your sce object… > > there is of course some other cool use cases for imputation but wont get into them as it might take your question off topic

Sean Davis (19:18:13): > @Sean Davis has joined the channel

Sean Davis (19:19:25): > Is there an equivalent vignette/tutorial to this from SatijaLab in Bioconductor? I’m showing my lack of SC experience here, I know - Attachment (satijalab.org): Integrating scRNA-seq and scATAC-seq data > Seurat

2023-07-11

Wes W (11:27:37) (in thread): > its certainly possible the matrix isn’t formatted correctly or as expected which is causing the recursivie error… do you have the upstream code used to generate and format the object? > > but i have gotton this error in the past, and if you are on shared server you might be smashing the resources limits of your system. double checking and running gc() etc… if you are doing this on multiple objects individually and you have a lot, perhaps try it single threaded instead of mcappy in a multicore function… > > lastly, 90% of the time this is fixed for me by restarting R Studio/R … i dont know why…

Vince Carey (15:32:13) (in thread): > thanks @wes i should have reported that the problem seems machine-specific and i am leaving it alone for now

2023-07-12

Kishori (19:12:53): > @Kishori has joined the channel

Kishori (19:13:20): > I am doing some 10x single-cell RNA seq analysis for mouse tumor data. I noticed a huge amount of Neutrophils (after processing through SingleR), but I was hoping to see some MDSC (Myyeloid-derived suppressor cells) as well. I don’t even see MDSC labels in MDSC. Has anyone encountered this situation or has any idea how to detect MDSC? Any comments or suggestion would be much appreciated.

Vivek Sharma (19:13:31): > @Vivek Sharma has joined the channel

Vivek Sharma (19:14:33) (in thread): > I’mstruggling to with similar issues. Any pointers will be appreciated

Peter Hickey (19:28:20) (in thread): > It’s very unlikely there are MDSC in the reference dataset you are using, so there’s no way any cells in your dataset will get that label by SingleR (which is a reference-based annotation tool). > What reference are you supplying to SingleR?

Axel Klenk (19:33:51): > @Axel Klenk has joined the channel

Augustine (19:43:10): > @Augustine has joined the channel

Kishori (20:07:23) (in thread): > @Peter HickeyI am using MouseRNAseqData, ImmGenData as references.

Peter Hickey (20:12:10) (in thread): > Neither of those reference datasets contain MDSCs (https://bioconductor.org/packages/release/data/experiment/vignettes/celldex/inst/doc/userguide.html#23_Mouse_RNA-seq;https://bioconductor.org/packages/release/data/experiment/vignettes/celldex/inst/doc/userguide.html#31_Immunological_Genome_Project_(ImmGen)). > You would need to provide SingleR with a reference containing MDSCs for it to be able to annotate your datasets with such a label. > I don’t know if such a reference dataset exists

Peter Hickey (20:12:48) (in thread): > Seehttps://bioconductor.org/books/release/SingleRBook/classic-mode.html#reference-choicefor a discussion of the choice of reference dataset and why it’s critical - Attachment (bioconductor.org): Chapter 2 Using the classic mode | Assigning cell types with SingleR > The SingleR book. Because sometimes, a vignette just isn’t enough.

Noorul (20:19:53): > @Noorul has joined the channel

John Scanlan (20:27:16): > @John Scanlan has joined the channel

Kishori (20:34:01) (in thread): > @Peter HickeyThank you! This is very useful information!

Marc Elosua (20:44:48): > @Marc Elosua has joined the channel

2023-07-13

Chris Chiu (04:01:45): > @Chris Chiu has joined the channel

Liz Ing-Simmons (05:36:14): > @Liz Ing-Simmons has joined the channel

Jacques SERIZAY (07:59:26): > @Jacques SERIZAY has joined the channel

kent riemondy (08:54:42): > @kent riemondy has joined the channel

Robert Castelo (11:12:13): > @Robert Castelo has joined the channel

Jared Slosberg (11:46:41): > @Jared Slosberg has joined the channel

2023-07-15

rohitsatyam102 (01:24:52) (in thread): > Hi@Wes W. Sorry for replying so late. I was figuring out this question and yes now I impute the data separately (4 samples 4 different individual imputation). Yes this is 10X data and that too parasite data where cells were harvested at early stages of the parasite life cycle and low gene expression is expected (I saw a similar paper of another lab on and on an average 50-100 out of 5000 genes are expressed). Though our average gene expressed per cell is far low (both in normal and treatment) than the published one (even when we harvest at later time point and we repeated the experiment to see if something went wrong with library prep or sequencing depth and rules that out), I thought imputation could help. Obviously, we don’t wish to perform over-imputation and so decided to use ALRA, which is conservative, there is however limited documentation on how to use it correctly.My Observations:When I impute all samples together the non-zero values increases upto ~17% but when I perform same thing separately, it goes upto ~50 percent. I tried to search if there is a method to guage if I am performing over-imputation, but couldn’t find any. I am thinking of using it both for “better defined clusters like the paper” as well as DE analysis and trajectory analysis if possible. I see in several issues on Seurat GitHub where people have used these imputation results to runFindMarkersbut again, I am not claiming that’s it 100% legit to do so. I am still exploring possibilities.

2023-07-19

rohitsatyam102 (04:51:47): > Can I tempt some single cell experts here to answer this small Seurat related queryhere. I know Seurat is not a bioconductor packages but several packages are built upon it so maybe they can help.

rohitsatyam102 (04:56:01) (in thread): > I tried looking at the function itself and I can see that atleast for SCT, theGetResidual()function uses “RNA” assay but wasn’t able to figure out about LogNormalisation.

Alan O’C (05:15:33) (in thread): > Just from browsing source I seehttps://github.com/satijalab/seurat/blob/763259d05991d40721dee99c9919ec6d4491d15e/R/integration.R#L1376

rohitsatyam102 (05:27:49) (in thread): > Yes that within the if loop of SCT. I am looking for LogNormalise

Alan O’C (05:33:08) (in thread): > https://github.com/satijalab/seurat/blob/763259d05991d40721dee99c9919ec6d4491d15e/R/integration.R#L4582

Alan O’C (05:34:07) (in thread): > Anyways I’d suggest stepping through the functions usingdebug("seurat::myFun")or by downloading the package source and setting breakpoints

rohitsatyam102 (05:35:04) (in thread): > Thanks. Will do that

2023-07-21

rohitsatyam102 (18:20:35): > Hi !! If I impute my samples and later I wish to integrated this imputed data with reference atlas, do I need to perform imputation of my atlas too? My imputed data currently look like this (the upper two panel are samples: ctrl and treatment and the bottom most is reference atlas without imputation). Now I can see that my reference data follows the NB distribution but I see some weird tail in my imputed data (I can’t explain why). I can see weird clustering in UMAP and the reference cells from same cell type are everywhere so I thought to check the data distribution. Sorry but there are no guidelines on how to integrate the imputed data with reference atlas. - File (PNG): image.png - File (PNG): image.png

2023-07-26

Wes W (13:52:57): > Hello all, would love some feed back on an experiment I was handed. I was not involved in the experiment design (which has a lot of flaws and limits), the person who did the wet lab work tried to analyze the data and hit some road blocks and I was asked to help out if I could. (10X scRNA) > > Here is the question, to integrate or not integrate the data at the sample level. the experiment consists of 16 samples with no experimental replicates or biological replicates for any sample (we can ignore for now the limitations on the power of downstream analysis and how to tell what differences are sampling variance vs what is biological signal for the time being), but there are so many sources of potential batch effects I almost want to just runharmonyorfastMNN()at the sample level, here are silliness of the samples I was given: > 1. the first 4 samples they sorted with a dump channel for CD16/32 , and then sorted on CD45.2, I expect, while depleted for other CD45+ immune cells, that the T cells left over shouldn’t be any different other than frequency , to the next 12 samples there was no dump channel first > 2. the first 4 samples were taken at day 14 and the next 12 samples were taken at day 7, its perfectly reasonable to suspect differences in gene expression with time > 3. 10 samples are from tumors, the other 6 are from lymph nodes, perfectly reasonable to suspect different types of t cells in different states between the two > 4. 12 of the samples were sorted on CD45.2 and 4 were sorted on CD45.1 . for those that dont do cellular therapy studies; the CD45.1/2 system is used for separating out the progeny of transplanted cells from those of the recipient. There is surely true biological signal here and differences would not be all from technical noise. > 5. Library prep days and sequencing days for the first 4 samples are the same between the samples, which means no way to separate out library prep day differences technical noise from the biological signal differences that could be there from being a different day the mice were harvested on or the depletion dump channel. the last 12 samples share the same library prep & sequencing days. > 6. they pooled, without hashing, different amounts of mice ranging from 4-12 mice per sample sequenced. > Attached is the output ofplotExplanatoryVariables(vars)and some UMAPs to visualize Samples, Dump (yes/no , which is also the same groups for Day 14 and 7, and same groups for differences in library & sequencing) , CD45.1/2 sort, and Tissue (LN / Tumor). > > I am currently thinking just to leave the data as it is and annotate them all as being different as there are no two samples that have identical treatment + sorting. which is fine for looking at transcriptome differences between treatment and regions but makes it trickier to compare frequencies of subtypes which the group would also like done, in which case maybe I just throw it atfastMNNor maybe do it both ways? > > Love to hear your thoughts on the path forward on this ridiculous experiment. thanks. - File (PNG): image.png - File (PNG): image.png

Jenny Drnevich (14:59:47) (in thread): > Not much to say other than:scream:and:sob:. Maybe we can find some time to chat at BioC if you are in person?

Wes W (15:35:37) (in thread): > Yep I’ll be there in person@Jenny Drnevich:smiley:

Jenny Drnevich (15:40:52) (in thread): > Coincidentally I’ve had 4 consults in the last week about parsing out technical/batch effects from biological effects, but none were nearly this confounded!

Wes W (17:46:07) (in thread): > im normally the one here they come to with these issues… and once i looked at the data I knew my only salvation was coming here to ask the community if I was going to be confident moving forward one way or another:melting_face:

2023-07-28

Konstantinos Daniilidis (13:47:06): > @Konstantinos Daniilidis has joined the channel

Benjamin Yang (15:59:00): > @Benjamin Yang has joined the channel

2023-07-30

Dipanjan Dey (10:42:04): > @Dipanjan Dey has joined the channel

Kasper D. Hansen (16:17:09) (in thread): > This seems crazy

2023-07-31

Wes W (10:15:49) (in thread): > Agreed@Kasper D. Hansen, what would you do? I think I honestly have to do both ways to address the different questions which seems silly and twice as long =(

Wes W (10:16:09) (in thread): > but I am looking forward to chatting with@Jenny Drnevichthis week in person and getting her thoughts

2023-08-02

Jamin Liu (14:43:41): > @Jamin Liu has joined the channel

2023-08-03

Ritika Giri (15:59:36): > @Ritika Giri has joined the channel

tho nguyen (23:55:27): > @tho nguyen has joined the channel

2023-08-04

Trisha Timpug (09:36:24): > @Trisha Timpug has joined the channel

Aedin Culhane (14:15:34) (in thread): > Only seeing this now.Iwouldn’tdo harmony.It tends to over align.In this studyyou’llexpect diff cells in each “batch”.FastMNN might do the same.The alignment method we did in corral is more subtle.Sometimesit’snot forceful enough but it might remove some effects

Rob Patro (14:26:54) (in thread): > When I teach my grad computational genomics class, I often have a question on the final where the student (posing as their recently graduated self) has to respond to a fictional colleague who has severely botched an experiment, both in terms of data collection and subsequent processing. This is much more of a study “design” nightmare than the obvious batch effect that I include in that question. I am sorry you have to attempt to analyze this data. On the flip side, I’m glad to know that the examples I use in my exam aren’t outrageous or over the top…

2023-08-20

rohitsatyam102 (08:46:25): > Hi Everyone. I was trying to perform trajectory analysis on my data and after running Velocyto, there are insufficient unspliced mRNA as can be seen in the image below. Should I proceed with this data or is it useless? What is the minimal percent of unspliced mRNA to infer trajectory? - File (PNG): image.png

2023-08-22

rohitsatyam102 (03:26:03) (in thread): > I checked the publicly available data as well and I get the similar results.

rohitsatyam102 (03:26:41) (in thread): > Is it because it’s UMI data coz splicing is observed in plasmodium?

Wes W (11:05:47) (in thread): > link to the data you using from public source? how did you generate your splicy reference?

rohitsatyam102 (11:23:53) (in thread): > The data is from PRJNA560557.https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8552519/ - Attachment (PubMed Central (PMC)): Single-Cell RNA Sequencing Reveals Cellular Heterogeneity and Stage Transition under Temperature Stress in Synchronized Plasmodium falciparum Cells > The malaria parasite has a complex life cycle exhibiting phenotypic and morphogenic variations in two different hosts by existing in heterogeneous developmental states. To investigate this cellular heterogeneity of the parasite within the human host, …

rohitsatyam102 (11:25:23) (in thread): > The cell ranger’s intronic read percentage is also low i.e. 1.2 percent when I check the cell ranger summary report

rohitsatyam102 (11:25:59) (in thread): > So maybe that explains why I am getting a very small percentage of unspliced mrna

Wes W (11:26:43) (in thread): > how did you generate your splicy reference?

rohitsatyam102 (11:27:30) (in thread): > I generated indexes using cellranger mkref subcommand

rohitsatyam102 (11:30:07) (in thread): > The alignment was produced by cellranger and the results were fed to velocyto run10x subcommand to generate spliced and unspliced read count matrix

Wes W (11:37:56) (in thread): > my first suggestion would be to skip cellranger here. I don’t thinkmkrefis creating a splici reference correctly for you. > > you can useSTARsolowith correct splice tags to generate spliced/unspliced counts , or I highly recommend using thealevin-frymethod of generating a splici reference and spliced/unspliced countshttps://combine-lab.github.io/alevin-fry-tutorials/2021/alevin-fry-velocity/if this doesn’t change your output , could be the data. - Attachment (combine-lab.github.io): An introduction to RNA-velocity using alevin-fry > Recently, RNA-velocity estimation has becomes increasingly popular tool in single-cell RNA seq analysis. In this post, we will discuss an additional advantage brought by the Unspliced-Spliced-Ambiguous (USA) mode introduced in alevin-fry 0.3.0 and later. That is, the solution presented in that approach for controlling the spurious mapping to spliced transcripts of sequenced fragments arising from introns (in the absence of full decoy) basically gives us the preprocessing results we need to perform an RNA-velocity analysis “for free”. Here we provide an end-to-end tutorial describing how to perform an RNA-velocity analysis for a 10x Chromium dataset. In this tutorial, we will show the whole analysis pipeline, starting from the raw FASTQ files to the gorgeous velocity plots (generated by scVelo) that you may like to include in your next analysis or paper.

2023-09-13

Christopher Chin (17:05:07): > @Christopher Chin has joined the channel

2023-09-20

Alik Huseynov (04:50:58): > @Alik Huseynov has joined the channel

Jaykishan (05:30:06): > @Jaykishan has joined the channel

2023-09-25

Kerim Secener (09:15:59): > @Kerim Secener has joined the channel

2023-09-28

Vince Carey (07:13:46): > Checking in on anyone working with/developing with SIMBA?https://simba-bio.readthedocs.io/en/latest/– I have been working on some basilisk interfacing, want to be sure I am not duplicating effort.

2023-10-05

Chris Chiu (23:38:52): > Hi, my question is how do people add VDJ information for cells that are called by emptyDrops (but not cellranger vdj)? > > Numbers of cells by cellranger vdj was much lower than we expected, so I ran emptyDrops with FDR < 0.1% and have about 2x more (~350 vs ~700). For our analysis, we like to filter cells that contain 1 TRB and 1-2 TRA chain. How does one to get the number of TRA and TRB chain in those droplets called as cells by emptyDrops (but not cellranger vdj)? From Cellranger VDJ output, filtered_contig_annotations.csv would have this information, but this restricted to droplets that have “is_cell = TRUE” (which the cells called by emptyDrops only are presumably having “is_cell = FALSE”). > > Thanks.

2023-10-06

Jenny Drnevich (09:56:16) (in thread): > My first thought was to suggest the--force-cellsoption, but it looks likecellranger vdjdoesn’t have that one. It also doesn’t seem to have any options to tweak the cell calling?

Lambda Moses (18:08:06): > Definitely can’t make it to Bioc 3.18. For Bioc 3.19, I’d like to submit a package vendoring@Aaron Lun’s scran-related headers to Bioc as part of the scran refactoring effort. The headers can be seen here:https://pypi.org/project/assorthead/Here are some names I thought of. Please use the number emoji to vote for a name you prefer for the package or suggest other names::one:scranheaders:two:Rlibscran:three:SH (inspired by BH for Boost headers):four:scHeaderverse (or contracted as scHeaverse?)

2023-10-08

Peter Hickey (17:55:22) (in thread): > is what you need in theouts/multi/vdj_b/all_contig_annotations.csvfile produced by Cell Ranger (this may assume you rancellranger multi)? This is for BCR sequencing, I think for TCR you can replacevdj_bbyvdj_t

2023-10-09

Steve Lianoglou (10:00:01) (in thread): > How about “schead” and the hex sticker can be an old wooden shed? > > Would anyone ever look at “schead” and think to pronounce it as “shed”? No. > > Does it make sense to represent something that serves of the core of a bleeding edge tech with an old wooden shed? Also no. > > Just having some fun here …:zany_face:But a genuine thank you for taking on this effort!

2023-10-11

Alan O’C (03:23:19) (in thread): > > Would anyone ever look at “schead” and think to pronounce it as “shed”? > Yes

2023-11-02

Sunil Poudel (10:51:57): > @Sunil Poudel has joined the channel

Sunil Poudel (10:59:35): > Our next seminar on Mon, Nov 06, 3-4 AM ET, will feature Ricard Argelaguet, who will discussprinciples and challenges in single-cell data integration. Join us under the zoom link provided in the flyer below! - Attachment (Nature): Computational principles and challenges in single-cell data integration > Nature Biotechnology - As the number of single-cell experiments with multiple data modalities increases, Argelaguet and colleagues review the concepts and challenges of data integration. - File (PDF): CCB_SeminarFlyer_Ricard.pdf

2023-11-08

Wes W (08:39:44): > what would be the least amount of cells in a population you would feel confident defining in a single cell RNAseq dataset? I have looked at some small populations over the years (and published), especially bioposies from cancer patients who are responding to treatment, sometimes out of 8000 cells recovered, only 25 are tumor cells from a responding patient at 3 month post treatment time point. > > I had a conversation with someone recently who suggested that you couldn’t possibly do an analysis on a cell number this small and trust the results. While I told them my reasons for why I had in the past, and explaining as long as you understand the limitations of any data set with the interpretations of the results then the data is what it is, only up to those limitations… and also maybe an antidote about “How back in my day, we did single cell genomics in 384 well plate, we never had 8000 cells…” but it got me thinking, maybe I haven’t thought about the recent accepted practices I could be behind on. am I living in the dark ages? > > What is the smallest population of cells you would try and do an analysis of? what model would you use? (while I normally double check with a second model that is less sensitive to cell number and while also doing a pass after downsampling to compare the cancers, is this an invalid approach?)

Alik Huseynov (09:27:15): > 8K might be ok, depends what that data represents, how many patients contribute to that small sample. I heard some saying like at least 30 cells per patient should be present, for DEG analysis or similar. > Within atlas-level data you might get clusters with 8-5K or lower that that, those could be sometimes rare cell populations. Doing stats/comparisons with small sample sizes is not advisable.

Wes W (10:08:00): > it would be 25 cells per patient per time point at the low end, with active disease patients haveing a lot more of course to compare,,,, > > Raph has a paper from WAY back in the day where the looked at small cell number stuff and I remember he or his post-doc said the p values are mostly trustable up to where if they multiplied the frequency of the gene by the amount of cells, and it was over 16, they didnt have a problem. > > the question is, how outdated is this thought process. > > rare cell subtypes exist, and without a meaningful way to enrich for the population , we are limited by our detection technology. the great part about single cell (flow, RNA, CITE, atac, etc) is the resolution to find those cells, and hoepfully characterize them.

Kasper D. Hansen (13:29:27): > Respectfully, this is a meaningless question by itself. You need to describe the kind of conclusions or statement you’re trying to establish. For example, having 25 cells from a cell population in an individual and attempting to describe its average expression profile will obviously lead to some errors (which you can try to get a handle on by downsampling bigger cell populations).

Kasper D. Hansen (13:30:24): > The bigger question is usually how this cell population vary between individuals and if you can even identify the same population. Here, I usually find it pretty insightful to do some though experiment, like what would prefect data look like.

Kasper D. Hansen (13:31:06): > Finally, as always in science, you have replicates and controls.

Wes W (20:07:16): > in this example the population is easy to find, very distinct. just limited cells. well defined and specific markers. just comparing them between patients. not finding novel populations or anything in this example. I tried to mention in the example I gave that all data is limited in the context of it and limitations of the collection and sample number. All interpretations of the data would be in the context of those limitations.

2023-12-01

Tram Nguyen (10:16:31): > @Tram Nguyen has joined the channel

2023-12-27

Cindy Reichel (14:38:58): > @Cindy Reichel has joined the channel

2024-01-10

Bernie Mulvey (15:04:10): > @Bernie Mulvey has joined the channel

2024-01-11

Nilesh Kumar (12:01:22): > @Nilesh Kumar has joined the channel

2024-01-27

atongsa miyamoto (04:52:59): > @atongsa miyamoto has joined the channel

2024-02-02

brian capaldo (10:31:33): > I have what should be a very easy question to answer, but google is failing me. Is there a convenience function for converting cell_data_set to SingleCellExperiment? (long explanation: taking over an analysis that was done in seurat, seurat conversion to SCE does not carry over everything, seurat to cds does though, but casting cds as sce doesn’t seem to do the trick either:crossed_fingers:I can convince the project lead to do everything in sce going forward)

brian capaldo (11:10:42) (in thread): > ended up just manually filling in the constructor, a bit obnoxious, but it worked

Jared Andrews (11:34:27) (in thread): > For the future, there’s alsosceasyandzellkonverter, the latter of which isn’t useful in this case. I can’t remember if sceasy does a better job carrying stuff over than the native seurat method (which has been mildly to majorly broken for about 3 years now).

brian capaldo (11:38:48) (in thread): > i used to know about those too! been avoiding seurat for so long, completely forgot about those

2024-02-06

brian capaldo (10:36:46): > Is there a protocol for installing velociraptor inside an existing conda env?

Alexander Bender (10:46:13) (in thread): > Hope I do not offend anyone by saying this but I would really run scvelo via reticulate. It’s just a few lines of code that is required and sce-anndata conversion is seamless with zellkonverter. Velociraptor brings in quite some overhead with this heavy conda environment that comes with it.

brian capaldo (10:56:25) (in thread): > I’m not opposed to that, though am still interesting in knowing if I can run R from a conda env and use velociraptor or other basilisk type of things

2024-02-07

Vince Carey (06:33:11) (in thread): > @brian capaldoI would conjecture that if the R used within your conda env is current and has BiocManager installed, you could install velociraptor with it. You would want to be sure BiocManager::valid() returns TRUE. If this is not correct please provide some detailed error traces. The biocpython channel might be a good place to discuss this.

brian capaldo (09:04:52) (in thread): > yeah, so the major issue is that I can’t figure out how to get it to use velociraptor’s conda env, instead of my conda env where the rest of R is installed

brian capaldo (09:05:36) (in thread): > it keeps defaulting to mine, and running into missing modules or commands that have changed

2024-02-16

Jovana Maksimovic (03:00:18): > @Jovana Maksimovic has joined the channel

2024-02-28

Jenny Drnevich (11:00:57): > Anyone want to try this?https://www.nature.com/articles/s41592-024-02201-0 - Attachment (Nature): scGPT: toward building a foundation model for single-cell multi-omics using generative AI > Nature Methods - Pretrained using over 33 million single-cell RNA-sequencing profiles, scGPT is a foundation model facilitating a broad spectrum of downstream single-cell analysis tasks by transfer…

Ludwig Geistlinger (11:04:19) (in thread): > we actually did start to try this@Tyrone Lee

Jenny Drnevich (11:07:52) (in thread): > Are you installing it yourself or trying the webapps? One thing you don’t see a lot of with all the new AI models is discussion of compute/memory requirements

Tyrone Lee (11:09:55): > @Tyrone Lee has joined the channel

Ludwig Geistlinger (11:11:00) (in thread): > we’ve been installing it on our HPC cluster with GPU access. as soon as you do model training and fine-tuning things get intense and you’ll need GPUs to get somewhere.

Ludwig Geistlinger (11:11:14) (in thread): > Also@Andrew Ghazi

Andrew Ghazi (11:11:17): > @Andrew Ghazi has joined the channel

Jenny Drnevich (11:12:02) (in thread): > I can see that being worthwhile if you’re at a place doing lots of human single cell. Me, not so much!

Ludwig Geistlinger (11:14:07) (in thread): > yes exactly. it’s human only. at least currently.

Tyrone Lee (11:14:18) (in thread): > As an approximationgeneformertook 16-hours to fine-tune train with 4A100son the coreHLCA(584.9K cells). > > scGPT already comes with a pre-trained model on 2.1 million lung cells. If your dataset shares similar cell type context with the pretrained organ-specific models, these models can usually demonstrate competitive performance as well

Jenny Drnevich (11:22:50) (in thread): > I still wonder how much all this AI is going to lock us into what we think we know now. “Expertly curated cell types” - my guess is that some of the data in the atlas is annotated incorrectly. I asked a LLM person who recently gave a talk here about accounting for uncertainty in the training data and they had never thought about it!

Kasper D. Hansen (11:26:25): > @Kasper D. Hansen has left the channel

2024-03-04

Tim Triche (12:44:56): > as expected, neither scGPT nor Geneformer offers a significant advance over scVI or other simpler methods in actual practice:https://www.biorxiv.org/content/10.1101/2023.10.16.561085v2.full

Tim Triche (12:45:23): > you will not see this paper in Glam Methods because if you didn’t have to set the Amazon on fire to train a model, what would the point be, and where would the indirects come from?

Tim Triche (12:45:46): > I say this as someone who contributed to the embeddings/foundation models hoohah with CZI.

2024-03-05

Pratibha Panwar (01:33:48): > @Pratibha Panwar has joined the channel

2024-03-06

Luke Zappia (04:03:00): > @Luke Zappia has joined the channel

2024-03-11

Dania Machlab (11:40:59): > @Dania Machlab has joined the channel

2024-03-12

Peter Hickey (00:11:49): > @Aaron Lunquoted in Nature. You’ve hit the the big time.https://www.nature.com/articles/d41586-024-00725-1 - Attachment (Nature): No installation required: how WebAssembly is changing scientific computing > Enabling code execution in the web browser, the multilanguage tool is powerful but complicated.

Jonathan Griffiths (04:18:36) (in thread): > And the quotes about about as stereotypically Aaron as you could imagine

2024-03-27

abhich (05:45:12): > @abhich has joined the channel

2024-03-28

Ivan Osinnii (02:41:10): > @Ivan Osinnii has joined the channel

2024-04-01

Ivan Osinnii (14:54:49): > Hi everyone! I am trying to annotate cells in a query dataset composed into sce object with rownames of ensembl format. I use SingleR package and my reference dataset was made in Seurat with gene symbol rownames. Since sce object already contains a second set of rownames as symbols, I reassign all ensembl names in a blunt approach like this: Lett.liver.sce@assays@data@listData [["counts"]]@Dimnames [[1]] <- Lett.liver.sce@rowRanges@elementMetadata@listData [["SYMBOL"]] > Lett.liver.sce@assays@data@listData [["logcounts"]]@Dimnames [[1]]<- Lett.liver.sce@rowRanges@elementMetadata@listData [["SYMBOL"]]

Ivan Osinnii (14:56:23): > After I do this, format ref dataset and run SingleR function for annotation it returns an error that there are no common feature names. What could be a problem? I thought since there are multiple layers containing rownames, I missed one and this one was used by SingleR for matching, but it does not seem to be the case

Aaron Lun (17:56:34): > SCE row names take priority over assay names; just dorownames(Lett.liver.sce) <- rowData(Lett.liver.sce)$SYMBOL

Aaron Lun (17:57:05): > reaching inside the SCE and modifying the assay names won’t work if the SCE’s rownames are unchanged.

Aaron Lun (17:57:22): > (technically you can force it to work by extracting the assays withwithDimnames=FALSE, but SingleR won’t do that.)

2024-04-02

Ivan Osinnii (03:27:39) (in thread): > Thanks!

Ivan Osinnii (16:41:06): > I have successfully annotated immune cell subtype in a dataset of lymphoid cells from liver using SingleR and MonacoImmuneDatabase database (sorry if misspell). Now when I try to annotate the same query dataset with the help of recently published dataset where this immune cell subtype was FACS-sorted and hence it’s a pure population - I have problem. All my query cells are labeled with this cell subtype, although I know there is only a subset of truly matching cells. I could imagine, SingleR tries to force labels on all cells, even if the scores are low. But when I look at the scores - they are all equal among all cells (I do not why). I assume that there is a batch effect between two datasets and SingleR goes crazy. Does it mean that I need to correct the batch effect first and then proceed to SingleR?

2024-04-03

Aaron Lun (13:42:46): > no need to correct for a batch effect, sehttps://github.com/LTLA/SingleR/issues/129#issuecomment-638542261 - Attachment: Comment on #129 Dealing with Batch effect corrected data > > What’s the input of SingleR (or other automatic cell annotation tools) if data were batch-corrected? > > Counts, counts, counts. (Or monotonic transformations thereof.) > > Don’t give SingleR() batch-corrected values, there is no guarantee that the correction preserves the ordering of expression values within each cell. In fact, there is no guarantee that the corrected expression values retain relevant biological information, see my comments https://osca.bioconductor.org/integrating-datasets.html#using-corrected-values|here. > > To me, the only purpose for corrected expression values is to (i) generate a common set of clusters across all batches for easier interpretation and (ii) make a pretty t-SNE plot. All other analyses that can use the original counts should do so. For example, DE analyses should use the original counts (or log-expression values) and block on the batch to account for the batch effect. > > > But I wonder what if the expression indeed affected by batch effect? Why wouldn’t we apply the corrected data in this situation? > > Now, that’s the thing. SingleR is already correcting for a batch effect - the difference between your single-cell dataset and the references! These are big technical differences: if you’re using the in-built references, then we’re comparing single-cell data with microarrays. That’s ancient history right there. I mean, that’s a technology almost as old as me (or more, depending on how you count it). > > If SingleR can deal with these massive technological differences, your little between-batch effects are nothing in comparison. So just let SingleR figure it out. As mentioned above, giving it corrected data may actually make the situation worse because the correction imposes many assumptions about which subpopulations should match up across batches. > > > The batch correction only influences the tSNE/UMAP clustering result > > I have to chip in here. Clustering is an independent process from visualization by t-SNE/UMAP. It is generally not a good idea to cluster on the t-SNE/UMAP coordinates, see arguments https://osca.bioconductor.org/dimensionality-reduction.html#visualization-interpretation|here. > > > and thus cells annotated with the same cell type may be separated in different clusters. > > Happens on occasion. For example, annotation separates T cells by CD4+ and CD8+, but the clustering might partition them into naive/stimulated grouping because that’s the bigger axis of variation. No one’s doing anything wrong here, both procedures are doing their job but the annotation has access to prior biological knowledge (through the sweat and tears of whoever generated and annotated the reference dataset) and the clustering does not. > > Personally, I think it’s more interesting when the two don’t match up, as this tells me that there is something novel in the dataset that isn’t captured by the existing annotation. If it’s exactly the same… well, congratulations, you’ve just recapitulated known biology.

Aaron Lun (13:56:17): > cells should not have the same scores. suggest you check the intersection of genes between the two datasets.

2024-04-05

Ivan Osinnii (22:30:21) (in thread): > Thanks. I checked and selected only intersected genes for SingleR. If I subset my ref dataset and leave only 1 cell type there, then after SingleR analysis all cells in my query dataset get score 1 and logically uniformly labeled. I think this is total nonsense. I tried to leave 2 cell types in the ref and repeat the SingleR protocol and that’s what I get: around 5% of query cells are labeled as first cell type, around 80% as second cell type and the rest 15% as “Other”. I like now that there’s at least that “Other”, but it’s again not according to common sense. I see that the clusters which were labeled as NKcells (after good projection on MonacoImmuneData) now are labeled as T-cells. So it is still not optimal. And this is the scores distribution which I have: it’s all very tightly distributed around 0.55. I expect that MAIT and Tmem cells are similar, but these “Others” should get very different score, since there are NK cells and even few B cells there - File (JPEG): Rplot.jpeg - File (JPEG): Rplot02.jpeg - File (JPEG): Rplot03 scores.jpeg

2024-04-16

Vince Carey (11:01:27): > Anyone interested in the UCSC browser for SC data? e.g.,https://cells.ucsc.edu/?ds=neuro-degen-atac+genesIn the overview it is noted that browsers can be devised for Seurat or scanpy objects….

2024-04-18

Frederick Tan (06:46:36) (in thread): > Several years ago, we “deployed” an older version > * https://cmo.carnegiescience.edu/cb > At that time, it was definitely easier to deploy than cellxgene, but the functionality wasn’t as developed (or intuitive). If that’s still the state, then which one depends on need.

2024-04-25

Mercedes Guerrero (05:02:21): > @Mercedes Guerrero has joined the channel

2024-04-29

Amarinder Singh Thind (08:29:25): > @Amarinder Singh Thind has joined the channel

Jacqui Thompson (19:00:14): > @Jacqui Thompson has joined the channel

2024-05-01

Peter Hickey (18:04:53): > Anyone who’s upgraded to Bioc 3.19, would you please try this and let me know if it works: > > library(celldex) > ImmGenData() > > Context:https://github.com/LTLA/celldex/issues/23

Charlotte Soneson (23:23:03) (in thread): > Works fine for me: > > > library(celldex) > > ImmGenData() > class: SummarizedExperiment > dim: 22134 830 > metadata(0): > assays(1): logcounts > rownames(22134): Zglp1 Vmn2r65 ... Tiparp Kdm1a > rowData names(0): > colnames(830): > GSM1136119_EA07068_260297_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_1.CEL > GSM1136120_EA07068_260298_MOGENE-1_0-ST-V1_MF.11C-11B+.LU_2.CEL ... > GSM920654_EA07068_201214_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_1.CEL > GSM920655_EA07068_201215_MOGENE-1_0-ST-V1_TGD.VG4+24ALO.E17.TH_2.CEL > colData names(3): label.main label.fine label.ont > > BiocManager::version() > [1] '3.19' >

2024-05-02

Peter Hickey (00:22:25) (in thread): > Thanks, Charlotte! Seems like it’s something weird with my setup. What’s strange is that it’s occurring at home on my personal laptop and when using the server on the work network

Charlotte Soneson (02:08:38) (in thread): > Mysterious. We are still in the process of installing the new release at work, but in our “still-devel” 3.19 installation, it works nicely there as well. And the same from my laptop (with 3.19 release installed) on the work network.

Peter Hickey (02:17:53) (in thread): > Mysterious is right. The same ArtifactDB infrastructure also now serves the ****scRNAseq**** datasets and that was working for me in recent weeks when 3.19 was still ‘devel’. But it too is borked for me:upside_down_face:I’ll keep digging

Federico Marini (03:57:51) (in thread): > also on R-devel + 3.19, no error here. > A colleague had some issue but it got solved with a fresh install to have the cache newly built

Vince Carey (07:53:47): > with BiocManager::valid() returning TRUE, it does produce an SCE

Vince Carey (09:34:34) (in thread): > It would be great to collect the details on these problems so that we can work on cache diagnostics and robustness.

Aaron Lun (14:22:06) (in thread): > just updated R to 4.4, no problems.

2024-05-06

Michal Kolář (11:56:57): > @Michal Kolář has joined the channel

2024-05-07

Carlo Pecoraro (12:12:35): > @Carlo Pecoraro has joined the channel

2024-05-08

Aaron Lun (02:19:38) (in thread): > while this is fixed, the new direct HTTP request method will be subject to cloudflare’s (generous) worker limits on the free plan - 100k requests a day, with a burst of 1000 requests/minute. we’re very much below this right now but will need to monitor this to avoid DoS’ing ourselves.

Vince Carey (06:34:25) (in thread): > Thanks for the tip Aaron. Can you give me a pointer on monitoring? Once people learn about it they will write scripts that download lots of things unnecessarily (we are fighting this for packages) and throttling methods may be needed?

Aaron Lun (10:59:45) (in thread): > i just use cloudflare’s dashboard, with some automated notifications as I approach some percentage of the limit

2024-05-09

Philippe Laffont (07:40:06): > @Philippe Laffont has joined the channel

2024-05-15

Sunil Nahata (08:30:58): > @Sunil Nahata has joined the channel

2024-05-31

Aaron Lun (11:30:30): > FYI most singleR-related repos have been moved tohttps://github.com/SingleR-inc

2024-06-09

Saejeong Park (10:24:08): > @Saejeong Park has joined the channel

2024-06-10

Harshitha (09:35:54): > @Harshitha has joined the channel

2024-06-11

Ziru Chen (04:36:44): > @Ziru Chen has joined the channel

2024-06-19

Maria Doyle (13:26:58): > @Maria Doyle has joined the channel

2024-06-30

Nicolas Peterson (13:09:18): > @Nicolas Peterson has joined the channel

2024-07-11

Hothri Moka (07:20:25): > @Hothri Moka has joined the channel

2024-07-17

Michael Lynch (07:41:13): > @Michael Lynch has joined the channel

2024-08-04

sunnyday (14:58:43): > @sunnyday has joined the channel

2024-08-19

Rema Gesaka (09:38:52): > @Rema Gesaka has joined the channel

2024-08-31

Zahraa W Alsafwani (23:12:23): > @Zahraa W Alsafwani has joined the channel

2024-09-20

Camille Guillermin (09:30:34): > @Camille Guillermin has joined the channel

2024-09-24

Zhu Yujia (11:33:11): > @Zhu Yujia has joined the channel

2024-11-19

Jenny Drnevich (10:11:43): > I’ve been struggling with many difficult single cell projects lately. I had a dream last night that there were haunted/possessed single cells floating around in the air that I needed to capture and destroy before they infected the good cells!:rolling_on_the_floor_laughing:@Wes W- not quite a single cell monster but definitely the same vibe!

Maria Doyle (10:32:27): > Something like this?:sweat_smile: - File (WebP): high-tech laboratory battling.webp

Jenny Drnevich (11:14:43): > Now I am going to have real nightmares:grinning:They were much smaller - like glowing dust motes floating around

Wes W (14:08:46) (in thread): > haha!!! > > i feel you. i have single cell spatial transcriptomics project right now that wont leave my brain at night

2024-11-22

Meg Urisko (11:50:43): > @Meg Urisko has joined the channel

Meg Urisko (11:51:33): > :wave:Hi! I’m Meg, a user researcher at the Chan Zuckerberg Initiative:wave:I am running a survey to understand how folks usethe CZ CELLxGENE Discover Census. We want to learn more about what data you’re interested in and how you download it.:arrow_right::link:We would appreciate it if you could take our survey:link::arrow_left::point_up_2:It should take about 10 minutes to complete, and we can offer you $10 Amazon eGift card as a thank-you for filling it out. We’d love to hear from as many Census users as possible, so please feel free to share this link with anyone you know who is using the Census:gratitude-thank-you:

2024-11-27

brian capaldo (10:12:37): > Maybe I missed it, but is there a way to ensureplotUMAPand its cohorts plot low to high when coloring by continuous values? I’m running into issues where cells with lower expression values are washing out the cells with higher expression values. Obviously a violin plot is an alternative for this particular issue, but the collaborator wants the UMAP feature plots.

Jared Andrews (10:20:40): > can usedittoDimPlotwithorder = "increasing"as an alternative

Tim Triche (11:36:24) (in thread): > can we stick tihs on Bluesky and see if it sticks better

Meg Urisko (12:38:36) (in thread): > Yes!

Meg Urisko (12:48:13) (in thread): > Thanks, Tim - you’re the best!

Tim Triche (12:48:25) (in thread): > What did I do?

2024-12-28

Pascal-Onaho (07:55:51): > @Pascal-Onaho has joined the channel

2025-01-09

Ammar Sabir Cheema (11:40:44): > @Ammar Sabir Cheema has joined the channel

2025-01-22

Alan O’C (17:52:26) (in thread): > Possibly of use, several months too latehttps://github.com/alanocallaghan/scater/pull/215 - Attachment: #215 Add min and max value (#197)

2025-02-06

Aaron Lun (22:41:27): > see discussion athttps://github.com/MarioniLab/DropletUtils/pull/119re spinning off the 10X-specific bits and pieces from DropletUtils. if anyone is interested in joining maintenance of this new package, please respond on the PR.

2025-02-10

Mark (00:38:51): > @Mark has joined the channel

2025-02-20

António Domingues (14:39:34): > @António Domingues has joined the channel

António Domingues (15:29:42): > I have a general question about experimental design and avoidance of confounding batch effects in single-cell, specifically in longitudinal experiments (time series). I looked up quite a few resources / papers, but none was particularly clear. > > There seems to be little reference to know reference in methods if samples collected at different time points were put straight through library prep (and sequencing) or collected until the experiment was done to perform simultaneous library prep and sequencing (to avoid library / sequencing run as a confounder). > > I understand there is body of work in the field for batch correction and data integration, but very little reference to experimental designs that avoid these confounders in the first place. Is this unfair? And does the field deal with experiments where a library needs to be prepared immediately after cell collection? (timepoint 1 -> library prep 1, timepoint 2 -> library prep 2)

2025-02-21

Pedro Sanchez (04:54:38) (in thread): > Hi:smile:IDK which technology you’re planning to use but for longitudinal profilings I would recommend fixation at individual time point and then processing all samples in the same run. There are several fixatives out there but thisrecent publicationis a good one. Also, I think 10X Genomics commercialise some others, in case you don’t want to make it in-house

António Domingues (11:48:04) (in thread): > 10x :) but I don’t think we can fix the cells, I think. But good point. I’ll ask the wetlab. Good point, thanks

2025-02-24

Pedro Sanchez (04:32:02) (in thread): > Just in case this help:https://www.10xgenomics.com/support/universal-three-prime-gene-expression/documentatio[…]rep/cell-fixation-protocol-for-gem-x-single-cell-3-5-assays

2025-02-25

Tim Triche (08:44:09) (in thread): > fixing is one great approach, blocking is another (i.e. batched randomization and use of Vireo or similar to demultiplex individuals based on genome; hashtagging but cheaper, only works for unrelated humans and requires a bit of thought)

Tim Triche (08:46:33) (in thread): > spike-ins are rarely discussed but probably ought to be:https://www.nature.com/articles/s41592-022-01446-x#Sec8 - Attachment (Nature): Molecular spikes: a gold standard for single-cell RNA counting > Nature Methods - This work presents an RNA spike-in that can be used to improve RNA counting in single-cell RNA-sequencing (scRNA-seq) analysis, as well as to report the performance of scRNA-seq…

Tim Triche (08:53:14) (in thread): > the joint UKBB/Technopole project is dealing with a lot of these issues, but I’m having trouble finding a link to it. Ithinkthe PI is at HT in Milan, she presented at CZI a couple years in a row and her talk was memorable for discussing these issues.

António Domingues (16:15:27) (in thread): > @Tim Triche, demultiplex by individuals we will be doing:slightly_smiling_face:Cool to know it’s something fairly common. And yes, I am putting a lot of thought into blocking and experimental design. IthinkI have it now.

António Domingues (16:18:13) (in thread): > As for molecular spike-ins this is something we are not considering. What the the core facility does routinely though, and this was also news to me, is to spike in a reference cell suspension in all libraries, cells close to the tissue(s) in the study, to detect potential issues (batch effects amongst them)

Tim Triche (17:22:25) (in thread): > Any control is a good control

2025-02-26

Pierre-Luc Germain (06:10:33): > just a general comment for people usingscDblFinderin their studies: in the scDblFinder version (1.20) initially shipped with Bioconductor 3.20 (current release), a mistake was introduced in the default doublet rate argument (leading to the thresholding procedure calling more doublets). This has been fixed in the Bioconductor release, but you should update your package.

Peter Hickey (16:48:27) (in thread): > Thanks for the note, Pierre-Luc

2025-03-03

Peter Hickey (00:10:43): > I’m looking for a mouseCITE-seq(i.e. gene expression and antibody panel, ideally using TotalSeq-A Mouse Universal Cocktail but at this point I’ll take anything) dataset that includes immune cells with cell type annotations to use as a reference withSingleR.SingleR/celldexhave the gene expression-only ImmGen and MouseRNASeq reference datasets. > Can’t find anything suitable in cellxgene, UCSC cell browser, 10x’s own public datasets, or Azimuth. > Any pointers?

2025-03-04

Zahraa W Alsafwani (14:26:30) (in thread): > I would suggest looking atCellMarker 2.0 https://academic.oup.com/nar/article/51/D1/D870/6775381

Peter Hickey (14:43:18) (in thread): > Thanks, Zahraa, but unfortunately that doesn’t match what I’m looking for. As far as I can tell, CellMarker 2.0 provides a database of marker genes for various cell types and tissues. > I’m looking for a dataset with: > 1. The count matrix > 2. The cell type annotations > 3. Gene expression and antibody derived tag (ADT) data

2025-03-06

Ammar Sabir Cheema (08:44:55): > @Peter Hickeyyou can check for the dataset from this paper,https://pubmed.ncbi.nlm.nih.gov/32269339/, > which was used to validate this pipeline (https://link.springer.com/protocol/10.1007/978-1-0716-2938-3_22) - Attachment (PubMed): A conserved dendritic-cell regulatory program limits antitumour immunity - PubMed > Checkpoint blockade therapies have improved cancer treatment, but such immunotherapy regimens fail in a large subset of patients. Conventional type 1 dendritic cells (DC1s) control the response to checkpoint blockade in preclinical models and are associated with better overall survival in patients w … - Attachment (SpringerLink): Harnessing Single-Cell RNA Sequencing to Identify Dendritic Cell Types > Dendritic cells (DCs) orchestrate innate and adaptive immunity, by translating the sensing of distinct danger signals into the induction of different effector lymphocyte responses, to induce the defense mechanisms the best suited to face the threat. Hence, DCs are…

Peter Hickey (17:22:29) (in thread): > As far as I can see, they only measured DCs from mice for the CITE-seq. Unfortunately I need a broader sampling of immune cell types than this, but thank you.

2025-03-11

Sanika Menkudale (07:12:09): > @Sanika Menkudale has joined the channel

2025-03-13

Mihai Todor (06:59:44): > @Mihai Todor has joined the channel

2025-04-03

Sean Davis (08:47:57): > This falls into the category of “do my homework for me.” How do people find, manage, store, use, and reuse cell type marker genes and marker gene sets?

Jared Andrews (09:01:56): > Store em in a big ol GMT file and filter out sets as needed. We do it for cell markers, gene sets for GSEA or enrichment analysis, etc.

Tim Triche (09:04:47): > I see someone decided to kick the “factorize vs compare to published lists” hornets nest this morning

Sean Davis (09:07:41) (in thread): > I’m not really worried about the hornet nest today. Just the single hornet (or just a few hornets) representing marker gene set-adjacent management and use.

Tim Triche (09:08:48) (in thread): > GMT it is, then! We find GSEA type analyses on factors to be super handy, soit’snot like I am opposed to them.

Tim Triche (09:09:55) (in thread): > I just wonder a lot about the differences between cell states and fates, andI’veseen way too many arbitrarily gated flow datasets to trust academic markers

Sean Davis (09:10:37) (in thread): > Do you do anything tricky to manage gene set metadata like overloading the description column or something like that?

Tim Triche (09:11:44) (in thread): > msigdbr has been a handy alternative to the “great big GMT file” approach but it too has its drawbacks.

Jared Andrews (09:12:19) (in thread): > Nothing quite so complex, though it is a frequent issue. We typically slap the source (publication, analysis, whatever) in the description column and keep the name specific enough to know what the biology entails. It’d be nice to have a more robust system for managing (and easily filtering) them though.

Jared Andrews (10:11:24) (in thread): > We use msigdbr on top of our own lists, but the C8 collection for cell types is too broad and annoying to filter. It’s still our go-to for GO/KEGG/Reactome

Tim Triche (10:15:34) (in thread): > this is one of those things where a more flexible annotationhub architecture could be super helpful

Tim Triche (10:15:44) (in thread): > and oddly aligned with the post-apocalypse NIH priorities

Tim Triche (10:16:20) (in thread): > I don’t just wantsomeone else’s tibble, I want a crowdsourced tibble that I can add to. Also I don’t care for the Broad’s licensing terms

Tim Triche (10:17:38) (in thread): > I just walked a student through EnrichmentBrowser (awesome package btw) yesterday and remembered why this process can be a drag (in her case I wanted a consensus list of genes and interactions for melphalan metabolism and clearance to map population AF diffs onto)

Tim Triche (10:18:53) (in thread): > so yeah we use this process to “gate” cells (see also the Human Cell Atlas CAP, aka the Cell Annotation Platform, for relevant developments in that respect) but “gate” and “cell” are evolving terms over time and it’s useful to have the resource evolve too

Tim Triche (10:20:25) (in thread): > https://celltype.info/search/cell-labels

Tim Triche (10:20:30) (in thread): > CAP not CAB, sorry.

Tim Triche (10:22:44) (in thread): > I still can’t believe I did this to COVID-era 1st-year grad students, but in my defense, they literally asked for it.https://trichelab.github.io/lab_use_content/project2.html

Tim Triche (10:24:33) (in thread): > my “answer key” (no right answers, I structured it for students to extend into their own projects, it’s just an example)https://trichelab.github.io/lab_use_content/project2_chunks/project2_tim.html

Tim Triche (10:25:04) (in thread): > I did use it again yesterday with plyranges and… well, we have so much to offer students that could make their lives and science better.

Vince Carey (10:33:54) (in thread): > blurting things out: any chance of connecting these concepts to celldex? any specific role of cell ontology? systematic approach to sharing identities of “new” cell types via PRs to the ontology? (we could write code to mergeinformationfrom a PR that we liked with a current cell ontology image, increasing flexibility and capturing provenance)

2025-04-04

Jayaram Kancherla (11:21:59) (in thread): > also about to suggest celldex, references availablethrough both R and Python (https://github.com/SingleR-inc). > > Also ran into this the other day, but more around metadata standards -https://github.com/cellannotation

Jared Andrews (11:24:02) (in thread): > celldex isn’t really a fitting place for sets of published genesets scraped from supplementary tables or whatnot though imo, particularly if they’re based on some sort of induced cell state.

2025-04-17

Juan Henao (17:31:54): > @Juan Henao has joined the channel

2025-04-25

Marisa Loach (04:58:04): > @Marisa Loach has joined the channel