#sc-signature

2018-11-20

Kevin Rue-Albrecht (14:52:35): > @Kevin Rue-Albrecht has joined the channel

Kevin Rue-Albrecht (14:52:36): > set the channel description: Extraction and application of cell “type” signatures

Aaron Lun (14:52:36): > @Aaron Lun has joined the channel

Vince Carey (14:52:36): > @Vince Carey has joined the channel

Raphael Gottardo (14:52:36): > @Raphael Gottardo has joined the channel

Peter Hickey (14:52:37): > @Peter Hickey has joined the channel

Rob Amezquita (14:52:37): > @Rob Amezquita has joined the channel

Kevin Rue-Albrecht (14:52:42): > set the channel topic: Extraction and application of cell “type” signatures

Aaron Lun (14:53:33): > YEAH FIRST MESSAGE

Rob Amezquita (14:55:13): > was reading through your comments Kevin, I definitely think an orthogonal container to SCE would be a good way to go. I also like your idea of leveraging Immgen/etc. to create DE signatures, it would be nice to get away from just lists and retain rank in defining gene lists

Valentin Voillet (14:55:19): > @Valentin Voillet has joined the channel

Kevin Rue-Albrecht (14:57:13): > I’ve been using this for my training set (mousehttps://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE109125) > > url <- "[ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE109nnn/GSE109125/suppl/GSE109125%5FGene%5Fcount%5Ftable%2Ecsv%2Egz](ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE109nnn/GSE109125/suppl/GSE109125%5FGene%5Fcount%5Ftable%2Ecsv%2Egz)" > if (! "GSE109125_Gene_count_table.csv.gz" %in% bfcinfo()$rname) { > bfcadd(bfc, "GSE109125_Gene_count_table.csv.gz", fpath=url) > } > - Attachment (ncbi.nlm.nih.gov): GEO Accession viewer > NCBI’s Gene Expression Omnibus (GEO) is a public archive and resource for gene expression data.

Peter Hickey (15:02:01): > yeah for the immgen data that gets you the counts but not the metadata. below is some code i wrote to get the metadata from GEO and tidy it up some. but i want to just writequery(ExperimentHub(), "ImmGen")and get something immediately useful:slightly_smiling_face:#lazybioinf > > library(BiocFileCache) > library(GEOquery) > library(edgeR) > bfc <- BiocFileCache() > immgen_counts_path <- bfcpath( > x = bfc, > rids = bfcrid(bfcquery(bfc, "GSE109125_Gene_count_table.csv.gz"))) > immgen_counts <- read.csv(immgen_counts_path, check.names = FALSE) > immgen_metadata <- as(phenoData(getGEO("GSE109125")[[1L]]), "data.frame") > dgelist <- DGEList(immgen_counts[, -1]) > rownames(dgelist) <- immgen_counts[, 1] > idx <- match( > colnames(dgelist), > gsub("_RNA-seq", "", as.character(immgen_metadata$title))) > dgelist$samples <- cbind(dgelist$samples, immgen_metadata[idx, ]) > dgelist$samples <- dgelist$samples %>% > dplyr::rename(broad_cell_type = characteristics_ch1.4) %>% > dplyr::mutate( > broad_cell_type = gsub("cell type: ", "", broad_cell_type), > narrow_cell_type = gsub("#[0-9]_RNA-seq", "", title)) >

Rob Amezquita (15:03:00): > so is the thought to basically have some way to more simply query a repo like ImmGen to get training data for classification?

Aaron Lun (15:04:15): > Probably want to drag some web API people into this discussion.

Kevin Rue-Albrecht (15:05:10): > thanks@Peter Hickeymuch cleaner than my own solution > > gz <- gzfile(bfcrpath(bfc, rnames = "GSE109125_Gene_count_table.csv.gz"), open = "r") > countMatrix_train <- read.csv(gz, row.names = 1, check.names = FALSE) > close(gz) > countMatrix_train <- as.matrix(countMatrix_train) > storage.mode(countMatrix_train) <- "integer" > colDataTable <- DataFrame( > SampleFullName=colnames(countMatrix_train), > CellType=factor(gsub("^([[:alnum:]]+)\\..*", "\\1", colnames(countMatrix_train))), > Replicate=gsub(".*#([[:digit:]]+)", "\\1", colnames(countMatrix_train)), > row.names=colnames(countMatrix_train) > ) > sce.train <- SingleCellExperiment( > assays=list(counts=countMatrix_train), > colData=colDataTable) > rm(countMatrix_train, colDataTable) >

Kevin Rue-Albrecht (15:06:24): > (I just realized that you also had the “broad” and “narrow” approach (!) great minds think alike:stuck_out_tongue_winking_eye:

Kevin Rue-Albrecht (15:09:06): > (broad=celltype; narrow=replicate)

Peter Hickey (15:09:34): > FWIW i sit next to the people that madehttps://www.haemosphere.org/, which includes ImmGen data > we spoke last week about making this resource available through BioC > I don’t think they have a formal web API but they do have bucketloads of experience in wrangling of mouse & human haematopoietic bulk RNA-seq & microarray data and making it into a resource. and the count matrices + metadata already exist, so getting a DGEList/SE-type thing should be straightforward - Attachment (haemosphere.org): Haemosphere > Gene Expression Analysis Tool

Rob Amezquita (15:12:12): > thats awesome peter! thanks for the link

Aaron Lun (15:12:17) (in thread): > Unusually polished for a WEHI web interface. What about the classic look instatsci.org?

Rob Amezquita (15:12:37): > so it seems like theres a few different concepts going around here..

Rob Amezquita (15:15:07): > 1) pulling down data from repo (ImmGen, Haemo, etc.) -> essentially a counts matrix with proper cell labels; labels can be parseable to be the major cell type (broad)~~~to narrow cell type (minor) [drawing these distinctions is not super easy, as theres lots of overlap in minor cell types but oh well]~~~2) running the classification - how we actually predict the cell type from an SCE object, e.g.predict(sce, cell_type_signature_training_data). Some methods details here need to be worked out, for example might be better to work with clusters of cells vs classifying single cells; outputting confidence estimates etc.

Kevin Rue-Albrecht (15:16:52) (in thread): > let’s (not?) mention cell “state” (e.g. activated/resting macrophage/T cell/…)

Kevin Rue-Albrecht (15:17:41) (in thread): > as different cell types don’t turn on/off the same genes when they activate

Rob Amezquita (15:18:09) (in thread): > right, which goes with the problem of minor cell types (states) are not comparable across different major cell types

Rob Amezquita (15:18:27) (in thread): > one level below major for cell type A != one level below major for cell type B

Rob Amezquita (15:18:57) (in thread): > nonetheless though i think that information is valuable since its already a part of most of the experiments..most data focuses on comparing say a T cell +/- stim

Rob Amezquita (15:19:35) (in thread): > (not saying that considering that should be a top priority, but could be valuable to keep in mind)

Kevin Rue-Albrecht (15:22:31) (in thread): > Yup. Smartest thing would probably either a semi-supervised or an ‘unsupervised’ (ideally) “Google maps” approach, whereby 1) either the user provides some information about which “kind of sample” he’s working with (and thereby which subset of cell types are expected), or 2) the algorithm will first ‘broadly’ guess the type of sample (does it look like haematopoietic, neuronal, …), and then refine the cell type prediction

Rob Amezquita (15:23:48) (in thread): > great idea, i like that sort of hierarchical approach to narrow the space of classification

Peter Hickey (15:24:41): > i can help with (1). i know essentially zero about proper ontologies - I think@Vince Careymay be able to advise. i’m inclined to just make available whatever annotations the authors have. narrow/broad etc are fuzzy > i have no time to work on (2) and i agree with aaron is likely a large undertaking.

Peter Hickey (15:26:48): > Frankly, much of the time all i want to do is overlay a tSNE with expression of marker gene(s) or gene signatures or do a more formal GSEA. > And so the questions for me are typically: > (A) Where can I get a relevant gene list(s)? > (B) if (A) doesn’t exist (:cry:), where can i get a large, relevant dataset to construct such a list? > and i want to do both within BioC when possible

Rob Amezquita (15:30:49): > yeah, that was more of what i was initially thinking as well@Peter Hickey- gene lists, maybe ways to define signatures on the fly as well. re: A) ive mostly turned to published data/papers…so lots of manual curation

Rob Amezquita (15:36:48): > i think both having the “rawer” continuums of expression data to use for prediction as well as more binary lists are complementary approaches

Rob Amezquita (15:42:42): > what would be a good way to organize our thoughts and sketch out what a package might look like/to-dos?

Kevin Rue-Albrecht (16:28:46): > shortest answer: the README of iSEE still very much resembles what we wrote on commit #1: a list of desired functionalities (see the green ticks acting as bullet points)https://github.com/csoneson/iSEE - Attachment (GitHub): csoneson/iSEE > R/shiny interface for interactive visualization of data in objects derived from the SummarizedExperiment class - csoneson/iSEE

Kevin Rue-Albrecht (16:29:15): > they only became green ticks as we progressively completed them

Kevin Rue-Albrecht (16:30:28): > i.e., it could all just start with a blank repo (https://github.com/csoneson/iSEE/commit/2550b701088add87235d004ce4c42bb94e955264) that a bunch of enthusiasts start filling up and refactoring over the next 11 months:wink: - Attachment (GitHub): Initial commit · csoneson/iSEE@2550b70 > R/shiny interface for interactive visualization of data in objects derived from the SummarizedExperiment class - csoneson/iSEE

Kevin Rue-Albrecht (16:31:35): > uh. i forgot that#isee’s first name was justSEE

Kevin Rue-Albrecht (16:37:09): > about point 2) as part of my efforts this summer, I wrote some basicmakePseudoBulkaggregating count data that I was using to compare clusters of cells to my bulk reference data set. It gave reasonable results, but I never got to polish the idea > > #' @rdname makePseudoBulks-methods > #' @aliases makePseudoBulks,SummarizedExperiment,factor-method > setMethod("makePseudoBulks", c("SummarizedExperiment", "factor"), > function(x, f, assay_name="counts") > { > .makePseudoBulks(x, f, assay_name) > }) > > #' @rdname makePseudoBulks-methods > #' @aliases makePseudoBulks,SummarizedExperiment,character-method > setMethod("makePseudoBulks", c("SummarizedExperiment", "character"), > function(x, f, assay_name="counts") > { > stopifnot(length(f) == 1) > > colDataLabels <- factor(colData(x)[, f, drop=TRUE]) > > makePseudoBulks(x, colDataLabels, assay_name) > }) > > #' Make pseudobulks (INTERNAL) > #' > #' @param x A \code{\link{SummarizedExperiment}} object. > #' @param f A grouping factor that defines pseudobulks. > #' @param assay_name The name of an assay in \code{assayNames(x)} that contains count data. > #' > #' @rdname INTERNAL_makePseudoBulks > #' > #' @importFrom S4Vectors DataFrame > #' @importFrom SummarizedExperiment SummarizedExperiment assay > #' rowData rowData<- rowRanges rowRanges<- > #' @importFrom SingleCellExperiment SingleCellExperiment > #' > #' @return A new \code{SummarizedExperiment} with > #' \itemize{ > #' \item pseudobulk count data stored in the \code{"counts"} assay > #' \item pseudobulk type and replicate information stored in \code{colData(x)[,"Cloner"]} > #' } > #' > #' @seealso \code{\link{makePseudoBulks}} > .makePseudoBulks <- function(x, f, assay_name="counts") { > > # Aggregate samples into pseudobulks > new_assay <- t(rowsum(t(as.matrix(assay(x, assay_name))), f)) > storage.mode(new_assay) <- "integer" > > Cloner_colData <- DataFrame( > pseudobulk=factor(levels(f)) > ) > rownames(Cloner_colData) <- colnames(new_assay) > > objectOut <- SingleCellExperiment( > assays=list(counts=new_assay), > colData=Cloner_colData) > > # Transfer rowData and/or rowRanges > rowData(objectOut) <- rowData(x) > if (!is.null(rowRanges(x))) { > rowRanges(objectOut) <- rowRanges(x) > } > > objectOut > } > - Attachment: Attachment > 1) pulling down data from repo (ImmGen, Haemo, etc.) -> essentially a counts matrix with proper cell labels; labels can be parseable to be the major cell type (broad) ~~to narrow cell type (minor) [drawing these distinctions is not super easy, as theres lots of overlap in minor cell types but oh well]~~ > > 2) running the classification - how we actually predict the cell type from an SCE object, e.g. predict(sce, cell_type_signature_training_data). Some methods details here need to be worked out, for example might be better to work with clusters of cells vs classifying single cells; outputting confidence estimates etc.

Kevin Rue-Albrecht (16:43:46): > my naive idea (working to certain extent) being that one could then throw both pseudobulk and reference bulks into a common PCA and look for nearest neighbours.

Kevin Rue-Albrecht (16:43:59): > Worked fine to some extent, but full of caveats too

Federico Marini (16:44:18): > @Federico Marini has joined the channel

Kevin Rue-Albrecht (16:53:00): > here, found the first time we fleshed out the README (commit 3)https://github.com/csoneson/iSEE/tree/021e3e20cfdb194e511f8097ed544329bd46bcd6

Kevin Rue-Albrecht (16:53:30): > now the only thing would be to brainstorm a basic repo name

Rob Amezquita (17:01:27): > awesome, i like that idea to start off the repo and go from there…re: names…hmm.

Rob Amezquita (17:03:51): > classc(“classy”),hancock(play on signatures),autograph..

Kevin Rue-Albrecht (17:04:30): > i don’t get hancock, what’s the thing there?

Rob Amezquita (17:05:01): > when someone asks for your “John Hancock” it refers to your signature, John Hancock (in)famously signed his name on the Declaration of Independence in very very large fashion

Rob Amezquita (17:05:30): - File (PNG): Pasted image at 2018-11-20, 2:04 PM

Kevin Rue-Albrecht (17:06:52): > oww

Kevin Rue-Albrecht (17:06:55): > me and history

Kevin Rue-Albrecht (17:06:57): > nice one tho

Kevin Rue-Albrecht (17:07:51): > my pet project was calledCloner, for “Classification On Expression Reference(s)”

Kevin Rue-Albrecht (17:08:20): > and the idea of cloning knowledge from reference data set(s) to new ones

Rob Amezquita (17:08:39): > that might be more accurate if what we work on is more similar to a transfer learning approach

Rob Amezquita (17:08:54): > rather than a repo of curated gene lists and the like

Aaron Lun (17:10:42): > I find hancock amusing, but probably most non-Americans won’t get it.

Kevin Rue-Albrecht (17:10:44): > at this point, I don’t mind passing on the name to another project like this one, which has better chances of success as a community effort

Rob Amezquita (17:12:32): > i dont have a strong preference, although i will say i equally findhancockto the most non-vanilla of the proposed names so far

Rob Amezquita (17:12:43): > but i also enjoy vanilla, so…

Rob Amezquita (17:13:27): > todo: add shruggie to slack

Kevin Rue-Albrecht (17:13:29): > hancock is cool with me, I learned something at least

Federico Marini (17:13:37): > ¯*(ツ)*/¯

Kevin Rue-Albrecht (17:13:45): > aaargh you beat me to it

Rob Amezquita (17:13:46): > todo: learn how to shruggie

Kevin Rue-Albrecht (17:13:54): > /shrug

Federico Marini (17:14:03): > slash-shrug, at the beginning of a sentence

Rob Amezquita (17:14:12): > ahaha ¯*(ツ)*/¯

Rob Amezquita (17:14:18): > wrong slash:slightly_smiling_face:

Federico Marini (17:14:19): > learned it from the master commander@Aaron Lun

Kevin Rue-Albrecht (17:14:20): > ¯*(ツ)*/¯

Kevin Rue-Albrecht (17:14:26): > ah got it now

Rob Amezquita (17:16:37): > what if we start from a google doc to spec out what a minimum viable product might look like for this? theres so many ways to cut this apple with a lot of in between steps, it would be very easy to get super bloated

Rob Amezquita (17:17:05): > and might also be good to figure out who would be interested in working on making this

Rob Amezquita (17:22:34): > once we figure out a sketch of an MVP, we can put it into the github readme and start working on that! im sure@Valentin Voilletand from@Raphael Gottardo’s lab will be contributing to discussion/code work

Kevin Rue-Albrecht (17:22:51): > True. Though can i leave that with you and I’ll catch up tomorrow? I need to hang up for the night:wink:

Rob Amezquita (17:23:13): > happy to get that started!

Aaron Lun (17:24:01): > I don’t plan to be particularly involved until you get into the interaction with SCE re. data containers and whatnot. Happy to chip in with opinions about anything but otherwise I’ve got enough on my plate single-cell coding-wise.

Rob Amezquita (17:24:32): > @Aaron Lunwhat would you recommend looking at with regards to interacting with SCE?

Kevin Rue-Albrecht (17:24:39): > I also need to swing the whole update by my PI tomorrow. We’re pretty oversubscribed on our side, so he’ll wanna be aware and perhaps contributes ideas too

Aaron Lun (17:25:14): > I don’t know specifically for your use case, but there’s a vignette on extending SCEs that might be worth a look.

Aaron Lun (17:25:44): > I suspect you’ll want a dedicated separate data strcuture for gene lists, though. That’s not something that follows the SE data model.

Kevin Rue-Albrecht (17:25:46): > AFAI’m concerned, we’ve tried doing it within our group and now it’s on the indefinite backburner, but perhaps he (my PI)’ll be keen to revisit with some extra hands on deck

Rob Amezquita (17:26:49): > yeah, i have to sketch this out a bit more, but im imagining containing the lists or data separate from SCE, and interacting with SCE only forpredict()style functions to use the data inherent/append new metadata

Aaron Lun (17:27:01): > That would make sense to me.

Rob Amezquita (17:27:40): > okay so extending SCE vignette..are there any minimalistic packages that you might recommend for extending/working with SCE? (also curious for a separate project that works more with epigenetic data)

Aaron Lun (17:28:41): > As examples? I thinkclusterExperimentextends it. So doessingleCellTKandCellTrails, maybe alsoSC3.

Aaron Lun (17:29:01): > Note that they all probably extended it before I wrote the vignette, so I guess they might not follow my advice.

Kevin Rue-Albrecht (17:29:29): > Alright, last message of the day for me: Please just ping my an invitation atkevinrue67@gmail.comif there’s a Google Docs started:slightly_smiling_face:

Aaron Lun (17:29:39): > It should work off-the-shelf with epigenetic data, as we derive from a Ranged SE. Maybe@Peter Hickeyhas some thoughts.

Rob Amezquita (17:29:40): > thanks@Kevin Rue-Albrechtfor getting this kickstarted!

Rob Amezquita (17:30:20): > yeah this might be for a separate discussion, i need to formulate my thoughts more before diving deep but ill take a look at those packages to do my homework:slightly_smiling_face:thanks for the tips!

Aaron Lun (17:30:41): > Okay. Now,thatdiscussion would be for#singlecellexperiment

2018-11-21

Vince Carey (07:20:45): > I noticed a query on ontology relevant to this topic. I think there is a serious disconnect between current “cell ontology” terminology and what is used in the metadata imported by Pete above.

Vince Carey (07:21:19): > I have just pushed ontoProc 1.5.1 into devel, which includes the July 2018 edition of Cell Ontology.

Vince Carey (07:23:09): > Using code like > > library(ontoProc) > co = getCellOnto() > nn = names(grep("CD8", > co$name, value=TRUE)) > library(ontologyPlot) > onto_plot(co, nn) > > we can get a view of the conceptual hierarchy of terms with CD8 as substring

Vince Carey (07:24:07): - File (PNG): littree.png

Vince Carey (07:30:23): > after a bit of magnification you get something like the display above. Details on cell surface markers can be found in the ‘def’ fields.

Kevin Rue-Albrecht (07:39:25): > neat

Vince Carey (09:18:40): > after running Kevin’s and Pete’s code above, we can get ontology tags for most of the organs in play via > > > petesamp = dgelist$samples > > ts = unique(petesamp$"tissue:ch1") > > ts = paste("^", tolower(ts), "$", sep="") > > ef = ontoProc::getEFOOnto() > > ontoProc::liberalMap(ts, ef) > input ontoid term > 1 ^subcutaneous lymph node$ ^subcutaneous lymph node$ <NA> > 2 ^bone marrow$ UBERON:0002371 bone marrow > 3 ^spleen$ UBERON:0002106 spleen > 4 ^lamina propia$ ^lamina propia$ <NA> > 5 ^peritoneal cavity$ UBERON:0001179 peritoneal cavity > 6 ^thymus$ UBERON:0002370 thymus > 7 ^lung$ UBERON:0002048 lung > 8 ^lymph node$ UBERON:0000029 lymph node > 9 ^colon$ UBERON:0001155 colon > 10 ^splenic$ ^splenic$ <NA> > 11 ^gut$ ^gut$ <NA> > 12 ^small intestine$ UBERON:0002108 small intestine >

Vince Carey (09:19:11): > Note that “lamina propia” is misspelled in the source data

Vince Carey (09:23:15): > A similar sort of fiddling can get us > > > liberalMap(paste("^", tolower(brc), "$", sep=""), co) > input ontoid term > 1 ^stromal cell$ CL:0000499 stromal cell > 2 ^b cell$ ^b cell$ <NA> > 3 ^ab t cell$ ^ab t cell$ <NA> > 4 ^innate lymphocyte$ ^innate lymphocyte$ <NA> > 5 ^stem cell$ CL:0000034 stem cell > 6 ^macrophage$ CL:0000235 macrophage > 7 ^granulocyte$ CL:0000094 granulocyte > 8 ^gd t cell$ ^gd t cell$ <NA> > 9 ^mast cell$ CL:0000097 mast cell > 10 ^dendritic cell$ CL:0000451 dendritic cell >

Vince Carey (09:24:37): > We will need to capitalize T and B. Getting deeper into subtypes, using information on sorting markers, will take more work. Do we want to do this with these resources, essentially syntactically?

Vince Carey (09:28:28): > BTW the way liberalMap works, if you don’t use the regexp tokens, you can learn a lot more about how the phrases are used in the ontology.

Kevin Rue-Albrecht (09:33:39): > > Note that “lamina propia” is misspelled in the source data > I’ve recently discovered theadistfunction, for approximate string matches (based on edit distance):https://www.rdocumentation.org/packages/utils/versions/3.5.1/topics/adist

Kevin Rue-Albrecht (09:35:04): > although on 2nd thought, it’s a bad idea; i’d rather manually curate “typos” than blindly trust a pattern matching

Vince Carey (10:02:08): > well, liberalMap has an option to use agrep … it is sometimes useful

Kevin Rue-Albrecht (10:06:07): > nice, I’ve starred the repo (everyone:https://github.com/vjcitn/ontoProc). I’ll have to play with the examples to get a better sense of functionalities, but I like the idea of having a toolkit like this (e.g.stopWords) - Attachment (GitHub): vjcitn/ontoProc > RDF ontology processing for Bioconductor. Contribute to vjcitn/ontoProc development by creating an account on GitHub.

Kevin Rue-Albrecht (10:06:52): > I’ll just drop out Slack again, as I’ve got some md5sum to run and a journal club article to skim

Kevin Rue-Albrecht (10:08:05): > if noone gets to it first, I might kick off tonight the Google Docs that we mentioned yesterday

Aaron Lun (10:19:54): > @Aaron Lun has left the channel

Kevin Rue-Albrecht (15:30:50): > Started a GoogleDocs here:https://docs.google.com/document/d/1pxUJ0OipoRPglWA8KPMo4Zoa5w3KOH08Tc5xZVOnZdo/edit?usp=sharingMy plan is currently just to record our conversation so far, if anyone feels like scrolling up and adding to the doc - File (Google Docs): Hancock: MVP

Rob Amezquita (16:12:00): > just requested edit access Kevin, thanks for kickstarting it!

Kevin Rue-Albrecht (16:17:56): > thanks for pointing it out, I just found how to give edit access to anyone with the link from now on

Rob Amezquita (17:51:38): > @Kevin Rue-Albrechtand I had a great video call conversation about what a minimum viable product might look like - more details are in the Google Doc, but I think we managed to distill it down to: > > Prediction of Cell Types - MVP! > > Simplest method - write a package that wraps around existing approaches using an example reference dataset (a matrix) to annotate an experimental dataset with the columns of reference (basically a wrapper/platform benchmarking existing methods, and later our own ) that utilizes the SingleCellExperiment container

Rob Amezquita (17:52:41): > it might be best to split out certain aspects into separate efforts/packages - for example, manual/curated gene lists could be a more “crowdsourced” approach, and pulling down data via APIs and munging could be its own thing altogether

Rob Amezquita (17:54:27): > would love other folks input on this though - next steps will be to sketch out a package skeleton, small example reference dataset and experimental dataset, the prediction methods, and then the interaction with theSingleCellExperimentobject to actually add the annotation metadata

2018-11-22

Kevin Rue-Albrecht (03:58:42): > For those like me, rediscovering flow cytometry as part of the conversation, here is an awesome resource for (protein) markers of immune cell types:https://www.abcam.com/protocols/flow-cytometry-immunophenotyping - Attachment (abcam.com): Flow cytometry immunophenotyping | Abcam > A detailed description of flow cytometry immunophenotyping, including links to information on CD markers.

Kevin Rue-Albrecht (04:40:34): > @Vince Careyfollowing on the chat with@Rob Amezquitayesterday, I just have the following points buzzing in my mind: > - a qualitative set of marker genes may already perform well (it’s basically the gold standard of FACS-sorting anyway). Presence/absence of a given marker could be estimated ‘absolutely’ (using the distribution of its expression data to estimate background and then signal), or ’relatively; (differential expression between a cluster and all other cells) > - negative markers may be just as useful as positive markers (again, parallel with FACS-sorting, in contrast to NNMF methods that focus on positive markers)

Kevin Rue-Albrecht (04:45:21): > With that in mind, I was wondering whether an ontology already exists for the following scenario, illustrated by simplelisthere, > > hancocks <- list( > thymocyte = c("CD3"=TRUE, "CD2"=TRUE, "CD19"=FALSE), > th_cell = c("CD3"=TRUE, "CD2"=TRUE, "CD19"=FALSE, "CD4"=TRUE), > ) > > Namely, representing sets of signatures for cell types with a relationship<th_cell> is a <thymocyte> (thus, inherits all its markers) additionally characterized by the presence of CD4, something akin to: > > hancocks <- list( > thymocyte = c("CD3"=TRUE, "CD2"=TRUE, "CD19"=FALSE), > th_cell = c(thymocyte, "CD4"=TRUE), > ) >

Kevin Rue-Albrecht (04:46:29): > AFA I’m aware, this “top-down” ontology differs from Gene Ontology which are “bottom-up” (i.e., parent categories inherit all the genes in child categories)

Kevin Rue-Albrecht (12:53:08): > Another way to put it is: > - the more specific the GO category, the smaller the gene set > - the more specific the cell type, the larger the set of markers (ie gene set)

Kevin Rue-Albrecht (16:24:19): > has renamed the channel from “sc-cell-signature” to “sc-signature”

2018-11-24

Kevin Rue-Albrecht (07:22:33): > Hi all, > To provide an common toy data set that is single-cell, well-known, and provides some kind of ground-truth (i.e., both cell type marker genes and cell type assignment to each cluster), I have distilled the Seurat PBMC 3k tutorial to an R script that caches both the 10x raw data and the outputSeuratand equivalentSingleCellExperimentobject. The script is available as Gist here:https://gist.github.com/kevinrue/82035181128ca5fd71a197c439bd83d5

Vince Carey (08:14:16): > thanks for this kevin – I thought I would mention that installing current sources of hdf5r and Seurat fails for me on macosx; using available binaries on CRAN leads to > > dyn.load("/Library/Frameworks/R.framework/Versions/3.6/Resources/library/hdf5r/libs/hdf5r.so") ... > ***** caught segfault ***** > address 0x18, cause 'memory not mapped' > > Traceback: > 1: fun(libname, pkgname) > 2: doTryCatch(return(expr), name, parentenv, handler) >

Kevin Rue-Albrecht (08:15:32): > Uh

Kevin Rue-Albrecht (08:15:54): > I just installed Seurat on two macs in the last two days.

Kevin Rue-Albrecht (08:16:44): > I had to fiddle a bit with brew install and my .R/Makevars to install hdf5r though

Vince Carey (08:22:24): > getting there now… maybe i am oversharing

Kevin Rue-Albrecht (08:25:00): > Anyway, my idea being that we could start small/modest, and start with trying to program what the Satija lab seems to have done manually, namely defining signatures as follows: > > list( > "CD4 T cells" = c("IL7R"), > "CD14+ Monocytes" = c("CD14", "LYZ"), > "B cells" = c("MS4A1"), > "CD8 T cells" = c("CD8A"), > "FCGR3A+ Monocytes" = c("FCGR3A", "MS4A7"), > "NK cells" = c("GNLY", "NKG7"), > "Dendritic Cells" = c("FCER1A", "CST3"), > "Megakaryocytes" = c("PPBP") > ) > > and assigning a signature to each cluster (with the possibility of assigning the same signature to multiple clusters!)

Kevin Rue-Albrecht (08:28:24): > We could then extend to include positive/negative markers, e.g. > > list( > "CD4 T cells" = c("IL7R"=TRUE, something=FALSE), > "CD14+ Monocytes" = c("CD14"=TRUE, "LYZ"=TRUE, something_else=FALSE), > ... > ) >

Vince Carey (08:29:21): > i don’t think i am going to get Seurat installed any time soon. if you can make the demo dataset and put it in a bucket or provide some other conveyance that would be great

Kevin Rue-Albrecht (08:29:51): > and later on, even more quantitative signatures (e.g. gene expression values or ranks)

Kevin Rue-Albrecht (08:29:55): > sure

Kevin Rue-Albrecht (08:30:41): > I was wondering whether anyone around had experience/time setting up data packages, or data sets on the ExperimentHub. I don’t.

Kevin Rue-Albrecht (08:33:42): > It’s still synchronizing, but you should soon be able to download the SCE fromhttps://www.dropbox.com/s/tzw14boipt9kkd5/14c55457b4556_file14c55fb15da6.rds?dl=0(11.5MB)

Vince Carey (08:33:58): > apropos the expression signature/surface marker connection – yes, we should capitalize on what exists … CIBERSORT seems a relevant resource

Vince Carey (08:35:20): > apropos ExperimentHub Shweta Gopaulakrishnan@Shweta Gopalin my group has some experience with the ExperimentHub and can help … I think you have to put it in a bucket and some documentation/provenance forms have to be filled out

Shweta Gopal (08:35:24): > @Shweta Gopal has joined the channel

Kevin Rue-Albrecht (08:44:21): > cool - just having lunch now, but feel free to use the gist that I linked above. i’ll chime back in later today

Kevin Rue-Albrecht (09:04:01): > @Vince Carey, the Dropbox link is synchronized now you should be able to pull down the SingleCellExperiment as an RDS file

Kevin Rue-Albrecht (09:49:44): > btw, (I’m preparing slides for my lab meeting right now but) I’ve implemented the dummiest approach (~20 lines of code) to use the markers declared by Satija to emulate what they seem to have done manually. I’ll share as soon as I reasonably can, to provide a baseline and matter for discussion.

2018-11-25

Vince Carey (08:58:51): > I finally had a look at flowCL – issues queries against an RDF database representing a version of Cell Ontology. The vignette produces

Vince Carey (08:59:19): - File (PNG): flowCLpic.png

Vince Carey (09:00:49): > I am not clear on how the RDF is managed atcell.inference.me… but have an email in to the flowCL developer

Kevin Rue-Albrecht (10:30:15): > I like (gene1+)(gene2+) -> (cell type). Handles ‘inheritance’ of markers and the n:n relationship between genes and cell types

Charlotte Soneson (10:39:45): > @Charlotte Soneson has joined the channel

Kevin Rue-Albrecht (12:23:22): > I’ve kicked off a Travis build with basic method, vignette, and unit tests. At least that’ll help identify the technical glitches that need fixing before we move into the real thinghttps://travis-ci.org/kevinrue/HancockAlso, here’s my dummy approach to using the signatures declared in the Seurat tutorial:https://gist.github.com/kevinrue/f0f26e4585b075caf6420f5453976d01

Kevin Rue-Albrecht (12:36:47): > what the hell, that just never happens: build #1 passedhttps://travis-ci.org/kevinrue/Hancock/builds/459445356?utm_source=github_status&utm_medium=notification

Kevin Rue-Albrecht (12:37:51): > PS: as a general software development rule, code coverage should never decrease:wink:

2018-11-26

Kevin Rue-Albrecht (05:00:34): > @Vince CareyI’ve started a stub of vignette in the package. No code yet, purely a couple of concepts that I wanted written somewhere. That said, I’ve included an examplelistrepresentation of the “signatures” declared in the Seurat tutorial. We can start proof-of-concept code using lists, but maybe we should integrateGSEABasecode already? Is there anything more thangsc <- GeneSetCollection(list(gs1, gs2))(?GSEABase)

Vince Carey (06:26:03): > Two aspects of GSEABase that I find compelling are a) the capacity to import from existing ‘list-like’ set sources, and b) the capacity to annotate each set or set-collection with significant metadata. We should look around for other set annotation approaches, incorporate advances if such are found, and consider whether GSEABase needs an extension to incorporate gene networks. Interoperation with NDEx could be a plus for certain NCI projects supporting Bioconductor (www.ndexbio.org).@Ludwig Geistlinger@Levi Waldronworked on the orchestrating single cell draft at the GSEA level and may have comments.

Ludwig Geistlinger (06:26:09): > @Ludwig Geistlinger has joined the channel

Levi Waldron (06:26:09): > @Levi Waldron has joined the channel

Kevin Rue-Albrecht (06:42:22): > I’m keen to have a discussion with anyone interested in agreeing on a “core” (yet generalisable) representation of gene signatures,beforewe dive too far into algorithms to learn/apply said signatures. I’d like to avoid heavy refactoring later on (i.e. changing the representation of signatures multiple times).

Fabiola Curion (06:44:24): > @Fabiola Curion has joined the channel

Charlotte Soneson (09:22:43): > This looks like a great initiative!@Kevin Rue-Albrecht, I’d be happy to join at least the conceptual discussion of the signature representation if that would be ok. A combinatorial approach with a set of markers whose absence/presence identify a cell type, as you have discussed above, seems like a good idea (I like to think of it as a kind of decision tree, with possibly multiple alternative/equivalent/redundant genes in each node - such “at least one of” relationships may be useful to be able to encode, in particular for sparse single-cell data).

Charlotte Soneson (09:22:51): > A kind of related point is that since most marker genes are obtained via differential expression, they are all highly expressed in a group of cells “compared to something else”, but that something is typically not recorded. With current methods, I think one of the most challenging cases will be a data set where all (or most) cells are of the same type (i.e., where there is nothing to compare the expression levels to), or equivalently, to classify a single cell consistently regardless of what other cells are included.

Kevin Rue-Albrecht (09:25:35): > Happy to see you interested in the effort@Charlotte Soneson! > Check out the draft section on “relative and absolute cell type markers” in the draft of vignette that I started this weekend (yes, there’s going to be a lot of “draft” in the conversation for a while, I expect)https://github.com/kevinrue/Hancock/blob/master/vignettes/concepts.Rmd - Attachment (GitHub): kevinrue/Hancock > Cell signatures, with confidence. Contribute to kevinrue/Hancock development by creating an account on GitHub.

Charlotte Soneson (09:26:26): > Ah, sorry, I missed the vignette:slightly_smiling_face:thanks!

Kevin Rue-Albrecht (09:26:27): > I’ve pinned a GoogleDocs to this channel for discussion, but I’m happy to see people add to the vignette.

Kevin Rue-Albrecht (09:28:23): > While both are open to anyone for editing, let’s just use the GoogleDocs for “draftier” (warned ya!) discussions, and the vignette for more “official” points.

Kevin Rue-Albrecht (09:31:59): > Also, I agree with the use of positiveandnegative markers in signatures. - Attachment: Attachment > With that in mind, I was wondering whether an ontology already exists for the following scenario, illustrated by simple list here, > > hancocks <- list( > thymocyte = c("CD3"=TRUE, "CD2"=TRUE, "CD19"=FALSE), > th_cell = c("CD3"=TRUE, "CD2"=TRUE, "CD19"=FALSE, "CD4"=TRUE), > ) > > Namely, representing sets of signatures for cell types with a relationship <th_cell> is a <thymocyte> (thus, inherits all its markers) additionally characterized by the presence of CD4, something akin to: > > hancocks <- list( > thymocyte = c("CD3"=TRUE, "CD2"=TRUE, "CD19"=FALSE), > th_cell = c(thymocyte, "CD4"=TRUE), > ) >

Charlotte Soneson (09:34:23): > Yeah, that was the post I was thinking of - I guess my point was that depending on how you define a “marker” gene, it becomes important to see these as aset, rather than a list of genes that are all, separately, indicative of a given cell type.

Kevin Rue-Albrecht (09:36:05): > Definitely. Have you had time to check out my “dummy” application of the Seurat tutorial signatures? It’s using their “signatures” (1-2 genes per cell type) as a set.https://gist.github.com/kevinrue/f0f26e4585b075caf6420f5453976d01

Charlotte Soneson (09:36:28): > Not yet…but it’s open in my browser:slightly_smiling_face:

Kevin Rue-Albrecht (09:36:53): > I’m always super proud when I remember to useS4Vectors::FilterRules

Kevin Rue-Albrecht (09:37:23): > they’re awesome, I’ve had “fun” extending them for VCF data a couple of years ago inTVTB

Kevin Rue-Albrecht (10:37:38): > btw, for everyone (including me), I’ve applied the same settings to theHancockrepo as the ones that worked so nicely foriSEE: > -masterbranch is protected, which means there no one can push directly to it. That’ll avoid havoc originating from merge conflicts. > - thus, contributions work by pull requests from side branches. Don’t forget to delete the side branch when it’s merged.

Rob Amezquita (11:47:51): > food coma’ed hard for the thanksgiving holiday and missed out on all the slack convo so just catching up now! > > i think for the cell-type signatures/lists approach, i really like the +/- selection, as well as the “at least one of..” approaches. Like@Charlotte Sonesonmentioned, biggest challenge will be identifying the “gold standard” reference dataset. in this regard, CITE-Seq/Ab-seq based data might be best because it integrates well known protein markers and allows us to come up with the reference RNA-based markers

Kevin Rue-Albrecht (11:48:30): > I’m very much looking forward to CITE-seq indeed

Rob Amezquita (11:48:45): > barring that however, i think a compendium of sorted bulk-RNAseq datasets mentioned much earlier in the slack would be good, but would also require some thinking re: applicability of bulk-RNA-seq to single-cell data

Rob Amezquita (11:50:10): > an alternative approach could also be compiling gene lists into groups based on publications into a data package as a “heres just what other groups reported” sort of thing, e.g.authorname2018 <- list('T cell' = c('CD3E', 'CD3D'), ...)

Rob Amezquita (11:50:24): > author2name2017 <- list(...)

Rob Amezquita (11:50:25): > etc

Kevin Rue-Albrecht (11:50:35): > If I understood correctly Fabiola’s comment on the Google doc, I think she suggested having a dedicated section of the document to list candidate sources of signatures. We could split them as 1. databases of signatures 2. data sets from which signatures can be extracted

Charlotte Soneson (11:50:42) (in thread): > I’m missing something - where are the signatures used here (I can see them printed at the bottom of the page):thinking_face:

Rob Amezquita (11:53:20): > great idea, i think that would make the most sense, and also i think that this effort should be a separate package entirely, with a more “community driven” sort of model behind it - as long as we supply the format, anyone can put up their proposed signatures to use. (call itautographsor something), and then a more methods-based package likehancockcan apply saidautographs(either datasets or gene-lists approach)

Kevin Rue-Albrecht (12:06:43) (in thread): > Oh sorry. They’re not used in this notebook. I just pasted them from the tutorial for safe keeping

Kevin Rue-Albrecht (12:07:28) (in thread): > They’re actual use is herehttps://gist.github.com/kevinrue/a9ffb8c5e2eeb65b4f05c0d9b0cb45ec

Kevin Rue-Albrecht (12:08:11) (in thread): > The first notebook is just to import and preprocess the data, similar to data.R in iSEE2018

Charlotte Soneson (12:09:14) (in thread): > Ok, thanks!

Charlotte Soneson (12:11:26): > I think such a division would make sense. Would it be worth considering splitting the definition of relative markers into those that were derived as DE between a cluster and all other cells, and those DE between a cluster andeachother cluster (specific genes)?

Kevin Rue-Albrecht (12:11:58): > Good point! Didn’t think that far yet.

Rob Amezquita (12:14:08): > not sure about that specific split@Charlotte Soneson, but broadly yes, i think if we define signatures it should be split based on methods - e.g. maybe something like: > > - publications based - gene lists from curated sources > - datasets - reference datasets > - signatures from datasets via method 1 > - signatures from datasets via method 2

Charlotte Soneson (12:14:12): > As long as a marker is defined by expression level rather than by “presence”, it will rely to some extent on normalization etc in any given data set, so I guess there is a special class of “absolute” markers that should basically only be observed in one cell type.

Kevin Rue-Albrecht (12:15:35): > I’d expect those “absolutely specific” markers to be pretty rare, but we can save space for them.

Charlotte Soneson (12:17:50): > @Rob AmezquitaYes, I agree that it should also be linked to the method used to derive the signature. My point was more that both of the marker types I mentioned could be considered “signatures of data set X with method Y”, but they can be quite different and may need to be treated differently for prediction (or at least we can accommodate methods that would treat them differently).

Rob Amezquita (12:20:14): > sorry just to clarify/follow-up, so say we have dataset X and define signatures with methods Y and Z. youre saying that we might have different downstream methods to leverage signatures Y and Z

Kevin Rue-Albrecht (12:20:42): > I’d say yes

Charlotte Soneson (12:22:48): > Oh, what I meant was that with the same method X we can find multiple different signatures on the same datasets (genes DE between cluster A and all other cells, vs genes DE between cluster A and each other cluster, for example).

Charlotte Soneson (12:23:28): > If you consider a “method” a DE tool, that is

Kevin Rue-Albrecht (12:24:09): > Not-completely-though-through example: > - Say method Y defines a signature as a ranked list of marker genes by decreasing expression. > - Say method Z defines a signature by fold-change against all other clusters in the sample > - Another method W identified by minimal fold-change against each other cluster > All of those methods would need to “tag” their set of signatures so that the prediction method downstream knows how to use them

Rob Amezquita (12:24:54): > yes, and in your case@Charlotte Sonesoni think the two sides of the vs. would be different methods if we were to be thorough

Rob Amezquita (12:25:35): > even if under the hood it was using the same method - simplest example might be we define DE with a fold change cutoff of 1.5 vs 2.5 - that might be two different “methods”

Kevin Rue-Albrecht (12:25:43): > Can’t say that my example is exactly what Charlotte meant, but bottom line is that signatures need to be “tagged” with information about how they were generated, and how they should be used

Rob Amezquita (12:26:05): > right, ideally with enough info to allow reproducibility

Rob Amezquita (12:26:28): > shoot, this is a lot more complicated than i initially thought haha

Rob Amezquita (12:26:36): > or rather, has that potential:smile:

Kevin Rue-Albrecht (12:29:02): > Yeah. I don’t want to throw all my confusion on you, but it’s kinda where I lost my head this summer, when tackling that by myself.

Kevin Rue-Albrecht (12:29:43): > Point is, let’s start small, and iteratively grow/attach new methods.

Rob Amezquita (12:30:01): > totally agreed!

Kevin Rue-Albrecht (12:31:02): > which is why I just picked up the Seurat PBMC 3k tutorial, and their signature (however arbitrary), as a toy example of automating their manual annotation process

Charlotte Soneson (12:34:14) (in thread): > Yes, fair enough

Kevin Rue-Albrecht (12:34:50): > At the moment, my Gist code is just reporting the proportion of cells that are positive for each signature in each cluster. > If I can propose a challenge to the channel, it would be to come up with our individual solutions of using that set of signatures (until we start using “official” ones) to assign an identity to each cluster. We can write our separate Rmarkdown notebooks, and from there identify a set of functions that will come in handy to automate those approaches.

Kevin Rue-Albrecht (12:35:34): > I’ve started the process withpositiveForMarker, for instance.

Rob Amezquita (12:35:35): > can you pin the gist code?

Kevin Rue-Albrecht (12:36:14): > Done

Rob Amezquita (12:36:26): > :thumbsup:

Kevin Rue-Albrecht (12:36:53): > I’ve also pinned the preliminary script that downloads and preprocesses the data set following the Seurat tutorial

Kevin Rue-Albrecht (12:37:19): > You know what, I’ll put both links in the Google doc as well

Rob Amezquita (12:37:56): > i think thats a great idea and with the “challenge” above! that might start to make the basis for methods inhancock

Rob Amezquita (12:39:00): > and then forautographs(or whatever) we can start to think about how to properly annotate the signature with metadata on how it was derived/citation, as well as “+”, “-”, “at least one of” sort of info too

Rob Amezquita (12:39:48): > is anyone familiar with a similar-ish construct that is BioC/S4 that we could adapt for signatures?

Kevin Rue-Albrecht (12:41:14): > @Rob Amezquitamake sure you get yourself familiar with GSEABase. Vince made several good points earlier in the channel - Attachment: Attachment > Two aspects of GSEABase that I find compelling are a) the capacity to import from existing ‘list-like’ set sources, and b) the capacity to annotate each set or set-collection with significant metadata. We should look around for other set annotation approaches, incorporate advances if such are found, and consider whether GSEABase needs an extension to incorporate gene networks. Interoperation with NDEx could be a plus for certain NCI projects supporting Bioconductor (www.ndexbio.org). @Ludwig Geistlinger @Levi Waldron worked on the orchestrating single cell draft at the GSEA level and may have comments.

Kevin Rue-Albrecht (12:45:16): > not sure how many things we can pin to the channel before it becomes counter-efficient:stuck_out_tongue_winking_eye:

Rob Amezquita (12:46:18): > duly noted, i have moved some of the pins to the google doc for longer term storage

Kevin Rue-Albrecht (12:46:42): > Call the section “-80”

Kevin Rue-Albrecht (12:47:19): > (nerd alert… too late)

Rob Amezquita (12:50:32) (in thread): > no idea what this refers to…a quick google search of “section -80” brought me to a kendrick lamar album called “Section.80”, but not sure if this is what you mean..:upside_down_face:

Kevin Rue-Albrecht (12:50:56) (in thread): > https://www.thermofisher.com/us/en/home/life-science/lab-equipment/cold-storage/tsx-freezers-refrigerators.html?gclid=CjwKCAiA0O7fBRASEiwAYI9QAu5fGkSON_Bfqkt2D4Ylqq91JQX7k67N8sLIicdS4BNr3xiFqalNthoCaAcQAvD_BwE&cid=lpd_ctt_cs_lrf_ULT-neg-80-exact_adwords&s_kwcid=AL!3652!3!203373626821!e!!g!!minus%2080%20freezer&ef_id=W-wyegAABdvzsdI5:20181126175050:s&s_kwcid=AL!3652!3!203373626821!e!!g!!minus%2080%20freezer - Attachment (thermofisher.com): TSX Freezers & Refrigerators | Thermo Fisher Scientific - US > Thermo Scientific TSX Series ultra-low freezers, high-performance refrigerators and freezers are designed with features that support sustainability objectives without compromising performance.

Rob Amezquita (12:53:57) (in thread): > d’oh. obviously have been away from the wet lab too long!

Vince Carey (13:17:39) (in thread): > There are a couple of papers on ontology and marker and signature concepts well worth a look. I am trying to digest them for the single-cell paper by Stephanie and Rob.https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1977-1 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5946857/ - Attachment (BMC Bioinformatics): Cell type discovery and representation in the era of high-content single cell phenotyping > A fundamental characteristic of multicellular organisms is the specialization of functional cell types through the process of differentiation. These specialized cell types not only characterize the normal functioning of different organs and tissues, they can also be used as cellular biomarkers of a variety of different disease states and therapeutic/vaccine responses. In order to serve as a reference for cell type representation, the Cell Ontology has been developed to provide a standard nomenclature of defined cell types for comparative analysis and biomarker discovery. Historically, these cell types have been defined based on unique cellular shapes and structures, anatomic locations, and marker protein expression. However, we are now experiencing a revolution in cellular characterization resulting from the application of new high-throughput, high-content cytometry and sequencing technologies. The resulting explosion in the number of distinct cell types being identified is challenging the current paradigm for cell type definition in the Cell Ontology. In this paper, we provide examples of state-of-the-art cellular biomarker characterization using high-content cytometry and single cell RNA sequencing, and present strategies for standardized cell type representations based on the data outputs from these cutting-edge technologies, including “context annotations” in the form of standardized experiment metadata about the specimen source analyzed and marker genes that serve as the most useful features in machine learning-based cell type classification models. We also propose a statistical strategy for comparing new experiment data to these standardized cell type representations. The advent of high-throughput/high-content single cell technologies is leading to an explosion in the number of distinct cell types being identified. It will be critical for the bioinformatics community to develop and adopt data standard conventions that will be compatible with these new technologies and support the data representation needs of the research community. The proposals enumerated here will serve as a useful starting point to address these challenges. - Attachment (PubMed Central (PMC)): Cell type discovery using single-cell transcriptomics: implications for ontological representation > Cells are fundamental function units of multicellular organisms, with different cell types playing distinct physiological roles in the body. The recent advent of single-cell transcriptional profiling using RNA sequencing is producing ‘big data’, …

Kevin Rue-Albrecht (13:22:34) (in thread): > thanks for the references: they look really interesting!

Lada Koneva (13:24:23): > @Lada Koneva has joined the channel

Rob Amezquita (16:41:08) (in thread): > i wonder if it would be possible to nest different results within a given method…that would be interesting!

2018-11-27

Kevin Rue-Albrecht (17:06:52): > Hey all, I’ve updated the scripts to set up the “challenge” data set (Seurat PBMC 3k tutorial). Please refer to the links listed in “Toy examples” section of the Google Doc pinned to this channel. > Thanks to some advice from@Rob Amezquitaand@Stephanie Hicksthe setup script now uses theTENxPBMCDatapackage to cache the data set (instead of my homemade use ofBiocFileCache). Results are exactly the same, so no biggie.

Stephanie Hicks (17:06:56): > @Stephanie Hicks has joined the channel

2018-11-28

Kevin Rue-Albrecht (02:50:24): > Hi all, > Just to let you know that I’m moving away from the Gist method and I set up a “proper” GitHub repository for test notebooks:https://github.com/kevinrue/Hancock2018In addition, thanks to@Vince Carey’s feedback onGSEABase, my “proportion of signatures by cluster” notebook is now usingGeneSetCollectionto store the signatures. > Seehttps://github.com/kevinrue/Hancock2018/blob/master/1-proportion_signature.Rmd - Attachment (GitHub): kevinrue/Hancock2018 > Test notebooks. Contribute to kevinrue/Hancock2018 development by creating an account on GitHub. - Attachment (GitHub): kevinrue/Hancock2018 > Test notebooks. Contribute to kevinrue/Hancock2018 development by creating an account on GitHub.

Federico Marini (15:55:57): > :wave:signature peeps - I guess some of you might have seen this one already…

Federico Marini (15:56:06): > but for the sake of completeness:http://biorxiv.org/cgi/content/short/456129v1?rss=1

Federico Marini (15:56:40): > moana, method to performrobust and scalable cell type classification framework for single-cell RNA-Seq data

Federico Marini (15:56:59): > python implementation

Kevin Rue-Albrecht (15:57:04): > I’ll add it to the GDoc (pinned item)

Federico Marini (15:58:57): > :thumbsup:

2018-11-29

Kevin Rue-Albrecht (07:24:42): > Hi@Peter HickeyI’ve just added a small code chunk in the vignette to show how simple lists of gene names (identifiers, whatever) can be packaged intoGSEABase::GeneSetCollection. > I’m not sure what web queries to Haemo, ImmGen, … would exactly return (JSON, …?) but if it’s anything like a list, would that help you get started?https://github.com/kevinrue/Hancock/blob/master/vignettes/concepts.Rmd#L106 - Attachment (GitHub): kevinrue/Hancock > Cell signatures, with confidence. Contribute to kevinrue/Hancock development by creating an account on GitHub.

Kevin Rue-Albrecht (07:57:08): > That said, I wonder whether such web-query code would live inHancock,GSEABase, or its own package. I’d suggest the latter; my reasoning: > -GSEABaseonly provides containers, does not currently contain web-related code yet, and probably shouldn’t in the future > - AsHancockaims to focus onmethodsfor learning/applying signatures, it would not be optimal to force other packages to depend onHancockjust for the sake of accessing those web-query functions. > Bottom line is that I’d imagine an independent lightweight package (likeTENxPBMCData) that bothHancockand other signature-related packages couldDepends:on

Vince Carey (11:24:52) (in thread): > @Peter HickeyDoes haemosphere have a web API? I do not see anything in the available pages.

Vince Carey (13:18:08): > should we usehttps://dice-database.org/downloadsto build putative signatures for immune cell types?https://www.sciencedirect.com/science/article/pii/S009286741831331X - Attachment (sciencedirect.com): Impact of Genetic Polymorphisms on Human Immune Cell Gene Expression > While many genetic variants have been associated with risk for human diseases, how these variants affect gene expression in various cell types remains…

Vince Carey (13:19:43): > perhaps considering the possibility of “eQTL-free” genes as an element of robustness

Peter Hickey (13:22:34) (in thread): > I don’t think so. I’ll ask about any plans for one

Kevin Rue-Albrecht (14:30:22): > That looks like a cool resource, thanks! Apparently only tenure-track level permanent university employee can request access to the raw data, I guess one can start working with the CSV files of TPM, to start with.

Levi Waldron (23:40:13): > FYI in case it is useful to you,http://bioconductor.org/packages/CellMapper/. It uses some tens of thousands of human and mouse microarrays and one or more known cell type-specific “query” genes to identify additional cell type-specific genes by correlation, and it works well. - Attachment (Bioconductor): CellMapper > Infers cell type-specific expression based on co-expression similarity with known cell type marker genes. Can make accurate predictions using publicly available expression data, even when a cell type has not been isolated before.

2018-11-30

Kevin Rue-Albrecht (02:40:12): > Thanks@Ludwig Geistlinger! Good reminder for me, as I even picked the paper for journal club a while back. One more for the “to consider” list (currently, the pinned GDoc)

Kevin Rue-Albrecht (02:47:38): > I’ve putCellMapperunder “Current literature”. Thanks again.

Kevin Rue-Albrecht (03:09:40): > https://twitter.com/mritchieau/status/1068297808988450816?s=12Reminder to self. Add to GDoc - Attachment (twitter): Attachment > ⁦@trashystats⁩ tells us about the dtangle #rstats package for deconvolving cell type mixing proportions in bulk samples #BioCAsia #abacbs2018 https://pbs.twimg.com/media/DtNbMiEVYAAzmvb.jpg

Rob Amezquita (17:22:00): > hi all, i’ve started a companion package that splits out the reference/curated genesets aspect so thatHancockcan focus on methods/applications of pre-defined signatures. happy to accept contributions of manual genesets and start looking at expanding the scope to including web-api/online databases as discussed above (e.g. dice, etc.)

Rob Amezquita (17:22:01): > https://github.com/robertamezquita/Inkwell - Attachment (GitHub): robertamezquita/Inkwell > Curated cell-type specific markers for manual classification. - robertamezquita/Inkwell

Rob Amezquita (18:05:18): > @Vince Carey@Martin Morganwould love to hear your thoughts on how we might extendGeneSetCollections/GeneSetsto facilitate such usage!

Martin Morgan (19:14:11): > @Martin Morgan has joined the channel

2018-12-01

Kevin Rue-Albrecht (09:32:31): > FWIW > I’ll push a first draft of “predict” method to “Hancock” later today, which should provide matter for discussion

Kevin Rue-Albrecht (12:04:58): > darn that involved a few more internal functions than I expected

Kevin Rue-Albrecht (12:06:01): > checking locally and then pushing. Feedback welcome, but please remember that I have feelings (and that I wrote hastyforloops for the proof-of-concept):yum:

Kevin Rue-Albrecht (12:11:01): > Available at:https://github.com/kevinrue/Hancock/tree/predictSee runnable example at?predict.GeneSetCollectionIn particular, have a look atmetadata(...)andcolData(...)[, "Hancock"]of the output object - Attachment (GitHub): kevinrue/Hancock > Cell signatures, with confidence. Contribute to kevinrue/Hancock development by creating an account on GitHub.

Kevin Rue-Albrecht (14:40:54): > Alright, I’m about to mergepredicttomaster. I’m pretty satisfied of the result, especially considering how streamlined the demo notebook looks now:https://github.com/kevinrue/Hancock2018/blob/master/1-proportion_signature.Rmd - Attachment (GitHub): kevinrue/Hancock2018 > Companion notebooks for the Hancock package. Contribute to kevinrue/Hancock2018 development by creating an account on GitHub.

Kevin Rue-Albrecht (15:14:34): > The updated package code is now on themasterbranch. It is compatible with the notebook above.

Rob Amezquita (18:22:39): > Wow! Solid work!!

Kevin Rue-Albrecht (18:37:40): > Thanks!

Kevin Rue-Albrecht (18:37:45): > First plotting function incoming

Kevin Rue-Albrecht (18:38:26): > and with that, I think we have a fair template to add new, fancier, methods

Kevin Rue-Albrecht (18:55:32): > Alright, off to sleep. Anyone feeling like merging my PR, don’t hold back

2018-12-02

Kevin Rue-Albrecht (05:05:34): > Not sure if GitHub sends notification about that kind of stuff, but I’ve extended theCONTRIBUTING.mdfile:https://github.com/kevinrue/Hancock/blob/master/CONTRIBUTING.md - Attachment (GitHub): kevinrue/Hancock > Cell signatures, with confidence. Contribute to kevinrue/Hancock development by creating an account on GitHub.

Vince Carey (06:58:37): > I filed an issue athttps://github.com/Bioconductor/GSEABase/issues/3to see about adding metadata flexibility for GeneSet and top level metadata concept for GeneSetCollection. I could start a PR but maybe we should discuss a little bit. My thought is that we should allow longDescription in GeneSet to be type ANY (for now) so that something like Biobase’s MIAME could be used to give detailed provenance of any given signature assertion. How a GeneSetCollection will be used is not totally clear to me, but some top-level metadata component seems unobjectionable and useful. - Attachment (GitHub): increasing metadata flexibility/availability? · Issue #3 · Bioconductor/GSEABase > https://github.com/kevinrue/Hancock/blob/master/vignettes/concepts.Rmd by @kevinrue shows how GeneSet and GeneSetCollection can play roles in definition of cell type signatures. It might be desirab…

Kevin Rue-Albrecht (08:24:54): > Thanks for that@Vince Carey! > I’ll shortly add an example ofGeneSetCollectionthat also includes negative markers. I had an example of that usinglistsomewhere. I think that’s the next natural step, before moving into the more challenging (semi-)quantitative signatures

Kevin Rue-Albrecht (14:39:04): > Hi@Vince CareyYou’ll probably appreciate the extension of the concepts vignette to includeGeneColorSethttps://github.com/kevinrue/Hancock/pull/12 - Attachment (GitHub): Extend concept vignette to discuss negative markers by kevinrue · Pull Request #12 · kevinrue/Hancock > demonstrate the use of the GeneColorSet class.

Kevin Rue-Albrecht (14:43:14): > I must say that it gets a bit confusing when dealing withphenotype,geneColor, andphenotypeColor. > It feels easy to drift into double-negatives, e.g. for a “gene G down-regulated in differentiated cell type A” > - setName = “cell type A” > - geneId = “G” > - phenotype = “differentiated” (one could create the same, opposite, gene set with “undifferentiated” here) > - geneColor = “down-regulated” > - phenotypeColor = “complete”

Kevin Rue-Albrecht (14:56:31): > I guess it’s just a matter of agreeing and writing some guidelines somewhere on creating signature gene sets using thisGeneColorSetstructure I guess, to help everyone work in a consistent way

2018-12-03

Frederick Tan (09:23:51): > @Frederick Tan has joined the channel

2018-12-04

Kayla Interdonato (10:52:04): > @Kayla Interdonato has joined the channel

Kayla Interdonato (10:56:50): > We are working on expanding the gene set representation idea athttps://github.com/Kayla-Morrell/GeneSet. The main motivation is more efficient representation of large data sets and a more familiar tibble like frame work. From the conversation above, it seems like 3 useful features we could add include Entrez (and similar) identifiers, color gene sets, and adding gene and gene set metadata. Any feedback welcome! - Attachment (GitHub): Kayla-Morrell/GeneSet > Contribute to Kayla-Morrell/GeneSet development by creating an account on GitHub.

Kevin Rue-Albrecht (11:03:47): > Hi@Kayla Interdonato. Thanks for the message! > I’m just looking athttps://github.com/Kayla-Morrell/GeneSet/compare/master…tibble_implementand I already like seeing thing likeimport.gmt <- function(path):grin:Did you already have any discussion with theGSEABaseteam? This looks like a replacement solution rather than an extension, and I’m just curious about what to expect if both packages are meant to co-exist. - Attachment (GitHub): Kayla-Morrell/GeneSet > Contribute to Kayla-Morrell/GeneSet development by creating an account on GitHub.

Kevin Rue-Albrecht (11:17:29): > Also, re: Entrez and other identifiers, I do like theGSEABasesub-classes that implicitly declare the type of gene identifier they contain. See: > > ?`GeneIdentifierType-class` > > Not sure iftibblecan handle that. OtherwiseDataFrameare probably the way to go

Kevin Rue-Albrecht (11:21:37): > Slightly adapted from theexamplesection of the man page above: > > ## Another way to change annotation to Entrez (or other) ids > probeIds <- featureNames(sample.ExpressionSet)[100:109] > geneIds <- getEG(probeIds, "hgu95av2") > gs <- GeneSet(EntrezIdentifier(), # <-- that's what I'm talking about > setName="sample.GeneSet2", setIdentifier="101", > geneIds=geneIds) > geneIdType(gs) # <-- and here too >

Ludwig Geistlinger (11:31:13) (in thread): > It seems worth discussing in a dedicated channel #GSEABase

Kevin Rue-Albrecht (11:31:58) (in thread): > Can you invite me when you create it? (EDIT: please)

2018-12-05

Kevin Rue-Albrecht (04:03:56): > While I don’t like the implementation of theSingleRpackage, their preprint (see pinned Google Doc) attracts some interesting discussion points in their Github issues (e.g.https://github.com/dviraran/SingleR/issues/11) > Point is, there is something to learn from the questions of prospective users with respect to what and how package functionality should be documented - Attachment (GitHub): Single cells’ CellType prediction (Signature genes for the cell types) · Issue #11 · dviraran/SingleR > Hi, I am working on the Single cell analysis using Seurat. I am new to the SingleR, It is really very useful for the single cell level cell type prediction. I have read the SingleR documentation, b…

Ludwig Geistlinger (05:04:32): > BTW: just out of interest, how much proof-of-concept is there actually that these gene signatures allow distinction of cell types? It reminds me a bit of the times when people were hunting for cancer signatures, eventually finding that those gene signatures often imply more informativeness than they actually have:https://www.ncbi.nlm.nih.gov/pubmed/22028643 - Attachment (ncbi.nlm.nih.gov): Most random gene expression signatures are significantly associated with breast cancer outcome. - PubMed - NCBI > PLoS Comput Biol. 2011 Oct;7(10):e1002240. doi: 10.1371/journal.pcbi.1002240. Epub 2011 Oct 20. Research Support, Non-U.S. Gov’t

Ludwig Geistlinger (05:08:17) (in thread): > It also seems surprising to me to build this upontibbleinstead of anS4Vectors-derivative@Kayla Interdonato?

Kevin Rue-Albrecht (05:20:52): > Just to clarify: I’m not into claiming that Hancock will « distinguish cells types ». > The goal here is to providesoftware, not signatures. > It’ll be up to users to decide which signatures they use and for what purpose.

Kevin Rue-Albrecht (05:25:45): > That said, we can alwayssuggestsources of signatures, or offer APIs to fetch them from online databases, as suggested by@Peter Hickey, but that’s all an independent question from theuseof those signatures

Ludwig Geistlinger (05:54:46): > Ah I thought collecting cell-type specific signatures and then use them to classify cell type of cells in the context of scRNA-seq data is the purpose of this channel?

Kevin Rue-Albrecht (06:14:09): > Well, I created this channel from a discussion that was touching on both subjects (software and signatures). > I think we can still accommodate both subjects on this channel for now, as they are tightly related subjects. > As discussions grow more active on either subject, perhaps I can rename this channel #sc-signature-software, and we can create another #sc-signature-data

Ludwig Geistlinger (07:18:15): > This clarification helps already. When it comes to #sc-signature-data, I think a lot of inspiration can be drawn from existing work on how to extract such signatures from data and literature:https://www.ncbi.nlm.nih.gov/pubmed/26771021 https://www.ncbi.nlm.nih.gov/pubmed/21546393 https://www.ncbi.nlm.nih.gov/pubmed/22110038When it comes to #sc-signature-software it seems to currently mostly focus around how to best represent gene signatures (in Bioc) - if I understood correctly? - Attachment (ncbi.nlm.nih.gov): The Molecular Signatures Database (MSigDB) hallmark gene set collection. - PubMed - NCBI > Cell Syst. 2015 Dec 23;1(6):417-425. - Attachment (ncbi.nlm.nih.gov): Molecular signatures database (MSigDB) 3.0. - PubMed - NCBI > Bioinformatics. 2011 Jun 15;27(12):1739-40. doi: 10.1093/bioinformatics/btr260. Epub 2011 May 5. Research Support, N.I.H., Extramural - Attachment (ncbi.nlm.nih.gov): GeneSigDB: a manually curated database and resource for analysis of gene expression signatures. - PubMed - NCBI > Nucleic Acids Res. 2012 Jan;40(Database issue):D1060-6. doi: 10.1093/nar/gkr901. Epub 2011 Nov 21. Research Support, N.I.H., Extramural; Research Support, Non-U.S. Gov’t

Kevin Rue-Albrecht (07:20:53): > Absolutely. Good references, and thanks for sharing them. If you feel like it, feel free to add them in the pinned Google Doc in the relevant section “Sources of curated signatures”. Otherwise I’ll try and do it later.

Kevin Rue-Albrecht (12:02:28): > @Rob AmezquitaI think Hancock has got a shot at this one!!!:smile: - Attachment: Attachment > Please vote for your favorite BioC 2019 sticker here: https://github.com/Bioconductor/BiocStickers/issues/74

Rob Amezquita (12:03:15): > aha i clicked on it literally thinking “best sticker” but turns out its the Bioc2019 meeting sticker vote!

Rob Amezquita (12:03:22): > i was so ready to submit:stuck_out_tongue:

Kevin Rue-Albrecht (12:03:33): > oh noooo i didn’t click, and i just thought

Kevin Rue-Albrecht (12:03:36): > damn

Rob Amezquita (12:16:21): > i thought the same exact thing haha, great minds think alike!

Rob Amezquita (12:16:31): > there should be a contest tho on best sticker full stop

Rob Amezquita (12:16:33): > :stuck_out_tongue:

2018-12-10

Mark Robinson (15:21:44): > @Mark Robinson has joined the channel

2018-12-13

Matt Ritchie (12:52:56): > @Matt Ritchie has joined the channel

2018-12-18

Kevin Rue-Albrecht (16:22:59): > FYI, extended support incoming forKayla-Morrell/GeneSetin Hancockhttps://github.com/kevinrue/Hancock/pull/18 - Attachment (GitHub): Extended support for the Kayla-Morrell/GeneSet package; by kevinrue · Pull Request #18 · kevinrue/Hancock

Kevin Rue-Albrecht (16:27:14): > Still not sure how much effort I should expect to put in for supporting bothGeneSetCollectionandtbl_genesetin the future, but so far it’s still manageable for my simple use case (only gene sets, no accompanying qualitative nor quantitative information)

Davis McCarthy (18:36:45): > @Davis McCarthy has joined the channel

2018-12-19

Kevin Rue-Albrecht (07:21:57): > Here’s a fun parallel:https://github.com/longhowlam/flowermodel - Attachment (GitHub): longhowlam/flowermodel > shiny app to predict flower species. Contribute to longhowlam/flowermodel development by creating an account on GitHub.

Kevin Rue-Albrecht (07:23:55): > Source:https://www.r-bloggers.com/an-r-shiny-app-to-recognize-flower-species/ - Attachment (R-bloggers): An R Shiny app to recognize flower species > Introduction Playing around with PyTorch and R Shiny resulted in a simple Shiny app where the user can upload a flower image, the system will then predict the flower species. Steps that I took Download labeled flower data from the … Continue reading →

Peter Hickey (18:31:10) (in thread): > (delayed follow up). there’s no web API or plans for one. it’s unclear to me how much easier this makes things since i don’t have any real web programming/resource packaging experience

Kevin Rue-Albrecht (18:43:14): > For anyone curious, the first “learning method” is pushed on branchlearn(https://github.com/kevinrue/Hancock/tree/learn) and demonstrated here (https://github.com/kevinrue/Hancock2018/blob/master/2-learn-signatures.Rmd) - Attachment (GitHub): kevinrue/Hancock > Cell signatures, with confidence. Contribute to kevinrue/Hancock development by creating an account on GitHub. - Attachment (GitHub): kevinrue/Hancock2018 > Companion notebooks for the Hancock package. Contribute to kevinrue/Hancock2018 development by creating an account on GitHub.

Kevin Rue-Albrecht (18:52:02): > (merged tomasternow)

2018-12-20

Kevin Rue-Albrecht (05:00:34): > Updated “Contributing.md” guidelines with a new section on methods tolearnsignatures:https://github.com/kevinrue/Hancock/blob/master/CONTRIBUTING.md#new-learning-methods - Attachment (GitHub): kevinrue/Hancock > Cell signatures, with confidence. Contribute to kevinrue/Hancock development by creating an account on GitHub.

2019-01-02

Kevin Rue-Albrecht (09:27:38): > Does this kind of Bioconductor-related Travis CI error ring a bell to anyone?https://travis-ci.org/kevinrue/Hancock/builds/474378942?utm_medium=notification&utm_source=email > > * installing **source** package 'BiocManager' ... > **** package 'BiocManager' successfully unpacked and MD5 sums checked > **** R > **** inst > **** byte-compile and prepare package for lazy loading > **** help > ***** installing help indices > **** building package indices > **** installing vignettes > **** testing if installed package can be loaded > * DONE (BiocManager) > The downloaded source packages are in > '/tmp/RtmptHsMEs/downloaded_packages' > Updating HTML index of packages in '.Library' > Making 'packages.html' ... done > Error: invalid version specification 'c(3, 9)' > Execution halted > The command "eval Rscript -e 'if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager");if (TRUE) BiocManager::install(version = "devel");cat(append = TRUE, file = "~/.[Rprofile.site](http://Rprofile.site)", "options(repos = BiocManager::repositories());")' " failed. Retrying, 2 of 3. > Error: invalid version specification 'c(3, 9)' > Execution halted >

Kevin Rue-Albrecht (09:29:35): > (I’ve Googled and theinvalid version specificationerror seems to have cropped up for other non-Bioc packages here and there, but nothing that I related to any recent change inHancock)

Federico Marini (18:09:21): > looks rather related to travis configs

Federico Marini (18:09:31): > ping maybe jim hester?

Kevin Rue-Albrecht (18:16:00): > Yup. I’ve even relaunched an old Travis build that used to fail R CMD check for a silly missing namespace import, and now it doesn’t even get that far and fails during the installation of dependencies as above, so the issue is clearly not in the package, but rather in Travis

Kevin Rue-Albrecht (18:19:30): > On a somewhat unrelated note, specifically@Federico Marini, if you’re missing ouriSEEentertainment, I’ve added a mini Shiny app toHancock. Best enjoyed withhttps://github.com/kevinrue/Hancock2018/blob/master/2-learn-signatures.Rmd - Attachment (GitHub): kevinrue/Hancock2018 > Companion notebooks for the Hancock package. Contribute to kevinrue/Hancock2018 development by creating an account on GitHub.

Kevin Rue-Albrecht (18:25:45): > For the record, I’ve tracked down the Travis issue to line 51 ofR/version.Rin theutilspackage (the one distributed with base R, notR.utils) > > stop(gettextf("invalid version specification %s", > paste(sQuote(unique(x[!ok])), collapse = ", ")), > call. = FALSE, domain = NA) >

Martin Morgan (20:14:26) (in thread): > This seems to be a change in R-devel; under R-3.8 we have > > > head(BiocManager:::.version_map_get(), 3) > Bioc R BiocStatus > 1 1.6 2.1 out-of-date > 2 1.7 2.2 out-of-date > 3 1.8 2.3 out-of-date > > whereas under devel we have > > > head(BiocManager:::.version_map_get(), 3) > Bioc R BiocStatus > 1 1, 6 2, 1 out-of-date > 2 1, 7 2, 2 out-of-date > 3 1, 8 2, 3 out-of-date > > where the package version columns have lost their class… I’ll follow up on this…

2019-01-03

Kevin Rue-Albrecht (03:49:04) (in thread): > Thanks!:+1:

Lluís Revilla (08:56:29): > @Lluís Revilla has joined the channel

2019-01-04

Martin Morgan (05:41:05) (in thread): > This is fixed in r75946https://stat.ethz.ch/pipermail/r-devel/2019-January/077155.html

Kevin Rue-Albrecht (05:46:11) (in thread): > Awesome, thanks!

Kevin Rue-Albrecht (05:56:01): > FYI, I’ve fairly extensively cleaned up the “wishlist” in the Google Doc pinned to this channel.

Kevin Rue-Albrecht (06:22:51): > (i.e. simple re-writing of the original bullet points into more descriptive subsections)

2019-01-06

Aaron Lun (03:03:48): > @Aaron Lun has joined the channel

Aaron Lun (03:03:57): > Hit me up with some truth@Kevin Rue-Albrecht

Kevin Rue-Albrecht (04:08:09): > On my quest for a simple « marker identification » template workflow (input/output), I’ve sorted candidate markers from the most frequently to the least frequently detected

Kevin Rue-Albrecht (04:08:26): > That, in a logical matrix that declares whether each marker (row) is detected in each sample (column)

Kevin Rue-Albrecht (04:09:44): > Now, what I’ve (brutally) implemented in R, is to walk down that matrix and return the number/proportion of samples that are positive for the first N marker, with N going from 1 to nrow(x)

Kevin Rue-Albrecht (04:11:14): > This creates a « scree » that I’m using to trim the candidate markers to the top n that are simultaneously detected in at least p% of the samples

Kevin Rue-Albrecht (04:12:53): > The « truth » behind this being that I’d like to « trim out » markers that do look specific to a cluster but, are not co-detected with other markers

Kevin Rue-Albrecht (04:15:18): > (If you’ve got time to install the ‘Hancock’ package on my GitHub, I wrote the example code and unit tests that could give you an idea. The function is exported

Aaron Lun (04:20:15): > Should lose the capital “H”.

Aaron Lun (04:20:31): > Okay, I think I understand what you want.

Kevin Rue-Albrecht (04:20:34): > Uh. My bad the example code presents only the other two functions documented on that page. Well the ‘scree’ function simply runs on the matrix that comes out of ‘makeMarkerDetectionMatrix’

Aaron Lun (04:20:39): > Basically, for each cell, you want the maximum “N”.

Kevin Rue-Albrecht (04:26:37): > Hm.. there is no “for each cell” in my idea. > I would say the maximum ‘n’ (genes) between 1 and N, simultaneously detected in at least ‘P’% of the samples

Aaron Lun (04:27:43): > What.

Aaron Lun (04:27:48): > What about the cells, then

Aaron Lun (04:27:49): > ?

Aaron Lun (04:28:19): > where did the P come from?

Kevin Rue-Albrecht (04:28:28): > They’re just here to give the proportion of cells positive for each marker, and each set of markers

Aaron Lun (04:28:48): > You’re making less and less sense as we go on.

Kevin Rue-Albrecht (04:28:51): > P is a user-defined threshold. I should have explicitly said that I guess

Aaron Lun (04:29:54): > For your scree plot, the x-axis is the number of genes in the ordered list of markers, and the y-axis is the percentage of cells.

Kevin Rue-Albrecht (04:30:00): > Perhaps code makes more sense:https://github.com/kevinrue/Hancock/blob/master/R/markerDetection-methods.R#L74 - Attachment (GitHub): kevinrue/Hancock > Cell signatures, with confidence. Contribute to kevinrue/Hancock development by creating an account on GitHub.

Aaron Lun (04:30:19): > I’m looking at the code now. I know what itdoes.

Aaron Lun (04:30:35): > Is my interpretation of the scree plot correct?

Kevin Rue-Albrecht (04:30:38): > X and y axes are correct

Aaron Lun (04:30:44): > Good.

Aaron Lun (04:30:58): > Then all you need is a vector of the max n for each cell.

Kevin Rue-Albrecht (04:32:29): > It’s just at this ‘max n’ that you lose me

Aaron Lun (04:33:22): > each cell has a different max n corresponding to the top number of marker genes that are expressed in that cell.

Kevin Rue-Albrecht (04:33:40): > Ooooooooh

Kevin Rue-Albrecht (04:33:51): > :bulb:

Aaron Lun (04:50:25): > What’s this “GeneSet” package?

Kevin Rue-Albrecht (04:54:29): > https://github.com/Kayla-Morrell/GeneSet, see#gseabase - Attachment (GitHub): Kayla-Morrell/GeneSet > Contribute to Kayla-Morrell/GeneSet development by creating an account on GitHub.

Aaron Lun (04:55:43): > Great. More tidyverse stuff. Just what I need.

Kevin Rue-Albrecht (05:03:13): > Well, from what I gather, discussion is still fairly open with Kayla and Martin in terms of final implementation. > I’m not very well placed to support of criticize the tidyverse. I neither rely on it nor avoid it. - Attachment: Attachment > I had kind of hoped that the tibble column would suffice, but you’re right that the duplicate genes is problematic (but that’s just tidy data, right?). Similarly for set-level data. I think ‘externally’ the information could be represented as a single tibble, but internally maintain three tables – gene / set mapping; gene annotation; set annotation.

Aaron Lun (05:36:21): > Two general package comments: > - “Hancock” -> “hancock”, the capitalization is extraneous when you don’t actually have multiple words (a la camelCase). As we know, all the classy packages have all-lowercase names.scater,scran,csaw… I mean, compare toSingleCellExperiment- ugh. > - Do you really need to have Shiny in this package as well? It’s worth quarantining the computational/mathematical routines that are actually useful from the interactive fluff that’s only around to please point-and-clickers from the wet labs.

Kevin Rue-Albrecht (07:48:52): > The package doesn’tneeda Shiny app (or a set of small apps, as I envisioned it). But while I’m playing with proof of concepts and waiting for metadata slots in the geneset objects, I imagined how users could conveniently visualise and rename geneset signatures. > That said, in practice I would probably use iSEE, to navigate the annotated while taking notes, followed a renaming of the signature at the R prompt. > > Renaming the package is fair enough. I just thought to acknowledge the original person name, like Seurat.

Aaron Lun (07:50:40): > I thought that was what you were thinking. But if I were to say, “Hancock can find it for you”… am I referring to the package? Some guy named Hancock? Or what?

Kevin Rue-Albrecht (07:52:26): > —> The package, focused on finding/using gene signatures, which itself refers to a guy family-named Hancock, famous for his fancy signature

Aaron Lun (07:53:20): > But if I were to say, “You can get hancock to do it”, it’s clear that it’s not a person’s name.

Aaron Lun (07:53:45): > Contrast this to, “You can get Hancock to do it”, in which case people would be saying, “okay, what’s his email?”

Kevin Rue-Albrecht (07:54:13): > :joy:. Fair enough. I don’t have a problem with the renaming, to be honest.

Aaron Lun (07:54:24): > I just dont want to hold the shift key.

Aaron Lun (07:54:46): > The random H indiffHicis one of my greatest shames.

Kevin Rue-Albrecht (07:54:58): > You’re making me wonder how many people have searched for Seurat’s e-mail address

Kevin Rue-Albrecht (07:55:31): > Well… diffhic does look a bit weird to read and pronounce

Aaron Lun (07:56:49): > Yes. Yes it does.

Aaron Lun (07:57:07): > I wanted hiccup but it was taken already.

Kevin Rue-Albrecht (07:57:39): > Oh no:scream:that’s a shame! Now I’m curious what that one does

Aaron Lun (08:29:47): > Looking at your sticker. I assume the Hancock signature image is public domain.

Kevin Rue-Albrecht (08:55:44): > https://commons.wikimedia.org/wiki/File:JohnHancocksSignature.svg > > This signature is believed to be ineligible for copyright and therefore in the public domain because it falls below the required level of originality for copyright protection

2019-01-07

Federico Marini (05:18:08): > ->https://www.biorxiv.org/content/early/2019/01/04/512434?rss=1might be of interest to the channel - Attachment (bioRxiv): A pitfall for machine learning methods aiming to predict across cell types > Machine learning models to predict phenomena such as gene expression, enhancer activity, transcription factor binding, or chromatin conformation are most useful when they can generalize to make accurate predictions across cell types. In this situation, a natural strategy is to train the model on experimental data from some cell types and evaluate performance on one or more held-out cell types. In this work, we show that, when the training set contains examples derived from the same genomic loci across multiple cell types, then the resulting model can be susceptible to a particular form of bias related to memorizing the average activity associated with each genomic locus. Consequently, the trained model may appear to perform well when evaluated on the genomic loci that it was trained on but tends to perform poorly on loci that it was not trained on. We demonstrate this phenomenon by using epigenomic measurements and nucleotide sequence to predict gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data and computing resources become available, future projects will increasingly risk suffering from this issue.

Aaron Lun (05:33:44): > Don’t see the big deal from the abstract. Is it that surprising that the choice of training set determines performance?

Kevin Rue-Albrecht (05:35:07): > Thought so too. There’s probably (going to be) a gold rush for papers on every challenging aspect of machine learning for cell type, I suppose.

Federico Marini (05:37:33): > not reinventing the wheel, of course

Federico Marini (05:38:03): > I posted it here since Kev was trying to find some kind of benchmarking, and maybe they might have relevant points

Aaron Lun (05:38:35): > Trying for find good benchmarks for hancock is gonna suck.

Aaron Lun (05:38:55): > it’s going to be about trying to convince people that the chosen measure of signature activity is useful.

Aaron Lun (05:39:22): > I mean… it’ll be like, “oh yeah, B cells have high signature scores for B cell gene sets.”

Aaron Lun (05:39:28): > Real inspiring stuff.

Kevin Rue-Albrecht (05:40:56): > Aside from benchmarking, what I’d really like to have is a single API to multiple methods (thinkglm(method = ...)) for learning and applying signatures to new data sets

Kevin Rue-Albrecht (05:43:49): > Then, yes sure, that would open a nice door for benchmarking. But ultimately, I’d anticipate different use cases to arise where different methods performing better for each.

Kevin Rue-Albrecht (05:46:37): > (oversimplistic illustration: automating the ‘easy’ annotation of wildly different PBMCs vs identifying markers of closely related cell “subtypes”)

Kevin Rue-Albrecht (05:48:37): > (or classifying bulk samples vs single cells, where the first might use quantitative expression levels and the latter use binary detection)

2019-01-08

Stephanie Hicks (04:28:19): > This seems like it should be added to the google doc:https://github.com/irrationone/cellassign - Attachment (GitHub): Irrationone/cellassign > Automated, probabilistic assignment of cell types in scRNA-seq data - Irrationone/cellassign

Kevin Rue-Albrecht (04:57:05): > Done. Thanks!

2019-01-11

Charlotte Soneson (03:25:23): > another one:https://aekiz.shinyapps.io/Cell_identity_predictor/

Federico Marini (03:54:16): > :thumbsup:I saw that as well yesterday, forgot to post here right away

Federico Marini (03:54:24): > good catch Charlotte

Kevin Rue-Albrecht (04:14:11): > Thanks guys!~~~Quick poll: who doesn’t have edit access to the GDoc?~~~

Kevin Rue-Albrecht (04:16:44): > Quick poll: who would prefer the Gdoc: > - “comment only” for “anyone with the link” and “edit access” for explicitly invited people?

Kevin Rue-Albrecht (04:17:18): > - “edit access” for “anyone with the link” (initially shared only on this Slack, but then I can’t control where else the link is shared

Kevin Rue-Albrecht (04:32:19) (in thread): > I’m a bit short on time for the next week, but I’m keen to do a survey of strategies out there to store the gene signatures, in all those “cell type prediction” packages. (In comparison to efforts byKayla-Morrell/GeneSetandllrs/BaseSet)

Kevin Rue-Albrecht (04:34:03) (in thread): > In the absence of a clear consensus on the “common class and methods” beyond the classes currently available inGSEBase, I assume each of those packages implement their own data structure..

Charlotte Soneson (04:42:59) (in thread): > This one seems not to store signatures, it calculates gene-wise logFCs on-the-fly from a provided reference expression matrix and uses agreement based on all genes for the final score.

Kevin Rue-Albrecht (17:20:12): > Note thathancock(package and GitHub repo) andhancock2018(GitHub repo) are now fully lowercase. Don’t forget to update yourgit remote

2019-01-13

Kevin Rue-Albrecht (12:17:05): > I don’t know why I didn’t do GitHub pages sooner –>https://kevinrue.github.io/hancock2018/index.html

Stephanie Hicks (15:09:34): > :tada:

2019-01-16

Rob Amezquita (18:04:56): > https://github.com/Irrationone/cellassign

Rob Amezquita (18:05:02): > comes with a preprint!

Kevin Rue-Albrecht (18:07:16): > Thanks. I’ve added the preprint link to the GDoc

2019-01-24

Steve Lianoglou (13:56:49): > @Steve Lianoglou has joined the channel

Stephanie Hicks (14:51:55): > not sure if this has already come across your desk@Kevin Rue-Albrecht, but just in case nothttps://www.biorxiv.org/content/10.1101/508085v1 - Attachment (bioRxiv): SingleCellNet: a computational tool to classify single cell RNA-Seq data across platforms and across species > Single cell RNA-Seq has emerged as a powerful tool in diverse applications, ranging from determining the cell-type composition of tissues to uncovering the regulators of developmental programs. A near-universal step in the analysis of single cell RNA-Seq data is to hypothesize the identity of each cell. Often, this is achieved by finding cells that express combinations of marker genes that had previously been implicated as being cell-type specific, an approach that is not quantitative and does not explicitly take advantage of other single cell RNA-Seq studies. Here, we describe our tool, SingleCellNet, which addresses these issues and enables the classification of query single cell RNA-Seq data in comparison to reference single cell RNA-Seq data. SingleCellNet compares favorably to other methods, and it is notably able to make sensitive and accurate classifications across platforms and species. We demonstrate how SingleCellNet can be used to classify previously undetermined cells, and how it can be used to assess the outcome of cell fate engineering experiments.

Kevin Rue-Albrecht (16:41:35): > Awesome thanks! I’m buried in so many things right now that I’ve fallen behind on reading

Stephanie Hicks (22:15:33): > :thumbsup:

2019-01-29

Stephanie Hicks (05:14:20): > https://www.biorxiv.org/content/10.1101/532093v1 - Attachment (bioRxiv): Automated identification of Cell Types in Single Cell RNA Sequencing > Cell type identification is one of the major goals in single cell RNA sequencing (scRNA-seq). Current methods for assigning cell types typically involve the use of unsupervised clustering, the identification of signature genes in each cluster, followed by a manual lookup of these genes in the literature and databases to assign cell types. However, there are several limitations associated with these approaches, such as unwanted sources of variation that influence clustering and a lack of canonical markers for certain cell types. Here, we present ACTINN (Automated Cell Type Identification using Neural Networks), which employs a neural network with 3 hidden layers, trains on datasets with predefined cell types, and predicts cell types for other datasets based on the trained parameters. We trained the neural network on a mouse cell type atlas (Tabula Muris Atlas) and a human immune cell dataset, and used it to predict cell types for mouse leukocytes, human PBMCs and human T cell sub types. The results showed that our neural network is fast and accurate, and should therefore be a useful tool to complement existing scRNA-seq pipelines.

Kevin Rue-Albrecht (05:15:53): > We know the 2019 fashion choice for new methods:sweat_smile:

Stephanie Hicks (05:53:14): > :joy:

Kevin Rue-Albrecht (06:17:45): > I’ve added the preprint and GH repo to thehancockMVP. Thanks!

2019-01-30

Davide Risso (10:44:25): > @Davide Risso has joined the channel

2019-02-05

Kevin Rue-Albrecht (03:17:17): > https://twitter.com/coletrapnell/status/1092577935838433280?s=12They’re everywhere:space_invader::alien: - Attachment (twitter): Attachment > Very proud to announce “Garnett”, a software system for rapidly annotating cells according to type in single-cell RNA-seq. Written by @HPliner with @JShendure. Software: https://cole-trapnell-lab.github.io/garnett/ and preprint: https://www.biorxiv.org/content/10.1101/538652v1

Federico Marini (03:31:48): > :mushroom::mushroom::mushroom::mushroom::mushroom::mushroom::mushroom::mushroom::mushroom::mushroom:

Kevin Rue-Albrecht (05:16:42): > Naïve question: is there a « Bioconductor » equivalent of Seurat’s FindMarkers wrapper around multiple DE testing frameworks? Or is that just a recipe for infinite-dependencies disaster?

Aaron Lun (05:21:48): > You could tryscran::combineMarkers.

Kevin Rue-Albrecht (05:22:31): > Will check out, thanks!

Aaron Lun (05:23:59): > Quite glad I’m not responsible for cell type assignment code.

Aaron Lun (05:24:39): > Seems like that would be an endless stream of complaints about misclassification.

Aaron Lun (05:25:15): > And I don’t understand why Cole sticks to ExpressionSet. He should switch over to SEs (or even better SCEs) already.

Kevin Rue-Albrecht (05:27:41): > Took me time to accept letting go of the ExpressionSet myself… but that is now years ago and I’m so glad I did!

Aaron Lun (05:28:42): > Admittedly I didn’t do a good job of selling it to him at the start, but the continued development in Eset is frustrating. At least the new stuff should use SEs.

Aaron Lun (05:32:42): > Continuing this discussion on#singlecellexperiment

Kevin Rue-Albrecht (05:33:39): > I can’t see a durable strategy “cell type assignment codes”. From my limited perspective, the only stable thing that I can see would beFilterRuleson the gene expression matrix (in that case reshape’d into long format)

Kevin Rue-Albrecht (05:34:51): > then, if downstream users want to name/alias theirFilterRulesas “CD4+ T cells” etc. it’s up to them

Aaron Lun (05:36:42): > I’m seeing “training” in their schematic, so that’s not so bad - provided your training set is like your test set. Good luck otherwise.

Kevin Rue-Albrecht (05:43:30): > Exactly. I’ve heard for months about NNMF and transfer learning, but as you point out that’s only useful if data sets were processed using the same pipeline, or perhaps if they have at least some features in common. Tough

Kevin Rue-Albrecht (05:46:22): > it’s still at the prototype stage, but I’ve reused theGSEABasesubclass’ing pattern herehttps://github.com/kevinrue/unisets/blob/master/R/AllClasses.R#L380with the idealistic idea in mind ofas(entrezids, "ENSEMBL")on signatures. > obviously with the caveat of multimapping etc.

2019-02-11

Vince Carey (22:14:32): > There is a new app in ontoProc (devel branch) called ctmarks(). It is an attempt to aid in the derivation of “cell type signatures” from Cell Ontology. One selects from a Cell Ontology class, and then the ‘intersection_of’, ‘has/lacks_plasma_membrane_part’, and allied property types are traversed with the aim of finding genes useful for discriminating the query cell type. Here is a screen shot. You can pick a few cell types (sequentially checking the ‘tags’ panel to ensure search is conducted) and after pressing the ‘stop app’ button, a data.frame is returned with information about your selections. One has to read the ‘condition’ field (hasPMPart means ’has plasma membrane part, etc.) to decide whether the gene should be “expressed or not” in the expression-based signature.

Vince Carey (22:15:07): - File (PNG): ctmarks.png

2019-02-12

Vince Carey (08:33:13): > ontoProc now also includes sym2CellOnto, which will enumerate cell types for which a given gene product is mentioned as a part/plasma membrane part etc. It is somewhat disconcerting that genes mentioned as distinctive in the PBMC3k tutorials frequently do not have the claimed roles asserted in the ontologies. Probably more knowledge needs to be transferred into the ontologies.

2019-02-17

Kevin Rue-Albrecht (12:51:49): > Thanks Vince. I’m definitely curious to look into this kind of integration with learning/prediction methods. > Separately, thehancockvignette was updated to use the PBMC data set best known for featuring in the Seurat clustering vignette. I’m also showcasing a couple more featureshttps://kevinrue.github.io/hancock/articles/hancock.htmlFor the time being, the package supports containers for gene set signatures from theGSEABase,kevinrue/unisets, andKayla-Morrell/GeneSet. > Until we settle on a durable gene set container, I selfishly “prefered”unisetsas an output of learning methods, as I designed it myself to include all the metadata slots that I wish to use for signatures (relation, element, set).

2019-02-18

Lukas Weber (06:23:48): > @Lukas Weber has joined the channel

Diego Diez (23:16:58): > @Diego Diez has joined the channel

2019-04-10

Stephanie Hicks (15:09:06): > seems relevant for this channel, but not sure if you guys have already seen ithttps://www.ncbi.nlm.nih.gov/pubmed/30289549/ - Attachment (ncbi.nlm.nih.gov): CellMarker: a manually curated resource of cell markers in human and mouse. - PubMed - NCBI > Nucleic Acids Res. 2019 Jan 8;47(D1):D721-D728. doi: 10.1093/nar/gky900.

2019-05-21

Almut (06:37:31): > @Almut has joined the channel

2019-06-23

Ameya Kulkarni (22:09:20): > @Ameya Kulkarni has joined the channel

2019-06-24

Komal Rathi (09:23:42): > @Komal Rathi has joined the channel

2019-06-26

Junhao Li (13:30:30): > @Junhao Li has joined the channel

2019-06-28

Aedin Culhane (15:57:19): > @Aedin Culhane has joined the channel

2019-07-30

Friederike Dündar (09:34:08): > @Friederike Dündar has joined the channel

2019-08-03

Mikhael Manurung (13:54:43): > @Mikhael Manurung has joined the channel

2019-08-08

Aaron Lun (12:25:16): > For want of a better place to talk about this, let’s use this channel, becuase SingleR sort-of does annotation-y stuff and I don’t want to start a new channel just for it unless there’s a need to.

Dan Bunis (12:25:55): > @Dan Bunis has joined the channel

Jared Andrews (12:25:55): > @Jared Andrews has joined the channel

Aaron Lun (12:27:00): > Adding@Friederike Dündarmanually, as slack doesn’t seem to want to add you.

Aaron Lun (12:27:09): > Oh, wait, you’re already here. Whoops.

Dvir Aran (12:27:11): > @Dvir Aran has joined the channel

Jared Andrews (12:34:10): > Are there any plans for additional changes beyond what’s described in:https://github.com/LTLA/SingleR/issues/1

Jared Andrews (12:37:33): > One more - why recommend logcounts for the training set but not for the test set?

Aaron Lun (12:38:39): > Yeah, this is pretty unintuitive. The default marker detection methods effectively operate on differences of the inputs, which is only the log-fold change if your inputs were logged.

Aaron Lun (12:38:51): > If you’re supplying your own gene sets, this doesn’t matter.

Aaron Lun (12:39:45): > I have to say that this is probably confusing enough that we should make the default marker detection methods assume that the inputs are counts, but that would probably break people’s code. Happy to change it if people are fine with it.

Jared Andrews (12:42:23): > I don’t particularly care, was just wondering the reason behind it. It might be good to make it consistent one way or the other.

Aaron Lun (12:46:34): > Yes, that is probably true.

Dan Bunis (12:50:58): > I hadn’t actually realized that the defaults were different. Is it actually better to supply raw counts versus log-normalized for the test data? If so, a note about that could go into the details section of the documentation.

Aaron Lun (12:51:48): > It shouldn’t matter for the test, because the correlations will be the same for any monotonic transformation (give or take some numerical error).

Aaron Lun (12:52:09): > But we really should make it easier for people and just require all-count inputs unless they know better.

Aaron Lun (12:52:36): > If this is the case, then the default marker detection scheme has to be changed slightly, because it doesn’t make sense to rank by differences of raw counts.

Jared Andrews (12:55:49): > Also, I think all of the reference datasets are already log-normalized.

Dan Bunis (14:12:39): > @Jared Andrewsis correct, and I don’t think it’s worth changing to counts. We can make it clear in vignettes and doc-details sections that log-normalization for the reference is important, (as Dvir did previously).

Aaron Lun (14:13:16): > Okay, in that case, all of our things should mention that we expect log-counts as input unless people know what they’re doing.

Aaron Lun (14:13:35): > I think the original docs (at least in the function documentation) just asked for counts or RPKMs. Can’t remember though.

Dan Bunis (14:31:11): > I’m not sure if it was in the docs tbh. But I remember it from the online vignette

Jared Andrews (14:37:02): > Original docs just asked for counts, but had an option to normalize to TPM for full-length methods.

Aaron Lun (14:37:09): > Right, so no logging was suggested.

Aaron Lun (14:37:33): > Well, a doc update is probably the most seamless to change.

Aaron Lun (15:17:01): > Right, let’s do this.

Aaron Lun (15:19:12): > I’m proposingreffor the reference data,testfor the test data. Nice and clean and short.

Aaron Lun (15:19:22): > I’ll also change all of the documentation to expect log-transformed values.

Aaron Lun (16:17:13): > Alright, things left to do. The first is to push ahead with the EHub upload (@Friederike Dündar). The second is to add tests for the visualization functions; we should probably also include their use in the vignette (@Dan Bunis).

Aaron Lun (16:17:56): > Re tests: they don’t have to be fancy, they just have to hit the code lines to make sure there’s no errors.

Aaron Lun (16:18:10): > check out scater’s test suite for plotting functions.

Friederike Dündar (16:31:13): > RE: ExpHub data > Can you take a look here:https://github.com/friedue/SingleR/tree/master/inst/scripts

Friederike Dündar (16:31:23): > to see whether this is going in the direction you’ve envisioned?

Friederike Dündar (16:32:00): > I’ve used Jared’s function to retrieve the RDA from the original github repo for now

Friederike Dündar (16:32:32): > I guess, eventually we should have the full documentation about how those RDA files came to be in there

Friederike Dündar (16:36:45): > I’ll push the .R file for retrieving the data and putting it into an SE object in a couple of minutes

Dan Bunis (16:54:09): > Adding on another thing left to do: There is one more visualization function that I need to update and add back (originally SingleR.DrawBoxPlot in Dvir’s code).

Aaron Lun (17:54:48): > The original generation of the RDA files is a rather tortuous process, according to@Dvir Aran. It would be nice to have this information but we’ll have to take it on faith for the time being.

Aaron Lun (17:56:27): > Anyway, review comments are back@Friederike Dündar.

Aaron Lun (20:04:30): > I guess anyone else who wants to help us get through the last mile (@Jared Andrews?) can read through the vignette and docs and point out any confusing/contradictory things.

Jared Andrews (20:05:28): > Sure, will get to this tonight.

Aaron Lun (20:06:31): > Just ignore my own TODOs,scranis a dumpster fire at the moment.

Jared Andrews (22:25:51): > @Aaron LunIslogNormCountspreferred overnormalizeinscater? We have both, but deprecation warnings abound withnormalize.

Aaron Lun (22:28:49): > Yes, the former is preferred.normalizewas just too vague, and had problems withigraph::normalize.

Jared Andrews (22:38:14): > Okay, thought so. I am running into an error with thesceGdata in the vignette though -size factors should be positive real numbers.

Jared Andrews (22:39:40): > That set doesn’t have sizeFactors already calculated, but the same error is thrown even aftersceG <- computeLibraryFactors(sceG). There are no negative numbers insizeFactors(sceG)after running, and as far as I can tell, they all look fine.

Aaron Lun (22:48:23): > Really. Hm. Will take a look after dinner.

Jared Andrews (22:51:28): > Okay - I’m not too familiar withscater, so hopefully I’m just missing something obvious. I’ve added basic visualization examples to the vignette and went ahead and swapped allnormalizeinstances tologNormCountsdespite the error.

Aaron Lun (23:38:10): > :+1:

Aaron Lun (23:38:17): > Let me finish my nightly anime and I will get back to this.

2019-08-09

Aaron Lun (01:28:13): > Gee, this package is looking good, guys.

Aaron Lun (01:28:22): > It’s definitely come a long way.

Friederike Dündar (09:23:48): > yes, it has! and it’s so useful, too!:slightly_smiling_face:

Friederike Dündar (09:24:13): > btw, vignette example works fine for me, but I’m in using the devel version for everything

Jared Andrews (09:54:40): > Yes, that should be fixed. It’s definitely useful. And a bit more usable now.

Dvir Aran (17:37:06): > Hey, sorry for the late entry:slightly_smiling_face:super thankful to all of you!

Dvir Aran (17:40:49): > Also sorry that had to take a step back. Overwhelmed currently by other things, and can’t find any free time. Regarding the references - as I told Aaron and Friederike, the generation of those references wasn’t systematic. I can provide explanations what I did, but I don’t have code that can produce them.

Dvir Aran (17:42:24): > Immgen/hpca - just downloaded the all the CEL files from GEO and used rma to normalized them using a custom cdf file from brainarrays.

Dvir Aran (17:44:35): > blueprint_encode - downloaded raw counts from the sorted samples from the blueprint and encode websites. Used edgeR to normalize and my TPM function (that can be found in the original SingleR) to do gene length normalization.

Dvir Aran (17:45:04): > mouse.rnaseq - got it from Berenice Benayoun.

Dvir Aran (17:46:57): > If any of you wants to recreate the immgen/hpca data, here is a list of the GSM ids used. - File (CSV): hpca.csv - File (CSV): immgen.csv

Dvir Aran (17:48:41): > The code will look something like this: > > library(affy) > library(annotate) > source(“https://bioconductor.org/biocLite.R”) > biocLite(‘org.Hs.eg.db’) > biocLite(‘org.Mm.eg.db’) > > install.packages(‘~/Documents/BrainArray/hgu133plus2hsentrezgcdf_22.0.0.tar.gz’, repos=NULL, type=‘source’) > install.packages(‘~/Documents/BrainArray/hugene10sthsentrezgcdf_22.0.0.tar.gz’, repos=NULL, type=‘source’) > install.packages(‘~/Documents/BrainArray/hugene11sthsentrezgcdf_22.0.0.tar.gz’, repos=NULL, type=‘source’) > install.packages(‘~/Documents/BrainArray/hgu133ahsentrezgcdf_22.0.0.tar.gz’, repos=NULL, type=‘source’) > > require(org.Hs.eg.db) > require(org.Mm.eg.db) > > setwd(‘~/cel_files’) > CELs.entrezgene = ReadAffy(cdfname=“hgu133ahsentrezgcdf”) > > eset.entrezgene = rma(CELs.entrezgene) > expr = exprs(eset.entrezgene) > x = rownames(exprs(eset.entrezgene)) > x = substr(x,1,nchar(x)-3) > sym = getSYMBOL(x,data=‘org.Hs.eg’) > rownames(expr) = sym

Friederike Dündar (18:05:23): > thanks, every bit of info helps!:slightly_smiling_face:

Friederike Dündar (18:07:53): > @Aaron Lunwhat would be the most elegant way of integrating those GSM-csv files into the documentation? Should we just provide them ininst/extdata?

Friederike Dündar (18:08:14): > and let people know they exist in thereferenceDataSets.Rmdvignette?

Friederike Dündar (18:09:11): > (and the documentation of the respective data retrieval functions)

Aaron Lun (18:09:11): > Yes, I would stick them in inst/exdata.

Aaron Lun (18:09:28): > Possibly in a subdirectory to avoid cluttering thigns up.

Aaron Lun (18:09:42): > I’ll leave that to your aesthetics.

Friederike Dündar (18:10:16): > :+1:

Aaron Lun (19:27:26): > Probably in inst/scripts, actually, as inst/exdata is very specific for EHub use.

Friederike Dündar (20:31:57): > isn’t it the other way around?

Friederike Dündar (20:32:09): > I feel like I’ve used “extdata” in other packages, but never “scripts”

Aaron Lun (20:32:51): > The thing is, Lori will go looking for the metadata CSV files ininst/exdata, while scripts is pretty much left to our own devices.

Aaron Lun (20:33:09): > So perhaps a better way of saying this is that the structure of files ininst/exdatais very specific for EHub use.

Friederike Dündar (20:35:26): > will she also look in subfolders ofextdata?

Friederike Dündar (20:36:15): > alternatively, we could add another folder at the level ofscriptsandexdata, named something likerefinfo

Friederike Dündar (20:36:20): > or justinfo

Aaron Lun (20:37:41): > The reason why I suggest that tthey should be ininst/scriptsis because they will only ever be read by a hypothetical script that re-creates the Immgen/HPCA data.

Aaron Lun (20:38:42): > I don’t anticipate that these files would be of interest anywhere else. If one wanted to use them, it would just be to fill out the SourceVersion in the metadata CSVs.

Aaron Lun (20:40:13): > I mean, it doesn’t matter, because we don’t have that hypothetical script anyway. So perhaps we should just put this aside.

Friederike Dündar (20:42:29): > for me it’s more about documentation than actually using them

Friederike Dündar (20:42:49): > the “blueprint-encode” dataset does not containallof either blueprint or encode

Friederike Dündar (20:43:00): > it contains a subset of samples, which are specified in the csv

Dvir Aran (21:06:50): > those are all the samples that were available in early 2016

Dvir Aran (21:07:47): > it definitely makes sense to recreate this reference. I’m sure there are tools now to pull those samples.

Dvir Aran (21:10:02): > one more thing that I probably missed - after having the expression matrix and annotations, this is the code for creating the reference object: > > name = ‘My_reference’ > expr = as.matrix(expr) # the expression matrix > types = as.character(types) # a character list of the types. Samples from the same type should have the same name. > main_types = as.character(main_types) # a character list of the main types. > ref = list(name=name,data = expr, types=types, main_types=main_types) > > # if using the de method, we can predefine the variable genes > ref$de.genes = CreateVariableGeneSet(expr,types,200) > ref$de.genes.main = CreateVariableGeneSet(expr,main_types,300) > > # if using the sd method, we need to define an sd threshold > sd = rowsSd(expr) > sd.thres = sort(sd, decreasing = T)[4000] # or any other threshold > ref$sd.thres = sd.thres

Dvir Aran (21:13:01): > CreateVariableGeneSet was removed from the current version. Do we need to bring it back?

Aaron Lun (22:18:01): > If it’s just for the data preparation… probably not. Thede.genesget created insidetrainSingleRanyway, so there’s not much point having a statically saved version. The only advantage would be to save time, but in that case, a user can (in general) save even more time by saving the entire output oftrainSingleRand applying it to new test data sets.

Aaron Lun (22:18:56): > We could certainlyexposethe existing code that is currently used insidetrainSingleRfor defining DE genes, and make that into a new function (effectively the same asCreateVariableGeneSet(), I would guess). But I don’t anticipate that it’ll get a lot of direct use, unless you have experience otherwise.

2019-08-11

Dvir Aran (00:14:49): > Oh ok. Doesn’t it take a long time to run this intrainSingleR?

Jared Andrews (01:55:55): > Nah, almost no time at all.

Jared Andrews (01:56:55): > Well, not no time, but <30s I think.

Jared Andrews (01:57:30): > And as Aaron said, it’s a one time expense if you want it to be.

Friederike Dündar (21:42:21): > Do I understand correctly that the Benayoun data set contains bulk RNA-seq of 4 tissues and primary cultures of neural stem cells from mice of different ages?@Dvir AranHow did you assign the cell type labels to these samples, i.e. “Adipocytes”, “Endothelial cells” etc.?

Dvir Aran (21:44:43): > No, what you are referring is what she published in the paper. This is the reference she used in that paper, but for some reason it is bot mentioned there at all. This is all from sorted cell types that she collected from GEO.

Friederike Dündar (21:51:02): > ahhh

Friederike Dündar (21:51:23): > can you give me a one-liner describing these data sets? or point me to the part in her paper that mentions this?

Aaron Lun (23:06:49): > Will have a look at your PR in a bit. Getting my ass kicked byscran’s refactoring.

2019-08-12

Dvir Aran (01:39:55) (in thread): > I don’t know what exactly she did. The samples, as you can see from the csv I sent you are RNA-seq profiles from sorted cells of non-genetically modified and not treated mice. Read files were downloaded from GEO and ArrayExpress. I don’t know how they were aligned and processed.

Aaron Lun (02:33:34): > It builds!

Friederike Dündar (09:06:15) (in thread): > Maybe I’ll just reach out to her directly

Friederike Dündar (10:42:53) (in thread): > any objections?

Dvir Aran (17:17:14) (in thread): > No, that would be great

2019-08-13

Friederike Dündar (09:46:07): > should the vignette feature the use ofgetReferenceDatasetorHumanPrimaryCellAtlasData()?

Friederike Dündar (09:46:56): > @Aaron LunI assume one of your PR comments indicates to switch to the more recent function, but just wanted to make sure

Friederike Dündar (10:13:50) (in thread): > :+1:

Jared Andrews (10:38:08): > If you go back to the commit before my PR, it has something in the vignette about it (though the placeholder function name was likely different).

Aaron Lun (11:30:52): > The new functions supersedegetReferenceDataset(), so just replace all instances.

Friederike Dündar (11:36:08): > will do

Friederike Dündar (11:40:20): > is there a way to link to the second vignette? (I still feel it’d be worth to have a separate vignette just about the reference data sets and I’d like to point to it from theSingleR.Rmd)

Aaron Lun (11:45:09): > Yes, look at options forBiocpkg.

Aaron Lun (11:45:28): > Obviously these links will only work upon submission.

Friederike Dündar (12:05:41): > Biocpkg("SingleR", vignette = "ReferenceDataSets")?

Aaron Lun (12:05:56): > Yes, that’s right.

Friederike Dündar (12:08:21): > ok, done

Friederike Dündar (12:18:29): > so, what do we need to do to get the data onto ExpHub?

Friederike Dündar (12:19:52): > and, more importantly, is there a way to have my name spelled with the proper Umlaut (ü) in the vignette?

Aaron Lun (12:34:30): > I think you just copy and paste the character in. Don’t forget to add yourself to the DESCRIPTION as well.

Friederike Dündar (13:34:54): > cool, there must have been some update in the past 4 years, I remember Rmd’s not being able to handle Umlauts

Aaron Lun (13:39:37): > What’s with all this “indent to the opening parenthesis” business that I’m seeing everywhere?

Aaron Lun (13:40:45): > Sure, it looks nice… until you have to rename the function.

Aaron Lun (14:02:23): > And while grumpy ol’ aaron is complaining about things; text wrap is fine, but always start a new sentence on a new line.

Aaron Lun (14:02:46): > > # Good > Something some thing blah blah blah > blah blah. > More something something blah. > > # Bad. > Something some thing blah blah blah > blah blah. More something something > blah >

Aaron Lun (14:03:19): > Makes thegit diffs a lot easier.

Friederike Dündar (14:15:14): > all I can say is: I try:slightly_smiling_face:

Friederike Dündar (14:18:08): > > indent to the opening parenthesis

Friederike Dündar (14:18:15): > what does that refer to? can you give an example?

Aaron Lun (14:18:41): > > blah <- SingleR(ref=asdasda > test=asdasda) >

Friederike Dündar (14:19:01): > that’s related to line break addiction

Friederike Dündar (14:19:06): > and RStudio makes it very easy

Friederike Dündar (14:19:22): > because all you do is hit return and it’ll plop you in the right spot

Aaron Lun (14:19:51): > That’s the problem, it’s no longer “right” once you change the name ofblah.

Aaron Lun (14:20:08): > And frankly, it’s pretty gross.SingleR’s a short name, but consider something likemakeSummarizedExperimentFromHDF5().

Friederike Dündar (14:20:38): > the longer the function name the more I’m inclined to split up the parameters on separate lines

Friederike Dündar (14:20:43): > for readability purposes

Aaron Lun (14:21:09): > That’s fine, as long as it’s not indented to the opening parenthesis.

Dan Bunis (14:21:10): > I think Aaron’s point is that he’d prefer a single 4 space tab on second lines (?)

Aaron Lun (14:21:12): > Yes.

Aaron Lun (14:21:47): > I too used to indent to the opening (. But it was a pain to maintain.

Dan Bunis (14:21:49): > But RStudio does an auto tab to the parenthesis automatically when things are copied in, so it can be hard to manage

Aaron Lun (14:21:56): > Oh geez.

Aaron Lun (14:22:27): > Don’t know how@Kevin Rue-Albrechtmanages it for iSEE.

Dan Bunis (14:23:32): > Looks like auto-indenting after paste can be turned off in RStudio settings.

Kevin Rue-Albrecht (14:36:54): > I set auto indent to 4 spaces, but i always leave the line opening brackets empty. Only exception is if all the arguments and function name fit in the 80 character line limit. > Otherwise arguments go the next line. This way RStudio respects the 4 space indent

Kevin Rue-Albrecht (14:38:21): > I really don’t like those function that put some arguments on the same line as the function definition and then indent to the opening parenthesis. It makes the indent inconsistent between functions according to the length of their respective name.

Kevin Rue-Albrecht (14:44:52): > Ps: glad to see the channel put to good use since I ran out of free time to play with Hancock. And SingleR moving to Bioc is good news. That would have been my first choice to wrap in hancock.

Jared Andrews (15:34:07): > ~100k cells ran with HPCA’s broad labels done in 16 minutes - not bad! Mostly T cells/monocytes, so fine tuning was likely quicker than most instances, but still.

Aaron Lun (15:34:20): > Woah. 1 core?

Jared Andrews (15:34:23): > Yep.

Aaron Lun (15:34:27): > sweet

Aaron Lun (15:44:24): > @Friederike DündarThere’s some instructions herehttps://bioconductor.org/packages/devel/bioc/vignettes/ExperimentHub/inst/doc/CreateAnExperimentHubPackage.html

Aaron Lun (15:44:38): > But I was thinking of getting it into the submission queue first.

Friederike Dündar (15:47:20): > BioC’s submission queue?

Aaron Lun (15:47:32): > Yes.

Friederike Dündar (15:47:39): > sure, let’s do it

Friederike Dündar (15:47:53): > ..if you think it’s ready..

Aaron Lun (17:45:43): > Think Dan has one more boxplot function to add?

Dan Bunis (17:50:27): > I do. I will prioritize that for this week since most other things seem done.

Aaron Lun (17:50:53): > :+1:

Dan Bunis (17:54:20): > There are also 2 tsne functions that we are leaving out as ‘better off elsewhere.’ We talked about suggesting DittoSeq for those… can/should we still do that even though DittoSeq is not in BioC?

Aaron Lun (17:55:30): > I don’t think that’s particularly critical ATM.

Aaron Lun (17:55:45): > IIRC, it was just plotting t-snes colored by the annotated labels, right?

Dan Bunis (17:56:22): > yup

Aaron Lun (17:56:28): > I think most people would understand that’s something that’s quite orthogonal to SingleR’s core business of getting the labels in the first place.

Aaron Lun (18:10:10): > Incidentally, whenever you’re doing package development, it is always a good idea to obsessively runR CMD check --preclean <package_dir>

Aaron Lun (19:22:30): > I have a few quick scripts for this (https://github.com/LTLA/OddsAndEnds/blob/master/bioctools/check_all.sh) which rebuilds the package, checks it, BiocChecks it, and reinstalls it.

Jared Andrews (19:25:08): > Installs on Windows.

Jared Andrews (19:25:51): > Something in the last few updates fixed whatever compilation error I was running into when building.

Aaron Lun (19:26:58): > Oh, that was theSystemRequirements: C++11thing.

Aaron Lun (19:27:09): > Every other system uses C++11 as its default… except for windows.

Aaron Lun (19:28:06): > Once you need to write C++ code that compiles portably, you will know the true meaning of pain.

Aaron Lun (19:44:40): > Right. The whole stack BUILDs, checks and BiocChecks. As of this point, it is ready to go.

Aaron Lun (19:50:34): > Oh, forgot about the boxplot. Okay, we’ll wait for that.

Dan Bunis (19:54:30): > I’ll start on that now.

Dan Bunis (19:55:00): > knew we were close, but didn;t realize we were THAT close.:blush:

Aaron Lun (20:05:32): > Pop quiz: whyseq_lenand not1:n?

Dan Bunis (20:07:00): > for when n is 0

Aaron Lun (20:07:46): > And there’s plenty of other gems like this in the bioc coding style guide.

Aaron Lun (20:12:58): > incidentally, BiocCheck will help you pick these up, if you run R CMD BiocCheck on your package.

Friederike Dündar (21:05:25): > I also got a reply from Berenice Benayoun with more details about one of the ref data sets; I’ll try to work it in tomorrow

Dan Bunis (21:14:50): > Hmmm so I’m realizing that the boxplot visualization is something that we might not be able to produce from the new SingleR output.

Dan Bunis (21:17:06): > except if broad.labels are used for grouping after finer.labels are used for running SIngleR

Dan Bunis (21:19:55): > But: that’s not how an example like the plot here was made… there are too many dots in T cells for those to have been distinctly defined celltypes (finer.labels) under the T cells broad.label. - File (PNG): image.png

Dan Bunis (21:24:32): > Unless we want to add the scores/correlations versus each individual reference cell to the SingleR/classifySingleR outputs,@Aaron Lun, I don’t think I can recreate this function.

Aaron Lun (22:29:44): > Hm.

Aaron Lun (22:29:44): > We don’t even calculate the correlations between each individual reference and each cell anymore.

Aaron Lun (22:29:45): > That’s the basis of one of the speed-ups.

Aaron Lun (22:30:55): > Is it all that important? Hard to see how you would use this for diagnostics if you have 10k input cells. Do you make a boxplot of each one?

Dan Bunis (23:32:01): > It is run on an individual cell, similar to the plotCellVsReference function I already converted, just it compares to all the reference cells instead of just one. I believe the purpose is to check the scoring of individual cells, but I’ve always used heatmap one for that. I’ve never been driven to use this function myself tbh, and I think it’s probably fine to leave out.

Dan Bunis (23:32:41): > @Dvir Aranis the boxplot visualization a function one that you would suggest strongly that we find a way to make work? In your experience, is it used often?

Aaron Lun (23:40:13): > It’s not even a matter of “making it work” with the currentSingler()output; that information just isn’t generated anymore. If we did need it, we would have to compute the correlations from scratch every time someone calls the plotting function. Which is doable, of course, but it would be nice to know whether it is useful enough to warrant that extra work (compared to, say, the heatmap plots).

Dvir Aran (23:44:25): > Probably nobody uses it… its just a way to understand how SingleR works.

2019-08-14

Aaron Lun (01:14:11): > TBH, I don’t findplotCellVsReferenceto be all that useful either. I mean… would anyone manually plot each cell against each reference? That must be, like, a million plots. The score heatmap seems like it would be the most routinely used diagnostic.

Dan Bunis (01:16:51): > The heatmap is what I’ve used mostly myself.

Dan Bunis (01:56:55): > But it can be nice to see the individual cells occasionally. Moreso even if the dots are colored by percent.captured among the reference cells? I could add that easily, in the vignette at least.

Jared Andrews (07:35:26): > Yeah, I really only use the heatmap, though pheatmap can’t cluster more than 65k elements or so apparently.

Friederike Dündar (09:09:09): > Can the legacy package generate that boxplot? We could always just refer users to the original package if they wanted to recreate every plot from the paper

Friederike Dündar (09:21:38): > off-topic: Germans replace their Umlauts for machine-readability (and crossword-puzzles) by removing the dots and adding an e after what’s now a common vowel. I.e., ü becomes ue, ä becomes ae etc. All my life I thought this was something all people with Umlauts in their alphabet did, but apparently, it’s a distinctly German habit. Anyway, my (Turkish) last name is correctly spelled Dündar.

Jared Andrews (09:22:45): > Should theREADMEbe updated?

Friederike Dündar (09:24:10): > my vote is “yes”

Friederike Dündar (09:35:42): > I’m still pondering about how to provide those tables of the different samples for the different reference data sets

Friederike Dündar (09:36:10): > I mostly want them to be linked and present so that people can browse them; I don’t think they will necessarily be needed for any scripts at this point

Friederike Dündar (09:36:21): > would figshare be an appropriate platform?

Friederike Dündar (09:36:23): > e.g.http://dx.doi.org/10.6084/m9.figshare.1416210 - Attachment (figshare): Metadata for a highly replicated two-condition yeast RNAseq experiment. > This datafile contains metadata associated with the European Nucleotide Archive (ENA) entry ERP004763 (see below for link). The ENA entry has 672 fastq files which are from a two-condition 48 biological and 7 technical replicate experiment. However, the metadata are a little unclear at ENA. This file should help disambiguate the information. The experiment is described in two pre-prints on arXiv linked below and has been published in Bioinformatics.

Friederike Dündar (09:37:21): > snapshot from the xls file related to the mouse “bulk” rna-seq ref data set – which, it turns out, is not just based on bulk, but also lots of scRNA-seq samples - File (PNG): image.png

Friederike Dündar (09:38:30): > if someone were to decide whether they needed to build their own reference data set, it might be helpful to know which data sets were used for the reference data sets (I know it is helpful for me because I don’t need to go looking up those GEO IDs for, say, adipocytes, by myself)

Jared Andrews (10:14:06): > I think that would be a good idea.

Friederike Dündar (10:59:31): > Sorting this all out will probably take me a little while; I don’t think it needs to hold up the queuing, though, do you?

Friederike Dündar (11:02:08) (in thread): > in the light of this:https://www.biostars.org/p/394302/I would say we should definitely update the README to point the devtools::install routine to Aaron’s repo and have a look at the vignette

Friederike Dündar (11:02:40) (in thread): > Do we actually have an example for how to do it with a Seurat object?

Friederike Dündar (11:03:03) (in thread): > (we probably should)

Jared Andrews (11:08:31) (in thread): > I have example code for a Seurat object, but it’s really just converting it to SCE using Seurat’sas.SingleCellExperiment(). And then a bit of finagling to add the annotation labels as a metadata column. I also haven’t ran it on the very latest version, so it could need some tweaking.

Jared Andrews (11:09:24) (in thread): > I can whip something up tonight though.

Aaron Lun (11:43:19): > You know you can pull them out from thesystem.file(), right?

Aaron Lun (11:45:13) (in thread): > As long as we don’t have any code/test/doc dependency on Seurat in the package, I’m fine with that.

Aaron Lun (11:46:02) (in thread): > I don’t want any direct exposure to their bizarro world data structures.

Friederike Dündar (11:46:03): > “sorting” as in which samples are actually represented in the objects, which meta data makes sense etc.

Jared Andrews (11:46:05) (in thread): > Yeah, that’s kind of the issue.

Jared Andrews (11:46:49) (in thread): > At worst, we can just mention that it can easily be used with Seurat objects viaas.SingleCellExperiment()without providing a direct example.

Aaron Lun (11:46:53) (in thread): > You can have this in the README, just not in the vignette.

Friederike Dündar (11:47:11) (in thread): > you don’t really need Seurat’s SingleCellExperiment function, you can just demonstrate how to pull out the corresponding data from the Seurat object

Aaron Lun (11:47:18) (in thread): > Make a PR

Friederike Dündar (11:47:30) (in thread): > > You can have this in the README, just not in the vignette. > that’s a great solution

Friederike Dündar (11:47:57) (in thread): > yes, still finagling

Jared Andrews (11:47:59) (in thread): > True, you can just as easily snag what you need by pulling the data slot.

Aaron Lun (11:48:21) (in thread): > In fact, it is for this reason that all of my packages (try to) follow the golden rule - anything that you can run on a SCE, you can also run on a raw matrix.

Aaron Lun (11:48:49) (in thread): > This means that you can do the same analyses (e.g., as part of some other workflow) even if you don’t want to buy into the whole BioC/SCE framework.

Aaron Lun (11:50:24): > Can’t you stuff this into thecolData? Just have a field indicating the experiment of origin for each column.

Friederike Dündar (11:51:36): > eventually, sure, that’s a good idea

Friederike Dündar (11:52:25): > but browsing colData within R is slightly less intuitive than just seeing an entire table

Friederike Dündar (11:52:35): > at least for mere mortals such as myself

Aaron Lun (11:53:06): > Perhaps you could nail down the proposed use case for this extra stuff, I’m having trouble visualizing it.

Friederike Dündar (11:53:50): > if I were trying to piece together my own set of reference data that I may want to to download and process to get a suitable training set, I would start by copy and pasting GEO accession numbers into a spreadsheet

Friederike Dündar (11:54:06): > with some details about why I’d be interested in these particular samples

Friederike Dündar (11:54:34): > If I had a starting point such as the table snapshotted above, that’d make my life easier

Friederike Dündar (11:55:00): > in fact, I will probably just do it in the near future and whatever I find useful, I will try to supply, but it won’t have to be within the SingleR package itself

Friederike Dündar (11:56:35) (in thread): > we could just have pseudo-code in the vignette, i.e. “to extract the normalized expression values out of the Seurat object, do this, then run SingleR and then do that to add it back into the Seurat object”

Friederike Dündar (11:56:43) (in thread): > but the README is a great place for that, too

Jared Andrews (11:59:02) (in thread): > I’ll add something tonight. Likely just a note in the vignette and a short example in the README. Along with updating the README itself.

Aaron Lun (12:05:43): > Anyway,@Dan Bunisand@Dvir Aran, I think our consensus was to not worry about the boxplots for the time being?

Dvir Aran (12:11:25) (in thread): > Yes, we can remove it

Aaron Lun (12:12:01) (in thread): > Sweet, that makes life easier.

Dvir Aran (12:16:08): > @Dan Buniscab you share a link to your vignette?

Dan Bunis (12:21:44): > I started but never finished making a separate one. I think the main one within SingleR suffices.

Dan Bunis (12:24:31): > I stopped when I found the current vignette and became unsure what the goal of a separate vignette was ?

Dan Bunis (12:51:24): > But coming back, separate one can show full workflow with annotation of all cells in a whole blood sample, and demonstration of that being successful by utilizing outside packages (namely my DittoSeq) to show CD3 expression in T cells, CD19 in B cells, etc.

Dan Bunis (12:57:00) (in thread): > I’d made a private git repo (dtm2451/SingleR-Vignette) where I have the in progress version. I added you as a collaborator, so I think you should be able to see it?

Aaron Lun (14:33:24): > Lunch, and then let’s finish the fight.

Dan Bunis (14:47:09) (in thread): > Agreed. My added thought: We can add it back (with calculations made inside the boxplot-function) in the future if it’s requested by users.

Aaron Lun (15:43:31): > @Friederike DündarWhy iscolnames(nrmcnts) <- paste(colnames(nrmcnts), seq_len(ncol(nrmcnts)), sep = ".")necessary?

Friederike Dündar (16:42:11): > strictly speaking, it’s only necessary for the ENCODE/blueprint data

Friederike Dündar (16:42:19): > because that contains non-unique colnames

Aaron Lun (16:43:05): > I don’t think non-unique colnames causes any problems, does it?

Friederike Dündar (16:43:35): > they do

Aaron Lun (16:43:50): > Where? In the SE constructor?

Friederike Dündar (16:43:50): > the coldata construction function complained

Aaron Lun (16:43:58): > Hm. That shouldn’t have happened.

Friederike Dündar (16:44:02): > yes, because there, they are becoming row.names

Friederike Dündar (16:44:17): > the colnames of the expr matrix become rownames of colData

Friederike Dündar (16:44:39): > unless there’s an automated uniquifying process under the hood, I thought it was a legitimate complaint

Aaron Lun (16:45:04): > That’s what I mean,DataFrames shouldn’t care if row names are not unique.

Aaron Lun (16:45:17): > Need to poke around in the SE machinery.

Friederike Dündar (16:45:26): > well, I’m glad it does

Friederike Dündar (16:45:33): > I rather have unique colnames/rownames

Dan Bunis (16:49:57): > Mini-additional reason for uniquefying: I made it so either index #s or cell/sample names can be used to call on a particular reference cell/sample in plotCellVsReference. It’s not terribly important, but it would be nice if all included datasets played well with that.

Aaron Lun (16:53:10): > I mean, that’s fine and all, but we shouldn’t just stick column numbers on like that.

Friederike Dündar (16:55:45): > I felt that’s what had basically been done before, so I just added on to it. But sure, those colnames do not win any beauty prices

Friederike Dündar (16:56:31): > if you have a more ingenious way of distinguishing two samples named “macrophages” I’ll be happy to accommodate that (probably)

Aaron Lun (16:56:39): > > old.names <- sample(LETTERS, 40, replace=TRUE) > > new.names <- character(length(old.names)) > indices <- split(seq_along(old.names), old.names) > for (j in names(indices)) { > idx <- indices[[j]] > if (length(idx) > 1L) { > new.names[indices[[j]]] <- sprintf("%s (%i)", j, seq_along(idx)) > } else { > new.names[indices[[j]]] <- j > } > } > new.names > > This will only add uniquify-ing identifiers that need it.

Aaron Lun (16:57:16): > You can change exactly how the uniquifying number is added, but that’s the general idea.

Friederike Dündar (16:58:20): > sure, I can add that as a helper function

Steve Lianoglou (16:59:11): > howdy – coming in from left, but why not just usemake.names(x, unique = TRUE)?

Aaron Lun (16:59:34): > That forces things to be syntactically valid, which may not be necessary.

Aaron Lun (16:59:48): > E.g.,epithelial cellsbecomesepithelial.cells

Steve Lianoglou (16:59:48): > also: thanks for working on SingleR – it’s been my preferred way to do celltype classification for some time, so happy to see it get some bioc love

Friederike Dündar (16:59:53): > yeah, I thought about that originally, can’t remember why I dropped it, probably because it did throw an error

Friederike Dündar (17:01:04): > I’ll address it with the other PR comments, probably tomorrow

Friederike Dündar (17:25:10): > @Steve Lianoglouout of curiosity: what’s your preferred way of using it? With a training set of your own or with the reference data supplied by Dvir?

Aaron Lun (18:14:37): > If there’s nothing else, I’m going to start the submission process.

Aaron Lun (18:18:12): > Oh, gotta check I’m not treading on any IP-related toes. Back in a bit.

Aaron Lun (18:20:05): > Right, let’s do it.

Aaron Lun (18:23:17): > Here’s a step by step account of the submission process.

Aaron Lun (18:29:29): > First, check that it builds, checks and BiocChecks locally.

Aaron Lun (18:30:12) (in thread): > In fact, there is actually only one duplicate in the Blueprint data, probably because a unique identifier was not correctly added.

Aaron Lun (18:30:29) (in thread): > It may be more effective to just change it in the data that we upload to EHub, rather than changing client-side every time.

Aaron Lun (18:47:50): > Second, follow the instructions athttps://github.com/Bioconductor/Contributions/issues/new

Aaron Lun (18:48:26): > If you are the maintainer, you need to be signed up to the BioC support site and the BioC-devel mailing list.

Aaron Lun (18:49:11): > Incidentally, I have made myself the maintainer, until someone demonstrates that they understand enough C++ to fix the build errors.

Aaron Lun (18:49:25): > Third, submit the issue.

Aaron Lun (18:49:31): > https://github.com/Bioconductor/Contributions/issues/1206

Aaron Lun (18:50:24): > Fourth, set up a webhook so that any commits that we have athttps://github.com/LTLA/SingleRwill automatically trigger a new build on the BioC Single Package Builder.

Aaron Lun (18:50:46): > Follow instructions here:https://github.com/Bioconductor/Contributions#adding-a-web-hook

Aaron Lun (18:52:22): > Everyone should subscribe to that issue if you want to keep track of the progress of the submission.

Aaron Lun (18:52:42): > Note that, from now on, a version bump is required to trigger a rebuild in any Bioconductor system.

Friederike Dündar (22:50:09) (in thread): > yes, there’s one duplicated set of cells in the Blueprint/ENCODE data > thanks for adding the colname modifier

2019-08-15

Aaron Lun (00:29:02): > @Jared AndrewsAre you getting CHECK failures on Windows?

Jared Andrews (00:30:06): > I have not tried recently. Give me a few and I can give it a try. About to submit a pull request for the updatedREADMEand vignette, so will check before I do that.

Aaron Lun (00:44:19): > If people haven’t already, subscribe tohttps://github.com/Bioconductor/Contributions/issues/1206

Aaron Lun (00:54:30): > And a reward for whoever’s been paying attention:https://github.com/LTLA/MoeDCPLogo

Jared Andrews (00:56:31): > Never really “got” anime.

Jared Andrews (00:57:02): > Currently checking, had to skip vignette building because I kept getting random errors with figures and didn’t feel like futzing with it.

Jared Andrews (00:57:26): > Seemingly built fine, running examples now.

Jared Andrews (01:03:20): > Yeah, examples and tests passed fine. 3 warnings, all related to vignette.

Jared Andrews (01:03:31): > > -- R CMD check results --------------------------------------------------------------------------------------------- SingleR 0.99.0 ---- > Duration: 9m 15.4s > > > checking files in 'vignettes' ... WARNING > Files in the 'vignettes' directory but no files in 'inst/doc': > 'SingleR.Rmd', 'datasets.Rmd', 'ref.bib' > > > checking package vignettes in 'inst/doc' ... WARNING > dir.exists(dir) is not TRUEdir.exists(dir) is not TRUE > Package vignettes without corresponding single PDF/HTML: > 'SingleR.Rmd' > 'datasets.Rmd' > > > checking re-building of vignette outputs ... WARNING > Error(s) in re-building vignettes: > --- re-building 'SingleR.Rmd' using rmarkdown >

Jared Andrews (01:04:07): > > ... > > Quitting from lines 51-54 (SingleR.Rmd) > Error: processing vignette 'SingleR.Rmd' failed with diagnostics: > 'i' must be length 1 >

Jared Andrews (01:11:23): > Looks like a moot point anyway - nice job on the fix.

Aaron Lun (01:19:28): > Oh yeah, all green.

Aaron Lun (01:20:45): > Fix Fred’s umlaut, and note the use ofBiocpkg("SingleR".

Aaron Lun (01:39:30): > As in to say, no need for the e after the funny u.

Aaron Lun (02:36:22): > I’m curious why everyone keeps deleting their forks.

Dan Bunis (02:41:24): > I do it as a fast way to merge through the internet interface until I have another change to submit.

Dan Bunis (02:42:19): > Though I’m confident there’s a way I could be more efficient through command line git?

Aaron Lun (02:50:05): > Yes. It would look something like: > > # Done once. > git remote add upstream[https://github.com/LTLA/SingleR](https://github.com/LTLA/SingleR)# Then, for merging with the canonical: > git fetch --all > git merge upstream/master >

Kevin Rue-Albrecht (05:15:00): > FWIW, to keep things even cleaner, I usually add “prune” > > git fetch --all --prune > > That removes the local branches for which the remote branch was deleted. A typical situation were this applies is after a side branch is merged tomaster, that side branch is deleted from GitHub, and as such there is no point keeping a local copy either.

Jared Andrews (10:52:10): > Dunno really, just force of habit. It’s relatively rare that I continually submit pull requests to a repo, so I just delete them to keep github from being a mess.

Friederike Dündar (10:53:43) (in thread): > thanks Ronny!

Aaron Lun (11:49:28): > Heads up, I will talk about this in 10 minutes at the Bioc dev forum.

Steve Lianoglou (12:36:38) (in thread): > I’ve been using the reference data that’s been included so far. On my todo list is to generate a new reference dataset from the Ziesel 2018 mouse brain architecture scRNAseq dataset, to see how well that might work. > > I may come back here to get some tips re: how best to pseudo-bulk scRNAseq data for that purpose, though:wink:Happy to contribute it back to the package as data or vignette if/when appropriate.

Friederike Dündar (13:14:44) (in thread): > @Aaron Lunrecently summed up some pseudo-bulk considerations in this issue:https://github.com/LTLA/SingleR/issues/3#issuecomment-519391898

Steve Lianoglou (14:49:49) (in thread): > groovy, thanks for the heads up!

Aaron Lun (15:45:43): > That went well.

Jared Andrews (15:55:51): > No party parrot emoji in this slack is a shame.

Aaron Lun (15:56:33): > :party_parrot:

Dan Bunis (15:59:18): > Do you mean the dev forum? Because yes it did! And thanks for the shout out:smiley:

Friederike Dündar (17:09:29): > alright, so, consensus is that I should renamenormcountstologcountsin the ref data?

Jared Andrews (17:10:45): > Wait, there is party parrot? Emoji search function is a lie.

Jared Andrews (17:10:51): > And yes, I think so.

Dan Bunis (17:11:30): > I agree too. Those datasets would automatically work with the defaults then, which would be nice.

Friederike Dündar (17:11:58): > I’ll add it to my recent pull request once I emerge from the subway

Aaron Lun (17:12:12): > Makes sense to me. AndMouseRNASeqDataseems like the best of a bad bunch.

Friederike Dündar (17:12:21): > you got it

Friederike Dündar (17:13:02): > what was the response during the bioc devel call?

Friederike Dündar (17:32:10): > alright, pushed latest changes to standing PR; gotta be offline for a couple of hours now though

2019-08-16

Aaron Lun (00:35:03): > Stuff is on EHub, and metadata files are compiled. Had to fix a few things but it was mostly error free.

Jared Andrews (00:44:08): > ~100k cells from rare T cell lymphomas, with annotations that are pretty dang believable for the most part. - File (PNG): image.png

Jared Andrews (00:44:36): > Done usingmethod = "cluster", of course.

Aaron Lun (00:46:47): > Hold on, if it’smethod="cluster", that’s just doing it separately on each cluster, so the number of profiles that are getting searched is a lot less than 100k.

Jared Andrews (00:49:23): > Ya ya

Jared Andrews (00:49:46): > I been staring at these plots all day man

Jared Andrews (00:50:06): - File (PNG): image.png

Jared Andrews (00:50:26): > There, per-cell annotations.

Aaron Lun (00:50:48): > So this was 16 minutes?

Jared Andrews (00:59:57): > Nah, this was a different run that ran it like 20 times on different clustering resolutions and all. I didn’t time the individual runs. And it would have had access to 4 cores. Running both the HCPA and the Encode/Blueprint refs withmethod="single"andmethod="cluster"with 4 different sets of clusters and both broad/fine labels, so it was uhh, yeah, 20 times. And script took just over an hour, including reading/writing a very bloated SCE object.

Aaron Lun (01:00:45): > The number of cores wouldn’t mater unless you setBPPARAM.

Aaron Lun (01:01:39): > So… that’s about 5 minutes each average per run?

Aaron Lun (01:01:58): > Could be worth putting up some timings in theREADME.

Aaron Lun (01:02:28): > No rush, just collect the stats when you’re running things.

Jared Andrews (01:09:45): > Yeah, I’m also not super clear headed at the moment, so take that with a grain of salt. I will do a proper timing when I have a chance.

Dan Bunis (02:14:46): > This is beautiful. Also not super clear headed myself rn perhaps, but I recognize DittoSeq when I see it!!! The colors at least… damn cuz the 8 good colors for color blindness are semi useless when clusters are not all together / when there are way more than 8 labels. But… it still looks pretty at least =)

Aaron Lun (02:21:50): > I assume you have some option to overlay the label text in the plot itself.

Dan Bunis (02:30:06): > Yup! 3 ways: labels at the cluster/level medians (‘do.label = TRUE’) with or without a white background (‘highlight.labels = TRUE/FALSE’) or letters overlayed on top of the inidividual dots to try and make extra groupings still visible at single-cell res (‘do.letter = TRUE, or default NA with >8 discrete values’).

Dan Bunis (02:36:59): > I normally turn lettering off too as Jared seems to do as well, even though it is meant to help with color confusion, so I may adjust the default do.letter functionality in future versions… the lettering idea came from SingleR, but as Dvir has previously commented to me, it ends up less useful, due to being hard to read, when cell # is high (aka typical of modern experiments!)

Aaron Lun (02:43:06): > I’d like something like in Figure 2A here:https://www.ncbi.nlm.nih.gov/pubmed/28504682 - Attachment (ncbi.nlm.nih.gov): Testing for differential abundance in mass cytometry data. - PubMed - NCBI > Nat Methods. 2017 Jul;14(7):707-709. doi: 10.1038/nmeth.4295. Epub 2017 May 15.

Aaron Lun (02:43:14): > Yes, I drew those manually.

Aaron Lun (02:43:16): > Yes, it sucked.

Friederike Dündar (09:00:05) (in thread): > I have a color scheme that works reasonably well (and isn’t hideous) for up to 12-14 clusters (I am deuteranomalous)

Friederike Dündar (09:10:22) (in thread): > [1] “[color content]” “limegreen” “grey30” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “grey83” “[color content]” “[color content]” “[color content]” > [13] “[color content]” “[color content]”

Friederike Dündar (09:10:42) (in thread): > wow, that’s awesome! didn’t know that slack showed the actual colors!:sunglasses:

Friederike Dündar (09:12:14) (in thread): > it depends a bit on luck where each color ends up on the plot, i.e.[color content]right next to[color content]is going to be difficult, so is[color content]next to[color content], but if they’re well separated, it usually works fine

Friederike Dündar (09:14:45) (in thread): > A huge factor is the size of the color dot in the legend, the bigger, the better

Friederike Dündar (09:17:32): > maybe the xkcd package could have helped with that

Friederike Dündar (09:17:32): > http://xkcd.r-forge.r-project.org/

Friederike Dündar (09:17:45): > never tried it, but it seems to be big on graph annotation

Friederike Dündar (09:20:13): > so what did you all think about Martin Morgan’s comment on the name of the package? I have to say, I agree with the general sentiment, but I can also see how Dvir probably had his reasons to pick that name

Jared Andrews (09:28:46): > I also think changing it may make it harder to find for people who use Dvir’s original version, which was fairly popular, it seems. Unless he plans to slap a big link on his github page and archive the repo, I think it might be best to leave it be.

Jared Andrews (09:31:29): > Also yeah, the lettering looks pretty hilarious with lots of cells. DittoSeq’s color scheme works pretty well up to about 15 colors, which is a lot better than most package’s defaults.

Dvir Aran (10:17:37): > SingleR is a brand name now :) like Seurat or scater… its supposed to be Single(cell) R(ecognition). SingleR is also like Tinder for matching single-cells (that one didn’t catch).

Dvir Aran (10:18:44): > The first name for this package, back in 2016, was scID, but that sounds like the bubble boy…

Dvir Aran (10:22:32): > I looked at Martin’s packages - Elbo, Tristly, lubridate, swift, socketeer… can you guess from the name what those packages are about? Maybe the field they are in, but not much more than that.

Dan Bunis (11:11:05) (in thread): > yes yes! what ends up next to each other is hugely important, and the legend size. I’m deuteranomalous as well, and I worked both of those into DittoSeq:smiley:

Dan Bunis (11:14:41) (in thread): > colorsinput allows easy reordering of how the colors in the color.panel are used (without forcing you to update the panel itself)

Dan Bunis (11:16:45) (in thread): > andlegend.size(which I might rename to something likecolor.legend.size) allows adjustment of how big the dots are in the legened and it’s large by default

Dan Bunis (11:23:02) (in thread): > Also I like your colors a lot!

Aaron Lun (11:33:37): > The real issue is that the capital “R” at the end is commonly thought of as the language. So people seeing the package name mentally remove the “R”, which leaves us with “Single”. This is particularly uninformative as it isn’t an acronym for anything.

Aaron Lun (12:23:45): > Mind you, I’d rather not change either. But that’s something to keep in mind when you’re naming packages.

Aaron Lun (12:24:34): > For example, if I were to name this package fresh, I would probably call it “scarf” - Single Cell Annotation with a Reference.

Aaron Lun (12:25:11): > (tm aaron lun 2019).

Kevin Rue-Albrecht (12:27:08): > Well, if it helps, when I toyed with this concept, I wrote one calledcloner, for “CLassification ON Expression Reference”. > It lived and died on GitHub, so feel free to reuse

Kevin Rue-Albrecht (12:27:46): > That said, the name may be a bit misleading, considering T cell clones

Friederike Dündar (12:50:09): > > Tinder for matching single-cells

Friederike Dündar (12:50:19): > I like that analogy, at least it’s a good explanation

Kevin Rue-Albrecht (12:52:17): > “scarface” - Single Cell Annotation with a ReFerence Annotated by Celltype Experts. (or more seriously, Expression)

Aaron Lun (12:52:58): > Well, actually,@Rob Amezquitaand I were thinking of making a TindR app. It would let you connect with people based on their R skill. So you could walk on the street and if there was another R user nearby, you could swipe right and talk to them. About R.

Aaron Lun (12:53:17): > That would have made us TENS OF DOLLARS.

Kevin Rue-Albrecht (12:54:24): > Now my weekend’s going to be about checking out how many Shiny apps include a chat interface

Kevin Rue-Albrecht (12:55:06): > damn:https://shiny.rstudio.com/gallery/chat-room.html

Friederike Dündar (12:55:07): > R users would demand to have the code on github

Friederike Dündar (12:55:08): > for free

Aaron Lun (13:04:37): > I was thinking advertising.

Friederike Dündar (13:11:36): > ah, of course.

Aaron Lun (13:12:28): > because y’know all those companies just desperate to get their product out to those R users.

Aaron Lun (13:12:31): > like rstudio

Aaron Lun (13:12:36): > and

Friederike Dündar (13:21:13): > yeah, was trying to come up with an indispensable item that R users would want, but I guess you could get some sympathy donations from local meetings/symposia

Friederike Dündar (13:22:49): > Snakemake would be the name of the same app but for python users, I guess

Friederike Dündar (13:29:34): > btw, I like this website in terms of information content about the data sets they’re providing:https://hemberg-lab.github.io/scRNA.seq.datasets/human/pancreas/This is somewhat along the lines of what I’d envision for the reference data sets of SingleR

Aaron Lun (13:29:55): > see the scRNAseq package, we already dumped a whole pile in there.

Aaron Lun (13:30:05): > That’s what the vignette uses actually.

Friederike Dündar (13:30:34): > the scRNAseq package has a website like that?

Aaron Lun (13:30:48): > The vignette has a quick summary of everything.

Aaron Lun (13:31:07): > Could do with some clean-up, but it could be a one-stop-shop for pulling out info.

Aaron Lun (13:31:16): > You could even make a function that creates a table of this information.

Friederike Dündar (13:31:39): > do you mean this:https://bioconductor.org/packages/release/data/experiment/vignettes/scRNAseq/inst/doc/scRNAseq.html

Aaron Lun (13:32:10): > devel

Friederike Dündar (13:32:55): > gotcha

Martin Morgan (13:32:55) (in thread): > Meant the suggestion to be constructive, and didn’t know the back-story of the name, sorry if it came across the wrong way. FWIW tristyly is a (classical) genetic polymorphism; lubridate is a clone of someone else’s (Hadley Wicham, conjunction of ‘lubricate’ and ‘date’ to indicate making it easy to work with dates); swift is the name of an object store; socketeer implements socket connections. Elbo was some running humor in a workshop I was leading, and might have been funny at the time (the shiny app is definitely fun – stick figures!)

Aaron Lun (13:40:38): > I would prefer an R function that creates this table, and then you could do some fancy JS stuff in the vignette to make the table interactive.

Aaron Lun (13:40:55): > e.g., sort by number of cells, restrict to mouse, and so on.

Dvir Aran (14:32:38) (in thread): > Sorry for being over defensive. SingleR is like my baby. Now with this new package I see it growing up and leaving the nest… I’m happy for her, but its not easy to give up control, and changing the name is a step too much… :)

Friederike Dündar (14:41:53): > that’d be nice.

Aaron Lun (14:42:25): > Sounds like a volunteer.

Kevin Rue-Albrecht (14:43:14): > Sounds like something thatpkgdowncould help developing

Friederike Dündar (14:43:39): > meanwhile – what’s the consensus on how to optimize the training data set?

Friederike Dündar (14:43:53): > I have this publication:https://science.sciencemag.org/content/suppl/2017/12/06/358.6368.1318.DC1

Friederike Dündar (14:44:14): > I can only find a matrix of TPMs per single cell

Friederike Dündar (14:44:38): > which is fine, no need to realign etc.

Friederike Dündar (14:45:00): > would you use the cell labels and matrix as is, i.e. at the single cell level

Friederike Dündar (14:45:18): > or should I aggregate them somehow

Kevin Rue-Albrecht (14:46:32): > the code is ugly (i.e. not Bioc-style), but themetacellconcept is nice (https://bitbucket.org/tanaylab/metacell/src/default/) to aggregate “minibulks” of similar cells

Friederike Dündar (14:47:04): > that’s kind of the theme of the Tanay group…

Friederike Dündar (14:47:35): > well, I already have the cell labels

Friederike Dündar (14:48:17): > I know which cell is supposed to represent an “early-born neuron” and a “late-born neuron” etc.

Friederike Dündar (14:48:34): > it’s mostly about what type of data will work best with the training function

Friederike Dündar (14:49:02): > since the TPMs aren’t log-transformed, I’d also have to go down the rabbithole of whether I’ll just add 1 or try something more sophisticated

Friederike Dündar (14:49:16): > (btw, how does scater’s logNormalize function do it these days?)

Friederike Dündar (15:15:11): > and, just to meddle with the conversation threads even more: what’s your take on this -https://github.com/dviraran/SingleR/issues/105

Jared Andrews (15:28:02): > scater’s logNormalize uses a pseudocount of 1.

Steve Lianoglou (15:45:07) (in thread): > Looks like you folks have nice documentation for the new datasets, the simplest thing would be to just put the information in there. > > You could also just add these as morecolDatacolumns as you suggest, or: > > Another thing to consider is since all of the datasets are provided as a(Summarized|SingleCell)Experiment, you can stuff a data.frame in themetadata()list, which has name,label,description columns for each of the unique cell labels/abbreviations found in thecolData

Jared Andrews (15:46:37): > I also may make a few other reference datasets over the weekend for my own use. If they prove useful, I may stick them in EHub as well.

Friederike Dündar (15:56:56) (in thread): > what’s the difference betweencolDataandmetadata?

Friederike Dündar (15:57:30): > based on single-cell or bulk?

Steve Lianoglou (15:59:11) (in thread): > metadata is just a list that you can stuff anything into and have it travel along with the SummarizedExperiment

Steve Lianoglou (15:59:37) (in thread): > so, for example, if you have several reference profiles in your dataset that are named/label “Teff”

Friederike Dündar (16:00:00) (in thread): > gotcha

Jared Andrews (16:00:26): > Purified bulk, most likely. It will be pretty specific for immune cells though. Even more-so than the HPCA/Blueprint sets.

Steve Lianoglou (16:00:32) (in thread): > you only have to describe “teff” in one row of a larger “cellinfo” data.frame (or whatever) that you can extract from the SE

Aaron Lun (16:01:06): > FYI, purely single-cell data sets should go into scRNAseq.

Steve Lianoglou (16:01:19) (in thread): > maybe more trouble than it’s worth, just an option

Jared Andrews (16:01:19): > Specifically, the old DMAP database and maybe this study:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE107011 - Attachment (ncbi.nlm.nih.gov): GEO Accession viewer > NCBI’s Gene Expression Omnibus (GEO) is a public archive and resource for gene expression data.

Friederike Dündar (16:01:31) (in thread): > yes, that seems cleaner. This would have to be done at the level ofinst/scripts/make-data.Rmd, I guess

Friederike Dündar (16:03:28): > > purely single-cell data sets should go into scRNAseq > I can see that point in principle, but if the consensus were that pseudo-bulk is what we’re after, how would that impact the verdict?

Aaron Lun (16:03:50): > No, that’s fine. I’m just mentioning that if you did have single-cell data, it would go into scRNAseq.

Friederike Dündar (16:06:05) (in thread): > I like themetadataidea, since, presumably, there will be additional details asked by more users, which may or may not be specific to the individual reference data sets

Friederike Dündar (16:06:37) (in thread): > that way thecolDatais consistent across all ref samples while themetadatalist contains whatever else we could unearth

Friederike Dündar (16:06:52) (in thread): > your take@Aaron Lun?

Aaron Lun (16:08:20) (in thread): > yes, the metadata would have been the way to go.

Aaron Lun (16:08:37) (in thread): > And yes, this would need to be done in the Rmd files.

Aaron Lun (16:08:44) (in thread): > But that’s okay, we can just modify them.

Friederike Dündar (16:09:17) (in thread): > sure, I’ll put it on my to-do list

Friederike Dündar (16:10:16) (in thread): > but I won’t touch this in the next days

Aaron Lun (16:16:32): > FYIscrannow has agetTopMarkers()function which should replace that weird loop in the vignette.

Aaron Lun (16:16:37): > Just need to wait for it to build.

Friederike Dündar (16:36:51): > does SingleR evernotassign a label to a given cell?

Aaron Lun (16:37:00): > No.

Aaron Lun (16:37:32): > Well, previously, you could ask it for a p-value thing, and then it would NA any labels with large p-values.

Friederike Dündar (16:37:55): > I.e. let’s say Jared was using his immune cell training set on a kidney sample – all those kidney cells would then be labelled as immune cells?

Friederike Dündar (16:38:08): > right, I remember seeing NA’s

Friederike Dündar (16:38:18): > so that’s no longer the case?

Aaron Lun (16:38:20): > Well, they wouldn’t be NAs, dvir had them as “X”.

Friederike Dündar (16:38:34): > mh, ok, then I don’t know what I remember

Aaron Lun (16:38:36): > That wouldn’t help in the case you described anyway.

Aaron Lun (16:39:05): > The p-values were based on how “outlier” the best score was, compared to the rest.

Aaron Lun (16:39:22): > However, if the entire reference is a mismatch, then all the scores are going to be crap.

Friederike Dündar (16:39:26): > so in the absence ofanyimmune cells…

Aaron Lun (16:39:55): > But as long as one score is noticeably less crap than the others, it would still be considered to be a good match.

Friederike Dündar (16:40:20): > yeah, got it. do you have a suggestion for how to judge the confidence of a label (assuming there’s just a slight mismatch, i.e. there’s kidney and immune cells in the test data but the training set has only immune cells)

Aaron Lun (16:41:42): > I don’t have anything that would reliably distinguish between no-real-matches and bad-real-matches.

Aaron Lun (16:42:17): > Especially if the “badness” of the match is variable across cell types. e.g., your immune cells are more variable across donors, so they get lower scores but are still assigned correctly.

Friederike Dündar (16:43:19): > so, maybe it’s best to start with a fairly broad reference data set that should cover the majority of one’s cell types

Aaron Lun (16:44:02): > Yes. That’s probably true of all classification algorithms, to one extent or another.

Aaron Lun (16:44:42): > There’s probably some diagnostics that you can do within your dataset, e.g., compute the variance within all cells assigned a particular label.

Aaron Lun (16:45:06): > If this is high, it suggests that you’ve got multiple true cell types assigned the same labels.

Aaron Lun (16:45:28): > It won’t help when you have a homogeneous population that gets entirely misassigned to some other label, but then you’re just stuffed in that case.

Aaron Lun (16:56:07): > A simple approach would be to examine the distribution of max scores across all cells assigned with a single label, identify low outliers and ignore them.

Aaron Lun (16:59:22): > Dvir’s p-value would have removed cells that fail to achieve a high score relative to other labels, so it’s an orthogonal check.

Aaron Lun (17:03:19): > I guess it wouldn’t hurt to add an extra function to do these things, though it’s all fairly arbitrary. I suppose we could ask for “delta correlation > 0.05” andno. mads < 3

Friederike Dündar (17:03:44): > yeah, just as an optional follow-up function

Aaron Lun (17:09:25): > Can you put this as an Issue so we don’t forget? Just quote what I said above.

Dan Bunis (17:14:26): > Making the issue

Dan Bunis (17:14:31): > :raised_hands:

Dan Bunis (17:15:11): > this is something I’ve been thinking about for a while. My method with my own data has been through clustering with the heatmap function

Dan Bunis (17:17:07): > Like I’d set cutree_cols to something, and then throw out calls for clusters of cells where more than one celltype scores highly. but that’s kinda weird and arbitrary. I like this p-value idea a lot better!

Aaron Lun (17:19:19): > To be clear, I wouldn’t be creating a p-value; much more straightforward to say, “the top scoring label must be at leastXabove the median score across all labels for this cell”.

Aaron Lun (17:20:06): > I couldn’t figure out how the original method was yielding properly-calibrated p-values, but we don’t need calibration anyway, so we might as well go with something simple.

Dan Bunis (17:22:24): > I’m copying a smattering of these thoughts into the issue and we can decide and make the function later.

Aedin Culhane (17:35:34): > Looking at the SingleR, Do sc signature data ( hpca, immgen, blueprint_encode ) have format of list with $types, $main_types etc. Some datasets (Kang) seem different.

Aaron Lun (17:36:06): > Sorry aedin, I have no idea what you were trying to say there.

Aedin Culhane (17:36:36): > Yeah.. neither did I. So I edited it

Aedin Culhane (17:37:25): > Is there going to be a sc signature format. I have collected cell type signatures and would like to know if there is a way to store them

Aaron Lun (17:37:54): > Well, that’s what this channel was originally for.@Kevin Rue-Albrechtmade some headway and then lncRNAs got in the way.

Aaron Lun (17:37:55): > As they do.

Aedin Culhane (17:37:56): > I was going to put them in a GSEAbase format, or the new geneset format that others are describing

Aedin Culhane (17:38:08): > (LincRNA…. typical)

Aaron Lun (17:38:14): > I hate them so much.

Aedin Culhane (17:39:39): > Also the symbols are $de.genes mostly in lower case… Is this because of human -> mouse or just a tolower effect?

Aedin Culhane (17:40:01): > (sorry for typos.. friday evening)

Kevin Rue-Albrecht (17:41:45): > > the new geneset format that others are describing > what’s that one?

Aedin Culhane (17:44:00): > Hi Kevin, I was thinking of the uniset and other new packages yourself and others presented at the Bioc meeting

Dan Bunis (17:52:35): > about the lower case comment, can you clarify where you are looking? In the HPCA and blueprint/encode datasets, genes are capitalized as is the norm for human genes. And in our mouse datasets, the first letters are capitalized as is typical for mouse genes.

Kevin Rue-Albrecht (17:53:55): > @Aedin CulhaneI’m still curious to see howhttps://github.com/Kayla-Morrell/BiocSetwill be received by the community, given the tidy/S4 discussion that happened at bioc2019 > > As I’ve been saying for myself,unisets(https://github.com/kevinrue/unisets) was meant as a proof of concept that snowballed (that tends to happen to me a lot). I’ve paused it for various reasons, including: > - i am still “just a postdoc” / not master of my own time yet (#jobswelcome) > - waiting for feedback from volunteer testers before I invest more time into something that no one wants to use > - a discussion with Herve Pages at bioc2019, that he started agraph2package (https://github.com/hpages/graph2) which happens to implementAnnotatedIDsas a very similar concept to myIdVector. Basically, we met at the concept where gene:set is a bipartite graph between two sets of entities

Kevin Rue-Albrecht (17:54:49): > If he makes headway in that direction, I’d seriously consider reimplementingunisetsaround that

Aaron Lun (17:55:52): > Y’know, ever since I started my industry job, I have never worked OT. Ever. Once I’m off the clock, I don’t touch work code. I don’t even think about it.

Aaron Lun (17:56:12): > I also get paid, like, 3-4 times more.

Aaron Lun (17:56:26): > Just sayin’.

Dan Bunis (17:56:46): > I want that so bad!

Aaron Lun (17:57:11): > And another thing - I will NEVER WORK ON LNCRNAS AGAIN.

Dan Bunis (17:57:12): > #boutReadyToGraduate

Aedin Culhane (17:59:28): > I will wait for someone else to define format then ;-))

Aaron Lun (17:59:57): > Yes, back to the original topic. The current objects will not hold any signatures at all, just the cell labels

Kevin Rue-Albrecht (18:00:18): > sorry, original topic it is

Aedin Culhane (18:00:34): > It might be handy to have a tool that maps free-text (cell type names) to cell type ontologies, to find out if two different sc signatures are the same cell type.@Vince Careyand I played with zooma for auto-annotating free text (ie cell types) to cell ontologies. It didn’t work that well.. Maybe someone else has a better idea

Aedin Culhane (18:01:18): > (again.. more apologies for friday typos)

Aedin Culhane (18:06:26): > Thanks for putting singleR together.

Kevin Rue-Albrecht (18:07:06): > Aedin, depends what you call the format. I’d pretty fair to say that all 3 packages presented at bioc2019 converged to the triple table format. Then, with regard to the actual package and class, it will come down to how users like to interact with the container, and as@Vince Careypointed out during the BOF, which container manages to support the features thatGSEABaseandCategoryare familiar with

Aedin Culhane (18:08:06): > Maybe the format doesn’t matter, if its easy to convert between then. The simplest one with the most connections to everything else will win:wink:

Aedin Culhane (18:08:55): > In the meantime.. I was looking for kidney related sc markers. I have some that I had compiled and was trying to find others:wink:

Tim Triche (18:22:48): > @Tim Triche has joined the channel

2019-08-17

Aaron Lun (00:55:48): > For those of you who can spare the time: some testing required onhttps://github.com/LTLA/SingleR/pull/19/

Aaron Lun (00:56:16): > Just plug in thescoresmatrix that you get out ofSingleRand see how much stuff you lose and whether the losses are sensible.

Aaron Lun (00:56:34): > @Dan Bunis, it may also be worth creating some plots to visualize the per-label and per-cell checks.

Aaron Lun (00:57:11): > e.g., create violin plots of scores per label, or deltas from median per cell across labels.

Dan Bunis (02:22:18): > I’ll add prune calls to heatmaps as well. Could we store the prune calls in the results DataFrame?

Jared Andrews (02:24:58): > Makes sense to me, though I haven’t looked at the PR, so not sure what info Aaron’s new code yields. If it’s just binary pruning, then seems reasonable.

Jared Andrews (02:25:17): > I won’t have time to really test till next week.

Jared Andrews (02:25:30): > Maybe a bit Sunday.

Aaron Lun (02:25:53): > @Dan BunisYes, if it works out, there would be an extra column in theclassifySingleRoutput namedpruned.

Aaron Lun (02:26:49): > For the time being, though, just call it on thescoresmatrix until we figure out what the defaults should be.

Aaron Lun (02:29:02): > Any other SingleR users on this<!channel>should feel free to test out the pruning as well. It’s as simple as: > > pred <- SingleR(test=test, ref=ref) > bad.labels <- pruneScores(pred$scores) > # then ignore all labels in bad.labels >

Dan Bunis (03:00:14): > To view on the plotScoreHeatmap: > > plotScoreHeatmap(results = pred, clusters = as.character(bad.labels)) > # OR > # To show calls as well: > plotScoreHeatmap(results = pred, > annotation_col = data.frame( > labels = pred$labels, > bad.labels = as.character(bad.labels), > row.names = row.names(pred))) >

Dan Bunis (03:17:39) (in thread): > Between your color.panel above and mine from DittoSeq, > “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” “[color content]” > which should I add to plotScoreHeatmap? I want to add color control for others like us!

Laurent Gatto (07:43:17): > @Laurent Gatto has joined the channel

Dan Bunis (14:57:43): > PR pushed to make annotation of pruneScores in plotScoreHeatmap easier + combinable with clustering and other annotations. I’ll update the code provided above once merged in to the main prune branch.

Aaron Lun (14:59:08): > Right. Probably table that for now, let’s make sure the pruning makes scientific sense in the first place.

Aaron Lun (15:00:08): > It would be useful to think of assembling a repo containing a few test cases that can be routinely used to check for sensibility upon algorithm updates.

Aaron Lun (15:00:52): > I have a few things like that in my code, e.g.,https://github.com/LTLA/HVGDetection2018/tree/master/realwill run through my HVG modelling functions and apply them on a whole bunch of different datasets to make sure that they are giving sensible and robust results.

Aaron Lun (15:02:52): > A kind of “SingleR gallery”, if that gets anyone excited.

Aaron Lun (15:03:41): > There’s probably a BioC workflow and a paper in that, if someone could be motivated to do it.

Dan Bunis (15:11:46): > @Dvir Aranand I had discussed some other ideas as well surrounding the SingleR update and potential paper a while back, so there’s most definitely space to make a paper out of something like this. > I’ve gathered a list of datasets for such a repo already:smiley:

Dan Bunis (15:17:23): > I think my updates to the heatmap function are worth pushing to the prune branch even before the utility of scoring is determined because it’s a good immediate check of the pruneScores. Also it doesn’t change the default nature of the function. - Attachment: Attachment > Right. Probably table that for now, let’s make sure the pruning makes scientific sense in the first place.

Friederike Dündar (20:34:23) (in thread): > I’ll leave you free reign here, I think we’re pretty much on the same page anyway:slightly_smiling_face:

Dvir Aran (23:42:56): > A couple of comments regarding the pruning - I played with that for days. There is a full section about this in supp info 2 in the paper, that is also available through the github page. The problem with a threshold is that the scores are correlated with the number of non-zero genes. In theory you can develop a method that corrects for that, but just a threshold doesn’t work.

2019-08-18

Aaron Lun (00:04:45): > You mean a fixed threshold?

Aaron Lun (00:05:16): > like “keep all correlations above X”?

Aaron Lun (00:22:36): > Right. So the per-cell check inpruneScoresshould be analogous in purpose to your original p-value, because it looks to see if the max score is large relative to the median within a cell. > > I don’t mean to be a party pooper, but TBH, when I was readingoutliers::chisq.out.test, I couldn’t understand how it was well-calibrated; the statistic doesn’t correct for the number of observations, so as the number of labels increase, you will always get outliers. (Try, for example,chisq.out.test(rnorm(100))and see if you manage tonotdetect an outlier!) Hence why I use a delta correlation inpruneScores, which is simpler to interpret. If we had a good p-value, that would be better, but we don’t, so oh well.

Dvir Aran (00:58:05): > I agree, its a bad way to do that. I have never used it and removed it in the first submission, but the reviewers insisted…:slightly_smiling_face:

Aaron Lun (00:58:59): > ha!

Aaron Lun (00:59:25): > been there, got the t-shirt.

Dvir Aran (01:01:03): > the problem is, of course very real, but quite complicated, and really depends on context - lets say there are three subsets for cell types X, but the reference only has 2, you don’t really want to prune the subset that ‘can’t’ be annotated. This is where the heatmap is very helpful, as we’ve shown in our paper where we identified a new macrophages subset

Aaron Lun (01:01:28): > Yes.

Dvir Aran (01:01:50): > My approach is therefore to always start with a very broad and not specific reference. Identify the major cell types.

Dvir Aran (01:04:21): > Then, go deeper for each cell type to identify the subsets. This was the project I present when I visited Genentech and met you… unfortunately (or luckily) I don’t have time to work on it at the moment, but if anyone is interested to finish up this project and write a paper, let me know:slightly_smiling_face:

Aaron Lun (01:07:10): > There should be plenty of PhD students on this channel with free time.

Dan Bunis (01:07:43): > :rolling_on_the_floor_laughing:

Dan Bunis (01:08:25): > I’m open to contributing, but don’t have quite enough free time to take the lead.

Dvir Aran (01:12:24) (in thread): > Problem is that you keep telling them how industry is better for them…

Aaron Lun (01:13:24) (in thread): > whoops

Jared Andrews (01:29:32): > > PhD students > > free time

Jared Andrews (01:29:51): > Excellent joke tbh.

Aaron Lun (01:30:01) (in thread): > My overly circuitous plan to increase the hiring pool has been exposed.

Aaron Lun (01:31:01): > Back in my day, I had loads of free time. I didn’t have internet or TV in my apartment, so I just sat on my bed and stared at the wall.

Aaron Lun (01:31:30): > Got a lot of deep thinking done. Like, “why am I here” and “what am I”

Aaron Lun (01:31:38): > Truly earned my “Doctor of Philosophy” title.

Aaron Lun (01:32:15): > On the flipside, now I’m a bit weird, but you guys will get there soon enough.

Jared Andrews (01:32:24): > Neat. I spend most of my time doing the same. 4 am on a Tuesday morning, “do I really want this?”

Jared Andrews (01:33:11): > “Do I deserve people at family parties saying, ‘whoa, that sounds complicated?’

Dvir Aran (01:33:34): > I have three kids… I don’t remember what free time looks like..

Jared Andrews (01:33:41): > Dunno, someone lend me validation.

Jared Andrews (01:37:08): > Dan, interested in tag teaming an F1000 paper or something? Seems like a workflow via Aaron’s test repository might be enough for that.

Dvir Aran (01:37:09) (in thread): > If anyone is reading this - we are hiring… if you want to do really cool clinical informatics analytics with the biggest health records data in the world in downtown Palo Alto, let me know…

Aaron Lun (02:48:05) (in thread): > Try#jobs

Dan Bunis (02:57:37): > Would love to contribute, but my plate really is pretty full rn.

Kevin Rue-Albrecht (05:18:34) (in thread): > More than “postdocs with free time”?:yum:

Aaron Lun (13:42:08): > Don’t forget to tick off the requests inhttps://github.com/LTLA/SingleR/pull/20

Aaron Lun (14:04:25): > huh, you can’t tick off the box? I thought you would be able to.

Dan Bunis (14:08:13): > Yea… silly

Dan Bunis (14:08:20): > I can tick off my own

Dan Bunis (14:09:00): > But like… what’s the point then?

Dan Bunis (17:21:43): > New thought/question added tohttps://github.com/LTLA/SingleR/pull/20. I’m with you now,@Aaron Lun, that the current min.diff.next cutoff is not ideally useful in its current implementation because it blasts away fine tuning. I had an improper assumption before about how fine.tuning worked. New question: Would it be possible to run min.diff.next on just the fine.tuning scores?

Aaron Lun (17:27:13): > currently in round 2 of batchelor refactoring.

Aaron Lun (17:27:28): > It physically hurts, Dan.

2019-08-19

Friederike Dündar (08:01:48): > I’ve lost it somewhere in that thread what that paper would be about? Basically a workflow advertising the recent move of SingleR to BioC? - Attachment: Attachment > Dan, interested in tag teaming an F1000 paper or something? Seems like a workflow via Aaron’s test repository might be enough for that.

Friederike Dündar (08:17:36): > I can contribute, but it’s difficult for me to predict how much I’ll be able to do after my next child’s born at the end of September

Jared Andrews (09:49:48): > Basically. And a more formal display of its accuracy/power in alleviating one of the more annoying parts of sc analysis. I’m not sure it’s something I want to tango with solo, given that I’m supposed to graduate in ~6 months and still have another paper and a dissertation to write.

Friederike Dündar (10:08:37): > @Dan BunisYou mentioned you already had collected some data sets – which ones did you pick and why?

Friederike Dündar (10:09:37): > We could just start putting something together, worst outcome would be an extensive vignette with possibly more reference data sets

Friederike Dündar (10:10:14): > Would an F1000 submission require that we show differences in performance compared to other contenders?

Jared Andrews (10:24:05): > I don’t think so, and are there other contenders? I think just showing how easily it slots into a standard workflow and that the results are generally believable is enough. Especially if we can show it can identify closely related cell subsets given a good reference set.

Friederike Dündar (11:06:25): > I guess “results are generally believable” is the term that we’d have to flesh out the most

Friederike Dündar (11:09:12): > those immune cell data sets you’ve been staring at, would those be up for publication or are they still embargoed?

Friederike Dündar (11:11:40): > as for contenders: I recently saw there’s a tool from the Trapnell Lab that tries something similar:https://cole-trapnell-lab.github.io/garnett/ - Attachment (cole-trapnell-lab.github.io): Garnett > Garnett - Automated cell type identification

Friederike Dündar (11:16:05): > https://www.biorxiv.org/content/10.1101/538652v1

Friederike Dündar (11:16:43): > but it seems to rely much more on the user to identify marker genes of interest first

Friederike Dündar (11:18:26): > > First, Garnett defines a markup language for specifying cell types using the genes that they specifically express. The markup language is hierarchical in that a cell type can have subtypes.

Friederike Dündar (11:18:47): > Yikes. Exactly the step I want to get rid of:slightly_smiling_face:

Aaron Lun (11:19:57): > I didn’t have a great time with garnett. Putting aside the actual results, I had to fight it every step of the way to get it to run.

Friederike Dündar (11:20:52): > did you ever do a direct comparison of the results? are you aware of other contenders?

Friederike Dündar (11:22:16): > should we tell this guy to just use SingleR:https://github.com/cole-trapnell-lab/garnett/issues/19:smiling_imp:

Jared Andrews (11:44:44): > My stuff is embargoed, but we could just use the 10X PMBC sets and show that detailed T cell subsets can be identified.

Aaron Lun (11:45:32): > I did do some comparisons, but nothing I’d be confident in releasing.

Jared Andrews (11:46:04): > Yeah, I also tried garnett and basically ran into the same issues. Tough to run, seemed like you might as well use marker genes and annotate manually and it’d be just as quick/

Friederike Dündar (12:27:20) (in thread): > sure, I was mostly thinking about how to minimize the effort since you’re probably more intimately familiar with your own data sets

Friederike Dündar (12:27:53) (in thread): > Ideally, we should probably have data sets from different types of research backgrounds, but I’m assuming that’s exactly what@Dan Bunishad in mind when he started collecting them, right?

Jared Andrews (12:28:11) (in thread): > Oh yeah, definitely. Mine are all disease sets too though, so they’re kinda screwy in their own way.

Friederike Dündar (12:30:19) (in thread): > it would actually be super interesting to see how it performs in a cancer setting

Dan Bunis (12:32:26): > A list of publically available (and typically processed already) single cell datasets from different tissues. I was going for breadth, because at that point, my conversations with@Dvir Aranabout a SingleR update paper included talks of an updated and more broad reference set. > > I’ll add better descriptions and link it here some time in the next few days. - Attachment: Attachment > @Dan Bunis You mentioned you already had collected some data sets – which ones did you pick and why?

Jared Andrews (12:34:12) (in thread): > It seems to do fairly well from what I’ve been seeing. This cancer has relatively well-defined late-disease cell type signatures that are being pulled out in patients where we expect as much. > > I am trying to get the paper submitted on it by the end of the year. Then again, the reference datasets for T cells are quite good, and there are even more specific ones that could be made. The current reference datasets definitely have a heavy immune cell tilt.

Friederike Dündar (12:36:27) (in thread): > my boss has been whining about how his tumor biopsy scRNA-seq samples seem to not work well with any of the alignment methods nor with SingleR although I’ve yet to see the results

Dan Bunis (12:36:31): > Datasets in here are probably also useful@Friederike Dündar:grin: - Attachment: Attachment > I have a few things like that in my code, e.g., https://github.com/LTLA/HVGDetection2018/tree/master/real will run through my HVG modelling functions and apply them on a whole bunch of different datasets to make sure that they are giving sensible and robust results.

Jared Andrews (12:37:43) (in thread): > :man-shrugging:

Friederike Dündar (12:38:08) (in thread): > what cancer types are you working on?

Jared Andrews (12:38:24) (in thread): > Cutaneous T cell lymphomas

Friederike Dündar (12:40:38): > shall we have some sort of brainstorming session about this hypothetical paper (e.g. via video chat) and then decide whether we feel we can take it on/find others to assign tasks to/give it up?

Dan Bunis (12:46:49): > Sounds like a good idea to me! Perhaps after I get a chance to poke through this list I haven’t looked at in a couple months…

Dan Bunis (12:48:33) (in thread): > I’m in SF, so I’d much prefer after noon ET

Friederike Dündar (12:49:57) (in thread): > Sure; this week I could do tomorrow or Wednesday afternoon (is there a doodle-like app in slack?)

Dan Bunis (12:52:03) (in thread): > I think there’ a way to make a poll at least. I have 10min to make it to my desk and give it a go rn…

Friederike Dündar (12:52:27) (in thread): > :+1:

Dan Bunis (14:25:15): > /poll “What times are good for a paper brainstorm session? (Times in ET)” “Tues 12pm” “Tues 1pm” “Tues 2pm” “Tues 3pm” “Tues 4pm” “Wed 12pm” “Wed 1pm” “Wed 2pm” “Wed 3pm” “Wed 4pm”

USLACKBOT (14:25:15): > This message was deleted.

Aaron Lun (14:25:35): > woah, WTF

Aaron Lun (14:25:44): > Didn’t even know that was possible.

Dan Bunis (14:27:32): > quite simple too. > > \poll "question" "opt1" … >

Aedin Culhane (14:28:06): > The Broad has a sc RNAseq Gtex dataset on a subset of Gtex individuals. I think it was 6 or 7 individuals. For Gtex, they did between upto 16 tissues per individuals. So that might be a good test dataset

Aedin Culhane (14:28:36): > Its behind the dbGap protection if you have access to it

Dan Bunis (14:32:48): > oh wonderful, thanks! Adding it to my list for now. I’ll poke into that before the brainstorm. Might be all we’d need, though we’d have to get access to use it.

Ben Johnson (17:12:06): > @Ben Johnson has joined the channel

Ben Johnson (17:14:21) (in thread): > are there additional identifying features to pick out these datasets in (for instance) the SRA run selector?

Kin Lau (17:53:18): > @Kin Lau has joined the channel

Friederike Dündar (17:54:41): > if people are more comfortable with another way of brainstorming, I’m totally up for it –anyone has a preferred platform?

Dan Bunis (20:33:49) (in thread): > @Jared Andrewswould any of these times work for you for a brainstorm?

Jared Andrews (21:19:41) (in thread): > Yeah, sorry, been busy. Answering now.

Dan Bunis (21:27:07) (in thread): > no worries! but cool all those times work for me too. I’d prefer wednesday if that’s alright with you both.

Jared Andrews (22:08:42) (in thread): > Good with me.

Friederike Dündar (22:19:16) (in thread): > Kiddo is sick, hopefully not seriously, but that means Wednesdays would also get my vote hoping that he’ll have recovered by then

Dan Bunis (22:20:12) (in thread): > also hoping not seriously and that they feel better quickly!

Friederike Dündar (22:21:52) (in thread): > thanks!

Friederike Dündar (22:22:51) (in thread): > alternatively/in addition, we could think about a text-based platform for brainstorming/collecting ideas

Friederike Dündar (22:23:25) (in thread): > google drive? or are there more exclusive/geekier options?

Jared Andrews (22:24:08) (in thread): > Drive is probably the easiest to get everyone in on and invite people easily.

Dan Bunis (22:24:15) (in thread): > lol google drive is what I know, but I’m not sure on the 2nd question

Friederike Dündar (22:25:39) (in thread): > let’s just do drive then

Friederike Dündar (22:26:09) (in thread): > just to keep track of the various ideas that have been floating around

Dan Bunis (23:34:38): > FYI, I’ve made a bunch of different branches of SingleR at this point (in order to make it easiest for Aaron to pick and choose what to take and when!) But if anyone wants to test with everything I’ve added (pruneScores visualizations, additional pruneScore cutoff based on minimum distance between best and next best fine-tuning scores, and annotation of prune-scores in plotScoreHeatmaps) install from this branch: > > BiocManager::install("dtm2451/SingleR@prune-dan-merged") > > Then you can use: > > pred <- SingleR(...) > pruneScores(results = pred) >

Aaron Lun (23:54:26): > We should probably nip and tuck some of the PRs.@Dan Bunistry out 20 and I’ll merge it if it’s okay.

Dan Bunis (23:56:13): > seems to work

Aaron Lun (23:56:50): > what threshold are you using on min.diff.next this time?

Aaron Lun (23:57:10): > Even if we don’t set it as the default, it would be good to give some guidance in the docs.

Dan Bunis (23:57:42): > 0.05 gets a lot for me, but I’m testing on a “meh” quality dataset.

2019-08-20

Dan Bunis (00:00:37) (in thread): > 0.05 was correct

Dan Bunis (00:00:46) (in thread): > http://127.0.0.1:35186/chunk_output/3F851EFE29C1784B/5E2F74A6/cxq68lvp900k7/000012.png - File (PNG): image.png

Dan Bunis (00:01:50) (in thread): > http://127.0.0.1:35186/chunk_output/3F851EFE29C1784B/5E2F74A6/cxq68lvp900k7/000014.png - File (PNG): image.png

Dan Bunis (00:03:55) (in thread): > http://127.0.0.1:35186/chunk_output/3F851EFE29C1784B/5E2F74A6/cxq68lvp900k7/000020.png - File (PNG): image.png

Aaron Lun (00:19:02) (in thread): > can you talk me through this?

Aaron Lun (00:19:21) (in thread): > ah, right. looks like 0.05 wipes out a lolt.

Dan Bunis (00:19:42) (in thread): > Yes, but the quality of the dataset is….

Dan Bunis (00:20:10) (in thread): > I’m about to test on 10X pbmcs instead

Aaron Lun (00:20:14) (in thread): > wELL, they’re not B cells, that’s for sure!

Aaron Lun (00:20:55) (in thread): > And besides, even the field can’t agree on all these damn precursors. John M. used to joke that we’d get a new haematopoietic tree every conference.

Dan Bunis (00:21:15) (in thread): > lol yes… CD34+ hematopoietic progenitors

Dan Bunis (00:22:07) (in thread): > not fun

Aaron Lun (00:26:36) (in thread): > Oh yeah, add a paragraph in the docs about this new arg and I’ll merge 20.

Dan Bunis (00:27:54) (in thread): > willdo tonight! after I use the 10X pbmc dataset to decide what a more normal min.diff.next could be

Aaron Lun (00:30:30) (in thread): > :+1:a few sentences should suffice, see the surrounding docs for guidance.

Tim Triche (01:19:47) (in thread): > is that the DRoNC-seq data? I was looking through the datasets in ANViL and it is NOT easy to navigate… more accurately, it’s quite difficult:confused:

Tim Triche (08:28:39): > @Stephanie Hicksmaybe of interest: DrONC-seq, Drop-Seq, and 10X on the same mouse kidneys:https://jasn.asnjournals.org/content/30/1/23 - Attachment (American Society of Nephrology): Advantages of Single-Nucleus over Single-Cell RNA Sequencing of Adult Kidney: Rare Cell Types and Novel Cell States Revealed in Fibrosis > Background A challenge for single-cell genomic studies in kidney and other solid tissues is generating a high-quality single-cell suspension that contains rare or difficult-to-dissociate cell types and is free of both RNA degradation and artifactual transcriptional stress responses. Methods We compared single-cell RNA sequencing (scRNA-seq) using the DropSeq platform with single-nucleus RNA sequencing (snRNA-seq) using sNuc-DropSeq, DroNc-seq, and 10X Chromium platforms on adult mouse kidney. We validated snRNA-seq on fibrotic kidney from mice 14 days after unilateral ureteral obstruction (UUO) surgery. Results A total of 11,391 transcriptomes were generated in the comparison phase. We identified ten clusters in the scRNA-seq dataset, but glomerular cell types were absent, and one cluster consisted primarily of artifactual dissociation – induced stress response genes. By contrast, snRNA-seq from all three platforms captured a diversity of kidney cell types that were not represented in the scRNA-seq dataset, including glomerular podocytes, mesangial cells, and endothelial cells. No stress response genes were detected. Our snRNA-seq protocol yielded 20-fold more podocytes compared with published scRNA-seq datasets (2.4% versus 0.12%, respectively). Unexpectedly, single-cell and single-nucleus platforms had equivalent gene detection sensitivity. For validation, analysis of frozen day 14 UUO kidney revealed rare juxtaglomerular cells, novel activated proximal tubule and fibroblast cell states, and previously unidentified tubulointerstitial signaling pathways. Conclusions snRNA-seq achieves comparable gene detection to scRNA-seq in adult kidney, and it also has substantial advantages, including reduced dissociation bias, compatibility with frozen samples, elimination of dissociation-induced transcriptional stress responses, and successful performance on inflamed fibrotic kidney.

Stephanie Hicks (08:30:04): > Thanks@Tim Triche

Tim Triche (08:30:52): > now if anyone knows where that GTex scRNAseq data is, I’d be quite obliged (it’s quite interesting trying to find it in ANViL)

Friederike Dündar (09:04:12) (in thread): > off-topic: has anyone figured out a way to add a label to the color bar in pheatmap? It’s so annoying to either have to guess or stick it in the title to define what the colors of the heatmap are representing…

Aedin Culhane (11:22:23) (in thread): > The Gtex folks mentioned it. I looked up my notes, it 3 individuals, 8 tissues, >300,000 cells (Sorry smaller than I recalled). They plan to expand to 25 tissues, and more individuals.

Aedin Culhane (11:22:30): > https://www-nature-com/articles/s41586-019-0969-x

Aedin Culhane (11:23:33): > scRNAseq of 2 million cells of 61 mouse embryos 9.5- 13.5 days gestation, ‘mouse organogenesis cell atlas’ . Used Monocle 3 to identify cell types and 56 trajectories

Kevin Rue-Albrecht (11:23:59): > Broken link?:cry:

Aaron Lun (11:24:13): > Need to replace the dashes with dots.

Kevin Rue-Albrecht (11:24:25): > gsubit is then

Aaron Lun (11:24:31): > Anyway, that’s Cole’s super sparse data. Average UMI count of 600 per cell, IIRC.

Aedin Culhane (11:26:59): > Yes its the Cole dataset

Aedin Culhane (11:27:22): > Sorry for broken link, i edited the link to remove the proxy.

Aedin Culhane (11:27:46): > https://www.nature.com/articles/s41586-019-0969-x - Attachment (Nature): The single-cell transcriptional landscape of mammalian organogenesis > Data from single-cell combinatorial-indexing RNA-sequencing analysis of 2 million cells from mouse embryos between embryonic days 9.5 and 13.5 are compiled in a cell atlas of mouse organogenesis, which provides a global view of developmental processes occurring during this critical period.

Tim Triche (11:28:54): > cool beans – do you know if the GTEx scRNA data is similarly published somewhere?

Tim Triche (11:29:29): > or indexed somehow. 300,000 cells is not a small dataset!

Aedin Culhane (11:29:52): > @Tim TricheI dont’ think the Gtex pilot is published. Anyone here in the Gtex network?

Tim Triche (11:30:28): > @Kasper D. Hansenwas, I thought, but I am senile and do not know

Tim Triche (11:31:10): > thanks for pointing out Cole’s data though. That will be really handy for some other projects.

Aedin Culhane (11:33:25): > Anyone played with the scRNA data in the GXA at EBI…https://www.ebi.ac.uk/gxa/sc/home - Attachment (ebi.ac.uk): Home > EMBL-EBI Single Cell Expression Atlas, an open public repository of single cell gene expression data

Aedin Culhane (11:34:50): > They have a tool called “marker Genes”, what are they using ?https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-6701/results/marker-genes?markerGeneK=21&k=21&colourBy=clusters - Attachment (ebi.ac.uk): Experiment > EMBL-EBI Single Cell Expression Atlas, an open public repository of single cell gene expression data

Aedin Culhane (11:35:57) (in thread): > Kristin Ardlie and Orit Rosen at the broad should know

Aedin Culhane (11:37:03) (in thread): > As far as the platform, yes is nuclei profiling and I think they tried 3 different dissociation protocols. They found no 1 protcol was optimal for all tissues

Aedin Culhane (11:38:15) (in thread): > So the pilot paper will compare the dissociation protocols for tissues within an individuals I think. (basically if the RNA is good for Bulk, its good for scRNAseq but dissociation tends to have some tissue-specific effects)

Aedin Culhane (11:39:52): > The biggest human dataset in scGXA is Vento-Tormo 2018. 48K cells on 10X, on human first trimester fetal-maternal interface

Aedin Culhane (11:42:07): > The biggest mouse are Delile et al, 2019. Developing mouse spinal cord. 62K cells and Ernst et al., 167K cells of mouse spermatogenesis

Tim Triche (11:51:58): > oh wow. I guess it’s good that we invited Vento-Tormo to swing by the Institute. Although the new fetal liver dataset is the coolest thing I have ever seen

Tim Triche (11:52:18): > I scraped the latter off of their website when the preprint went up. Information wants to be free:wink:

Tim Triche (11:58:10) (in thread): > very cool, thanks!

Friederike Dündar (13:03:59): > if I were a reviewer of Garnett, I’d ask them to integrate that data base of cellmarkers to make it useful:https://academic.oup.com/nar/article/47/D1/D721/5115823:wink:

Aaron Lun (14:31:20): > They have a method to help you choose markers, but it’s pretty awkward to use.

Aaron Lun (14:31:36): > It relies on file parsingshudder

Dan Bunis (15:35:01) (in thread): > Do you mean how to add a new annotation bar? I have somewhat gotten a hang of pheatmap at this point by I’m not totally sure what you’re asking.

Dan Bunis (15:37:14) (in thread): > OH you mean the color scale for the data in the heatmap! No I haven’t even tried that actually. My plan has been to just leave that out / put that in my figure legends

Friederike Dündar (17:16:42) (in thread): > yes, I mean the color scale for the data that’s in the matrix being displayed as the actual heatmap. So annoying that it’s not immediately there

Dan Bunis (17:20:31) (in thread): > Agreed! There is def no input for it. And if there even is a manual way, the plot components are buried somewhere deep inside the output so I haven’t been able to figure out how to edit anything manually at all.

Friederike Dündar (17:21:31) (in thread): > yeah, I’ve also given up at least twice, but the whole grid/lattice stuff isn’t my favorite mess to deal with anyway

Friederike Dündar (17:37:08) (in thread): > I can set up the file; can you share the email addresses you want me to use for granting you access via PM?

Jared Andrews (17:45:22) (in thread): > Have we picked a time yet? I will have to find a spot that’s not noisy as can be. And how do we plan to do this? Skype?

Friederike Dündar (17:46:25) (in thread): > I rather not use Skype but anything else

Friederike Dündar (17:47:25) (in thread): > I’ve had good results with the app formerly known asappear.in

Friederike Dündar (17:47:25) (in thread): > https://getstarted.whereby.com/information/pricing/

Dan Bunis (17:47:48) (in thread): > we could use zoom?www.zoom.us

Friederike Dündar (17:48:09) (in thread): > that’d work for me, too

Dan Bunis (17:48:29) (in thread): > zoom isn’t limited to 4 people, in case we have others wanting to join

Friederike Dündar (17:50:51) (in thread): > sounds good, we have some sort of institutional login for that anyway

Friederike Dündar (17:50:59) (in thread): > I didn’t know it was free

Friederike Dündar (17:52:29) (in thread): > as for the time – I still don’t know whether I’ll have to entertain my son tomorrow or not; but in any case the later, the better, I think (for me)

Dan Bunis (17:52:44) (in thread): > And I didn’t realize there was a limit on the length of group meetings with the free option… but I also have an institutional login, so that probably doesn’t matter!

Friederike Dündar (17:53:00) (in thread): > I should know by tonight – if he goes to bed without a fever, he’ll most likely go to school tomorrow

Friederike Dündar (17:53:47) (in thread): > so, you guys decide on the time and whether zoom would work

Dan Bunis (17:54:43) (in thread): > :pray:. I’ll need to leave at 5, but I don’t think we’ll go more than an hour anyway, so any of the times are fine with me.@Jared Andrews?@Aaron Lun?

Friederike Dündar (17:55:42) (in thread): > btw,https://www.uctoday.com/collaboration/video-conferencing/zoom-security-issue-uc-unicorns-arent-invincible/ - Attachment (UC Today): Zoom Security Issue: UC Unicorns Aren’t Invincible - UC Today > UC Today reports on the latest technology news from around the globe. Read similar Video Conferencing news to ‘Zoom Security Issue: UC Unicorns Aren’t Invincible’ here

Aaron Lun (17:55:44) (in thread): > I’m out for two hours starting now, but otherwise should make every other time.

Jared Andrews (17:56:39) (in thread): > I will be out by like 5:30, ET, otherwise doesn’t matter to me.

Aaron Lun (17:56:44) (in thread): > Actually, i’m no longer sure; my calender’s filled up since I filled out the pooll

Aaron Lun (17:57:01) (in thread): > Just do whatever, and I’ll pop in if I’m there.

Friederike Dündar (17:59:05) (in thread): > zoom would work for you?

Aaron Lun (20:07:53) (in thread): > I suppose so, for work hours at least.

Jared Andrews (23:54:47): > Do we want to create a repo for additional reference datasets?

2019-08-21

Aaron Lun (00:01:18): > Depends on how many there are. We could just keep on shoving them intoSingleRfor the time being, and once it becomes annoying, it can split off into its own package.

Aaron Lun (00:01:32): > Not a big deal, we’d just update all the ExperimentHub paths to point to this new location.

Jared Andrews (00:11:13): > Okay. I should have two more to add soon.

Jared Andrews (00:11:37): > Also, this is currently in the Bioconductor queue:https://github.com/kieranrcampbell/cellassign

Aaron Lun (00:12:22): > yes, I know about that one. It’s a real fight to get tensorflow to work.

Jared Andrews (00:12:59): > It also requires known marker inputs.

Aaron Lun (00:17:43): > well, that’s not such a big deal, it’s fairly easy to just throw a few DE analyses in to get those guys. I mean… that’s what SingleR does under the hood anyway.

Aaron Lun (00:18:14): > Though if you do have marker genes, SingleR now provides a lot of different specification options.

Vince Carey (00:18:31) (in thread): > {tools} > SCXA-Workflows > A flexible pipeline for Single Cell RNA-seq analysis that integrates many existing tools for filtering and mapping reads, quantifying expression, clustering, finding marker genes and variable genes. The workflows maximize reproducibility by making use of Bioconda, Biocontainers, NextFlow and Galaxy. They can be run on the cloud, local machines or local premises. > > Thus far, as I poke into the scxa-workflow folders, there is no use of Bioconductor, although I see scater in the toolshed referenced.

Aaron Lun (00:18:36): > You can either give one marker list per label,orone marker list per pairwise comparison between labels.

Aaron Lun (00:19:04): > The latter is better for fine-tuning, but we support the former because most people just publish the cell type markers on their own.

Jared Andrews (00:24:39): > Yeah, you’re right. Maybe I’ll give it a shot. I’ve personally struggled a bit finding good markers canonically associated with more rare subtypes, but then again, I’m pretty inexperienced. I’m sure others have an easier time with it. SingleR abstracts a lot of that away, which erases the lion’s share of the headache for me at least.

Friederike Dündar (10:45:05): > Today, 4pm EST, zoom-chat about what people would envision for a paper and whether that’s feasible?

Friederike Dündar (10:46:47): > Google Doc for various notes:https://docs.google.com/document/d/1uH6V3SvqnLi9H_5yBlVhn7qdsBS0ykh2fMu4BS0U2v8/edit?usp=sharing

Friederike Dündar (10:47:17): > if you want to edit it, send me an email address for adding you

Tim Triche (11:19:57): > blood cells are homogeneous?!?:crossed_swords:

Tim Triche (11:20:12): > Berenice would likely have interest in advising on feasibility

Friederike Dündar (11:38:08): > they are all blood cells:stuck_out_tongue:

Friederike Dündar (11:38:52): > > Berenice would likely have interest in advising on feasibility > Berenice Benayoun?

Friederike Dündar (12:15:19): > @Dvir Arandid you ever compare the consistency between the labels when using different reference data sets? I.e., Blueprint/Encode and HPCA have some of the same or very similar labels – did you look at the agreement between those when applied to the same data set?

Dvir Aran (12:20:01): > Not systematically. HPCA is not a good reference. I always prefer the blueprint/encode reference. I only use HPCA where blueprint is not deep enough - for example distinguish Cd16+ and - monocytes.

Jared Andrews (12:23:31): > They generally agree with a few exceptions - cytotoxic CD8+ T cells/NK cells for example. I agree that Blueprint/Encode is generally better.

Friederike Dündar (12:51:46): > why do you feel that HPCA is inferior?

Jared Andrews (13:18:27): > I guess I really shouldn’t say one is better/worse - Blueprint/Encode was just a better fit for me, as it had some important subtypes that HPCA didn’t. And HPCA seemed to do a poor-er job with differentiating between those subtypes. But as Dvir said, it could have just as easily been the opposite, say, if I was interested in monocytes. > > One of the benefits of SingleR is easily being able to run against multiple reference sets to help discern these discrepancies though.

Friederike Dündar (13:21:29): > so, I have an example where Blueprint sees tons of adipocytes while HPCA thinks those are neurons

Friederike Dündar (13:21:38): > they do agree on some neurons

Friederike Dündar (13:22:14): > is there a way to find out which genes are probably contributing the most to the final assignment?

Dan Bunis (13:27:41): > hmmmm I don’t think there’s anything built in for that. Maybe you could pull the differences between ref$labels from the trainSingleR output?

Dan Bunis (13:32:02): > Actually, it looks like there aren’t many neuronal samples in the blueprint/encode reference. The only cell type I recognize as neuronal in blueprint.encode$labels.fine is astrocytes. And that’s more likely the cause.

Jared Andrews (13:33:56): > Yeah, I am trying to find more broad/varied reference sets, so if anyone stumbles upon any, let me know.

Jared Andrews (13:35:28): > GTEx is on my list.

Friederike Dündar (13:35:37): > there’s “Neurons” in both

Dan Bunis (13:38:44) (in thread): > Should we say 3pm ET?

Friederike Dündar (13:39:17) (in thread): > I announced 4pm in the main slack, but I’m fine with either one

Dan Bunis (13:39:41) (in thread): > oh! I missed that. 4pm works!

Jared Andrews (13:39:49) (in thread): > I have to help someone with flow at 2pm ET, so I’d prefer 4.

Friederike Dündar (13:39:57) (in thread): > :+1:

Aaron Lun (14:07:13) (in thread): > 4pm ET should be 1pm here, so I’ll pop in once I’m free.

Tim Triche (14:23:57) (in thread): > yes

Jared Andrews (14:43:09) (in thread): > Actually, nevermind, doesn’t make sense to use tissue-level data.

Dvir Aran (15:53:06): > The best way to resolve such issues is to look at the scores heatmap. If many different cell types have similar scores it means that SingleR is not really able to discern the correct cell type, and probably means this cell type is not in the refrence.

Dan Bunis (15:54:06): > Zoom link for our brainstorming session:https://ucsf.zoom.us/j/9367688791

Dvir Aran (15:54:54): > And both references will do very bad for brain tissue.

Dvir Aran (15:55:31): > Regarding a reference - I have created a gigantic reference with >30K samples of sorted cell types

Dvir Aran (15:57:52): > I worked on a pipeline to use that reference with SingleR, but its a bit complicated.

Jared Andrews (15:58:26): > I have a list of other references I’m making/plan to make, including some brain sets.

Jared Andrews (16:00:14): > Dvir, is it okay if I throw them in your repo for adding them to ExpHub? I have 3 done, but they are all immune/hematopoietic.

Friederike Dündar (16:00:55): > why do you want to throw them into Dvir’s repo?

Friederike Dündar (16:01:30): > may as well just put it on your own page – anyway, let’s discuss:slightly_smiling_face:

Jared Andrews (16:01:40): > Keep everything in one place? Bioc won’t allow >5 Mb in data in the repo, right?

Jared Andrews (16:01:57): > Anyway, setting up, will be there in a sec.

Dan Bunis (16:02:30) (in thread): > in case this is easier to access as more messages are sent to general#sc-signaturehttps://ucsf.zoom.us/j/9367688791

Tim Triche (16:51:03): > https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=SRP183515

Friederike Dündar (17:05:57): > @Jared Andrewsyou could have a github repo (not necessarily BioC; just as a place to download the data from to put it into ExpHub eventually) of your own dedicated to immune cell reference data sets

Jared Andrews (17:06:21): > Very true.

Friederike Dündar (17:08:19): > so,@Dvir Aranwhat about that massive data set you just mentioned?

Friederike Dündar (17:09:21): > Is it available anywhere so that we could try it out?

Jared Andrews (17:24:23) (in thread): > Unrelated to SingleR, but I thought you might find this paper interesting given what you were talking about:https://www.nature.com/articles/s41467-019-11591-1 - Attachment (Nature Communications): A general approach for detecting expressed mutations in AML cells usin > The advent of single-cell RNA sequencing has revealed significant transcriptional heterogeneity in cancer, but its relationship to genomic heterogeneity remains unclear. Focusing on acute myeloid leukemia samples, the authors describe a general approach for linking mutation-containing cells to their transcriptional phenotypes using single-cell RNA sequencing data.

Tim Triche (17:35:09) (in thread): > yep, packaged up that one on Sunday afternoon. Tidbit: 3 of those 5 patients are TCGA LAML and have corresponding bulk RNA, WGS, and miRNA.

Dvir Aran (17:52:12): > Not available yet. I’m looking for the right project to use it in. Since I took a ‘sabbatical’ from academia it is stalled now… but if anyone wants to continue that project I’ll be more than happy to discuss.

Tim Triche (18:34:51) (in thread): > actually maybe that would be a good dataset for scRNAseq@Aaron Lun? what do I need to do to toss an arbitrary dataset into it?

Tim Triche (18:35:17) (in thread): > (scRNAseq-the-package)

Jared Andrews (18:36:54) (in thread): > https://bioconductor.org/packages/devel/data/experiment/vignettes/scRNAseq/inst/doc/scRNAseq.html#adding-new-data-sets

Jared Andrews (18:37:27) (in thread): > Seems fairly straightforward.

Jared Andrews (18:38:32) (in thread): > Oh, doesn’t want 10X data though.

Tim Triche (18:41:26) (in thread): > ah well. Bernstein’s seqWell/ONT data might be of greater interest then

Tim Triche (18:42:34) (in thread): > loading the data is easy, merging it is less so. That experiment has some “interesting” additions that aren’t really discussed in the paper (specifically, sorted normal marrows as controls)

Aaron Lun (18:42:46) (in thread): > Re 10X; depends.

Aaron Lun (18:43:01) (in thread): > UsuallyDropletUtils::read10xCounts()is pretty good with loading stuff.

Aaron Lun (18:43:18) (in thread): > If the format is more… um, imaginative, then it’s a candidate for scRNAseq.

Tim Triche (18:43:47) (in thread): > I just used DropletUtils to load, and then annotated based on mutation calls (FLT3-ITD, since we’re interested in why FLT3-ITD + NUP98-NSD1 is so much worse than FLT3-ITD + NPM1 in kids, and one of the adults is a NUP98-NSD1 vs an NPM1c)

Tim Triche (18:43:58) (in thread): > Bernstein’s data is “more imaginative” for sure.

Tim Triche (18:44:14) (in thread): > Also single cell, also AML, also genotyping, but smartseq2+ONT

Tim Triche (18:45:06) (in thread): > per usual, all of these are marrows or ficoll’ed PBs with a healthy “constellation” of normal cell types alongside the leukemic cells.

Aaron Lun (18:47:15) (in thread): > Happy to take a PR. Might need some discussion abotu how the relevant interface is arranged, but in principle, should be fine.

Dan Bunis (21:25:06) (in thread): > FYI, I asked@Dvir Aranif he would share his new reference with us for testing purposes with the planned kmeans pseudobulk code. He said that he would, but needs to find a proper way since the files are large.

2019-08-22

Dvir Aran (00:12:56): > http://www.nxn.se/single-cell-studies

Laurent Gatto (05:10:52) (in thread): > Link to the google sheet:https://docs.google.com/spreadsheets/d/1En7-UV0k0laDiIfjFkdn7dggyR7jIk3WH8QgXaMOZF0/edit#gid=0

Laurent Gatto (05:11:20) (in thread): > Link to the pre-print:https://www.biorxiv.org/content/10.1101/742304v1

Friederike Dündar (09:09:58) (in thread): > Can you give a couple more details? I.e. what was your intent when you started it and at what point is it?

Friederike Dündar (09:11:09) (in thread): > ideally, we could “just” re-run the code he used to generate the data set…:innocent:

Jared Andrews (10:10:08) (in thread): > Also, my PI gave me the greenlight to go ahead with contributing to the paper.:+1:

Dvir Aran (11:52:03) (in thread): > The intent was to create a generic way for identifying all known cell type and by way of corollary new cell types. I’ll set a time for early next week to present to you the work I’ve done.

Friederike Dündar (11:53:01) (in thread): > that’d be great!

Aaron Lun (12:57:04): > https://github.com/LTLA/SingleR/pull/25

Aaron Lun (13:05:33) (in thread): > > As a particular example, it is demonstrated that the number of cell types identified in single cell RNA sequencing studies is directly proportional to the number of cells analyzed

Aaron Lun (13:05:36) (in thread): > That’s pretty funny.

Jared Andrews (13:54:26) (in thread): > Related, but might be a stupid question: it’s fairly easy to create reference datasets from pure bulk populations with lots of samples that are pretty large. Like over the 100 MB github limit (and git repos are capped at 1 GB). While I assume this isn’t an issue for ExpHub storage, nobody likes waiting for downloads. > > Would the preferred way to limit such files be to use this function or is just taking the median by label to collapse the sample number valid?

Aaron Lun (13:57:24) (in thread): > Not sure what you mean. It seems like everyone would have to download the reference regardless to use this function in the first place.

Friederike Dündar (14:15:43) (in thread): > I didn’t think we were going to have to use if for bulk

Jared Andrews (14:16:19) (in thread): > Yeah, ignore me, it becomes a moot point once they’re on ExpHub.

Friederike Dündar (14:16:33) (in thread): > how important are the replicate numbers for the DE determination? I.e., does it make sense to have more than, say, 10 replicates per bulk-derived label?

Aaron Lun (14:16:57) (in thread): > Internally? SingleR doesn’t even care, it just takes the median.

Jared Andrews (14:17:05) (in thread): > That was my real question.

Aaron Lun (14:17:11) (in thread): > Externally, it would be good to have the replicates.

Friederike Dündar (14:17:21) (in thread): > what do you mean?

Aaron Lun (14:17:35) (in thread): > Simply to pick genes that are not highly variable within a label and thus less useful for distinguishing between labels.

Aaron Lun (14:17:57) (in thread): > If you want to do marker detection more carefully, you would do so outside the function and pass the result intogenes=.

Aaron Lun (14:18:12) (in thread): > Check out the possibilities in?trainSingleR.

Friederike Dündar (14:18:19) (in thread): > > pick genes that are not highly variable within a label

Friederike Dündar (14:18:21) (in thread): > that’s a good point

Friederike Dündar (14:18:40) (in thread): > ideally, SingleR could do that internally…?

Jared Andrews (14:19:04) (in thread): > Yeah, I was just trying to think of ways to shrink file size for hosting prior to ExpHub, but I’m not gonna worry about it.

Aaron Lun (14:19:41) (in thread): > @Friederike DündarYes, it could, but I’m reluctant to introduce changes like that. Given that the current thing seems to work, we’d have to build up a body of evidence that the new approach works as well.

Friederike Dündar (14:20:27) (in thread): > right

Aaron Lun (14:20:40) (in thread): > Also, many of the references don’t have replicates. So you’d have to bake in two different approaches depending on whetherrefis detected to have replicates or not. Gets a bit painful to have to describe multiple implicit behaviors.

Friederike Dündar (14:20:55) (in thread): > also true

Aaron Lun (14:21:23) (in thread): > On the other hand, detecting your own stuff should be fairly easy - the vignette has an example that should be massively shortened with the newscran::getTopMarkersfunction.

Friederike Dündar (14:21:51) (in thread): > I guess the message for Jared then is to not try to amass 100s of samples representing the same cell type when file size is an issue

Aaron Lun (14:21:52) (in thread): > Wilcox, t-test, binomial; with or without a log-fold change threshold; varying numbers of markers. Lots of options there for easy tuning.

Jared Andrews (14:26:29) (in thread): > Sure, I just don’t want to arbitrarily remove replicates. The set in question has 1500+ samples, but only 15 labels. Hence why I was wondering if just taking the median, since SingleR does that internally anyway, was valid. > > But given Aaron’s points, it doesn’t seem to make sense to do so, presuming the file doesn’thaveto be on github for upload to ExpHub (I am still somewhat unclear on that process, but will start a PR copying@Friederike Dündar’s dataset/metadata scripts tonight).

Dan Bunis (14:28:44) (in thread): > Hey guys, turns out@Friederike Dündarwas right that there’s a limit to the # of people a private repo can be shared to. So: To make it public but “less public”, I’ve removed SingleR from the name. It’s now WIP-vignettes (for work in progress). Just search for that or dtm2451/WIP-vignettes to find it.

Friederike Dündar (14:28:55) (in thread): > As long as you can put the download/data wrangling process in R code, you don’t have to externally store the data anywhere else

Friederike Dündar (14:29:41) (in thread): > :+1:

Jared Andrews (14:29:44) (in thread): > That’s what I figured, so I will just throw them into a drive or box folder or something.

Friederike Dündar (14:30:44) (in thread): > can you add as a collaborators via the settings? since we agreed to mostly work on separate files anyway, this would obliterate the forking/PR-requests

Friederike Dündar (14:31:02) (in thread): > my user name: friedue

Dan Bunis (14:31:11) (in thread): > oh right, yes

Friederike Dündar (14:31:41) (in thread): > where are you downloading it from in the first place?

Jared Andrews (14:33:16) (in thread): > https://dice-database.org/downloads

Jared Andrews (14:34:16) (in thread): > I did most of my mangling it in python though.

Friederike Dündar (14:34:30) (in thread): > theAll cell types (Mean TPM)?

Jared Andrews (14:34:49) (in thread): > Nope, that didn’t have replicates, so grabbed each individually and mashed em together.

Friederike Dündar (14:34:57) (in thread): > gotcha

Jared Andrews (14:36:03) (in thread): > HaveSummarizedExperimentobjects for it and 2 other sets, all I did was remove genes with no reads and transform to log2.

Friederike Dündar (14:46:38) (in thread): > this way we can also have a to do list in the issues part of that repo, that’s neat

Dan Bunis (16:52:57) (in thread): > My PI seems onboard too.

Aaron Lun (17:02:38): > @Dan BunisSpaces?! In your file names!?:nauseated_face::face_vomiting::flag-nz:

Dan Bunis (17:06:11): > they’re gone now.

Aedin Culhane (22:01:06): > A curated database reveals trends in single cell transcriptomics | bioRxivhttps://www.biorxiv.org/content/10.1101/742304v1

2019-08-23

Tim Triche (08:57:42): > from the above (and per Aaron’s note), > > the number of cell types identified in single cell RNA sequencing studies is directly proportional to the number of cells analyzed.

Aaron Lun (13:49:47): > Does anyone bother to regress out cell cycle here?

Aaron Lun (13:50:05): > Wondering whether I should spend time in overhauling scran’s capabilities.

Aaron Lun (13:50:26): > Currently we’ve only gotcyclone, which is only really maintained for historical reasons.

Aaron Lun (13:51:11): > Cell cycle has never really been an issue for me. At best it’s like, “oh, those cells are actively dividing. That’s nice to know.”

Friederike Dündar (14:23:31): > I like to have the info from cyclone about which cells are actively dividing, but usually I don’t regress it out unless there’s a very specific reason

Jared Andrews (14:40:56): > I never regress it out. Never seems to bias things.

Dan Bunis (14:55:41): > I don’t either. Never seems to bias anything for me either.

Friederike Dündar (15:07:54): > but I do miss thecycloneease-of-use when I have to deal with non-human or non-mouse data

Dvir Aran (21:18:55): > Seaurat has some simple functions for cell cycle scoreshttps://satijalab.org/seurat/v3.0/cell_cycle_vignette.html - Attachment (satijalab.org): Satija Lab > Lab Webpage —

Aaron Lun (22:24:00): > Yes, I was wondering whether I should invest some time in making something quick and simple that works directly off an SCE. But if no one’s going to use it… I wouldn’t bother.

Aaron Lun (22:25:49): > Like what I talk about here:https://support.bioconductor.org/p/122362/#122369

Aaron Lun (22:28:03): > In principle, this should just be a specific application of a competitive signature-based method, so I’m not even sure I would write a separate package for that.

Aaron Lun (22:32:07): > Maybe I could get the AUCell authors to work on it. But their GitHub repo hardly overflows with commit activity, and I don’t want to be stuck dragging the tire again.

Peter Hickey (22:34:26): > i applycycloneroutinely and use the results sometimes, mostly to block on

Aaron Lun (22:35:14): > Really. Does the blocking do anything useful?

Peter Hickey (22:36:23): > i’ve had some really wonky datasets where it’s at least enabled me to getsomethingout to assuage a collaborator

Peter Hickey (22:37:35): > the current function/workflow works fine for me, so I don’t personally have a priority for something more refined but would appreciate still having it around in the toolbox

2019-08-24

Tim Triche (16:19:13): > Seurat’s objects are piggy (IMHO and IME) so anything that allows me to avoid them, I’ll use. That said, proliferative T cells are different from exhausted or naive T cells, so regressing out cycle seems sort of dumb for those; and long-term stem cells are practically defined by their quiescence, so regressing out cycle for those seems dumb too. If I worked on something that wasn’t blood or in blood, I’d use these functions a lot more. Instead I focus on how to bolt on V(D)J sequences, haplogroups, or what have you.

Jared Andrews (17:45:18) (in thread): > I wrote something basic to slap V(D)J sequences from 10X sets onto the metadata for Seurat objects. It’d likely be pretty easy to adapt to SCEs if you have any interest. Only takes sequences though, haven’t put in the energy to determine how to parse out and carry genes through in a good way.

Tim Triche (20:41:08) (in thread): > heh, I see we have the same angle here

Tim Triche (20:41:24) (in thread): > 10X’s TCRA/TCRB sensibilities are horrible

Tim Triche (20:41:37) (in thread): > but I guess at least you can use them as a doublet finder:wink:

Tim Triche (20:42:52) (in thread): > originally I thought the VDJ results could go into an altExp or whatever, but really, all you need is TCRA/TCRB (maybeallow a secondary if you’re going to try and impute or compute edit distances) and it quickly becomes obvious which epitope targets are over-represented

Tim Triche (20:44:56) (in thread): > mitochondrial variants are a little more fun though

Jared Andrews (21:05:17) (in thread): > I have done the bare minimum with it, so yeah, all I use it for is to look for/track obvious clones over time. Which is pretty easy given that most of my samples have high disease burden.

2019-08-26

Aaron Lun (01:37:04) (in thread): > I was meant to create some standardized S4 classes for repertoire sequencing. But I got sidetracked by this damn book.

Friederike Dündar (08:39:04) (in thread): > I concur. I’d hate to lose the info, if just for the single reason that every.single.biologist wants to know about it (because they’ve read it’s a massive confounder or whatever)

Friederike Dündar (10:10:46) (in thread): > What type of information from V(D)J sequencing cannot be contained withincolData?

Aaron Lun (12:45:15) (in thread): > Some care required for 1:zero-many cell:VDJ sequence mappings. Easy enough to handle but requires some work.

Tim Triche (15:14:22) (in thread): > what Aaron said

Tim Triche (15:14:44) (in thread): > IME, if they’re filtered reasonably well, they fit into a few colData columns at most

Tim Triche (15:15:28) (in thread): > I will get on this later this week if it’s of interest. Not hard to pull in, but different from the publicly available version.

Aaron Lun (23:08:05): > @Jared Andrewsmight as well add dan to the vignette author list and complete the set; I think he wrote the vis parts.

Peter Hickey (23:55:39): > thanks for everyone’s work on SingleR. I tried it out today and found it incredibly useful!

Peter Hickey (23:58:01): > a quick visualisation function I wrote and found useful to overlay the scores on a t-SNE/UMAP, based onSingleR::plotScoreHeatmap(): > > plotScoreReducedDim <- function(results, sce, use_dimred = "TSNE", > max.labels = 20, normalize = TRUE, ncol = 5) { > scores <- results$scores > rownames(scores) <- rownames(results) > m <- rowMaxs(scale(t(scores))) > to.keep <- head(order(m, decreasing = TRUE), max.labels) > if (normalize) { > mmax <- rowMaxs(scores) > mmin <- rowMins(scores) > scores <- (scores - mmin) / (mmax - mmin) > scores <- scores ^ 3 > } > scores <- scores[, to.keep, drop = FALSE] > cns <- colnames(scores) > p <- lapply(cns, function(cn) { > plotReducedDim( > sce, > use_dimred = use_dimred, > colour_by = data.frame(Score = scores[, cn])) + > ggtitle(cn) + > scale_fill_viridis_c(limits = force(if(normalize) c(0, 1) else NULL)) + > guides(fill = guide_colourbar(title = "Score")) > }) > cowplot::plot_grid(plotlist = p, ncol = ncol) > } >

2019-08-27

Dan Bunis (12:42:26): > Seems like a plot of the scores for a set of the labels… kinda like individual rows of the heatmap, but on a tSNE? I like!

Aaron Lun (13:29:47): > Need to wrap up some of our PRs.

Jared Andrews (14:32:53): > Should have mine done by the end of the day, keep getting sidetracked.

Dan Bunis (14:41:25): > I’m wrapping up the plotScoresHeatmap PR now. Just ran BiocCheck to make sure it’s good to go.

Aaron Lun (16:51:10): > minor comments, will merge once done.

Aaron Lun (17:36:31): > Are they selling computers without return keys these days?

Aaron Lun (17:36:41): > Because everyone’s so stingy with their newlines

2019-08-28

Aaron Lun (12:00:04): > The PRs look done?

Jared Andrews (12:26:06): > I don’t plan on editing mine any further unless anyone has complaints. I didn’t list all the “fine” cell types for the sets with like 100+ of them because it was pretty obnoxious looking.

Aaron Lun (12:28:00): > Sure, okey dokey. Will merge once my computer decides to fi

Aaron Lun (12:28:04): > nish updating

Aaron Lun (12:33:54): > okay, am back.

Dan Bunis (12:34:13): > Mine too.

Dan Bunis (12:34:44): > No plans for new changes unless there are complaints or suggestions.

Friederike Dündar (12:35:22): > :tada:

Aaron Lun (12:40:02): > @Jared AndrewsThe-data.Rmdscripts should really contain the code to get it from the original data source. E.g., for monaco immune, put in the code to pull and transform GSE107011

Aaron Lun (12:40:22): > We didn’t do it for the original SingleR references because that code has long disappeared, but we should at least try to do it for new datasets.

Jared Andrews (12:42:02): > Yeah, I didn’t really realize that till halfway through, and I did most of my wrangling in python. And adding the labels is fairly annoying.

Jared Andrews (12:42:22): > I can go back and redo it, but it will take some time.

Aaron Lun (12:42:25): > You can stick the Python scripts in there and just have the Rmarkdown call the python.

Aaron Lun (12:42:46): > I have python scripts in a few packages, eg.,https://github.com/LTLA/diffHic

Aaron Lun (12:43:14): > Not really intended for routine usage, but that’s okay.

Jared Andrews (12:46:43): > My code is a mess. I will just redo in R once I get a chance and am in the mood to quickly achieve a bad mood. I should have been more organized putting it together.

Aaron Lun (12:47:21): > Okay, good.

Aaron Lun (13:02:27): > @Dan Bunisnot sure why I would care about the scores that weren’t assigned to a label and were pruned away. Shouldn’t we just throw the not-assigned-to-current-label points into a single category?

Dan Bunis (14:09:45): > I think that’s a good point. I don’t particularly love the current format… suggestions are welcome.

Dan Bunis (14:09:52): > I can combine those.

Dan Bunis (14:10:15): > Alternatively, I could have it show “All cells” (including the “this label” and “this label - pruned”), + “this label” + “this label - pruned”. Would just require a bit extra manipulation of the dataframe.

Jared Andrews (17:46:32) (in thread): > Any idea what theSourceTypefor ExpHub should be if it’s just a tab-delimited count/expression matrix? It says to use something fromgetValidSourceTypes(), but that function doesn’t exist so I’m not quite sure what’s valid.

Aaron Lun (17:47:16) (in thread): > That function floats around in another package. Think the package is called AnnotationHubData. Was quite annoying to have to find it. Might put in an issue about how unfindable it is.

Jared Andrews (17:47:41) (in thread): > Ah, I tried AnnotationHub but not that one. Thanks.

Aaron Lun (20:13:51): > Makes nsees

Aaron Lun (20:13:52): > sense

Dvir Aran (23:54:29): > I received this email- > > I tried the new version, but it seems not to be working. Even copy/paste of the example in the help page doesn’t run well (logNormCounts, for example, doesn’t exist in the available scater package, and updating with the github link also doesn’t work because an error exporting a function from SingleCellExperiment). >
> I tried to run with the mouse databases, but that didn’t work either. >
> Should I go back to the previous version as these kinks are worked out, or am I doing something wrong? Was there any fundamental difference in classification in the new version, or should the results be the same?

Dvir Aran (23:55:05): > ig=ImmGenData() > ms=MouseRNAseqData() ### the function listed in the help page is MouseBulkData() which apparently doesn’t exist > pred=SingleR(test=ig,ref=ms,labels=ms$label.main,assay.type.ref=‘logcounts’) > Error in nn.d^2 : non-numeric argument to binary operator

Aaron Lun (23:56:35): > Sounds like they should be operating on BioC-devel versions of packages.

Aaron Lun (23:57:02): > This will hopefully be a lot less confusing when it actually comes out on BioC, as all BioC packages are kept in sync with each other.

Aaron Lun (23:57:30): > Right now we have the awkward situation where people are pulling it off GitHub and trying to use it with BioC-release packages.

Aaron Lun (23:58:27): > Probably could add a version requirement for BiocNeighbors for the time being.

Aaron Lun (23:58:45): > Anyone want to do that?

2019-08-29

Aaron Lun (00:00:00): > Probably roll it into one of the other PRs.

Jared Andrews (00:01:39): > I assume that is the fellow you had email me about the reference? I helped him get his reference data into shape but forgot to tell him to upgrade to Bioc dev.

Jared Andrews (00:02:56): > And by you, I mean Dvir.

Aaron Lun (00:04:06): > Maybe some instructions on the README would also help for the time being. This is the second query I’ve had on this topic, and I’ve had my share of enjoyment watching people get confused, so we should probably help them out.

Jared Andrews (00:04:29): > My PR to fix the new reference data won’t be done till next week, but I don’t mind throwing that in with it if nobody else gets to it first.

Aaron Lun (00:04:53): > ¯*(ツ)*/¯

Aaron Lun (00:05:01): > go for it

Dvir Aran (00:06:42): > @Jared Andrewsno, this is a different one. I get 3-5 emails a day about SingleR…

Aaron Lun (00:09:15): > Hopefully when it gets onto BioC you can just point them to the Bioc support site, and the number of unique questions should drop off.

Dvir Aran (00:11:04): > Well, already most of the issues go into the github repo - 103 issues to date… the emails are mostly about interpretation.

Aaron Lun (00:11:48): > gee, you’d think with 103 issues, some of them would be more-or-less the same.

Aaron Lun (00:12:15): > IMO, issues are even harder to search for the same topic.

Dvir Aran (00:13:13): > I don’t complain - its better to get noticed than not… The most annoying thing in academia is to publish a paper that nobody gives a shit about.

Aaron Lun (00:13:40): > Been there, got the T-shirt.

Dvir Aran (00:14:11): > Yes, many issues are similar. If I remember I refer them.

Dvir Aran (00:17:03): > I had similar experience with xCell, still getting some emails (I think its down to 1-2 a week). I guess this is the good thing in publishing a tool - there is much more interaction with fellow researchers after publication than nom-tool paper.

Aaron Lun (17:01:02): > @Dan BunisPR looks good. Add some plots to the vingette and we’re good to go. I guess we should also have a section about the pruning in the vignette as well.

2019-08-30

Aaron Lun (02:15:54): > should probably also get some sleep and stop committing at 1-2 am.

Aaron Lun (02:16:32): > unless you’re waking up at lunchtime, in which case I guess that’s okay.

Aaron Lun (02:16:39): > Man, I wish I could wake up at lunchtime.

Jared Andrews (02:21:38): > Sleep is for graduated people or something

Dan Bunis (02:22:20): > :man-shrugging:I just couldn’t sleep even if I wanted to last night, so I decided to crank it out.

Dan Bunis (17:23:29): > I’ve started adding pruning and the new visualizations associated with assessing the prune cutoffs to the vignette in my PR.

Aaron Lun (17:35:57): > :+1:

Aaron Lun (17:36:45): > No need to get too detailed in the vignette. One thing that I’ve learnt is to move as much stuff into?<FUNCTION>as possible, because that tends to be the first port of call for anyone who’s got problems.

Rob Amezquita (20:06:06): > Is there any documentation on making your own reference dataset? Eg what’d y’all do with the bulk RNAseq to make it usable

Aaron Lun (20:06:38): > Bulk RNA-seq, just stick it in.

Rob Amezquita (20:06:48): > That’s it?

Aaron Lun (20:06:56): > But you probably want to adjust it by transcript length if you want to compare to UMI counts

Rob Amezquita (20:07:11): > Yeah for sure

Aaron Lun (20:07:12): > just feed us log-RPKMs, I guess.

Rob Amezquita (20:07:38): > Perfect, that’s easier than I expected!

Aaron Lun (20:07:57): > The more interesting question - and something I’ve been thinking about since yesterday - is whether it is better to keep single-cell references as single cells or convert them into pseudo-bulk samples.

Aaron Lun (20:08:41): > In theory, it should be an accuracy/speed trade-off, with pseudo-bulk being faster but single-cell being nominally more accurate (as it preserves the “shape” of the distribution).

Aaron Lun (20:09:12): > In practice, well, I don’t know. I can imagine that the single-cell could do worse if the shape of the test doesn’t match up with the shape of the reference, such that the knowing shape is doing more harm than good.

Aaron Lun (20:12:01): > Oh - regardless of that, if people are using sparse single-cell references, they REALLY SHOULD be computing the genes and passing it togenes=. This is because most of the interesting genes will have >50% zeroes, andtrainSingleR’s default is to compute the log-FC between medians (for historical reasons), which means that almost every gene that might be useful will have a log-FC of zero. Unless you have really strong markers, you just get left with highly expressed genes that might have a non-zero median, e.g., ribosomal genes, histones; not very useful.

Aaron Lun (20:13:39): > Need to think of a better way to protect people here; otherwise people would try outSingleRwith their single-cell references and it would seem to suck, just because the default marker detection throws up when the input data is sparse.

2019-09-02

Friederike Dündar (17:05:27): > What would be a good way of testing it? I.e. how well can we simulate bulk-RNA-seq data from single-cell data?

Friederike Dündar (17:06:05): > And what would you suggest for computinggenes? Just the most variable ones + robust expression?

Friederike Dündar (17:07:36) (in thread): > Lol, that’s contrary to my experience, but I guess there are also different types of problems people can run into

Friederike Dündar (17:07:56) (in thread): > 1. The function is throwing an error –> this is the classical case for immediately pulling up the help

Friederike Dündar (17:08:19) (in thread): > 2. The function is returning an object I don’t understand –> again, this is where I’d typically go to the help

Friederike Dündar (17:09:00) (in thread): > 3. The function is returning something without an error, but I have no idea what the values mean –> I’ll probably peak into the help, but I’ll also hope for the vignette to have an illustrated example

Friederike Dündar (17:10:10) (in thread): > I.e. anything that will help me judge the validity of the results I would expect to be detailed in the vignette

Aaron Lun (17:26:11) (in thread): > I only consider the vignette to tell me how different functions should be stitched together. Past the first reading, I rarely go back to it. Any function-specific concerns should go into?.

Aaron Lun (22:49:29): > 1) I assume you’re referring to testing whether single-cell or pseudo-bulk references are better. This should be pretty straightforward - there’s loads of datasets inscRNAseqthat contain labels for the same cell types. Should be a simple matter of using one as a test and the other as a reference, seeing how accurate it is, and then repeating after pseudo-bulking the reference.

Aaron Lun (22:50:05): > Mind you, these cell types are super obvious (as they should be, otherwise probably multiple people wouldn’t have agreed on them!) so maybe it’s not the most challenging test of performance.

Aaron Lun (22:50:52): > 2) There’s an example in the vignette. It’s basically pairwise DE between all labels. This is, in principle, the same as what SingleR does already, and we could also make it do it under the hood, but… meh.

2019-09-03

Friederike Dündar (09:59:51): > > There’s an example in the vignette. It’s basically pairwise DE between all labels. This is, in principle, the same as what SingleR does already, > Yeah, I thought SingleR was doing it anyway so what’s the advantage of having the user compute them separately beforehand?

Aaron Lun (11:20:12): > Because SingleR uses medians to do so, which fails if the genes have more than 50% zeroes in both groups (not uncommon). Also it doesn’t consider the variability of expression within groups.

Dvir Aran (23:55:26): > How about take the mean instead? Several papers showed the mean of single-cell is well correlated with bulk

2019-09-04

Tim Triche (10:05:24) (in thread): > hey so, have people compared using (say) TPMs and UMI-computed TPMs for bulk vs. single-cell of the same specimens?

Tim Triche (10:06:42): > has anyone else looked at (e.g.) SCRABBLE or URMS as a means to impute events that clock in below the limit of detection for scRNA (esp. 10X or sci-RNA) measurements?

Tim Triche (10:08:10): > We have been poking at this (particularly using constrained low-dimensional reconstruction), both to try and tighten up composition estimates and to overcome limits of detection for e.g. 3’ focused or ultra-sparse platforms.

Tim Triche (10:09:22): > We have an in-house protocol for total RNAseq alongside nuclear and MT extraction from each cell, but it’s about $8/cell plus sequencing, and while complexity is much better than expected (dirty tricks are involved), it’s still not as cheap as imputing from other people’s matched samples:smile:

Tim Triche (10:10:41): > One of the issues is whether converting UMI counts to TPM (in order to match up sc and bulk) is valid. It seems like it, but I’m never sure whether something that is right “in theory” consistently produces optimal results in practice.

Aaron Lun (11:29:29) (in thread): > Certainly that would be safer. Still not the most informative, as it won’t consider the variance for single-cell data, but better than what we have now.

Jared Andrews (16:20:37): > A potential competitor, though it looks more complicated to use:https://github.com/pcahan1/singleCellNet

Aaron Lun (17:08:58): > I think I’ve tested that one. Random forests, IIRC.

2019-09-05

Aaron Lun (11:48:08): > It is done,SingleRis into BioC-devel.

Aaron Lun (11:48:21): > We should see the first build reports over the next few days or so.

Federico Marini (11:48:30): > :party_parrot:

Federico Marini (11:48:38): > Congrats to you all!

Federico Marini (11:49:27): > and as I said to Aaron in a PM: best experience one can have with a new package,:white_check_mark:Tried it on a dataset where some hints of annotation were there, worked out of the box

Ludwig Geistlinger (13:20:28): > Also from my side: very helpful package! Applied it to ovarian cancer single-cell data of ours, and contrasted it with marker gene expression-based annotation. Strong agreements + a lot of insights. The score heatmap is great to get an impression of how confident the calls are, and where there is potential to confuse cell types. I think the resultingtuned.scorescontaining the best and next-best scores could be turned into an overall confidence/margin score per cell by just subtracting the two.

Aaron Lun (13:20:56): > Yes, that’s part of how whatpruneScroresoperates.

Rob Amezquita (13:26:10): > seriously,SingleRhas changed my life

Rob Amezquita (13:26:31): > SOOOO much better than manual annotation, and now whenever i feel like playing around with clustering settings or what have you i dont have to repeat my annotations

Rob Amezquita (13:26:35): > rockstars all of you!

Friederike Dündar (15:35:46): > maybe we should just collect a couple of testimonials instead of bothering with a publication?:wink:

Unknown User (15:40:27): Rob Amezquita (16:06:07): > so far ive been using theBlueprintEncodeData()reference mostly, as it has pretty much every major cell type i want, but, im mentoring a student who has a mouse prostate cancer model, so we’re actually gonna try taking a human scRNA-seq dataset that has been manually labeled and using it to classify the mouse prostate cancer

Aaron Lun (16:06:36): > Do read the vignette instructions on single-cell references.

Rob Amezquita (16:13:25): > now, the interesting part - comparing across the various immune references that are in the package will be interesting..much more interesting than anything having to do with manual labeling

Jared Andrews (16:49:09): > They are of varying usefulness depending on what cell types you’re interested in. Looking at them all together typically gives a pretty comprehensive view.

Friederike Dündar (17:10:39): > do you mean, you run the annotation multiple times? or do you somehow combine the different references?

Jared Andrews (18:26:39): > I run multiple times.

2019-09-06

Friederike Dündar (08:59:01): > have you used Aaron’smatchReferences()in that context?

Jared Andrews (10:00:34): > Nope. I haven’t gotten to play around with that or the score pruning much.

Dvir Aran (12:31:07): > This could be a great resource for reference datasets:http://www.proteinatlas.org/about/download, specifically the blood cell sample gene data sets

Jared Andrews (15:23:02): > We have pretty good blood/immune cell coverage, imo, but I suppose more is never a bad thing. Prepping the reference sets made me want to off myself though, so I’m done adding more for a while at least.

Aaron Lun (15:53:08): > Now you know how thescRNAseqpackage made me feel.

Jared Andrews (16:13:24): > Yeah, I don’t know how you put up with doing that many.

Martin Morgan (16:30:25): > https://bioconductor.org/packages/SingleR - Attachment (Bioconductor): SingleR (development version) > Performs unbiased cell type recognition from single-cell RNA sequencing data, by leveraging reference transcriptomic datasets of pure cell types to infer the cell of origin of each single cell independently.

Aaron Lun (17:04:02): > Quick, everyone download it to bump up our rank.

Kevin Rue-Albrecht (17:41:25): > Bonus point: Don’t forget the VPN for the unique IP downloads

Dvir Aran (18:10:15): > I’ll tweet about it, and Atul will retweet, so we will get that bump :)

Aaron Lun (19:24:06): > Okay, now that we’re in, the only remaining items on the package development side are: > - More feedback about the pruning. Good? Bad? Indifferent? > - Some feedback on the aggregation functions. This probably requires actual formal testing, possibly a section for the workflow.

Aaron Lun (19:24:47): > Meanwhile, I’ll get my botnet running to download SingleR

Tim Triche (20:18:56): > Is it bad that we get different labels between old-SingleR and new-SingleR when using the same reference markers?

Aaron Lun (20:19:09): > how different?

Aaron Lun (20:19:46): > there are some small changes in the calculations that affect things. But I would be surprised if they had a big effect.

2019-09-09

Aaron Lun (11:59:06): > 0.99.11 will begin pulling the references from ExperimentHub. This should not result in any major changes, except for a small correction for “neutrophil(s)” in HPCA data.

Aaron Lun (11:59:51): > This includes the original references converted by@Friederike Dündar, and a few new ones added by@Jared Andrews. Lots of immune/haem stuff.

Jared Andrews (12:00:00): > Did we ever figure out the deal withBiocParallel? I get a slowdown every time I try to setBPPARAMtoMulticoreParamwhen testing with small sets.

Aaron Lun (12:00:19): > Yes, that’s because the set-up cost > parallelization benefit.

Jared Andrews (12:00:27): > Ah, was gonna ask if there’s much overhead.

Aaron Lun (12:00:41): > In the future, one would use openMP to parallelize the C++forloop, but… meh.

Jared Andrews (12:01:02): > So for larger sets, it should still provide a benefit?

Aaron Lun (12:01:23): > I would hope so. Your 1 hour thing should be cut down considerably.

Jared Andrews (12:01:48): > :+1:

Dvir Aran (12:11:17): > We are still on the main page of bioC! How often does the ranking gets updates? - File (JPEG): Image from iOS

Aaron Lun (12:12:10): > Depends on how many people mention BioC in tweets, I suppose.

Kevin Rue-Albrecht (16:20:56): > Spreading the Twitter love … > > An important note for everyone on here discussing of all the flashy new cell type prediction tools today: > > a general SVM still outperforms literally all of them in terms of both speed and accuracy @ahmedElkoussy > > “A comparison of automatic cell identification methods for single-cell RNA-sequencing data” > https://www.biorxiv.org/content/10.1101/644435v1

Aaron Lun (16:21:23): > I saw that paper and I would dispute their conclusions.

Aaron Lun (16:21:54): > Of all the standard ML methods I tried, the SVM was the most recalcitrant of the lot.

Aaron Lun (16:22:17): > Their feature selection strategy also leaves much to be desired.

Kevin Rue-Albrecht (16:23:42): > Well, aside from that, I know tweets are (generally) all about catching attention, but in this case I would dispute the “haters gonna hate” tone itself of their “flashy new”.

Aaron Lun (16:24:11): > Don’t understand your comment.

Kevin Rue-Albrecht (16:24:59): > Just pointing out that their tweet is unnecessarily harsh against the effort put in what they call “flashy new tools”

Aaron Lun (16:26:24): > Y’know, funnily enough I had the same attitude. And perhaps they might be right from a theoretical perspective. However, an average joe won’t be able to get a random forest up and running, so whatever you do, you’ll have to throw a coat of paint on it to make it palatable to a user. Which puts it into the category of “flashy new”.

Dvir Aran (18:18:34): > This tweet comes in regards to the two papers out today in Nature Methods on scRNA-seq annotation - Garnett and CellAssign. Both methods require a set of marker genes, and they show really elementary annotations, very similar to what you will get by just visualizing those markers.

Dvir Aran (18:21:05): > Haven’t read thoroughly, but have seen those papers on biorxiv - really surprised that it got in to nature methods, since I don’t understand what makes these methods better than previous methods. I guess the senior author names helped with that…

Aaron Lun (19:14:02): > If anyone is seeing weird behavior from SingleR - make sure matrixStats is the latest version, and then forcibly reinstall DelayedMatrixStats.

Aaron Lun (19:14:37): > This is because the former changed the order of arguments inties.method, but the change does not register with the latter (which is used by SingleR) in the precomputed S4 method tables.

Aaron Lun (19:16:19): > A DMS version bump has been made to prompt a reinstallation on user machines, but until then, you may get poor behavior as SingleR will be computing the max (or min, can’t remember) rank instead of the average. This results in incorrect Spearman correlation calculations and all sorts of wrongness down the line.

Tim Triche (19:25:10): > that may well be the issue, thanks for catching this

Aaron Lun (19:27:25): > Not 100% sure in your case because your message above was on the 6th and matrixStats updated on the 7th according to the CRAN landing page (but perhaps source came out a few days earlier than the binaries).

Aaron Lun (19:43:57): > Henrik’s GH history suggests that he made the version bump on the 5th, so maybe it does fit in.

2019-09-10

Tim Triche (09:49:30): > Kin will find out

Friederike Dündar (11:35:52): > Same data annotated with reference scRNA-seq data set (“nowa”) using aggregation function (default settings) vs. using the 10-pairwise-marker-gene-strategy as recommended in the vignette > > > table(pred_nowa_markers$labels) %>% sort > > EN-PFC2 EN-V1-1 EN-V1-2 IPC-nEN1 MGE-IPC2 MGE-RG1 > 1 1 1 1 1 2 > nIN1 nIN3 nIN5 IPC-nEN2 nIN2 EN-V1-3 > 2 2 11 13 30 37 > Microglia IN-STR nEN-early1 IPC-div2 nIN4 IPC-div1 > 38 40 63 84 216 319 > MGE-div MGE-RG2 Glyc MGE-IPC3 Mural IPC-nEN3 > 392 510 525 623 904 1161 > Endothelial MGE-IPC1 RG-early Choroid > 1702 2510 4176 12970 > > table(pred_nowa_aggregated$labels) %>% sort > > IN-CTX-CGE1 nIN3 OPC Astrocyte IPC-nEN1 MGE-IPC2 > 1 1 1 2 2 2 > nIN2 EN-PFC3 IN-STR nIN4 IPC-nEN2 EN-PFC2 > 2 3 3 4 5 6 > vRG EN-V1-1 EN-V1-3 MGE-RG1 IPC-div1 EN-V1-2 > 6 13 13 39 44 57 > tRG Microglia nEN-early1 RG-div2 nEN-late IPC-div2 > 86 96 97 103 205 253 > MGE-IPC3 MGE-RG2 MGE-div Glyc Mural IPC-nEN3 > 323 450 851 993 1311 1452 > Endothelial Choroid MGE-IPC1 RG-early > 1670 1882 3227 13132 >

Rob Amezquita (11:37:47): > can you send atable(pred_nowa_aggregated$labels, pred_nowa_markers$labels)to compare?

Aaron Lun (11:41:37): > whoah choroid goes nuts.

Friederike Dündar (11:42:08): > yes

Friederike Dündar (11:46:52): > table(pred_nowa_aggregated$labels, pred_nowa_markers$labels) - File (Plain Text): tables.txt

Aaron Lun (11:47:50): > I should probably throw inpairwiseWilcoxandlfc=1in the vignette.

Friederike Dündar (11:48:27): > good point

Aaron Lun (11:52:45): > Do you have any idea on which one is “right”? Both approaches have failure modes consistent with these observations.

Aaron Lun (11:53:22): > pairwise markers -> not enough or not the right markers selected, resulting in failure to distinguish populations

Aaron Lun (11:53:50): > aggregation + default markers -> too many markers and noise, resulting in a distribution over more labels

Aaron Lun (11:55:48): > It would be instructive to look at the choroid/endothelial “markers” to see if they’re actually doing their job.

Aaron Lun (11:56:16): > genes$Choroid$Endothelialandgenes$Endothelial$Choroidshould give you markers in both directions.

Friederike Dündar (12:00:19): > the biologist hopes that the one based on aggregation is right

Friederike Dündar (12:01:26): > and based on other markers, I also doubt that choroid should be as dominant as it is implied by the marker gene approach

Friederike Dündar (12:01:43): > I’m re-running it now withlfc=1, let’s see if that helps already

Aaron Lun (12:02:36): > Is thisaggregateReferences?

Friederike Dündar (12:04:33): > yes

Friederike Dündar (12:04:50): > results are similar to a previous run where I just used sum of counts

Aaron Lun (12:04:53): > Right, so it’s doing the k-means thing. At least that doesn’t do anything crazy.

Aaron Lun (12:05:19): > So the methodological difference is between the top-10 markers vs the default (which is something like 500 markers per pairwise comparison).

Friederike Dündar (12:06:01): > yes

Aaron Lun (12:06:31): > There’s one way to check the self-consistency of the marker approach. Let’s pick the cells that change from mural->choroid in the single-cell reference, which seems fairly drastic.

Aaron Lun (12:08:16): > The question is - do these actually upregulate the identified mural vs choroid markers that we identified withpairwise*?

Aaron Lun (12:09:19): > If yes, then the problem lies withinSingleR(). If not, and we’re sure that these are really mural cells, then the problem lies in the marker detection.

Friederike Dündar (12:12:29): > I’ll check out the RG-early <–> choroid candidates, it’s easier for me to judge the “validity” of individual markers

Friederike Dündar (12:12:43): > mural and choroid are both notoriously under-characterized by the biologists

Aaron Lun (12:12:56): > Okay, I don’t even know what these are.

Friederike Dündar (12:13:02): > neither do they:smile:

Aaron Lun (12:13:02): > Just picked them based on the alphabetical order.

Aaron Lun (12:13:09): > C <-> M. Seems pretty far away to me.

Aaron Lun (12:13:27): > Astrocytes and B cells, now those are closely related.

Friederike Dündar (12:14:20): > absolutely

Friederike Dündar (12:16:11): > except that the meanings of choroid and mural are fairly similar (layer, membrane vs. wall)

Friederike Dündar (12:16:32): > anyway, I’ll report back

Aaron Lun (12:21:09): > :+1:

Aaron Lun (12:21:32): > Important to make sure our vignette isn’t talking shit

Aaron Lun (12:22:09): > I’ve only tested the marker approach on well-separated cell types so it’s not surprising that it works.

Friederike Dündar (14:41:58): > right.

Friederike Dündar (14:42:04): > lfc=1 doesn’t change the outcome

Friederike Dündar (14:42:35): > (choroid still dominating)

Aaron Lun (14:43:17): > Did you have a look at the identified markers?

Friederike Dündar (14:43:28): > the data I have is definitely more challenging than, say, pancreas

Friederike Dündar (14:43:37): > I haven’t dug deeper yet, no

Aaron Lun (14:45:02): > kay, looking forward to seeing what these markers are.

Aaron Lun (14:45:29): > By the sounds of it, they must be pretty good if they’re still standing up atlfc=1.

Friederike Dündar (15:07:21): - File (PNG): image.png

Friederike Dündar (15:08:32): > some are fairly sparsely expressed though

Friederike Dündar (15:28:01): - File (PNG): image.png

Friederike Dündar (15:28:46): > “switched” = choroid with markers, RG with aggregated values

Aaron Lun (15:29:40): > Hm. Not very conclusive either way, it must be said.

Friederike Dündar (15:29:56): > yep

Aaron Lun (15:30:04): > The most obvious cause is that: > 1. the test dataset is not much like the reference > 2. this causes the top 10 markers to not be very useful

Aaron Lun (15:30:41): > For example, FOXG1 is basically useless for defining RG. I can’t imagine it has a similar expression profile in the reference, otherwise it wouldn’t have been picked in the top 10.

Friederike Dündar (15:33:10): > right

Friederike Dündar (15:33:52): > that’s a good point. Just because a marker is abundantly present in the reference, doesn’t mean it’s equally well picked up in the test data set

Friederike Dündar (15:34:17): > The one that seems to hold up is TTR

Friederike Dündar (15:34:35): > what about TUBB2B though?

Friederike Dündar (15:34:49): > well, I guess in the test data it’s just not a good marker…

Aaron Lun (15:34:55): > What does the same heatmap look like in the reference?

Friederike Dündar (15:46:50): - File (PNG): image.png

Aaron Lun (15:47:09): > Lol.

Aaron Lun (15:47:48): > Look at how good TTR is here.

Friederike Dündar (15:48:05): > yes, it is the only marker the literature can agree on, too:slightly_smiling_face:

Aaron Lun (15:48:06): > It’s a wonder that anything got assigned as choroid in the test.

Friederike Dündar (15:48:32): > it’s supposed to be a rare cell type

Friederike Dündar (15:48:53): > which is why I lean more towards the aggregation-based results here

Aaron Lun (15:50:01): > The actual difference is probably not the aggregation/no-aggregation, it’s the choice ofn.

Friederike Dündar (15:50:06): > yeah

Aaron Lun (15:50:49): > Try kicking it out to 50-100.

Friederike Dündar (15:51:01): > I tried it with 100 and it seemed to never finish

Friederike Dündar (15:51:08): > is that possible?

Friederike Dündar (15:51:25): > or was it probably due to something else on our side, e.g. temporary overload of the servers by others

Aaron Lun (15:51:40): > It will about at least 10x slower.

Friederike Dündar (15:52:01): > but would it be slower than the “normal” SingleR run, without defining the genes?

Aaron Lun (15:52:38): > It won’t be slower than a normal SingleR run if you gave single-cell values as the reference.

Aaron Lun (15:53:00): > Of course, it will be slower than a run using aggregated values.

Aaron Lun (15:53:34): > For testing purposes, you can go in the other direction - feed aggregated values and reducen.

Aaron Lun (15:53:55): > Ifnis the culprit, you should converge to the single-cell results atn ~ 10.

Aaron Lun (15:54:31): > Either way… it does not inspire confidence in the assignments when almost all of the defining markers in the ref are not expressed in the test.

Friederike Dündar (15:57:08): > doesSingleR()have annparameter?

Aaron Lun (15:57:22): > think it gets passed totrainSingleR. Think it’s calledde.n.

Friederike Dündar (16:00:06): > yeah, no passing, will usetrainSingleR()

Aaron Lun (16:02:24): > in any case, whilenmight be the immediate problem, the deeper problem is why the test is so different from the ref.

Friederike Dündar (16:05:37): > but that’s not SingleR’s fault

Aaron Lun (16:06:58): > That’s right. And a discrepancy of this magnitude is very hard to defend against.

Aaron Lun (16:07:36): > If we have a look at the cells that SingleR+aggr called as RG… they barely express any RG markers as defined in the reference.

Friederike Dündar (16:07:36): > let’s put it this way: it’s the best reference set so far

Friederike Dündar (16:07:53): > for this data, in my hands

Friederike Dündar (16:08:27): > I mean, the whole reason I turned to SingleR was that my usual source of annotation (= biologist) was coming up with nill

Friederike Dündar (16:08:44): > probably because his mind expects similar patterns as the reference set I’m now using

Friederike Dündar (16:09:10): > except that SingleR allows me to look at 500 markers of myriad comparisons (there are 20+ labels in the reference)

Friederike Dündar (16:09:27): > which my biologist is not able to hold in his RAM

Friederike Dündar (16:09:44): > his annotation was most likely based on n ~ 3

Friederike Dündar (16:10:34): > the question is: what would be a sensible recommendation in the vignette?

Friederike Dündar (16:10:58): > just randomly select a couple of labels and markers and do the two heatmaps (ref vs. test)?

Friederike Dündar (16:12:10): > > > table(pred_nowa_aggregated_n10$labels) > > Astrocyte Choroid EN-PFC2 EN-PFC3 EN-V1-1 EN-V1-2 > 53 2800 9 17 5 11 > EN-V1-3 Endothelial Glyc IN-CTX-CGE1 IN-CTX-CGE2 IN-CTX-MGE1 > 9 1880 1624 18 2 2 > IN-CTX-MGE2 IN-STR IPC-div1 IPC-div2 IPC-nEN1 IPC-nEN2 > 1 120 205 362 21 27 > IPC-nEN3 MGE-div MGE-IPC1 MGE-IPC2 MGE-IPC3 MGE-RG1 > 795 903 3143 28 344 131 > MGE-RG2 Microglia Mural nEN-early1 nEN-early2 nEN-late > 457 168 2140 188 1 19 > nIN1 nIN2 nIN3 nIN4 nIN5 OPC > 2 34 23 38 51 11 > oRG RG-div1 RG-div2 RG-early tRG vRG > 2 70 310 10166 105 40 >

Friederike Dündar (16:12:19): > so,nalone does not explain it

Friederike Dündar (16:13:12): > > ## just for refreshing memory -- results for 10 markers selected with the scran routine > > table(pred_nowa_markers$labels) > > Astrocyte Choroid EN-PFC2 EN-PFC3 EN-V1-1 EN-V1-2 > 4 11831 1 6 1 5 > EN-V1-3 Endothelial Glyc IN-CTX-CGE1 IN-STR IPC-div1 > 50 2043 1049 2 69 283 > IPC-div2 IPC-nEN1 IPC-nEN2 IPC-nEN3 MGE-div MGE-IPC1 > 223 15 10 835 685 3147 > MGE-IPC2 MGE-IPC3 MGE-RG1 MGE-RG2 Microglia Mural > 6 597 17 663 76 1322 > nEN-early1 nIN1 nIN2 nIN4 nIN5 OPC > 150 3 31 77 4 1 > RG-div1 RG-div2 RG-early tRG vRG > 3 24 3086 15 1 >

Aaron Lun (16:14:07): > WEIRD

Friederike Dündar (16:15:36): > how can I access the markers thattrainSingleR()identifies?

Aaron Lun (16:15:50): > should be in$extra

Friederike Dündar (16:16:25): > oftrained?

Aaron Lun (16:17:59): > yes.

Friederike Dündar (16:18:10): > > > names(trained) > [1] "common.genes" "original.exprs" "nn.indices" "search" >

Aaron Lun (16:18:23): > oh right, I renamed it tosearch.

Aaron Lun (16:18:29): > Maybe it’ssearch$extra

Friederike Dündar (16:25:37): > markers followingaggregate–>trainSingleR(..., de.n = 10)for the reference data set - File (PNG): image.png

Aaron Lun (16:27:17): > Does it look like that in the test as well?

Friederike Dündar (16:32:39): > nope

Friederike Dündar (16:32:45): > test data - File (PNG): image.png

Aaron Lun (16:32:57): > Aw geez.

Friederike Dündar (16:33:08): > yes, that’s been my state of mind for months

Friederike Dündar (16:34:56): > but why are the final decisions so fundamentally different?

Aaron Lun (16:36:23): > One could perhaps give a more specific answer, but I would say that, in the absence of any clear signal, the algorithm just wets the bed.

Friederike Dündar (16:36:45): > I don’t understand what that means:smile:

Aaron Lun (16:37:20): > highly technical term

Friederike Dündar (16:37:43): > C++ I assume

Friederike Dündar (16:43:23): > to be fair, thetrainSingleR(…, de.n = 10)markers do look slightly less bad than the ones just picked with thescran`routine

Friederike Dündar (16:43:39): > scran-based markers supplied viagenes= - File (PNG): image.png

Friederike Dündar (16:43:56): > well, no, nevermind

Friederike Dündar (16:49:25): > for SingleR’s vignette/paper, we should at least mention it – I agree that it all comes down to assessing how similar/appropriate a reference set is for a given test data set

Friederike Dündar (16:49:46): > if I use the blueprint annotation, all these cells are labeled as adipocytes, btw

Aaron Lun (16:50:58): > Well, it could be worse.

Aaron Lun (16:51:18): > Okay, time for some real talk. Having seen all of this, I don’t have much advice to give. SingleR+aggregation does “better”, but only in a highly superficial sense that doesn’t have any solid basis when we look at the genes involved. It’s true that the SingleR defaults use 500 genes, and one could say that it uses weak signal across many genes to improve performance, but at some point you’d want to look at specific genes (e.g., to validate or stain the population of interest). And the heatmaps are pretty depressing for that.

Friederike Dündar (16:51:53): > > uses weak signal across many genes to improve performance > except that it “works” with n = 10, too

Aaron Lun (16:52:31): > Right, which makes it even more sketchy.

Friederike Dündar (16:52:45): > but what does it tell us about the algorithm?

Aaron Lun (16:53:19): > That GIGO?

Friederike Dündar (16:53:27): > i.e.genes=10_scran_markersreturns a very different result fromde.n = 10, genes = NULL

Friederike Dündar (16:54:21): > wherede.n=10, genes = NULL~de.n = 500, genes = NULL

Aaron Lun (16:55:25): > There’s two key differences at this point; the identity of the genes, and whether the use of the single-cells as the reference values is different from the use of the aggregated values.

Aaron Lun (16:57:40): > Easy to tell by just usinggenes=trained$search$extra(or whatever it was called) from the aggregated data, while passing the single cell profiles asref.

Aaron Lun (16:57:54): > Or vice versa, using the scran markers ingenes=while passing the aggregated profiles asref.

Friederike Dündar (16:58:25): > well, you can see the genes up there

Friederike Dündar (16:58:51): - File (PNG): image.png

Aaron Lun (16:59:26): > That’s a pretty big difference when the signal is so weak in the test.

Friederike Dündar (16:59:50): > > aggn10 scran > TTR 1 1 > TRPM3 1 1 > SERINC5 1 1 > PIFO 1 0 > CA2 1 1 > SULF1 1 0 > EFHC1 1 0 > SLIT2 1 0 > LAMB1 1 0 > MEST 1 1 > PRDX3 0 1 > PPIB 0 1 > CLU 0 1 > PHACTR2 0 1 > TMBIM6 0 1 >

Friederike Dündar (17:01:09): > If I supply the scran-defined genes,SingleRnever sees the actual values of the reference data set, right?

Aaron Lun (17:01:40): > It will still use the reference values to do the correlation calculations on the genes.

Friederike Dündar (17:01:56): > ah, right

Friederike Dündar (17:09:14): > gotta scoot now, will report back later

Aaron Lun (17:10:48): > :+1:

Aaron Lun (17:47:20): > immgen has aNArow name.

Aaron Lun (19:14:41): > https://github.com/LTLA/SingleR/issues/41

Aaron Lun (19:15:14): > https://github.com/LTLA/SingleR/issues/42

Aaron Lun (19:20:20): > Especially 41. People who use fine labels should let me know if this is causing problems for them, in which case I will deal with it sooner rather than later.

Friederike Dündar (21:19:41): > https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1795-z/tables/1 - Attachment (Genome Biology): A comparison of automatic cell identification methods for single-cell RNA sequencing data > Single-cell transcriptomics is rapidly advancing our understanding of the cellular composition of complex tissues and organisms. A major limitation in most analysis pipelines is the reliance on manual annotations to determine cell identities, which are time-consuming and irreproducible. The exponential growth in the number of cells and samples has prompted the adaptation and development of supervised classification methods for automatic cell identification. Here, we benchmarked 22 classification methods that automatically assign cell identities including single-cell-specific and general-purpose classifiers. The performance of the methods is evaluated using 27 publicly available single-cell RNA sequencing datasets of different sizes, technologies, species, and levels of complexity. We use 2 experimental setups to evaluate the performance of each method for within dataset predictions (intra-dataset) and across datasets (inter-dataset) based on accuracy, percentage of unclassified cells, and computation time. We further evaluate the methods’ sensitivity to the input features, number of cells per population, and their performance across different annotation levels and datasets. We find that most classifiers perform well on a variety of datasets with decreased accuracy for complex datasets with overlapping classes or deep annotations. The general-purpose support vector machine classifier has overall the best performance across the different experiments. We present a comprehensive evaluation of automatic cell identification methods for single-cell RNA sequencing data. All the code used for the evaluation is available on GitHub ( https://github.com/tabdelaal/scRNAseq_Benchmark ). Additionally, we provide a Snakemake workflow to facilitate the benchmarking and to support the extension of new methods and new datasets.

Aaron Lun (22:30:44): > These are the SVM guys, right?

Aaron Lun (22:31:00): > Surprised that the paper came out so quickly, it was only a couple of months ago that I heard about the preprint.

Friederike Dündar (22:31:59): > yes; surprised Garnett still made it to Nature Methods after the (lack of) performance shown there…

Friederike Dündar (22:32:30): > anyway, the legacy SingleR wins the prize for longest computation time:party_parrot:

Aaron Lun (22:33:14): > Oh yeah. Gotta get first in something.

Friederike Dündar (22:33:18): > which is a great segue for our paper, i.e. “SingleR is awesome, the only complaint was the computation time, well it’s been taken care of”

2019-09-11

Federico Marini (03:20:44): > > Surprised that the paper came out so quickly, it was only a couple of months ago that I heard about the preprint. > Well Genome Biology’s hounds felt it was a hot topic now:slightly_smiling_face:

Federico Marini (07:55:48): > Although, now that I think of it, this is maybe one of the benchmarking papers, there was a call recently?

Aaron Lun (11:27:54): > Literally every man and his dog has some benchmarking code. I don’t want to write yet another paper.

Jared Andrews (13:10:56): > Wish I had a coding dog.

Federico Marini (13:12:21): > :heavy_plus_sign:1

Dvir Aran (17:04:44) (in thread): > They leave bugs everywhere

2019-09-12

Aaron Lun (20:00:56): > @Friederike Dündarand@Dan Buniswe should think about how to make it easy to make those heatmaps that we were using to check choroid/RG similarity.

Aaron Lun (20:01:14): > Those were pretty convincing w.r.t. something going wrong.

Dan Bunis (20:02:00): > :+1:they were. that’s a good idea.

Aaron Lun (20:02:16): > It doesn’t even have to be a SingleR function - e.g., scater has plotHeatmap. So we just really need to be able to get the marker identities out.

Dan Bunis (20:02:55): > dittoSeq too. whichever we want to use.

Dan Bunis (20:03:35): > I guess we just need a function for outputting the genes used in traindata?

Aaron Lun (20:03:40): > Right, so maybe it’s a fairly simple case of having some demo code that iterates through all marker sets and shows that all assigned labels actually do express the top ~50 markers.

Aaron Lun (20:03:50): > Yes, and this is easy enough if you calltrainSingleRexplicitly.

Aaron Lun (20:04:05): > Oh. But only withgenes="de".

Aaron Lun (20:04:08): > Which is th edefault anyway.

Dan Bunis (20:08:11): > yea perhaps we can just make a wrapper for that?

Dan Bunis (20:08:56): > I can draft up example code for dittoHeatmap sometime soon.

Dan Bunis (20:09:31): > Unless it’s better to use scater::plotHeatmap?

Aaron Lun (20:10:42): > Well, until your thing gets into BiOC, we basically have to use scater.

Aaron Lun (20:11:05): > Anyway, I’m not sure that this loop is complex enough to warrant a dedicated wrapper.

Aaron Lun (20:11:18): > I would imagine a three liner that we could throw into the vignette.

Aaron Lun (20:11:34): > The main issue is that the loop will spawn many plots, and this traditionally has not been a good thing to wrap.

Dan Bunis (20:12:15): > > Well, until your thing gets into BiOC, we basically have to use scater. > Next cycle:crossed_fingers:

Dan Bunis (20:13:08): > We might want to leave it as a suggestion for specific labels that a user is uncertain about.

Dan Bunis (20:13:35): > I think you are definitely right and it can otherwise spin quite out of control.

Friederike Dündar (22:37:26): > yeah, I can share the code once I actually get back to look at the “final” results – was swamped today, maybe I’ll get to it tomorrow

Friederike Dündar (22:37:34): > I need to clean it up for my future self’s sake anyway

2019-09-17

Aaron Lun (01:49:45): > singleRandBiocNeighborsgot out of sync, this should be fixed after the rebuild in a few days.

2019-09-18

Dvir Aran (12:42:05): > Hi, can we please add some code to the vignette in how to use SingleR with Seurat objects? I am getting so many emails by angry users that are having troubles with it…

Aaron Lun (12:42:19): > I thought the README had something.

Dvir Aran (12:42:41): > Also a line that will show how to produce a tSNE plot colored by cell types

Dvir Aran (12:43:51): > I see, too complex for those that are sending me those emails… :(

Aaron Lun (12:43:56): > I don’t mind adding a few lines in aneval=FALSEchunk, but I really don’t want any dependency to Seurat, because that will pull in half of CRAN at every check.

Dvir Aran (12:45:07): > Of course no dependency. Just code for non-computational people that learned to follow the Seurat tutorials but can’t do anything beyond that

Aaron Lun (12:50:53): > I’m actually more surprised that these people managed to even install Bioc-devel packages.

Friederike Dündar (12:52:12): > tbf, I’ve been hounding Jared for Seurat pointers, too

Friederike Dündar (12:52:18): > it’s just such a messy object

Jared Andrews (13:18:33): > I can expand on the README code if wanted. And show how to color by cell type for plotting. Or add it to the vignette witheval=FALSE.

Jared Andrews (13:22:12): > Oh woops, looking at the README, there’s actually a type there anyway.

Friederike Dündar (13:46:04): > :smile:

Friederike Dündar (13:46:55): > why not put it in the README and point to it in the vignette?

Dan Bunis (13:51:23): > Should be just a few lines, right? > > pred <- SingleR(test = as.SingleCellExperiment(object), …) > object$labels <- pred$labels > TSNEplot(“labels”, object)

Dvir Aran (13:52:24): > But people that install it from bioC don’t see the README, only the vignette

Kevin Rue-Albrecht (13:54:29): > In#iseewe added a FAQ that answers common questions. How about putting a link to your README there? > e.g.https://bioconductor.org/packages/release/bioc/vignettes/iSEE/inst/doc/basic.html#6_faq

Jared Andrews (14:00:36): > Yeah, that’s probably a good idea. Will fix it tonight.

Aaron Lun (14:20:26): > While everyone’s here, I guess we should make a decision onaggregateReferences. Was this helpful or not?

Aaron Lun (14:21:08): > And maybe someone can add those diagnostic heatmaps? Just callplotHeatmap()fromscaterif that helps.

Aaron Lun (14:21:30): > I need to go put out fires somewhere else, but I can chip in if someone gets the ball rolling.

Dan Bunis (14:23:02): > OnaggregateReferences, I haven’t used it so:man-shrugging:

Aaron Lun (14:23:16): > Well, I guess we could just stick it in and hope for the best.

Dan Bunis (14:26:36): > Have you triedaggregateReferences@Jared Andrews?

Jared Andrews (14:27:16): > I also haven’t used it. I might be able to give it a shot with the immune reference sets (in conjunction withmatchReferences) if it isn’t crazy time consuming.

Friederike Dündar (14:28:24): > I’ve used it and I haven’t forgotten about the heatmaps etc.

Friederike Dündar (14:29:05): > but I could go into labor any minute, so people around here are a bit antsy in terms of getting their analyses “publication-ready”

Aaron Lun (14:29:11): > geez

Friederike Dündar (14:29:12): > i.e. tending to a couple of fires, too

Aaron Lun (14:29:13): > whoah

Aaron Lun (14:29:21): > good luck.

Friederike Dündar (14:29:39): > I haven’t run into any issues with the function itself, so I vote to put it in for now

Friederike Dündar (14:30:07): > and looking into the troubles I had with the brain annotation is definitely on my post-delivery to do list:slightly_smiling_face:

Aaron Lun (14:30:24): > took me a while to parse “go into labor”

Aaron Lun (14:30:30): > thought that was another technical term.

Dan Bunis (14:30:37): > If you send me some base code I can try and scratch it together

Friederike Dündar (14:30:51): > I was thinking about that, Dan:slightly_smiling_face:

Dan Bunis (14:31:08): > one less thing to worry about when you have a baby occupying your time!

Jared Andrews (14:32:16): > I should also have a bit more time now. Been kinda AWOL prepping for an update I had Monday, but it went badly anyway so ¯*(ツ)*/¯

Friederike Dündar (14:32:16): > alright, I’ll just send you an Rmd/html with what I see and my code and you can try to emulate it

Friederike Dündar (14:33:04): > the main question for me right now is what do we want to show? > the data that’s been giving me headaches is unpublished brain organoid stuff

Friederike Dündar (14:33:42): > if we want to delve into the problems, we could pull out another brain organoid data set and see if the same issues are true

Friederike Dündar (14:34:13): > if we want to do something more light-weight, we can just use any other old scRNA-seq data set that Aaron’s put into the scRNAseq package

Dan Bunis (14:35:48): > I can try to find another scRNAseq dataset that might be less well annotated.

Dan Bunis (14:36:03): > I’ll plan to generalize your code, and then…

Dan Bunis (14:36:25): > It would probably work to just make a mock version of ill-matchedd data versus reference

Aaron Lun (14:36:35): > Note that we don’t need everything for the vignette, just demonstrate the use of heatmaps to check markers.

Friederike Dündar (14:37:00): > right, so basically just pulling out the labels and throwing it into a heatmap

Dan Bunis (14:37:19): > Like a T cells dataset, but remove all T cells labels from the reference, then compare Tcell expression to whatever labels the T cells do get or something like that

Dan Bunis (14:37:33): > okay. even simpler

Aaron Lun (14:37:57): > A negative example is fine too but keep in mind we have time limits on the vignette

Dan Bunis (14:38:05): > But if I can mock up something to show a bad match in just a few lines, I will!

Dan Bunis (14:38:18): > :+1:and if it is quick

Jared Andrews (23:10:50): > Are there any other FAQs I should add to the vignette while I’m editing it?

Jared Andrews (23:17:13): > I’m just going to point people to the README.

Aaron Lun (23:31:00): > Just coordinate with Dan on the heatmaps.

2019-09-19

Friederike Dündar (08:53:38): > Maybe another FAQ about “where to find (sc)RNA-seq reference data” –> the scRNAseq package

Federico Marini (08:56:08): > > Maybe another FAQ about “where to find (sc)RNA-seq reference data” –> the scRNAseq package > Hope the doctors did not tell you to use Slack for relaxing, pre-birth:slightly_smiling_face:

Federico Marini (08:56:38): > (Alles gute, otherwise, whenever the whole procedure will start!)

Friederike Dündar (09:21:04): > Well, for all it’s worth, it could still be another two weeks, I guess. It’s mind-boggling to me how little concrete knowledge there is floating around about when a birth is actually going to start

Dvir Aran (11:34:15): > The median length of pregnancy for first spontaneous birth is 40+5, second+ is 40+4

Dvir Aran (11:35:02): > All 3 of ours where 41+

Dvir Aran (11:37:42): > My data is accurate to 2012 (our first) so don’t know if newer studies have different numbers.

Aaron Lun (12:04:05): > Dev forum happening NOW.

Aaron Lun (12:04:33): > Lori talking about data resources and serialization.

Friederike Dündar (12:40:06) (in thread): > Well, average early labor for first deliveries are around 7-8 hrs plus another 8hrs active labor. I had about 4hrs from being admitted to the hospital (with nothing but very, very faint cramps) to holding baby #1 in my arms. Thus, I don’t trust the stats for being particularly good predictors in my case. I figure, it’s safer for me to keep showing up in the office, which is literally in the hospital, than staying home at this point:sweat_smile:

Federico Marini (18:12:02) (in thread): > Well. I had (not me, my girlfriend) twins in the first round

Federico Marini (18:12:17) (in thread): > they came 4 weeks in advance

Federico Marini (18:13:10) (in thread): > anyway: Wish you the “possibly pleasant” version of it:wink:

Friederike Dündar (20:45:41) (in thread): > I don’t think they’ve found the right drug cocktail for that yet:wink:

2019-09-20

Federico Marini (11:27:14): > Just ran into this now:https://www.cell.com/cell-reports/fulltext/S2211-1247(19)30059-2?sf207295678=1

Federico Marini (11:27:26): > don’t know if it can be useful for additional ref data?

Aaron Lun (11:28:08): > oh please, no more immune cells.

Aaron Lun (11:28:21): > We have so many immune references right now.

Federico Marini (11:28:44): > You’re immune to that as of now:slightly_smiling_face:

2019-09-29

Aaron Lun (21:31:47): > Heatmaps added.

Aaron Lun (21:32:35): > Also a small bugfix for some UB.

2019-10-01

Matteo Calgaro (04:40:07): > @Matteo Calgaro has joined the channel

2019-10-07

Tim Triche (16:10:57): > quick silly question: does classifySingleR default to multicore if it finds them?

Aaron Lun (16:13:31): > No, you have toBPPARAM=MulticoreParam().

Tim Triche (16:13:37): > doh

Aaron Lun (16:14:51): > Overhead is pretty high, though, so for small-ish datasets it’s often not worth it.

Tim Triche (16:15:05): > how small is small

Tim Triche (16:15:11): > 60K cells worthwhile?

Aaron Lun (16:15:37): > ¯*(ツ)*/¯

Aaron Lun (16:15:48): > I was thinking <10k cells.

Tim Triche (16:16:07): > I’ll let you know in a few minutes:slightly_smiling_face:

Aaron Lun (19:05:25): > I guess it took more than a few minutes, then.

2019-10-08

Tim Triche (13:14:39): > They completed ~ the same time but my kid had to go to soccer

Tim Triche (13:15:04): > I should go back and kick off a serial and parallel job at the same time on a node and find out for sure.

2019-10-13

Aaron Lun (02:15:24): > It seems that our parallelization strategy has been largely disappointing, and I’m a bit bemused why this is the case.

2019-10-14

Kathy Sivils (15:32:58): > @Kathy Sivils has joined the channel

Aaron Lun (18:27:06): > Next release is aiming to include openMP functionality for close-to-the-metal parallelization with minimal overhead. But in the meantime, people should just do the best they can withBiocParallel.

2019-10-18

Aaron Lun (20:36:01): > @Martin Morgan@Tim TricheAs promised, we have a MWE of the behavior athttps://gist.github.com/LTLA/e3a10e18cd32994bac0281a839ae47a8, using the latest BioC-devel. This is completely attributable to thebpstartandbpstop, removing them causes the 10-fold slowdown to disappear. I vaguely remember previous discussions about this - these commands cause MulticoreParam to send stuff over sockets rather than using a fork? - and I thought that this was fixed, but apparently not. > > I was putting inbpstartandbpstopbecause I do multiple BiocParallel operations within the same function, and I thought that I could be more efficient (at least for Snow and BatchTools) by avoiding redundant worker setup. However, this approach seems to be heavily penalized when we use a MulticoreParam, for reasons that are not entirely clear to me. Presumably there is some reason for this, but would it be better to makebpstart,MulticoreParam-methoda no-op anyway?

2019-10-20

Martin Morgan (05:10:07): > The first thing is to understand the value of having a persistent worker. Consider, in an R session with only BiocParallel > > f = function(i) { requireNamespace("GenomicRanges"); i } > > ## case 1 > p <- MulticoreParam(2) > system.time(res1 <- bplapply(1:2, f, BPPARAM = p)) > system.time(res2 <- bplapply(3:4, f, BPPARAM = p)) > > Each call tobplapply()takes the same amount of time (about 1.5s for me) because each call to bplapply loads GenomicRanges in a new process. On the other hand, with > > ## case 2 > bpstart(p) > system.time(res1 <- bplapply(1:2, f, BPPARAM = p)) > system.time(res2 <- bplapply(3:4, f, BPPARAM = p)) > bpstop(p) > > The first bplapply is expensive, because the process has to load GenomicRanges. But on the second call, the process is being re-used and GenomicRanges is already loaded – the call torequireNamespace()is essentially a no-op. The parent process (our interactive session) does not have GenomicRanges loaded. This is the use case for an explicit call to bpstart(). > > Maybe the next step is to understand the cost of having a persistent worker. In case 1, the indexes 1, 2 and 3, 4 seem like they can just be captured during the fork – they’re defined in the parent process, and hence available to the child process. In case 2, the worker has already been created, so the indexes can’t be captured by the process of forking, the indexes have to be SENT (e.g., serialized) TO THE PERSISTENT WORKER. Prior to v.1.19.2, BiocParallel wouldalwaysserialize data to the worker; this meant simpler and more consistent code across invocations and across back-ends. > > In v.1.19.2, though, when BiocParallel knows that the worker is not persistent (i.e., when it is notbpisup()at the time something likebplapply()is invoked), it relies on forking to make data available to workers. So > > > Q = list(rnorm(1e8), rnorm(1e8)) > > p <- MulticoreParam(2) > > system.time(res1 <- bplapply(Q, length, BPPARAM = p)) > > user system elapsed > 0.009 0.016 0.016 > > is fast (Q is available to the worker via forking) but > > > bpstart(p) > > system.time(res1 <- bplapply(Q, length, BPPARAM = p)) > > user system elapsed > 0.147 0.383 2.781 > > bpstop(p) > > is slow (Q has to be sent to the worker explicitly). > > In your gist, you compareMulticoreParam()toSerialParam(). There is no data transfer cost necessary, ever, withSerialParam(), so it is always (when the work ofFUNis trivial) fast.MulticoreParam()is slow when the worker is persistent because of serialization (as in your gist), but is now fast when forking allows data inheritance (without explicit calls tobpstart()). A useful comparison is withSnowParam(), where serialization is always necessary, so the difference between persistent and non-persistent variations is the cost of starting the workers. > > > p <- SnowParam(2) > > system.time(res1 <- bplapply(Q, length, BPPARAM = p)) > > user system elapsed > 3.101 0.312 6.414 > > bpstart(p) > > system.time(res1 <- bplapply(Q, length, BPPARAM = p)) > > user system elapsed > 3.048 0.305 5.684 > > bpstop(p) > > All of these examples point to the need reflect on the amount of work assigned to workers – here the work (length())is trivial and the cost and complexity of forking / communication make this moot –lapply(Q, length)is much faster. > > Maybe one final thing to mention, which is quite fun and a little experimental, is the SharedObject (https://bioconductor.org/packages/SharedObject) package, which allows objects to be shared across processes on the same physical computer. So > > library(SharedObject) > Q = lapply(Q, share) > p <- MulticoreParam(2) > bpstart(p) > system.time(res1 <- bplapply(Q, length, BPPARAM = p)) > bpstop(p) > p <- SnowParam(2) > system.time(res1 <- bplapply(Q, length, BPPARAM = p)) > > Are all ‘fast’ because a reference to the shared object is being serialized, rather than object itself. - Attachment (Bioconductor): SharedObject (development version) > This package is developed for facilitating parallel computing in R. It is capable to create an R object in the shared memory space and share the data across multiple R processes. It avoids the overhead of memory dulplication and data transfer, which make sharing big data object across many clusters possible.

Martin Morgan (05:10:26): > Turns out there’s a limit to the size of a slack post, and the above is just there!

Aaron Lun (11:13:47): > Right. So, the question becomes, should I not usebpstartandbpstopinside my functions? Seems like I’m caught in an awkward place if I have to sacrifice efficiency for MulticoreParam in order to improve efficiency for the others, or vice versa.

Aaron Lun (11:24:53): > In the functions I’m talking about, I do have worker re-use within the function, hence why I was usingbpstartandbpstop. But for certain jobs, the cost of serialization outweighs the cost of worker setup.

Aaron Lun (11:25:26): > Actually, I’d say that for most jobs, the cost of serialization outweighs that of worker setup.

Aaron Lun (11:26:13): > Perhaps theTransientParamidea isn’t too bad, to make it explicit that the user wants to fork and ignore the cost of set-up.

Martin Morgan (17:38:40): > it wouldn’t be impossible for package developers toif (!is(bpparam(), "MulticoreParam")) { bpstart(); on.exit(bpstop()) }. I’m not sure whether it’s helpful to exposeTransientParam()(it’s defined in the package only) and force even naive users to grapple with these issues…

2019-10-21

Aaron Lun (01:13:05): > That seems like a reinvention of S4 dispatch…

Aaron Lun (01:15:37): > The bpstart/stop/isup code is complicated enough, especially combined with someregistercode as well. This is boiler plate that I have in all my functions, so I’d rather not have to think about it every time. Normally I would wrap this in a function but theon.exitdoesn’t allow me to.

Aaron Lun (01:20:52): > What I don’t understand is: if bpstart’ing a MCParam involves serialization, isn’t this just the same as SnowParam? In which case, if a user wanted serialization, why shouldn’t they just use SnowParam instead, and just have MulticoreParam to be a dedicated forking param that no-op’s on bpstart?

Aaron Lun (01:57:06): > Well. I’m not happy about it, but I added anis(BPPARAM, "MulticoreParam")clause to the bpstart/bpstop block inSingleR::classifySingleR. But if I have to decide whether or not to set up a worker, I would like this decision to be abstracted away (e.g., withbpsharedmemory(BPPARAM)) rather than hard-coding a class name into the function. This would also allowDoparParamto just work rather than just erroring out when it hitsbpstart.

Martin Morgan (05:35:08) (in thread): > bpstart(MulticoreParam()) is still (much) faster and more memory efficient than bpstart(SnowParam()). It’s not that the user wants serialization, but rather worker persistence, and that has the implementation cost of serialization. The user has to choose their poison; sometimes, e.g., when data transfer is minimal, the costs of serialization are trivial.

Aaron Lun (14:50:00) (in thread): > I guess the developer (i.e., me) is choosing on behalf of the user when I need to decide whether or not tobpstartor not. I’ll grudgingly agree that this is my choice to make because I know whether or not the cost of setting up workers outweighs the cost of serializing data; but in that case, I would like the entire procedure to be streamlined over my current: > > if (!bpisup(BPPARAM) && !is(BPPARAM, "MulticoreParam")) { > bpstart(BPPARAM) > on.exit(bpstop(BPPARAM)) > } >

Aaron Lun (14:50:39) (in thread): > And as I said before, that doesn’t include the code to deal withregisterfor functions that implicitly use BPPARAM, e.g., in%*%.

2019-10-22

Tim Triche (12:16:40): > oh for heaven’s sake, I’m trying to use geometric sketching on a reasonable (60k cells) dataset and now I’m blowing up BiocSingular: > > R> PCA <- runPCA(logcounts(sce), rank=10) > > ***** caught segfault ***** > address 0x7f3ec5764648, cause 'invalid permissions' > > Traceback: > 1: La.svd(x, nu, nv) > 2: svd(x, nu = nu, nv = nv) > 3: safe_svd(as.matrix(x), nu = nu, nv = nv) > 4: (function (x, k = min(dim(x)), nu = k, nv = k, center = FALSE, scale = FALSE, deferred = FALSE, fold = Inf, BPPARAM = SerialParam()) ... > 5: do.call(FUN, c(list(x = x, k = k, nu = nu, nv = nv, center = center, scale = scale, BPPARAM = BPPARAM, ...), ARGS(BSPARAM))) > 6: (new("standardGeneric", .Data = function (x, k, nu = k, nv = k, center = FALSE, scale = FALSE, BPPARAM = SerialParam(), ..., ... > 7: (new("standardGeneric", .Data = function (x, k, nu = k, nv = k, center = FALSE, scale = FALSE, BPPARAM = SerialParam(), ..., ... > 8: do.call(FUN, c(list(x = x, k = k, nu = nu, nv = nv, center = center, scale = scale, BPPARAM = BPPARAM, ...), ARGS(BSPARAM))) > 9: runSVD(x, k = rank, nu = ifelse(get.pcs, rank, 0), nv = ifelse(get.rotation, rank, 0), center = center, scale = scale, ...) > 10: runSVD(x, k = rank, nu = ifelse(get.pcs, rank, 0), nv = ifelse(get.rotation, rank, 0), center = center, scale = scale, ...) > 11: .local(x, ...) > 12: runPCA(logcounts(sce), rank = 10) > 13: runPCA(logcounts(sce), rank = 10) >

Tim Triche (12:17:20): > that machine has a half terabyte of RAM:disappointed:

Tim Triche (12:17:52): > this dataset (whence I keep trying to generate reprex’es) seems to blow everything to smithereens

Aaron Lun (12:39:46): > Exac SVD won;t be happy.

Aaron Lun (12:40:04): > I’m not sure you want to do that anyway.

2019-10-24

Tim Triche (12:09:45): > Is there a more useful default?

Aaron Lun (13:33:35): > The default was changed in devel.

2019-10-25

Tim Triche (11:56:01): > > tim@tim-ThinkPad-T470:~/bioc-git/BiocSingular$ grep -i version DESCRIPTION > Version: 1.1.7 > > > > R> packageVersion("BiocSingular") > [1] '1.1.7' > > I’ll give it another shot:slightly_smiling_face:

Aaron Lun (11:56:34): > Whoops, the default was changed in the identically namedscater::runPCA.

Aaron Lun (11:56:54): > Being an infrastructure package, BiocSingular always defaults to an exact PCA.

Tim Triche (12:00:46): > ugh

Tim Triche (12:00:57): > this is an important detail!

Aaron Lun (12:01:30): > I’m sure?runPCAwould have indicated this.

Tim Triche (12:02:09): > It seems like a risky choice to have the default be “segfault even with 512GB of RAM”

Aaron Lun (12:02:48): > ¯*(ツ)*/¯

Aaron Lun (12:02:56): > Complain to the LAPACK devs.

Tim Triche (12:03:01): > fair point

Tim Triche (12:03:50): > it appears that the default is documented inrunSVD

Tim Triche (12:04:04): > > a <- matrix(rnorm(100000), ncol=20) > > out.exact0 <- runSVD(a, k=4) > str(out.exact0) > > out.exact <- runSVD(a, k=4, BSPARAM=ExactParam()) > str(out.exact) >

Tim Triche (12:04:30): > Perhaps it makes sense to note this in ?BiocSingular::runPCA

Tim Triche (12:04:37): > I can send a PR but you will be faster:slightly_smiling_face:

Tim Triche (12:05:58): > it looks like I should usescater::calculatePCAmost of the time.

Tim Triche (12:06:19): > (thanks for walking me through this btw)

Tim Triche (16:56:52): > update: calculatePCA with the top 2000 features did great, and I have fed the result to UMAP and that worked great too. Will put up aplot.lyof the results with cell type annotations when I get a chance, after committing more MTseeker changes.

2019-10-30

Aaron Lun (20:50:23): > <!channel>SingleR is finally in release!

Aaron Lun (20:50:35): > https://bioconductor.org/packages/release/bioc/html/SingleR.html - Attachment (Bioconductor): SingleR > Performs unbiased cell type recognition from single-cell RNA sequencing data, by leveraging reference transcriptomic datasets of pure cell types to infer the cell of origin of each single cell independently.

Aaron Lun (20:51:07): > Pretty good download stats for what had previously been a devel-onlhy package.

Dan Bunis (20:53:04): > I still see the In Bioc: devel only badge. And no download stats:thinking_face:

Aaron Lun (20:53:26): > Those badges tak ea while to get updated, the important thing is the 1.0.0 version number

Aaron Lun (20:53:44): > if you click on rank, you’ll get taken to the download stats, even though the rank is “unknown”.

Dan Bunis (20:54:27): > There’s also a link at the bottom of the page which I just found.http://bioconductor.org/packages/stats/bioc/SingleR/

2019-11-01

Dan Bunis (17:30:17): > A new evaluation of Cell Type Deconvolution R Packages was added today to bioRXiv:https://www.biorxiv.org/content/biorxiv/early/2019/11/01/827139.full.pdf- they use the old SingleR, womp. > - they also use Seurat::FindTransferAnchors & Seurat::TransferData (which transfers labels from an annotated Seurat to a new one based on what I assume to be similar ‘anchors’ determination to what’s behind their scTransform(?)) as a celltype deconvolution tool. > - They call Seurat the best method at annotating major cell types, but SingleR better for calling between highly similar cell types > > Has anyone else used these Seurat functions in this way?

Jared Andrews (17:31:58): > I toyed with the Seurat integration/anchor methods and found them very heavy handed.

Aaron Lun (17:32:34): > Any batch correction method can be effectively adapted to do this, e.g., correct the test to the training dataset and then see which is the closest training cluster to each test cell.

Aaron Lun (17:38:28): > They were probably using SingleR with the default settings, i.e., median expression for each cluster?

Dan Bunis (17:39:43): > They have code on github… lemme check.

Aaron Lun (17:41:08): > I should also add that one reason why I don’t use batch correction methods to do classification is because each cell’s classification should be independent, and this is not usually the case for batch correction methods.

Dan Bunis (17:41:08): > > CreateSinglerObject(counts = query, cluster = NULL, annot = NULL, project.name = "example singler", > min.genes = 200, technology = "", > species = "", citation = "", ref.list = list(example_ref), > normalize.gene.length = F, variable.genes = "de", fine.tune = T, > reduce.file.size = T, do.signatures = F, do.main.types = F, > temp.dir = TMP_DIR, numCores = SingleR.numCores) >

Aaron Lun (17:42:24): > Guh. That command was one of the first that I deleted, so I don’t know.

Dan Bunis (17:43:09): > I do hope their reviewers point out some of these concerns. Though I haven’t read the whole thing so they may have something like that in their discussion:man-shrugging:

Aaron Lun (17:47:19): > Might be worth pulling down one of their datasets and seeing if the new version + new recommendations do better.

Dan Bunis (17:49:19): > I think the old ‘de’ method uses gene #s set by the user beforehand with the method for picking those genes being generally through another function that you’d removed

Dan Bunis (17:54:39): > > Might be worth pulling down one of their datasets and seeing if the new version + new recommendations do better. > Agreed cuz it’s be nice to have the newer version benchmarked instead! They also checked runtime and memory usage, both of which should be greatly improved! Anyone have the bandwidth for this? I don’t til at least next week.

Aaron Lun (17:54:58): > I don’t, at all.

Aaron Lun (18:01:31): > Or we could just ask them to try it out.

Jared Andrews (18:01:56): > Yeah, it should take them no time to run.

Aaron Lun (18:08:36): > Can someone do that? Just point out that: > - There’s a faster version on Bioconductor, and > - We have some more focused recommendations for picking marker genes when dealing with a single-cell reference. > And just that we’d be interested in knowing whether this makes things any better.

Dan Bunis (18:12:30): > Sure. I’m already drafting an email. I’ll address both these points.

Aaron Lun (18:13:48): > In the meantime, I guess we should confirm that our more focused recommendations do actually do better than the default setting.

Aaron Lun (18:14:22): > I’m pretty sure they do based on some tests I did in the past with variousscRNAseqdatasets.

Aaron Lun (18:16:37): > I’ll throw up an example later.

Aaron Lun (22:14:45): > Were they already pseudo-bulking?

Aaron Lun (22:14:57): > @Dan Bunis?

Aaron Lun (22:28:18): > Do you know whether they were doing it using the inbuilt median method or if they were doing it manually?

Dan Bunis (22:47:30): > They were indeed pseudobulking. I didn’t check exactly how, but probably not with SingleR functions… they used the same pseudobulk refs for all the annotation methods built for bulk refs.

Aaron Lun (23:13:11): > With pseudo-bulking and no replicate studies, SingleR basically collapses down to nearest neighbors in rank space.

Aaron Lun (23:13:30): > With k=1, I might add.

Dan Bunis (23:13:43): > They used the mean of normalized expression, so no reps I think.

Dan Bunis (23:14:18): > It’s just a 1 or 2 sentence section in the methods

Aaron Lun (23:14:28): > Sure, that’s typical.

Aaron Lun (23:14:55): > I should add that, for major cell types, you can achieve excellent performance by clustering the test dataset and assigning cluster centroids to labels.

Aaron Lun (23:15:12): > And the propagating the label for each cluster to all of its constituent cells.

Aaron Lun (23:16:55): > This kicks the can towards the clustering algorithm in terms of where the main errors would occur, but if the major cell types are pretty distinct, the chance of error is low and you avoid the minor misclassification from rogue cells in each cluster.

Aaron Lun (23:55:11): > But then again, I would say that, if the cell types are pretty distinct, the few misclassified cells don’t really matter.

2019-11-03

Tim Triche (12:24:41): > Hey so I kicked this over to Lana and she’d be happy to chime in, but slack is requesting either a @fredhutch.org, @roswellpark.org, or @bedatadriven.org email

Tim Triche (12:25:07): > How can a person send an invite to this slack? I’m not entirely sure as a non admin

Tim Triche (12:26:03): > Oh never mind. “Invite people” seems to work even for lumpenproletariat like me

Tim Triche (12:27:10): > It won’t let me create an invite link, though.

Aaron Lun (12:34:38): > ¯*(ツ)*/¯

Tim Triche (12:35:28): > Ok I figured it out. Lana will drop by when she has time

Tim Triche (12:40:17): > This is a useful community. I presented some SingleR and trajectory work on Friday and got quite a lot of usable feedback. The physicists were interested in how quantitative one could make predictions from a dynamical system, I.e. if we perturb cell type A in system X with Y nmol of compound J, how far along the trajectory to cell type B will it move? Is the impulse function linear? Is it reasonable to view stochastic reprogramming as an activation energy type of affair? Does incorporating rna velocity into the model and/or denoising it help? When and why? Is it better to iteratively optimize cell markers once a framework is tested and provides repeatable results?

Tim Triche (12:41:06): > I found this a pleasant change from the usual “which cells are in the purple cluster” type nonsense

Tim Triche (12:42:30): > (The Osca chapter on mito filtering was great to stir discussion too — yeah we are throwing away some interesting cells that are dying, next time maybe think about that in the isolation and prep if it’s a major goal to keep ’em!)

Tim Triche (12:43:36): > Anyways. The SingleR comments are particularly actionable for Lana since the preprint is still a preprint AFAIK. And obviously a topic of some interest to many:grin:

Aaron Lun (12:49:02): > @Dan Bunissent an email as well.

Dan Bunis (13:32:50): > I did, I did. My email overviewed themajorupdates in bioc SingleR, mentioned us devs would be quite interested to see how the new version fairs in thei benchmarking, and provided replacement code for switching to bioc SingleR within their current pipeline =)

Tim Triche (13:52:06): > Yep

Tim Triche (13:53:04): > Benchmarking these types of pipelines is fugging hard, moreso given that usually the “big” papers are by groups with an ulterior motivation for benchmarketing

Dan Bunis (14:09:53): > Benchmarketing lol:rolling_on_the_floor_laughing::man-facepalming:

2019-11-04

Izaskun Mallona (07:57:51): > @Izaskun Mallona has joined the channel

Jared Andrews (09:33:16): > @Aaron LunI will write this tonight.https://github.com/LTLA/SingleR/issues/54#issuecomment-549245527

Aaron Lun (11:48:46): > @Jared AndrewsMake sure to work off thepostreleasebranch; this will allow me to easily push the same set of commits to both release and devel and thus make things available to users of the current release. But make sure your changes dosn’t modify any existing functionality (which it shouldn’t, but just to be clear).

Jared Andrews (11:50:06): > Will do. Won’t touch anything other than a new function and will be sure all tests pass.

Jared Andrews (12:44:34): > In relation to this, was your plan to harmonize labels between the different results sets prior to the final assignment? Or are we content with just leaving it up to the user to unify the final labels (seems the safer approach)?

Aaron Lun (12:45:04): > the latter, yes.

Aaron Lun (12:45:24): > So sometimes you’ll get T cell, CD4 T cell, CD8 T cell (or some obscure subset thereof).

Aaron Lun (12:45:45): > And that’s fine. In fact, that’s sort of nice if it falls back to “It’s a T cell, but I don’t know exactly what subtype, but there you go.”

Jared Andrews (12:47:06): > Great, just wanted to be sure.

Jared Andrews (23:00:30): > Does label pruning make sense post-combining results?

Aaron Lun (23:38:07): > Hm.

Aaron Lun (23:38:44): > I would say that if it got pruned in any given result, then you ignore it when collating across reuslts.

Aaron Lun (23:39:27): > So you’d basically have two output columns,labelandpruned.label, where the first set is obtained by collating across the individuallabel, and the second set is obtained by collating across the non-NApruned.label.

2019-11-05

Jared Andrews (06:11:53): > Okay, easy enough.

2019-11-06

Jared Andrews (11:55:17) (in thread): > What about cases where there are no non-NApruned.labels?

Aaron Lun (11:56:54) (in thread): > Then stick in a NA value, if it can’t get assigned to anything.

Aaron Lun (15:52:33): > @Jared Andrews@Dan Bunis@Friederike DündarBTW I answered Pedro’s email but forgot to reply all. Gmail’s interface is pretty bad with that.

2019-11-07

Kevin Blighe (11:24:31): > @Kevin Blighe has joined the channel

2019-11-08

Brendan Innes (11:59:43): > @Brendan Innes has joined the channel

2019-11-09

Aaron Lun (18:46:32): > ugh reviewing the most borign paper in the world

Aaron Lun (18:46:40): > @Jared Andrewsyou here?

Aaron Lun (18:47:29): > Can you makecombineResultsgive the same set of fields asclassifySingleR? i.e., also combinefirst.labelandprune.label.

Aaron Lun (18:48:03): > as in just take the values from the reference that you chose for each cell.

Jared Andrews (21:10:56): > Yeah, sure, that makes sense.

2019-11-10

Jared Andrews (18:51:08) (in thread): > I’m working this in now, btw. Taking a look at your new PR now too.

Aaron Lun (18:53:53) (in thread): > off to buy some gloves to clean my toilet bowl. hasn’t been cleaned in months, it’s so gross.

Aaron Lun (20:20:22): > My toilet bowl is so clean now.

Aaron Lun (20:20:27): > it SPARKLES

Aaron Lun (20:20:37): > And these gloves are so cool

Aaron Lun (20:20:40): > feel like a surgeon

Jared Andrews (20:24:49): > Cleaning is definitely cathartic.

Jared Andrews (22:53:30) (in thread): > Okay, this is mostly done, minus tests and one niggling thing, but I have a killer migraine, so I’m done for the night.

Aaron Lun (23:22:52) (in thread): > :parrot_aussie:

Aaron Lun (23:23:13) (in thread): > Not entirely sure why that’s meant to be an australian parrot.

2019-11-14

Aedin Culhane (16:55:34) (in thread): > Hi Aaron Can I view the relationship (or hierarchy) between these cell labels. Or are they just a bunch of unstructured “tags”

Aedin Culhane (16:57:32): > Hi . We are trying different dataset aligners, and find some align better than other. Could through this into the mix… See if cells have >1 labels are those that don’t align??? any interest?

Aaron Lun (17:04:37): > I have no idea what you’re saying?

Aaron Lun (17:05:29) (in thread): > Not across studies unless you harmonize them to the Cell Ontology.

Dan Bunis (18:24:56): > We do have interest in seeing how SingleR compares to other annotation methods.

Dan Bunis (18:25:12): > @Aedin CulhaneWhen you say that you find “some align better than other” do you mean “somecellsalign better than others across various alignment methods”?

2019-11-19

Peter Hickey (01:15:42): > Having runSingleR()using one of the built-in references (e.g.,HumanPrimaryCellAtlasData) what’s the easiest way to extract/identify the genes that are driving the label? Is there a recommended visualisation of this information?

Aaron Lun (02:30:32): > metadata()$de.genes

Aaron Lun (02:30:48): > and look at one of the heatmaps in the vignette.

Aaron Lun (02:31:08): > section 4.3

Peter Hickey (04:51:42): > Thanks!

2019-11-23

Aaron Lun (02:34:12): > @Jared AndrewsLet’s finish the fight.

Aaron Lun (02:37:11): > Also, the new default heatmap colors sort of suck. I think we need a better unidirectional color scale.

Aaron Lun (02:38:03): > Also, the bidirectional color scale could probably afford to be capped at max(abs(score)) rather than at c(-1, 1).

Aaron Lun (02:43:33): > Made an issue.https://github.com/LTLA/SingleR/issues/64

Aaron Lun (02:43:54): > Argh, none of you guys are listed as collaborators, so I can’t assign you to do it. Bum.

Aaron Lun (02:45:10): > Here’s another onehttps://github.com/LTLA/SingleR/issues/65

Aaron Lun (02:52:16): > And another onehttps://github.com/LTLA/SingleR/issues/66

Aaron Lun (02:54:02): > These are pretty nice and easy things, so PRs welcome from the usual suspects. I’ll be working on this epic iSEE refactoring over thanksgiving so I won’t be in the way.

Jared Andrews (07:37:52) (in thread): > Yeah, sorry, I’ve been super busy and am currently out of town. I changed the function to clean it up, it’s probably ready for another once over. If you think it’s okay, I will write some tests next week. Will also update the readme, which is pretty out of date.

2019-11-24

Aaron Lun (02:15:56): > Right, ok, enough of watching cod playthrough videos on youtube.

2019-11-27

Aaron Lun (01:49:52): > Woah, it’s like watching a movie. Games nowadays are insane.

Aaron Lun (01:50:17): > Anyway,@Jared AndrewsI added some comments, but nothing major; can proceed with tests.

Jared Andrews (01:51:24): > Saw that, thanks. Will get to it over the weekend, trying to finish a paper draft by the end of the week. Never actually written an R test, so new things are a-happening.

Aaron Lun (01:53:08): > I would normally say to have a look at some of the examples in the other test files, but some of them are either too intense or too trivial… probablytest-aggregate.Randtest-prune.Rare your best bets.

Aaron Lun (01:53:52): > @Dan Bunisyou want to do 64?

Jared Andrews (01:57:15): > Yeah, doesn’t look too tough, I will stumble my way through as usual.

Aaron Lun (01:58:39): > :+1:

Aaron Lun (01:59:14): > SingleR’s test load is actually pretty light. For comparison, scran has ~5000 test statements, and beachmat has ~40,000.

Dan Bunis (17:30:25): > I’ll take on 64. Shouldn’t be too hard to do.

Aaron Lun (17:31:47): > sweet

Dan Bunis (17:33:06): > I’d seen and been planning on it, just got distracted from replying by pokemon lol sorry!

Aaron Lun (17:33:24): > sword and shield?

Dan Bunis (17:33:52): > mhmmmmm

Dan Bunis (17:35:13): > they’re pretty good imo. different ways, but only as flawed as every other pokemon game

Aaron Lun (17:48:10): > I’m more of a gold/silver guy

Jared Andrews (17:59:02): > Nothing’s captured the magic of the originals tbh.

Dan Bunis (18:03:49): > those were great. i’m not saying that sword and shield are better really… just that I like em too.

Dan Bunis (18:04:30): > :man-shrugging:

Aaron Lun (18:10:00): > I’m particularly attached to gold/silver because i got this pirated version… that was still in japanese.

Aaron Lun (18:10:16): > Fortunately, my friend had given me a game guide for christmas, so I studied it hard.

Aaron Lun (18:10:37): > Figured out what moves each pokemon learns at each level so as to make sure I chose the right moves.

Aaron Lun (18:10:54): > Sometimes it was hard because some pokemon learn two moves at the same time, so I just had to take a chance.

Dan Bunis (18:15:28): > oh wow. are you saying you had no idea what was going on except because of the guide?

Aaron Lun (18:15:43): > pretty much

Aaron Lun (18:16:08): > I mean, i sort of figured that the totodile was the water type

Dan Bunis (18:16:10): > ha! love it.

Dan Bunis (18:16:50): > and those games actually were hard too.

Dan Bunis (18:17:16): > that woulda made for quite a bit more challenge

2019-12-03

Aaron Lun (00:17:32): > time for some more finishing of the fight.

Komal Rathi (09:09:18): > @Komal Rathi has left the channel

Jared Andrews (11:07:23): > Yeah, added some tests. Let me know if you think anything else should be covered.

Aaron Lun (11:32:54): > As if I had written them myself.

Aaron Lun (11:33:27): > Only suggestion is to usepruned.labels=tolower(labs)to check that the labels and pruned labels are handled separately.

Aaron Lun (11:33:59): > Also a cheap test is to try tocombineResults(list(A=results, B=results))to check that you get the same results back.

Jared Andrews (23:20:08): > Donezo.

2019-12-04

Aaron Lun (02:22:33): > I’ll keep pushing on the multi ref PR, will someone be available for testing it IRL?

Dan Bunis (02:31:06): > I’ll should have a bit of time next week.

Aaron Lun (02:45:14): > Great. And now, time for my favorite part of the day.

Aaron Lun (02:45:18): > :sleeping:

Jared Andrews (10:35:55): > I will also probably run it on my large dataset with all the immune refs because I want a giant mess of labels.

2019-12-05

Aaron Lun (02:56:55): > SingleR()now accepts multiple references, see themultirefbranch.

Ludwig Geistlinger (16:14:54): > Maybe relevant for this channel:http://software.broadinstitute.org/gsea/msigdb/supplementary_genesets.jsp#SCSig

Aaron Lun (17:01:40): > Database for Immune Cell Expression(/eQTLs/Epigenomics)

Aaron Lun (17:01:58): > @Jared Andrewsis the weird formatting in the vignette intentional?

Jared Andrews (17:05:15): > That’s just what the database is called, feel free to change. Especially since we only use expression. I’m just a sucker for exactness. See header here:https://dice-database.org/

Aaron Lun (17:05:48): > oh god.

Aaron Lun (17:06:00): > oh well, okay.

Aaron Lun (17:06:22): > Bit too cute for their own good there.

Jared Andrews (17:07:51): > :man-shrugging:

Aaron Lun (17:09:00) (in thread): > Interesting. Looks like it’s all in gene symbols again, rather than a proper Id.grumble

2019-12-06

Aaron Lun (01:43:13) (in thread): > Playing with this now. They couldn’t have released in Ensembl IDs? They started off with Ensembl, but theydeliberatelymapped it to Entrez/Symbols? Jesus.

Aaron Lun (02:36:05) (in thread): > Threw in some examples in the book.

Aedin Culhane (11:13:13) (in thread): > Sorry, I am too intermittent on slack. We are preparing a review on this for a Frontiers issue on integrative analysis. Should hopefully have it ready in a week. Currently CCA is the approach used for cross-study/batch/platform (in seurat and other tools). We compared CCA to other matrix factorization approaches. The comparison was for a short review so its not comprehensive, but performances vary

Ludwig Geistlinger (11:52:51) (in thread): > I agree on symbols, but what’s problematic about Entrez IDs?

Aaron Lun (11:54:05) (in thread): > Nothing, but if they were starting from Ensembl annotation, they could have released a version in Ensembl as well.

Ludwig Geistlinger (11:57:06) (in thread): > Agreed, I think the GSEA community is pretty used to EntrezIds back from the good old microarray days, whereas ENSEMBL took over in the RNA-seq realm

Aaron Lun (11:57:39) (in thread): > Also, direct use of those gene sets for annotation seems a bit tricky; even though the book is taking ages to build, you can look at my comments here:https://github.com/Bioconductor/OSCABase/commit/96e0388dab8fb07ef7fe09ed1d749fb453da480f

Aaron Lun (11:59:33) (in thread): > I’m going to guess that they primarily defined each gene set within a study, though I’ll need to dig into that.

2019-12-09

Aaron Lun (12:00:41): > Anyway, one good way to check that the combining works well is to take a single reference and split it into multiple parts, where each part should have a few unique labels; hopefully we should see similar results, treating the assignment to the original reference as the gold standard.

2019-12-10

Robert Ivánek (05:41:40): > @Robert Ivánek has joined the channel

Chris Vanderaa (09:34:04): > @Chris Vanderaa has joined the channel

2019-12-11

Aaron Lun (00:09:04): > Doth no one want to finish the fight?

Jared Andrews (01:29:59): > I am in the final push to finish a manuscript, so I’m pretty unavailable till the end of the year. If I get really desperate for a procrastination method,maybeI’ll take a crack at #66, but that’s a pretty big maybe.

2019-12-13

Aaron Lun (01:22:29): > Crowd sourcing: got some HSCs. Best reference is…?

Dan Bunis (03:21:07): > BlueprintEncode is what I’ve found best for my HSPCs.

Tim Triche (10:39:00): > what age are your HSCs and what organ(s) did they come from

Tim Triche (10:39:27): > Hourigan is a pretty solid reference for adult marrow HSPCs (has bulk + 10X + CyTOF for the same n=8 donors)

Tim Triche (10:42:40): > But Greenleaf & co. found that most of the references seem to be consistent in the large so:man-shrugging:

Aaron Lun (11:31:37): > ¯*(ツ)*/¯

Aaron Lun (11:31:56): > Just writing this book with a HSC example from Nestorowa (2016) and I thought I’d throw in an example of annotation.

Aaron Lun (11:32:06): > Consider biological knowledge of the system to be zero.

2019-12-14

Aaron Lun (02:45:48): > Trying this out and wondering why I was getting gibberish annotations. Then I realized that blueprint was human and my data was mouse… whoops.

Aaron Lun (02:48:45): > Distribution of cells for labels vs clusters. Color islog10(x+10)where x is the number of cells for that combination of label/cluster. - File (PNG): image.png

Aaron Lun (02:49:25): > Okay, so the above is withMouseRNAseqData(). Hard to say whether it makes sense, I guess HSCs wouldn’t really have clear attributes of any single lineage.

Aaron Lun (05:44:57): > Ugh. Picked a random cluster to annotate from another dataset. Saw a bunch of weird genes and googled them: most of them were neuronal. So I thought, “okay, these are neurons”. Turns out they’re meant to be microdissected bone marrow cells.:face_palm_star_trek:

Aaron Lun (06:03:16): > Real twilight zone stuff. Usually a good bet that it’s low-quality cells… or even just empty wells.

2019-12-15

Aaron Lun (04:56:35): > And after all that, I realized I was analyzing the wrong dataset.:face_palm_star_trek::face_palm_star_trek::face_palm_star_trek:

Dan Bunis (10:00:22): > :man-facepalming::man-facepalming::rolling_on_the_floor_laughing:Hopefully it looks better with the right data?

Aaron Lun (14:12:54): > Haven’t gotten around to using the right dataset yet, but I threw what I had done into the book as another workflow:https://github.com/Bioconductor/OSCABase/blob/master/analysis/workflows/grun-hsc.Rmd

Aaron Lun (14:13:33): > This was actually kind of instructive as it shows how SingleR will happily (and incorrectly) annotate mouse data with human references based on the few shared gene symbols (e.g., H19 and a bunch of other weird all-caps genes).

Aaron Lun (14:13:54): > That’s the cost of using gene symbols instead of the proper stuff, i.e., Ensembl or Entrez IDs.

2019-12-16

Federico Agostinis (09:48:29): > @Federico Agostinis has joined the channel

Dan Bunis (13:14:56): > @Aaron Lunis there an explanation of the differences between Ensembl / Symbol / Entrez IDs and how to convert between them in the OSCA book? Or a link to something with that? If not, it might be nice to include. (I can repost this in the#osca-bookif you’d like.)

Aaron Lun (16:32:19): > You can write a chapter if you like. I don’t think I have anything explicit about that, though there are lots of cases where I’ve done that in the workflows.

Aaron Lun (16:32:26): > ANyway, merging in the PR, it looks good enough.

Aaron Lun (16:36:20) (in thread): > I guess this would go into the “feature selection” chapter, which is probably the closest.

Dan Bunis (17:36:51) (in thread): > I count myself more as someone who would benefit from reading said chapter currently, rather than as someone qualified to write it. But I suppose writing something would encourage me to learn the intricacies.

Aaron Lun (18:43:19) (in thread): > Well, I’ve been looking for some~~~suckers~~~volunteers to help me write some of the chapters, and you’re one of the top of my list.

Dan Bunis (18:46:33) (in thread): > :rolling_on_the_floor_laughing:

Aaron Lun (19:13:41): > @Dan BunisDid the plotHeatmap changes break anything in release? If not, I’m just going to merge them in.

Dan Bunis (19:14:38): > I don’t think so but idr rn what the changes were.

Dan Bunis (19:14:43): > gimme a sec to refresh

Dan Bunis (19:19:13): > They shouldn’t. There’s just the color change & a bug FIX.

Aaron Lun (19:19:21): > okay, sweet.

Jared Andrews (20:35:20): > Also, we should probably update the beginning of the README considering it’s in Bioconductor now.

Aaron Lun (23:30:14) (in thread): > Another important thing is to update the vignette to describe how to use multiple references, which we forgot to do. If you’re doing this, you may also try to tackle #66 as well.

Aaron Lun (23:30:46) (in thread): > Damn forgot to send this to the channel.

2019-12-17

Aaron Lun (20:56:19): > Just added a mode for doing single-cell references. Looking for testers and also someone to update the vignette for the various changes so far.

Aaron Lun (20:56:40): > To incentivize people, I’m willing to offer a prize in the Christmas spirit.

Aaron Lun (20:58:27): > First prize is this Genentech T-shirt. Limited edition, and only slightly worn. - File (JPEG): IMG_20191217_153105.jpg

Aaron Lun (20:59:16): > Second prize is this Genentech beanie. Also only mildly worn. - File (JPEG): IMG_20191217_153159.jpg

2019-12-18

Lluís Revilla (03:40:52) (in thread): > Which repository are you talking about? SingleRhttps://github.com/LTLA/SingleR? I can give it a shot

Jared Andrews (09:19:42) (in thread): > Yes, that’s the repo. You’ll want to go off of thepostreleasebranch.

Jared Andrews (14:08:00): > “Ignite the possibilities” is a pretty weak catch phrase though, Genentech has to up their game.

Aaron Lun (16:12:53): > There’s more where that came from.

2019-12-19

Aaron Lun (16:47:32): > Was there a reason why we haveshow.labels=FALSE? Seems like it should ebTREU.

Aaron Lun (16:54:41): > @Dan Bunis

Dan Bunis (16:55:41): > historical reasons:man-shrugging:. It’s not a part of the pre-BioC function.

Dan Bunis (16:57:42): > We can update that. Another thing I remember talking about adding was an option to show either the labels or pruned.labels. If we do want that, make that an issue on the github and then I’ll remember and get to it when I can.

Aaron Lun (17:00:37): > I’ll just do it now.

Aaron Lun (17:12:34): > Ugh headphones are broken and music only comes out of one side now. Really disorientating when I take them off.

Aaron Lun (17:30:50): > PR created. Basically just needs some fleshing out of the combining results, and maybe some documentation about aggregateReferences.

Aaron Lun (23:20:40): > Marco?

2019-12-20

Aaron Lun (02:36:20): > It is done.

Jared Andrews (09:36:45): > On a completely unrelated note, how the hell have you contributed to something on Github every single day for the last year? Almost the last 2 years. That is nuts.

Aaron Lun (16:02:29): > Meth.

Jared Andrews (16:23:23): > The industry secret, I knew it.

2019-12-21

Aaron Lun (00:41:39): > Vignette updated with more deets, look forward to 1.1.5.

2019-12-29

Aaron Lun (21:13:16): > So cold can’t type properly

2019-12-30

Jared Andrews (09:25:50): > Don’t you live in California?

Aaron Lun (13:49:47): > in the city. my thermostat says its 65, butit feels a lot lower than that.

Aaron Lun (13:50:09): > Anyway, make sure you read the new vignette and test out the multi-reference mode if you have a chance. Also some other neat stuff to make life easier.

2020-01-06

Aaron Lun (22:35:29): > Is everyone back?

2020-01-07

Jared Andrews (09:26:08): > In theory, yes.

2020-01-12

Aaron Lun (23:37:06): > Anyway, I have some thoughts about how to take our annotation to the next level.

Aaron Lun (23:40:08): > It involves a bit of grinding by folks with some biological knowledge.

Aaron Lun (23:40:31): > But I have a T-shirt to reward a submitter of any PR that comes my way.

2020-01-13

Jared Andrews (00:42:49): > I have next to no time, but what’s the ask?

Friederike Dündar (11:23:21): > sounds intriguing! Could the T-shirt also be a baby onesie?

Aaron Lun (11:30:47): > Welcome back@Friederike Dündar. I take it the deployment went well?

Friederike Dündar (11:31:15): > yes, as well as these things can go

Federico Marini (14:32:05): > Congrats on the F1 generation@Friederike Dündar:)

Friederike Dündar (16:58:15): > thanks!:slightly_smiling_face:

Aaron Lun (19:14:11): > So, the task is to map all our “colloquial” names in the annotation to formal entities in the Experimental Factor Ontology. e.g.,https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCL_0000576. This will allow us to easily combine information from multiple references as well as adjust the granularity of the labels by moving up and down the ontology. > > For example, maybehttps://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCL_0000934&viewMode=All&siblings=falseis too specific for you, but we can just ratchet it up tohttps://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCL_0000084&viewMode=All&siblings=false. - Attachment (ebi.ac.uk): monocyte > [Myeloid mononuclear recirculating leukocyte that can act as a precursor of tissue macrophages, osteoclasts and some populations of tissue dendritic cells., A mononuclear phagocytic leukocyte, 13 to 25 mm in diameter, with an ovoid or kidney-shaped nucleus, containing lacy, linear chromatin and abundant gray-blue cytoplasm filled with fine reddish and azurophilic granules. Formed in the bone marrow from promonocytes, monocytes are transported to tissues such as the lung and liver, where they develop into macrophages.] - Attachment (ebi.ac.uk): T cell > [A type of lymphocyte whose defining characteristic is the expression of a T cell receptor complex.]

Aaron Lun (19:15:42): > This requires a bit of skill to know which of our informally named cell types match up with the things in the ontology, but it would allow us to replace our “broad” and “fine” labels with a single suite of labels that can be dynamically adjusted by the user; this would provide a solution to problems such as those discussed inhttps://github.com/LTLA/SingleR/issues/68.

Vince Carey (23:12:14): > There is some material in the ontoProc vignette (https://www.bioconductor.org/packages/release/bioc/vignettes/ontoProc/inst/doc/ontoProc.html) that could be relevant here. I had a few interactions with the ontology group at HCA (Angela Pisco, Richard Scheuermann) on this topic, along with the EBI group led by Helen Parkinson. Difficult topic. Haven’t updated references since 2018…

2020-01-14

Friederike Dündar (10:14:50): > not sure if it’s worth it – many people that have confessed to using SingleR said they were using their own reference sets anyway

Aaron Lun (11:36:58): > This is but a larger effort to assign standard vocabulary to all datasets in thescRNAseqpackage, so that people can seamlessly combine inferences from multiple references without harmonization problems discussed in?combineResults.

Aaron Lun (11:40:58): > The quality of the references in the package is almost as big an asset as the method itself. The “on-label” use of this package involves using these references - that’s what we have in the vignette - so it seems worth it to me.

Aaron Lun (11:42:37): > At the very least, it provides a framework in which other people are encouraged to use standard terms. The alternative is pretty crap, when everyone calls their B cells in different ways, e.g., “B cells”, “bcells”, “B”, “B_cells” (this is a real example, and was a real pain to deal with).

Aaron Lun (11:42:57): > It’s like the gene symbol problem times 10.

Friederike Dündar (14:41:42): > yes, I follow in principle, but cannot come up with someone who’d actually do the work…

Friederike Dündar (14:42:11): > Actually, I guess, it’s not so horrible since we have somewhat limited cell types

Aaron Lun (14:43:29): > SingleR datasets are easy. Imagine trying to wade through all of Linarsson’s cell annotations.

Aaron Lun (14:43:38): > What’s Ast? Astrocytes, maybe? Who knows?

Friederike Dündar (14:44:14): > Linarsson being one of thescRNAseqpackages?

Aaron Lun (14:44:42): > I imported many of Sten’s datasets into scRNAseq.

Aaron Lun (14:44:47): > Let me tell you, it was an adventure.

Friederike Dündar (14:44:49): > but you actually want someone to do just that, i.e. to wade through all of the cell annotations of all of thescRNAseqdata?

Aaron Lun (14:44:56): > No, I’ll do that.

Aaron Lun (14:45:08): > I just want people here to go through SingleR datasets.

Friederike Dündar (14:45:12): > got it

Aaron Lun (14:45:27): > But if you want…

Friederike Dündar (14:45:35): > if you find someone who’s in NYC and will sit down with me for a day I’d do it, I just couldn’t stand doing it on my own

Friederike Dündar (14:45:43): > it =SingleR

Aaron Lun (14:47:06): > maybe I will do one to kick things off, show how easy it would be.

Friederike Dündar (14:56:35): > :party_parrot:

Friederike Dündar (14:56:38): > go for it

2020-01-18

Aaron Lun (01:27:33): > IT HAS BEGUN

Aaron Lun (01:27:34): > https://github.com/LTLA/SingleR/pull/84

Aaron Lun (01:28:15): > This was only somewhat painful. Cell Ontology is pretty comprehensive and the EBI search tool is pretty good, so it’s not too hard to find matches.

Aaron Lun (01:28:43): > Sometimes there are difficulties, e.g., the closest I could find to “neuronal progenitors” was “neuroblast”, not sure if that’s really close enough.

Aaron Lun (01:29:09): > I just didn’t bother with iPS cells. Didn’t know what I should call them.

Aaron Lun (01:29:31): > Don’t know what “BM” means, either. Guessing it was “bone marrow” but that’s not really a cell type so I left that out as well.

Aaron Lun (01:29:53): > The immuno-focused refs should be easier, as least you’re not jumping around the ontology tree too much.

Aaron Lun (01:33:19): > I would say that it took me just 1 hour to do it, so it’s definitely not a big burden.

Aaron Lun (01:33:35): > Especially if you know what the cell types actually are.

2020-01-20

Aaron Lun (22:25:09): > LOLhttps://www.ebi.ac.uk/ols/ontologies/cl/terms?iri=http://purl.obolibrary.org/obo/CL_0000325 - Attachment (ebi.ac.uk): stuff accumulating cell > [A cell that is specialised to accumulate a particular substance(s).]

Aaron Lun (22:27:57): > The comprehensiveness of these ontologies is pretty insane.

Aaron Lun (22:28:35): > It’s like, “class-switched B cell” or “foreskin fibroblast”. I was like, gee, those sound pretty specific, not sure if they’ll have terms.

Aaron Lun (22:28:38): > BUT THEY DO.

Aaron Lun (22:29:06): > https://www.ebi.ac.uk/ols/ontologies/cl/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCL_1001608 - Attachment (ebi.ac.uk): foreskin fibroblast > [Fibroblast from foreskin.]

Aaron Lun (22:29:30): > https://www.ebi.ac.uk/ols/ontologies/cl/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCL_0000972 - Attachment (ebi.ac.uk): class switched memory B cell > [A class switched memory B cell is a memory B cell that has undergone Ig class switching and therefore is IgM-negative on the cell surface. These cells are CD27-positive and have either IgG, IgE, or IgA on the cell surface.]

Aaron Lun (22:30:09): > Anyway, blueprint ENCODE was a pleasure to map; lots of unambiguous assignments and it only took me 30 minutes.

Aaron Lun (22:30:35): > Probably downhill from here, if other people can chip in.

2020-01-21

Martin Morgan (06:08:27): > sounds like a great exercise, Aaron, in the long term adherence to ontologies will be a big win.

Jared Andrews (10:38:07): > I’ll do a few in the next day or two. Say, DICE, Monaco, and the mouse RNAseq. Immgen is probably going to be the most annoying given all of the experimental conditions.

Aaron Lun (11:33:54): > For the time being, I think we don’t need to worry about the experimental conditions, just the cell types.

2020-01-22

Vince Carey (11:55:00): > I’d like to follow the ontology use development more closely – is there a branch of singleR where the code is emerging? Specifically I’d like to see a reuse of the ontology serializations/support of ontoProc if such is advantageous.

Aaron Lun (11:55:57): > https://github.com/LTLA/SingleR/pull/84

Aaron Lun (11:56:42): > Annotated with real intelligence. Accept no substitutes, none of that artificial stuff.

Vince Carey (12:09:37): > OK, I see how to look at a PR – and you have a tsv with some mapping. Now looking at the instance of cell ontology in ontoProc to see what is going on with CL:0000222, which has neighbor in the tsv in PR ‘Tissue_stem_cells:CD326-CD56+’, and the monocyte tag as a control … we have

Vince Carey (12:10:09): - File (PNG): onto.png

Vince Carey (12:11:36): > One of the aims of ontoProc vignette is to propose structures for adding possibly conjectural molecular details into structures denoting cell type and state.

Aaron Lun (12:11:50): > What I really want is a way to easily traverse these trees to synchronize the “granularity” of labels across reference datasets. SingleR currently cannot handle classification to nested terms.

Vince Carey (12:12:31): > I think that traversal concept is important – can we get somewhere with the aggregation facilities of treeSummarizedExperiment?

Vince Carey (12:14:46): > But I think having a protocol for introducing, formally, synonyms like the ones I believe you have in the TSV, into OBO-consistent structures, could be worth thinking about. TSV is of course easy to process and link with – but some way of uniting this information with standard structures in the ontology world will probably pay off.

Aaron Lun (12:16:24): > Well, the TSV contains a mapping from uncontrolled to controlled vocab. I don’t know that it needs to be more complicated than that. AFAIK the OBO stuff is for storing the ontology structure itself.

Vince Carey (12:16:31): > For example, you have ‘CD326-CD56+’ in the tissue stem cell entry – from the ontoProc vignette perspective this might be read ‘lacks plasma membrane protein CD326’, ‘has plasma membrane protein CD56’ and these clauses are important for discriminating a specific kind of – mesodermal cell?

Aaron Lun (12:17:17): > Well, possibly, but I decided against mapping to anything that was not a cell ontology term. For the sake of sanity.

Aaron Lun (12:17:31): > You are of course more than welcome to submit a PR with additional terms.

Vince Carey (12:18:23): > Is the token ‘CD326-CD56+’ used in other resources? I don’t want to add terms but I do want to have a systematic approach to managing the relevant information.

Aaron Lun (12:18:42): > ¯*(ツ)*/¯

Aaron Lun (12:18:56): > AFAIK no, but I don’t look at the fine labels very closely.

Vince Carey (12:26:11): > Just for fun:https://www.ncbi.nlm.nih.gov/pubmed/20643952 - Attachment (ncbi.nlm.nih.gov): Mapping the first stages of mesoderm commitment during differentiation of human embryonic stem cells. - PubMed - NCBI > Proc Natl Acad Sci U S A. 2010 Aug 3;107(31):13742-7. doi: 10.1073/pnas.1002077107. Epub 2010 Jul 19. Research Support, Non-U.S. Gov’t

Aaron Lun (12:28:52): > Well, yes, I would have assumed that it was important tosomebody, otherwise the sample wouldn’t have been generated. But clearly it wasn’t important enough to warrant its own ontology term, or I wasn’t able to find its correct synonym.

Vince Carey (12:35:35): > When your pull request goes live, I will add some material to the ontoProc vignette to look at the diversity of uncontrolled terms that you’ve assembled and mapped. Perhaps at some point we can add some value to singleR with an approach to tree traversal and aggregation using the CO relationships.

Aaron Lun (12:36:56): > Yes. The ontology terms are useless to us unless we have these tree manipulation algorithms. If you can do that for us, it would help a lot. Otherwise I will go and do it myself.

Vince Carey (12:38:54): > I’ll have a look and report back.

Jared Andrews (21:05:10): > T cells, CD4+, Th1_17 is kind of a tossup. Can go with Th1 or Th17. Or just CD4+ helper T cell. Thoughts?

Aaron Lun (22:38:04): > Sounds like CD4+ helper would make sense here. If you’re uncertain, best to fall back; reliability is better than resolution.

2020-01-23

Jared Andrews (00:41:10): > Two more done. Exhausted B cells didn’t seem to have a good match, so they got relegated to “B cells”. All the specific CL pages are 404-ing for me right now, so done for the evening.

Aaron Lun (00:41:48): > Yeah, give me some of those sweet sweet ontologies. I’ll wait for the rest before merging the PR.

Aaron Lun (00:42:47): > What am I doing right now that makes ontology mapping seem fun by comparison? That’s right! I’m reviewing a paper!

Jared Andrews (00:43:06): > You were right, it’s not bad at all for the immune sets.

Aaron Lun (00:43:21): > Very satisfying, isn’t it?

Aaron Lun (00:43:31): > Finding that perfect match

Jared Andrews (00:44:06): > Mostly, except when they don’t and the tree thingy is broken. Their site is having issues, me thinks. Search works fine, but all the pages 404.

Jared Andrews (00:44:34): > Think I’d enjoy reviewing papers, tbh. Well, to an extent.

Aaron Lun (00:44:36): > Are you talking about the tree that you can click on to expand terms?

Jared Andrews (00:45:04): > Yes, that. And clicking on terms themselves doesn’t lead to the term page currently. For me at least.

Aaron Lun (00:45:11): > Hm. Well, for me, it’s a bit annoying how I can’t click on it to open a new tab.

Aaron Lun (00:45:56) (in thread): > EXCELLENT. Another person to redirect reviews to.

Jared Andrews (00:46:20): > NOPE, I am not qualified in the eyes of the community. I am just a lowly grad student.

Jared Andrews (00:46:30): > For at least a few more months.

Aaron Lun (00:46:59): > Once you become a corresponding author on anything, you can just watch the review requests roll in.

Aaron Lun (00:47:08): > For example, I did one review a day over the long weekend.

Aaron Lun (00:47:17): > Just woke up, reviewed a paper before lunch, and repeat.

Aaron Lun (00:47:22): > And you know what I got as a reward?

Aaron Lun (00:47:31): > That’s right! ANOTHER 3 reviews to do this weekend!

Jared Andrews (00:47:45): > A sense of self-satisfaction and the knowledge of an essential job well done?

Jared Andrews (00:48:00): > Great joke, I know.

Aaron Lun (00:48:11): > No, mostly an increasing sense of bitterness and unchecked sadism.

Jared Andrews (00:48:58): > So what you’re saying is that I should explicitly request to never have you as a reviewer?

Aaron Lun (00:51:04): > Depends. If you manage to get your manuscript to me at the same time you put in a PR that solves a bug/adds a new feature/does something cool, I’ll take that into consideration.

Aaron Lun (00:51:22): > (Converse is also true.)

Jared Andrews (00:55:17): > Good to know, good to know.

Vince Carey (06:13:53): > Vignette of ontoProc 1.9.1 now includes demonstration of subset_descendants – Aaron’s hpca table (as of yesterday) was copied into ontoProc (I’ll remove it when PR goes in) to help with bind_formal_tags and subset_descendants … what other aggregation functions are of interest? - File (PNG): desc.png

Jared Andrews (10:34:03): > That’s pretty nifty.

Aaron Lun (11:37:17): > My desired user-level experience would be to (i) give a function two sets of terms at potentially different resolutions (ii) and get one set of terms aligned at the same resolution.

Aaron Lun (11:37:25): > That’s it, really.

Vince Carey (12:11:13): > What’s the definition of “resolution” here? Would you be using the expression data to carry out this task?

Aaron Lun (12:12:00): > “resolution” would be T-cell -> CD4 T cell -> helper CD4 T cell -> Th1.

Aaron Lun (12:12:33): > Left is low-res, right is highest res. I think I got the lineage right here, but basically we’re talking about the progression along the tree.

Aaron Lun (19:12:40): > If any of them are willing to help out in the OSS space, they should join the party.

Aaron Lun (19:15:29): > Just be like… “PRESTIGE”.

2020-01-24

Jared Andrews (13:28:58): > We’re sticking solely to the cell ontology right?

Jared Andrews (13:37:58): > Not really sure how to deal with the “colony forming unit” variants from the Novershtern dataset.

Aaron Lun (13:50:52): > Yes, for the time being.

Aaron Lun (15:13:05): > @Jared AndrewsExcellent. Most excellent. Your hate has made you strong.

Aaron Lun (15:13:21): > Now, strike down Immgen, and your journey towards the dark side will be complete.

Jared Andrews (15:14:09): > That one will take a bit longer, so I can’t justify doing it right now since I used the others as productive procrastination to avoid finishing writing a discussion section.

Jared Andrews (15:14:12): > Maybe tonight.

Aaron Lun (15:14:29): > lol

2020-01-25

Jared Andrews (02:25:32): > There are way too many damn t cell subsets

Aaron Lun (02:26:13): > I think that every day.

Jared Andrews (04:32:09): > Alright, well, that was painful.

Aaron Lun (04:37:15): > sheesh. yeah, that looks nasty.

Aaron Lun (04:37:52): > What on earth is even “T.DN3-4”?

Aaron Lun (04:39:00): > geez, these t cell subtypes go on forever.

Aaron Lun (04:39:26): > Anyway, good work. If you’re done I’ll merge it in. Hopefully people will be able to tell us if we’re wrong.

Jared Andrews (05:08:41): > Yeah, I’m done. I’m sure I screwed up a few, a lot of the immgen labels were pretty useless, but I did my best. I don’t know jack about most myeloid lineages, so lots of macrophage labels just got the base macrophase ontology.

2020-01-27

Vince Carey (05:32:45): > Just a little more on the Cell Ontology features exposed in ontoProc. Aftercl = getCellOnto(), we can do > > > cl$name[["CL:0000928"]] > [1] "activated CD4-negative, CD8-negative type I NK T cell" > > cl$has_soma_location[["CL:0000928"]] > character(0) > > cl$located_in[["CL:0000928"]] > character(0) > > cl$has_plasma_membrane_part[["CL:0000928"]] > [1] "PR:000001343" > > cl$develops_from[["CL:0000928"]] > [1] "CL:0000924" > > pr = getPROnto() > > pr$name[["PR:000001343"]] > [1] "CD69 molecule" > > cl$lacks_plasma_membrane_part[["CL:0000928"]] > [1] "PR:000001004" "PR:000001084" > > pr$name[.Last.value] > PR:000001004 > "CD4 molecule" > PR:000001084 > "T-cell surface glycoprotein CD8 alpha chain" >

Vince Carey (05:36:02): > This allows the ctmarks app in ontoProc to produce the ‘lineage’ graphs, but also, under the tags tab - File (PNG): Screen Shot 2020-01-27 at 5.33.57 AM.png

Vince Carey (05:39:52): > This is based on a 2018 edition of Cell Ontology. I’ve checked the 2020 version and there are very few new cell types added, but there are evidently many more properties catalogued. What I’ve shown here is based on rudimentary chasing down of a couple of predicates that seem relevant to expected expression patterns.

Raphael Gottardo (10:49:33): > @Greg Finak@Ju Yeong KimLook at the discussion above.

Greg Finak (10:49:36): > @Greg Finak has joined the channel

Ju Yeong Kim (10:49:36): > @Ju Yeong Kim has joined the channel

2020-01-29

Aaron Lun (16:47:55): > I’m going to give someone more patient a chance to deal with the latest SingleR issue before I blow my top.

Aaron Lun (16:56:50): > Anyway, back to on the ontologies;@Vince CareyI will prep up SingleR so we can get the new terms in the reference data, what’s the shortest series of calls from your package to harmonize the labels?

Jared Andrews (17:28:08): > I’m not even really sure what’s being asked (yet again), but hopefully answered whatever question they had.

Aaron Lun (20:16:19): > I would have already collapsed into full sarcasm mode at this point, so you’re already ahead.

Vince Carey (20:43:12): > @Aaron Lunis this what you have in mind? > > > library(ontoProc) > > z = read.csv(system.file("extdata/hpca.csv", package="ontoProc"), stringsAsFactors=FALSE) > > z[,2][1:5] > [1] "CL:0000840" "CL:0000451" "CL:0000451" "CL:0000451" "CL:0002598" > > cl = getCellOnto() > >> cbind(z[1:5,], clname=cl$name[z[1:5,2]]) > uncontrolled controlled > 1 DC:monocyte-derived:immature CL:0000840 > 2 DC:monocyte-derived:Galectin-1 CL:0000451 > 3 DC:monocyte-derived:LPS CL:0000451 > 4 DC:monocyte-derived CL:0000451 > 5 Smooth_muscle_cells:bronchial:vit_D CL:0002598 > clname > 1 immature conventional dendritic cell > 2 dendritic cell > 3 dendritic cell > 4 dendritic cell > 5 bronchial smooth muscle cell >

Vince Carey (20:46:39): > The key steps in getting conventional names, given CL tags, arecl = getCellOnto()and usingcl$name, which is a named vector mapping from CL tags to endorsed name. The real star of the show is the (CRAN) ontologyIndex package of Daniel Greene, who imports and structures OBO so nicely. ontoProc stores some serializations and has some helper functions to combine information across ontologies – of some interest as CL uses both GO and PR (protein ontology).

2020-01-30

Aaron Lun (02:54:30): > > library(SingleR) > out <- ImmGenData() > out$label.ont >

Aaron Lun (02:56:07): > Having thought about this a bit more: now that we have ontologies, we could actually refineSingleR by having the fine-tuning step follow the ontology hierarchy.

Aaron Lun (02:58:38): > We start at the top level of whatever reference we have where none of the terms are parents of each other, and we work our way down the hierarchy using the same fine-tuning step on the children of the best-scoring term at each step.

Aaron Lun (03:00:22): > For example, a match toB cellswould suggest us to perform a comparison between immature, mature, precursor and transitional stage B cells (or whichever of these are available in the reference or multiple references); if the best match at this step was a mature B cell, then we would go on to fine-tune to compare GC B cells, plasmablasts, etc.

Aaron Lun (03:02:21): > In theory, this would reduce noise at broader labels by focusing on big differences between classes, while automatically adjusting itself to look for finer differences at the tips of the ontology. I suspect it would also be more biologically sensible in some respects because the nature of the search is constrained by known relationships between cell types.

Aaron Lun (03:02:49): > Now, I know you guys are mercenaries, so I’ll say that there’s probably a paper in that idea above.

Jared Andrews (11:02:18): > That sounds neat.

Jared Andrews (11:42:52): > I also don’t have the brainpower available to actually think about the best way to implement that, but if you give me strict dumb dumb instructions, I will help how I can.

2020-01-31

Vince Carey (14:51:15): > Testing using hierarchical annotation has some history.https://academic.oup.com/bioinformatics/article/23/22/3024/208216…https://academic.oup.com/nar/article/38/11/3523/3100635. I don’t sense that a clear standard approach has emerged.

Dvir Aran (19:42:37): > Hey, sorry to chime in late. I explored hierarchical analysis for a while. The main problem is that the transcriptional profile doesn’t work like that - two cells from different lineages can have relatively similar transcriptional profile, while their parents are very different. Trying to go along the tree from the root to the leaves is doomed to fail. The approach I found to work is the opposite - going up the tree. This does not change how SingleR works, but you can still use CL hierarchies to explore relationships between cells.

Dvir Aran (19:44:53): > @Aaron LunI think I presented this approach when we met. I can share my slides from back then if that helps.

Dvir Aran (19:47:48): > Essentially, what I think might be nice is automatic construction of the developmental tree of a scRNA-seq dataset. I can see several dataset where this could have been valuable and provide cool insights

Aaron Lun (19:49:38): > Hm. Okay. I was hoping that root-first search would work, as it’s a lot more amenable to scaling the method.

Dvir Aran (19:51:47): > I agree, makes a lot of sense, unfortunately, life doesn’t…

Aaron Lun (19:51:54): > Curse that damn biology!

Aaron Lun (19:52:45): > Though all is not lost. If we have enough reference data, we can forcibly learn new connections that are not lineage-related but help us get to the right answer faster.

Aaron Lun (19:54:11): > And of course, it doesn’t have to be a tree, it just has to be a reasonably sparse DAG to help with scalability.

Dvir Aran (19:56:17): > Yes, there are many possibilities here.

Aaron Lun (19:58:24): > Internally we have over a hundred reference datasets mapped to the Cell Ontology, so this would be an immediate use case for some high-throughput SingleR for general-purpose cell labelling.

Aaron Lun (19:59:23): > Imagine a world where people don’t think much about cell type assignment in the same way that they don’t think much about read alignment.

Aaron Lun (19:59:27): > It would be beautiful.

2020-02-01

Aaron Lun (03:33:10): > A few BOTE calculations. Let’s consider a situation withNlabels,Scells per label and a choice ofde.n=X. > > Let’s assume the most optimistic case that each label is defined by a unique set of markers, i.e., the pairwise comparison between each label and every other label yields the same set of genes with 100% redundancy. In this case, the total number of genes used in the initial search isN*X, so the run-time complexity of the initial search isf(SN)*N*Xwheref()is a function that is less than linear due to the nearest neighbor algorithm. > > Conversely, in the worst case, each label is defined from each other label by a different set of markers, i.e., there is no redundancy in the genes identified for all pairwise comparisons involving a set of labels. In this case, the total number of genes used in the initial search isN*N*X, so the runtime complexity becomesf(SN)*N*N*X. > > On top of that, there is of course the cost of the fine-tuning. This doesn’t have predictable complexity but seems to be main time sink in practical usage. GivenKlabels that survive the initial search andTfine-tuning iterations, complexity becomesS*K*K*X*T. It’s hard to say whatKandTmight be; at the very worst, this could beS*N*N*N*Xcomplexity though this is unlikely to ever occur.

Aaron Lun (03:49:44): > We can write an alternative implementation ofSingleR()that givesN*N*X*f(2S)performance predictably; so better than the worst-case of the current implementation, but worse than the best-case. Nonetheless, it has a chance at being more efficient in practice because it is more cache-friendly and amenable to parallelization. The idea is to perform pairwise comparisons as before, but then perform a NN search for each pair of labels and their smaller subset of markers; the label that outscores the greatest number of other labels is the final assignment. Effectively the fine-tuning step taken to its logical extreme. > > In terms of results, I’m hoping that it might be able to fill in one of the existing implementation’s weak spots, namely the assumption that the true label is close enough to the top score to survive the initial search and progress to fine-tuning. There have been a few times that I poked around in the outputscoresand noted that the final fine-tuned label, while correct in my case, was uncomfortably close to the 0.05 limit from the top score. This could get problematic as the number of labels increases and the size of thecommon.genesof the initial search increases; noise in irrelevant genes can then affect the final score, especially if you only have a few markers defining your cell type out of the total number of available genes (and Spearman doesn’t care about their strength).

Aaron Lun (04:06:10): > But, anyway. All that is for another day, because right now we need to figure out what fun we can do with our ontologies.

Vince Carey (09:15:38): > @Aaron Lundo you think we should formalize the role of ontology-based labeling with methods on a specific metadata component – in pseudo-code you’ve usedSE$label.ont– I think it would be useful to agree on a method for assigning/retrieving ontologic annotation, and a place where it can be found.

2020-02-02

Aaron Lun (03:36:48): > @Vince CareyWell I don’t know that it would be easy to formalize the name of the field of the SE containing ontologies, that seems pretty specific to our use case.

Aaron Lun (03:42:47) (in thread): > I don’t think that this was what I was thinking of. I just want to “sync”, say,ImmGenData()$label.ontandBlueprintEncodeData()$label.ont, in terms of the granularity of the terms, so that the output is easier to interpret.

Aaron Lun (03:44:57) (in thread): > Or just generally make the cell type assignment results easier to interpret when I have terms at different levels of the ontology hierarchy. Some kind of graph or summary that reflects the relationships between terms, if I gave you just a whole bag of terms.

Aaron Lun (03:46:37) (in thread): > Perhaps maybe a plot of a graph layout where where each node is a term and each edge is a ontological relationship (ideally “is a”); but only containing edges between nodes that appear in my vector of terms (sized according to the frequency of the term).

Aaron Lun (03:47:02) (in thread): > Seems like making pretty plots of such relationships would fall in the remit of an ontology-wrangling package.

Vince Carey (08:47:22): > OK – this is a pretty tall order and I don’t think I can do too much more in the short term. But I have added some code to start this ‘synchronization’ process off. Using the ‘standardized’ branch of SingleR and the 1.9.5 of ontoProc (in git),connect_classescan be used to trace is_a relationships between classes in SEs that use ‘label.ont’ in colData: > > > suppressMessages({ > + library(SingleR) # 'standardized' branch > + library(ontoProc) > + imm = ImmGenData() > + blu = BlueprintEncodeData() > + }) > > > cl = getCellOnto() > > > cc = connect_classes(cl, imm, blu) > > > map2prose( cc[["blu->imm"]][1:4], cl ) > $`macrophage (CL:0000235)` > CL:0000583 CL:0000129 > "alveolar macrophage" "microglial cell" > > $`CD4-positive, alpha-beta T cell (CL:0000624)` > CL:0000792 > "CD4-positive, CD25-positive, alpha-beta regulatory T cell" > CL:0000895 > "naive thymus-derived CD4-positive, alpha-beta T cell" > CL:0000896 > "activated CD4-positive, alpha-beta T cell" > CL:0000897 > "CD4-positive, alpha-beta memory T cell" > CL:0001044 > "effector CD4-positive, alpha-beta T cell" > > $`CD8-positive, alpha-beta T cell (CL:0000625)` > CL:0000900 > "naive thymus-derived CD8-positive, alpha-beta T cell" > CL:0000906 > "activated CD8-positive, alpha-beta T cell" > CL:0000909 > ... >

Vince Carey (08:49:31): > This is telling us that BlueprintEncodeData has some coarse categories that have finer-grained consitutents in ImmGenData. There is also a component ofccnamesimm->bluthat provides information on the other direction.

Vince Carey (08:54:43) (in thread): > All I mean here@Aaron Lunis that if you are committing to label.ont in the colData of SE (as in the ‘standardized’ branch of SingleR) then we might write a method to check for it and use it. As it is I don’t know where to get the tags associated with the samples.

Aaron Lun (14:25:24) (in thread): > Is that really necessary? I don’t know why I would pass an SE to your functions, I’d just pass the vector of ontology terms.

Aaron Lun (14:26:36): > Hm…

Aaron Lun (14:26:46): > Have to think about this.

2020-02-03

Vince Carey (08:45:21): > Moving a little closer to the synchronization aim, ontoProc 1.9.7 has acommon_classesfunction, and the example (which requires that you have thestandardizedbranch of SingleR installed) produces > > clname imm blu > CL:0000235 macrophage 76 18 > CL:0000576 monocyte 10 16 > CL:0000787 memory B cell 2 1 > CL:0000771 eosinophil 4 1 > CL:0000057 fibroblast 21 20 > CL:0000775 neutrophil 23 23 > CL:0000115 endothelial cell 20 18 > CL:0000624 CD4-positive, alpha-beta T cell 41 11 > CL:0000625 CD8-positive, alpha-beta T cell 17 3 > CL:0000066 epithelial cell 25 18 > > We could use this with theconnect_classesto “coarsen” labeling of samples in SEs to common levels.

Vince Carey (08:46:42): > BTW if we used the metadata components of the SingleR datasets to include short identifying tags, we would have an opportunity to label reports better than I do in these examples.

2020-02-04

Matt N Tran (08:28:06): > @Matt N Tran has joined the channel

Peter Hickey (22:48:56): > https://www.biorxiv.org/content/10.1101/810234v2

Peter Hickey (22:49:09): > > Unifying single-cell annotations based on the Cell Ontology > > Single cell technologies have rapidly generated an unprecedented amount of data that enables us to understand biological systems at single-cell resolution. However, joint analysis of datasets generated by independent labs remains challenging due to a lack of consistent terminology to describe cell types. Here, we present OnClass, an algorithm and accompanying software for automatically classifying cells into cell types part of the controlled vocabulary that forms the Cell Ontology. A key advantage of OnClass is its capability to classify cells into cell types not present in the training data because it uses the Cell Ontology graph to infer cell type relationships. Furthermore, OnClass can be used to identify marker genes for all the cell ontology categories, independently of whether the cells types are present or absent in the training data, suggesting that OnClass can be used not only as an annotation tool for single cell datasets but also as an algorithm to identify marker genes specific to each term of the Cell Ontology, offering the possibility of refining the Cell Ontology using a data-centric approach. >

Aaron Lun (22:53:05): > Scooped before we even got started.

Peter Hickey (22:53:29): > i hadn’t heard of it, but the first version of preprint is from oct last year

Peter Hickey (22:53:37): > software:https://github.com/wangshenguiuc/OnClass

Aaron Lun (22:53:57): > I mean, that’s fine and all, but you still need some~~~sucker~~~person to do the manual mapping to the CO.

Aaron Lun (22:55:12): > > A key advantage of OnClass is its capability to classify cells into cell types not present in the training data because it uses the Cell Ontology graph to infer cell type relationships. > I dunno, that seems to defy logic to me.

Aaron Lun (22:55:57): > Or in other words: that’s a pretty bold claim and I’ll believe it when I see it in my own hands.

Aaron Lun (23:40:27): > I was wondering why I was feeling so uncomfortable.

Aaron Lun (23:40:30): > And then I realized.

Aaron Lun (23:40:35): > I put my pants on backwards.

2020-02-05

Kevin Rue-Albrecht (02:59:05) (in thread): > I thought those stories belonged in#isee:cry:

Aaron Lun (03:12:29) (in thread): > get to work, albrecht!

Kevin Rue-Albrecht (03:13:57) (in thread): > :stuck_out_tongue:

Vince Carey (06:03:48): > to accelerate investigationshttps://onclass.readthedocs.io/en/latest/howtouse.html

Vince Carey (06:04:37): > ” If your training labels are not mapped to cell ontology ID, please use our natural language processing tool to map them to existing cell ontology terms.”

Jared Andrews (10:12:09): > Wonder how good it is.

Aaron Lun (11:46:50): > Yeah, it would only really be useful if it can deal with the HPCA’s gibberish terms.

Aaron Lun (11:47:54): > The other ones were a breeze to manually map.

Aaron Lun (12:42:00): > Anyway. Looking at our own backyard, we should figure out how to improve scalability. I put down some thoughts but it would be a good idea for people to do some profiling to actually nail down what is currently the most time-consuming part. I’m going to guess it is the fine-tuning, and perhaps we should put some more effort there to make it faster.

2020-02-12

Dan Bunis (18:52:31): > Is there a good reference dataset that anyone has used for brain / neuronal cell types?

Aaron Lun (18:58:43): > Human or mouse?

Aaron Lun (18:58:56): > Because scRNAseq has LOADS of mouse brain datasets.

Aaron Lun (18:59:12): > And by that I mean Linarrson datasets

Dan Bunis (18:59:38): > human “unfortunately”

Aaron Lun (19:00:31): > There might be one or two human datasets in there. Don’t know if they’re labelled, though.

Dan Bunis (19:02:28): > I’ve foundhttp://www.brainrnaseq.org/and I believe their data comes from herehttps://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE73721 - Attachment (ncbi.nlm.nih.gov): GEO Accession viewer > NCBI’s Gene Expression Omnibus (GEO) is a public archive and resource for gene expression data.

Dan Bunis (19:04:16): > I’ll look through scRNAseq. I think my labmate will probably fall to some non-SingleR gene-signature based method if there isn’t something ready-made.

Aaron Lun (19:08:32): > well, it’s just mostly astrocytes AFAICT.

Aaron Lun (19:08:56): > “Download” - unavailable feature at this moment.

Dan Bunis (19:40:02): > yup and yup. Thus my having to resort to GEO, and then not convincing my labmate to annotate their own reference data anyway…:man-facepalming:

Steve Lianoglou (20:20:18): > @Dan Bunisthere are a few datasets that come to mind that maybe can be helpful > > (1) The Mathys et al. snRNA-seq alzheimer’s dataset:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6865822/#!po=59.3750(2) The single cell genesets from msigdb:https://www.gsea-msigdb.org/gsea/msigdb/supplementary_genesets.jsp#SCSig - Attachment (PubMed Central (PMC)): Single-cell transcriptomic analysis of Alzheimer’s disease > Alzheimer’s disease (AD) is a pervasive neurodegenerative disorder, the molecular and cellular complexity of which remains poorly understood. Here, we profiled and analysed 80,660 single-nucleus transcriptomes from prefrontal cortex of 48 individuals …

Dan Bunis (20:23:06): > (1) happens to be the dataset that we are hoping to re-analyze & re-annotate.

Steve Lianoglou (20:23:19): > The Mathys dataset has labeled cell clusters for astrocytes, endothelial cells, excitatory neurons, inhibitory neurons, microglia, oligodendrocytes, oligodendrocyte precursors, and pericytes

Steve Lianoglou (20:25:15): > ahh, nice – well, looking forward to your re-analysis:wink:

Dan Bunis (20:26:03): > It does! And their method “works”. But it relies on Seurat’s AddModuleScore which is very lightly documented, and quite odd at the actual code level.

Steve Lianoglou (20:30:21): > It’s been a while since I peaked at Seurat internals, but if I recall correctly it was something “simple” like an aggregate geneset score normalized by an aggregate score of a random set of genes of similar size as the module, no?

Aaron Lun (20:38:44): > Hence the oddness.

Dan Bunis (20:39:56): > not entirely sure. There are zero comments in the code (https://github.com/satijalab/seurat/blob/87e2454817ed1d5d5aa2e9c949b9231f2231802f/R/utilities.R) I’ve given up on figuring it out for now.

Aaron Lun (20:40:22): > I’m curious to know what bias that normalization is expected to be correcting for.

Aaron Lun (20:48:02): > Or in other words: if you take a random set of genes, the expectation of the mean for those genes is simply the average of the expression values of all genes. So you might as well scale by the average across all genes - possibly weighted inversely by the number of genes of similar abundance, if you want to be fancy. But in that case, the relative difference in the scaling factor between cells is simply the coverage of each cell, which should already be handled by a previous normalization procedure.

Aaron Lun (20:49:31): > There are reasonable arguments for adjusting the weight of each gene in computing the aggregate, but whatever you decide, I will bet there is a better way of doing it than picking a random gene set.

Aaron Lun (21:02:32): > I will add that the concept of a single-sample gene set score has been around for a long time and has generally been pretty unconvincing to me.

2020-02-14

Andrew Skelton (05:10:03): > @Andrew Skelton has joined the channel

2020-02-19

Aaron Lun (18:34:17): > Just tried using the multiple-reference mode to try to handle a 200k reference from 10 different studies. Wasn’t great.

Aaron Lun (18:34:41): > Need to reflect … again … on how we’re combining these references.

2020-02-26

Dan Bunis (13:18:44): > I’ll be teaching SingleR (to novice coders) during the second day of a workshop on Friday! I’m putting a few overview slides together now.

Jared Andrews (13:24:45): > I’m starting to use the multiple-reference mode with basically all of the immune references, and it’s doing a decent job, imo. Harmonizing the labels afterwards is annoying, but I think I’m getting better resolution than I was using any of them individually. Then again, basically all of my cells are T cells, so I don’t know how it works for more heterogeneous samples.

Aaron Lun (13:49:42): > Oh, when I said “wasn’t great”, I meant in terms of speed.

Aaron Lun (13:49:53): > Improving speed may be a simple task of inverting how we’re combining stuff.

Aaron Lun (13:50:48): > So, annotate on a per-single basis (WITHOUT union’ing the gene sets) and then, only after the best label is found for each reference, compute the score for the union of marker sets. This should be much faster.

Jared Andrews (15:48:32): > Ohh, okay. Yeah, I did notice it’s a fair amount slower.

2020-02-29

Aaron Lun (13:18:12): > Just realized Windows was erroring out. Because it can’t findï. Becauseof courseit can’t.

2020-03-01

Aaron Lun (18:44:13): > @Jared Andrewsre. combining, can you test outhttps://github.com/LTLA/SingleR/issues/99? See comments inhttps://github.com/LTLA/SingleR/issues/94.

Aaron Lun (18:44:19): > Also@Dan Bunis.

Jared Andrews (19:04:42): > I’m at a conference this week, but yeah, I’m back to single cell stuff and have been playing with multiple references. I will test it thoroughly next week.

Aaron Lun (19:12:49): > Just comment on the issue if you have results that can be shown, a la the same style ashttps://github.com/MarioniLab/DropletUtils/issues/36

Jared Andrews (20:03:12): > Will do.

2020-03-02

Dan Bunis (14:03:10): > Willdo too

2020-03-03

Aaron Lun (20:09:28): > Just realized that even the regular fine tuning is pretty slow.

Aaron Lun (20:13:04): > took around 5-6 minutes for 20k cells against a 5k reference.

Jared Andrews (20:44:23): > I’ve only used against bulk samples. Wasn’t the pseudobulk stuff meant to help speed up using the single cell references?

Aaron Lun (20:45:40): > Yeah, it was. But I don’t have a good idea of how much is lost when you pseudobulk. Maybe not a lot. Who knows.

Dan Bunis (20:45:59): > Still way faster than before we (aka you) took over the calculation steps.

Dan Bunis (20:46:48): > It ran in ~30sec for a 1k cell test vs 5k cell ref for me.

Dan Bunis (20:46:56): > But that is admittedly TINY

Aaron Lun (20:47:01): > For me, any step that takes more than 1 minute but less than 10 minutes is defined as “slow”.

Aaron Lun (20:47:21): > If it takes more than 10 minutes, then I’m probably doing something else anyway so I don’t notice or care how long it takes.

Aaron Lun (20:47:42): > So paradoxically, garnett is “not slow”.

Aaron Lun (20:48:03): > ho ho ho

Dan Bunis (20:48:10): > Agreed. But when things take that long, it’s annoying as hell if there’s any need for iterative improvement

Dan Bunis (20:48:14): > LOL

Dan Bunis (20:51:00): > Is the pseudo-bulk something that happens automatically? or does it require manual turn-on? I cant remember and I have the devel version on a different computer than im currently on.

Aaron Lun (20:51:28): > no, it doesn’t happen automatically, you need to explicitly callaggregateReferences. This hasn’t seen much use AFAIK.

Aaron Lun (20:51:43): > Would be nice to get some real-world experience of how it handles.

Dan Bunis (20:54:15): > :+1:I will include that in my testing this week. Likely Thursday or Friday

Aaron Lun (20:54:22): > cool

2020-03-04

Aaron Lun (01:37:10): > BTW do either of you want to help me out with SingleR maintenance? I don’t like to blow my own trumpet, but this would be a good opportunity to learn package development from one of the best.

Dan Bunis (03:20:53): > :rolling_on_the_floor_laughing:

Dan Bunis (03:21:54): > not sure exactly what the responsibilities are, but I am definitely interested in learning.

Aaron Lun (03:30:36): > It’ll be sort of like the rule of 2 that the sith use. Basically, you make an increasing number of contributions, and once you understand enough about the package, you challenge me for the maintainer rights. If you win, you become the maintainer. And then the cycle continues.

Aaron Lun (03:30:43): > For example, I did this for scater.

Aaron Lun (03:30:59): > and scRNAseq.

Aaron Lun (03:32:02): > I think the main difference from the sith is that I didn’t kill the previous maintainers.

Davide Risso (03:34:29) (in thread): > That was very kind of you Aaron!

Federico Marini (03:37:16): > @Davis McCarthystill alive and rocking?

Dan Bunis (04:01:20): > bahaha. Idk that I foresee wresting it from you cuz I know literally zero C at the moment. But I like the sith reference and the feeling of power from knowing that you already gave me the ability to commit the SingleR pull #99

Jared Andrews (10:02:05): > I am definitely interested but am completely tapped until at least July. Once I get a “normal” position, I really want to contribute more to the Bioc ecosystem.

Dan Bunis (12:01:02): > same

Alan O’C (12:58:38): > @Alan O’C has joined the channel

Dan Bunis (15:17:47): > One of my colleagues just asked me about a new potential reference set that she wants to use, and that we might want to add to SingleR. The main “drawback” may be that it’s HUGE. But I put that in quotes because really this also means we have a potential test set foraggregateReference - File (PNG): image (2).png

Dan Bunis (15:18:35): > Comes fromhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC6434952/ - Attachment (PubMed Central (PMC)): The single cell transcriptional landscape of mammalian organogenesis > Mammalian organogenesis is an astonishing process. Within a short window of time, the cells of the three germ layers transform into an embryo that includes most major internal and external organs. Here we set out to investigate the transcriptional dynamics …

Aaron Lun (15:22:02): > If it’s a sc dataset, seems like it should go into scRNAseq

Dan Bunis (15:24:02): > I was thinking that too. But it really is HUGE compared to what’s in scRNAseq now AFAIK. Is that a problem? 2,072,011 cells

Aaron Lun (15:24:52): > Not for scRNAseq itself. Obviously what you want to use it for is another matter.

Aaron Lun (15:26:05): > If it’s really that big, one might also consider creating an entirely new EHub package for its retrieval. e.g.,https://bioconductor.org/packages/release/data/experiment/html/MouseGastrulationData.html - Attachment (Bioconductor): MouseGastrulationData > Provides processed and raw count matrices for single-cell RNA sequencing data from a timecourse of mouse gastrulation and early organogenesis.

Aaron Lun (15:31:40): > Oh wait. I just recognized what dataset you were referring to.

Aaron Lun (15:32:04): > It’s that one with a median UMI count per cell of…:drum_with_drumsticks:… 600.

Aaron Lun (15:32:43): > I guess they did get it published in Nature, so… win?

Dan Bunis (15:37:28): > :rolling_on_the_floor_laughing:that is tiny! Another reason thataggregateReferencemight be very useful. The dataset would become a reference-set for a field that currently lacks a good ref-set, so I do think it’s worth pursuing (and my colleague expressed interest in helping to make that happen!)

Dan Bunis (15:42:14): > pipeline though: > 1. start with counts and labels > 2. Make SCE & add labels > 3. scater::logNormCounts > 4. Ready for scRNAseq? (and SingleR testing…)

Aaron Lun (15:45:04): > If you can, I would say that the upload to EHub would be right at the top, around about 1&2. Basically as raw as possible.

Dan Bunis (15:50:19): > okay. I think that should be quite possible through some simple transfer of labels between objects here,https://oncoscape.v3.sttrcancer.org/atlas.gs.washington.edu.mouse.rna/downloads. Though having the logNormCounts included, and the data pre-aggregated, would be super helpful for SingleR users because of the size.

Aaron Lun (15:54:48): > We can store the aggregated forms as separate EHub entries under the control of SingleR. But the raw data should also be made available as itself.

Dan Bunis (15:55:14): > :ok_hand:

2020-03-05

Aaron Lun (16:28:05): > Working on making the fine tuning faster, preliminary work intester. Note that it has more overhead so it will actually be slower for small datasets, but who cares about that.

Dan Bunis (16:43:19): > FWIW, I certainly agree that reducing the calc time for larger datasets is worth a bit more calc time for small datasets.

Aaron Lun (16:44:52): > At least I hope it’ll be faster. From a theoretical perspective, I avoid recomputing ranks for the references for each cell, so that should be a clear win. But there is quite a lot of overhead so I don’t know where or if the benefit is realized.

Aaron Lun (16:45:31): > On the plus side, it will be “simpler” in the sense that I can discard all of the C++ code.

Aaron Lun (16:46:15): > On the minus side, it will use more memory - it’s literally a memory/speed trade-off as I cache the ranks across cells to avoid recomputing them.

Dan Bunis (16:52:08): > Refreshing myself on the current method now

Aaron Lun (16:52:52): > There’s a lot of C++ code in there.

Aaron Lun (16:53:02): > But if you can read it, I will be impressed.

Dan Bunis (16:56:14): > That’s probably why I can’t remember the previous method very well lol

Dan Bunis (16:56:55): > and I currently just know generally what is done instead. But I’ll try

Aaron Lun (17:01:14): > If you can read someone else’s C code without flinching, you have reached true mastery.

Aaron Lun (17:01:28): > C++ is not so bad but can get pretty nasty with loads of templating.

Dan Bunis (17:11:09): > (now has become after a meeting I’d forgotten about)

Jared Andrews (19:10:21): > Yeah, my experience with C++ is writing a python wrapper around some and hunting memory leaks with valgrind.

Dan Bunis (19:14:19): > My experience amounts to reading your code today… so anything I can add here may be naive.

Dan Bunis (19:15:37): > But one worry with increasing the memory need is that it might decrease the limit to the number of cells (in test and/or ref) than someone can run through SingleR on their machine.

Jared Andrews (19:17:43): > Depends how great the increase is. I haven’t been running OOM with the bulk references on like 25k cells with 16 GB, but I also havent been tracking usage at all.

Dan Bunis (19:31:42): > For sure. I think 16GB RAM is pretty common. Maybe we should first just make sure the new method runs on our systems.

Dan Bunis (19:31:44): > But if the added memory demand is just the ranks, then that’s a markers by ref_samples matrix (correct?). That’s likely only in GB size scale for references that can and should be pseudo-bulked with aggregateReference.

2020-03-06

Aaron Lun (00:41:27): > Well. Just tested it, and it sucked.

Aaron Lun (00:41:55): > It chews through a lot of memory. Too much, I would say.

Aaron Lun (00:43:19): > There are bits that I could probably use to shave off some time compared to the existing implementation. But it’s less of a slam dunk than before.

Aaron Lun (12:26:30): > Got some shade thrown at SingleR by the scCATCH manuscript.

Aaron Lun (13:56:29): > aggregation options now intrainSingleR.

Jared Andrews (14:19:48): > The new version or the legacy one? I haven’t seen a paper that’s compared the new version with other options yet.

Aaron Lun (14:19:57): > ¯*(ツ)*/¯

Aaron Lun (14:20:08): > Well, performance (in terms of results) should be the same.

Jared Andrews (14:21:01): > Yeah, performance (in terms of accuracy) has typically been good, speed was the main drawback previously.

Dan Bunis (14:31:40): > By starting post-clustering, scCATCH kicks the cell groupings can to the user. Perhaps that’s wise, but personally, I always run SingleR by cell so that I have an independent check of whether cluster lines were valid. (And I often then give clusters their max per-cell labels.)

Jared Andrews (14:39:41): > Same. For really fine-grained subsets, like T cells, clusters can bleed pretty hard, so per cell info is really helpful.

Pierre-Luc Germain (14:54:10): > @Pierre-Luc Germain has joined the channel

2020-03-08

Jared Andrews (22:38:45): > @Aaron LunWhat’s the easiest way to use thecombineRecomputedResultsfunction? Not seeing any new parameters to set it inSingleRorclassifySingleR, but I might just be missing it.

Aaron Lun (23:30:09): > It’s meant to be called directly. The selling point is that you can use the results computed from each reference directly; i.e., you don’t have to know that you want to combine things when you train on each reference, unlike with our previouscombineCommonResults(or whatever I renamed it).

Aaron Lun (23:32:46): > Currently it requires a list of references and a list of labels, which is a bit arduous.

2020-03-09

Jared Andrews (00:07:14): > Ah okay.

Jared Andrews (00:07:28): > Will give it a test in the next day or two.

Aaron Lun (13:14:30): > aggregation capabilities should be in BioC-devel. Note the newaggr.refargs inSingleR.

Dan Bunis (13:30:41): > I had stumbled figuring this out too, and ended up with results that were exactly the same. But I’ll update my code for the new method and hopefully have some actual comparison together today/tomorrow.

Jared Andrews (17:21:05): > There’s an issue with the ontology stuff for the Novershtern dataset, seems to be a missing value. > > dmap <- NovershternHematopoieticData() > > Error in .add_ontology(se, "novershtern", match.arg(cell.ont)): all(!is.na(m)) is not TRUE > Traceback: > > 1. NovershternHematopoieticData() > 2. .add_ontology(se, "novershtern", match.arg(cell.ont)) > 3. stopifnot(all(!is.na(m))) >

Aaron Lun (17:21:36): > Should have been fixed in the last push.

Aaron Lun (17:21:47): > It’s those damn umlauts

Jared Andrews (17:21:48): > Ah, okay.

Aaron Lun (17:22:29): > try reinstalling from GH and see if that helps. Well, it should, otherwise it would have never passed CHECK.

Jared Andrews (17:23:02): > Can you push to thecombinedbranch? Trying to test the new combine functions side by side.

Aaron Lun (17:24:14): > done

Jared Andrews (17:25:51): > :+1:

Dan Bunis (19:05:23): > I’m getting an error when I try to download the BlueprintEncodeData now with the LTLA/SingleR@combined version. This had been working for me Friday / before the above merge.

Dan Bunis (19:07:24): > HPCA, ImmGen, and MouseRNAseq give no error. ButBE <-BlueprintEncodeData()gives this - File (PNG): image.png

Dan Bunis (19:09:05): > I’m going to reinstall from master and grab and save an .rds for my testing. But this will need to be remedied before any final merge.

Aaron Lun (19:09:36): > Are you on release or devel?

Aaron Lun (19:10:21): > Becausecombinedis operating on BioC-devel (specifically, pulling from the devel version of EHub). So if you installed it into a BioC-release environment, it’s probably not finding version 1.2.0 of the new coldata.

Aaron Lun (19:10:25): > I would think.

Dan Bunis (19:11:17): > :+1:probably that

Dan Bunis (19:13:34): > yup

Jared Andrews (19:23:39): > Ah damn, mine will probably hit that too. Bah, been putting off installing R 4.0 and packages

Jared Andrews (19:24:54): > Don’t know it it’s on conda anywhere yet. Installing on our cluster without it is pretty annoying.

Jared Andrews (19:25:46): > Might just download them locally tbh

Dan Bunis (19:30:51): > FWIW, the devel version of SingleR is working for me even with everything else at release v3.10. I just had to download and save the problematic ref from release before (re)updating.

Dan Bunis (19:31:20): > Soooo you can probably continue putting that off a bit longer:man-shrugging:

Aaron Lun (19:31:55): > If you’re a package developer (or involved in it), it is always a good idea to have two parallel versions of R running on your machine.

Aaron Lun (19:32:11): > One for regular use (or debugging packages in release), and another for package devleopment.

Aaron Lun (19:32:26): > I have this setup on all of my machines.

Aaron Lun (19:32:57): > The “overhead” involves runningBiocManager::install()once in a while and pressingato keep everything up to date.

Aaron Lun (19:33:28): > If you’re installingSingleRfrom source, you already have the necessary C(++) toolchain, so that’s the hardest part taken care of already.

Jared Andrews (19:42:44): > Yeah, I do that locally too. I should just run locally, I just can’t run side by side then.

2020-03-10

Aaron Lun (00:18:35): > Hm. Landed myself in a bit of a pickle regarding optimization of fine-tuning.

Aaron Lun (00:20:17): > First attempt was to sacrifice memory efficiency for speed, by basically caching the ranked references computed for one cell for re-use when fine-tuning another cell. This was far too memory intensive - you could have one cached set of references for every combination of labels! - so I tossed it.

Aaron Lun (00:21:02): > Second attempt was to only cache the last ranked reference, and order the cells so that all cells with the same combination of labels above the tuning threshold were processed together. This was… better… but it was still slower than the current implementation, probably due to the looping overhead.

Aaron Lun (00:23:28): > Well, okay, so I moved that entire loop into C++. But while I was trying to solve all the segmentation faults, I realized that, because we need to reorder cells, we don’t process consecutive cells, and each cell has to be potentially pulled out of file multiple times, which means that we pay a speed penalty for random access (especially for file-backed matrices).

Aaron Lun (00:23:31): > So. Bum.

Dan Bunis (00:27:50): > =/

Dan Bunis (00:29:57): > that kinda dead on arrival as an option

Aaron Lun (01:14:38): > Well. After several days of effort, I shaved off 0.6 seconds from a fine-tuning runtime of ~27 seconds. I guess I give up.

Jared Andrews (11:51:34): > I’m seeing about ~30% runtime improvements with the recompute method over the common one for combining results, which is pretty nice.

Aaron Lun (11:52:20): > Excellent, excellent.

Jared Andrews (12:40:51): > Results are similar, but it seems to assign to the finer grained labels slightly more frequently, which is exactly what I was hoping for personally. > > Are we still trying to determine the best way to use ontologies to harmonize? I know there was some discussion with ontoProc above, but it seems unresolved still.

Aaron Lun (12:41:48): > I have some ideas that I postedhttps://github.com/LTLA/SingleR/issues/68

Aaron Lun (12:54:29): > Perhaps@Vince Careywould have some thoughts.

Dan Bunis (13:24:13): > I didn’t have the same success myself. I found run time better by ~30% for theoldmethod for one of my datasets, and by >80% for another (1176sec new versus 206 sec old).

Dan Bunis (13:25:28): > In terms of accuracy, I got mixed results.

Jared Andrews (13:27:36): > I am getting odd results when trying to use it with clusters.

Jared Andrews (13:28:40): > Scratch that, bug in my code.

Jared Andrews (13:36:03): > Should also note that I’m using lots of cells (~25k) and all of the immune reference sets plus one custom one, so lots of labels.

Dan Bunis (13:36:21): > I should add that in the test that I ran with the gigantic time difference, the new method far outperformed thee old in terms of accuracy. Old: 15% match to published. New: 49%. This was in annotation of a brain dataset with 70634 cells using refs = HPCA + BlueprintEncode + custom brain scRNAseq data.

Aaron Lun (13:37:14): > Woah. 49% still sounds pretty bad.

Dan Bunis (13:37:49): > The references are great for it. T cells are being call way more than they should.

Dan Bunis (13:39:20): > Also, I expect the differential gene capture issue is likely a problem here. I haven’t done anything to screen the refs before running

Jared Andrews (13:39:46): > I screened all the refs (and test) in mine for the recomp method.

Aaron Lun (13:45:37): > what do you guys mean by screening? Like, picking sensible references?

Dan Bunis (13:46:16): > I was about to ask that. How do we suggest that users make corrections after screening for gene mismatches? I’ve been trying to figure that out.

Dan Bunis (13:47:57): > I simply had 2 annotated brain datasets, so I used the smaller one as ref, and then threw in the built in human refs as well

Aaron Lun (13:52:56): > Right - so you’re assuming that the original assignment between the two annotated brain datasets is the “truth”? I’d like to know what causes cells to deviate from the truth upon recomputing the scores; what do the recomputed scores look like for those cells?

Vince Carey (13:53:49) (in thread): > ontoProc::onto_plot2 (1.9.9 in bioc git) now invisibly returns graphNEL … we want to operate on the graph based on node degree, to address issue #68 in SingleR … I will put some time into this. I agree that a plot of a large dag is not very useful.

Aaron Lun (13:54:56) (in thread): > thx vince

Dan Bunis (14:07:46) (in thread): > I think the answer is no, but it’s where your “recomputing” takes me… Would you like me to re-score the smaller dataset based just on itself? (perhaps after pseudobulk so there are no/fewer correlations of 1)

Dan Bunis (14:08:25) (in thread): > Short answer is that 42% are called as T cells but there should be very few T cells in the dataset

Aaron Lun (14:08:47) (in thread): > Hold on, hold on.

Aaron Lun (14:09:55) (in thread): > What happens if you just use one of your brain datasets as the reference and the other one as the test? Is it sensible to treat this as a “ground truth” or at least a reasonable thing?

Dan Bunis (14:10:21) (in thread): > Sure!

Aaron Lun (14:10:27) (in thread): > Okay, good.

Dan Bunis (14:10:56) (in thread): > I was honestly trying to give our combine fxns a hard time.

Aaron Lun (14:11:45) (in thread): > So, if you assign test cells on a per-reference basis (i.e., running it as if you didn’t know about any other references), then the brain2brain results should be sensiblepriorto recomputing the scores.

Aaron Lun (14:12:21) (in thread): > As the recomputation takes the per-reference assignments, it should be easy to see how the recomputation deviates from the sensible baseline provided by the brain2brain results.

Dan Bunis (14:12:24) (in thread): > should be.

Dan Bunis (14:12:48) (in thread): > I do expect some users to want to be able to use a single reference set for everything.

Aaron Lun (14:13:13) (in thread): > The recomputed scores should give some indication. Have a look at all the T cells and see what happened to the recomputed scores.

Dan Bunis (14:13:37) (in thread): > But the half assed grouping I’d been using (HPCA + Blueprint-ENCODE + brain) is probably not a good stand in for a comprehensive ref set

Aaron Lun (14:13:54) (in thread): > What do you mean? The brain2brain assignemnt?

Dan Bunis (14:14:30) (in thread): > edited above

Aaron Lun (14:14:51) (in thread): > If the brain2brain assignment is the most sensible, the ideal sitaution with the recomputation is that those per-reference assignments are recognized as the best and preserved in the combined output.

Aaron Lun (14:14:52) (in thread): > If that is not the case, we should figure out why. Might be a bug.

Aaron Lun (14:15:17) (in thread): > Anyway lunching brb

Dan Bunis (14:15:20) (in thread): > I haven’t run just brain2brain, but am now.

Dan Bunis (14:40:18) (in thread): > 99% accuracy versus published calls when I just annotate brain2 (70k cells) with brain1 (8k cells) without throwing in additional mismatched refs

Jared Andrews (14:42:53): > By screening I just meant restricting to common genes across all sets.

Aaron Lun (15:09:56) (in thread): > I should mention that if you’re using the recomputed strategy, you should have run brain2brain at some point.

Dan Bunis (15:10:19) (in thread): > :man-facepalming:

Dan Bunis (15:10:46) (in thread): > yes, you are right. I had, and it was just buried.

Aaron Lun (15:38:11) (in thread): > Right. Now, the question becomes what causes those guys to change to a T cell label? What do the scores look like?

Jared Andrews (16:01:22): > Using cluster mode yields a bit of trouble: > > bp <- readRDS("./ref/bp.rds") > mona <- MonacoImmuneData() > dice <- DatabaseImmuneCellExpressionData() > dmap <- readRDS("./ref/dmap.rds") > hpca <- HumanPrimaryCellAtlasData() > trm <- readRDS("./ref/GSE131770_SingleR.rds") > > refs <- list(BP=bp, Monaco=mona, DICE=dice, DMAP=dmap, HPCA=hpca, TRM=trm) > labels <- list(bp$label.fine, mona$label.fine, dice$label.fine, dmap$label.fine, hpca$label.fine, trm$label.fine) > clusters = c("SCT_snn_res.0.8", "SCT_snn_res.1", "SCT_snn_res.1.2", "SCT_snn_res.1.5") > > sce <- readRDS("./ref/test.rds") > > for (clust in clusters) { > out.suf <- "recomp" > preds <- list() > > all.genes <- lapply(refs, rownames) > common <- Reduce(f = intersect, x = all.genes) > common <- intersect(common, rownames(sce)) > > for (i in seq_along(refs)) { > refs[[i]] <- refs[[i]][common, ] > pred <- SingleR(test = sce[common, ], ref = refs[[i]], labels = labels[[i]], method = "cluster", clusters = sce[[clust]]) > preds[i] <- pred > } > > pred <- combineRecomputedResults(preds, test = sce[common, ], ref = refs, labels = labels) > } >

Jared Andrews (16:01:26): > > Error in combineRecomputedResults(preds, test = sce[common, ], ref = refs, : cell/cluster names in 'results' are not identical > Traceback: > > 1. combineRecomputedResults(preds, test = sce[common, ], ref = refs, > . labels = labels) > 2. stop("cell/cluster names in 'results' are not identical") >

Jared Andrews (16:06:43): > Seems like I should be providing the cluster column for test, but not sure the function supports that yet.

Aaron Lun (16:09:00): > That’s right. You’d need to take the row means of each cluster outside the function, otherwisecombineRecomputedResultsgets confused.

Aaron Lun (16:16:20): > Again, this will not be a problem if we decide to integrate it, etc. etc.

Jared Andrews (16:16:49): > Right, just trying to get around it to test.

Jared Andrews (16:48:15): > For reference, way around this currently is something like: > > clust.avg <- sapply(rownames(preds[[1]]), function(x) rowMeans(assays(sce)$logcounts[, colData(sce)[[clust]] == x])) > pred <- combineRecomputedResults(preds, test = clust.avg[common, ], ref = refs, labels = labels) >

Jared Andrews (16:48:43): > I’m sure there are probably less gross ways to do it.

Aaron Lun (17:29:53): > Alright, I’ll fix it when I get home. Sheesh.

Aaron Lun (17:30:19) (in thread): > nag

Dan Bunis (17:32:59) (in thread): > yes yes. Had to switch gears for a bit, but I am looking into it.

Aaron Lun (17:34:06): - File (GIF): lwa.gif

Aaron Lun (17:34:38): > on the plus side, those business class upgrade offers just keep rolling in.

Dan Bunis (17:35:36) (in thread): > single-ref scores, grouped by final call, faceted by whether they were called correctly in the multi-ref calling vs not (with “not” split into called as T cells vs other) - File (PDF): vsBrain.pdf

Dan Bunis (17:37:02) (in thread): > combineRecomputed scores (lots of “missing” data dues to scores not being calculated except for top labels), same description as above otherwise. - File (PDF): vsBrain.HSPC.BE.pdf

Jared Andrews (17:37:42): > Hey, I just needed to get it working for my own stuff. No rush.

Aaron Lun (17:38:35) (in thread): > hm.

Dan Bunis (17:38:57) (in thread): > It’s surprising to me that the scores seem to mostly be 0 or 1 with few in between in the recomputed. ANd especially that there are some 0s in the “correct” column?

Aaron Lun (17:39:23): > And I just wanted an excuse to put in a gif.

Aaron Lun (17:41:14) (in thread): > hmmm…

Aaron Lun (17:41:35) (in thread): > Don’t suppose you could do me a favor and try to figure out what happened. Do you know enough about the algorithm?

Aaron Lun (17:41:59): > I shall be doing that more often now that I know that slack supports gifs.

Aaron Lun (17:42:25): > :alisa:

Dan Bunis (17:42:51) (in thread): > I can normally figure things out, but I won’t turn down an overview to help me get started!

Aaron Lun (17:42:56): > As in, uploaded gifs, I know that these emoticon things have been around forever.:konata:

Dan Bunis (17:43:17) (in thread): > jk yes, I think I do know it.

Dan Bunis (17:43:48) (in thread): > I have read the issues and everything. I’ll dig in more.

Aaron Lun (17:44:03) (in thread): > Okay. More deets in?combineRecomputedResults, and you can always RTFC like a pro.

Aaron Lun (17:44:30): > Whoops, don’t know why that ended up here.

Dan Bunis (17:44:58): > so #rude

Aaron Lun (17:45:13): > The F is fantastic, obviously.

Dan Bunis (17:49:21): > sure Jan

Dan Bunis (17:50:45) (in thread): > I’ve already grossly checked that ~90% of the genes from the brain ref would be retained.

Dan Bunis (17:51:19) (in thread): > I’m going to check into whether any of the dropped genes were original markers.

Dan Bunis (17:52:13) (in thread): > Then I’ll dig in further with the actual code.

Aaron Lun (19:05:57): > One option is to use all markers during recomputation. The current system only uses the markers for the best label, which has some obvious blind spots.

Aaron Lun (19:07:27): > For example, if the only genes are the markers that are upregulated in the right label, you will never have anything that is not upregulated, which means that the correlation will be poor, because you’ll be looking at the correlation across genes that are all upregulated!

Aaron Lun (19:08:07): > It’s like you’re drawing a line; you can imagine that if you have points at the origin and points at, say, (1, 1), you can draw a decent line.

Aaron Lun (19:08:27): > But if you don’t have points at the origin, then the line just kind of floats around at (1,1) and isn’t particularly helpful.

Aaron Lun (19:08:43): > TBH I thought that wouldn’t be so much of a problem because the markers for the other (incorrect) labels should have provided those points at the origin, but let’s have a look at what the markers actually are.

Aaron Lun (19:09:02): > Side note: Okay, that gif was incredibly distracting so I removed it.

Aaron Lun (20:09:57): > BTW@Dan Bunisyou should have gotten a reinvitation to be a collaborator on the repo.

Dan Bunis (20:16:31): > I haven’t gotten anything. Should this come by email?

Aaron Lun (20:16:47): > No, you should get a letter with a wax seal.

Aaron Lun (20:16:56): > By pigeon.

Aaron Lun (20:17:07): > Hm. Will try resending it when I get home.

Aaron Lun (20:18:13): > I wonder why it does that.

Aaron Lun (20:34:42): > I do remember it saying “copied” but I thought that meant it copied the invite.

Aaron Lun (22:16:49): > @Dan Bunisit is done

Aaron Lun (22:17:03): > gee you have such a hard username to remember

Aaron Lun (22:17:14): > should have called yourself “danbun” or something cool.

Aaron Lun (22:17:44): > oh wait “da-bun”. That would be pretty cool.

Aaron Lun (22:18:18): > I was going to suggest “bunda” as well, but a quick google indicated that was not a good idea.

2020-03-11

Aaron Lun (00:15:48): > @Jared AndrewsIt is done.

Aaron Lun (00:16:39): > Also@Dan BunisasSingleRnow auto-handles the intersection of multiple references, the warning should no longer be relevant unless you’re doingtrainSingleR, etc. manually (in which case one is presumed to know what one is doing).

Aaron Lun (00:17:56): > (Also, existing code callingcombineRecomputedResultswill no longer work, as I switched it to taking the output oftrainSingleRinstead of the references. This is a bit friendlier to use insideSingleRand it also cuts out one required argument.)

Jared Andrews (09:25:26): > Nice, thanks. Having fixed the cluster stuff, it is still quicker for me by 15-20%, though I’m running “single” mode and “cluster” mode sequentially.

Aaron Wolen (09:49:04): > @Aaron Wolen has joined the channel

Aaron Lun (16:08:38): > Not sure what that last bit meant.

Jared Andrews (16:34:04): > I’m running multiple times for different clustering resolutions.

Jared Andrews (16:34:39): > In addition to running on a per-cell basis.

Aaron Lun (16:36:22): > Right, and in each of those runs, you’re getting that speed boost.

Jared Andrews (16:49:32): > Wait

Jared Andrews (16:52:39): > Alright, so apparently I can’t read >_> > > recomp method wall time: 06:15:07 > common method wall time: 02:13:36

Aaron Lun (16:52:51): > guh.

Aaron Lun (16:53:01): > man that sucks so bad

Jared Andrews (16:53:20): > tbf, it does seem more accurate. Or at least more fine-grained.

Jared Andrews (16:53:26): > Based on eyeball test.

Jared Andrews (16:54:26): > I am rushing to get things together for a job interview next week, but will perform more straightforward comparisons after that.

Jared Andrews (16:54:33): > With actual logs and images.

Jared Andrews (16:54:54): > And not being all over the place.

Aaron Lun (16:55:02): > 4 minutes. I wonder how it takes 4 extra minutes.

Aaron Lun (16:55:05): > Hm.

Aaron Lun (16:55:07): > Hmmm.

Aaron Lun (16:55:10): > Hmmmmmmm.

Jared Andrews (16:55:12): > So

Jared Andrews (16:55:14): > um

Jared Andrews (16:55:17): > that’s

Jared Andrews (16:55:24): > hours

Aaron Lun (16:55:29): > oh shit

Aaron Lun (16:56:09): > How is this taking hours?

Jared Andrews (16:56:38): > Single processor, “single” method + combine, then ~4 different cluster resolutions + combine for each. All immune refs + 1 (small) custom ref.

Jared Andrews (16:56:49): > ~25k cells.

Jared Andrews (16:57:35): > All fine labels.

Jared Andrews (16:58:17): > I will post full code and examples once I have time to test properly. My guess is fine tuning. I will do some testing with that as well.

Jared Andrews (16:58:36): > There are 9000 T cell labels and my populations are mostly T cells.

Aaron Lun (17:37:39): > 2 hours. Geez. I guess that’s okay given how many times you’re running it. Though the cluster ones should be pretty fast.

Jared Andrews (17:40:15): > The cluster ones are fairly quick.

Jared Andrews (17:41:34): > I haven’t read the details about the fine-tuning recently, but if it’s going the max number of iterations for the majority of cells, that might explain some of it.

Jared Andrews (17:43:06): > If that’s the case, I will spend the time to manually harmonize the immune references into 1 giant one at some point, which should help.

Dvir Aran (20:16:25): > Hey. Hope you all doing well. One way to not loose too much granularity and speed up calculations is use the cluster method, but cluster with super high resolution

Dvir Aran (20:17:03): > You get clusters of 10-20 cells, which will essentially speed up things 10-20x

Dvir Aran (20:18:33): > If you have 25K cells I believe this makes a lot of sense. It might also get you more accurate results

Aaron Lun (20:36:03): > That’s definitely an option

Aaron Lun (20:36:25): > Though for my use cases, the single-cell aspect is very appealing.

Aaron Lun (20:36:56): > Just logistically; I can dice and slice my dataset in any way, and the results are guaranteed to be the same, given that each cell’s classification is independent of the others.

Aaron Lun (20:37:52): > I can literally just shovel data in without worrying too much. (Or specifically, someone else worries about it, just not me.)

Aaron Lun (22:08:41): > @Jared AndrewsActually, now that I dwell on it, 2 hours is not bad.

Aaron Lun (22:09:40): > Assuming all of the time is taken up by the single-cell classification, that’s… actually alright. Especially if it’s easily parallelized.

Jared Andrews (22:12:31): > Yeah, probably about 75% of that is the single cell classification. Again, this is with no parallelization, as I parallelize by each sample instead because I’m lazy and typing 8 numbers to submit new jobs is too much for me.

Aaron Lun (22:20:11): > I guess there just remains the question of why the recomputed scores are doing so badly in@Dan Bunis’s example relative to a pure brain2brain classification.

Aaron Lun (22:35:03): > Hold on, I just blew my own mind

Aaron Lun (22:35:42): > Cut down fine-tuning from 24 seconds to 6 seconds.

Aaron Lun (22:35:51): > I think I stuffed something up. Will check after dinner.

Aaron Lun (23:51:40): > OMG

Aaron Lun (23:51:41): > it’s real

2020-03-12

Jared Andrews (00:06:10): > :100:

Aaron Lun (00:08:34): > All about cutting down the number of cache misses.

Aaron Lun (00:16:14): > Finally.

Aaron Lun (00:16:26): > After several nights of banging my head.

Aaron Lun (00:16:28): > It was so simple!

Aaron Lun (00:17:51): > The goodness has now been merged intocombined. Hopefully you will see an improvement on your end as well.

Aaron Lun (01:05:32): > Also caught a bug. Trying to figure out what it was, I tested pretty hard.

Aaron Lun (02:40:52): > Oh thank god. Just ended up being a problem with my test code rather than a bug.

Aaron Lun (02:44:39): > For some weird reason, i have to doR CMD INSTALL --preclean <package>twice before the speed-up appears. If I only do it once, for some reason, it takes 50 seconds instead of 8. I have no idea why. I can only guess there’s some compiled object that is cached somewhere from a previous installation that didn’t get removed by--preclean, but that’s a stretch. Anyway, I can’t easily reproduce it, but if you notice your code is slower with the new branch, first try reinstalling (again) from a clean repository.

Dan Bunis (15:38:50): > Noted! I won’t be able to come back to this until tomorrow afternoon at the earliest, but I will try to pinpoint the cause of the bad performance in my brain+immune2brain relative to brain2brain by Monday.

Aaron Lun (22:36:35): > Well, my vacation starts Monday, so I might drop out. But you guys have push access and I’ll watch your PRs.

Aaron Lun (22:37:11): > And in the meantime, I will do my holiday coding project, which will be aWestern Blot Simulator.

2020-03-13

Jared Andrews (00:51:15): > @Aaron LunWere your fine-tuning tests with recomputation or the common method or both? I’m getting speed up for common, but not recomp: > > recomp method wall time: 06:15:07 -> 06:45:54 > common method wall time: 02:13:36 -> 01:39:12 > > The recomputation method was also switched to being run in aSingleRcall rather than calling directly.

Jared Andrews (00:52:11): > I guess it could also be due to installation issues described above, I just useddevtoolsto install.

Aaron Lun (00:52:50): > They shouldn’t have affected recomp, though everything should have been a little faster.

Aaron Lun (00:54:08): > the extra 30 minutes on the recomp is probably +/- load elsewhere in the system. Probably best to use a faster example that you can run a few times to get some measure of the variance.

Aaron Lun (00:56:25): > I’m wondering what makes recomp so slow. Maybe it’s just the overhead of doing the loops in R. Hm.

Jared Andrews (01:16:13): > kk, just making sure. I will do more proper testing next week.

Jared Andrews (01:16:25): > Why are you building a WB simulator?

Aaron Lun (01:22:21): > For fun.

Jared Andrews (01:29:29): > Fair enough.

2020-03-14

Aaron Lun (19:05:27): > Or specifically, I’m readingforbetterscience.comand I’m thinking of how amateur some of these forgers are.

Jared Andrews (20:50:46): > My side project right now is making animated sankey diagrams for basketball box scores. Since March Madness is cancelled, it’s kind of taken the fun out of it though.

2020-03-16

Dan Bunis (18:20:26): > I just updated dittoSeq’s description onhttps://github.com/seandavi/awesome-single-cellbut noticed that the SingleR note still points to@Dvir Aran’s repo. Do we want to update that to the Bioconductor pagehttps://bioconductor.org/packages/release/bioc/html/SingleR.html? I can make the PR.

Aaron Lun (18:25:18): > sure

Jared Andrews (21:34:34): > May also want to remove some of the language about it being an in-progress thing that will aoon be in Bioconductor

Dan Bunis (21:54:34): > Where do you mean? For SingleR?

Jared Andrews (22:27:38): > Yeah, in the readme.

Jared Andrews (22:28:01): > Sorry, I didn’t fully read your message that was talking about Sean’s list.

2020-03-17

Dan Bunis (13:22:39): > PR’d@Dvir Aranwith an update of the Bioconductor link from the devel SingleR page to the release SingleR page.

Aaron Lun (14:36:49): > Well, I didn’t end up making that WB simulator. But that’s mostly because I realized there were much better ways to fabricate results.

Aaron Lun (14:37:27): > Probably not the right channel, but for anyone who’s interested:https://ltla.github.io/SingleCellThoughts/general/integrity.html

Jared Andrews (15:16:22): > > Quite simply, either we clean up our mess or the politicians will do it for us. > I question that statement (at least in the US), but agree with the overall sentiment.

Jared Andrews (15:17:18): > WBs are probably the easiest assays to simulate.

Jared Andrews (15:21:04): > I have enjoyed nothing more in grad school than reporting obviously fraudulent data though. Multiple exactly identical IHC images used in different papers with different captions.

Aaron Lun (17:55:22): > I would just wonder when the revolution comes.

Aaron Lun (17:57:45): > Ugh. Got an invitation to review. Need to turn this down.

Aaron Lun (17:58:07): > I was almost going to do it, and then I remembered I wasn’t going to.

Aaron Lun (17:58:14): > Old habits, etc.

Jared Andrews (18:54:33): > Would you change your opinion if it was a non-profit, open-access journal?

Giuseppe D’Agostino (22:01:17): > @Giuseppe D’Agostino has joined the channel

Giuseppe D’Agostino (22:14:25): > hi all, regarding human brain “type” annotation in scRNAseq data: it’s something I’ve been working on these days in a very naive way (bear with me for I have a wet lab background). I have collected markers, however they were identified, from these publications: Lake (the 2018 “update”), Schirmer 2019 (Multiple Sclerosis snRNAseq), Jäkel 2019 (same as Schirmer but with a focus on oligodendrocytes) and the BRETIGEA markers. If you do hierarchical clustering on the pairwise Jaccard index you can define broad sets - up to you to define where to cut the tree obviously - but they seem to make sense most of the time if you take the union. Obviously you lose granularity in terms of, for instance in excitatory neurons, what sort of cortical layer they belong to, but I would say it’s a good starting point.

Giuseppe D’Agostino (22:19:21): > this is the annotated heatmap - File (PDF): jaccardbrain.pdf

2020-03-18

Aaron Lun (03:30:59) (in thread): > Nope. For the three hours it takes for me to review a paper, I could clear a decent part of my anime backlog. So… y’know, go figure.

Aaron Lun (03:31:45): > Can you massage this into a reference dataset(s)?

Giuseppe D’Agostino (09:27:33): > yes. so far I have used them as genesets to do automatic type annotation via fgsea, but I could create the references from the original snRNAseq datasets for use with SingleR

Dan Bunis (16:43:02): > Aaron’s question was mine as well. Which snRNAseq datasets are you referring to btw?

Dan Bunis (16:51:45): > I do think your markers / method might be useful in other ways as well… “disclaimer: Aaron and I are probably quite biased towards SingleR” lol

Giuseppe D’Agostino (20:04:58): > Lake Nat biotech 2018 (PMID 29227469) > Schirmer Nature 2019 (PMID 31316211) > Jäkel Nature 2019 (PMID 30747918) > BRETIGEA markers (no seq data to be used here tho) > > The Mathys markers were not very clearly divided by cell type in their study (they show DE genes for late-early but I did not understand whether they were uniquely assigned to cell type) so I did not add them yet. > There is another human brain snRNAseq study, for Alzheimer’s disease (Grubman Nat neuro 2019) but they use BRETIGEA for annotation.

2020-03-19

Somesh (13:29:07): > @Somesh has joined the channel

2020-03-23

Peter Allen (12:07:02): > @Peter Allen has joined the channel

2020-03-24

Ludwig Geistlinger (18:07:06): > One question: I want to annotate cell cyle phase to myhumanovarian cancer scRNA-seq dataset (10X). I am readinghttps://osca.bioconductor.org/cell-cycle-assignment.html#using-reference-profilesfor the SingleR approach, that demonstrates the functionality using the Buettner15 dataset as reference (mouseESCs with known cell cycle phases). Can you recommend a specific reference dataset when working with human data instead? - Attachment (osca.bioconductor.org): Chapter 16 Cell cycle assignment | Orchestrating Single-Cell Analysis with Bioconductor > Online companion to ‘Orchestrating Single-Cell Analysis with Bioconductor’ manuscript by the Bioconductor team.

Aaron Lun (18:08:01): > Think there’s a dataset named “Leng Something something” in scRNAseq.

Aaron Lun (18:08:04): > Or maybe it’s Ling.

Aaron Lun (18:08:23): > I don’t remember. But I’m pretty sure I added something like that.

Aaron Lun (18:08:44): > Ah, what do you know.LengESCData. BioC-devel.

Ludwig Geistlinger (18:09:18) (in thread): > Thanks, Aaron!

2020-03-25

Aaron Lun (03:08:14): > @Jared Andrews@Dan Bunisgive me some news oncombined

Jared Andrews (03:13:09): > Sorry, got sidetracked trying to get CNV inference to work and helping our lab shut down.combinedI’m kinda back and forth on. It seems to help avoid label pruning (sometimes) and also provide more granularity (sometimes). I will post plots to the github issue tomorrow. In general, I think it’s a worthy addition that will be helpful in certain cases. It’s still slower for me, but I’ll downsample a few of my samples and re-run tomorrow to get better timings.

Dan Bunis (14:41:53): > I’ve also been sidetracked, but I’ll try to get back to investigating thecombinedbrain&immune2brain wonkiness today.

Aaron Lun (15:15:45): > Excellent.

Aaron Lun (15:16:36) (in thread): > You might want to make a PR with a comment about this in OSCABase.

Ludwig Geistlinger (17:16:17) (in thread): > Alright

Dan Bunis (19:41:01): > So looking back, the reason my brain&immune2brain results were shit was because I’d mistakenly used the built-in mouse references for some bonkers incorrect reason:man-facepalming:. (Refresher: test = human brain data, ref = a different annotated human brain scData & 2 blood-focused SingleR-refs.) I’m kinda shocked that there were ANY equivalent gene names to utilize, but now the 0 and 1 Pearson scores make more sense at least.

Aaron Lun (19:41:33): > lol. Iv’e done that myself as well. Probably linc’s.

Aaron Lun (19:41:40): > they are all-caps in both human and mouse.

Dan Bunis (19:42:03): > ahhhh

Dan Bunis (19:42:46): > probs. I’m re-running now with the proper SingleR-refs and I’ll report back on the likely much better results.

Aaron Lun (19:42:55): > fingers crossed.

2020-03-26

Dan Bunis (01:56:21): > Took a while, but 99+% accuracy for both recompute = TRUE and recompute = FALSE.

Aaron Lun (01:57:00): > That’s reassuring.

Aaron Lun (01:57:08): > Hm. I might merge it in, then.

Aaron Lun (01:57:33): > If you guys could throw some plots and code in, we could have a nice record of what we did to check that it was oka.

Aaron Lun (01:58:02): > There’s also the new aggregation functionality, but let’s deal with this first.

Dan Bunis (02:00:55): > I’m going to rerun with my other datasets overnight. Then I can throw something together tomorrow. I haven’t been doing robust testing for speed, but these are diverse datasets at least.

Jared Andrews (22:53:29): > Who wants the fun process of turning this into a reference:smiling_imp:https://www.nature.com/articles/s41586-020-2157-4 - Attachment (Nature): Construction of a human cell landscape at single-cell level > Construction of a human cell landscape at single-cell level

Jared Andrews (22:54:06): > 700k cells from 60 tissues, fetal and adult. All human. A few cultured stem cell populations as well.

2020-03-27

Aaron Lun (00:12:05): > I don’t suppose we know any of these people?

Aaron Lun (00:18:46): > Also@Jared Andrewsyou know I love you man, but be careful with throwing paywalled articles here. This is a public channel and I’d hate to get our wrists slapped by Springer Nature over copyright.

Jared Andrews (00:20:24): > Fine, fine. Also, I have timings and all for you that are actually reproducible, I am just trying to clean it up and harmonize labels between the ref sets to make differences more clear.

Aaron Lun (00:20:57): > sweet.

Jared Andrews (00:21:37): > Recomp is faster, is the short of it. For the 10x PBMC dataset I’m using at least. Will let you be the judge of other differences.

Aaron Lun (00:22:24): > hm. interesting. I’m going to look at it tonight, see if I can shave off a few bits and pieces.

Aaron Lun (02:14:33): > Okay. I’ve trimmed down some of the overhead, so it should be slightly faster. More speed improvements are possible but require other pieces to get into play on the BiocNeighbors side, so it’s probably as fast as it’s going to be for the time being.

Dan Bunis (02:36:46): > My last dataset is still running, womp womp. I won’t have anything to share until tomorrow.

Aaron Lun (02:37:33): > :+1:

Aaron Lun (03:13:16): > Started a call.

Aaron Lun (03:18:05): > okay, so this does work.

Dan Bunis (15:03:20): > (summary of mycombineresults posted to the SingleR issue)

Aaron Lun (15:03:34): > yes, very nice.

Aaron Lun (15:03:45): > Did you get some timings?

Dan Bunis (15:08:55): > Yes, but I’m not sure if we should trust my times. I was using my own system, but I was running other things at the same time so the available memory was probably not equivalent for all the runs.

Dan Bunis (15:09:35): > Sounded like@Jared Andrewsdid a more robust job at getting timings?

Jared Andrews (15:09:52): > Adding my stuff to same thread now.

Jared Andrews (15:09:55): > I’m all over the place.

Jared Andrews (15:35:17): > Adding the timings, still have additional things to add once I get a chance.

Dan Bunis (15:41:36): > I added my timings for mine, along with the noted caveats.

2020-04-01

Aaron Lun (02:35:16): > Food for thought: a SingleR book. The package now has sufficient features to fill up at least 4 chapters of worked examples.

Aaron Lun (02:54:51): > This would take pressure off (i) the vignette and (ii) the OSCA book.

Jared Andrews (04:15:34): > @Dan BunisHow does using the majority method for cluster assignment comparing to doing the cluster assignment via SingleR directly? Have you looked at that?

Dan Bunis (12:58:24): > I did run SingleR in cluster mode with my hematopoietic dataset, but only once and I don’t remember exactly how well it did. I can regenerate a side-by-side comparison for us to figure out if it’s worth throwing into a vignette / chapter. What I liked about the majority method was that it gave me a secondary way of determining how well a clusters’ cells’ aligned with the particular signature. If 90% of cells are the majority type, that’s pretty good. Instead if 55% are the called type, while 45% are a similar type, the cluster might actually represent a transitionary state. Of course, this type of thing is something we might expect to be able to investigate a cluster method’s ScoresHeatmap too. I had stopped tweaking my method once I was satisfied with its results.

Aaron Lun (12:59:49): > I’m personally a big fan of the majority method, percentage assigned is a natural way of gauging confidence. And also of picking up within-cluster heterogeneity that you hadn’t seen before.

Dan Bunis (13:00:48): > Yea, that’s another definitely another perk. If clusters are too coarse, you can see that with the single method, but not cluster.

Jared Andrews (13:14:40): > Yeah, I haven’t done a comparison either, but the benefits are pretty clear. A vignette comparing the two would probably be worthwhile.

Dan Bunis (13:19:19): > :+1:

Dan Bunis (13:29:02): > We probably want to use a stem cell differentiation dataset to exemplify the benefits. I might be able to directly share mine in the next month, but it’s a bit complicated (due to having three very distinct ages of samples), and possibly more fitting of a book chapter than of a brisk vignette. I do think the full work through with mine is worth displaying, but maybe we can throw something quicker together for our current vignette using a simpler dataset already in SingleR?

Jared Andrews (13:52:31): > I can dork around with the PBMC set over the weekend. Can swap my code from from jupyter to Rmd and just share the vignette. Or are there any decent stem cell sets in scRNAseq?

Aaron Lun (13:57:45): > There are many stem cell datasets. Dunno whether they’re decent or not, but they’re there.

Dan Bunis (14:01:23): > I’m checking the NestorowaHSCData now.

Aaron Lun (14:01:41): > that is probably the “most decent” of the bunch.

Dan Bunis (14:01:57): > :crossed_fingers:that is isn’t just HSCs

Dan Bunis (14:08:57): > took a while to download, and then I got an error. But based on the paper’s title, “A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation” it does indeed span differentiation and should work similar to my dataset. I’d recommendscRNAseq::NestorowaHSCData()over PBMCs for this.

Aaron Lun (14:09:21): > EHub is probably broken.

Aaron Lun (14:09:28): > I’m getting breaks on other datasets.

Dan Bunis (14:10:58): > But:man-shrugging:, PBMCs clusters should be more cut-and-dry. The need for this method may not become apparent, but it would make SingleR look better lol.

Dan Bunis (14:11:27): > @Aaron Lunmy error was that I didn’t have ensembldb so maybe not?

Aaron Lun (14:11:38): > oh, that’s a different thing then.

Dan Bunis (14:13:21): > yup, worked after install

2020-04-02

Jared Andrews (12:37:53): > Another huge scRNA dataset to keep an eye on as a potential reference:https://www.biorxiv.org/content/10.1101/2020.03.31.016972v1.full.pdf+html

Aaron Lun (12:38:18): > ye gods

Aaron Lun (12:38:53): > I can smell the grant money steaming off that one

2020-04-06

Anna Lorenc (13:54:52): > @Anna Lorenc has joined the channel

2020-04-07

Aaron Lun (01:04:53): > It’s time to finish the fight.

Aaron Lun (01:04:59): > Oh wait. Was there anything left?

Aaron Lun (01:05:11): > Guess we had the book, and Dan’s thing, and…?

Jared Andrews (07:12:11): > I still plan to compare cluster assignment methods. That is, using SingleR in cluster mode versus majority assignment.

Jared Andrews (07:13:57): > My dissertation is also due in a month though, so that’s going to eat most of my time until I get a few chapters written.

Dan Bunis (15:06:40): > I have one last tweak toplotScoreHeatmapbefore IthinkbothplotScorefunctions will be ready for QC’ing multi-ref runs.

Dan Bunis (15:07:04): > I should have that done today.

Dan Bunis (15:14:31): > I’m a bit iffy on their full utility though, so further testing is definitely required. I’d love some help with testing, and I’d also love any suggestions for additional tweaks which might be bourne of such testing.

Dan Bunis (16:53:10): > The code is probably a tad dirty for Aaron’s standards (I’ll fix that soon), but multi-ref updates to plotScoreHeatmap and plotScoreDistribution are ready for testing.

2020-04-08

Aaron Lun (00:39:20): > Finally looking at it now.

Aaron Lun (00:46:24): > Is there a reason we couldn’t just expect people to docombined$orig.results?

Aaron Lun (00:57:11): > Now thinking about it more. Of greater value would be the ability to spawn all individual result heatmaps, all at once. (Same applies forplotScoreDistribution, but bear with me here). You can adapt the work you’ve done in the current branch so that you have ashow.originals=option in the top-levelplotScoreHeatmapfunction; if set, this simply callsplotScoreHeatmapon theorig.resultslike you’ve already done. Importantly, you setpheatmaptosilent=TRUEto return the plot object that can then be fed intogridExtra::grid.arrangeto create a multi-panel plot. Or we can just return the list directly and people can arrange the plots as they wish.

Aaron Lun (00:58:48): > The nice thing is that this code can be quite stylishly architectured, asplotScoreHeatmapjust ends up calling itself to generate plots for all the individual original results. Pretty neat.

Dan Bunis (12:55:37) (in thread): > cuz that’s not best practice?

Dan Bunis (12:55:39): > I’d been thinking about such recursive multi-plotting as being an enhancement, but, also thinking more, I agree that ashow.all.originalsoption would actually be the go-to option. I will work it in for both.

Aaron Lun (12:56:05) (in thread): > why not?

Dan Bunis (12:56:53) (in thread): > guess I’m wrong…

Dan Bunis (12:57:06) (in thread): > This doesn’t count as direct slot access then?

Aaron Lun (12:57:27) (in thread): > Nope, this is a DF, and it’s just another column of a DF.

Aaron Lun (12:57:38) (in thread): > Basically, it comes down to what you’re promising.

Aaron Lun (12:57:59) (in thread): > With S4 classes, you’re only promising consistency of the external behavior; how things work inside is a black box.

Aaron Lun (12:58:28) (in thread): > With the combined output, we’ve already documented that we’re returning the original results in theorig.resultsfield, so we have to promise that’s stable.

Aaron Lun (12:59:07) (in thread): > People can expect to use that, just like they can expect to use$labels; it’s just another field.

Aaron Lun (12:59:46) (in thread): > In this case, it’s a bit more complicated because it’s a DF inside a DF inside a DF (!) but at the end of the day it’s just like a nested list.

Dan Bunis (13:00:17) (in thread): > Gotcha. This all makes sense to me.

Aaron Lun (13:01:32) (in thread): > I can understand people not wanting to reach inside the object so deeply, but new arguments aren’t free either, with respect to complexity of the code and the need to get users to understand what the argument does.

Dan Bunis (14:13:05) (in thread): > Oh! Additional reason: for checking scores with ref y mixed with label calls from final or ref x.clustersorannotation_colinputs can be hijacked/used inplotScoreHeatmapfor this, but there’s nothing similar forplotScoreDistribution

Dan Bunis (16:47:05): > show.all.originalsis now added (with additional new arggrid.varsfor controlling grid.arrange arrangement) to both viz functions.

2020-04-10

Aaron Lun (03:03:12): > @Dan Bunisjust to say that I will look at this soon.

Aaron Lun (03:03:46): > But with the fact that I have my work computer at home, my work/life boundaries have completely collapsed, so it’s been harder to find time recently.

Aaron Lun (22:27:59): > God this iterator bit at the end really sucks.

Aaron Lun (22:37:59): > Re.combineRecomputedResults.

Aaron Lun (22:38:05): > That was really not a good idea in hindsight.

Dan Bunis (22:41:34): > > my work/life boundaries have completely collapsed, so it’s been harder to find time recently > @Aaron LunSAME. Not that I have a work computer, but just always being come, which is now where I work, my time management has gotten sooo much worse.

2020-04-11

Aaron Lun (14:03:30): > @Dan BunisAlmost there. Just a few suggestions for neater code organization.

Jared Andrews (14:11:15) (in thread): > Care to elaborate?

Aaron Lun (14:26:31) (in thread): > Just the way I wrote the parallelization was not very efficient.

Aaron Lun (14:26:50) (in thread): > I thought it would be more efficient than the way I was going to do it, but I was wrong.

Aaron Lun (23:02:07): > Phew. That was as much work as I had feared. Let’s see if it actually improved recompute speed. It should definitely be faster for batchtools-parallelized jobs; harder to say for other backends or for serial processing.

2020-04-12

Dvir Aran (00:47:59): > Is there a way to automatically transfer issues from dviraran/SingleR to LTLA/SingleR? I keep getting q there, and just asking to repost in LTLA…

Aaron Lun (00:49:13): > I don’t think so - I recall trying on some other project, and I think transfer only works for repos in the same organization, if I remember correctly.

Aaron Lun (00:49:32): > However, perhaps it would be sufficient to create an issue template that simply says, “repost to LTLA/SingleR”.

Aaron Lun (00:50:20): > So if they make an issue, they will have to actively delete the text that tells them to repost to LTLA/SingleR; which is just willful ignorance at that point.

Aaron Lun (02:08:49): > oh god it is so much faster. 10k cells against ~20 human references (aggregated) takes 15 minutes on 10 cores.

2020-04-13

Aaron Lun (11:58:48): > @Jared Andrews@Dan Buniscould you takefaster-recomputeout for a spin sometime this week?

Jared Andrews (11:59:45): > Yep. Have a grant due the 15th but will run my same shtuff as last time.

Aaron Lun (12:00:18): > Great. Just see if it (i) gives the same results as and and (ii) is not slower than the currentmaster.

Aaron Lun (12:00:34): > It’s definitelyfasterthan the currentmasterwhen parallelizing, so that’s a definite win.

Dan Bunis (14:48:01): > Can do too. Very likely not today, but sometime in the next few days.

Aaron Lun (14:48:28): > :+1:

2020-04-15

Dvir Aran (23:54:30): > @Aaron LunI did what you suggested, yet someone erased it and posted an issue… - File (JPEG): Image from iOS

Dvir Aran (23:55:17): - File (JPEG): Image from iOS

2020-04-16

Aaron Lun (00:26:20): > LOL

Aaron Lun (00:26:37): > Well, another choice is to archive the repository.

Aaron Lun (00:26:54): > You might be able to turn off issues as well, if you don’t want to archive it.

Aaron Lun (00:27:06): > Then people LITERALLY can’t post issues.

Aaron Lun (00:27:44): > Users are like water; they will flow in the directions that are available. And also seep through any cracks in your API and mess with your function internals.

Al J Abadi (02:05:31): > @Al J Abadi has joined the channel

2020-04-17

Aaron Lun (00:30:35): > @Dan Bunislet’s finish your PR.

Dan Bunis (00:49:15): > Let’s! Tomorrow:sleeping:

Aaron Lun (23:07:30): > Good, good.

Aaron Lun (23:07:31): > Let your hate guide you.

Aaron Lun (23:07:50): > It works for me!

Jared Andrews (23:23:10): > Also, I’m working on my testing now, but finally updated to R 4.0 and broke everything, so probably won’t get it done till tomorrow.

Dan Bunis (23:40:12): > Right! I meant to get my testing running while I worked on theplotScores… I also won’t have that until tomorrow.

Jared Andrews (23:44:35): > Also broke page 100 on my dissertation today so:tada:

2020-04-18

Dan Bunis (00:03:28): > That’s so longggg. Not looking forward to that part of the process… very much hoping that I can get by with my papers + just a bit extra intro:crossed_fingers:

Jared Andrews (01:10:44): > Yeah, that’s really just 2 papers. I still need to do my intro chapter, conclusions/future chapter, and one more data chapter or maybe an appendix depending how much I get done.

Aaron Lun (03:51:05): > I would tell you to go to sleep,@Dan Bunis, but that would be hypocritical.

Dan Bunis (03:54:51): > I think it’s a perfect time for some Fallen Order… then bed.

Aaron Lun (03:55:30): > oh yeah, some of the level design is pretty amazing in that game.

Dan Bunis (03:57:12): > Yea, almost feels like a Zelda game sometimes.

Dan Bunis (03:57:44): > I’m not actually that far in yet.

Aaron Lun (16:06:11): > @Dan BunisI’m going to get lunch, but we’re almost there.

Aaron Lun (16:06:22): > Added some comments, see if you agree or not.

Dan Bunis (17:03:41): > Commented back. I see where you are coming from and generally agree with the main suggestion. Not sure on the “If you are ambitious” part, but I should be able to better assess the utility better after I do the first half.

Dan Bunis (21:57:15): > What’s the status of our reference labels alignment? It’d be great to be able to color similar labels similarly in a futureplotScoreHeatmap. Maybe in 3.13/3.14. - File (PNG): image.png

Aaron Lun (22:00:29): > Oh. Probably next release.

Aaron Lun (22:01:28): > I actually have an idea for that. I’d like to spin out theData()functions into a separate package, maybeCellTypeReferences. Or a better name, if you have one.

Aaron Lun (22:02:02): > This would allow us to separate the computation from the references, hopefully giving us more room to do stuff like adding ontology-related functionality.

Aaron Lun (22:02:22): > Say NO to feature creep!

Aaron Lun (22:03:04): > The tricky part is, as always, finding a good name.

Aaron Lun (22:03:25): > It’s no exaggeration to say that’s what I spend most of my new package dev time thinking about.

Aaron Lun (22:06:42): > Anyway, if you’re working on the PR tonight, let’s finish it together.

Dan Bunis (22:16:06): > Oh! That sounds like a great plan. PossiblyCellTypeRefsinstead for brevity.

Aaron Lun (22:19:43): > One would need to think of a jazzy name.

Aaron Lun (22:19:58): > @Kevin Rue-Albrechthad a great one for cell type signatures - hancock.

Dan Bunis (22:38:27): > I… should I get the reference? Was there an obscure reference in the movie?

Dan Bunis (22:39:34): > Or something else… my pop culture knowledge is iffy at best.

Aaron Lun (22:40:57): > Hancock? Y’know, the guy with the big signature on the constitution?

Dan Bunis (22:46:33): > had a feeling I was missing something stupid:man-facepalming:lol

Aaron Lun (22:51:43): > and neither@Kevin Rue-Albrechtand I are US citizens… for shame…

Dan Bunis (23:37:52): > ’Murica! …

Dan Bunis (23:40:08): > Turns out I am getting theplotScorefunctions into a shape where at leastmoreof the data extraction, than now, will be done the same way for both.

Dan Bunis (23:43:27): > Still haven’t updated my code to start the time & call testing forfaster-recompute. How is the R-4.0 update muck coming,@Jared Andrews?

Jared Andrews (23:52:03): > Should be done running, haven’t checked it today. Will take a look in a few

2020-04-19

Jared Andrews (00:26:50): > Hancock is pretty damn good, gotta say. Also had the “name that pokemon” bit come to mind, but no actual name related to it. And guess who, that stupid board game.

Jared Andrews (00:28:09): > Oh, my code is halfway done because I did the package switching in a super stupid way and had a prompt to click through. Should be done in an hour or two, then I just have to compare between the new and old results.

Dan Bunis (01:10:49): > Oops! But sounds good@Jared Andrews. I finally pulled myself away from updating the Viz-code long enough to make the 2min of updating req’d to get my own testing code ready.

Aaron Lun (01:23:11): > In a fight with scran’s overly fragile tests. Changed the way that variances were computed and this broke a few tests in a rather head-scratching way.

Aaron Lun (01:25:19): > After that, and a shower, will help with pushing Dan’s thing through.

Aaron Lun (01:25:35): > @Jared Andrewsalso help us think of a name (above).

Aaron Lun (01:28:48): > And thanks for handling that Q.

Jared Andrews (01:30:10): > Yeah I’m terrible at thinking of names. Especially for a data package where you at leastkindahave to be descriptive and can’t go totally wild.

Aaron Lun (01:43:26): > My god. Found a nasty bug. Going to take my shower now.

Aaron Lun (01:44:35): > I think the name can be a bit on the fun side. I’ll throw inmugshotsfor some cell type profiling.

Jared Andrews (01:56:37): > NameThatCell

Dan Bunis (01:57:01): > Call me boring, but I like the simplicity of the scRNAseq -> constains scRNAseq data simplicity:man-shrugging:

Jared Andrews (01:57:10): > It’s not really doing the naming though so eehh

Jared Andrews (01:57:19): > I will call you Boring.

Dan Bunis (01:57:33): > That said, I will probably never turn down a Pokemon reference:joy:

Dan Bunis (02:00:30): > cellDex?

Dan Bunis (02:02:12): > instantly know it holds info about different cells

Dan Bunis (02:04:46): > taken:sob:

Dan Bunis (02:12:44): > jk,cellDexis available.

Jared Andrews (02:24:04): > That’s pretty good too.

Jared Andrews (03:03:50): > @Aaron LunSeeing ~15% speedup with no parallelization with fine labels, but a ~50% slowdown with main labels. I actually think the weekly virus scan that I have no control over kicked on at some point during it, so that may be affecting performance. Am re-running on my personal computer. Results arealmostexactly identical, but there are very slight variations. Like 10-20 cells might hop around between T cell labels given how many of them there are.

Aaron Lun (03:04:32): > Hm.

Aaron Lun (03:06:51): > This may or may not make sense, let’s see the re-run.

Dan Bunis (03:14:48): > I think that my all 40k naive T cell dataset may have run faster, but the 70k brain2 dataset is still running and has been for a while. Hopefully done soon.

Aaron Lun (04:01:24): > People still awake?

Jared Andrews (04:02:12): > Yeah. Trying to get my notebook to knit so I can just upload that when it’s done rather than formatting all this garbage, but it doesn’t seem to want to so whatever

Aaron Lun (04:02:41): > What are you trying to knit?

Jared Andrews (04:03:05): > Good notes on Dan’s PR. I haven’t used the heatmap functions since it’s been changed for multiple references, will be good to play with.

Jared Andrews (04:03:14): > Just an Rmd doc.

Aaron Lun (04:03:49): > Did that main labels re-run do any better? What are the absolute times we’re talking about here?

Jared Andrews (04:07:44): > It’s still running. Fine labels was like 529 seconds but down to 448 with new method. Main labels went from 190 s to 298s.

Aaron Lun (04:08:16): > Hm.

Jared Andrews (04:08:40): > I would not put a huge amount of faith in that currently though.

Aaron Lun (04:08:44): > k

Aaron Lun (04:11:06): > Well, the change that I made is not a slam-dunk in terms of efficiency. In fact, in purely algorithmic terms, it’s actually slower because it does a lot of redundant work - specifically, it recomputes the ranks of the references for each cell in the test. However, it is very much faster for (i) file-backed matrices and (ii) parallelization in the absence of shared memory. So, basically if you want to annotate 1 million cells on a cluster, this would crush the previous implementation.

Jared Andrews (04:12:36): > Dan’s numbers will probably be more illuminating then.

Aaron Lun (04:12:40): > Though I wonder whether I could thread the needle to have my cake and eat it too.

Dan Bunis (04:13:14): > Mine is still running.

Aaron Lun (04:18:14): > Y’know what, having thought about it, I actually do have a way to get the best of both worlds. It might even be applicable to the fine-tuning if it works… which would have the AMAZING plus of being able to strip out all of the C++ code.

Dan Bunis (04:22:28): > I’d ask how, but my brain power is def starting to wane. I’ll have to parse through your heatmap PR in the morning.

Aaron Lun (04:23:00): > yes, that’s probably for the best.

Dan Bunis (04:25:36): > ~20 mins and this run will have taken longer overall than the previous. (only with fine.labels.)

Aaron Lun (04:33:03): > Woah. How long was it taking in the first place?

Dan Bunis (04:36:10): > Previous time for all 4 with this “new” method adds up to ~3hrs 15min - File (PNG): image.png

Dan Bunis (04:37:35): > Ss actually, we’re there now. But also it finished.

Dan Bunis (04:38:17): > To bad I forgot to create the new folder that I’d wanted to save into… I’ll have to rerun overnight to check the individual times and results.

Jared Andrews (04:39:04): > Is that with fine labels?

Dan Bunis (04:39:42): > Yup.

Dan Bunis (04:40:23): > example code: > > B2pred <- SingleR( > test = GetAssayData(brain2, assay = "RNA", slot = "counts"), > ref = list(HPCA, BE, GetAssayData(grub, assay = "RNA", slot = "counts")), > labels = list(HPCA$label.fine, BE$label.fine, grub$m1_cell_type2), > recompute = TRUE) >

Jared Andrews (04:41:12): > I’m still getting the same-ish numbers (~15% faster with fine labels, ~50% slower with main).

Jared Andrews (04:42:20): > Will put my actual numbers into the PR tomorrow.

Dan Bunis (04:47:52): > 4k cells, right?

Aaron Lun (04:51:17): > Hmmm… Okay. Well, guess I’ll deal with that tomorrow.

Aaron Lun (04:51:23): > Or today. Ho ho ho.

Dan Bunis (04:53:23): > time to go back to Fallen Order for a bit for me, then back to this in the actual morning.

Jared Andrews (11:14:40): > Yeah, 4k.

Dan Bunis (12:56:58): > My results are posted. Call comparison: Only 1 cell out of ~120k had a different call. Time comparison: general time decrease; +/- <6% for 3, but -57% for one of them. Not sure if that -57 is a fluke.

Aaron Lun (13:14:32): > Hm… okay. I’m going to have one last roll of the dice, hope you guys held onto those scripts.

Dvir Aran (13:24:20): > cellXriv

Aaron Lun (13:37:10): > That reminds me, we probably need some advice for which references to use.

Aaron Lun (13:37:29): > Also thanks@Jared Andrews. Give me a few hours and we’ll have one more crack at it before I give up.

Jared Andrews (13:44:12): > We also need to update the README, it’s out of date and probably confusing some people. I’ve considered harmonizing all the immune references by hand so that we don’t have “B cells”, “B-cells”, “B_cell”, etc, but if we’re going to turn to the ontologies at some point I’ll just save my time.

Aaron Lun (13:48:54): > Probably the ontologies are indeed the better approach, then we have both the “original” labels plus our harmonized ones. Otherwise we’d be taking responsibility for changing the originals to the “correct” values. Probably not too hard in those cases above but you can imagine situations where it might be more controversial. If a user then challenges me on it, for example, I’d be all like

Aaron Lun (13:49:13): > ¯*(ツ)*/¯

Jared Andrews (14:01:08): > Yeah, very true. Looks like the issue diversion from Dvir’s repo is starting to happen. With people not updating or even reading about the new package, naturally. We might want to create an issue template to tell them to check the vignette/manual if they’re coming from his repo to help with those if they start occurring more frequently.

Aaron Lun (14:38:33): > Woah. That was easier than I expected. Thought it would take the whole day.

Dan Bunis (14:49:35): > not mad atcellrXivas acellTypeReferencepackage name, but still prefercellDexmyself ¯*(ツ)*/¯

Aaron Lun (14:50:34): > I’ll only mention that one should try to avoid random capitals in the middle of the package name. I learnt my lesson with diffHic, for example. Both of the proposals work if you’re willing to accept, e.g.,cellrxivorcelldex.

Aaron Lun (14:52:32): > It might not seem like a lot, but it sure it a lot easier to type.

Aaron Lun (14:52:57): > Though having tried to quickly type both of them very quickly, I’ll say that celldex is a clear winner here.

Aaron Lun (14:53:21): > Just because it’s so hard to remember “rxiv”. I always stumble on this for the bio and a verions as well.

Dan Bunis (14:54:30): > For the vignette, might be fine just to do something like this for harmonizing: > > ref <- refData() > ref2 <- ref2Data() > ref2$labels.fine_harmonized <- sub("B_cell", "B cells", ref$labels.fine) > pred <- SingleR(..., > labels = list( > ref$labels.fine, > ref2$labels.fine_harmonized)) >

Dan Bunis (14:55:16): > agreed. I get rxiv wrong a lot too, and so did Dvir (I’m assuming) in making the suggestion in the first place.

Aaron Lun (14:55:53): > wait, i thought dvir’s one was fine.

Aaron Lun (14:56:12): > “biorXiv”. that is how it’s spelt, right?

Aaron Lun (14:56:16): > Man, I don’t even know anymore.

Aaron Lun (14:56:34): > Wait, the “R” is capitalized? Oh man.

Dan Bunis (14:59:59): > lol def r-x-i-v. I never remember the X vs R. Luckily that doesn’t matter for urls, butcellrxivorcelldexare certainly the clear winners overcellRxivorcellDex.

Dan Bunis (15:01:40): > Me more often than typing it right & Dvir-above: > > cellXriv

Jared Andrews (15:02:09): > cellRchiveif we really want to rile some folks up

Jared Andrews (15:02:33): > Makes me angry just reading it

Dan Bunis (15:03:19): > makes me hungry

Jared Andrews (15:03:52): > Get yerself a baked potato with a bit of butter. Few chives. Solid lunch option.

Aaron Lun (15:08:59): > And with that, I am going to subway for lunch. Will be back in 30 minutes - should be able to knock off the final change by then.

Aaron Lun (15:09:51): > I will leave you with a parting gift; did you know thatmax.col’s defaults wasties.method="random"? Probably the cause of the differences in the previous version of this algorithm.

Dvir Aran (15:41:27): > I don’t understand celldex, sorry for my ignorance.

Dvir Aran (15:44:19): > I see there is a company celldex therapeutics. No mention of cellrxiv, will make it easy to search for it on google

Dvir Aran (15:46:20): > I wanted the X capital for expression, but agree it doesn’t work

Dvir Aran (15:48:57): > But when I see cellrx I think of drug prescription, idk

Dan Bunis (16:02:44): > celldexreferences the pokèdex, the in-games dictionary for all the different pokèmon

Federico Marini (16:06:12): > my 2 cents as a user of singleR - what aboutcellorama(for the whole panorama of cells)?

Jared Andrews (16:16:47): > Isnt that already used?

Federico Marini (16:25:10): > I knew scanorama

Federico Marini (16:25:18): > for integrating datasets

Dan Bunis (16:27:09): > seems available. “package ‘cellorama’ is not available (for R version 4.0.0 alpha)”

Dan Bunis (16:27:50): > But probably should be checking from a non-dev R

Federico Marini (16:36:44): > did you tryavailable::availableon that?

Dvir Aran (16:36:50): > If its a sub package of SingleR, it can be SingleRef

Dvir Aran (16:37:07): > Or SingleReferences

Dan Bunis (16:41:02): > I think having it come off as sc-matching-method agnostic would help the ref-package gain more use.

Dan Bunis (16:42:37): > But with SingleR still plastered throughout the package as suggested usage of course

Federico Marini (16:45:00): > btw, scanorama is also a film festival

Federico Marini (16:45:14): > probably, not this year..

Jared Andrews (17:28:35): > Oh, cellorama is a magazine put out by Merck that I think I got at my last conference that I’m thinking of. I still likecelldexpersonally.

Aaron Lun (18:16:10): > My god, I’ve got it.

Aaron Lun (18:16:23): > This should be the best of both worlds right here.

Dan Bunis (18:50:50): > Sounds ready to start testing? Starting mine up.

Aaron Lun (18:55:57): > I was for a second, but then I had another flash of inspiration.

Aaron Lun (18:56:15): > My god there’s so much C++ code.

Aaron Lun (19:00:27): > okay, I finally got it to compile. Let’s see if it runs.

Aaron Lun (19:04:25): > Jesus it’s fast. Had to double-check I was running the right test suite.

Dan Bunis (19:05:32): > That’s a good sign. o_O

Aaron Lun (19:06:17): > I’m going to check it and get it up there. Then I will take a break.

Aaron Lun (19:06:31): > I mean, it’s already 4. Where the hell did my weekend go?

Dan Bunis (19:07:01): > This is corona-time, what is a weekend?

Dan Bunis (19:12:46): > I was planning on doing the same though. I’m satisfied with bothplotScoresagain, so will be running this test once I see your commit, then likely popping away until after it runs / another review from you.

Aaron Lun (20:03:26): > Right! it’s done.

Jared Andrews (20:13:47): > Will re-run tests.

Jared Andrews (20:28:02): > Very good, further improvements and main labels are faster now too. Results are similarly about the same (99.99%). Will update my timing table.

Jared Andrews (20:33:25): > Okay, updated my comment.

Aaron Lun (20:36:34): > Nice, very nice.

2020-04-20

Aaron Lun (02:14:40): > @Dan Bunisis it over?

Dan Bunis (02:25:12): > For now.. I still need to update the docs and add tests. Adding tests may reveal bugs:man-shrugging:.

Aaron Lun (02:33:47): > Okay. Kick off the timing script, let’s see if we can’t merge the other PR first.

Dan Bunis (02:44:52): > That just finished actually.

Dan Bunis (02:45:51): > Adding now.

Dan Bunis (02:56:25): > Slower across the board vs last time. But also I was on my laptop the whole time vs last time I’d been asleep.

Dan Bunis (02:56:56): > So to compare those accurately should I run it overnight again?

Aaron Lun (02:57:28): > Hm.

Aaron Lun (02:57:53): > That’s kind of unexpected.

Aaron Lun (02:58:22): > I guess you weren’t playing fallen order on your laptop while you were waiting.

Dan Bunis (02:58:57): > I’ll double-check that the commit was right.

Dan Bunis (03:01:01): > Each time I decided to run check, that took 10+min. I did run that a bunch in the interim:man-shrugging:.

Dan Bunis (03:02:56): > def was the right commit

Aaron Lun (03:03:21): > Well, I have to do some timings tomorrow on my end anyway. Prb just let it run overnight and see what pops up.

Dan Bunis (03:03:52): > will do

Aaron Lun (03:04:06): > In any case, what is the % slowdown?

Dan Bunis (03:06:09): > 10%, 6%, 8%, 7% compared to try#1

Aaron Lun (03:06:26): > okay, that’s not too bad.

Aaron Lun (03:07:18): > The SingleR CHECK is also pretty intensive, so it’s not entirely impossible that was also affecting timings.

Dan Bunis (03:11:19): > That’s what I was thinking.

Dan Bunis (03:12:20): > I’ll redo therecompute = FALSEthis time too

Dan Bunis (12:34:27): > Overnight rerun numbers are entered. Better across the board.

Dan Bunis (12:35:03): > Hasrecompute = FALSEchanged at all?

Jared Andrews (12:36:38): > Yeah, I didn’t rerun those to see if it had changed at all. I think some of the fine-tuning code changed for both though.

Aaron Lun (12:38:16): > That PR’s been around for long enough that I can’t remember.

Dan Bunis (12:38:32): > Okay. Well is that method staying? It got worse by a lot in some cases.

Dan Bunis (12:39:00): > But it’s certainly easier for us to maintain and vignette just the one method.

Aaron Lun (12:39:14): > Which method -recompute=FALSE?

Aaron Lun (12:39:40): > It shouldn’t have slowed down due to the changes I introduced - that would be a surprise…

Dan Bunis (12:43:55): > An alternative explanation?: If memory chunks were not released afterrecompute = TRUEruns, this rerun might have had a large chunk of my memory inaccessible? I ran all the TRUE before the FALSE, But R-4.0 is supposed to be better at that…

Aaron Lun (12:46:32): > Therecompute=FALSEcombining doesn’t do anything itself, the burden is mostly on the neighbor search and fine-tuning step for that mode of operation.

Dan Bunis (12:56:56): > Sure. But it was slower for 3/4:man-shrugging:. Basically, my code was just… > > # This, essentially 4 times with different data > test <- SingleR( > test = raw_counts_matrix, > ref = list(HPCA, BE), > labels = list(HPCA$label.fine, BE$label.fine), > recompute = TRUE) > > # THEN this, 4 times with different data > test <- SingleR( > test = raw_counts_matrix, > ref = list(HPCA, BE), > labels = list(HPCA$label.fine, BE$label.fine), > recompute = FALSE) >

Aaron Lun (12:58:04): > I can only say

Aaron Lun (12:58:17): > ¯*(ツ)*/¯

Dan Bunis (13:00:21): > Welp, I’ll rerun tonight with just therecompute = FALSEruns to check if it was some memory bug.

Aaron Lun (21:04:05): > JESUS IT’S FAST.

Aaron Lun (21:04:49): > Dunno whether the cluster was just in a good mood today or what, but ~12000 cells against ~25 references on 10 cores was done in 6 minutes.

Aaron Lun (21:05:39): > Now, they were aggregated references, so make of that what you will, but that’s pretty good. It’s BLASTing against 100 cell types right there.

Aaron Lun (23:37:04): > @Dan Bunislet’s finish the fight. Tests look done?

Dan Bunis (23:37:30): > They are! Working on the docs now.

Aaron Lun (23:37:53): > Just reading through your PR remarks.

Aaron Lun (23:38:06): > Yes to thesetup-multiref.R, if you haven’t done it already.

Aaron Lun (23:38:59): > though one should be sure that it doesn’t clash with thesetup.Rcontents. Might be worth plowing it intosetup.Rinstead.

Aaron Lun (23:40:14): > And mention the argument changes inNEWS. The visual changes don’t really matter as much, they won’t cause people’s code to crash so it’s okay.

2020-04-21

Dan Bunis (00:00:22): > I haven’t setupsetup-multiref.R. I’ve never had multiple before, so I’m not versed on how to set which test files use what… is that easy? Or is that why you suggest putting it in the mainsetup.R?

Aaron Lun (00:01:21): > Well, it’s not hard, but if you accidentally overlap variable names with contents ofsetup.R, we’d never know until we started seeing incomprehensible errors.

Aaron Lun (00:02:02): > Your current setup is probably the safest. If you want to be ultra safe, you define a function insetup.Rthat generates a list containing the variables to be used. Then you can just call the function multiple times in each of the individualtest-files.

Aaron Lun (00:02:16): > A poor man’s namespacing, so to speak.

Dan Bunis (00:06:34): > I thought there was a way to have certaintest-files use certainsetup-files?

Dan Bunis (00:06:44): > If not, I think the current method is best…

Aaron Lun (00:06:59): > Possibly ¯*(ツ)*/¯

Aaron Lun (00:07:11): > If you can find it, then that would also be acceptable.

Dan Bunis (00:07:51): > That’s not on my docket for tonight, and it’d be nice to finish this tonight.

Aaron Lun (00:08:09): > yeah, punch that clock.

Dan Bunis (00:08:10): > ¯*(ツ)*/¯

Dan Bunis (01:00:46): > > mention the argument changes inNEWS > Just to confirm, I SHOULD add the input name change from labels to labels.use, but NOT the defaulting changes toshowandshow.nmads, correct?

Aaron Lun (01:01:07): > Yes, that’s right.

Aaron Lun (01:01:22): > I mean, it doesn’t really matter. You can mention them if you like.

Dan Bunis (01:06:03): > ¯*(ツ)*/¯

Aaron Lun (01:07:03): > Indeed

Aaron Lun (01:07:48): > I’d err on the side of “what people don’t know don’t hurt them”. As long as it actually doesn’t hurt them.

Aaron Lun (01:08:01): > And plotting changes is definitely something that hurts no one.

Dan Bunis (01:14:17): > I certainly agree for a package with a main purpose distinct from visualization.

Dan Bunis (01:14:45): > But, I’d be more inclined to include them if we were discussing dittoSeq.

Aaron Lun (01:15:02): > I suppose that would make sense.

Dan Bunis (01:21:54): > Running the final check now:smiley:

Dan Bunis (01:34:40): > committed

Aaron Lun (02:01:18): > guess I should take a break from watching halo 5 videos.

Aaron Lun (02:02:24): > I’m going to do a squash-and-merge commit, just like I did with the other one.

Aaron Lun (02:02:39): > Hopefully it preserves the authorship.

Aaron Lun (02:07:00): > Actually, canyoudo the squash-and-merge?

Aaron Lun (02:07:25): > Just prune back the commit message to the important bits, like in:https://github.com/LTLA/SingleR/commit/b81fe3e403d02f8d554cf4b1a642a4fcb111ab93

Aaron Lun (02:09:16): > Admittedly that was not the best squash-and-merge example, I had a bunch of changes in there that shouldn’t have been compressed down into a single commit… oh well. But yours is more contained so you should be able to break it down into a short description, especially given that we spent a lot of time fiddling with the reorganization - we don’t need multiple commits to show that we did that.

Aaron Lun (02:15:19): > Just to be clear we’re talking about the same thing: - File (PNG): github.com_LTLA_SingleR_pull_107 (2).png

Aaron Lun (02:15:43): > Okay, you’re not an admin, but you should be able to do it once you approve the pR.

Aaron Lun (02:16:47): > Hm. Guess he’s asleep.

Aaron Lun (02:16:52): > Well, he’s earnt it.

Dan Bunis (02:21:14): > Not asleep, just watching a movie. I’ll do it after =)

Dan Bunis (04:01:47): > :tada:

Aaron Lun (04:08:06): > thx

Aaron Lun (04:08:31): > It turns out that squash-and-merge sets the author to whoever made the PR, which kind of sucks.

Aaron Lun (04:08:47): > Oh well. Live and learn, I suppose.

Dan Bunis (04:18:34): > But all those commits!

Dan Bunis (04:20:10): > But meh, I don’t really care.:man-tipping-hand:I don’t have a commit-every-day streak going that could potentially have been broken by this.

Aaron Lun (04:21:43): > I care deeply, nowgit blamedoesn’t work!

Aaron Lun (04:21:52): > Who to blame… but myself.

Aaron Lun (04:32:21): > 1.1.14 is up.

Jared Andrews (08:46:26): > Neat.

Martin Morgan (10:21:52): > @Martin Morgan has left the channel

2020-04-22

Aaron Lun (00:17:25): > haha! Now that Martin’s gone, I can go back to swearing.

Aaron Lun (01:04:11): > Download stats are going to pop 800 UIPs this month. This should move us towards the top 150.

Dan Bunis (01:25:04): > that’s awesome!

Dan Bunis (01:25:22): > but… how do you know that without it already auto-updating?

Jared Andrews (08:25:55): > You can tell just from the trends on the stats page.

Dan Bunis (21:06:36): > Is this something to be concerned about in the windowscheckreport? > > Rd warning: C:/Users/biocbuild/bbs-3.11-bioc/tmpdir/Rtmpw9fZ11/R.INSTALL158816c654dd/SingleR/man/plotScoreDistribution.Rd:67: file link 'grid.arrange' in package 'gridExtra' does not exist and so has been treated as a topic > > The examples and tests are fine, so the issue is not a question of gridExtra availability.

Aaron Lun (21:07:30): > That’s okay.

Aaron Lun (21:07:35): > All windows tests have that.

Dan Bunis (21:08:43): > okay kinda figured, but then it wasn’t built and propogated so I thought it might actually matter… - File (PNG): image.png

Aaron Lun (21:10:46): > That’s also the case for every Windows package, I guess it’s some teething problems with Rtools 4.0.

Dan Bunis (21:11:12): > oh oh okay

Dan Bunis (21:11:25): > I hadn’t realized how widespread

Jared Andrews (21:48:24): > Yeah, building from source on Windows is hit and miss currently.

2020-04-27

Jared Andrews (21:31:41): > Huge image incoming with a question: - File (PNG): image.png

Jared Andrews (21:32:32): > For the combined plot, is there a way to prioritize labels so that only those with at least 1 cell containing them are shown and labels assigned to no cells are not?

Jared Andrews (21:33:50): > Many of the cells on the right-side of the combined plot are Tregs, but that label isn’t present. I think showing the top x labels in terms of assignment frequency may be a good way to go about it.@Dan Bunis, any thoughts?

Aaron Lun (21:34:44): > True art right there.

Jared Andrews (21:35:34): > I’m starting to go down the rabbit hole of ultra-specialized T cell subsets, so we’ll see how fine-grained I can get or if it just gets too muddy.

Dan Bunis (21:37:21): > =).max.labelsinput can tackle the assignment frequency purpose. It actually operates a bit differently from the direct # of calls, but should work.

Jared Andrews (21:39:59): > Ah, I see. I can get it to show, but it’d be nice to hide some of the less important ones.

Dan Bunis (21:42:39): > For the annotations… I’m actually surprised if labels that are non-existent would show. The results$labels column is not a factor is it?

Jared Andrews (21:43:52): > Nope, character.

Dan Bunis (21:45:00): > I think maybe I’d just misunderstood your goal.. Do you want an “other” color used for labels that are only present at a low frequency?

Jared Andrews (21:45:36): > No, I just don’t want them to push out labels that are actually present at a high frequency.

Aaron Lun (21:46:02): > In other words, get rid of all the all-NA rows in the top-left plot.

Jared Andrews (21:46:22): > Yep.

Dan Bunis (21:47:42): > Ohhhkay. That should be easy.

Jared Andrews (21:48:02): > Bumpingmax.labelsup to 50 gets me what I want, but there’s still a lot of relatively not useful info in there with those all-NA rows: - File (PNG): image.png

Jared Andrews (21:48:49): > Thanks for making it work out of the box with the combined results though, extremely convenient.

Dan Bunis (21:51:11): > Actually, not totally sure I get what’s different now. What was missing before you changedmax.labelsfrom 40 to 50?

Dan Bunis (21:51:38): > > In other words, get rid of all the all-NA rows in the top-left plot. > This task is what should be easy.

Dan Bunis (21:52:15): > @Aaron Lun, PR to master?

Aaron Lun (21:52:52): > Sounds like a plan.

Aaron Lun (21:53:43): > Hm.

Aaron Lun (21:54:12): > Actually, this could be tricky, we’re right in the middle of the BioC release. Let me check whether RELEASE_3_11 has been created yet.

Dan Bunis (21:56:09): > RELEASE_3_11 has been created, for dittoSeq at least, and I think everything else too. I’ve been keeping up pretty close seeing as it’s my first Bioc-release cycle:smiley:(as a package’s LEAD maintainer)

Aaron Lun (21:57:15): > Indeed.

Aaron Lun (21:58:07): > Right, here’s what we’ll do. I’ve made aTESTbranch that is intended for changes that will go both intomasterandRELEASE_3_11, e.g., bugfixes and non-breaking improvements. Do your PRs against that.

Aaron Lun (21:58:56): > I’ve used a separate branch so as to avoid problems whenmasterandRELEASE_3_11diverge with breaking changes such that we can’t simply do all our changes in one and merge them into the other.

Aaron Lun (22:00:52): > (One might be tempted to do all the non-breaking changes onRELEASE_3_11but this still causes merge conflicts if we make a bugfix inRELEASE_3_11in a region of code that has been removed inmaster.)

Dan Bunis (22:07:36): > gotcha, I’ll commit in a new branch and PR toTEST. Unless you’d rather I simply commit this one directly toTEST? The all-NA row removal may just be 2 lines:man-shrugging:.

Aaron Lun (22:08:10): > If you’re happy, I’m happy.

Aaron Lun (22:08:14): > Well, sometimes.

Aaron Lun (22:08:41): > Just do it, I can always wipe that one. I should probably protect theRELEASE_branches, though.

Dan Bunis (22:12:56): > Does theRELEASE_branch sync directly?

Aaron Lun (22:15:07): > What do you mean by that?

Dan Bunis (22:19:42): > Just wondering where the need to protect it comes from.

Dan Bunis (22:20:38): > I won’t commit to it, but I suppose you may not trust that haha.

Aaron Lun (22:21:54): > Oh right.

Dan Bunis (22:22:11): > Q was just whether it’s set up in a way that a commit to that branch would automatically be pushed to the Bioc branch versus whether agit push upstreamby you would be required to actually affect the released packaged.

Aaron Lun (22:22:36): > The latter, just to provide an element of control and oversight.

Aaron Lun (22:22:58): > More generally, the protection gives everyone a chance to have a look at the changes and make sure they’re okay.

Aaron Lun (22:23:15): > They don’t necessarily have to look at it, but at least they don’t get ambushed by unexpected changes either.

Dan Bunis (22:25:11): > :thumbsup:would love to hear how you actually set that up, perhaps in dm. I made a detachedRELEASE_branch for dittoSeq too, but perhaps in a more roundabout way than is actually required.

Aaron Lun (22:25:55): > Mind you, if it’s just yourself, protecting stuff is probably overkill.

Aaron Lun (22:26:12): > I don’t bother to set it up on packages where I’m the only committer.

Dan Bunis (22:47:52): > Makes sense. I’m not gonna protect, but I’m still glad I figured out how to detach the commits.

Dan Bunis (22:47:59): > Pushed toTEST

Jared Andrews (22:48:53): > Sorry, got sucked up in other stuff. The difference is on the right side of the combined heatmap where the Treg label is present in the second heatmap (which quite a few cells have), that is absent in the first.

Jared Andrews (22:49:01): > Regardless, will test it out.

Aaron Lun (22:49:39): > I will also look at that later tonight, but in the meantime, we should start planning for the other three items for this development cycle: > * book > * celldex (or whatever we decided to call it) > * ontologies

Aaron Lun (22:50:23): > @Vince Careyand I can probably take the ontologies. I’d like volunteers to lead the other two.

Jared Andrews (22:52:33): > I am out until mid June on more SingleR stuff, already overcommitted. Will gladly contribute after that.

Dan Bunis (22:53:35): > Not sure I commit much extra time at the moment. I have lots to do before graduation.

Jared Andrews (22:53:49): > ^^

Dan Bunis (22:54:11): > That’s probably mid-July for me, but I don’t have a date yet.

Aaron Lun (22:59:20): > Bum.

Aaron Lun (22:59:52): > It’s probably less work than what you’re imagining.

Dan Bunis (23:00:08): > Back to@Jared Andrews’s issue at hand, I see the mysterious disappearance of the Tregs now. I hadn’t before. Thatttt issue highlights that the current way max.labels works (based on max values after labels’s scores arescale()ed) is NOT valid when most scores are NA. The current fix should postpone the need to address the issue, but a calculation method fix would be ideal.

Jared Andrews (23:00:41): > Yeah, theTESTbranch works well: - File (PNG): image.png

Jared Andrews (23:01:58): > I defend June 4th and will probably take a week-ish off after. I can probably take care of celldex or whatever then. It’s just throwing the data code into another package right? So I can just rip off the scRNAseq format for the most part?

Dan Bunis (23:07:49): > Do you think these ultimate calls are valid@Jared Andrews? Kinda expected with T cells for sure, but it does look like the scores are pretty muddy.

Jared Andrews (23:12:33): > I just don’t think there is a whole lot that differentiates many of the subsets. I can still track developmental trajectories that make sense based on these assignments. I’m sure they aren’t perfect, but they’re decent as a first pass. I think once we get the ontology stuff rolling and the labels can be harmonized to a degree, it will be easier to discern accuracy. I’m sure there are some single cell T cell sets out there with manual annotations that we could compare to as well.

2020-04-28

Aaron Lun (00:27:38): > @Dan BunisI ended up changing my mind… can you make a PR intomasterfromTEST?

Aaron Lun (02:48:09): > Let’s get the ball rolling.https://ltla.github.io/SingleRBook/ - Attachment (ltla.github.io): Assigning cell types with SingleR > The SingleR book. Because sometimes, a vignette just isn’t enough.

Dan Bunis (13:43:24): > ^^^ Pretty great already! For anyone that might be able to contribute, is the raw form hosted somewhere?

Aaron Lun (13:44:15): > Thought I sent you an invite already.

Dan Bunis (13:54:56): > Ahh, perhaps. Been having internet issues:sob:so I porbably just haven’t seen.

Aaron Lun (17:54:02): > ugh.

Dan Bunis (18:03:07): > You can add more details if you’d like, Aaron, but I’ve given a short answer.

Aaron Lun (18:04:52): > good, good.

Aaron Lun (18:22:42): > holy shit we cracked 1k downloads.

Dan Bunis (18:36:25): > well over, that’s a big jump!

2020-04-29

Anthony Sonrel (03:50:58): > @Anthony Sonrel has joined the channel

2020-04-30

Aaron Lun (03:34:20): > LOL:https://github.com/LTLA/SingleRBook/network/alert/docs/libs/jquery-2.2.3/jquery.min.js/jquery/open

Jared Andrews (10:43:14): > Link is broken, I assume it was a security alert for jquery? I just got one for another project.

Alan O’C (10:48:10): > Presumably github doesn’t want people to advertise security vulnerabilities in their projects:smile:

Alan O’C (10:49:19): > 2.2.3 seems to have an XSS vulnerability, I assume that’s what it was related to

USLACKBOT (10:50:50): > This message was deleted.

2020-05-01

Charlotte Rich-Griffin (05:26:09): > @Charlotte Rich-Griffin has joined the channel

Dan Bunis (13:19:34) (in thread): > thanks for the reminder!

Dvir Aran (16:19:24): > @a@

Dvir Aran (16:26:51): > Kids playing with my phone…

Aaron Lun (16:27:27): > well, at least they didn’t call someone.

Aaron Lun (16:27:50): > I used to get a few calls from one of my supervisor’s children in that manner.

Aaron Lun (16:29:09): > Was so excited to talk science with my boss on the weekend. Bitterly disappointed when I heard toddler feeding sounds instead.

Dvir Aran (21:43:54): > I broke my foot a week ago, so now when my kids take my phone I can’t even get it back

Dan Bunis (21:58:52): > How rude of them haha! Hope it heals quick, Dvir! But “good” timing perhaps to be off your feet while we’re all stuck at home anyway I suppose.

Dvir Aran (22:32:02): > Yeah, told my boss that because I broke my foot, I will be working from home in the next few weeks… :)

2020-05-02

Aaron Lun (01:39:35): > Holy crap, 1.2k downloads this month. Really moving up in the world.

2020-05-03

Nitin Sharma (11:01:40): > @Nitin Sharma has joined the channel

Nitin Sharma (12:18:10): > Hello everyone, I have created a channel#singlecell-queriesfor more general queries regarding single-cell analysis.

2020-05-05

Jared Andrews (11:49:59): > > Third, have you compared 10x data to Smartseq data. I was trying to use data from SmartSeq as my reference data but the correlations are fairly low and are about the same regardless of the cell type, so I was wondering if due to the nature of Smartseq being full length transcripts, you can’t do a direct comparison to 10x data. Do you have any recommendations for ways to compare these 2 types of single cell data in SingleR? > Any thoughts on this? Hadn’t really thought about it previously. Asked for more info as to whether label assignment still makes sense despite low scores.

Alan O’C (11:54:00): > @Alan O’C has left the channel

Aaron Lun (12:05:29): > Smart-seq2 is pretty close to bulk and we’re doing it all the time.

Dan Bunis (12:17:13): > Yup yup

USLACKBOT (12:19:44): > This message was deleted.

USLACKBOT (12:20:03): > This message was deleted.

Dan Bunis (12:27:28): > Re Jared’s question, there’s no indication inside that quote of whether the labels given by SingleR still make sense. If they do, I wonder if they’re overthinking / being picky by looking for “high” absolute correlations whereas what matters for SingleR are relative correlations between labels?

Jared Andrews (12:29:30): > Yeah, I made that point to them - I think they’re mostly troubled by low scores/small differences between scores.

Jared Andrews (12:30:25): > I explained that often the differences between scores aren’t so great as one might expect, but that the scores are really irrelevant so long as the appropriate label is assigned.

Dan Bunis (12:32:04) (in thread): > Does this mean you think it will make it into BioC after all too?:grin:When you first brought it up, that was a question

Dan Bunis (12:37:34) (in thread): > Glad that headache is gone! (Or at least… “no longer limiting”)

Dan Bunis (12:41:41) (in thread): > Sounds good. But I’ll say that I do like when users have options or can stick to one package if it fits their data. So when it makes sense, from the cytometry/CyTOF vs RNAseq perspective, for ours to both certain things, so be it:man-shrugging:.

Dan Bunis (12:41:57) (in thread): > Sounds like what you were going for anyway lol

Dvir Aran (14:39:06): > @Jared Andrewsneed to normalize to gene length

Dvir Aran (14:39:20): > I did that internally in my version

Dan Bunis (14:50:01): > Right, right! Perhaps using log normalized FPKMs, or the like, for the ref matrix would improve the marker selection and correlations.

Aaron Lun (15:06:43): > Probably worth mentioning this in the vignette. Or better, the book.

2020-05-06

Jared Andrews (16:48:35): > Getting an error with theplotScoreHeatmapchanges in 1.3.1 that was not present in 1.2.0. > > pdf("tester.pdf", height = 20, width = 24) > p <- plotScoreHeatmap(pred) > grid.draw(p) > dev.off() >

Jared Andrews (16:48:51): > Trace: > > Error in `levels<-`(`**tmp**`, value = as.character(levels)) : factor level [24] is duplicated > 6. factor(scores.labels, levels = colnames(scores)) > 5. table(factor(scores.labels, levels = colnames(scores))) > 4. .trim_byLabel_and_normalize_scores(scores, labels.use, max.labels, normalize, scores.title, scores.labels) > 3. .trim_normalize_reorder_scores(scores = scores, scores.title = scores.title, labels.use = labels.use, max.labels = max.labels, cells.use = cells.use, normalize = normalize, cluster_cols = cluster_cols, order.by = order.by, cells.order = cells.order, labels = labels, clusters = clusters, ... > 2. .plot_score_heatmap(scores = scores, labels = labels, prune.calls = prune.calls, cells.use = cells.use, labels.use = labels.use, max.labels = max.labels, clusters = clusters, cells.order, order.by = order.by, show.labels = show.labels, show.pruned = show.pruned, scores.title = scores.title, labels.title = labels.title, ... > 1. plotScoreHeatmap(pred) >

Jared Andrews (16:49:38): > @Dan Bunis, any insight on what might be going on off the top of your head?

Aaron Lun (16:50:11): > Oh. That’s easy. It’s just that some columns have duplicated names. Auniqueand re-expansion should be sufficient.

Aaron Lun (16:50:30): > Something like > > table(factor(scores.labels, levels = unique(colnames(scores))))[colnames(scores)] >

Jared Andrews (16:50:33): > Okay, that’s what I expected.

Jared Andrews (17:03:50): > And yup, changing line 426 inplotScoreHeatmap.Rto that fixed it.

Dan Bunis (17:18:32): > Sorry, was on a call. Sounds like a fix was found, but I think there may be a bigger potential issue here to address

Dan Bunis (17:19:22): > Index numbers, not column names, are used for themax.levelstrimming.~~~Thus, if a label name was initially repeated, this fix would lead to only the first instance’s scores ever being retained.~~~@Aaron Lun, when label names are the same across individual-refs, the scores end up in separate columns of the scores matrix, but with duplicate names, correct?~~~Or possibly to all indices “to the right” of the second instance being shifted, and unwanted labels’ data being retained.~~~(crossed out cuz they were wrong!)

Dan Bunis (17:20:16): > If so, I can build in a uniquefy’ing step for colnames for themax.levelstrim.

Aaron Lun (17:21:48): > I don’t have the code in front of me, but from memory, shouldn’t both duplicates be retained unless they straddled themax.levelsthreshold?

Aaron Lun (17:22:06): > I don’t recall amatchorduplicatedstep that would only take the first.

Dan Bunis (17:26:40): > There isn’t.

Dan Bunis (17:27:14): > Sorry for confusion, many distractions persisted, but I think your code suggestion is fine…

Dan Bunis (17:29:57): > > table(factor(scores.labels, levels = unique(colnames(scores))))[colnames(scores)] > > the[colnames(scores)]in ^^^ duplicates that combined level’s counts properly for the later ranking step, so theoretical:100:from me.

Aaron Lun (17:35:30): > @Jared Andrewsif you can make a PR with a test that hits that case, I’ll merge it in.

Dan Bunis (17:49:31): > I’m pondering whether it’s worth solving the issue that these labels’ counts would be combined so their ranking would be artificially boosted. But is such a fringe case worth the hassle of going back to a scores matrix-basedtable(apply(scores, 1, which.max))instead of the labels-basedtable(scores.labels).

Aaron Lun (17:50:05): > IMO don’t worry about it.

Dan Bunis (17:50:11): > Once we built out the ontoProc reference matching, will such instances become less fringe and more common?

Aaron Lun (17:54:27): > There are two levels here: > * Getting an inflated ranking. I don’t think that’s an issue, the ranking isn’t meant to be quantitative anyway, it’s just a way to pick interesting labels. > * Disambiguating the labels on the combined heatmap. This is slightly more of a concern, but only because having three copies of the same cell type label is kind of confusing. We could look into adding disambiguation of the visual labels, but this would be independent of the ranking.

Aaron Lun (17:55:54): > There is a super easy way of doing both, thoiugh.

Dan Bunis (17:58:14): > Bullet one: True. So it’s nbd.

Dan Bunis (18:00:14): > Bullet two: In the annotation bar, $labels is a character vector, not factor, so ref-duplicates are already together with a single color. It’s in the heatmaps rows/rownames where “duplication” and confusion may happen. But yup, that’d be an easy fix which we can worry about later when better reference-matching methods necessitate it.

Aaron Lun (18:03:25): > For future reference, the easy way I mentioned above is to modifyscoresandlabelsinside the top-levelplotScoreHeatmapfunction bypasteing on the reference identifier. Then the rest of the function can proceed without knowing about any difference.

Dan Bunis (18:09:25): > For future reference, thelabelspaste-ing would need to be an entire re-labeling based on max call location after thescorespaste-ing happens to retrieve the proper ref-label.

Aaron Lun (18:09:55): > Yes, both of them would be done at the same time.

Aaron Lun (18:10:12): > Let’s just do it in Jared’s PR.

Jared Andrews (18:18:36): > Okay, not at PC atm, will start PR later though. I might not get to tests for a few days, I am very much in crunch time

Dan Bunis (18:19:54): > same here. As of today, I need to cut back on procrastinating via package-dev, so my time is limited until next week

Aaron Lun (18:20:17): > > procrastinating via package-dev > That’s my entire day to day.

Jared Andrews (18:20:32): > You’re living the dream then.

Aaron Lun (18:20:38): > Damn straight

Jared Andrews (18:21:21): > So is that actually your role at Genentech? Internal/open source package dev?

Aaron Lun (18:21:33): > Pretty much.

Jared Andrews (18:21:37): > Neat.

Dan Bunis (18:23:36): > Man, I’d thought so. Seriously, the dream.

2020-05-07

Jared Andrews (13:09:25): > Thanks for handling that test@Dan Bunis

Jared Andrews (13:10:15): > I was definitely overthinking that

Dan Bunis (13:10:53): > np! Was mostly copy/paste since I know that script! It also may be ready in this form for the extra fixes to come.

Jared Andrews (13:21:25): > @Aaron Lunhave you found Github Actions generally easier/better than Travis-CI for R stuff? I’ve had a number of struggles with Travis for R packages.

Aaron Lun (14:08:31): > I dunno, I just used@Kevin Rue-Albrecht’s scripts.

Charlotte Soneson (14:50:43) (in thread): > I generally find GitHub Actions more convenient than travis - no ssh keys, generous build time, multiple operating systems + containers in parallel

Kevin Rue-Albrecht (15:03:27): > @Jared Andrewsthe main caveat that I’ve seen is also its strength: the arbitrarily large configuration matrix tends to have 1-2 configurations fail on average, often related to glitches with the EHub.

Kevin Rue-Albrecht (15:04:04): > Aside from that, Charlotte pretty much sold all the nice features in the thread

Charlotte Soneson (15:04:15) (in thread): > I haven’t seen these for some time now actually:slightly_smiling_face:

Jared Andrews (15:05:24): > Will have to give it a shot, juggling travis and appveyor is a pain, and longer build times would be very nice.

Kevin Rue-Albrecht (15:05:38) (in thread): > True. Plus I’m pretty sure that with a bit of time/effort, it should be possible to cache the ehub folder between builds

Kevin Rue-Albrecht (15:06:09): > I’ve tried myself at writing a post recently about it, let me dig that up

Kevin Rue-Albrecht (15:06:21): > https://kevinrue.github.io/post/transiton-from-travis-ci-to-github-actions/ - Attachment (Kevin Rue-Albrecht): Transiton from Travis CI to GitHub Actions | Kevin Rue-Albrecht > Overview The recent introduction of GitHub Actions makes a lot of our Travis CI build configurations redundant. In particular, the examples actions include: R-CMD-check using rcmdcheck::rcmdcheck() to check the package.

Kevin Rue-Albrecht (15:06:34): > Still working on my writing and formatting skills ;)

Dan Bunis (15:15:17): > Because of this useful discussion, I’m a bit less sorry to have committed test code with a bug, thus likely instigating the Github Actions addition to SingleR:innocent:

Charlotte Soneson (15:19:31): > If you want some more inspiration, there’s a reasonably extensive example e.g. here (two setups currently fail since the new Bioc containers are not yet available):https://github.com/csoneson/ExploreModelMatrix/actions/runs/97533106/workflow

Aaron Lun (16:41:32): > Need a SingleR GIF for the book.

Aaron Lun (17:21:46): > Wrote a pretty neat preface for the book.

Federico Marini (17:32:31): > John Aaron Lennon

Federico Marini (17:32:49): > Imagine there’s no genome:musical_note:

2020-05-08

Kevin Rue-Albrecht (10:38:47): > FWIW, the blog post I linked above is outdated. I recommend look at Charlotte’sExploreModelMatrixI’ve also just updated the action in iSEE using ExploreModelMatrix, with a little addition to manually install Rtsne because it’s a suggests of a suggests, which doesn’t get installed as part of the dependencies.

2020-05-14

Aaron Lun (12:21:16): > It has begun:https://github.com/LTLA/CellTypeReferences

Aaron Lun (12:25:02): > I forgot what the other names were, so if you want to change it, come and get it.

Jared Andrews (12:58:36): > Is there anything else that actually needs to be done with that?

Aaron Lun (12:59:05): > FANCY YOU ASK.

Jared Andrews (12:59:09): > :anguished:

Aaron Lun (12:59:32): > I do need you and others to refill in the description of each dataset and provide some recommendations about when they should be used.

Jared Andrews (13:00:15): > June 5th-13th is a side-project/off week for me, so if you can wait till then, happy to do it.

Jared Andrews (13:00:38): > Can also help contribute to the book then if there is anything that is lacking.

Aaron Lun (13:01:25): > A week, huh.

Aaron Lun (13:01:43): > WEll, in that case

Aaron Lun (13:01:52): > the other thing that we need is more non-immune references.

Jared Andrews (13:01:57): > A week where I also plan to do a significant amount of non-work things, so I’m not gonna go nuts, but yeah.

Jared Andrews (13:02:02): > Can def do that.

Aaron Lun (13:02:33): > Also possibly a good opportunity for you to try submitting CellTypeReferences to BioC.

Aaron Lun (13:02:46): > get your feet wet with the package submission process.

Jared Andrews (13:03:05): > Sounds good, I may have another package to submit for the next cycle anyway.

Jared Andrews (13:03:18): > Could use the practice.

Dan Bunis (13:05:32): > One rec wascellDex

Jared Andrews (13:06:37): > Yes, that is also the one I remember.

Dan Bunis (13:08:46): > I can go through some of the ref descriptions next week or the week after. And I’ll poke my classmates who were going to help wrangle the brain and… was it a whole-developing-mouse dataset? idr… references that I’ve mentioned before.

Jared Andrews (13:09:11): > Yeah, there are some huge mouse scRNA-seq datasets around for developmental brain stuff.

Aaron Lun (13:12:13): > Remember, any of the scRNA-seq datasets probably have a separate “original” home elsewhere (e.g.,MouseGastrulationData) that we will then collate into a more amenable reference inPACKAGE_NAME_HERE.

Jared Andrews (13:13:00): > Yes, would likely stick them intoscRNAseq, no?

Aaron Lun (13:13:04): > Also happy to host it inscRNAseq

Aaron Lun (13:13:06): > Yes.

Aaron Lun (13:13:11): > Was just going to say that.

Aaron Lun (13:13:35): > Kind of depends on how complex they are, if it’s got multiple levels and batches and etc. then it may be worth separating into a different package.

Aaron Lun (13:13:46): > See for examplehttp://bioconductor.org/packages/release/data/experiment/html/TabulaMurisData.html - Attachment (Bioconductor): TabulaMurisData > Access to processed 10x (droplet) and SmartSeq2 (on FACS-sorted cells) single-cell RNA-seq data from the Tabula Muris consortium (http://tabula-muris.ds.czbiohub.org/).

Aaron Lun (13:14:02): > Now THAT would be a good place to start.

Jared Andrews (13:14:38): > If I collect additional bulk sets from GEO, etc, are they good to go straight intoPACKAGE_NAME_HERE?

Aaron Lun (13:14:55): > yep.

Jared Andrews (13:15:07): > Perfect.

Jared Andrews (13:18:55): > Huge number of single cell sets here, though labels are likely quite hit and miss:https://panglaodb.se/samples.html - Attachment (panglaodb.se): Samples | PanglaoDB > Single cell RNA sequencing data.

Jared Andrews (13:19:03): > Lots of “Unknown” labels.

Aaron Lun (13:20:30): > I don’t know that those are the author-derived labels, I had thought they were from another round of annotation.

Jared Andrews (13:21:36): > Correct, likely have to go back to the original studies for better labels.

Aaron Lun (13:25:34): > TBH the pangodb stuff really deserves an entirely separate package that provides an interface to the DB.

Aaron Lun (13:25:53): > … which is best maintained by the original authors.

Dan Bunis (13:32:15): > … except that with the general lack, at most institutions, of quality bioinformatics training, most authors would have no idea how to do something like that.

Aaron Lun (13:35:48): > I was referring to the pangodb authors, who presumably know their way around computers.

Dan Bunis (13:41:18): > oh oh ohhhhh. They might be enticed to make that if presented in a way that stresses how much extra use the DB might get with a direct R interface:man-shrugging:

Aaron Lun (13:42:57): > Indeed, one could present that argument to them.

Dvir Aran (18:11:50): > https://www.biorxiv.org/cgi/content/short/2020.05.13.094953v1

Dvir Aran (18:12:24): > Looks interesting, but really need to test it

Dvir Aran (18:14:00): > It kind of weird to call it reference-free, because there is a reference of course, just not used when making the prediction. The model is still trained on a reference…

Dvir Aran (18:14:46): > But other than that, the idea seems solid to me and at least according to the manuscript they did a lot of testing

Dvir Aran (18:18:39): > My other concern is that I don’t really see the point of learning across tissues. Just going to add noise. However, they might have used the tissue as a feature.

Aaron Lun (18:19:14): > I’m still traumatized from trying to get cellassign (specifically tensorflow) to run reliably on my cluster.

Aaron Lun (18:19:37): > Getting triggered by “neural networks” and “cell type annotation” so I’m not going to revisit that experience anytime soon.

2020-05-18

Aaron Lun (17:32:20): > Multiple reference chapter is done.

Aaron Lun (17:32:54): > Just need to do the diagnostics chapter and the ontology chapter.

Aaron Lun (18:55:39): > Just cleaned up the CellTypeReferences vignette. It is ready to launch, though I would like some more information aboutwheneach reference should be used. (Technical details about the reference should live in?).

2020-05-19

Aaron Lun (16:55:28): > My god. I hadn’t realized how long it took to do fine tuning against ImmGen’s fine labels. 1000 cells took almost 15 minutes, sheesh.

Dan Bunis (17:08:29): > ontoProcmight prove especially useful there! Lot’s of ImmGen cell definitions are ultimately kinda similar.

Dan Bunis (17:16:29): > I also have a large run going right now that is taking a pretty long time. 60k cells (mostly from Covid-patient bronchioalveolar lavage fluid,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145926) withref = HPCA+BlueprintEncode``main.labelsthenfine.labelsis going on 3 hours now. Running on an AWS e2 instance, but I’m not sure if I set up utilization of all cores. - Attachment (ncbi.nlm.nih.gov): GEO Accession viewer > NCBI’s Gene Expression Omnibus (GEO) is a public archive and resource for gene expression data.

Aaron Lun (17:18:56): > WhatBPPARAMdid you use?

Aaron Lun (17:20:40): > 3 hours. That’s 2.9 hours too long for my levels of patience.

Dan Bunis (17:31:03): > …whatever the default would be

Aaron Lun (17:31:16): > Oh. Then definitely it’s not using more than 1 core.

Aaron Lun (17:32:03): > The default is always serial, as the auto-core detection does silly things on some HPCs. Like thinking that it has access to all 50 threads on a node.

Dan Bunis (17:33:50): > makes sense. I’d never actually used the multi-core functionality before so I was using an old pipeline for this new “quick” side-project.

Dan Bunis (17:34:18): > I’ll re-run after and compare. But I’m also not certain if the knit is actually still going versus crapped out so:man-shrugging:

Aaron Lun (17:34:40): > let me know when you get to that, there’s some gotchas on the choice of parallelization backend.

Dan Bunis (19:21:38): > welp, turns out the knitting process had indeed bugged out and I don’t know when. I have no idea if the serial method would have run quickly or not. Either way, my instance has 4 cores and 32gb RAM. So…Singler(..., BPPARAM = MulticoreParam(3))?

Aaron Lun (21:19:08): > Sorry, fugued off for a bit while getting yelled at by a colleague.

Aaron Lun (21:19:14): > Right. Where was I?

Aaron Lun (21:21:23): > Ah, yes. You could probably go toMulticoreParam(4). Major caveat is that forking can become very memory inefficient if the workers are doing memory-intensive rather than CPU-intensive operations. Reason for this is because the fork makes a shallow copy of the memory of the parent, but if the garbage collector is triggered in the child, it proceeds to make an actual copy of all of the parent process’ objects. So memory usage of your entire process literally quadruples rather than just increasing according to the extra amount that the child used.

Aaron Lun (21:21:45): > SnowParam(4)has more initial overhead and is less efficient, but avoids this major jump in memory usage.

Aaron Lun (21:22:29): > Having said that, I don’t know if there’s some weird stuff on AWS that prevents proper socket communication. I’ve had some issues like that in the past when I tried to do parallelized compute within an EC2 instance. Never bothered to look into it, though.

Dan Bunis (22:34:46): > Neither seem to work.SnowParam(4)causes a crash, so I don’t get to see an error.MulticoreParam(4)gives: > > Error in result[[njob]] <- value : attempt to select less than one element in OneIndex > > 8. bploop.lapply(cls, X, lapply, ARGFUN, BPPARAM) > 7. bploop(cls, X, lapply, ARGFUN, BPPARAM) > 6. bplapply(all.indices, FUN = .find_nearest_quantile, ranked = ranked, quantile = quantile, BPPARAM = BPPARAM) > 5. bplapply(all.indices, FUN = .find_nearest_quantile, ranked = ranked, quantile = quantile, BPPARAM = BPPARAM) > 4. FUN(X[[i]], ...) > 3. lapply(trained, FUN = .classify_internals, test = test, quantile = quantile, fine.tune = fine.tune, tune.thresh = tune.thresh, sd.thresh = sd.thresh, prune = prune, BPPARAM = BPPARAM) > 2. classifySingleR(test, trained, quantile = quantile, fine.tune = fine.tune, tune.thresh = tune.thresh, prune = prune, check.missing = FALSE, BPPARAM = BPPARAM) > 1. SingleR(GetAssayData(balfsc), ref = list(blueprintEncode = BE, HPCA = HPCA), labels = list(BE$label.main, HPCA$label.main), BPPARAM = MulticoreParam(4)) > > But:man-shrugging:, diagnosing how to make this work isn’t all that important as I’m in no real rush for this data… I’ll just use the serial method.

Aaron Lun (22:35:10): > That’s another way of saying that it ran out of memory.

Aaron Lun (22:35:30): > Not particularly surprised, AWS never played nice with BiocParallel for some reason.

2020-05-20

Aaron Lun (00:07:36): > Can you try repro with 2 cores?

Dan Bunis (01:58:35): > > MulticoreParam(2)) ==> > Error in mcfork(detached) : > unable to fork, possible reason: Cannot allocate memory >

Aaron Lun (01:59:35): > Nasty. Maybe make a MWE and see if anyone on#generalhas more general advice.

Dan Bunis (02:08:07): > Maybe. But It’s realllly not high priority, so I don’t feel like spamming everyone rn

2020-05-21

Aaron Lun (00:19:59): > quick@Dan Buniswhat’s youre heatmap function.

Dan Bunis (00:20:48): > ditto-tab to all lol.dittoHeatmap

Aaron Lun (00:24:50): > Excellent.

Aaron Lun (00:26:46): > Right, time to watch some anime while the book rebuilds.

Dan Bunis (00:27:51): > … Is there a chapter I should check out?

Aaron Lun (01:42:57): > The DICE reference is deceptively big. I thought it only had a handful of labels but there are actually many samples per label, so it ended up stalling the entire book. Easily solved by slapping anaggr.ref=TRUEon top.

Aaron Lun (02:47:24): > Latest book is up.

Jared Andrews (03:05:07): > is it rendered anywhere

Jared Andrews (03:05:28): > Or hosted somewhere in readable form, I guess.

Aaron Lun (04:15:06): > https://ltla.github.io/SingleRBook/ - Attachment (ltla.github.io): Assigning cell types with SingleR > The SingleR book. Because sometimes, a vignette just isn’t enough.

Aaron Lun (04:16:55): > Some work to do all over the place, but most of the bits have been transferred.

Jared Andrews (05:14:17): > DoesontoProchave a way to convert the ontology IDs to their actual labels?

USLACKBOT (07:51:54): > This message was deleted.

Aaron Lun (12:20:04) (in thread): > Not directly; I do so usingontologyIndexand the OBO file from Cell Ontology.

Aaron Lun (12:27:09) (in thread): > Highest priority is to get CellTypeReferences into BioC - this is almost ready to go, people just need some commentary about each reference (e.g., when and why you should use it) in the vignettes.

Dan Bunis (13:50:51) (in thread): > Hmmm, I hadn’t thought about doing that, but I think you are correct! By “advertise” you mean by suggesting dittoSeq-plotters as answers to people’s visualization questions, correct?

Dan Bunis (13:59:27) (in thread): > It seems@Jared Andrewshas already done some of this =)https://www.biostars.org/p/416701/#416775

Jared Andrews (14:00:43) (in thread): > Plug it every chance I get, it is just the best, most flexible solution for common viz tasks.

Jared Andrews (14:08:22) (in thread): > Almost guarantee it’ll be a top 500 package on Bioc by the next release. It’s too convenient to go unused.

Jared Andrews (14:15:44) (in thread): > Will pop once it starts getting play in other peoples’ vignettes and such too.

Dan Bunis (14:30:30) (in thread): > Was about to share it myself lol. Obviously? this was me just now. Thanks for the upvote.

Dan Bunis (14:54:43) (in thread): > https://www.biostars.org/p/439368/#439527

Dvir Aran (19:24:18): > Is there a channel here for ads for recruiting postdocs and grads?

Dvir Aran (19:25:18): > Also, if anyone interested -https://aran-lab.com - Attachment (aran-lab.com): Aran Lab @ Technion > Assistant Professor

Dvir Aran (19:26:38): > Send to your friends. Great place, the Israel version of MIT. Terrible PI, but what can I do…

Dan Bunis (19:30:27): > Looks like there is a#jobschannel, but as a non-follower, I’m not sure how used it is.

Aaron Lun (19:37:06): > That’s because it needs more gifs.

Dvir Aran (19:45:56): - File (GIF): File from iOS

Aaron Lun (21:54:18): > When is everyone’s happy week?

Aaron Lun (23:48:21): > Reading this book. It’s actually pretty nice.

Jared Andrews (23:50:11): > Mine is the week of June 7th or whatever. Not gonna devote all my time to this stuff because I need to become a real human being again, but will find references and add descriptions and do whatever is needed for the book when I have time.

2020-05-22

Aaron Lun (00:13:59): > @Vince Careyit’s time to take the ontology stuff to the next level.

Aaron Lun (00:14:03): > Or at least, to some level.

Aaron Lun (00:16:36): > I’m going to write the chapter with what I would like to do, and then we can discuss about what we can actually do.

Vince Carey (07:25:49): > ontoProc got a little attention from a person at UCSC … I plan to update its resources (e.g., serialized Cell Ontology) soon but this should be done through AnnotationHub

Aedin Culhane (12:35:48) (in thread): > Yes. There is a#jobschannel

Dvir Aran (14:08:59) (in thread): > Thanks. I’ll post there.

2020-05-24

Aaron Lun (03:46:58): > Alright@Vince Carey. The main focus of the chapter will be about (i) basic querying of the cell ontology (name, description, children, parents) and (ii) using the ontology to dynamically adjust the resolution of the reference labels. > > For (i), I can imagine having a helper function that downloads the ontology withBiocFileCacheand loads it into memory viaontologyIndex. This can then return aDataFramecontaining the details about each specific term, and/or it can return anigraphcontaining the relationships between terms. I guess you could serialize the ontology but it seems more sustainable to just pull ontologies in from an external source. > > For (ii), this is a bit tricky; the graph thing we tried earlier was less than satisfactory. I think we could start by - given a vector of terms - reporting all MRCAs and the terms associated with them. Then people can immediately get an idea of what terms they could “roll up” to. We could then add an option to extrapolate to multiple vectors of terms (e.g., for multiple reference datasets), to see how different datasets compare in terms of resolution.

Vince Carey (05:52:44): > OK@Aaron Lun– We start athttps://github.com/obophenotype/cell-ontology. This is pretty active. It has been a while since I worked on this, and my recollection is that the only satisfactory representation was the OWL, which was converted to obo using the python pronto package.

Vince Carey (05:56:36): > Is this still the case? I used wget onhttps://raw.githubusercontent.com/obophenotype/cell-ontology/master/cl.oboand then > > > library(ontologyIndex) > > n1 = get_OBO("cl.obo", extract_tags="everything") > > n1 > Ontology with 2236 terms > > format-version: 1.2 > data-version: cl/2020-05-21/cl-simple.owl > ontology: cl/cl-simple > > And because this is based on cl-simple.owl it doesn’t agree with what I have used in ontoProc, which was derived from the full owl serialization. From the point of view of pure cell taxonomy maybe we don’t need the larger ontology. I will report back on some comparisons shortly.

Vince Carey (06:00:43): > Oh, there’s now a cl-full.obo, but > > > library(ontologyIndex) > 1/0 packages newly attached/loaded, see sessionInfo() for details. > > r1 = get_OBO("cl-full.obo", extract_tags="everything") > Error in ancs_from_pars(int.pars, int.chld) : > Can't get ancestors for items 174, 175, 177, 179, 180, 181, 182, 184, 185, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 236, 237, 238, 239, 240, 241, 243, 244, 245, 247, 248, 249, 250, 251, 252, 253, 254, >

Vince Carey (06:10:17): > The new cl-simple.obo has 38 cell types not present in current ontoProc serialization:

Vince Carey (06:10:17): > > > simp$name[.Last.value] > CL:0001200 > "lymphocyte of B lineage, CD19-positive" > CL:0001201 > "B cell, CD19-positive" > CL:0001202 > "CD86-positive plasmablast" > CL:0001203 > "CD8-positive, alpha-beta memory T cell, CD45RO-positive" > CL:0001204 > "CD4-positive, alpha-beta memory T cell, CD45RO-positive" > CL:0008027 > "rod bipolar cell (sensu Mus)" > CL:0008028 > "visual system neuron" > CL:0008029 > "inhibitory neuron" > CL:0008030 > "excitatory neuron" > CL:0008031 > "cortical interneuron" > CL:0008032 > "rosehip neuron" > CL:0008033 > "decidual pericyte" > CL:0008034 > "mural cell" > CL:0008035 > "microcirculation associated smooth muscle cell" > CL:0008036 > "extravillous trophoblast" > CL:0011013 > "motile sperm cell" > CL:0011014 > "non-motile sperm cell" > CL:0011015 > "amoeboid sperm cell" > CL:0011016 > "flagellated sperm cell" > CL:0011017 > "vagal neural crest cell" > CL:0011018 > "lymphoid tissue–inducer cell" > CL:0011019 > "mesothelial cell of epicardium" > CL:0011020 > "neural progenitor cell" > CL:0011021 > "fibroblast of upper back skin" > CL:0011022 > "fibroblast of skin of back" > CL:0011023 > "CD25+ mast cell" > CL:0011024 > "double negative T regulatory cell" > CL:0011025 > "exhausted T cell" > CL:0017000 > "pulmonary ionocyte" > CL:0017001 > "splanchnic mesodermal cell" > CL:0019001 > "tracheobronchial serous cell" > CL:0019002 > "tracheobronchial chondrocyte" > CL:0019003 > "tracheobronchial goblet cell" > CL:3000000 > "ciliated epithelial cell of esophagus" > CL:3000001 > "Hofbauer cell" > CL:3000002 > "sympathetic noradrenergic neuron" > CL:3000003 > "sympathetic cholinergic neuron" > CL:3000004 > "peripheral sensory neuron" >

Vince Carey (07:01:17): > ontoProc 1.11.1 in git now has an option newest in getCellOnto which, if set to TRUE, will use BiocFileCache to retrieve cl-simple.obo and import using ontologyIndex::get_OBO with extract_tags=“everything”. The source of cl-simple.obo has an etag so bfcupdate may work but I did not introduce that.

Vince Carey (07:10:52): > For MRCA you can use code like > > > gg = getCellOnto(newest=TRUE) > Warning message: > In (function (parents, id = names(parents), name = id, obsolete = setNames(nm = id, : > Some parent terms not found: RO:0000301, RO:0000302, RO:0002258 (3 more) > > sapply(gg, function(x) x[["CL:0008029"]])[1:6] > $id > [1] "CL:0008029" > > $name > [1] "inhibitory neuron" > > $parents > [1] "CL:0000151" "CL:0000540" > > $children > character(0) > > $ancestors > [1] "CL:0000000" "CL:0000003" "CL:0000151" "CL:0000211" "CL:0000393" > [6] "CL:0000404" "CL:0000255" "CL:0000548" "CL:0002371" "CL:0002319" > [11] "CL:0000540" "CL:0008029" > > to get ancestors for a specific term.

Aaron Lun (18:24:18): > @Vince Careycheck out the latest chapter athttps://ltla.github.io/SingleRBook/exploiting-the-cell-ontology.html. Lots of functions there that could find homes in ontoProc. - Attachment (ltla.github.io): Chapter 6 Exploiting the cell ontology | Assigning cell types with SingleR > The SingleR book. Because sometimes, a vignette just isn’t enough.

Aaron Lun (18:25:16): > Also, maybe it would be worth caching these files in ExperimentHub so that we have a guaranteed version of each set of ontologies to go with the package version. Otherwise if those remote resources update, our scripts will not be reproducible.

Vince Carey (18:51:42): > I think it would be AnnotationHub. I will look into it.

Vince Carey (19:02:20): > @Aaron Lun, i think there is a glitch in book > > library(SingleR) > ref <- MouseRNAseqData(cell.ont="nonna") > translated <- cl$name[ref$label.ont] > head(translated) > ## <NA> <NA> <NA> <NA> <NA> <NA> > ## NA NA NA NA NA NA >

Aaron Lun (19:08:10): > That’s a glitch in SingleR’s mapping, we put aCL_instead ofCL:.

Jared Andrews (23:55:35): > Woops, that was probably me.

2020-06-02

Peter Hickey (00:18:59): > T-cell nerds: favourite reference for human T cell subsets? I think I remember seeing some discussion on this channel but I can’t find it

Aaron Lun (00:24:43): > From trying to answer your question, I realized that the CellTypeReferences vignette doesn’t specify what’s human and mouse.

Aaron Lun (00:25:02): > I’m a big fan of DICE because it’s a short acroynm.

Jared Andrews (03:04:15): > DICE and Monaco are the best in my experiences.

Aaron Lun (03:05:48): > Actually, DICE isn’t that short because it expands out toDatabaseblah blah balh.

Aaron Lun (03:05:59): > Hm. Guess it’s Monaco for me, then.

Aaron Lun (03:06:10): > I struggle with the Novo one because I can never spell it properly.

Jared Andrews (03:34:37): > Novo is basically the old DMAP database, but they killed that, so had to go for the GEO directly.

Jared Andrews (03:35:09): > Mappings on that one are weird though, I don’t care much for it. Included it because colleagues use it for myeloid stuff.

Helena L. Crowell (06:43:39): > @Helena L. Crowell has joined the channel

Peter Hickey (19:20:29): > Thanks, I appreciate the advice

2020-06-05

Aaron Lun (17:09:31): > @Jared Andrewswas this your happy week?

Jared Andrews (17:16:22): > Yes, I am a real doctor as of yesterday. Gimme the list and I’ll start on it Sunday or Monday.

Aaron Lun (17:16:36): > join the club

Aaron Lun (19:52:45): > Anyway, there’s two obvious things to do. The first is to get celldex ready for submission to BioC. The second is to add some examples in Part 3 of the book, which is looking pretty lonely at the moment.

2020-06-08

Jared Andrews (11:47:50): > Okay, so celldex needs what? Better descriptions of when to use which in docstrings? Additional non-immune datasets?

Aaron Lun (11:48:45): > yes, in the vignette, I would say.

Aaron Lun (11:48:57): > Additional non-immune would be a help but only if you already have some obvious candidates.

Aaron Lun (11:49:20): > Maybe the man pages need some tidying up, I haven’t looked at them in a while.

Jared Andrews (11:50:00): > Okay. Will take a look around and see what’s what.

Aaron Lun (11:51:13): > I think you already have push access, though if you’re going to do something big, put it in a PR.

Aaron Lun (11:51:22): > etc. etc. You know the drill.

Stephanie Hicks (12:59:38): > https://twitter.com/tangming2005/status/1270037566910332929 - Attachment (twitter): Attachment > is there an easy way to get the marker genes for each cell type used by SingleR https://bioconductor.org/packages/devel/bioc/vignettes/SingleR/inst/doc/SingleR.html e.g., Monaco dataset.

Aaron Lun (13:00:17): > I don’t have a twitter account.

Aaron Lun (13:00:45): > Because if I did I’d just troll people.

Stephanie Hicks (13:00:47): > I’m going to direct him to the support site

Aaron Lun (13:01:03): > in fact I would consider it my calling to troll people.

Stephanie Hicks (13:02:15): > lol

Dvir Aran (13:49:42): > Just answered him

Stephanie Hicks (13:50:45): > thanks@Dvir Aran!

Dvir Aran (13:50:47) (in thread): > Isn’t that what everyone on twitter do?

Aaron Lun (13:51:02) (in thread): > Indeed!

Dvir Aran (13:51:25) (in thread): > The community would enjoy having you on twitter

Dvir Aran (13:53:23) (in thread): > I want to see you trolling Lior Pachter

Aaron Lun (13:53:43) (in thread): > yes, that would be one of my pastimes

Aaron Lun (13:54:02) (in thread): > I don’t even have anything against the guy, it just seems like it would be fun.

Dvir Aran (13:54:56) (in thread): > Different people gave different definition to what “fun” is

Aaron Lun (13:55:59) (in thread): > For example, I’d like to try out some of these “ad hominem” attacks that none of the codes of conduct let me do.

Dvir Aran (13:57:24) (in thread): > That will make you popular on twitter, less so in the real world

Aaron Lun (14:01:19) (in thread): > yeah, I might even get elected to be the president of the united states.

Aaron Lun (14:01:25) (in thread): > Damn that would be such a pain.

Giuseppe D’Agostino (20:29:10): > @Jared Andrewsregarding celldex: I’ve been making sce objects for some human single nucleus brain datasets. If I had to choose one I’d go with the latest Allen Brain Atlas data since it has a clear type ontology and (for smartseq) sources from different cortical areas.

2020-06-09

Jared Andrews (22:43:44): > I will keep that in mind.

2020-06-10

Aaron Lun (00:37:22): > Alright! Time to add some more links from the OSCA book to the SingleR book.

Aaron Lun (00:37:37): > Or actually just a couple of links.

Dvir Aran (16:06:24): > Can you send again the link for the book?

Aaron Lun (16:15:13): > https://ltla.github.io/SingleRBook/Hopefully we can arrange to get something under thebioconductor.orgdomain. - Attachment (ltla.github.io): Assigning cell types with SingleR > The SingleR book. Because sometimes, a vignette just isn’t enough.

Dvir Aran (17:59:24) (in thread): > This is turning out really great. So many gems here.

Aaron Lun (17:59:35) (in thread): > :+1:

Dvir Aran (18:00:27) (in thread): > I have a few small comments, I’ll try to organize them tonight.

Dvir Aran (18:00:39) (in thread): > Is it ready for sharing with the world?

Aaron Lun (18:01:01) (in thread): > As ready as it’ll ever be.

Dvir Aran (18:01:19) (in thread): > :+1:

Dvir Aran (18:25:29) (in thread): > Figures on chapter 5 are missing

Aaron Lun (18:26:03) (in thread): > yes, i keep on forgetting to fix that.

Aaron Lun (18:44:10) (in thread): > rebuilding now.

Dvir Aran (18:55:55) (in thread): > :+1:

Dvir Aran (18:56:46) (in thread): > I’m so excited to refer every question I get about SingleR to the book. Thats gonna free a great chunk of my time

Aaron Lun (18:57:22) (in thread): > excellent. that’s the plan.

Dvir Aran (18:58:49) (in thread): > Only thing missing is a vignette with Seurat. Too many questions are with misunderstandings about SingleCellExperiment.

Aaron Lun (19:33:51) (in thread): > I. Am. Stumped.

Aaron Lun (19:34:00) (in thread): > I don’t know why those figures aren’t appearing, they’re definitely in the repo.

Aaron Lun (19:34:22) (in thread): > Maybe I need to clear my browser cache, but I doubt that’s the problem.

Dvir Aran (19:35:04) (in thread): > I see them now

Dvir Aran (19:35:58) (in thread): - File (PNG): Image from iOS

Aaron Lun (19:36:36) (in thread): > Hmmm. Well, okay. Maybe it’s just me then.

Aaron Lun (19:36:55) (in thread): > Probably a browser thing. Doesn’t really make sense to me, but… WHATEVER.

2020-06-11

Aaron Lun (19:22:44): > I realized we have a lot of issues.@Dan Buniscan you see how many of them we can shut down?

Aaron Lun (19:25:37): > And also, have you finished your PhD?

Dan Bunis (19:30:35): > I have not:sob:. Should happen in August.

Dan Bunis (19:30:58): > I can look through them tomorrow.

Dan Bunis (19:32:06): > Pretty sure the newest one is from a collaborator of Jared’s, so we should def be able to figure out what it was about.

Aaron Lun (19:33:08): > small world.

Jared Andrews (20:14:28): > Huh, didn’t see that one, it is indeed a close collaborator. I’ll look into it. Sorry I haven’t gotten to stuff yet, this week hasn’t turned out the way I expected and haven’t had much time. Will at least add to the vignette tonight.

Aaron Lun (20:15:06): > WILD PARTYING?

Aaron Lun (20:15:27): > yeah, those post-defense hangovers seem to last forever.

Jared Andrews (20:16:08): > Lmao, try a type 1 diabetes diagnosis. Admittedly much less fun.

Aaron Lun (20:16:25): > hm, that does sound less fun.

Aaron Lun (20:18:11): > Damn, reading the CDC guidance on T1D now. Didn’t realize that it could develop at any age, I thought it was a kid thing.

Jared Andrews (20:19:37): > Me too. AIN’T GREAT. > > I’ll be back to work next week anyway, so should be more productive then.

Aaron Lun (20:25:42): > I do recall during our company~~~indoctrination~~~on-boarding that we were the first to make synthetic human insulin back in the 80’s.

Aaron Lun (20:25:55): > So I guess we are doing something worthwhile after all.

Jared Andrews (20:48:08): > So long as ya aren’t upcharging it by 3000%.

Aaron Lun (20:48:53): > but my penthouse apartment!

2020-06-12

Jared Andrews (01:09:10): > So the table in the celldex vignette stating organism and focus along with the lists of cell types in each isn’t sufficient for determining which might be useful?

Jared Andrews (01:10:24): > I am unsure how much value blurbs about each are going to add when users can just look at the cells in each. Maybe if we collect many more references there might be more of a need there.

Aaron Lun (01:14:19): > Do we have organism for all of them? I was looking through them some time ago and I thought we were missing a few.

Aaron Lun (01:15:18): > I was mostly just thinking that you and others could just add your experiences. Just something to help guide the user. Currently they just have to pick one randomly.

Aaron Lun (01:15:47): > Something like:JA’s comments: I really like this reference for blah blah blah.

Aaron Lun (01:15:51): > You know, like in food blogs.

Aaron Lun (01:16:18): > Not that I read food blogs, otherwise I wouldn’t be eating the same dinner for 6 years.

Jared Andrews (01:21:24): > What is your nightly dinner?

Jared Andrews (01:21:35): > And okay, that’s fair. I will add some comments.

Jared Andrews (01:23:09): > Organism is in the summary table, but I’ll add a few comments and more explicitly state it.

Jared Andrews (01:28:36): > Will PR it tomorrow.

Aaron Lun (01:29:09): > :+1:

Jared Andrews (01:38:54): > Oh duh, I was still looking at the SingleR vignette. jfc my brain is putty at the moment.

Jared Andrews (01:39:06): > Well, regardless. Will add.

Federico Marini (08:42:22): > Is there a way to get my hands on the pdf version of the singleR book?

Federico Marini (08:42:53): > I’m a tree-killer, sorry:disappointed:still prefer to read “larger texts” on paper, and pdf would be one doc only

Aaron Lun (11:20:37): > ¯*(ツ)*/¯

Dvir Aran (12:38:00) (in thread): > https://www.eddjberry.com/post/writing-your-thesis-with-bookdown/ - Attachment (Ed Berry): Writing your thesis with bookdown > This post details some tips and tricks for writing a thesis/dissertation using the bookdown R package by Yihui Xie. The idea of this post is to supplement the fantastic book that Xie has written about bookdown, which can be found here. I will assume that readers know a bit about R Markdown; a decent knowledge of R Markdown is going to be essential to using bookdown. The first thing to highlight is that I’m not a pandoc or LaTeX expert.

Aaron Lun (21:44:18): > Nice. HDF5Array with 30k cells run against an aggregated Tabula Muris in 15 minutes on 10 cores. Quite happy with that, all things considered.

2020-06-13

Jared Andrews (02:31:36): > Can’t remember, do I PR to master or RELEASE?

Aaron Lun (02:32:35): > does celldex even have a RELEASE?

Jared Andrews (02:32:42): > Oh. Duh, right.

Aaron Lun (04:10:04): > I think it’s time.

Aaron Lun (04:12:04): > https://github.com/Bioconductor/Contributions/issues/1515

Aaron Lun (15:18:16): > Wow, that was quick.

Dan Bunis (15:33:18): > Even faster than my ability to simply check through SingleR’s issues! I will get to that this weekend, but yesterday ended up being a harder day it should have been… Pulse &:trans:.

2020-06-14

Aaron Lun (21:21:32): > If anyone has any idea for a SingleR or celldex sticker…

Aaron Lun (22:11:46): > book is now up to date with celldex.

Dvir Aran (23:27:16): > Can use this as inspiration - File (JPEG): Image from iOS

Dvir Aran (23:28:23): > Or this - File (JPEG): Image from iOS

Dan Bunis (23:30:29): > Did I miss something? Did we turn SingleR into a dating app?

Aaron Lun (23:47:40): > I wonder how many bioinformaticians use these apps to get the reference.

Aaron Lun (23:47:43): > But it would be funny.

Dvir Aran (23:58:50): > Tinder was the original inspiration for singler, as a matching tool for single cells

2020-06-15

Dvir Aran (00:00:32): > Singler morphed quite a bit since its conception, but people still laugh at talks when ai mention that

Dvir Aran (00:01:33): > Just to note - I got married long before tinder, so don’t know anything about how it works

Dan Bunis (00:02:02): > Now that you mention… I seem to remember you telling me this before.:man-shrugging:. Had totally forgotten. I also love the dating app inspired idea.

Dvir Aran (00:04:23): > Maybe something like this, just cells instead if hearts:hearts: - File (JPEG): Image from iOS

Dvir Aran (00:05:13): - File (JPEG): Image from iOS

Dvir Aran (00:06:27): > Singler - matching single cells

Dan Bunis (00:06:39): > Separate note: Issues are down to just 4 now. All have recent activity aside from #68. Seems like this one serves as a reminder to us of future plans, so I’ll leave it to@Aaron Lun’s discretion whether to close that one or not.

Dan Bunis (00:08:46): > Hmmm I wonder if the intention would become lost if the icons are changed from hearts. But if we can fit the “matching single cells” in too, it might still work

Dvir Aran (01:03:18): - File (JPEG): Image from iOS

Dvir Aran (01:04:36): > I don’t have any illustration tools on my work computer, so playing on powerpoint

Aaron Lun (02:39:28): > I think we can do something with a play on the tinder icon, just swapping the fire for, e.g., a macrophage with two arms.

Giuseppe D’Agostino (03:00:25): > coffee break brand design - File (PNG): singler_logo.png

Aaron Lun (03:05:55): > oh, that is nice.

Federico Marini (03:21:52): > Agree:slightly_smiling_face:oh wait

Federico Marini (03:22:20): > you’re the exploded-brain-Giuseppe?

Federico Marini (03:22:27): > This Giuseppe?https://twitter.com/gdagstn/status/1271844248535445504 - Attachment (twitter): Attachment > @AllenInstitute’s brain voxel data + Duncan Murdoch’s rgl + @zarquon42b’s Rvcg + @tylermorganwall’s rayrender + some tinkering, tweaking and learning = a not so useful but admittedly cool brain rendering, all in #RStats

Giuseppe D’Agostino (03:23:53): > yes that’s me

Giuseppe D’Agostino (03:26:50): > as you can see most of my time goes into doing useful stuff for the scientific community

Federico Marini (03:28:45): > Ain’t that a beauty, huh

Dvir Aran (03:29:14): > Cool

2020-06-17

Dan Bunis (02:02:25): > I really like the tinder-phage (and the brain)!!

Andrew Jaffe (16:09:51): > @Andrew Jaffe has joined the channel

2020-06-19

Aaron Lun (22:31:38): > https://bioconductor.org/packages/devel/data/experiment/html/celldex.html - Attachment (Bioconductor): celldex (development version) > Provides a collection of reference expression datasets with curated cell type labels, for use in procedures like automated annotation of single-cell data or deconvolution of bulk RNA-seq.

2020-06-26

Aaron Lun (15:34:47): > BTW there remains an open call for more chapters in the “workflow” section of the book. These should probably be called “case studies”.

Aaron Lun (15:35:08): > I’ve only got one half-finished one in there:https://ltla.github.io/SingleRBook/cross-annotating-pancreas.html - Attachment (ltla.github.io): Chapter 8 Cross-annotating pancreas | Assigning cell types with SingleR > The SingleR book. Because sometimes, a vignette just isn’t enough.

Aaron Lun (15:41:14): > If you have a dataset that you ran through SingleR and did some interpretation of the results, then we can stick it in here.

2020-06-30

Aaron Lun (03:19:32): > Just looking at the book now, I realize that this is one of the very rare times I’ve written something where Ididn’tcite myself.

Aaron Lun (03:19:42): > The other being my first paper, by definition.

Jared Andrews (10:21:56): > Time to write up some thoughts on cell type predictions and toss it on bioRxiv then.

2020-07-01

Aaron Lun (14:10:40): > One more case study added. Come on, give me something guys.

Dan Bunis (14:38:06): > I will add my HSPCs case-study once I can more freely share the data. (:crossed_fingers:that it’ll be in just the next month or so!)

Jared Andrews (15:11:45): > Sorry, I am trying to find a house in a short time frame, and it’s proving a challenge. I start a new job August 3rd and am pretty busy trying to wrap things up.

2020-07-02

Peter Hickey (00:19:20): > what do the column names of DICE mean (and why are they duplicated)? > > > table(colnames(SingleR::DatabaseImmuneCellExpressionData())) > snapshotDate(): 2020-04-27 > see ?SingleR and browseVignettes('SingleR') for documentation > loading from cache > see ?SingleR and browseVignettes('SingleR') for documentation > loading from cache > > TPM_1 TPM_10 TPM_100 TPM_101 TPM_102 TPM_103 TPM_104 TPM_105 TPM_106 TPM_11 TPM_12 TPM_13 > 15 15 15 15 15 13 12 4 2 15 15 15 > TPM_14 TPM_15 TPM_16 TPM_17 TPM_18 TPM_19 TPM_2 TPM_20 TPM_21 TPM_22 TPM_23 TPM_24 > 15 15 15 15 15 15 15 15 15 15 15 15 > TPM_25 TPM_26 TPM_27 TPM_28 TPM_29 TPM_3 TPM_30 TPM_31 TPM_32 TPM_33 TPM_34 TPM_35 > 15 15 15 15 15 15 15 15 15 15 15 15 > TPM_36 TPM_37 TPM_38 TPM_39 TPM_4 TPM_40 TPM_41 TPM_42 TPM_43 TPM_44 TPM_45 TPM_46 > 15 15 15 15 15 15 15 15 15 15 15 15 > TPM_47 TPM_48 TPM_49 TPM_5 TPM_50 TPM_51 TPM_52 TPM_53 TPM_54 TPM_55 TPM_56 TPM_57 > 15 15 15 15 15 15 15 15 15 15 15 15 > TPM_58 TPM_59 TPM_6 TPM_60 TPM_61 TPM_62 TPM_63 TPM_64 TPM_65 TPM_66 TPM_67 TPM_68 > 15 15 15 15 15 15 15 15 15 15 15 15 > TPM_69 TPM_7 TPM_70 TPM_71 TPM_72 TPM_73 TPM_74 TPM_75 TPM_76 TPM_77 TPM_78 TPM_79 > 15 15 15 15 15 15 15 15 15 15 15 15 > TPM_8 TPM_80 TPM_81 TPM_82 TPM_83 TPM_84 TPM_85 TPM_86 TPM_87 TPM_88 TPM_89 TPM_9 > 15 15 15 15 15 15 15 15 15 15 15 15 > TPM_90 TPM_91 TPM_92 TPM_93 TPM_94 TPM_95 TPM_96 TPM_97 TPM_98 TPM_99 > 15 15 15 15 15 15 15 15 15 15 >

Aaron Lun (00:36:45): > If I had to guess, it’s probably because@Jared Andrewsjust cbind’d a whole bunch of files together, where there was one file per cell type. Probably each file just numbered its columns from 1 to 100-ish.

Peter Hickey (00:39:10): > i wondered if it might be donors? each colname is found 0 or 1 time for each fine label > > > table(table(colnames(dice), dice$label.fine)) > > 0 1 > 29 1561 >

Peter Hickey (00:40:47): > also end up with some funky patterns if i make a heatmap ordered by main then fine label of a subset of DICE (i.e. the heatmap is the dice data not a test dataset) - File (PNG): image.png

Aaron Lun (00:46:54): > are you referring to the gradients within each cell?

Peter Hickey (00:49:10): > yep

Aaron Lun (02:36:51): > ¯*(ツ)*/¯

Jared Andrews (08:06:09): > Yeah, I just cbind’d everything. Don’t remember if each was from a different donor.

Aaron Lun (13:00:58): > happy to incorporate any findings from@Peter Hickeyinto the docs.

2020-07-03

Davide Risso (10:37:07): > Sorry if this has been discussed already, there are too many messages in here for me to follow closely. First, congrats on the celldex package, very useful! I’ve noticed that it’s focused on immune cell types and I was wondering if this is its scope or if you would be open to host e.g. brain references.

Davide Risso (10:38:13): > Specifically, I have often mouse brain single-cell data that I would like to map to references, but usually I need much finer resolution than what I find in general purpose references

Davide Risso (10:38:54): > I’m thinking specifically about the large collection of Allen Brain datasets that might be useful for this, in particular this huge single-cell dataset:https://portal.brain-map.org/atlases-and-data/rnaseq/mouse-whole-cortex-and-hippocampus-10x

Davide Risso (10:40:21): > My two Q’s: > 1. if I wanted to host a version of these data in ExperimentHub, would celldex the right package or is it more of a scRNAseq candidate or neither? (It’s also unclear if we can distribute/reuse the data but that’s a different story…) > 2. Is this too big to use as a reference in SingleR? I had a look at the pseudo-bulk aggregation in the book but I haven’t tried it yet and was wondering if it’s going to take forever with a million cells

Aaron Lun (14:55:29): > 1. I think all of the Allen brain datasets deserve their own package. scRNAseq is a bit of a swamp of random bits and pieces, so if you have a common theme, it’s best to make another package there.

Aaron Lun (14:56:51): > 2. The idea would be to use pseudo-bulk aggregation to create a reference that wouldthenbe hosted in celldex. Then the pseudo-bulk would only be done once and everyone else could run annotation quickly.

Jared Andrews (21:17:33): > Additional potential mouse brain reference:https://www.sciencedirect.com/science/article/pii/S0960982218309928Cerebellum-centric though.

Aaron Lun (21:19:18): > Hey, it’s st jude. I get ads from them.

2020-07-04

Jared Andrews (02:13:35): > Yes, I am going to be working with one of the authors there.

Aaron Lun (03:05:38): > I wonder why youtube targets me with st judes advertising, I don’t have any children.

Jared Andrews (03:38:11): > Cuz they want ya money since they run almost exclusively off donations. At least on the treatment side of things.

Aaron Lun (05:08:39): > Google should know me better than to think I would ever donate money for anything.

2020-07-07

Dan Bunis (12:37:44): > I went perhaps too literal here, but what do yall think? (pokedex reference:https://img.rankedboost.com/wp-content/uploads/2016/07/Pokemon-Go-Pok%C3%A9dex-500x382.png) - File (PNG): celldex.png

Aaron Lun (12:38:01): > ho ho

Aaron Lun (12:38:23): > even better if you could make a pokeball-looking cell

Aaron Lun (12:38:37): > dunno how that would fit in tho.

Aaron Lun (12:39:12): > What font does the pokedex use?

Aaron Lun (12:39:34): > you could just have the screen spanning the width and the “celldex” in the pokedex’s font.

Dan Bunis (12:40:08): > this font and the pokedex font are both free-ware:smiley:

Dan Bunis (12:42:16): - File (PNG): celldex.png

Aaron Lun (12:42:40): > I’m just trying to remember what the font is from the anime

Dan Bunis (12:43:51): > We could also shrink down the celldex name and have a cell or two below it in the screen.

Aaron Lun (12:44:05): > that’s okay, the title does need to be big.

Aaron Lun (12:44:19): > If people don’t get it, it’s their loss.

Dan Bunis (12:44:20): > This is the pokedex font I remember from the gameboy… - File (PNG): image.png

Aaron Lun (12:44:24): > OH YES

Aaron Lun (12:44:38): > That brings back memories

Aaron Lun (12:44:57): > though I mostly played the japanese version.

Aaron Lun (12:45:19): > And no, I can’t read japanese. It was a gift from relatives overseas.

Dan Bunis (12:47:02): > lol I think I remember you mentioning that. And you used a guide cuz otherwise, how would you know the moves your pokes were learning

Dan Bunis (12:47:25): > It’s not really memories for me… I still play sword and shield.

Aaron Lun (12:48:16): > Yes, I must have mentionedd that to you at some point.

Aaron Lun (12:48:39): > anyway, get some of this juicy font in there and make a PR into BiocStickers.

Dan Bunis (13:18:10): - File (PNG): celldex_load.png

Jared Andrews (13:23:11): > that’s pretty awesome

Aaron Lun (13:34:33): > even comes with the ExperimentHub loading bar experience.

Aaron Lun (13:34:58): > How big can you make the screen? Keep in mind that the sticker ends up being pretty small.

Aaron Lun (13:37:37): > hm. I guess if you make it any bigger, it doesn’t really look the same anymore.

Jared Andrews (13:38:38): > Could probably blow it up a bit and it’d still be okay

Aaron Lun (13:39:28): > While I’m complaining, the edges of the screen actually have a few bits and pieces.

Aaron Lun (13:40:06): > But yes, this is going to be so awsesom.

Dan Bunis (13:44:28) (in thread): > whatcha mean?

Aaron Lun (13:45:06) (in thread): > can you see the little buttons/thingies on the margins of the screen? - File (PNG): image.png

Aaron Lun (13:45:17) (in thread): > on the grey parts

Dan Bunis (13:45:25) (in thread): > LOL

Dan Bunis (13:45:32) (in thread): > I can add those.

Dan Bunis (13:57:48): > larger screen & extra accents added: - File (PNG): celldex_load.png

Aaron Lun (13:57:58): > oh yeah

Aaron Lun (13:59:00): > I think it looks pretty good. The only remaining thing is the weird overlap between letters and how you can see the border of one letter through the overlapping letter.

Dan Bunis (13:59:07): > celldex font-size is 28 now, so bigger than most other packages. It may stand out:laughing:

Aaron Lun (13:59:26): > oh, this is a font that you’re using?

Dan Bunis (13:59:29): > Yea… I haven’t messed with the font much.

Dan Bunis (13:59:36): > It is.

Dan Bunis (14:00:41): > I could probably fix that manually in the final version.

Dan Bunis (14:02:03): > But I guess that’s now?

Aaron Lun (14:02:23): > whenever you want, it looks pretty finished to me.

Jared Andrews (14:02:47): > Agreed

Aaron Lun (14:02:47): > (If you want to take it to the next level you could even make the fill yellow and the border blue and really mimic the pokemon font.)

Dan Bunis (14:55:49): > PR’d this - File (PNG): celldes_sticker.png

Aaron Lun (14:55:56): > OMG

Dan Bunis (14:58:00): > Thanks to@Laurent Gattofor the quick approval, it’s in:smiley:

Aaron Lun (14:58:21): > outstanding

Dvir Aran (15:24:57): > Nice

Dvir Aran (15:25:32): > Now do one for SingleR…

Aaron Lun (15:26:35): > we could probably start from@Giuseppe D’Agostino’s one.

Dan Bunis (15:59:50): > If you can share a pdf or svg version, Giuseppe, I can try and make a hex version sometime soon.

Federico Marini (16:04:46): > If you do it before July 17, you can get 10 for 1 Euro by StickerMule. There’s a special offer we negotiated for eRum2020

Federico Marini (16:04:54): > I can send you the code for that:wink:

Kevin Rue-Albrecht (16:12:14): > are we getting more#iseeones for that!?!??!:smiley:

Federico Marini (16:18:31): > not that I need them

Federico Marini (16:18:53): > wish I was done with the iSEEu design, but rn it is only in powerpoint draft format:stuck_out_tongue:

Federico Marini (16:19:37): > Still: I was only longing for a pdf of SingleR & CellDex when they will be ready!

Dan Bunis (18:49:53): > There’s an svg version on BiocStickers for celldex. Come to think of it, that may be useless without the fonts, so perhaps I should add those too in my next PR.

Dan Bunis (18:51:04): > Can I get the code@Federico Marini? I’d love to print some dittoSeq stickers!

Giuseppe D’Agostino (20:17:24): > @Dan Bunislet me wake up properly and I’ll send you the pdf

Giuseppe D’Agostino (20:17:37): > Awesome celldex logo by the way

Dan Bunis (20:18:23): > Thanks!

Dan Bunis (20:18:54): > No rush tho, I’m not gonna get to it today.

Giuseppe D’Agostino (20:19:48): > Yeah its as fast as “save as” in illustrator so no worries

Giuseppe D’Agostino (20:20:09): > Maybe you need the tinder fonts too

2020-07-08

Giuseppe D’Agostino (00:01:53): > btw is there a systematic way to judge how “good” a specific reference is? let’s say I want to use as a reference a newly published brain dataset that (supposedly) has done a good job at identifying new populations of astrocytes. I’d be inclined to trust this study and its markers if they did a good job at consistently identifying other “benchmark” populations, e.g. cortical layer-specific neurons - unless that newly published dataset just isolated astrocytes, in which way I guess there’s not much to do. I usually look at relationships between annotations usingclustree, since it’s an easily interpretable way to see how consistent labels are across references, but it still requires manual curation and interpretation. also I reckon it’s kind of a chicken and egg question - the published dataset may be perfectly fine while my dataset may have depleted specific populations and the result would be similar.

Aaron Lun (00:06:35): > Define “good”.

Giuseppe D’Agostino (00:10:51): > yeah that’s the issue

Giuseppe D’Agostino (00:11:19): > by “good” I mean how trustworthy it is in allowing me to say “gee, that’s really an astrocyte I got in my dataset”

Giuseppe D’Agostino (00:11:44): > especially since I’m fishing for “new” “cell types”

Aaron Lun (00:14:04): > All of these annotation tools only do one thing: tell you how similar your unknown cells are when compared to labelled cells. The labels are taken as truth in this process. Some protection is provided by using multiple sets of references to bridge gaps and to limit the damage from mislabelled cells, but this will not be of help to you if you’re working so close to the edge that no one else has identified these cell types before.

Aaron Lun (00:14:44): > And with my cynical hat on, if those cell types only show up in one reference, do they really exist?

Giuseppe D’Agostino (00:14:59): > actually my case study was that these cell types were identified before by that specific reference

Giuseppe D’Agostino (00:15:37): > I was contending that if that reference does a good job at identifying also other cell types that are known or shared by several studies

Giuseppe D’Agostino (00:16:01): > I may be inclined to trust that those “new cell types” exist

Giuseppe D’Agostino (00:16:26): > but eh.

Aaron Lun (00:16:30): > Well, that really just tells you that the authors are good at recognizing existing cell types. Nothing more, nothing less.

Aaron Lun (00:17:01): > I mean, I could probably annotate B and T cells pretty well but you shouldn’t trust my immunology.

Aaron Lun (00:17:44): > I would take it as, “SingleR tells me that my cluster X is similar to Bob’s cluster Y that they called memory blah blah cells.”

Aaron Lun (00:17:53): > Are they memory blah blah cells? Are they something else? Who knows?

Aaron Lun (00:18:29): > But at least you know that your X cells are similar to what Bob thinks are memory blah blah cells. And you can make your own judgements as to Bob’s expertise at identifying memory blah blah cells.

Giuseppe D’Agostino (00:19:11): > right. muy judgment would be that if Bob does a questionable job at identifying what Alice Carl and Daniel all think are X cells, I would not trust Bob.

Giuseppe D’Agostino (00:19:34): > on memory cells or anything else really.

Giuseppe D’Agostino (00:20:25): > then of course that does not tell me that memory cells are real

Aaron Lun (00:22:21): > I would be surprised if you found major disagreements that lost trust in that manner.

Aaron Lun (00:22:54): > Most disagreements I see are always debatable. Or at least it is not obvious to me who is right or wrong there.

Aaron Lun (00:23:40): > Add in the usual noise in single-cell technologies and that’s enough fuzz to prevent you from unequivocally saying that someone’s annotations are rubbish.

Aaron Lun (00:24:31): > Unless, like, they’re really wrong. Like, they forgot to order theircolDatawith theirassaysand you just end up with a scrambled mess. I haven’t seen that in a hwile.

Giuseppe D’Agostino (00:24:58): > I did see a bad annotation from a paper that did not account for patient effects and got patient-specific cell types

Giuseppe D’Agostino (00:25:32): > plugged in in my data and compared to the others - many cells were labelled as something different from the consensus

Aaron Lun (00:27:04): > well, if you want to do it, SingleR’s multiple reference capability will report both the individual and combined results for each cell.

Aaron Lun (00:28:19): > Though the final comparison will always be a bit manual unless you have standardized annotations.

Giuseppe D’Agostino (00:29:11): > yeah and at least in human brain data some studies find immune cells, some don’t, so t cells become muscle become something else according to the annotation

Giuseppe D’Agostino (00:29:19): > lots of harmonization to be done yet

Aaron Lun (00:30:42): > thematchReferenceswill also do pairwise comparisons between references to see which labels match to each other.

Giuseppe D’Agostino (00:31:21): > great!

Aaron Lun (00:40:11): > https://ltla.github.io/SingleRBook/using-multiple-references.html#manual-label-harmonization - Attachment (ltla.github.io): Chapter 5 Using multiple references | Assigning cell types with SingleR > The SingleR book. Because sometimes, a vignette just isn’t enough.

Giuseppe D’Agostino (00:42:58): > I reckon that with low probabilities one has to dig in further to see what’s up

Giuseppe D’Agostino (00:43:17): > if the labels are supposed to match

Aaron Lun (00:43:30): > yes, that’s the idea.

Aaron Lun (00:50:52): > well actually, the more specific idea is that, if you don’t know which labels are supposed to match, the heatmap guides you through it.

Aaron Lun (00:52:59): > And if there’s a mismatch… hard to say who’s wrong, even with a voting system. You’d need a slam-dunk for a known marker being mislabelled to really make any clear statement.

Giuseppe D’Agostino (00:55:12): > yeah I get it. the nice thing about looking at different assignments inclustreeis that it becomes evident when something’s systematically off, because the branches connecting that layer become very scrambled. but in that case it’s an issue affecting the whole reference and not just one specific population

Giuseppe D’Agostino (00:55:39): > but then in that specific case I knew that the clustering was “wrong” so there was a probable cause

Aaron Lun (00:59:55): > yep, plenty of ways to quantify disagreement (scran alone has at least 2 ways of doing it, from memory) - much harder to make a call on correctness.

Aaron Lun (01:25:42): > I mean, what I really don’t understand is why clustree is on CRAN and not on BioC.@Luke Zappia

Luke Zappia (01:25:46): > @Luke Zappia has joined the channel

Aaron Lun (01:37:11): > I was going to suggest that clustree could auto-determine the sequence based on a MST from pairwise rand indices rather than relying on the sequence of user-specified resolutions. But then that would require aSuggestsintoscran, and then you’re well on your way to just being a Bioc package in all but name.

Luke Zappia (03:49:03): > I’m not really sure what has been discussed (if someone wants give a summary that would be nice:smile_cat:). I putclustreeon CRAN because the method can be used for with any kind of clustering for any kind of data (not just scRNA-seq/genomics data) and I wanted it to be available to people outside Bioconductor. I think the only Bioconductor dependency it has is onSingleCellExperimentandSummarizedExperimentto provide the SCE interface. It might be possible to make a Bioc package extendingclustree depending on what you want to do. I want to rewriteclustreeto make a bunch of things easier but it’s not likely to happen any time soon.

Giuseppe D’Agostino (05:03:37): > @Luke Zappiathe summary of the discussion is: there is no surefire way to know whether any one reference set is “good enough” to label cell types, i.e. if you can trust the labels assigned in one particular study for your own dataset. I brought up clustree as it’s my method of choice to visualize consistency in labelling among different annotations. that’s all :)

Davide Risso (11:15:40) (in thread): > It seems that what you proposed could be done with the Dune package by@Hector Roux de Bézieuxhttps://bioconductor.org/packages/devel/bioc/html/Dune.html - Attachment (Bioconductor): Dune (development version) > Given a set of clustering labels, Dune merges pairs of clusters to increase mean ARI between labels, improving replicability.

Hector Roux de Bézieux (11:15:44): > @Hector Roux de Bézieux has joined the channel

Aaron Lun (11:16:23) (in thread): > not quite the same, but related.

Aaron Lun (11:33:20) (in thread): > You would need to intercept it at the ARI matrix.

Aaron Lun (11:37:56) (in thread): > But let us continue this discussion in#hca_clustering.

Federico Marini (15:44:00) (in thread): > Sure - Attachment (Sticker Mule): e-Rum 2020 > Special Offer, 10 stickers 2” x 2” for only €1

Federico Marini (15:44:15) (in thread): > it is just for an appetizer

Federico Marini (15:44:26) (in thread): > i.e. does not scale with higher numbers

2020-07-09

Dan Bunis (11:06:30): > :man-shrugging: - File (PNG): singleR_sticker.png

Jared Andrews (11:52:30): > I was thinking the “What’s that pokemon?” silhouette with a question mark with an arrow pointing to the macrophage.

Aaron Lun (11:53:39): > old times

Aaron Lun (11:53:55): > let me have a look at the tinder logo.

Aaron Lun (11:54:01): > Well, maybe not on my work computer.

Aaron Lun (11:54:05): > Or maybe yes.

Aaron Lun (11:54:52): > So, this is what we’re up against. - File (PNG): image.png

Aaron Lun (11:55:31): > or - File (PNG): image.png

Aaron Lun (11:56:26): > Seems like if we use the latter, we keep the cell next to “SingleR”: how would that look on the sticker? Does it fill enough space?

Dan Bunis (12:13:44): > like this? - File (PNG): singleR_sticker_inline_2.png

Aaron Lun (12:14:06): > yes, that’s looking good

Aaron Lun (12:14:22): > Might need to fiddle with some of the sizes to try to fill up more space, but I’ll leave that to you

Aaron Lun (12:15:11): > guess there’s not a lot of space to work with

Dan Bunis (12:16:21): > a bit bigger here - File (PNG): singleR_sticker_inline.png

Aaron Lun (12:16:37): > YES.

Aaron Lun (12:17:07): > that’s definitely better, for some reason.

Aaron Lun (12:17:43): > (You could also make an alternative version with the inverted colors.)

Aaron Lun (12:18:12): > e.g. white border, pink background and white text and cell.

Aaron Lun (12:18:22): > like the away jersey

Dan Bunis (12:29:23): > that’s not too hard in Illustrator.

Dan Bunis (12:32:26): - File (PNG): singleR_sticker_invert.png

Aaron Lun (12:36:09): > nice

Aaron Lun (12:36:46): > maybe we need some shading on the nucleus, light a light grey or something. Otherwise the identity of the cell is somewhat lost.

Dan Bunis (12:42:04): > This is with the darkest color of its original gradient. - File (PNG): singleR_sticker_invert.png

Aaron Lun (12:42:15): > oh yes

Dan Bunis (12:44:46): > I’ll remove the outer black line before I PR. And I’ll PR both, but probably only one should go on the main BiocSticker page.

Dan Bunis (12:44:47): > I’m leaning towards the silhouette version for that.

Giuseppe D’Agostino (13:23:42): > Looking good

Dan Bunis (13:24:10): > And PR’d:smiley:

Giuseppe D’Agostino (13:27:29): > I did not pay attention to the kerning in the original logo, if we want to be philologically correct

Dan Bunis (13:35:58): > Do you mean the size ratio between the macrophage vs the letters? Cuz it’s fine to me.

Dan Bunis (13:36:40): > But you could PR an edit to that if you want to.

Giuseppe D’Agostino (13:41:09): > The spacing between letters in the singler vs tinder logo

Dvir Aran (14:49:13): > Love it:blush:

Aaron Lun (23:24:36): > behold the sweet new color scheme athttps://ltla.github.io/SingleRBook/, courtesy of@Kevin Rue-Albrecht. - Attachment (ltla.github.io): Assigning cell types with SingleR > The SingleR book. Because sometimes, a vignette just isn’t enough.

2020-07-10

Aaron Lun (01:06:55): > when that sticker gets merged, it’s going to be on the front page of the book.

Aaron Lun (01:47:56): > @Kevin Rue-Albrechtwe probably need to make the figure caption font grey to distinguish it from the main text, check out OSCA’s style.css.

Kevin Rue-Albrecht (02:52:16): > Sure. I’m expecting minor tweaks as we spot them , keep’em coming

Kevin Rue-Albrecht (03:39:20): > @Aaron Lunthat’s because you wiped out the originalstyle.csscontent > > p.caption { > color: #777; > margin-top: 10px; > } > p code { > white-space: inherit; > } > pre { > word-break: normal; > word-wrap: normal; > } > pre code { > white-space: inherit; > } >

Kevin Rue-Albrecht (03:39:44): > namelyp.caption

Kevin Rue-Albrecht (03:40:32): > that’s the reason why I wrote a separate filebiocstyle.css; to avoid touching the base style file (for now)

Kevin Rue-Albrecht (03:41:20): > anyway, hold on, I’ll PR what I mean

Aaron Lun (03:53:27): > yes, I know I did, but you might as well stick it in the same file.

Aaron Lun (03:53:56): > you just want one CSS file defining the “Bioc” style.

Kevin Rue-Albrecht (03:55:48): > fine, but in that case I’d rather use a single filebiocstyle.cssand remove the originalstyle.css

Kevin Rue-Albrecht (03:56:48): > PR updated

Aaron Lun (03:57:02): > leave that choice of file name to the book authors, just edit the rebook version

Kevin Rue-Albrecht (03:57:33): > argh ok, I was messing withhttps://github.com/LTLA/SingleRBook-base/pull/2

Kevin Rue-Albrecht (04:02:25): > seehttps://github.com/LTLA/rebook/pull/2

Federico Marini (15:39:47) (in thread): > Very nice work on the sticker!@Dan Buniscan you please ping me when it is finalized so I can use my stickermule promo to get a small bunch?

Aaron Lun (15:40:00): > it got merged

Federico Marini (15:41:11): > Cool! Both are “official”?

Dan Bunis (15:41:54): > Yup! They’re all “official” and merged into the BiocStickers github.

Federico Marini (15:42:07): > I like the away jersey a tiny tiny bit more

Federico Marini (15:42:30): > also because it would stand out more with the existing rest on the laptop

Dan Bunis (15:43:43): > Same here:smiley:. They’re both up, but that’s the one I put on the main BiocStickers README page.

Federico Marini (15:58:49): > I’ll be curious to see the reactions of people just smirking at the tinder ref

Federico Marini (16:04:08): > Is the pdf of the away one still with black border?

Dan Bunis (16:08:57): > It is =/. But the png is without it.

Dan Bunis (16:09:20): > The pngs are 600 dpi, so should be plenty.

Federico Marini (17:33:41): > oh that’ll do it

Aaron Lun (17:34:13): > I just realized that the CSS for the book should also follow the pink color scheme.

Dvir Aran (20:06:29) (in thread): > Next up - spinoff of pornhub

Dvir Aran (20:06:49) (in thread): > Maybe instead of celldex call it cellhub :)

Dvir Aran (20:07:51) (in thread): > I know how the logo will look

2020-07-11

Aaron Lun (00:58:42): > oh yes

Aaron Lun (00:58:44): > https://ltla.github.io/SingleRBook/ - Attachment (ltla.github.io): Assigning cell types with SingleR > The SingleR book. Because sometimes, a vignette just isn’t enough.

Kevin Rue-Albrecht (03:38:47): > i guess you’re just missing ads on the right now

Aaron Lun (04:40:51): > I can’t figure out how to get the favicon to stick.

Kevin Rue-Albrecht (08:47:17): > you mean like this?https://isee.github.io/iSEE-book/ - Attachment (isee.github.io): Extending iSEE > This book describes how to use the Bioconductor iSEE package to create web-applications for exploring data stored in SummarizedExperiment objects.

Kevin Rue-Albrecht (08:52:24) (in thread): > 1. Go tohttps://realfavicongenerator.net/ > 2. Upload the image of your choice (e.g. Bioc sticker) > 3. Download the folder of favicons > 4. Edit index.Rmd as here:https://github.com/iSEE/iSEE-book/pull/19/files > > > favicon: "favicon_package_v0.16/favicon.ico" > - Attachment (RealFaviconGenerator.net): Favicon Generator for perfect icons on all browsers > The ultimate favicon generator. Design your icons platform per platform and make them look great everywhere. Including in Google results pages.

Aaron Lun (13:45:06) (in thread): > I was hoping to not have to add more PNGs to the repo, but oh well.

Federico Marini (14:42:18) (in thread): > eheheheh

Federico Marini (14:42:37) (in thread): > xBamster for aligned files?

2020-07-12

Dvir Aran (00:09:41) (in thread): > Idk what this refers to ;)

2020-07-14

Aaron Lun (00:15:02): > gotta say. Realy liking these colors.

Kevin Rue-Albrecht (04:37:27): > yeah - the bioc vibe is almost hypnotic

Kevin Rue-Albrecht (04:37:43): > I just realised that there is a menu to switch between color schemes - File (PNG): image.png

Kevin Rue-Albrecht (04:37:54): > basically, my CSS affects the “White” mode

Kevin Rue-Albrecht (04:38:10): > Makes me wonder whether it’s possible to add a “Bioc” mode

Kevin Rue-Albrecht (04:40:05): > anyway - for now it works like this, a question for another day

Aaron Lun (11:10:11): > oh. I changed the SingleR colors to something a bit different.

Russ Bainer (19:34:44): > @Russ Bainer has joined the channel

2020-07-22

Aaron Lun (01:59:31): > now with a favicon.

Aaron Lun (12:18:52): > doth no one want to contribute some case studies?

Jared Andrews (12:31:37): > I will at some point, but wouldn’t hold your breath, as the data’s not going to be public for a while.

Dan Bunis (12:41:30): > I will too! But in ~1month after I graduate. My PI is trying to squeeze an extra paper out of me before then lol.

2020-07-26

Federico Marini (16:20:58): - File (JPEG): Photo on 26.07.20 at 22.20.jpg

Federico Marini (16:21:08): > just came in from stickermule with the special promo

Aaron Lun (17:04:13): > damn

2020-07-28

Dan Bunis (21:06:19): > Mine finally arrived too! - File (JPEG): 20200728_180531.jpg

Aaron Lun (21:06:34): > oh yeah

2020-07-29

Jared Andrews (00:10:00): > Damn that celldex one is good

Aaron Lun (00:11:06): > @Dan Bunisgetting a strong pokemon theme going on there

Dan Bunis (00:14:00): > Right!!? celldex one turned out so great!

Dan Bunis (00:16:20): > I needa update the colors in the dittoSeq one though… I need it to pop more.

Aaron Lun (00:17:02): > Why didn’t you have a tSNE within the ditto outline?

Aaron Lun (00:17:53): > dunno if you’ve ever seen some of those multicellular orgs eat dyed yeast, sort of like that

Aaron Lun (00:18:14): > ah, paramecium, that’s it.

Aaron Lun (00:19:13): > https://www.youtube.com/watch?v=l9ymaSzcsdY - Attachment (YouTube): Paramecium eating pigmented yeast

Aaron Lun (00:22:14): > wait, paramecium isn’t multicellular.

Dan Bunis (00:22:15): > I’m not that good of an artist? lol

Dan Bunis (00:23:12): > celldex is a bunch of simple shapes.@Giuseppe D’Agostinomade the hard part of SingleR lol

Aaron Lun (00:23:37): > well, it’ll be better than the flower you’ll see on Friday.

Dan Bunis (00:25:13): > Lol not sure for what. Exciting!

Giuseppe D’Agostino (00:25:18): > to be fair it was a 10 minute affair of which 8 minutes were spent looking for the Tinder font

Dan Bunis (00:25:45): > You made the macrophage in 2 minutes then?

Giuseppe D’Agostino (00:26:10): > it’s just a bunch of pulls and pushes of points on a circle

Dan Bunis (00:27:30): > Those are some skills. Noted.

Giuseppe D’Agostino (00:31:25): > I do love thecelldexlogo tho, it’s such a throwback

Aaron Lun (00:32:51): > you see, the real purpose of the Bioc conference is a sticker meet-and-trade event.

Aaron Lun (00:33:20): > can you collect them all?

Dan Bunis (00:33:30): > Seeing the ditto used in this, makes me want to go back to something like my old literal ditto-tsne.https://jonkeane.com/blog/introducing_dittodb/

Giuseppe D’Agostino (00:33:30): > with laptops as expensive sticker albums

Dan Bunis (00:34:29): > what else are the apple logos good for if not being covered up by awesome stickers?

Aaron Lun (00:35:05): > I count 16 * 8 stickers currently

Dan Bunis (00:35:34): > Speaking of tho, serious question, are wenotgetting a stickermule promo? I saw they are BioC2020 sponsors…

Aaron Lun (00:37:54): > I thought there was a promo?

Dan Bunis (00:38:56): > Oh? I think the previously shared one ended. I haven’t seen one through BioC.

Aaron Lun (00:53:34): > just looking over all the stickers, the celldex one definitely stands out.

Aaron Lun (00:53:53): > my aesthetic favorite is still the MultiAssayExperiment one.

Aaron Lun (00:54:03): > which reminds me, I have to file an issue there.

Dan Bunis (00:55:25): > That one is great.

Dan Bunis (00:57:24): > Man I really wanna go back to something like this… I’d made some fun emojis our of it. - File (GIF): eyeroll_new.gif

Aaron Lun (00:57:31): > holy crap

Aaron Lun (00:57:36): > that was startling

Dan Bunis (00:57:45): > lol oops sorry

Dan Bunis (00:58:01): > ended up rather large.

Aaron Lun (00:58:29): > there’s probably something super artistict that could be done with the ditto shape

Aaron Lun (00:59:23): > like you have a whole bunch of points in grey but oly the points overlapping teh ditto are in color

Aaron Lun (00:59:39): > like those colorblindness tests with the dots and the colors and the number written in differently colored dots

Dan Bunis (00:59:47): > yes… I may just remove the outline and see if I can’t just make it look plausibly like a tsne that just happens to be ditto-shaped.

Dan Bunis (01:00:58) (in thread): > this is something I could do in R and then copy over.

Aaron Lun (01:02:25): > I must say, when I first saw those colorblindness tests, I couldn’t even comprehend that some people couldn’t see colors. (I was <10.)

Aaron Lun (01:02:41): > And then that got me thinking about the nature of our observed reality.

Aaron Lun (01:02:51): > And then I became a solipsist.

Dan Bunis (01:05:13): > I regularly end up discussing with people that no one actually knows if what a color vision-typical person sees as red actually looks the same to them as what red looks like to another color vision-typical person.

Dan Bunis (01:07:08): > Like understanding that I certainly register it differently… just because yall can tell colors apart doesn’t mean your minds actually project them to your conscious similarly:man-shrugging:

Dan Bunis (01:09:43): > I had to google solipsist. Definitely can see how you got there! > > For anyone else who needs: > solipsist = noun. Philosophy. the theory that only the self exists, or can be proved to exist. extreme preoccupation with and indulgence of one’s feelings, desires, etc.; egoistic self-absorption)

Kevin Rue-Albrecht (04:22:29) (in thread): > I did make a simple Shiny app that was converting images into fake tSNE plots:https://github.com/kevinrue/magick-profile

Kevin Rue-Albrecht (05:45:41) (in thread): - File (PNG): image.png - File (PNG): image.png

Kevin Rue-Albrecht (05:46:04) (in thread): > https://kevinrue.shinyapps.io/magick-profile/+ the ditto image above on the right

Kevin Rue-Albrecht (05:46:38) (in thread): - File (PNG): image.png

Kevin Rue-Albrecht (05:51:10) (in thread): > actually.. 7 clusters fits better to edges + smile - File (PNG): image.png

Kevin Rue-Albrecht (05:51:46) (in thread): > more like this - File (PNG): image.png

Jared Andrews (09:56:42): > Sounds like narcissism with extra steps.

Dan Bunis (11:15:25) (in thread): > This applet is pretty cool.

Dan Bunis (11:16:26) (in thread): > So is iSEE… I’ve finally used it now. Even if it was just to grab - File (PNG): image.png

Kevin Rue-Albrecht (11:17:33) (in thread): > yeah:slightly_smiling_face:unfortunately, this applet doesn’t give you the script like iSEE does, so you’ll have to extract the code from the repo if you want to do things in a console:https://github.com/kevinrue/magick-profile

Sonali (14:47:50): > @Sonali has joined the channel

Dvir Aran (20:09:27): > Congrats@Aaron Lunfor the BioC community award. This award should have your name on it.

Aaron Lun (20:10:19): > I think I would die of embarrassment

Aaron Lun (20:10:43): > and also, people pronounce my name in a way that I don’t expect

Aaron Lun (20:10:57): > to the point that I’m no longer sure if I’m pronouncing it correctly

Dvir Aran (21:19:10): > How many ways is it possible to pronounce a three letters name?

Brianna Barry (21:34:17): > @Brianna Barry has joined the channel

Peter Hickey (21:39:17): > I’m now worried I buggered up your surname@Aaron Lunin my recorded workshop (I definitely butchered Hervé Pagès, to my shame)

Aaron Lun (21:40:39): > The thing is that I always thought it was pronounced “lun(g)“. But people pronounce it like “lun(ar)“, which makes me think “… oh, that’s me.”

Aaron Lun (21:40:58): > But the twist is that the second approach is actually tthe correct chinese pronunciation, so I can’t even say it’s wrong.

Aaron Lun (21:41:29): > So I was all like WHATEVER and just roll with it.

Aaron Lun (21:42:24): > I think the first pronunciation must have been adopted from a primary school teacher who just cut the “ch” off “lunch”.

Aaron Lun (21:42:39): > I probably even have a certificate that says “Aaron Lunch” somewhere.

Peter Hickey (21:48:57): > now i want to come up with yet another pronunciation

2020-07-30

Federico Marini (03:15:47): > What’s the correct one@Aaron Lun? air-run loon?

Federico Marini (03:17:04): > For you@Kevin BligheI’d go with “Bly”. But in that corner of earth I’ve seen strange things happen with sillables:smile:

Giuseppe D’Agostino (05:12:21): > I gave up teaching the correct pronunciation of my first name 2 weeks in

Kevin Rue-Albrecht (05:18:25): > Easter egg iniSEE(voice=TRUE), say “Lun” correctly, and it’ll open a panel with anime GIFs to keep you company while working:stuck_out_tongue:

Kevin Rue-Albrecht (05:19:17) (in thread): > … can’t wait to see iSEE downloads peaking after that message …:laughing::stuck_out_tongue:

Federico Marini (08:27:50) (in thread): > guess most would say bly-ghee?

Aaron Lun (10:57:33) (in thread): > WHAT

2020-07-31

bogdan tanasa (14:06:29): > @bogdan tanasa has joined the channel

2020-08-05

shr19818 (13:46:32): > @shr19818 has joined the channel

2020-08-12

Jared Andrews (11:00:09): > Okay, so. The Allen Brain Map datasets. Were I to add them to cellDex, would using the provided aggregate gene expression values per cell type be acceptable or are we pretty set on the whole shebang being accessible?

Aaron Lun (11:07:06): > Yes, we should add some kind of aggregated value.

Aaron Lun (11:07:39): > However, the whole shebang could be made available via a different pakcage, e.g., forAllenBrainMapData.

Aaron Lun (11:08:09): > If we’re going to pull down and clean up the count matrices anyway, it’s actually not that much extra effort.

Aaron Lun (11:08:12): > I can help.

Jared Andrews (11:09:14): > They provide the aggregate values already. So that part is trivial.

Aaron Lun (11:09:25): > oh, tthat’s even better.

Aaron Lun (11:09:36): > What are these aggregated across?

Aaron Lun (11:09:50): > cell type + ?

Aaron Lun (11:09:57): > batch, replicate, etc.?

Jared Andrews (11:11:01): > Dataset, basically. I haven’t dug into the metadata much. They provide it by either median or trimmed means, e.g:https://portal.brain-map.org/atlases-and-data/rnaseq/human-multiple-cortical-areas-smart-seq - Attachment (portal.brain-map.org): Human Multiple Cortical Areas SMART-seq - brain-map.org > None

Aaron Lun (11:11:34): > good enough for me.

Jared Andrews (11:13:41): > Same. Okay, I’ll get those together. We can discuss the whole shebang later.

Aaron Lun (11:14:17): > this is great, if anyone complains about the aggregated values, we can just shrug and point to the Allen guys.

Jared Andrews (11:15:53): > Yes, I have a few other datasets I may add too, but they are more messy and niche.

Aaron Lun (11:18:22): > i’m pretty sure you’re already a collab on celldex, so just branch and PR, no need to fork.

Jared Andrews (11:21:27): > yep, will do

Jared Andrews (17:52:14): > Hmm, should a B cell from liver be labeled differently from a bone marrow B cell? I have found a few datasets that would have both fine and main labels already without taking that into account.

Aaron Lun (17:52:52): > your call.

Jared Andrews (17:52:52): > My inclination is very much “yes”, but we need to determine how such cases should be handled.

Aaron Lun (17:52:59): > didn’t even know that livers had B cells.

Jared Andrews (17:53:34): > Everywhere has B cells if you look hard enough. But example holds for other cell types as well.

Jared Andrews (17:54:02): > I guess tacking it on to the “fine” label is probably most appropriate.

Aaron Lun (17:54:17): > sounds sensible.

Jared Andrews (17:56:05): > I just now realized your blob avatar has a large mustache and is not in fact despondently weeping.

Aaron Lun (17:56:20): > ha

Dan Bunis (18:28:41): > Tacking it onto fine sounds right to me.

Dan Bunis (18:31:07): > For T cells, I know that that the ones characterized in tissues tend to show differences in expression compared to circulating T cells (what many would consider the canonical version). But I agree that’s probably fine-grain and not main-level necessary.

2020-08-13

Jared Andrews (11:09:38): > Should we provide pre-aggregated references for other enormous datasets?

Aaron Lun (11:20:04): > I think that would make sense.

2020-08-14

Roye Rozov (04:44:17): > @Roye Rozov has joined the channel

Kasper D. Hansen (05:28:24): > @Kasper D. Hansen has joined the channel

2020-08-15

Aaron Lun (22:02:27): > Was reviewing a paper and looking at CIBERSORT

Aaron Lun (22:02:39): > and then I realized that you had to get a license to use CIBERSORT

Aaron Lun (22:02:51): > and I fell out of my chair laughing

Dan Bunis (23:46:04): > Right!!:man-facepalming:

Dan Bunis (23:48:27): > xCell ftw

2020-08-16

Aaron Lun (00:12:33): > One could consider getting xCell onto BioC. Might be a fun summer project for someone.

Dan Bunis (01:54:14): > I’ve talked about future plans for xCell with@Dvir Aran. Upgrades to xCell are certainly on his mind.

Dvir Aran (02:57:02): > Yes, setting up my lab now, and one of first projects is xCell 2.0

Dvir Aran (14:52:00): > Re cibersort - I never understood it. Its one line of code. Why would someone pay for it?

Federico Marini (15:22:43): > because Stanford?

Federico Marini (15:23:48): > From the perspective of someone that sometimes gets asked to to run “some” de-convolution approach o ntheir data, it is impressive how everyone has at least heard of that. Yet have no clue it is the only sheriff in town.

Dan Bunis (16:11:43): > If only Atul (Dvir’s mentor at the time both xCell and SingleR were first published) hadn’t moved then lol. xCell and SingleR would both have come from Stanford instead of UCSF.

Dan Bunis (16:20:42): > AlsoDeconRNASeq - Attachment (Bioconductor): DeconRNASeq > DeconSeq is an R package for deconvolution of heterogeneous tissues based on mRNA-Seq data. It modeled expression levels from heterogeneous cell populations in mRNA-Seq as the weighted average of expression from different constituting cell types and predicted cell type proportions of single expression profiles.

Federico Marini (16:25:43): > and quantiseq

2020-08-17

Jared Andrews (11:12:47): > @Aaron LunShould new celldex dataset scripts go into the1.2.0folder? How does the versioning work? > > I also have a pretty large set, some of which may be better added toscRNAseq. Is there any set criteria for whether a dataset is a better fit forscRNAseq(or their own package) orcelldex?

Jared Andrews (11:13:05): > Current list of datasets is here:https://docs.google.com/spreadsheets/d/1eU83gy_1SaD2d2S3qxy0J5DvJIT0N9Lu8s_jFkZyM8M/edit?usp=sharing

Jared Andrews (11:13:26): > First batch is what I plan to add, second is for my own personal use.

Jared Andrews (11:14:44): > I will aggregate the huge ones for entry incelldex.

Aaron Lun (11:19:09): > To answer the second question: the key distinction betweenscRNAseqandcelldexis that the raw single-cell counts go in the former while the processed values (aggregated, log-transformed + normalized, possibly FPKMs) go in the latter. Scanning the spreadsheet, I daresay that all of the datasets in the first block are good candidates to go intoscRNAseq; then thecelldexscripts can just pull them down and process them to create suitably compact references. > > That’s not to say that people can’t use the raw single-cell counts directly for annotation, but if they want to do so, they can get the stuff directly fromscRNAseq. Similarly, if people don’t care about annotation and they just want to look at the single-cell data, they can usescRNAseq.

Aaron Lun (11:20:36): > As for the versioning: you will want to add new scripts to 2.4.0 forscRNAseqand 1.0.0 forcelldex. The idea is to add them to the number corresponding to the next release, given that the average user won’t really be knowledgeable of the BioC-devel version numbers anyway.

Jared Andrews (11:23:57): > Okay, got it. Hmm, some of these sets are semi-processed already, and I sure as hell am not going to request raw data for all of them.

Aaron Lun (11:25:23): > how processed are we talking? If they’re still “single-cell”, then that’s fine.

Aaron Lun (11:26:09): > if they’re already agg’d, then they should go intocelldexlike that example we discusseed earlier.

Jared Andrews (11:26:54): > Still single cell, but how they’ve been processed varies from TPM-normalized to Seurat-scaled.

Jared Andrews (11:27:45): > Some lack good info about what the expression matrix actually is.

Jared Andrews (11:28:30): > None are already agg’d except the BrainMap sets, which also have counts available. So agg’d ones for that will go into celldex.

Aaron Lun (11:35:13): > let’s knock off the low-hanging fruit, get the agg’d brainmap ones in.

Jared Andrews (11:35:33): > Will do.

Aaron Lun (11:35:35): > That seems like a clear win.

Aaron Lun (11:36:03): > The normalized single-cell valuesprobablystill belong inscRNAseq, I think there is already precedent for having some normalized values in there.

Aaron Lun (11:36:34): > think the Grun pancreas data was already normalized, but I don’t recall.

2020-08-18

Will Macnair (09:08:52): > @Will Macnair has joined the channel

2020-08-19

Yi Wang (12:25:33): > @Yi Wang has joined the channel

2020-08-24

Jose Alquicira (07:57:00): > @Jose Alquicira has joined the channel

2020-08-26

Aaron Lun (02:08:29): > Note that I’m deprecating themethod=argument - NOT the cluster-based functionality, just themethod=argument. This is because if you provideSingleR()with a non-NULLclusters=, there’s no need to do the extra typing to specifymethod="cluster", because it’s pretty clear that you want to do cluster-based annotation. This should save some typing for everyone.

Jared Andrews (09:00:18): > :+1:

2020-09-02

Aaron Lun (11:57:19): > Does anyone have any insight onhttps://support.bioconductor.org/p/133626/, other than “HPCA kind of sucks?”

Dan Bunis (16:43:54): > I wonder if the issue may be doublets vs singlets or, for the B_cells specifically, maybe some very distinct fine-level subsets that then get lumped together in the broad-labels…. for their DE-based QC comment, they might be using too-strict cutoffs. Idk though, kinda just spitballing here.

2020-09-03

Aaron Lun (17:58:34): > We should probably think of some new ideas for pruning bad assigned labels. I’ve been thinking for a while and I can’t think of much better than what we’re already doing withpruneScores.

Aaron Lun (18:01:45): > maybe some kind of hard threshold on the delta-from-median, but the question is how to choose this number.

Aaron Lun (18:13:37): > Could probably derive a “reasonable” value from the reference dataset, by seeing what the delta would be if each cell was annotated against the other labels (and choosing an appropriate threshold based on a low outlier).

2020-09-04

Goutham Atla (08:24:50): > @Goutham Atla has joined the channel

2020-09-16

Friederike Dündar (09:27:26): > Whoosh, and the year is almost over (can’t wait for this year to be over). Just playing catch up on all the amazing things you guys pulled off over the summer. Is there a way to get a PDF version of the SingleR book?

Aaron Lun (11:04:07): > That’s a good question

Friederike Dündar (14:39:02): > Thank you. Is there a good answer, too?

Aaron Lun (14:41:10): > The current answer is “no”, but I don’t know whether it’s “not yet” or “never”.

Aaron Lun (14:41:31): > It should actually be fairly easy to generate the PDF version, but I don’t know how the various interactive HTML bits will interact with the LaTeX code.

Federico Marini (14:56:01): > I would not mind “missing” the interactive part, for me it is just more practical to have all the book in one file (and print)

Aaron Lun (14:56:46): > also let me add “it should actually be fairly easyin theory”

Friederike Dündar (16:21:35): > If one were to suggest to include, for example, the visual workflow summary of SingleR provided in Dvir’s original paper, where would that PNG ideally go in the context of the book’s repo? - File (PNG): image.png

Aaron Lun (16:22:45): > I think there’s a section where I talk about the algorithm, could fit in there. Should be able to make aknitr::include_graphiccall to the PNG’s URL there.

Aaron Lun (16:24:00): > might need to do some fancymagick::image_crop()’ing if you just want a subsection of it.

Friederike Dündar (16:27:58): > not sure how that’ll pan out with a paywalled article? I was hoping for the lower tech solution of sticking the PNG as a file into the repo to get away with a relative link and nomagick

Friederike Dündar (16:30:09): > I should have specified that, content-wise, I’d have a suggestion where to place it, but I was wondering how to construct the relative link to the PNG of my liking (I may want to cook up another illustration at one point, i.e. go to the drawing board myself)

Aaron Lun (17:17:42): > I’ve typically handled this by having an entirely separate branch for images, see for examplehttps://github.com/Bioconductor/OrchestratingSingleCellAnalysis/tree/images/images. These are then referenced by theirraw.github.comURLs in themasterof the book itself, see for example the iSEE chapter. > > I’ve been loathe to commit images tomastergiven how quickly they inflate the git blobs if they need updating. When they’re in a separate branch, I don’t really have to care all too much, and the URL references still work.

2020-09-17

Friederike Dündar (04:13:59): > will check it out; the separate branch is nice

Dvir Aran (11:23:44): > See this file that accompanies the paperhttps://github.com/dviraran/SingleR/blob/master/vignettes/SupplementaryInformation1.html

Dvir Aran (11:25:15): > My attempt to explain the algorithm. The markdown to create it is herehttps://github.com/dviraran/SingleR/blob/master/vignettes/SupplementaryInformation1.Rmd

Dvir Aran (11:25:31): > But it uses my version of SingleR

Dvir Aran (11:26:18): > I also have a much more colorful version of figure 1a. I’ll try to find it

2020-10-07

Aaron Lun (04:31:59): > FYIhttp://bioconductor.org/books/devel/SingleRBook/ - Attachment (bioconductor.org): Assigning cell types with SingleR > The SingleR book. Because sometimes, a vignette just isn’t enough.

Aaron Lun (04:32:54): > old links will redirect here for a while, probably until 3.13.

2020-10-19

Dvir Aran (08:47:56): > Hi, I’ve been running with issues with SingleR. I get an error “no method found for function rowMedians and signature ANY”

Dvir Aran (08:48:32): > Any idea what is going on?

Dvir Aran (08:56:31): > Getting this error while trying to run the example in SingleR

Dvir Aran (08:58:29): > After detaching all packages and removing all objects

Aaron Lun (11:25:11): > Should be fixed with the latest DMS, patched yesterday.

Dvir Aran (15:23:35): > Hmm. Just updated all packages, including delayedmatrixstats

Dvir Aran (15:24:07): > Still same issue

Federico Marini (15:24:18): > can be it is still not built on Bioc?

Federico Marini (15:24:33): > i.e. if you pick it from GitHub’s repo it would

Dvir Aran (15:25:52): > I get ab error installing from github

Federico Marini (15:26:52): > Error or warning-to-error? I am asking as this often comes up just for usingremotes

Aaron Lun (15:29:33): > Pete pushed the change to DMS yesterday, so it’ll be a day until it becomes available on BioC-devel.

Aaron Lun (15:29:54): > 1.11.5 is what you should be looking out for.

Dvir Aran (15:33:11): > Thanks. I don’t understand what is going on. I tried installing DMS v1.10, but still getting same error. Where is the issue?

Aaron Lun (15:33:49): > Okay, hold on. The first thing is that you shouldn’t be affected by any of this if you’re using BioC-release packages (i.e., even middle numbers).

Aaron Lun (15:34:06): > Those are still functioning as expected, and nothing in this discussion pertains to them.

Aaron Lun (15:34:23): > A few of us are probably using SingleR from GitHub, which corresponds to the BioC-devel version.

Aaron Lun (15:34:36): > Well, I know I am.

Aaron Lun (15:35:09): > If that’s the case, there was a recent bug introduced in DMS 1.11.4 that brokerowMedians. Long story, I won’t go into it.

Dvir Aran (15:35:11): > I am using the devel version. Tried rolling back to older version but didn’t help. I guess I should try rolling back to release

Aaron Lun (15:35:39): > Probably best to try to roll forward to DMS 1.11.5; you said you tried this and it didn’t work?

Dvir Aran (15:35:51): > Yup

Aaron Lun (15:35:57): > What was the error?

Dvir Aran (15:36:44): > Not giving anything useful

Aaron Lun (15:37:53): > usually there’s something in the error messages: I’m doingBiocManager::install("PeteHaitch/DelayedMatrixStats"), for example, and it’s vomiting out all sorts of stuff.

Dvir Aran (15:39:02): > Just - Error: (converted from warning) package MatrixGenerics was built under R v4.0.3

Dvir Aran (15:39:14): > And then execution halted

Aaron Lun (15:39:23): > Ugh. I wonder why it converted it from a warning?

Aaron Lun (15:40:52): > Well, let’s try doing it piece by piece. What about: > > BiocManager::install("MatrixGenerics") > BiocManager::install("sparseMatrixStats") > BiocManager::install("PeteHaitch/DelayedMatrixStats") >

Aaron Lun (15:41:17): > assuming you didn’t set youroptions(warn=2)globally, which would be kind of crazy.

Dvir Aran (15:43:10): > Its the default. Anyhow, I suppressed this with R_REMOTES_NO_ERRORS_FROM_WARNINGS

Dvir Aran (15:43:17): > And i works

Aaron Lun (15:43:26): > Hm, must be aremotesthing.

Aaron Lun (15:43:47): > I wonder why they do that by default. Oh well.

Federico Marini (15:43:50): > Yes it is

Dvir Aran (15:43:59): > Damn it, Singler still not working

Federico Marini (15:44:00): > This is what I meant with the above

Federico Marini (15:44:03): > https://github.com/federicomarini/ideal/blob/master/README.md#installation-troubleshooting

Federico Marini (15:44:19): > Sys.setenv(R_REMOTES_NO_ERRORS_FROM_WARNINGS="true"), no clue why it is set to false actually..

Dvir Aran (15:44:33): > No, tried devtools and bioc, both same

Aaron Lun (15:44:33): > what’s the error?

Dvir Aran (15:44:54): > Same error in singler

Aaron Lun (15:45:03): > hm. Let me have a look at it.

Aaron Lun (15:45:51): > might take a few minutes. Need to update my local install.

Dvir Aran (15:46:01): > Restarted r and it works!

Dvir Aran (15:46:05): > Thanks!

Aaron Lun (15:46:12): > oh, yeah, usually need to do that for package installs.

Aaron Lun (15:46:19): > Okay, great. That’s a relief.

Dvir Aran (15:47:45): > Yeah, well, the good and bad in open source development

2020-10-28

Dan Bunis (16:39:49): > From the New Books section of the Bioc 3.12 release notes: > > SingleRBookThis book covers the use of SingleR, one implementation of an automated annotation method. If you want a survey of different annotation methods - this book is not for you. If you want to create hand-crafted cluster definitions - this book is not for you. (Read the other one instead.) If you want to use the pre-Bioconductor version of the package - this book is not for you. But if you’re tired of manually annotating your single-cell data and you want to do something better with your life, then read on. > Thanks, Aaron, for the hearty laugh!

Aaron Lun (16:40:33): > my pleasure, as always

Aaron Lun (16:40:54): > I also take liberties with the “contributors” section of every book I’m in.

Dan Bunis (16:44:26): > yes yes!:rolling_on_the_floor_laughing:, also great.

Dan Bunis (16:44:45): > (https://bioconductor.org/books/3.12/SingleRBook/contributors.html) - Attachment (bioconductor.org): Chapter 10 Contributors | Assigning cell types with SingleR > The SingleR book. Because sometimes, a vignette just isn’t enough.

Jared Andrews (16:53:48): > I’ll take eldritch horrorscape to square off the year, sounds like an interesting demise as far as apocalyptic events go.

2020-11-04

ImranF (11:57:45): > @ImranF has joined the channel

2020-11-05

Aaron Lun (14:21:09): > I’m looking through Seurat’s module score code, and I don’t understand why they feel the need to do this sampling thing.

Aaron Lun (14:23:54): > I would be very curious to know what the difference is from just taking the average of the log-expression values across all genes in the set.

2020-11-06

Friederike Dündar (05:40:57): > maybe an SCTransform side-effect?

2020-11-13

brian capaldo (14:42:12): > @brian capaldo has joined the channel

2020-11-19

Kevin Blighe (08:31:26): > @Kevin Blighe has joined the channel

2020-12-07

Aaron Lun (17:21:44): > do we have some timings on typical SingleR runs with a reasonably sized dataset?

Aaron Lun (17:21:59): > say, 100k cells. Are we talking minutes, hours, etc.?

Aaron Lun (18:44:52): > really? I didn’t think it was that fast.

Jared Andrews (22:12:57): > Really really depends on reference(s), in my experience. If you cram a bunch of references with closely related labels together, you get into hours.

Aaron Lun (22:15:30): > Does parallelization help?

Aaron Lun (22:16:23): > I suspect I can’t make it any faster… well, other than to suggest settingfine.tune=FALSE, but that’s a big loss.

Aaron Lun (22:16:34): > (especially for closely related labels).

2020-12-12

Huipeng Li (00:38:29): > @Huipeng Li has joined the channel

2020-12-13

Kelly Eckenrode (13:42:09): > @Kelly Eckenrode has joined the channel

2020-12-14

Nick Owen (13:21:48): > @Nick Owen has joined the channel

2021-01-01

Bernd (14:06:40): > @Bernd has joined the channel

2021-01-22

Annajiat Alim Rasel (15:45:24): > @Annajiat Alim Rasel has joined the channel

2021-02-28

Alexander Toenges (17:45:07): > @Alexander Toenges has joined the channel

2021-05-11

Megha Lal (16:45:49): > @Megha Lal has joined the channel

2021-05-14

Aaron Lun (17:33:22): > Heads-up,aggregateReferencewill now do some quick and dirty feature selection before its PCA step. It’s also parallelized in a more sane way so it should be faster as well.

2021-06-03

Federico Marini (08:46:05): > FYI ->http://www.nature.com/articles/s41596-021-00534-0

Federico Marini (08:47:03): > is this even so?

Federico Marini (08:47:06): - File (PNG): image.png

Federico Marini (08:47:56): > the name is also misspelled:disappointed:

Davide Risso (08:56:58): > that’s weird… they say scmap is scalable and SingleR is not, but my experience is the exact opposite… it makes me wonder if I’ve been using scmap wrong…

Davide Risso (09:08:49): > FWIW we used SingleR on a dataset with ~50,000 cells with no issues

Jared Andrews (09:41:00): > That could have been the original version, which did have some scaling issues IIRC, but the bioc version can easily handle 100k+ relatively easily.

Jared Andrews (09:41:34): > But given that paper was just published and the bioc version has been out like ~2 years, I hope that’s not the case.

Federico Marini (17:03:39) (in thread): > same

Federico Marini (17:04:23): > I have the impression they miss out on all the cool features that got implemented since

Federico Marini (17:04:46): > plus: did I miss the web interface to that?

2021-06-24

Kevin Blighe (21:05:41): > With SingleR, as I understand, the reference datasets just comprise lists of genes (are they ranked in some way?), and then there is some number-crunching happening to infer the expression of each reference dataset cell-type in your own data, right? > Anybody aware of any mouse foetal (fetal) cell datasets?

Dan Bunis (23:12:55): > SignleR refs are full expression sets. as literal expression matrices, or SCEs. initial scoring is a spearman correlation of marker genes between cell versus ref-label samlpes/cells (and marker genes are learned inn the trainSingleR step, not embedded in the ref)

Dan Bunis (23:14:40): > And I believe there is one of those I’ve seen published, but I don’t remember where. I don’t think it’s part of celldex. Not sure about scRNAseq either.

Dan Bunis (23:14:58): > sorry I’m not helpful on that side!

2021-06-25

Kevin Blighe (06:49:06): > Thanks Dan! - interesting how the number crunching works

Friederike Dündar (15:09:53): > I think I used this one before for human fetal brain scRNA-seq reference data: Nowakowski2017

Friederike Dündar (15:10:04): > https://doi.org/10.1016/j.stem.2016.03.012 - Attachment (sciencedirect.com): Expression Analysis Highlights AXL as a Candidate Zika Virus Entry Receptor in Neural Stem Cells > The recent outbreak of Zika virus (ZIKV) in Brazil has been linked to substantial increases in fetal abnormalities and microcephaly. However, informat…

Friederike Dündar (15:11:11) (in thread): > are you looking for all tissues of the mouse foetus?

Kevin Blighe (15:45:15) (in thread): > Biopsies are from foetal hearts, but it’s white blood cells / leukocytes that are being profiled from these

Friederike Dündar (16:32:42) (in thread): > mh, fetal blood cells, nope, never worked with that

2021-07-04

Kevin Blighe (09:26:09): > The study is actually based on the “resident leukocyte population in the foetal heart, and specifically the macrophage population” > I cannot imagine any reference dataset out there for this. Although, if it’s literally just macrophages, then surely any common immune cell signature should suffice?

Jared Andrews (11:41:47): > Macrophaged have pretty drastic tissue-specific signatures. I’m unsure how well a typical immune signature from PBMCs or such would pick them up. I imagine they’d probably still be the top score, but who knows.

2021-07-05

Friederike Dündar (05:49:27): > Yeah, agree with Jared. If the task is to differentiate Macrophages from other non-immune cells, then probably most references with macrophages in them should do

Friederike Dündar (05:49:42): > If you want to distinguish different immune cell types, it’ll be more difficult for sure

2021-07-13

Kevin Blighe (23:26:04): > In relation to the above (foetal mouse heart macrophages), GSEA picked up this signature in a few of the clusters in the data:https://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?geneSetName=CUI_DEVELOPING_HEART_C8_MACROPHAGERelating to:https://pubmed.ncbi.nlm.nih.gov/30759401/ - Attachment (PubMed): Single-Cell Transcriptome Analysis Maps the Developmental Track of the Human Heart - PubMed > The heart is the central organ of the circulatory system, and its proper development is vital for maintaining human life. Here, we used single-cell RNA sequencing to profile the gene expression landscapes of ∼4,000 cardiac cells from human embryos and identified four major types of cells: cardiomyoc …

Kevin Blighe (23:26:45): > That’s pretty cool the way that GSEA led me to a possible reference set for SingleR, courtesy of Broad Inst’s manual curation team

2021-08-05

Manojkumar Selvaraju (17:58:49): > @Manojkumar Selvaraju has joined the channel

2021-09-06

Eddie (08:23:31): > @Eddie has joined the channel

2021-09-09

Julien Roux (01:59:25): > @Julien Roux has joined the channel

2021-10-15

Wes W (11:01:21): > @Wes W has joined the channel

2021-10-18

Wes W (14:14:26): > in response to a very old message by@Federico Mariniand@Davide Risso, we sequenced 25 patients t cells by 10X and just shy of 240,000 cells after clean up. SingleR had no issues… that paper is wack yo

Qirong Lin (19:34:54): > @Qirong Lin has joined the channel

2021-10-19

Federico Marini (07:48:01): > I think that comparison was done with a not-up-to-date version of the SingleR code, which got a significant boost in all directions after that

2021-10-29

Enrico Ferrero (13:22:21): > @Enrico Ferrero has joined the channel

2021-11-08

Paula Nieto García (03:29:33): > @Paula Nieto García has joined the channel

2021-11-09

Aedin Culhane (02:35:47) (in thread): > Complete agree. We have purified T cells and most pipelines calls macrophages and other myeloid cells in these if you have a mixed ref.

2021-11-30

Friederike Dündar (07:04:18): > > The fastest method appears to be scmap-cluster, and the other fast methods include SCINA, SingleCellNet, SciBet, SingleR and scHPL > https://www.sciencedirect.com/science/article/pii/S2001037021004499?via%3Dihub - Attachment (sciencedirect.com): Automatic cell type identification methods for single-cell RNA sequencing > Single-cell RNA sequencing (scRNA-seq) has become a powerful tool for scientists of many research disciplines due to its ability to elucidate the hete…

2022-01-28

Megha Lal (11:14:38): > @Megha Lal has left the channel

2022-02-01

Stephanie Hicks (20:24:58): > @Stephanie Hicks has left the channel

2022-02-15

Gene Cutler (12:01:32): > @Gene Cutler has joined the channel

2022-03-21

Pedro Sanchez (05:02:41): > @Pedro Sanchez has joined the channel

2022-05-18

Vince Carey (06:23:40): > @Vince Carey has left the channel

2022-07-28

Mervin Fansler (17:21:02): > @Mervin Fansler has joined the channel

2022-08-11

Rene Welch (17:16:14): > @Rene Welch has joined the channel

2022-08-15

Michael Kaufman (13:15:46): > @Michael Kaufman has joined the channel

2022-10-20

Connie Li Wai Suen (01:24:01): > @Connie Li Wai Suen has joined the channel

2022-11-06

Sherine Khalafalla Saber (11:21:21): > @Sherine Khalafalla Saber has joined the channel

2022-12-13

Ana Cristina Guerra de Souza (09:01:20): > @Ana Cristina Guerra de Souza has joined the channel

2022-12-20

Jennifer Foltz (10:41:27): > @Jennifer Foltz has joined the channel

2023-01-08

Aedin Culhane (06:19:09): > We have been using celltypist and manual scoring for immune cell classification.What are others using to sub classifying immune cells (exhausted, tissue resident, CD4, CD8 T cells, Mast cells, DC etc)

2023-01-10

Wes W (10:03:13): > different environments make a big difference i find. we do a lot of manual scoring as no approach has come close to getting any of the cells right in our tumour microenvironments. we also have CAR-T cells whose manufacturering process is not a normal enviroment and their artifical activation and stimulation has a bunch of pathways activated that make traditional signature tools get the cells wrong. the tool I ended up making works great for our case but I would wager if tested on a different tissue would prob fail like the other fail on mine.

2023-01-27

Yu Zhang (11:21:36): > @Yu Zhang has joined the channel

2023-02-28

Ramin (15:30:42): > @Ramin has joined the channel

2023-03-01

jeremymchacón (12:14:26): > @jeremymchacón has joined the channel

2023-05-12

Aaron Lun (13:33:15): > @Aaron Lun has left the channel

2023-06-08

Pierre-Luc Germain (04:46:25): > @Pierre-Luc Germain has left the channel

2023-06-19

Pierre-Paul Axisa (05:12:19): > @Pierre-Paul Axisa has joined the channel

2023-07-12

Axel Klenk (19:33:46): > @Axel Klenk has joined the channel

2023-07-28

Konstantinos Daniilidis (13:47:47): > @Konstantinos Daniilidis has joined the channel

Benjamin Yang (15:58:55): > @Benjamin Yang has joined the channel

2023-07-31

Chenyue Lu (17:51:13): > @Chenyue Lu has joined the channel

2023-09-01

Chris Vanderaa (09:34:56): > @Chris Vanderaa has left the channel

2023-09-03

Lea Seep (09:52:44): > @Lea Seep has joined the channel

2023-09-13

Christopher Chin (17:05:01): > @Christopher Chin has joined the channel

2023-12-13

Paul Myers (09:45:43): > @Paul Myers has joined the channel

2023-12-14

Marc Elosua (15:40:00): > @Marc Elosua has joined the channel

2023-12-27

Cindy Reichel (14:37:37): > @Cindy Reichel has joined the channel

2024-03-27

abhich (05:46:24): > @abhich has joined the channel

2024-04-25

Mercedes Guerrero (05:02:45): > @Mercedes Guerrero has joined the channel

2024-05-06

Michal Kolář (11:58:05): > @Michal Kolář has joined the channel

2024-07-02

Diána Pejtsik (10:59:39): > @Diána Pejtsik has joined the channel

2024-07-31

Zahraa W Alsafwani (17:24:57): > @Zahraa W Alsafwani has joined the channel

2024-08-19

Rema Gesaka (09:41:07): > @Rema Gesaka has joined the channel

2024-11-23

Umar Ahmad (18:00:47): > @Umar Ahmad has joined the channel