#tidiness_in_bioc

2018-12-10

Vince Carey (15:14:41): > @Vince Carey has joined the channel

Vince Carey (15:14:42): > set the channel description: discuss relationship of Bioc structures to tidyverse

Martin Morgan (15:14:42): > @Martin Morgan has joined the channel

Stuart Lee (15:14:42): > @Stuart Lee has joined the channel

Michael Lawrence (15:14:42): > @Michael Lawrence has joined the channel

Laurent Gatto (15:14:42): > @Laurent Gatto has joined the channel

Vince Carey (15:19:36): > Welcome to the tidiness channel. To cut to the quick. at Bioc Europe we had a brief discussion of options for using tidyverse concepts (linear command chains, small number of manipulation verbs, tidy data concept) with SummarizedExperiments.@Stuart Leeand@Michael Lawrencehave probably given this some thought. I was particularly impressed by the plyranges talk given at Bioc2018 and felt sure that I should try to incorporate the key pedagogic themes there into work I try to do with SummarizedExperiments. It has not proven all that easy, and before I go into more details I thought I would just see what kinds of work may already be going on in this direction.

Martin Morgan (20:39:21): > Hmm… alsohttps://github.com/sa-lee/plyexperiment/issues/1 - Attachment (GitHub): Design notes · Issue #1 · sa-lee/plyexperiment > @lawremi's notes moved over from the plyranges wiki Design Goals We have basically the same goals as for (G)Ranges: make SummarizedExperiment easier to use. Data Structures The trouble with SE …

Sean Davis (21:40:21): > @Sean Davis has joined the channel

Rob Amezquita (22:13:58): > @Rob Amezquita has joined the channel

2018-12-11

Charlotte Soneson (00:07:51): > @Charlotte Soneson has joined the channel

Ludwig Geistlinger (00:51:06): > @Ludwig Geistlinger has joined the channel

Kevin Rue-Albrecht (04:14:45): > @Kevin Rue-Albrecht has joined the channel

Malte Thodberg (06:49:44): > @Malte Thodberg has joined the channel

Dror Berel (10:52:19): > @Dror Berel has joined the channel

Daniel Van Twisk (15:48:44): > @Daniel Van Twisk has joined the channel

Frederick Tan (16:59:56): > @Frederick Tan has joined the channel

2018-12-12

James Taylor (11:43:48): > @James Taylor has joined the channel

Stephanie Hicks (15:08:36): > @Stephanie Hicks has joined the channel

Michael Love (22:21:42): > @Michael Love has joined the channel

2018-12-13

Michael Love (07:34:45): > See also@Laurent Gattowork onhttps://github.com/lgatto/tidies - Attachment (GitHub): lgatto/tidies > A Grammar of Data Manipulation for Omics Data. Contribute to lgatto/tidies development by creating an account on GitHub.

Tim Triche (09:25:20): > @Tim Triche has joined the channel

Michael Lawrence (12:42:26): > @Michael Lawrence has joined the channel

Matt Ritchie (13:00:30): > @Matt Ritchie has joined the channel

Martin Morgan (13:46:14): > @Michael Lawrenceis it worth while to think about ‘tidy’ at the S4Vectors level, e.g., if there were a ‘tidy’ interface to Vector and / or List would we automatically have tidy DataFrame / GRanges / SummarizedExperiment ?

Michael Lawrence (13:48:48): > If by “tidy” you mean a dplyr interface, yes for DataFrame and GRanges. Stuart said that might be his first step, because base implementations of dplyr (e.g. noplyr) lack efficient aggregation due to deficiencies in the base R, while S4Vectors has some fast paths.

Martin Morgan (13:50:09): > (separate thought) I was wondering about the importance of non-standard evaluation for the tidyverse, e.g., if subsetting the airway SummarizedExperimentse[, colData(se)$dex == "trt"](which is alreadyse[, se$dex == "trt"]) where replaced byse[, dex == "trt"].

Martin Morgan (13:53:58): > (and another separate thought) How important is the visual display and naming convention, for instance if SummarizedExperiment where to show a few rows of it’s three main components (rowData, colData, assay(s)) as tibble-like structures whether that would make the data more accessible? It quickly takes up more than a screenful of real estate, though…

Martin Morgan (13:59:11): > (and again, just to get these ideas out there…) it seems like the biobroom approach (whack everything into one big tibble) isn’t a good idea – it loses the constraints implicit in the SummarizedExperiment, and is inefficient in storage and computation. I think Laurent’s approach is some intermediate ground, where, e.g.,filter(dex == "trt")figures out thatdexis metadata on the column, then does the appropriate subsetting but returns an ExpressionSet. My understanding is that plyexperiment introduces new verbsfilter_rows()/filter_columns()or similar, increasing cognitive load but avoiding inevitable corner cases and also conflicting generics; I don’t know how this is implemented, but I guess by imposing a kind of ‘view’ (Michael said something about a ‘DPI’ in a recent conversation?) on the underlying S4 object…

Nitesh Turaga (14:13:38): > @Nitesh Turaga has joined the channel

Michael Lawrence (14:26:56): > I went with the_rowsand_columnssuffixes because it makes the code more explicit. Personally, it’s never made sense to me that SummarizedExperiment is a Vector, nor should it be treated as a long table. The simple structure of SummarizedExperiment makes it easy to reason about the data. We’ve seen the SE design arise again and again. Loom, Hail, etc. We shouldn’t throw the baby out with the bath water.

Michael Lawrence (14:29:17): > Whatever we do, we need to keep the API separate from the DPI. DPI’s tend to be DSLs that work only through non-standard evaluation. An API should not introduce that type of complexity.

Michael Lawrence (14:31:44): > Btw, here is an example of treating the data as one long table and querying it with an extension of SQL.https://app.dimensions.ai/details/publication/pub.1034559870 - Attachment (app.dimensions.ai): GenoMetric Query Language: a novel approach to large-scale genomic data management - Dimensions > MOTIVATION: Improvement of sequencing technologies and data processing pipelines is rapidly providing sequencing data, with associated high-level features, of many individual genomes in multiple biological and clinical conditions. They allow for data-driven genomic, transcriptomic and epigenomic characterizations, but require state-of-the-art ‘big data’ computing strategies, with abstraction levels beyond available tool capabilities. RESULTS: We propose a high-level, declarative GenoMetric Query Language (GMQL) and a toolkit for its use. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous datasets and samples; as such it is key to genomic ‘big data’ analysis. GMQL leverages a simple data model that provides both abstractions of genomic region data and associated experimental, biological and clinical metadata and interoperability between many data formats. Based on Hadoop framework and Apache Pig platform, GMQL ensures high scalability, expressivity, flexibility and simplicity of use, as demonstrated by several biological query examples on ENCODE and TCGA datasets. AVAILABILITY AND IMPLEMENTATION: The GMQL toolkit is freely available for non-commercial use at http://www.bioinformatics.deib.polimi.it/GMQL/.

Nicholas Knoblauch (14:32:11): > @Nicholas Knoblauch has joined the channel

Stuart Lee (20:40:31) (in thread): > I think NSE seems central to the tidyverse approach (see their efforts with rlang), however under the constraint that NSE occurs only wrt the grammar. For example, tibble does not implement any NSE for base ops like[, so your example would only work with a call tofilter().

Stuart Lee (20:52:06) (in thread): > I think a lot of beginner users mostly work in rstudio - I actually think a lot of mileage for a new user could be had ifSummarizedExperimenthad a less intimidating output fromutils::View. Being able to see what’s be done to the object is important to the user. I don’t see how that’s possible without being overly verbose, unless maybe the last modified slot gets shown?

Stuart Lee (20:58:07): > What does DPI stand for? I think it’s right to try and port dplyr via S4Vectors, then both plyranges and plyexperiment/tidies can piggyback on that. I probably won’t get around to starting this until the new year though but happy to collaborate if someone else would like to take the lead.

Peter Hickey (22:51:11): > @Peter Hickey has joined the channel

2018-12-14

Davide Risso (04:22:08): > @Davide Risso has joined the channel

Michael Lawrence (12:28:38): > DPI: Data Programming Interface

Michael Love (12:41:04): > anyone againstSummarizedExperiment %>% summarizejust returning the sum of all the elements in all the assays?

Michael Love (12:56:47): > maybesummarize(frobenius=FALSE)

Sean Davis (13:02:31): > summarizeshould implement ARTIFICIAL INTELLIGENCE, I think. - File (PNG): Pasted image at 2018-12-14, 1:02 PM

Laurent Gatto (13:04:31): > What aboutse %>% group_by(frobenius=FALSE) %>% summarize(...)

Michael Love (13:04:51): > that’s more straightforward

Michael Love (13:05:27): > and then in line with@Sean Davissuggestion, it can produce a word doc suitable for submission to journal

Michael Love (13:06:21): > it goes a bit beyond the scope of the grant proposal but who’s complaining

Rob Amezquita (13:06:33): > a complementary function,summarise(), could be added to produce british-english spellings

Laurent Gatto (13:10:08): > You meanwrite_paper() %>% submit()

Laurent Gatto (13:10:10): > The US and GB spellings are supported in the tidy verse.

Kevin Rue-Albrecht (13:13:03): > I think you missed%T>% upload_ExperimentHub()

2018-12-18

Levi Waldron (22:48:30): > @Levi Waldron has joined the channel

Levi Waldron (22:50:55): > Did no one make it far enough through the GMQL paper to read that “Also packages of R/ Bioconductor (https://www.bioconductor.org/) have been proposed for tertiary analysis (Huber et al., 2015); they facilitate typical specific operations, but require to perform them through scripts and are not suitable for big data processing”? That, after starting the paper with “We previously proposed a paradigm shift in genomic data management…”

Levi Waldron (22:55:36): > I’m fiddling with their use case. If I can figure out what it actually means, it seems quite manageable usingcuratedTCGAData,MultiAssayExperiment,RaggedExperiment, andRangedSummarizedExperiment. “In TCGA data of BRCA patients, find the DNA somatic mutations within the first 2000 bp outside of the genes that are both expressed with FPKM > 3 and have at least a methylation in the same patient biospecimen, and extract these mutations of the top 5% patients with the highest number of such mutations.”

Levi Waldron (22:59:56): > But those sound like fighting words for me:smile:. They set the bar fairly low by performing this operation in 57 minutes on a server with 120 cores and 375 GB of RAM.

Sean Davis (23:05:41): > This query can also probably be done in BigQuery with one SQL statement and I suspect it would run in seconds to a minute. Of course, the SQL would take some thought, but….

Levi Waldron (23:06:39): > I’d love to write a cheeky blog post providing a couple alternative Bioconductor solutions.

Levi Waldron (23:10:51): > I feel extra-offended since Bioinformatics gave an editorial rejection to@Vince Carey’s BiocOncoTK shortly after publishing this slanderous statement. If you work on the BQ statement I’ll figure out the curatedTCGAData solution…

2018-12-19

Sean Davis (00:24:44): > I have no idea what their “at least a methylation” means.

Sean Davis (00:25:00): > In case someone hasn’t found the newest GMQL iteration:https://doi.org/10.1093/bioinformatics/bty688

Sean Davis (00:42:28): > Ignoring the “methylation” piece (since I do not know what that means), here is a quick SQL run on the available TCGA BigQuery data. Note that this was run on ALL of TCGA, not just breast cancer. I was off a bit in my estimate of query time, but 3min 45sec is not too bad.

Sean Davis (00:43:03): - File (PNG): Pasted image at 2018-12-19, 12:42 AM

Sean Davis (00:44:26): > BigQuery SQL query to go along with@Levi Waldron’s challenge. - File (SQL): BigQuery.sql

Levi Waldron (00:45:46): > Wow, nice,@Sean Davis!!!

Sean Davis (00:46:52): > Needs some sanity checking, but it is late….

Levi Waldron (00:47:36): > Indeed! I have some idea about “has at least a methylation”, will post later

Levi Waldron (00:53:26): > There’s clearly a lot of TCGA-specific implementation underneath their hood, but the 450k methylation dataset has either numeric or NA values, and is associated with gene symbols. I’m guessing “has at least a methylation” just means methylation is non-missing.

Levi Waldron (00:55:54): > Testing here on ACC because it’s more wieldy: > > > system.time(mae <- curatedTCGAData("ACC", c("Mutation", "RNASeq2GeneNorm", "Methylation"), dry.run = FALSE)) > user system elapsed > 64.204 4.184 70.786 > > rowData(mae[[1]]) > DataFrame with 485577 rows and 3 columns > Gene_Symbol Chromosome Genomic_Coordinate > <character> <character> <character> > cg00000029 RBL2 16 53468112 > cg00000108 C3orf35 3 37459206 > cg00000109 FNDC3B 3 171916037 > cg00000165 NA 1 91194674 > cg00000236 VDAC3 8 42263294 > ... ... ... ... > rs9363764 NA NA 0 > rs939290 NA NA 0 > rs951295 NA NA 0 > rs966367 NA NA 0 > rs9839873 NA NA 0 > > assay(mae)[1:5, 1:5] > TCGA-OR-A5J1-01A-11D-A29J-05 TCGA-OR-A5J2-01A-11D-A29J-05 > cg00000029 "0.119877013723081" "0.107120474727399" > cg00000108 NA NA > cg00000109 NA NA > cg00000165 "0.903199918565705" "0.818523203916931" > cg00000236 "0.879703842366322" "0.256478816347749" > TCGA-OR-A5J3-01A-11D-A29J-05 TCGA-OR-A5J4-01A-11D-A29J-05 > cg00000029 "0.0607523034615359" "0.157004810973011" > cg00000108 NA NA > cg00000109 NA NA > cg00000165 "0.093013644233452" "0.856304366091803" > cg00000236 "0.25363881716578" "0.940340992918748" > TCGA-OR-A5J5-01A-11D-A29J-05 > cg00000029 "0.534426091413176" > cg00000108 NA > cg00000109 NA > cg00000165 "0.928811628299566" > cg00000236 "0.931213787726484" > > > > (the character methylation data is a bug needing to be fixed)

Levi Waldron (00:58:03): > It would be simpler if theGene_Symbolcolumn didn’t sometimes contain multiple semicolon-separated symbols. I’m not sure what the single genomic coordinate means, but it seems they used the gene symbols.

Sean Davis (01:05:31): > Probably better to use the CG position information than the assigned gene symbol, but….

Sean Davis (01:07:32): > Adding the methylation data is just another join, in any case, so can be added if it makes sense to do so.

Ludwig Geistlinger (03:03:57) (in thread): > It refers to the specific genomic position this somatic mutation occurs on the chromosome given in the 2nd col, I’d say (like a SNP).

Sean Davis (05:48:24) (in thread): > That location is a specific genomic position, but does not represent a mutation. It represents a CG dinucleotide position at which methylation status is measured. But, yes, the annotation process is not a gene annotation process, but a positional annotation process. The assignment of gene names is convenient, but it may not be correct for all applications.

Ludwig Geistlinger (05:55:08) (in thread): > I thought we are looking here atmae[[1]]which is theMutationassay?

Sean Davis (06:04:30) (in thread): > No, mae[[1]] definitely contains methylation data. Perhaps that also explains@Levi Waldron’s observation that the parsed data in assayData are incorrect (character).

Ludwig Geistlinger (06:05:22) (in thread): > Right, just downloaded it myself. > > > mae > A MultiAssayExperiment object of 3 listed > experiments with user-defined names and respective classes. > Containing an ExperimentList class object of length 3: > [1] ACC_Methylation-20160128: SummarizedExperiment with 485577 rows and 80 columns > [2] ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns > [3] ACC_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 79 columns > Features: > experiments() - obtain the ExperimentList instance > colData() - the primary/phenotype DataFrame > sampleMap() - the sample availability DataFrame > `$`, `[`, `[[` - extract colData columns, subset, or experiment > *Format() - convert into a long or wide DataFrame > assays() - convert ExperimentList to a SimpleList of matrices >

Vince Carey (09:24:56): > Note that example(buildPancanSE) in BiocOncoTK will construct a (delayed) RangedSummarizedExperiment for 450k data for a specific tumor (I think it is BLCA). It will require the Fdb.* to construct the range+ metadata. The object could well be defined for all tumors by passing the full vector of tumor codes as ‘acronym’. In this hybrid approach there is no need to create new tables of annotation in BigQuery because we have it in Bioc. But if we wanted to use BigQuery to do gene/range computations for us we would have to add and join in BigQuery.

Michael Love (09:29:46) (in thread): > there is so much variance in what gets past the editors desk. i’m always surprised what they think is worth sending out and what isn’t, seems like the really new stuff gets more desk rejection

Vince Carey (10:15:27): > Is the GMQL example biologically meaningful (modulo some ambiguity about “have a methylation”)? If it is, it seems worthwhile to have operations on a DelayedMAE with either local or BQ storage that carry it out concisely, or even tidily.

Sean Davis (10:18:41): > The annotation data are in BigQuery for this use case. But if we think of BigQuery not as a database but as a cloud data integration engine, it begs the question of whether we could or should have methods for instantiating bioc data and annotation into bigquery for these types of analyses. I wouldn’t think of it as a high priority, but BigQuery is a big hammer looking for nails.

Vince Carey (10:24:38): > Oh yes – I see the platform_reference dataset now. Your screenshot was useful. I would agree that we should not duplicate any annotation available in BigQuery. I was under the incorrect impression that cgnnnnnn were used without any anchors to location.

Vince Carey (10:28:55): > I don’t think we have a DelayedRanges concept, do we? Do you think it would be fruitful to map the IRanges algebra to (BQ) SQL operations? Perhaps there are answers in plyranges.

Tim Triche (10:53:20): > Somewhat unrelated, but is the restfulSE use case, where assays[[1]] ended up hosed, resolved? We have a bunch of new TARGET data to disseminate and would just as soon make it interoperable

Levi Waldron (11:14:41) (in thread): > I don’t think the specific example is biologically meaningful, although those types of combined operations seem potentially useful.

Levi Waldron (11:16:33) (in thread): > You mean assays(SE)[[1]] or assay(SE)?

Tim Triche (11:16:51) (in thread): > assays(aRestfulSE)[[1]]

Levi Waldron (11:17:23) (in thread): > Not that I know of…

Tim Triche (11:18:07) (in thread): > ugh, I knew a GRCh38/hg38 Fdb would eventually inhabit my future

Tim Triche (11:20:16) (in thread): > I have some HDF5 SEs kicking around ready to load, and reimplemented some common exploratory analyses (e.g. DMRcate) to be less-hideously-slow for DelayedArray-backed SEs, but hit a snag where assays(aRestfulSE)[[1]] ended up as 0/1 rather than continuous values

Tim Triche (11:20:57) (in thread): > there are probably issues with realize() that I am overlooking, but that can come later, I figure

Vince Carey (11:36:12): > @Tim Trichei am sorry to have dropped the ball on this. i will revisit that transfer today. we do not know why one component failed to behave.

Tim Triche (11:36:19): > no worries

Tim Triche (11:36:36): > thanks for looking into it!

Levi Waldron (15:16:24): > Got started on a gist for the Masseroli et al use case (but using ACC) athttps://gist.github.com/lwaldron/9d77ce85030b9d1ec24474c1b2c0a99b. Currently it just loads the data and adds ranges to the methylation dataset.

Tim Triche (15:53:05): > anyone played with variantKey and if so, any opinions on its utility for such things

Michael Lawrence (16:56:49): > In theory one could implement the ranges API on top of a BQ-backed or otherwise deferred derivative of IntegerRanges. There are already many different representations of integer ranges, such as Views and Partitioning.

Michael Lawrence (16:57:25): > I’ll be exploring a Hail-backed IntegerRanges soon.

Sean Davis (18:06:46): > In thinking about BigQuery, there are no indexes, so the implementation optimizations might be a little less informed by clever ranges approaches. Anything that minimizes the size of the column(s) scanned will be a win, but columns are going to be scanned.

2018-12-20

Shian Su (22:00:01): > @Shian Su has joined the channel

2018-12-28

Rene Welch (12:47:02): > @Rene Welch has joined the channel

2019-01-03

Kasper D. Hansen (12:58:49): > @Kasper D. Hansen has joined the channel

Aedin Culhane (13:03:57): > @Aedin Culhane has joined the channel

Sean Davis (13:08:15): > @Michael Lawrence, do you have any good references on relational algebra? It seems like the tidiness ideas are “old” in the sense that there is a well-established literature on relational algebra. SQL is one common implementation of those ideas, but dplyr and other tidy data approaches lean heavily on the same ideas.

Tim Triche (13:09:53): > relational algebra or relational calculus? The latter is typical for describing normal forms etc.

Tim Triche (13:11:15): > nevermind, I see that this is an algebraic representation (projection -> selection -> mutation)

Sean Davis (13:13:17): > I guess you are right, Tim. You need both, technically.

Sean Davis (13:13:47): > For SQL speakers, this slide deck might be of interest.http://www.cs.cornell.edu/projects/btr/bioinformaticsschool/slides/gehrke.pdf

Aedin Culhane (13:14:42): > Whats the best intro doc/tutorial for tidyverse for someone who is completely new to it?

Tim Triche (13:17:31): > @Sean DavisI just realized something – the tidyverse makes the composability of relational algebra explicit – something that is not always apparent from e.g. SQL queries. This is part of its appeal to people, with or without formal training; there is something appealing about the simplicity of composition, even if (in the large) it isn’t a good idea to represent mappable functional atoms as pieces of NSE pipelines. (see also Nextflow, CWL, etc.)

Tim Triche (13:18:46): > (there is a reason that abstraction becomes more appealing for layering explicit representations in finite space)

Tim Triche (13:19:38): > (e.g. a lot of pipelines are really better off as DAGs)

Tim Triche (13:20:25): > @Aedin Culhaneprobablyhttps://r4ds.had.co.nz/since it is canonical - Attachment (r4ds.had.co.nz): R for Data Science > This book will teach you how to do data science with R: You’ll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you’ll learn how to clean data and draw plots—and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You’ll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You’ll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualising, and exploring data.

Tim Triche (13:20:59): > the blurb says a bunch of stuff about data science, but really it’s “data science in the tidyverse” by its authors

Tim Triche (13:21:39): > “this book will teach you how to do data science in R” [IN THE WAY THAT HADLEY PREFERS]

Aedin Culhane (14:48:56): > thanks

Martin Morgan (15:30:17): > From the classics, I really likedhttp://shop.oreilly.com/product/9780596100124.do - Attachment (shop.oreilly.com): Database in Depth > This concise guide sheds light on the principles behind the relational model, which underlies all database products in wide use today. It goes beyond the hype to give you a clear view of the techn…

Shian Su (18:20:28): > Personally I find tidyverse composition with magrittr piping syntax appealing because it makes it extremely easy to observe intermediate states and reason about the logic

Shian Su (18:31:05): > I skimmed through Database Design for Mere Mortals a while back, and my impression of it is that dplyr is almost a reinterpretation akin to ggplot2’s interpretation of graphical grammar.

Shian Su (18:38:54): > It’s borrowed database table concepts to apply to R’s data.frames, but data.frames are not databases nor are they necessarily valid database tables. A data.frame is just one table, whereas a database can be a collection of tables. You usually wouldn’t be mutating tables in a database but it’s routine to do it in dplyr. Database tables require a unique primary key, data.frames do not.

Shian Su (18:52:18): > I see databases as primarily motivated by data integrity, storage and query efficiency. Whereas dplyr is providing explicit verbs for manipulating data.frames, that took operations from relational algebra because it’s battle-tested to provide sufficient flexibility for most use-cases. This for me replaced what used to be less explicit operations performed through the[operator.

2019-01-04

Levi Waldron (08:34:32): > Wouldn’t you say thedata.framedoes have a “unique primary key”, ie therownames? Not that you have to use it, but it’s always there and always unique.

Kasper D. Hansen (09:16:58): > Levi is right, a data,frame needs to have a unique rowname which can serve as a primary key

Kasper D. Hansen (09:17:10): > data frames are absolutely valid database tables

Kasper D. Hansen (09:18:04): > data.frames in the tidy verse are however not the “typical” relational database which is often spread across multiple tables to ensure that data is not unnecessarily repeated.

Kasper D. Hansen (09:19:09): > But tibbles are different here

Martin Morgan (09:33:54): > I think tibbles don’t generally have row names, probably to provide a simpler uniform view of ‘data’ as a column. Also the use of unique keys is primarily beneficial for joining tables, and the tidy representation is very much single-table de-normalized. Not really speaking with any authority here…

Vince Carey (10:38:21): > rownamesdo not existin the tidyverse

Kasper D. Hansen (10:40:04): > tidy representation is clearly not normalized and the claim in Hadley’s paper that the tidy data representation is 3NF is clearly wrong

Kasper D. Hansen (10:40:27): > I agree with Vince re. rownames

Kasper D. Hansen (10:40:38): > but there is the idea and then the implementation

Kasper D. Hansen (10:41:41): > For example > > > class(as_tibble(mtcars)) > [1] "tbl_df" "tbl" "data.frame" >

Kasper D. Hansen (10:41:56): > says that atibbleis adata.framewhich is also clearly wrong

Kasper D. Hansen (10:43:02): > No doubt that the idea of the tidy verse is to have a data representation which is similar to single (partially normalized) table in data-base lingo. This implies for example that it should not have rownames and should not be thought of as having an intrinsic ordering

Kasper D. Hansen (10:43:28): > What complicates this is that data.frame’s are said to be examplars of this, and they violate these things.

Kasper D. Hansen (10:43:54): > Of course this shows that the abstract tidy representation is not fully specified

Vince Carey (10:44:31): > Skimming the slideset pinned by Sean on Relational Algebra reminded me of some recent reading on the history of logic, specifically footnote 13 on page 291 ofhttps://epdf.tips/the-evolution-of-logic.html… we will need a new channel to pursue this I think, but I really wonder if we are at a turning point – do we need query optimization in genomic applications, and can we get it outside of formal database architecture/deployments? BigQuery for TCGA has been around for a while and does not seem to have caught fire. Can we say why? Does it boil down to the risk of a very costly unintended computation? - Attachment (epdf.tips): The evolution of logic - PDF Free Download > The Evolution of Logic The Evolution of Logic examines the relations between logic and philosophy over the last 150 yea…

Kasper D. Hansen (10:44:39): > Btw. a database table does not need to have a primary key

Kasper D. Hansen (10:45:36): > I have the idea that tidy verse stuff is closer to NoSQL, but I need to investigate this a bit more

Kasper D. Hansen (10:46:50): > @Vince Careymeans PDF page 291, not actual book page

Vince Carey (10:47:25): > thanks kasper it is book page 279

Kasper D. Hansen (10:48:36): > I don’t know much about query optimization but I know that many smart people have spent many years working on that. I would be cautious in abandoning this

Aedin Culhane (10:49:16): > There is a chapter on relational data in hadley “bible”…https://r4ds.had.co.nz/relational-data.html - Attachment (r4ds.had.co.nz): 13 Relational data | R for Data Science > This book will teach you how to do data science with R: You’ll learn how to get your data into R, get it into the most useful structure, transform it, visualise it and model it. In this book, you will find a practicum of skills for data science. Just as a chemist learns how to clean test tubes and stock a lab, you’ll learn how to clean data and draw plots—and many other things besides. These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R. You’ll learn how to use the grammar of graphics, literate programming, and reproducible research to save time. You’ll also learn how to manage cognitive resources to facilitate discoveries when wrangling, visualising, and exploring data.

Vince Carey (10:49:45): > it seems to me that a key attraction of the tidyverse is that the same “things that work” with data.frame can work with external data architectures including RDBMS. when you go down that road, the idea of an intrinsic or physical ordering of records is gone, and that means you abandon that for data.frame too.

Kasper D. Hansen (10:50:28): > agreed

Aedin Culhane (10:50:31): > Whilst tibble (or tribble, transposed tibbles) omit row.names or any primary key, Inner/outer joins are achieved by matching on columns (of course in genomics this is problematic, where EntrezID , and rank order of ID could be joined accidently)

Kasper D. Hansen (10:50:59): > Also, the piping in that operations are endomorphisms in the sense of tibble in, tibble out

Aedin Culhane (10:51:18): > Vince, however for genomics, where we have an order (genomic ranges Chr position etc).. Indexing is much faster

Kasper D. Hansen (10:51:35): > only for some things

Aedin Culhane (10:51:45): > I love the piping aspect of things, as it brings me back to unix scripting days of yore:wink:

Kasper D. Hansen (10:51:50): > I think it is important to remember that not all objects are linked to a genomic location

Kasper D. Hansen (10:51:58): > Say metabolites.

Aedin Culhane (10:51:59): > True Kasper

Vince Carey (10:52:04): > Great point Aedin. I thought the absence of “index” from the slide set that Sean posted was noteworthy. Indexing is a costly aspect of going there but it is often a one time charge.

Kasper D. Hansen (10:52:24): > Also I will claim that some objects can be linked to genomic location in multiple ways and it is application dependent how you want to do this

Aedin Culhane (10:52:42): > Searching through large data is also faster if there is an index (that is why blast etc works) So maybe a space for both?

Aedin Culhane (10:53:52): > However we have other indices for metabolites (there is relational structure info, ontologies etc). genomics data has a lot of structure if we think about it. However thats probably a tangent

Kasper D. Hansen (10:54:26): > I agree on the structure. My point is that there are multiple (equally valid) relationships

Aedin Culhane (10:56:10): > However these indices could be tied together. In genomics we have a lot of static indices. Then any new data point could reference it.

Aedin Culhane (10:57:12): > In Hadley’s “bible” he takes about primary, foreign and surrogate keys

Aedin Culhane (10:58:39): > @Vince Carey@Sean Daviscan you send link to slides again.

Vince Carey (11:00:07): > Claim: genomic data are too heterogeneous to be efficiently managed in RDBMS. So leveraging RDBMS for genomic work generally is likely to require exceptions and interoperation efforts for things that don’t fit. NoSQL seems to fit the bill better, and I think that MongoDb is behind much of firecloud.

Vince Carey (11:00:41): > @Aedin Culhanehttp://www.cs.cornell.edu/projects/btr/bioinformaticsschool/slides/gehrke.pdf

Aedin Culhane (11:11:51): > I agree Vince, whilst most genomic annotation is hetergeneous and rapidly changing with 1:many matches, there are some constants, genome locus, chemical composition

Aedin Culhane (11:15:11): > Metabolites, glycoproteins, and lipids etc have specific MS/MS profile (this is how most are identified). I don’t know enough about MS/MS, to really go down that road, but weight/charge seem to be something static

Aedin Culhane (11:18:13): > http://www.ericrscott.com/2018/06/28/webchem/ - Attachment (Eric R. Scott): Retrieve chemical retention indices from NIST with {webchem}! | Eric R. Scott > PhD Candidate in Biology

Aedin Culhane (11:18:47): > Anyway… thats a tangent for another time

Vince Carey (11:21:57): > We have a package called hmdbQuery that addresses an aspect of metabolite data organization. There is an XML schema for information that HMDB considers relevant. The package has rank 1477/1649 so I conclude the topic is not that important for project users. Our proteomics subcommunity should surely be brought into this discussion –@Laurent Gattois one of the germinators of the tidiness channel.

Vince Carey (11:27:44): > I feel that the bible blurb component “These are the skills that allow data science to happen, and here you will find the best practices for doing each of these things with R.” is overblown, but we are experiencing/accepting pressure to adopt these practices in the absence of data that show their actual value relative to practices that have worked well for Bioc thus far. Channel material of Dec 18 considers how SQL/MAE and (perhaps) tidyverse work on a specific use case in a contentious paper, and I would like to bring that to closure before too long.@Levi Waldrondo we have a repository where the code for that use case should be contributed?

Sean Davis (11:46:45): > Just a note on the NoSQL front. Most relational database systems (including data.frames) now have strong capabilities for storing, querying, and mutating semi-structured data (JSON, JSONB, XML, list columns, etc.), so I wouldn’t hold that out as a reason to use self-described NoSQL systems. Where NoSQL systems do excel relative to relational databases is in horizontal scalability and, potentially, durability; this comes at the cost of lack of joins (largely) and, in some cases, ACID compliance. Cloud-native systems like BigQuery offer relational mechanics and scalability, but they do lack the formal relational guarantees of traditional relational database systems.

Michael Lawrence (12:14:08): > The tidy tools are all about reducing complex data (like a relational schema) into a simple table for data analysis. There’s no support for representing complex data, because the user should be applying joins, aggregations and other transformations to get out of that world as quickly as possible. Unfortunately, that typically discards most semantics of the data, which makes it difficult to write code in the language of the domain, and thus code clarity suffers.

Dror Berel (12:45:37): > The following poster demonstrate a proof of concept of utilizing the tidy approach to manage S4 classes (SummarizedExperiment, MAE, or any other other) at the meta level, and enjoy both approaches. complex data with designated S4 class, managed at the ‘study’ level with a ‘tidy’ approach.https://www.bioconductor.org/help/course-materials/2017/BioC2017/DDay/LightningTalk/SessionII/ImmuneSpaceR.pdf

Workast (12:58:04): > @Levi Waldroncreated a space in Workast for this channel.Click here to collaborate

Unknown User (12:58:57): > New task by@Levi Waldron

Aedin Culhane (13:06:40): > Whats Workast?

Aedin Culhane (13:19:27) (in thread): > Hi Levi I looked at the gene.. ” > # “In TCGA data of BRCA patients, find the DNA somatic mutations > # within the first 2000 bp outside of the genes that are both > # expressed with FPKM > 3 and have at least a methylation in the same patient > # biospecimen, and extract these mutations of the top 5% patients > # with the highest number of such mutations.”

Levi Waldron (13:22:19): > It seems to be a project management system like GitHub’s “Projects”, was trying it out as a way of collecting GMQL-related conversation from this channel, but I doubt it’s going to be useful here.

Levi Waldron (13:22:27) (in thread): > Working on it.

Aedin Culhane (13:23:43): > What the goal, is it to replicate the GMQL example queries using queries of MAE/SE objects?

Levi Waldron (13:36:24): > More or less, and to write a response to their slander of R/Bioconductor.https://github.com/waldronlab/GMQLvsBioc - Attachment (GitHub): waldronlab/GMQLvsBioc > Alternative approaches to https://doi.org/10.1093/bioinformatics/bty688 analysis - waldronlab/GMQLvsBioc

Aedin Culhane (13:37:45): > For anyone without Bioninformatics, the pdf is onhttps://www.researchgate.net/publication/327020526_Processing_of_big_heterogeneous_genomic_datasets_for_tertiary_analysis_of_Next_Generation_Sequencing_data - Attachment (ResearchGate): (PDF) Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data > PDF | Motivation: We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most…

Aedin Culhane (13:52:16) (in thread): > So they want ” DNA somatic mutations occurring within the first 2,000 bp upstream or downstream of any of the expressed methylated genes ”

Tim Triche (15:07:21) (in thread): > after a fashion, row.names is primary for indexing a data.frame; however now that Hadley encourages everyone to use tibbles/tribbles, that’s out the window as well

Tim Triche (15:07:50) (in thread): > I’m going to guess you’re not a huge fan of data.table indexing, then:wink:

Tim Triche (15:11:37) (in thread): > as a former db0 maintainer, +1000

Tim Triche (15:14:50) (in thread): > this is great, thanks for posting it!

Vince Carey (16:04:00) (in thread): > Sifting through these high-level concepts seems essential to do on a regular basis as technology evolves. It seems a bit harder to make decisions because we always have a hybrid: R plus whatever else we want to entertain. R + SQLite seems to have done well for many aspects of annotation. R + HDF5 seems favorable for large-scale numerical data but more benchmarking may be in order, and some limitations of HDF5 per se have been remarked elsewhere. SRAdbV2 is an interesting amalgam of newer technologies that seems very promising for large-scale sample metadata. Can we implement the colData for a SummarizedExperiment as a delayed interface to SRAdbV2?

Shian Su (16:14:56) (in thread): > I’m interested by the efficiency but terrified by people writing chained index expressions which resemble hieroglyphics after a few months.

Tim Triche (18:18:07) (in thread): > sounds exactly like my experience. I am looking forward to transitioning things to vroom

Tim Triche (18:18:55): > that’s a really good idea, especially if it can be coupled to RESTfulSE

Martin Morgan (18:36:15): > I thinkcolData()should be made likeassays(), where anything that provides a particular API is good enough.

Shian Su (18:38:40): > Just as a personal opinion, I attribute the success of tidyverse to endomorphism on a simple data type rather than its API model. I think the biggest attraction for new users at least is that you are basically always thinking in terms of a data.frame, and you can always take a look at it to find out what you’ve done.

Tim Triche (18:39:05): > then tidyRestfulSE should be a big hit:wink:

Tim Triche (18:39:12): > or tidyRestfulMAE

Tim Triche (18:39:13): > whatever

Shian Su (18:46:38): > I contrast this with my experience with learning SingleCellExperiment, not singling it out, it’s just the first BioC object that I had to work extensively with. It took me a long time to establish a mental model of the data structure, particularly sincestr()essentially just blows up my console. It was after months of using it that I randomly came across a diagram of the structure, when it actually solidified in my mind. But now when I run functions, it’s always a bit of a guessing game for me whether colData, rowData, assay or metadata have changed.

Martin Morgan (18:48:43): > So one question is, what would make the mental model solidify sooner? I don’t really think presenting an SE as a single flat tibble (whatever it’s implementation) would be that helpful, for instance, because the structure and relations are completely lost…

Tim Triche (18:48:59): > but it is a schema of sorts

Tim Triche (18:49:09): > and the primary keys for SEs are samples and features no?

Martin Morgan (18:50:01): > Basically two tables & a matrix, with the joins & constraints on the row and column names (aka keys) of the matrix

Shian Su (18:50:04): > In this instance, probably just having that diagram in the vignette. I think it’s in a slide at BioC Toronto currently.

Tim Triche (18:50:27): > I think there was a glitter journal paper on this a while back:wink:

Tim Triche (18:51:05): > in principle, the user never needs to know whether the SE is really just a VIEW of a bunch of columns/rows that are materialized right before calculation

Tim Triche (18:51:28): > in SQL, a VIEW can look identical to a TABLE for a user; only the RDBMS need know the difference

Shian Su (18:51:44): > Still, the existence of metadata makes life hard, because packages can stick whole new data objects in there.

Tim Triche (18:52:00): > metadata(foo) is meant to be more of a NoSQL type of affair:wink:(i.e. a hidey hole, for good and bad)

Tim Triche (18:52:57): > it’s not even really keyed, which can be a big problem

Tim Triche (18:53:22): > if metadata(foo) at least shared keys, then it could be stored as keyed indexed BLOBs of some sort

Tim Triche (18:53:35): > (I consider JSON/JSONB to be Just Another BLOB Type)

Tim Triche (18:54:27): > the thing is, metadata() is an “out”. Maybe it doesn’t need to be supported, or maybe it’s just One Big BLOB. ?

Shian Su (18:55:41): > One observation for me is the existence of a myriad of custom plotting functions that exist in various packages. From things like plotting dim reduced representations to just scatter plots of two columns from one of the tables. It’s in most situations a trivial operation where developers don’t have faith users can navigate to the appropriate data on their own, that to me is a symptom of lack of transparency.

Shian Su (19:01:46): > The other scary thing is that even with the knowledge of the model, an actual str(sce_object, 2) reveals many undescribed slots, each of which contains a non-primitive object.

Tim Triche (19:02:26): > both of these are valid points and maybe worth revisiting

Tim Triche (19:02:48): > especially if the goal is to have SE be more of an API than just a data structure

Tim Triche (19:03:10): > essentially the authors are memoizing a computation against the raw data

Tim Triche (19:03:17): > this could be done more elegantly I think

Tim Triche (19:04:31): > on the other hand, it would involve trading off some flexibility to demand more than the bare minimum from subclasses of SE

Tim Triche (19:05:06): > in terms of API rigidity, that is. if an API is too rigid, developers that need more flexibility will roll their own alternative

Martin Morgan (19:05:42): > I don’t understand whystr()is problematic – you’re peaking into the implementation details, when what you’re interested in is the interface; I could understand concerns about limitations frommethods(class="SingleCellExperiment")(e.g., it doesn’t show any non-method functions). The separation of implementation from interface is a classic paradigm and is hugely advantageous, e.g., in representing colData() not as a literal data.frame but as a connection to a remote service.

Tim Triche (19:06:50): > I wonder if something conceptually like str(), but which peeks at the large moving pieces instead, might help

Tim Triche (19:07:49): > i.e. a minimalist version of str() that instead lists off the names of assays(), colData(), rowRanges/rowData(), metadata(), more detail than typically comes from show() but less than from the real version of str()

Tim Triche (19:08:26): > :100:regarding abstraction of interface from implementation regardless

Shian Su (19:12:44): > Probably because I’m a weird intermediate user who knows just enough to be dangerous. But withoutstr()I feel like I’m reciting magical incantations and hoping for the best. Abstracting an interface from implementation also requires a clear ( (and hopefully simple) abstract data type for the user to reason with, and in my circumstance not feel like there are 4-5 moving parts shifting under my feet.

Shian Su (19:16:50): > I haven’t given it much thought, but I hope theres at least a subset of routine tasks that don’t require the full flexibility of the SE interfaces and can be reduced to a more flat representation and be more transparent.plyrangescertainly managed it well forGRanges, but it probably had a more compatible starting point.

Shian Su (19:21:36) (in thread): > Thanks for teaching me aboutmethods, I’ve been usingshowMethodsuntil now and it’s been blowing up my console. But the problems still exists that SCE has 176 methods listed that apply to it, it’s still very hard for me to understand a structure through this list.

Tim Triche (19:27:20): > the fundamental problem (IMHO) is that tidy data representations assume the bits are not linked together by e.g. foreign keys. But for an SE, particularly if it is Ranged, the foreign keys (range names x sample names x assay => data, sample names x colData names => covariates, range names x rowData names => features) are a feature rather than a bug (i.e. the original ExpressionSet had as its huge advantage the near-inability to screw up and create off-by-one-row type errors, or botch mappings, or what have you)

Tim Triche (19:28:06): > so the same structure that makes it more difficult to reason about SEs (and ExpressionSet-like objects) “tidily” is also the source of its greatest strength, which is delegating tedious bookkeeping to the computer.

Tim Triche (19:28:56): > I assume this is why Sean brought up relational algebra in the first place. RDBMS deal with foreign-keyed tables all day every day.

Tim Triche (19:29:33): > (although, they are keyed row-wise, whereas e.g. SE assays are doubly indexed: sample X feature)

Shian Su (19:35:13): > At the very least converting any assays to a~~~flat~~~long-form structure is unacceptable for numerical efficiency.

Tim Triche (19:37:21): > as long as it’s only “flat” instantaneously during operations (e.g. in the sense of a VIEW), I don’t see that as a showstopper. having it stored and loaded as one big flat structure, that’s a nonstarter. But having it “look like” assays() is an array of rectangular “things” (not materialized, just respecting a rectangular API) is no big deal

Tim Triche (19:37:47): > all out-of-core storage schemes sooner or later diverge between the interface and implementation

Tim Triche (19:38:15): > as long as the atomic operations respect this difference it does not become a problem

Shian Su (19:42:37): > I’m interested in patterns in R to achieve this behaviour.

Tim Triche (19:43:31): > anSE %>% umap(“tpm”) %>% plot(1:2) ?

Tim Triche (19:44:12): > trying to think which actions would be sensible for this and which would not

Tim Triche (19:44:36): > in the case of something like a loom object, where everything is prespecified, it’s easier because it’s fully spec’d

Tim Triche (19:45:01): > some actions (especially ones without major side effects) make sense, mostly for EDA

Tim Triche (19:45:10): > also subsetting

Tim Triche (19:45:31): > but a lot of other operations, I’m not sure how practical it is to write NSE-respecting atomic operations to be chained

2019-01-05

Vince Carey (08:39:52) (in thread): > You are right, a list of 176 methods is daunting. I have written a little code to try to get at the question of which of these methods are actually defined, in a package called DrS4 atgithub.com/vjcitn > > library(DrS4) > require("SingleCellExperiment") > scem = defdMethods("SingleCellExperiment") > ## there are probably more signatures per method than listed here > ## see FIXME in source > scem > ## DataFrame with 41 rows and 2 columns > ## mnames sigs > ## <character> <List> > ## 1 [ SingleCellExperiment,ANY,ANY,... > ## 2 [<- SingleCellExperiment,ANY,ANY,... > ## 3 cbind SingleCellExperiment > ## 4 clearSizeFactors SingleCellExperiment > ## 5 clearSpikes SingleCellExperiment > ## ... ... ... > ## 37 spikeNames SingleCellExperiment > ## 38 tpm SingleCellExperiment > ## 39 tpm<- SingleCellExperiment > ## 40 weights SingleCellExperiment > ## 41 weights<- SingleCellExperiment >

Vince Carey (08:40:55): > > > scem[[2]][[1]] > x i j > "SingleCellExperiment" "ANY" "ANY" > drop > "ANY" > > so details on the signature are present in sigs

Vince Carey (08:42:34): > This shows that the package really defines 41 as opposed to 176 methods. However there are possibly multiple signatures allowed per method, defdMethods currently only lists one. I do not plan to go much further with this. However, it is not hard to get relevant metadata from S4.

Martin Morgan (10:08:24): > Usuallymethods()is too conservative rather than too liberal in what can operate on an object, e.g., it has no way of knowing about plain-old-functions that accept an SingleCellExperiment. Looks like DrS4 is being too conservative, missing e.g.,assay()and many other methods that are defined on base classes.

Vince Carey (13:39:21): > Yes, good point. We’ll have to see how to get a more reasonable survey.

Levi Waldron (15:44:00): > A while ago@Martin Morganposted something about Information Hiding (https://en.wikipedia.org/wiki/Information_hiding) which helped clarify the Bioconductor S4 approach to me. Details of data structure and methods implementation are hidden from users to a) provide relatively simple workflows (“magical incantations”) that I think are what most Bioconductor end-users come for, but which are still in many cases able to be inspected and extended by developers and power users, unlike predecessors of Bioconductor like GenePattern. The information hiding has the added advantage of enabling new technologies like REST or HDF5 to occur without distracting the user. With complex semantic data and without the information hiding, you instead can get tidy but horribly complicated workflows like this one, to pick on a vignette from my own lab that stems from an inadequate S4 data structure and methods. This kind of workflow is not what has hooked in tens of thousands of Bioconductor users… - Attachment: Information hiding > In computer science, information hiding is the principle of segregation of the design decisions in a computer program that are most likely to change, thus protecting other parts of the program from extensive modification if the design decision is changed. The protection involves providing a stable interface which protects the remainder of the program from the implementation (the details that are most likely to change). > Written another way, information hiding is the ability to prevent certain aspects of a class or software component from being accessible to its clients, using either programming language features (like private variables) or an explicit exporting policy.

Levi Waldron (15:45:15): > Just a part of the code fromhttps://bioconductor.org/packages/release/data/experiment/vignettes/HMP16SData/inst/doc/HMP16SData.html#phylum-level-comparison-to-metagenomic-shotgun-sequencing - File (R): tidy analysis of microbiome data

Vince Carey (16:05:07): > FWIW I have updated DrS4 to a) collect all relevant signatures in queried class/package combination, and b) issue a message concerning relevant superclasses.

Vince Carey (16:13:21): > it still doesn’t do what is needed > > > showMethods("subsetByOverlaps") > Function: subsetByOverlaps (package IRanges) > x="Vector", ranges="Vector" >

Shian Su (17:19:25): > @Levi WaldronI agree that most end users are after magical incantations. But as you experienced, not every S4 object comes with the complete spellbook. In your case, by breaking out a part of the complex data structure and operating on it with familiar tools you got what you want, if you felt this were a useful enough an operation you can contribute it back into the package. Imagine instead the frustration if you knew an object should have been capable of something, but the incantation was not provided for it and you could not crack open the internals.

Shian Su (17:36:17): > I don’t have a concise description of my issue, but withSingleCellExperiment, knowing that it is aSingleCellExperimentfrom “Experiment 42914” is not nearly enough information to give me a idea of what’s in the object. This is because every single cell package tacks on their own bits of information to the data structure, so twoSingleCellExperimentsof the same Single Cell Experiment can contain wildly different information. This also affects what method can be run, I am reluctant to criticise because I don’t have a proposal for an alternative, but if a function is a method forSingleCellExperimentthen I feel like it should run on aSingleCellExperiment, not aSingleCellExperimentAfterRunningAnotherMethod.

Vince Carey (17:48:24): > My conclusion is that it is the obligation of the developer of a class to define its usage in documentation. The class system has much relevant information but it is hard to do a better job than methods(class=…) to find relevant method names. Man pages with examples and vignettes are what we need.

Vince Carey (17:50:32): > @Shian Suit would be good to have a worked example of what you are describing. Is the problem that SingleCellExperiment has optional information, or that extensions to SingleCellExperiment are used in ways that are confusing?

Shian Su (17:54:14): > I don’t want to single out any packages, certainly my ownscPipehas such behaviours, but in general its related to adding to metadata, colData or rowData information that is necessary for certain methods.

Shian Su (17:56:32): > It’s not specifically a Bioconductor issue,Seurathas even deeper depedencies where 2-3 methods have to be run before an object is ready for some method.

Shian Su (18:02:37): > I don’t see a clear way to resolve this, it’s impractical to have a new class for each state mutation and it’s impractical to prevent mutation.

Shian Su (18:04:55): > I imagine it’s more a community effort of consensus and documentation rather than technical solution.

Vince Carey (18:08:47): > I agree. All hands on deck.

2019-01-06

Kevin Rue-Albrecht (05:00:40): > One approach that we’ve probably not invented but that we’ve used in iSEE was to store package-specific information in a nestedDataFrameadded to “metadata, colData or rowData”under the name of the package. > That, with “community effort of consensus”, can at least partially address the issue and avoid conflicts between packages. > It still leaves it up to the developers to document,stopifnot, and raise informative errors if they implement methods that rely on other methods having been run before

Shian Su (18:16:49): > Upon further reflection, I think one useful principle would be to enforce that all methods for a class should run on objects of the as-is. If there are necessary upstream methods then those will be run, with documentation stating that results COULD be saved for efficiency and maybe messaging to recommend it. This has issues when multiple choices exist for upstream methods, but I think it’s ok to just pick out a default and report it through messaging/documentation.

Shian Su (18:23:55): > I think I’ve also determined a more concise statement of my issue with colData, rowData and metadata flexibility. These slots have no specification of their own, so very few assumptions can be made and interoperability suffers. For example gene annotations have no standard governing column names, so entrez-ids may beGeneID,ENTREZ,entrez_idor any other alias. This is mostly due to a lack of standards among annotation organisations, but it would be interesting to see further standardisation of things, in the same way that GRanges standardised a representation of genomic features.

Martin Morgan (19:26:31): > I’m not so sure (my 2 cents) about standardizing names; i think it’s analogous to expecting the user to name their data.frame column ‘Treatment’, rather than to allow them to specify a model where the column that contains the semantic information about treatment is identified –lm(count ~ cell + dex, df)

Shian Su (19:39:28): > I agree there’s a flexibility cost to my proposal, I don’t think the whole table should be totally locked down. But at least for parts that make sense, they should be made consistent. For example BAM files allow arbitrary tags for each record, with some reserved and some left for developers to use. A while after UMIs became ubiquitous in single cell sequencing a new tag for specified in the specs for UMI sequences. This canonical specification I think is immensely helpful for making UMI processing softwares interoperable. Similarly GRanges provides a canonical representation of genomic regions, with optional metadata for flexibility, but at the very least we don’t have to deal with having to guess whether we should be looking forstart,Start,begin,first, etc…

Shian Su (19:43:20): > In the example ofTreatment, it would not make sense to force all modelling through such a column, but it may make sense to allow such a column to be reserved for something that can be modelled on. i.e.Treatment, if it exists must be a column of values such thatlm(count ~ Treatment, df)makes sense, and not for example a column of nested DataFrames.

2019-01-07

Levi Waldron (07:51:52): > @Shian SuI share your frustration about the lack of standardized vocabularies used in data reporting and in software. It’s not the fault or role ofSummarizedExperiment, but I think there is a role for topic-specific classes derived from base classes likeDataFrameandcharacterthat contain names and entries from controlled vocabularies, such as genes, experimental factors, and taxonomies. This work can be done in stand-alone packages that enforce domain-specific vocabularies and provide derived classes and simplified methods - in your Entrez Gene example, a relevant class could guarantee that a vector contains only relevant Entrez IDs as of a certain NCBI version, and provide convenient lookup and mapping. The thing is it requires a significant up-front effort to select those controlled vocabularies, define relevant domain-specific classes, and provide enough convenient methods enabled by them to make other developers and users take the effort to adopt them. Of course, some developers here have been thinking about the value of ontologies long before me (thinking of@Vince Carey’shttps://bioconductor.org/packages/ontoProc,https://bioconductor.org/packages/pogos/,https://bioconductor.org/packages/tenXplore/, and even going way back to Gene Ontologyhttps://www.sciencedirect.com/science/article/pii/S0047259X04000223)).@Sean Davishas also helped convince me of the value of formal ontologies. - Attachment (Bioconductor): ontoProc > Support harvesting of diverse bioinformatic ontologies, making particular use of the ontologyIndex package on CRAN. We provide snapshots of key ontologies for terms about cells, cell lines, chemical compounds, and anatomy, to help analyze genome-scale experiments, particularly cell x compound screens. Another purpose is to strengthen development of compelling use cases for richer interfaces to emerging ontologies. - Attachment (Bioconductor): pogos > Provide simple utilities for querying bhklab PharmacoDB, modeling API outputs, and integrating to cell and compound ontologies. - Attachment (Bioconductor): tenXplore > Perform ontological exploration of scRNA-seq of 1.3 million mouse neurons from 10x genomics. - Attachment (sciencedirect.com): Ontology concepts and tools for statistical genomics) > In computer science, an ontology is any formally structured vocabulary covering a conceptual domain. Gene Ontology (GO) is a structured collection of …

Tim Triche (08:25:04): > @Sean Davismight have some input re: MESH terms, disease/phenotype ontologies, etc. as NIH plowed a lot of money into this a long time ago. Things like clinical data elements, etc. expose some of these (although not always consistently, i.e. not all such instances are keyed as ID:Name pairs such that updating the Name cascades to all instances of the ID, which is how these things are supposed to work, abstracting away misspellings etc.)

Tim Triche (08:27:53) (in thread): > also HUGO & friends change symbols seemingly weekly, so unless these things are keyed against immutable primary identifiers, it breaks in a hurry

Tim Triche (08:30:15) (in thread): > if you look at (for example) makeGRangesFromDataFrame, the only reason users don’t have to care about that is the implementation where the function tries just about every possible synonym for these things before giving up and asking the user what’s what. “Standards” like GTF/GFF are another example of how this sounds good but in practice is handled with lots of duct tape, baling wire, and bubblegum

Tim Triche (08:31:24): > to some extent, when enforcing ontologies, standardization, etc. “this way lies madness”. Look at caBIG and ask yourself if that’s what you want

Martin Morgan (08:35:27): > There’s a tension between a complete ontology and the 1/2 dozen terms that are actually useful in any one application.

Tim Triche (08:50:54): > unfortunately the 1/2 dozen terms seem to change over time and applications (or both)

Sean Davis (09:04:23): > Ontology mapping can be quite valuable, but the process does not lend itself to automation as far as I can tell. I recently did some mapping for recount-brain from hand-curated terms to ontology terms and 90% of the work was done by hand-written term matching code (semi-automated). The other 10% required direct “maps” done by hand (literally typing terms). See, for example:https://github.com/LieberInstitute/recount-brain/blob/19d0a1c7462e64d471ef4891b060d9fd36fe4328/cross_studies_metadata/recount_brain_ontologies.Rmd - Attachment (GitHub): LieberInstitute/recount-brain > Code and analyses for the recount-brain project. Contribute to LieberInstitute/recount-brain development by creating an account on GitHub.

Sean Davis (09:08:02): > That said, when contributing to or creating “curation” projects, it seems fairly natural to choose existing terms when possible. The naming thing seems like a wash, but queries of data can be much more powerful (if they leverage the ontology relationships). Reasoning over ontologies offers real potential, but I have to admit that I have not seen many killer apps in this space yet.

Sean Davis (09:11:51): > You can see from my code (above) what I did to do some of the mappings from terms to ontology terms. In terms of “tidy” datasets, I chose to keep the ontologies in a single lookup table:https://github.com/LieberInstitute/recount-brain/blob/19d0a1c7462e64d471ef4891b060d9fd36fe4328/cross_studies_metadata/recount_brain_ontologies.Rmd#L16-L33 - Attachment (GitHub): LieberInstitute/recount-brain > Code and analyses for the recount-brain project. Contribute to LieberInstitute/recount-brain development by creating an account on GitHub.

Sean Davis (09:17:02): > Finally, for those interested, BioPortal is one of the key resources, with dozens of ontologies mapped to each other and searchable.http://bioportal.bioontology.org/

Laurent Gatto (09:30:10): > On the subject of ontologies, there’s also EBI’s Ontology Lookup Service (https://www.ebi.ac.uk/ols/index) and the associatedrolspackage (https://www.bioconductor.org/packages/release/bioc/html/rols.html). - Attachment (ebi.ac.uk): Home < Ontology Lookup Service < EMBL-EBI > EMBL-EBI Ontology Lookup Service is a repository of bio-medical ontologies and provides a powerful REST API for searching and accessing ontology term information - Attachment (Bioconductor): rols > The rols package is an interface to the Ontology Lookup Service (OLS) to access and query hundred of ontolgies directly from R.

Laurent Gatto (09:43:10): > In MS and proteomics, the efforts to use controlled vocabulary and ontologies to document experiments (such as data submitted to the main data repositories such as PRIDE) have, as far as I can tell, largely failed and are used mainly (only?) within the developer communities, mostly (only?) when writing/reading data.

Ruizhu HUANG (10:49:47): > @Ruizhu HUANG has joined the channel

Vince Carey (15:18:04): > One of the reasons for starting this channel was to collect energy for a document on tidiness in genomics/bioconductor. It seems to me that a document on obstacles to interoperability related to ontology non-adoption would also be useful. An estimate of the fraction of annotation tokens that could have been derived from a standard but are not is of course hard to get objectively, but maybe there is scope for a good-enough estimate? In addition to token selection there is the logistical problem of linking the standard/ontological terms and their identifiers to usable data. Perhaps another channel is needed? Is “tidiness” a semantic concept?

Michael Lawrence (15:32:24): > I’d vote for a separate channel. Btw, we fooled around with constraints on GRanges columns a long time ago. See constraints.R in the GenomicRanges package. We never actually added constraint storage to the representation, but the original idea was that specific columns could be constrained to conform to specific ontologies. The initial motivation was that many of the GFF3 attribute columns are specified to conform to the Sequence Ontology.

Shian Su (17:50:25): > I think of tidiness as a semantic concept, but I think I’ve also dragged things a bit too off-track. Personally I think the transparency of data structures, consistent interface, verb-names for functions, and composability of said verbs is the source of tidyverse’s popularity. The flatness of a “tidy” table just happens to induce the desirable qualities but is not intrinsic to it.

Shian Su (17:57:12): > I think a good example would be HTML, for those who’ve used the rvest package. The fundamental data structure is the tree of HTML elements, and you navigate through it by composing operations that lead you down the desired branches. It’s easy to reason about and use, most things are very composable and I would consider it very “tidy” despite having no tables in sight.

Michael Lawrence (19:18:14): > There is a formal definition of tidy, as described in Hadley’s paper, but you’re right, the popularity is due to many factors, which as a whole produce a positive user experience.

Martin Morgan (19:27:37) (in thread): > Not sure about how transparent the rvest data structure is, e.g., I’d challenge you to tell me the cast of the movie athttp://www.imdb.com/title/tt1490017/without looking at the README athttps://github.com/hadley/rvest. For a hint, here’s the first step > > library(rvest) > lego_movie <- read_html("[http://www.imdb.com/title/tt1490017/](http://www.imdb.com/title/tt1490017/)") > - Attachment (IMDb): The Lego Movie (2014) > Directed by Phil Lord, Christopher Miller. With Chris Pratt, Will Ferrell, Elizabeth Banks, Will Arnett. An ordinary LEGO construction worker, thought to be the prophesied as “special”, is recruited to join a quest to stop an evil tyrant from gluing the LEGO universe into eternal stasis. - Attachment (GitHub): hadley/rvest > Simple web scraping for R. Contribute to hadley/rvest development by creating an account on GitHub.

Shian Su (19:31:17) (in thread): > The rvest data structure is not transparent at all, but it is a HTML structure and easy to reason inside say a web browser.

Martin Morgan (19:34:04) (in thread): > So you’re taking up the challenge?

Levi Waldron (19:35:06): > BTW I’ve finished a draft solution of this silly GMQL challenge (did it for ACC to keep things smaller for testing) athttps://github.com/waldronlab/GMQLvsBioc. - Attachment (GitHub): waldronlab/GMQLvsBioc > Alternative approaches to https://doi.org/10.1093/bioinformatics/bty688 analysis - waldronlab/GMQLvsBioc

Martin Morgan (19:38:18) (in thread): > (FWIW I was totally enamored of xpath back in the day as a way to navigate XML / HTML documents; one definitely needed additional tools to figure out the document structure…)

Levi Waldron (19:39:46): > It’s maybe an interesting case study right now as it blends some parts that I find elegant (whereRaggedExperimentandMultiAssayExperimentcan be used as designed), and one part I find messy (where I moved things in and out ofdplyr). - File (R): R/loadACC.R

Levi Waldron (19:43:09): > It makes me think there are a few operations (like group_by, summarise_all, transmute_all) that would be nifty to have operate directly and efficiently onSummarizedExperiment(said without having thought through if that’s really possible)…

Shian Su (19:43:13) (in thread): > Well I find Chrome and Safari’s inspector quite adequate.

Shian Su (19:43:24) (in thread): > > library(rvest) > library(magrittr) > library(purrr) > library(stringr) > lego_movie <- read_html("[http://www.imdb.com/title/tt1490017/](http://www.imdb.com/title/tt1490017/)") > > lego_movie %>% > html_nodes("table.cast_list") %>% > html_nodes("tr") %>% > extract(-1) %>% > html_node("td:nth-child(2)") %>% > html_text() %>% > str_trim() >

Shian Su (19:44:20) (in thread): > In theory RStudio’s viewer could be used in interesting ways.

Michael Lawrence (19:48:31): > Thanks for doing that. I’ll take a closer look later. I agree that SummarizedExperiment should support more intuitive aggregation, and the plyexperiment project should address that. In this case, I bet you could get away with something likerowsum(is.na(x), symbols) > 0on the matrix. Not to be confused withrowSums(), of course.

Shian Su (19:50:55) (in thread): > I think technically HTML/XML shouldn’t be that hard to navigate as structures, but it’s all the information stored that makes the whole operation unmanageable.

Shian Su (19:51:11) (in thread): > Also the way the information is stored.

Shian Su (20:02:25) (in thread): > I don’t know whether this is an xpath thing, I find the model shared by BeautifulSoup (Python), D3 (Javascript) and rvest very interesting. In the way that operations are mapped over the current selection, so when I ranhtml_node(), it selected one node per element of the current selection. I’m surprised at how often this is what I actually want to do versus the alternative.

Martin Morgan (20:05:21) (in thread): > Bravo!

Stuart Lee (22:36:34): > One sticking point for me in the current API is how to properly put the assay matrices in context with NSE? How should a user manipulate the assay slots? I’m not sure if we have properly addressed this with things likemutate_cols(). Perhaps there’s a middle ground between our proposed approach and Laruent’s. Current uses of SummarizedExperiment suggest that we expect a user to think of the assay as whole entity, in slices, and colwise and rowwise… In some cases there are obvious constraints, that could mean we could get the user to skip having to think about the structure of the assay, i.e adding gene averages as a column torowDataimplies rowMeans. If one facet of NSE (at least in the way the tidy verse uses it) is about putting the data (SummarizedExperiment) in the context of the computation this implies (to me) that names representing parts (rows, cols, the names of the assay slots) are needed so something ala tidygraph > > {r} > # a summarisedexperiment with a named assay called counts > # this api implies default verbs manipulate entire assay > # but not sure if this makes sense > # create a new assay called logcounts > se %>% > mutate(logcounts = log2(counts + 1)) > > # feature aggreations, query over rows/cols of an SE via activate > se %>% > activate(rows) $>$ > mutate(gene_avg = mean(counts)) # this is rowMeans > > This feels like it could make things a little clunky, what would something like a grouped filter by expression look like? > > The other option (as we currently have) is to use scoped variants which as Martin mentioned blows out the number of functions in the API. > > But perhaps there needs to be more thought on what the API over arrays/matrices should look like. Would it make sense to look at array query languages like those in tiledb? I know there’s been a bit of experimentation with array structures in dplyr with thetbl_cubeapproaches too.

Stuart Lee (22:37:21): > Seehttps://dplyr.tidyverse.org/reference/tbl_cube.html - Attachment (dplyr.tidyverse.org): A data cube tbl — tbl_cube > A cube tbl stores data in a compact array format where dimension names are not needlessly repeated. They are particularly appropriate for experimental data where all combinations of factors are tried (e.g. complete factorial designs), or for storing the result of aggregations. Compared to data frames, they will occupy much less memory when variables are crossed, not nested.

Stuart Lee (22:38:03): > And alsohttps://github.com/TileDB-Inc/TileDB-Presto - Attachment (GitHub): TileDB-Inc/TileDB-Presto > TileDB Connector for PrestoDB. Contribute to TileDB-Inc/TileDB-Presto development by creating an account on GitHub.

2019-01-08

Peter Hickey (07:10:08): > https://github.com/DavisVaughan/rraymay be of interest - Attachment (GitHub): DavisVaughan/rray > Simple Arrays and Matrices. Contribute to DavisVaughan/rray development by creating an account on GitHub.

Tim Triche (11:21:48) (in thread): > at least for RNAseq, one obvious grouping/filtering example could be grouping transcripts by gene, and filtering by functional annotation (coding, NMD, antisense, lncRNA, etc.)

Tim Triche (11:23:26) (in thread): > when we wrote arkas and TxDbLite, it was in an attempt to forcibly encode all of the appropriate metadata for a Kallisto run (from the FASTA being indexed, all the way through to the counts and runinfo) so that a user could “just add BioC” and do exactly these sorts of operations, without needing to know how to synchronize all the annotations.

Tim Triche (11:28:08): > particularlyhttps://davisvaughan.github.io/rray/articles/broadcasting.html, which makes the tidy “rules” explicit for arrays

Michael Lawrence (14:07:53): > I still prefermutate_rows(gene_avg = mean(counts)). Theactivate()mechanism makes everything context dependent, and thus harder to read. I like the idea (new to me at least) that row-wise or column-wise operations implicitly group the assays by the respective dimension. I guess that grouping would be layered on top of any existing grouping.

Michael Lawrence (14:36:22): > @Levi WaldronShouldn’t this linehttps://github.com/waldronlab/GMQLvsBioc/blob/b29d165d94080192509bbbdd0a0b3eeac21b4610/R/loadACC.R#L19beany(!is.na(x))? - Attachment (GitHub): waldronlab/GMQLvsBioc > Alternative approaches to https://doi.org/10.1093/bioinformatics/bty688 analysis - waldronlab/GMQLvsBioc

Levi Waldron (14:37:29): > Oops yes, fixing that now. Thanks@Michael Lawrence

Levi Waldron (15:09:53): > Fixing that bug changes the number of genes found per patient, but not the rank of the top four tumors - I guess those tumors are just the extreme hypermutators. The ACC example runs on a laptop without using much of the 16GB memory, in about 6 minutes (half of which is just loading and constructing the initial MAE because of the big methylation dataset). BRCA will probably need 32 or 64GB without further optimization (even just cleaning up the workspace), but trying it as is on a high memory server.

Levi Waldron (15:13:23) (in thread): > Ah I see what you’re getting at now, yes that’s a good idea.

Levi Waldron (15:25:51) (in thread): > BTW I am fiddling with this, leaving it for today though.

Stuart Lee (18:51:53) (in thread): > I agree - it feels more natural to me too.

2019-01-09

Levi Waldron (05:07:05) (in thread): > @Michael Lawrenceyour suggestion ofrowsumtakes a couple seconds instead of a couple minutes withdplyr, and produces the same result. I’ve pushed the change (https://github.com/waldronlab/GMQLvsBioc/commit/9042f4f1b98bf937fd4c3622b9df45b6a846616b) - Attachment (GitHub): use rowsum instead of dplyr · waldronlab/GMQLvsBioc@9042f4f > Alternative approaches to https://doi.org/10.1093/bioinformatics/bty688 analysis - waldronlab/GMQLvsBioc

Levi Waldron (05:14:10) (in thread): > (and does it with fewer characters of more readable code - the only hackish part is multiplying a logical matrix by1Lto make it integer so thatrowsumwill work on it )

Aedin Culhane (11:32:50): > Hi Levi

Aedin Culhane (11:37:18): > Is it possible to pre- filter the Meth data as a delayed array, prior to loading, based on mutation/RNAseq. Would this make it quicker

Tim Triche (11:38:40): > related to the above, is there a tutorial somewhere about block size vs. realization time… I was playing around with HDF5-backed, large HM450/EPIC datasets (1000-2000 cases) and it is slow as all hell. I rewrote DMRcate’s internals to play nicely with HDF5 backing, for example

Kasper D. Hansen (11:39:05): > @Tim TricheIt is super finicky to get performant

Tim Triche (11:39:14): > that’s been my impression

Kasper D. Hansen (11:39:21): > It took me forever to just get parsing (IDAST -> RGChannelSet) work well

Kasper D. Hansen (11:39:28): > And conceptually thats easy

Tim Triche (11:39:45): > for large datasets, or direct to HDF5, or … ?

Tim Triche (11:39:58): > I just read everything into memory and save the SE as HDF5

Tim Triche (11:40:04): > I think

Tim Triche (11:40:10): > (goes to look … been a while)

Kasper D. Hansen (11:40:12): > I mean reading in bounded memory

Kasper D. Hansen (11:40:31): > so the conversion can be done on almost any piece of hardware

Tim Triche (11:40:39): > oh, yeah, that’s horrible

Tim Triche (11:40:47): > split-combine was about as well as I could do for that

Kasper D. Hansen (11:40:50): > Reading it all in memory … kind of defeats the purpose

Kasper D. Hansen (11:40:57): > I think it works for minfi now

Kasper D. Hansen (11:41:12): > Its slow that reading all in and saving, but it scales well in my testing

Tim Triche (11:42:45): > what I had been doing was reading in the files on a server then backing them with HDF5 for processing on machines/nodes with less RAM

Kasper D. Hansen (11:43:22): > ok

Kasper D. Hansen (11:43:40): > Anyway, for sure it is a big project to make performant code

Kasper D. Hansen (11:43:51): > Hopefully that will improve as we learn

Tim Triche (11:43:52): > not elegant, but I didn’t really have a lot of time to get it spun up, and… what you said above.

Tim Triche (11:44:36): > my original goal was simply to send everything to Vince for remote HDF5 servicing:smile:

Levi Waldron (15:35:52): > Hot off the press from@Marcel Ramos Pérez:waldronlab/curatedTCGADatais now successfully providing methylation data from a local HDF5 file (in the ExperimentHub file cache, so you have to download the big file once). The commandmae <- curatedTCGAData("ACC", c("Mutation", "RNASeq2GeneNorm", "Methylation"), dry.run = FALSE)now runs (from cache) in 5 seconds instead of 3 minutes…

Levi Waldron (15:38:29): > In this case I doubt it will speed things up much overall because most of the matrix has to be checked, but in many cases it’ll be a big improvement and it’s great not having a big wait before you can do anything.

Marcel Ramos Pérez (16:10:07): > @Marcel Ramos Pérez has joined the channel

Levi Waldron (17:26:01) (in thread): > BRCAhas a small 27K methylation dataset, I could cheat and use that:smile:. Otherwise, sadly I don’t think there’s much time to be gained from filtering the meth data here, although I can remove the probes that aren’t mapped to a gene in therowData…

2019-01-10

Tim Triche (07:46:32): > so the goal I mentioned above is similar to that, but backed by restfulSE service

Tim Triche (07:47:00): > any plans to spin up that as an example in the F1000R paper?

Tim Triche (07:54:58): > I just read the link that@Sean Davisposted in#randomregarding “teaching data science”. The emphasis on tidiness and on finding good example data (coughMAE via restfulSEcough) resonated on several levels

Levi Waldron (09:13:37): > Even though I convert the DelayedArray to matrix (as.matrix) to makerowsum()work, the whole program is almost 5x faster than just the time it took to load the MAE before! It does give the same result. Code athttps://github.com/waldronlab/GMQLvsBioc. - File (R): speed-up with HDF5 methylation

Tim Triche (09:29:20): > try DelayedMatrixStats::rowSums2 ?

Tim Triche (09:30:03): > some of these steps could maybe benefit from “tidiness” that hides some of these rough edges:wink:

Levi Waldron (09:55:52) (in thread): > Indeed, likemae[["ACC_Methylation-20160128"]] <- mae[["ACC_Methylation-20160128"]][!is.na(rowData(mae[["ACC_Methylation-20160128"]])$Gene_Symbol), ].

Levi Waldron (09:59:34) (in thread): > I guess that will have to involve a loop where the “rows” argument is set differently for each gene?

Tim Triche (10:09:55) (in thread): > oh, nevermind, I just realized you did rowsum() not rowSums(). my bad!

Levi Waldron (10:10:04) (in thread): > yep

Shian Su (17:45:41) (in thread): > Is there any argument for the alternative besides having a shorter function names and fewer functions?*_rowsand*_colsseem much more clear to me.

2019-01-14

Di Cook (06:58:47): > @Di Cook has joined the channel

2019-01-17

Lluís Revilla (07:23:50): > @Lluís Revilla has joined the channel

Kayla Interdonato (08:28:45): > @Kayla Interdonato has joined the channel

2019-01-22

Hervé Pagès (13:19:18): > @Hervé Pagès has joined the channel

2019-01-23

Hervé Pagès (03:34:06) (in thread): > @Levi WaldronI added arowsummethod for DelayedMatrix objects that uses block processing (it’s in DelayedArray 0.9.7). You should be able to call it directly on your DelayedMatrix objectmethso no need to doas.matrix(meth)first (which defeats the purpose of using a DelayedMatrix object in the 1st place).

2019-01-24

Kylie Bemis (19:47:22): > @Kylie Bemis has joined the channel

2019-01-30

Dror Berel (11:59:08): > Maybe not a tidy per se, but would like to get your (constructive) comments about my new packages: Bioc2mlr. > Not much of documentation yet, but mostly a simple, straightforward, proof-of-concept demonstration.https://github.com/drorberel/Bioc2mlr

2019-01-31

Dror Berel (09:58:16): - File (JPEG): vision.jpg

2019-02-05

Laurent Gatto (06:04:07) (in thread): > Very useful initiative. Something I’m sure@Vince Careyis also interested in.

2019-02-07

Charity Law (20:21:12): > @Charity Law has joined the channel

2019-02-18

Diego Diez (21:22:50): > @Diego Diez has joined the channel

2019-02-24

Levi Waldron (15:36:05) (in thread): > @Hervé PagèsI’m sorry I somehow missed this message! I triedDelayedMatrixStats::rowsumwhich was too slow as of a month ago, but I will try it again now with@Mike Smith’s optimizations and I will tryDelayedArray::rowsum(). FYI my use case is athttps://github.com/waldronlab/GMQLvsBioc/blob/master/R/loadACC.R.

2019-04-15

Jon Bråte (12:26:42): > @Jon Bråte has joined the channel

2019-06-20

Marko Zecevic (19:39:57): > @Marko Zecevic has joined the channel

2019-06-25

Aaron Wolen (10:17:57): > @Aaron Wolen has joined the channel

2019-06-26

Junhao Li (13:29:58): > @Junhao Li has joined the channel

2019-06-28

Andrew McDavid (12:42:27): > @Andrew McDavid has joined the channel

2019-07-09

Stevie Pederson (21:31:54): > @Stevie Pederson has joined the channel

2019-07-12

Jannik Buhr (07:44:50): > @Jannik Buhr has joined the channel

2019-08-15

Constantin Ahlmann-Eltze (11:23:04): > @Constantin Ahlmann-Eltze has joined the channel

2019-10-05

John Hutchinson (12:07:17): > @John Hutchinson has joined the channel

2019-11-29

David Mas-Ponte (14:46:31): > @David Mas-Ponte has joined the channel

koki (15:43:11): > @koki has joined the channel

2019-12-04

Benjamin Reisman (09:51:59): > @Benjamin Reisman has joined the channel

Jonathan Carroll (17:39:39): > @Jonathan Carroll has joined the channel

2019-12-12

Tim Triche (17:50:58): > @Tim Triche has left the channel

2020-02-07

Michael Lawrence (16:55:49): > On the one year anniversary of this channel going quiet, I’d like to invite anyone interested to join a working group on building bridges (technical, social and philosophical) between Bioconductor and the tidyverse. This is sponsored by the Bioc Technical Advisory Board. Stuart Lee and Mike Love have already expressed their support.

Nicholas Knoblauch (17:03:29): > sounds great!

Michael Lawrence (19:12:46): > So yea just let me know if you’re interested in joining.

Nicholas Knoblauch (19:19:33): > count me in!

2020-02-08

Vince Carey (05:23:52): > keep me posted

2020-02-09

Shian Su (06:38:35): > Also interested

2020-02-25

Laurent Gatto (02:41:36) (in thread): > Yes, I’m definitely interested.

Teun van den Brand (03:08:06): > @Teun van den Brand has joined the channel

Teun van den Brand (03:51:31): > I too am interested

Kevin Rue-Albrecht (05:32:56): > Interested too, starting from 1 April: I’ll be teaching biomedical data science including R and Python, with components about Bioc and tidy. A good excuse to spend time assembling materials and reviewing the current state ofplyrangesand others that I’ve left on my reading list for far too long

Vince Carey (11:07:06): > To get a glimpse of one relevant approach, seehttps://github.com/Bioconductor/Contributions/issues/1330

Vince Carey (11:09:20): > which has becomehttps://github.com/stemangiola/tidyBulk

2020-03-02

Jonathan Carroll (04:50:08): > DFplyr, standing by (count me in)

2020-03-05

Michael Lawrence (12:05:14): > Should we put together a poll to schedule a brainstorming meeting to get things started? Anyone know how to make one of those polls?

Kevin Rue-Albrecht (12:58:45): > There’shttps://slack.com/intl/en-gb/help/articles/229002507-Create-a-poll-but I’m sensing a doodle might be more efficient for more than 3 choices:sweat_smile: - File (JPEG): Image from iOS

Stuart Lee (19:14:46): > given the number of Australians in this group, a friendly time would be really appreciated:smile:

Vince Carey (20:10:01): > A written agenda will help. I just skimmed the plyranges paper and had a look at its vignette and the vignette > of fluentGenomics. Developers may be able to take advantage of plyranges for certain processes – is this happening? > Do we want to promote this? Should we use plyranges in training? Would it make sense to rewrite HelloRanges with > plyranges operations?

Nicholas Knoblauch (20:37:13): > Something I’d be interested in knowing more about is what the barriers are (technical, social, philosophical etc.) between tidy and bioconductor

Nicholas Knoblauch (20:38:07): > beyond the namespace turf war

2020-03-24

Nitesh Turaga (09:05:40): > @Nitesh Turaga has left the channel

2020-06-06

Olagunju Abdulrahman (19:57:59): > @Olagunju Abdulrahman has joined the channel

2020-06-10

Hervé Pagès (19:07:46): > @Hervé Pagès has left the channel

2020-06-17

Marco Chiapello (08:37:56): > @Marco Chiapello has joined the channel

Andrew Jaffe (16:10:36): > @Andrew Jaffe has joined the channel

2020-06-24

Stephanie Hicks (14:57:39): > @Stephanie Hicks has left the channel

2020-07-15

Spencer Nystrom (08:56:18): > @Spencer Nystrom has joined the channel

2020-07-31

bogdan tanasa (14:07:10): > @bogdan tanasa has joined the channel

2020-08-05

shr19818 (13:48:06): > @shr19818 has joined the channel

2020-10-23

Rebecca Howard (08:18:42): > @Rebecca Howard has joined the channel

2020-11-23

Dominique Paul (08:38:55): > @Dominique Paul has joined the channel

2020-12-02

Konstantinos Geles (Constantinos Yeles) (05:44:09): > @Konstantinos Geles (Constantinos Yeles) has joined the channel

2020-12-12

Huipeng Li (00:37:57): > @Huipeng Li has joined the channel

2020-12-13

Kelly Eckenrode (13:42:28): > @Kelly Eckenrode has joined the channel

2021-01-01

Bernd (14:07:12): > @Bernd has joined the channel

2021-01-22

Annajiat Alim Rasel (15:46:36): > @Annajiat Alim Rasel has joined the channel

2021-02-12

Janani Ravi (15:53:21): > @Janani Ravi has joined the channel

2021-03-23

Lambda Moses (23:06:27): > @Lambda Moses has joined the channel

2021-05-11

Megha Lal (16:46:06): > @Megha Lal has joined the channel

2021-06-02

Levi Waldron (15:59:45): > @Levi Waldron has left the channel

2021-06-05

KP (17:22:29): > @KP has joined the channel

2021-06-07

Stuart Lee (00:45:46): > @Stuart Lee has left the channel

2021-06-11

Sebastian Worms (07:10:18): > @Sebastian Worms has joined the channel

2021-08-06

ChiaSin (20:02:06): > @ChiaSin has joined the channel

2021-09-01

Katie Saund (14:43:58): > @Katie Saund has joined the channel

2021-09-25

Haichao Wang (07:20:50): > @Haichao Wang has joined the channel

2021-10-29

Michael Love (13:42:21): > Hitting up this old channel for some new questions:slightly_smiling_face:@Wancen Muand I are interested in computing correlations across modality. Suppose we have RNA and ATAC for matched samples, and want to compute correlation of RNA and ATAC for every pair of features that are within X bp. > > AFAIK@Stuart Leeplyrangesis just for GRanges and not easily adaptable to SummarizedExperiment. Not sure if this computation is currently possible with@stefano mangiola’stidybulk? Or maybe@Marcel Ramos Pérezhas thought about this withMultiAssayExperiment?

Wancen Mu (13:42:24): > @Wancen Mu has joined the channel

Stuart Lee (13:42:24): > @Stuart Lee has joined the channel

stefano mangiola (13:42:24): > @stefano mangiola has joined the channel

Malte Thodberg (15:25:23) (in thread): > CAGEfightR implements something like this (promoter-enhancer correlation within a certain distance) with findLinks(), and returns a GInteractions.

Michael Love (17:27:43) (in thread): > oh, does this work with two objects though?

Michael Love (17:28:25) (in thread): > i want to compute correlations on the bipartite graph where an edge connects a gene to a peak, but not gene-gene or peak-peak

2021-10-30

Malte Thodberg (01:36:19) (in thread): > No, it works by first rbinding the two RSEs with combineClusters and then indicating promoter/enhancer with a factor in rowData.

Malte Thodberg (01:37:43) (in thread): > There might have been updates since I wrote it, but this was to make it work with GInteraction(), > which only has a single slot for the ranges.

Malte Thodberg (01:39:44) (in thread): > I didn’t find any option for storing “asymmetric” interactions between two sets of features, e.g. eQTLs or PCHiC.

2021-11-08

Paula Nieto García (03:30:19): > @Paula Nieto García has joined the channel

Michael Love (07:11:08): > @stefano mangiolajust wanted to check before we reproduce this functionality — do you have something to compute correlation across two assays in tidybulk?

stefano mangiola (18:43:54): > Hello@Michael Lovesorry I’m not great in slack. I will give a proper answer today!

Michael Love (18:53:50): > Oh no worries — also would you prefer I post to support or GitHub?

stefano mangiola (21:11:07): > Hello@Michael Lovesure github is the best for receiving notifications. I will try to be active here a well :) > > I thought about it, this feature would be more for tidySummarizedExperiment, as we abstracted *_join() methods from dplyr, and once joined it is easy to do correlation between assays (see attached) by gene. > > Unfortunately it is possible to inner_join() (for example) only with exact combinations of columns. However it would be a nice addition to generalise this concept to any match allowed by Granges/plyranges. > > From a quick brainstorm, in the backend (the only missing bit would be 1., the rest is possible with the tidySE machinery) > 1. determine the matches > 2. index the ranges based on their match > 3. add those indexes to the two SE > 4. *_join() the two - File (R): correlation_of_assays.R

stefano mangiola (21:14:20): > it would be a nice little project, but definitely now it is not possible to do what you are looking for with the current framework

Michael Love (22:17:40) (in thread): > Oh cool, thanks for the R code and detailed answer. I’ll take a look and see what can be done

2021-11-09

Maria Doyle (16:53:42): > @Maria Doyle has joined the channel

2021-11-12

Michael Love (06:22:22) (in thread): > I mocked up the plyranges example just to compare, this involves putting the data as *List object in the metadata columns. So definitely breaking the pattern > > library(plyranges) > library(GenomicRanges) > x <- rnorm(10) > y <- rnorm(10) > dat1 <- NumericList(x, y) > dat2 <- NumericList(y+rnorm(10), x+rnorm(10)) > gr1 <- GRanges("chr1", IRanges(c(1,21),width=10), id1=1:2, data1=dat1) > gr2 <- GRanges("chr1", IRanges(c(5,26),width=10), id2=c("a","b"), data2=dat2) > gr1 %>% join_overlap_left(gr2, maxgap=20) %>% > mutate(cor=cor(data1, data2)) %>% > select(id1, id2, cor) >

Michael Love (06:22:25) (in thread): > > GRanges object with 4 ranges and 3 metadata columns: > seqnames ranges strand | id1 id2 cor > <Rle> <IRanges> <Rle> | <integer> <character> <numeric> > [1] chr1 1-10 * | 1 a 0.226608 > [2] chr1 1-10 * | 1 b 0.769355 > [3] chr1 21-30 * | 2 a 0.826418 > [4] chr1 21-30 * | 2 b 0.597861 >

Michael Love (06:24:16) (in thread): > This is what I’d like in terms of simplicity of code but it breaks apart the SE where people will have their data

Michael Love (06:27:19) (in thread): > your examples 2 and 3 above are very interesting, I need to learn aboutfuture_mapbut yeah, I see how if we make new SEs with repeated observations for every overlap, and in the order of a findOverlaps or join_overlap result table, we can do it with your above code without breaking apart the SE

Michael Love (08:15:53) (in thread): > @Wancen Mu:point_up:some ideas on how to compute your gene to peak or gene to CpG correlations in tidy manner, we can discuss later

Wancen Mu (08:40:00) (in thread): > Thanks, Mike and Stefano!

2021-11-24

Helge Hecht (13:15:57): > @Helge Hecht has joined the channel

2021-12-14

Megha Lal (08:23:36): > @Megha Lal has left the channel

Liz Ing-Simmons (09:18:53): > @Liz Ing-Simmons has joined the channel

2021-12-29

Michael Love (09:07:13) (in thread): > Just for fun… I’ve been playing around with a toy example of Stefano’s code of computing correlations within tidySE — this was a good opportunity for me to learn a bit more about tidySE syntax. I like how extensible this approach would be. Maybe I could work more on hiding away the code where we make an SE where each row is an overlap.https://gist.github.com/mikelove/40b6882ff88f31b7ccb847530022b058#file-overlap_correlation-r-L79-L87

2022-01-03

Michael Love (09:09:38) (in thread): > @Wancen Muin the above example I split a matrix into aNumericListfor the plyranges example, and for the tidySE example I make an SE which consists of overlaps with the data in the assays. as far as testing out ournullrangescode I think the plyranges case is easier to adapt, but we can mention that it would also be possible with tidySE. I think for the tidySE case, one approach would be to move this type of code into a function, but before doing that I’d want to check in with@stefano mangiolaon his thoughts. Is this a use case that others would be interested in? > > ## edited -- this had a bug, fixed on github... >

Kurt Showmaker (17:05:38): > @Kurt Showmaker has joined the channel

2022-01-04

Wancen Mu (08:57:04) (in thread): > Oh sorry@Michael Love, I just saw the message! Thanks for trying this out!:+1:I also have tried it on 10x multi-omics. But I will look over your code tomorrow and see if I can optimize mine.

2022-01-11

stefano mangiola (11:52:47) (in thread): > Hello,

stefano mangiola (11:53:46) (in thread): > I simplified the tidySE code a little > > # I am using the github repo > rho <- > se %>% > nest(data = -.feature) %>% > mutate(rho = map_dbl(data, ~cor(pull(.x, x), pull(.x, y)))) %>% > select(-data) >

Michael Love (11:54:49) (in thread): > oh cool, i’ll try this out and modify the gist

Michael Love (11:56:23) (in thread): > i think it’s natural to have the data as SE, whereas with plyranges we have to asplit() and put it in mcols. however, once it’s in mcols we can avoid building a new SE that has the matches as rows. it’s been fun to think about how to get this done within various frameworks. tidySE is elegant for sure

stefano mangiola (12:35:06) (in thread): > Yes true, in tidySE will not be necessary in 15 minutes. You will be able to do > > se %>% > nest(data = -.feature) %>% > mutate(rho = map_dbl(data, ~cor(pull(.x, x), pull(.x, y)))) %>% > unnest(data) >

stefano mangiola (12:35:30) (in thread): > (I was surprised, there is a little bug because is a ranged..)

Michael Love (12:37:06) (in thread): > oh cool, so when i try this out, use devel branch tidySE?

stefano mangiola (12:38:42) (in thread): > I will point you to the version:slightly_smiling_face:

stefano mangiola (14:27:29) (in thread): > the 3-line code is possible with the branch > > tidySummarizedExperiment@adapt_to_rangedSE > > the unnest is very unoptimised (more than I remember) but I will work on the speedup tomorrow

stefano mangiola (14:27:45) (in thread): > *remembered

Michael Love (14:28:19) (in thread): > cool sounds good and thanks for looking into this example Stefano!:pray:

stefano mangiola (14:29:07) (in thread): > all good, it helps to improve the code!

stefano mangiola (14:29:17) (in thread): > helps me

2022-01-12

stefano mangiola (14:53:47) (in thread): > I have optimised the unnest as well. Now you can avoid ever leaving SE > > se %>% > nest(data = -.feature) %>% > mutate(rho = map_dbl(data, ~cor(pull(.x, x), pull(.x, y)))) %>% > unnest(data) >

stefano mangiola (14:54:06) (in thread): > Let me know if you need opinions on anything else

Michael Love (15:28:02) (in thread): > awesome, trying out now

Michael Love (15:42:58) (in thread): > that’s very nice

Michael Love (15:49:54) (in thread): > I updated the gist, the overlap SE could hypothetically be tucked away as a function but there’s not a great place for it

Michael Love (15:50:24) (in thread): > https://gist.github.com/mikelove/40b6882ff88f31b7ccb847530022b058#file-overlap_correlation-r-L59-L71

Michael Love (15:52:20) (in thread): > This relates to GInteractions and also maybe MultiAssayExperiment where there are row-based mappings across experiments

2022-04-25

Antoine de Weck (19:59:11): > @Antoine de Weck has joined the channel

2022-05-05

Flavio Lombardo (05:58:00): > @Flavio Lombardo has joined the channel

2022-05-07

Kozo Nishida (04:21:26): > @Kozo Nishida has joined the channel

2022-05-20

Simon Pearce (03:16:57): > @Simon Pearce has joined the channel

2022-07-13

Sarah Pierce (11:52:56): > @Sarah Pierce has joined the channel

Brenda Pardo (11:52:57): > @Brenda Pardo has joined the channel

2022-07-31

Tim Howes (14:06:59): > @Tim Howes has joined the channel

2022-08-01

Marc Elosua (11:30:30): > @Marc Elosua has joined the channel

Michael Love (13:11:48): > post-BioC, Stefano and i were interested to see what interest there would be in a beginners workshop re: tidy genomic analysis. Here’s a poll about timing:https://twitter.com/mikelove/status/1554132629720539136 - Attachment (twitter): Attachment > For #rstats bioinformatics folks, esp. students and postdocs: > > Would you be interested in a “tidy genomic analysis” in @Bioconductor zoom workshop? > > Should we do late day in Melbourne (GMT+10) / early in US (GMT-5), or early in Melbourne and late in US?

2022-08-02

Mikhael Manurung (07:18:19): > @Mikhael Manurung has joined the channel

2022-08-04

László Kupcsik (21:56:06): > @László Kupcsik has joined the channel

2022-08-10

Mae Woods (11:00:20): > @Mae Woods has joined the channel

2022-08-15

Michael Kaufman (13:13:19): > @Michael Kaufman has joined the channel

2022-08-29

Margaret Turner (19:12:16): > @Margaret Turner has joined the channel

2022-09-04

ChiaSin (22:04:07) (in thread): > Saw this message a month late, is the workshop happening?

2022-09-06

Michael Love (08:24:35) (in thread): > i got caught up in the Fall semester starting, Stefano and i had discussed > * 15 of Set to the 10 Oct > * 8AM Melbourne / 6PM US East (compared to 8PM Melbourne / 6AM US East). Evening in US East allows West coast to join (but not Europe). Anyway nothing is perfect and we can record it

2022-09-07

ChiaSin (18:07:16) (in thread): > Hope the semester is going well:slightly_smiling_face:Sounds good. I will be keen to catch the workshop live/recorded. Thank you for the effort!

2022-09-12

Samuel Gamboa (13:30:58): > @Samuel Gamboa has joined the channel

2022-11-22

Michael Love (08:27:20): > @stefano mangiolaand I have been discussing doing an tidy-in-Bioc online tutorial in mid-December. Details cross-posted to birdsite and oldelephantsite > * https://twitter.com/mikelove/status/1595038783371706369 > * https://genomic.social/@mikelove/109387583769396088 - Attachment (twitter): Attachment > Stefano Mangiola @steman_research & I are planning a tutorial on “Tidy Genomic Analysis” > > Dec 12, 4:30 PM EST > Dec 13, 8:30 AM AEDT > 60 min + 30 min Q/A > > - Tidy single-cell and bulk transcriptomics > - Tidy enrichment with plyranges and nullranges > > Sign up: > https://unc.zoom.us/meeting/register/tJIoduypqzkoEtCKxHGyW01DOdaIUrGPxmro - Attachment: Attachment - Attachment (genomic.social): Michael Love (@mikelove@genomic.social) > Attached: 2 images > > Stefano Mangiola and I are planning a zoom tutorial on “Tidy Genomic Analysis”: > > Dec 12, 4:30 PM EST > Dec 13, 8:30 AM AEDT > 60 min + 30 min Q/A > > - Tidy single-cell and bulk transcriptomics > - Tidy enrichment with plyranges and nullranges > > Sign up: > https://unc.zoom.us/meeting/register/tJIoduypqzkoEtCKxHGyW01DOdaIUrGPxmro > > #tidy #RStats #Bioconductor #genomics

Mikhail Dozmorov (09:58:37): > @Mikhail Dozmorov has joined the channel

2022-11-28

Joshua Shapiro (09:55:21): > @Joshua Shapiro has joined the channel

2022-11-29

Assa (09:09:17): > @Assa has joined the channel

2022-12-12

Jenny Drnevich (17:53:38): > @Jenny Drnevich has joined the channel

Mervin Fansler (17:53:44): > @Mervin Fansler has joined the channel

Melysssa Minto (17:53:45): > @Melysssa Minto has joined the channel

Jennifer Foltz (17:56:48): > @Jennifer Foltz has joined the channel

Umran (17:56:54): > @Umran has joined the channel

Kartik (17:56:57): > @Kartik has joined the channel

Lexi Bounds (17:57:29): > @Lexi Bounds has joined the channel

Jenea Adams (17:57:42): > @Jenea Adams has joined the channel

Sofya Marchenko (18:04:25): > @Sofya Marchenko has joined the channel

2022-12-13

Michael Love (08:21:38) (in thread): > Video recording from this workshop:https://www.youtube.com/watch?v=nXxTGoBJYHM - Attachment (YouTube): Tidy Genomics Analysis Workshop: Tidy Enrichment and Tidy Transcriptomics

Ana Cristina Guerra de Souza (09:02:06): > @Ana Cristina Guerra de Souza has joined the channel

Sarah Djeddi (09:16:53): > @Sarah Djeddi has joined the channel

2022-12-14

Michael Love (15:07:41): > :wave:hi to new channel members. Please feel free to post questions / challenges here, I personally like to puzzle on ways to rework genomic analyses into tidy formats

2022-12-15

Vince Carey (05:16:17): > I am watching the tutorial now and I see %>% instead of |> … are there benefits to sticking with magrittr’s %>%? (Maybe it is addressed later in the video…?)

Michael Love (07:48:09): > Not that i know of, just that some of my material is pre 4.1, and i haven’t had the time to go around and replace all these

Michael Love (07:50:23): > the new pipe is a little more cumbersome for me to type? i have to move my pinky from the right Shift

Spencer Nystrom (08:42:58): > I have not switched because I often use.Multiple times inside a single function call. Or at least often enough to where it’s relevant to not use base. I’ve given|>a try, and it is great for sure though. > > As for Mike’s typing pain, I’ve long embraced the Ctrl + Shift + M pipe alias originally introduced in RStudio. RStudio has a setting to make that add either a Magrittr or a base R pipe symbol.

Michael Love (09:26:54) (in thread): > oh we actually do that in this workshop:https://github.com/mikelove/tidy-genomics-talk/blob/main/boot_and_match_examples.R#L215-L220

Michael Love (09:27:27) (in thread): > i use emacs mostly (just to annoy everyone)

Spencer Nystrom (09:31:58) (in thread): > (I also use emacs, and vim cause I am a monster, but I alias Ctrl + Shift + M =%>%cause it’s too good)

Mercilena Benjamin (09:39:48): > @Mercilena Benjamin has joined the channel

2022-12-17

Michael Love (08:09:55) (in thread): > ooh, in emacs the keybinding is even niftier, bc it can auto newline and indent

2022-12-20

Mervin Fansler (13:19:57): > I haven’t switched for similar reasons, but also because I frequently use the other pipes implemented inmagrittr, likethe exposition pipe (%$%)to reference columns directly orthe tee pipe (%T>%)which is helpful for dumping or inspecting intermediate objects.

Michael Love (13:51:05): > oh I also like%<>%

Michael Love (13:54:00) (in thread): > oh I get it, “T” pipe

Michael Love (13:54:22) (in thread): - File (PNG): image.png

Mervin Fansler (14:00:17) (in thread): > Same, but I try to use it sparingly. Easy to mistake as%>%when rereading code, and think the<-assignment is a better visual cue to highlight something is being mutated.

2022-12-23

Pierre Gestraud (08:36:09): > @Pierre Gestraud has joined the channel

Michael Love (11:23:01): > spreading the tidy on support site:slightly_smiling_face:https://support.bioconductor.org/p/9148540/

2022-12-25

Antonio Cembellin Prieto (12:53:44): > @Antonio Cembellin Prieto has joined the channel

2023-01-18

José Basílio (13:10:24): > @José Basílio has joined the channel

2023-03-10

Edel Aron (15:28:45): > @Edel Aron has joined the channel

2023-05-03

Rebecca Butler (16:19:12): > @Rebecca Butler has joined the channel

2023-05-18

Oluwafemi Oyedele (05:54:16): > @Oluwafemi Oyedele has joined the channel

2023-05-25

Jacob Krol (17:14:24): > @Jacob Krol has joined the channel

2023-06-15

Michael Love (03:04:23) (in thread): > i was talking to someone at a summer course and realized we don’t need to split up the data and attach to the GRanges, we can just usepurrr: > > x_no_dat %>% join_overlap_inner(y_no_dat) %>% > mutate( > rho = map2_dbl(id.x, id.y, \(.x,.y) { > cor(dat_x[.x,], dat_y[.y,]) > })) >

Michael Love (03:04:41) (in thread): > x_no_datis the GRanges without the data as mcols

Michael Love (03:05:09) (in thread): > now it seems very obvious, but this approach means you can keep the data in the original format, which may be a large performance difference

2023-06-19

Pierre-Paul Axisa (05:12:34): > @Pierre-Paul Axisa has joined the channel

2023-06-28

Andrew Ghazi (11:00:04): > @Andrew Ghazi has joined the channel

2023-07-06

Assa (02:54:08): > @Assa has left the channel

2023-07-26

Guillaume Devailly (08:15:57): > @Guillaume Devailly has joined the channel

Helena L. Crowell (08:19:14): > @Helena L. Crowell has joined the channel

Michael Love (09:04:10): > @stefano mangiolaand I are compiling a list of open challenges for “tidyomics” — tidy-data paradigm for Bioconductor objects and other biological datasets. An open project, we invite anyone interested to assign themselves to an issue, and then we will include you in the writing of a paper describing this joint effort. Happy to discuss here or in person at BioC > * https://twitter.com/mikelove/status/1684171715457454082 > * https://github.com/orgs/tidybiology/projects/1/views/1 - Attachment (Twitter): Michael Love on Twitter > Be part of the #tidyomics community! Take on one of the Tidyomics Open Challenges & become part of our upcoming paper, “Enhancing omic data analyses with the tidyomics ecosystem” > > Challenges: > https://t.co/5BYzVjNwLm > > tidiness_in_bioc on bioc slack > @steman_research @Bioconductor

Guillaume Devailly (09:11:15): > I might be interested in writing a little something for the blog. Is the blog live somewhere or not yet? I was looking for previous posts for inspiration, but all I could find was this:https://tidybiology.github.io/tidyomicsBlog/ - Attachment (tidybiology.github.io): Tidy transcriptomics > The blog about tidy transcriptomics.

Michael Love (09:15:05) (in thread): > that would be great, i think@stefano mangiolais planning to bring the blog back to life soon. if you want to start drafting in Rmd or md I’m sure that could be brought in easily

Laurent Gatto (10:25:43): > Hi Mike - FYI, we also apply tidy principles into our MS and proteomics approaches. We don’t necessarily aim for the same dplyr verbs, but we follow the principle of having functions/operations that use the default data structure as input and output so as to combine them with pipes.

Kevin Rue-Albrecht (10:27:57) (in thread): > nice! I didn’t know that github feature! Reminds me of BiocChallenges, which didn’t age well xD

Michael Love (10:28:18) (in thread): > @stefano mangiolais also leading this — can we recruit you to the project? > > you could add some integration issues around MS and proteomics to the GH Project link above?

Michael Love (10:30:15) (in thread): > here is the key part of the abstract: > > …Bioconductor emphasises the linkage between metadata and standardised data formats, ensuring interpretability and reproducibility. On the other hand, Tidy programming in R has standardised data organisation and manipulation, with intuitive natural-language-like grammar. However, native Bioconductor data formats with rich metadata are not readily compatible with tidyverse packages.Recent advancements have bridged this gap by abstracting data formats into tidy structures, enabling seamless interaction through an API. This abstraction unlocks the potential for transforming genomics and transcriptomics analyses and facilitates a technology-agnostic exploration paradigm for diverse omic data. Integration with popular packages like dplyr, tidyr and ggplot2 enhances accessibility and encourages cross-disciplinary collaborations.Here, we introduce the tidyomics ecosystem—a suite of interoperable software that accelerates omic data analysis, enhances reproducibility, promotes transparency, and fosters cross-disciplinary collaborations, driving scientific discovery forward.

Michael Love (10:31:50) (in thread): > what Stefano and I are trying to recruit based on is, helping to build documentation and packages/functionality that allows users to come in to Bioconductor and operate on objects with familiar tidyverse verbs, easily make ggplot2 figures, etc.

Michael Love (10:32:49) (in thread): > e.g. we have a draft figure showing what we mean by common language applied to Bioc objects: - File (PNG): Screenshot 2023-07-26 at 4.32.31 PM.png

Michael Love (10:34:04) (in thread): > yeah, I didn’t know either, Stefano started this one and showed me how

Michael Love (10:34:42) (in thread): > i think we have a community hopefully with a vested interest in solving some of these / adding new ones

Kevin Rue-Albrecht (10:35:19) (in thread): > lesson learned from BiocChallenges (and other failed ideas): anything that requires git commits and github action for rendering has a tougher time getting traction:wink:

Laurent Gatto (10:35:26) (in thread): > I think MS/proteomics verbs/approaches as implemented in RforMS could would be a good fit. I’m at a workshop this week and then on holiday, but I have set a reminder to get back to you about this thereafter.

Michael Love (10:35:56) (in thread): > great! thanks Laurent

Laurent Gatto (10:38:56) (in thread): > Having said that, there’s also a tidyproteomics package/paper (https://github.com/jeffsocal/tidyproteomics) for quantitative data, outside of the Bioc ecosystem.

Michael Love (10:40:46): > As a preview of what we mean by integration, in this upcoming workshop we work across SingleCellExperiment <-> SummarizedExperiment <-> ranges using Stefano’s tidy* packages and plyranges with familiar tidyverse verbshttps://tidybiology.github.io/tidyomicsWorkshopBioc2023/articles/tidyGenomicsTranscriptomics.html - Attachment (tidybiology.github.io): Tidy genomic and transcriptomic single-cell analyses > tidyomicsWorkshopBioc2023

Martin Morgan (11:03:39): > mentioninghttps://github.com/Bioconductor/S4Vectors/pull/116in case there is some insight or more general solution@Kevin Rue-Albrecht - Attachment: #116 add fortify.DataFrame > Hi there > > TL;DR: This PR makes the following code possible: > > > library(S4Vectors) > > df <- DataFrame( > x = 1:10, > y = rnorm(10) > ) > > library(ggplot2) > > ggplot(df, aes(x, y)) + > geom_point() > > # see also: > fortify(df) > > > Without the need for as.data.frame(df) in the ggplot() call. > > * * * > > Context: > > As per https://community-bioc.slack.com/archives/C6KJHH0M9/p1690235172607369 > > I’ve messed around S4vectors a bit to test feasibility, and somehow landed on my feet with something that seems to work. > > I’ll be honest, I’m not even sure why R allows me to do it, but it seems that I can importFrom a package that is listed in Suggests (i.e., not in Depends). > > I added ggplot2 to Suggests because I don’t like the idea of having it under Imports. It just feels wrong to automatically install ggplot2 and its own dependencies as a dependency of S4Vectors. S4Vectors should remain a lightweight package. > > I suppose that if users have ggplot2 installed, the import statement “just works”, and if they don’t have ggplot2 installed.. well… they don’t have any reason to call ggplot() on a DataFrame object :D > > I’m aware that this PR is unlikely to be the final fix (if any is possible at all). I just aim to give a starting point to the discussion. > > Also, I’ve considered other approaches, but run into chicken-and-egg issues: > > • I suspect ggplot2 will not accept to Suggests: S4Vectors, as I don’t see any Bioconductor package in its existing Imports/Suggests (https://cran.r-project.org/web/packages/ggplot2/index.html) and install.packages() cannot see Bioconductor packages without messing with options(repos). > • I suspect S4Vectors will not accept to Imports: ggplot2, to justify more cleanly importFrom(ggplot2, fortify). Same reason as above: keep S4Vectors dependencies to a minimum > • I noted that ?ggplot2::fortify states “Rather than using this function, I now recommend using the broom package, which implements a much wider range of methods. fortify() may be deprecated in the future.” However, it is not clear to me what needs to be done in broom (or biobroom)

Gurpreet Kaur (11:11:39): > @Gurpreet Kaur has joined the channel

Christian Brueffer (11:20:24): > @Christian Brueffer has joined the channel

stefano mangiola (18:47:44) (in thread): > great.@William Hutchisonmaybe prioritise the blog transfer a little so members of the community will be able to contribute. Please@Guillaume Devaillyadd yourself to the github issue so we can start planning there.

William Hutchison (18:47:51): > @William Hutchison has joined the channel

William Hutchison (20:26:17) (in thread): > Sure, I will work to get the blog up as soon as possible. Thank you for your interest Guillaume

stefano mangiola (22:49:21) (in thread): > @Guillaume Devaillysome ideas for (needed) posts are herehttps://github.com/tidybiology/tidyomicsBlog/issuesBut feel free to propose something opening your own issue in the repo

2023-07-27

Michael Love (01:01:42): > Re: these Open Challenges, there’s been a lot of great discussion on ideas, what’s missing, how to get started, what is the scope. We didn’t want to limit the scope from the outset, but to see what people were interested in building. Feel free to assign yourself (multiple people can assign as well to one issue) or suggest a new open challenge. If you have questions about what this is about, or about one of the challenges feel free to ask here or DM or start discussion on the issue itself. Maybe you already have ideas, or if you need more information or help thinking through solutions, happy to give guidance.https://github.com/orgs/tidybiology/projects/1

William Hutchison (02:25:36) (in thread): > Hello again, the blog is up!

Helena L. Crowell (04:54:29): > Sorry if I’m missing something, but could it be that one cannot create/assign challenges w/o being a project member? Or maybe this is only true for tasks that are linked to package-issues?

Clemens Kohl (07:33:30): > @Clemens Kohl has joined the channel

Louise Morlot (07:35:14): > @Louise Morlot has joined the channel

stefano mangiola (08:28:21) (in thread): > It’s maybe the case that you cannot create issues on the project if you’re not member. You should be able to assign yourself to an issue though. These requires the same rights as assigning yourself to an issue of the public repository. > > depending on how these experience goes, in the future, we will for sure think about a way to allow the community to add challenges!

stefano mangiola (08:32:20) (in thread): > I think the key is to invite a broader community to the github organisation. We are pretty new to this githubinfrastructure

Michael Love (11:10:44) (in thread): > I think we should just add interested people as organization members. This allows them to read other members’ repos (but not write), create repos, and create/edit issues.

Michael Love (11:13:35) (in thread): > I believe that all members can create/edit issues, and we can just add everyone who lists their github here as members

Michael Love (11:13:44) (in thread): > I’ve added@Helena L. Crowell

Michael Love (11:15:30) (in thread): > added@Abdullah Al Nahid

Abdullah Al Nahid (11:15:45): > @Abdullah Al Nahid has joined the channel

Michael Love (11:22:52) (in thread): > I’m still learning how to do this, but some of the challenges may be writing new packages, and i think the Issue should link to the new package in that case. I just did this for:https://github.com/orgs/tidybiology/projects/1/views/1?pane=issue&itemId=34385041

stefano mangiola (19:37:09) (in thread): > Amazing!@Guillaume Devaillyhave a look to the proposed topics other wise let us know what would you like to write about.

2023-07-28

Helena L. Crowell (08:17:57): > Alright@stefano mangiola, big PR ontidySingleCellExperimentawaiting:wink:What I think could be next (after discussing): > * Clean theR CMD CHECKandBiocCheckreports. E.g., on the current R/Bioc devel, I am getting > 10 NOTES (noting >500 lines), a couple warnings, and one error. These are perhaps not high-priority for some, but I think it would be nice to keep things clean from the start (also to better spot potentially worrisome issues). This will mainly involve adhering to the guidelines outlined here (https://contributions.bioconductor.org/r-code.html#r-code); so should be simple enough. > * Expand on unit tests. Current coverage is 74%, but hits per line are 0-10 in most cases. > * Expand on (transcriptomics-specific) examples. By inheriting man pages, documentation comes down to making these as useful as we can. > * Not sure I’m missing something, but a few “low fruit” seem to be missing, e.g.,slice_-helpers,add_countand(add_)tally? Nothing important, it only occurred to me as they are documented withslice/countat the moment, but don’t actually exist (I think?). > * I haven’t really benchmarked this, butaggregate_cellsseems to be a lot less efficient thanscuttle’saggregateAcrossCells. Perhaps worth considering how we could improve here, as this is something a lot of people might want to do. > * Once we’re happy with things, transfer what we can totidySummarizedExperimentand …? > Let me know what you (*anyone interested) think! I think all these could be done fairly quickly, and would bring us into a more stable state; then the “hard stuff” can be tackled (efficiency, grouping):muscle: - Attachment (contributions.bioconductor.org): Chapter 15 R code | Bioconductor Packages: Development, Maintenance, and Peer Review > Everyone has their own coding style and formats. There are however some best practice guidelines that Bioconductor reviewers will look for. can be a robust, fast and efficient programming language…

Christian Brueffer (15:00:25) (in thread): > Sounds reasonable to me, particularly cleaning up lint issues early on.

Benjamin Yang (15:59:08): > @Benjamin Yang has joined the channel

Noriaki Sato (19:19:46): > @Noriaki Sato has joined the channel

stefano mangiola (22:19:04) (in thread): > Impressive! I left few comments on the PR. > > Agree on every point. > > Makingaggregate_cellmore efficient is a good project. We offer a very nice API, that preserves metadata etc., in an automatic seamless way. So using a more efficient backend preserving our interface would be amazing. > > importing scuttle is a bit too much, so we could use a similar strategy, citing the authors in our function (simpler approach) or seeing if we can straight borrow some code from scuttle, inviting Lun and McCarthy to the paper/community. > > You could create a new “challenge” about this. If you don’t have the rights, I can do it with no problem.

2023-07-29

Helena L. Crowell (02:13:26) (in thread): > Cool. So once the dox PR goes through, would you be ok that I get started on 1.)? Closely linked with more unit tests:sparkles:Agree re avoiding another import. Sure, I can open a challenge/issue for this! Not sure how active Aaron is these days, but maybe tagging him on GH will catch his attention.

stefano mangiola (03:08:35) (in thread): > Great. I would prefer this formal list of tasks was tracked at the project level, as already I am getting a bit confused:slightly_smiling_face:between, > * slack > * github project > * github PR > If any proposal isnotdocumentation etc.. feel free to open another challenge and assign yourself.

stefano mangiola (03:09:33) (in thread): > for example > * more examples would be still documentation challenge > * BiocChek might be another challenger > * unit tests might be another challenge > * new function might be another challenge

Helena L. Crowell (03:11:16) (in thread): > You’re correct, sorry for the confusion. I’ll try and organize the above into challenges with some description on the GH project page

stefano mangiola (03:11:31) (in thread): > (some edits above)

2023-08-01

Michael Love (15:24:13) (in thread): > (this is fantastic) > > I’ve been tagging everything to open challenges, it’s a great way to organize this collective effort. Any member’s repo is fair game for the issue. I’ve already had a case where an open challenge solution became a new repo, and then i just moved the issue to that repo

Michael Love (15:28:15): > Michael Schubert asks: > > Do you have a design document (for goals and scope) in addition to the GitHub issues you link? > I’ve added the drafted abstract to the open challenges project page: - File (PNG): Screenshot 2023-08-01 at 3.27.00 PM.png

2023-08-03

Jacques SERIZAY (08:32:05): > @Jacques SERIZAY has joined the channel

Michael Love (09:57:46) (in thread): > @Laurent Gatto, I wanted to ask: is there someone from your group or a suggestion of someone from the Bioc proteomics community that may want to work on an API similar to tidySE etc. for MsExperiment? > > I think@stefano mangiolahas a formula that can be applied to allow for tidy operations, e.g. > > lmse |> filter(injection_idx == 1) > > # instead of > > lmse[ , sampleData(lmse)$injection_idx == 1] >

Michael Love (09:59:01) (in thread): > even just a small movement towards tidy operations on the Bioc MS objects would look good i think, to show we aren’t just focused on transcriptomics

Laurent Gatto (12:03:33) (in thread): > 1. We have plenty of such filtering functions readily available. > 2. Implemented a general filter() is trivial, but can become inefficient. We have different backends that should take care of the different implementations. > Can discuss more in 2-3 weeks, when back from holidays.

Abdullah Al Nahid (14:31:57): > tidyomics logo voting thread: > > I landed on this font for the new tidyomics logo. > Let’s vote::balloon:for sentence case:star:for lower case - File (PNG): tidyomics.png

Ritika Giri (15:59:49): > @Ritika Giri has joined the channel

stefano mangiola (21:25:25) (in thread): > Amazing! as logo I was also thinking about a graphical hex logo. > > But this would be great in many cases as well.

2023-08-04

Abdullah Al Nahid (02:50:49) (in thread): > @stefano mangiolaI haven’t tried out with any symbols yet. Just playing with the typography so far

Abdullah Al Nahid (04:52:10) (in thread): > lowercase text is settled then:white_check_mark:

Michael Love (06:46:00) (in thread): > just dropping this into the xcf template, we can take iterations on colors/design, etc. etc.

Michael Love (06:46:09) (in thread): - File (Binary): tidyomics.xcf

Michael Love (06:47:04) (in thread): - File (PNG): tidyomics.png

Abdullah Al Nahid (06:49:58) (in thread): > Thanks@Michael Love

Michael Love (06:52:49) (in thread): > for the border, I recommend just adjusting with hue/saturation bc otherwise the edges become jagged (if you use color selection)

Abdullah Al Nahid (06:53:18) (in thread): > Noted.

Abdullah Al Nahid (06:57:44) (in thread): > lots of inspiration here:https://github.com/Bioconductor/BiocStickers

stefano mangiola (07:08:59) (in thread): > There are some folks (I don’t remember who) who do these amazing graphical stickers, with flat and elegant style. > > There are some hex stickers at bioc2023 we could ask also for their help

Abdullah Al Nahid (07:36:23) (in thread): > I thought the issue was about tidyomics logo, not sticker. Sticker and logo are two different things for me. By the way, I found Giotto suite stickers presented at bioc2023 here:https://giottosuite.readthedocs.io/en/latest/_images/GiottoSuiteWebsite-07.svgKindly provide any reference that I can follow for such styles. Thanks@stefano mangiola - File (PNG): image.png

Michael Love (08:17:25) (in thread): > the xcf i sent is just the template, you can make it flat style also

Michael Love (08:17:43) (in thread): > I tried to make it gaudy colors to emphasize this is not a design suggestions:slightly_smiling_face:

Abdullah Al Nahid (09:03:43) (in thread): > Haha, it’s fine. I will try flat, gradient anything I feel like might look good. Will then ask for feedbacks and finally, if requires, voting

Abiud Cantu (10:04:59): > @Abiud Cantu has joined the channel

Joost Groot (10:05:24): > @Joost Groot has joined the channel

Jianhong (10:09:12): > @Jianhong has joined the channel

Ray Su (10:48:09): > @Ray Su has joined the channel

Scott Norton (10:50:11): > @Scott Norton has joined the channel

Brian Gural (10:50:13): > @Brian Gural has joined the channel

Sowmya Parthiban (10:50:51): > @Sowmya Parthiban has joined the channel

Cindy Fang (10:51:13): > @Cindy Fang has joined the channel

Francesc Català-Moll (13:48:49): > @Francesc Català-Moll has joined the channel

Lucio Queiroz (14:23:22): > @Lucio Queiroz has joined the channel

2023-08-06

Michael Love (11:25:12): > Exciting news:@Ming Tanghas offered us to usetidyomicsname on GitHub!:tada:Thanks Ming! > > Stefano and I will move over projects/links etc., just to explain that some links may be movingtidybiology->tidyomicsin the next few weeks

Ming Tang (11:25:16): > @Ming Tang has joined the channel

Michael Love (11:38:01): > So at this moment, open issues are in two places. That’s ok, bc the Issues are anyway tied to repos. So Stefano and I just need to move these over one by one (the Assignees, Labels, etc. will come over automatically). - File (PNG): Screenshot 2023-08-06 at 11.21.51 AM.png

Spencer Nystrom (14:21:49) (in thread): > I think you* should*be able to move these with the GitHub API.

Spencer Nystrom (14:23:05) (in thread): > https://stackoverflow.com/questions/57024087/github-api-how-to-move-an-issue-to-a-project - Attachment (Stack Overflow): GitHub API - how to move an issue to a project? > There are several ways to move a Github issue to a Project board through the GitHub user interface, but there doesn’t seem to be any way to do this via the API (either v3 or v4). Is this missing

Michael Love (16:51:49) (in thread): > all the items are issues, it’s pretty quick to just start typing, they auto-complete

Michael Love (16:58:02) (in thread): > I’ve moved all of mine over - File (PNG): Screenshot 2023-08-06 at 4.57.42 PM.png

stefano mangiola (17:39:36): > @Ming Tangin tha house!

2023-08-07

Kevin Rue-Albrecht (11:09:26): > FYI stay tuned with the CAB for Hacktoberfest. We’ve been talking about organising a Bioconductor event this year. The open challenges would be a great headline for it!

2023-08-09

Charlotte Soneson (03:40:42): > :wave:I wonder if the issue identified inhttps://github.com/stemangiola/tidySummarizedExperiment/issues/70is another candidate for a challenge (I couldn’t see it in the list):slightly_smiling_face:- I thinkget_count_datasets()might need also a check for the row names being consistent between assays, and a decision of what to do if the dimnames of the assays don’t agree with those of the SE. - Attachment: #70 Can’t convert my TreeSummarizedExperiment to a tibble > I used to be able to transform my tse into tidy format by calling tidySummarizedExperiment::as_tibble (don’t remember the version), but it doesn’t work anymore. Wondering if this could be a bug or just some formatting I need to do in my data. I’d appreciate any help. Thanks. > > ``> library(curatedMetagenomicData) > #> Loading required package: SummarizedExperiment > #> Loading required package: MatrixGenerics > #> Loading required package: matrixStats > #> > #> Attaching package: 'MatrixGenerics' > #> The following objects are masked from 'package:matrixStats': > #> > #> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, > #> colCounts, colCummaxs, colCummins, colCumprods, colCumsums, > #> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, > #> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats, > #> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, > #> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, > #> colWeightedMeans, colWeightedMedians, colWeightedSds, > #> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet, > #> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, > #> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, > #> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins, > #> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, > #> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, > #> rowWeightedMads, rowWeightedMeans, rowWeightedMedians, > #> rowWeightedSds, rowWeightedVars > #> Loading required package: GenomicRanges > #> Loading required package: stats4 > #> Loading required package: BiocGenerics > #> > #> Attaching package: 'BiocGenerics' > #> The following objects are masked from 'package:stats': > #> > #> IQR, mad, sd, var, xtabs > #> The following objects are masked from 'package:base': > #> > #> anyDuplicated, aperm, append, as.data.frame, basename, cbind, > #> colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, > #> get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, > #> match, mget, order, paste, pmax, [pmax.int](http://pmax.int), pmin, [pmin.int](http://pmin.int), > #> Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, > #> table, tapply, union, unique, unsplit, which.max, which.min > #> Loading required package: S4Vectors > #> > #> Attaching package: 'S4Vectors' > #> The following object is masked from 'package:utils': > #> > #> findMatches > #> The following objects are masked from 'package:base': > #> > #> expand.grid, I, unname > #> Loading required package: IRanges > #> Loading required package: GenomeInfoDb > #> Loading required package: Biobase > #> Welcome to Bioconductor > #> > #> Vignettes contain introductory material; view with > #> 'browseVignettes()'. To cite Bioconductor, see > #> 'citation("Biobase")', and for packages 'citation("pkgname")'. > #> > #> Attaching package: 'Biobase' > #> The following object is masked from 'package:MatrixGenerics': > #> > #> rowMedians > #> The following objects are masked from 'package:matrixStats': > #> > #> anyMissing, rowMedians > #> Warning: replacing previous import 'S4Arrays::read_block' by > #> 'DelayedArray::read_block' when loading 'SummarizedExperiment' > #> Loading required package: TreeSummarizedExperiment > #> Loading required package: SingleCellExperiment > #> Loading required package: Biostrings > #> Loading required package: XVector > #> > #> Attaching package: 'Biostrings' > #> The following object is masked from 'package:base': > #> > #> strsplit > library(tidySummarizedExperiment) > #> > #> Attaching package: 'tidySummarizedExperiment' > #> The following object is masked from 'package:XVector': > #> > #> slice > #> The following object is masked from 'package:IRanges': > #> > #> slice > #> The following object is masked from 'package:S4Vectors': > #> > #> rename > #> The following object is masked from 'package:matrixStats': > #> > #> count > #> The following object is masked from 'package:stats': > #> > #> filter > dataset_name <- "HallAB_2017.relative_abundance" > tse <- curatedMetagenomicData( > pattern = dataset_name, > dryrun = FALSE, rownames = 'NCBI', > counts = TRUE > )[[1]] > #> > #> $2021-10-14.HallAB_2017.relative_abundance> #> dropping rows without rowTree matches: > #> k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Atopobiaceae|g__Olsenella|s__Olsenella_profusa > #> k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Collinsella|s__Collinsella_stercoris > #> k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Carnobacteriaceae|g__Granulicatella|s__Granulicatella_elegans > #> k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus|s__Ruminococcus_champanellensis > #> k__Bacteria|p__Firmicutes|c__Erysipelotrichia|o__Erysipelotrichales|f__Erysipelotrichaceae|g__Bulleidia|s__Bulleidia_extructa > #> k__Bacteria|p__Proteobacteria|c__Betaproteobacteria|o__Burkholderiales|f__Sutterellaceae|g__Sutterella|s__Sutterella_parvirubra > #> k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Cloacibacillus|s__Cloacibacillus_evryensis > tse > #> class: TreeSummarizedExperiment > #> dim: 503 259 > #> metadata(1): agglomerated_by_rank > #> assays(1): relative_abundance > #> rownames(503): 853 820 ... 172901 1262744 > #> rowData names(7): superkingdom phylum ... genus species > #> colnames(259): p8582_mo1 p8582_mo10 ... SKST041_2_G103027 > #> SKST041_3_G103028 > #> colData names(24): study_name subject_id ... HBI SCCAI > #> reducedDimNames(0): > #> mainExpName: NULL > #> altExpNames(0): > #> rowLinks: a LinkDataFrame (503 rows) > #> rowTree: 1 phylo tree(s) (10430 leaves) > #> colLinks: NULL > #> colTree: NULL > class(tse) > #> [1] "TreeSummarizedExperiment" > #> attr(,"package") > #> [1] "TreeSummarizedExperiment" > tidy_tse <- tidySummarizedExperiment::as_tibble(tse) > #> Error inmap2(): > #> :information_source: In index: 1. > #> :information_source: With name: relative_abundance. > #> Caused by error in.x[rownames(se), , drop = FALSE]: > #> ! subscript out of bounds > #> Backtrace: > #> ▆ > #> 1. ├─tidySummarizedExperiment::as_tibble(tse) > #> 2. ├─tidySummarizedExperiment:::as_tibble.SummarizedExperiment(tse) > #> 3. │ └─tidySummarizedExperiment:::.as_tibble_optimised(...) > #> 4. │ └─tidySummarizedExperiment:::get_count_datasets(x) > #> 5. │ ├─... %>% ... > #> 6. │ └─purrr::map2(...) > #> 7. │ └─purrr:::map2_("list", .x, .y, .f, ..., .progress = .progress) > #> 8. │ ├─purrr:::with_indexed_errors(...) > #> 9. │ │ └─base::withCallingHandlers(...) > #> 10. │ ├─purrr:::call_with_cleanup(...) > #> 11. │ └─tidySummarizedExperiment (local) .f(.x[[i]], .y[[i]], ...) > #> 12. ├─purrr::when(...) > #> 13. ├─purrr::when(...) > #> 14. └─purrr (local)(`) > #> 15. └─cli::cli_abort(…) > #> 16. └─rlang::abort(…) > sessioninfo::session_info() > #> ─ Session info ─────────────────────────────────────────────────────────────── > #> setting value > #> version R version 4.3.0 (2023-04-21) > #> os Pop!_OS 22.04 LTS > #> system x86_64, linux-gnu > #> ui X11 > #> language (EN) > #> collate en_US.UTF-8 > #> ctype en_US.UTF-8 > #> tz America/New_York > #> date 2023-05-15 > #> pandoc 2.19.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) > #> > #> ─ Packages ─────────────────────────────────────────────────────────────────── > #> package * version date (UTC) lib source > #> AnnotationDbi 1.62.1 2023-05-02 [1] Bioconductor > #> AnnotationHub 3.8.0 2023-04-25 [1] Bioconductor > #> ape 5.7-1 2023-03-13 [1] CRAN (R 4.3.0) > #> beachmat 2.16.0 2023-04-25 [1] Bioconductor > #> beeswarm 0.4.0 2021-06-01 [1] CRAN (R 4.3.0) > #> Biobase * 2.60.0 2023-04-25 [1] Bioconductor > #> BiocFileCache 2.8.0 2023-04-25 [1] Bioc…

stefano mangiola (07:00:34) (in thread): > That’s quite new issue:slightly_smiling_face:doing a check and throwing an informative error might be a clean and simple idea. > > Avoiding to rely to rownames matching can be another idea. But do we really want to marge assays which feature IDs do not match with the SE feature IDs or other assays’?

Charlotte Soneson (07:27:19) (in thread): > Good question:sweat_smile:throwing an error if the dimnames of any assay is different from those of the SE, and leaving to the user to fix it (and thus make the decision) may indeed be the safest.

stefano mangiola (08:23:33) (in thread): > Now this issue is indeed part of the challenges!

Jenny Drnevich (10:38:24) (in thread): > How do you generally handle the discrepancy between tidyverse’s “rownames should not exist!” and SummarizedExperiment’s “we use rownames as the keys”?

stefano mangiola (18:08:30) (in thread): > if rownames exist they are mapped to the .feature column of the tibble representation. > > if no rownames are found anywhere SE or assay we give numerical incremental id (row number) in the .feature column. > > the SE is never touched.

2023-08-15

stefano mangiola (01:07:39): > @Helena L. Crowelland@William Hutchisonare on fire! - File (PNG): image.png

stefano mangiola (01:08:16): > And so much in progress! > > Together we go! - File (PNG): image.png

Michael Love (09:06:29): > I’ll work on moving these over tohttps://github.com/orgs/tidyomics/projects/1/views/1

Michael Love (09:08:01): > We should start to make sure we’ve got all interested parties as Assignees so we can then make a list with affiliations etc. I’ll also start working on this today

Kevin Rue-Albrecht (11:10:06): > Keep some for Hacktoberfest:wink:

Michael Love (12:00:10): > I’ve moved all the ones I have the ability totidyomics <-- tidybiology - File (PNG): Screenshot 2023-08-15 at 11.59.42 AM.png

Michael Love (12:00:37): > the rest need to be moved by@stefano mangiolaor@William Hutchison

Michael Love (12:00:51): > you just go tohttps://github.com/orgs/tidyomics/projects/1/views/1and start typing the # then the repo name

stefano mangiola (15:59:12) (in thread): > there will be plenty!

stefano mangiola (19:20:11): > I moved them all. Now bidybiology is almost empty except for bioc2023 workshop, which we should move, and probably delete the organisation (?)

2023-08-16

Michael Love (09:03:17): > yes I’ll move the workshop as well

Michael Love (09:06:55): > yes i think we can release the organization now

Michael Love (09:07:01): > otherwise it will just cause confusion

Andres Wokaty (14:37:36): > @Andres Wokaty has joined the channel

Alex Mahmoud (14:37:52): > @Alex Mahmoud has joined the channel

Robert Shear (14:40:11): > @Robert Shear has joined the channel

2023-08-18

Victor Yuan (12:35:40): > @Victor Yuan has joined the channel

2023-08-21

stefano mangiola (03:13:12): > Congrats to@Charlotte Sonesonto her first contributionAmazing work - Attachment: #78 Bugfix nonmatching dimnames > Here’s a first attempt to address #70, as well as increasing the consistency of utilities.R in terms of indentation, spaces and assignment operators.
> Currently the situation where assays are unnamed is not handled well by get_count_datasets(). Do we want to require named assays? At the moment, adding a check for this makes nest() fail (even if the assays are in fact named).

2023-08-24

Lachlan Baer (01:20:20): > @Lachlan Baer has joined the channel

Leo Lahti (17:14:21): > @Leo Lahti has joined the channel

Leo Lahti (17:15:17) (in thread): > Hi! I only saw this now. People on#miaversechannel might also be able to help.

2023-08-26

chilam (11:47:47): > @chilam has joined the channel

Michael Love (16:22:57): > As part of one of the challenges,@Abdullah Al Nahidhas submittedeasyliftto Bioconductor:tada:way to go! > > This is a function that helps liftover ranges from one build to another and also pulls down seqinfo using GenomeInfoDb.https://github.com/nahid18/easylift

2023-08-27

Abdullah Al Nahid (02:53:46) (in thread): > Thank you so much! This has been such a feel-good project for me. I loved writing my first R/Bioconductor library.

2023-08-29

Jacques SERIZAY (12:51:11): > @Eric Davisand I are thinking of getting started withplyinteractionsto port tidy operations toGInteractions. Any place recommended? Would thetidyomicsorganization be ok hosting the repo on Github? Or should we have it on one of our personal accounts?

Michael Love (12:59:58): > tidyomics fine with us, Jacques what’s your github handle

Jacques SERIZAY (13:00:36): > I’m js2264 on github

Jacques SERIZAY (13:47:23): > Thx!

2023-08-31

stefano mangiola (20:46:27): > #tidyomics friends! There are 20+ open challenges still waiting for a champion. Join our effort, solve one (or many), and join our upcoming publication! - File (PNG): image.png

2023-09-01

Kwangwoon (Jon) Lee (09:39:12): > @Kwangwoon (Jon) Lee has joined the channel

2023-09-05

Richard White (10:27:31): > @Richard White has joined the channel

2023-09-08

Jacques SERIZAY (08:25:18): > Hi all, I have been working on adaptingdplyrverbs toGInteractionsobjects as part of the tidyomics project:https://github.com/tidyomics/plyinteractions. > As it is, it fully supports dplyr core verbs:mutate, group_by, count/tally, summarize, select, filter, slice, rename, with tidy evals, so people can manipulate/modify them just like they would operate with tabular data. > > Compared toplyranges, it still missesjoin_*andanchor_*+ arithmetics function families. To support these, an idea I had would be to implement a newGInteractionsflavor to specify whichanchorsare “hooked” up (anchors1oranchors2), and then forwardplyrangesoperations (e.g.stretch,shift, …,join_*) to that set of hooked up anchors. This would allow all the arithmetics functions fromplyrangesand many of the overlap methods to work seamlessly withGInteractionsin chained operations.Does this approach make sense? I am looking for a name for this type of class (and associated setter), do you have any suggestion? I am thinking of: > 1. :hook:HookedGInteractions,hookandunhook > 2. :lock:LockedGInteractions,lockandunlock > 3. :dart:FocusedGInteractions,focusandunfocus > 4. :round_pushpin:PinnedGInteractions,pinandunpin > Not sure which one would be the most intuitive. If you have an opinion, would you mind voting by emoji? Any other suggestion is welcome of course:slightly_smiling_face:Thanks!

2023-09-09

Michael Love (06:42:15) (in thread): > Nice approach!

2023-09-10

Michael Love (12:00:00): > @Stuart Leein case you have thoughts, see Jacques approach to extending functionality for GInteractions:point_up:

2023-09-13

Jacques SERIZAY (08:07:31): > Hi all, me again withplyinteractions. The package now ports the most importantdplyrandplyrangesfunctions (IMHO) toGInteractions, everything is referenced in thepkgdownwebsite:https://tidyomics.github.io/plyinteractions/, with extensive examples and a rather long vignette. I’m happy with the state of it as it currently is and would be ready to submit it to Bioc, but I first wanted to see if#tidiness_in_biocpeeps have comments/suggestions prior to formal review. Please do let me know if you feel like some functionalities are missing!:hugging_face: - Attachment (tidyomics.github.io): Extending tidy verbs to genomic interactions > A dplyr-like interface for interacting with the common Bioconductor > class GInteractions. By providing a grammatical > and consistent way of manipulating these classes their accessiblity for new > Bioconductor users is hopefully increased.

2023-09-14

stefano mangiola (07:37:10): > #tidyomics is officially liveThanks to the whole community, we are doing great! - Attachment (X (formerly Twitter)): Stefano Mangiola on X > :tada::broom: The #tidyomics ecosystem is official! > > Into #omic data analysis? Spanning #Seurat @Bioconductor #SCE, #SE, #GRanges, #Citometry? > Now, just use #tidyverse ! > > Co-led with @mikelove and @TonyPapenfuss @WEHI_research #singlecell > > The preprint https://t.co/jGEy2BVj5C

2023-09-15

Leo Lahti (04:56:48): > @Leo Lahti has joined the channel

2023-09-16

Laurent Gatto (00:22:18): > Hello - Following up from some interactions with@Michael Loveon slack and the tidyomics pre-print, doesn’t the following examples address the desire for tidy proteomics? This code chunk is taken from thesager vignette, but there are many other examples, for (raw)mass spectrometrydata processing ofquantitative dataprocessing. > > qf |> > filterFeatures(~ label > 0) |> ## 1 > filterFeatures(~ rank == 1) |> ## 2 > filterFeatures(~ spectrum_fdr < 0.05) |> ## 3 > zeroIsNA(1:3) |> ## 4 > logTransform(i = 1:3, ## 5 > name = paste0("log_", names(qf))) |> > aggregateFeaturesOverAssays(i = 4:6, ## 6 > fcol = "peptide", > name = sub("psm", "peptide", names(qf)), > fun = colMedians, > na.rm = TRUE) |> > joinAssays(i = 7:9, ## 7 > name = "peptides") |> > normalize(i = 10, ## 8 > name = "norm_peptides", > method = "center.median") |> > aggregateFeatures(i = "norm_peptides", ## 9 > name = "proteins", > fcol = "proteins", > fun = colMedians, > na.rm = TRUE) > > Here, theqfvariable is a QFeatures object (a specialisedMultiAssayExperiment) composed of multipleSummarizedExperiments (orSingleCellExperiment, if dealing withsingle cell proteomics),and each of these functions updates thatqfvariables, i.e. takes aQFeaturesas input and return an updated one. We typically use specialised vocabulary, but some more general verbs could be used in some instances. > > Note also that more generally, use cases usingSEorSCEalso apply to proteomics, given that these data structures can also be applied to quantitative proteomics. - Attachment (uclouvain-cbio.github.io): Using the sager package to import and analyse sage results in R > sager - Attachment (rformassspectrometry.github.io): Spectra Infrastructure for Mass Spectrometry Data > The Spectra package defines an efficient infrastructure for storing and handling mass spectrometry spectra and functionality to subset, process, visualize and compare spectra data. It provides different implementations (backends) to store mass spectrometry data. These comprise backends tuned for fast data access and processing and backends for very large data sets ensuring a small memory footprint. - Attachment (rformassspectrometry.github.io): Quantitative features for mass spectrometry data > The QFeatures infrastructure enables the management and processing of quantitative features for high-throughput mass spectrometry assays. It provides a familiar Bioconductor user experience to manages quantitative data across different assay levels (such as peptide spectrum matches, peptides and proteins) in a coherent and tractable format. - Attachment (uclouvain-cbio.github.io): Mass Spectrometry-Based Single-Cell Proteomics Data Analysis > Utility functions for manipulating, processing, and analyzing mass spectrometry-based single-cell proteomics data. The package is an extension to the QFeatures package and relies on SingleCellExpirement to enable single-cell proteomics analyses. The package offers the user the functionality to process quantitative table (as generated by MaxQuant, Proteome Discoverer, and more) into data tables ready for downstream analysis and data visualization.

Stephanie Hicks (17:45:07): > @Stephanie Hicks has joined the channel

2023-09-17

Stephanie Hicks (05:58:04): > hello there! I apologize if I missed it, but doestidyomicshave it’s own hex sticker (similar totidyverse)?

Michael Love (12:10:38) (in thread): > https://github.com/orgs/tidyomics/projects/1?pane=issue&itemId=35945742

Michael Love (12:15:15) (in thread): > this is awesome, and it would be great to have pointers across these projects that provide similar functionality. > > one thing that is tying together the packages we describe in the preprint is common grammar and a common abstraction which is in “tidy” format - File (PNG): Screenshot 2023-09-17 at 12.14.23 PM.png

Michael Love (12:16:17) (in thread): > with the packages you list, do you / can you applymutateto add new column variables,group_byfor aggregation by column variables, etc.?

stefano mangiola (18:05:06) (in thread): > We still need to define a design for the sticker. There was a sticker at Bioc2023 with an amazing design, and I am wondering who was the designer. > > I could not find that sticker online:disappointed:

stefano mangiola (18:07:03) (in thread): > Hello All, > > yes the concept we are pushing, is > same data interface (tibble, but keeping the original object intact) > and same manipulation functions, extending the existing tidyverse methods

Stephanie Hicks (20:51:46) (in thread): > Ah, thank you for linking to that. I should have looked there. No worries, I just wanted to make sure I wasn’t missing something obvious. it’s been a long weekend:upside_down_face:

2023-09-18

Michael Love (08:34:19) (in thread): > I think we can interface with existing packages that promote clean, easily-readable code by linking to and from and calling these “related work” etc., but for the open challenges and the body of software described in the preprint, this is what we are trying to build out (what Stefano said).

Michael Love (08:47:54) (in thread): > this looks fantastic, i love the diagrams in the vignette

Michael Love (08:49:19) (in thread): > two suggestions, here you can maybe add more context: > > This is different from theanchorterm used inplyranges. This is due to the fact that “anchor” is used in the chromatin interaction field to refer to the ends of a potential chromatin loop.

Michael Love (08:51:08) (in thread): > other suggestion: > on the README, maybe a bit more about the motivation of the package, something along the lines of:plyinteractionsprovides a consistent interface for importing and wrangling genomic interactions frompairsandbedpefiles intoGInteractionsinR. (copied verbatim) > > While plyranges operates on genomic ranges (GRanges objects), and allows for application of a tidy grammar for manipulation, GInteraction objects are more complex in that each observation (row) corresponds to a pair of two GRanges… etc.

Michael Love (08:51:57) (in thread): > great! - File (PNG): image.png

stefano mangiola (18:21:30) (in thread): > Im thinking that pushing harder for tidyMultiAssayExperiment would be very relevant here as it might take care of data representation and manipulation. while very specialised operation could be done outside

stefano mangiola (19:57:13) (in thread): > (e.g. tidySummarizedExperiment vs tidybulk)

2023-09-19

Jacques SERIZAY (04:20:49) (in thread): > Thanks, I’ve added more content when needed, following your suggestions! I’ll submit the package for review by Bioc in the coming days

Artur Sannikov (08:28:24): > @Artur Sannikov has joined the channel

2023-09-20

Artur Sannikov (08:00:52): > Hi, > I’m trying to merge an a dataset I have to acolDataDFrame of a TreeSummarizedExperiment (which in my case is more like a SummarizedExperiment with 0 tree data). To do this, I need to create a new column witmutatein colData by which I’ll then join the two datasets. DFrame does not supportmutateand using various conversions to make it behave, when reassigning the dataset back to colData, I get this error: > > Error in h(simpleError(msg, call)) : > error in evaluating the argument 'x' in selecting a method for function 'makeNakedCharacterMatrixForDisplay': subscript is a NSBS object that is incompatible with the current > subsetting operation > > I opted for tidySummarizedExperiment to solve the issue. The package does have a A SummarizedExperiment-tibble abstraction pasilla object with is nice to work with. For example, I can manipulate colData in pasilla: > > library(dplyr) > library(tidySummarizedExperiment) > > pasilla <- tidySummarizedExperiment::pasilla > treated_subset <- pasilla |> mutate("end" = substr(type, 8, 10)) > > colData(treated_subset) > #> DataFrame with 7 rows and 3 columns > #> condition type end > #> <character> <character> <character> > #> untrt1 untreated single_end end > #> untrt2 untreated single_end end > #> untrt3 untreated paired_end end > #> untrt4 untreated paired_end end > #> trt1 treated single_end end > #> trt2 treated paired_end end > #> trt3 treated paired_end end > > However, in thedocs, I can only see that we can convert an object to a tibble and manipulate it, but then it’s not aA SummarizedExperiment-tibble abstractionanymore, so it does not have any SE functions. > > Is there a way to create a similar to pasilla object from my TSE and work with that?

stefano mangiola (08:08:40): > could you do (?) > > TreeSummarizedExperiment |> mutate(new_column…) |> left_join(other_dataset |> as_tibble())

Artur Sannikov (08:29:29): > Thanks, Stefano! I’ll have to test the merged dataset but the merging step worked:slightly_smiling_face:

stefano mangiola (08:33:17): > you don’t need column with the same name, you cando > > TreeSummarizedExperiment |> left_join(other_dataset |> as_tibble(), by=join_by(column_a == column_b))

Artur Sannikov (08:35:05): > Sure, but before I needed to create a new column with substr to join the two dataframes

Vince Carey (12:55:07): > @Michael Love@Stuart Lee@Michael Lawrenceis there anything like a delayed GRanges available? A really large GRanges might take too long to load … I am thinking of a duckdb/parquet back end with all the nice GRanges operations supported

Michael Lawrence (13:20:07): > @Sanchit Sainiis working on something like this, carrying on from work done by@Stuart Lee(the “query” framework). And I did something like that many years ago. I had it in bioc svn but it was never released as a package and therefore might be gone forever. I think@Martin Morganwas working with someone on a DelayedGRanges, which I think is what stimulated the factoring out of the GenomicRanges class to enable this sort of thing. I should also note that@Aaron Lunis working on a parquet-backed DataFrame right now.

Sanchit Saini (13:20:18): > @Sanchit Saini has joined the channel

Aaron Lun (13:20:18): > @Aaron Lun has joined the channel

Vince Carey (13:22:22): > Good to know. Providing performant access to the AlphaMissense results is a motivating use case.

Aaron Lun (14:20:23): > https://github.com/LTLA/ParquetDataFrame/blob/master/R/ParquetColumnSeed.Rdoesn’t actually have the DF class itself yet.

Aaron Lun (14:20:32): > but the example above should give you an idea of the concept.

Ludwig Geistlinger (14:23:13): > I think this is also of interest for@Jiaji George Chen@Ruben Drieswho also work on a duckdb/parquet-backend for data frames eg for large molecule data from spatial platforms.

Jiaji George Chen (14:23:17): > @Jiaji George Chen has joined the channel

Ruben Dries (14:23:18): > @Ruben Dries has joined the channel

Timothy Keyes (18:16:10): > @Timothy Keyes has joined the channel

Sean Davis (18:21:41): > There has been quite a bit of work done in genomic file indexing and performant access. Tabix is one approach based on text files:https://rdrr.io/bioc/Rsamtools/man/TabixFile-class.html. If the data contain information that can be stored as VCF, then the binary equivalent, BCF, may be relevant. Again, there is bioconductor tooling for this:https://rdrr.io/bioc/Rsamtools/man/BcfFile-class.html. - Attachment (rdrr.io): TabixFile-class: Manipulate tabix indexed tab-delimited files. in Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import > Use TabixFile() to create a reference to a Tabix file (and its > index). Once opened, the reference remains open across calls to > methods, avoiding costly index re-loading. > TabixFileList() provides a convenient way of managing a list of > TabixFile instances. - Attachment (rdrr.io): BcfFile-class: Manipulate BCF files. in Rsamtools: Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import > Use BcfFile() to create a reference to a BCF (and optionally > its index). The reference remains open across calls to methods, > avoiding costly index re-loading. > BcfFileList() provides a convenient way of managing a list of > BcfFile instances.

Sean Davis (18:27:27): > Parquet files are columnar (ie., each column is stored together, rather than having rows stored together). They support relatively easy “projection pushdown,” a fancy term meaning that we can limit reading parquet files to only those COLUMNS that we want. However, achieving “predicate pushdown,” a fancy term for passing along the “where” or “filter” clause of a query, is not as straightforward since parquet files don’t include “indexes” for filtering “rows.” Furthermore, the performance of “predicate pushdown” will vary based on the filter applied and its relation to the “ordering” of the file on disk.

Sean Davis (18:33:23): > So, depending on the use case, it may be enough to use tabix on the positional data if the data are generic chromosomal regions. If the data are genomic variants, there may be an option to use BCF.

Sean Davis (18:35:43): > The methods I described above are pretty good for reading subsets of data into GenomicRanges in memory or processing ranges in chunks (even in parallel).

Sean Davis (18:36:55): > Finally, I’ll mention that for the alphamissense data, the largest datasets expand to only a few GB, so a modest server will allow them to be loaded to memory.

Vince Carey (19:33:19): > Yes, good to remember tabix. I will do some comparisons.

Vince Carey (19:33:32): > > > tt > class: TabixFile > path: AlphaMissense_hg19.tsv.gz > index: AlphaMissense_hg19.tsv.gz.tbi > isOpen: FALSE > yieldSize: NA > > p > GRanges object with 1 range and 0 metadata columns: > seqnames ranges strand > <Rle> <IRanges> <Rle> > [1] chr21 19000000-20000000 * > ------- > seqinfo: 1 sequence from an unspecified genome; no seqlengths > > microbenchmark(scanTabix(tt, param=p) |> unlist() |> (\(x)read.delim(text = x))(), times=20) > Unit: milliseconds > expr min > (function(x) read.delim(text = x))(unlist(scanTabix(tt, param = p))) 32.05145 > lq mean median uq max neval > 32.85717 33.63268 33.03526 33.28713 39.33622 20 > > microbenchmark(tbl(con, "alpmis_hg19.parquet") |> filter(`#CHROM`=="chr21", POS >= 19e6 , POS <= 20e6), times=20) > Unit: milliseconds > expr > filter(tbl(con, "alpmis_hg19.parquet"), `#CHROM` == "chr21", POS >= 1.9e+07, POS <= 2e+07) > min lq mean median uq max neval > 10.06742 10.29056 10.85911 10.99688 11.22671 12.36846 20 >

Vince Carey (19:35:05): > > > tbl(con, "alpmis_hg19.parquet") |> filter(`#CHROM`=="chr21", POS >= 19e6 , POS <= 20e6) |> count() > # Source: SQL [1 x 1] > # Database: DuckDB 0.8.1 [vincent@Linux 6.0.0-1020-oem:R 4.3.1/:memory:] > n > <dbl> > 1 10531 > > scanTabix(tt, param=p) |> unlist() |> (\(x)read.delim(text = x))() |> nrow() > [1] 10530 > > so something is amiss … but the excursion seems pointful.

Tyrone Lee (19:51:16): > @Tyrone Lee has joined the channel

Eddie Ruiz (20:22:06): > @Eddie Ruiz has joined the channel

Vince Carey (21:32:57): > ok, needed h=FALSE. also the parquet example may be unfair as it does not produce a data.frame > > > microbenchmark(tbl(con, "alpmis_hg19.parquet") |> filter(`#CHROM`=="chr21", POS >= 19e6 , POS <= 20e6) |> as.data.frame(), times=20) > Unit: milliseconds > expr > as.data.frame(filter(tbl(con, "alpmis_hg19.parquet"), `#CHROM` == "chr21", POS >= 1.9e+07, POS <= 2e+07)) > min lq mean median uq max neval > 69.27372 70.29432 72.11533 71.23101 72.33904 77.67609 20 >

2023-09-21

Sean Davis (00:30:01): > Thanks, Vince. Defining the use cases clearly will be important here. Having documentation about how to deal with large range sets (tabix, BCF, sqlite, duckdb, even fread in chunks) would be useful to reduce from a variant/range “catalog” to an in-memory GenomicRanges.

Michael Love (07:49:35): > for posterity, might want to move some of this into a channel about performance / backends?

Vince Carey (09:46:35): > yes, hope i didn’t hijack the channel.

Michael Love (10:23:35): > no prob!

2023-09-22

Artur Sannikov (08:31:46): > Hi, in tidySE, once you convert SE into a tibble, is it possible to convert it back into SE?

Michael Love (09:07:47): > it’s not actually converted

Michael Love (09:08:28): > See here:https://tidyomics.github.io/tidyomicsWorkshopBioc2023/articles/tidyGenomicsTranscriptomics.html#part-1-introduction-to-tidysinglecellexperiment - Attachment (tidyomics.github.io): Tidy genomic and transcriptomic single-cell analyses > tidyomicsWorkshopBioc2023

Michael Love (09:08:46): > > It creates an invisible layer that enables viewing the SingleCellExperiment object as a tidyverse tibble…

Artur Sannikov (09:19:46): > I’m also trying to remove a duplicate for a specific column from a colData dataframe with distrinct() function. Here is’ the code. I get a one-column (“animal_id”) dataframe with this warning tidySummarizedExperiment says: Key columns are missing. A data frame is returned for independent data analysis. > > What’s the issue? > > library(TreeSummarizedExperiment) > library(dplyr) > > # Generate assay data > set.seed(42) > assay_data <- matrix(rpois(200, lambda = 10), nrow = 20, ncol = 10) > > # Generate sample data > sample_data <- data.frame( > animal_id = factor(rep(1:5, each = 2)), > treated = factor(rep(c("yes", "no"), times = 5)), > disease = factor(rep(c("disease1", "disease2"), times = 5)) > ) > > # Create TSE > tse <- TreeSummarizedExperiment( > assays = list(counts = assay_data), > colData = sample_data > ) > > tse |> distinct(animal_id, .keep_all = TRUE) > #> tidySummarizedExperiment says: Key columns are missing. A data frame is returned for independent data analysis. >

Aaron Lun (09:49:21): > @Aaron Lun has left the channel

Martin Morgan (09:52:30) (in thread): > @Michael Lawrencementioned my name a little up stream, probably referring to packages by@Qian Liu, maybe VCFArray or GDSArray (uses ‘gds’ format, which is quite performant).@Vince CareyA key reservation about a database implementation is that one would like efficient operations on ranges (findOverlaps, etc); naive implementations (like inRsamtools::scanVcf()where theparam=argument is more than one GRanges) are probably quadratic in the number of ranges in the reference and the number of ranges queried. I think one wants ‘spatial’ indexes (https://duckdb.org/docs/extensions/spatial.html?) which I haven’t explored… - Attachment (DuckDB): Spatial > DuckDB is an in-process database management system focused on analytical query processing. It is designed to be easy to install and easy to use. DuckDB has no external dependencies. DuckDB has bindings for C/C++, Python and R.

Michael Love (11:11:39) (in thread): > I think this is the expected behavior:https://stemangiola.github.io/tidySummarizedExperiment/articles/introduction.html#tidyverse-commands - Attachment (stemangiola.github.io): Overview of the tidySummarizedExperiment package > tidySummarizedExperiment

Vince Carey (11:37:33) (in thread): > Thanks Martin! The spatial extensions are surely worth investigating…. I hope someone listening has some bandwidth.

Sean Davis (12:22:08) (in thread): > You can actually get away with a pretty simple index for genomic data. Spatial indexes will also work, but an advantage of using a simple implementation is that it is agnostic to the database backend.http://genomewiki.ucsc.edu/index.php/Bin_indexing_system

2023-09-24

Martin Morgan (20:05:57) (in thread): > It turns out that it’s pretty fast to stick this stuff in duckdb (<20s for me) > > db_path <- tempfile() > hg38 <- dbConnect(duckdb(db_path)) > > sql <- r"{ > CREATE TABLE AlphaMissense AS > SELECT * FROM read_csv_auto('AlphaMissense_hg38.tsv.gz'); > }" > dbExecute(hg38, sql) > > It’s then available asdplyr::tbl(con, "AlphaMissense"). > > This post touts theefficiency of range-based queriesin DuckDB. So I put EnsemblDB annotations into another table > > edb <- AnnotationHub::AnnotationHub()[["AH113665"]] > > annotation <- > edb |> > ensembldb::genes() |> > GenomeInfoDb::keepStandardChromosomes(pruning.mode = "coarse") |> > dplyr::as_tibble() |> > dplyr::mutate(`#CHROM` = paste0("chr", as.character(seqnames))) |> > dplyr::select(`#CHROM`, everything(), -seqnames, -entrezid) > > dbWriteTable(hg38, "annotation", annotation) > > created a subset of a 1000 genes > > sql <- r"{ > CREATE TABLE annotation_1000 AS > SELECT * FROM annotation USING SAMPLE 1000; > }" > dbExecute(hg38, sql) > > and evaluated a join (71M AlphaMissense x 1000 EnsemblDb) into a temporary tableolap > > sql <- r"{ > CREATE TEMP TABLE olap AS > SELECT > am.* EXCLUDE ("#CHROM"), > an.* > FROM AlphaMissense am > JOIN annotation_1000 an > ON am."#CHROM" = an."#CHROM" > AND am.POS >= an.start > AND am.POS <= an.end > }" > > system.time({ > DBI::dbExecute(hg38, sql) > }) > > this generated 1.5 million records in > > user system elapsed > 23.837 0.126 3.343 > > It is accessible in a lazy fashion withtbl(hg38, "olap")

2023-09-25

Artur Sannikov (09:12:49) (in thread): > Thank you! How would I then remove a duplicate with using distinct?

Michael Love (10:19:20) (in thread): > you want to remove a sample?

Michael Love (10:19:47) (in thread): > you can do two steps: find the samples you want to keep, then filter columns

Michael Love (10:20:57) (in thread): > e.g. in pseudocode, supposefoois the variable with duplicates, and you just want one sample per unique level offoo: > > ids_to_keep <- se |> distinct(foo, bar) |> distinct(foo, .keep_all=TRUE) |> pull(bar) > se |> filter(bar %in% ids_to_keep) >

Michael Love (10:21:20) (in thread): > Stefano here is using distinct to show the colData

2023-09-26

Vince Carey (08:56:28) (in thread): > FWIW there are parquet files for the hg38 scores so that > > parq_url = "[https://mghp.osn.xsede.org/bir190004-bucket01/BiocAlphaMissense/alpmis_hg38.parquet](https://mghp.osn.xsede.org/bir190004-bucket01/BiocAlphaMissense/alpmis_hg38.parquet)" > parq_url = "[https://biocfound-alphamissense.s3.amazonaws.com/alpmis_hg38.parquet](https://biocfound-alphamissense.s3.amazonaws.com/alpmis_hg38.parquet)" > library(duckdb) > library("DBI") > library("dplyr") > library(dbplyr) > con <- dbConnect(duckdb::duckdb()) > dbExecute(con, "install 'httpfs'") > dbExecute(con, "load 'httpfs'") > tbl(con, parq_url) > > with either of the parq_url settings

2023-09-27

Martin Morgan (16:33:36) (in thread): > I know@Vince Careyyou have a proof of concept too, butAlphaMissenseprovides functions to download and store data locally (via BiocFileCache, both the tsv.gz files and a DuckDB database). There are also a couple of helper functions to facilitate writing R data to temporary tables, performing range-based joins within the database, and a little use of GenomicRanges::GPos to represent the single-nucleotide variants.

Vince Carey (17:07:05) (in thread): > yes, that’s way ahead of mine. please go forward with publicity/submission?

Martin Morgan (17:25:26) (in thread): > yes I’ll try to pursue; I see that I’ve created some problems getting off the ground (creating the first download…)

2023-09-29

Jacques SERIZAY (17:29:01): > plyinteractionsis submitted today last minute:grimacing:let’s see if it can make it in time for the coming Bioc release!

2023-10-01

stefano mangiola (09:03:29) (in thread): > Hello@Laurent Gattowould it be ok to tag you in a tidyomics issue for discussing the plans for the extension of tidyomics for mass spec? > > Other interested people arehttps://github.com/tidymass@Michael Lovewas there anyone else interested?

Michael Love (10:14:16) (in thread): > i feel like one other group tagged you on Twitter/X?

Laurent Gatto (10:17:49) (in thread): > Yes, definitely, thank you.

Laurent Gatto (10:19:19) (in thread): > Re tidyMultiAssayExperiment, the higher order proteomcis class QFeature is a specialisation of MAE>

2023-10-02

stefano mangiola (10:49:54) (in thread): > then tidyMAE can well become a priority, I wanted for long time to tackle that but there was not enough support. We should be able to find support now.

2023-10-03

Artur Sannikov (03:30:05): > Hi, I’m getting this warning when trying to add a new column to a TSE object: > > Warning: tidySummarizedExperiment says: the assays in your SummarizedExperiment have row names, but they don't agree with the row names of the SummarizedExperiment object itself. It is strongly recommended to make the assays consistent, to avoid erroneous matching of features. > > For example, > > animal_tse <- > tse |> > mutate(animal_id = substr(sample_name, 1, 10)) > > And the it tells me thatError: 67 specified rows can't be found.I have 67 species inrowData. I checked if the rownames of the assay and the TSE object itself are identical > > rownames(tse) == rownames(assay(tse, "counts")) > > And got only TRUEs, so they are identical I think. > > I cannot reproduce the issue on test data. > > What can be the problem here?

Helena L. Crowell (03:59:39) (in thread): > Could u quickly run sth like > > . <- lapply(assays(tse, withDimnames=FALSE), rownames) > sapply(.[-1], identical, .[[1]]) > > and also > > sapply(., identical, rownames(tse)) >

Helena L. Crowell (04:12:24) (in thread): > Actually, the exact check being run ishttps://github.com/stemangiola/tidySummarizedExperiment/blob/cf66bf1810f814b885e922d5f8ef688156331a33/R/utilities.R#L726C1-L733C4…so you could give that a try: > > if (!is.null(rownames(se)) && > length(assays(se)) > 0 && > !is.null(rownames(assays(se, withDimnames = FALSE)[[1]])) && > !all(rownames(assays(se, withDimnames = FALSE)[[1]]) %in% rownames(se))) { > warning( > "tidySummarizedExperiment says: the assays in your SummarizedExperiment have row names, but they don't agree with the row names of the SummarizedExperiment object itself. It is strongly recommended to make the assays consistent, to avoid erroneous matching of features." > ) > } >

Artur Sannikov (04:42:08) (in thread): > I get this: > > sapply(.[-1], identical, .[[1]]) > #> named list() > > sapply(., identical, rownames(tse)) > #> counts > FALSE >

Artur Sannikov (04:43:45) (in thread): > If I run the full check, I get the warning

Helena L. Crowell (04:45:24) (in thread): > Yeah, sure, but you should be able to run each of the checks in theif-statement (one by one) and figure out what’s off / is not passing the check.

Artur Sannikov (04:56:37) (in thread): > But all of them should be TRUE to get this warning. And indeed, they’re all TRUE. What do you mean?

Helena L. Crowell (05:28:06) (in thread): > If they are all TRUE, I guess the problematic line isall(rownames(assays(se, withDimnames = FALSE)[[1]]) %in% rownames(se))… in your original post, you did not specifywithDimnames, so this went unnoticed.withDimnames=TRUE(default) will basically set these on the fly, so they always appear to match. Simply overwriting the assay might fix it, e.g.,assay(tsne) <- assay(tsne)(since the accessor sets the row names according torownames(tsne)) …but I’m not sure

Helena L. Crowell (05:30:01) (in thread): > *As a side not: I’d suggest posting software-related issues such as this on GitHub as opposed to Slack (including the output of yoursessionInfo()and the code, as you did) so that other package users can pinch in and share in the discussion/answers etc.:pray:

Artur Sannikov (09:13:34) (in thread): > Thank you, the solutionassay(tsne) <- assay(tsne)worked! Do you want me to post the issue onGitHuband you reply with the solution and I can mention you and your solution in the issue itself?]

2023-10-05

Artur Sannikov (03:18:54) (in thread): > Hi Helena, what solution do you prefer?

Helena L. Crowell (03:28:38) (in thread): > Either is fine by me, I just meant as a more general comment to post software-related issues to the respective GH repo in the future:pray:

Artur Sannikov (10:20:33) (in thread): > https://github.com/stemangiola/tidySummarizedExperiment/issues/84 - Attachment: #84 Warning:The assays in your SummarizedExperiment have row names, but they don’t agree with the row names of the SummarizedExperiment object itself > Reposting from Slack. > > When creating a new column in a tse object, I get Warning: tidySummarizedExperiment says: the assays in your SummarizedExperiment have row names, but they don't agree with the row names of the SummarizedExperiment object itself. It is strongly recommended to make the assays consistent, to avoid erroneous matching of features. > > For example, running > > > animal_tse <- > tse |> > mutate(animal_id = substr(sample_name, 1, 10)) > > > get me Error: 67 specified rows can’t be found. I have 67 species in rowData. I confirmed that the rownames of the assay and the tse object itself are identical/ > > I was not able to reproduce the issue on the test pasilla data. > > @HelenaLC suggested running > > > . <- lapply(assays(tse, withDimnames=FALSE), rownames) > sapply(.[-1], identical, .[[1]]) > > > That gave me > > > sapply(.[-1], identical, .[[1]]) > #> named list() > > sapply(., identical, rownames(tse)) > #> counts > FALSE > > > The code that checks for warning is https://github.com/stemangiola/tidySummarizedExperiment/blob/cf66bf1810f814b885e922d5f8ef688156331a33/R/utilities.R#L726|here. > > If I run if checks individually, they all give me TRUE, thus the warning. > > @HelenaLC said that > > > If they are all TRUE, I guess the problematic line is all(rownames(assays(se, withDimnames = FALSE)[[1]]) %in% rownames(se)) … in your original post, you did not specify withDimnames , so this went unnoticed. withDimnames=TRUE (default) will basically set these on the fly, so they always appear to match. Simply overwriting the assay might fix it, e.g., assay(tsne) <- assay(tsne) (since the accessor sets the row names according to rownames(tsne)) …but I’m not sure > > So, doing > > > assay(tse) <- assay(tse) > > > solved the problem.

2023-10-11

Romane Libouban (05:41:38): > @Romane Libouban has joined the channel

2023-10-15

Boyd Tarlinton (05:19:06): > @Boyd Tarlinton has joined the channel

2023-10-26

Michael Love (08:11:06): > easylift is released in 3.18:slightly_smiling_face:https://twitter.com/mikelove/status/1717514050974978394 - Attachment (X (formerly Twitter)): Michael Love on X > Abdullah’s first @Bioconductor package! :tada: > > Facilitates genomic liftover using existing Bioc tools: > > ranges |> easylift(“hg38”) > > Abdullah is one of the group that signed up for #tidyomics open challenges with @steman_research et al: > > https://t.co/BkJ3k1KXLh

2023-11-09

Simon Pearce (07:09:37): > This might not be the right channel to ask, but does anyone know of any resources for tidy analysis of methylation arrays?

Michael Love (08:51:00): > right place to ask!@Kasper D. Hansenhave you thought about this?

Kasper D. Hansen (21:11:13): > No

Kasper D. Hansen (21:11:42): > But since the minfi classes are essentially SummarizedExperiment’s, shouldn’t this work out of the box?

2023-11-10

Michael Love (08:53:24): > what type of operations are you interested in@Simon Pearce

Simon Pearce (08:55:22): > Well, I haven’t worked with methylation arrays much, but mostly calling differentially methylated sites/regions is my first plan. I see tutorials using a whole range of packages, and don’t really know which to use. And they are base R:neutral_face:

Michael Love (09:34:14): > I would just use those packages with the pipelines outlined in the vignette. you can then look at the results with tidyomics packages

2023-11-13

Michael Love (13:57:38): > I often perform overlap joins withmaxgapso the ranges don’t typically overlap. You get the metadata fromxandyand can then do computation on that. Often it’s interesting to stratify this analysis by the x-y distance, but AFAIK there’s no easy way for that to come along. I’ve been playing with a function to provide that with minimal effort. Thoughts/suggestions? (I would next write this for the directed version, and inner/left, but not for ‘within’ bc it’s not relevant)https://github.com/mikelove/plyrangesMLmisc/blob/main/join_with_distance.R

Spencer Nystrom (15:35:23) (in thread): > We do something similar (ish) injoin_nearestwith adistanceargument.https://github.com/sa-lee/plyranges/blob/master/R/ranges-join-nearest.R

Spencer Nystrom (15:36:06) (in thread): > Bizarre choice to render that whole file inline, Slack.

Michael Love (17:09:25) (in thread): > @Spencer Nystromgood point, i can ask Stuart if he wants this as an argument or a new set of functions

Michael Love (17:13:27) (in thread): > or neither haha

Spencer Nystrom (17:45:43) (in thread): > It’d be great to have a consistent API. These can be optional codepaths if the compute is heavy, so feel like they could be incorporated easily.

Michael Love (17:54:03) (in thread): > the compute should be minimal

Michael Love (17:54:52) (in thread): > vector < vectorandvector - vectorand it happens once

2023-11-26

Izabela Mamede (13:43:06): > @Izabela Mamede has joined the channel

2023-11-27

Jacques SERIZAY (08:39:12): > Hey all! A bit of self-advertising for tidy lovers: I have been working ontidyCoverageto generate this type of plots. There are already several packages for this purpose, buttidyCoveragerelies on new classes built on top ofSummarizedExperiment, and it can directly make use oftidySummarizedExperimentmethods:tada:, and of course coerce withas_tibble(). > * CoverageExperimentcontains sets of features inrowRangesand sets of BigWigFiles/RleLists incolData. It has a singleassay: coverage, a matrix containing matrices of coverage for each track over each set of features > * AggregatedCoveragehas the same structure, but itsassaysaremean,se,median, …, computed for all the matrices inassay(ce, "coverage"). > It is still in development and currently only in my github (https://github.com/js2264/tidyCoverage/), but any feedback/contribution/suggestion would be welcome! - File (PNG): image.png

Simon Pearce (09:11:22) (in thread): > Can you default to something that isn’t rainbow? likehues::scale_colour_iwanthue?

Jacques SERIZAY (09:27:14) (in thread): > tidyCoveragepackage in itself does not provide any plotting function. The previous plot is generated using the following code: > > ## See[https://jserizay.com/tidyCoverage/articles/tidyCoverage.html#example-use-case-annotationhub-and-txdb-resources](https://jserizay.com/tidyCoverage/articles/tidyCoverage.html#example-use-case-annotationhub-and-txdb-resources)## for how `TSSs` and `bws` are created > > CoverageExperiment(bws, TSSs, width = 4000, scale = TRUE, center = TRUE) |> > aggregate(50) |> > mutate( > histone = case_when( > stringr::str_detect(track, 'H2A') ~ "H2A", > stringr::str_detect(track, 'H2B') ~ "H2B", > stringr::str_detect(track, 'H3') ~ "H3" > ) > ) |> > ggplot(aes(x = coord, y = mean)) + > geom_ribbon(aes(ymin = ci_low, ymax = ci_high, fill = track), alpha = 0.2) + > geom_line(aes(col = track)) + > facet_grid(~histone) + > labs(x = 'Distance from TSSs', y = 'Mean histone PTM coverage') + > theme_bw() + > theme(legend.position = 'top') > > So all the plotting is done throughggplot2, and it will be the responsibility of the end user to choose an appropriate color. I have addedscale_colour_iwanthueto the code used to plot this figure, thanks for the suggestion:slightly_smiling_face:

Simon Pearce (09:27:53) (in thread): > Ok, that makes sense, I just looked at the plot rather than the package itself:wink:

2023-11-28

Michael Love (17:03:06) (in thread): > this is great Jacques, recommend putting a mini version of the tidy grammar example into the README.md so people can quickly scan what’s going on, what they can do (even if it’s a pre-baked example where the plot is pre-made)

Michael Love (17:03:35): > e.g. - File (PNG): Screenshot 2023-11-28 at 5.03.29 PM.png

Michael Love (17:03:44): > also we should link out to this from the tidyomics page

Michael Love (17:04:56) (in thread): > @Jacques SERIZAYyou can add a link out here:https://github.com/tidyomics/.github/blob/main/profile/README.md?plain=1#L34-L41

Kasper D. Hansen (17:28:42): > With the caveat that I have not looked careful at the package, I think a “Coverage” experiment is a kind of weird name, since - to me - this is post-coverage analysis, where you have summarized a coverage dataset across specific features of interes

Kasper D. Hansen (17:29:41): > However, I do think we have plenty of room for better work on coverage / bigWig files

Kasper D. Hansen (17:30:20): > I would think of a CoverageExperiment as (mainly) a link out to coverage storage and not contain specific features

Michael Love (17:33:14) (in thread): > i guess the benefit of “CoverageExperiment” is that people know what kind of think they are working with: “oh this will work like an SE, i know how to operate on these”

Kasper D. Hansen (19:15:42) (in thread): > Ah ok, the “Experiment” part. Ok, I can see that

2023-11-29

Jacques SERIZAY (03:40:03): > As I see it,CoverageExperimentpurpose is to store genome coverage extracted over specific sets of features. And as Michael highlighted, because it directly builds onSummarizedExperiment, I used an*Experimentname, but I would be happy to consider something better suited, e.g.FeatureCoverage? > > Also, this raises a tricky point. Let’s assume I have 2 tracks (RNA fwd and rev), and a single set of features (614 Scc1 peaks) (this is the content ofdata(ce)). RunningCoverageExperiment(tracks, features, width = 3000)will return aCoverageExperiment(i.e. a SE object) of dim1x2with several assays namedcoverage. Now,contrary to canonical SE objects, each cell of the assay matrix is a list(of length one) containing a matrix itself, of dim614x5000(the GRanges length x the specified width). > 1. Could storing lists in thescoresarrays be a problem in itself (independent from tidy considerations)? > 2. This makes many methods implemented intidySummarizedExperimentto fail (includingmutate:disappointed:). The error comes fromtidySummarizedExperiment:::update_SE_from_tibble, I think because it attempts to rebuild an SE object from theceobject I provided and this fails due to point 1. That means that any operation would require a call toas_tibble()first, which is not ideal… > I appreciate your feedbacks, thanks for taking some time to share them! - File (PNG): image.png - File (PNG): image.png - File (PNG): image.png

stefano mangiola (06:48:47) (in thread): > any scope for a PR totidySummarizedExperiment?

Jacques SERIZAY (09:15:21) (in thread): > I’m thinking about it, but not sure whether it is the best thing to do. What are your thoughts on storing a vector/matrix (contained as a 1-element list) in each cell of anassay? I don’t think I’ve seen this in any *Experiment-derived package, so not sure it’s worth doing this

Jacques SERIZAY (15:49:05) (in thread): > After giving some thoughts to it, there is no way this issue could be solved easily I think. This issue arises from the fact that I am storing vectors/matrices in each cell of the matrices fromassay(x, ...). According to the documentation, this is allowed:A RangedSummarizedExperiment contains one or more assays, each represented by a matrix-like object of numeric or other mode.But AFAICT,tidySummarizedExperimentis not covering the cases where the matrices store lists. What are your thoughts@stefano mangiola? Did you every encounter such use case? - File (PNG): image.png

stefano mangiola (20:01:55) (in thread): > nested structures seem troublesome. In brief, why each mean cell includes many values?

2023-11-30

Jacques SERIZAY (02:36:19) (in thread): > Each row refers to a resized GRanges object (to let’s say 1kb, with multiple entries, e.g. all TSSs) and each column refers to a coverage track (e.g. bigwigFile). Each cell of themeanassay (1 GRanges x 1 bigwig) contains the average coverage along 1kb, so it is a 1kb-long vector.

2023-12-01

stefano mangiola (00:55:42) (in thread): > This object (or it’s representation) could not be flattened out? afterall that’s the beauty of tidyverse, that makes working with redundant object easy.

stefano mangiola (00:56:23) (in thread): > it seems you are bending the tidy representation to a 3d matrix, while you could bend the 3d matrix to a tidy representation

Jacques SERIZAY (03:34:50) (in thread): > The object can easily be coerced to atibbleusingas_tibble, then all the methods fortibbles would work no problem. However, what I wanted is to have all the methods fromtidySummarizedExperimentwork OOTB withCoverageExperimentobjects (since they are SE objects). But becauseCoverageExperimenthave nested matrices in their assays,tidySummarizedExperimentmethods fail in several places.

Jacques SERIZAY (03:40:03) (in thread): > > it seems you are bending the tidy representation to a 3d matrix, while you could bend the 3d matrix to a tidy representation > Not sure what you mean here. Would you mind clarify this point? Otherwise if you don’t want to spend more time on this, I would totally understand. > > In any case, coercing aCoverageExperimentto atibblewith aas_tibblemethod is trivial, so I think I will simply recommend to the end user to explicitly coerce to atibbleif anyfiltering/mutateing is needed.

stefano mangiola (08:30:18) (in thread): > maybe try to give me 2,3 examples where the tidy SE fails. so we can think if you need to abstract further.

2023-12-05

Malte Thodberg (02:28:28) (in thread): > I don’t know if they apply here, but there are some flavors of SummarizedExperiments with “ragged” content:https://bioconductor.org/packages/release/bioc/html/BumpyMatrix.htmlandhttps://bioconductor.org/packages/release/bioc/html/RaggedExperiment.html - Attachment (Bioconductor): BumpyMatrix > Implements the BumpyMatrix class and several subclasses for holding non-scalar objects in each entry of the matrix. This is akin to a ragged array but the raggedness is in the third dimension, much like a bumpy surface - hence the name. Of particular interest is the BumpyDataFrameMatrix, where each entry is a Bioconductor data frame. This allows us to naturally represent multivariate data in a format that is compatible with two-dimensional containers like the SummarizedExperiment and MultiAssayExperiment objects. - Attachment (Bioconductor): RaggedExperiment > This package provides a flexible representation of copy number, mutation, and other data that fit into the ragged array schema for genomic location data. The basic representation of such data provides a rectangular flat table interface to the user with range information in the rows and samples/specimen in the columns. The RaggedExperiment class derives from a GRangesList representation and provides a semblance of a rectangular dataset.

Jacques SERIZAY (03:57:48) (in thread): > @stefano mangiolahere is a report comparing tidyverse operations on SE/CoverageExperiment (CE) objects.mutateandsummarizeare tricky to get to work withCoverageExperimentbecause of the nature of the data stored in each cell of the assay matrix. > I will have a look at@Malte Thodbergsuggestion. I did have a look atRaggedExperimentbut it wasn’t what I needed, but I wasn’t aware ofBumpyMatrix! Thanks for the tip, I’ll look into it!

Jacques SERIZAY (09:31:04) (in thread): - File (HTML): tidySummarizedExperiment-methods.html

2023-12-08

Eric Waltari (17:48:18): > @Eric Waltari has joined the channel

2023-12-09

Michael Love (09:15:25): > Just noticed this from 2017 - bedtools-in-R + dplyr, similar in scope to some things herehttps://f1000research.com/articles/6-1025 - Attachment (f1000research.com): F1000Research Article: valr: Reproducible genome interval analysis in R. > Read the latest article version by Kent A. Riemondy, Ryan M. Sheridan, Austin Gillen, Yinni Yu, Christopher G. Bennett, Jay R. Hesselberth, at F1000Research.

Vince Carey (10:11:44) (in thread): > good catch. lots of c++ infrastructure. also potentially peculiar handling of RNG seed in e.g., bed_random.

Michael Love (14:15:35) (in thread): > yeah i mean, lots of stuff here is superceded by plyranges and so on

2023-12-11

stefano mangiola (18:32:56) (in thread): > Thanks, I will have a look soon. > > But your case sounds very similar to mine, where things “are tricky to get to work [..] because of the nature of the data stored”. The successful solution for me, which started tidy transcriptomics, was to think out of the box and represent the data drastically differently from how the data is stored and how people thought about the data for 15 years (matrix vs. long format). > > So, further unpacking (virtually) the data (a leap of abstraction from what tidySE does) might be the key. For example, not having nested statistics but representing them as a flat table. In this case, I could help design the display of data and the tidy manipulation adapters.

2023-12-27

Cindy Reichel (14:38:18): > @Cindy Reichel has joined the channel

2023-12-28

David Rach (09:16:26): > @David Rach has joined the channel

2024-01-06

stefano mangiola (04:56:34): > As this tidy community is very close to my heart and includes like-minded researchers, I would like to share (apologies if off-scope) a Postdoc opportunity in the laboratory I am opening @ SAiGENCI. > — > > Only 1 week to apply for this great#Postdocopportunityand#SingleCellComputational#Immunogenomicsapplied to cancer(Adelaide), a world-class institute equipped with cutting-edge technologies and world leaders in#multiomicprofiling. > > We encourage candidates from Europe and the US. I came to Australia 13 years ago and never regretted great science and a great life quality. > > Please consider applying and spread the word to talents you knowhttps://careers.adelaide.edu.au/cw/en/job/513184/grantfunded-researcher-a

2024-01-11

Michael Love (10:57:49): > hi tidy folks, > > Stefano, William and I are trying to better motivate on the GitHub pages why someone should consider the tidiness-in-Bioc style of data manipulation. Here is an attempt, and I’m happy to take feedback. Does this look like a fair and honest comparison?https://github.com/tidyomics#comparison-to-base-r

Lluís Revilla (11:52:48) (in thread): > While I’m interested in the tidiness I haven’t followed much the evolution, improvements or how it is implemented. It looks like the better readability is achieved via new methods. In this case, could then these new methods be added toSummarizedExperimentclass? Or the newtidybulkclass is needed? It seems like the gain in readability it comes with a heavy cost on dependencies that it is not mentioned.

Vince Carey (11:53:49): > It looks pretty honest to me. Good point by@Lluís Revillaabout hidden costs. In addition to possible increased dependencies, “debugging” when things go wrong in some way seems worthy of attention.

Vince Carey (11:54:25): > Also I wonder about%>%vs|>… habit or necessity?

Michael Love (14:19:58) (in thread): > thanks Lluis, it depends on whatdatais, if its an SE you would load the tidySummarizedExperiment package. Importantly no new classes are defined as part of this effort

Michael Love (14:20:51) (in thread): > if it were anSEthat would require the following imports > > dplyr, tibble (>= 3.0.4), magrittr, tidyr, ggplot2, rlang, purrr, lifecycle, methods, plotly, utils,S4Vectors, tidyselect, ellipsis, vctrs, pillar, stringr, cli, fansi, stats, pkgconfig > however many of these are already installed for someone that uses tidyverse

Michael Love (14:21:19) (in thread): > so themarginalinstall cost is minimal if you use tidyverse

Lluís Revilla (15:12:05) (in thread): > Oh, I see I misunderstood the tidybulk packagehttps://github.com/stemangiola/tidybulk/blob/915117b885f81144605504427492283c9ad9bddd/R/tidyr_methods.R#L70.

Lluís Revilla (15:12:46) (in thread): > If users do not have the tidyverse the impact will be huge. Recently the FDA has accepted an R submission with R code but packages dependencies seem a friction point, if we want to make them more accessible less dependencies seems the way forward. If you want to provide verbs like select, filter, mutate you can do that without depending on the tidyverse, and there are examples of doing this (poorman) for example.

Lluís Revilla (15:14:40) (in thread): > Considering the experimental and the several deprecating functions and arguments the “tidyverse” packages has I am a bit wary to depend too much on them in packages

Michael Love (15:20:23) (in thread): > Many of the tidyomics packages i would say are for end users more than developers. I’m just speaking from my own perspective. It wouldn’t make sense necessarily to have a tidyomics dependency unless you are working on another tidyomics package. Again this is my opinion > > And I suppose we are targeting users who are already on board with using, say, dplyr and ggplot2

Michael Love (15:20:45) (in thread): > I don’t think we can bring folks over to tidyomics who have no interest in installing dplyr or ggplot2 or other tidyverse packages

Michael Love (15:22:15): > %>% is necessary to use dot notation inside of a function. E.g.count_overlaps(., other_ranges)within amutatecall

Michael Love (15:22:49): > I use|>otherwise but there are a few places where they are not equivalent in important ways, so it’s good to teach both still i think

Michael Love (15:23:07) (in thread): > I think we need a “what does this mean for developers” as a separate document.

Michael Love (15:24:13) (in thread): > I’ve been thinking of drafting this, in addition to the documentation above which is predominantly user focused. But you’re right we are also recruiting new developers and they may need instruction on when to use base R vs “tidy” > > I only use base R for packages and tidy for scripting. But I can’t speak for everyone on that point

Michael Love (15:25:01) (in thread): > we should somewhere write about it from the developer perspective, and include a diversity of viewpoints there bc there is not a single correct way to go about this. > > and while i’m at it, I should note that, depending on scale of data, one should consider leveraging things likerowMeansorrowsumin the provided example, especially if you were writing code for a package or code that will be run many times as part of a pipeline

Lluís Revilla (15:59:31) (in thread): > If the goal is to improve the readability to make it similar to them and you can have compatible packages without those dependencies, Why do you need to them to install tidyverse or ggplot2?

Michael Love (16:02:45) (in thread): > Sorry I don’t follow your last point. > > The goal of the tidyomics effort is to enable tidy operations on Bioconductor objects, without redefining classes or functionality that exists in either tidyverse or Bioconductor.

Michael Love (16:03:02) (in thread): > It can be thought of as a API layer between the two

stefano mangiola (17:22:26) (in thread): > @Lluís Revillatidybulk natively works with SE. But also works with tibble. For tibble is adding a temporary class for nest unnest.

Lluís Revilla (17:41:51) (in thread): > I think I need to try it before I understand more about how it works. Because with these last comments I am lost. Thanks for the answers

Martin Morgan (19:07:45) (in thread): > I’ll mention that I didn’t really understand whatfilter()was doing until looking at the old-school code mentioningrowData, which is a bit ironic – brevity may not always be a benefit. > > I’m pretty much a convert to dplyr on the principle that one gets a good distance on knowing just a few things. So the edge cases that require use of%>%instead of|>are a flag that too much is expected of the user – I really have no idea when I could want or need to use.. > > If playing code golf on a reproducible example > > library(SummarizedExperiment) > example(SummarizedExperiment) # get 'se' object > rowData(se)$score = runif(nrow(se)) > rowData(se)$gene_classes = sample(letters[1:4], nrow(se), TRUE) > > I would usesubset()for subsetting, androwsum()oraggregate()for aggregations, leading to two lines of not too inscrutable code > > subdata = subset(se, score > .5) > aggregate( > as.vector(assay(subdata)), > list(rep(rowData(subdata)$gene_classes, ncol(subdata))), > mean > ) > > And yes, this (especiallyaggregate()) would illustrate the consistency and clarity of the dplyr approach.

Michael Love (21:00:32) (in thread): > Martin would you be ok if I added thisin additionto the existing base R example? Because your average R user doesn’t use aggregate. I also agree that rowsum is a beautiful secret and would be my choice for package source code (I found rowsum when writing tximport code in 2015)

Michael Love (21:03:48) (in thread): > filtermakes more sense if you see the tibble representation of the hypothetical SEdata. if this is going to be a prominent example I should probably make it a reprex instead of a hypothetical

2024-01-12

Michael Love (12:29:26) (in thread): > I’ve updated the example to include reprex and Martin’s cleaner base R code in additionhttps://github.com/tidyomics#comparison-to-base-r

Michael Love (12:30:15) (in thread): > i will switch this example to |>

Michael Love (21:19:00) (in thread): > i’ve removed all %>% examples from this main page… too distracting

2024-01-14

stefano mangiola (17:36:09) (in thread): > @Martin Morgan@Michael Loveas discussion, personally, I am going away from %>% and the use of “.” . > > To me, now, when a “.” is needed, it is a possible indication that a new variable should be created. Also, as tidyverse and the surrounding ecosystem are evolving, there is less and less need for this.

Michael Love (19:52:01) (in thread): > this is convenient though, forxandyGRanges > > x %>% mutate(num_ovrlps = count_overlaps(., y)) >

stefano mangiola (21:10:45) (in thread): > From a redesign perspective, I would seecount_overlapsmore as ajoin |> countoperation. Because join takes two datasets, while mutate has been designed to take one > > These two operations could be possible natively with dplyr now > > x |> > left_join(y, by = join_by(chromosome, overlaps(x_lower, x_upper, y_lower, y_upper))) |> > count(chr, x_lower, x_upper, name = "number_of_overlaps") > > https://stackoverflow.com/questions/76404727/the-usage-of-the-key-word-within-and-overlaps-in-join-byIf this interface was the case (doesn’t matter what happens in the backend), then designing the convenience function count_overlaps, would be easy, i.e. it would take three arguments (if the names of the columns were fixed) > * x > * y > * name = “n” > But would be used as aleft_joinrather than amutate``x |> count_overlaps(y, "number_of_overlaps")

Michael Love (22:03:35) (in thread): > convenience function makes sense, there’s alreadyadd_nearest_distance

2024-01-17

Michael Love (08:45:29): > added to the README as part of revision, feel free to edit directly or provide comments and i can update. source for the project landing page README is herehttps://github.com/tidyomics/.github/blob/main/profile/README.md - File (PNG): Screenshot 2024-01-17 at 8.44.34 AM.png

2024-01-19

Frederick Tan (13:22:10) (in thread): > Worth adding a comment to users about scale, perhaps as a footnote that tidy does have it’s limits? At a recent CoFest! (hackathon) a group hit a wall with tibble and couldn’t optimize so moved to data.table.

Frederick Tan (13:27:22) (in thread): > More for a “Why learn/use this?” section, but perhaps worth noting that Bioconductor’scarpentries-incubator.github.io/bioc-intro, which if successful will bring even more new users into Bioconductor, takes a tidy approach.

stefano mangiola (18:39:40) (in thread): > Maybe in the future they will edit > > BiocManager::install(c("tidyverse", "SummarizedExperiment", "hexbin", "patchwork", "gridExtra", "lubridate")) > > for this :) > > `BiocManager::install(c("tidyverse", "tidyomics", "SummarizedExperiment", "hexbin", "patchwork", "gridExtra", "lubridate")) >

2024-01-20

Frederick Tan (10:05:24) (in thread): > Huh … interesting givencarpentries-incubator.github.io/bioc-intro/60-next-steps.html#tidysummarizedexperiment

2024-01-21

stefano mangiola (06:50:52) (in thread): > for the recordshttps://bioinformatics.ccr.cancer.gov/docs/data-wrangle-with-r/pdf/combined.pdf

2024-01-25

Francesc Català (05:03:00): > @Francesc Català has joined the channel

2024-01-27

atongsa miyamoto (04:52:28): > @atongsa miyamoto has joined the channel

2024-02-15

Jacques SERIZAY (13:13:13): > This made me realize thatcross-packagedocumentation covering the whole tidyomics ecosystem is kinda lacking… (I think this was mentioned by one of the reviewers?). Is there any plan on writing a compendium decribing the tidyomics with advanced examples, use cases, etc… ? Could it be mostly done by aggregating examples from the different packages together? - Attachment: Attachment > Not as well organized as Hadley’s books, but Computational Genomics with R from Altuna Akalin covers most aspects of bioinfo analysis for bulk assays. It has not been updated for quite some time though, and it doesn’t cover at all the tidyomics principles :scream:.

Michael Love (13:30:24): > go for it! in the revision, we did update the docs a bit, i don’t know if you’ve seen them in 2024

Michael Love (13:31:04): > (this README atgithub.com/tidyomicsis editable by anyone in the org) - File (PNG): Screenshot 2024-02-15 at 1.30.52 PM.png

Michael Love (13:32:36): > I think the way to do it is to have individuals maintain tutorial content themselves (it can have a /tidyomics/ url if you like). I maintaintidyomics/tidy-ranges-tutorialwhich does show off examples from many packages including tidybulk, tidySE, plyranges, etc. Also our workshop from this past BioC showed tidySCE and plyranges combined. But we need more tutorials for sure! I can add it as a project challenge

Michael Love (13:35:21): > https://github.com/orgs/tidyomics/projects/1/views/1?pane=issue&itemId=53419320

Jacques SERIZAY (13:36:26): > Thanks Mike the tutorials are exactly what I had in mind! I would love to have a general handbook at hand, with (opinionated) workflows for bulk/sc analyses… A somewhat larger scope I would envision would be focusing on multi-omics data integration using tidyomics. To me that’s one of the biggest strengths of tidyomics

Michael Love (13:36:31): > it takes some time to come up with a good example that spans many omics. I spent like a week on the example from BioC workshop

Michael Love (13:37:14): > i love it. i think a good place to start is to brainstorm the right dataset that will enable interesting cross-assay comparisons

Michael Love (13:38:36): > feel free to assign yourself to that issue and anyone who is interested can pitch in, maybe it would lead to a BioC or EuroBioc workshop?

Jacques SERIZAY (13:38:43): > Totally agree! it sounds straightforward on the paper, but then the reality of low-quality datasets hits you back…

Michael Love (13:40:06): > i have made a bunch of RNA-seq datasets: airway, oct4, parathyroidSE, macrophage, spatialDmelxsim, fission (these are all Bioc pkgs)

Michael Love (13:40:36): > but i don’t have any epigenomic, proteomic data. > > there are a lot of PBMC-type datasets in Bioc

Michael Love (13:41:11): > you’d want something that ties things together

Jacques SERIZAY (13:41:19): > Also lots of epigenomics from ExperimentHub

Michael Love (13:42:04): > yeah so we did this contrived thing here:https://tidyomics.github.io/tidyomicsWorkshopBioc2023/articles/tidyGenomicsTranscriptomics.html#part-3-genomic-and-transcriptomic-data-integration - Attachment (tidyomics.github.io): Tidy genomic and transcriptomic single-cell analyses > tidyomicsWorkshopBioc2023

Jacques SERIZAY (13:42:07): > All the ENCODE data is there I think. It has ChIP, RNA, accessibility, even Hi-C I think. And functional annotations (whatever they are worth)

Michael Love (13:42:51): > yup, that’s what we use, unfortunately it’s hg19 so you have to liftover. (i’ve been wanting to have time to get better chip / atac for hg38 in Bioc)

Michael Love (13:43:01): > > Now let’s do something interesting with the gene ranges: let’s see if genes near peaks of active chromatin marks (H3K4me3 measured with ChIP-seq) in another experiment involving PBMC have a difference in their expression level compared to other genes.

Michael Love (13:43:32): > later i point out the limitations: > > we are comparing cell-type-specific expression to aggregate ChIP-seq peaks in PBMC > > the two experiments, while both PBMC, come from different projects, different labs, and one is from cancer patients, while the other is from the ENCODE project

Michael Love (13:44:12): > i ended up just doing something for demo purposes because getting a big dataset with all the right omics is a ton of work, and I had to do this on a tight schedule:slightly_smiling_face:

Jacques SERIZAY (13:46:23): > Well now we could use easylift:wink:. But yeah I agree, it would require good organization/communication. I’l try to do more reading over the next days to see if something would be doable.

Michael Love (13:52:49): > i’ve noticed that you lose about half the people when you start having to do chores during a tutorial

Michael Love (13:53:16): > we should just throw hg38 filtered peaks into the tutorial, it’s gonna be 1 Mb which is less than the graphics

Michael Love (13:53:52): > easylift is great, but even just conceptually introducing the idea of genome builds is a pedagogical no-no

2024-03-05

Vamika Mendiratta (02:09:05): > @Vamika Mendiratta has joined the channel

2024-03-15

Peace Sandy (08:36:43): > @Peace Sandy has joined the channel

2024-03-28

Laura Symul (08:39:44): > @Laura Symul has joined the channel

2024-04-01

Michael Love (09:40:22): > to tidyomics-interested folks, we are converging on a hex sticker here, if you are interested:https://github.com/tidyomics/tidyomics/issues/4 - Attachment: #4 Create tidyomics logo and sticker

Michael Love (11:38:35) (in thread): > pinging some of the github members to get their eyes on it@Charlotte Soneson@Helena L. Crowell@Jacques SERIZAY@Timothy Keyes@Abdullah Al Nahid@Pierre-Paul Axisa@Stephanie Hicks@William Hutchison

Pierre-Paul Axisa (12:31:40) (in thread): > Oh neat! Are there any vector images of the drafts somewhere to play around with?

Michael Love (13:40:15) (in thread): > we are mostly just sketching at this point but maybe@stefano mangiolahas vector versions of his?

Michael Love (13:40:35) (in thread): > ideally we can finalize on a logo this week to send to CZI

Michael Love (13:40:47) (in thread): > tidyomics was selected for EOSS round 6:tada:

Michael Love (13:40:54) (in thread): > they want project logos by 4/5

Stephanie Hicks (13:41:34) (in thread): > oh huge congratulations!!

Michael Love (13:43:03) (in thread): > Stefano and I hope to use the support to help with project organization tasks, & run workshops to promote everyone’s contributions and packages. it’s a community effort

Stephanie Hicks (14:44:10) (in thread): > well deserved!

2024-04-02

stefano mangiola (00:59:42) (in thread): > (sorry I missed this thread) I added the svg herehttps://github.com/tidyomics/tidyomics/issues/4#issuecomment-2031084988

2024-04-12

Michael Love (15:47:32): > i noticed tidySE prints like this in rmd/quarto output: > > # A SummarizedExperiment-tibble abstraction: 16 × 8 > # [90mFeatures=4 | Samples=4 | Assays=counts[0m > ... > > i think this can be fixed by usingcliinstead of printing the ANSI color code directly. I made an open challenge:https://github.com/orgs/tidyomics/projects/1/views/1?pane=issue&itemId=59607835

2024-04-13

stefano mangiola (23:41:35) (in thread): > Yes I always wondered. good catch!

2024-04-15

Michael Love (08:44:28) (in thread): > added a note: this also prevents qmd -> pdf because the latex command will not allow the ANSI codes.

Michael Love (10:17:59): > @Jacques SERIZAY@Eric Davisand others: there is interest in extending the plyinteractions demo into a longer workshop at BioC, which would also include a (brief) intro to the tidyomics project. Eric are you going? Anyone else in this channel going to BioC in person who could do an intro to the project? Or Jacques you could also do the whole time but it’s a bit exhausting to talk for that long, I myself prefer to split it up into two presenters…

Eric Davis (10:18:11): > @Eric Davis has joined the channel

Jacques SERIZAY (10:53:23): > Happy to talk aboutplyinteractionsof course, alone or with@Eric Davisor anybody else who wants to join in:slightly_smiling_face:Also happy to include a brief intro totidyomicsif needed, although I agree it’d be nice to split things, otherwise it will indeed get exhausitng!

2024-04-17

Chenyue Lu (10:58:52): > @Chenyue Lu has joined the channel

2024-04-18

Michael Love (08:31:13): > tidyomicsdevelopers, if you are willing, switch your visibility on the team to public:https://github.com/orgs/tidyomics/peopleThis is an open GitHub development team, so other folks here interested in contributing to the challenges (incl hosting workshops), just reply here and i’ll add you - File (PNG): Screenshot 2024-04-18 at 8.19.57 AM.png

2024-04-22

Lluís Revilla (11:28:50): > I there a way to apply dplyr verbs to DFrame, DataFrame, SimpleList, RectangularData ? I see a way for several classes inhttps://github.com/tidyomics/tidyomicsbut none for more basic structures, and perhaps I am missing something obvious …

Michael Love (11:51:11): > I think@Lambda Mosesworked on this perhaps

Michael Love (11:53:21): > https://github.com/orgs/tidyomics/projects/1/views/1?pane=issue&itemId=35975564

Michael Love (11:55:16): > this is in particular for SCE and left_join a DF, but it might be useful to have additional generic functionality. > > I don’t think we’ve implementedGRanges |> left_join(more metadata)

Lluís Revilla (16:21:26): > Thanks for the pointers! I’ll continue the discussion there.

stefano mangiola (19:46:44): > it should be incredibly straightforward to implement tidyDataFrame, virtually 2 lines of code per tidy method (as an inefficient implementation). > > However I am thinking that a more high-level API that does not expose DataFrame could be more elegant, rather than expanding the echosystem for very specific (and backend-oriented) tasks like this. > > Unless I am missing an important and recurrent application.

stefano mangiola (19:47:35): > for example tidySCE could be expanded to interact with feature-data

Jonathan Carroll (20:14:31) (in thread): > I played with this idea a while agohttps://github.com/jonocarroll/DFplyrOne of the fiddliest bits to get working is groups.

2024-04-23

Lluís Revilla (03:55:11) (in thread): > Thanks! It looks nice and might work for my use case

Lluís Revilla (03:58:35): > @stefano mangiolaIndeed a general high-level API might work well, but for classes outside that high-level API having a low-level tidy API will work wonders. tidyomics don’t cover a quarter of the classes in Bioconductor (which I think there are too many and without much order, but that is part of other working groups). When working with Bioconductor it is impossible to avoid classes, outside tidyomics, that I’d like to use with tidy verbs but I can keep using it and mixing code syntax styles.

stefano mangiola (04:20:56): > That makes sense; I was just thinking about the design of the ecosystem (packages within tidyomics). Of course more general purpose tools make sense. > > When working with Bioconductor it is impossible to avoid classes, outside tidyomics > Can you make examples beyond DataFrame?

2024-04-24

Lluís Revilla (08:10:46): > There are many, the difficult part is finding a package in bioconductor that doesn’t define a new class: BSseq objects for methylation from {bsseq}, all the classes from {clusterProfiler} ({DOSE}, and others), from {GOSemSim} the class GOSemSimDATA, the classes defined by {mixOmics}: plsda, pca, spls, sgcca,. The ones from microbiome analysis like those from {phyloseq} …

Michael Love (09:23:22): > I’d recommend that, at least for the new comer developers, we focus the open challenges on classes supported by the core team

Michael Love (09:24:31): > i think limiting the scope of the challenges will help things. which is not to say “don’t address these other classes” but just that when we list the TODOs (which new-comers look at when they consider contributing to the tidyomics project) we focus on classes that are “main” ones supported by core (or like SCE which is so widely adopted that it’s nearly so)

2024-04-28

Danielle Callan (08:43:12): > @Danielle Callan has joined the channel

2024-04-29

Michael Love (09:55:18): > do we have something for manipulating sequences (DNA)? tidyfasta?

Tyrone Lee (12:24:56) (in thread): > Bioseqis compatible with dplyr grammar and other tidyverse tools.

Michael Love (14:49:13) (in thread): > in that it works on data embedded in a tibble - File (PNG): Screenshot 2024-04-29 at 2.49.02 PM.png

Michael Love (14:49:34) (in thread): > sorry, i should have been specific, I meant operating natively on the S4 object

2024-05-07

Michael Love (08:36:08): > some random tidySE questions (i can post elsewhere if that is preferred): > * would it be possible to have SE -> group_by -> SE for some operations? E.g. group_by -> mutate (so not changing dimensions at all) or group_by -> slice (picking one row/col per group) > * would it be possible to add to rowData or colData with mutate without reference to existing columns? e.g. could it be inferred by the length of the vector, or maybe specify functions mutate_row_data, mutate_col_data? > * I have an idea for virtual nesting of SE, where instead of actually breaking up the SE, we nest indices of rows/cols while keeping the original object unmodified

2024-05-09

stefano mangiola (06:12:20) (in thread): > 1. is one of the challenges, for the moment nest is one option. > 2. mutate recognises which are you referring to, unless you do trivial (not useful?) operation such as mutate(a = 1). > 3. not sure I get this. Would with_groups() or group by achieve this?

Michael Love (06:39:03) (in thread): > 1. got it > 2. i’m often interested in adding things like rowSums() or colSums() to rowData or colData. do you see a way to support this? > 3. let me work on a concrete example

Michael Love (06:40:32) (in thread): > For 2, just in general it might be nice to add rowData or colData columns, besides > > rowData(x)$new_col <- z > > something like > > x |> > mutate_row_data(new_col = z) >

Philippe Laffont (07:39:52): > @Philippe Laffont has joined the channel

stefano mangiola (07:55:03) (in thread): > I think there is scope to create mutate_row_data and col_data. But I would like to understand whether it would be very inconvenient doing it with the current framework before creating ad hoc functions > > for rowSums() yes would be a bit more cumbersome, something like > > se |> with_groups(.feature, ~.x |> mutate(s = sum(counts) ) > > fornew_col = zwhat would z be? If you make a concrete example it would help.

stefano mangiola (07:55:51) (in thread): > (with_groups has not been implemented yet:slightly_smiling_face:)

Michael Love (08:55:30) (in thread): > let me mock an example

Michael Love (09:07:23) (in thread): > > library(tidySummarizedExperiment) > library(airway) > data(airway) > airway |> group_by(.feature) |> summarize(rowsum = sum(counts)) > rowSums(assay(airway)) |> head() > # I would like to do this: > airway %>% > mutate_row_data(rowsum = rowSums(assay(.))) > > library(microbenchmark) > microbenchmark(airway |> group_by(.feature) |> summarize(rowsum = sum(counts)), rowSums(assay(airway)), times=5) > # on my machine: ~700 ms vs 3 ms >

stefano mangiola (21:41:51) (in thread): > Yes for efficiency makes sense to implement those function. Please feel free to open an issue, > > Another way (still comparatively slow) is > > airway |> left_join(airway |> assay() |> rowSums() |> enframe(name = “.feature”, value = “sum_counts”))

2024-05-10

Michael Love (06:21:35) (in thread): > awesome, I may open an PR in the next month:handshake:

stefano mangiola (06:24:32) (in thread): > Michael Love contributor of tidySummarizedExperiment.. now we are in business!

stefano mangiola (06:26:50) (in thread): > Just one thing I though > > I would prefer referring to feature and sample, rather than row, and column (which is a backend detail)

stefano mangiola (06:27:21) (in thread): > (we have alreadyjoin_features) so maybemutate_features?

Michael Love (06:45:35) (in thread): > sure!

2024-05-15

Sunil Nahata (08:31:25): > @Sunil Nahata has joined the channel

2024-05-23

Michael Love (11:16:43) (in thread): > i’m looking overmutate.SummarizedExperimentandupdate_SE_from_tibble, is my understanding correct that youas_tibble.SummarizedExperimentthe SE, then perform the mutate, and then scan the SE to see what kind of update is needed. I.e. while you do unspool the SE for performing themutateoperation, the object that is updated remains SE throughout?

Michael Love (17:51:36) (in thread): > here’s an example of what i’m thinking for a new functionmutate_features:https://github.com/mikelove/tidySummarizedExperiment/commit/bc563fce72b834c581df058c32c307ed29cff09e

stefano mangiola (21:51:55) (in thread): > The mutate was born simple but inefficientSE |> tibble |> mutate |> SEBut I have been developing more and more to understand what needs to be mutated and avoidSE |> as_tibbleI have been mutating rowData and colData without a specificmutate_features. Somutate_featureswill integrate nicely not only with the front end but also with the back end

stefano mangiola (21:53:57) (in thread): > > here’s an example > Great! left a little comment

2024-05-24

Michael Love (09:12:07) (in thread): > great, thanks i’ll keep working on it, and developmutate_samplestoo

Michael Love (22:28:46) (in thread): > ok now hasmutate_samplesadded tests and man pageshttps://github.com/mikelove/tidySummarizedExperiment/

Michael Love (22:29:01) (in thread): > let me know if you think it’s ready for PR

2024-05-26

stefano mangiola (04:35:29) (in thread): > please do

Michael Love (09:38:38) (in thread): > I can work onclibased coloring this week

stefano mangiola (19:37:18) (in thread): > You mean special columns vs mutable columns? > > That would be huge!:heart:We have a half-PR here not sure if it is a god starting pointhttps://github.com/qclayssen/tidySingleCellExperiment/commit/3b740c83f3220e1baa3f0d5dca50bbb2a62086ecBut feel free to start from scratch

2024-05-27

Michael Love (07:51:31) (in thread): > By cli I mean using this package for coloringhttps://github.com/r-lib/cli

Michael Love (07:51:45) (in thread): > Bc current code breaks in Rmd or Quarto

Michael Love (07:52:10) (in thread): > Breaks = renders the Ansi code instead of coloring

stefano mangiola (08:25:26) (in thread): > I think we use it for some rendering. But using it for colouring would be great

Michael Love (08:33:31) (in thread): > It should fix this thing on line 2

Michael Love (08:33:47) (in thread): > > ## # A SummarizedExperiment-tibble abstraction: 102,193 × 5## # [90mFeatures=14599 | Samples=7 | Assays=counts[0m## .feature .sample counts condition type >

Michael Love (08:34:06) (in thread): > [90m

2024-06-05

Adrian Hirt (05:47:09): > @Adrian Hirt has joined the channel

2024-06-10

Justin Landis (13:18:09): > @Justin Landis has joined the channel

2024-06-11

Michael Love (06:07:22): > hi@stefano mangiola, could we find a time to meet with you to ask some development questions about tidySE and tidySCE?@Justin Landisjust joined my lab and is interested in applying data mask concepts from rlang. Would early or late your time be preferred?

stefano mangiola (20:11:36): > Hi Michael, and welcome Justin. Adelaide 8.30AM UNC 7pm would suit?

2024-06-12

Justin Landis (09:45:20): > I think that time would work for me depending on the day. The earliest I could meet would be Thursday. Unfortunately I will be traveling this weekend and I likely will not be able to meet until next week.

Justin Landis (10:02:44): > To clarify, I mean 7 pm on the 13th for UNC, 8:30 am on the 14th for Adelaide:sweat_smile:

2024-06-16

stefano mangiola (21:31:23): > Hello Community! I am glad to announce that the #tidyomics ecosystem has made it to Nature Methods.:tada::tada:Thanks to all the dev and user community!@Michael Loveco-led the study, and congrats to@William Hutchisonand@Timothy Keyesfor making this happen. The work has just started :)https://www.nature.com/articles/s41592-024-02299-2?utm_source=twitter&utm_medium=social&utm_campaign=nmethTweet:https://x.com/steman_research/status/1802505141045833731Bluesky:https://bsky.app/profile/stemang.bsky.social/post/3kv3j4kr6l327We will organise a Zoom party to celebrate with the community and plan for the future! - Attachment (Nature): The tidyomics ecosystem: enhancing omic data analyses > Nature Methods - tidyomics offers a software ecosystem for omic data manipulation and analysis that bridges Bioconductor with the tidyverse framework. - Attachment (X (formerly Twitter)): Stefano Mangiola (@steman_research) on X > :broom::star-struck: #tidyomics made it to @naturemethods! > “The tidyomics ecosystem: Enhancing omic data analyses”, :clap::skin-tone-4:, @hutchisonwj0, @timothykeyes, and Team! > > 32 researchers | 26 institutes | 10 countries | 4 continents > > A #crowd/#community-research success > https://t.co/koiFWBVygM @cziscience - Attachment (Bluesky Social): Stefano Mangiola (@stemang.bsky.social) > #tidyomics made it to #NatureMethods! “The tidyomics ecosystem: Enhancing omic data analyses”, :clap::skin-tone-4: Will, Timothy, and Team! > > 32 researchers | 26 institutes | 10 countries | 4 continents > > A #crowd/#community-research success > > https://nature.com/articles/s41592-024-02299-2?utm_source=twitter&utm_medium=social&utm_campaign=nmeth… > > #CZI, Thanks @mikelove.bsky.social !

Michael Love (21:35:38) (in thread): > Thanks to all the contributors and co-authors. It’s been so exciting and fun to work on such a large open project within the Bioc community. And thanks to Stefano for spearheading at every turn:grin:

2024-06-17

Justin Landis (19:53:40): > Hello everyone! I am working on a project at the moment to create more efficient abstractions of Bioconductor objects to work in conjunction withdplyrverbs. > So far I have a working example ofmutate()forSummarizedExperimentobjects. > > Please have a look athttps://github.com/jtlandis/biocmaskFeel free to open an issue or start a discussion as well!

2024-06-18

stefano mangiola (01:20:35) (in thread): > To celebrate the publication and discuss future plans of#tidyomics, we are organising a#Zoomparty/getting-together. > > This is#opento everyone! So please join, and let’s walk together. > > When would you prefer? Please reply to the#pollbelow.:star:Option 1 > > 5.30PM US time (New York, EDT) | > 10.30PM Europe time (London, BST) | > 7:30AM Australia time (Sydney, AEST):balloon:Option 2 > > 8AM US time (New York, EDT) | > 1PM Europe time (London, BST) | > 10PM Australia time (Sydney, AEST) - File (PNG): image.png

shristi shrestha (12:22:05): > @shristi shrestha has joined the channel

2024-06-19

Izabela Mamede (08:04:43): > Hello all, while giving a workshop I found a weird bug associated tidyseurat with Seurat 5.0, merging tidyseurat objects and the FindMarkers function, should I report it here or is it better on github?

Vince Carey (14:03:49): > Hi Izabela. Seurat is not a Bioconductor package, so use their github or support forum.

Michael Love (22:11:13) (in thread): > I think the Q may be about tidyseurat which is CRAN but Stefano maintains and is in the tidyomics family

Michael Love (22:13:32) (in thread): > Maybe tidyseurat GH so you can provide lots of code output etc

Michael Love (22:13:56) (in thread): > And gets a link in case we need to pull in Seurat devels

2024-06-20

Izabela Mamede (08:31:34) (in thread): > Yep exactly, will clarify there I was tlaking about tidyseurat

Izabela Mamede (08:31:49) (in thread): > Will do, thanks!

Alyssa Columbus (19:09:51): > @Alyssa Columbus has joined the channel

stefano mangiola (23:43:18) (in thread): > Yes Iza, feel free to open an issue to tidyseurat

stefano mangiola (23:57:43): > <!channel>tidyomicsis looking for a developer! Postdoc and PhD positions available. Please spread the word.:pray:The position will be hosted in my lab and in collaboration with@Michael Love.https://careers.adelaide.edu.au/cw/en/job/514194/grantfunded-researcher-aGreat scientific environment and great salary! You will also be exposed to and part of large-scale single-cell and spatial data analyses and AI for single-cell cancer immunology. > > Please consider applying! - File (PNG): JOB OPPORTUNITIES.png

2024-06-30

Nicolas Peterson (13:08:52): > @Nicolas Peterson has joined the channel

2024-07-02

stefano mangiola (08:40:12): > Join usTuesday 16th of JULYfor a #tidyomics Zoom celebration and open discussion about #future plans! > > 8.30AM US time (New York, EDT) > 1.30PM Europe time (London, BST) > 10.30PM Australia time (Sydney, AEST) > > Zoom:https://unimelb.zoom.us/j/88334039518?pwd=EU0W9q7OsMbSO27693Ka5raygBet99.1Password: 542990 > > #everyone welcome! - File (PNG): image.png

stefano mangiola (08:41:00) (in thread): > Calendar invite - File (Binary): tidyomics celebration of future plans.ics

Juan Henao (09:49:28): > @Juan Henao has joined the channel

Diána Pejtsik (10:56:35): > @Diána Pejtsik has joined the channel

2024-07-05

Margherita (12:28:50): > @Margherita has joined the channel

2024-07-11

Hothri Moka (07:21:07): > @Hothri Moka has joined the channel

2024-07-14

stefano mangiola (21:55:08) (in thread): > Hello Community! Just a reminder to join our open Zoom discussion this Tuesday about tidyomics, how to grow our community, and the prioritises for the future!

2024-07-16

stefano mangiola (08:17:06): > Join us in 15 minutes!:slightly_smiling_face:tidyomics community meeting > > Zoom:https://unimelb.zoom.us/j/88334039518?pwd=EU0W9q7OsMbSO27693Ka5raygBet99.1Password: 542990

Michael Love (09:13:06): > Notes from this meeting:https://docs.google.com/document/d/1tlUsMvKbdHdM1n_MDphD9sTsyLoMX2EG7dvEso-8dCY/edit

stefano mangiola (10:29:59): > @Jonathan Carroll

stefano mangiola (23:16:27) (in thread): > Exciting first#tidyomicsmeeting! > > We talked about community, development plans, and things to improve. Join us for the next one in two weeks. Details soon! - File (PNG): image.png

2024-07-19

stefano mangiola (04:07:08): > The next#zoom #meetupis on TUESDAY 30th of July. > > 8.30AM US time (New York, EDT) > 1.30PM Europe time (London, BST) > 10.30PM Australia time (Sydney, AEST) > > We will talk about the#communityand the#softwareecosystem. Please share…Password: 542990

2024-07-24

Michael Love (09:40:12): > Great session right now at BioC2024 from Jacques:raised_hands: - File (PNG): Screenshot 2024-07-24 at 9.39.24 AM.png

Michael Love (09:47:31): > question from audience about grouped GRanges vs GRangesList

Michael Love (10:46:28): > introducingtidyCoverage — would be great to hear about this at the next tidyomics meeting - File (PNG): Screenshot 2024-07-24 at 10.45.57 AM.png

Jacques SERIZAY (11:29:52) (in thread): > Thanks Mike. The question was whether there is a way to convert GRangesLists as GroupedGRanges (that I don’t know about)? I realised I never asked this myself (I typically don’t use GRangesLists as such though…). It would be good to look into it, would be a nice way to directly tietidyCoverageandplyrangestogether:heart_eyes:

Jacques SERIZAY (11:33:04) (in thread): > Happy to! I was hoping to be faster and cover more of it, but I think the audience was not that familiar with tidyomics in general (which I guess I wrongly assumed:man-facepalming:). > Also, I’m at the conference until the end, if anyone wants to directly chat!

Michael Love (11:49:13): > my thought is that, where does mcols(grl) go?

Michael Love (11:50:09): > in the chat I had said > > GroupedGRanges is to Grouped dataframe as GRangesList is to nested table

Marcel Ramos Pérez (11:53:00) (in thread): > mcols(grl)is the actual groupedDataFrame, right?

Michael Love (11:55:44) (in thread): > > > grl <- GRangesList(GRanges("1",IRanges(start=1:3,end=10:12)), GRanges("2",IRanges(start=1:5,end=10:14))) > > grl > GRangesList object of length 2: > [[1]] > GRanges object with 3 ranges and 0 metadata columns: > seqnames ranges strand > <Rle> <IRanges> <Rle> > [1] 1 1-10 * > [2] 1 2-11 * > [3] 1 3-12 * > ------- > seqinfo: 2 sequences from an unspecified genome; no seqlengths > > [[2]] > GRanges object with 5 ranges and 0 metadata columns: > seqnames ranges strand > <Rle> <IRanges> <Rle> > [1] 2 1-10 * > [2] 2 2-11 * > [3] 2 3-12 * > [4] 2 4-13 * > [5] 2 5-14 * > ------- > seqinfo: 2 sequences from an unspecified genome; no seqlengths > > > mcols(grl) <- DataFrame(foo = 1:2) > > grl > GRangesList object of length 2: > [[1]] > GRanges object with 3 ranges and 0 metadata columns: > seqnames ranges strand > <Rle> <IRanges> <Rle> > [1] 1 1-10 * > [2] 1 2-11 * > [3] 1 3-12 * > ------- > seqinfo: 2 sequences from an unspecified genome; no seqlengths > > [[2]] > GRanges object with 5 ranges and 0 metadata columns: > seqnames ranges strand > <Rle> <IRanges> <Rle> > [1] 2 1-10 * > [2] 2 2-11 * > [3] 2 3-12 * > [4] 2 4-13 * > [5] 2 5-14 * > ------- > seqinfo: 2 sequences from an unspecified genome; no seqlengths > > > mcols(grl) > DataFrame with 2 rows and 1 column > foo > <integer> > 1 1 > 2 2 >

Michael Love (11:56:24) (in thread): > it is metadata alonglength(grl)notsum(lengths(grl))

Michael Love (11:57:24) (in thread): > mcols(grl)!=do.call(rbind, lapply(grl, mcols))

Marcel Ramos Pérez (12:03:59) (in thread): > right, there are multiple levels ofmcols. Could themcols(GroupedGRanges)==mcols(grl)? or maybe I don’t understand the use case well

Michael Love (12:26:34) (in thread): > me neither exactly

Michael Love (12:26:57) (in thread): > GroupedGRanges is really analogous to GroupedDF, the grouping is just an indicator of how future operations should behave

stefano mangiola (19:13:43) (in thread): > Let me add this to the agenda

2024-07-25

Jacques SERIZAY (12:17:28) (in thread): > Fair point. The question was raised in the audience without specific use case. Personally, when I useGRangesList, it really is just to have a list ofGRanges, so I typically don’t have anything stored inmcols(grl), but I definitely see how coercingGrangesListtoGroupedGRangeswould be a problem. Beyond this limitation, I agree that the use case forGroupedGRanges<->GrangesListseems pretty limited.

Michael Love (12:29:55) (in thread): > One common case when this comes up is the GRL from exonsBy. There is often metadata about genes, e.g. gene symbols or other mappings

2024-07-26

Michael Love (10:34:17) (in thread): > sorry i just realized i have to drop off kids at 8:45am for the next two weeks, so i won’t be able to attend the meeting. If you record i can catch up afterward

Jacques SERIZAY (16:50:26): > Not sure how much we want this for now, but since I was looking at the different pieces of tidyomics workshops (trying to gather them all), I’ve thrown some of them in a biocbook. I’m thinking about keeping it on my GH for now, but let me know if you guys think it’d be best to move it to tidyomics organization.https://js2264.github.io/BiocBook.tidyomics/devel/

Michael Love (16:53:15): > This is great, we should definitely have a book like this. I’m happy to contribute > > (And I’m in favor of it being under the org, lets think of a book title)

Jacques SERIZAY (17:42:35): > Lets brainstorm a name title, then I can update content and push to tidyomics as origin. The way BiocBooks work requires to have the same repo name and book title, in order to have a working docker deployed

Qiwen Octavia Huang (19:48:09): > @Qiwen Octavia Huang has joined the channel

2024-07-27

stefano mangiola (04:47:01) (in thread): > @Jacques SERIZAYare you still presenting at the meeting or do you want to postpone?

Jacques SERIZAY (10:57:36) (in thread): > @stefano mangiolanot sure I understand your question. I ran the first workshop of Bioc2024 on Wednesday, the conference was wrapped up yesterday, I’m flying back to France today and so I’d still be happy to chat on Tuesday 30, 1:30pm London time. Did I miss something?

2024-07-28

stefano mangiola (23:51:55) (in thread): > I was just wondering if you are OK mentioning your work at tomorrow’s meeting

2024-07-29

Jacques SERIZAY (04:12:20) (in thread): > Yes of course!

stefano mangiola (04:58:03) (in thread): > This is awesome. Definitely you should mention about this tomorrow

stefano mangiola (05:01:24): > <!channel>allow me to give a gentle reminder, about: > > the#tidyomicsmeeting tomorrow! > (Tuesday, details below)@Jacques SERIZAYwill present #tidyCoverage + we’ll discuss how to improve the ecosystem and community! > > 8.30AM US (New York, EDT) > 1.30PM EU (London, BST) > 10.30PM AU (Sydney, AEST)https://unimelb.zoom.us/j/88334039518?pwd=EU0W9q7OsMbSO27693Ka5raygBet99.1 ……Password: 542990

Stephanie Hicks (06:57:11) (in thread): > Thank you! Will it be recorded by chance?

stefano mangiola (07:26:30) (in thread): > We can, I believe. I have to remember:slightly_smiling_face:Btw, we will rotate meeting time to accomodate all zones.

Michael Love (09:34:24) (in thread): > I can join, just a little late tomorrow

Maximilian (10:27:57): > @Maximilian has joined the channel

Jenny Drnevich (10:35:57) (in thread): > I will try to join about half way through.@Jacques SERIZAYand I had some good discussion at the conference about tidy-izing the bioc-rnaseq Carpentries workshop (https://carpentries-incubator.github.io/bioc-rnaseq/) as it’s all in base R. We want to make a few more improvements based on last week’s teaching of it, and then the idea is you all fork it off and tidyize it. Jacques said a side-by-side comparison would be beneficial for your paper as well.

Michael Love (10:37:01) (in thread): > it would be cool if there were some exploration of the counts directly from a SummarizedExperiment, before going into DE world

Michael Love (10:37:35) (in thread): > i have a little EDA of RNA-seq here using tidyomics you are free to copy whatever you likehttps://tidyomics.github.io/tidy-ranges-tutorial/rna-seq-eda.html

Jenny Drnevich (10:38:54) (in thread): > We have some EDA ishttps://carpentries-incubator.github.io/bioc-rnaseq/04-exploratory-qc.html. If there is anything critical you think we are missing, let me know!

Michael Love (10:39:16) (in thread): > OSCA and OSTA are the two big Bioc online books: “Orchestrating…” for sc and spatial > > Tidying Up Bioconductor Analysis (TUBA):wink:

Michael Love (10:40:11) (in thread): > oh so nothing different probably. > > just that you can do some quick plots with tidySummarizedExperiment

Michael Love (10:40:22) (in thread): > i think that EDA doc is:ok_hand:

Michael Love (10:41:18) (in thread): > for example, I have this at the bottom > > We can also make more interesting plots. E.g. for the genes involved in pluripotency, make a line plot, highlighting OCT4. In addition, center the log counts for each gene (subtract the mean of log counts across samples). - File (PNG): image.png

Michael Love (10:41:39) (in thread): > it shows a Bioc beginner who knows ggplot2 and dplyr how they can get info out of an SE

Michael Love (10:42:41) (in thread): > but for the EDA basics like library size, PCA i think the carpentries docs are:ok_hand:

Jenny Drnevich (10:42:56) (in thread): > That’s a cool spidery looking plot! I think that might fit better after DE detection because many projects you might not know interesting genes ahead of time. (and because beginners could get the idea “it is important for me to to always check the OCT4 gene”

Michael Love (10:43:30) (in thread): > i’m happy to lend a hand if there are questions about translating between base R/Bioc and tidySE or tidybulk

Jacques SERIZAY (12:46:13) (in thread): > You forgotOHCA:disappointed:Haha, love TUBA! It does not convey the “omics”, but not a big problem for me… Otherwise what came to me was : > * Tidy Omics Data Analysis (TODA) > * Tidy Analysis of Omics Data (TAOD) > * Tidy Multi Omics Analysis (TMOA) > * Tidy Analysis of Multi Omics (TAMO) > Or any of these words in any order:wink:

Michael Lawrence (13:29:53) (in thread): > Realistically I will have to watch the recordings of these, because they’re at 5:30 AM for me.

Michael Love (13:36:39) (in thread): > Tidy Analysis of Data Analysis:tada:

Michael Love (13:37:37) (in thread): > Forgot about OHCA! > > Also OSTA isn’t here actually…https://www.bioconductor.org/help/bioconductor-books/ - Attachment (bioconductor.org): Bioconductor - Bioconductor Books > The Bioconductor project aims to develop and share open source software for precise and repeatable analysis of biological data. We foster an inclusive and collaborative community of developers and data scientists.

Michael Love (13:38:23) (in thread): > Some of these are mega-vignettes actually (just one package). Not that I mind.

Michael Love (13:38:44) (in thread): > We also can shift them around the clock

stefano mangiola (18:22:24) (in thread): > Yes we will rotate the time. Tomorrow we will discuss that as well.

2024-07-30

Susan Holmes (08:45:41): > @Susan Holmes has joined the channel

Juan Henao (12:57:23) (in thread): > Should we pin the link for the document? or fix it in the channel description? Otherwise, it will get lost, or you have to send the link everytime

Michael Love (13:33:33) (in thread): > adding now

Michael Love (13:47:04): > Added bookmarks to planning docs / website

stefano mangiola (19:32:17) (in thread): > Thanks for joining! > > Here is the meeting recordinghttps://unimelb.zoom.us/rec/share/yzeVTd2x8R8RDfksWDV7ChElg3Q0yhC0vBUG0_YAFxkmMHq6JLcgoCCxjIeSttM3.vREGMXsClJ8bMEfUPasscode: Yn5#301E - File (PNG): Screenshot 2024-07-30 at 11.25.53 pm.png

2024-07-31

Michael Love (06:55:04): > For thinking about a time to include US West coast: - File (PNG): Screenshot 2024-07-31 at 6.54.10 AM.png

Michael Love (06:56:02): > Maybe 8-10 AM in Melbourne?

stefano mangiola (19:09:41): > Cool, maybe we can do this time next time, and the one after that I can skip.

2024-08-04

stefano mangiola (22:47:07): > Next week@Jonathan Carrollwill present #DFplyr, and we’ll chat community and ecosystem! Join us! > > This will be US, AU friendly (we will rotate to US/EU/Middle-east zones the week after) > - 9AM Sydney (wed) > - 4PM San Franc. US (tue) > - 7PM NY (tue)https://unimelb.zoom.us/j/88334039518?pwd=EU0W9q7OsMbSO27693Ka5raygBet99.1… Pswrd:542990 - File (PNG): Screenshot 2024-07-30 at 11.25.53 pm.png

Jonathan Carroll (22:48:02) (in thread): > https://github.com/jonocarroll/DFplyrfor anyone who wants to find breaking edge casesbeforeI present it:stuck_out_tongue_winking_eye:

Stevie Pederson (22:49:38) (in thread): > Ooh! Nice! Just confirming this is Wed 14th?

Jonathan Carroll (22:50:19) (in thread): > Wednesday morning ACST, 14th August

2024-08-05

Jacques SERIZAY (09:03:50): > Self-advertisement: tidyCoverage got published as an Application Note in Biofinformatics:https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btae487/7723482?rss=1:slightly_smiling_face:

Michael Love (09:44:00): > Congrats!!

2024-08-08

Michael Lawrence (13:17:25) (in thread): > Looks interesting!

2024-08-11

Jonathan Carroll (01:35:34): > I wanted to write a pre-read post for presenting {DFplyr} and it got long…https://jcarroll.com.au/2024/08/11/tidy-dataframes-but-not-tibbles/ - Attachment (Irregularly Scheduled Programming): Tidy DataFrames but not Tibbles > A while ago (2019 seems so long ago now) I started working on something I > thought was interesting but which never really got any traction. It has > potential once more, so it’s about time I wrote up what it does and why I think > it’s a useful idea. I’m going to talk about using the {dplyr} package on some > data with rows and columns, but we’re not talking about data.frames or > tibbles…

2024-08-12

Malte Thodberg (16:51:44) (in thread): > Super interesting:eyes:Could theSplitDataFrameListclass fromIRangesbe used to mimic the tibble grouping function?

Jonathan Carroll (19:26:43) (in thread): > Possibly - I think I explored that at one point. I suppose it comes down to whether it “feels like” aDataFrameand can be used as if it is one in downstream processing. I would guess that aspliton a groupedDataFrameproducing aSplitDataFrameListcould be really useful.

Jonathan Carroll (19:28:42) (in thread): > There’s also a possibility of more closely aligning to what {tibble} does and having a properGroupedDataFrameclass (c.f.grouped_tbl) to assist with dispatching on grouped/non-grouped, but I started simple and just check for group info.

2024-08-13

Michael Love (09:54:50) (in thread): > This is really interesting, thanks for posting before the meeting@Jonathan CarrollSomething I’ve wanted to implement is join formcols(GRanges)to add additional columns (incoming from a data.frame or DataFrame or tibble) to the existing DataFrame. Would I be able to use your package? > > is there plan to submit to Bioc?

Robert M Flight (16:19:43): > @Robert M Flight has joined the channel

Jonathan Carroll (17:54:15) (in thread): > I haven’t ventured quite as far asjoinbut that would be a natural extension, and I don’t immediately see a reason it wouldn’t work. > > In terms of formalising the approach, I’d love to discuss whether this is the “right” way or if it needs something different in {S4Vectors}, but I’m open to submitting to Bioc once it’s ready.

Jacques SERIZAY (18:02:34) (in thread): > Hi all, unfortunately I won’t be able to attend today’s meeting, but let me know if it is recorded, I’d be interested!

Michael Love (19:35:09) (in thread): > this is great. we should definitely list this in the core tidyomics list once you’re ready to submit to Bioc, we will figure out over time how to merge/make use of DFplyr

stefano mangiola (19:48:02) (in thread): > Yes we should record every time! We forgot this time:disappointed:But the blogpost from Jon Caroll above is a good summary

Arash Bagherabadi (20:09:25): > @Arash Bagherabadi has joined the channel

Arash Bagherabadi (20:12:41): > :wave:Hello, team! Glad to see you all here and thanks for the meeting.

William Hutchison (20:38:39): > Hi all! It was nice to see you at today’s meeting. Here are the plots I shared tracking git commits to some of the tidyomics packages: - File (PNG): commits_combined_unique_authors.png - File (PNG): commits_combined.png - File (PNG): commits_per_packages.png - File (PNG): commits_per_author.png - File (PNG): commits_per_author_and_package.png

stefano mangiola (20:57:45) (in thread): > Thanks Jon, for the presentation!

Jonathan Carroll (21:05:10) (in thread): > No worries, I hope the package can be useful

stefano mangiola (21:49:12) (in thread): > I think so! Do you see yourself working toward a possible integration, or do you wish someone from the community would pick that up?

Jonathan Carroll (21:51:06) (in thread): > I’ll do some work on it, but additional contributors are most welcome.

Jonathan Carroll (22:00:42) (in thread): > > is there plan to submit to Bioc? > How ‘complete’ should it be before adding to Bioc?filter,select,mutate, and co. need a bit of polish, butjoinmight take a bit longer.

2024-08-14

stefano mangiola (20:58:07): > @William Hutchison, I remember you worked to finalise the vectorial tidyomics logo. Could you please share it here? And maybe also to the tidyomics repository

stefano mangiola (21:55:15): > Thanks to Everyone. It was a great meeting as usual.<!channel>, is someone interested in presenting anything in two weeks’ time? It’s all very casual. - File (PNG): image.png

Stevie Pederson (22:03:25) (in thread): > Sorry. Was too slow turning my camera on…:disappointed:

Stevie Pederson (22:03:37) (in thread): > Great talk though!!!

stefano mangiola (23:37:27): > We encourage a balanced and diverse representation! Join us!

2024-08-15

Michael Love (07:03:13): > @Justin Landismentioned that at some point he would like to show his work onbiocmask— it would be good to try to have Michael Lawrence available for this one, so maybe similar to the time we just did (4pm West coast)?

stefano mangiola (07:33:51) (in thread): > why don’t we do a calendar already with the cycling time zones. So we can plan when people wants to speak

Michael Love (10:42:37) (in thread): > lemme sketch it in the doc

Michael Love (10:46:14) (in thread): - File (PNG): Screenshot 2024-08-15 at 10.46.04 AM.png

Maria Doyle (11:17:05) (in thread): > If you have a schedule, and if you’d like, we can add these events to the Bioconductor events calendar. Additionally, we could schedule posts from the Bioconductor accounts (Mastodon, LinkedIn, Slack) to advertise and remind everyone about them. > Let me know if that works for you!

Michael Love (11:22:11) (in thread): > that would be awesome! > > i’ve been thinking, maybe we can post a zoom link with a password “tidy” and then we can broadly disperse the password, but not use a direct link (pwd included) > > i don’t know if this still happens but it would be annoying if we had bots showing up

2024-08-19

Rema Gesaka (09:37:19): > @Rema Gesaka has joined the channel

2024-08-20

stefano mangiola (20:53:12): > Hello Folks! Please share the call to participate to the next tidyomics meeting. Tuesday 27th of August.https://x.com/steman_research/status/1826059124091793867 https://bsky.app/profile/stemang.bsky.social/post/3l26vts4aak26 - Attachment (X (formerly Twitter)): Stefano Mangiola (Hiring!) (@steman_research) on X > Hello Community! > > Don’t miss the #tidyomics meeting in one week! All welcome! plz share :pray: > > We’ll present #HPCell, a pipe-to-HPC :package: for popular single-cell and spatial tools. > + we’ll discuss ecosystem and community! > > 8.30AM (New York) > 1.30PM (London) > 10.30PM (Sydney) - Attachment (Bluesky Social): Stefano Mangiola (Hiring!) (@stemang.bsky.social) > Hello Community! Don’t miss the #tidyomics meeting in one week! All welcome! plz share :pray: > > We’ll present #HPCell, a pipe-to-HPC :package: for popular single-cell and spatial tools. > + we’ll discuss ecosystem and community! > > 8.30AM (New York) > 1.30PM (London) > 10.30PM (Sydney)

2024-08-21

Michael Love (07:37:03) (in thread): > I would submit sooner rather than later

Michael Love (07:37:15) (in thread): > it’s easy to add functionality later

Michael Love (07:37:29) (in thread): > also simpler, core functionality makes it easier on reviewers

Jonathan Carroll (07:38:39) (in thread): > Cool. I’ll do some tidying (pun intended) and see about submitting it with the basic functionality it has now. Cheers.

Michael Love (07:46:58) (in thread): > let us know if you need any:eyes:on it, happy to help

Jenny Drnevich (09:09:14) (in thread): > What’s the date?

stefano mangiola (19:11:10) (in thread): > It’s Tuesday 27th. I will update the post

2024-08-22

Maria Doyle (16:59:12): > Hi Stefano, > I DM’d you about the posts for that, and I’ve added the meeting to the Events Google calendar here:https://bioconductor.org/help/events/. I’ve also made a PR to get it on the Events webpage:https://github.com/Bioconductor/bioconductor.org/pull/282/files#diff-407109866fb33b2b0abfe3d17d23b035a63147b607218992a8e8b5babc78b966 - Attachment (bioconductor.org): Bioconductor - Events > The Bioconductor project aims to develop and share open source software for precise and repeatable analysis of biological data. We foster an inclusive and collaborative community of developers and data scientists.

2024-08-23

stefano mangiola (02:43:52): > Thanks Maria!

Alexandra Emmons (10:13:01): > @Alexandra Emmons has joined the channel

2024-08-27

Assa (02:01:10): > @Assa has joined the channel

stefano mangiola (04:12:38): > Hello All, > > tidyomics meeting is on in ~4 hours! Please join! > > Agenda + noteshttps://docs.google.com/document/d/1tlUsMvKbdHdM1n_MDphD9sTsyLoMX2EG7dvEso-8dCY/editJoin from PC, Mac, iOS or Android:https://unimelb.zoom.us/j/88334039518?pwd=EU0W9q7OsMbSO27693Ka5raygBet99.1Password: 542990 - File (Google Docs): tidyomics community planning meeting

Michael Love (06:17:51): > Tidy workshop from Bioc is up:https://www.youtube.com/watch?v=ky-N_IEe6cw&list=PLdl4u5ZRDMQQzhPLvgOl2O6iWe_EGTywz - Attachment (YouTube): Workshop: Applying tidy principles to investigating chromatin composition and architecture

Peter Hickey (18:37:09): > @Peter Hickey has left the channel

2024-09-04

Jonathan Carroll (03:27:54): > I’m hoping to leverage the momentum I have for {DFplyr} and submit to Bioconductor for the next release - if anyone is familiar with usingS4Vectors::DataFrameand {dplyr} and would like to see how they play together, please takehttps://github.com/jonocarroll/DFplyrfor a spin and submit any issues you spot - I’m sure there are some bits broken/not working.

Jonathan Carroll (03:31:20) (in thread): > Implemented so far: > > arrange > count > distinct > filter > format > group_by > mutate > pull > rename(2) > select > slice > summarise > summarize > tally > ungroup

Lluís Revilla (13:15:13) (in thread): > Cool! I’ll try it soon! Next release as for 3.20, submitting before September 20, or for 3.21?

stefano mangiola (18:17:41) (in thread): > Amazing! if you have opportunity to check consistency with tidyomics in terms of arguments, generics etc.. this would be a great time to do it.

2024-09-05

Jonathan Carroll (00:02:47) (in thread): > If there aren’t glaring issues I could aim for 3.20 but this is my first try so I don’t yet know what I’m overlooking.

Dario Righelli (11:34:32): > @Dario Righelli has joined the channel

Michael Love (17:39:33) (in thread): > Jonathan – would you want to spend some minutes on the Sept 10/11 getting additional feedback? > > Justin will present a package but we can have some spare time after his presentation

Michael Love (17:45:58): > Hello All, > > tidyomics community meeting planned next week for Sept 10/11: > * September 10 – 23:00 UTC @Justin Landisof UNC will present a tidyomics work in progress,biocmask, with adata maskingapproach to SummarizedExperiment manipulation > Agenda + notes + time zone detailshttps://docs.google.com/document/d/1tlUsMvKbdHdM1n_MDphD9sTsyLoMX2EG7dvEso-8dCYZoom in thread:point_down: - Attachment (dplyr.tidyverse.org): Data-masking — dplyr_data_masking > This page is now located at > ?rlang::args_data_masking.

Michael Love (17:46:19) (in thread): > Michael Love (he/him) is inviting you to a scheduled Zoom meeting. > > Topic: tidyomics community meeting > Time: Sep 10, 2024 07:00 PM Eastern Time (US and Canada) > > Join Zoom Meetinghttps://zoom.us/j/97540639601?pwd=NQz3UC9yrapAwU6t5vgeSqnPwzbCxk.1Meeting ID: 975 4063 9601 > Passcode: tidyomics > > — > > One tap mobile+13052241968,,97540639601#,,,,*750327968#US+13092053325,,97540639601#,,,,*750327968#US > > — > > Dial by your location > •+1 305 224 1968US > •+1 309 205 3325US > •+1 312 626 6799US (Chicago) > •+1 646 558 8656US (New York) > •+1 646 931 3860US > •+1 301 715 8592US (Washington DC) > •+1 346 248 7799US (Houston) > •+1 360 209 5623US > •+1 386 347 5053US > •+1 507 473 4847US > •+1 564 217 2000US > •+1 669 444 9171US > •+1 669 900 9128US (San Jose) > •+1 689 278 1000US > •+1 719 359 4580US > •+1 253 205 0468US > •+1 253 215 8782US (Tacoma) > > Meeting ID: 975 4063 9601 > Passcode: 750327968 > > Find your local number:https://zoom.us/u/aX8URyv4e— > > Join by SIP > •97540639601@zoomcrc.com— > > Join by H.323 > • 162.255.37.11 (US West) > • 162.255.36.11 (US East) > > Meeting ID: 975 4063 9601 > Passcode: 750327968

Michael Love (17:47:01) (in thread): > If anyone wants to be added to the calendar event just post your email here

2024-09-06

Jonathan Carroll (05:04:16) (in thread): > That would be great, particularly as it sounds like there’s overlap with Justin’s approach.

2024-09-10

Alex Qin (03:45:57): > @Alex Qin has joined the channel

Michael Love (18:02:00): > Reminder: Justin Landis will showbiocmaskin one hour. > > Also Jonathan will talk about submission of DFplyr for this upcoming release > > Zoom link:https://zoom.us/j/97540639601?pwd=NQz3UC9yrapAwU6t5vgeSqnPwzbCxk.1

Michael Love (19:48:27): > looking to submit both biocmask and Jonathan likely submitting DFplyr to this upcoming release. > > I made a video of Justin’s presentation if anyone wants to catch up on that just DM me

2024-09-11

stefano mangiola (05:46:11): > For the records@Dario Righellimentioned some participation initiatives for tidyomics (at EuropBioc2024), and would be happy to share it in one of our next meetings

Dario Righelli (08:01:38) (in thread): > Thanks Stefano! I’ll try to take place in one of the meetings to better explain the initiative.

2024-10-01

Caroline Schreiber (04:09:56): > @Caroline Schreiber has joined the channel

2024-10-23

Abdullah Al Nahid (03:57:42): > is there a companion tidyomics data package that will load/download toy datasets for all type of omic data? (rna-seq, scrna-seq, atac-seq, chip-seq, tcr-seq, bcr-seq, spatial etc.)

Abdullah Al Nahid (03:58:15): > I feel like it would be a fun package and I can work on it if not available already

Michael Love (07:56:04): > Like simulated? There is ExperimentHub which has thousands of(real)datasets ready to go

Kasper D. Hansen (09:20:48): > What do you mean by toy data? A strength of our field IMO is that learning can be done on real data

Abdullah Al Nahid (16:36:01): > Apologies for my bad wording. I meant to say, some curated datasets for education/workshop purposes likehttps://archive.ics.uci.edu - Attachment (archive.ics.uci.edu): UCI Machine Learning Repository > Discover datasets around the world!

Abdullah Al Nahid (16:36:36): > I will exploreExperimentHub

Vince Carey (16:52:29): > @Abdullah Al Nahidfor rna-seq, ExperimentHub will have many offerings. For sc-rna-seq, the package scRNAseq is of interest for several reasons. First, it has a considerable number of datasets with reasonably clear provenance in influential publications. Second, it uses “language-agnostic representations” of experiments, based on the alabaster.* and gypsum packages. The aim here is to provide well-curated data to analysts using other languages. See the#biocpythonchannel for more information on relations to python. I know some recent packages address tcr-seq but I cannot name them now. Maybe we need some biocViews to simplify their discovery? Keep us posted on how you leverage tidyomics to improve accessibility of these resources.

Abdullah Al Nahid (16:53:28) (in thread): > Thank you!

2024-10-29

Travis Blimkie (11:28:01): > @Travis Blimkie has joined the channel

2024-11-17

Michael Love (15:45:06): > Re: tidyomics meeting @ Tuesday 11/19 AM in US/Europe – may not be able to make this meeting bc kid duty. Do we have a sense for attendance or topics people want to cover?

Jacques SERIZAY (16:04:19) (in thread): > I’ll be here, and don’t have particular topic in mind. > also while I’m at it, I lost access to the meetings schedule in my calendar (or perhaps it doesn’t exist anymore?) Could anyone share a link again please? Thanks:slightly_smiling_face:

stefano mangiola (21:52:15) (in thread): > I had dropped the calendar invite series. > > > Not sure why it is still around for some. My understanding was that we would re-establish the tidyomics meetings in the near future

2024-11-18

Michael Love (09:54:20) (in thread): > ok sounds good

Michael Love (09:55:05): > In the meantime, if there are any topics of interest here, we can slot them for upcoming months. It can be software or education/workshop related etc.

2024-11-21

Jacques SERIZAY (07:44:45): > Potential issue withtidySingleCellExperiment: is it normal that thereducedDimNames,mainExpName, andaltExpNamesdo not show up when restoring the traditional SingleCellExperimentshowmethod?:thinking_face: > > R version 4.4.1 (2024-06-14) -- "Race for Your Life" > Copyright (C) 2024 The R Foundation for Statistical Computing > Platform: x86_64-pc-linux-gnu > > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > > Natural language support but running in an English locale > > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > > > library(tidySingleCellExperiment) > Loading required package: SingleCellExperiment > Loading required package: SummarizedExperiment > Loading required package: MatrixGenerics > Loading required package: matrixStats > > .... > > > data(pbmc_small, package="tidySingleCellExperiment") > > reducedDimNames(pbmc_small) > [1] "PCA" "TSNE" > > pbmc_small > # A SingleCellExperiment-tibble abstraction: 80 × 17 > # Features=230 | Cells=80 | Assays=counts, logcounts > .cell orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8 letter.idents groups RNA_snn_res.1 file ident PC_1 PC_2 PC_3 PC_4 PC_5 tSNE_1 tSNE_2 > <chr> <fct> <dbl> <int> <fct> <fct> <chr> <fct> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> > 1 ATGCCAGAACGACT SeuratProject 70 47 0 A g2 0 ../data/sample2/outs/filtered_feature_bc_matrix/ 0 -0.774 -0.900 -0.249 0.559 0.465 0.868 -8.10 > 2 CATGGCCTGTGCAT SeuratProject 85 52 0 A g1 0 ../data/sample1/outs/filtered_feature_bc_matrix/ 0 -0.0260 -0.347 0.665 0.418 0.585 -7.39 -8.77 > 3 GAACCTGATGAACC SeuratProject 87 50 1 B g2 0 ../data/sample2/outs/filtered_feature_bc_matrix/ 0 -0.457 0.180 1.32 2.01 -0.482 -28.2 0.241 > 4 TGACTGGATTCTCA SeuratProject 127 56 0 A g2 0 ../data/sample2/outs/filtered_feature_bc_matrix/ 0 -0.812 -1.38 -1.00 0.139 -1.60 16.3 -11.2 > 5 AGTCAGACTGCACA SeuratProject 173 53 0 A g2 0 ../data/sample2/outs/filtered_feature_bc_matrix/ 0 -0.774 -0.900 -0.249 0.559 0.465 1.91 -11.2 > 6 TCTGATACACGTGT SeuratProject 70 48 0 A g1 0 ../data/sample1/outs/filtered_feature_bc_matrix/ 0 -0.774 -0.900 -0.249 0.559 0.465 3.15 -9.94 > 7 TGGTATCTAAACAG SeuratProject 64 36 0 A g1 0 ../data/sample1/outs/filtered_feature_bc_matrix/ 0 -0.460 -1.19 -0.312 0.716 -1.65 17.9 -9.90 > 8 GCAGCTCTGTTTCT SeuratProject 72 45 0 A g1 0 ../data/sample1/outs/filtered_feature_bc_matrix/ 0 -0.900 -0.388 0.693 0.404 0.536 -6.49 -8.39 > 9 GATATAACACGCAT SeuratProject 52 36 0 A g1 0 ../data/sample1/outs/filtered_feature_bc_matrix/ 0 -0.774 -0.900 -0.249 0.559 0.465 1.33 -9.68 > 10 AATGTTGACAGTCA SeuratProject 100 41 0 A g1 0 ../data/sample1/outs/filtered_feature_bc_matrix/ 0 -0.488 -1.16 -0.306 0.702 -1.47 17.0 -9.43 > # ℹ 70 more rows > # ℹ Use `print(n = ...)` to see more rows > > options("restore_SingleCellExperiment_show" = TRUE) > > pbmc_small > class: SingleCellExperiment > dim: 230 80 > metadata(0): > assays(2): counts logcounts > rownames(230): MS4A1 CD79B ... SPON2 S100B > rowData names(5): vst.mean vst.variance vst.variance.expected vst.variance.standardized vst.variable > colnames(80): ATGCCAGAACGACT CATGGCCTGTGCAT ... GGAACACTTCAGAC CTTGATTGATCTTC > colData names(9): orig.ident nCount_RNA ... file ident > > Whereas, if I only load theSingleCellExperimentpackage: > > R version 4.4.1 (2024-06-14) -- "Race for Your Life" > Copyright (C) 2024 The R Foundation for Statistical Computing > Platform: x86_64-pc-linux-gnu > > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > > Natural language support but running in an English locale > > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > > > library(SingleCellExperiment) > > ... > > > data(pbmc_small, package="tidySingleCellExperiment") > > pbmc_small > class: SingleCellExperiment > dim: 230 80 > metadata(0): > assays(2): counts logcounts > rownames(230): MS4A1 CD79B ... SPON2 S100B > rowData names(5): vst.mean vst.variance vst.variance.expected vst.variance.standardized vst.variable > colnames(80): ATGCCAGAACGACT CATGGCCTGTGCAT ... GGAACACTTCAGAC CTTGATTGATCTTC > colData names(9): orig.ident nCount_RNA ... file ident > reducedDimNames(2): PCA TSNE > mainExpName: NULL > altExpNames(0): >

Jacques SERIZAY (07:47:51): > WithR 4.4.1, bioc3.19,SingleCellExperiment_1.26.0andtidySingleCellExperiment_1.14.0:grimacing:

Jacques SERIZAY (07:50:23): > Happy to open an GH issue@stefano mangiola, if this is indeed a bug

Helena L. Crowell (11:23:49) (in thread): > Could it be that the int_metadata is getting lost? I believe SCE adapts some of its methods (including show()) to the object version, which is stored there.altExps are relatively new.

stefano mangiola (18:30:10) (in thread): > Interesting. > > if after restoring, you do > > reducedDimNames(pbmc_small) > > Are they in the SCE?

stefano mangiola (18:31:05) (in thread): > in other words, are the objects identical, if you load or not tidySCE, ignoring the printout?

2024-11-22

Jacques SERIZAY (05:48:34) (in thread): > Yes they are completely identical, includingreducedDimNames(pbmc_small)which returns the expected[1] "PCA" "TSNE", even though the slot name does not show up in the print.

Helena L. Crowell (05:49:06) (in thread): > what’sint_metadata()

Jacques SERIZAY (07:28:27) (in thread): > > $version > [1] '1.22.0' > > $spike_names > character(0) > > $size_factor_names > character(0) > > Regardless of whether I load the data after loadingtidySCEorSCEonly in independent vanilla R sessions

2024-11-24

stefano mangiola (23:35:14) (in thread): > Please go ahead

stefano mangiola (23:35:17) (in thread): > Thanks!

stefano mangiola (23:36:50) (in thread): > anyway it is a mystery! As we don’t touch Bioc methods, and our printing is completely skipped if the option “print original” is chosen.

2024-11-25

Izabela Mamede (14:07:47): > Hello all, was there a final decision on the hex sticker of tidyomics? I plan to print some this semester to give to some undergrads in my class

Michael Love (14:20:01): > we’ve been using this one: - File (JPEG): tidyomics-logo.jpg

Michael Love (14:20:24): > I think@stefano mangiolamay have a more higher res copy

Michael Love (14:20:45): > we can post that to biocstickers

Izabela Mamede (17:09:09) (in thread): > Would be great if possible!

stefano mangiola (18:17:05) (in thread): > Here there are the few version I did,@William Hutchisoncould you please add/replace in this file your polished version, and re-attach? - File (SVG): logo.svg

stefano mangiola (18:18:36) (in thread): > Also the font is slightly different from the one in PNG.@William Hutchisonif you have this font (don’t remember who replaced it), please replace it as well.

William Hutchison (18:33:22) (in thread): > This is the version of the logo I had been working on. Feel free to request any changes or make any further modifications yourself - File (SVG): tidyomics_logo_3_v2.svg

stefano mangiola (18:36:02) (in thread): > Thanks! Yes this is the logo we can use.

stefano mangiola (18:42:02) (in thread): > I don’t seem to have the font that we have used

stefano mangiola (18:42:15) (in thread): > So if I open in Illustrator I loose the exact font

William Hutchison (18:49:28) (in thread): > I don’t think I have the font either. There was some discussion on this topic on the Github issuehttps://github.com/orgs/tidyomics/projects/1?pane=issue&itemId=35945742&issue=tidyomics%7Ctidyomics%7C4and@Abdullah Al Nahidhad put some work into finding a nice font

Abdullah Al Nahid (19:07:02) (in thread): > I will report back the font name asap. I think I found it in google fonts

Abdullah Al Nahid (19:12:06) (in thread): > This is the fonthttps://fonts.google.com/specimen/Farro - Attachment (Google Fonts): Farro - Google Fonts > Farro is an artsy, four-weighted, display typeface that has a peculiar personality flowing through its European humanist silhouette. To contribute, see github.c

stefano mangiola (19:14:19) (in thread): > Abdullah, would you like to write a section on the logo, meaning + font name?

stefano mangiola (19:14:42) (in thread): > in the tidyomics README? We should participate to some competition!:wink:

Abdullah Al Nahid (19:18:06) (in thread): > Sure I can. Although the meaning is a bit abstract to me. DNA, molecules, bioconductor, it’s all combined.

stefano mangiola (19:19:17) (in thread): > I’m cirous to what chatGPT thinks about the meaning. (I will give my version)

stefano mangiola (19:20:28) (in thread): > I attach here the SVG again, I have edited a little detail about the black outser separation (as you see in the PNG) > > !! Please let’s use this from now on - File (SVG): logo.svg

stefano mangiola (19:23:35) (in thread): > If someone wants to fix, the little black spaces in the DNA molecules and other shapes, feel free and repost.

stefano mangiola (19:23:57) (in thread): > In the tidyomics github we have the William alternative version

stefano mangiola (19:24:54) (in thread): > We have all the versions in SVG, but we should definetly converge to one to not create confusion.

2024-12-11

Chris Fields (20:19:47): > @Chris Fields has joined the channel

2024-12-12

Janetta Top (05:07:22): > @Janetta Top has joined the channel

2025-01-02

Jason Laird (12:48:13): > @Jason Laird has joined the channel

2025-01-19

Michael Love (14:13:07): > This looks like solvable with tidySingleCellExperiment?https://support.bioconductor.org/p/9161096/

2025-02-27

stefano mangiola (02:15:31): > Hello, tidy community, (allow me to tag<!channel>for this announcement) > > We are back with the tidyomics meetings!:tada:We would like to organise a first meeting around improving the way we abstract and display theSummarizedExperiment. We would like to discuss with all of you and get feedback about our recent packagehttps://github.com/tidyomics/tidyprintand know what you think the best display would among the one we propose. This package will also aim to standardise the messaging of tidyomics packages. > > Please react to this poll so we can choose a decent time for everyone.:pray: > * :balloon:Monday 10th of March 4PM NY , 9PM London, 8AM (Tuesday) Melbourne > * :rocket:Tuesday 11th of March 4PM NY , 9PM London, 8AM (Wednesday) Melbourne > * :star:Wednesday 12th of March 4PM NY , 9PM London, 8AM (Thursday) Melbourne > * :sunny:Thursday 13th of March 4PM NY , 9PM London, 8AM (Friday) Melbourne > Stefano and Mike

Chen Zhan (02:16:43): > @Chen Zhan has joined the channel

stefano mangiola (02:42:59) (in thread): > I gave a tidyomic solution ;)

Michael Love (08:15:02) (in thread): > love it. nice and clean

Michael Love (08:15:21) (in thread): > i’ve been encouraging people to post to support so we have a nice public record of Q&A

Michael Lawrence (16:14:19) (in thread): > Btw, there are number emojis you can use for this. Perhaps a bit more intuitive.

2025-02-28

Kateřina Matějková (09:22:37): > @Kateřina Matějková has joined the channel

2025-03-02

stefano mangiola (19:43:14) (in thread): > Hello All, > > thanks for voting. The meeting will beWednesday 12th of March 4PM NY , 9PM London, 8AM (Thursday) Melbourne(If you want to be invited to the calendar event, please message me your email address). See you there! > > Join from a PC, Mac, iPad, iPhone or Android device: > Please click this URL to start or join:https://adelaide.zoom.us/j/4427895326?pwd=vAJIUl3pjeHpC2UrCVhacSRk4IQD1S.1Password: 247186 > > Join from dial-in phone line: > Dial:+61 8 7150 1149Meeting ID:442 789 5326International numbers available:https://adelaide.zoom.us/u/kd4MrrG8hUJoin from a H.323/SIP room system: > Dial:4427895326@global.zoomcrc.comor SIP:4427895326@zmau.usor 103.122.166.55 (Australia) > Meeting ID:442 789 5326H323/SIP Password: 247186

2025-03-04

Himel Mallick (11:46:34): > @Himel Mallick has joined the channel

Ammar Sabir Cheema (13:39:02): > @Ammar Sabir Cheema has joined the channel

Artür Manukyan (16:08:22): > @Artür Manukyan has joined the channel

2025-03-05

Najla Abassi (06:21:26): > @Najla Abassi has joined the channel

Benjamin Hernandez Rodriguez (22:30:22): > @Benjamin Hernandez Rodriguez has joined the channel

2025-03-11

Rüçhan Ekren (10:23:42): > @Rüçhan Ekren has joined the channel

stefano mangiola (18:08:47) (in thread): > NY time is 5PM (not 4PM). Apologies! > > To confirm these are the times of the meeting > > - 5PM NY > - 9PM London > - 8AM (Thursday 13th) Melbourne

Michael Love (19:48:14): > Thanks Stefano!!

2025-03-12

Rafael (09:24:28): > @Rafael has joined the channel

Michael Love (17:03:05): > <!channel>meeting now:https://adelaide.zoom.us/j/4427895326?pwd=vAJIUl3pjeHpC2UrCVhacSRk4IQD1S.1

Chris Fields (17:38:45): > Ended up having some weird connectivity issues here. Mainly joined b/c I’m interested in tidyomics for microbiome/metagenome and other complex data:slightly_smiling_face:

Michael Love (17:41:49) (in thread): > Awesome, happy to talk about ideas here anytime. For project planning we use GH Projects. A good way to post an idea, tag it to a repo, get feedback or collaboration

Michael Love (17:42:07) (in thread): > See the channel bookmarks

Chris Fields (17:52:21) (in thread): > Thanks! Looking these over now

Michael Love (18:01:17) (in thread): > I don’t do much micro meta but it’s always more fun to develop with partners so maybe someone else in the channel will jump on if you post about an idea

stefano mangiola (18:51:56): > Thanks everyone for the meeting today! We spanned 3 continents :) > The next meeting will be in2 weeks. > > We encourage gender balance, the meetings are completely open and fun! - File (PNG): image.png

Michael Love (18:56:02): > Thanks Stefano for organizing– I can help organize one for week of3/24 perhaps?

Michael Love (18:57:50): > If anyone is interested in running a workshop this year let me know, would be great to promote and coordinate

Sisi Wang (19:42:21): > @Sisi Wang has joined the channel

2025-03-13

Mihai Todor (06:59:53): > @Mihai Todor has joined the channel

2025-03-17

stefano mangiola (02:36:48) (in thread): > Amazing, feel free to start communications!

stefano mangiola (02:38:50) (in thread): > I love Mike’s idea of getting a tidy stream in the three Bioc conferences. > > If we get at least 1,2,+ presenters per conference

Michael Love (08:29:24) (in thread): > as a reviewer, my advice is don’t put “tidyomics” in the title. sometimes we try to make sure there is distribution across projects, so it can count against you if all the titles and abstracts look similar. just put your own topic, e.g. “Analysis of large single cell datasets using tidySingleCellExperiment”

Michael Love (08:33:39): > I’m going to propose we try to get some of the European developers, what about noon in CET next week (if you have interest/availability, see poll below)? CC@Helena L. Crowell@Charlotte Soneson@Jacques SERIZAY. Agenda will be figuring out what projects/issues we should focus on, what workshops are planned in 2025 for each of the three major BioC conferences. > 1. Tue 3/25 > 2. Wed 3/26 > 3. Thur 3/27 - File (PNG): image.png

stefano mangiola (21:22:01) (in thread): > Mike, some people prefer to be included in the calendar invite, so I was putting together an email list to paste into the Zoom invite > > here the details > > William Hutchison <hutchison.w@wehi.edu.au>; Chen Zhan <chen.zhan@adelaide.edu.au>; michaelisaiahlove@gmail.com <michaelisaiahlove@gmail.com>; justin_landis@med.unc.edu <justin_landis@med.unc.edu>; juanhenao.sanchez@gmail.com <juanhenao.sanchez@gmail.com>; jono@jcarroll.com.au <jono@jcarroll.com.au>; > xiaotao.shen@outlook.com > lamamedhat333@gmail.com; >

2025-03-19

Jacques SERIZAY (05:37:53) (in thread): > noon CET would be much easier for me to attend:slightly_smiling_face:I’m free any of these days.

Michael Love (08:47:28) (in thread): > @Pierre-Paul Axisawould you be interested to attend / have any interest in tidyomics workshops/presentations this year?

Pierre-Paul Axisa (15:09:34) (in thread): > Yes! Though I have to look at the calendar. That would be during bioconductor conferences?

Pierre-Paul Axisa (15:12:57) (in thread): > I can attend on thursday

Michael Love (15:24:49) (in thread): > Invite sent, if you didn’t receive please let me know. I’ll send an agenda next week

2025-03-24

Michael Love (09:29:45): > Dear all, we will have a tidyomics developer/teacher meetingthis Thursday March 27 with a Europe friendly meeting time (12:00 CET) > > Eager to hear from anyone who is planning anyteaching/workshops/talks this yearso we can coordinate promotion of events and sharing of materials as needed. Anyone is qualified to hold a workshop. If you plan to attend please give a:thumbsup:so we can gage attendance. > > Zoom link:https://zoom.us/j/4133532783?pwd=VHl6dlNXMk5NYStCODN6S1IwaVliQT09Time info:https://www.worldtimebuddy.com/?qm=1&lid=4460162,2158177,2078025,2950159&h=4460162&date=2025-3-27&sln=7-8&hf=0

stefano mangiola (16:59:03) (in thread): > I addedlamamedhat333@gmail.comto the email list

Lama Salem (18:11:38): > @Lama Salem has joined the channel

2025-03-25

Pierre-Paul Axisa (06:56:45): > Something I’m interested in is testing for enrichment between different genomic annotations in a continuous manner (rather than defining significant peaks), eg enrichment of epigenetic signal for GWAS loci. I’ve written a bare bone package to do that based on ranks reusing GSEA-type approaches:https://github.com/ImmuneAxisa/PSEA. I’m curious to discuss that on Thursday, especially comparing to things likebootRanges

Michael Love (08:24:56) (in thread): > Also think about comparing to LDSC which is a standard tool for this type of question

Michael Love (08:25:52) (in thread): > Btw you could use block bootstrap to shuffle a continuous signal, by chopping signals into blocks

Michael Love (08:26:12) (in thread): > If you wanted to preserve local behavior but shuffle globally

Pierre-Paul Axisa (08:36:43) (in thread): > yes LDSC is on my TODO list ^^

Pierre-Paul Axisa (08:40:43) (in thread): > struggling a bit with the interface. Do you know any other implementations than the orignal one in python?

Michael Love (08:42:06) (in thread): > I don’t, but I could ask someone

Michael Love (10:36:19) (in thread): > added to event

Michael Love (10:36:56) (in thread): > is there a good open/free mailing list management where people could add/remove themselves?

2025-03-27

Jacqueline Murphy (05:42:27): > @Jacqueline Murphy has joined the channel

Michael Love (09:48:20) (in thread): > I have some developers on an email thread – Pierre Paul I can add you?

Pierre-Paul Axisa (09:49:04) (in thread): > yes! can you usepp.axisa@gmail.complease?

2025-04-01

stefano mangiola (08:20:48): > If someone would like to lead a tidyomics workshop at the Galaxy Bioconductor Community Conference (GBCC), e.g. about bulk RNA/singlecell/spatial multiomics, please submit today. > > I would have proposed it, but apparently, only one abstract per person can be submitted (@Maria Doyle?), and I believe the abstract form is now inaccessible to me. There will be plenty of tidyomics members who can support you at the event. > > Don’t feel you need a lot of experience.

Michael Love (08:48:38): > ~~~Justin and I are planning a long workshop to briefly demo tidyomics in general (tidySE, plyranges) and then get into some of plyxp’s unique aspects like group_by summarize.~~~~~~I think there is plenty of space for another workshop from tidyomics. E.g. if you have a particular domain / dataset of interest. I’m happy to help craft workshop material with anyone.~~~

stefano mangiola (08:52:13): > Amazing! The issue is that the deadline is today, hopefully we can find someone from the community. I have already submitted an abstract for cellNexus:disappointed:

Michael Love (08:52:38): > they extended to next week

Michael Love (08:52:58): > “we will be extending the deadline by one week due to requests received”

Michael Love (08:53:15): > Stefano you will be in person?:raised_hands:

Jenny Drnevich (10:08:06): > The abstract deadlines are for 10 min talks and posters only. We sent out a survey for the longer 45 min workshops for what people wanted to see and who was interested in teaching a workshop. The “deadline” was March 21, but it looks like it’s still open:https://forms.gle/i34Ag8SHMtBAmpPh7. We haven’t made any decisions yet, but there were already a few more than the number of slots. However, the submissions were heavily Galaxy-biased (this is how they do workshops) so Bioc ones might get more preference (can’t promise anything). - Attachment (Google Docs): Expression of interest for GBCC2025 training topics > GBCC2025 is coming up soon and we’d like to hear from you! This year, we’re doing things a little differently with our joint conference between Galaxy and Bioconductor. There will be 10 training sessions, each lasting 1.5 hours. > > Please fill out this form to: > Indicate the topics you are interested in learning about at GBCC this year. > Express your interest in leading a training session (optional). > Note: This year, there are no abstract submissions for package demos or workshops, only talks and posters. If you want to lead a training session, please use this form. If you’re interested in giving a talk or presenting a poster, please submit through the conference abstract submission page. > Fill out this form by Friday, March 21st, so that we can create a training schedule that better serves all of us in the Galaxy and Bioconductor community. > > Thank you for your input! > The GBCC2025 Organizing Committee

Jenny Drnevich (10:10:45): > Galaxy usually pushes “posters”, even those selected for a talk are encouraged to also do a poster. They often have people with laptops giving demos at the poster sessions. Just another avenue for getting your tidyomics stuff out…

Jenny Drnevich (10:12:58): > @Michael LoveI don’t see Justin’s or your name on the survey for leading a training session…

Michael Love (10:32:31): > We were going to submit yesterday but then with the extension we are honing the submission

Michael Love (10:33:26): > I got tripped up with the registration bc that means I have to submit a complete expense request and wait for approval, but then i realized there is a way to not pay by card right now

Michael Love (10:34:33): > oh i see. well maybe we will go for a talk then

Michael Love (10:35:50): > this is what made me think abstracts would be considered for workshops - File (PNG): Screenshot 2025-04-01 at 10.35.37 AM.png

Michael Love (10:37:20): > i think we will just go for a talk this year

Jenny Drnevich (10:37:22): > That’s CSHL’s boilerplate abstract text. We haven’t had a lot of luck getting them to change their standard text, like including you don’t need to pay a down payment to register.

Jenny Drnevich (10:39:02): > [conference planning jointly between Galaxy, BioC and CSHL has been a lot more complicated this year as each one has a different standard way of doing things!]

Michael Love (10:39:58): > understand and thank you for organizing!

Jenny Drnevich (10:41:28) (in thread): > What page was this on?

Michael Love (10:42:33) (in thread): > scientific program

stefano mangiola (19:25:03) (in thread): > Yes! Looking forward to meet you all!

2025-04-03

Kozo Nishida (11:17:59): > I’m planning to visualize data usingtidyplotsfor eachrowDatagroup inSummarizedExperiment. > The x-axis represents eachcolDatagroup, and the y-axis corresponds to theassaysvalues. > In this case, do you think it’s best to convertSummarizedExperimentinto atibbleusingtidySummarizedExperiment? > I’d appreciate your thoughts—such as whether it might be better to use a different package that doesn’t require convertingSummarizedExperimentinto atibble.

stefano mangiola (19:16:51) (in thread): > tidySummarizedExperimentcan be piped directly into ggplot. If you are interested in specific features/genes, you can do > > se |> filter(…) |> ggplot(…)

2025-04-04

Pierre-Paul Axisa (11:01:00): > I wonder if there is an interest in backing up this channel if the slack gets downgraded. Isthe channel history useful?

Kozo Nishida (11:01:32) (in thread): > Thanks for the information! I confirmed that replacing the example ggplot with tidyplot also works well. That was exactly what I was looking for!

Michael Love (11:02:26): > certainly for posterity. is there a hack to download to text format?

Pierre-Paul Axisa (11:25:42): > I used in the past a little command line software that downloads it to json. I can try to find it

2025-04-11

stefano mangiola (00:59:23): > Congrats@Izabela Mamede, for her first post to our tidyomics Blog. It will be published very soon! > > We have few more PRs (blog post proposals) herehttps://github.com/tidyomics/tidyomicsBloganyone interested in acting as reviewer?

Izabela Mamede (12:21:19): > thanks Stefano!

2025-04-14

Saad Farooq (07:07:38): > @Saad Farooq has joined the channel

2025-04-25

stefano mangiola (03:20:27): > Should we propose a tidyomics challenges hackatlon? - Attachment: Attachment > If you have any ideas for the collaboration fest / hackathon post-conference (GBCC25 joint Galaxy-Bioconductor conference in the US), please submit any and all ideas to https://forms.gle/RzrT75DpRh9Vb8Z16

Michael Love (09:25:36): > I’d love to see it — I couldn’t manage to include Co-Fest in my travel, but I’m also happy to find time during the meeting to hack with tidyomics folks!

Michael Love (09:26:33): > given we’re there for two half days and two full days, seems like plenty of time to meet up and plan 2025 development, tutorials, future workshops, etc.

2025-04-26

Carlos Mata-Machado (09:30:49): > @Carlos Mata-Machado has joined the channel

2025-04-29

Vince Carey (12:58:44): > has there been any attention to tidyomics for “annotation”? I am thinking of how one might simplify/unify processes related to querying/joining withorg.Hs.eg.db, GO.db, etc.

Michael Love (13:02:16) (in thread): > Good point, to the extent that you can do select() |> as_tibble() we have a decent baseline > > But maybe there are more things we could do for usability

stefano mangiola (20:36:59) (in thread): > annotation can be easily embedded in a mutate command, I will paste the code here when in front of my laptop

stefano mangiola (21:52:53) (in thread): > Having said that, surely an harmonised framework would help

stefano mangiola (22:19:02) (in thread): > This is one example > > SE = SE |> > mutate(entrez = mapIds(org.Hs.eg.db, > keys = .feature, > keytype = "SYMBOL", > column = "ENTREZID", > multiVals = "first" > )) > > This function adds the entrez to therowDataand displays it in the tibble abstraction. with all other sample and gene metadata. > > Do we see the need to build a proper annotation routine? If so, how many functions should we aim to target?

2025-04-30

Martin Morgan (01:11:41) (in thread): > The Organism.dplyr package presents annotation resources as tibbles for use in tidyversehttps://bioconductor.org/packages/3.21/bioc/vignettes/Organism.dplyr/inst/doc/Organism.dplyr.html

Vince Carey (06:15:07) (in thread): > Great to be reminded of Organism.dplyr. I have been wondering about the special character of the org.****.****.db packages and underlying database structure. There is little visible motivation for changing them, their construction is complex and potentially fragile, and compelling use cases in analysis workflows could provide motivation and specific goals for adopting a new approach. Our use of SQLite for annotation tasks seems particularly inefficient in terms of storage consumed (thus bandwidth for content shipped to requesters). This is not directly connected to tidyomics principles but, given the energy in this subcommunity, I thought it would be good to start a discussion. The mutate example given above is very nice and could serve as a nice baseline for benchmarking alternatives. Another example is providing focus, perhaps selecting only those features mapped to a given GO category.

stefano mangiola (22:36:22) (in thread): > just a comment on sqlite, if complex relational stricture is not needed, parquet is probably the way to go. fast and efficient

Haroon (23:42:00): > @Haroon has joined the channel