#bigdata-rep
2016-11-21
Marcel Ramos Pérez (15:23:08): > @Marcel Ramos Pérez has joined the channel
2016-11-26
Sean Davis (22:23:49): > @Sean Davis has joined the channel
2016-11-28
Marcel Ramos Pérez (13:08:09): > set the channel topic: Off-disk/on-disk representations in Bioconductor
Phil Chapman (13:13:56): > @Phil Chapman has joined the channel
Levi Waldron (13:13:56): > @Levi Waldron has joined the channel
Peter Hickey (13:13:56): > @Peter Hickey has joined the channel
Kasper D. Hansen (13:13:56): > @Kasper D. Hansen has joined the channel
Aedin Culhane (13:13:56): > @Aedin Culhane has joined the channel
Vince Carey (13:13:56): > @Vince Carey has joined the channel
Michael Lawrence (13:13:56): > @Michael Lawrence has joined the channel
Tim Triche (13:13:56): > @Tim Triche has joined the channel
Martin Morgan (13:13:57): > @Martin Morgan has joined the channel
Jack Zhu (13:13:57): > @Jack Zhu has joined the channel
Benjamin Haibe-Kains (13:13:57): > @Benjamin Haibe-Kains has joined the channel
Lucas Schiffer (13:13:57): > @Lucas Schiffer has joined the channel
Nitesh Turaga (13:48:28): > @Nitesh Turaga has joined the channel
2016-12-13
Marcel Ramos Pérez (15:54:49): > <!channel>Hervé is currently working on separating the HDF5 backend from the delayed array representation and allowing for additional backends to be implemented in the future.
Tim Triche (18:29:28): > ok
Tim Triche (18:29:34): > should I be bothering him then?
Marcel Ramos Pérez (18:30:15): > I have sent him the scripts you sent me for resolving the dimnames issue.
Marcel Ramos Pérez (18:31:57): > He said he will have a look at it. I think he want’s to implement them usingrhdf5
’s API. I can ask him how far he’s gotten.
Tim Triche (18:34:12): > OK. I was thinking that it might be handy to have “adapters” for e.g. kallisto hdf5 files, other raw data types (e.g. Rle), but that requires some thought and/or (de facto) standardization
Marcel Ramos Pérez (18:35:48): > I will try inviting him to the slack team here.
2017-01-06
Sean Davis (11:20:31): > I’ve also invited Ted Haberman from the HDF5 group.
2017-01-18
Hervé Pagès (14:17:53): > @Hervé Pagès has joined the channel
Ted Habermann (14:18:09): > @Ted Habermann has joined the channel
2017-04-28
Andrew McDavid (16:50:42): > @Andrew McDavid has joined the channel
Marcel Ramos Pérez (16:55:00): > Peter Hickey: BOF proposalhttps://github.com/LTLA/beachmat/issues/1 - Attachment (GitHub): BOF application · Issue #1 · LTLA/beachmat > Assuming Davide is the BOF session leader, I’ll fill everyone else in, in the order of my memory: Additional collaborators – First Name, Last Name (Affiliation) Aaron Lun (CRUK Cambridge Institute…
Marcel Ramos Pérez (16:55:06): > @Marcel Ramos Pérez pinned a message to this channel.
Marcel Ramos Pérez (16:55:32): > Peter Hickey: Ideas for a single-cell container and a C++ API for all matriceshttps://github.com/LTLA/beachmat - Attachment (GitHub): LTLA/beachmat > beachmat - Ideas for a single-cell container and a C++ API for all matrices
Marcel Ramos Pérez (16:55:38): > @Marcel Ramos Pérez pinned a message to this channel.
Davide Risso (17:06:13): > @Davide Risso has joined the channel
Andrew McDavid (17:16:18): > So the thought was to talk again on say 5/12?
Peter Hickey (17:17:54): > yep at 9am pacific time
Stephanie Hicks (21:43:57): > @Stephanie Hicks has joined the channel
2017-04-30
Raphael Gottardo (02:25:38): > @Raphael Gottardo has joined the channel
Raphael Gottardo (02:26:02): > A quick benchmark run by Mike in my group:http://rpubs.com/wjiang2/271647
Raphael Gottardo (02:27:04): > Looks like we might be able to store single-cell matrices as simple matrix and save on space with HDF5’s built-in compression. I have asked him to look a few more things, but at least it’s a start.
Peter Hickey (12:10:42): > interesting. so i guess 10x software produces the sparse representation by default? is that changeable by the user?
Peter Hickey (12:11:55): > also, is there a way to convert a from 10x-style .h5 to a compressed, simple matrix .h5 (without reading everything into R)?
Aaron Lun (16:52:20): > @Aaron Lun has joined the channel
Aaron Lun (16:52:30): > Ah, this is where everyone is.
Aaron Lun (16:53:24): > @Peter HickeyThe cellranger software produces a file in MatrixMarket format
Aaron Lun (16:53:41): > Convertible to a dgCMatrix withreadMM
fromMatrix
Aaron Lun (16:54:37): > I guess they just put the MM format into HDF5, in this case.
Aaron Lun (16:55:44): > Good to hear we can use a simple matrix in HDF5, which makes C++-level access easier
Peter Hickey (17:52:26): > yay, another format
2017-05-01
Raphael Gottardo (02:58:58): > Yes, I think that if we can use a “simple” matrix format, then that would make everyone’s life easier as we wouldn’t have to write special code to decode on the fly or code that relies on sparse matrices, etc.
Raphael Gottardo (03:01:16): > Also, I wouldn’t worry too much about 10X’s format. If we have a better format, we could push that onto them. Also, 10X is only one of the platform people use, so it makes sense to think about a “standard” for storing single-cell genomics data.
Stephanie Hicks (07:57:39): > In terms of use cases, I asked Peter Kharchenko at this end of his talk this morning his opinion the data format he uses to read/write/work with millions of cells and if he chunks across genes vs cells. He made the case for chunking in cells when you are initially looking at the data, e.g. may make sense to look at all genes at first and then decide which genes are most informative
Aaron Lun (10:36:13): > FYI, MatrixMarket was developed by NIST:http://math.nist.gov/MatrixMarket/formats.html
Aaron Lun (10:37:17): > so 10X didn’t entirely make it up
Aaron Lun (10:37:18): > We should be fine if we can get to/from formats easily.
Aaron Lun (10:40:23): > Should be easy to write high-performance converters in C/C++.
Aaron Lun (10:46:54): > Though the simpler the better, of course.
Kasper D. Hansen (11:11:11): > based on skimming the documentation, it doesn’t look like MatrixMarket supports random access. One should be able to build an access index based on the file, but it does not appear to be part of the approach detailed in the link
Aaron Lun (11:16:42): > It would have to be converted to a dgCMatrix, I think
Aaron Lun (11:17:01): > where completely random access would have a log(nrow) cost
Aaron Lun (11:17:32): > or just to a HDF5 format, as Raphael suggests.
Aaron Lun (11:18:49): > Choice would depend on how slow the disk I/O is compared to processor speed
Kasper D. Hansen (11:29:52): > It doesn’t seem like we should go down the in-memory route only. That means either (1) everything is on disk or (2) choice of disk or memory is user-dependent. Obviously (2) is more flexible, but in that case you probably need to write two versions of each function in C++ if you want direct access. We already have support for both at the R level, using DelayedArray, but that assumes you use R to access data. I certainly favour 2, at the expense of more work if you want C/C++ access.
Aaron Lun (11:33:30): > Actually, regarding (2); that’s exactly what thebeachmatAPI was designed for
Aaron Lun (11:34:42): > C++ class polymorphism can support row/column/random access to a variety of matrix classes
Aaron Lun (11:36:45): > The aim is that if you can put it in a SE object, the API can read it.
Aaron Lun (11:37:00): > So package developers can abstract it all away
Kasper D. Hansen (11:53:57): > seems interesting. It is probably a good idea to talk to Herve about this, if you haven’t done so already.
Aedin Culhane (12:05:24): > BTW here the link to Martin’s package to read the 10X datahttps://github.com/mtmorgan/TENxGenomics - Attachment (GitHub): mtmorgan/TENxGenomics > TENxGenomics - Interface to 10x Genomics’ 1.3 m single cell data set
Hervé Pagès (16:27:03): > @Aedin CulhaneThx for the link. I was finally able to take a quick look at the 10X data.@Raphael GottardoThanks for running those benchmarks. I was kind of skeptical about the overall benefits of this on-top-of-hdf5 sparse matrix representation. Sparse data tends to compress well so I’m not too surprised that you get a file of reasonable size. Random access is much faster with the compressed matrix and, yes, this format makes life much easier. Anyway, it should be simple and fun to implement a DelayedArray “driver” for the 10X format (e.g. TENxMatrix class, will extend DelayedArray). I’ll add this to Martin’s TENxGenomics package in the next couple of days (hope that’s ok with you Martin). Once we have this, converting from 10X format to simple hdf5 compressed matrix will just be a matter of calling writeHDF5Array() on a TENxMatrix object (will process by block, so only 1 block at a time in memory).
2017-05-02
Raphael Gottardo (05:07:40): > @Hervé PagèsThanks. We should sit-down and chat about this with@Greg Finakand Mike so that we better understand what DelayedArray does
Hervé Pagès (13:38:19): > TENxMatrix added to TENxGenomics 0.0.16 (thx Martin for the merge). Another disadvantage of the 10X format is that it doesn’t seem possible to slice the matrix horizontally without actually loading the entire data, unless I missed something. Not that this should be a common use case but still. To convert to simple hdf5 compressed matrix:tenxmat <- TENxMatrix("1M_neurons_neuron20k.h5"); se <- SummarizedExperiment(tenxmat); saveHDF5SummarizedExperiment(se, verbose=TRUE)
.
Peter Hickey (13:41:09): > @Hervé Pagès: experimented with this this morning. seemed to have an issue with loadingTENxMatrix("1M_neurons_filtered_gene_bc_matrices_h5.h5")
. did you try with full dataset? could be issue with me mixing-and-matching a bunch of different pkgs at the moment, but could also be integer/double issue with the full dataset. will try to reproduce
Hervé Pagès (13:45:37): > @Peter HickeyYeah, I tried this too and my code is currently choking on those 64-bit integers. I’ll work on this today.
Peter Hickey (13:46:50): > thanks!
2017-05-03
Hervé Pagès (13:59:18): > @Peter HickeyFixed.@Raphael GottardoSure. Let me know when you want to do this. FWIW I’ll be in my office at the Hutch tomorrow.
Peter Hickey (14:07:45): > thanks
Martin Morgan (21:59:58): > It would be interesting to see a more performance-oriented implementation of TENxGenomics:::as.matrix.TENxGenomics before taking Mike’s benchmarkhttp://rpubs.com/wjiang2/271647too seriously
2017-05-04
Andrew McDavid (12:59:23): > Some folks met at Ascona, I took some notes. Feel free to edit parts that I mischaracterized.https://docs.google.com/document/d/1IicivH30pDDtOOLIp5qlFWN5oRUbOF6dFlfX-4XLdJo/edit?usp=sharing@Stephanie Hicks
Andrew McDavid (12:59:34): > @Davis McCarthy
Davis McCarthy (12:59:37): > @Davis McCarthy has joined the channel
Aaron Lun (13:26:10): > @Andrew McDavidI wasn’t there, but may I add some comments?
Peter Hickey (13:26:48): > can anyone point me to a good guide on memory mapped files?
Peter Hickey (13:27:05): > at the level of concepts rather than implementation
Stephanie Hicks (15:57:58): > @Aaron Lunof course!
Aaron Lun (17:24:04): > Cool. I’ll put up some thoughts tomorrow.
2017-05-05
Aaron Lun (05:10:13): > @Hervé PagèsIs it safe to assume thatHDF5Matrix
objects are always fully realized?
Aaron Lun (05:12:24): > i.e., no pending operations or subsetting to be done.
2017-05-06
Hervé Pagès (15:20:35): > @Aaron LunYes. An HDF5Matrix/HDF5Array object is a DelayedMatrix/DelayedArray object in a pristine state i.e. no pending operations on it yet. This is checked by its validity method. As soon as you start operating on it (e.g. subsetting, transpose, or any other delayed operation), it becomes a DelayedMatrix/DelayedArray instance.
2017-05-07
Aaron Lun (06:29:03): > Cool, thanks.
Aaron Lun (09:53:14): > While we’re on this topic; what is the motivation for storing data from multipleHDF5Matrix
instances as different data sets in the same file? It occurs to me that if two threads were to try to create aHDF5Matrix
simultaneously, they could end up with the sameauto_inc_ID
. There might also be problems with concurrent writing on the HDF5 side, but I’m less sure about that.
Aaron Lun (09:54:08): > Would it be safer to perform a newtempfile
call for every instance of aHDF5Matrix
? Admittedly, there’s still a non-zero probability of a name clash with this strategy.
Aaron Lun (09:55:40): > I guess a separate file for eachHDF5Matrix
instance would also make it easier to copy the data associated with that instance, in case you wanted to transfer it somewhere.
2017-05-08
Hervé Pagès (02:39:48): > Good point. Note that the user has full control over where realization happens viasetHDF5DumpFile()
andsetHDF5DumpName()
. At the time I came up with the default scheme, I felt that having all the realizations happen in the same file was kind of convenient (then I can dolsHDF5DumpFile()
to see the history of all realizations for the current session) but I have to admit that I didn’t think about the risk of clash in case of concurrent realizations. As you said, performing a newtempfile()
call for each realization might still have a non-zero probability of clash. My feeling is that most applications should control where realization happens anyway, and not rely on the default scheme. I’ll add a note to the doc about this. Thx!
Aaron Lun (04:20:28): > Perhaps switching to atempfile
might be a good idea, just to provide some protection. On my system, this causes chaos:
Aaron Lun (04:21:29): > @Aaron Lunuploaded a file:Untitled - File (Plain Text): Untitled
Aaron Lun (04:21:59): > Most of the time it crashes withCan not create dataset. Object with name '/HDF5ArrayAUTO00001' already exists.
Aaron Lun (04:22:11): > But sometimes it gives me something like:
Aaron Lun (04:22:52): > @Aaron Lunuploaded a file:Untitled - File (Plain Text): Untitled
Aaron Lun (04:26:54): > I can imagine instances where realization needs to performed in each thread separately, so this might not always be avoidable by the caller.
Aaron Lun (04:30:18): > tempfile
provides a bit more protection here - and it seems like different names are tried in each child thread (http://stackoverflow.com/questions/5262332/parallel-processing-and-temporary-files) - Attachment (stackoverflow.com): Parallel processing and temporary files > I’m using the mclapply function in the multicore package to do parallel processing. It seems that all child processes started produce the same names for temporary files given by the tempfile functi…
Aaron Lun (04:32:28): > In this approach, I guess you could also get the history of realizations by doing alist.files
ontempdir()
…
Aaron Lun (04:38:18): > Ifpattern
is set to something likehdf5array
intempfile
, it should be unique enough.
Peter Hickey (08:07:50): > @Aaron LunI ran into exact same issue. I think it will be a common pattern when reading in data from multiple samples in parallel to create a SE, so it would be good to have a common solution. I experimented with asetHDF5DumpDir()
function (cf.setHDF5DumpFile()
) and usingtempfile()
. it’s basic and now outdated by changes in HDF5Array/DelayedArrayhttps://github.com/PeteHaitch/bsseq/blob/HDF5Array/R/hdf5-utils.R - Attachment (GitHub): PeteHaitch/bsseq > Devel repository for bsseq
Peter Hickey (08:10:18): > i didn’t investigate the performance of parallel reading from multiple files vs. parallel reading from a single file
Peter Hickey (08:11:12): > also didn’t investigate what, if any, options the HDF5 library provides for parallel/threaded ops
Peter Hickey (08:12:42): > the ‘downside’ of thetempfile()
approach is you can end up with a lot of intermediate files, butsaveHDF5SummarizizedExperiment()
at least provides an option for tidying it all up at the end (albeit by re-writing data)
Aaron Lun (08:22:46): > If intermediate files are a problem, some kind ofinvalidate
function could be provided to delete the file corresponding to aHDF5Matrix
instance (followed by arm
on the instance itself) for freeing up/tmp
disk space on long-running sessions.
2017-05-10
Stephanie Hicks (09:58:26): > Hey everyone, sadly I’m going to have to miss the call Friday with the demos. Will someone be taking notes or will code for the demos be available afterwards?
Peter Hickey (14:30:32): > @Stephanie Hicks: happy to share my demo (when i write it!). i think there’s often a google doc for notes that can be shared
Peter Hickey (14:30:42): > @Aaron LunSee this Aaron?https://github.com/Bioconductor-mirror/rhdf5/commit/fa5b49c0162cfebd20d9c4f83164a9670993cbe9 - Attachment (GitHub): Merge branch ‘master’ into devel · Bioconductor-mirror/rhdf5@fa5b49c
Stephanie Hicks (14:34:53): > @Peter Hickeythanks!
Aaron Lun (16:02:41): > @Peter HickeyHm… looks like it just exposes the C interface. I was hoping for the C++ API, as this makes life a fair bit easier.
Aaron Lun (16:04:21): > I’d imagine that a lot of package developers would also want the C++ interface, especially if they’re usingRcppalready.
Aaron Lun (16:06:37): > I probably should have mentioned that at the last hook-up.
Aaron Lun (16:12:51): > Also, I can’t say for sure, but it seems likerhdf5only has a subset of the C source files that’s present in the original API.
Aaron Lun (16:13:33): > Someone who links to therhdflibrary and isn’t able to get a particular function is going to be surprised.
Aaron Lun (16:14:50): > It should be possible to just dump all of the HDF5 source files - C and C++ - into a subdirectory, likeRhtslibdoes. I’m looking at the files in the tarball I got from the HDF5 website; after compression, the source files alone are just 2.1 MB, so it’s definitely manageable.
Aaron Lun (16:16:32): > I don’t know whether Mike S. is on this channel, so I’ll just bump@Martin Morgan…
Peter Hickey (16:16:34): > i’ve not looked into this at all, hopefully Mike or someone more familiar with rhdf5 can give an overview of how/why rhdf5 is structured as it is
Aaron Lun (16:17:56): > I remember Mike J. saying something about C++ and HDF5;@Raphael Gottardo, do you know if he’s linking to the C++ API?
Marcel Ramos Pérez (16:20:20): > @Aaron LunI don’t think Mike S. is on this slack team. I could add him with an email.
Aaron Lun (16:20:33): > @Marcel Ramos Pérezsounds like a good idea.
Davis McCarthy (16:22:55): > Hi folks - Vlad Kiselev (vk6@sanger.ac.uk; author of SC3) is interested in this topic. Can we add him to the group?
Marcel Ramos Pérez (16:25:49): > Certainly
Raphael Gottardo (20:57:44) (in thread): > @Aaron LunYes, he is currently looking at doing this for our prototype.
Raphael Gottardo (20:58:20): > Can someone also add Mike Jiangwjiang2@fhcrc.orgto this group? Thanks.
Marcel Ramos Pérez (21:37:15): > Invited
Kasper D. Hansen (21:40:21): > Putting code into an R package makes the author essentially responsible for getting it to compile on all platforms using a wide variety of compilers. Not saying this is a reason, but it could be.
2017-05-11
Aaron Lun (04:04:22) (in thread): > Great. In that case, it seems like making the C++ API available inrhdfwould be useful to other people as well.
Martin Morgan (04:07:49) (in thread): > I think its better to put the library code in ‘pure’ library package Rhdf5lib rather than mixing it up with user-oriented code; rhdf5 would depend / import / link to Rhdf5lib / Rhdf5lib++, and would be free to create lots of useful user-facing functionality (maybe with dependencies on more narrowly useful packages) without being burdened with maintaining the library code
Mike Smith (04:08:03): > @Mike Smith has joined the channel
Martin Morgan (04:10:55): > I started a google doc for our conference call on Friday; it’s editable so feel free to add elementshttps://docs.google.com/document/d/120MAhngbIe_EGi2ObnKyBMfuFpGBqK_3inOmRdHZnmU/edit?usp=sharing
Aaron Lun (04:11:42) (in thread): > That would also be good.
Mike Smith (04:31:11) (in thread): > @Peter HickeyAlthough I’m now listed as the rhdf5 maintainer, I’m afraid I’m not very familiar with its history, so I can’t really shed any light on why Bernd chose to include the parts that he did. My assumption was that he had cherry picked some ‘minimal working version’ to include in src/hdf5source/hdf5small.tgz that satisfied the features he was interested in. I’ve no idea if using the C++ interface was ever considered.
Mike Smith (04:34:59) (in thread): > This is essentially the conclusion I came to when thinking about this yesterday. We probably want to separate the R interface for reading, writing, etc hdf5 files from the C library and have a package that fulfils a similar role to Rhtslib or zlibbioc. At the moment rhdf5 sort of does both.
Mike Smith (04:43:35) (in thread): > @Kasper D. HansenAlthough this is something the current structure of rhdf5 is also affected by. I had Brian Ripley contact me to fix some issues compiling on Solaris. I’ve also so far failed to figure out what tool-chain Bernd used to build the Windows dlls that are shipped with the package.
Peter Hickey (08:39:54) (in thread): > @Mike Smith: That was my guess for the history. Thanks for taking over rhdf5, Mike.
Mike Smith (12:19:03): > I’ve started a experimental ‘Rhdf5lib’ library athttps://github.com/grimbough/Rhdf5libI took the latest version of the hdf5 source tarball, and chopped down to get rid of tests, fortran code, etc. At the moment it creates static libraries for both C and C++, and copies all the header files for both. You can access them in $R_LIBRARY_PATH/Rhdf5lib/lib and /include respectively. Absolutely zero support for Windows right now. Maybe not the most elegant solution in the world, but let me know if it looks like something worth pursuing. - Attachment (GitHub): grimbough/Rhdf5lib > Contribute to Rhdf5lib development by creating an account on GitHub.
Aaron Lun (12:32:13): > So far so good - it installs successfully on my machine…
Aaron Lun (12:41:46): > Should we be using#include "c++/H5cpp.h"
? It’s hard to tell on my machine because HDF5 is already installed; I’ll try on the server.
Andrew McDavid (12:46:17): > I also will unable to make the call tomorrow, but will check in at this bat channel early next week for updates
Aaron Lun (14:05:53): > Further onRhdf5lib: the good news is that I can compilebeachmatusing the provided header files. The bad news is that I haven’t been able to link successfully; even after including/path/to/lib/libhdf5_cpp.a
in the link line, runningdyn.load
on the resulting library givesundefined symbol: H5T_NATIVE_INT32_g
. It’s probable that I’m doing it wrong, but my only other guess would be that maybe the static library is depending on something in the C API; I don’t see aH5T_NATIVE_INT32
definition in the HDF5 C++ source files.
Raphael Gottardo (15:08:40): > @Mike JiangWill you be able to give a quick demo of what you’ve done tomorrow?
Mike Jiang (15:08:44): > @Mike Jiang has joined the channel
Raphael Gottardo (15:09:25) (in thread): > Perhaps add a short description to the Google Doc
Mike Jiang (15:52:44): > @Aaron Lun10x
usesCompressed sparse column
format forH5
(seehttps://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/advanced/h5_matrices)
Mike Jiang (15:54:44): > and usesMatrix Market
fortext
based version
Mike Jiang (15:55:45): > Regarding to linking torhdf5
, I currently only uses itsC
API, and YES, it would be great if it can exposeC++
header as well
Mike Smith (17:56:12) (in thread): > @Aaron LunTaking a look at this in a bit more depth, it seems the C++ interface is an extension of the C one, rather than something totally separate. You’ll find a definition forH5T_NATIVE_INT32
in H5Tpublic.h As such you’ll need to include both folders with the header files and link against bothlibhdf5.a
andlibhdf5_cpp.a
. > > This probably explains why the c++ stuff is always in the subfolders in the hdf5 source, so maybe I should maintain that hierarchy in the R package? Happy to take suggestions. > > I’ve managed to compile and run a ‘hello world’ style thing linking against the libraries it generates, so I’ll try to put together something slightly more helpful in the morning.
Martin Morgan (18:00:12): > The working group will meet athttp://huntercollege.adobeconnect.com/singlecell/tomorrow 12pm Eastern; google doc with an agenda up-slack of here. See you then!
2017-05-12
Aaron Lun (04:17:17) (in thread): > @Mike SmithSounds good; let me know how you got it to work! Currently, my attempt to link to bothlibhdf5.a
andlibhdf5_cpp.a
results in a segfault upondyn.load
(if I putlibhdf5_cpp.a
beforelibhdf5.a
in the link line) orundefined symbol: H5Tget_native_type
(if I use the opposite order).
Vladimir Kiselev (04:36:48): > @Vladimir Kiselev has joined the channel
Wolfgang Huber (05:32:29): > @Wolfgang Huber has joined the channel
Wolfgang Huber (05:33:17): > Sorry I won’t be able to make the May 12 call.
Mike Smith (05:49:29) (in thread): > I have a feeling this may be related to the relative location of the C and C++ headers, and the fact I’ve changed them in Rhdf5lib compared to the standard hdf5 source. I’m not sure the C++ files know where to look. I’ve pushed a change to to Rhdf5lib to put them back how I think it expect them to be, so try version 0.0.2 > > I’ve also put a crude demo package athttps://github.com/grimbough/usingRhdf5libso you can check out the arguments in my src/Makevars there. - Attachment (GitHub): grimbough/usingRhdf5lib > usingRhdf5lib - Testing linking against Rhdf5lib
Mike Smith (05:49:29): - Attachment: Attachment > Further on Rhdf5lib: the good news is that I can compile beachmat using the provided header files. The bad news is that I haven’t been able to link successfully; even after including /path/to/lib/libhdf5_cpp.a
in the link line, running dyn.load
on the resulting library gives undefined symbol: H5T_NATIVE_INT32_g
. It’s probable that I’m doing it wrong, but my only other guess would be that maybe the static library is depending on something in the C API; I don’t see a H5T_NATIVE_INT32
definition in the HDF5 C++ source files. - Attachment: Attachment > I have a feeling this may be related to the relative location of the C and C++ headers, and the fact I’ve changed them in Rhdf5lib compared to the standard hdf5 source. I’m not sure the C++ files know where to look. I’ve pushed a change to to Rhdf5lib to put them back how I think it expect them to be, so try version 0.0.2 > > I’ve also put a crude demo package at https://github.com/grimbough/usingRhdf5lib so you can check out the arguments in my src/Makevars there.
Aaron Lun (06:33:41) (in thread): > Thanks@Mike Smith. Narrowed down the problem to something on my side;PredType
and templates don’t play nice in the latest version of HDF5.
Mike Smith (06:56:48) (in thread): > On that note, I’ve just realised that the site I got my ‘latest version’ from (https://support.hdfgroup.org/HDF5/release/obtainsrc518.html) is not the latest version at all! I don’t know how much has changed between them, but I’ll replace it with hdf5 v 1.10.1 later today. - Attachment (support.hdfgroup.org): HDF5 Source Code > The HDF Group is a not-for-profit corporation with the mission of sustaining the HDF technologies and supporting HDF user communities worldwide with production-quality software and services.
Aaron Lun (08:45:26) (in thread): > Success - had to movePredType
out of the template parameter, because it wasn’t aconst
object in the newer versions of HDF5, but otherwise everything works. Probably.
Aaron Lun (08:46:20) (in thread): > Could you put together apkgconfig
function inRhdf5libthat I can put in myMakevars
?
Mike Smith (10:21:41) (in thread): > Yep, that’s on my todo list. Just documenting the steps to get from the original hdf5 tarball to the trimmed version in the package, so I can do it again in the future. I’ll add apkgconfig
function after that’s done.
Mike Smith (11:37:09) (in thread): > I’ve updated the bundled version of hdf5 to 1.10.1 and added apkgconfig
function. It only works on Linux at the moment, but it seems to do the job in my usingRhdf5lib example package.
Shweta Gopal (11:57:25): > @Shweta Gopal has joined the channel
Peter Hickey (12:04:22): > fyi chatting now athttp://huntercollege.adobeconnect.com/singlecell/
Aaron Lun (14:22:28) (in thread): > Did you commit thepkgconfig
source file? I’m not seeing aR
subdirectory on the repo.
Aaron Lun (14:23:27) (in thread): > Also, on my system-L/path/to/Rhdf5lib/lib -lhdf5_cpp -lhdf5
causes segfaults; I have to instead use/path/to/Rhdf5lib/libhdf5_cpp.a /path/to/Rhdflib/libhdf5.a
on the link line.
Aaron Lun (14:24:05) (in thread): > I suspect this is because I have HDF5 system libraries installed, and the linker is picking the wrong ones to link against. Or something like that.
Aaron Lun (14:32:54): > @Davide RissoIf we’re splitting up the package, would you like to own theSingleCellExperimentrepository? I’m already a maintainer for plenty of BioC packages, so I thought I might spread the fun.
Aaron Lun (14:34:49): > Incidentally - it took a lot of work, but I reverse-engineered “beachmat” to be a reasonble acronym: “compiling Bioconductor to handle Each Matrix”.
Peter Hickey (14:38:59): > if the JABBA (Just Another Bogus Bioinformatics Acronym) awards still existed, Aaron, I might have to nominate you (http://www.acgt.me/blog?tag=jabba) - Attachment (ACGT): ACGT blog > ACGT: a blog that discusses various issues relating to genomics and bioinformatics by Keith Bradnam
Davide Risso (15:47:09) (in thread): > @Aaron Lunsure!
2017-05-13
Aaron Lun (07:10:30) (in thread): > Okay, I’ve cleaned it up and requested a transfer to you. It passes CHECK and BiocCheck but needs a vignette.
Greg Finak (12:15:18): > @Greg Finak has joined the channel
2017-05-15
Aaron Lun (12:53:12) (in thread): > It is done.beachmatnow only contains the C++ API, while the newSingleCellExperimentpackage (https://github.com/drisso/SingleCellExperiment) contains the definition for theSingleCellExperiment
class (duh). - Attachment (GitHub): drisso/SingleCellExperiment > SingleCellExperiment - S4 classes for single cell experiment data
Aaron Lun (12:53:12): - Attachment: Attachment > @Davide Risso If we’re splitting up the package, would you like to own the SingleCellExperiment repository? I’m already a maintainer for plenty of BioC packages, so I thought I might spread the fun. - Attachment: Attachment > It is done. beachmat now only contains the C++ API, while the new SingleCellExperiment package (https://github.com/drisso/SingleCellExperiment) contains the definition for the SingleCellExperiment
class (duh).
2017-05-17
Mike Smith (10:45:45) (in thread): > Sorry, I forgot to commit theR
directory and hence the code.
> > I think I’ve managed to get a Windows library compiled from the same source, so I’m going to do a bit more testing with that and then we can finalise the paths so it links to the version we want.@Mike Jiangreported/requested something similar in the Github commit thread
2017-05-18
Aaron Lun (12:06:28) (in thread): > Sweet, thanks.
Hervé Pagès (12:26:31): > An update on HDF5Array: 1) by default now each automatic HDF5 dataset is written to its own file sobplapply(1:3, function(i) as(matrix(i, 1, 1), "HDF5Array"))
works. 2) also addedshowHDF5DumpLog()
to display the log of all the HDF5 datasets created in the current session.
Aaron Lun (12:46:16): > beachmathas some functionality where it creates aHDF5Matrix
object in C++ for storing matrix output. Currently I’ve been usingtempfile
to get the dump file name for the new object; can I now switch togetHDF5DumpName
?
Hervé Pagès (13:07:31): > getHDF5DumpFile
andgetHDF5DumpName
are for the end user i.e. they don’t increment the global counters used internally to generate the automatic file and dataset names. In your code, you need to useHDF5Array:::get_dump_file_for_use
andHDF5Array:::get_dump_name_for_use
(they do increment the global counters). Sorry, these things are not exported for now (I’m not quite ready to commit to the current design yet).
Kasper D. Hansen (13:10:42): > Herve: not sure if you do it yet, but it would be good to think about package settings for this. I do agree that sometimes you want one file per sample and sometimes multiple and that it is dataset dependent. But I postulate that it may be application area dependent.
Kasper D. Hansen (13:11:36): > For some of our usage we have WGBS which has few (<100 typically) samples, but each sample has 28M numbers. Compared to the methylation arrays where we have 100-1000s of samples, but each sample is much smaller
Kasper D. Hansen (13:12:09): > So what I am trying to say is that I would like control over this as a package writer, and I am (sometimes) confident to make choices on behalf of my users
Hervé Pagès (13:25:15): > You still have full control on the location of automatically created HDF5 datasets (viasetHDF5DumpFile
andsetHDF5DumpName
). I didn’t change that. What I changed is that now by default each automatic dataset goes to its own file. But if you dosetHDF5DumpFile("path/to/my.h5")
, then from now on, everything is going to that file (so parallel realization will break). If you dosetHDF5DumpFile("path/to/some/dir/")
(trailing slash is important), then from now on it’s going to be 1 file per dataset again but the files will be created inpath/to/some/dir/
. So in your application, you should be able to control exactly where automatic datasets are created.
Peter Hickey (14:11:47): > Having asetHDF5DumpDir("dir")
as an alias for or instead ofsetHDF5DumpFile("path/to/some/dir/")
make sense to me. What do you you think,@Hervé Pagès?
Hervé Pagès (14:17:43): > Agreed. The current situation is a little lame. Didn’t even document that “feature” because I was considering addingsetHDF5DumpDir
. Didn’t do it yet because I wanted to take the time to think about the interaction betweensetHDF5DumpDir
andsetHDF5DumpFile
.
Peter Hickey (14:20:17): > 1 step ahead as ever:+1:
Hervé Pagès (16:32:25): > @Aaron LunJust addedfor.use
arg togetHDF5DumpFile()
andgetHDF5DumpName()
to let the caller specify whether or not he intends to use the returned file and dataset names (default is FALSE). This replacesHDF5Array:::get_dump_file_for_use()
andHDF5Array:::get_dump_name_for_use()
.
Hervé Pagès (17:47:39): > Also, would be good if you calledHDF5Array:::append_dataset_creation_to_dump_log(...)
in beachmat when you create an HDF5Matrix object. Should be called right after creating the dataset on disk withh5createDataset()
. Sorry this is still very new and not documented yet.
John Readey (20:44:58): > @John Readey has joined the channel
2017-05-19
Aaron Lun (05:46:46): > @Hervé PagèsThanks for the info. I’ve switched togetHDF5DumpFile/Name
, and things seem to work. However, it seems I can’t access unexported functions viaRcpp, so I can’t callappend_dataset_creation_to_dump_log
.
Aaron Lun (06:07:34): > On a completely different note; perhaps someone might explain to me why the following code doesn’t work:
Aaron Lun (06:10:47): > @Aaron Lunuploaded a file:Untitledand commented: This gives meHDF5. Invalid arguments to routine. Bad value.
. I would have thought I would be allowed to truncate a closed file, especially given thatH5F_ACC_TRUNC
is one of the “possible arguments” in h5const("H5F_ACC") >
. - File (Plain Text): Untitled
Martin Morgan (09:16:17) (in thread): > @Aaron LunThat’s probably not correct – you’ll ‘just’ need to formulate the callHDF5Array:::foo()
in Rcpp, maybe via the equivalent ofget("foo", envir=getNamespace("HDF5Array"))
though that could be tedious; an alernative and better solution (keeping the inner workings more publicly accessible) would be to provide an R-level function in your package that wraps the:::
call; then your Rcpp code would invoke your own R function.
Peter Hickey (11:34:21): > for those interested in such things, another R interface to HDF5:https://github.com/Novartis/hdf5r - Attachment (GitHub): Novartis/hdf5r > Contribute to hdf5r development by creating an account on GitHub.
Hervé Pagès (13:25:35) (in thread): > I guess I should export this anyway. Now that the user can dogetHDF5DumpFile(for.use=TRUE)
andgetHDF5DumpName(for.use=TRUE)
, and these things are documented, I should also export and documentappend_dataset_creation_to_dump_log
, especially since I would really like to see all automatic datasets show up in the dump log.
Aaron Lun (13:30:43) (in thread): > Yes, that’s what I was thinking as well.
Sean Davis (15:40:03): > Thanks,@John Readey, for joining us.
Hervé Pagès (20:10:21) (in thread): > This is now exported (and documented) asappendDatasetCreationToHDF5DumpLog
in HDF5Array 1.5.6.
Hervé Pagès (20:46:05): > @Peter HickeyJust added and documentedsetHDF5DumpDir
andgetHDF5DumpDir
to HDF5Array 1.5.6
2017-05-20
Sean Davis (09:02:44): > @John Readey, just as background, the dataset of interest here is generated by a company called TenX genomics. The specific largish dataset is here:https://s3-us-west-2.amazonaws.com/10x.files/samples/cell/1M_neurons/1M_neurons_filtered_gene_bc_matrices_h5.h5
Sean Davis (09:04:32): > Following up on your email to me, yes, having hdf5server read directly from s3 would be both convenient and would probably have some nice performance benefits, particularly for frequently-accessed or concurrently-accessed datasets.
Sean Davis (09:08:29): > And this README gives a little sense of the types of operations that folks might want to perform.https://github.com/mtmorgan/TENxGenomics - Attachment (GitHub): mtmorgan/TENxGenomics > TENxGenomics - Interface to 10x Genomics’ 1.3 m single cell data set
Aaron Lun (10:02:37) (in thread): > Thanks@Hervé Pagès. Works like a charm.
John Readey (14:08:56): > Thx for the background@Sean Davis. Having the data on S3 (rather than say EBS) has some advantages:
John Readey (14:09:02): > - Data is automatically replicated
John Readey (14:09:21): > - Cost is low
John Readey (14:10:56): > - Data can be accessed by multiple instances (EBS volumes can only be attached to one machine at a time)
John Readey (14:12:23): > AWS recently came out with a NFS-like service (AWS EFS), but it cost 10x what S3 does
John Readey (14:14:00): > I’ll try loading up the 1M_neurons dataset onto the HDF Server
Sean Davis (16:24:47): > @Vince Careyhas been working on an HDF5 Server-backed class, here:https://github.com/vjcitn/restfulSE - Attachment (GitHub): vjcitn/restfulSE > restfulSE - demonstration of SummarizedExperiment with assay component retrieved from remote HDF5 server
Sean Davis (16:26:10): > TheSE
part of the name is related to the class described here:https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html
Sean Davis (17:31:42): > Oh, and for the Bioc folks, the hdf5 server@John Readeyis developing is here:
Sean Davis (17:31:52): > https://github.com/HDFGroup/h5serv - Attachment (GitHub): HDFGroup/h5serv > h5serv - Reference service implementation of the HDF5 REST API
2017-05-23
Mike Smith (08:20:41): > I’ve made some updates toRhdf5lib, mostly related to including precompiled windows libraries. This was less straightforward than I’d hoped as MinGW is no longer supported for compiling HDF5, but I think it’s working now. I ended up reverting back to HDF5 version 1.8.18 as I was unable to compile 1.10.1 in a way that didn’t error as soon as I ran it. This seemed to be related to their SWMR feature and file locking, which was introduced in 1.10. If anyone feels those features are important or crucial to them let me know and we can try to get it running successfully. The package also ships with a version oflibdlwhich isn’t included in Rtools, but seems to be require for compilation. > > I also updatedpkgconfig()
to return absolute paths to the compiled libraries on Linux and Mac, to try and avoid using system versions instead. Let me know how badly it breaks for you.
Unknown User (08:56:22): > @Mike Smith commented on @Aaron Lun’s file https://community-bioc.slack.com/files/U34P8RS3B/F5FJW8A77/-.txt: I’ll admit the man page is a bit ambiguous there, but I think for H5Fopen()
the “possible arguments” are actually found by h5const("H5F_ACC_RD")
> > And in the C source for H5Fopen
it explicitly states it will reject "H5F_ACC_TRUNC"
: > > > /* Reject undefined flags (~H5F_ACC_PUBLIC_FLAGS) and the H5F_ACC_TRUNC & H5F_ACC_EXCL flags */ > if((flags & ~H5F_ACC_PUBLIC_FLAGS) || > (flags & H5F_ACC_TRUNC) || (flags & H5F_ACC_EXCL)) > HGOTO_ERROR(H5E_ARGS, H5E_BADVALUE, FAIL, "invalid file open flags") >
- File (Plain Text): Untitled
Martin Morgan (10:42:57): > @Mike Smithout of curiosity I was wondering what compiler(s) HDF5 does support on Windows?
Mike Smith (12:44:50): > Principally they listVisual Studio2012 through to 2015, and alsogccviaCygwin. They also say usingCMakeis mandatory. You can see a list of tested / supported platforms inhttps://support.hdfgroup.org/ftp/HDF5/releases/ReleaseFiles/hdf5-1.8.18-RELEASE.txtMy version is built withconfigure,MSYS2andMinGW, so we’ll see if it has any immediately obvious flaws.
Aaron Lun (13:47:27) (in thread): > @Mike SmithThanks for that. I’ve tested the latest version ofRhdf5libwithpkgconfig("PKG_CXX_LIBS")
and I can confirm that it works fine with all ofbeachmat’s functionality.
Mike Jiang (19:44:15) (in thread): > It worked beautifully withncdfFlow
package on both linux and windows. When couldRhdf5lib
make it to the Bioc devel ?
Hervé Pagès (21:24:48): > It would probably make sense to factor out the code from Vince’s restfulSE package in charge of talking with the HDF5 server, and put it in its own package. Something that would become a more general purpose HDF5 client for R. The API could be designed to mimic that of rhdf5. I believe this is the approach John took with h5pyd i.e. to mimic h5py’s API. Are there plans to do something like this?
2017-05-24
Hervé Pagès (02:33:43) (in thread): > I’m also eager to see Rhdf5lib make its way to BioC devel.
Aaron Lun (05:07:46) (in thread): > Agreed.
Mike Smith (05:34:53) (in thread): > Thanks for the positive feedback, I’m glad to hear it seems to be working well. > > Unfortunately I’m on holiday for 10 days from tomorrow morning, in a not particularly compute friendly environment. I’ll see if I can get some documentation done this afternoon so it is passingR CMD check
and submit a contribution issue to get the ball rolling.
> > If anyone wants to be added as a contributor on the github repo so they can be more responsive than I will be over the next week then I’m happy to do that.
Vince Carey (07:35:15): > i would agree that more work should be done on streamlining interaction between R/rhdf5 and the server. restfulSE is very preliminary and i invite variations on its approach to communicating with the server
Vince Carey (07:36:24): > @John Readeyare there prospects for server-side statistical computing? it would be very useful to be able to get column sums and row sums from a matrix on the server
Peter Hickey (08:21:45): > server-side computing would be very cool
Sean Davis (08:30:30): > For something like restfulSE, why not implement methods for colsums, rowsums, etc., that read from a precomputed hdf5 column? It seems that would be a good first step. In a sense, the SummarizedExperiment interface could be mirrored for an RestfulHdfSummarizedExperiment, capitalizing on precomputation where possible. A “loader” for the hdf5-backed SummarizedExperiment would create the necessary data objects at upload time–or they could be created by hand–to support the RestfulHdfSummarizedExperiment interface.
Sean Davis (08:52:48): > Of course, if@John Readeyis up for building server-side methods, that would be a really nice alternative if one can come up with the right interface/API.
John Readey (13:40:38): > Hey@Vince Careyone possible optimization for rhdf5 is to cache data on the client side. E.g. if you’ve already gotten the dimensions for a dataset and it’s not extensible, you know it’s not going to change. Or if you suspect the source data is not changing, you can get all the links for a group in one go and just keep those around. You have to be a bit careful that your cache doesn’t blow up, but generally it’s pretty safe if you are just talking about metadata. I did something like this with h5pyd recently - you’ll note a “use_cache” hint in the File constructor:https://github.com/HDFGroup/h5pyd/blob/develop/h5pyd/_hl/files.py. - Attachment (GitHub): HDFGroup/h5pyd > h5pyd - h5py distributed - Python client library for HDF Rest API
John Readey (13:46:26): > The server side methods would be an interesting extension to the REST API. I’m thinking of something analogous to the SQL methods like SUM or AVG functions. The client would provide a selection region and operator and the server would do the math and return the result.
John Readey (13:46:37): > Is there a list somewhere of useful methods?
Vince Carey (14:20:28): > @John Readey– Thanks for these comments. I am wondering whether our approach to retrieval from the server can be improved – we receive JSON as a binary stream, run readBin on that, then fromJSON on the result. rhdf5 plays no role. I have wondered whether the server could produce something that would more immediately transform to R data via rhdf5. Metadata seems to come pretty cheaply, but the acquisition of the numerical payload from the select operation via fromJSON might be susceptible of substantial improvement.
Vince Carey (14:23:16): > Apropos server-side operations, a real concern of mine is that we may want to verify that the data on the server are as we expect. It seemed to me that hashes of rowsums and columnsums could be known to the client and checked in real time when querying.
John Readey (16:07:39): > @Vince Carey- the HDF REST API supports reading/writing dataset data as binary - i.e. no binary to JSON to binary translation needed. In my preliminary performance testing, using binary was up to 10x faster than reading JSON. No binary protocol for reading metadata, but typically these are much smaller than dataset requests.
John Readey (16:19:10): > Some instructions for using binary transfer are here:http://h5serv.readthedocs.io/en/latest/DatasetOps/GET_Value.html.
Martin Morgan (18:21:25): > The next call is Friday May 26 at 12pm Eastern; please add agenda items to the google dochttps://docs.google.com/document/d/120MAhngbIe_EGi2ObnKyBMfuFpGBqK_3inOmRdHZnmU/edit?usp=sharing
Hervé Pagès (18:56:42): > I agree that caching is an important feature of an HDF5 client for R (rhdf5d?), possibly implemented via the new BiocFileCache package so the cached data persists across sessions (an interesting new use case for BiocFileCache). A good start for the server-side methods are the first 5 methods from theSummary
group in R (max
,min
,range
,prod
,sum
, see?methods::Summary
in an R session) plus theirrow*
/col*
analogs (e.g.rowSums
,colSums
). The matrixStats package on CRAN supports a much more extensive set of summarization methods. The ability to restrict the operation to a selected region would be great.
2017-05-25
Martin Morgan (22:13:20): > set up a reminder on “conference call May 26 at 12pm Eastern” in this channel at 9AM tomorrow, Eastern Daylight Time.
Raphael Gottardo (23:28:35): > I won’t be able to join tomorrow due to a conflict.@Mike Jiangis also out of the office tomorrow.
2017-05-26
USLACKBOT (09:00:00): > Reminder: conference call May 26 at 12pm Eastern.
Martin Morgan (12:01:22): > The meeting is athttp://huntercollege.adobeconnect.com/singlecell/
Aaron Lun (12:04:23): > Will be there in a second; just waiting for the mac to boot up
Stephanie Hicks (13:04:46): > @Davide Risso@Aaron Lun@Peter HickeyI apologize for missing the meeting two weeks ago, but did you discuss whether SingleCellExperiment was meant for only expression data or can it work with scRNA-seq/scBS-Seq data too? did we ever decide on a data set to play around with?
Peter Hickey (13:06:02) (in thread): > @Stephanie Hicksno, don’t think we’ve discussed that. off top of my head, spike-ins may be unique to rna-seq and we have dedicated stuff for that in SingleCellExperiment right now I think
Stephanie Hicks (13:20:18) (in thread): > Ah, ok.@Davis McCarthy,@Andrew McDavid,@Vladimir Kiselev, Tallulah and I were discussing this at at workshop a few weeks back (whether SingleCellExperiment should be specific to expression data or general enough for any single cell data or multiple types of data). It was clear from the talks at the workshop that one of the big interests/challenges was jointly analyzing multiple types of data measured on same cells. So, we were trying to think about how we would handle that type of data with the proposed SingleCellExperiment class. The concern was if we make the class specific for only expression, then how would we want to handle other types of data also measured on same cells. Making it as general as possible & extending the class for different settings seemed to be the consensus. But I also understand the argument for making it only for expression since that is so popular.
Peter Hickey (13:23:08) (in thread): > with papers likehttp://www.biorxiv.org/content/early/2017/05/17/138685already out, i agree that having a container for these multi-assay single-cell data will be an important consideration. The MultiAssayExperiment (https://bioconductor.org/packages/release/bioc/html/MultiAssayExperiment.html) may work for wrapping up a bunch of SingleCellExperiment (or similar) - Attachment (Bioconductor): MultiAssayExperiment > Develop an integrative environment where multiple assays are managed and preprocessed for genomic data analysis.
Aaron Lun (13:53:30) (in thread): > I haven’t thought about that. Spike-ins are probably expression-only, but given that RNA-seq is currently the bread and butter of single-cell analyses, it’s worth keeping it in the base class, especially as there’s no extra overhead in terms of slots. The other data structures should be generally applicable.
Peter Hickey (13:58:31) (in thread): > for bisulfite you have typically have an unmethylated spike-in (lambda phage genome, typically) to estimate bisulfite conversion efficiency. I don’t always include that in my SummarizedExperiment-derived object (if I do it’s kind of tacked on via it’s ownseqlevel
andgenome
in theseqinfo
slot). So this could be in thespike-in
slot provided that slot allowed for [I|G]Ranges-based objects and not just ‘name’-indexed ojects
Peter Hickey (13:59:01) (in thread): > For scATAC-seq there’s no obvious spike-in AFAIK so this slot would be empty
Aaron Lun (14:04:57) (in thread): > Note that there is no explicit “spike-in slot”; there’s just internalrowData
, one of the fields of which is a logical vector indicating whether the corresponding row of the SE object corresponds to a spike-in feature. For BS data (lol), if you can annotate each row with a genomic location in aGRanges
object, then you can simply useisSpike<-
to indicate which rows are spike-ins.
Peter Hickey (14:06:05) (in thread): > sounds like it’ll work perfectly. awesome
Peter Hickey (14:06:30) (in thread): > (calling my software that simulated bisulfite-sequencing data BSBS is still one of my crowning achievements)
Aaron Lun (15:12:58): > There seem to be some performance considerations withrhdf5with respect to compression and chunking. Currently,h5createDataset
sets the chunk dimensions to the data set dimensions, i.e, the entire data set is a single chunk. This is fine when reading and writing entire matrices, but it might make taking slices rather expensive, as the HDF5 library needs to operate on each chunk at once (seehttps://support.hdfgroup.org/HDF5/Tutor/layout.html#lo-define). Ideally we would know what access patterns to expect and chunk accordingly; but this is not possible for general applications, where both gene/row and cell/column access are desirable. I guess we could just make square chunks that are “small but not too small”, e.g., 50 x 50, which should be sufficiently granular. Any ideas? - Attachment (support.hdfgroup.org): The HDF Group - Information, Support, and Software > The HDF Group is a not-for-profit corporation with the mission of sustaining the HDF technologies and supporting HDF user communities worldwide with production-quality software and services.
Aaron Lun (15:17:04): > Possibly we could have aHDF5Arrayfunction that sets global variables to control the chunking if the user knows that they’ll be doing mainly row or column accesses; otherwise the defaults could be the squares.
Peter Hickey (15:22:51): > @Aaron LunHDF5Array()
is a function in the HDF5Array pkg, although not doing what you’re talkign about
Peter Hickey (15:22:56): > check out how it chunks data too
Aaron Lun (15:27:28) (in thread): > @Peter HickeyYes, I did actually mean a function in theHDF5Arraypackage, so that every call toh5createDataset
during realization will use the user-defined chunk sizes. I see aHDF5Array:::.chunk_as_hypercube
that sets the chunk sizes to something like 1000000^0.5 (for matrices) - is that the “least suboptimal” for general access?
Aaron Lun (15:27:28): - Attachment: Attachment > @Aaron Lun HDF5Array()
is a function in the HDF5Array pkg, although not doing what you’re talkign about - Attachment: Attachment > @Peter Hickey Yes, I meant a function in the HDF5Array package, so that every call to h5createDataset
during realization will use the user-defined chunk sizes. I see a HDF5Array:::.chunk_as_hypercube
that sets the chunk sizes to the square-roots of the dimensions - is that the “least suboptimal” for general access?
Aaron Lun (15:36:43) (in thread): > Having dug into it further, I see.chunk_as_subblock
as well. I guess what I’m looking for is a more user-visible way to get and set the chunk sizes, much like how@Hervé PagèsexposedgetHDF5DumpFileName
, etc. For example, if you set the chunk dimensions to something likec(1, NA)
, it means that each chunk consists of 1 row and all columns when callingh5createDataset2
.
Hervé Pagès (16:04:06): > @Aaron Lun@Peter HickeyChoosing “good” chunk dimensions is indeed challenging since it really depends on how the dataset will be accessed, which is hard to know at write-time. I don’t use.chunk_as_hypercube()
in HDF5Array to automatically choose the chunk dimensions, this is only something I played with at some point to make some comparisons. I use.chunk_as_subblock()
which chooses the chunk dimensions in a way that seems to play well with the block-processing mechanism used during on-disk realization and by other operations that are not delayed (like the summarization methods i.e.max
,mean
,sum
,rowSums
,colSums
, etc…). Ultimately the user who knows what s/he’s doing should have a way to control the chunk dimensions so I’ll try to come up with something for this in the HDF5Array package.
John Readey (16:36:11): > h5py has some code that can create a reasonable chunk layout if the use doesn’t supply one. See guess_chunk() in this file:https://github.com/h5py/h5py/blob/master/h5py/_hl/filters.py - Attachment (GitHub): h5py/h5py > HDF5 for Python – The h5py package is a Pythonic interface to the HDF5 binary data format.
Hervé Pagès (19:24:23): > In the mean time, I exported and documented.chunk_as_subblock
asgetHDF5DumpChunkDim
(see?getHDF5DumpChunkDim
). Also added thechunk_dim
arg towriteHDF5Array
to let the user control the dimensions of the chunks (see?writeHDF5Array
).
2017-05-27
Aaron Lun (06:52:59) (in thread): > Thanks@Hervé Pagès. It is quite timely that we’re discussing this, as I’m currently running into access issues on the Zeisel brain data set (19000 genes x 3000 cells). As it is currently implemented,getHDF5DumpChunkDim
seems to prefer column-wise chunks;c(7500, 1)
, for this data set. This means that column access is fast (1.2 seconds to compute column sums) but row access is painfully slow (no timing available, I just stopped it after a while). If I may, I’d like to make some requests for the next version ofHDFArray: > > • More balanced default chunk dimensions that allows for fast row or column access. All things being equal, cutting it up into 100x100 squares would probably be a good compromise. Perhaps people with more experience with these things may have some ideas here, e.g., as suggested by@John Readey. If not, I can try different block sizes and see which gives the best result, which leads me to my second point… > > • Global chunk dimensions, and functions to get/set them. Currently, I make myHDF5Matrix
objects viaas(x, "HDF5Matrix")
, where the chunk dimension arguments cannot be directly supplied. I imagine this would also be the case for implicit realizations ofDelayedMatrix
objects. A similar situation arises forHDF5Matrix
instances created in C++ bybeachmat. Some way to change the chunking behaviour in these functions would be nice.
Aaron Lun (06:52:59): - Attachment: Attachment > In the mean time, I exported and documented .chunk_as_subblock
as getHDF5DumpChunkDim
(see ?getHDF5DumpChunkDim
). Also added the chunk_dim
arg to writeHDF5Array
to let the user control the dimensions of the chunks (see ?writeHDF5Array
). - Attachment: Attachment > Thanks @Hervé Pagès. It is quite timely that we’re discussing this, as I’m currently running into access issues on the Zeisel brain data set (19000 genes x 3000 cells). As it is currently implemented, getHDF5DumpChunkDim
seems to prefer column-wise chunks; c(7500, 1)
, for this data set. This means that column access is fast (1.2 seconds to compute column sums) but row access is painfully slow (no timing available, I just stopped it after a while). If I may, I’d like to make some requests for the next version of HDFArray: > - More balanced default chunk dimensions that allows for fast row or column access. All things being equal, cutting it up into 100x100 squares would probably be a good compromise. Perhaps people with more experience with using zlib
may have some ideas here. If not, I can try different block sizes and see which gives the best result, which leads me to my second point… > - Global chunk dimensions, and functions to get/set them. Currently, I make my HDF5Matrix
objects via as(x, "HDF5Matrix")
, where the chunk dimension arguments cannot be directly supplied. I imagine this would also be the case for implicit realizations of DelayedMatrix
objects. A similar situation arises for HDF5Matrix
instances created in C++ by beachmat. Some way to change the chunking behaviour in these functions would be nice.
Aaron Lun (06:56:06) (in thread): > For example, a user could set the global chunking dimensions to an integer vectorv
of any length. This would then be interpreted by somegetChunkDim
function as follows: > > i) For a dataset of N dimensions, the first N elements ofv
are used. Ifv
is too short, it is recycled as typically done in R. > ii) If any elements areNA
, this indicates to use the entire length of that dimension, e.g., all rows or all columns. > iii) The chunk size is set to the length of the dataset for any dimension where the former exceeds the latter. > > Regarding my first point in the previous message, the default might be to set the global chunk to something like 100, which implies a 100x100x…. chunk size.
Hervé Pagès (14:50:50) (in thread): > Thanks@Aaron Lunfor the feedback. One caveat with this approach is that the chunk length will grow exponentially with the nb of dimensions (e.g. will be 100^N). This is something I tried to remedy with my initial.chunk_as_hypercube
strategy where the chunk length was set to some value L and the chunk dimensions were set to L^(1/N). Another difficulty that.chunk_as_hypercube
was trying to deal with is that all the chunk dimensions need to be <= the dataset dimensions (i.e.all(dim(chunk) <= dim(dataset))
must be TRUE) otherwiseh5createDataset
is not happy. Unfortunately, trimming the chunks with something likepmin(dim(chunk), dim(dataset))
can lead to a chunk length that is much smaller than what the user would have wanted for datasets with very uneven dimensions (e.g. a 10000x5 matrix). So in that case.chunk_as_hypercube
was modifying the dimensions of the chunks by squeezing them along some dimensions and stretching them along other dimensions to make the chunks fit in the dataset but at the same time preserve their length. But after playing with this a little I realized that I was getting better results by simply choosing chunk dimensions that are compatible with the block-processing mechanism and this meant defining chunks that are linear subsets that run along the fastest moving dimension (this is because the block-processing mechanism itself runs along the fastest moving dimension, changing this would be a big change). So I abandoned the whole.chunk_as_hypercube
idea. This is whygetHDF5DumpChunkDim
gives you chunks of dim 7500x1 for your data. I still need to think about a mechanism that allows the user to control the chunk dimensions. Anyway, before that, I’ll add something to let him/her choose the compression level (6 by default). If you only care about speed and if size on disk is not really an issue, you might just want to set this to 0 (i.e. no compression at all). Then I guess chunk dimensions become irrelevant. This one should be an easy change.:smiley:
Aaron Lun (14:57:28) (in thread): > Sounds good. I was thinking about this over lunch; one possibility would be to define some rechunking functions that make a newh5dataset
(in the same file) with the most extreme chunking schemes available (i.e., all row, 1 column vs 1 row, all columns). Users and functions could then switch between chunking schemes as desired, which would be faster than having to continually work with a single suboptimal chunking scheme.
Hervé Pagès (15:07:36) (in thread): > Fast row and fast column access, at the cost of duplicating the data on disk. That’s an interesting idea. Definitely worth exploring! Sounds like this could even be a built-in feature of the HDF5 format itself…
2017-05-28
Hervé Pagès (00:54:58) (in thread): > Just addedset/getHDF5DumpCompressionLevel
to HDF5Array. AlsowriteHDF5Array
now has alevel
argument to control compression level.
2017-05-30
Aaron Lun (08:17:44) (in thread): > Some feedback on the feasibility of the double-chunking strategy. Using the Zeisel brain data set as an example (https://storage.googleapis.com/linnarsson-lab-www-blobs/blobs/cortex/expression_mRNA_17-Aug-2014.txt), I stored the count matrix in contiguous or row/column chunked form: > > infile <- "expression_mRNA_17-Aug-2014.txt" > counts <- read.delim(infile, stringsAsFactors=FALSE, header=FALSE, row.names=1, skip=11)[,-1] # First column after row names is some useless filler. > dense.counts <- as.matrix(counts) > storage.mode(dense.counts) <- "double" # for proper comparison with sparse matrix. > > library(rhdf5) > fname <- "contiguous.h5" > h5createFile(fname) > h5createDataset(fname, "cont", dims=dim(dense.counts), level=0) > h5write(dense.counts, file=fname, name="cont") > > fname <- "double_chunk.h5" > h5createFile(fname) > h5createDataset(fname, "bycol", dims=dim(dense.counts), chunk=c(1, ncol(dense.counts))) > h5write(dense.counts, file=fname, name="bycol") > h5createDataset(fname, "byrow", dims=dim(dense.counts), chunk=c(nrow(dense.counts), 1)) > h5write(dense.counts, file=fname, name="byrow") >
Aaron Lun (08:18:32) (in thread): > The uncompressed contiguous file is 480 MB in size, while the double-chunked file is still only 36 MB.
Aaron Lun (08:24:08) (in thread): > Access times from the double-chunked file are slightly slower column access (due to decompression overhead) but much faster for row access, using timings based onbeachmatrow/column accessor functions. > > library(HDF5Array) > contig <- HDF5Array("contiguous.h5", "cont") > byrow <- HDF5Array("double_chunk.h5", "byrow") > bycol <- HDF5Array("double_chunk.h5", "bycol") > > library(beachtime) > system.time(BeachmatColSum(byrow)) # flipped around because 'bycol' chunk is actually for a single row. > system.time(BeachmatRowSum(bycol)) > system.time(BeachmatColSum(contig)) > # system.time(BeachmatRowSum(contig)) # Takes forever, don't bother running. >
Aaron Lun (08:37:05) (in thread): > The only remaining question is whether there is an efficient way to “rechunk” a H5 Dataset. This would allow you to do all your processing on the column-wise chunks and then call the rechunker to synchronise the row-wise chunks.
Aaron Lun (08:41:16) (in thread): > Ah, theBeachmatRowSum(contig)
finished; it took 555.135 seconds. In comparisonBeachmatRowSum(bycol)
takes about 6 seconds. Both column sum functions take less than a second.
Aaron Lun (12:46:48) (in thread): > We’d be looking for a R version of theh5repack
utility.
Hervé Pagès (20:11:57) (in thread): > Interesting results, thank! Whether there is an efficient way to “rechunk” a given dataset depends on its current chunking I guess. Worst case scenario being a big dataset that is given to you with “byrow” (or “bycol”) chunks and you want to rechunk it to “bycol” (or “byrow”) chunks. Might take hours or even days on a powerful server! Unless you have enough memory to load the full dataset in memory. A much more favorable scenario is if the dataset is given to you uncompressed and with no chunks (like your contig dataset). I guess rechunking should be a one-time operation anyway and not one typically run by the end user (s/he should have access to datasets that are already dual-chunked). If we want to go that route, I will start working on a HDF5Matrix2 container in HDF5Array. It will complicate things significantly. I won’t try to support realization of a DelayedArray object as a HDF5Matrix2 object, at least not until someone convinces me we need that, as this might be really complicated to support and will probably be too inefficient. Besides it’s not something that seems to be needed for the typical workflow. My understanding is that this would typically be done outside the HDF5Array framework by runningh5repack
(orh5rechunk
?) on a powerful server.
2017-05-31
Aaron Lun (09:13:12) (in thread): > Thanks@Hervé Pagès. Rechunking may not just be a one-time operation, though. Even if I had a dual-chunkedHDF5Array
, if I do some operations andrealize
them on disk, I would have to update both of the layouts sometime during my analysis. I guess this would indeed be complicated to support. > > If efficient access to both rows and columns are to be provided, we’ll have to deal with the chunking issue in one way or the other. A compromise might be for the user (or functions) to rechunk as necessary if the downstream operations are known to be row-access. This could probably be done reasonably efficiently if you can control the available memory; I will try writing something for this on the weekened.
Aaron Lun (14:28:49) (in thread): > I just tried out what I talked about in my last message. Starting with purely column-wise chunks for double-precision data, I wrote some C++ code to load in as many chunks as I could fit in 100 MB of memory. I then wrote the loaded data into a new file using row-wise chunks. This approach takes 16 seconds on my computer with a 19000 x 3000 matrix, which is pretty good; assuming linear scaling, it would take 1.5 hours for a 1 million cell data set, which is… acceptable.
Hervé Pagès (15:07:41) (in thread): > Not too bad! Would be good to try the same thing with 10x more columns (19000 x 30000) and confirm that it scales linearly. More on preserving the dual-chunking layout at realization time: Assuming that the typical analysis reduces the size of the original data, the end user will have other options available e.g. realizing as HDF5Matrix (i.e. single layout) with some well chosen chunk size/compression level, or as RleMatrix, or as sparse matrix. Will be worth exploring these before tackling realization as HDF5Matrix2 object (i.e. dual layout).
Aaron Lun (15:15:37) (in thread): > I was thinking something similar; even with the large 10X data sets, the density may be low enough to hold in memory completely as a sparse matrix (1% density for 1 million cells & 20000 genes should be around 1.6 GB). Most initial operations in data analysis are fairly simple; we can then delay conversion to aHDF5Array
until the later sparsity-destroying operations (e.g., batch effect correction, centering), whereupon the user will have a better idea of whether we need row or column chunking.@Peter Hickey@Martin Morgando you have a feel for the density of the 10X data sets?
Peter Hickey (15:19:41) (in thread): > Not off hand
Aaron Lun (15:22:32) (in thread): > @Hervé Pagès10-fold would probably make my computer scream, but it does scale linearly at 2-fold, at least.
Aaron Lun (15:22:54) (in thread): > @Peter HickeyAh, I thought you might have had a look at the 1M neuron data set.
Peter Hickey (15:23:48) (in thread): > i have but don’t recall sparsity (do you mean % non-zero?). I can get that if you don’t have the data handy
Aaron Lun (15:25:29) (in thread): > Yes, that’s it. I haven’t downloaded the data yet; but once we’ve sorted out the HDF5 chunking issue, I’ll convertscranandscaterover to be able to acceptHDF5Array
obejcts, and I’ll use them to run through an entire analysis of the 10X data.
Peter Hickey (15:27:22) (in thread): > 7% non-zero (2,624,828,308 out of 1,306,127 x 27,998)
Aaron Lun (15:28:37) (in thread): > Aw crap. Well, that won’t fit on my computer as a sparse matrix.
Peter Hickey (15:29:40) (in thread): > i tried a naive conversion to sparse matrix and it sucked…killed it
Aaron Lun (15:40:30) (in thread): > @Hervé PagèsI should note that my example was for a very dense matrix. If we simulate data with 7% non-zero entries, it seems compression gets a lot better and there’s fewer disk writes, so rechunking time drops to 8 seconds for the 3000-cell example. I presume it will get even better if someone who knows what they’re doing writes the C++ code.
Aedin Culhane (16:30:41): > Single cell meetinghttps://www.eventbrite.com/e/annual-single-cell-analysis-investigators-meeting-2017-registration-29448592533 - Attachment (Eventbrite): Annual Single Cell Analysis Investigators Meeting 2017 > ABOUT THE MEETING The Single Cell Analysis Program (SCAP), supported by the National Institutes of Health (NIH) Common Fund, will host its 5th and final Annual Investigators Meeting on June 29-30, 2017, at the Clinical Center on the NIH campus in Bethesda, Maryland. The purpose of the SCAP is to accelerate the discovery, development, and translation of cross-cutting, innovative approaches to analyzing the heterogeneity of biologically relevant populations of cells in situ. MEETING OBJECTIVES Convene the funded SCAP investigative teams to update the community on their research and consider current conceptual, technical, and methodological challenges in single cell analysis. Determine major biomedical research opportunities that can be addressed by the Common Fund rather than individual NIH Institutes or Centers. Discuss how relevant groundbreaking technologies and approaches in SCA can be disseminated to the research community effectively in the near future. AGENDA To be announced DIRECTIONS & NIH VISITOR INFO NIH Clinical Center (Building 10)9000 Rockville Pike, Bethesda, Maryland (see map) - Masur AuditoriumMedical Center Metro Station (Red Line) AIRPORTS/METRO Reagan National Airport (DCA) is the closest airport and is connected to the Washington Metro. Dulles International Airport (IAD) and Baltimore Washington International Airport (BWI) are approximately 45 minutes from the NIH campus but are not accessible through the Washington Metro. The NIH campus (located at the Medical Center Station) is accessible from the recommended Bethesda Hyatt (located at the Bethesda Station) via the Red Line of the Washington Metro. NIH VISITOR INFORMATION The NIH campus requires a valid, current, photo ID for entry. Visitor passes must be worn at all times. If you leave campus and return at a later time, you will be required to go through security again upon re-entry. If you choose to travel to the NIH by car, please note that pay parking is available but space is extremely limited. All vehicles and passengers must be screened at the Gateway Visitor’s entrance. Please allow adequate time for security screening. Visit the NIH visitor’s web page for more information. FOOD & BEVERAGES Food and beverages must be purchased. A full cafeteria is open from 6:30 a.m. - 2:30 p.m. located on the B1 level of the Clinical Center. Three concession/coffee stands are also available. The concession stand is located on the B1 level near the cafeteria and is open from 7:00 a.m. - 6:00 p.m. Two coffee stands are open from 7:00 a.m. - 4:00 p.m. and are located on the 1st floor in the CRC and the FAES corridor. LODGING INFORMATION With the exception of invited keynote speakers, all other participants and presenters are responsible for all lodging charges, taxes, and incidentals. FAQs Can I update my registration information? Yes, Eventbrite allows you to update your information at any time. Who should I contact with questions? For questions about meeting content, logistics, or abstract submissions, please contact us at single_cell@mail.nih.gov. FAQs FOR PRESENTERS Should all SCAP projects present? Yes. Each group with an active SCAP award must prepare a poster or apply for a talk (limited space available) describing the funded single cell analysis project and current progress. Designate one individual as the primary presenter of the poster for each SCAP award. Do I need to submit an abstract? If you wish to present a poster or talk at the meeting, you must submit an abstract. Investigators actively funded under SCAP RFAs are required to submit an abstract describing their work and progress. Due to a limited number of slots for both presentations and posters, we ask for one abstract per award. Please coordinate among your award collaborators to determine the abstract you wish to submit. Funded investigators who have not spoken at previous meetings will be given priority during talk selection, but everyone is welcome to submit an abstract for either format. How do I submit an abstract? If you are interested in presenting a talk or a poster, please indicate your interest during registration, format your abstract following the Abstract Template guidelines, and email your abstract to single_cell@mail.nih.gov. Abstracts are due by June 2, 2017. Please include relevant NIH grant numbers in your acknowledgements section. When will my abstract be approved? You will receive a confirmation email when your abstract is submitted and a notification email when your abstract has been selected for a presentation or poster. All SCAP grantees are guaranteed poster space. What are the presentation or poster requirements? Posters should be no more than 4 ft x 4 ft in size. Thumbtacks/velcro will be provided on-site. Each presenter will receive an assigned number and designated space for the poster session. Materials from previous meetings are posted on the Single Cell Analysis Program Website. Have questions about Annual Single Cell Analysis Investigators Meeting 2017? Contact Single Cell Analysis Program
2017-06-01
Aedin Culhane (12:37:44): > @Michael Lawrencethis is the bioc channel
Martin Morgan (13:25:25): > if I tryh5read("1M_neurons_filtered_gene_bc_matrices_h5.h5", "mm10/data")
, R says “the dims contain negative values” because rhdf5 is interpretting this as a 2624828308 x 1 array (and failing, because array dims are each restricted to 2^31 - 1) even though as a vector R would be happy to read it. Can I make rhdf5 treat this as a plain-old-vector?
Lori Shepherd (14:24:28): > @Lori Shepherd has joined the channel
John Readey (19:17:28): > Has anyone tried the 1M_neurons file with Python and h5py? I get an error reading the TITLE attribute of the root group.
2017-06-02
Hervé Pagès (05:26:27): > @Martin MorganOne way to address this would be to add adrop
argument toh5read
. When set toTRUE
it would do what it does when using[
on an array i.e. coerce the returned array to the lowest possible dimension. Another possible semantic fordrop=TRUE
is to just drop all dimensions (not only those equal to 1). Maybe don’t make this the default though. Would probably break many things.
Davide Risso (10:22:22): > @John ReadeyI don’t know if you saw this, or if it’s of any help (I’m no python expert), but this is the tutorial by 10x on how to analyze the data in python:https://s3-us-west-2.amazonaws.com/10x.files/supp/cell-exp/megacell_tutorial.html
Davide Risso (10:26:56): > they don’t seem to use h5py though
Kasper D. Hansen (10:27:13): > So Qs on this: if I understand it correctly, they just “analyze” 20k cells right?
Kasper D. Hansen (10:27:37): > And this code > > tsne = pd.read_csv("analysis/tsne/2_components/projection.csv") > clusters = pd.read_csv("analysis/clustering/graphclust/clusters.csv") >
> basically tells us that tsne was done outside of python?
Kasper D. Hansen (10:27:42): > right?
Kasper D. Hansen (10:27:50): > at least not in their example code
Davide Risso (10:28:08): > yes, for “easier analysis”
Davide Risso (10:28:48): > I would imagine someone somewhere analyzed the whole data
Kasper D. Hansen (10:29:52): > Of course this script is not a full analysis. I am just trying to see what they claim can easily be done in python according to their script and I am not impressed
Kasper D. Hansen (10:31:15): > They do have all 1M cells in memory in some sparse matrix though
Kasper D. Hansen (10:32:04): > or is theGeneBCMatrix
just a point to an HDF5 object?
Kasper D. Hansen (10:32:46): > I am just asking the (admitted somewhat narrow and less interesting) question: can we do the same in R as they are doing in pythin
Kasper D. Hansen (10:32:48): > pythin
Kasper D. Hansen (10:32:55): > python damnit
Davide Risso (10:33:30): > If theirGeneBCMatrix
is indeed just a pointer to the HDF5 object, then their script does not do more than Martin’s package vignette
Davide Risso (10:33:47): > https://github.com/mtmorgan/TENxGenomics - Attachment (GitHub): mtmorgan/TENxGenomics > TENxGenomics - Interface to 10x Genomics’ 1.3 m single cell data set
Kasper D. Hansen (10:35:36): > I don’t know about the pointer; does anyone know enough python to check or know from reading code?
Peter Hickey (10:38:18): > my read is thatGeneBCMatrix
contains an in-memory sparse matrix representation of the 1M cells
Sean Davis (10:38:26): > I think thematrix
here is actually acsc_matrix
.https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csc_matrix.html#scipy.sparse.csc_matrix, so it is in memory. Methods on the matrix are given on the linked page.
Peter Hickey (10:40:32): > csc_matrix
(Compressed Sparse Column matrix) is like adgCMatrix
(Compressed, sparse, column-oriented numeric matrices) from theMatrixR package
Sean Davis (10:40:51): > Exactly.
Sean Davis (10:54:30): > A quick discussion of approaches to large matrix PCA that can be read in 2 minutes or less:https://datascience.stackexchange.com/a/1169 - Attachment (datascience.stackexchange.com): How to do SVD and PCA with big data? > I have a large set of data, about 8GB. I want to use machine learning to analyze it. So I think I should do SVD then PCA to reduce the data dimension for efficiency. But MATLAB and Octave cannot load
Sean Davis (10:55:16): > An algorithm for the principal component analysis of large data sets:https://arxiv.org/abs/1007.5510
Sean Davis (10:58:09): > And some code for therandom projection
approach:http://homes.di.unimi.it/valenti/SW/clusterv-source/rp.RAnd the package home page:http://homes.di.unimi.it/valenti/SW/clusterv/
Aaron Lun (11:43:18): > @Kasper D. HansenAgree regarding the triviality of the Python script. Seems like they just do a bit of subsetting and plotting. We wouldn’t be having problems either if t-SNE coordinates and clusters came down to us like manna from heaven. They also require >32 GB of RAM, which is on the limit of what PCs can handle.
Kasper D. Hansen (11:46:32): > Thanks for those links Sean, I was not aware the paper on sparse projections, which I’ll have a look at
Kasper D. Hansen (11:48:45): > Otherwise, this is the stuff we have been thinking about
Sean Davis (12:05:24): > @Kasper D. Hansen, just putting it out there to recap our discussion yesterday since most folks here didn’t hear.
Sean Davis (12:08:13): > And just another reference point: scikit-learn Incremental_PCAhttp://scikit-learn.org/stable/auto_examples/decomposition/plot_incremental_pca.html
2017-06-03
Hervé Pagès (00:31:16) (in thread): > FWIW it looks like the dgCMatrix implementation is broken if the nb of non-zero values is >= 2^31. I tried 2 strategies. First one is to call thesparseMatrix()
constructor directly on the long vector of non-zero values. Second one is to break the long vector in chunks (I used 4 chunks), callsparseMatrix()
on each chunk (that went fine), and add the 4 dgCMatrix objects. Both strategies failed. Interestingly the 2nd strategy failed in the last step when I tried to add the 4th matrix to the sum of the first 3 ones. This last addition is what would have made the nb of non-zero values in the sum go over 2^31. So it seems that basic arithmetic operations are broken on dgCMatrix objects with a nb of non-zero values >= 2^31. I’m giving up on dgCMatrix and will focus on RleMatrix. In the last couple of days I’ve improved support for long Rle in S4Vectors so my hope is that it will be possible to represent the full 1M neuron data set as an RleMatrix object (will be between 35Gb and 40Gb in memory though, but will support delayed operations and block processing). Clearly not a satisfying solution but I’m curious to give this a shot.
2017-06-04
Martin Morgan (07:24:36): > @Hervé Pagès@Vince CareyI added drop = FALSE toh5read()
andH5Dread()
, which ignores the dimensions entirely. An existing workaround would be to provide an argumentbuf=integer(2624828308)
to either function, but that has reference semantics (buf
isn’t duplicated at the C level). So > > xx = h5read(BiocFileCache()[["BFC10"]], "/mm10/data", drop=TRUE) > > str(xx) > int [1:2624828308] 1 19 14 40 29 1 17 26 5 7 ... >
> This is in rhdf5 v. 2.21.1.1 athttps://github.com/mtmorgan/rhdf5; I’ll submit a pull request to Mike in the not too distant future. - Attachment (GitHub): mtmorgan/rhdf5 > Package Homepage: http://bioconductor.org/packages/devel/bioc/html/rhdf5.html Bug Reports: https://support.bioconductor.org/p/new/post/?tag_val=rhdf5.
Peter Hickey (22:01:44) (in thread): > That’s frustrating about the Matrix package
2017-06-07
Hervé Pagès (05:38:17) (in thread): > Representation as RleMatrix is not that bad: the full 1M neuron data set can be put in an RleMatrix object that is only 19Gb in memory. The serialized object is 2.4G on disk, and takes 2 min to load. Producing that object was expensive though: the whole process took about 6 h and required 85Gb of memory! Right now walking on the full object (i.e. block processing) is slow (about 1h30) but there is A LOT of room for optimizing this mechanism specifically for this kind of RleMatrix object. The data in the RleMatrix object is chunked and the chunks are Rle objects stored in an environment. By using a block processing strategy that takes advantage of these chunks, walking on the full object could take less than 2 min! The block processing mechanism has been hidden in the DelayedArray package so far (it’s used internally when doing things likemax()
,sum()
,rowSums()
, etc…). At some point I will expose and document the utilities behind it (block_APPLY()
,block_APPLY_and_COMBINE()
, etc…)
Aaron Lun (13:36:07) (in thread): > Yes, ablock_APPLY
function would be pretty useful. I’ll also add support forRleMatrix
tobeachmat, this seems like a useful format for count data.
Aaron Lun (13:37:16): > @Martin MorganAre we having a meet-up this Friday?
Martin Morgan (13:56:04): > @Aaron Lunyes friday 12noon eastern
Martin Morgan (14:23:38) (in thread): > FWIW thetenxiterate()
example in the TENxGenomics vignettehttps://github.com/mtmorgan/TENxGenomics/blob/master/vignettes/TENxGenomics.Rmd#iterativetakes about 8 minutes and <16G of memory to calculate row and column non-zero counts and sums, using 6 of my 8 laptop cores. Also if one goes for the all-in-memory approach then do you really need to chunk, or just get more memory? - Attachment (GitHub): mtmorgan/TENxGenomics > TENxGenomics - Interface to 10x Genomics’ 1.3 m single cell data set
2017-06-08
Martin Morgan (11:32:37): > set up a reminder “single cell meeting at 12 noon EST (connect: http://huntercollege.adobeconnect.com/singlecell/ ; doc: https://docs.google.com/document/d/120MAhngbIe_EGi2ObnKyBMfuFpGBqK_3inOmRdHZnmU/edit?usp=sharing)” in this channel at 9AM tomorrow, Eastern Daylight Time.
2017-06-09
Aaron Lun (08:10:57) (in thread): > @Hervé PagèsDo you know whether settinglevel=0
inh5createDataset
disables all chunking, even ifchunk
is specified?
USLACKBOT (09:00:00): > Reminder: single cell meeting at 12 noon EST (connect:http://huntercollege.adobeconnect.com/singlecell/; doc:https://docs.google.com/document/d/120MAhngbIe_EGi2ObnKyBMfuFpGBqK_3inOmRdHZnmU/edit?usp=sharing)
Vince Carey (11:47:26): > is it possible to join by phone? i will be on a train at noon
Marcel Ramos Pérez (11:48:13): > I don’t think you can with Adobe Connect or it has to be set up somehow
Vince Carey (11:48:52): > @John Readeywe have obtained dense hdf5 representation of 10x data in 13 files of 28x by 100k … is there any advantage to binding them together as a single hdf5 matrix before putting on server?
Hervé Pagès (11:53:49): > I’ll have to leave the meeting at 9:20 PST to take my daughter to school, sorry
Vince Carey (11:57:29) (in thread): > @Marcel Ramos Pérezok i have app may work
Kasper D. Hansen (12:00:17): > @Vince Careycould you put those 13 files somewhere accesssible?
John Readey (12:03:34): > @Vince Carey- have one aggregate file is convenient for users - they can do a single request that has access to the entire data space.
Vince Carey (12:05:23): > ok is there a preferred way to join such arrays h5py ops? - Attachment: Attachment > @Vince Carey - have one aggregate file is convenient for users - they can do a single request that has access to the entire data space.
John Readey (12:19:08): > Yes, that’s how I’ve typically done it. Create an initial array with unlimited dimension. for each array you want to concatenate, extend the target array, and copy in the source data.
John Readey (12:32:50): > BTW, I’ve loaded the 1M_neurons file to the object-storage based HDF Server (HSDS)
John Readey (12:33:31): > Unfortunately, I don’t have compression working yet so the 4 GB file became 29 GB!
John Readey (12:34:50): > If anyone is interested in trying out accessing it, let me know and I can provide the connect info.
John Readey (12:36:10): > This is what the layout looks like:
John Readey (12:36:11): > $ hsls -v -H -r /home/john/sample/1M_neurons_filtered_gene_bc_matrices_h5.h5 > / Group > UUID: g-5f9a0b78-436f-11e7-a127-0242ac110008 > /mm10 Group > UUID: g-601f554e-436f-11e7-a127-0242ac110008 > /mm10/data Dataset {2624828308/Inf} > UUID: d-60423c44-436f-11e7-a127-0242ac110008 > Chunks: [131072] 524,288 bytes, 20,026 allocated chunks > Storage: 10,499,313,232 logical bytes, 10,499,391,488 allocated bytes, 100.00% utilization > /mm10/indices Dataset {2624828308/Inf} > UUID: d-611fe572-4373-11e7-a127-0242ac110008 > Chunks: [65536] 524,288 bytes, 40,052 allocated chunks > Storage: 20,998,626,464 logical bytes, 20,998,782,976 allocated bytes, 100.00% utilization > /mm10/indptr Dataset {1306128/1306128} > UUID: d-73ac3e4a-437b-11e7-a127-0242ac110008 > Chunks: [65536] 524,288 bytes, 20 allocated chunks > Storage: 10,449,024 logical bytes, 10,485,760 allocated bytes, 99.65% utilization > /mm10/shape Dataset {2/16384} > UUID: d-74e90f04-437b-11e7-a127-0242ac110008 > Chunks: [16384] 65,536 bytes, 1 allocated chunks > Storage: 8 logical bytes, 65,536 allocated bytes, 0.01% utilization
John Readey (12:36:51): > Each “allocated chunk” is stored as a separate S3 object.
Davide Risso (12:43:30): > Link to the SingleCellExperiment repo:https://github.com/drisso/SingleCellExperiment - Attachment (GitHub): drisso/SingleCellExperiment > SingleCellExperiment - S4 classes for single cell experiment data
Aaron Lun (13:24:26): > Link to thebeachmatevaluation repo:https://github.com/LTLA/MatrixEval2017 - Attachment (GitHub): LTLA/MatrixEval2017 > MatrixEval2017 - Evaluating access and memory usage of different matrix types
Peter Hickey (13:25:07): > have you got a pdf of the tex document you could upload?
Aaron Lun (13:27:16): > @Aaron Lunuploaded a file:description.pdf - File (PDF): description.pdf
Vince Carey (13:28:00): > yes i will try to do this next week - Attachment: Attachment > @Vince Carey could you put those 13 files somewhere accesssible?
Unknown User (13:29:38): > @Peter Hickey commented on @Aaron Lun’s file https://community-bioc.slack.com/files/U34P8RS3B/F5SA70ZFY/description.pdf: thanks! - File (PDF): description.pdf
Andrew McDavid (14:39:15): > @Aaron Lun: I think this was discussed, but sailed over my head on the call. Should we be considering square or rectangular chunks as a compromise between row chunks or column chunks for HDF5 storage? Would that avoid the poor worst-case performance properties of HDF5?
Kasper D. Hansen (14:41:15): > I was not on the call, but in general things will be sliced in either direction. It will be assay specific which is the big dimension. We are working more on HDF5 stuff for methylation assays. For sequencing we need 28M x 100 (up to 1000) and for arrays we need 1M x 1,000-10,000 (up to 100,000)
Aaron Lun (14:43:59) (in thread): > @Andrew McDavidIt would probably be suboptimal, but how suboptimal it is would depend on the size of the data set, whether the chunk cache is exceeded, etc. My guess is that for large data sets, it would probably take less time to rechunk it than to perservere with a suboptimal layout. This is based on the assumption that, if you’re looking for row/column access, you’ll probably exceed the chunk cache by the time you get to the end of a row/column in a large data set, requiring you to reload the chunks at the start when you get to the next row/column.
Andrew McDavid (14:49:51): > Do you mean in general things will be sliced in only one direction (that is assay dependent) or in both directions?
Kasper D. Hansen (14:51:45): > both directions, I think
Kasper D. Hansen (14:52:01): > it depends on what you’re trying to compute
Andrew McDavid (15:07:09): > So rectangular chunks might make sense then.
Aaron Lun (15:09:10): > Depends on whether you need row/column accessat the same time. If they’re in separate steps, e.g., my pipeline contains a bunch of steps requiring column access (e.g., cell-based quality control), followed by a bunch of steps requiring row access (e.g., DE/HVG/whatever genes), it makes more sense to completely rechunk the file between the two sets of operations.
Andrew McDavid (15:12:34): > probably not at the same time. How long does rechunking take?
Aaron Lun (15:12:55): > For the 1M data set, probably on the order of an hour.
Aaron Lun (15:13:01): > For Zeisel, a couple of seconds.
Andrew McDavid (15:13:03): > Oh, wow.
Andrew McDavid (15:16:35): > 1. we could store both row-major and column-major formats. 2. I would be curious to see timings on rectangular chunks, say of size 1000 X 100.
Andrew McDavid (15:18:27): > or 100 X 1000, I guess if we are storing genes X cells.
Aaron Lun (15:18:47): > Herve and I discussed option 1, it would be a real pain to keep both of them updated during realizations.
Andrew McDavid (15:19:06): > True, yeah that does sound painful.
Andrew McDavid (15:19:59): > Also, there was commentary on PCA on large data sets. FWIW, randomized PCA algos are very fast on fat or tall data sets (n>>m or n << m). Essentially you take a random projection of the data in q << min(m, n) dimensions, then calculate the svd on that random projection.
Andrew McDavid (15:23:14): > rsvd
contains a pure-R implementation that works with a minor mod with a CSparseMatrix
Aaron Lun (15:27:13): > I’ll test square chunks sometime next week, but my feeling is that 1 hour isn’t that bad, given that it would probably take days of pure computational processing to get through the 1M cell data set anyway.
Aaron Lun (15:32:04): > Now that I think about it, it’s also fairly straightforward to determine the theoretical rectangular chunk dimensions that~optimize~minimize suboptimality in consecutive row/column accesses for a given size of the chunk cache.
Andrew McDavid (15:42:58) (in thread): > A 3000 X million CSparseMatrix (5% sparsity) takes ~15 seconds to decompose on my laptop.
Andrew McDavid (15:45:43): > @Andrew McDaviduploaded a file:rsvd on a large matrixand commented: The magic is happens by taking the cross product of the random projection and the data, so a sparse representation is possibly nice here. But hdf5 could easily drop in here as well, though maybe not as fast. - File (Plain Text): rsvd on a large matrix
Sean Davis (15:53:25): > Nice,@Andrew McDavid.
John Readey (16:08:27): > Re: chunking - in general the more chunks that a selection intersects the slower the call will be. Each chunk will need to be fetched from disk and decompressed. So if you are only reading one element per chunk, things will be slow. It’s certainly worthwhile trying out a square chunk shape and seeing how that performs. Also you can play around with the size of the chunk (say from 256KB to 8MB).
John Readey (16:10:17): > With the HSDS server the performance behavior is a bit different since multiple chunks can be processed in parallel (on different nodes of the server).@Aaron Lunif you can point me to the data files and access code, I could do some performance runs with HSDS.
Andrew McDavid (16:14:18) (in thread): > And here’s a very nicely written reference on randomized SVD algorithms:https://arxiv.org/pdf/0909.4061.pdf
Mike Jiang (19:00:38): > @John ReadeyI agree. Also, the union operations of multiplehyperslab selection
atH5
level is extremely expensive, so ideally we want to restrict the H5 access pattern atchunked
size, and do the subsetting of chunked array in memory. (based on my recent benchmarkhttp://rpubs.com/wjiang2/282793for the simulated data of up to 500k cells )
Kasper D. Hansen (21:19:42) (in thread): > That is what I intend to use
Fanny Perraudeau (21:22:19): > @Fanny Perraudeau has joined the channel
2017-06-10
Aaron Lun (07:09:02): > @John ReadeyMy timings were done on very small data sets (the suboptimal layouts just took too long on any reasonably sized data set), so I don’t know how well they’ll translate to server performance. But check out thetimings/simulations/hdf5_matrix
folder in theMatrixEval2017repo if you want to have a look at what I did.
Aaron Lun (07:10:30): > @Andrew McDavid@Hervé PagèsAlso kept on thinking about how to optimize rectangular chunking for both row and column access, in a manner that can outperform row- and column-level chunks - seeadditional.pdf
inhttps://www.dropbox.com/sh/rfasf3mxjs65ac9/AADNbMGqo4AGPrj1l26D6fU3a?dl=0 - Attachment (Dropbox): Bioconductor > Shared with Dropbox
Aaron Lun (07:11:16): > This comes at the cost of some memory usage for the chunk cache, but this should be tolerable (my guestimates are around 1.6 GB for the 1M neuron data set).
2017-06-11
Peter Hickey (17:16:45): > @Hervé Pagès: Say I’m wanting to compute some row-wise summary stat of a “long”DelayedMatrix: I might be willing to realize into memorynn
rows at a time wherenn < nrow(x)
butnn >> 1
. Is this something that can be achieved using existing functionality inDelayedArray? TheArrayBlocksandArrayGridclasses look like they might be what I want, but I admit I don’t yet understand these fully or how they relate to one another. My impression is thatArrayBlocksoperate in a column-wise manner, is that correct? I guess I’m looking for their row-wise partner.
Peter Hickey (17:19:05) (in thread): > I mentioned on the call last Friday (unfortunately after you’d had to leave) that I’m working on aDelayedMatrixStatspackage (https://github.com/PeteHaitch/DelayedMatrixStats) to support the fullmatrixStatsAPI forDelayedMatrixobjects, with some optimisations based on the backend. That’s what motivates this question - Attachment (GitHub): PeteHaitch/DelayedMatrixStats > DelayedMatrixStats - A port of the matrixStats API to work with DelayedMatrix objects from the DelayedArray package
2017-06-12
Aaron Lun (04:11:12): > Here’s some timings for row- and column-level access on a 10000-by-10000HDF5Matrix
with optimized chunk cache settings inbeachmatand C++: > > - Pure column chunks, column access = 4566 ms > - 100x100 chunks, column access = 6016 ms > - Pure row chunks, row access = 13367 ms > - 100x100 chunks, row access = 14851 ms > > Pretty damn good, given we’re operating on a single file. The optimized cache size is only 8 MB, which is easily tolerable. Incidentally, the optimal size of the HDF chunk cache inadditional.pdf
is equivalent to the acceptable amount of data that you might want to load into R memory during block processing of aHDF5Matrix
.
John Readey (11:47:40): > @Aaron Lunso it looks like using the square chunk layout add only a modest penalty & works equally well for both column or row access. Did you try something like pure row chunks with column access? I’d guess this would be really slow, hence the square chunks offer the best overall performance if you expect to need both row and column access.
Aaron Lun (12:32:03): > @John ReadeyAs a certain president might say, it’s “Bad!!!” (or maybe “Shocking!!!”). So bad, in fact, I gave up trying to run it for anything larger than a 1000 x 1000 matrix.
Hervé Pagès (15:28:28) (in thread): > @Peter HickeyI’m currently revisiting completely the block-processing mechanism in DelayedArray. Right now the mechanism is column-oriented. Our recent discussions convinced me that we need a more flexible mechanism that doesn’t privilege any dimension a priori. The new classes you saw (now calledArrayViewport
andArrayGrid
) will support the new block-processing mechanism I’m working on. It’s a big change and still very much a work-in-progress. Because of other priorities that are showing up this week and next week, I won’t be able to progress as fast as I’d like on this. Will probably take me about 3 more weeks to complete that transition. It’s on the agenda to export ablockApply
function that will let you walk on blocks of arbitrary shape (e.g.nn
rows). The interaction between blocks and chunks (h5 chunks or RleArray chunks) will have a critical impact on performance so will play a role in how I design and implement the whole thing.
Peter Hickey (15:32:46) (in thread): > Thanks,@Hervé Pagès, that’s helpful to know. It’ll be great to have a generalblockApply
. I’ll resume work onDelayedMatrixStatsin a few weeks time once things have settled down
2017-06-13
Aaron Lun (14:11:44) (in thread): > I also rejiggedbeachmatto handleChunkedRleArraySeed
objects, but it occurred to me that my code could have been easily generalized if aSolidRleArraySeed
object was just implemented as aChunkedRleArraySeed
with a single chunk. This would also apply to more general rectangular chunks inRleMatrix
where each chunk is its own RLE. A bit more difficult to break down for column access, but allows for more effective block row access - perhaps you were already thinking about it.
2017-06-14
John Readey (11:23:38): > I’ll try some similar experiments using the HDF Server. What’s the largest matrix that would come up in practice in the biodata world? 10^5 x 10^5 would be ~10GB. In earth science it’s common to have TB-sized arrays but they are typically three-dimensional (time + lat + lon).
Peter Hickey (11:31:17): > I’m already playing with data with two datasets with 10^10 elements. dimensions vary widely though. E.g.,2 * 10 ^ 4 * 10^6
vs.3 * 10 ^ 8 * 2 * 10 ^ 2
Mike Jiang (12:53:45): > Due to the sparsity, it will be smaller after the default H5 compression
Peter Hickey (12:56:27): > True but some of my data are dense, albeit with some likely compressibility due to runs of constant values
2017-06-15
Hervé Pagès (16:50:24) (in thread): > @Aaron LunYes, SolidRleArraySeed is not strictly needed and could in theory go away (and it will at some point). I only kept it for testing/comparing purposes. All this is still very much work in progress so expect things to change anytime. For example the internals of ChunkedRleArraySeed will change soon to support arbitrary “grid chunking” (right now it only supports chunking along the columns).
Hervé Pagès (18:49:38) (in thread): > And the plan is that each chunk in the grid of chunks will be an RLE that runs along the column or along the rows (either user-controlled or automatically decided based on what gives better compression). The idea is to have a flexible/controllable chunking scheme that adapts to the kind of data (e.g. 1000x1000 chunks encoded along the rows for the 1M neurons data set). Plus the ability to switch between integer- and raw-RLE depending on the max value in the chunk, overall this should lead to good compression while maintaining reasonably fast row and col access. Also block-processing will be fast because it will automatically adjust the size of the blocks to the chunking grid (typically the grid used for block-processing will be coarser than the chunking grid i.e. each block will cover exactly a given number of chunks e.g. 100 or 1000). And block-processing will be able to work in parallel. So I still have a lot of work ahead. Today I completed implementing ArrayGrid objects. They formalize the notion of grid and will be used internally to support all this. They are fun to play with so I exported them:smiley:. They even have a man page:?ArrayGrid
(still somewhat minimalist though).
2017-06-16
Aaron Lun (05:22:27) (in thread): > Woah, sounds intense. But that’s okay; where would I get the excitement in my life if I didn’t get sudden changes to BioC-devel code:slightly_smiling_face:? Also, will all chunks in the sameRleMatrix
be either row-major or column-major, or are they allowed to vary within the matrix?
2017-06-17
Hervé Pagès (07:28:42) (in thread): > About stability of the internals: it’s not a solution in the long run anyway that beachmat needs to know about the gory details of RleArray internals. The DelayedArray package will need to provide a low-level C API that you can call directly from the C++ code in beachmat. Then I will be able to make all the crazy changes I want in the RleArray internals without breaking beachmat. Boring, I know:wink:I would still break serialized RleArray instances though, which is always a source of excitement! Good question about “all chunks have the same orientation i.e. either row- or col-oriented” vs “each chunk can have its own orientation”. I was first going to go for the latter (in the spirit of maximum flexibility), then changed my mind (sounded overkill maybe, do we really expect to see data that would take advantage of this?) so was going to go for the former. What do you think? Even if the latter sounds overkill, I don’t expect it to add much complexity to the overall thing.
Aaron Lun (07:51:29) (in thread): > Regarding the row/column-majorness: I was thinking about it too, and it seemed like overkill to me. But once the code (R or C++) can handle chunks, then I suspect the extra detail doesn’t really matter. Maybe best to go with the simpler design first, which would make development a bit easier, and then generalize it as deemed necessary.
Aaron Lun (07:58:12) (in thread): > Also happy to chat about the API when you get around to it. From my end, all I would really need is a way to expand each chunk into a dense matrix in C/C++. I can then draw some inspiration from the HDF5 chunk cache to handle consecutive row/column accesses.
Hervé Pagès (10:03:18) (in thread): > Anextract_chunk()
means that the client code still needs to know about the chunking grid and still has some non-trivial work to do for performing the general subsetting. It should also support extraction of a chunk as an RLE (e.g. via anas_rle
argument) otherwise general subsetting will pay the cost of expanding a full chunk even for extracting only 1 value from that chunk. This can be avoided because a set of positions can be efficiently extracted from an RLE without expanding it. So something likeextract_chunk()
would still be very low-level and wouldn’t provide much isolation to protect beachmat from important changes in the RleArray internals. I was thinking of an API that provides the general subsetting and isolates completely the client code from the notion of chunks/RLEs. Something likeextract_subarray(x, index, dest)
wherex
is the SEXP ,index
a list of integer vectors, one per dimension inx
(some of them possibly NULL, I take inspiration fromh5read()
here), anddest
a pointer to pre-allocated memory where the extracteddensedata will be written. Note that this would be a C implemenation of thesubset_seed_as_array
method for ChunkedRleArraySeed objects. Happy to discuss the API in more details (maybe somewhere else?). I’m not here yet and won’t be before at least 2 or 3 weeks…
Aaron Lun (11:17:31): > Trying to convert aTENxMatrix
object into aHDF5Matrix
and failing pretty hard, with: > > > library(TENxGenomics) > > path <- "1M_neurons_filtered_gene_bc_matrices_h5.h5" > > tenxmat <- TENxMatrix(path) > > library(HDF5Array) > > writeHDF5Array(tenxmat, file="processed.h5", name="neurons") > Error in validObject(.Object) : invalid class "ArrayViewport" object: > a viewport cannot be longer than .Machine$integer.max >
> Any thoughts@Martin Morgan?
Aaron Lun (11:35:06) (in thread): > @Hervé PagèsSure, happy to talk off-line about this when you get around to it. FWIW: I was of the understanding that position look-up in a RLE involved taking a cumulative sum of the run lengths until the requested position was reached. While expanding the chunk would indeed be more expensive than a look-up for any single location, it should (if I’m understanding the look-up correctly) be faster for the entire set of consecutive rows/columns spanned by the chunk, as it would avoid having to compute the cumulative sum for each row/column request.
Hervé Pagès (12:03:52) (in thread): > Position look-up in a RLE is a whole topic per se. Right now 3 different algos are implemented and the fastest is picked up based on thenb of positions to extract / nb of runs
ratio. When this ratio is >= some threshold, the cum sum of the run lengths is computed and a binary search is used for each position. This is pretty fast. See_positions_mapper()
in S4Vectors/src/map_ranges_to_runs.c for the details. I improved this code a couple of weeks ago and there is still room for improvement. It’s a pity really that Rle objects store the run lengths and not their cumulutative lengths. The latter would have made everything much simpler and faster but that’s another story. Yes expanding the RLE would probably be slightly faster than the binary search when the nb of positions to extract from the RLE is really big (i.e. of the same order of magnitude as the length of the RLE). Maybe this could be added as a 4th algorithm in the switch statement in_positions_mapper()
. This is not a typical situation though. Most of the times, expanding the RLE can and should be avoided.
Hervé Pagès (13:12:58): > oops… blame me for this. I introduced this regression recently. Fixed now in svn. This takes about 3h for me with blocks of 250M integers (options(DelayedArray.block.size=1e9)
, sorry this option is not documented yet) and 500x500 chunks (chunk_dim=c(500, 500)
). CallwriteHDF5Array()
withverbose=TRUE
to see progress.
Aaron Lun (13:13:51) (in thread): > I was thinking more of thebeachmatuse case where someone requests a row or column, and then requests the immediately following row/column, and so on. For any single request, the lookups would be the fastest, but the second request would have to recompute the cumulative sums. (ThebeachmatAPI doesn’t handle multi-row or column extractions, so the two requests can’t be collapsed into one.) > > Anyway, if I expanded each RLE chunk overlapped by the first requested row/column and cached the dense version of the chunk, I could quickly pull out the next row/column upon request - a situation analogous to howbeachmatspeeds up consecutiveHDF5Matrix
accesses by controlling the size of the HDF5 chunk cache, especially when the chunk dimensions are not chosen by us. > > Currently,beachmatdoesn’t do any expansion, but instead caches some indices forRleMatrix
so that a consecutive access pattern won’t involve a new binary search. Might be something to consider down the line.
Aaron Lun (13:14:31) (in thread): > Thanks@Hervé Pagès. I’ll try this tomorrow and let you know how it goes.
Hervé Pagès (13:36:08) (in thread): > I see. Yep, it seems that for that use case, caching the expanded versions of the chunks that are intersected by the current row or column is probably the best. So your initialextract_chunk()
proposal would be all that is needed. Thanks for clarifying!
Hervé Pagès (14:21:21) (in thread): > I meant the best in terms of speed. The size of the cache could get really big though, if the size of the chunks was not set carefully. E.g. the cache would be 2,6 Gb when walking on the rows of the 1M neurons data set if the chunks are 500x500 (you would need to cache 2600 chunks of 1 Mb each).
Aaron Lun (16:17:44) (in thread): > Yeah, it could get pretty gruesome if we’re not careful. To get around this for HDF5 access,beachmatcurrently caps the cache at 2 GB, and throws an error if this is exceeded (to avoid machines freezing upon running out of memory, and to give people a chance to rechunk to a more memory-friendly layout). Ideally, the upper limit could be set by users, especially if they’re on machines with lots of memory.
Aaron Lun (16:25:55) (in thread): > In fact, I would imagine that chunk-wise processing in R (viablock_apply
or friends) would run into similar problems. For general operations based on rows or columns, you’d need to load all chunks overlapping a row or column into memory at once. This would gobble up memory for poorly chosen layouts.
2017-06-18
Hervé Pagès (03:43:11) (in thread): > What can help reduce the cache size is to use chunks that preserve thenrow
/ncol
ratio e.g. with 280 x 13000 chunks for the 1M neurons data set (one 100th of the full dimensions), you would need to cache 100 chunks whether you walk on the rows or on the columns. Then the cache would use only 1.4 Gb of RAM and that memory would be optimally used. With respect to block processing: all the block-processing operations currently supported in DelayedArray (summarization, matrix mult, etc…) will only need to load one block at a time. For examplerowSums
andcolSums
will be applied to each block (viabplapply()
) and the results returned in a list of length the nb of blocks. Then all the results are combined together before being returned to the user. I guess that’s what Martin is doing with thetenxiterate
example in the TENxGenomics vignette. A similar strategy can be applied to matrix multiplication. So by choosing the block size the user actually chooses to cap the memory used by block-processing. Very roughly though: this doesn’t account for the memory footprint of the intermediate list of individual block results, or for the memory needed to compute the result on a given block. Also, if processing in parallel, this memory cap needs to be multiplied by the number of cores.
Aaron Lun (06:02:14) (in thread): > Yep, thenrow
/ncol
preservation is exactly the strategy that I arrived at when I was playing around with HDF5 chunks. (The exact scaling depends on the acceptable size of the cache and whether you want to outperform pure row/column chunks in terms of access speed; seeadditional.pdf
in my Dropbox link.) > > As for the block-wise processing; I was actually thinking of@Peter Hickey’s aim of replicating all of thematrixStatsfunctions. Many of these can indeed be handled block-wise, and then summarized across blocks as necessary. However, one of them iscolRanks
, for which there seems to be no other choice than to load all chunks for a given column.
Aaron Lun (06:05:54) (in thread): > While we’re on this topic: > > > library(TENxGenomics) > > path <- "1M_neurons_filtered_gene_bc_matrices_h5.h5" > >[tenx.se](http://tenx.se)<- tenxSummarizedExperiment(path) > > tenx <- assay([tenx.se](http://tenx.se)) > Error in dimnames(assay)[1:2] <- dimnames(x) : > 'dimnames' applied to non-array >
> Just posted this as a Github issue.
Hervé Pagès (09:23:49) (in thread): > Yeah,colRanks
is a good example where block-processing is probably not the good approach and where one might want to just loop on the columns. Although that doesn’t necessarily mean that all the chunks traversed by a column need to be “loaded” (i.e. expanded and cached). I don’t know for HDF5Matrix objects, but at least for RleMatrix objects I have some hope that extraction of individual columns or rows will be reasonably fast without caching. Caching will certainly bring a nice boost to it though so people with enough memory should be able to turn caching on. It might be the case that extraction of individual columns or rows of an HDF5Matrix object without caching will be painfully slow though… An interesting and not trivial question is how caching and parallel evaluation will play together. Ideally we’d want to be able to have both so they add up. Anyway, one thing at a time. My roadmap for refactoring block-processing in DelayedArray already feels long and bumpy enough so I’ll leave caching aside for now. So it seems that we’re heading towards chunks that preserve the originalncol/nrow
ratio and are small enough so that even people with little memory will be able to benefit from the caching in beachmat. But not too small either because (1) that would hurt compression and (2) there could be a significant overhead in storing hundreds of thousands of S4 objects in an environment (like RleArray does).
Hervé Pagès (09:43:11) (in thread): > Thinking more aboutcolRanks()
. Not a summarization method likecolSums()
i.e. it returns a matrix of the same size as the original matrix. So typically doingcolRanks()
on a big DelayedMatrix object will produce a matrix that doesn’t fit in memory. The result will need to be written to disk as we go (i.e. inside theapply(m, 2, rank)
loop). This is a situation similar to matrix multiplication where the result is also written to disk as we go (using the current realization backend).
Hervé Pagès (09:52:26) (in thread): > My feeling on this is that there is no need for a tenxSummarizedExperiment class. Just stick a TENxMatrix object in a SummarizedExperiment object and you should be good to go:tenx <- TENxMatrix(path); tenxse <- SummarizedExperiment(tenx)
.
Aaron Lun (09:54:10) (in thread): > It seems thetenxSummarizedExperiment
class also loads in some cell metadata (from the HDF5 file, presumably), such as the sequencing library and the mouse of origin. Directly using aTENxMatrix
doesn’t seem to make this metadata available. Not 100% sure, though.
Aaron Lun (09:58:56) (in thread): > Regarding caching and parallel evaluations: yeah, just ran into that issue myself. I was hoping to split my cell cycle phase assignment task across the four cores on my machine viabplapply
. But then I realized the.Call
call in each core would open its own HDF5 chunk cache, which is 1.6 GB in size. This would make it use 6.4 GB in total, which would be pushing the limits of my machine. It would probably run - it just wouldn’t be able to do anything else.
Hervé Pagès (10:14:36) (in thread): > Ah ok. I see now that there is no tenxSummarizedExperiment class so I take this back (tenxSummarizedExperiment()
actually returns a SummarizedExperiment instance). The TENxMatrix object only contains the counts, no metadata, so my suggested workaround won’t give you the metadata. Here is another suggestion: modifytenxSummarizedExperiment()
to store the assay data in a TENxMatrix object instead of a TENxGenomics object (the TENxGenomics and TENxMatrix classes are actually redundant). You can try this with:tenx <- TENxMatrix(path); dimnames(tenx) <- NULL; tenx.se@assays[[1]] <- tenx
.
2017-06-19
Aaron Lun (09:34:38) (in thread): > Bit under four hours to write to aHDF5Array
for me; final size of 4.9 GB. Pretty good. For some reason, though, the rownames in theTENxMatrix
didn’t show up when I calledHDF5Array
on the HDF5 file in which I saved the results.
Vince Carey (13:22:06): > @John Readeywould you be willing to host our dense hdf5 representation of the 10X genomics data? we are spending $300/month for a suitable EC2 server and it seems too much for the proof of concept to keep this going indefinitely.
Vince Carey (13:23:17): > @Shweta Gopalcan you have a look at these notes? it would be nice to implement and reproduce locally - Attachment: Attachment > @John Readey My timings were done on very small data sets (the suboptimal layouts just took too long on any reasonably sized data set), so I don’t know how well they’ll translate to server performance. But check out the timings/simulations/hdf5_matrix
folder in the MatrixEval2017 repo if you want to have a look at what I did.
John Readey (14:12:12): > @Vince Careywhat are you running on the EC2 server? I’m guessing you have data files stored on S3? How much is that costing?
Shweta Gopal (14:13:50): > @Vince Careyyes I would!
Shweta Gopal (14:26:15): > @John ReadeyWe are running the hdf server on the ec2 instance with the .h5 files in the /data folder.
John Readey (14:44:19): > Ok - got it. How much local disk do you need?
Sean Davis (14:46:52): > And what are the instance specs, just out of curiosity?
Shweta Gopal (15:33:53): > @Sean DavisWe are using a m4.2xlarge currently.
Shweta Gopal (15:34:47): > 8 vCPU and 32 GiB!
John Readey (16:53:17): > I already have the 1M neuron dataset loaded on my development EC2 instance. I’d be happy to let you guys try out accessing the data through the service with the usual caveats:
John Readey (16:53:35): > * Service will go down without notice at random times
John Readey (16:53:47): > * code is experimental
John Readey (16:54:21): > * can’t guarantee I’ll be available to help with any unforeseen issues
John Readey (16:54:26): > * etc.
John Readey (16:55:26): > How about for the next Friday meeting I run through accessing the server using the Python sdk?
2017-06-21
Michael Stadler (02:04:51): > @Michael Stadler has joined the channel
Panagiotis Papasaikas (03:13:48): > @Panagiotis Papasaikas has joined the channel
Aaron Lun (08:51:41): > @Hervé PagèsDoHDF5Matrix
objects support row/column names? I can’t assign row or column names without changing it to aDelayedMatrix
object. > > a <- matrix(1:100, 10, 10) > library(HDF5Array) > x <- as(a, "HDF5Array") # HDF5Matrix - okay. > > y <- x > rownames(y) <- LETTERS[1:10] > y # Now a DelayedMatrix. > > z <- realize(y, "HDF5Array") # Still a DelayedMatrix! >
> I can understand it becoming aDelayedMatrix
after assignment, but I would have thought that realization would convert it back to aHDF5Matrix
.
Peter Hickey (08:58:51): > @Aaron LunCurrently, dimnames are stored as names on elements in@index
. In order to have these, you have to have a non-NULL@index
(checkz@index
) but as it stands an HDF5Array with a non-NULL@index
has to be promoted(?) to a DelayedArray. I think there are plans to try storing the dimnames in the .h5 file, in which case it should be possible to have an “HDF5Matrix with dimnames”
Aaron Lun (09:03:49): > The fact that it gets promoted doesn’t bother me so much, it’s the fact thaty
is not considered pristine viaDelayedArray:::is_pristine
. In fact, even after realization,z
is not pristine, which opens the possibility for infinite loops inbeachmat(as it attempts to coerce realization if it encounters aDelayedMatrix
object). > > Even ifz
were pristine, the realization would do an unnecessary rewrite of the actual data in the HDF5 file, which is quite galling for the 1M neuron data set. Another 3 hours - just to add row names!
Peter Hickey (09:07:01): > yeah thats not ideal. so the issue for you is that “pristine-ness” is too strict? ifis_pristine()
considered an@index
oflist(NULL, NULL)
equivalent tolist(1:nrow, 1:ncol)
then the issue might be avoided?
Aaron Lun (09:11:32): > Yeah, I guess so. Intuitively, I would expecty
in the above example to be pristine; I didn’t really do anything to the data itself, just added some row names (which I consider to be metadata).
Peter Hickey (09:15:01): > fwiw i also think ofy
as pristine. i feel like it should be possible to remove the dependency between dimnames and pristineness
2017-06-22
Aaron Lun (07:43:27): > Just called cell cyle phases on the 1M neuron data set. > > > table(blah$phases) > > G1 G2M S > 1190769 82244 13440 >
> Took 1 day on 3 cores.
Sean Davis (12:22:15): > A little tangential, but@Shweta Gopaland I have been discussing server ops for the h5serv. I suggested using docker to help with maintenance and with portability. The HDF5group maintains a repo of Dockerfiles for their docker containers here:https://github.com/HDFGroup/hdf-docker - Attachment (GitHub): HDFGroup/hdf-docker > hdf-docker - Dockerfiles for HDF related containers
Sean Davis (12:22:56): > And the instructions for quickly getting a docker container for h5serv running are here:https://github.com/HDFGroup/h5serv#running-with-docker - Attachment (GitHub): HDFGroup/h5serv > h5serv - Reference service implementation of the HDF5 REST API
Sean Davis (12:24:38): > To run h5serv as a docker container you just need to install Docker (no Python, h5py, etc. needed). > > * Install docker:https://docs.docker.com/installation/#installation. > * Run the h5serv image:docker run -p 5000:5000 -d -v <mydata>:/data hdfgroup/h5serv
where
Sean Davis (12:36:45): > Running on AWS, you might want to try an image with Docker pre-installed likeami-30d49826
Shweta Gopal (12:47:41): > I did try using docker. I created an image using the Dockerfile in the h5serv github repo and it works fine.
Shweta Gopal (12:48:04): > docker run -p 5000:5000 -d -v
Vince Carey (13:09:59): > Hi – sorry for delay in responding. If I understand correctly you have the 10x 1M neuron dataset in the form provided by 10x genomics – which is > h5ls(“1M_neurons_filtered_gene_bc_matrices_h5.h5”) > group name otype dclass dim > 0 / mm10 H5I_GROUP
> 1 /mm10 barcodes H5I_DATASET STRING 1306127 > 2 /mm10 data H5I_DATASET INTEGER 2624828308 > 3 /mm10 gene_names H5I_DATASET STRING 27998 > 4 /mm10 genes H5I_DATASET STRING 27998 > 5 /mm10 indices H5I_DATASET INTEGER 2624828308 > 6 /mm10 indptr H5I_DATASET INTEGER 1306128 > 7 /mm10 shape H5I_DATASET INTEGER 2 > What we are talking about is a dense matrix representation allowing natural slicing that consumes 5.3 GB disk, with gzip compression in its construction in h5py …. to host anything on our instance is costing us about 350USD per month but we have not optimized in any way … we have found that smaller hosts than the one we are using do not have enough CPU/RAM to process substantial queries … m4.2xlarge is what we are using at moment, it is public, the restfulSE examples run against it, but we are thinking of taking it down as the concept is proven. we are thinking of making some kind of utility to allow spinning it up as needed. the whole strategy of making a bioc-enabled cloud resource that employs HDF5 server to resolve targeted data queries needs careful thought. - Attachment: Attachment > I already have the 1M neuron dataset loaded on my development EC2 instance. I’d be happy to let you guys try out accessing the data through the service with the usual caveats:
Sean Davis (13:19:58): > @Vince Carey, that was where I was heading with the docker discussion.@John Readey, are there docs for how to specify thes3
bucket location(s) to h5serv (sorry–being lazy)? If so, we can document and simplify the process of spinning up the server locally or on an AWS instance from R.
John Readey (13:24:04): > hey@Sean Davisthere are two HDF Servers (both with the same REST API): h5serv and hsds.
John Readey (13:24:35): > h5serv is the older one, but doesn’t support S3. HDF5 are stored on local disk.
John Readey (13:25:14): > hsds is the newer (in progress) project, that is native S3 and supports higher request loads.
John Readey (13:26:12): > NASA is sponsoring the hsds project with the intent to support multi-TB collections of earth science data in the cloud.
John Readey (13:26:59): > HSDS will be open source at the conclusion of the project (summer ’18), but we’d prefer to keep it closed source for now.
John Readey (13:27:50): > But I can provide access to the HSDS instance we have on AWS.
Sean Davis (13:33:51): > Thanks,@John Readey. That clarifies (perhaps again) for me.
Martin Morgan (16:18:29): > set up a reminder “single cell meeting at 12 noon EST (connect: http://huntercollege.adobeconnect.com/singlecell/ ; doc: https://docs.google.com/document/d/120MAhngbIe_EGi2ObnKyBMfuFpGBqK_3inOmRdHZnmU/edit?usp=sharing)” in this channel at 9AM tomorrow, Eastern Daylight Time.
2017-06-23
USLACKBOT (09:00:00): > Reminder: single cell meeting at 12 noon EST (connect:http://huntercollege.adobeconnect.com/singlecell/; doc:https://docs.google.com/document/d/120MAhngbIe_EGi2ObnKyBMfuFpGBqK_3inOmRdHZnmU/edit?usp=sharing)
Davide Risso (09:51:12): > Hi all, unfortunately I won’t be able to make it for today’s meeting. I will check back here for updates!
Aaron Lun (10:40:12): > Just used the deconvolution method to normalize the 1M neuron data set: 5 hours.
Andrew McDavid (11:50:47): > I am also not going to be able to make the meeting today. sorry!
John Readey (11:59:27): > I won’t be able to attend either
Sean Davis (12:14:14): > I’m going to miss, also.
Marcel Ramos Pérez (12:15:14): > I’m having issues connecting with my browser
2017-06-24
Unknown User (08:00:54): > @Aaron Lun commented on @Andrew McDavid’s file https://community-bioc.slack.com/files/U34P8RS3B/F5SCN9TQE/rsvd_on_a_large_matrix.txt: @Andrew McDavid I’m starting to run PCA on the 10X data set now, so I’m hitting this issue. A couple of thoughts: (i) for crossprod_help
, does the use of Matrix
methods help when A
is non-sparse after mean-centring? (ii) For dense matrices, C(++) code calling LAPACK’s DORMQR
routine may (not 100% sure) provide some speed boosts for calls to crossprod_help
with Q
, as it avoids constructing Q
explicitly. > > If you have a repository for this, I’m happy to chip in with some code. At the very least, we could get it operational with HDF5Array
objects. Perhaps we should make something like DelayedArrayAlgorithms, containing an implementation of randomized SVD, nearest neighbour finding, etc. Also possibly including the MatrixStats-like stuff from @Peter Hickey. - File (Plain Text): rsvd on a large matrix
Aaron Lun (10:18:29): > Ugh. Just finished another round of wrestling with the HDF5 C++ API. This time the problem was that theFileAccPropList
constructor, despite claiming to make a copy of an existing object, actually just makes a reference to the existing object; so modifying the “copy” actually modifies the original object (and any other “copies” that were “constructed” from the original). I stumbled on this nice little surprise when I was setting chunk cache sizes, and found that the cache sizes for row and column accesses were identical across differentFileAccPropList
instances (which they shouldn’t be, for non-square matrices).@Mike Jianghave you had similar experiences?
Unknown User (11:15:00): > @Peter Hickey commented on @Andrew McDavid’s file https://community-bioc.slack.com/files/U34P8RS3B/F5SCN9TQE/rsvd_on_a_large_matrix.txt: Yeah it’d be really sweet to be able to call things like prcomp(DelayedMatrix)
, svd(DelayedMatrix)
, tsne(DelayedMatrix)
, svdr(DelayedMatrix)
, even lmFit(x, DelayedMatrix)
and have them do the work using a strategy that is good enough for most use cases. However, I might be wishing for too many ponies - File (Plain Text): rsvd on a large matrix
2017-06-25
Vince Carey (00:15:22): > Here is a scalable, audited, restartable approach to computing > column sums for the full 10x data. batchtools is used to manage > multicore processing. > > library(HDF5Array) > library(DelayedMatrixStats) > # our dense image of the full dataset > #> h5ls("tenx_full.h5") > # group name otype dclass dim > #0 / *_db_* H5I_GROUP > #1 /*_db_* {addr} H5I_GROUP > #2 /*_db_* {ctime} H5I_GROUP > #3 /*_db_* {datasets} H5I_GROUP > #4 /*_db_* {datatypes} H5I_GROUP > #5 /*_db_* {groups} H5I_GROUP > #6 /*_db_* {mtime} H5I_GROUP > #7 / newassay001 H5I_DATASET INTEGER 27998 x 1306127 > # > tf = HDF5Array("tenx_full.h5", "newassay001") > library(batchtools) # need to set cluster.functions = makeClusterFunctionsMulticore... > library(BBmisc) > chk10 = chunk(1:ncol(tf), chunk.size=5000) # we used 9 cores to run 262 jobs > rr = makeRegistry("fullColsums") > batchMap(reg=rr, x=chk10, fun=function(x)colSums2(tf, cols=x)) > submitJobs(reg=rr) > # > # getJobTable result: done in about 17 minutes > #> Syncing 262 files ... > # job.id submitted started done error > # 1: 1 2017-06-24 23:36:17 2017-06-24 23:36:17 2017-06-24 23:36:53 NA > # 2: 2 2017-06-24 23:36:17 2017-06-24 23:36:17 2017-06-24 23:36:54 NA > # 3: 3 2017-06-24 23:36:17 2017-06-24 23:36:17 2017-06-24 23:36:54 NA > # 4: 4 2017-06-24 23:36:17 2017-06-24 23:36:17 2017-06-24 23:36:51 NA > # 5: 5 2017-06-24 23:36:17 2017-06-24 23:36:17 2017-06-24 23:36:54 NA > # --- > #258:[258 2017-06-24](tel:2582017-06-24)23:51:35 2017-06-24 23:51:40 2017-06-24 23:53:08 NA > #259:[259 2017-06-24](tel:2592017-06-24)23:51:40 2017-06-24 23:51:44 2017-06-24 23:53:13 NA > #260:[260 2017-06-24](tel:2602017-06-24)23:51:44 2017-06-24 23:51:47 2017-06-24 23:53:12 NA > #261:[261 2017-06-24](tel:2612017-06-24)23:51:47 2017-06-24 23:51:51 2017-06-24 23:53:15 NA > #262:[262 2017-06-24](tel:2622017-06-24)23:51:51 2017-06-24 23:51:51 2017-06-24 23:53:18 NA >
Vince Carey (13:01:48): > to verify this approach to processing the data, we have the following total count based on the original data from 10x: > > > library(rhdf5) > > > > barco = h5read("1M_neurons_filtered_gene_bc_matrices_h5.h5", "mm10/barcodes") > > dat = h5read("1M_neurons_filtered_gene_bc_matrices_h5.h5", "mm10/data", drop=TRUE) > > print(sum(as.numeric(dat))) > [1] 6388703090 > # and the following >
> > > > kk = loadRegistry("fullColsums") > Sourcing configuration file '/home/stvjc/batchtools.conf.R' ... > > kk > Job Registry > Name : Multicore > File dir: /home/stvjc/fullColsums > Work dir: /home/stvjc > Jobs : 262 > Seed : 617139788 > > cs = reduceResults(reg=kk, fun=c) > > sum(cs) > [1] 6388703090 > > rr = loadRegistry("fullRowsums") > Sourcing configuration file '/home/stvjc/batchtools.conf.R' ... > > tt = reduceResults(reg=rr, fun=cbind) > > sum(tt) > [1] 6388703090 > > sum(apply(tt,1,sum)==0) > [1] 872 >
> it seems strange to me that there are 872 genes that have zero reads according to this rowsum approach. does anyone else have the total sum and row and column sums for the data?
Martin Morgan (21:50:22): > > library(BiocParallel) > register(bpstart(MulticoreParam(progressbar=TRUE))) > > library(TENxGenomics) > tenx = TENxGenomics("1M_neurons_filtered_gene_bc_matrices_h5.h5") > > MAP <- function(x, nrow) > tabulate(x$ridx, nrow) > system.time({ > result <- tenxiterate(tenx, MAP, nrow=nrow(tenx)) > }) > all <- Reduce(`+`, result) >
> gives after about 3 minutes of 6 cores > > > table(all == 0) > > FALSE TRUE > 24015 3983 >
> So there appear to be 3983 genes with zero counts. (There is a more complete MAP function in the TENxGenomics vignettehttps://github.com/mtmorgan/TENxGenomics/blob/master/vignettes/TENxGenomics.Rmd#iterativethat calculates row and column sums for quasi-independent verification)
Martin Morgan (22:24:32): > Also > > > library(rhdf5) > > ridx = h5read("1M_neurons_filtered_gene_bc_matrices_h5.h5", "/mm10/indices", drop=TRUE) > > tbx = tabulate(ridx) > > length(tbx) > [1] 27997 > > sum(tbx == 0) > [1] 3983 >
> (I only get h5read to work from my repositorygithub.com/mtmorgan/rhdf5; seems like there is something buggy about this code still…)
2017-06-26
Vince Carey (01:11:27): > thanks. looks like we don’t have the dense version right yet, or i am not calculating sums well…
2017-06-28
Vince Carey (06:41:34): > We have the full dense version now, and margins are consistent with yours. Moving to cloud today.
Unknown User (09:41:56): > @Andrew McDavid commented on @Andrew McDavid’s file https://community-bioc.slack.com/files/U34P8RS3B/F5SCN9TQE/rsvd_on_a_large_matrix.txt: @Aaron Lun Excellent point about the centering destroying sparsity. 1. We can delay the centering and represent it as a rank-one update to our data Y, eg Y_center = Y - one %*% t(mu), where one = rep(1, nrow(Y)), and mu is the column means (assuming cells X genes here like a normal statistician). There appear to be efficient ways to perform low-rank updates to SVD, eg: https://mathoverflow.net/questions/143375/efficient-rank-two-updates-of-an-eigenvalue-decomposition-or-more-genearlly-svd. 2. This delay of low-rank, sparsity-destroying ops might be a good policy in general, since it might maximize HDF5 compression? - File (Plain Text): rsvd on a large matrix
Hervé Pagès (15:25:06) (in thread): > Right now realization as HDF5 doesn’t write the dimnames to the file. Fixing this is on the TODO list (has been for a while though, never really got the time to work on it).
Hervé Pagès (16:04:15): > @Aaron Lun@Peter HickeyAbout pristineness (sorry for the late reply on this). The problem is that HDF5Array doesn’t know how to pick-up the dimnames from the HDF5 file yet. So an HDF5Array object never has dimnames. As soon as you try to put some on it, it’s downgraded to a DelayedArray instance because it’s not in sync with its seed anymore (i.e. with the HDF5ArraySeed object that it contains). Realization as HDF5Array propagates the dimnames anyway, but they go directly from the source to the destination without going thru the file. As a consequence the resulting object is not an HDF5Array object, but a DelayedArray instance. This is not good but is just temporary. Once realization as HDF5Array knows how to store the dimnames in the file, everything will behave as expected. And yes, I think that changing the dimnames should not preserve pristineness: the modified object is not in sync with its seed anymore. You can resync it with its seed by realizing it (you can think of this as replacing the seed with a new seed). One way the user can achieve this is withas(x, "RleArray")
oras(x, "HDF5Array")
. If changing the dimnames didn’t downgrade the object to a DealyedArray instance then these coercions woudn’t do anything (no-op). Wouldn’t be good. Hope this makes sense.
2017-06-29
Aaron Lun (05:41:39): > @Andrew McDavidSounds sensible. It’ll definitely reduce the HDF5 file size, but I’m not sure it speeds it up much. Should we set up a repository likeBigDataAlgorithms
to get some code committed? Better to use a different name to distinguish it fromDelayedMatrixStats
, which does a very specific thing.
Aaron Lun (05:44:34) (in thread): > @Hervé PagèsOkay. Looking forward to the fix; as long as callingrealize(x)
eventually gets me aHDF5Matrix
or a pristineDelayedMatrix
, I’m happy.
Martin Morgan (08:41:37): > @Aaron LunI created a BigDataAlgorithms repos athttps://github.com/Bioconductor/BigDataAlgorithms; you have admin rights. It’s currently completely empty. - Attachment (GitHub): Bioconductor/BigDataAlgorithms > BigDataAlgorithms - Algorithms for working with large, especially disk-based or sparse, data
Aaron Lun (08:42:10) (in thread): > @Martin MorganThanks, will get started on this later today.
2017-06-30
Aaron Lun (09:48:31) (in thread): > @Andrew McDavidBigDataAlgorithms::rsvd
now supportsDelayedMatrix
instances. I had to define acrossprod
method forDelayedMatrix
instances, but it probably doesn’t belong in this package; and in any case, it could probably be made more efficient. Also cleaned up the R code to make it more idiomatic.
2017-07-01
Aaron Lun (09:28:32) (in thread): > @Peter HickeybigLmFit
method is added. Currently it doesn’t choose its chunks very sensibly, though.
2017-07-03
Aaron Lun (14:31:10) (in thread): > Thersvd
implementation could probably do with some optimizing forDelayedMatrix
objects (or maybe just the matrix multiplications). It was taking too long for the 10X data set, so I just gave up on it. > > In the meantime, I’m taking a random subset of 10000 cells, doing the PCA on that, and projecting all cells onto the first two PCs. A bit simple and stupid, but it seems to do the job.
Aaron Lun (14:36:05): > Is@Mike Smithback? Any news on theRhdf5libsubmission? Looking through the Github issues suggests it’s pretty tough going… a bit ominous forbeachmat…
Mike Smith (16:56:08) (in thread): > Currently cycling from EMBL Heidelberg to EMBL Grenoble for charity, but I’ll be back in the office Thursday. Most of the warnings were the result of using single vs double quotes on the Windows single package builder, but not inherent to the library itself. I was hoping the remaining warning could be ignored as@Martin Morgansuggested and the package would be accepted. I’ll nudge the github issue soon.
2017-07-05
Aaron Lun (05:45:46) (in thread): > Great.
Aaron Lun (06:58:13) (in thread): > While we’re talking about BioC builds, I have a few questions for@Martin Morganabout how to set up a package that gets linked to by other packages. This came up as I was fiddling withbeachmat, but I’ll useRhtslibas an example: > > 1. I notice that Unix machines link toRhtslibas a shared library, while Macs link toRhtslibas a static library. Does the distributed BioC binary for Macs include both the shared and static libraries? If so, would there be some benefit from writing some code (e.g.,pkgconfig("TARGETS")
that returns onlylibhts.so
orlibhts.a
as a build target) to avoid needing to distribute/store both libraries when only one gets used? > 2. It seems that Windows packages also link to the shared library, based on the code ofRhtslib::pkgconfig
. How does this end up working when you distribute the Windows binaries? I would have thought that the installation location ofRhtslibcould change between machines, such that a pre-built binary containing a fixed path would not be guaranteed to find the library. (And indeed, I thought that this was the whole point of using static libraries for Macs.)
Martin Morgan (17:11:51) (in thread): > Yes, basically, we’d rather link to a shared than static library from an aesthetic perspective. > > On Linux, specifying -Wl,-rpath is good enough to guarantee the correct .so is found, at least on common operating systems. On Mac I think we cannot / could not rely on a similar technique; the use of a static lib was in svn commit -c117850, and quite intentional. Yes the binary distribution contains both dynamic and static versions of the library, I guess for completeness and because space is cheap? On Windows I think in general use of a dll can be problematic because Windows finds the first named DLL on it’s search path, but in the htslib case there are (were?) no other windows-based hts libraries so…
2017-07-06
Vince Carey (07:46:44): > the full dense representation of 1m neurons is now in our hdf5 server instance more details available from@Shweta Gopalselected row n colsums verified using restfulSE but doc currently limited we have finally grokked binary transfer
Vince Carey (07:49:35): > @Samuela Pollackis working on new package rhdfsclient that will extract interface methods from restfulSE binary transfer interface planned using rhdf5lib
Samuela Pollack (07:49:42): > @Samuela Pollack has joined the channel
Vince Carey (07:52:19): > @John Readeywe are ready to try object store for the full 1m neurons let us know how to proceed
Martin Morgan (09:38:19): > set up a reminder “single cell meeting at 12 noon EST (connect: http://huntercollege.adobeconnect.com/singlecell/ ; doc: https://docs.google.com/document/d/120MAhngbIe_EGi2ObnKyBMfuFpGBqK_3inOmRdHZnmU/edit?usp=sharing)” in this channel at 9AM tomorrow, Eastern Daylight Time.
Aaron Lun (10:35:01) (in thread): > Okay. Well, as soon asRhdf5libgets accepted, I’ll submitbeachmatto the BioC contributions. It’ll definitely build on Unix; builds on Mac, though I’m not sure what ends up in the binary; and may or may not work on Windows.
2017-07-07
USLACKBOT (09:00:00): > Reminder: single cell meeting at 12 noon EST (connect:http://huntercollege.adobeconnect.com/singlecell/; doc:https://docs.google.com/document/d/120MAhngbIe_EGi2ObnKyBMfuFpGBqK_3inOmRdHZnmU/edit?usp=sharing)
2017-07-09
Aaron Lun (06:52:48) (in thread): > beachmathas been submitted to Bioconductor,https://github.com/Bioconductor/Contributions/issues/413 - Attachment (GitHub): beachmat · Issue #413 · Bioconductor/Contributions > Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor Repository: https://github.com/LTLA/beachmat Confirm the following by editing each che…
Aaron Lun (06:52:48): - Attachment: Attachment > Is @Mike Smith back? Any news on the Rhdf5lib submission? Looking through the Github issues suggests it’s pretty tough going… a bit ominous for beachmat… - Attachment: Attachment > beachmat has been submitted to Bioconductor, https://github.com/Bioconductor/Contributions/issues/413
Aaron Lun (06:53:37) (in thread): > Let’s see if it blows up on the Windows machines.
Aaron Lun (08:46:46) (in thread): > Well, it blew up as expected. In particular, the linking of the object files intolibbeachmat.dll
is missing the flags: > > -LC:/local323/lib/x64 -LC:/local323/lib -LC:/Users/BIOCBU~1/BBS-3~1.6-B/R/bin/x64 >
> It’s also missing the-s -static-libgcc
flags, and there’s atmp.def
file that gets included into the shared library.
2017-07-19
Mike Smith (05:30:04) (in thread): > So I think my issues with thebeachtest
package are related to linking againstlibbeachmat.dll
. I can link to it fine at compile time viaMakevars.win, but since it’s a dynamic library we also need to know its location when loadingbeachmat
(and subsequentlybeachtest.dll
). I haven’t investigated why this seems ok on Linux, but on Windows the installed version oflibbeachmat.dll
isn’t in the search path and I get the error: > > LoadLibrary Failure: The specified module could not be found >
> I can get it to load if I subsequently do something likeSys.setenv(PATH = system.file("lib/x64", package = "beachmat"))
but I’m not sure if 1) you can do that before a package DLL is loaded and 2) if it’s a horrid hack anyway. > > I also tried an alternative strategy of building a static version oflibbeachmat.a
inbeachmat
, and the using that inbeachtest
. That seems to work although you have to include theRhdf5lib
flags too. However, I’m not sure it’s very portable, as there doesn’t seem to be a equivalent toR CMD config CC
for finding the archive tool, and if you don’t use a version matched to the C compiler this approach breaks too. > > I’m not sure what the best strategy is here, I’m hoping there’s something I don’t know about identify to locations of linked libraries, but maybe we also want to consider shippingbeachmat
with pre-compiled Windows libraries in the same vein as the other system library packages?
Aaron Lun (05:36:28) (in thread): > Hm. Precompiled might be the only solution then.@Martin Morganany thoughts on this?
Aaron Lun (05:56:29) (in thread): > I presume you’re referring to a precompiled static library, as there’d be no point distributing a precompiled shared library if Windows can’t find it anyway.
Aaron Lun (06:15:31) (in thread): > Okay, I just checked out theRhtslibrepo and it has static Windows libraries as well. So I guess we should just do that, then.
Martin Morgan (08:13:42) (in thread): > yes static libraries on Windows
Aaron Lun (09:12:04) (in thread): > Okay.@Mike Smith, can you dump the instructions used to make the static libraries somewhere? For example, thebeachmatrepo has a Wiki page that can be edited.
Aaron Lun (09:47:54) (in thread): > More generally, I wonder if the BioC build system can be configured to build the Windows static library for us.
Aaron Lun (09:48:08) (in thread): > This would save Mike from having to manually rebuild them everytime I make a change to the API.
Mike Smith (09:54:09) (in thread): > I’m pretty sure I’ve got a way to do this now.
Mike Smith (09:55:23) (in thread): > Just trying to getbeachtest
to passR CMD check
, then I’ll send a couple of pull requests your way
Aaron Lun (10:07:45) (in thread): > Sweet.
Aaron Lun (10:24:38) (in thread): > Just did a bit of a monster commit to standardize the header files, it shouldn’t break anything, but worth checking.
Mike Smith (10:50:05) (in thread): > Thanks, glad I checked that out before doing anything more. It would have presented some fun conflict resolution exercises otherwise!
Aaron Lun (11:03:30) (in thread): > I also have some small functions to add, will be finished by the end of the day.
Aaron Lun (11:04:06) (in thread): > Just to be clear; we’re going for static libraries, right?
Aaron Lun (11:04:29) (in thread): > Unless you managed to get it to work with shared libraries, in which I case I’ll bring you a trophy.
Mike Smith (11:07:56) (in thread): > A static library, but built during package installation, so it isn’t shipped with the source. I haven’t figured a sensible way to use the shared version.
Aaron Lun (11:11:54) (in thread): > Oooh, nice.
Aaron Lun (12:42:59) (in thread): > Okay, I’ve done all of the changes on my end. Feel free to make a PR when you’re ready.
Aaron Lun (13:12:28) (in thread): > I can’t bring a trophy on the plane, so here’s one instead. > > (*v*) > *|* > | | > |-----+-----| > | MIKE | > | SMITH | > '---------' > \ / > '. .' > | | > .' '. > *|__*|_ > [#######] >
2017-07-20
Aaron Lun (04:15:18): > Boom.http://bioconductor.org/packages/devel/bioc/html/beachmat.html - Attachment (Bioconductor): beachmat (development version) > Provides a consistent C++ class interface for a variety of commonly used matrix types, including sparse and HDF5-backed matrices.
Peter Hickey (07:52:33): > Great work
2017-07-21
Martin Morgan (02:40:18): > set up a reminder “single cell meeting at 12 noon EST (connect: http://huntercollege.adobeconnect.com/singlecell/ ; doc: https://docs.google.com/document/d/120MAhngbIe_EGi2ObnKyBMfuFpGBqK_3inOmRdHZnmU/edit?usp=sharing)” in this channel at 9AM today, Eastern Daylight Time.
USLACKBOT (09:00:28): > Reminder: single cell meeting at 12 noon EST (connect:http://huntercollege.adobeconnect.com/singlecell/; doc:https://docs.google.com/document/d/120MAhngbIe_EGi2ObnKyBMfuFpGBqK_3inOmRdHZnmU/edit?usp=sharing)
2017-07-23
Aaron Lun (05:41:23) (in thread): > @Mike SmithAre we missing some linker instructions to theRhdf5liblibraries inbeachmat::pkgconfig
?scranjust failed on the BioC build machines (http://bioconductor.org/checkResults/devel/bioc-LATEST/scran/tokay1-install.html) complaining about being unable to find theRhdf5libfunctions.
Mike Smith (05:49:09) (in thread): > The build reports are bit hard to read on my phone, but it looks like that isn’t linking tolhdf5
at all. Inbeachtest
we also passRhdf5lib::pkgconfig
to thePKG_LIBS
inMakevars.win.
Aaron Lun (05:56:22) (in thread): > Ah, okay. I didn’t notice that. I would move theRhdf5lib::pkgconfig
call intobeachmat::pkgconfig
, to avoid users having to put two things into theirMakevars.win
- do you think that would be okay?
Aaron Lun (05:56:54) (in thread): > This would mimic what I’m currently doing for the Mac static library.
Aaron Lun (09:04:44) (in thread): > I just made a commit doing what I said above (v0.99.11). It should work - but I guess I’ve been wrong before.
2017-07-24
Mike Smith (04:33:22) (in thread): > beachtest
still seems to passR CMD check
with those changes. > > I get a few notes along the lines of: > > Registration problems: > symbol 'cxxfun' not in namespace: > .Call(cxxfun, test.mat) >
> but they are only notes.
Aaron Lun (05:19:24) (in thread): > Thanks Mike. I’ve seen these registration problems before when I try to get fancy with how I call.Call
; we can probably ignore them.
hcorrada (07:57:40): > @hcorrada has joined the channel
2017-07-25
Aaron Lun (16:43:33): > Hey, anyone else around the Marriott who’s up for dinner?
Aaron Lun (17:42:36): > Also,http://www.biorxiv.org/content/early/2017/07/24/167445 - Attachment (bioRxiv): beachmat: a Bioconductor C++ API for accessing single-cell genomics data from a variety of R matrix types > Recent advances in single-cell RNA sequencing have dramatically increased the number of cells that can be profiled in a single experiment. This provides unparalleled resolution to study cellular heterogeneity within biological processes such as differentiation. However, the explosion of data that are generated from such experiments poses a challenge to the existing computational infrastructure for statistical data analysis. In particular, large matrices holding expression values for each gene in each cell require sparse or file-backed representations for manipulation with the popular R programming language. Here, we describe a C++ interface named beachmat, which enables agnostic data access from various matrix representations. This allows package developers to write efficient C++ code that is interoperable with simple, sparse and HDF5-backed matrices, amongst others. We perform simulations to examine the performance of beachmat on each matrix representation, and we demonstrate how beachmat can be incorporated into the code of other packages to drive analyses of a very large single-cell data set.
Aaron Lun (18:37:51): > Ah, I’m too tired to eat, so I’ll just go to sleep; see you guys tomorrow.
2017-07-26
Ju Yeong Kim (14:37:36): > @Ju Yeong Kim has joined the channel
2017-07-28
Leonardo Collado Torres (09:25:51): > @Leonardo Collado Torres has joined the channel
2017-07-30
Kevin Rue-Albrecht (16:12:00): > @Kevin Rue-Albrecht has joined the channel
2017-07-31
Lorena Pantano (12:34:22): > @Lorena Pantano has joined the channel
Davide Risso (14:05:19): > Hi all, we will have another group call next Monday (8/7) at 12noon EST.
Davide Risso (14:05:50): > The tentative agenda is summary of Bioc2017 discussions (Pete) + summary of the HCA hackathon (Aaron)
Davide Risso (14:06:54): > And discussion of potential ideas and room for collaboration for the CZI Human Cell Atlas RFA
Davide Risso (14:09:21): > We will not be able to use Adobe Connect anymore, so I’m thinking of simply using Google Hangout this time, but since it’s limited to 10 participants, we’ll need to find an alternative if we’re more than that. I will set up a RSVP sheet soon.
Vince Carey (15:48:18): > I have had good success with bluejeans in other organizations. If someone wants to get a quote I think the foundation can reimburse or take over financing it as long as it is not too exorbitant.
2017-08-03
Davide Risso (14:59:36): > Hi everyone, > just a gentle reminder that we will have a group call next Monday (8/7) at 12noon EST. > The Bioconductor foundation offered to pay for the use of the Bluejeans conference software (up to 50 people in the call) so there’s room for everybody! > I will send more info and links to join the call on Monday.
Martin Morgan (21:44:42): > I’d like to draft a ‘white paper’ analysis of the 10x genomics data; maybe we could spend a few minutes on Monday talking about what that might look like?
Davide Risso (22:31:27): > That would be great!
2017-08-04
Davide Risso (12:16:02): > Here’s a tentative agenda for Monday’s meeting with info on how to join the call:https://docs.google.com/document/d/120MAhngbIe_EGi2ObnKyBMfuFpGBqK_3inOmRdHZnmU/edit?usp=sharingSee you all on Monday!
Raphael Gottardo (13:10:44): > Unfortunately I won’t be able to make the call on Monday as I am traveling. However, I am interested in the HCA RFA. The first bullet point is well in line with our HDF5 work. > > “Developing standard formats and analysis pipelines for genomic, proteomic, and imaging data, in forms that enable consistent use of these pipelines by numerous experimental labs”
Raphael Gottardo (13:11:35): > They encourage large collaborations, so it would be easy to include a large group of people.
Raphael Gottardo (13:27:41): > I think we could also include something on > > Developing user tools that allow scientists and physicians to extract and analyze data organized by genes, cells, or tissues of interest
Andrew McDavid (14:04:52): > I also will be unable to make the call Monday, but agree with Martin that we should try to write something up the work this group has done this spring
2017-08-07
Davide Risso (09:29:09): > Just a quick reminder that we are meeting today at 12noon EST
Aaron Lun (09:45:22): > that should be 5pm here, if I’m not mistaken.
Aaron Lun (09:48:44): > I’ll definitely be attending, so just ping me via Slack if I’m not there.
Peter Hickey (12:49:40): > A set of packages for computing on distributed matriceshttps://github.com/RBigDatahttps://rbigdata.github.io/ - Attachment (GitHub): pbdR > GitHub is where people build software. More than 23 million people use GitHub to discover, fork, and contribute to over 64 million projects. - Attachment (rbigdata.github.io): pbdR - Programming with Big Data in R > Rbigdata.github.io :
2017-08-08
Aaron Lun (04:34:58): > @Davide RissoIt seems like there is moderate demand for a cell-cell distance matrix in the SCE class. This would not be too hard to add - thoughts?
Aaron Lun (04:53:39): > I also plan to add some common assay accessors -counts
,normcounts
,logcounts
, andcpm
,tpm
,fpkm
, which should have fairly universal meanings. This will assist interoperability as different packages are encouraged to conform to the same meaning of the assays. (Obviously each package can still do its own thing, as long as the relevant assay gets filled in at the end.)
Aaron Lun (04:54:38): > Finally, I will make it so thatreducedDim
without any second argument will simply return the first element, if it doesn’t do so already.
Aaron Lun (06:52:27): > Whoops, turned out you did the last thing already.
Aaron Lun (06:53:37): > Well, I’ve added some convenience accessors for named assays to thenamedassay
branch ofSingleCellExperiment. Check it out - in particular, the documentation - and see if the suggested interpretation of the fields are too dictatorial. If it’s okay, I suggest mentioning the convenience wrappers in the vignette as well.
Aaron Lun (08:03:50): > Also, I got rid of the heretical two-space indenting that Rstudio automatically puts in.
Sean Davis (08:05:14): > @Aaron LunWRT cell-cell distance matrix, seems like a subclass of SCE might be a useful approach? Also, instantiating a cell-cell distance matrix might be problematic for some datasets that might otherwise be in-memory datasets. Just thinking out loud.
Aaron Lun (08:06:23): > @Sean DavisYes, that was our original plan as well, in which SC3 would implement its own subclass. But if enough people need it, we could provide a slot in the SCE class directly. It needn’t be filled, in which case it would just be left empty and would not use up any more memory.
Vladimir Kiselev (09:16:06): > @Aaron Lunthanks for suggesting this! Though in SC3 I would need 3 such slots containing data from different distance metrics. So I think I will still go for implementing my own subclass. But maybe it makes more sense for other packages.
Davide Risso (09:18:38): > I’m thinking whether this could be a useful feature for clusterExperiment too, and it seems that it might be. Let me think a bit more, but if already two packages would benefit from it, it makes sense to add a slot to the basic class…
Aaron Lun (09:30:45): > @Vladimir KiselevWe can implement it in the same way asreducedDims
, such that multiple metrics can be stored.
Vladimir Kiselev (09:57:04): > great, I didn’t know it can be done
Vladimir Kiselev (09:59:16): > Actually, I am more interested in a slot specific to cell type data. In my day to day practice I more and more deal with matrices that have cell types in columns and genes in rows. Does such a thing already exist in SingleCellExperiment? Or is it possible to implement such a storage format? Any Ideas?
Vladimir Kiselev (10:01:43): > I would think that this can be the same as the expression matrix but containing cell types instead of cell ids
Davide Risso (10:02:24): > can it be stored in rowData()?
Davide Risso (10:02:48): > can you give a specific example of the type of info that this matrix contains?
Vladimir Kiselev (10:04:31): > theoretically it can be stored in rowData() but accessing it won’t be straightforward. An example is cell type signatures of any kind
Davide Risso (10:06:50): > @Aaron LunI’m looking at the namedass (interesting name :)) branch and I think it makes sense to reserve slots for normalized counts, cpm, tpm, etc. But do we really need the log-counts? I would imagine that it’s pretty straightforward to compute them on the fly?
Aaron Lun (10:07:09): > Not for the 10X data set, it wasn’t.
Davide Risso (10:07:35): > Fair enough.. but would DelayedArray delayed operations help?
Davide Risso (10:08:05): > probably not because they still need to be computed at some point…
Davide Risso (10:08:24): > should we open a SingleCellExperiment channel instead of polluting this one?
Aaron Lun (10:08:48): > Yes.
Vladimir Kiselev (10:08:49): > makes sense
Davide Risso (10:09:50): > Done!
Sean Davis (12:53:15): > For those following along, just jump over to#singlecellexperiment.
2017-08-10
Aaron Lun (11:24:23): > @Mike SmithJust sawhttps://support.bioconductor.org/p/99048/. Is this relevant to us? Do people need to doImports: beachmat
in their packages, and doesbeachmatneed to importRhdf5lib?
Mike Smith (11:27:57) (in thread): > Quite possibly. I’m looking into it at the moment, and I’ll let you know.zlibbiocworks a bit differently since its dll ends up inlibs
rather thanlib
(or where ever else we choose). It might solve our static vs shared queries, but not sure at the moment.
2017-08-13
Aaron Lun (11:53:11): > @Martin MorganThe devel vignette forRhtslibmentions aRHTSLIB_RPATH
variable for defining the link path. However, I don’t see this expression anywhere in the package code - does R automatically figure it out? I’d like to describe aRBEACHMAT_RPATH
in thebeachmatvignette but I’m not sure on what I actually need to do.
Aaron Lun (11:57:21): > Same probably applies to@Mike SmithforRhdf5libwith aRHDF5LIB_RPATH
.
Vince Carey (12:36:00): > i guess you would only find it in a client package
Aaron Lun (12:42:24): > I was thinking thatRhtslib::pkgconfig
should check if the environment variable has been set, and if so, tell the linker to use that path when installing client packages. This shouldn’t be too hard (I think); a call toSys.getenv
should allowpkgconfig
to see ifRHTSLIB_RPATH
has been set or not, and to change the linker flags appropriately. It would probably be easier than requiring each client package to fiddle with the linker flags in theirMakevars(.win)
.
Davide Risso (18:20:30): > For those of you who are not in the other channel, but are interested in joining our meeting tomorrow about the CZI RFA, please see the#hca_rfachannel for details
2017-08-14
Martin Morgan (07:42:34): > @Aaron LunAre you looking in svn? the public git repo lags (it’ll have to be updated in the next day or so!
Aaron Lun (07:45:44): > Ah, okay.
Martin Morgan (08:54:56): > @Aaron Lunactually, I see now that it’s a change I never committed, and have done so now.
Aaron Lun (08:55:19): > okey dokey.
Aaron Lun (13:30:10): > @Raphael GottardoTalk about optimal HDF5 chunking reminds me that we did a bit of it inbeachmat’s Supplementary Materials, making use of the HDF5 chunk cache to speed up consecutive row/column access. It’s not applicable to random access and it still doesn’t beat pure row/column chunking for a read-only data set, but is competitive and useful if you only want to write a single data set (e.g., as intermediateHDF5Matrix
objects for normalized values, etc.).
Aaron Lun (13:32:04): > Will ask you more about how we coordinate these efforts once I get the draft underway
Raphael Gottardo (14:49:20) (in thread): > Sounds good. I’ll share a google doc in a couple of days.
2017-08-15
Aaron Lun (13:29:32): > @Mike SmithJust saw theRHTSLIB_RPATH
code inRhtslib, and currently updatingbeachmat; probably a good idea to updateRhdf5libwithRHDF5LIB_PATH
in thepkgconfig
as well.
Aaron Lun (13:39:42): > Obviously after the git transition, though.
Sean Davis (20:25:38): > Bioconductor is on vacation for a couple of days, it seems….
2017-08-16
Aaron Lun (05:29:31): > @Martin MorganGot an interesting suggestion from someone on theSingleCellExperimentGithub repository, asking for partitioning ofcolData
for large metadata. The idea would be to provideAssays
-like behaviour forcolData
but involvingDataFrames
and only requiring consistency in the number of rows. This would allow users, if they so desired, to break up large metadata for easier examination. For example, you could have onecolData
Frame for QC metrics, another for phenotype values, another for genotype values, and so on, with each one accessible by specifying the name or number tocolData()
. The defaultcolData()
would return the firstDataFrame
, so it would be compatible with existing usage.
Martin Morgan (05:54:12): > @Aaron LunThis is likeBiobase::NChannelSet
, implemented with a specialchannel
column in thephenoData
. That implementation didn’t really take off, and it would be useful to have a discussion about why. A different implementation might make use of metadata on the columns > > df = DataFrame(a=1:3, b=1:3, c=1:3) > mcols(df) = DataFrame(group = factor(c(1, 1, 2)) >
Aaron Lun (06:06:27): > The second approach is probably more relevant, as thecolData
doesn’t need to be specific to a particular entry ofassays
. However, we could have problems if we want to use thegroup
metadata to mimic an actual list ofDataFrame
s. For example, you wouldn’t be able to assign twodf
columns with the same column name but differentgroup
levels.
Kasper D. Hansen (09:18:24): > @Martin Morganone thing to think about for large number of cells is that thecolData
slot will be very big, because you’ll for each covariate have 1 value per cell. So if we think millions of cells… Also, while I am brainstorming, we are going to have sample (not cell) specific covariates which will be way lower dimension.
Davide Risso (09:43:13): > I agree with@Kasper D. HansenI think that some sort of hierarchical structure with sample colData and cell colData will be very beneficial
Davide Risso (09:44:16): > perhaps just re-using MultiassayExperiment when you have cells from multiple samples could be enough?
Aaron Lun (09:48:39): > Yes, I talked to Levi about that, it should be supported.
Martin Morgan (11:24:44): > I think the lesson learned though from the tidyverse is that this kind of complicated structure confuses users. And if we’re dealing with large data, the margins aren’t that large, or the software needs to support a million rows without blinking.
Martin Morgan (11:26:10): > My comment was about organizing colData hierarchically.
Raphael Gottardo (13:40:39): > I like the idea of organizing coData hierarchically, it makes a lot of sense scientifically.
Tim Triche (13:41:53): > if it works for multi-sample multi-channel flow it ought to work for scRNAseq etc
Davide Risso (13:55:42): > I agree. I think that the reason the simple long format works in the tidyverse is that they assume that you have one table per observational unit type and that you will join tables, etc.
Davide Risso (13:56:04): > which is pretty much the opposite of our approach where one container should have all the data
Tim Triche (13:57:17): > the tidyverse is (to some degree) RDBMS for people some of whom don’t understand RDBMS. ExpressionSet and its descendants have always been more of a “make it impossible to have off-by-one errors” affair and they do that very well.
Davide Risso (13:57:57): > exactly
Tim Triche (13:59:10): > Another interesting feature of hierarchically structured per-sample/per-specimen covariates is that this feeds naturally and directly into hierarchical mixed models, which I have to imagine will become more popular as editors become more statistically literate.
Martin Morgan (18:02:42): > isn’t it the case though that if you were to write a linear model that included nesting, you would use a long-form data frame rather than two tables with a relational link? And that individual is just one (albeit common) grouping of single-cell data? Would you have three tables for samples, tissues, and cells?
Aedin Culhane (18:33:07): > I like this idea, I was thinking about an approach, whereby given a large number of cols, we compute a simple score between each pair of cols, allowing one to either rank cols. For example by median of a gene (to extract the extremes with high/low expression). Or by distance, or similarity. The latter might be better for imputation of data, or for nested clustering. For example many ‘old-school’ bioinformatic algorithms (eg clustal) created a crude index tree as a first step. I think MAE would be useful for this
Aedin Culhane (18:34:21): > If we use an out-of-memory solution (eg restfulSE), you would just need a col index for subsetting, you wouldn’t need to replicate data.
2017-08-17
Aaron Lun (09:53:08): > @Mike SmithNote that I’ve archived the oldbeachmat, and cloned the BioC version of the repository; this is what is now on my Github account athttps://github.com/LTLA/beachmat. - Attachment (GitHub): LTLA/beachmat > Clone of the Bioconductor repository for the beachmat package, see http://bioconductor.org/packages/devel/bioc/html/beachmat.html for the official development version.
Tim Triche (10:52:13): > re:@Martin Morgan@Davide Risso@Aedin Culhane: there’s no particular reason why a slot holding a long skinny or fat wide data.frame-like object couldn’t have accessors to make it look like a hierarchical structure (like when dealing with trees in SQL for example). But having accessors for this type of data structure makes it easier to plot, table-ize, feed to pooled/shrunken mixed models, etc.
Tim Triche (10:55:58): > the “hide it behind an accessor” would potentially require some gymnastics, so it might be just as well to specify a normalized schema for the [super]class[es] to instantiate, but my suspicion is that users would find it more natural to think of covariates as specific to a subject, a sample, or a specimen. I could be wrong
Aaron Lun (14:18:17): > Has anyone usedirlba
with a non-NULL
value forcenter
?
2017-09-15
Martin Morgan (10:10:30): > A little tangent, but thishttps://github.com/bedatadriven/renjin-hdf5is about random access to hdf5 files, rather than chunked access. The solution partly relies on an adaptive cache, and de-emphasizes compression, etc. - Attachment (GitHub): bedatadriven/renjin-hdf5 > renjin-hdf5 - Renjin Package for reading HDF5 files
Alexander Bertram (10:19:58): > @Alexander Bertram has joined the channel
Alexander Bertram (10:45:28): > Thanks Martin. Still working on parallellizing the subsetting, but already the addition of an LRU cache for compressed chunks and memory mapping for uncompressed chunks has been promising
2017-09-28
Vince Carey (12:22:12): > Should we have a monthly update in this group? I would like to know whether there are other representation ideas to be explored. Also we have had an introduction to the hdf object store (only available on xsede jetstream at this time) that is very promising. I think John Readey would welcome some additional testers. Basically you don’t really have hdf5 files in the back end, but a distributed object that looks like hdf5 – CEPH object store is the underlying infrastructure.
2017-10-02
Raphael Gottardo (10:59:08): > @Vince CareyYes that would be good. I think we’d like to hear more about the hdf object store, and would certainly be interested in testing.
2017-10-06
Vince Carey (14:46:02): > @Raphael Gottardoi missed this – do you have an account atportal.xsede.org? if not, set one up and we can add you to the current jetstream project that provides access to the object store. we should discuss further off line.
2017-10-16
Raphael Gottardo (11:36:41): > @Vince CareyThanks, we’ll set up an account.@Mike Jiang
2017-10-17
John Readey (16:33:35): > Yes, I’d welcome additional testers.
John Readey (16:34:04): > BTW - has anyone tried out this new hdf5r package:https://github.com/hhoeflin/hdf5r? - Attachment (GitHub): hhoeflin/hdf5r > Contribute to hdf5r development by creating an account on GitHub.
Vince Carey (23:08:17): > I was able to build and test on linux. It would be good to know what aspects of rhdf5 are improved upon by hdf5r … the vignette is vague about this. It may be noteworthy that Bioconductor has a package Rhdf5lib that fosters direct use of C and C++ APIs in other packages.https://bioconductor.org/packages/devel/bioc/vignettes/Rhdf5lib/inst/doc/Rhdf5lib.html
2017-10-18
Vladimir Kiselev (05:23:16): > I second Vince, I thinkRhdf5lib
is now used inbeachmat
andSingleCellExperiment
, so would be nice to make sure that we are up to date with whyhdf5r
is better.
2017-10-27
Guangchuang Yu (05:40:23): > @Guangchuang Yu has joined the channel
2017-10-30
Aaron Lun (09:19:10) (in thread): > @Hervé PagèsRan into this issue again when testinghttps://github.com/LTLA/TENxBrainData. It seems thatHDF5Matrix
objects get converted toDelayedMatrix
objects upon callingassay
, presumably because row names get slapped on it when the function returns. This becomes a little inconvenient becausebeachmatautomatically tries torealize
any non-pristineDelayedMatrix
objects; but no matter how much realization it does, it can never get a pristine object when dimnames are around, resulting in an infinite loop! - Attachment (GitHub): LTLA/TENxBrainData > TENxBrainData - An ExperimentHub package for the 1.3 million brain cell 10X single-cell RNA-seq data set.
Aaron Lun (09:19:53) (in thread): > Perhaps the simplest solution would be to add aNAMES
slot to theHDF5ArraySeed
class? This would behave in a manner consistent withdimnames
of other backends, without necessitating coercion toDelayed
-ness upon a change to the row/column names (which would in turn ensurebeachmatdoes not try to waste time realizing it when the data haven’t changed.
2017-11-09
Raphael Gottardo (16:40:19): > I am sure many of you have seen this already:https://github.com/mojaveazure/loomR - Attachment (GitHub): mojaveazure/loomR > loomR - An R-based interface for loom files
Raphael Gottardo (16:40:23): > Thoughts?
2017-11-10
Martin Morgan (10:40:08): > its very immature and doesn’t fit with bioc; I don’t know why there’s yet another rhdf5 parser, maybe because it’s not in Bioconductor so has less baggage (but then communicating with the maintainer might have moved rhdf5 to CRAN…). With the HDF5Array and DelayedMatrix Stats package you can already do many high-level computations on this very large data; presumably some infrastructure will be added to loomR at some point (there’s a non-exportedmap()
function for instance). > > I don’t think loom is particularly well thought-out, e.g., naming the ‘row_attrs’ when an ‘attribute’ has a specific meaning in hdf5, and the notion of storing row and column metadata in the file when it can be quite complicated (suited to a relational database, for instance) and frequently updated (so, when one is interested in versions of the metadata, e.g., the cannonical copy for ‘the lab’ and then the annotations for your particular work flow) small changes require large data copying. Also I don’t know whether there was any consideration given to the various concerns you’ve mentioned, e.g., about one monolithic file versus several. > > I think the DelayedArray infrastructure needs some significant improvement in terms of performance, and if loomR has performance (I don’t know whether it does or not…) then it’ll win. > > I like the idea of a ‘matrix’ interface on top of loom, dim(), dimnames() and subsetting; that would make loomR fit with Bioc, but I’m not sure whether it’s in the cards.
Peter Hickey (10:45:10): > > I think the DelayedArray infrastructure needs some significant improvement in terms of performance > agreed and i hope to return to this (inDelayedMatrixStats) soon. some prelim stuff shows much better performance via integration withbeachmat
Stian Lågstad (16:53:15): > @Stian Lågstad has joined the channel
2017-11-14
Raphael Gottardo (15:38:00) (in thread): > Thanks Martin. That was my impression as well.
2017-11-22
Mike Smith (03:15:30): > For those who don’t follow Wolfgang on Twitter, here’s a summary of some benchmarking I’ve been doing runningcolSums()
under various scenarios on theTENxBrainDatapackagehttp://www.msmith.de/2017/11/17/10x-1/
Martin Morgan (12:00:08): > Maybe worth pointing out that, since the I/O cost is so high, one would typically write algorithms that try to make a single pass through data;colSums(tenx); rowSums(tenx)
would take about twice as long as something like > > MAP1 <- function(i, tenx) { > ## input > assay <- assay(tenx, withDimnames = FALSE) > tenx <- as.matrix(assay[, i, drop=FALSE]) > ## process > list( > colSums = colSums(tenx), > rowSums = rowSums(tenx) > ) > } > > REDUCE1 <- function(x, y) { > list( > colSums = c(x$colSums, y$colSums), > rowSums = x$rowSums + y$rowSums > ) > } > > init <- list(rowSums = numeric(nrow(tenx.sub)), colSums = numeric(0)) > > chunksize <- 10000 > cidx <- snow::splitIndices(ncol(tenx.sub), ncol(tenx.sub) / chunksize) > > system.time({ > result0 <- lapply(cidx, MAP1, tenx.sub) > result1 <- Reduce(REDUCE1, result0, init = init) > }) >
> Also, my understanding is that hdf5 supports only a single reader per process, so for instanceMulticoreParam()
would require use ofBiocParallel::ipclock()
/unlock()
to enforce single-thread reading, and wouldn’t provide any speedup. Separate processes, e.g.,SnowParam()
, would do the trick, though one would want to reduce data transfer (e.g., by usingassay()
rather than the SummarizedExperiment with relatively large colData). > > library(BiocParallel) > MAP2 <- function(i, assay) { > suppressPackageStartupMessages({ > library(TENxBrainData) > }) > ## input > tenx <- as.matrix(assay[, i, drop=FALSE]) > ## process > list( > colSums = colSums(tenx), > rowSums = rowSums(tenx) > ) > } > > system.time({ > assay <- assay(tenx.sub, withDimnames = FALSE) > result0 <- bplapply(cidx, MAP2, assay, BPPARAM=SnowParam(4)) > result1 <- Reduce(REDUCE1, result0, init = init) > }) >
2017-11-27
Aaron Lun (12:46:14): > @Mike SmithNice work. I had been wondering whetherrhdf5
should allow easier interrogation of the chunk dimensions so thatDelayedArray
’s processing mechanisms can adapt the block size appropriately. Also, interesting results for the uncompressed file, though my computer doesn’t even have 145 GB of free disk space!
2017-11-29
Matthew McCall (09:31:15): > @Matthew McCall has joined the channel
2017-12-01
Sean Davis (06:53:06): > Just FYI….https://www.biorxiv.org/content/early/2017/11/30/227041 - Attachment (bioRxiv): The GCTx format and cmap{Py, R, M} packages: resources for the optimized storage and integrated traversal of dense matrices of data and annotations > Motivation: Computational analysis of datasets generated by treating cells with pharmacological and genetic perturbagens has proven useful for the discovery of functional relationships. Facilitated by technological improvements, perturbational datasets have grown in recent years to include millions of experiments. While initial studies, such as our work on Connectivity Map, used gene expression readouts, recent studies from the NIH LINCS consortium have expanded to a more diverse set of molecular readouts, including proteomic and cell morphological signatures. Sharing these diverse data creates many opportunities for research and discovery, but the unprecedented size of data generated and the complex metadata associated with experiments have also created fundamental technical challenges regarding data storage and cross-assay integration. Results: We present the GCTx file format and a suite of open-source packages for the efficient storage, serialization, and analysis of dense two-dimensional matrices. The utility of this format is not just theoretical; we have extensively used the format in the Connectivity Map to assemble and share massive data sets comprising 1.7 million experiments. We anticipate that the generalizability of the GCTx format, paired with code libraries that we provide, will stimulate wider adoption and lower barriers for integrated cross-assay analysis and algorithm development. Availability: Software packages (available in Matlab, Python, and R) are freely available at https://github.com/cmap
2017-12-08
Aaron Lun (08:07:07): > @Martin MorganThe data set license seems to be CC BY 4.0, see thehttps://support.10xgenomics.com/single-cell-gene-expression/datasets/1.3.0/1M_neurons, so we should be okay. I guess we should use the same licence in the package.
2017-12-11
Ricard Argelaguet (16:47:34): > @Ricard Argelaguet has joined the channel
2017-12-14
Aaron Lun (06:06:18): > @Peter HickeyThere’s a couple of matrixStats-like functions inhttps://github.com/davismcc/scater/pull/42that are looking for a better home. Implemented in C++ viabeachmatso they should be pretty fast, though some further optimization is possible in cases wherecols
orrows
are supplied. Also needNA
protection. - Attachment (GitHub): Further fixes to plotting by LTLA · Pull Request #42 · davismcc/scater > Clone of the Bioconductor repository for the scater package, see https://bioconductor.org/packages/devel/bioc/html/scater.html for the official development version.
Peter Hickey (13:27:23): > thanks,@Aaron Luni’m gonna spend some time on this after xmas.
2017-12-15
Ricard Argelaguet (12:46:49): > Hi! I am trying to build a bioconductor package for single-cell methylation data. Just to give you a brief overview, the data has a few amount of cells (>1000) with sparse coverage (5 milion randomly sampled CpG sites per cell). Clearly, i need to work using an on-disk representation, which is fine once the user has created the matrix and stored it. However, I am struggling building the matrix, as it can easily reach >50GB (more than 50% missing values) and the computer runs out of memory. What suggestions do you have?
Sean Davis (13:53:21): > Do you have pseudocode or code for how you are trying to build your matrix?
Peter Hickey (14:40:29): > one option is to read batches of samples (with batch size as small as 1), construct the matrix, and then write that batch to disk usingHDF5Array. Then the batches’HDF5Arrayrepresentations can becbind()
-ed together into aDelayedArray, which has very low overhead
Peter Hickey (14:41:53): > it may even be possible to read the data in one of the ‘sparse’ formats supported by theMatrixpackage, wrap the result in aDelayedArray, thenrealize()
it on disk with theHDF5Arraybackend.
Peter Hickey (14:42:10): > unsure if that will have any real benefit in terms of peak memory usage
Peter Hickey (14:42:34): > as Sean says, if you can share some code we may be able to help more
Ricard Argelaguet (16:23:04): > Thanks for your feedback. My current attempt is to load each cell in a data.table format with columns (sample, cpg, meth), where meth is +1 if the CpG site is methylated and is -1 if the CpG site is unmethylated: > > dt <- lapply(opts$cells, function(x) { > data <- fread(sprintf("zcat < %s/%s.tsv.gz",io$in.data,x)) %>% .[,sample:=x] > }) %>% rbindlist >
> then i convert the sample IDs and the CpG IDs to integer values to pass to SparseMatrix: > > dt[,cpg:=as.numeric(as.factor(cpg))] > dt[,sample:=as.numeric(as.factor(sample))] >
> finally i create the sparse matrix: > > sparse_dmatrix <- sparseMatrix(i = dt$sample, j = dt$cpg, x = dt$rate) >
Ricard Argelaguet (16:24:36): > this actually works better as the sparseMatrix will ignore the missing values, but I still have the memory burden of loading all data.frames (each cell is easily 100mb)
Aaron Lun (16:24:36): > +1 on Pete’s suggestion.
Aaron Lun (16:25:03): > cbind
ing the cell-specificHDF5Array
s should be simplest.
Aaron Lun (16:25:26): > Followed by realization into a completeHDF5Array
.
Ricard Argelaguet (16:26:28): > so I save each cell in the same hdf5 file as an independent data set, and then when I query I cbind every time into a DelayedArray?
Aaron Lun (16:27:11): > Not quite.
Ricard Argelaguet (16:27:14): > i just said bullshit
Ricard Argelaguet (16:27:16): > i got it now
Ricard Argelaguet (16:27:23): > thanks both!
Peter Hickey (18:17:42): > one minor thing: Annoyingly,fread("zcat < file.gz")
isn’t portable (doesn’t work on standard windows machine) and can be a little fragile; seehttps://github.com/Rdatatable/data.table/issues/717for some discussion > fwiwreadr::read_tsv()
can read natively from gzip and plain text. it’s not as fast (especially with recent multithreading support added tofread()
but it may be simpler - Attachment (GitHub): Support .gz file format for fread · Issue #717 · Rdatatable/data.table > I have several thousands of .gz files containing data in csv format - about 60GB in total in terms of .gz files. Decompressing them and load some pieces via fread turns out a huge pain in the first…
Ricard Argelaguet (18:19:31): > i will keep this in mind when writing the package, thanks again!
2017-12-24
Aaron Lun (12:20:33): > @Hervé PagèsI noticed that theDelayedArray
internals have been modified to useSeedDimPicker
upon transposition. In the context of matrices, would the sole purpose ofSeedDimPicker
be to support transposition? If so, I will modifybeachmataccordingly, given that theis_transposed
field I was previously using to accommodate transposedDelayedMatrix
objects no longer exists. (I assume that the switch was motivated by the desire to have more general permutations of dimensions for higher-dimensional arrays, but that’s not so relevant to me.)
2018-01-06
Aaron Lun (14:23:24) (in thread): > @Mike SmithCould you also givescater:::.colSums(counts(tenx))
a try? This usesbeachmatunder the hood; I’d like to see how it compares toDelayedArray::colSums
.
2018-01-09
Mike Smith (06:59:44) (in thread): > Let me update things to reflect that the changes I identified toDelayedArray
andrhdf5
are now incorporated, and then I’ll give I’ll add that to the comparison. I’ll give you a shout when it’s done.
Aaron Lun (12:26:47) (in thread): > Thanks. I’m not sure whether develscateris up to date, you may have to install fromhttps://github.com/davismcc/scater.
2018-01-16
Aaron Lun (13:50:38): > Looks like TENxBrainData is out on BioC-develhttp://bioconductor.org/packages/devel/data/experiment/html/TENxBrainData.html. - Attachment (Bioconductor): TENxBrainData (development version) > Single-cell RNA-seq data for 1.3 million brain cells from E18 mice, generated by 10X Genomics.
2018-01-29
Vince Carey (21:09:07): > the devel version of restfulSE (1.1.4) now includes classes H5S_Array and BQ3_Array, that implement the DelayedArray protocol for content provided by HDF Server and BigQuery respectively. The BQ3 refers to the ISB-CGC TCGA model which is similar to a triple store. BQM_Array is in development for BigQuery content in “matrix” form. example(H5S_Array) should work out of the box; use of BQ3_Array requires a billing relationship with ISB-CGC. Thanks to@Hervé Pagèsfor the clear documentation on how to do this.
Vince Carey (21:12:16): > > colSums(H5S_Array(“http://h5s.channingremotedata.org:5000”, “tenx_full”)[,c(1:5, 1306123:1306127)]) > analyzing groups for their links… > done > [1] 4046 2087 4654 3193 8444 4885 2554 3080 3849 5833
Peter Hickey (21:39:20): > very nice, Vince!
2018-02-07
Aaron Lun (12:58:35): > FYI, beachmat reviews came back. A bit of work to do, but more-or-less reasonable - nothing I hadn’t wondered myself.
Aaron Lun (13:03:38): > I mean, wondered after submission. Otherwise I would have put it in.
2018-02-14
Aaron Lun (16:10:44): > Has anyone ever thought about integrating Python into BioC’s build systems? From what I understand, the main problem has been that we can’t control the system version of Python, nor the versions of various Python packages. I was wondering whether we could follow the example of theRhdf5liband friends, and provide a package that installs a local version of Python solely for use by R. This would give us a controlled framework in which developers could put python scripts in their packages and call them from R with.pyCall
or something, analogous to C/C++ and.Call
.
Peter Hickey (16:13:21): > unsure what you’re looking for, Aaron, but perhaps reticulate can help?https://github.com/rstudio/reticulate - Attachment (GitHub): rstudio/reticulate > reticulate - R Interface to Python
Aaron Lun (16:21:30): > Yes; the idea would be to control the version of python being used, along with all its packages. So one could imagine doingreticulate::use_python()
on a python installation where the versions of everything are fully under the control ofbiocLite
. Obviously, you wouldn’t want to enslave the system python because people want to do their own things on that. Hence the need for an alternative installation.
Aaron Lun (16:24:50): > I mean, one could just use the whatever python happened to be on thePATH
, and call Python scripts from R using that. But you can imagine that sometimes it would work and sometimes it won’t, depending on the latest setting ofPYTHONPATH
, whether some packages got upgraded/downgraded by the user, whether they changed the python binary, etc.
Aaron Lun (16:26:57): > And honestly, I thinkpip
hates me.
2018-02-15
Ricard Argelaguet (03:58:38): > +1 on Aaron’s suggestion. Right now people do eithersystem("python …")
orReticulate
, which will depend on the python version that is locally installed. Would be great if some version control could be done within Bioconductor
Sean Davis (05:27:37): > The python ecosystem has some pretty nice tooling around versions and package environments. Does one of these (use_python
,use_virtualenv
, oruse_condaenv
) work for the use case?https://rstudio.github.io/reticulate/articles/versions.html
Sean Davis (05:31:05): > For the use case that you describe,@Aaron Lun, that case is pretty much fully covered by a pythonvirtualenv
orcondaenv
.
Sean Davis (05:33:39): > Using these python systems is likely to require a bit of setup (installing conda, for example), but I suspect that is going to be preferable to having bioc manage python packages directly.
Martin Morgan (11:52:21): > while python version is one concern, what about managing python package version dependencies? I guess this is virtualenv or condaenv, but the latter has in the past been the source of numerous problems.
Sean Davis (11:54:53): > I guess it comes down to a use case and balancing the need to create something “new” versus leveraging a less-than-optimal solution that exists.
Sean Davis (11:55:49): > I don’t know the details here. My comments were general. And my mileage has varied in using the python package managers of all flavors.
Sean Davis (11:56:17): > Our HPC group has settled on conda, so that has been the flavor-of-the-month.
Martin Morgan (12:00:38): > Also it is a R / Bioc repository rather than a python repository after all (not nixing the idea just trying to find the right balance). Are there lessons learned from the way other languages are handled? C from source with limited use of external libraries, or (a few) packages with bundled libraries (including source, and implying open licensing). Java is sometimes bundled in R packages, often in marginally satisfactory ways (a static and quickly stale snapshot of large jar files typically containing much superfluous functionality, often without easy transparency to underlying source and little regard to actual licensing).
Kasper D. Hansen (12:01:59): > Basically we would be trying to fix the fucked up situation with python 2 and 3
Kasper D. Hansen (12:02:52): > Of course, if anyone can do it, its the R community which after all seems to have the only really working system across languages (perhaps with the exception of Emacs lisp)
Martin Morgan (12:03:20): > Is it possible with current setup to include a Makefile that populates a virtualenv during package build & check, so that the build system (and users) only need to have a supported version of python?
Aaron Lun (12:07:59): > Apparently it is possible to compile python files to obtain a stand-alone binary (http://www.pyinstaller.org/) though I’ve never used it. If this works, it would take us closer to how R currently handles C/C++/Fortran code.
2018-02-16
Laurent Gatto (14:56:07): > @Laurent Gatto has joined the channel
2018-02-27
Vince Carey (09:44:48): > Apropos python: 1) Maybe send a query to JJ Allaire who has probably considered the question? 2) BiocSklearn is a top 50% downloads and depends on reticulate, as its approach to assuring sufficient python. I haven’t had any complaints.
2018-02-28
Daniel Van Twisk (15:18:37): > @Daniel Van Twisk has joined the channel
Daniel Van Twisk (15:21:27): > Just an fyi, we’ve started work on a package calledLoomExperiment
that’s purpose is to import/export.loom
files.https://github.com/Bioconductor/LoomExperiment - Attachment (GitHub): Bioconductor/LoomExperiment > Contribute to LoomExperiment development by creating an account on GitHub.
2018-03-01
Daniel Van Twisk (14:38:25): > Can anyone link me to publicly available.loom
files that haverow_graphs
and/orcol_graphs
. I’ve been searching for examples and found some herehttp://loom.linnarssonlab.org/, however, none of these appear to featurerow_graphs
orcol_graphs
.
Sean Davis (14:40:12): > Does it make sense to reach out to the Linnarsson Lab directly (if not already done)?
Aaron Lun (18:20:31): > @Daniel Van TwiskIf it helps,SingleCellExperimentalready contains dedicated slots for reduced dimensions. It may be possible to save yourself some work by derivingLoomExperimentfrom the SCE. We can’t help much with the graphs, though; you’ll need your own slots for that.
2018-03-02
Daniel Van Twisk (10:02:51): > @Aaron LunI’ll look into it, thanks!
2018-03-06
Aedin Culhane (13:15:42): > Hi Did you see thinkhttps://grants.nih.gov/grants/rfi/NIH-Strategic-Plan-for-Data-Science.pdf
Aedin Culhane (13:15:52): > RFI open
Aedin Culhane (13:16:17): > One of the comments is “There is currently no general system to transform, or harden, innovative algorithms and tools created by academic scientists into enterprise-ready resources that meet industry standards of ease of use and efficiency of operation.”
Aedin Culhane (13:16:42): > Should Bioc try to be part of this ?
Vince Carey (13:19:35): > we should probably send in comments as a group. i started to take notes. i wonder what “industry standards” they are talking about? it is slanted towards commercial solutions, it would seem.
Sean Davis (13:19:49): > I would argue that that statement is a little naive.
Sean Davis (13:24:51): > And,@Vince Carey, I agree that this statement is slanted toward one view of what a software product for biological science is.
Aedin Culhane (13:30:27): > @Vince Carey@Sean DavisI don’t know how the RFI review process works. Would multiple comments (from different PI) which say the same thing or would a comment from a respected organization be more valuable.
2018-03-07
Vince Carey (14:31:52): > I think we should just make some constructive comments as a project that is capable of influencing practice. Various faculties and individuals will also have comments but I would hope that Bioc folks will be able to agree on a few points and will sign. When are comments due?
2018-03-18
Aaron Lun (19:36:02): > DropletUtilsnow has aread10xMatrix
function, which behaves likereadMM
but avoids limitations withscan()
when the function tries to allocate >2GB strings. This is achieved via chunked reading of thematrix.mtx
file produced byCellRanger, and was motivated by the failure ofreadMM
to read in datasets that should have fitted in memory. It also supports output as aHDF5Matrix
, for the datasets that don’t fit in memory or cause thedgCMatrix
class to have an integer overflow.
2018-03-19
Aaron Lun (17:23:41): > … and it turns out that the problem withscan()
was because the file was corrupted, resulting in an unclosed quote symbol. Not sure how it managed to do that, though thematrix.mtx
was pretty big (1.7GB) so there was plenty of scope to stuff up somewhere. Oh well. Anyway,read10xMatrix()
still allows you to read it in as aHDF5Matrix
, so it wasn’t a complete waste of time.
2018-04-18
Elizabeth Purdom (09:45:32): > @Elizabeth Purdom has joined the channel
Vince Carey (14:13:32): > The (alleged) complete row and columns sums for the 10x 1.3 million neurons are in a list with named numeric components athttps://s3.us-east-2.amazonaws.com/biocfound-scrna/fullsums.rda
Vince Carey (14:14:05): > > > str(fullsums) > List of 2 > $ colsums: Named num [1:1306127] 4046 2087 4654 3193 8444 ... > ..- attr(*, "names")= chr [1:1306127] "AAACCTGAGATAGGAG-1" "AAACCTGAGCGGCTTC-1" "AAACCTGAGGAATCGC-1" "AAACCTGAGGACACCA-1" ... > $ rowsums: Named num [1:27998] 16368 51 0 1195 0 ... > ..- attr(*, "names")= chr [1:27998] "ENSMUSG00000051951" "ENSMUSG00000089699" "ENSMUSG00000102343" "ENSMUSG00000025900" ... >
Vince Carey (14:16:38): > I will describe how to get these in a scalable way with batchtools and restfulSE … does anyone have the sums computed in some other manner for confirmation?
Martin Morgan (14:50:05): > biocLite("mtmorgan/hdf5tenx")
and then the code athttps://github.com/mtmorgan/hdf5tenx/blob/master/inst/script/rle_matrix.R(replacingfname =
withfname = ExperimentHub()[["EH1039"]]
produces > in about a minute (once the data is local) with the following output > > > str(res) > List of 2 > $ row :List of 3 > ..$ n : int [1:27998] 15912 51 0 1164 0 13628 115 516456 188273 197 ... > ..$ sum : num [1:27998] 16368 51 0 1195 0 ... > ..$ sumsq: num [1:27998] 17312 51 0 1259 0 ... > $ column:List of 3 > ..$ n : int [1:1306127] 1807 1249 2206 1655 3326 3866 1420 1769 1672 2241 ... > ..$ sum : num [1:1306127] 4046 2087 4654 3193 8444 ... > ..$ sumsq: num [1:1306127] 35338 14913 31136 21619 106780 ... > > stopifnot( > + identical(sum(res$column$n != 0), 1306127L), > + identical(sum(as.numeric(res$column$n)), 2624828308), > + identical(sum(as.numeric(res$column$sum)), 6388703090), > + identical(sum(as.numeric(res$column$sumsq)), 270395442858), > + all(mapply(function(x, y) { > + identical(sum(as.numeric(x)), sum(as.numeric(y))) > + }, res$row, res$column)) > + ) >
> unnamed but in the order of the original data - Attachment (GitHub): mtmorgan/hdf5tenx > Contribute to hdf5tenx development by creating an account on GitHub.
Martin Morgan (14:52:58): > tenx = TENxBrainData::TENxBrainData(); rows = rowSums(assay(tenx))
(and likewise for column, afterbiocLite("TENxBrainData")
) should produced named margin sums.
Vince Carey (14:54:34): > thanks – is this > > BUG -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rhdf5lib/include" -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include -fPIC -Wall -g -O2 -c margins_slab.cpp -o margins_slab.o > margins_slab.cpp:3:10: fatal error: 'c++/H5Cpp.h' file not found > #include "c++/H5Cpp.h" > ^~~~~~~~~~~~~~~~~~~~~~~~~ > 1 error generated. >
> my problem?
Vince Carey (14:54:48): > upon trying to install the hdf5tenx
Martin Morgan (15:15:25): > sorry, try again (the package dates to the early days of Rhdf5lib)
Vince Carey (16:03:07): > yes and our sums agree
2018-04-25
Vince Carey (10:59:35): > There have been some questions about throughput with restfulSE. The following timings are based on runs in an academic network. > > > suppressPackageStartupMessages({ > + library(restfulSE) > + library(DelayedMatrixStats) > + }) > > se10x = se1.3M() > analyzing groups for their links... > done > snapshotDate(): 2018-04-25 > see ?restfulSEData and browseVignettes('restfulSEData') for documentation > downloading 0 resources > loading from cache > '/udd/stvjc//.ExperimentHub/554' > > register(SerialParam()) # parallel block processing/REST conflict > > # needs investigation > > system.time(cs <- colSums(assay(se10x[,1:10]))) > user system elapsed > 0.408 0.038 0.688 > > cs > AAACCTGAGATAGGAG-1 AAACCTGAGCGGCTTC-1 AAACCTGAGGAATCGC-1 AAACCTGAGGACACCA-1 > 4046 2087 4654 3193 > AAACCTGAGGCCCGTT-1 AAACCTGAGTCCGGTC-1 AAACCTGCAACACGCC-1 AAACCTGCACAGCGTC-1 > 8444 11178 2375 3672 > AAACCTGCAGCCACCA-1 AAACCTGCAGGATTGG-1 > 3115 4592 > > system.time(cs <- colSums(assay(se10x[,1:100]))) > user system elapsed > 2.419 0.113 3.183 > > system.time(cs <- colSums(assay(se10x[,1:200]))) > user system elapsed > 5.616 0.068 6.881 > > system.time(cs <- colSums(assay(se10x[,1:1000]))) > user system elapsed > 33.458 0.238 40.465 > > system.time(cs <- colSums(assay(se10x[,1:2000]))) > user system elapsed > 68.575 0.459 83.353 > > sessionInfo() > R version 3.5.0 beta (2018-04-10 r74581) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: CentOS release 6.9 (Final) > > Matrix products: default > BLAS/LAPACK: /app/intelMKL-2017.0.098_i86-rhel6.0/intelMKL/compilers_and_libraries_2017.0.098/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so > > locale: > [1] C > > attached base packages: > [1] parallel stats4 stats graphics grDevices utils datasets > [8] methods base > > other attached packages: > [1] restfulSEData_1.1.0 ExperimentHub_1.5.2 > [3] AnnotationHub_2.11.4 DelayedMatrixStats_1.1.12 > [5] restfulSE_1.1.9 SummarizedExperiment_1.9.17 > [7] DelayedArray_0.5.31 BiocParallel_1.13.3 > [9] matrixStats_0.53.1 Biobase_2.39.2 > [11] GenomicRanges_1.31.23 GenomeInfoDb_1.15.5 > [13] IRanges_2.13.28 S4Vectors_0.17.42 > [15] BiocGenerics_0.25.3 rmarkdown_1.9 > ~ >
Kasper D. Hansen (12:19:26): > So this is running on a server backend right? How much is allocated (how many CPUs etc)
Vince Carey (13:17:36): > Yes, the machine is an m4.2xlarge i think and@Shweta Gopalwill know the details. That would be 8 cores, 32GB at the server. We do not do any load profiling on that server at this time. restfulSE+rhdf5client translate the [i,j] to GET request(s), the server gets the HDF5 requested and by default ships back JSON. rhdf5client translates the JSON; a binary transmission is available but for the 10x data it does not seem beneficial to use it.
Kasper D. Hansen (13:26:23): > thanks, thats useful
Shweta Gopal (14:13:22): > Hi@Kasper D. HansenYes, it is a m4.2xlarge 8 vCPU and 32 GiB !
2018-04-27
Aaron Lun (13:36:26): > @Martin Morganhttps://github.com/MarioniLab/DropletUtils - Attachment (GitHub): MarioniLab/DropletUtils > Clone of the Bioconductor repository for the DropletUtils package, see https://bioconductor.org/packages/devel/bioc/html/DropletUtils.html for the official development version.
2018-05-01
Mike Smith (15:56:05): > I did a quick write up of the work@Kasper D. Hansenand I did at the CZI meeting on testing the feasibility/performance of parallel reading from HDF5Arrays -http://www.msmith.de/2018/05/01/parallel-r-hdf5/
Raphael Gottardo (16:01:02): > Great thanks@Mike Smith,@Mike Jiangand I will have a look at it.
Peter Hickey (16:25:48): > thanks,@Mike Smith!
Peter Hickey (16:26:07): > on a related topic, I’m going to write up some blogposts/tutorials on using DelayedArray (especially for developers). So please hit me up with any questions you might have!
2018-05-03
Peter Hickey (16:28:28): > anyone familiar with TileDB?https://tiledb.io/ - Attachment (tiledb.io): TileDB - Home > Array data management made fast and easy
Raphael Gottardo (17:04:14) (in thread): > No but that looks interesting.
Raphael Gottardo (17:15:45): > Interesting discussion here:https://news.ycombinator.com/item?id=15547749.
Raphael Gottardo (17:16:10): > @Mike JiangPerhaps we should look at it too?
Raphael Gottardo (23:19:03): > And there is already an R interface:https://github.com/TileDB-Inc/TileDB-R - Attachment (GitHub): TileDB-Inc/TileDB-R > TileDB-R - R interface to the TileDB storage manager
2018-05-04
Peter Hickey (07:34:47): > forgot to mention that!
2018-05-08
Mike Jiang (13:28:50): > Looks like the R binding is not quite ready yethttps://github.com/TileDB-Inc/TileDB-R/issues/10#issuecomment-387253011 - Attachment (GitHub): package installation error · Issue #10 · TileDB-Inc/TileDB-R > libtiledb.cpp:227:24: error: ‘tiledb::Version’ has not been declared auto ver = tiledb::Version::version(); ^ libtiledb.cpp: In function ’Rcpp::List tiledb_array_schema…
Vince Carey (14:48:52): > For convenience I am posting a link to the tiledb paper here. There is some comparison to HDF5. In the ycombinator link given above there is indication that the tiledb group admires HDF5 and will consider interoperability.https://people.csail.mit.edu/stavrosp/papers/vldb2017/VLDB17_TileDB.pdf
Raphael Gottardo (14:54:28) (in thread): > Mike, as they have indicated in their response, it would be good to email them to follow up.
Peter Hickey (14:56:49): > i also ran into problems installing on macos. will wait and see
2018-05-09
Mike Smith (07:06:15): > Might also be worth paying attention to the news that HDF5 is being split in two, with free and a paid-for editions (https://www.hdfgroup.org/2018/05/announcing-development-of-the-enterprise-support-edition-of-hdf5/). It’s not clear to me what the impact of this will be, but at first glance it looks to me like we may not be able to distribute a copy of the HDF5 source any more. - Attachment (The HDF Group): Announcing the launch of the Enterprise Support Edition of HDF5 - The HDF Group > Dear Friends, Community Members, and Colleagues, Today, I am writing to all of you to announce the launch of both a Community Edition (CE) and a subscription-based Enterprise Support Edition (ESE) for HDF5. This model is similar to Red Hat, Lustre, and other open source projects. We are moving down this pathway to address the challenges that continuously face us in achieving sustainability and increasing community involvement. Since I joined The HDF Group in April 2016, I made it my…
Vince Carey (07:15:10): > I mentioned this concern to John Readey. I will let you know his response.
Aaron Lun (07:41:00): > Hm. Sounds ominous.
Kasper D. Hansen (07:47:24): > Not a good sign. Perhaps the community will fork it.
Aaron Lun (17:37:10): > We should have patented it. “Analyzing single-cell data with HDF5”.
Mike Jiang (17:51:44): > Just briefly looked at copyright file in the latest CE, it doesn’t seem to prohibit the redistributions of the source
2018-05-10
Vince Carey (09:24:01): > https://forum.hdfgroup.org/t/announcing-development-of-the-enterprise-support-edition-of-hdf5/4374… John Readey said the main restriction he envisioned was against distribution of enterprise edition binaries. I do not think our methods have to change. But there are issues raised on the forum concerning speed of bugfix process. Anyone want to post concerns to forum? - Attachment (forum.hdfgroup.org): Announcing development of the Enterprise Support Edition of HDF5 - News and Announcements from The HDF Group - HDF Forum > Dear Friends, Community Members, and Colleagues, Today, I am writing to all of you to announce the launch of both a Community Edition (CE) and a subscription-based, Enterprise Support Edition (ESE) for HDF5. This model …
2018-05-11
Mike Smith (02:45:52): > Cool, some of the replies in that thread seem more encouraging than the initial announcment. I was a bit concerned by the entry “Who can access the source? - Open to anyone registered at HDF” inhttps://www.hdfgroup.org/solutions/what-does-the-enterprise-support-edition-mean-for-the-community-edition/since we obviously don’t require registration at HDF5 to downloadrhdf5but it sounds like the licence isn’t changing so the status quo is hopefully maintained.
Aaron Lun (05:49:59): > Incidentally, most of the relevantscranfunctions used in the 10X brain data analysis have been parallelized in version 1.9.3. So it is now possible to see how well the HDF5 read/write process scales with the number of cores in the context of an actual analysis pipeline. My cluster has pretty rubbish IO, though.
2018-05-18
Vince Carey (14:08:01): > I am hitting > > HDF5. Dataset. Read failed. >
> and the analogous Write failed event. This is using h5read – upon setting h5errorHandling(“verbose”), i get > > Error in H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem, : > libhdf5 > error #000: _? in ??(): line 223 > class: HDF5 > major: Dataset > minor: Read failed > error #-05: ?l' in ???(): line 605 > class: HDF5 > major: Dataset > minor: Read failed > error #-04: ?? in ?l'(): line 2092 > class: HDF5 > major: Low-level I/O > minor: Read failed > error #-03: ??? in 0??(): line 3122 > class: HDF5 > major: Data filters > minor: Filter operation failed > error #-02: ?< in ???(): line 1370 > class: HDF5 > major: Data filters > minor: Read failed > error #-01: ?< in 0?C(): line 123 > class: HDF5 > major: Data filters > minor: Unable to initialize object >
> which is not that much more informative. Are there other approaches to diagnosis here? Specifically I wonder about the fact that some columns in the target file look good, so perhaps I can just rewrite the ones for which I get failure?
Martin Morgan (14:53:18): > I’d guess that those???()
would resolve to function names if compiled without optimization and with compiler symbols, but that wouldn’t be much more helpful…
2018-05-22
Mike Smith (05:31:43): > How reproducible is this? Does it happen every time you run a particular block of code? The ‘Data filters’ references in the bottom 3 messages is almost certainly the gzip compression. Maybe try turning off compression and see if this still occurs, it might help narrow things down a bit.
Vince Carey (11:21:26): > Good question on reproducibility. As it happens I just tried again and again and got the file I needed. So the next time I hit it I will try to get more context information. Note that I was attempting to write several hdf5 files from a given function – sequentially, but a lot was going on. I inserted sleeps between the write attempts and that seemed to help a bit.
2018-05-29
Aaron Lun (14:17:58): > @Kasper D. HansenHaving played around with randomized SVD (viarsvd
), I’m no longer sure that it’s the way to go as a general SVD replacement. For one thing, it assumes thatk + l
is greater than the true rank of the matrix, and when this is not true, its accuracy suffers quite a bit. One might say that this is a reasonable assumption but it’s hard to tell the true rank (and thus the appropriatek
andl
)a priori- for example, I often need the variance explained per component to figure that out, which requires an SVD in the first place.rsvd
also only really returns accurate U/D/V for dimensions <= true rank; beyond that, its accuracy again drops off. One might say that lack of accuracy doesn’t matter for those higher dimensions consisting of random noise, which is fair enough; nonetheless it is somewhat annoying as it makes it difficult to write unit tests to checkrsvd
against the standardsvd
. We can get around this by increasing the number of power iterations, but this is not entirely cheap either, involving a series of QR decompositions of reasonably sized matrices. > > I wonder whether an adaption ofirlba
may be more appropriate, combining your idea of quickly computing XtX (or XXt) and doing the decomposition on that, with adjustments toirlba
to support non-ordinary matrices?
Aaron Lun (16:25:41): > The last thing seems to work pretty well, actually. Need to really droptol
to get the same result as a direct application ofirlba
, but otherwise it seems to do the job.
2018-05-30
Kasper D. Hansen (05:39:36): > > rsvd
also only really returns accurate U/D/V for dimensions <= true rank; beyond that, its accuracy again drops off. One might say that lack of accuracy doesn’t matter for those higher dimensions consisting of random noise, which is fair enough; nonetheless it is somewhat annoying as it makes it difficult to write unit tests to checkrsvd
against the standardsvd
. We can get around this by increasing the number of power iterations, but this is not entirely cheap either, involving a series of QR decompositions of reasonably sized matrices.
Kasper D. Hansen (05:40:14): > This is because the output of SVD is not uniquely defined. You might very well suffer the same withsvd
on different architectures or BLAS implementations
Kasper D. Hansen (05:41:04): > I amverysurprised to hear that you sayrsvd()
(which I don’t have much experience with - I have rolled my own) is not precise whenk + 1 > rank(mat)
Kasper D. Hansen (05:41:48): > Finally I have learned that randomized SVD can mean many different things, so I probably should take a look at thersvd()
implementation
Aaron Lun (06:53:44): > Hm, I should have been clearer with what I meant by “true rank”. Specifically, I considered data with points distributed on a 2D subspace; projected into 100 dimensions; and with some random noise added to each observation. As a result, the rank of this matrix would technically be 100, but I only really care about the two components corresponding to my original (biological) subspace - i.e., the true rank minus the noise. > > set.seed(12345) > ncells <- 1000 > ndim <- 2 > truth <- matrix(rnorm(ncells*ndim), ncol=ndim) > > ngenes <- 100 > proj <- truth %*% matrix(rnorm(ndim * ngenes), ncol=ngenes) > data <- proj + rnorm(length(proj)) >
> So if we runsvd
andrsvd
, requesting only two components, all is well: > > s.out <- svd(data, nv=2) > head(s.out$v) > > library(rsvd) > r.out <- rsvd(data, k=2) > head(r.out$v) >
> But ask for the third component, and the results are no longer the same, or even stable acrossrsvd
runs: > > s.out <- svd(data, nv=3) > head(s.out$v) > > r.out <- rsvd(data, k=3) > head(r.out$v) >
Kasper D. Hansen (07:19:22): > Interesting. I should look at thersvd
arxiv paper
Aedin Culhane (12:12:11): > Hi Kasper, If you only want the first component, we implemented NIPALS, which is faster than decomposing the entire matrix. it is also ok with NA
Aaron Lun (13:46:12): > There’s probably also some issues regarding the numerical stability of decomposing XtX compared to SVD on X itself; don’t know if anyone’s explored that.
2018-05-31
Mike Jiang (17:02:40): > Is parallel block-processing (or concurrent read) available forDelayedArray
?
Peter Hickey (17:05:33): > DelayedArray::blockApply() takes a BiocParallalParam
Peter Hickey (17:06:03): > You can use that or (as I’ve done) adapt this idea
Mike Jiang (17:06:55): > but it is not out-of-box for subsetting likeas.matrix(h5array[i,j])
?
Peter Hickey (17:07:59): > No. I’m not quite sure what you’re looking for, example?
Mike Jiang (17:09:29): > I was hoping for faster subsetting delayedMatrix through cocurrent IO that is natively supported by DelayedArray package
2018-06-08
Vince Carey (11:57:55): > Is anyone experienced with using H5Sunlimited() in maxdims? Beyond setting this in dataset creation, are additional steps needed to get an extensible dataset? – it is possible to create an unlimited dimension with this, when used in maxdims. I do not yet know if a resize operation is needed to append to an existing dataset.
2018-06-11
Vince Carey (12:01:37): > h5py has a resize method that is useful for concatenating hdf5 datasets, or appending to one … i don’t see this in rhdf5. hdf5r seems to have a concatenation method available. no one has hit this use case, concatenating two hdf5 matrices?
Aaron Lun (12:05:49): > For me, not really, I’ve mostly treated the HDF5 matrix as read-only, and creation is expensive enough that I do it sparingly with all dimensions known in advance.
Aaron Lun (12:06:15): > Though I can see how it might be useful.
Vince Carey (12:08:10): > It doesn’t seem to be a use case in high demand – there is a thread on stackoverflow with a nice solution in python … it is a bit tedious, not a high level operation. it is surprising to me.
Vince Carey (12:09:11): > i am just somewhat cowardly when it comes to creating really big matrices in one go … i go piece by piece and then when it looks like everything is ok, i have to put them together somehow
Peter Hickey (12:09:29): > like aaron, i’m currently treating HDF5 matrices as read-only, but can see the use of appending/binding
Peter Hickey (12:10:05): > i’ve not yet considered in-place alterations…are these even possible?
Vince Carey (12:10:09): > read-only mindset is fine – but when an isolated error is found, it is nice to be able to fix it.
Vince Carey (12:10:46): > yes, with h5py you can make element-level edits to a file. i assume you could do likewise with rhdf5
Vince Carey (12:11:14): > and this is one reason to think about checksums for content that you want to regard as read only
2018-06-12
Mike Jiang (13:22:43): > Byconcatenating
, do you mean merging into the single dataset or just copy the dataset from one file to another (or through external link mechanism)? If former, it is essentially creating a new h5 with one big dataset.
2018-06-13
Vince Carey (12:31:21): > My question concerned putting contents of two compatible hdf5 datasets together – essentially an analog of rbind in R. Whether it is accomplished by creating a new dataset or appending to one of them I did not specify. I think appending can be done manually with rhdf5 provided H5Sunlimited() is used to set the maxdim that will be extended.
Kasper D. Hansen (14:40:24): > @Vince Careydoing that efficiently is cleary useful
Kasper D. Hansen (14:40:43): > Especially when large data are processed. You can do stuff on small batches and then combine
Aaron Lun (14:44:28): > I can see the utility of this, but mostly in a similar vein towriteLines
orwrite.table
withappend=TRUE
. For interactiverbind
ing ofHDF5Matrix
s, it would make more sense to create a new dataset entirely. Otherwise, if you append to datasetX, otherHDF5Matrix
s pointing toXwill implicitly point to the appended dataset, rather than the pre-appendedXas one might expect from R’s usual semantics. This would probably cause chaos.
Kasper D. Hansen (14:51:20): > We are doing extensive constructions like Vince is describing: piece by piece.
Vince Carey (15:04:05): > Maybe the dominant model for rhdf5 is as a tool for interacting with HDF5 that was made elsewhere and is regarded as read-only. The ramifications of treating HDF5 datasets like R matrices for constructive manipulations certainly need to be thought through. Checking whether the HDF5 has been modified since the R reference was created may be useful given that other applications may be doing things to the data. This is a cost of interoperability. Similar disciplines might be relevant with the SQLite resources.
Aaron Lun (15:14:55): > To be clear, I think it’s fine to construct things piece by piece. I’m just saying that trying to make it super-convenient via automatic extension withrbind
could have unintended consequences. A proper copy-on-write model would require us to do some reference counting to keep track of allHDF5Matrix
s that exist in the current session (or even outside of the session!) - sounds tough.
Peter Hickey (15:22:27) (in thread): > Yes, although we’re doing this by pre allocating and filling piece by piece
Kasper D. Hansen (15:27:44): > I understood Vince’s question as asking if the C-level function was exposed. Not about what the R semantics should be
Kasper D. Hansen (15:27:54): > But perhaps I read things a bit fast
Vince Carey (15:32:29): > I don’t think there is a C level function to do this.
Kasper D. Hansen (15:33:19): > oh h5py sounds like a python interface
Kasper D. Hansen (15:34:17): > so does it do anything “smart” or does it create a new HDF5 file?
Vince Carey (16:06:05): > there isn’t a concatenation method in h5py either. there are examples of doing it with extensible maxdims settings. but concatenation does not seem to be part of any HDF5 API. i “need” it because of the way i produced my datasets, and so i believe i will write it using rhdf5 facilities. unless you already have one kasper!
Kasper D. Hansen (16:28:22): > We fill in a preallocated big matrix
Kasper D. Hansen (16:28:37): > But@Peter Hickeyis the expert here
Kasper D. Hansen (16:28:59): > Not sure we do it in the best possible way, but it seems to work reasonably well
Kasper D. Hansen (16:29:12): > Actually I am pretty sure we don’t know if it is the best way
2018-06-15
Mike Jiang (17:38:39): > Anyone has some experience to measure the memory usage of the function call? It involves the externalPtr thuspryr::mem_change
orRprofmem
won’t be helpful. I am currently computing the difference of the total memory (RSS
) used byrsession
usingps
command before and after the function call. Even though I calledgc()
for both checkpoints, I guess that still won’t guarantee the unused memory gets returned fromR
toos
(otherwise I won’t observe some negative difference). (nudge@Andrew McDavidsince he was wondering about that too).I wonder if there is better way to get the more accurate measurements. Here is my example of thematrix benchmarking
https://github.com/RGLab/mbenchmark/blob/master/README.md. Disclaimer: this was just a demo case for the package usage, by no means does it provide any conclusive evidence of the superiority of any data format. - Attachment (GitHub): RGLab/mbenchmark > mbenchmark - benchmarking the common matrix operations
Peter Hickey (18:10:47): > Is this helpful?https://github.com/r-prof/jointprof - Attachment (GitHub): r-prof/jointprof > jointprof - Joint profiling of native and R code
Davide Risso (18:14:21): > @Elizabeth Purdommaybe can help
2018-06-17
Aaron Lun (10:54:22): > @Kasper D. HansenTook 17 minutes on my computer to compute the tcross-product of a 68K cells * 5000 genesDelayedArray
matrix. Sounds bad but is probably comparable to the speed oftcrossprod
anyway; and the remaining steps viairlba
were effectively negligible by comparison.
2018-06-18
Kasper D. Hansen (04:12:31): > what did you do for this?
Aaron Lun (04:27:11): > Glad you asked. I tried a number of things but the best solution was to subset theDelayedArray
by column; realize it to a dense array; runtcrossprod
on the dense array (producing a 5000*5000 matrix); repeat with the next set of columns, and add the result together. This seems to give the best performance as it allows one to use the speed-ups intcrossprod
(by comparison, subsetting by row would require general matrix multiplication, which is slower).
Kasper D. Hansen (04:30:48): > Ok, I’ll think about this. I want to have something for Bioc2018. My first thought was to use beachmat to grab subsets of the matrix and then pass it to BLAS or RcppArmadillo or just[t]crossprod()
Aaron Lun (04:40:04): > Technically the underlying matrix was sparse, so it is tempting to think that there are some further speed-ups there; but there is some centering involved prior to computing the crossproduct, and this makes things tricky. One could still compute the sparse cross-product and subtract the outer product of the mean vector, but this has issues with numerical stability.
Aaron Lun (04:40:25): > I also tried home-brewing my owncrossprod
in C++, which sucked. Very much non-cache-optimal.
Kasper D. Hansen (04:41:11): > Could you share your code for your recent attempt?
Kasper D. Hansen (04:41:46): > And I agree that an important point seems to be thinking about centering (and scaling) prior as well
Kasper D. Hansen (04:42:22): > I see your point of numerical stability. Was mostly thinking about speed
Martin Morgan (04:44:38): > is this ‘subset and realize as dense’ whatDelayedArray::blockApply()
does?
Aaron Lun (04:45:45): > Yes, though I didn’t useblockApply
directly. I wanted to avoid having to hold X objects in memory at once (where X is the number of blocks).
Aaron Lun (04:46:55) (in thread): > Re code: I deleted it in a rage quit. But I can tell you what I did. The best approach I got was to iterate across columns, extracting one column of the matrix at a time; subtract the mean vector, and apply the scaling vector to this column; and then compute the outer product in a column-major manner, adding the values to the output matrix. This was competitive withtcrossprod
at small matrices but fell away quite quickly, presumably due to cache misses.
Martin Morgan (04:52:56): > It doesn’t sound like blockApply would be useful if it simply realized the data in memory, which I understand is what you say it does?
Kasper D. Hansen (05:00:21) (in thread): > I meant code for your 17m run, not the stuff you hated (although I would have liked to see that as well)
Aaron Lun (05:01:51): > No, that’s not so much the problem. Obviously,blockApply
only realizes the current block that it’s working on, so (AFAIK) it won’t realize the full matrix, which would be bad. (There would be problems if we parallelized to the point that the entire matrix was effectively realized at once, but that’s for another day.) The “problem” is that each iteration ofblockApply
will produce a square matrix representing the cross-product of the current chunk. Over X iterations, this requires memory for X matrices, while we only really need space for 2 as each new square matrix gets added to the existing square matrix.
Aaron Lun (05:02:50): > As problems go, this is not so bad as the square matrices are currently fairly small. However, the ideal solution for this particular use-case would be to have a mutex or something that allows each core to add their new matrix to the existing matrix, and avoid us having to generate the full list of X matrices that then get added together.
Aaron Lun (05:03:22) (in thread): > https://github.com/MarioniLab/scran/commit/2cfd90c1c94844330e2bb4ceb97577a857d15f3f - Attachment (GitHub): Switched to delayed crossprod for faster multi-sample PCA. · MarioniLab/scran@2cfd90c > Clone of the Bioconductor repository for the scran package, see https://bioconductor.org/packages/devel/bioc/html/scran.html for the official development version.
Aaron Lun (05:03:38) (in thread): > I put in BiocParallel support but it doesn’t actually do anything yet.
Kasper D. Hansen (05:26:01): > Just to make this easier to understand: what@Aaron Lunis talking about is the return value of the function.
Kasper D. Hansen (05:26:39): > straight up it would be a list of square matrices which should then be combined using elemetwise addition.
Kasper D. Hansen (05:27:44): > But if you do it sequentially you can update the return matrix by the previous one, essentially likeReduce("+", listOfSquareMatrices))
Kasper D. Hansen (05:28:34): > This is a special case of theapply()
approach: when return value is big.
Martin Morgan (05:29:22): > ok thanks for the clarification. There is a mutex (single computer) in BiocParallel; also in GenomicFiles there isreduceByYield()
which allows one to extract a block, compute on it, and then reduce it iteratively; something better than > > yieldDelayedArray <- function() { > ## instead: provide `grid` argument like `DelayedArray::blockApply()` > block <- 0L > function(m) { > ## signal 'done' or realize a block > if (block == ncol(m)) > return(NULL) > ## instead: implement like `DelayedArray::blockApply()` > block <<- block + 1L > as.matrix(m[,block, drop=FALSE]) > } > } > > setMethod("isOpen", "DelayedArray", function(con, rw = "") TRUE) # hack -- DelayedArray is not a file > > reduceByYield(X = m, YIELD = yieldDelayedArray(), MAP = function(block) { > message(colSums(block)) > colSums(block) > }, REDUCE = `+`) >
> would work, including with the MAP function evaluated in parallel.
Kasper D. Hansen (05:29:26): > I have been thinking about the case when even a single matrix is too big, but have decided to postpone that for now (it wont happen in the gene expression single cell world)
Kasper D. Hansen (05:32:21): > Yeah@Martin Morganthats the idea, assuming thatblockApply()
can be made to do the computation one block at a time, do the reduce and then continue. I find it even harder to reason about the delayed operations.
Peter Hickey (08:54:48): > aaron’s approach is one i have been toying with but haven’t had time to pursue
Peter Hickey (08:56:16): > looking at the source for[t]crossprod()
made it pretty clear i didn’t want to implement this myself. there are 3 versions: a ‘naive’ R, a SIMD-parallelised R, and one that uses the BLAS routinedgemm
Aaron Lun (08:57:29): > good grief
Peter Hickey (08:58:56): > the non-BLAS ones were added somewhat recently by Tomas Kilbera. IthinkBLAS is the default, but I haven’t gone down the path of figuring out (A) which is called or (B) how to specify the choice (it might not even be available as a runtime option, perhaps only at compile time?)
Vince Carey (15:10:57): > > cellinds = seq_len(68000) > alit = rhdf5client::H5S_Array( > filepath="[http://h5s.channingremotedata.org:5000](http://h5s.channingremotedata.org:5000)", > host="tenx_full")[seq_len(5000), cellinds] > save(alit, file="alit.rda") > > library(BBmisc) > chs = chunk(cellinds, n.chunks=20) > library(BiocParallel) > register(SnowParam(20)) > doit = function(x) { > library(DelayedArray) > library(rhdf5client) > load("alit.rda") # local DelayedArray with remote data > tcrossprod(as.matrix(alit[,x])) > } > system.time(xxt <- bplapply(chs, doit)) >
> runs in under 2 min
Vince Carey (15:13:10): > can we consider XXt of 5000 genes x 68000 tenx cells as a canonical example? we would want to address centering and scaling and correctness.
Vince Carey (15:18:46): > > user system elapsed > 18.040 7.272 113.960 > > > sum(as.numeric(xxt[[20]])) > [1] 1760328388 > > system.time(XXT <- Reduce("+", xxt)) > user system elapsed > 2.661 0.609 3.276 > > sum(as.numeric(XXT)) > [1] 38245091981 >
Vince Carey (15:31:39): > I understand that this may look flabby in the sense that you have 20 cores (each job taking maybe .5g) and you have to keep the various chunk-specific matrices around until you sum. They could be computed into hdf5 for use later, to allow removal from RAM, or added into an hdf5 dataset when they emerge? I just use the remote hdf5 because it is convenient for me – I can’t remember where the local one is….
Aaron Lun (16:13:07): > Interesting.
Aaron Lun (16:14:04): > So if I’m reading this correctly, each core retrieves a little bit from the HDF5 server, and then realizes it to a full matrix and computes the crossproduct.
Aaron Lun (16:14:39): > 2 minutes sounds about right, given the overhead of remote retrieval.
Aaron Lun (16:14:50): > It should be linear w.r.t. the number of columns.
Vince Carey (16:17:09): > yes
Aaron Lun (16:19:05): > So it would probably be even faster with a fully in-memory matrix (which is what I’m currently working with).
Vince Carey (16:20:10): > indeed …
Aaron Lun (16:20:45): > I’ll give that a shot tomorrow. Tired of waiting 17 minutes for my crossproduct… though I still have to wait 40 minutes for my t-sne…
Vince Carey (16:22:00): > we need to know that the answer is correct. is the centering to column-mean zero and scaling to unit SD, or something else?
Aaron Lun (16:31:03): > hold on - let me untranspose things in my head.
Aaron Lun (16:31:20): > Each gene should have mean zero across cells.
Aaron Lun (16:32:20): > However, in my application, the mean is computed across multiple matrices (batches). But the computation shouldn’t be affected, as we’re just subtracting a vector regardless.
Aaron Lun (16:33:17): > Scaling is again not standard - I’m dividing each matrix by the square root of the number of cells, so that each batch effectively contributes the same amount of information to the gene-gene covariance matrix.
Aaron Lun (16:33:28): > The idea being to avoid one batch from dominating the identification of the axes of variation in the PCA.
2018-06-19
Aaron Lun (15:25:49): > Also, a BioC-friendly parallelized NN-search algorithm athttps://github.com/LTLA/kmknn - Attachment (GitHub): LTLA/kmknn > kmknn - Bioconductor-friendly implementation of the K-means for K-nearest neighbors algorithm
2018-06-20
Martin Morgan (15:07:47): > It’s a little bit up-thread now, but Herve mentionedblockReduce()
to iterate through a delayed array and accumulate the reduction.
Aaron Lun (15:11:45): > Looking at the source forblockReduce
- it doesn’t seem to use BiocParallel?
Martin Morgan (15:15:11): > that’s right, no parallel evaluation…:slightly_frowning_face:
Aaron Lun (15:18:45): > I’m actually not sure how a mutex would work in R - what should I be looking at inBiocParallel?
Martin Morgan (15:20:42): > ipclock
(inter-process lock) and related functions; there’s an example on the man page, and I think also in the vignette.
Peter Hickey (15:22:05): > @Aaron LunI’ve got some examples I can point you to
Aaron Lun (15:23:53): > Hit me up
Aaron Lun (15:24:55): > The vignette’s pretty unhelpful, it just mentions that ipclock exists.
Peter Hickey (15:26:53): > i use them inhttps://github.com/hansenlab/bsseq/blob/refactor/R/BSmooth.R - Attachment (GitHub): hansenlab/bsseq > Devel repository for bsseq
Peter Hickey (15:27:02): > andhttps://github.com/hansenlab/bsseq/blob/refactor/R/read.bismark.R - Attachment (GitHub): hansenlab/bsseq > Devel repository for bsseq
Peter Hickey (15:27:15): > both are rather long …
Peter Hickey (15:29:01): > will write up a succinct example for BioC2018 workshop (WIPhttps://github.com/PeteHaitch/BiocWorkshops) - Attachment (GitHub): PeteHaitch/BiocWorkshops > BiocWorkshops - Workshops for learning Bioconductor
Peter Hickey (15:31:01): > basic idea is to create a lock withlock <- ipcid()
infoo()
. Internally,foo()
breaks up the object and passes to.foo()
via abplapply()
/bpmapply()
, etc. > Includelock
as argument to.foo()
, the function you are using inbplapply()
/bpmapply()
/etc. When I write to theHDF5RealizationSinkI call do: > > ipclock(lock) > write_block_to_sink(x, sink, viewport) > ipcunlock(lock) >
Aaron Lun (15:32:04): > Is that a write to file here?
Peter Hickey (15:32:16): > yeh
Aaron Lun (15:32:25): > Wondering how it could be applied to modify an in-memory resource.
Aaron Lun (15:32:37): > <<-
?
Aaron Lun (15:32:51): > Don’t even know if that would be respected in separate memory.
Peter Hickey (15:33:15): > https://github.com/Bioconductor/DelayedArray/issues/20 - Attachment (GitHub): Help implementing ‘parallel’ writing to a RealizationSink with BiocParallel · Issue #20 · Bioconductor/DelayedArray > I'm trying to implement writing to an arbitrary RealizationSink backend via BiocParallel::bplapply() with an arbitrary BiocParallelParam backend. That is, I want to be able to construct blocks …
Peter Hickey (15:33:38): > same thing? if so, see@Martin Morgan’s replyhttps://github.com/Bioconductor/DelayedArray/issues/20#issuecomment-390747227 - Attachment (GitHub): Help implementing ‘parallel’ writing to a RealizationSink with BiocParallel · Issue #20 · Bioconductor/DelayedArray > I'm trying to implement writing to an arbitrary RealizationSink backend via BiocParallel::bplapply() with an arbitrary BiocParallelParam backend. That is, I want to be able to construct blocks …
Peter Hickey (15:33:42): > seems rather hard
Aaron Lun (15:34:01): > Hm.
Aaron Lun (15:35:43): > Okay.
Aaron Lun (15:36:33): > Perhaps we can take a step back in ambition here.
Martin Morgan (15:37:52): > ‘writing to file’ might be interpreted a bit loosely, like to a memory-mapped region shared between processes; this is in effect whatipcyield()
is doing – storing a numerical representation and managing access so that read + update is effectively atomic. You couldn’t do this with an R object, or if you did you’d serialize / unserialize it to the shared memory.
Aaron Lun (15:41:44): > For the cross-product example above, there are two motivations for block processing. One is to save memory and avoid realizing the entire Delayed array at once (this is my main motivation, actually). The second is to parallelize across blocks. However, the two considerations need not involve the same number of blocks. One might process the cross-product in 20 blocks to save memory but have 2 cores available for processing. Now ideally, each core would just add its results to a locked/unlocked common resource, but this seems difficult (?). However, it would be just as good for the 2 cores to return their cross-product result independently, and for the managing process to add the values together, before dispatching the next set of jobs. This would avoid the worst scenario whereby you have to hold 20 results in memory at once before adding them all together to get the final cross-product.
Aaron Lun (15:43:35): > And yes, if you had 20 cores then you would have to deal with 20 results regardless, but a system with 20 cores should have more than enough memory for that.
Martin Morgan (15:54:31): > > library(BiocParallel) > > ITER <- function(n) { > function() { > ## count down from n -- 'yield' the next number > if (n <= 0L) > return(NULL) > res <- n > n <<- n - 1L > res > } > } > > FUN <- function(i) > ## do something with the current yield > c(i, Sys.getpid()) > > REDUCE <- function(x, y) > ## combine successive FUNs > list(x[[1]] + y[[1]], unique(c(x[[2]], y[[2]]))) > > bpiterate(ITER(5), FUN, REDUCE = REDUCE, BPPARAM = MulticoreParam(2)) >
> This is essentiallybpiterate()
, where you’d like to implementITER()
above to yield a chunk of the delayed array.REDUCE()
is run on the master, so can be implemented to store state between calls, for instance. The memory-management part of the problem is the domain of ITER; the number of processes is the domain of BPPARAM.
Martin Morgan (16:00:45): > I guess ITER is something like > > ITER <- function(x, grid = NULL) { > grid <- DelayedArray:::normarg_grid(grid, x) > b <- 0L > function() { > if (block == length(grid)) > return(NULL) > b <<- b + 1L > viewport <- grid[[b]] > block <- DelayedArray:::extract_block(x, viewport) > if (!is.array(block)) > block <- DelayedArray:::.as_array_or_matrix(block) > attr(block, "from_grid") <- grid > attr(block, "block_id") <- b > block > } > } >
Kevin Rue-Albrecht (16:47:54): > @Kevin Rue-Albrecht has left the channel
2018-06-22
Aaron Lun (05:43:58) (in thread): > If anyone’s interested in this, seehttps://github.com/LTLA/OkNN2018for some benchmarking on simulated data. There’s some pretty nice performance gains over kd-tree implementations for high-dimensional data - about 2-5 times faster, depending on the distribution of points. At low dimensions, kmknn is slower than kd-trees but that’s probably not a big deal as everyone is pretty fast. - Attachment (GitHub): LTLA/OkNN2018 > OkNN2018 - Code for performance testing of the kmknn package at https://github.com/LTLA/kmknn.
Aaron Lun (10:50:48) (in thread): > Woah sweet.
Aaron Lun (11:01:55) (in thread): > This seems to work pretty nicely:https://gist.github.com/LTLA/7cf5d6231a9616084803429348760760
Albert Kuo (13:47:29): > @Albert Kuo has joined the channel
Peter Hickey (14:02:32): > really cool, Martin. I’ve not usedbpiterate()
much, but i get the feeling it’s what I should be using for a bunch of problems instead ofbplapply()
Peter Hickey (15:11:31) (in thread): > perhaps a question for@Martin Morgan, Aaron’s code returns alistifBPPARAM = SerialParam()
. this a bug?
Mike Jiang (16:12:05): > http://rpubs.com/wjiang2/399331
Mike Jiang (16:12:40): > Some benchmark results for different on-disk solutions
Mike Jiang (16:14:15): > Memory usage isn’t very accurate. As we see, region selection is not measured properly because it ran afterrandom slicing
task which leaves some cache effects (apparentlygc()
didn’t take care of it)
Mike Jiang (16:15:35): > We do seebigmemory
uses more RAM (due to its memory mapping nature), but it is still in proportion to the size of the requested subset and much smaller than loading the entire dataset into memory.
Mike Jiang (16:18:06): > h5
has advantage of file size due to its compression support. But IO speed is a little disappointed. Any comments?
Peter Hickey (16:19:39): > my experience ish5
can feel frustratingly slow for random subsets. > is one of the benchmarks the baseline of bringing the entire dataset into memory? there, at least, h5 feels fast in my experience
Peter Hickey (16:20:52): > with h5, i’ve resorted to loading big contiguous chunks into memory, subsetting in memory, then loading another chunk. i’ve wondered if that can be formalised, i.e. trade off intermediate memory usage for speed
Raphael Gottardo (16:25:12) (in thread): > Thanks@Mike Jiangfor sharing. Has anyone considering leveraging bigmemory in BioC, for some of the work we’re doing? There are other associate packages that could also be useful, e.g. biganalytics, etc.
Mike Jiang (16:29:00) (in thread): > There isbigmemoryExtras
in bioc, supporteSet
, but it doesn’t have windows support for some reason
Mike Jiang (16:48:04) (in thread): > No, it was the subsetting withlong
shape, i.e. closer to the columnar-oriented chunking shape.
Peter Hickey (16:53:33) (in thread): > if it’s not much trouble i’d be interested in the ‘load it all’ baseline
Mike Jiang (17:10:49) (in thread): > Theregion selection
should give you the idea of how each performs in terms of loading continuous blocks, no matter whether it is all blocks or just some portions of it.
Mike Jiang (17:12:24) (in thread): > Besides , the purpose is to do partial IO, for this dataset, it is probably ok. But for the original 1M data, theload it all
won’t be feasible
2018-06-23
Aaron Lun (05:44:57): > @Mike JiangIs the defaulttype="random"
? HDF5 performance might be better for adjacent rows/columns, depending on how the file was chunked.
Mike Jiang (12:49:17) (in thread): > defaulttype = 'subsetting'
, which includes bothrandom_slicing
andregion_selection
(i.e.adjacent rows/columns).
Mike Jiang (12:50:39) (in thread): > and h5 doesn’t show better performance either compared to the rest
Aaron Lun (12:52:44) (in thread): > right - I didn’t notice the facets
Aaron Lun (12:53:30) (in thread): > how was the HDF5 file chunked? If at all?
Aaron Lun (13:07:18) (in thread): > It would be surprising if decompression was taking up all of the extra time for HDF5, but I guess it’s not impossible.
Mike Jiang (13:28:13) (in thread): > it was chunked by column. I can re-chunk it to different shape to see how much it will improve.
Aaron Lun (13:29:51) (in thread): > :+1:
Aaron Lun (13:46:05) (in thread): > or even just test column-only access from an uncompressed file (which should be effectively col-major storage due to rhdf5’s internal transposition). That would rule out decompression and chunking overhead as factors.
Mike Jiang (14:03:36) (in thread): > row-wise and col-wise access is to be added to the testing tasks once DelayedArray operation is fully supported by mbenchmark. Currenty there is some issue with DelayedArrayhttps://github.com/Bioconductor/DelayedArray/issues/21 - Attachment (GitHub): Error in extract_array when print the extended backend · Issue #21 · Bioconductor/DelayedArray > source code of extension is here https://github.com/RGLab/mbenchmark/blob/master/R/bmarrayseed.R To reproduce, install mbenchmark and run library(mbenchmark) mat <- matrix(seq_len(2e4), nrow = 1…
Mike Jiang (14:04:31) (in thread): > and I can certainly add uncompressed h5 to the test list
Aaron Lun (14:07:03) (in thread): > great.
Mike Jiang (14:53:49) (in thread): > > > ridx <- sample(1e4, 1e2) > Unit: milliseconds > expr min lq mean > as.matrix(bm[ridx, ridx]) 7.872726 8.223395 8.706102 > as.matrix(hm.uncomp[ridx, ridx]) 94.537694 104.117021 114.062108 > as.matrix(hm[ridx, ridx]) 87.594667 88.604313 102.453633 >
Mike Jiang (14:56:14) (in thread): > uncompressed is no better. Also chunking by squared-chunk(hm.1k_by_1k
) seems to be worse thancolumn-wise chunking
(hm
) in random slicing
Mike Jiang (14:56:45) (in thread): > > ridx <- sample(1e4, 1e2) > Unit: milliseconds > expr min lq mean > as.matrix(bm[ridx, ridx]) 5.880989 5.981789 6.139067 > as.matrix(hm.1k_by_1k[ridx, ridx]) 738.967623 772.511860 787.707618 > as.matrix(hm[ridx, ridx]) 81.337151 84.107653 85.513872 > median uq max neval cld > 6.08259 6.268105 6.453621 3 a > 806.05610 812.077615 818.099134 3 c > 86.87815 87.602233 88.326312 3 b >
Aaron Lun (15:01:38) (in thread): > hm.
Aaron Lun (15:02:05) (in thread): > the uncompressed result is a surprise.
Aaron Lun (15:05:30) (in thread): > I suppose the next question is, whose fault is it? There’s a lot of R code in betweenDelayedArray
’s[
and the actual call(s) to the HDF5 C library.
Aaron Lun (15:07:17) (in thread): > I suppose it could also be the HDF5 library that’s inherently slow, but that would be a worst-case scenario, as there’s nothing we can do about that.
Mike Jiang (15:10:24) (in thread): > all formats are used as DelayedArray backends, so it should be rhdf5 or h5 c lib issue
2018-06-24
Aaron Lun (05:20:44) (in thread): > Yes, there’s actually quite a bit of R code even inrhdf5. It’d be surprised if it were the cause, but who knows. There’s also the additional question of whetherrhdf5itself is calling the HDF5 C library in the “optimal manner”; I don’t know enough about that to be sure.@Mike Smith?
2018-06-25
Mike Jiang (12:42:45): > @Mike Jianguploaded a file:image.png - File (PNG): image.png
Mike Jiang (12:42:59): > requested by@Aaron Lun, I added toh5.uncomp
(no compression) andh5.100_x_200
(chunking dims(100,200)
)
2018-06-26
Peter Hickey (10:48:10) (in thread): > gentle ping,@Martin Morgan, should I post this to github issue tracker?
Martin Morgan (10:50:33) (in thread): > yes on the github issue tracker would be good@Peter Hickey
Vince Carey (11:49:31): > @Mike Jiangwill you be at developer day for Bioc2018 and would you consider presenting some of this there?
Mike Jiang (12:55:36) (in thread): > I’d love to, but unfortunately I won’t be back to US until July 27
Raphael Gottardo (13:41:45): > @Vince CareyI won’t be able to make it either but let me see if someone from my group can make it. Otherwise, we’re happy to make slides for you guys to present.
2018-06-28
Mike Jiang (19:12:31): > @Mike Jianguploaded a file:image.pngand commented:@Aaron LunHere is the row/col-wise access results (thanks to@Hervé Pagèsfor helping resolve the indexing issue of DelayedArray backend forbigMemory
, so that all the row/col stats are out-of-box for different backends/formats) - File (PNG): image.png
Mike Jiang (19:14:31): > Since the data set is wide (1k by 30k), we see therowSums
takes relative lesser time
Mike Jiang (19:16:10): > Not sure whyh5.by_col
layout performs particularly worse than the rest, could be the artifacts from page cache, which I am yet to deal with in my next run
Mike Jiang (19:18:32): > Also I setoptions(mc.cores = 1L)
so that parallel block processing fromDelayedArray
is disabled
2018-06-29
Raphael Gottardo (16:04:04): > @Mike JiangSomething that I have come across that could be relevant:https://privefl.github.io/bigstatsr/articles/bigstatsr-and-bigmemory.htmlBasically iterating over blocks to gain in efficiency.
Mike Jiang (16:30:42) (in thread): > block processing is already applied automatically (at least forrow/col
stats computing) by theDelayedArray
since I wrapped thebm
as the backend ofdelaedArray
Raphael Gottardo (16:32:29) (in thread): > ok, cool.
Mike Jiang (16:37:58) (in thread): > This is one advantage of leveragingDelayedArray
since it provides some very useful common paradigms like block and parallel processing as well as matrix stats. So we don’t need to reinvent the wheel
2018-06-30
Unknown User (08:30:16): > A file was commented on
Martin Morgan (09:25:57): > hdf5 offers interoperabilty with other programming languages, which shouldn’t be discounted
2018-07-02
Hervé Pagès (17:10:31): > @Aaron Lun@Mike JiangStarting with DelayedArray 0.7.12, the block-processing mechanism is “chunk aware” i.e. it tries to choose a block geometry that is compatible with the physical chunks. So if an HDF5 is chunked by column then it will choose blocks that contain full columns. It will do this whether you’re callingrowSums()
orcolSums()
, it doesn’t matter.
Aaron Lun (17:37:06): > Woah sweet.
2018-07-03
Aaron Lun (08:26:17): > @Kasper D. HansenI was thinking about fast SVD again. It seems to me that the current crossproduct strategy is O(mn^2) for the cross product itself (where n <= m), plus whatever time it takes to do a approximate SVD (linear on ‘n’, presumably). However, this may not always be faster than directly doing the approximate SVD on the original matrix - I don’t knowirlba’s time complexity, but for square-ish matrices, computing the cross-product may actually be slower.
Kasper D. Hansen (08:42:34): > Oh yeah, this is not something to do for square matrices
Kasper D. Hansen (08:43:45): > So basically the various random matrix methods (with the caveat that there is one method I still need to fully understand) are ways of making appr. square matrices faster. If the matrix have very different number of cols/rows, do the crossproduct first to get a square matrix
Kasper D. Hansen (08:43:57): > then do random methods on the resulting square matrix
Aaron Lun (08:44:05): > It would be nice to have aquickSVD
wrapper to make this choice for us.
Kasper D. Hansen (08:44:11): > Yes
Kasper D. Hansen (08:44:27): > The good news is that this is something I want to make major progress on in July and August
Aaron Lun (08:44:32): > Oh sweet.
Aaron Lun (08:44:46): > Good timing then. I have some packages that would benefit from it.
Aaron Lun (08:45:13): > Especially if the resultant package comes out before the next release.
2018-07-20
Davide Risso (16:19:38): > @Aaron Lunquick question about beachmat: is there a way to automagically extract a random subset of rows from a matrix or am I limited to extracting either one row at a time or a set of contiguous rows?
Aaron Lun (16:40:56): > Currently, the whole thing is built around one row or column at a time. I could probably generalize it somewhat to obtain sets of rows or columns at a time; you can put in an issue on the repo and I’ll get around to it next month.
Davide Risso (16:46:20): > OK, thanks! So is the current way of getting a random subset of rows to just loop on a set of random indexes?
Aaron Lun (16:47:38): > yes, basically.
Aaron Lun (16:48:09): > Preferably sorted random indices to take advantage of caching in sparse/HDF5 matrices.
Davide Risso (16:49:04): > Good point about the sorting thanks!
Aaron Lun (16:49:35): > For rows, native support for a random subset will be more efficient, due to speed-ups with the microprocessor cache when you have multiple elements from one column. So, it’s probably worth adding. But I will have to think deeply about how to support this in a general fashion.
Aaron Lun (16:54:16): > I also need to catch up on the latest state of affairs with the chunk-detection infrastructure inDelayedArray; especially with respect to what user-settable global options I need to respect during access.@Hervé PagèsI may bug you soon about this.
Aaron Lun (16:56:12) (in thread): > out of curiosity, what is this for?
2018-07-23
Kasper D. Hansen (12:01:15): > Depending on storage format you’ll find it slow
Kasper D. Hansen (12:01:28): > If you access many chunks
Kasper D. Hansen (12:01:35): > (this is to@Davide Risso)
Davide Risso (12:05:13): > So would shuffling the rows first and then access contiguous rows be faster?
Davide Risso (12:05:35): > Or is it a matter of optimizing the storage format?
Davide Risso (12:06:51) (in thread): > We are implementing mini-batch k-means for the czi proposal
Davide Risso (12:07:44) (in thread): > That’s what python suggests you to do when you want to run kmeans on large datasets
Davide Risso (12:07:56) (in thread): > But we’re still at the level of feasibility
Kasper D. Hansen (12:12:36): > That is probably always true
Kasper D. Hansen (12:13:01): > For HDF5 you have the issues of chunks in the file. To read an entry, you need to read the entire chunk containing the entry
Kasper D. Hansen (12:13:30): > So if your chunks are (all genes, a single sample) there are no issues with random accesss to samples
Aaron Lun (12:13:44): > We havebeachmat::rechunkByRow
. Or margins, can’t remember the name.
Kasper D. Hansen (12:13:46): > If your chunks are (single gene, all samples) you need to read the entire file
Kasper D. Hansen (12:14:07): > But doesn’t that require writing a new file?
Aaron Lun (12:14:10): > Yes.
Aaron Lun (12:14:19): > But it may still be faster than trying to access a poorly chunked file.
Kasper D. Hansen (12:14:30): > Yes, true. Also depends on data size
Kasper D. Hansen (12:14:39): > Remember the scRNA data is not that big still
Kasper D. Hansen (12:16:07): > All I am saying is that random access to rows and/or columns …. the ability to samples things at random …. depends a lot on the storage format (+ options set at file creation) and can go from “no penalty at all” to “its the same time as reading everything” depending on choices
Davide Risso (12:17:03): > Thanks! This is a very good point to keep in mind
Kasper D. Hansen (12:17:30): > We will likely choose chunksizes based on access patterns. And we have discussed having multiple versions of the same file (since storage is cheap) to allow for different patterns
Kasper D. Hansen (12:18:05): > This seems likely to be useful for scRNA data, which I again remind people is not that big
Kasper D. Hansen (12:18:22): > Just keep it in mind when doing algorithms
Kasper D. Hansen (12:18:27): > Thats all I am saying
Kasper D. Hansen (12:18:47): > Worse are algorithms which sample individual entries in the data matrix
2018-07-26
Matthew Oldach (12:56:26): > @Matthew Oldach has joined the channel
2018-07-27
Aaron Lun (11:10:08): > @Kasper D. HansenHow’s the faster SVD going?
Kasper D. Hansen (16:38:49): > Forward, but far from done. I was so happy to not see you here since I hadn’t gotten as far as I wanted
Kasper D. Hansen (16:39:12): > On the plus side my knowledge of block processing in DelayedArray is 1000x greater now
Aaron Lun (21:37:23): > Ha lol
2018-08-06
Aaron Lun (06:02:49): > I wonder, how many people are using RleArrays?
Aaron Lun (06:03:08): > I currently have specialized support for this class inbeachmat, but I’m not sure if that’s worth maintaining.
Aaron Lun (06:03:44): > That is to say, removing this specialized support will mean that handling of RleArrays would fall back to the general block-processing mechanism.
Kasper D. Hansen (13:59:19): > I am not, but I think of it as a potential very interesting special case
Kasper D. Hansen (14:00:22): > I don’t know if beachmat access makes “sense” based on the data structure, but if I could vote I would say retain it. I mean, one promise of beachmat is essentially bindings to everything. Although I recognize it is easier to say when it doesn’t cost me anything.
Aaron Lun (14:08:31): > The problem is that the current version ofbeachmatonly supports the most basic type of RleArray; the more exotic types are not handled (e.g., using raw vectors, or using chunks). If I declare I’m supporting RleMatrix objects, I’m obliged to support all of them, which is a pain given that I don’t know how many people are using it.
Aaron Lun (14:08:59): > I mean, RleMatrix objects will still be accepted bybeachmat, but the access scheme will involve block processing via R, rather than native C++ extraction.
Aaron Lun (14:09:12): > So all code will still work with RleMatrix objects, just not in the most efficient manner.
Aaron Lun (14:09:43): > Speaking of ill-considered ideas, I’ve thrown out native support for Psym matrices, which was a good idea at the time but sees no actual use.
Kasper D. Hansen (14:21:46): > Well, you’re the developer, but “no actual use” seems a bit early
Aaron Lun (14:24:45): > Well, for me, at least.
Aaron Lun (14:25:52): > Packed symmetric matrices are a real pain because you need to reflect around the diagonal. Took ages to write and debug - at one point I was just guessing what the offsets should be.
Aaron Lun (14:26:57): > But even then you only get a 2-fold saving in memory, which isn’t really good enough when the matrix is large.
Aaron Lun (14:28:01): > So in short, why would you want to compute a distance matrix that is so large that (i) you need to save memory and (ii) a two-fold saving is good enough? A very tight goldilocks zone.
Aaron Lun (14:29:36): > Keep in mind that a fully dense matrix will be faster to access as well, so it’s not like the memory saving is costless.
Aaron Lun (14:29:45): > Probably somewhat relevant for a cross-product SVD.
Kasper D. Hansen (15:15:41): > Knowing your matrix is symmetric could speed up some algorithms
Kasper D. Hansen (15:16:05): > I agree effect would not be amazing (>10x) for access and storage
Kasper D. Hansen (15:17:04): > They are used for other things besides distances …:slightly_smiling_face:But of course usability requires other methods on these data types are supported
Aaron Lun (15:20:30): > No doubt about the speed-up, but you don’t need the symmetric matrix to be stored in a packed form.
Kasper D. Hansen (15:21:11): > fair enough
Aaron Lun (15:21:17): > The only advantage from a packed form (access-wise) is that you reduce cache misses
Aaron Lun (15:21:27): > but that assumes you don’t need to access the otherr side of the diagonal
Aaron Lun (15:21:35): > and if you do, then you’ll get cache misses like crazy
Kasper D. Hansen (15:21:47): > Ok, I am convinced, but will add one final thing which is a selling point of beachmat is “use same interface toeverything”
Kasper D. Hansen (15:21:57): > But I would probably not do it
Kasper D. Hansen (15:22:08): > (with the caveat that if you have figured it out - why remove it?)
Kasper D. Hansen (15:22:26): > Ok, that was hard to understand
Aaron Lun (15:22:31): > So your comment still holds -beachmatwill accept any matrix-like object.
Aaron Lun (15:22:34): > even now.
Aaron Lun (15:22:46): > The only difference is how it does the access
Kasper D. Hansen (15:22:54): > I think I can see why you would not implement it. I am less clear on why you want to remove functionality
Aaron Lun (15:23:04): > for simple, dense, sparse and HDF5 matrices, it uses native C++ methods for max speed.
Aaron Lun (15:23:24): > For everything else, it reverts to a block processing strategy (silently under the hood - the user experience is the same).
Aaron Lun (15:24:24): > Re. removing functionality; more of a maintenance thing, as the nature of the class structure means that whenever I add a new feature, I need to implement it for every representation, otherwise the entire thing doesn’t compile.
Kasper D. Hansen (15:30:22): > ah
Aaron Lun (15:34:12): > so everytime I add a new feature, I have to spend a day to do it, and then another day fixing the compilation errors, because I can’t do them bit by bit.
Kasper D. Hansen (15:39:07): > sounds painful
Aaron Lun (15:42:15): > Well, at least the compiler does give me useful messages. I spent yesterday debuggingiSEE
;shinyis useless when bugs occur in observers.
2018-08-10
Aaron Lun (05:47:53): > Finally, the refactored beachmat passes its test suite: - File (PNG): Pasted image at 2018-08-10, 10:47 AM
Kasper D. Hansen (08:26:41): > 40112 tests! Impressive!
Sean Davis (13:36:40): > :fast_parrot:
2018-08-11
Aaron Lun (11:51:46): > … and as a result, it times out on the Bioc windows machines. Does anyone remember how to specify long tests that don’t get run every build cycle?
2018-08-13
Aaron Lun (06:25:49): > Anyone?@Hervé Pagès?@Martin Morgan? I remember seeing something about long tests on the mailing list but I can’t find it anymore.
Martin Morgan (07:26:59): > https://stat.ethz.ch/pipermail/bioc-devel/2017-November/012326.htmlI don’t know whether this is still current; fromhttp://bioconductor.org/checkResults/there’s a link to release longtestshttp://bioconductor.org/checkResults/3.7/bioc-longtests-LATEST/. All packages are failing.
Aaron Lun (07:44:48): > Thanks Martin. Are these enabled for 3.8?
2018-08-14
Aaron Lun (12:49:46): > So - uh - any suggestions on using long tests forbeachmat? Should I just move them and hope for the best?
Hervé Pagès (13:53:48): > @Aaron LunI repaired the long tests builds in release. They stopped running on June 9 because of some changes we made to the build system code that broke them. Also set them up in devel. We should get new reports for release and devel next Saturday. It’s easy for me to forget to check the reports on Saturday so let us know if you notice anything wrong (preferably by sending an email to bioc-devel). Thanks!
2018-08-15
Aaron Lun (06:30:23): > Thanks@Hervé Pagès. Are these tests to be run on all systems? Currently only malbec results are shown.
Aaron Lun (06:40:19): > Also, what’s the reason for using a top-levellongtests/
rather thaninst/longtests/
? CHECK complains aboutNon-standard file/directory found at top level: 'longtests'
with the former.
Aaron Lun (06:43:11): > I guess that anything ininst
also gets put into the installation directory, but we could use.Rinstignore
if we’re concerned about space.
Hervé Pagès (15:11:15): > @Aaron LunThey run on Linux and Mac in release but only on Linux in devel at the moment. We could add Mac in the near future. About the CHECK warning (or note?) you got: I’ve never seen it and can’t reproduce it. Don’t see it on the CHECK report for HDF5Array either:https://bioconductor.org/checkResults/3.8/bioc-LATEST/HDF5Array/malbec1-checksrc.htmlCan you provide more details? Thanks!
Aaron Lun (15:12:32): > Hm. That’s funny. It occurs for me at* checking top-level files ... NOTE
. Maybe because I do--as-cran
.
Hervé Pagès (15:14:30): > Yes, that must be it. Many Bioconductor packages have arbitrary top-level folders with various things inside and I don’t think we should worry about this or consider it bad practice.
Aaron Lun (15:14:49): > Okay.
2018-08-16
Aaron Lun (13:02:17): > @Hervé PagèsWill there be Windows support for the long tests at any point? I ask because - despite my distaste for it - builds on Windows are remarkably good at picking up problems in the C++ code.
Aaron Lun (13:02:51): > i.e., if there’s something non-standard or a segfault, Windows will often throw up.
Hervé Pagès (13:56:01): > I don’t know about running the long tests on Windows. Compared to the other platforms, the Windows builds are harder to set up, fail often (i.e. don’t even finish), tend to report more false positives, and require constant monitoring. So we’re spending a lot of time and effort to keep them up and running. For this reason we don’t run the data experiment builds on Windows either. I’m not sure it makes sense at this point to spend even more resources to run the long tests builds on Windows, at least not until more people implement long tests in their package (so we maximize the return on investment). Martin? Yes Windows is picky and good at picking up problems in C/C++ code but is this really the purpose of the long tests? Can’t those problems be picked up by the normal tests?
Hervé Pagès (14:24:28): > I forgot to ping Martin so here you go@Martin Morgan
Martin Morgan (14:38:45): > I’m in the same camp as HERVE (accents are apparently optional if using caps), that the long tests are designed for qualitatively different diagnostics.
Aaron Lun (18:14:59): > okay.
Aaron Lun (18:17:41): > On another note @Hervé Pagès, a question aboutDelayedArrayperformance. Let’s say I have a HDF5Matrixy
and I create a delayedMatrixx = t(t(y)/v)
for some vectorv
of length equal toncol(y)
. If I want to realize, say,x[1:100,]
, what happens under the hood?
Aaron Lun (18:20:12): > Currently, the 1M cell analysis pipeline writes a new HDF5Matrix containing the normalized expression values (effectivelylog2(x+1)
). This takes about 3 hours on my machine, which obviously is a high upfront cost. On the other hand, it avoids repeating the calculation later, which might be cheaper in the long run if the realization (to a dense array in memory) is not trivial.
Aaron Lun (18:21:13): > To complete the story, I’ve asked forx[1:100,]
becausebeachmatswitches to a block-processing mechanism when it encounters a matrix that it does not understand. That is, a block is realized in R and then passed back to C++ under the hood. This involves a call to R from C++, so is probably less efficient than direct access from a HDF5 file containing the already-computed normalized expression values. (I say probably because the normalized values are doubles while the originals are ints, so I’m not sure whether the 2x data size would offset the cost of the R call.)
Aaron Lun (18:35:35): > I guess now that I’ve written it out, 3 hours is quite a long time… but if block realization is expensive, it might still be worth it, especially for multi-pass algorithms.
Hervé Pagès (19:46:25): > It depends how you realize it. If you realize it in memory e.g. withas.matrix(x[1:100, ])
then no block-processing is used so: the 100 first rows are loaded in memory, then transposed, then divided byv
, then transposed again. If you realize it to disk withas(x[1:100, ], "HDF5Array")
orwriteHDF5Array(x[1:100, ])
then block-processing is used, that is, blocks of the object to write (x[1:100, ]
in this case) are realized in memory one at a time and written to disk. By default the grid of blocks generated byblockGrid(x[1:100, ])
will be used. I recently added many ways to control the default grid (see?blockGrid
and?setDefaultGridMaker
). With the default block size being 100 Mb (recently increased from 45 Mb in devel),blockGrid(x[1:100, ])
will define a grid of 11 blocks of 100 x 125000 each (except for the last block that will be only 100 x 56127) ifx
ist(t(counts(TENxBrainData())) / v)
.
Hervé Pagès (20:24:25): > Yep, I confirm.writeHDF5Array(x[1:100, ])
takes about 3.5 h on my laptop:thinking_face:I did some naive profiling and it seems that about 46% of this time is spent in loading and realizing the blocks in memory and 54% of this time is spent in writing the realized blocks to disk. When is HDF5 going to support concurrent write access so I can parallelizewriteHDF5Array()
? Even if concurrent writing turns out to not be significantly faster, at least the reading part would be parallelized and we know that this would make the whole reading part faster. In the mean time I will take a closer look atrhdf5::h5write()
(whichwriteHDF5Array()
is based on) and see whether there is room for optimization. I’ve wanted to do this for a while…
2018-08-17
Aaron Lun (04:54:14): > thanks herve
Aaron Lun (10:09:01): > I’ll just mention thatbeachmatwould require parallelization to occur at the C++ level for it to help the C++ code “as is” (i.e., with no further modification). Otherwise the parallelization would have to occur via R usingBiocParallel, which would require some code to be rewritten for multiple.Call
statements, e.g., viabplapply
. I’d hope that any parallelization from the HDF5 group would be in the HDF5 C library itself…
Martin Morgan (10:13:17): > probably compression is a big cost here? also is this one of the ‘hidden’ costs of writing a full matrix rather than something like the tenx format, where it just takes a long time to realize all those zeros?
Aaron Lun (10:15:46): > Compression + chunk management; with a lot of smaller chunks it takes a while to aggregate information for a single row/column across chunks, even when the chunk cache is large enough to contain the entire matrix in memory.
Aaron Lun (10:16:04): > Avoiding the reading/writing of zeroes would probably help a lot.
Aaron Lun (10:16:49): > Probably not that much for the final file size, but at least it would take the strain off (de)compressing.
Kasper D. Hansen (10:20:20): > seems like 3.5h for a log-transform (log2(x+1)
) is too much
Kasper D. Hansen (10:20:25): > I will time it in memory
Kasper D. Hansen (11:11:12): > Ok, why am I having problems sucking all of tenx into memory, using > > counts_in_memory = as.matrix(counts(tenx)) >
> I get a memory allocation error, but I highly suspect this is due to 32 length matrices, because I can easily do 100,000 cells but not 300,000 and the number of entries in the 100,000 cell matrix is 2.7998e9
Kasper D. Hansen (11:12:14): > For 100,000 cells the dense matrix is 11.2Gb and it takes 132s (single core) to computelog2(x+1)
on this matrix
Kasper D. Hansen (11:12:52): > Assuming linear scaling that comes to 29m for the entire matrix
Kasper D. Hansen (11:13:17): > That’s makes the 3.5h look not crazy bad I must say
Kasper D. Hansen (11:14:20): > That is a factor 7, but the 29m does not include reading of the matrix into memory, which the 3.5h do
Kasper D. Hansen (11:43:41): > > Ok, why am I having problems sucking all of tenx into memory, > I think I made a memory request error, forgetting about hard vs soft memory limits
Kasper D. Hansen (12:01:20): > seems to be roughly linear in time: 300,000 takes 376s and takes up 33.6Gb
Martin Morgan (13:00:23): > If I process all the tenx brain data in the original format, writing to a temporary file without compression (it’s temporary, anyway…) > > library(rhdf5) > > fname <- "/home/mtmorgan/.ExperimentHub/1039" > group <- "mm10" > > offset <- ceiling(seq(0, 2624828308, length.out=50)) > start <- head(offset, -1) + 1L > count <- diff(offset) > grp <- paste0(group, "/data") > > fout <- tempfile(fileext=".h5") > h5createFile(fout) > h5createDataset( > fout, "log2(x+1)", 2624828308, storage.mode = "double", chunk = max(count), > level = 0L > ) > > system.time({ > for (i in seq_along(start)) { > message(i) > x <- h5read(fname, grp, start = start[i], count = count[i]) > h5write( > log2(x + 1L), fout, "log2(x+1)", start = start[i], count = length(x) > ) > } > }) >
> it takes about 264s (i.e., 4 1/2 minutes) and modest memory consumption. Reading and transforming without saving takes about 180s so 3.5 hours seems very long.
Kasper D. Hansen (13:40:03): > oh, thats interesting
Kasper D. Hansen (13:41:05): > why is this faster than realizing the entire matrix and just compute the log2(x+1)? That seems weird. I am wondering if I am messing up something.
Aaron Lun (14:08:33): > It seems like the above code only involves, in effect, the non-zero values?
Aaron Lun (14:09:00): > Am I reading this right?
Hervé Pagès (14:11:02): > Right. If you don’t store the information about the row and col indices along with the non-zero values, you end up with a file that is not standalone i.e. you won’t be able to read and make sense of it without grabbing the row and col indices from somewhere else.
Hervé Pagès (14:21:33): > I recently added the TENxMatrix class and constructor to the HDF5Array to make it easy to operate directly on the original format (sparse). See?TENxMatrix
. I’m currently working on awriteTENxMatrix()
function so it will be easy to write temporary/intermediate datasets in this format too e.g.writeTENxMatrix(log2(TENxMatrix(fname, "mm10") + 1))
. Hopefully using this intermediate format will speed up things a little.
Kasper D. Hansen (14:28:44): > It would be great to be able to experiment with sparsity
Kasper D. Hansen (14:29:07): > The issue with sparsity is that most of the things we want to do to the matrix immediately destroy the sparsity
Kasper D. Hansen (14:29:33): > A side question: do we have the 68k PBMC data in a convenient bioconductor package form?
Aaron Lun (14:30:30): > It’s not particularly hard to read it in as an SCE, seehttps://github.com/MarioniLab/FurtherMNN2018/blob/master/droplet/prepareData.R - Attachment (GitHub): MarioniLab/FurtherMNN2018 > FurtherMNN2018 - Code for further development of the mutual nearest neighbours batch correction method, as implemented in the scran package.
Aaron Lun (14:30:41): > lines 12-16.
Kasper D. Hansen (14:31:17): > thanks
Kasper D. Hansen (14:31:48): > Still, might be worthwhile to consider to have in a package for benchmarking / testing / etc
Kasper D. Hansen (14:31:58): > but this is super helpful
Davide Risso (15:00:39): > +1 for Kasper’s request
Davide Risso (15:00:59): > Having a “standard” medium-size dataset for our benchmarking would be nice
Davide Risso (15:01:21): > putting the 68k PBMCs into a package would make it the default dataset for most of us
Aaron Lun (15:10:04): > I should mention that my 3 hour estimate comes frombeachmatvia the HDF5 C++ API, so it’s not like R itself is soaking up the time either.
Aaron Lun (16:39:31): > In any case, I will parallelize the remainingscaterandscranfunctions in the pipeline, and we can start burning through CZI’s Amazon budget. I suspect thatnotwriting the HDF5 file will be the simplest approach; we’ll have to put up with 2 re-reads (once for modelling gene-wise variation, and another for the PCA), but even that should only take ~1 hour total based on@Kasper D. Hansen’s timings. So all in all it will be cheaper - and once we have the first 50 PCs, it’s game over basically. I can do batch correction on the PCs withfastMNN
, and we can just throw some scalable clustering and visualization method at it.
Aaron Lun (16:42:37): > Note that the 1 hour read is also much more easily parallelizable than the 1.5 hour write, assuming that the file system is amenable to parallel reads.
Aaron Lun (16:44:52): > Sorry, that’s a 1.5 hour read (based on Herve’s timings) + 0.5 hour calculation (based on Kasper’s timings), so I guess it’s 2 hours per full re-read of the data set. And I guess the PCA will need to read the data multiple times as well… nonetheless, the parallelization should still be a net benefit.
Aaron Lun (16:46:36): > Alternatively we could do the same thing from aTENxMatrix
… we don’t need to break the sparsity until we get to the PCA, and even then it should be possible to either “factor out” the subtraction or do it “on the fly” without saving the de-sparsified results back to a HDF5 file.
2018-08-20
Aaron Lun (05:26:29): > beachmatnow supports something that’s a bit magical: if you write C++ getters/setters for your matrix class in your package,beachmatcan link to those functions and directly use them without going through R. Seehttp://bioconductor.org/packages/3.8/bioc/vignettes/beachmat/inst/doc/external.htmlfor details.
Aaron Lun (05:57:10): > Well… at least it would, if the longtests didn’t segfault on the Bioc machines.
Kasper D. Hansen (12:29:39): > this discussion about sparsity and the associated timings have changed my mind. While I still think that we will eventually loose sparsity, it seems natural to exploit it when we have it (like in thelog2(x=1)
transformation timings above).
Kasper D. Hansen (12:30:09): > What is our options for reading in (part of) a HDF5 file in some sparse matrix format?
Kasper D. Hansen (12:30:33): > Is that possible? Does it require the HDF5 file to be in a specific format
Kasper D. Hansen (12:30:41): > Pinging@Peter Hickeyand@Hervé Pagès
Martin Morgan (13:21:52) (in thread): > just a small comment aboutlog2(x + n)
– it does actually result in a dense matrix (not in my snippet / implementation above, but e.g., in Matrix)! But obviously can be represented as sparse + offset
Kasper D. Hansen (13:27:17) (in thread): > we usedn=1
which makes the result sparse. But yes, the use case of sparse + constant, or sparse + column constants might be important
Aaron Lun (13:30:37): > If we start with aTENxMatrix
and load parts of the file into memory, we should get an immediate speed boost from reducing file IO, even if the in-memory matrix was dense. Of course, the best approach would be for the in-memory chunks to be themselves sparse, but this may end up being lost anyway.
Kasper D. Hansen (13:37:29): > Yeah, so this is what I thought was not worth thinking about, but the 30m->4m is compelling
Kasper D. Hansen (13:37:44): > (and remember my 30m was without IO)
Aaron Lun (15:18:33): > So we’d basically need a method that, upon seeing aTENxMatrix
, will yield the requested submatrix indgCMatrix
form. I guess we could set the realization backend manually, but it would be nice for the proposed method to know that it’s aTENxMatrix
and return a sparse matrix automatically.
Aaron Lun (15:22:03): > Of course, this trusts that all downstream procedures are capable of taking both sparse and non-sparse inputs.
Aaron Lun (15:23:20): > And not only that, procedures must be able to meaningfully exploit sparsity, e.g., by skipping zeroes. Otherwise you’d have to just expand things to a dense array, e.g., if you were using LAPACK libraries.
Kasper D. Hansen (15:26:11): > Supporting this through a stack is clearly not trivial. But I am expecting that eventually we will settle on a standard “pipeline” from raw counts to something and that pipeline will start with a sparse matrix and eventually escape to a dense matrix. First step in all of this is to think about returning (parts of) the data matrix as a sparse object. Without that we cannot experiment
Kasper D. Hansen (15:26:58): > Its just that if we always start with log-transforming counts, going from 30m (no IO) -> 4m (including IO) is a pretty impressive time saver
Kasper D. Hansen (15:27:20): > Also, what isTENxMatrix
? A new class?
Aaron Lun (15:27:38): > It’s been around for a while - provides IO directly from the 10X HDF5 file. I think it now lives in HDF5Array.
Kasper D. Hansen (15:27:57): > oh
Aaron Lun (15:28:04): > I guess it is “new”, as it used to be in TENxGenomics, which isn’t in BioC.
Kasper D. Hansen (15:28:07): > so this is the 10X sparse file right
Kasper D. Hansen (15:28:19): > as opposed to a standard HDF5 dense file
Aaron Lun (15:28:38): > Yes, that’s right. AFAIK it doesn’t use any of the HDF5 library’s facilities for data set chunking
Aaron Lun (15:28:55): > because it’s just stored as a stream of counts and row/column indices.
Kasper D. Hansen (15:29:00): > ok
Kasper D. Hansen (15:29:23): > So I know nothing about HDF5s capacity for sparse matrices
Aaron Lun (15:29:34): > I don’t think it really has any.
Kasper D. Hansen (15:29:51): > oh, so it is TENx which is making their own circumvention
Kasper D. Hansen (15:29:53): > weird
Aaron Lun (15:30:58): > googling “HDF5 sparse matrix” gives you the 10x website on the first page of results…
Aaron Lun (15:31:06): > so, yeah.
Kasper D. Hansen (15:31:30): > If thats the case I am not sure I am making sense. Pretty clear to me that if it is just TENx rolling their own sparse format into HDF5, that everything you do with it is likely to be suboptimal
Aaron Lun (15:32:16): > Nonetheless, it is still reading less data than a fully dense matrix that would have to be stored in the standard HDF5 format. And compression can only do so much.
Kasper D. Hansen (15:33:03): > fair enough, but can you say “I want this sub matrix of data”
Aaron Lun (15:33:09): > AFAIK no.
Aaron Lun (15:33:18): > you can get columns easily enough
Aaron Lun (15:33:22): > assuming it’s stored as CSC
Aaron Lun (15:33:25): > which it seems to be.
Kasper D. Hansen (15:33:46): > hmm ok
Aaron Lun (15:33:50): > But per-row access would be tortuous.
Kasper D. Hansen (15:33:58): > compression can be magic though
Kasper D. Hansen (15:34:06): > as Ive just learned yet again
Aaron Lun (15:34:32): > Sure, but that probably accounts for a fair bit of time in I/O.
Aaron Lun (15:34:51): > And chunk management isn’t free.
Kasper D. Hansen (15:34:54): > depends. I find it hard to reason about
Aaron Lun (15:36:28): > Anyway, if you want to get row-level data from a CSC matrix, you’ll need to hop along doing a binary search in each column segment - very painful. The only sensible way would be to realize chunks at a time.
Martin Morgan (16:13:33): > I don’t think the binary search (or even linear scan, equivalent ofx[row_idx %in% rows_I_want]
) would be that painful computationally, if one were interested in a sub-matrix. If whole-row slices a common use case, an option would be to store row-oriented ‘10x’-style data beside the current column-oriented data.
Kasper D. Hansen (16:17:05): > before doing that, do we have any benchmark for storage / data retrival of this “custom” sparse format vs. compressed HDF5
Kasper D. Hansen (16:17:17): > Especially for compression for different chunkdims.
Aaron Lun (16:34:39): > In any case, parallelization of the relevantscaterandscranfunctions is complete. I reckon it’s time to start throwing some cores at the TenX analysis. Does anyone have any particular advice on using BiocParallel on a SLURM cluster, before I hit the docs re.BatchJobsParam()
?
Martin Morgan (17:21:42): > BatchtoolsParam()
; there’s a vignettehttp://bioconductor.org/packages/devel/bioc/vignettes/BiocParallel/inst/doc/BiocParallel_BatchtoolsParam.pdf
Aaron Lun (18:08:00): > Oh. What’s the distinction fromBatchJobsParam()
then (Section 3 of the vignette)?
Aaron Lun (18:10:36): > Also - like inSnowParam
- can I assume that if I pass a package function toFUN
forbplapply
, the package namespace is automatically loaded in each worker? Or do I have to manually specify it in theregistryargs
?
Martin Morgan (18:16:45): > BatchJobs / batchtools is an upstream change – the author of BatchJobs wants new users to use batchtools, I think because of race conditions that they were unable to solve using the BatchJobs implementation. Ithinkthe answer to theFUN
part of the question is yes, but honestly am not sure.
2018-08-21
Kasper D. Hansen (11:10:50): > basic Q: How do I create a matrix in R which utilizes a long vector.
Kasper D. Hansen (11:12:39): > > > mat = matrix(0, ncol = 69000, nrow = 33000) > Error: vector memory exhausted (limit reached?) >
Kasper D. Hansen (11:14:01): > This must be because33000*69000 > (2^31 -1)
isTRUE
Aaron Lun (11:15:09): > That shouldn’t be a problem - I think as long as the dimensions are not > 2^31-1, it should be create-able. Trying to dig up the docs now…
Kasper D. Hansen (11:15:37): > Well, as you can see it is kind of a problem:slightly_smiling_face:
Kasper D. Hansen (11:15:44): > I am surprised as well
Kasper D. Hansen (11:16:27): > butmat = matrix(ncol = 69000, nrow = 33000)
works
Aaron Lun (11:16:30): > https://stat.ethz.ch/R-manual/R-devel/library/base/html/Memory-limits.html
Aaron Lun (11:17:10): > I would say that it’s because if you don’t specify the type, it assumes logical.
Aaron Lun (11:17:32): > Which is 4 bytes per entry, so some math suggests your secondmat
would be 9GB in size.
Aaron Lun (11:17:51): > Your doublemat
would be twice as large, so I guess you just don’t have enough addressable memory on your system?
Kasper D. Hansen (11:18:09): > shouldn’t that just swap
Kasper D. Hansen (11:18:18): > Ok, Im trying this on bigger memory machine
Kasper D. Hansen (11:19:31): > it works
Kasper D. Hansen (11:19:40): > weird, this is not the usual out-of-memory error I get
Kasper D. Hansen (11:19:55): > and it ends up being 18.2Gb
Kasper D. Hansen (11:19:57): > Thanks
Kasper D. Hansen (22:23:01): > @Martin MorganWe have been preparing a TENxPBMCData package mimicking the BrainData, but with smaller dataset(s). If we want to put it on ExperimentHub, should we follow the style of preparing the data which is present ininst/scripts/makeData.R
in the BrainData package. We are basically just following stuff@Aaron Lunand you did.
2018-08-22
Aaron Lun (05:43:01): > @Peter HickeyDoDelayedMatrixStats::colSums2
and friends respond to choices ofbpparam()
? I can’t really notice a difference in speed uponregister
ing different BiocParallelParam objects.
Martin Morgan (06:03:31) (in thread): > @Kasper D. Hansenyes, Aaron did a nice job of the 10xBrainData package; the script is meant to provide reproducibility.
Aaron Lun (06:52:38): > Also,@Martin Morgan; a few releases ago, I recall there being some issues withBiocParallelon Macs, e.g.,https://support.bioconductor.org/p/88307/. Was your fix implemented, and have you had any more reports of odd behaviour? From previous releases (i.e., ~2 years ago), I have seen stalling ofMulticoreParam
-parallelized code; I have also seen random crashes in worker nodes for code that works withSerialParam
and everywhere else.
Aaron Lun (06:52:57): > I ask because all of my functions currently require aBPPARAM=
argument; I’d like to transition this to usebpparam()
. The problem was that it kept on stalling/crashing on some (but not all!) macs with the defaultMulticoreParam
.
Kasper D. Hansen (08:38:45) (in thread): > And who should we get in touch with re. moving th files into ExperimentHub when we’re done in a day or two?
Martin Morgan (10:09:48) (in thread): > @Lori Shepherd
Martin Morgan (10:13:07): > There are two problems in the thread you point to. One is use of localhost vs. 127.0.0.1 and I believe that is ‘fixed’. The other is that users may have arbitrary ports blocked, and there is no real solution to that other than figuring out what ports are / can be opened. But either way the back-end returned bybpparam()
can be determined by the most recent call toregister()
, so is under the user’s control. Not an entirely satisfactory answer.
Kasper D. Hansen (13:35:20) (in thread): > thanks
Peter Hickey (22:21:44) (in thread): > Not yet
2018-08-23
Aaron Lun (12:39:12): > I’m having some trouble understanding the syntax of theBatchtoolsParam
template files - currently looking at the SLURM template.
Aaron Lun (12:39:56): > For example, what are<%=
?
Vince Carey (12:41:09): > I am no expert but look at the ?brew in the brew package
Aaron Lun (12:41:43): > Ah, I see.
Aaron Lun (12:45:02): > Right. Am I meant to change anything with respect to the expressions to be evaluated inside<%= ... %>
?
Aaron Lun (12:45:42): > I presume that the “modifications” referred to in the BiocParallel vignette refer to stuff like adding modules to set up the appropriate environment.
Vince Carey (12:48:14): > I have only used it with SGE. The <%… stuff was left alone.
Aaron Lun (12:51:13): > Okay. Do you know whether we can/should specify theresources
that these templates refer to? I’m curious to know what the defaultresources$memory
is for the SLURM template.
Vince Carey (12:54:48): > I commented out the only reference to ‘resources’ in the SGE template. If you have a stock installation of SLURM maybe you can do nothing. In our case, the assumptions made in the supplied template for SGE did not pan out, so the resources reference caused an error, but removing it made it work. There are issues of moving targets – but I have found the developers to be quite responsive when things need to be fixed, so … try it, and file issues when needed.
Vince Carey (12:57:02): > We should have some community work on this matter … a recipe for a SLURM cluster in AWS that works with BatchTools should be available… maybe@Sean Daviscan provide more guidance.
Vince Carey (12:57:24): > In other words, do not go it alone.
Aaron Lun (12:57:34): > Looks like some template editing is required: > > Error: Fatal error occurred: 101. Command 'sbatch' produced exit code 1. Output: 'sbatch: error: Invalid numeric value "" for cpus-per-task.' >
Aaron Lun (12:59:30): > Either that, or I need to figure out how to setresources
.
Vince Carey (12:59:46): > Are you using slurm-simple or slurm-dortmund
Aaron Lun (13:00:08): > simple, though I don’t really know the difference.
Aaron Lun (13:06:10): > Staring at thebatchtoolssource code indicates thatresources
are set via theresources=
argument (duh) insubmitJobs
. Looking at the correspondingBiocParallelcode that callssubmitJobs
, it seems thatresources=list()
. Shouldn’t it be possible to specify this viaBatchtoolsParam()
? Especiallywalltime
andmemory
, as these are definite job killers if they are too small (I don’t know the default values, if there are any).
Aaron Lun (13:07:53): > Right, I see that it’s possible to pass these via some.batchtools.conf.R
. But this would require saving the parameters to a file, which gets inconvenient, e.g., if you need to change the parameters halfway through a script.
Aaron Lun (13:17:18): > Okay, had some luck after moving.batchtools.conf.R
to~/
.
Aaron Lun (13:17:54): > Well, at least it’s a different error now. > > Adding 10 jobs ... > Submitting 10 jobs in 10 chunks using cluster functions 'Slurm' ... > Waiting (Q:0 R:0 D:0 E:0 ?:10) [-------------------------------] 0% eta: ?s > Error in .reduceResultsList(ids, fun, ..., missing.val = missing.val, : > All jobs must be have been successfully computed > cleaning registry... >
Aaron Lun (13:18:30): > FYI using thepiApprox
example in?BatchtoolsParam
.
Aaron Lun (13:26:38): > Got it working for this example - forgot that I had two R installations, so I had to modify the template file.
Aaron Lun (13:28:07): > But it sure wasn’t easy to figure that out. Setting logging options on: > > param <- BatchtoolsParam(10, cluster="slurm", template="parallel/slurm-aaron.tmpl", resultdir="parallel", logdir="parallel", log=TRUE) >
> …didn’t actually produce any logs when the thing was failing.
Aaron Lun (13:50:03): > One thing after another. Anyone have any experience with this: > > > sce <- TENxBrainData() > updating metadata: retrieving 1 resource > |======================================================================| 100% > > Error: database is corrupt; remove it and try again > database: '/mnt/scratchb/jmlab/lun01/bioconductor/TENxBrainAnalysis/rawdata/experimenthub.sqlite3' > reason: disk I/O error > In addition: Warning message: > Couldn't set synchronous mode: disk I/O error > Use `synchronous` = NULL to turn off this warning. >
Aaron Lun (13:51:24): > It works when I try to save it to the home directory (which is on a different file system - or at least, it would work if there were enough disk space, which there isn’t). But it doesn’t like the scratch filesystem, and I can’t quite understand why.
Sean Davis (14:12:08): > https://www.sqlite.org/faq.html#q5
Sean Davis (14:13:22): > Not sure if that is the issue. Can you read the file with sqlite3?
Aaron Lun (14:30:22): > I guess so, if by that you meansqlite3 rawdata/experimenthub.sqlite3
.
Aaron Lun (14:30:42): > That pops up a prompt without further errors.
Aaron Lun (14:32:17): > Though.import rawdata/experimenthub.sqlite3 WHEE
gives meError: disk I/O error
Aaron Lun (14:33:13): > Hm. Maybe there’s no disk space?
Sean Davis (14:37:36): > Sounds like you might want to touch base with your local support staff just to get a lay-of-the-land.
Sean Davis (14:38:09): > You could also just try removing the sqlite file and start over. Perhaps the file is, indeed, corrupt.
Aaron Lun (14:38:17): > Yeah.
Aaron Lun (14:38:42): > No problems with the file. Shouldn’t be a problem with disk space either, unless it needs >5 GB just to open it.
Sean Davis (14:38:43): > Figuring out if there is disk space can be system-specific. They may also have some experience with sqlite on their system.
Aaron Lun (14:43:35): > It’s definitely some weird shit with sqlite on this particular file system; moving the same file to my home directory works fine.
Sean Davis (14:44:08): > Are you accessing the file concurrently? Just curious.
Aaron Lun (14:44:32): > Not that I know of. There’s no other processes that should be touching this file.
Aaron Lun (14:45:25): > Probably some weird virtualization set-up that frustrates SQLite.
Sean Davis (14:45:55): > Definitely some local expertise on the cluster….
Aaron Lun (14:46:59): > I’ll submit a ticket locally, but it’s going to take a while. Bummer - I was hoping to be able to get some timings before the SOUND meeting.
Sean Davis (14:49:58): > Do the nodes have local scratch? That is, do they have a local disk on the node that is accessible? If so, you might be able to use that in the short run. Each node would then need to gets its own copy of experimenthub, etc., since local scratch is not shared, but that might be OK for your timings.
Aaron Lun (14:51:15): > Not 100% sure, but I don’t think so. Note that I’d still be downloading this ~6 GB file (fromTENxBrainData()
) on each node, which will inflate the timings considerably.
Aaron Lun (15:02:02): > Well, ticket submitted. Let’s see what they say. Probably something that’s not easily fixable, but who knows.
Kasper D. Hansen (16:26:26): > Following my surprise at the time it takes tolog2(x+1)
transform the BrainData, I have been thinking / benchmarking approaches to PCA utilizing sparsity. I am not done with this, but my current gut feeling is that this will speed up PCA quite a bit. This is more food for thought; I am going to eventually provide some numbers
Kasper D. Hansen (16:27:24): > I guess what I am trying to say is that even if we store data as dense matrices in HDF5, having a method which reads a block and directly constructs a sparse matrix might be very useful
Aaron Lun (17:26:30): > I though the IO was the major problem with a dense matrix.
Aaron Lun (17:27:08): > Not just the dense operations (though you’re right that sparsifying it would probably speed up further processing).
Martin Morgan (18:52:25) (in thread): > I just addedresources=list()
toBatchtoolsParam()
in 1.15.9; see?batchtools::submitJobs
, but basically key-value pairs for substitution into the template. Install from github withBiocManager::install("Bioconductor/BiocParallel")
Aaron Lun (19:09:28) (in thread): > :+1::party_parrot:
Kasper D. Hansen (21:49:11): > Well, I benchmarked just doing the dense crossproduct for example.
Kasper D. Hansen (21:49:34): > This is 100% parallizable, but its slow
2018-08-24
Aaron Lun (04:33:18) (in thread): > > The scratch filesystem is Lustre, which is a network filesystem designed for high performance. In order to meet its performance goals it has to drop a few POSIX constraints, one of which is the way POSIX file locking works. SQLite uses these file locks quite a lot and hence freaks out on the scratch filesystems. > > > > This is a fundamental property of the filesystem and not something we can change. How much IO do you expect to be directed at the SQLite database? If it isn’t massive then you can keep it on the home partition.
Aaron Lun (04:35:40) (in thread): > So I guess that’s that.@Martin Morganis there a workaround where we can dump the sqlite file in one location and the actual data files in another? I was thinking of just manually moving the sqlite file to my/home
and then linking to it on the lustre system.
Martin Morgan (05:26:31) (in thread): > for the one-off I’d be tempted in the setup on the head node to pull the files from the hub to the lustre file systemfls <- cache(hub[paste0("EH", 1040:1042)]); file.copy(fls, ...)
and refactorTENxBrainData()
to use them (basically the last two lines)
Martin Morgan (05:32:48) (in thread): > I think we’ve had a similar problem with our org packages on NFS file systems;@Hervé Pagèsmight recall a workaround.
Sean Davis (06:46:56) (in thread): > @Martin Morgan–that sounds like a good workaround so Aaron can get his benchmarking for SOUND done.
Sean Davis (06:49:39) (in thread): > A longer-term approach might be to have ExperimentHub split out the location of the file cache and the experimenthub sqlite file; the latter is small, so it might be convenient to be able to locate it independently of the actual cache. It sounds like that would, at least in practice, deal with Aaron’s situation. Not sure how general that would be, though.
Martin Morgan (07:01:43) (in thread): > Is the problem because we’re using SQLite, or is this more general? Lori & I have been thinking a little along these lines, e.g., BiocFileCache can remember the location of a file without actually managing it in its cache; this is a little problematic because some mechanism other than BiocFileCache can easily remove the original resource, invalidating the cache. I think there will be an iteration in the future where this kind of distributed data management is implemented.
Aaron Lun (07:55:05) (in thread): > My attempted workaround (soft-linking to a sqlite file located in~
) still didn’t work: > > > sce <- TENxBrainData() > snapshotDate(): 2018-08-20 > see ?TENxBrainData and browseVignettes('TENxBrainData') for documentation > downloading 1 resources > retrieving 1 resource > |======================================================================| 100% > > loading from cache > '/mnt/scratchb/jmlab/lun01/bioconductor/TENxBrainAnalysis/rawdata/1042' > |======================================================================| 100% > > |================================================== | 71%Error: failed to load resource > name: EH1040 > title: Brain scRNA-seq data, 'dense matrix' format > reason: 1 resources failed to download > In addition: Warning message: > download failed > hub path: '[https://experimenthub.bioconductor.org/fetch/1040](https://experimenthub.bioconductor.org/fetch/1040)' > cache path: '/mnt/scratchb/jmlab/lun01/bioconductor/TENxBrainAnalysis/rawdata/1040' > reason: Failed writing body (10087 != 15120) >
> so I guess I’ll just download the files locally and push it up onto lustre manually… bit of a pain, as the scripts need to be rewritten…
Aaron Lun (07:55:56) (in thread): > I mean, I guess the error this time is unrelated to sqlite - but I don’t know why it couldn’t finish the write - a bit bemusing.
Aaron Lun (10:48:42): > Okay, myBatchtoolsParam
job setups are now chugging along. Currently working onpreprocess.Rmd
only.
Aaron Lun (11:00:43): > woah, insane. QC calculations done in 4 minutes!
Aaron Lun (11:00:47): > 10 cores.
Aaron Lun (13:03:41): > BTW, thewalltime
in the SLURM script probably refers to seconds, not minutes.
Aaron Lun (13:04:00): > Template probably needs some fixing.
2018-08-25
Aaron Lun (07:04:45): > Hm. Somewhat bemusing - subsetting theHDF5Matrix
to form aDelayedMatrix
causes a 5-fold slowdown inbeachmat, despite the fact thatbeachmatis still directly using HDF5 C++ methods to pull out data from the underlying seed…
Aaron Lun (07:08:17): > Oh wait, I know why.
Aaron Lun (07:08:40): > It’s becausecalcAverage
is not optimized for column-based chunk access.
Aaron Lun (07:10:51): > sigh
Aaron Lun (09:05:58): > Hm. Getting a weird error. > > > h5ls("counts.h5") > Error in H5Fopen(file, "H5F_ACC_RDONLY", native = native) : > HDF5. File accessibilty. Unable to open file. >
Aaron Lun (09:06:25): > Notwithstanding the typo in “accessibilty”, I’ll note that the same thing was working yesterday.
Aaron Lun (09:06:52): > The same thing works on release - I assume that it’s got something to do with the HDF5lib update.
Aaron Lun (09:10:52): > Funnily enough, both release and devel work on my desktop. The error above refers to the cluster.
Aaron Lun (09:13:11): > @Mike Smithany ideas?
Aaron Lun (09:17:54): > Oh, this gets better. The following code works fine: > > a <- matrix(runif(1000), 100, 10) > b <- as(a, "HDF5Matrix") > b >
> so it would seem to be some incompatibility with theTENxBrainData
HDF5 file on particular systems. MD5 sums agree on both the cluster and desktop, so the file itself is the same.
Aaron Lun (09:29:41): > Ah. Interesting. Even tryingsetHDF5DumpDir(dir=".")
doesn’t work. So it’s a general incompatibility between the new HDF5 library and the Lustre file system.
Aaron Lun (09:30:02): > I wonder if this is related to the (lack of) POSIX locks for SQLite that we discussed earlier.
Aaron Lun (09:31:40): > Looking athttps://support.hdfgroup.org/HDF5/hdf5-quest.html, it seems I should be able to turn off thread-safe mode to avoid the requirements for POSIX locks. - Attachment (support.hdfgroup.org): HDF5 FAQ – Questions About the Software > The HDF Group is a not-for-profit corporation with the mission of sustaining the HDF technologies and supporting HDF user communities worldwide with production-quality software and services.
Aaron Lun (09:34:11): > Specifically to turn off SWMR duringRhdf5libcompilation.
Aaron Lun (18:21:45): > Perhaps this is relevant:http://hdf-forum.184993.n3.nabble.com/HDF5-1-10-0-and-flock-td4028856.html. I’ll need tostrace
it to confirm. - Attachment (hdf-forum.184993.n3.nabble.com): hdf-forum - HDF5-1.10.0 and flock() > HDF5-1.10.0 and flock(). Hi all, I was wondering if HDF5 was going to be keep the 1.8.x branch going? Or is it recommend to move to the 1.10.x? I’m asking as we all know for SWMR you need flock()…
Aaron Lun (18:34:35): > … and this might do it:http://hdf-forum.184993.n3.nabble.com/HDF5-files-on-NFS-td4029577.html. Something to test tomorrow. - Attachment (hdf-forum.184993.n3.nabble.com): hdf-forum - HDF5 files on NFS? > HDF5 files on NFS?. Hi All, In case my previous question was lost between other issues I sent before, I’d like to ask again with hope someone knows the answer here: Is is possible to use HDF5…
Mike Smith (23:54:12): > I think you can also try setting the environment variableHDF5_USE_FILE_LOCKING
toFALSE
to see if lack of flock() is the issue on your lustre system.
2018-08-26
Aaron Lun (06:31:27): > yeah, that’s what the last post suggested as well.
Aaron Lun (06:32:38): > sweet, works like a charm.
Aaron Lun (06:32:48): > Probably something to throw in a FAQ somewhere?
Aaron Lun (07:11:51): > geez, ggplot is slow though.
Aaron Lun (10:07:37): > Finally!preprocess.html
is built! Total runtime forcalculateQCMetrics
is 700 seconds across 10 workers; pretty damn good.
Aaron Lun (10:10:19): > From memory, that’s a better than 10-fold speed-up, probably due to efficiency improvements incalculateQCMetrics
itself.
Aaron Lun (10:14:54): > https://www.dropbox.com/sh/x5dpaa02xvj4352/AACBAfQRJyft3cijHmywqKRoa?dl=0 - Attachment (Dropbox): BigBioc > Shared with Dropbox
Aaron Lun (11:19:36): > Runningnormalize.Rmd
now… It’s pretty sweet, this batchtoolsparam stuff.
Aaron Lun (11:47:23): > Clustering done in 30 minutes, though this won’t be representative of actual clustering because it was clustered within each level ofLibrary
.
Aaron Lun (11:52:15): > Size factor calculation will probably be finished in another 20-30 minutes. So a 1 hour execution time for this part of the analysis, which I deem acceptable.
Aaron Lun (11:57:15): > @Martin MorganI’ve setRNGseed=10000L
in myBatchtoolsParam
call, but looking at the logs gives me these messages: > > ### [bt]: Starting job [batchtools job.id=5] > ### [bt]: Setting seed to 5 ... > > ### [bt]: Job terminated successfully [batchtools job.id=5] > ### [bt]: Starting job [batchtools job.id=13] > ### [bt]: Setting seed to 13 ... > > ### [bt]: Job terminated successfully [batchtools job.id=13] > ### [bt]: Starting job [batchtools job.id=47] > ### [bt]: Setting seed to 47 ... > > ### [bt]: Job terminated successfully [batchtools job.id=47] >
> … so is theRNGseed=
argument being respected, or isbatchtoolsjust setting its own seed regardless?
Aaron Lun (11:58:13): > Even if it’s the latter, I wouldn’t mind if the seeds are deterministically set, though I don’t know whether the job ID will always be the same for the same inputs.
2018-08-27
Aaron Lun (08:24:03): > normalize.html
is up on the Dropbox folder. Calculations take just over an hour.
Aaron Lun (09:57:37): > @Hervé PagèsI recall you saying that the block sizes chosen byDelayedArrayare now aware of the chunk dimensions in the underlying HDF5 file. Does this apply toblockGrid
?
Aaron Lun (10:03:59): > And if so, I presume that this is still the case even if theHDF5Matrix
has become aDelayedMatrix
object?
Aaron Lun (10:08:27): > Ah. Wait. I’ve realized why it doesn’t work. It’s becausebeachmatwill try to take an entire “row” of blocks when access is requested for an overlapping row. This means that the required memory is equal togetDefaultBlockSize()*(number of blocks)
. For the 10X data set, this would be1e8 * 200
which is 20 GB - no wonder my workers were failing.
Aaron Lun (10:15:41): > It’s not clear to me that there is an easy way to control the overall memory taken up by a series of blocks overlapping a single row or a column. Or is there?
2018-08-28
Martin Morgan (02:32:12) (in thread): > it seems like an issue with batchtoolshttps://github.com/mllg/batchtools/issues/203– the seed is set, but reported incorrectly. - Attachment (GitHub): Wrong random number seed reported on execJob. · Issue #203 · mllg/batchtools > This line batchtools/R/execJob.R Line 38 in 3b0b1a9 messagef("### [bt%s]: Setting seed to %i …", now(), job\(id, job\)seed) reports the random number seed used by each job as , but accord…
Lucas Schiffer (22:34:05): > @Lucas Schiffer has left the channel
2018-08-30
Kasper D. Hansen (09:18:40): > We submittedTENxPBMCData
yesterday which contains (amongst other things) the 68k dataset and a number of 3-6k datasets, for testing
Kasper D. Hansen (09:19:10): > Can grab it from herehttps://github.com/kasperdanielhansen/TENxPBMCData
Kasper D. Hansen (09:19:18): > Data is hdf5
Vince Carey (09:21:08): > :+1:
Kasper D. Hansen (09:23:54): > Usage is likeTENxBrainData
but the function now has adataset
argument which gives the specific PBMC data to download. The default is a 4k (small) dataset.
Kasper D. Hansen (09:24:43): > to get 68k do something like > > tenx <- TENxPBMCData(dataset = "pbmc68k") >
Kasper D. Hansen (09:29:08): > See vignette (which doesn’t really say much more to be honest)
Stephanie Hicks (10:27:31): > minimal is ok for the moment.:joy:
2018-09-07
Aaron Lun (12:34:31) (in thread): > @Hervé PagèsAny thoughts on this?
2018-09-08
Aaron Lun (07:14:57): > @Martin MorganDoestenxSummarizedExperiment
have a more permanent home? Happy to move it intoDropletUtils.
Martin Morgan (09:11:23) (in thread): > Feel free to use it as you wish; I don’t have plans for it to arrive elsewhere. There is HDF5Array::TENxMatrix, which takes a different approach compared with the TENxGenomics representation; not sure your goals but maybe you’d rather implement tenxSummarizedExperiment to use that instead?
Aaron Lun (09:20:07) (in thread): > Yes, I’ve already adapted themake-data.R
script to use theHDF5Arrayversion instead.
Aaron Lun (09:55:31) (in thread): > Do you know if there’s a formal description of the 10x HDF5 matrix format on their website anywhere?
Aaron Lun (11:11:41) (in thread): > got it.
Aaron Lun (12:16:56): > Hm. 2 hours to estimate variances.
Aaron Lun (12:16:59): > That can’t be right.
Aaron Lun (13:08:06) (in thread): > Done:DropletUtils::read10xCounts
now supports reading from the 10x sparse HDF5 format and producing a SCE object with aTENxMatrix
count matrix.
2018-09-10
Aaron Lun (05:35:41): > batchtools
takes an awfully long time to clean the registry. 14 minutes - Almost as long as the calculations themselves!
Aaron Lun (06:08:52): > okay, I can guess what the problem is now - it’s this 100 iteration loess fit.
Aaron Lun (15:19:41): > Hm. Still confused about this.
Aaron Lun (15:20:06): > Currently running this chunk: > > library(TENxBrainData) > sce <- TENxBrainData() > sf <- runif(ncol(sce)) > lcounts <- log2(t(t(counts(sce))/sf) + 1) > > library(scran) > setAutoBlockSize(100*100*8) > system.time(fit <- trendVar(lcounts, design=sce$Library, subset.row=1:1000)) >
Aaron Lun (15:20:16): > This takes about 60 seconds on my desktop - pretty good.
Aaron Lun (15:21:01): > But when I try to run in withBatchtoolsParam
across 10 workers, each worker takes > 30 minutes to run, and the whole thing crashes.
Aaron Lun (15:21:18): > Having thought about it… doessetAutoBlockSize
even propagate to the workers?
Kasper D. Hansen (15:22:34): > I assume lcounts is delayed? How does that propogate over a batch job?
Kasper D. Hansen (15:22:43): > That needs to be save right?
Kasper D. Hansen (15:22:53): > And then loaded?
Aaron Lun (15:23:20): > TheDelayedMatrix
object itself is probably fine, there shouldn’t be much overhead there.
Kasper D. Hansen (15:23:22): > Not really familiar with this package but the communication must be different
Kasper D. Hansen (15:23:34): > It might have to realize it to save it
Kasper D. Hansen (15:23:45): > oh wait a minute
Kasper D. Hansen (15:23:55): > this is a chunk you run on a worker
Kasper D. Hansen (15:23:58): > ok, that is different
Kasper D. Hansen (15:24:33): > But my guess is still something with communication
Aaron Lun (15:24:34): > My suspicion is that the global variable set bysetAutoBlockSize
is not propagating to the workers. This means that each worker tries to load an entire “row of blocks”, where each block is the default 1e8 bytes in size.
Kasper D. Hansen (15:24:58): > I don’t really get how you run this chunk
Aaron Lun (15:25:55): > On the cluster, the last line is set to: > > system.time(fit <- trendVar(lcounts, design=sce$Library, subset.row=1:1000, BPPARAM=BPPARAM)) >
> whereBPPARAM
is defined as inhttps://github.com/Bioconductor/TENxBrainAnalysis, seeparallel/slurm.R
. - Attachment (GitHub): Bioconductor/TENxBrainAnalysis > R scripts for analyzing the 1.3 million brain cell data set from 10X Genomics - Bioconductor/TENxBrainAnalysis
Aaron Lun (15:26:29): > Yep, seeing the job logs confirms that they all ran out of memory.
Kasper D. Hansen (15:26:53): > So how is lcounts communicated out to the workers?
Aaron Lun (15:27:17): > Gets serialized and then loaded again, presumably. This is light, so that’s not the problem.
Kasper D. Hansen (15:27:37): > mgith not be, but why is that light?
Kasper D. Hansen (15:27:44): > It needs to write the entire file
Aaron Lun (15:27:47): > Remember, it’s still a DA on the worker-side.
Aaron Lun (15:27:56): > The HDF5 file doesn’t get altered.
Kasper D. Hansen (15:28:29): > what happens if you serialize a DA with delayed operations?
Kasper D. Hansen (15:28:33): > Don’t they get realized?
Aaron Lun (15:29:00): > Why would they be? If yousaveRDS
a DA, you should just get the DA.
Kasper D. Hansen (15:29:43): > perhaps I am overthinking this a bit
Aaron Lun (15:30:18): > saveRDS(lcounts, file="whee.rds")
gives me a file that’s 37 MB, so clearly no realization occurred.
Aaron Lun (15:30:54): > Anyway, I’m pretty sure of the problem now. The workers are not seeing thesetAutoBlockSize()
setting. I will confirm this in just a bit…
Kasper D. Hansen (15:30:55): > ok, I am wrong:slightly_smiling_face:
Kasper D. Hansen (15:31:18): > Its not weird, how would R know that this setting should be communicated
Kasper D. Hansen (15:31:34): > Somehow BatchJobs would know how to recreate something likelibrary()
Kasper D. Hansen (15:31:40): > not sure about options
Aaron Lun (15:32:17): > non-multicore parallelization is pretty bad at carrying over the surrounding environment…
Kasper D. Hansen (15:32:30): > yeah that is a hrard problem
Kasper D. Hansen (15:32:39): > seems like it handles library, which is better than nothoing
Aaron Lun (15:47:50): > Okay, here it is. > > library(DelayedArray) > source("parallel/slurm.R") # see TENxBrainAnalysis/parallel/slurm.R > setAutoBlockSize(100*100*8) > bplapply(1:10, BPPARAM=BPPARAM, FUN=function(i) getAutoBlockSize()) >
> … and yep, they’re all still1e+08
. Tagging@Martin Morganand@Hervé Pagès: any thoughts for how I can propagate the block size setting to the workers in a general manner? I’d rather not add another parameter to all of my functions to callsetAutoBlockSize
; is there something that can be done by tellingBatchtoolsParam
which global options should be preserved?
Aaron Lun (15:51:42): > I could imagine wrappingFUN
in another function that sets a specified list of global options…
Aaron Lun (15:54:14): > Well, I’m going home, that’s enough debugging for today.
Kasper D. Hansen (15:54:36): > there must be a way to execute a common piece of code on the workers
Martin Morgan (17:14:42): > generally, and without looking at the details, set the block size on the worker as part of whatever the function is that is doing the ‘apply’.
Aaron Lun (18:26:08): > That’s the problem, I’m afraid. Thebplapply
call lies within package-defined functions, so they’re not user-modifiable. Of course, it is possible formeto modify the functions beingbplapply
’d, but I would have to do so for every such function, which would be an unsustainable pain for maintenance (e.g., if thesetAutoBlockSize()
function name changes again, or if I have to add different functions to change the block shape).
Aaron Lun (18:32:20): > It would seem more natural to have a built-in facility for propagating global variables as part of theBatchtoolsParam
object, so that it automatically gets applied to everybplapply
call. I can imagine a number of strategies for achieving this. The simplest would be for the BTP constructor to store a user-supplied list of global variables that need to be set in each worker, and then re-define the function to bebplapply
’d as: > > .local <- function(FUN, ..., global.list) { > options(global.list) > FUN(...) > } >
Aaron Lun (18:32:36): > … and then execute.local
in each worker.
Aaron Lun (18:34:00): > A more automated approach would be for each package to “register”, on attach, a set of important global variables that must be propagated to workers, and forBatchtoolsParam
to automatically recognize and use these global variables whenever a BTP instance is used in abplapply
call.
Aaron Lun (18:41:22): > In fact, there aren’t even that manyoptions()
in default R, so one could just throw the whole lot in and not bother with trying to be refined.
Kasper D. Hansen (20:09:54): > it seems sensible to be able to write a function and get it executed on each worker as part of theBPPARAM
2018-09-11
Aaron Lun (04:25:49): > Perhaps. But regardless of whether or not a user wants to execute a common function across allbplapply
calls, it would be “expected behaviour” that any settings in the global environment would propagate to the workers. In fact, that’s probably 99% of the use cases of this common function anyway, as there’s few other ways that an operation limited to the scope of the common function can change the behaviour of the actualFUN
to be executed.
Martin Morgan (11:07:38): > It’s not just BatchtoolsParam, but also SnowParam (separate processes) and in general the problem is that DelayedArray is not not playing by BiocParallel rules (or that BiocParallel’s rules need to be updated). It wouldn’t be a ‘fix’ that was BatchtoolsParam-specific, but BiocParalllel-wide. One significant problem is that one canbpstart(<param>)
for most params, separate from using it; this has the advantage of allowing a cluster to be re-used, so saving expensive startup costs. But it also leads to problems with global variables – are they synced before bpstart(), or before bplappy() (in which case global variables need to be synced at each invocation). My feeling is that DelayedArray should avoid global variables, or if it uses these then make sure that they are available on the workers. For instance, the functionality in DelayedArray could accept a parameter that defaulted to the global variable; the value of the parameter would be sent to the worker. Happy to work through this with@Hervé Pagès
Aaron Lun (11:21:30): > Explicit passing of arguments would be the most direct solution but also causes some difficulties for developers. To give a concrete example:scran::trendVar
callsbplapply
withFUN
set to a wrapper around.Call()
to some C++ code, which, viabeachmat, callsas.matrix()
on a DelayedArray object. If I wantedas.matrix()
to respond to, say, a new setting for the block size argument, I would have to pass the relevant argument all the way down this chain of calls - pretty ugly.
Aaron Lun (11:25:15): > Icouldmodify the wrapper so that it does the same as.local
above, though one could just as easily do that on the BiocParallel side.
Martin Morgan (11:30:38): > param
arguments in BiocParallel and elsewhere are basically tackling this – create one object that holds all the details, and pass it. Avoiding global options is generally a very good thing in the programming world. A more clever approach is a factory pattern that creates local scope with desired parameter values set; one would pass the result of the factory call rather thanas.matrx
.
Aaron Lun (11:32:19): > I wonder if this could be done by storing these parameters in the DA object itself.
Aaron Lun (11:32:47): > This would avoid the need to pass any other argument down the chain of functions.
Aaron Lun (11:33:26): > Needing to pass a DA-specific parameter would compromise the representation-agnostic behaviour ofbeachmat, for example.
Hervé Pagès (12:23:15) (in thread): > @Aaron Lun@Martin Morgan@Sean DavisSorry for the late answer, I’ve been away from this slack for a while. I don’t remember the exact details and don’t know if that would help here, but it seems that some obscure and hard-to-reproduce SQLite access problems on some exotic file systems can be avoided by opening connections in read-only mode (by defaultdbConnect(SQLite(), ...)
connects in RW mode). In particular I think that this avoids the creation of the lock file and other temporary objects. Even if you are not on a exotic file system and if you only need read access, this is the right thing to do anyway as it will allow concurrent read access (SQLite blocks read access if there is already an open RW connection to the file). Another setting that seems to help when the SQLite file is on an NFS partition is thevfs="unix-none"
setting. ThedbFileConnect()
utility in AnnotationDbi is a wrapper arounddbConnect(SQLite(), ...)
that uses those settings. Seehttps://github.com/Bioconductor/AnnotationDbi/blob/master/R/utils.R(dbFileConnect()
is used in hundreds of SQLite-based annotation packages to connect to the embedded SQLite file). - Attachment (GitHub): Bioconductor/AnnotationDbi > Annotation Database Interface. Contribute to Bioconductor/AnnotationDbi development by creating an account on GitHub.
Aaron Lun (12:24:09) (in thread): > Thanks Herve, will investigate.
Hervé Pagès (12:34:24) (in thread): > @Aaron LunYes. See?blockGrid
.
Hervé Pagès (12:54:43) (in thread): > Most of the time there is still a notion of chunk grid on the DelayedMatrix object after the original HDF5MatrixM0
has been transformed by some delayed operations. You can get this grid withchunkGrid()
(orchunkdim()
if the grid is still regular). This grid might have gone thru some transformations e.g. if the original object was permuted or subsetted. For examplechunkdim(log(t(M0)[1:10, ] + 1))
will be 1x500 ifchunkdim(M0)
was 500x1. If the delayed operations carried by the object “break” the propagation of the chunkdim (e.g. random subsetting), thenchunkdim()
returns NULL. This is still work-in-progress and I’m planning to make the chunkdim propagate in more situations than those supported at the moment. In particular, even if the chunkdim doesn
Aaron Lun (12:55:34) (in thread): > :+1:
Hervé Pagès (12:57:24) (in thread): > In particular, even if the chunkdim doesn’t propagate, the chunk grid can sometimes propagate but then the grid is not regular anymore (e.g. if you remove some rows and/or some cols with something likeM[-2, -5:12]
).
Aaron Lun (12:59:37) (in thread): > Arguably, if these are physical chunks, then the best access pattern would be independent of any subsetting that would be applied in memory anyway?
Aaron Lun (13:00:07) (in thread): > Well, at least for HDF5, you’d have to read in an entire chunk regardless of whether you wanted a subset.
Hervé Pagès (13:03:06) (in thread): > No easy way that I’m aware of. I guess this is in the context of parallelization i.e. process in parallel blocks that overlap a single row or columns. So more generally speaking setting the auto block size really controls the size of the individual blocks. In the context of parallel evaluation, you need to do your own calculation i.e. adjust the auto block size taking into account the nb of blocks that you anticipate are going to be loaded in memory at any given time.
Aaron Lun (13:05:16) (in thread): > Yeah, that’s what I ended up doing - just setting the block size to be so small that a “row of blocks” would fit in memory. Of course, this didn’t work for other reasons - see the later discussion about propagation of global variables to the workers.
Aaron Lun (13:17:42) (in thread): > I’ll admit to not knowing what I’m doing, but is this what you mean: > > # Saving in ~/.ExperimentHub would exceed storage limits. > library(ExperimentHub) > setExperimentHubOption("CACHE", ".ExperimentHub") > > library(AnnotationDbi) > dbConnect(SQLite(), dbname = '.ExperimentHub/experimenthub.sqlite3', > synchronous = NULL, flags = SQLITE_RO, vfs = "unix-none") > > library(TENxBrainData) > sce <- TENxBrainData() # or we would, if it was working properly. >
Aaron Lun (13:20:23) (in thread): > There’s no obvious place to pass the output ofdbConnect
, unfortunately. I assume the appropriate place is somewhere deep inAnnotationHub::query
.
Hervé Pagès (14:55:58): > I’m surprised that.Options
doesn’t get passed around to the workers. This was my naive expectation (but I guess it would also be the expectation of many users). This would provide a convenient and intuitive way to pass around user-controlled global settings. If this is not a reasonable expectation (maybe because.Options
is not just about user-controlled global settings so it would be undesirable to overwrite some system-specific settings on the workers, especially in the context of heterogeneous workers, I don’t know), then I could store the DelayedArray user-controlled global settings in a file (e.g. something like~/.DelayedArray.conf
). Would be nice to have a more generic/central mechanism for passing around user-controlled global settings though instead of having each package coming up with its own home-made mechanism. I was hoping.Options
would be the right vehicle for that but maybe it’s not?
Kasper D. Hansen (15:01:27): > Perhaps I am not clear in my thinking, but I can see the use of global variables when setting up the compute. This is not too different from (say) Aarons SLURM settings in a text file, just technically a different approach.
Kasper D. Hansen (15:02:19): > I think the right way is to have all the information in the BPPARAM object. This is additional information about how big chunks to process essentially
Aaron Lun (15:59:46): > I’m fine with any solution as long as I don’t have to change any of my package code. (Which I feel is a reasonable request if we are to preserve interoperability between matrix representations - it shouldn’t be the responsibility of a package function to set a whole bunch of parameters specific to a representation that may or may not be used.)
Aaron Lun (18:03:03) (in thread): > It may also be the case that workers don’t have access to the same file system (or at least the same absolute paths) as the dispatching node - a similar situation was causing linkage problems for shared libraries (e.g., withRhtslib) a while back.
Hervé Pagès (20:56:32) (in thread): > Good point. I’ll have to scratch my head a little bit longer about this until I come up with a good solution. BTW I only realize now that the non-transmission of the global options that you illustrated earlier with > > setAutoBlockSize(100*100*8) > bplapply(1:10, function(i) getAutoBlockSize(), BPPARAM=BPPARAM) >
> was “caused” by the use of the"slurm"
cluster type. By default theBatchtoolsParam
backend uses the"multicore"
cluster type which doesn’t have that problem. So we have at least 2 types ofbplapply()
backends that don’t transmit the global options:SnowParam()
andBatchtoolsParam(cluster="slurm")
. I guess that this is just the consequence of those backends starting fresh R sessions on each node and not forking. So it would be super useful to have a mechanism for transmitting user-controlled global settings to the workerswhatever the backend is. Turns out that the in-house mechanism I first had in mind for transmitting DelayedArray settings was going to rely on the workers being able to share stuff viatempdir()
. However, with those backends that don’t fork,tempdir()
is not transmitted either (of course): > > > bplapply(1:3, function(i) tempdir(), BPPARAM=SnowParam()) > [[1]] > [1] "/tmp/Rtmp8l9oll" > > [[2]] > [1] "/tmp/RtmpymsI0y" > > [[3]] > [1] "/tmp/Rtmps8eY6M" >
> So back to more head scratching…
2018-09-12
Hervé Pagès (02:11:27) (in thread): > Maybe I was not clear. IfM1
is a 50x50 HDF5Array object with achunkdim()
of 10x10, thenchunkGrid(M1)
returns a 5x5 RegularArrayGrid object where the grid elements are aligned with the physical chunks. So all grid elements contain 10x10 matrix elements. The question is how the grid of physical chunks gets “projected” on a DelayedMatrix object that was obtained by subsetting an HDF5Matrix object. For example, suppose thatM2
isM1[-4, ]
, the plan is to havechunkGrid(M2)
return a 5x5 ArrayGrid object where the grid elements are still aligned with the physical chunks. This is possible by returning a grid that is not regular anymore because the grid elements at the top edge of the grid should now contain 9x10 matrix elements instead of 10x10. So in this casechunkGrid(M2)
would return an ArbitraryArrayGrid object instead of a RegularArrayGrid object. I consider that this grid is still good guidance for choosing a grid suitable for block processing ofM2
(soblockGrid(M2)
takes it into account) because it respects the boundaries of the physical chunks. So in the endblockGrid(M2)
will be able to return a grid of blocks where no chunk overlaps with 2 blocks so a chunk won’t be loaded/uncompressed twice when we walk on the blocks withlapply()
orbplapply()
. Right nowblockGrid(M2)
returns NULL but the plan is to have it return the ArbitraryArrayGrid object described above.
Aaron Lun (04:39:15) (in thread): > Interesting. I’ll have to adjust thebeachmathandling of “unknown” matrices appropriately to respond toblockGrid
more effectively - currently it just looks at@spacings
.
Martin Morgan (09:34:38) (in thread): > This is a feature that we (i.e.,@Lori Shepherd) can / will implement in BiocFileCache and the Hubs, some time next week probably.
Aaron Lun (09:52:30) (in thread): > Okay. I presume thatTENxBrainData()
will also require modification to pass the relevant arguments to the EHub/AHub commands.
Martin Morgan (16:39:44) (in thread): > I’m not sure how palatable this is, but with the package-local options something like the following works and is relatively easy to maintain / use > > #' @export > options <- local({ > env <- new.env(parent = emptyenv()) > list(set = function(key, value) { > env[[key]] <- value > }, get = function(key) { > env[[key]] > }) > }) > > .doit <- function(i, ..., options) { > list(Sys.getpid(), options$get("key")) > } > > #' @importFrom BiocParallel bplapply > #' @export > fun <- function(n = 2) { > bplapply(seq_len(n), .doit, options = options) > } >
> The test case is SnowParam(), and I have > > > options$set("key", 12) > > str(fun()) # MulticoreParam > List of 2 > $ :List of 2 > ..$ : int 10667 > ..$ : num 12 > $ :List of 2 > ..$ : int 10668 > ..$ : num 12 > > BiocParallel::register(BiocParallel::SnowParam()) > > str(fun()) # SnowParam() starts, sees options > List of 2 > $ :List of 2 > ..$ : int 10712 > ..$ : num 12 > $ :List of 2 > ..$ : int 10718 > ..$ : num 12 > > BiocParallel::bpstart() ## persistent registered (SnowParam) cluster > > str(fun()) # SnowParam() already up > List of 2 > $ :List of 2 > ..$ : int 10733 > ..$ : num 12 > $ :List of 2 > ..$ : int 10739 > ..$ : num 12 > > options$set("key", 13) > > str(fun()) # SnowParam() already up, sees new option > List of 2 > $ :List of 2 > ..$ : int 10733 > ..$ : num 13 > $ :List of 2 > ..$ : int 10739 > ..$ : num 13 >
Aaron Lun (17:11:51) (in thread): > Why would this have to be package-specific? Couldn’toptions
be inBiocParallelitself? It seems as if this definition would not change regardless of the use ofbplapply
.
Aaron Lun (17:30:00) (in thread): > Or maybe that’s what you already meant…
Martin Morgan (17:41:17) (in thread): > I meant that this would be a way for a package to manage its own options and to interoperate with BiocParallel, not for BiocParallel to manage another packages’ options.
Martin Morgan (17:43:02) (in thread): > I guess BiocParallelParam could have another fieldoptions
with the contract if-you-set-BiocParallelParam::options-we-will-set-base::options() but that seems artificial.
Aaron Lun (18:16:47) (in thread): > So your intention is forDelayedArrayto defineDelayedArray::options
, which then gets passed to anyBiocParallelcalls involving DAs. But if I had ascranfunction that accepts a DA object and usesBiocParallel, then I would have to passDelayedArray::options
to all of mybplapply
calls withinscranas well. This seems like a lot of code to support just a single matrix representation - increases maintenance, breaks interoperability, requires even more care to includeoptions
for future packages implementing new representations, etc.
Aaron Lun (18:18:53) (in thread): > If re-setting of global options by default in the workers is not desirable, then can’t we embed the relevant parameters within the DA object itself? This would tie the parameters directly to the object for retrieval by the relevantDelayedArrayfunctions, no matter what R session they’re in. It would also have the nice side-effect of being able to define different block sizes or shapes for different matrices (optimized for a particular application) at the same time.
Hervé Pagès (19:29:28) (in thread): > ok i came up with an approach that is a little bit different from Martin’s:https://github.com/Bioconductor/DelayedArray/commit/31cc4290fdd72acd1d9de796cbda4df16b43211eTested on my laptop with SnowParam but I don’t have an easy way to test this withBatchtoolsParam(cluster="slurm")
.@Aaron LunCan you confirm that this works for you? For me: > > > library(DelayedArray) > > getAutoBlockSize() > [1] 1e+08 > > bplapply(1:2, function(i) DelayedArray::getAutoBlockSize(), BPPARAM=SnowParam()) > [[1]] > [1] 1e+08 > > [[2]] > [1] 1e+08 > > > setAutoBlockSize(25) > automatic block size set to 25 bytes (was 1e+08) > > bplapply(1:2, function(i) DelayedArray::getAutoBlockSize(), BPPARAM=SnowParam()) > [[1]] > [1] 25 > > [[2]] > [1] 25 > > > bplapply(1:2, function(i) {DelayedArray::setAutoBlockSize(99); DelayedArray::getAutoBlockSize()}, BPPARAM=SnowParam()) > automatic block size set to 99 bytes (was 25) > automatic block size set to 99 bytes (was 25) > [[1]] > [1] 99 > > [[2]] > [1] 99 > > > getAutoBlockSize() > [1] 25 >
- Attachment (GitHub): Workers now inherit DelayedArray user-controlled global options · Bioconductor/DelayedArray@31cc429 > In the context of BiocParallel::bplapply() and family, workers now inherit the user-controlled options defined on the master. Workers can also modify the inherited options or define new user-contro…
Martin Morgan (21:24:42) (in thread): > From my parsing of the code, I’d be concerned about the use of file paths; clusters are typically configured in ways that make no one place guaranteed to be shared by the head and worker nodes, which Ithinkis assumed by your code (the workers try to read the manager file and discover that it is not their own, but they may not be able to read that file); the data needs to be serialized by the manager to the workers.
Hervé Pagès (23:44:58) (in thread): > This is a valid concern and is actually one of the 2 big assumptions that the current implementation is based on. More precisely the current implementation assumes that thetmpdir()
of the master is accessible by the workers:https://github.com/Bioconductor/DelayedArray/blob/31cc4290fdd72acd1d9de796cbda4df16b43211e/R/utils.R#L353-L358This is why I’m curious to know whether this works on@Aaron Lun’s SLURM cluster. I understand that the only clean solution is to send the serialized options thru the manager. So my approach should only be considered as a temporary dirty hack for now, until we have a clean solution that doesn’t require people to add an extra argument to all the functions they pass tobplapply()
. - Attachment (GitHub): Bioconductor/DelayedArray > Delayed operations on array-like objects. Contribute to Bioconductor/DelayedArray development by creating an account on GitHub.
2018-09-13
Hervé Pagès (00:18:10) (in thread): > @Aaron LunAttaching user-controlled global options to a DelayedArray object assumes that the global options are object specific but in general they are not (e.g. the option that controls verbosity of block processing). It would also require adding an argument to things likegetAutoBlockSize()
andsetAutoBlockSize()
so they operate on a DelayedArray. So now these would be object getter and setter instead of global option getter and setter. However there are good reasons why the global options in DelayedArray (e.g. the auto block size) are global options and not object-specific settings. It would also assume that the workers are performing a unary operation (e.g.rowSums()
) but there is not reason why they couldn’t be operating on more than 1 DelayedArray object at a time (e.g. matrix multiplication).
Aaron Lun (07:02:28) (in thread): > @Hervé PagèsI ran the following: > > library(DelayedArray) > source("parallel/slurm.R") > setAutoBlockSize(100*100*8) > bplapply(1:10, BPPARAM=BPPARAM, FUN=function(i) DelayedArray::getAutoBlockSize()) >
> and ended up with: > > Adding 10 jobs ... > Submitting 10 jobs in 10 chunks using cluster functions 'Slurm' ... > > Error: BiocParallel errors > element index: 1, 2, 3, 4, 5, 6, ... > first error: worker cannot access file /tmp/RtmpYPmOjA/DelayeArray-options.133253 > created by master > In addition: Warning message: > In dir.create(bplogdir(BPPARAM)) : 'parallel' already exists > cleaning registry... >
> … so I guess not.
Aaron Lun (07:09:01) (in thread): > The above was run on an R session on the headnode, so the tempfile would have been created on a headnode-visible path that might not be seen by the workers. I’ve also tried running the code in an R session on the workers (i.e., submitting a job containing the above code, which starts an R session on a worker node, which then further submits jobs to additional workers via the Batchtools framework). This leads to the same error.
Hervé Pagès (09:59:07) (in thread): > Well so I guess that means DelayedArray is not supported on this kind of cluster for now, sorry:disappointed:
Aaron Lun (10:03:47) (in thread): > :sad-parrot:. That’s unfortunate. Am I the only one with this issue? Surely other cluster types have similar behaviours.
Martin Morgan (10:50:45) (in thread): > It would fail on Roswell’s cluster, too
Aaron Lun (10:54:00) (in thread): > The current fall back would be torealize
the DA on a HDF5 backend (+3 hours). This allows the next step to proceed with direct HDF5 access, avoiding the issues with DAs on the workers. Which is a shame, as the next step would only take 2 minutes on a DA (extrapolating from desktop timings).
Hervé Pagès (11:01:09) (in thread): > How about using something like thebplapply()
wrapper below in your code@Aaron Lun: > > bplapply2 <- function(X, FUN, ..., BPREDO=list(), BPPARAM=bpparam()) > { > FUN2 <- function(x, opts) > { > do.call(base::options, as.list(opts)) > FUN(x, ...) > } > bplapply(X, FUN2, opts=base::.Options, BPREDO=BPREDO, BPPARAM=BPPARAM) > } >
> If that works for you I will revert my last changes to DelayedArray so global options are stored again inbase::.Options
.@Martin MorganIs this a reasonable feature request forbplapply()
and family?
Aaron Lun (11:08:53) (in thread): > That sounds very sensible to me.
Hervé Pagès (11:10:05) (in thread): > From worst to best solution: (1)@Aaron Lunand I both define and usebplapply2
in our own code/packages; (2)bplapply2
is added to BiocParallel and people who care about transmission of global options (e.g.@Aaron Lunand I) use it instead ofbplapply
in their code. (3)bplapply()
just does this.
Aaron Lun (11:10:40) (in thread): > IMO 3 FTW.
Martin Morgan (11:24:39) (in thread): > To back up, from the discussion I understand that this ‘global option’ needs to be set for performance reasons for a particular DelayedArray instance?
Aaron Lun (11:28:56) (in thread): > For me at least. This is because some functions (in this casetrendVar
, but also various others) require row-level access. Rather than realizing a single row at a time, which would be fairly inefficient,beachmatrealizes a series of consecutive rows; returns the requested row; and caches the rest for future requests. The number of rows in the cache is defined according to theDelayedArrayblock size, which is a natural choice as it considers the physical chunking layout of the matrix representation. However, for this purpose, the default block size is pretty large when you consider that we are realizing blocks across all columns of the matrix. This means that the memory on the workers is frequently and unnecessarily exceeded.
Aaron Lun (11:30:20) (in thread): > I imagine that@Peter Hickeywould have similar problems for his DMS functions that require access to an entire row of values at once, e.g., rowRanks?
Martin Morgan (11:34:14) (in thread): > but when you set thisglobaloption, doesn’t this affect processing of other DelayedArray objects, maybe in the user’s work flow unrelated to your needs?
Hervé Pagès (11:36:52) (in thread): > The global options in DelayedArray control not just performance of block-processing but other aspects e.g. level of verbosity. Generally speaking they don’t apply to a particular DelayedArray instance. For example when a block-processed operation walks on more than 1 object simultaneously (e.g. matrix multiplication) the level of verbosity applies to the whole operation. Also you could imagine that a global setting caps the total amount of memory used by this matrix multiplication (right now we don’t have such setting, only auto block size which caps the size of the individual blocks, but I might add something like this in the future).
Martin Morgan (11:41:17) (in thread): > That doesn’t seem to be Aaron’s use case (and this use case is driving the discussion)… > > Somehow I would be more amenable to something like > > bpfun <- function(FUN, options = base::options()) { > force(options) > function(...) { > base::options(options) > FUN(...) > } > } >
> used as > > bplapply(1:2, bpfun(function(i) getOption("FOO"))) >
Aaron Lun (11:47:19) (in thread): > Regarding the global option: yes, and I agree that’s not desirable. That’s part of why I was asking whether there was a way to control the total memory usage when extracting a series of blocks across an entire row of a DA. I’ll admit to being lazy when I was implementingbeachmat’s caching system for DAs, and perhaps it would be better to definebeachmat-specific parameters for this instead of relying on the DA block size. However, any such parameters would still be global, as there’s no way to easily pass the cache size tobeachmatdirectly.
Hervé Pagès (12:00:45) (in thread): > Like there is no way to pass the block size when doingM1 %*% M2
other than via a global option. Usingbpfun
is more or less equivalent to usingbplapply2
so either (or both) could go in BiocParallel. How do you feel Martin about adding that feature tobplapply()
itself?
Aaron Lun (12:02:55) (in thread): > Ah, you beat me to it. To elaborate on whybeachmatcan’t easily accept a cache size parameter; the constructor for the C++ matrix classes accepts a single object (Rcpp::RObject
containing the matrix representation). Let’s say that we allow the constructor to take another argument for the cache size. However, this would mean that allbeachmat-dependent code must provide this argument to the constructor in order to respond to user-requested changes in the cache size. This would need to propagate up the call stack to a user-visible function, which is not insignificant given that we’re working at the bottom level here (in C++). Indeed, some developers of packages dependent onbeachmat(mostly me, but who knows) would be bemused at having to provide such a DA-specific argument if they never anticipate usage of DAs in their functions - certainly I was not using DAs routinely withinscranuntil now.
Martin Morgan (12:49:10) (in thread): > As another proposal, would adding an arg to BiocParallelParam() to export globals be ok?
Aaron Lun (13:19:20) (in thread): > Yes, I’d be happy with that.
Samuela Pollack (13:40:14): > @Samuela Pollack has joined the channel
Hervé Pagès (13:44:10) (in thread): > Sounds good to me. FWIW I just addedbplapply2()
to DelayedArray:https://github.com/Bioconductor/DelayedArray/commit/643acc55cae3ab9d04b66c80b8e86b5d34f1ce8cWith abpexportglobals()
setter for BiocParallelParam objectsbplapply2
will just become: > > bplapply2 <- function(X, FUN, ..., BPREDO=list(), BPPARAM=bpparam()) > { > bpexportglobals(BPPARAM) <- TRUE > bplapply(X, FUN, ..., BPREDO=REDO, BPPARAM=BPPARAM) > } >
> so a silly wrapper but still convenient for my internal needs where I always want globals to be transmitted. But if BiocParallelParam objects are set to export globals by default then I won’t needbplapply2()
at all:wink:Thanks! - Attachment (GitHub): Add bplapply2() · Bioconductor/DelayedArray@643acc5 > bplapply2() is a simple wrapper to BiocParallel::bplapply() that propagates base::.Options to the workers. Not exported because DelayedArray is obviously the wrong place for this (a better place wo…
Martin Morgan (16:10:18) (in thread): > exportglobals
should be available in all the params, default TRUE.https://github.com/Bioconductor/BiocParallel/commit/13c2187c2569ebbbe8a3853cd8bc8853c18923f9 - Attachment (GitHub): implement bpexportglobals() · Bioconductor/BiocParallel@13c2187 > - export options() to workers - default TRUE
Aaron Lun (16:32:35) (in thread): > Thanks guys. I’ll check it out first thing in the morning.
Aaron Lun (16:35:04): > As it so happens, I have C++ code to quickly fit linear models on a variety of matrices including HDF5Matrices. I’m wondering if there’s anybody else who might find this code useful, in which case I would consider wrapping it in a package and providing access to the C++ code in addition to R-level wrappers to return specified statistics of interest (e.g., coefficients, residual variances, residual effects, fitted values, residuals and so on).
Peter Hickey (18:26:33) (in thread): > Interested
Hervé Pagès (19:34:08) (in thread): > Excellent. Thanks Martin!
Kasper D. Hansen (19:59:57): > How does it compare to limma in speed?
Kasper D. Hansen (20:00:05): > and scalability?
Kasper D. Hansen (20:00:23): > If nothing else you should share it
Kasper D. Hansen (20:00:27): > sounds useful
2018-09-14
Aaron Lun (04:31:54): > Speed is probably about the same, it’s all QR decompositions. Maybe a bit slower due to cache issues with row-level access (as limma transposes before runninglm.fit
) but it would definitely be more memory efficient for anything involving sparse or HDF5 matrices. Depends on whether cache issues outweigh the cost of transposition.
Aaron Lun (04:42:41) (in thread): > Works like a charm.
Aaron Lun (09:18:10): > Anyway, those of you who are interested can check outhttps://github.com/LTLA/flmam. - Attachment (GitHub): LTLA/flmam > Test code for fitting linear models to any matrix representation. - LTLA/flmam
Aaron Lun (09:18:23): > Feedback is welcome, PRs are even more so.
Kasper D. Hansen (09:21:11): > So for sc/HDF5 this is fitting the same linear model (same design matrix) to a row of data?
Kasper D. Hansen (09:21:32): > So essentially like running `lmFit()
Aaron Lun (09:22:43): > Yes.
Aaron Lun (09:22:50): > Currently putting up the R backend.
Aaron Lun (12:49:28): > Backend is up. It’s a couple of shades faster thanlm.fit
, but we do nowhere near the same amount of work (e.g., we don’t calculate residuals), so that’s to be expected. Main advantages are related to memory and representation agnosticism.
Aaron Lun (13:55:27): > @Hervé PagèsLooks like beachmat’s longtests are failing, though they work fine on my machine with the samecheck
command. Any clues as to why?
Hervé Pagès (14:00:58): > Mmmh the error is weird. I’ll check. Note that the current results are still from 2018-09-08 (we run the long tests builds only once a week, on Saturdays), so it could be that when I go on malbec1 to try to reproduce this every thing will be fine for me too.
Aaron Lun (14:20:28): > okay, thanks
Aaron Lun (14:25:30): > @Mike SmithDoesrhdf5automatically save datasets in extensible mode now?
Aaron Lun (14:28:03): > Not sure where I should be looking, but some ofbeachmat’s optimizations are breaking in a manner that is consistent with expansion of the matrix by 1 chunk on all sides.
Hervé Pagès (16:44:39) (in thread): > I can’t reproduce this in an interactive session on malbec1 either. But it’s still failing in the context of the builds (I kicked a new long tests build and the result is here:https://bioconductor.org/checkResults/3.8/bioc-longtests-LATEST/beachmat/malbec1-checksrc.html). This seems to actually affect all the packages in the long tests builds (and also in release on the 2 machines running the long tests builds there). Looks like something is awfully broken with the long tests builds:worried:I’ll keep investigating…
Aaron Lun (16:47:54) (in thread): > :+1:
Aaron Lun (17:19:01): > Yes, definitely something changed regarding the chunk hashing between versions 1.8.* and 1.10.2 of the HDF5 library.
Aaron Lun (17:24:13): > Dammit. Looking for documentation keeps on giving me our Plos paper!
Aaron Lun (17:43:17): > I RFTS’d the HDF5 source code. Looks like they changed their hashing strategy in 1.10, in a completely undocumented way. Dammit! Now I understand why the CZI guys weren’t keen on HDF5.
Aaron Lun (17:45:38): > The old hashing strategy was > > val % nslots >
> and the new one is this piece of > > unsigned u; /* Local index variable */ > > val = scaled[0]; > for(u = 1; u < ndims; u++) { > val <<= shared->cache.chunk.scaled_encode_bits[u]; > val ^= scaled[u]; > } /* end for */ > val % nslots >
> whatever the hell that’s doing.
Aaron Lun (17:47:27): > Good grief. I’ll just give everyone a unique hash index.
Aaron Lun (19:25:47): > Think I’ve figured it out.@Mike Smithis there an easy way to compileRhdf5libwhile being able to edit the source code? I notice that the HDF5 source is still in a tarball - is this strictly neccessary for compilation?
2018-09-15
Aaron Lun (16:12:32): > https://forum.hdfgroup.org/t/unintended-behaviour-for-hash-values-during-chunk-caching/4869 - Attachment (forum.hdfgroup.org): Unintended behaviour for hash values during chunk caching? - HDF5 - HDF Forum > It seems that there have been some changes in the calculation of the hash value for chunk caching in 1.10.3, and I’m not sure whether these were intentional or not. To motivate this discussion, let’s say that I have a 10…
Aaron Lun (16:12:43): > The HDF forum interface is pretty sweet.
Aaron Lun (16:13:00): > I think BioC support should have markdown by default. Certainly doesn’t hurt Github, and people who don’t format their code won’t do so anyway.
Aaron Lun (16:14:56): > Incidentally,variance.Rmd
is done.
2018-09-16
Aaron Lun (18:17:58): > @Kasper D. Hansenis the PCA package going to be in the next BioC release?
2018-09-17
Mike Smith (03:43:12) (in thread): > The configure file has checks for a folder calledhdf5
in thesrc
direcory. If that is present it shouldn’t untar the source, so you can edit files in there and it’ll use them.
Aaron Lun (04:34:22) (in thread): > Thanks mike; yes, I eventually just looked atconfigure
to see what the build was expecting.
Aaron Lun (17:03:25): > Interesting. I thought Mike J. was talking about matrix formats, but I can’t seem to find the message… I must be on the wrong slack.
Raphael Gottardo (17:33:13): > I think he went on the CZI slack
2018-09-18
Hervé Pagès (20:18:55) (in thread): > Confirmed. Some leftovers from the previous builds were breaking the long tests builds. Problem is fixed now and things seem to be working fine:https://bioconductor.org/checkResults/3.8/bioc-longtests-LATEST/beachmat/malbec1-checksrc.htmlLet me know if you run into other problems with these builds.
2018-09-19
Aaron Lun (05:32:24) (in thread): > Thanks@Hervé Pagèsthis is the first time that the longtests have OK’d for beachmat:party_parrot:
Aaron Lun (16:39:59): > @Martin MorganCan we get the 20k subset up so that we can start timing a few of the scripts with@Marcus Kinsella?
Marcus Kinsella (16:40:04): > @Marcus Kinsella has joined the channel
Kasper D. Hansen (18:28:48): > in the meantime we do have the pbmc68k
Kasper D. Hansen (22:14:48): > @Aaron LunSoTENxBrainAnalysis
in thedimred.Rmd
. Doesrsvd()
suck the entire data matrix into RAM? And then does a random PCA? And how long does it take?
Kasper D. Hansen (22:16:36): > I am prototyping another approach which does less flops, and can work with relatively small memory overhead, but reads the data many times. Not sure how that will work though re. parsing speed / memory etc
2018-09-20
Aaron Lun (04:27:21): > No,rsvd()
goes through the HDF5Array multiple times as well, otherwise there wouldn’t be any point in saving memory. I think it needs to do several cross-products with the random matrix; at least twice for the power iterations.
Kasper D. Hansen (07:57:37): > Yeah, I know it does that in theory and in theory these cross-products can be done in bounded memory. Ive looked at the code and it uses an algorithm I am somewhat familiar with having implemented it for in-memory matrices myself.
Kasper D. Hansen (07:59:08): > I am asking about the actual implementation right now. If I look at the master branch of the package I don’t see any code for doing the crossproduct in any efficient way way exploiting DelayedArray infrastrcuture - it seems like it importscrossprod()
from theMatrix
package.
Aaron Lun (07:59:22): > I assume you’re talking about BigDataAlgorithms.
Kasper D. Hansen (07:59:25): > yes
Kasper D. Hansen (07:59:35): > Should I look elsewhere?
Aaron Lun (07:59:53): > No, that’s right.
Aaron Lun (08:00:10): > It’s just a hack fromrsvd::rsvd
to get it to work.
Kasper D. Hansen (08:00:20): > ok, so if Im guesssing (and I could of course test this), if I run the code right now it sucks the data into memory right?
Kasper D. Hansen (08:00:32): > I mean all of the data
Aaron Lun (08:00:41): > No, it shouldn’t do that, otherwise it would never have run on my desktop.
Aaron Lun (08:00:51): > It should be using DA’s%*%
Kasper D. Hansen (08:01:40): > ah DA already has%*%
?
Kasper D. Hansen (08:02:05): > But I also seecrossprod(A, Y)
where A is the big matrix
Kasper D. Hansen (08:02:21): > ah ok, I get it
Kasper D. Hansen (08:02:34): > yeah ok, if%*%
is already DA-aware I agree
Kasper D. Hansen (08:02:45): > Interesting
Aaron Lun (08:03:06): > The last time I ran it, DA was not aware of the chunk dimensions.
Aaron Lun (08:03:17): > So there have probably been advances since.
Kasper D. Hansen (08:03:25): > So how long does it take right now on 1.3M? Was it 24h on how many cores?
Aaron Lun (08:03:30): > 1 core.
Kasper D. Hansen (08:03:40): > Do you have a mac?
Aaron Lun (08:03:55): > No, a dell optiplex 7050 running i5 7th gen.
Kasper D. Hansen (08:04:05): > What BLAS for R?
Kasper D. Hansen (08:04:22): > Eventually you need%*%
for a small matrix and that will be BLAS
Aaron Lun (08:04:30): > … the usual? > > R version 3.5.0 Patched (2018-04-30 r74679) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 16.04.5 LTS > > Matrix products: default > BLAS: /home/cri.camres.org/lun01/Software/R/R-3-5-branch/lib/libRblas.so > LAPACK: /home/cri.camres.org/lun01/Software/R/R-3-5-branch/lib/libRlapack.so >
Aaron Lun (08:04:54): > It would probably be even faster if you used DA to load it in as a sparse matrix and do the cross product with that.
Aaron Lun (08:05:21): > Keeping in mind that you will have to do some additional work to handle the mean-centering.
Kasper D. Hansen (08:07:51): > I have solved the mean centering and scaling issues for sparse matrices
Kasper D. Hansen (08:08:24): > (which is why I am sparse matrix focused on#delayed_array)
Kasper D. Hansen (08:08:53): > ok, this is super useful thanks.
Kasper D. Hansen (08:09:01): > I have identified 3 ways of doing it
Kasper D. Hansen (08:09:18): > 1 way needs 1-2 passes over A but does the most multiplications by far
Kasper D. Hansen (08:10:20): > 1 way (essentiallyrsvd
) does a few passes over A, might have relatively big intermediate products, does less multiplications, not sure how well it can exploit sparsity
Kasper D. Hansen (08:10:40): > 1 way does many passes over A, but has by far the fewest multiplications
Aaron Lun (08:11:12): > Regardless of exactly how it is done, is the core theoretical framework still based on randomized SVD?
Aaron Lun (08:11:24): > Or are you using another approximate SVD method?
Kasper D. Hansen (08:11:35): > The last way is the one I have gotten furthest with, because I thought it would be fastest. Now when I am facing down the barrel of data access in HDF5 I am not so sure
Kasper D. Hansen (08:11:45): > This is 3 different theoretical approaches
Kasper D. Hansen (08:11:51): > the first one actual does nothing
Kasper D. Hansen (08:12:22): > but can be speed up a bit by doing some low-k SVD method (like randomized PCA) on an in-memory matrix
Aaron Lun (08:13:17): > I’ve been less than happy with randomized SVD in practice, as you need to ensure that thek
+ the extra dimensions > true rank to get accurate results for the firstk
components in PCA.
Aaron Lun (08:13:36): > Other approximate methods likeirlba
don’t have that requirement.
Kasper D. Hansen (08:13:56): > the second one is randomized PCA (a term I am now disliking because I have found out there are multiple randomized PCAs)
Kasper D. Hansen (08:14:03): > the third one is the irlba algorithm
Kasper D. Hansen (08:14:27): > Hmm, I don’t remember that result
Aaron Lun (08:14:34): > I think I mentioned it a while ago
Aaron Lun (08:14:40): > (referring torsvd::rsvd
)
Kasper D. Hansen (08:14:47): > The randomized PCA I use (which is similar, but slightly different from yours) I am not sure. I will check
Kasper D. Hansen (08:15:05): > I thought for “my” version we just need \(k=2\) to get numerical accuracy
Kasper D. Hansen (08:15:19): > I thought it was independent of true rank. Will check. That is super important
Kasper D. Hansen (08:15:50): > But anyway, I have a lot more experience now. Substantial progress is being made right now
Kasper D. Hansen (08:16:12): > I will try to write down these thoguhts a bit more formal
Kasper D. Hansen (08:16:27): > oneirlba
drawback is that it needs starting values which it picks at random
Kasper D. Hansen (08:16:44): > and forpbmc68
raw - that is integer values - the startign values fail
Kasper D. Hansen (08:17:16): > which is not the problem we want to solve (we want to take something like logs etc) but not a great sign
Kasper D. Hansen (08:17:49): > ok, that is all super helpful
Kasper D. Hansen (08:17:50): > thanks
JiefeiWang (12:48:51): > @JiefeiWang has joined the channel
Martin Morgan (16:30:02) (in thread): > This is available now, sorry for the delayBiocManager::install("Bioconductor/TENxBrainData"); TENxBrainData::TENxBrainData20k()
module installing dependencies, which aren’t automatically installed until the package propagates to the public Bioconductor repository (maybe not until next week?)
Aaron Lun (16:30:34) (in thread): > Thanks
Aaron Lun (18:19:47): > @Mike JiangI notice you raised an issue about parallelization in the Rtsne github repo. Did it end up giving any speed-ups?
Aaron Lun (18:20:08): > I haven’t even tried runningRtsneon the 1.3 million data set, don’t know if it’ll be happy or not.
Aaron Lun (18:27:59): > If a lot of time is being spent searching for k nearest neighbors, we could try precomputing that a lahttps://github.com/jkrijthe/Rtsne/issues/32 - Attachment (GitHub): Usin kNN matrix directly in Barnes-Hut · Issue #32 · jkrijthe/Rtsne > It would be very useful to have a knn option, as an alternative to the experimental is_distance option for very large matrices, such that given n x k distance matrix can directly be plugged into bh…
Mike Jiang (18:34:28) (in thread): > Yes, it did when I tested it back in then. You will need to install openmp branch.
Aaron Lun (19:09:54) (in thread): > Will do.
2018-09-24
Aaron Lun (12:39:19): > FYIhttps://github.com/jkrijthe/Rtsne/pull/38
Aaron Lun (14:21:16): > Trying to compute exact k-NNs withkmknnon 1e6 points with 50 dimensions; taking ~2 hours across 10 cores! Which is actually not as bad as I expected, but I think I will add an “approximate neighbors” option tokmknnviaRcppAnnoy; seehttps://github.com/LTLA/kmknn/issues/1. - Attachment (GitHub): Generalize package to support approximate nearest neighbors · Issue #1 · LTLA/kmknn > I'm thinking of generalizing the package to support approximate nearest neighbor matching via RcppAnnoy. This would provide users with the option of sacrificing accuracy for much greater speed….
2018-09-25
Lori Shepherd (12:03:28) (in thread): > I update the AnnotationHub code to use AnnotationDbi::dbFileConnect instead of RSQLite::dbConnect - Both Hubs should now open readonly and vfs=“unix-none” setting - you will need version 2.13.7 of AnnotatinoHub
Juan R Gonzalez (13:14:09): > @Juan R Gonzalez has joined the channel
Stuart Lee (13:42:50): > @Stuart Lee has joined the channel
2018-09-26
Abbas Rizvi (01:34:09): > @Abbas Rizvi has joined the channel
2018-09-27
Aaron Lun (15:51:17): > FYIhttps://forum.hdfgroup.org/t/dimensionality-of-the-memory-dataspace-when-reading-from-a-dataset/4922 - Attachment (forum.hdfgroup.org): Dimensionality of the memory DataSpace when reading from a DataSet - HDF5 - HDF Forum > It seems that the dimensionality of the memory H5::DataSpace has a major effect on the performance of the HDF5 API. If I wanted to read a rectangular A x B block of values from a two-dimensional H5::DataSet, I could defi…
2018-09-28
Saad Khan (12:20:05): > @Saad Khan has joined the channel
Aaron Lun (14:08:09): > Holy shit, the NN detection via RcppAnnoy for 1e6 x 50 matrix finishes in 5 minutes across 10 cores! Check outLTLA/kmknn@annoyance
.
Aaron Lun (16:58:36): > FYIhttps://marionilab.cruk.cam.ac.uk/iSEE_tcga/now uses HDF5-backed storage!HDF5Array
in the wild!
Aaron Lun (16:59:04): > This cuts load times by 10-20 seconds as the RDS doesn’t contain the TCGA count matrix any more.
Aaron Lun (17:59:41): > Man, this is pretty amazing. I can interactively explore the TCGA data set on a crappy laptop that’s >8 years old.
2018-09-29
Raphael Gottardo (00:20:59): > This is really nice.
Aaron Lun (11:41:51): > @Martin MorganI’m ready forkmknnrenaming toBiocNeighbors(seehttps://github.com/LTLA/kmknn/pull/2). Instructions to proceed? - Attachment (GitHub): Support approximate nearest neighbors by LTLA · Pull Request #2 · LTLA/kmknn > Easy interoperable support for approximate nearest neighbors via RcppAnnoy, closes #1.
Martin Morgan (12:30:42): > It’s not been released yet, so I think we should simply change the (bioc) repo name. Ifgit.bioconductor.org/packages/kmknnis up-to-date, I can arrange to rename that togit.bioconductor.org/packages/BiocNeighbors. So just confirm that the git.bioconductor repository is current and I’ll do the rest.
Aaron Lun (12:40:18): > Thanks Martin; the bioconductor git is now up-to-date and ready to go.
Aaron Lun (14:28:26): > @Martin MorganAll packages that used to depend onkmknnnow depend onBiocNeighborsand pass CHECK locally. So, we’re good to go.
2018-09-30
Aaron Lun (13:25:41): > I would also add thatscranandcydarhave crashed in a flaming heap asBiocNeighborsis not yet available. So, the sooner the better.
Martin Morgan (16:04:00): > they’ll fail again tonight but hopefully tomorrow night’s builds will be fine.
Aaron Lun (16:05:03): > Sweet, thanks.
2018-10-01
Stian Lågstad (04:50:35): > @Stian Lågstad has left the channel
Nitesh Turaga (10:52:49): > @Aaron LunI will help you with the re-naming of Biocneighbors. I’ll let you know once the repo is up-to date on the bioc git server.
Aaron Lun (10:53:09): > Thanks Nitesh. Was there anything else that I needed to do?
Nitesh Turaga (10:53:42): > Not right now. I just need to get things right on the server.
Nitesh Turaga (11:41:01): > Hi@Aaron LunPlease check now, and update your remotes on your local copy. Sync once more.
Aaron Lun (11:42:08): > Thanks Nitesh, this seems to have worked.
2018-10-04
Aaron Lun (17:52:03): > FYIhttp://bioconductor.org/packages/devel/workflows/vignettes/simpleSingleCell/inst/doc/xtra-5-bigdata.html
Aaron Lun (17:54:18): > @Kasper D. Hansenif you make a wrapper package that allows users to switch SVD algorithms with aPARAM
argument, then I can add that too.
2018-10-09
BJ Stubbs (13:56:31): > @BJ Stubbs has joined the channel
2018-10-17
Vince Carey (12:21:49): > This is probably germane to our HCA channels but I thought I would start here. I just got cellxgene running and wondered about how to run it on something other than the example data. I came acrosshttps://anndata.readthedocs.io/en/latest/anndata.AnnData.html#anndata.AnnData… is anyone familiar with this, and contemplating interoperation opportunities?
Vince Carey (12:22:23): > cellxgene is athttps://github.com/chanzuckerberg/cellxgene - Attachment (GitHub): chanzuckerberg/cellxgene > React + Redux app for exploration of large scale scRNA-seq datasets - chanzuckerberg/cellxgene
Raphael Gottardo (23:52:18): > I don’t have any experience with this but we should look at it. Note that there are other similar efforts, e.g. the UCSC cell browser.
2018-10-18
Marcus Kinsella (12:45:26) (in thread): > i’m pretty familiar with the project if you have any questions. generally, the goal is to enable a number of different “backends” for the visualization tool. the first one is scanpy/anndata, but in the longer term they’d want SingleCellExperiment too
2018-10-19
Aaron Lun (13:36:20): > FYIhttps://github.com/LTLA/BiocSingular - Attachment (GitHub): LTLA/BiocSingular > Centralized Bioconductor resource for SVD algorithms. - LTLA/BiocSingular
Kasper D. Hansen (15:02:55): > I’ll try to add some stuff in here soon
Kasper D. Hansen (15:03:06): > I assume this is where you want the PCA to go
Aaron Lun (17:19:07): > Yes.
Aaron Lun (17:19:44): > @Hervé PagèsI think the parallelized DA%*%
andcrossprod
should be a priority, it will enable us to get past our current roadblocks.
2018-10-20
Aaron Lun (11:38:51): > @Kasper D. HansenWe should talk a bit about the design of the package before you start throwing stuff in.
2018-10-21
Aaron Lun (10:38:13): > God the DelayedArray%*%
is slow. 15 minutes and counting on each of 10 cores to computex %*% y
for a 1.3 millionx
and a vectory
? That can’t be right.
Aaron Lun (10:43:55): > Okay, 30 seconds for 5000 x 1000 matrix %*% 1000 vector.
Aaron Lun (10:46:57): > which is fully attributable to the realization time,as.matrix
takes effectively as long.
Aaron Lun (10:48:16): > The funny thing is thatas.matrix(counts(sce)[,1:1000])
takes only 3 seconds!
Aaron Lun (10:48:26): > Andas.matrix(logcounts(sce)[,1:1000])
takes only 7 seconds.
Aaron Lun (10:50:58): > wherelogcounts(sce)
is just a log-transformed version ofcounts(sce)
.
Aaron Lun (10:55:56): > Well, here’s an interesting little annoyance. Assume I have a 1.3 million SingleCellExperimentsce
, with a DAlogcounts
. I want to subset to a set of highly variable genes specified by thedec
dataframe, so I do: > > library(BiocStyle) > library(SingleCellExperiment) > sce <- readRDS("objects/sce.rds") > dec <- read.table("objects/hvg_output.txt", stringsAsFactors=FALSE, header=TRUE) > > chosen <- order(dec$bio, decreasing=TRUE)[1:5000] > exprs.mat <- logcounts(sce) > system.time(x1 <- as.matrix(exprs.mat[,1:1000])) > exprs.mat <- exprs.mat[rowData(sce)$Ensembl %in% dec$Ensembl[chosen],] > system.time(x1 <- as.matrix(exprs.mat[,1:1000])) >
> The firstas.matrix
takes 6.5 seconds but the secondas.matrix
takes 38 seconds - despite the latter ostensibly requiringlessinformation!
Aaron Lun (11:19:45): > It shouldn’t take more than 5 minutes on each of 10 workers to run through the entireHDF5Matrix
and do some calculations, based on timings withscater::calculateQCMetrics
.
Aaron Lun (11:21:59): > Currently it’s taking 20 minutes.
Aaron Lun (11:22:14): > So there’s something with the DA that’s introducing a 15 minute overhead.
Aaron Lun (11:30:10): > Even the log-transformation should only give a +3 minute overhead when split across 10 workers.
Aaron Lun (11:33:08): > Maybe it’s because.BLOCK_mult_by_left_matrix
doesn’t respect the chunking?
Aaron Lun (11:34:42): > Having said that, even in the best case with a 5 minute (clock time) matrix multiplication,irlbarequires aboutnv * 2
such multiplications, so it would still take about 500 minutes to get the top 50 PCs.
Kasper D. Hansen (22:14:44): > There are essentially 3 ways to do the PCAS
Kasper D. Hansen (22:16:39): > 1) you do the full crossproductA %*% t(A)
where A is the data matrix. > 2) you use rsvd as in the BigDataAlgoriothms which does a number ofA %*% B
with B a smaller matrix > 3) you use irlba which does a decent number ofA %*% y
andy %*% A
which y a vector
Kasper D. Hansen (22:17:25): > from 1->3 you do less computation (in terms of number of multiplications) but you access the data more. It is very context dependent which of these is fastest, but I hope irlba can be made fast
Kasper D. Hansen (22:18:18): > Now, here I have ignored the issue of standardization (row-centering, transforming to unit variance). This can be addressed by building the standarddization into the matrix products.
Kasper D. Hansen (22:19:43): > But for example for using irlba this means overwriting a standard matrix product (the%*%
method) with a special method which does something else (ie. includes standardization). It may also mean that we need the ability to switch out multiplicatioon methods on the fly.
Kasper D. Hansen (22:20:52): > I have been having a lot of thoughts on how to design this. I think the best way is to take a SE and wrap it into a new class with a special%*%
designed for this situation (plus precomputed row means and SD)
Kasper D. Hansen (22:20:58): > This may be a bit rambling …
Kasper D. Hansen (22:21:48): > Finally, as Aaron is saying, sometimes DA has a ton of overhead. In my experience with the methylatin arrays, overhead can be substantially reduced with a lot of effort.
Kasper D. Hansen (22:22:20): > Also, I think irlba is in theory the fastest way, but I am wondering if data access going through DA is a bottleneck
2018-10-22
Aaron Lun (04:54:29): > It’s not much hassle to implement all of them, actually, and have the user decide which is the fastest.
Aaron Lun (04:54:47): > We just need to talk to thersvddeveloper and make them allow arbitrary multiplication and crossproduct functions.
Aaron Lun (04:55:19): > Much likehttps://github.com/bwlewis/irlba/issues/43 - Attachment (GitHub): Un-deprecate mult? · Issue #43 · bwlewis/irlba > Is it possible to un-deprecate the mult argument? I would like to use a custom matrix multiplication algorithm (in this case, parallelized with BiocParallel) on matrix representations that already …
Martin Morgan (06:59:46) (in thread): > naively following along without any real experience, but how do you use%*%
with a DelayedArray? I have > > n <- 100; dim <- c(50, 10) * n; > m <- matrix(rnorm(prod(dim)), dim[1]); v <- rnorm(dim[2]) > > md <- DelayedArray(m) > vd <- DelayedArray(matrix(v, nrow = n)) > ` >
> but then > > > md %*% v > Error in md %*% v : requires numeric/complex matrix/vector arguments > > md %*% vd > Error in md %*% vd : > multiplication of 2 DelayedMatrix objects is not supported, only > multiplication of an ordinary matrix by a DelayedMatrix object at the > moment >
Martin Morgan (07:17:25) (in thread): > > n <- 100; dim <- c(50, 10) * n > m <- matrix(rnorm(prod(dim)), dim[1]) > md <- DelayedArray(m) >
> Base R performance is pretty revealing > > > ridx <- 1:50; cidx <- 1:10 > > microbenchmark( > + m[ridx,], > + m[,cidx], > + m[,cidx][ridx,], > + m[,ridx][cidx,], > + m[ridx, cidx] > + ) > Unit: microseconds > expr min lq mean median uq max > m[ridx, ] 409.612 541.1585 586.98998 577.0755 618.1120 965.812 > m[, cidx] 409.293 522.5260 547.54903 531.0155 551.9175 808.718 > m[, cidx][ridx, ] 407.865 528.8330 593.51027 535.2340 552.2195 5858.278 > m[ridx, ][, cidx] 458.763 546.5070 573.63234 556.8975 592.9695 732.605 100 > m[ridx, cidx] 5.124 5.5275 6.12323 5.9185 6.2655 20.094 > neval > 100 > 100 > 100 > 100 > 100 >
> where I guess almost all the cost is in the allocation – it pays to accumulate indexes, and then to perform the operation in one go.
Aaron Lun (07:45:44) (in thread): > Themd %*% v
requires cbind’dv
to work.
Aaron Lun (09:28:52): > Anyone: how would I force a package to use the S4 implicitcrossprod
rather than the defaultcrossprod
?
Aaron Lun (09:29:15): > I’m currently trying to modifyrsvdto work on non-ordinary matrices, and this is one of the sticking points.
Martin Morgan (09:38:11) (in thread): > I’d guess that the implicit is already invoked, but by the time whatever you pass to user-facing rsvd functions has been coerced to a standard matrix (e.g., viaas.matrix()
in the first line ofrsvd:::rsvd.default()
Aaron Lun (09:38:24) (in thread): > Yeah, I’ve removed that line.
Aaron Lun (09:39:57) (in thread): > So for example: > > library(rsvd) # compiled after removing the as.matrix > a <- rsparsematrix(10,10,0.1) > rsvd(a) > ## Error in crossprod(x, y): ... >
Kasper D. Hansen (09:52:11): > Wait a minute
Kasper D. Hansen (09:52:38): > I was not referring to thersvd
package but to the function you have inBigDataAlgorithms
. That’s where I would start
Kasper D. Hansen (09:53:40): > Your github issue on reinstatingmult
is both right and wrong IMO. It is wrong because it all works if you write a specific%*%
method (well, you need two methods). It is right because of the overloading thing
Kasper D. Hansen (09:53:46): > But perhaps its easier to talk
Aaron Lun (09:54:01): > The function in BigDataAlgorithms is just ripped from rsvd anyway.
Kasper D. Hansen (09:54:19): > yeah, we have another implementation which is similar but not identiical
Kasper D. Hansen (09:54:24): > I would just use your own code
Kasper D. Hansen (09:54:34): > Then you have control as opposed to using a package
Aaron Lun (09:54:52): > But that increases maintenance.
Aaron Lun (09:55:27): > I’d hate to have to rewrite the thing just for a few changes.
Aaron Lun (09:56:00): > And if there are clear-cut improvements, we would have the greatest impact if we implemented those improvements to the reference package, so that everyone benefits.
Kasper D. Hansen (09:57:04): > Its not clear to me that the rsvd package is the reference implementation
Kasper D. Hansen (09:57:29): > But it is kind of moot. Unless the rsvd package sets up the right imports, you’ll never get it to work
Kasper D. Hansen (10:02:10): > want to talk?
Martin Morgan (10:03:00) (in thread): > Is there an implicit generic? I think its a plain-old function, generic requiresMatrix::crossprod()
?
Aaron Lun (10:15:21): > Yeah
Aaron Lun (10:16:28): > I’ve got a talk at 5pm on my end, but can do afterwards - what about 7 pm my time, 2 pm yours (in Baltimore)?
Aaron Lun (10:16:47): > Or in other words, anytime after 1pm on your time.
Aaron Lun (10:43:03) (in thread): > > > Matrix::crossprod > standardGeneric for "crossprod" defined from package "base" > > function (x, y = NULL, ...) > standardGeneric("crossprod") > <bytecode: 0x64c4bb8> > <environment: 0x64be598> > Methods may be defined for arguments: x, y > Use showMethods("crossprod") for currently available ones. >
Aaron Lun (10:43:18) (in thread): > So I assumed it was a generic of some sort frombase
.
Martin Morgan (10:59:43) (in thread): > Yeah confusing, the promotion indicates where the default implementation is defined; in a new R session > > > crossprod > function (x, y = NULL) > .Internal(crossprod(x, y)) > <bytecode: 0x7fed61590228> > <environment: namespace:base> > > setGeneric("crossprod") > [1] "crossprod" > > crossprod > standardGeneric for "crossprod" defined from package "base" >
Aaron Lun (11:27:21) (in thread): > Okay. So, what’s the modification I can suggest to thersvdmaintainers so that theircrossprod
call uses the S4 generic?
Aaron Lun (11:27:42) (in thread): > Should I ask them to putsetGeneric("crossprod")
in their code somewhere?
Martin Morgan (11:34:43) (in thread): > That would probably be a bad idea – there’d be a generic in Matrix and a generic in rsvd, DelayedArray likely plays well with Matrix so it’s methods on on Matrix’ generic. They shouldimportFrom(Matrix, crossprod)
which will add a dependency (which sounds perfectly reasonable to me…)
Aaron Lun (11:35:07) (in thread): > Yep, that sounds sensible enough. I’ll suggest that to them.
Aaron Lun (12:40:12): > Hello?
Kasper D. Hansen (13:36:20): > Sorry. Still around?
Aaron Lun (13:40:22): > Yep
Aaron Lun (14:11:13): > Also,@Martin Morgan: I didn’t realize that the mere act of defining aMulticoreParam
class would change the seed: > > library(BiocParallel) > > set.seed(100) > runif(1) # 0.3077661 > > set.seed(100) > MulticoreParam(2) > runif(1) # 0.2576725 >
> Is this meant to happen?
Kasper D. Hansen (14:16:17): > Does it change the random generator?
Kasper D. Hansen (14:16:25): > that is pretty weird
Aaron Lun (14:16:28): > Yes, the seed is changed.
Kasper D. Hansen (14:16:35): > the generator, not the seed
Aaron Lun (14:16:39): > Oh.
Aaron Lun (14:16:42): > Dunno.
Aaron Lun (14:17:11): > sounds like a pretty odd thing to do.
Aaron Lun (14:17:34): > Anyway, were we going to talk or what?
Kasper D. Hansen (14:17:48): > yes, skype, facetime, hangouts?
Aaron Lun (14:18:12): > Skype, let me just boot up my skyping computer.
Aaron Lun (14:18:51): > aaronlun110
Aaron Lun (14:19:02): > gotta move to a quiet office.
Hervé Pagès (14:29:57) (in thread): > @Martin MorganYou first need to turnv
into a matrix at the moment:md %*% as.matrix(v)
. Convenient indeed to be able to skip that coercion so I’ll make this work directly on a vector.
Hervé Pagès (14:42:21) (in thread): > Can’t remember whether.BLOCK_mult_by_left_matrix
is smart enough to respect the chunking at the moment but that’s definitely something it should do. Improving%*%
on DelayedArray objects has been on my list for a while and will be a top priority after the release:https://github.com/Bioconductor/DelayedArray/issues/33 - Attachment (GitHub): Implementing crossprod and dual-DA %*% · Issue #33 · Bioconductor/DelayedArray > I recently needed crossprod for DA objects, as well as a %*% that would accept two DAs. I've put the code I've used here, in the hope that it may help development of the methods within the …
Hervé Pagès (14:44:45) (in thread): > [my previous post was actually meant to be a reply to this, so posting again but now in the right thread] Can’t remember whether.BLOCK_mult_by_left_matrix
is smart enough to respect the chunking at the moment but that’s definitely something it should do. Improving%*%
on DelayedArray objects has been on my list for a while and will be a top priority after the release:https://github.com/Bioconductor/DelayedArray/issues/33
Martin Morgan (15:28:32) (in thread): > it (and other params) chooses a random port for workers to communicate with the manager; avoid the random number generation usingMulticoreParam(manager.port=12345L)
or other open port.
Aaron Lun (15:30:18) (in thread): > Ah, okay.
Aaron Lun (15:32:36): > Summing up my chat with@Kasper D. Hansen- optimizing matrix multiplication and cross products will be a priority as soon as the next release is finished. I will writeBiocSingularso that it will be plug-and-play with different matrix multiplication schemes, insofar as that is possible.
Aaron Lun (15:41:23): > FYIhttps://github.com/erichson/rSVD/issues/4 - Attachment (GitHub): erichson/rSVD > Randomized Matrix Decompositions using R. Contribute to erichson/rSVD development by creating an account on GitHub.
Kasper D. Hansen (15:47:01): > nice
Aaron Lun (15:53:06) (in thread): > Thanks@Hervé Pagès. It would also help if pristine DAs could directly use the%*%
andcrossprod
defined for the seeds, which would provide a great performance boost for sparse matrices.
Aaron Lun (17:14:44): > Alright. Not planning to sleep tonight. Will do this refactoring.
Aaron Lun (17:18:20): > Anyone else still awake?
Hervé Pagès (17:20:34) (in thread): > Explicit coercion of vector to ordinary matrix should not be needed anymore:https://github.com/Bioconductor/DelayedArray/commit/6f0b3bc7030637c7594c0590dde36594ec49552f - Attachment (GitHub): %*% now works between a DelayedMatrix object and an ordinary vector · Bioconductor/DelayedArray@6f0b3bc > Delayed operations on array-like objects. Contribute to Bioconductor/DelayedArray development by creating an account on GitHub.
Aaron Lun (17:21:23) (in thread): > Sweet.
Aaron Lun (17:21:45) (in thread): > I have some more thoughts about this that I’ll try to put into an editable document somewhere.
Aaron Lun (17:21:59) (in thread): > This channel is probably not the best place to reason over this.
Hervé Pagès (17:23:33): > yep, only 2:23 pm here in Seattle…:wink:
Aaron Lun (17:25:25): > sweet.
Aaron Lun (19:21:29): > Scaled/centered methods added for%*%
,crossprod
,tcrossprod
.
Aaron Lun (19:53:38): > @Kasper D. HansenBiocSingular:::bs_matrix
now contains centering and scaling support for%*%
,crossprod
andtcrossprod
that don’t involve twobs_matrix
objects.
Aaron Lun (20:37:44): > Okay,BiocSingularis plug-and-play with whatever%*%
orcrossprod
is defined for an input matrix class, without breaking sparsity due to centering.
Aaron Lun (20:39:23): > On the downside, all responsibility for parallelization has been shifted to those same%*%
orcrossprod
functions.
Kasper D. Hansen (20:39:39): > sounds sweet. Ill take a look tomorrow
Aaron Lun (20:39:58): > I think this is more natural anyway, because then I don’t have to second-guess@Hervé Pagèswhen deciding on the best way to distribute jobs across workers in a chunk-aware manner.
Aaron Lun (20:42:49): > The problem is that%*%
doesn’t get parallelized for non-DA objects, e.g.,dgCMatrix
. You might say that I could wrap it into a DA, but then the block processing would unnecessarily destroy sparsity. Some modification probably required to distribute a sparse matrix in a DA seed.
Aaron Lun (20:44:10) (in thread): > Look atbs_matrix.R
, which is the sparsity-preserving non-parallelized version.bpmatrix.R
contains the parallelized sparsity-destroying version.
Aaron Lun (21:06:08): > Hm. Well, it’s 2am, but there’s not much point going to sleep now.
2018-10-23
Aaron Lun (00:20:24): > I’m going home.
Hervé Pagès (01:07:49): > @Aaron LunThe plan is to have block processing preserve sparsity. This is already enabled for things likeas(x, "TENxMatrix")
(ifis_sparse(x)
is TRUE) and will be enabled for other operations, including matrix multiplication. I spent a good amount of time in September putting in place the infrastructure for this. The idea is to useread_sparse_block()
instead ofread_block()
to load blocks fromx
but this can only be done whenis_sparse(x)
is TRUE (i.e. if sparsity of the seed propagated all the way thru the delayed operations carried byx
).
Aaron Lun (04:07:16): > Right. That sounds sensible.
Kasper D. Hansen (08:23:46): > sweet
Kasper D. Hansen (08:24:00): > This will likely be critical for performant matrix operations
Aaron Lun (10:52:14): > Woah, here’s a blast from the past -pcaMethods
.
Aaron Lun (10:53:51): > Though this seems more like imputation-related.
Aaron Lun (10:53:56): > Which is not really what we’re interested in.
Aaron Lun (10:55:35): > Does anyone have experience with NIPALS vs IRLBA vs RSVD?@Aedin Culhanewere you the one who was talking about NIPALS?
Kasper D. Hansen (10:57:02): > We used NIPALS a while ago, but just for standard matrices, in minfi.
Kasper D. Hansen (10:57:33): > I didn’t like it for stability reasons, but I didn’t evaluate speed. We switched to an internal random svd
Aedin Culhane (11:00:17): > Nipals works if you want to approx first Co moment but is slow others. I want to bring Chen meng into conversation he did more comparisons than I
Aaron Lun (11:27:39): > Hm, okay. Just wondering if it’s worth wrapping up inBiocSingular.
Aaron Lun (11:28:20): > “first” being “first few” or literally the first PC?
Aedin Culhane (12:32:21): > Sorry I was typing using phone. Horrific typos.
Aedin Culhane (12:33:29): > We used nipals to compute 1st compute to weight the datasets in multi dataset decomposition.
Kasper D. Hansen (12:47:06): > I don’t think this is more understandable:slightly_smiling_face:
Aaron Lun (13:26:17): > What I want to know is: can I use NIPALS as an approximate SVD to get the first few singular values/vectors?
Aaron Lun (13:26:32): > It sounds like this is not possible.
Aaron Lun (13:43:41): > dragging@Chen Menginto this channel.
Chen Meng (13:43:44): > @Chen Meng has joined the channel
Aaron Lun (14:15:53): > So,@Chen Meng: what id’ like to know is if NIPALS can be used as an approximate truncated SVD.
Aaron Lun (14:16:02): > Basically likeirlba::irlba
orrsvd::rsvd
.
2018-10-24
Chen Meng (02:43:39): > NIPALS can do that, but how much is faster than the regular SVD depends on the dimension of matrix, I can benchmark it.
Chen Meng (03:49:47): > a dirty benchmark, the running time of NIPALS and SVD roughly on the same scale for decomposing matrix 1000500, NIPALS is roughly 3x faster than SVD for matrix 100005000
Chen Meng (03:51:21): > nipals in mixOmics were used. NIPALS only compute 1 SV, and SVD always computes all
Chen Meng (03:52:02): > also nipals seems do not converge within 1000 iterations for large matrix
Gabriele Sales (05:54:19): > @Gabriele Sales has joined the channel
Vince Carey (07:23:23): > Is there a method for coercing seurat objects to SingleCellExperiment?
Aaron Lun (07:38:59): > I thought they have one in Seurat.
Aaron Lun (07:43:14): > @Chen MengHm. So NIPALS probably won’t provide much advantage over IRLBA and Rsvd for our applications.
Chen Meng (07:54:18): > No, NIPALS is slow, but the good is your can easily incorporate penalty. I am looking into RSVD, like a magic … interesting
Chen Meng (08:38:45): > @Aaron Luninteresting, it seems when both m and n are large, the solution of rsvd is not stable
Chen Meng (08:39:26): > library(rsvd) > a <- matrix(rnorm(5e6), 5e4, 1e2) > system.time( > res <- rsvd(A = a, k = 1) > ) > system.time( > s <- svd(a) > ) > plot(res\(u, s\)u[, 1]) > plot(res\(v, s\)v[, 1])
Chen Meng (08:39:53): > that’s what I tried
Aaron Lun (08:41:08): > Yes, this is what I observe as well. I think it is because we’re throwing random matrices in with no structure, it does better at recovering singular values/vectors when the first few SVs contribute to most of the variance.
Aaron Lun (08:41:39): > Or specifically; the rsvd approach assumes that the matrix is T + E where T is the true signal of low rank and E is a high-dimensional noise.
Aaron Lun (08:42:25): > It can then do the SVD of T to accurately recover the first k singular vectors where ‘k’ is the rank of T.
Chen Meng (08:55:29): > hmm, makes sense. Sounds like a method you can use to determine the number of meaningful components …
Stephanie Hicks (09:14:29) (in thread): > https://github.com/kevinrue/BioSeurat - Attachment (GitHub): kevinrue/BioSeurat > Converter from Seurat to SingleCellExperiment. Contribute to kevinrue/BioSeurat development by creating an account on GitHub.
Vince Carey (09:53:05): > Thanks for this …there is a Convert method in Seurat also … but you need to use Convert(x, “sce”) … contrary to the doc. There is a big “switch” in the Convert.seurat function … it is not the usual approach to coercion for S4.
Kasper D. Hansen (10:25:10): > While I appreciate the conversion methods in Seurat I think there is a reason why we would want conversion methods within Bioc
Martin Morgan (11:22:37): > I wonder if ‘conversion’ is the right metaphor; is the Seurat representation fully defined and completed mapped to SCE, or is it more like we have concepts in Bioc (‘assay’, ‘rowData’) that map to concepts in the Seurat object (as well as additional concepts on both sides)?
Aaron Lun (11:26:37): > I would suggest migrating this discussion to the#singlecellexperimentchannel.
2018-10-26
Vince Carey (10:54:52): > There is a problem with SingleCellExperiment coercion from RangedSummarizedExperiment. After example(SummarizedExperiment), we have > > > se > class: RangedSummarizedExperiment > dim: 200 6 > metadata(0): > assays(1): counts > rownames: NULL > rowData names(1): feature_id > colnames(6): A B ... E F > colData names(1): Treatment > > sce = as(se, "SingleCellExperiment") > > sce > class: SingleCellExperiment > dim: 200 6 > metadata(0): > assays(1): counts > rownames: NULL > rowData names(1): feature_id > colnames(6): A B ... E F > colData names(1): Treatment > reducedDimNames(0): > spikeNames(0): > > validObject(sce) > Error in validObject(sce) : > invalid class "SingleCellExperiment" object: 1: 'nrow' of internal 'rowData' not equal to 'nrow(object)' > invalid class "SingleCellExperiment" object: 2: 'nrow' of internal 'colData' not equal to 'ncol(object)' >
Vince Carey (10:55:41): > if we downclass se to an ordinary SummarizedExperiment the coercion yields a valid sce
Aaron Lun (10:55:52): > Hm.
Aaron Lun (11:09:51): > That’s interesting.setAs
is specified but doesn’t get called.
Vince Carey (11:10:34): > Do I need to file an issue on github?
Aaron Lun (11:10:49): > You can, but I’m looking at it now.
Vince Carey (11:11:52): > OK. I think the “internal” rowData behave differently for RangedSE and SE as far as the validity check is concerned.
Aaron Lun (11:12:52): > The real problem is that there’s no setAs for RSEs, only for SEs.
Aaron Lun (11:27:06): > Dragging this convo to#singlecellexperimentagain.
2018-10-31
Jayaram Kancherla (17:13:33): > @Jayaram Kancherla has joined the channel
2018-11-01
Charlotte Soneson (14:31:09): > @Charlotte Soneson has joined the channel
Aedin Culhane (17:03:24): > Has anyone used kipoi,http://kipoi.org/?
Raphael Gottardo (17:07:32) (in thread): > Haven’t used it but it does look like a zoo:wink:.
Marcus Kinsella (18:29:44): > so@Aaron Lun, we do have a bit of a higher level, mostly aspirational doc as well:https://github.com/HumanCellAtlas/matrix-service/blob/4f8342d7a229f865fd91386b8ad6e67af76ba8c3/MOTIVATION.md - Attachment (GitHub): HumanCellAtlas/matrix-service > Contribute to HumanCellAtlas/matrix-service development by creating an account on GitHub.
Marcus Kinsella (18:30:13): > to at least give some context beyond just the api yaml
Aaron Lun (18:30:52): > Cool, yes, that little code chunk there would be the idealized usage.
Aaron Lun (18:33:47): > I notice that the matrix service offers loom and csv as well. I’ve assumed that these were dense representations; has there been any discussion about offering an explicit sparse matrix format?
Aaron Lun (18:34:08): > zarr aside, unless someone makes a convenient R port.
Aaron Lun (18:34:11): > … which we could.
Marcus Kinsella (18:34:18): > i think it outputs mtx as well
Marcus Kinsella (18:34:21): > but you mean like CSC?
Aaron Lun (18:34:26): > Yeah.
Marcus Kinsella (18:35:29): > yeah that’s come up. thus far, it appears that the difference between chunked, compressed, dense and CSC isn’t very large
Aaron Lun (18:35:47): > Oh, interesting.
Marcus Kinsella (18:36:14): > that said, adding more output formats is like the easiest thing we can do
Marcus Kinsella (18:36:26): > so if somebody wants csc, we can create that
Marcus Kinsella (18:37:34): > the harder questions are more like what information needs to be returned in a response immediately to initialize something likefiltered.data
Marcus Kinsella (18:39:01): > assuming thatfiltered.data
is too large for memory, and perhaps is too large to transfer and store in its entirety
Aaron Lun (18:42:24): > Hmm. I could…imagine… DelayedArray being able to createfiltered.data
immediately, once we get the 1D DelayedVectors class up and running to store the various metadata fields (and thus only having to retrieve the ones that get used). I don’t want to stress out@Hervé Pagèstoo much though!
Marcus Kinsella (18:46:42): > so it’s not like things need to be up an running by the end of the month. i think for a while you guys were waiting to be able to point atsomethingto try to integrate with the DCP, and now there sort of is something. ideally, i’d like to have a vague roadmap, at least on the DCP side, that ends with that code snippet working, even if the time axis on said roadmap is unlabeled
Marcus Kinsella (18:48:14): > so, basically general ideas about the sorts of requests and responses that would needed would be useful
Aaron Lun (18:51:44): > Right. I’ll admit that I’m the worst guy to talk to about this, because my remote resource skills = 0. I’ll tag@Martin Morgan,@Vince Carey(and@BJ Stubbs?) onto this… I guess we (Marioni lab) did say that we were secondarily responsible for the R-side API, though I never get to meet “the other family” (i.e., the EBI DCP team that John manages) so I don’t really get any special insight. I’m happy to chip in with class design, doc’ing, testing, etc. to take the load off others, though.
Aaron Lun (18:54:57): > Same goes for the DA%*%
@Hervé Pagès- anything you need me to do to help, just let me know.
Hervé Pagès (18:56:41): > I will. Now that the BioC 3.8 release is behind us, I’ll be able to resume the DelayedArray fun:slightly_smiling_face:
Aaron Lun (18:59:05): > Great. If we can make good headway over the next month, I won’t have to cut into my Christmas coding time!
Aaron Lun (18:59:25): > Got a whole pile of unit tests to convert to using testthat… now that’s a proper holiday…
Aaron Lun (19:01:16): > And on that note, I really need to get some sleep, so g’night.
Kevin Rue-Albrecht (19:16:25): > @Kevin Rue-Albrecht has joined the channel
2018-11-02
Vince Carey (04:36:01): > @Aaron Lunhas asked what is involved in making a RESTfulMatrix analogous to the classes in restfulSE that we have for working with HDF Scalable Data Service or BigQuery back ends. In the long run we want to have an expression X[G, S] to produce a query to DCP that produces or refers to a matrix of results on features G from samples S. G and S should have the usual semantics for Bioconductor users, so G can be gene or transcript symbols, S can be cell identifiers or some meaningful way of identifying a group of cells. Therefore we need to know how to translate from user annotation of features and samples to the metadata elements at DCP that select the data of interest. Once we have that down, we construct the query for the numerical data, acquire it, and translate from its native format to something usable in R, binding all relevant metadata to the R object. DelayedArray should be in the middle, but we might wire that in after we see ‘nonDelayed’ operations working. I have looked athttps://github.com/HumanCellAtlas/data-storeand wonder if that is already standing up at HCA side, and I have looked athttps://github.com/HumanCellAtlas/matrix-service/blob/4f8342d7a229f865fd91386b8ad6e67af76ba8c3/MOTIVATION.mdand wonder if there are running examples for using the matrix service. Running examples, either as python code or RESTful queries, will be really helpful for determining what aspects of this long-run view should be worked on now. I hope@Marcus Kinsellawill guide us to the best current examples.@Shweta Gopalis our API expert in Boston and should keep an eye on this dialogue. - Attachment (GitHub): HumanCellAtlas/data-store > Design specs and prototypes for the HCA Data Storage System (“blue box”) - HumanCellAtlas/data-store - Attachment (GitHub): HumanCellAtlas/matrix-service > Contribute to HumanCellAtlas/matrix-service development by creating an account on GitHub.
Aaron Lun (13:15:07): > Hooray - finally, a non-Aaron use ofbeachmatbybsseq!@Peter HickeyFTW.
Marcus Kinsella (14:21:55): > @Vince Carey:https://matrix.staging.data.humancellatlas.org/v0/swagger.jsonhttps://dss.data.humancellatlas.org/v1/swagger.json
Peter Hickey (16:16:02) (in thread): > Thanks for your help, Aaron. I’ve got a bunch more uses for it, just gotta find the time
2018-11-03
Vince Carey (07:34:03): > @Marcus Kinsella, is it too early to use the API to a) ask for the identifiers of available matrices/metadata and b) make a request for a matrix? The tabula muris data would be a good example as we have other representations of it that can be used for checking. My explorations of the slack channels andhttps://hca.readthedocs.io/en/latest/index.html, dcp-cli, dcp-diag are not turning up any concrete examples of requests.
Federico Marini (09:34:00): > @Federico Marini has joined the channel
2018-11-04
Marcus Kinsella (16:08:28): > @Vince Careylet me check in the status of tabula muris. It might not be in the data store yet. We actually have a whole session on creating vignettes and code examples at the DCP meeting this week.
Vince Carey (20:39:47): > :+1:
2018-11-05
Shian Su (00:53:21): > @Shian Su has joined the channel
Federico Marini (03:06:36): > Hi everyone. > Joined the channel to get a feeling how the efforts of “seamless” delivering the HCA datasets are going
Federico Marini (03:07:46): > Being one of theiSEE
developers (#iseehere), I wanted to come up with a SingleCellExperiment object to feed into that
Federico Marini (03:08:37): > Seeing the previous efforts of Charlotte (TabulaMurisData) and Aaron (TenXBrainData), I wanted to put together an interface to experimentHub data
Federico Marini (03:10:24): > Is the effort of interest for others?
Davide Risso (04:54:02): > @Stephanie Hicks@Kasper D. Hansenand I contributed the TENxPBMCData package with a collection of PBMC scRNA-seq datasets from 10X Genomics
Davide Risso (04:54:18): > They are all in the form of SCE with HDF5-backed data
Davide Risso (04:57:00): > See:https://bioconductor.org/packages/devel/data/experiment/vignettes/TENxPBMCData/inst/doc/TENxPBMCData.html
Aaron Lun (04:57:11): > That’s probably not enough without actual analysis.
Aaron Lun (04:57:18): > e.g., t-SNE, PCA, UMAP, whatever.
Aaron Lun (04:57:23): > Otherwise the app gets pretty boring.
Davide Risso (04:58:19): > Oh, I misunderstood what@Federico Mariniwas proposing… I thought he wanted to contribute a new dataset
Federico Marini (04:59:10): > SCE with HDF5-backed data would be the starting point
Davide Risso (04:59:20): > But I still don’t understand what is the actual proposal…
Federico Marini (04:59:37): > cherry on top would be then the dim red representation
Davide Risso (04:59:37): > Could the TENxPBMCData package be a good starting point?
Federico Marini (04:59:42): > most likely!
Federico Marini (05:00:19): > I wanted to check in here to understand whether the HCA peeps were close enough to the release of data “on-demand”
Federico Marini (05:00:58): > But say, packaging up the data, and running the Bioc single cell workflows on that, that’s an excellent start
Aaron Lun (06:46:42): > FYIBiocSingularalmost ready for submission, just waiting forrSVD
to get through to CRAN.
Aaron Lun (06:47:22): > Add your comments now or hold your peace (well, until review anyway).
Aaron Lun (07:42:40): > I have to say, though, I’m not convinced about the safety of deferred centering.
Aaron Lun (08:02:31): > Well, deferring the centering is now an option rather than the default, because I wasn’t convinced that was the safest way to go.
Aaron Lun (08:34:33): > As a result, though, I’m also waiting for DA to supportcrossprod
andtcrossprod
.
Marcus Kinsella (09:46:24) (in thread): > kinda close! you want an endpoint to try out? i’m not 100% certain about what you mean by “on-demand” though
Federico Marini (09:49:08) (in thread): > Hi Marcus. My aim would be to get the matrix and construct the related SingleCellExperiment object
Federico Marini (09:49:27) (in thread): > As of now I am taking the h5 dataset and doing some munging on its format
Vince Carey (10:01:36) (in thread): > yes to an endpoint to try out
Peter Hickey (16:46:09) (in thread): > can you elaborate?
Aaron Lun (17:27:15) (in thread): > If you look atBiocSingular/R/bs_matrix.R
, there’s a number of special matrix multiplication functions that avoid block processing when dealing with a centered and scaled matrix. That is to say, the multiplication (or crossprod) is applied to the original matrix, and the centering/scaling is “deferred”, i.e., the operations are factored out and used to modify (typically subtract from) the matrix product, rather than being applied to the original matrix prior to multiplication. In theory, this should give the same value, but on real world systems we have finite precision. The intermediate matrix product often contains fairly large values so there will be loss of precision, possibly resulting in catastrophic cancellation when the subtraction is performed.
Aaron Lun (17:29:55) (in thread): > It only takes a few genes with particularly large values to lose precision in the matrix product. And this is “large” in an absolute sense, not even a relative sense; so if I added an arbitrarily large constant value to all values for a gene, it would probably result in severe loss of precision. This would result in very unusual inconsistencies from an end-user’s perspective where the constant value should have little effect as it would just be removed upon centering.
Peter Hickey (20:43:09) (in thread): > right, that’s what i thought and agree end-users will find it confusing (and you’ll likely end up with many users asking why/complaining even if documented)
2018-11-07
Martin Morgan (16:05:15) (in thread): > @Marcus Kinsellaare there some examples of this in action, python or whatever, just a little demo of discovery & retrieval. Also@Vince Careyor others are you exploring this actively?@Daniel Van Twiskhas been working through the discovery angle in the DCP (c.f.,https://github.com/Bioconductor/HCABrowser) but this is quite a slog to be made useful. Also tagging@Marcel Ramos Pérez - Attachment (GitHub): Bioconductor/HCABrowser > Browse the Human Cell Atlas data portal :warning: under development - Bioconductor/HCABrowser
Vince Carey (16:06:32) (in thread): > @Martin Morganwaiting for@Marcus Kinsellato comment on more concrete approaches to demonstrating the use of the APIs
Vince Carey (16:07:04) (in thread): > This seems to be a topic of active discussion at an ongoing meeting, so perhaps need to wait a few more days.
2018-11-08
Aaron Lun (10:14:00): > BEHOLD! Parallelized DelayedMatrix multiplication and crossproducts inhttps://github.com/Bioconductor/DelayedArray/pull/38. Automatically switches between parallelization schemes depending on the chunking layout of the underlying representation. Tagging@Hervé Pagès@Kasper D. Hansen.
Kasper D. Hansen (11:41:50): > nice
Aaron Lun (12:19:09): > I also tracked down the cause for DelayedArray loading takes so long for my use case. Consider the following code using the HDF5 file fromTENxBrainData
: > > library(HDF5Array) > mat <- HDF5Array("rawdata/counts.h5", "counts") > system.time(X1 <- as.matrix(mat[1:100,])) # 2 seconds > > sf <- runif(ncol(mat)) # size factors > mat <- t(t(mat)/sf) > system.time(X2 <- as.matrix(mat[1:100,])) # 10 seconds, not great. > system.time(X2b <- t(t(X1)/sf)) # reference is only 3 seconds, so something missing here. > > mat <- log2(mat + 1) # normalized log-expression > system.time(X3 <- as.matrix(mat[1:100,])) # 13 seconds > > tossed <- rbinom(ncol(mat), 1, .95)==1 > mat <- mat[,tossed] > system.time(X4 <- as.matrix(mat[1:100,])) # 106 seconds! >
> This is fully attributable toh5readDataset
, which seems particularly displeased with the suppliedindex
. Any thoughts@Mike Smith?
Kasper D. Hansen (12:22:13): > A critical thing we need to do now is set up some (reproducible) timing examples which are not crazy long, but long enough to be insightful. I suggest the variosu PBMC data we put together recenmtly.
Kasper D. Hansen (12:22:16): > How do we do that?
Kasper D. Hansen (12:22:48): > something inscripts/timings
or ?
Kasper D. Hansen (12:22:53): > different package?
Aaron Lun (12:23:18): > Different package, I would think.
Kasper D. Hansen (12:23:47): > why?
Kasper D. Hansen (12:24:02): > I mean, its not suitable for/vignettes
Aaron Lun (12:24:55): > TENxBrainAnalysis
isn’t even a package.
Aaron Lun (12:24:59): > It’s just a collection of stuff.
Aaron Lun (12:25:09): > There’s no reason that timings should be for the TENx Brain Data only.
Kasper D. Hansen (12:27:15): > I would not start with the brain data for timings. I was thinking some timings scripts inBiocSingular
Aaron Lun (12:30:36): > Then that’s definitely in a separate repo.
Kasper D. Hansen (12:32:28): > Ok, it sounds like a PR would not get accepted
Aaron Lun (12:34:33): > I’m happy to add a README pointing to your repo. But timing under different schemes and for different data sets would be complex enough to warrant a dedicated repo.
Aaron Lun (12:35:07): > For example, as it is now, there are already 3 * 2 * 2 different ways to run the SVD, not including parallelization options.
Aaron Lun (12:35:27): > Randomized, Exact, or IRLBA; deferred centering or not; with or without crossproduct.
Kasper D. Hansen (12:40:06): > Once we do timings we may quickly realize that some things don’t work well
Aaron Lun (12:58:23): > Yes.
Kasper D. Hansen (13:06:13): > So I mostly view this as a intermediate stuff to guide development. But ok, different repos, I get it
Vince Carey (16:00:11): > @Kasper D. Hansen, i recall that@Mike Jianghas been doing a fair amount of benchmarking so perhaps he should be engaged on this?
2018-11-09
Mike Jiang (13:58:51) (in thread): > @Aaron Lunwhere can I findrawdata/counts.h5
?
Aaron Lun (14:37:15) (in thread): > It’s the HDF5 file that’s saved byTENxBrainData()
. I had to manually extract it because SQL didn’t work on my cluster.
Aaron Lun (14:37:32) (in thread): > If you’re on a machine that supports file locking, then you can just use thecounts
fromTENxBrainData()
instead.
2018-11-10
Aaron Lun (10:37:47) (in thread): > Main problem seems to be inH5Sselect_index
.
Aaron Lun (10:42:24) (in thread): > … thoughH5Dread
doesn’t seem much faster either.
Aaron Lun (11:30:55) (in thread): > In fact, it seems as if it would be faster to load an entire block and subset it in memory, compared to trying to combine hyperslab selections, which seems to kill performance in many ways.
2018-11-11
Peter Hickey (16:08:48): > That’s been my conclusion, too. Anecdotally, non-contiguous reads from hdf5 feel way slower than they ought but I’ve not tried to figure out where or why
Aaron Lun (16:39:06): > Even contiguous reads suck under certain circumstances, e.g.,https://forum.hdfgroup.org/t/dimensionality-of-the-memory-dataspace-when-reading-from-a-dataset/4922 - Attachment (HDF Forum): Dimensionality of the memory DataSpace when reading from a DataSet > It seems that the dimensionality of the memory H5::DataSpace has a major effect on the performance of the HDF5 API. If I wanted to read a rectangular A x B block of values from a two-dimensional H5::DataSet, I could define the memory DataSpace as either: two-dimensional, with dimensions A x B. one-dimensional, with dimensions 1 x AB. As far as I can tell, these two choices yield identical results with respect to ordering of data values. However, use of the one-dimensional DataSpace requires t…
Aaron Lun (16:44:30): > In any case, if no one has anystructuralcomments to make aboutBiocSingular, I think I will put it in the BioC submission queue, on the understanding that DA%*%
andcrossprod
will be available some time this devel cycle.
Mike Jiang (17:33:18) (in thread): > hyperslab selection
has been the culprit ever since we started to use hdf5 7 years ago. And it is the inherent problem at libhdf5 c library level (according to mygprof
result. This essentially prevents me from using any random-slicing(non-continous indexing) directly from h5
2018-11-12
Aaron Lun (12:06:33) (in thread): > Hm. We probably could rethink howh5read
handlesindex=
.@Mike Smith?
Mike Smith (12:09:01) (in thread): > Will take a look at this later…
Vince Carey (12:18:54) (in thread): > Pinging@John Readeyin case he has addressed such issues in h5py
Mike Smith (12:29:00) (in thread): > Do you have an example? Just wondering about the access pattern vs the chunk layout. It’s totally possible to inadvertently read an entire file multiple times
John Readey (13:47:39) (in thread): > @Vince Carey- I can attempt to translate to Python and do some timings. It would be interesting to compare h5py+HDF5Lib with h5pyd+HSDS.
Mike Jiang (14:01:32) (in thread): > @John Readeyif you can recall, HSDS was much better than libhdf5 in regards to random slicing based on my preliminary benchmarking . But@Aaron Lun’s application relies onHDF5Array
, which is based on libhdf5 (at least for now) if I understand it correctly
Aaron Lun (14:01:54) (in thread): > pretty much.
Vince Carey (14:03:12): > @Aaron Lun, quick question – the Rle sparse representation is talked up in the TENxBrainData vignette. Can this representation be used conveniently with DelayedArray, and do your matrix operations work naturally with that format?
Aaron Lun (14:04:18): > If it fits in a DA (and has supported methods for block processing, e.g.,colGrid
), it can be used directly in all applications in place of the standard HDF5Array.
Aaron Lun (14:06:23): > I understand@Hervé Pagèshas some to-be-implemented sparse matrix trickery that would make everything go a lot faster as well.
John Readey (14:08:24) (in thread): > Has anyone worked on a Python/h5py version of HDF5Array?
Vince Carey (14:13:38) (in thread): > I would hazard that no one in this group has done so. The HDF5Array is distinguished by lazy computation, deferring loading of numeric data until the last possible moment and then unwinding a stack of deferred operations to produce a desired result in RAM. This is in contrast to the standard model of matrix computation in R which is entirely in memory.
Vince Carey (14:14:31) (in thread): > HDF5Array specializes this concept to the HDF5 representation of arrays, but the more general concept is implemented in DelayedArray, which can be deployed for various back ends.
Aaron Lun (14:14:39) (in thread): > @John ReadeyFor timing purposes, would it be sufficient to give you the final call to the HDF5 C library?
Aaron Lun (14:15:31) (in thread): > We’re basically just doing hyperslab union selections and then a read with the resultingDataSpace
.
John Readey (14:15:44) (in thread): > Sure. Does it translate to just one call though?
Aaron Lun (14:15:57) (in thread): > Unfortunately not.
Aaron Lun (14:16:22) (in thread): > Let me think of a MWE for this that should only userhdf5, where the corresponding operations inh5pyshould hopefully be fairly similar.
Mike Jiang (14:18:23) (in thread): > I used C API directly for hyperslab union selections and concluded it was libhdf5 issue instead of rhdf5
John Readey (14:18:41) (in thread): > @Mike Jiangcreated this issue a while back:https://github.com/HDFGroup/h5pyd/issues/47. Is that related? - Attachment (GitHub): Fancy indexing is not supported · Issue #47 · HDFGroup/h5pyd > @jreadey , I think you were aware of it since you mentioned Coordinate list (dset[(x,y,z),:]) not being supported yet (not sure if you actually meant list). But I considered it as the one of most u…
Peter Hickey (14:20:10) (in thread): > Sorry, Mike, poor form by me complaining without an example. Next time I’m hit by it I’ll construct one to share
Mike Jiang (14:20:14) (in thread): > yes, the same type of indexing
Aaron Lun (14:22:52) (in thread): > From therhdf5side, I couldn’t see anything wrong with how it was calling the C library with theH5S_SELECT_OR
selections. So I also concluded it was an internal issue with libhdf5 (assuming nothing funny with our packaging ofRhdf5lib).
Aaron Lun (14:24:19) (in thread): > It’s hard to find complaints about this online though. One would think that taking union of hyperslabs would be a fairly common procedure…
Aaron Lun (14:48:28) (in thread): > Probably the best solution is forrhdf5to forgo taking a union of hyperslabs, and instead extract each hyperslab as it goes along. This would require some care to take advantage of the chunk cache.
John Readey (15:07:29) (in thread): > I highly suspect that the HDF5 lib is iterating though the data spaces. With HSDS we should be able to do this in parallel to make it faster.
2018-11-13
Mike Smith (07:04:23) (in thread): > Looks like there’s a strong relationship between the time taken to select a set of hyperslabs and how far apart from each other they are. This is timings for selecting 10,000 single column slabs, ranging from 10,000 consecutive columns to a random selection of 10,000 from 100,000 - File (PNG): Rplot.png
Hervé Pagès (12:13:31): > On the inefficiency ofrhdf5::h5read()
, I was curious to compare read times with the hdf5r package from CRAN. I did very little testing so far but the latter seems to be quite faster. 25x faster in the following example: > > library(rhdf5) > library(h5vcData) > tally_file <- system.file("extdata", "example.tally.hfs5", package="h5vcData") > system.time(a1 <- h5read(tally_file, "/ExampleStudy/16/Coverages", index=list(1L, NULL, NULL))) > # user system elapsed > # 21.431 2.576 24.008 > > library(hdf5r) > cov_ds <- H5File$new(tally_file)[["ExampleStudy"]][["16"]][["Coverages"]] > system.time(a2 <- cov_ds[1L, , , drop=FALSE]) > # user system elapsed > # 0.785 0.148 0.933 > > identical(a1, a2) > # [1] TRUE >
> So maybe there’s some inefficiency in the hdf5 hyperslab selection mechanism but something is definitely going on in rhdf5 itself. This one should be a lower hanging fruit.
Hervé Pagès (12:40:02): > Note that hdf5r compiles against the local HDF5 system lib (1.8.16 on my Ubuntu 16.04.5 LTS laptop), while rhdf5 uses the HDF5 lib version bundled in Rhdf5lib (1.10.2 at the moment). So we can’t 100% exclude a possible performance regression between HDF5 1.8.16 and 1.10.2 even though it sounds very unlikely to explain the 25x slowdown between hdf5r and rhdf5.
Aaron Lun (12:43:10): > Well, there’s at least one regression in 1.10 that killedbeachmat’s chunk caching performance:https://forum.hdfgroup.org/t/unintended-behaviour-for-hash-values-during-chunk-caching/4869 - Attachment (HDF Forum): Unintended behaviour for hash values during chunk caching? > It seems that there have been some changes in the calculation of the hash value for chunk caching in 1.10.3, and I’m not sure whether these were intentional or not. To motivate this discussion, let’s say that I have a 100-by-200000 dataset (of doubles, but this isn’t important right now). The chunk dimensions are 22-by-44721, so in terms of chunks, the data set is 5-by-5 after rounding up. Now, in H5Dchunk.c, there is a calculation of “scaled dimensions”, which I presume to be the data set dime…
Aaron Lun (12:44:06): > Well, an undocumented and buggy change, if not a regression.
Hervé Pagès (12:48:22): > Interesting. I’ll compile hdf5r against HDF5 1.10.2 so we can compare [hdf5r + HDF5 1.8.16] vs [hdf5r + HDF5 1.10.2] vs rhdf5.
Hervé Pagès (13:52:18): > Unfortunately hdf5r doesn’t seem to support compilation against a user-specified HDF5 lib:disappointed:This makes it hard to remove possible performance regressions in the HDF5 lib from the equation to explain the 25x speed difference between hdf5r and rhdf5. FWIW I observe the same speed difference when using BioC 3.7 where rhdf5 uses HDF5 lib 1.8.19. Still not the same version as what hdf5r uses (1.8.16) but close. Still unlikely though that the 25x speed difference between hdf5r and rhdf5 can be explained by a possible performance regression in the HDF5 lib alone.
Aaron Lun (15:02:15): > Hm.
John Readey (15:04:19): > Hey@Hervé Pagès- would you mind posting tohttps://forum.hdfgroup.org/c/hdf5about the performance issue? - Attachment (HDF Forum): HDF5 > All HDF5 (and HDF4) questions, potential bug reports, and other issues.
Aaron Lun (15:08:07): > Diagnosing@Hervé Pagès’s use case, there’s around 8 seconds taken up in the R code inH5Sselect_index
. The read itself inH5Dread
is fast - and then about 14 seconds in who-knows-where.
Aaron Lun (15:10:09): > Ah, the remaining time is amatch
call for non-NULL
indices inh5readDataset
.
Mike Smith (16:38:26): > H5Sselect_index
seems to scale really badly as the sparsity of the index increases. Asking for a bunch of columns that are close to one another is much faster than the same number of columns that are spread out. This seems to be related to number of calls toH5Sselect_hyperslab
.
Mike Smith (16:39:24): > Some profiling I did earlier today suggests huge amounts of time being spent in calls tomalloc
andfree
by the libhdf5 code.
Mike Smith (16:44:15): > rhdf5is definitely also considerably slower when reading huge numbers of small datasets, as there’s sanity checks that get executed tonnes of times in that scenario (e.g.https://support.bioconductor.org/p/105729/) but that shouldn’t be responsible for slow performance in the case above.
Hervé Pagès (16:46:29): > @John ReadeyThanks but unless I find some HDF5-specific issue, I don’t want to create noise on the HDF Forum about a performance issue that seems to be on our side.
John Readey (16:48:14): > No problem - but don’t hesitate to utilize the forum if you do suspect some HDF5-related problem.
Mike Jiang (16:48:23) (in thread): > H5Sselect_hyperslab
issue is definitely HDF5 lib issue
Mike Jiang (16:49:44) (in thread): > It’s a long-standing issue of libhdf5 and not bioc’s
Mike Jiang (17:00:03) (in thread): > unions ofH5Sselect_hyperslab
are supposed to be light-weight pre-filter or view creation on the different regions of the matrix before the actual disk IO occurs. But in reality, this non-disk operations causes more overhead than the actual IO often times. This behavior is never mentioned or warned of anywhere in hdf5 docs
Mike Smith (17:02:08) (in thread): > This is exactly what I’m observing; the more spread out the index used in theh5readDataset
the more calls there are toH5Sselect_hyperslab
and performance dies exponentially.
Mike Jiang (17:05:23) (in thread): > I also echo@Aaron Lun’s frustrations (and puzzling) of lacking of such issue or question post on the websites in the past
Mike Smith (17:15:51): > To demonstrate, there’s two functions athttps://gist.github.com/grimbough/bc26824e35a07325c4244bff20b978fbThey each selectn
columns, from the firstm
in the matrix. The first does this in a single call toH5Screate_simple
and scales badly asm
increases; the second works on a per-chunk basis and seems to be much more consistent in performance.
Hervé Pagès (17:29:03) (in thread): > These are fair points. Just to clarify: I’m not saying the HDF5-related issues don’t deserve attention. Thanks for identifying and reporting them. All I’m saying is that I don’t think the HDF Forum is the right place for me to report the 25x speed difference I observed between hdf5r and rhdf5. I will definitely use the forum if I find things that I suspect are HDF5-related. However for now I’m going to focus on trying to have HDF5Array load the data 25x faster.
Aaron Lun (17:31:15) (in thread): > Between the two Mikes and I, we could probably slap together a stand-alone C(++) source file that demonstrates the issue, for posting on the HDF5 forum. The HDF5 staff can be pretty quick sometimes, though it’ll probably just end up being a bug report for the next release (like everything else that I’ve reported).
Marcus Kinsella (19:27:22) (in thread): > alright, well here’s something:https://github.com/HumanCellAtlas/data-consumer-vignettes/blob/mckinsel-expression-matrix/expression_matrix/Retrieve%20Expression%20Matrix.ipynb - Attachment (GitHub): HumanCellAtlas/data-consumer-vignettes > Simple walk-throughs of interacting with the DCP as a downstream data consumer. - HumanCellAtlas/data-consumer-vignettes
2018-11-14
Mike Smith (04:00:35) (in thread): > Out of interest, can you run the same code immediately afterwards? I currently getHDF5. File accessibilty. Unable to open file.
on a second run - something more to look into.
Mike Smith (08:52:59) (in thread): > Now I see I need to runcov_ds$close()
after
Mike Smith (09:07:37) (in thread): > I made some tweaks torhdf5(version 2.27.1) that cut out some unnecessary matching and reordering of data when an index isNULL
. This puts Hervé’s example in a similar ball park tohdf5r > > system.time(a1 <- h5read(tally_file, "/ExampleStudy/16/Coverages", index=list(1L, NULL, NULL))) > ## user system elapsed > ## 1.197 0.196 1.455 > cov_ds <- H5File$new(tally_file)[["ExampleStudy"]][["16"]][["Coverages"]] > system.time(a2 <- cov_ds[1L, , , drop=FALSE]) > ## user system elapsed > ## 0.909 0.176 1.085 > cov_ds$close() >
> However it’s still terrible if you as specify a large number of entries in a particular dimension. > > system.time(a1 <- h5read(tally_file, "/ExampleStudy/16/Coverages", index=list(1L, NULL, 1:9e7))) > ## user system elapsed > ## 89.838 6.983 96.804 >
> All of the time difference here is spent tidying data in R rather than reading from the HDF5 file, so I’ll keep working on the logic to try and minimise this.
Hervé Pagès (12:39:28) (in thread): > Nice findings Mike. Yes it looks likeH5Sselect_index()
is spending a lot of time rearranging the indices inindex
. Thanks for the improvements.
Aaron Lun (13:04:08): > @Mike Jiangdid you end up posting on the HDF5 forum about the hyperslab unions?
Mike Jiang (13:19:53) (in thread): > No, I didn’t.
Mike Jiang (13:37:57) (in thread): > We can try to create a minimal reproducible example to post one
Aaron Lun (13:39:34) (in thread): > That would be a good idea. You mentioned you saw this behaviour working with the C API - do you have some kind of example case already?
Aaron Lun (14:05:29): > Alright, here we go, nice and short and standalone: > > #include "H5Cpp.h" > #include <vector> > #include <iostream> > #include <algorithm> > #include <ctime> > > int main (int argc, const char** argv) { > if (argc!=4) { > std::cout << argv[0] << " [FILE] [DATASET] [CONSEC]" << std::endl; > return 1; > } > const char* fname=argv[1]; > const char* dname=argv[2]; > const bool consec=(argv[3][0]=='1'); > > H5::H5File hfile(fname, H5F_ACC_RDONLY); > H5::DataSet hdata=hfile.openDataSet(dname); > H5::DataSpace hspace=hdata.getSpace(); > > hsize_t dims_out[2]; > hspace.getSimpleExtentDims(dims_out, NULL); > const size_t total_nrows=dims_out[0]; > const size_t total_ncols=dims_out[1]; > hspace.selectNone(); > > // Defining submatrices. > const size_t NR=10000, NC=100; > hsize_t h5_start[2], h5_count[2]; > h5_start[0]=0; > h5_start[1]=0; > h5_count[0]=1; > h5_count[1]=NC; > > { > clock_t start=std::clock(); > size_t counter=0; > for (size_t i=0; i<NR; ++i) { > counter += (consec ? 1 : 2); // non-consecutive. > h5_start[0]=counter; > hspace.selectHyperslab(H5S_SELECT_OR, h5_count, h5_start); > } > clock_t end=std::clock(); > std::cout << "Elapsed for union: " << double(end - start)/CLOCKS_PER_SEC << std::endl; > } > > std::vector<double> storage(NR * NC); > h5_count[0]=NR; > H5::DataSpace outspace(2, h5_count); > outspace.selectAll(); > > { > clock_t start=std::clock(); > hdata.read(storage.data(), H5::PredType::NATIVE_DOUBLE, outspace, hspace); > clock_t end=std::clock(); > std::cout << "Elapsed for read: " << double(end - start)/CLOCKS_PER_SEC << std::endl; > } > > double total = std::accumulate(storage.begin(), storage.end(), 0.0); > std::cout << "Total is: " << total << std::endl; > return 0; > } >
Aaron Lun (14:05:54): > Running on the 10X data set gives me: > > $ ./HDF5UnionTester ~/.ExperimentHub/1040 counts 0 > Elapsed for union: 4.92112 > Elapsed for read: 0.229063 > Total is: 190563 > $ ./HDF5UnionTester ~/.ExperimentHub/1040 counts 1 > Elapsed for union: 0.011119 > Elapsed for read: 0.017441 > Total is: 198451 >
Aaron Lun (14:12:36): > The union clearly sucks. The longer read is somewhat understandable as more chunks need to be retrieved, though it’s 20 times longer rather than the 2-fold increase that we might expect.
Aaron Lun (14:14:47): > FYI I compile againstRhdf5lib; you can figure out the paths by running: > > cat(sprintf("g++ -std=c++11 -I%s union_test.cpp -o HDF5UnionTester %s -ldl\n", > system.file(package="Rhdf5lib", "include"), > capture.output(Rhdf5lib::pkgconfig()))) >
Martin Morgan (15:17:48): > And FWIW./ExperimentHub/1040
can be retrieved (wget, etc) fromwget
https://experimenthub.bioconductor.org/fetch/1040/ExperimentHub::ExperimentHub()[["EH1040"]]
Mike Smith (15:29:07): > Nice example Aaron - look pretty similar to what I see with anrhdf5implementation > > hslabUnionTest <- function(consec = TRUE) { > > fid <- H5Fopen('/media/Storage/Work/ExperimentHub/1040') > did <- H5Dopen(h5loc = fid, 'counts') > sid <- H5Dget_space(did) > > ## 100 rows, selection of columns > index <- list(1:100, > as.integer(seq(from = 1, to = 10000*(2^!consec), length.out = 10000))) > > ## create hyperslab union > t1 <- microbenchmark::get_nanotime() > size <- H5Sselect_index(h5space = sid, index = index) > message("Time for union: ", (microbenchmark::get_nanotime() - t1)/1e9) > > ## read dataset > h5spaceMem = H5Screate_simple(size, native = did@native) > t1 <- microbenchmark::get_nanotime() > obj <- H5Dread(h5dataset = did, h5spaceFile = sid, > h5spaceMem = h5spaceMem, > compoundAsDataFrame = TRUE, drop = FALSE) > message("Time for read: ", (microbenchmark::get_nanotime() - t1)/1e9) > > ## tidy up > H5Sclose(h5spaceMem) > H5Sclose(sid) > H5Dclose(did) > H5Fclose(fid) > } >
> > > > hslabUnionTest(consec = TRUE) > Time for union: 0.002536182 > Time for read: 0.015438965 > > hslabUnionTest(consec = FALSE) > Time for union: 14.765978321 > Time for read: 0.537314938 >
Mike Smith (15:41:54): > I wonder if we’re just approaching this wrong. Set operations on the hyperslabs seems to scale like growing a vector in R, getting exponentially worse as the size grows. Is there actually any advantage to creating hyperslabs that are bigger than the chunk size? I guess it makes the code simpler if we make a huge union of slabs, but maybe if you ‘apply’ a hyperslab as soon as the next union would make it span a chunk, they will stay a sensible size. I find it odd there’s not more reports of this, which suggests our ‘I want an arbitrary set of data points with no regular pattern between them’ is an atypical access pattern for HDF5 users.
Aaron Lun (15:45:30): > There certainly aren’t any warnings against taking an arbitrary hyperslab union… so it sounds like a totally reasonable thing to expect to be supported. But yes, the immediate solution would simply be to do multipleH5Dread
s whenindex=
is set inh5read
, which I was alluding to in one of the previous threads.
Nicholas Knoblauch (15:59:01): > @Nicholas Knoblauch has joined the channel
Aaron Lun (16:00:57): > I mean, their examples involve some pretty weird unions, e.g. Figure 11 inhttp://davis.lbl.gov/Manuals/HDF5-1.8.7/UG/12_Dataspaces.html. Who would ever want to take that kind of union? I think our use case is much more sensible.
Mike Jiang (16:22:40) (in thread): > The existence of union operation onhyperslabs
is to allow user to achieve multi-spot data collection from singleH5Dread
call. If this operation turns out to be much less efficient than dispatching multi-read on user side, then it defeats its very purpose. So I still consider it as a bug or a flaw of libhdf5
Aaron Lun (17:30:12): > Agreed. I’ll post the above example on the HDF5 forum tomorrow (unless you have anything else to add to it).
Aaron Lun (17:31:44): > But in any case, any fix will only occur in the next release at the earliest, and we need a workaroundnow.
2018-11-15
Aaron Lun (04:37:23): > Oh sweet, HDF made their bug tracker public.
Aaron Lun (04:53:53): > Posted:https://forum.hdfgroup.org/t/union-of-non-consecutive-hyperslabs-is-very-slow/5062
Mike Smith (05:00:59): > Cool thanks for posting it. Hopefully they understand this is a problem in the more general case, rather than pointing out that here we could just select a single hyperslab with a stride of 2. For curiosities sake I tested that myself; the hyperslab selection step is similar to the consecutive selection (but still slower), reading takes exactly the same time as with the union approach.
Aaron Lun (05:03:00): > Yep, I added some words to clarify that.
Aaron Lun (09:14:49): > Got some replies already.
2018-11-16
Daniel Van Twisk (12:50:08) (in thread): > There’s still work to do but heres the search and get-fqids parts inHCABrowser
’slazyeval
branchhttps://github.com/Bioconductor/HCABrowser/tree/lazyeval: > > library(HCABrowser) > hca <- HumanCellAtlas(per_page=100) > res <- hca %>% filter(library_construction_approach == "EFO:0008931" & > paired_end == True & > ncbi_taxon_id == 9606 & > process_type == "analysis") > %>% results() > %>% as_tibble() > > res > > # A tibble: 831 x 9 > name size uuid bundle_fqid bundle_url process_type.te… biomaterial_cor… > <fct> <int> <fct> <fct> <fct> <fct> <int> > 1 2a3f… 2.95e4 f2c3… 1b0694b8-b… https://d… analysis 9606 > 2 2a3f… 4.34e3 4aa3… 1b0694b8-b… https://d… analysis 9606 > 3 2a3f… 4.57e2 0c98… 1b0694b8-b… https://d… analysis 9606 > 4 2a3f… 8.65e3 c355… 1b0694b8-b… https://d… analysis 9606 > 5 2a3f… 2.73e4 d1a7… 1b0694b8-b… https://d… analysis 9606 > 6 2a3f… 1.90e3 c3c7… 1b0694b8-b… https://d… analysis 9606 > 7 2a3f… 2.71e3 c765… 1b0694b8-b… https://d… analysis 9606 > 8 2a3f… 5.51e5 578f… 1b0694b8-b… https://d… analysis 9606 > 9 2a3f… 1.40e6 bd4d… 1b0694b8-b… https://d… analysis 9606 > 10 2a3f… 5.00e3 3086… 1b0694b8-b… https://d… analysis 9606 > > fqids <- res %>% pull('bundle_fqid') %>% unique() >
- Attachment (GitHub): Bioconductor/HCABrowser > Browse the Human Cell Atlas data portal :warning: under development - Bioconductor/HCABrowser
Marcus Kinsella (13:21:24) (in thread): > hey thanks@Daniel Van Twisk, looks super promising
Hervé Pagès (19:35:57): > “The issue has a very high priority on our to-do list”:+1:
2018-11-18
Aedin Culhane (03:15:06): > @Aaron Lunhave you played with the rTensor package?
Aedin Culhane (03:19:36): > Its built on S4 and has a class Tensor with functions modeMean and modeSum to get a sum/mean over any dimension of the array
Aedin Culhane (03:20:52): > Thus given a dataset, you can easily split it N-way, and compute trucated svd or multilinear PCA, over the N subsets… Might be worth comparing.
Aedin Culhane (03:21:51): > PS;… I lost track of whch channel I should post to
Aedin Culhane (03:23:22): > oops
Aedin Culhane (03:23:49): > (tried to find the emoji and gave up… I need to brush up on my emoji speak)
Aaron Lun (06:46:00): > Nope, haven’t looked at it.
Aaron Lun (06:47:29): > I’m not sure I’m understanding the magic here.
Aaron Lun (06:52:08): > The vignette focuses a lot on the 3D array case; I guess it should collapse easily to work with 2D matrices, but I don’t know if it provides a speed boost over what we currently have for PCA.
Aaron Lun (06:52:22): > I’m assuming you’re suggesting it for use inBiocSingular.
Aedin Culhane (08:10:02): > Yup… there is a few methods recently that guide the selection of block to direct rsvd, rather than using random selection SVD. (I’ll find refs for you….).. These are more stable. The rTensor package has some nice functions to direct folding/unfolding of tensors -> arrays -> matrices.
Aedin Culhane (08:12:21): > For example could take something like the L1000 set of genes as a seed.
Aedin Culhane (08:13:12): > One paper is Group-sparse SVD Models and Their Applications in Biological Data Wenwen Min, Juan Liu and Shihua Zhang
Aaron Lun (10:29:29): > To be clear, I’m not interested - at least not at the moment - providing a framework for anything other than vanilla SVD. The only purpose inBiocSingularis to provide different SVD algorithms that should converge to the same result (with differing speed/precision/accuracy, depending on which algorithm you use). This is necessary in order to allow for different algorithms to be plug-and-play replacements within higher-level functions in dependent packages that “just want to do an SVD”.
Aaron Lun (10:29:48): > I say this because the Min paper above seems to be doing something else, some kind of regularized SVD (for interpretation of the loadings, perhaps?). This is not the current focus of theBiocSingularpackage, though I am open to discussion on whether the package’s scope can be extended to encompass such methods.
Thomas Girke (14:31:54): > @Thomas Girke has joined the channel
Marcus Kinsella (22:05:28) (in thread): > hey so what’s the status of HCABrowser? like how close is it to being ready to share with other hca engineers or include in some the of the hca dcp integration testing?
2018-11-20
Aaron Lun (11:37:27): > Y’know, I’m not sure NIPALS works all that well either. At least I’m not getting it to behave likeprcomp
.
Aaron Lun (11:42:52): > See, for example, this: > > library(nipals) > B <- matrix(c(50, 67, 90, 98, 120, > 55, 71, 93, 102, 129, > 65, 76, 95, 105, 134, > 50, 80, 102, 130, 138, > 60, 82, 97, 135, 151, > 65, 89, 106, 137, 153, > 75, 95, 117, 133, 155), ncol=5, byrow=TRUE) > rownames(B) <- c("G1","G2","G3","G4","G5","G6","G7") > colnames(B) <- c("E1","E2","E3","E4","E5") > p1 <- nipals(B) > p1$scores > ## PC1 PC2 PC3 PC4 PC5 > ## G1 -0.55949042 0.05162405 0.220551255 -0.2906543 -0.18715390 > ## G2 -0.36525385 0.15224701 0.009501308 0.7843769 0.26055809 > ## G3 -0.16119699 0.51225010 -0.307583245 -0.4533066 0.04126946 > ## G4 -0.03091054 -0.60078381 0.494445656 -0.1505239 -0.04562830 > ## G5 0.14080832 -0.38455116 -0.664000933 0.1374190 -0.47099841 > ## G6 0.36460628 -0.15409435 -0.141749252 -0.1762467 0.74442623 > ## G7 0.61143720 0.42330816 0.388835210 0.1489357 -0.34247317 > > prcomp(B, scale.=TRUE)$x > ## PC1 PC2 PC3 PC4 PC5 > ## [1,] -2.8089242 -0.09712098 0.24440012 0.05005907 -0.012986464 > ## [2,] -1.8337410 -0.28614482 0.01049935 -0.13510113 0.018085364 > ## [3,] -0.8092162 -0.96259290 -0.34095408 0.07808442 0.002856244 > ## [4,] -0.1552763 1.12890507 0.54804808 0.02592634 -0.003167254 > ## [5,] 0.7068734 0.72291917 -0.73575685 -0.02368734 -0.032669769 > ## [6,] 1.8304896 0.28965942 -0.15705359 0.03038533 0.051636075 > ## [7,] 3.0697948 -0.79562496 0.43081698 -0.02566669 -0.023754196 >
Aaron Lun (11:45:37): > Ah… and I guess that’s becausenipals
’s “scores” are not actually the PCs, because they haven’t been imbued with the singular values.
Aaron Lun (11:48:39): > @Vince CareyWere you the one who put me in touch with Kylie? I notice that she’s using NIPALS for PCA in her Cardinal package, perhaps she could provide some insights as to why she did so.
Vince Carey (11:50:41): > yes
Aaron Lun (11:53:25) (in thread): > I guess she’s not on this slack. Let me see if I’ve got her email somewhere
Aaron Lun (11:54:35) (in thread): > Apparently not.
Aaron Lun (11:55:04) (in thread): > Ah, it’s on her website.
Aaron Lun (11:59:27) (in thread): > done.
Aaron Lun (14:55:32): > @Hervé Pagèsany feedback on my DA PR? Good? Bad? CRAAZY?
Aaron Lun (15:03:55) (in thread): > And now it works like a charm.
Aaron Lun (15:24:34): > @Aedin Culhane@Kasper D. HansenIn any case, I added a preliminary NIPALS option, which seems to have a different speed/accuracy trade-off to randomized SVD and IRLBA.
Aaron Lun (18:28:24): > Though thenipals
package probably needs some work to accommodate arbitrary matrix representations.
Kevin Rue-Albrecht (22:54:36): > @Kevin Rue-Albrecht has left the channel
2018-11-21
Aaron Lun (13:20:09): > Having thought about it, I shiftedrunNipalsSVD
to its own branch, as I’m not sure whether it has any niche to fill here. Also, I haven’t testednipals
with non-standard matrix types, but it seems like it would need some work.
Hervé Pagès (14:27:51) (in thread): > Will do ASAP. Finishing my newHDF5Array::h5mread()
today. For random selection it uses its own “load one chunk at a time” approach which should achieve better performance than the hyperslab union. I’ll provide some benchmarks soon.
Aaron Lun (14:34:49) (in thread): > :+1:
Kasper D. Hansen (21:40:37): > We used nipals some years ago in minfi and I dropped it after they did an algorithm update which changed all outputs (and broke my tests) and I found out that our custom random SVD was as fast anyway.
2018-11-23
Aaron Lun (11:17:29): > @Leonardo Collado TorresIt could be worth considering providing the recount TCGA data sets as HDF5Matrices. Sometimes I just want to get expression values for a single gene, but as it is now, I have no choice but to load the entire Rdata into memory.
Aaron Lun (11:23:32): > While I’m complaining, it would also be nice ifdownload_study
just usedExperimentHub
so my scripts wouldn’t constantly try to re-download the file upon re-execution.
2018-11-24
Martin Morgan (12:28:35): > (orBiocFileCache
if for some reason the data does not want to be served via ExperimentHub)
2018-11-25
Aedin Culhane (00:56:35) (in thread): > Aaron. I have a R script somewhere thats calc scores from multiple different PCA methods in R and shows they all compute same thing. Do you want me to find it?
Aaron Lun (09:00:22) (in thread): > It’s okay, I got it to work. I just don’t know whether it’s worth putting in NIPALS as a separate SVD back-end. It doesn’t seem any faster than our current options.
2018-11-26
Vince Carey (06:27:16): > :+1:
Vince Carey (06:33:30) (in thread): > @Aaron Lunfor focused queries to TCGA (or possibly better, pancan_atlas) consider BiocOncoTK buildPancanSE
Vince Carey (06:36:15) (in thread): > > library(BiocOncoTK) > library(SummarizedExperiment) > bq = pancan_BQ() > ss = buildPancanSE(bq) > ss # defaults -> 450k > class: RangedSummarizedExperiment > dim: 396065 409 > metadata(3): acronym assay sampType > assays(1): assay > rownames(396065): cg00000029 cg00000165 ... rs966367 rs9839873 > rowData names(3): gene_id gene_name gene_biotype > colnames(409): TCGA-FD-A5BV TCGA-FD-A3B3 ... TCGA-UY-A9PH TCGA-UY-A78L > colData names(20): bcr_patient_uuid bcr_patient_barcode ... > radiation_therapy race > > which(rowData(ss)$gene_name == "ORMDL3") > [1] 25148 36071 68078 77670 104360 141050 141931 144159 162609 169085 > [11] 250328 277782 286716 326723 346389 357248 373282 > > assay(ss[25148,]) > <1 x 409> DelayedMatrix object of type "double": > Running job [-] 4s > Complete > Billed: 168.08 GB > Downloading 4 rows in 1 pages. > TCGA-FD-A5BV TCGA-FD-A3B3 ... TCGA-UY-A9PH TCGA-UY-A78L > cg01482279 0.0501831 0.0566330 . 0.0479807 0.0534672 >
Vince Carey (06:37:45) (in thread): > @Aaron Lunalso, curatedTCGAData project of@Marcel Ramos Pérezand@Levi Waldronhas produced HDF5 images of TCGA data
Aaron Lun (06:43:15) (in thread): > Interesting.
Leonardo Collado Torres (10:30:49): > RegardingExperimentHub
@Aaron Lun, do you imagine having a package per study RSE? (like 2000 of them?) Also, one for gene level, one for exon, one for transcript? The bulk of the data inrecount2
is made up of BigWig files which my understanding is that they don’t fit inExperimentHub
. We have about 8-9 TB of data
Leonardo Collado Torres (10:32:04): > You could also edit your scripts with something like what we have athttp://leekgroup.github.io/recount-analyses/example_de/recount_SRP032789.html(and elsewhere): > > ## Download the gene level RangedSummarizedExperiment data > if(!file.exists(file.path('SRP032789', 'rse_gene.Rdata'))) { > download_study(project_info$project) > } > > ## Load the data > load(file.path(project_info$project, 'rse_gene.Rdata')) > rse_gene >
Leonardo Collado Torres (10:35:03): > As for HDF5Matrices, we’ve already been considering doing this. Would this be the best place to start?http://bioconductor.org/packages/3.8/bioc/html/HDF5Array.html - Attachment (Bioconductor): HDF5Array > Implements the HDF5Array and TENxMatrix classes, 2 convenient and memory-efficient array-like containers for on-disk representation of HDF5 datasets. HDF5Array is for datasets that use the conventional (i.e. dense) HDF5 representation. TENxMatrix is for datasets that use the HDF5-based sparse matrix representation from 10x Genomics (e.g. the 1.3 Million Brain Cell Dataset). Both containers being DelayedArray extensions, they support all operations supported by DelayedArray objects. These operations can be either delayed or block-processed.
Leonardo Collado Torres (10:35:21): > I see several HDF5-related packages
Vince Carey (10:47:29): > https://github.com/vjcitn/htxcompprovides access to 181000 RNA-seq samples in HDF5 … many probably overlap with recount2. The quantifications were generated using salmon, in Sean Davis’@Sean Davisbigrna project. I will update the htxcomp repo today to make sure more details are available at the face page. however, > > > library(htxcomp) > > library(SummarizedExperiment) > > hh = loadHtxcomp() > > hh[1000,1000] > class: RangedSummarizedExperiment > dim: 1 1 > metadata(1): rangeSource > assays(1): counts > rownames(1): ENSG00000067141.16 > rowData names(0): > colnames(1): DRX048550 > colData names(4): experiment_accession experiment_platform > study_accession study_title > > hh[,1000]$study_title > [1] "Active enhancer elements of human vascular endothelial cells" > > assay(hh[1000,1000]) > <1 x 1> DelayedMatrix object of type "double": > DRX048550 > ENSG00000067141.16 18.48119 >
- Attachment (GitHub): vjcitn/htxcomp > tools for manipulating human transcriptome compendium - vjcitn/htxcomp
Vince Carey (10:48:03): > hh is a RangedSummarizedExperiment;rowRanges(hh)
provides the transcript metadata used to obtain the gene quantifications via tximport. Use Sean’s SRAdbV2 package to find the accession numbers for experiments/samples of interest.
Vince Carey (10:53:26): > TCGA is not in the compendium. But a good fraction of SRA RNA-seq studies (up to 2018…) are present. There are many moving parts but a lot of checking has been done. The back end is thehsdshdflab.hdfgroup.orgHDF Scalable Data Service, to which a number of Bioconductor resources have been added thanks to help of John Readey of HDF Group. You can work with these via jupyter notebooks. My kitanb repo hashttps://github.com/vjcitn/kitanb/blob/master/BigDataIntro.ipynblays out basic ideas; alsohttps://github.com/vjcitn/kitanb/blob/master/restful1.ipynb… you need to subscribe to HDF Kita lab for notebook access. There is a 30 day trial period but you might need a credit card to get an account. If this is an issue, let me know, and we can try to get a couple of free entries. - Attachment (GitHub): vjcitn/kitanb > Jupyter notebooks for HDF Kita + Bioconductor. Contribute to vjcitn/kitanb development by creating an account on GitHub. - Attachment (GitHub): vjcitn/kitanb > Jupyter notebooks for HDF Kita + Bioconductor. Contribute to vjcitn/kitanb development by creating an account on GitHub.
Vince Carey (10:54:55): > Bottom line – before we go transforming substantial public data into HDF5, let’s discuss whether we have something already, and think about strategies for reducing redundant downloads.
Vince Carey (12:46:15) (in thread): > I have looked at the mckinsel vignette noted above. The essence seems to be developing the query culminating inhttps://s3.amazonaws.com/dcp-matrix-service-results-prod/81f3cb0f4a66074635d419df4f0bb47369e8b3211451303557b16b9732da0393.loom… The LoomExperiment package can handle the .loom file as a LoomFile, to which import() is then applied to generate > > > import(gg) > class: LoomExperiment > dim: 58347 100 > metadata(3): CreationDate LOOM_SPEC_VERSION last_modified > assays(1): matrix > rownames: NULL > rowData names(1): Gene > colnames: NULL > colData names(155): ACCUMULATION_LEVEL ALIGNED_READS ... > uncertain.reads unique.aligned > rowGraphs(0): NULL > colGraphs(0): NULL >
> from which the boxplots in the ipynb can be derived
Martin Morgan (12:52:20) (in thread): > One package with a way to choose which study to return, likehttps://bioconductor.org/packages/curatedTCGADataperhaps. - Attachment (Bioconductor): curatedTCGAData > This package provides publicly available data from The Cancer Genome Atlas (TCGA) Bioconductor MultiAssayExperiment class objects. These objects integrate multiple assays (e.g. RNA-seq, copy number, mutation, microRNA, protein, and others) with clinical / pathological data. The MultiAssayExperiment class links assay barcodes with patient IDs, enabling harmonized subsetting of rows (features) and columns (patients / samples) across the entire experiment.
Martin Morgan (12:53:28) (in thread): > @Marcel Ramos Pérezis working on the matrix-specific side of things athttps://github.com/Bioconductor/HCAMatrixBrowser - Attachment (GitHub): Bioconductor/HCAMatrixBrowser > Contribute to Bioconductor/HCAMatrixBrowser development by creating an account on GitHub.
Vince Carey (12:55:17) (in thread): > Good to know. I don’t see HumanCellAtlas() exported from HCABrowser but example above calls it.
Vince Carey (12:55:31) (in thread): > Am I in the wrong branch? I used lazyeval
Vince Carey (12:58:56): > Briefly – “experiments” are just one way of slicing up the compendium of transcriptomes, and requiring that we divide them up in this way is something of an obstacle to integrative thinking. So all of SRA RNA-seq was turned into a (delayed) HDF5 matrix to get beyond this.
Leonardo Collado Torres (13:21:14) (in thread): > thanks for the link Martin!
Vince Carey (13:21:53) (in thread): > I feel like we need a new channel here for the HCA APIs
Leonardo Collado Torres (13:22:08): > Thanks Vince! I relayed this info to the recount2 team.
Marcel Ramos Pérez (13:24:14) (in thread): > Hi Vince, I’m still working getting a functional package up on GitHub. I’ll update you when I have some runnable code.
Marcel Ramos Pérez (13:30:08) (in thread): > in the meantime, here is a working scripthttps://gist.github.com/LiNk-NY/02a476af8ecd9f88171b95ab250d0ecc
Martin Morgan (13:31:29) (in thread): > @Daniel Van Twiskwill set you straight about branch, but I think he’s using devtools, and as a workarounddevtools::load_all()
gets you there…
Aaron Lun (14:51:58) (in thread): > I just wanted the Rdata file returned bydownload_study
, but as a HDF5Matrix.
Vince Carey (15:20:58) (in thread): > Yes, load_all() helped, provided you have checked out lazyeval. thanks
2018-11-28
Hervé Pagès (12:30:30): > After a couple of weeks and 2600 lines of C code, I’ve made some progress on addressing the inefficiency of random hyperslab unions in hdf5. With the TENxBrainData dataset: > > library(HDF5Array) > library(ExperimentHub) > hub <- ExperimentHub() > fname1 <- hub[["EH1039"]] # sparse > > index1 <- list(77 * sample(34088679, 10000, replace=TRUE)) > system.time(aa1 <- h5read(fname1, "mm10/data", index=index1)) > # user system elapsed > # 29.114 0.134 30.368 > system.time(bb1 <- h5mread(fname1, "mm10/data", index1)) > # user system elapsed > # 1.628 0.102 2.210 > identical(aa1, bb1) >
> With the “dense” version of the dataset (100x100 chunks): > > fname2 <- hub[["EH1040"]] # dense > index2 <- list(sample(27998, 1000, replace=TRUE), > sample(1306127, 1000, replace=TRUE)) > system.time(aa2 <- h5read(fname2, "counts", index=index2)) > # user system elapsed > # 251.767 0.728 253.152 > system.time(bb2 <- h5mread(fname2, "counts", index2)) > # user system elapsed > # 16.208 0.616 17.376 > identical(aa2, bb2) >
> Memory usage is also reduced a lot in those cases: the code usingh5read()
uses 1.948g and the code usingh5mread()
only 0.52g (as reported bytop
, most of it is not R controlled memory). The latest version of HDF5Array now usesh5mread()
as the backend for HDF5Array objects.
Aaron Lun (12:32:00): > :+1::+1::+1::christmas-parrot::party_parrot::flag-au:
Peter Hickey (14:34:42): > :tada::champagne::smiley::+1:
Raphael Gottardo (14:38:33): > Well done@Hervé Pagès@Mike JiangCould you revisit your benchmarks with these changes?
Vince Carey (15:48:01) (in thread): > Dan’s and Marcel’s work represents good technical progress … I am wondering whether we can understand/access the metadata at a higher level. This > > filter(library_construction_approach == "EFO:0008931" >
> selects SmartSeq2 … I have tried to substitute the EFO tag for 10x sequencing and no records come back. Can we ‘browse’ the metadata to understand what options make sense? Ideally the bundle_fqids defined in Marcel’s gist would have some metadata bound to them that is more digestible by a human?
Mike Jiang (15:51:17) (in thread): > @Hervé PagèsShould I be using Bioc devel? github version fails to build somehowHDF5Array.so: undefined symbol: _get_trusted_elt
Hervé Pagès (16:31:12) (in thread): > Yes in BioC devel only. It’s HDF5Array 1.11.2. Are you installing by runningR CMD INSTALL
on your local working copy? In that case make sure you start with a clean copy (i.e. no stale.o
files). I’ve done little testing ofh5mread()
. Currently working on adding unit tests and a man page.
Hervé Pagès (16:41:48) (in thread): > Another fun one: > > library(HDF5Array) > library(ExperimentHub) > hub <- ExperimentHub() > fname1 <- hub[["EH1039"]] # sparse > > index3 <- list(sample(1306127, 25000, replace=TRUE)) > system.time(aa <- h5read(fname1, "mm10/barcodes", index=index3)) > # user system elapsed > # 42.554 0.000 42.555 > system.time(bb <- h5mread(fname1, "mm10/barcodes", index3)) > # user system elapsed > # 0.121 0.000 0.121 > identical(aa, bb) >
Mike Jiang (17:02:35) (in thread): > BioC devel has 1.11.1, which is the one I installed currently. github version still doesn’t build (with the clean copy). I see you defined_get_trusted_elt
asinline
, which causes the linking issue. You may want to usestatic inline
? seehttps://stackoverflow.com/questions/16245521/c99-inline-function-in-c-file/16245669#16245669 - Attachment (Stack Overflow): C99 inline function in .c file > I defined my function in .c (without header declaration) as here: inline int func(int i) { return i+1; } Then in the same file below I use it: … i = func(i); And during the linking I got ”
Peter Hickey (17:07:17): > Thanks for all your work on this,@Hervé Pagès. Just in time for a talk I’m giving tomorrow on DelayedArray :)
Mike Jiang (18:01:21) (in thread): > I verified thatstatic inline
fixes the linking issue
Mike Jiang (18:03:57) (in thread): > Also theh5mread
basedHDF5Array
IO for random slicing ondense
matrix does have10-fold
improvement overh5read
> > library(HDF5Array)#must load it first to avoid namespace conflicting > library(mbenchmark) > mat <- matrix(seq_len(2e6), nrow = 1e3, ncol =2e3) > dims <- dim(mat) > > #bigmemory > library(bigmemory) > bm.file <- tempfile() > suppressMessages(bm <- as.big.matrix(mat, backingfile = basename(bm.file), backingpath = dirname(bm.file))) > #wrap it into DelayedArray > library(DelayedArray) > bmseed <- BMArraySeed(bm) > bm <- DelayedArray(bmseed) > > #h5 > library(rhdf5) > h5.file <- tempfile() > h5createFile(h5.file) > h5createDataset(h5.file, "data", dims, storage.mode = "double", chunk=c(100,100), level=7) > h5write(mat, h5.file,"data") > #wrap it into DelayedArray > hm = HDF5Array(h5.file, "data") > > library(microbenchmark) > set.seed(1) > ridx <- sample(1e3, 500, replace = T) > cidx <- sample(2e3, 1000, replace = T) > microbenchmark(a <- as.matrix(bm[ridx, cidx]) > ,b <- as.matrix(hm[ridx, cidx]) > ,c <- h5read(h5.file, "data", list(ridx, cidx)) > # ,d <- h5mread(h5.file, "data", list(ridx, cidx)) > , times = 3) >
Mike Jiang (18:05:34) (in thread): > > Unit: milliseconds > expr mean > a <- as.matrix(bm[ridx, cidx]) 10.47707 > b <- as.matrix(hm[ridx, cidx]) 264.57109 > c <- h5read(h5.file, "data", list(ridx, cidx)) 2659.31401 > > all.equal(a,b,c) > [1] TRUE >
Hervé Pagès (19:44:09) (in thread): > Thx Mike for the feedback. I replacedinline
withstatic inline
in HDF5Array 1.11.3. Do you have compiler optimization turned off on your machine? I had to set gcc flag to-O0
in order to be able to reproduce the linking issue. Defaults are-O2
on Linux and Mac, and-O3
on Windows (the build system uses that too:https://bioconductor.org/checkResults/3.9/bioc-LATEST/malbec2-NodeInfo.html).
2018-11-29
Mike Jiang (13:09:34) (in thread): > No, mine is-O2
too. Here is my compiler flagsGNU C99 7.3.0 -mtune=generic -march=x86-64 -g -O -O2 -std=c99 -finline-functions -fPIC -fstack-protector-strong
Hervé Pagès (14:03:47) (in thread): > mmh… weird! You can see the flags that R uses withR CMD config CFLAGS
: > > hpages@spectre:~$ R CMD config CFLAGS > -g -O2 -Wall >
Mike Jiang (14:59:35) (in thread): > myR CMD config CFLAGS
returns the same as yours, the longer version I posted earlier was extracted from the actual compiled object, which I thought might give you more details for troubleshoot. Anyway,static inline
should be the right way to go sinceinline
keyword has the undefined compiler-specific behavior and thus not safe to use.
2018-11-30
Daniel Van Twisk (16:00:33) (in thread): > Hi@Vince Carey, > With regards to the metadata, I’ve currently only selected a few by default but am open to including more if needed. I need to further refine the select and filter methods to take JSON specific fields and simplified fields (i.e. ‘files.specimen_from_organism_json.organ.text’ vsorgan
). > > With regards tolibrary_construction_approach
giving incorrect filters, I have noticed that some of the shorthands I just mentioned aren’t all functioning correctly and I will need to be tested accordingly to ensure functionality. > > I hav recently returned from a trip and have gotten back on working onHCABrowser
I will make updates to this thread when the issues that are being brought up are corrected.
2018-12-02
Vince Carey (06:34:41) (in thread): > :+1:
2018-12-06
Stevie Pederson (10:30:12): > @Stevie Pederson has joined the channel
Marcus Kinsella (17:43:03) (in thread): > @Vince Careyi’m not sure there’s a way to browse the metadata nicely except for what’s here:https://prod.data.humancellatlas.org/explore/projects
Marcus Kinsella (17:46:13) (in thread): > i do have a big json file of metadata fields and values that i made by just crawling the hca:https://hca-metadata-summaries.s3.amazonaws.com/metadata_22000.json.gz?AWSAccessKeyId=ASIASQQJ53RKOJY4QHZO&Signature=gH3z63DSQWDeMLvjxGt%2B%2FayNSzw%3D&x-amz-security-token=FQoGZXIvYXdzEBAaDE3%2Fv7Z3ObA3A9l4%2BCL%2FAdqL956DnJlnHCQuDwoCZ%2Blejiu671s5RAmjbL6K7%2Fa6pILBDipRNu83DU3V4CbAhi%2FXktkkOoFAqnxBC4GoJkpYLqC6EvlH2I7J75BNZyhchTTsZaGZ6izUcBatWj1BUFnnWGbRFXG9Q%2FL5X3atafFjV0n7%2FDsvVG1eDV31loBZpiNheiZegYDQ6Ta29ogqhc7T5Cr8aAYRd507SEkGUqiPyc19YzvgG7XVH1vAcLqTIMhhhNDGemCAmZ3ZkP3HY5CsRprnsPrazJTZbdgZSlmiydht2RtPhQHQeBmW6ycEukvHCWjknGR%2BGTWS61%2FdhQOc3J22BO%2BaFX7OR%2F4QKCjWyabgBQ%3D%3D&Expires=1544308876
Marcus Kinsella (17:46:47) (in thread): > that’s metadata from about 2/3 of bundles in prod, and gives the search string, existing values, and a count
Vince Carey (17:49:55) (in thread): > that’s helpful. i’ll get back to this in a week or so
Marcus Kinsella (17:50:30) (in thread): > okay sounds good, i’ll need to refresh the url then
2018-12-13
Michael Lawrence (12:44:02): > @Michael Lawrence has joined the channel
2018-12-14
Stephanie Hicks (13:40:05): > is there a reference/citation for the HDF5 file format?
Daniel Van Twisk (15:57:19) (in thread): > @Vince Carey@Marcus KinsellaI have further progress to show and also a few technical questions regarding the HCABrowser. > > First off, the master branch is up-to-date and the package can be built and installed. > > I’ve fixed issues with the previous example, things look correct now (they still may not be the same because the data at the urlhttps://dss.integration.data.humancellatlas.org/v1has changed and we are currently using filter syntax (i.e. filter, term) instead of query syntax (i.e. must, match): > > > hca <- HumanCellAtlas() > > res <- hca %>% filter(library_construction_approach.ontology == "EFO:0008931" & > paired_end == True & > ncbi_taxon_id == 9606 & > process_type == "analysis") > > > res > class: HumanCellAtlas > Using hca-dcp at:[https://dss.integration.data.humancellatlas.org/v1](https://dss.integration.data.humancellatlas.org/v1)EsQuery: > Query: > Bool: > Filter > Bool > Filter > Bool > Filter > Bool > Filter > Term : library_construction_approach.ontology == EFO:0008931 > Term : paired_end == True > Term : ncbi_taxon_id == 9606 > Term : process_type == analysis > Columns selected: > project_title > project_shortname > organ > library_construction_approach.text > specimen_from_organism_json.genus_species.text > files.donor_organism_json.diseases.text > library_construction_approach.ontology > paired_end > ncbi_taxon_id > process_type > > class: SearchResult > bundle 1 - 10 of 21 > link: TRUE > > Showing bundles with 10 results per page > # A tibble: 10 x 10 > bundle_fqid bundle_url process_type.te… biomaterial_cor… library_prepara… > <fct> <fct> <fct> <int> <fct> > 1 99bc97c7-2… https://d… analysis 9606 EFO:0008931 > 2 c731e25a-f… https://d… analysis 9606 EFO:0008931 > 3 2fb086f3-9… https://d… analysis 9606 EFO:0008931 > 4 40d88ae0-b… https://d… analysis 9606 EFO:0008931 > 5 32265c00-8… https://d… analysis 9606 EFO:0008931 > 6 8123b1de-6… https://d… analysis 9606 EFO:0008931 > 7 cf7ff717-f… https://d… analysis 9606 EFO:0008931 > 8 b723b224-3… https://d… analysis 9606 EFO:0008931 > 9 cf7832e6-c… https://d… analysis 9606 EFO:0008931 > 10 7868c53e-5… https://d… analysis 9606 EFO:0008931 > > > pullBundles(res) > ## The relevant bundle_fqids displayed here (too long!) >
> The EsQuery is a bit messy (this is to ensure further changes to the query can be arbitrarily added), so it might be better to display it differently. > > I’ve went through and corrected issues regarding incorrect names being returned and incorrect search results. The incorrect names were simply due to the fact that some fields had multiple end nodes. (e.g.library_construction_approach
may refer toontology
ortext
). These fields can now be specified withlibrary_construction_approach.ontology
orlibrary_construction_approach.text
, respectively. All fields can be accessed by giving a unique json identifier. (e.g.organ
can be accessed with any of the following:organ
,organ.text
,specimens_from_organism.organ.text
,files.specimens_from_organism.organ.text
). An error will be thrown if an identifier is not unique: > > > hca %>% select('library_construction_approach') > Error in FUN(X[[i]], ...) : > Field library_construction_approach matched more than one field. Please select one: > files.library_preparation_protocol_json.library_construction_approach.ontology > files.library_preparation_protocol_json.library_construction_approach.text >
Michael Lawrence (16:08:55): > @Vince CareyIt sounds like Kita is presenting an abstraction that stores data in separate chunks, but makes the data appear as a single HDF5 resource. Is that a feature unique to Kita (thus requiring S3) or does the h5server also support that?
Daniel Van Twisk (16:12:53) (in thread): > I’ve added further methods for user interaction with the data in theHumanCellAtlas
object.hca %>% activate()
will change whether bundles or files are displayed to the user (bundles is default) > > > hca <- hca %>% activate > Displaying results by file > > hca <- hca %>% activate > Displaying results by bundle >
> hca %>% downloadHCA
which will download the entire result of the search by bundles or files (whichever is activated).bundles <- hca %>% pullBundles
get all bundle_fqids from theHumanCellAtlas
result.hca %>% showBundles(bundles)
will show all results related to a character vector of bundle_fqids. > > Thefilter()
method should now be able to hand arbitrary complicated searches. For examples: > > res <- hca %>% filter(!(library_construction_approach.ontology == "EFO:0008931" | !((!process_type == analysis & ncbi_taxon_id == 9606)))) > > > res > class: HumanCellAtlas > Using hca-dcp at:[https://dss.integration.data.humancellatlas.org/v1](https://dss.integration.data.humancellatlas.org/v1)EsQuery: > Query: > Bool: > Filter > Bool > MustNot > Bool > Should > Term : library_construction_approach.ontology == EFO:0008931 > Bool > MustNot > Bool > Filter > Bool > Filter > Bool > MustNot > Term : process_type == analysis > Term : ncbi_taxon_id == 9606 > Columns selected: > project_title > project_shortname > organ > library_construction_approach.text > specimen_from_organism_json.genus_species.text > files.donor_organism_json.diseases.text > library_construction_approach.ontology > process_type > ncbi_taxon_id > > class: SearchResult > bundle 1 - 10 of 17 > link: TRUE > > Showing bundles with 10 results per page > # A tibble: 10 x 9 > bundle_fqid bundle_url donor_organism_… donor_organism_… library_prepara… > <fct> <fct> <fct> <fct> <fct> > 1 f17503d8-2… https://d… 9606 atrophic vulva EFO:0009310 > 2 23818067-b… https://d… 9606 atrophic vulva EFO:0009310 > 3 09d0e640-5… https://d… 9606 atrophic vulva EFO:0009310 > 4 3b89f0a9-8… https://d… 9606 atrophic vulva EFO:0009310 > 5 3c2f7450-4… https://d… 9606 atrophic vulva EFO:0009310 > 6 0f6dca6c-c… https://d… 9606 atrophic vulva EFO:0009310 > 7 fc480e7b-f… https://d… 9606 atrophic vulva EFO:0009310 > 8 7c5b997e-4… https://d… 9606 atrophic vulva EFO:0009310 > 9 a93ef55f-5… https://d… 9606 atrophic vulva EFO:0009310 > 10 367ae623-5… https://d… 9606 atrophic vulva EFO:0009310 >
> Questions for anyone interested: > > Is there a new server being used for the hca? Using the urlhttps://dss.integration.data.humancellatlas.org/v1yeilds very few results (it is being updated daily, though) > > What are default fields you would like to see? I’ve tried basing the current select fields (bundle_fqids, bundle_urls, species, organ, disease, project_name, project_shortname) fromhttps://prod.data.humancellatlas.org/explore/projects. In general, do you think the HumanCellAtlas object looks friendly enough to a user? > > I’d like to add a method,findFieldValues(hca, <field>)
, that displays all values associated with a field. For example, using the function onorgan
should showc('brain', 'heart', 'kidney', <etc>)
. this would help the user discover values that they can search for, but I don’t know how to to do this with the hca-dcp. Googling yields “aggregations” should do ithttps://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html. Something like:
Daniel Van Twisk (16:16:15) (in thread): > > curl -XGET[https://dss.integration.data.humancellatlas.org/v1/_search](https://dss.integration.data.humancellatlas.org/v1/_search)-d ' > { > "aggs" : { > "whatever_you_like_here" : { > "terms" : { "field" : "organ" } > } > }, > "size" : 0 > }' >
> But the dcp-doesn’t allow GETs on_search
, so it seems that this functionality may be constrained in the dcp. > > In addition, I’d like to generate easily-searchable json endpoints (e.g., typingorgan
instead offiles.specimens_from_organism.organ.text
is better for the user; also I’d like to programmatically reveal all searchable fields). How can this be done? I saw that@Marcus Kinsellahad a post showing fields gleaned from the dcp, but the link is dead.
Marcus Kinsella (16:16:51) (in thread): > couple quick things: aggregations don’t work in the DSS
Marcus Kinsella (16:17:15) (in thread): > new link:https://s3.amazonaws.com/hca-metadata-summaries/metadata_subsample.json
Marcus Kinsella (16:18:07) (in thread): > that url you list points to the “integration” deployment, which gets wiped very now and then. just delete “integration.” and you’ll point to production
Daniel Van Twisk (16:18:36) (in thread): > The information on the link is exactly what I need. Do you have a way of generating it?
Marcus Kinsella (16:18:58) (in thread): > i have a shameful script that takes two days to run
Marcus Kinsella (16:19:08) (in thread): > happy to share it, but that’s what it is
Daniel Van Twisk (16:20:35) (in thread): > I’d still like to see it. However, two days may be too long for a user to wait to use the HCABrowser, so maybe we can find other ways.
Marcus Kinsella (16:23:03) (in thread): > i’ve been thinking would should more or less be running a cron job that updates some s3 object every so often
Marcus Kinsella (16:23:39) (in thread): > or otherwise just dumping metadata to some other database as bundles arrive
Marcus Kinsella (16:24:23) (in thread): - File (Python): gather_metadata.py
Marcus Kinsella (16:39:20) (in thread): > for default fields, i would look at some of the facets in these tables:https://prod.data.humancellatlas.org/explore/
Sean Davis (16:40:36): > The cloud-native version is Kita-specific. h5server can be deployed on AWS, but the backend is still just an HDF5 file.
Michael Lawrence (18:07:06): > The appealing thing about Kita is that we could in theory construct a matrix incrementally, adding samples over time. Of course, we could achieve that with an R-level abstraction, like the matter package.
Vince Carey (20:23:00): > A couple of comments. First, Kita is a notebook interface for HSDS, HDF Scalable Data Service. HSDS, a REST API for the HDF object store, has recently been open-sourced. One needs to have openstack implementing S3 to run HSDS decoupled from AWS. This was accomplished in the development of HSDS on XSEDE Jetstream. We are trying to do that “locally”, to decouple from AWS. Second, incremental construction of matrices in HDF5 is generally available when one of the dimensions is left “infinite”; there may be other ways. This is supported in HSDS. We have left HDF Server, a tornado-based REST API for HDF5 content managed as files (not as objects), because the scalability of HSDS is very appealing – requests can be handled in parallel by as many cores as you are willing to commit on your HSDS deployment. HDF Server does not scale in this way.
Vince Carey (20:24:41): > A nice thing about HDF Server, however, is that it is very easy to deploy.
Sean Davis (21:30:37): > There are other s3 implementations besides openstack that might be relevant, particularly for testing. See minio, for example.https://www.minio.io/ - Attachment (Minio): Minio: Private cloud storage > Minio is an open source object storage server with Amazon S3 compatible API. Build cloud-native applications portable across all major public and private clouds.
2018-12-15
Michael Lawrence (14:08:44): > @Vince Carey, do you have any resources on HSDS for me to read? Google just brings up fact sheets about Kita. Are you able to store metadata on the HDF5 objects in S3 or does the REST API hide that?
Vince Carey (14:11:30): > I think the best place at the moment ishttps://github.com/HDFGroup/hsds - Attachment (GitHub): HDFGroup/hsds > Cloud-native, service based access to HDF data. Contribute to HDFGroup/hsds development by creating an account on GitHub.
Vince Carey (14:12:38): > John Readey@John Readeyis in this channel – I understand he is working on a white paper. The metadata question should be posed to him.
Vince Carey (14:30:14): > I am not sure how accessiblehttps://support.hdfgroup.org/projects/hdfserver/#restis from the github page links, and I don’t know if it is up to date. But it seems potentially useful. - Attachment (support.hdfgroup.org): HDF Server > The HDF Group is a not-for-profit corporation with the mission of sustaining the HDF technologies and supporting HDF user communities worldwide with production-quality software and services.
2018-12-17
Vince Carey (06:38:44) (in thread): > @Daniel Van TwiskHCABrowser builds and installs but HumanCellAtlas is not exported and I do not see how to run the example. Can you update the github README? Thanks
Vince Carey (06:49:25) (in thread): > I tried running@Marcel Ramos Pérezgist noted above, even with the LiNk-NY fork of HCABrowser to no avail.
John Readey (11:35:23): > Hey@Michael Lawrencethe HSDS storage model stores metadata (e.g. dataset shape & type, attributes) as JSON objects in S3. Chunks are stored as binary objects. The schema is described here:https://github.com/HDFGroup/hsds/blob/master/docs/design/obj_store_schema/obj_store_schema_v2.rst. - Attachment (GitHub): HDFGroup/hsds > Cloud-native, service based access to HDF data. Contribute to HDFGroup/hsds development by creating an account on GitHub.
John Readey (11:39:07): > For anyone who would like to try it out we have a JupyterHub site you can sign up for here:https://www.hdfgroup.org/hdfkitalab/.@Vince Careyhas loaded up some relevant datasets at the path: /shared/bioconductor/.
Marcel Ramos Pérez (12:04:47) (in thread): > Hi Vince,@Vince CareyI’ve moved the script into a package and worked on it more. You can run the request with the same bundle IDs using theHCAMatrixBrowser
https://github.com/Bioconductor/HCAMatrixBrowser - Attachment (GitHub): Bioconductor/HCAMatrixBrowser > Contribute to Bioconductor/HCAMatrixBrowser development by creating an account on GitHub.
Vince Carey (19:23:47) (in thread): > @Marcel Ramos Pérezthanks, that works. Now how do I get valid bundle ids? I have tried requesting a manifest such ashttps://prod.data.humancellatlas.org/explore/projects?filter=%5B%7B%22facetName%22%3A%22organ%22%2C%22terms%22%3A%5B%22spleen%22%5D%7D%2C%7B%22facetName%22%3A%22fileFormat%22%2C%22terms%22%3A%5B%22h5%22%5D%7D%5D
Vince Carey (19:25:24) (in thread): > but the tsv returned has bundle id/version that when concatenated with . does not get handled by loadHCAMatrix. > > Browse[1]> readBin(response$content, what="character") > [1] "{\n \"detail\": \"'81f920f3-8645-439c-868e-1bf16ab9b0a7.2018-12-11T013132.284148Z' is not of type 'array'\",\n \"status\": 400,\n \"title\": \"Bad Request\",\n \"type\": \"about:blank\"\n}\n" >
Marcus Kinsella (19:33:49) (in thread): > not sure how HCAMatrixBrowser works exactly, theuuid.version
should work, i think you’re just sending a single string rather than an array
2018-12-18
Daniel Van Twisk (11:42:09) (in thread): > @Vince CareySorry about that. I’ve been a little single sighted with trimming the package’s code, that I’ve neglected making sure the package functions in a stand-alone manner. I will update you once the package is patched up and will also get proper vignettes and man-pages out soon.
Vince Carey (14:57:10) (in thread): > :+1:
2018-12-20
Daniel Van Twisk (13:47:50) (in thread): > @Marcus KinsellaYou are correct that there’s not really a great way to obtain all fields and value related to these fields at the moment without just iterating through the bundles of the hca. Since I’d still like to offer the user methods to find the available fields and values > > I want to ask whether you have plans to continuously update your resource of data-mined bundles from the hca that you showed me previously. Basically, I want to access it to display this information to the user.
Daniel Van Twisk (14:26:31) (in thread): > @Marcus KinsellaAnd just one note about your resource onhttps://s3.amazonaws.com/hca-metadata-summaries/metadata_subsample.json, it’s currently not a valid json. It seems to have an extra":
in the first line. Removing that makes it valid.
Marcus Kinsella (15:13:58) (in thread): > ugh thanks for the heads up, i must have fat fingered something in vim
Marcus Kinsella (15:14:48) (in thread): > i do plan to continuously update the bundle crawling until we have a better solution in place
Martin Morgan (15:19:05) (in thread): > Would it be possible / helfpul to include counts with the crawl, so that the user can easily know the relative frequency of , e.g., different files.sequencing_protocol_json.sequencing_approach.ontology ?
Marcus Kinsella (15:30:00) (in thread): > yeah, i think the script above actually does create the counts, i stripped them out for simplicity
Marcus Kinsella (15:30:19) (in thread): > but out of curiosity, what would you assume the counts represent?
Marcus Kinsella (15:31:38) (in thread): > or i should say, what would you want them to represent?
Martin Morgan (15:52:42) (in thread): > document frequency? I guess I’m thinking of someone developing an algorithm and wondering whether it would be worth their while, in terms of additional data, to accommodate the unique features of sequencing approach x
Marcus Kinsella (15:59:18) (in thread): > i see, so it represents “bundles” right now, but bundles are extremely heterogeneous in terms of data content
2019-01-02
Mike Jiang (18:04:25): > @John Readeyc++
wrapper forh5
lib is not officially supported asthread-safe
, does that mean it is not safe even for calling those read-only APIs on the independent h5 files within the multi-threaded applications?@Aaron Lun, do you have multi-threaded implementation for beachmat?
Mike Jiang (18:15:52) (in thread): > Never mind, I found the answer from hdf site(1) concurrent access to different datasets in a single HDF5 file and (2) concurrent access to different HDF5 files both require a thread-safe version of the HDF5 library
Shian Su (18:50:07) (in thread): > https://portal.hdfgroup.org/display/knowledge/Questions+about+thread-safety+and+concurrent+access< “Please note that the threadsafe feature is only maintained and supported for C” sounds like it doesn’t supportC++
officially either way. - Attachment (Confluence): Questions about thread-safety and concurrent access > Thread-Safety Concurrent Access Troubleshooting Issues
Aaron Lun (21:44:02) (in thread): > I don’t use multiple threads per se, all my parallelization is done via BiocParallel. In the case of BatchtoolsParam, it involves starting a new R instance altogether, which avoids the “same process” part of the problem (because the separate R instance has its own global variables). I think this would be the case for all BiocParallel schemes - even MulticoreParam has a separate R process, technically speaking.
2019-01-03
Mike Jiang (13:14:45) (in thread): > Yeah, I’ve verified that h5c++ is not thread-safe since it crashes on me and the issue went away once the h5 access is serialized in my code.
2019-01-04
Aaron Lun (00:23:07) (in thread): > I should probably remove my advice about concurrent access in the beachmat user’s guide.
2019-01-10
Tim Triche (14:53:05): > has anyone benchmarked DelayedMatrixStats methods against a restfulSE? as compared to as.matrix() and the default internal implementations (where possible)
Peter Hickey (14:57:12): > There are no specific methods for restfulSE in DMS, so it’ll just be block processing + default implementations.
Peter Hickey (14:58:11): > I’m not currently using restfulSE, hence the lack of specific methods :)
Tim Triche (14:58:21): > at some point I need to bother you about that, since I get REALLY slow performance working across HDF5-backed SEs (bsseq objects) for e.g. DMR calling
Tim Triche (14:58:35): > not so much “why is it slow with restfulSE” but “why is it slow”
John Readey (15:00:56): > Hey@Tim Triche- can you provide an equivalent Python script?
Tim Triche (15:01:04): > I’;ll work on it!
Tim Triche (15:01:15): > although probably not for a few days minimum
Peter Hickey (15:02:22): > In my experience, unexpected and big slowdowns are due to subsetting/reordering of rows or columns that ‘degrade’ an HDF5Array to a DelayedArray. These trigger some ongoing challenges with getting good performance when reading non-contiguous chunks from an HDF5 file using the BioC HDF5 stack
John Readey (15:02:59): > @Mike Jiangwas looking for some enhancements to the Python drivers that would help performance (https://github.com/HDFGroup/h5pyd/issues/47)
John Readey (15:03:43): > And I was thinking we could build in some common operations into the server. e.g. return the sum of a column.
2019-01-11
Levi Waldron (09:51:19): > I limited this to just 100 rows because the sapply approach was so slow. Still waiting for results on the full dataset (365860 rows, no sapply). I’m a little baffled why the “traditional” method (as.matrix) is taking an order of magnitude longer than the 45 seconds it did yesterday without the function and system.time, andDelayedMatrixStats::rowsum
is similarly slow. gist athttps://gist.github.com/lwaldron/44c7142adb04de03255cc37b0264acf6. - File (JavaScript): rowsum benchmarking on TCGA ACC methylation data
Tim Triche (09:51:57): > ok, now this is starting to sound more like my experiences
Levi Waldron (10:33:01): > It seems to have something to do with unevaluated operations? The difference seems to have been a subset I did on the MAE before extracting and operating on methylation. Here’s a minimal example showing how a nested operation can add orders of magnitude to an HDF5 operation. Can one of the experts (@Peter Hickey@Hervé Pagès@Martin Morgan@Aaron Lun) help me understand? - File (JavaScript): Untitled
Aaron Lun (10:35:01): > Indeed. Looks like the old nemesis from our previous discussion (scroll up to Nov 8).
Aaron Lun (10:36:42): > There are various issues here - some of which are solvable, e.g.,rhdf5inefficiencies (perhaps already solved by@Mike Smith). Others less so due to problems with the HDF5 library itself (though@Hervé Pagèsdid some workarounds, last I saw).
Levi Waldron (11:40:30): > Ouch
Levi Waldron (11:43:53): > Do you have any user guidelines for avoiding the slowdowns?
Aaron Lun (19:33:00): > Not from me - fixes need to occur at a lower level, I think.
2019-01-13
Levi Waldron (10:54:14): > FWIW, here is a stripped-down benchmark that runs in a few seconds on a small 27K methylation dataset from breast cancer. Gist athttps://gist.github.com/lwaldron/44c7142adb04de03255cc37b0264acf6 - File (PNG): BRCAmeth27.png
Levi Waldron (10:59:06): > The best polynomial fit says that elapsed time increases with the cube of the number of rows.
Levi Waldron (11:12:02): > I found it interesting that if I just select the first n rows, theas.matrix()
time is constant (at the low end of the above scale), but if I sample the same number of rows either randomly or sequentially from all available rows, the time increases cubically.
Levi Waldron (11:21:31): > Benchmark showing how the cubic scaling doesn’t occur if you select the first n rows, only if you select sequentially or randomly from all rows. > > > summary(lm(res2["elapsed", ] ~ I(n^3)))$r.squared > [1] 0.9908704 >
- File (PNG): BRCAmeth27.png
Peter Hickey (17:03:51): > not surprised by random sample result but a bit surprised by sequential. > what doeschunkdim()
give you on the assay data (sorry, I don’t have bioc-devel installed currently)?
Tim Triche (18:03:47): > I meant to mention this earlier
Tim Triche (18:05:48): > at least when I was reimplementing DMRcate in miser (https://github.com/ttriche/miser), I found that things were a lot more predictable when I throttled the block size down:
Tim Triche (18:05:55): > > DelayedArray:::set_verbose_block_processing(verbose) # why so slow? > setAutoBlockSize(1e6) # look at million entries at a time >
Tim Triche (18:06:10): > verbatim from the last time I fiddled with the code:wink:
2019-01-14
Levi Waldron (05:02:29): - File (R): chunkdim()
Levi Waldron (05:46:49): > I submitted an issue athttps://github.com/grimbough/rhdf5/issues/31to put the issue on issue tracking. FYI the polynomial degree seems to be actually 2 with some more data, and the sequential sample has a greater growth constant (updated plot on the rhdf5 issue). Gist is updated quite a bit (https://gist.github.com/lwaldron/44c7142adb04de03255cc37b0264acf6).
Aaron Lun (06:31:05): > @Peter Hickey@Mike SmithI’m thinking of makingbeachmata header-only library. (Excepting the link to Rhdf5lib, which I can’t do anything about.) Thoughts?
Antonino Aparo (06:37:57): > @Antonino Aparo has joined the channel
Aaron Lun (06:41:05) (in thread): > Actually, forget it. It’s a lot more work for me and I still need the end-developer to modify theirMakevars
so there’s no real benefit.
Aaron Lun (06:42:58) (in thread): > In any case,beachmatgot a version bump, so make sure you bump bsseq otherwise BioC-devel users will probably get funny link problems.
Peter Hickey (06:49:33) (in thread): > I don’t know enough about C++ to have an opinion. > thanks for the heads up, will bump the version tomorrow
Aaron Lun (08:08:31) (in thread): > The relevant version is 1.5.2. No API changes but you should recompile anyway.
Kasper D. Hansen (08:45:49): > That chunkdim seems extreme
Kasper D. Hansen (08:46:22): > How does caching work in this case?
2019-01-16
Tim Triche (11:10:17): > nb.@Peter Hickeythe bsseq implementation of HDF5 backing works GREAT (I just summarized imprinting across 67 WGBS runs of tissues, tumors, and plasma at 10x-30x on a not-particularly-new laptop in 30s) … maybe I need to be digging into your code to see how you guys made it fly like that
Tim Triche (11:11:19): > at present, I haven’t managed to achieve those sorts of speeds out of minfi, so I must be doing something wrong (granted in that case, there are thousands of samples, but I don’t see why the differences should be so extreme)
Peter Hickey (14:51:19): > Thanks, Tim!@Kasper D. Hansenand I started work on refactoring minfi to support hdf5 but it’s a big job and we’re both overcommitted. But I hope to return to it at some point
Kasper D. Hansen (15:02:28): > No, code doesn’t exisit in minfi
Kasper D. Hansen (15:03:06): > Taking it from “code runs” to “code runs well” is a really big task and we (ahm Pete) has only done that for bsseq, where it is amazing what can be done
Kasper D. Hansen (15:03:22): > Pete has done like 300 WGBS with non-CpG methylation
Kasper D. Hansen (15:03:42): > In minfi we haveread.metharray2()
which might not be exported
Kasper D. Hansen (15:03:47): > That is optimized
Tim Triche (15:03:49): > ooooh
Kasper D. Hansen (15:03:52): > But that is just reading the data
Tim Triche (15:03:56): > still
Kasper D. Hansen (15:04:04): > everything post reading is unoptimized
Kasper D. Hansen (15:04:13): > But I am really itching to address this soon
Tim Triche (15:04:17): > hmm. well, miser attempts to do some of that
Tim Triche (15:04:26): > along with automatic metadata discovery, etc.
Tim Triche (15:04:35): > it’s a work in progress (and “progress” is a generous term)
Kasper D. Hansen (15:04:42): > partly because I have a research prokect with many samples so I need it myself
Tim Triche (15:04:55): > about 2500 for TARGET pAML phase II
Tim Triche (15:05:04): > so yeah, same issue here
Tim Triche (15:05:48): > still – bsseq is amazing. I just had no idea until the cluster was offline monday and I had to use HDF5
Kasper D. Hansen (15:06:17): > yeah documentation is (ahm) a weak point at the moment
Kasper D. Hansen (15:06:25): > It really makes me want to do the same for minfi
Kasper D. Hansen (15:06:35): > It just takes a lot to actually do it
Tim Triche (15:06:47): > yeah that was the other thing I noticed when coding up miser
Kasper D. Hansen (15:06:50): > The parsing - which in principle is ultra simple - was not exactly easy
Tim Triche (15:06:58): > I essentially rewrote DMRcate to run in finite time off of HDF5
Tim Triche (15:07:32): > it’s much more fiddly than I had hoped (similarly, imputation etc… everything I take for granted)
Kasper D. Hansen (15:07:51): > Part of it is recognizing how to think / write this
Kasper D. Hansen (15:08:03): > Minor changes in code can have HUGE impact
Tim Triche (15:08:11): > good god was that ever the cae
Tim Triche (15:08:20): > I’m hoping there exists a catalog of best practices somewhere
Kasper D. Hansen (15:08:32): > recognizing the good patterns will be important
Kasper D. Hansen (15:08:44): > I think bsseq is the most advanced usage right now
Kasper D. Hansen (15:09:06): > And frankly, arcane incantations had to be deployed by Pete to make it work this well
Kasper D. Hansen (15:09:30): > So one thing we have on our “want to do list” is finish minfi and compile a (possible) long tutorial on our experience
Tim Triche (15:11:29): > looking over the contortions in > > minfi:::read.metharray2 >
> is educational
Kasper D. Hansen (15:13:29): > yes it has some quite explicit patterns
2019-01-24
Mike Jiang (13:04:02): > @Aaron Lun@Mike SmithNot sure if you guys attendedHDF5 C++ Webinar
today, have you guys thought about adding more recentH5CPP
toRhdf5lib
( the existingh5c++
wrapper is outdated and incomplete as far as I experienced )?https://github.com/steven-varga/h5cpp11
Aaron Lun (13:09:25): > Interesting. Certainly HDF5’s C++ API is a pain in the ass, so there’s much room for improvement there. The question is how well this thing will be supported - was there any indication of official uptake?
Aaron Lun (13:10:07): > I also wonder how it interfaces with the Linear algebra libraries. Looks like it just provides convenient wrappers for filling a matrix, but my initial thought was some crazy templating scheme where HDF5 readers were passed directly into armadillo etc.
Mike Jiang (13:18:00) (in thread): > The project was partially collaborated with Gerd Heber from Hdf group, I feel that it is going to replace the role of the current c++ api sooner or later
Aaron Lun (13:22:23) (in thread): > Feels like we should just wait and see before committing.
Mike Jiang (13:25:10) (in thread): > I will post an issue and see how popular it will be among the user community
2019-02-02
Sean Davis (08:39:41): > Worth a quick look as a general approach for storing tabular data (not just numeric):https://www.biorxiv.org/content/10.1101/536979v1.abstract
Sean Davis (08:40:25): > > Motivation: Biologists commonly store data in tabular form with observations as rows, attributes as columns, and measurements as values. Due to advances in high-throughput technologies, the sizes of tabular datasets are increasing. Some datasets contain millions of rows or columns. To work effectively with such data, researchers must be able to efficiently extract subsets of the data (using filters to select specific rows and retrieving specific columns). However, existing methodologies for querying tabular data do not scale adequately to large datasets or require specialized tools for processing. We sought a methodology that would overcome these challenges and that could be applied to an existing, text-based format. Results: In a systematic benchmark, we tested 10 techniques for querying simulated, tabular datasets. These techniques included a delimiter-splitting method, the Python pandas module, regular expressions, object serialization, the awk utility, and string-based indexing. We found that storing the data in fixed-width formats provided excellent performance for extracting data subsets. Because columns have the same width on every row, we could pre-calculate column and row coordinates and quickly extract relevant data from the files. Memory mapping led to additional performance gains. A limitation of fixed-width files is the increased storage requirement of buffer characters. Compression algorithms help to mitigate this limitation at a cost of reduced query speeds. Lastly, we used this methodology to transpose tabular files that were hundreds of gigabytes in size, without creating temporary files. We propose coordinate-based, fixed-width storage as a fast, scalable methodology for querying tabular biological data.
2019-02-03
Tim Triche (12:05:34): > so character matrices – makes sense
Sean Davis (13:08:17): > Fixed-width character matrices, yep.
Tim Triche (18:00:31): > fixed stride => much faster indexing. as usual, ASCII rules the world:wink:
Shian Su (19:10:36): > I thought space-time trade-offs and memory access patterns were pretty well studied already. Is it particularly surprising for anyone that fixed strides beats stream parsing and looking for delimiters?
Shian Su (19:24:05): > It also seems like a missed opportunity to mention the ideas presented here:https://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html - Attachment (blastedbio.blogspot.com): BGZF - Blocked, Bigger & Better GZIP! > BAM files are compressed using a variant of GZIP (GNU ZIP) , called BGZF (Blocked GNU Zip Format). Anyone who has read the SAM/BAM Specifica…
Shian Su (19:27:32): > In-place transposition doesn’t sound like a useful operation to optimise for, surely you’d at most do it once and maybe have a copy of the file in each orientation. Column/Row/Block access should be the main concerns.
Shian Su (19:43:33): > It’s also my impression that memory-mapping truly shines when used in conjunction with concurrency, allowing multiple threads to share the same memory and avoiding excess copies, would be interesting to see benchmarks on that.
Jayaram Kancherla (19:51:24): > something ive been reading upon recently is the vaex python package, it complements pandas and provides fast and scalable way to load huge datasets.https://twitter.com/maartenbreddels/status/1091753644641447936?s=09 - Attachment (twitter): Attachment > My slides and notebook on vaex from my talk at #europandas are on GitHub, demonstrating you can work with a 135 GB / > 1 000 000 000 row dataframe in #Python on your laptop. #BigDataAnalytics https://github.com/maartenbreddels/talk-vaex-pandas-summit-2019 https://pbs.twimg.com/media/DyavulqX0AAveoY.jpg
Shian Su (20:01:02): > The blog post on Vaex:https://towardsdatascience.com/vaex-out-of-core-dataframes-for-python-and-fast-visualization-12c102db044a
Shian Su (20:01:22): > Leads to Apache Arrow:https://arrow.apache.org
Shian Su (20:01:53): > Which seems very promising
Jayaram Kancherla (21:06:14): > @Shian Suyes, It does support arrow and hdf5 and in another blogpost they also recommend large data sets to be stored as arrow for the best performance with vaex. more info here -https://docs.vaex.io/en/latest/
2019-02-04
Kasper D. Hansen (10:14:05): > @Shian SuBut bgzip does not seem to allow for indexing on both rows and columns, only rows, as far as I understand
hcorrada (10:28:49) (in thread): > @Jayaram Kancherla
Sean Davis (11:23:00): > https://github.com/apache/arrow/tree/master/r
Sean Davis (11:23:12): > Apache arrow R package.
Shian Su (18:11:09) (in thread): > Technically it’s just indexing on blocks, so it depends on what you decided should be the memory-contiguous dimension. For SAM each “row” lives in contiguous memory, but for a column-major formats it would be the columns being indexed (but not really unless columns just happen to fit nicely into single blocks).
Shian Su (18:15:21) (in thread): > I’m actually wrong in that Dayton’s paper does propose row-number indexing, where each line is compressed into its own block. But this has the obvious issue when the format is “wide” with a large number of columns. Under this scheme querying a single column requires a full decompress of the entire dataset.
2019-02-05
Tim Triche (11:37:10): > is Stephen on this slack? could just ask him if so. alas seems not so
Aaron Lun (13:05:52): > @Peter Hickey@Kasper D. Hansen@Davide Rissoand others; BiocSingular is ready to roll. Latestirlbaupdates have cleared the remaining bugs from the dependencies, so expect an imminent submission to BIoC once the CRAN binaries build.
2019-02-12
Davide Risso (04:54:32): > @Aaron Lunhave you seen/tried this?https://github.com/KlugerLab/FIt-SNE
Davide Risso (04:54:56): > Since we were talking of scaling up tsne
Aaron Lun (04:55:38): > Nope. It’s not in a R package so it was Too Much Effort.
Davide Risso (04:57:02): > I’m not in the office today, but I may want to tried it later this week. They have a .R script so perhaps it’s worth the effort of building a small package around it if it’s truly scalable
Davide Risso (04:57:38): > You mentioned you were working on scaling tsne, are you working within the Rtsne package?
Aaron Lun (05:01:30): > Yes. It’s done and on CRAN.
Aaron Lun (05:01:53): > SeeRtsne_neighbors
, as used byscater::runTSNE
.
Davide Risso (05:02:29): > Awesome! So if I just use scater I’m good!:sunglasses:
Aaron Lun (05:03:47): > Yes. Withexternal_neighbors=TRUE
andBNPARAM=BiocNeighbors::AnnoyParam()
(orBiocNeighbors::HnswParam()
).
Aaron Lun (05:03:57): > Those two are approximate NN libraries.
Aaron Lun (05:04:08): > You might want to crank up thenthreads=
to be passed toRtsne
as well.
Davide Risso (05:04:18): > :+1:
2019-02-13
Mike Smith (08:19:46) (in thread): > Thought I’d highlight that I dug into Levi’s benchmark and tweaked the rhdf5 indexing code to optimise how it selects hyperslabs. The two blue lines represent the before and after for sequential index selection. Thanks@Levi Waldronfor the nice reproducible example to work with. - File (PNG): index_improvment.png
Tim Triche (09:53:36): > you guys are heroes
2019-02-19
Johnson Zhang (13:54:16): > @Johnson Zhang has joined the channel
2019-02-20
Martin Morgan (08:43:14): > @Aaron Lunwas wondering about your comment that implies static linking makes header-only libraries difficulthttps://github.com/Bioconductor/HDF5Array/issues/15? Kind of makes sense. CRAN is moving to ‘staged’ installshttps://developer.r-project.org/Blog/public/2019/02/14/staged-install/index.htmlwhere it can be challenging to figure out how to specify the path to the shared object (if, for instance, one is interested in a specific version of the library rather than ‘whatever the OS finds’). One can ‘opt out’ (in the DESCRIPTION file, SharedInstall: no) or use static linking, but I don’t really fully understand the shared / static tradeoff.
Aaron Lun (08:52:41): > The main problem that I’ve encountered with a shared library is that the hard-coded paths can be wrong if the file system changes between installation and execution (e.g., head vs cluster nodes). This is a pretty specific situation, but nonetheless, it happens. A more general user-level problem is that changes tobeachmat(which has loads of content in headers, due to templates) can sometimes result in ABI incompatabilities, requiring a re-compilation of client packages. This generally happens less as I try to bump all client packages whenever I updatebeachmat.
Kasper D. Hansen (08:54:39): > The head vs cluster nodes is a headache. In general, for this reason, I install my stuff on cluster nodes instead of head node. Woe of course if the different cluster nodes are different
Kasper D. Hansen (08:56:19): > From an admin perspective (since I maintain a highly used installation on our local cluster), we are beginning to see this more and more often where updates to a package (say Rcpp) requires re-installation of reverse dependencies. I think this is what Aaron is saying. We don’t really have a good way in R of signaling this (as far as I know) and since it is starting to happen more and more, this would be a really good thing to work out.
Martin Morgan (09:43:59): > Probably the head / node problem is tickled by staged installation; the package is installed and tested in one location, and then moved (in the file system sense of the word) to another, so if the path during installation is somehow important to the package then staged installs break. This makes it seem likeStagedInstall: no
should be a last resort, so that packages are more independent of installation directory. Probably there are additional problems on clusters…
2019-02-23
Levi Waldron (17:09:53): > Re: head / node problems, as a user, if you can get your sysadmin to install Singularity, and you use it, these problems all go away.
Levi Waldron (17:12:01): > There are brief instructions atGitHub.com/waldronlab/bioconductor_devel.@Ludwig Geistlingerand@Nitesh Turagahave also made some slurm-specific directions, I need to link to those still.
Ludwig Geistlinger (17:12:25): > @Ludwig Geistlinger has joined the channel
2019-03-11
Aaron Lun (02:31:08): > @Hervé PagèsLet me know if you want to talk about my proposal to move some C++ code into HDF5Array. The changes are relatively small for the amount of functionality to be added, but it does touch the codebase at multiple locations and I can appreciate that might require some justification/discussion.
2019-03-13
Kylie Bemis (23:27:08): > @Kylie Bemis has joined the channel
2019-03-18
Aaron Lun (03:06:37): > RGU reference!
2019-03-19
Aaron Lun (01:45:56) (in thread): > Nudging@Hervé Pagès.
2019-03-23
Stephanie Hicks (15:58:09): > :joy:@Aaron Lun - File (PNG): Screen Shot 2019-03-23 at 3.57.41 PM.png
Aaron Lun (18:59:38): > ?
2019-03-24
Stephanie Hicks (06:26:08): > It’s a funny email address
Tim Triche (15:19:06): > > For many years, it was believed that an infinite number of monkeys mashing an infinite number of keyboards would eventually recreate the genius of Shakespeare. Now, thanks to the World Wide Web, we know this to be false.
Aaron Lun (22:19:51): > ha lol
Aaron Lun (23:08:58): > <!channel>Beachmat v2 has been pushed to BioC. Packages that haven’t updated their linking instructions are liable to fail. These packages are listed below: > - bsseq
Aaron Lun (23:10:31): > Incidentally, until we decide how to deal withhttps://github.com/Bioconductor/HDF5Array/issues/15, performance of beachmat-driven C++ code on HDF5Matrix inputs will degrade to block processing… which probably is a bit worse than what we had before.
2019-03-29
Stephanie Hicks (13:39:45): > does anyone know if there can be dashes in the names of data objects inside of anExperimentHub
package?
Sean Davis (14:29:32): > ping,@Marcel Ramos Pérez,@Lucas Schiffer?
Lucas Schiffer (14:29:38): > @Lucas Schiffer has joined the channel
2019-03-30
Lori Shepherd (21:06:27) (in thread): > Sorry I didn’t see this earlier. What do you mean exactly? The name of the resources that gets pulled down from the hub or the object itself?
2019-03-31
Stephanie Hicks (08:15:04) (in thread): > No problem@Lori Shepherd! I’m asking about both. Looping in@Keegan Korthauer@Patrick Kimeswho are helping generate the objects and I’m helping put together in an ExperimentHub package.
Keegan Korthauer (08:15:07): > @Keegan Korthauer has joined the channel
Patrick Kimes (08:15:07): > @Patrick Kimes has joined the channel
Lori Shepherd (09:07:34) (in thread): > I don’t think it should matter - but I’ll do a few tests first thing tomorrow to verify
Stephanie Hicks (13:45:54) (in thread): > thanks!
2019-04-01
Lori Shepherd (13:33:38) (in thread): > I don’t think it should matter. In general for most file names I believe it is more common to use an underscore rather than a dash. Each object should have its own file and the naming of the object/file is somewhat irrelevant and would be limited based on the load/import function required for the file. As long as the operating system allows files with dash then it should be okay as a filename in the hub. The hub will try and load the object unless using a FilePath DispatchClass so again the naming of the object shouldn’t matter either.
2019-04-03
Levi Mangarin (15:45:40): > @Levi Mangarin has joined the channel
Tao Liu (17:24:33): > @Tao Liu has joined the channel
2019-04-22
Tom Gleeson (04:02:42): > @Tom Gleeson has joined the channel
2019-05-06
Firas (10:31:38): > @Firas has joined the channel
2019-05-14
Corina Lesseur (09:16:29): > @Corina Lesseur has joined the channel
2019-05-20
Assa (05:28:14): > @Assa has joined the channel
2019-05-26
Aaron Lun (01:21:41): > Every now and then, you want to regress out some factors from an expression matrix, and then run PCA on the residuals.
Aaron Lun (01:21:46): > But sometimes your matrix is too large to be represented as a dense matrix of residuals.
Aaron Lun (01:21:50): > Gee, if only there were a way to do the PCA on the residualswithout actually computing the residuals.
Aaron Lun (01:21:56): > Well, withBiocSingularand a healthy dose ofDelayedArray
magic, you can!
Aaron Lun (01:22:02): > > library(BiocSingular) > design <- model.matrix(~gl(5, 1000)) > > library(Matrix) > y0 <- rsparsematrix(nrow(design), 30000, 0.01) > y <- ResidualMatrix(y0, design) > > object.size(y) > ## 19523952 bytes > system.time(pc.out <- runPCA(y, 10, BSPARAM=IrlbaParam())) > ## user system elapsed > ## 5.146 0.240 5.386 > > # For comparison: > fit <- lm.fit(x=design, y=as.matrix(y0)) > object.size(fit$residuals) > ## 1200000392 bytes > system.time(pc.out <- runPCA(fit$residuals, 10, BSPARAM=IrlbaParam())) > ## user system elapsed > ## 105.118 0.244 105.377 >
Aaron Lun (01:22:22): > Gotta call up the SFPD bomb squad
Aaron Lun (01:22:25): > cause I just blew my mind
2019-05-27
Tim Triche (09:12:52): > that’s mighty slick
Tim Triche (09:12:58): > :exploding_head:
2019-06-23
Ameya Kulkarni (22:09:33): > @Ameya Kulkarni has joined the channel
2019-06-24
Komal Rathi (09:22:30): > @Komal Rathi has joined the channel
FeiZhao (21:21:27): > @FeiZhao has joined the channel
2019-07-10
Stephanie Hicks (16:19:52): > question. I have 44 WGBS samples from purified whole blood cell types (6 cell types). They are bigwig files. I’m trying to identify the best way to work with theseincrediblylarge files (>25 million rows per sample) . I can create individualbsseq
objects out of each one (either in memory or on disk with HDF5). I can combine thebsseq
files into onebsseq
object and keep the data in separate HDF5 files or resave into all 1 HDF5 file. But ultimately I want to do the usual and rundmrFinder()
across the 6 cell types. Any advice from the experts in the#bigdata-repchannel on best way to create the object in R?
Martin Morgan (16:29:44): > is@Peter Hickeythe expert on this ^^ ?
Stephanie Hicks (16:51:53): > Definitely. Maybe@Tim Triche@Kasper D. Hansenor others might have insights too?
2019-07-11
Sean Davis (10:25:09): > @Keegan Korthauer?
Keegan Korthauer (10:39:13): > @Stephanie Hicksit sounds like both of those approaches would work, but it might be more efficient to save the combined sample set as HDF5 (though technically, I believe it creates multiple h5 files - one for the M counts and one for the Cov counts) - that way if you want to rerun the code in later sessions, you wouldn’t have to do the combine step again.
Vince Carey (11:16:36): > The transformation from bigwig to HDF5 sounds onerous. The bigwig files are indexed. Can’t dmrFinder work with them directly?
Sean Davis (11:22:53) (in thread): > Seems like a really good question.
2019-07-13
Kasper D. Hansen (17:08:17): > Talk w/Pete
Kasper D. Hansen (17:08:27): > We have done this for much, much larger datasets
Kasper D. Hansen (17:08:51): > ie. non-CpG methylation (~1B rows instead of 25M) and 300 samples
2019-07-14
Vince Carey (09:36:21): > @Kasper D. Hansen@Peter Hickey– what is “this”? Can you work from the bigwig files directly or does the content need to be reformatted?
Peter Hickey (21:30:44): > i haven’t used bigWig files for analysis, only for visualisation
Peter Hickey (21:30:59): > creating a bigWig-backed DelayedArray is something i thought about
Peter Hickey (21:32:24): > @Stephanie Hicksmy advice is to stick them on disk in an HDF5 file. If you create a singleBSseqobject then runHDF5Array::saveHDF5SummarizedExperiment()
on it, it’ll create a directory with a.rds
file for theSummarizedExperimentbit and a single (I think) HDF5 file for the assays data
Peter Hickey (21:33:34): > I would then re-do this after QC-ing the data to remove any samples / loci before runningdmrFinder()
2019-07-15
Stephanie Hicks (21:12:50) (in thread): > @Vince Careyyes, I can’t say I have found agreatsolution, but my currentworkingsolution is something like for each pair of bigwig files (acall
matrix — or percent methylation; and acov
matrix — or coverage): > > # create a granges object for the % methylation > gr_call <- rtracklayer::import(file_name_call, format = "bigWig") > colnames(mcols(gr_call)) <- "score_call" > > # create a granges object for the coverage > gr_cov <- rtracklayer::import(file_name_cov, format = "bigWig") > colnames(mcols(gr_cov)) <- "score_cov" > > # combine call and cov information into same granges object > gr_call$score_cov <- gr_cov$score_cov > genome(gr_call) <- "GRCh38" > rm(gr_cov) > > # liftover to diff genomic coordinates (if necessary) > genome(gr_call) <- "GRCh38" > path = system.file(package="liftOver", "extdata", "hg38ToHg19.over.chain") > ch = import.chain(path) > gr_call_19 <- liftOver(gr_call, ch) > gr_call_19 <- unlist(gr_call_19) > genome(gr_call_19) <- "hg19" > gr_call_19 <- unique(gr_call_19) > rm(gr_call) > > # create bsseq objects > methyl_mat <- round(as.matrix(mcols(gr_call_19)$score_call) * > as.matrix(mcols(gr_call_19)$score_cov)) > cov_mat <- as.matrix(mcols(gr_call_19)$score_cov) > bs1 <- BSseq(gr = gr_call_19, M = methyl_mat, Cov = cov_mat) > > # or using HDF5 > hdf5_m <- writeHDF5Array(methyl_mat) > hdf5_cov <- writeHDF5Array(cov_mat) > bs2 <- BSseq(gr=gr_call_19, M=hdf5_m, Cov=hdf5_cov) >
Stephanie Hicks (21:29:58): > Right now I’m looping over all the samplesN=44 samples to create individualbsseq
objects, but I want to know if there is a more optimal way of doing this…:shrug:
Peter Hickey (21:43:17): > you could parallelise the loop (assuming you’ve enough memory or as aqsub
job)
Stephanie Hicks (21:48:22): > good point. But@Peter Hickeyto your point of “to create a singleBSseqobject”, I’m assuming I will need read in all thecall
andcov
files into memory to be able to create one biggranges
object and store thecall
andcov
scores in themcols
component? so then I can store it all into one HDF5 file? That seems not possible, or is there a better way of doing this?
Stephanie Hicks (21:50:07): > or can you even rundmrFinder()
on a bunch of independentBSseqobjects? my understanding was it had to be 1BSseqobject?
Peter Hickey (22:51:35): > if eachBSseqobject you create has theexact same co-ordinates in the exact same orderthen you cancbind()
them and re-save withsaveHDF5SummarizedExperiment()
to put all the data in one HDF5 file
Peter Hickey (22:53:13): > thecbind()
is slow because it (unnecessarily, in this case) validates the returned object. a hack to speed it up is to temporarily disable validity checking > > S4Vectors:::disableValidity(TRUE) > bsseq <- do.call(cbind, list_of_bsseq) > S4Vectors:::disableValidity(FALSE) >
Stephanie Hicks (23:09:58): > the BSseq objects do not have the same co-ordinates and they are not in the same order. my current work around is > > bsList <- list(bs1, bs2) > bsCombined <- combineList(bsList) >
Stephanie Hicks (23:11:02): > but this feelsnot optimaleither:confused:
Peter Hickey (23:12:43): > combining BSseq objects (especially disk-backed ones) with different co-ordinates, frankly, sucks
Stephanie Hicks (23:13:25): > :sad-parrot:
Peter Hickey (23:14:14): > withread.bismark()
i made it possible to specify the coordinates up-front to avoid parsing the files twice (once to get the set of co-ordinates, the second time to parseM
andCov
). > I never start with bigWigs so i’ve never done the same for that
Stephanie Hicks (23:14:37): > oh nice !
Stephanie Hicks (23:15:06): > yeah, this is the “raw” format provided from the consortia that I downloaded the files from…:disappointed:
Stephanie Hicks (23:15:25): > well in that case, is it not worth going the disk-backed route?
Peter Hickey (23:17:12): > 44 WGBS with CpGs isn’tthatbig. Should be < 4 GB for each ofM
andCov
> > > memuse::howbig(nrow = 20000000, ncol = 44, type = "int") > 3.278 GiB >
Peter Hickey (23:18:20): > as@Kasper D. Hansensaid, i’ve only really needed HDF5 when doing non-CpG
Stephanie Hicks (23:18:28): > totally fair point. but when I combine two WGBS today, it will be more than 25 million rows now.
Stephanie Hicks (23:18:46): > could be e.g. 30 million. as I combine more and more WGBS samples, i’m assuming that will grow
Peter Hickey (23:19:16): > even with 40 million, thats “only” 8 GB for each assay
Stephanie Hicks (23:19:25): > but yes, maybe it’s worth just trying everything in memory
Peter Hickey (23:19:33): > i’d create the object in-memory and save an HDF5-backed version to disk
Peter Hickey (23:19:42): > best of both worlds
Stephanie Hicks (23:19:45): > ok, sold.
Stephanie Hicks (23:21:30): > Thanks@Peter Hickey. i’ll keep the channel posted on my progress:female-technologist:
2019-07-17
Vince Carey (20:46:44): > Is BigWig a typical ‘processed’ format for WGBS? Are there any exemplary publicly accessible example files? I might take a crack at a DelayedArray interface if I had some examples.
Peter Hickey (21:01:39): > i’d say it’s the best of a bad bunch. > you lose coverage information but it works with genome browsers. > unsure how common it is
Peter Hickey (21:02:59): > encode came up with some methylation genome track format that retains coverage info,bedMethyl
(https://www.encodeproject.org/data-standards/wgbs/) but I don’t think it has much buy in (e.g., don’t know if genome browsers support it well)
Stephanie Hicks (21:46:52): > I know@Keegan Korthauerhas worked with WGBS quite a bit too. She might have more insights on your question@Vince Carey.
Stephanie Hicks (21:47:50): > @Vince Careyalso, fwiw the WGBS data (bigwig) I’m working with is part of the BLUEPRINT consortia available here:http://dcc.blueprint-epigenome.eu/#/datasets(bigwig is the only option available)
Stephanie Hicks (21:49:25): > following up on@Peter Hickey’s point, their solution is to have two bigwig files: one for methylation signal and one for coverage of methylation signal.
Keegan Korthauer (22:48:36): > I have worked with quite a bit of wgbs, but I’ve never worked with it in BigWig format. Reason being that as@Peter Hickeypointed out, your lose coverage information, and that has turned out to be valuable in the small sample size settings I’ve worked in.
Keegan Korthauer (22:49:27): > (I didn’t know about@Stephanie Hicksclever solution of having a separate file for coverage - that sounds promising!)
2019-07-18
Simon Dirmeier (05:00:06): > @Simon Dirmeier has joined the channel
Kasper D. Hansen (09:37:41): > @Vince CareyNo, BigWig sucks for WGBS. But it can be used to display data on the UCSC browser. The WGBS world is not big enough that a standard file format has emerged
Kasper D. Hansen (09:38:46): > Also, some of the obvious advantages of BigWig are lost in WGBS data, since it is really a set of positions and not continous along the genome
Vince Carey (09:51:20): > OK, thanks
Stephanie Hicks (10:26:35) (in thread): > can someone tell this to BLUEPRINT plz:grimacing:
Sean Davis (10:53:49): > I wonder what@Michael Lawrencethinks about Apache Arrow for large datasets like WGBS?
Michael Lawrence (11:24:54): > The Arrow serialization format (flat buffers) does not use any compression, so the files would be big. Not sure Arrow would bring many advantages over HDF5. BigWig is still efficient for non-contiguous data, because it can use the bedGraph model. It also stores statistical summaries of the data, which in theory could be used to store supplemental vectors. But, a bgzip’d and tabix’d extended BED file could work if you only need to compute block-wise or query specific ranges. As an aside, I wonder if any framework supports block-wise processing of tracks by their gzip blocks?
hcorrada (11:25:43): > @Jayaram Kancherla
2019-07-19
Jayaram Kancherla (08:35:23): > @Michael LawrenceI’ve been working on a library that can apply transformations (using a numpy or numpy-like function) over indexed file formats (bigbeds, bigwigs or any file that can be indexed with tabix). For a given query range, we only access and parse the blocks of the file and apply the function. This Jupyter Notebook describes some of the functionality of the library (https://epiviz.github.io/post/2019-02-04-epiviz-fileserver/). > > Code & documentation is available athttps://github.com/epiviz/epivizFileParser
Tim Triche (09:42:17) (in thread): > bigWig of covg + bigWig of % is sufficient to recreate C coverage and total (C+T) coverage; obviously for read-level inference you need reads
Tim Triche (09:43:09) (in thread): > we are covering this in the biscuit paper… a BED-like format for C/total (which is tabix’able) and VCF 4.2 for omnibus calls (C vs. T, SNVs, indels, SVs)
Tim Triche (09:43:47) (in thread): > there’s no particular reason that WGBS doesn’t have a standard format (like BED, VCF, 2 bigWigs) other than people inventing their own goofy formats all over the place
Tim Triche (09:44:02) (in thread): > sorry to hear that you’re having to deal with BLUEPRINT btw
Michael Lawrence (12:16:34) (in thread): > I see; it’s cool how it lazily computes for a given range. Similar to how plyranges defers reading a BAM file until after specifying filters and a summary (like coverage). What about computing in batch though? It would be cool if the framework knew where the blocks were and could align the processing to the file structure. Much faster than partial block reads, which require scanning.
Jayaram Kancherla (14:09:01) (in thread): > We use the Bigwig/Tabix index to find the data blocks in the given query range and only request for these blocks of the file (the file can be hosted on a public server). We then parse these blocks and compute the function.
Jayaram Kancherla (14:10:47) (in thread): > also can you clarify what you mean by batch processing ?
Michael Lawrence (16:44:35) (in thread): > Yes, that sounds right. I mean processing the data in its entirety, not just selected regions.
Michael Lawrence (16:45:32) (in thread): > So in other words, you loop over the blocks, processing each in turn. If you were to loop over, say, tiles of the genome, it would be less efficient, because if a range mapped to reads at the end of a block, you would have to scan the entire block to get to them.
2019-07-22
Gabriel Hoffman (12:23:23): > @Gabriel Hoffman has joined the channel
2019-07-24
Jayaram Kancherla (05:55:35) (in thread): > I don’t think this will be a problem atleast for bigwigs. From every data block, we know how many dat apoints it contains and we also know how many bytes each datapoint takes, so we can directly skip to the end without having to scan the entire block
2019-07-25
Michael Lawrence (13:22:48) (in thread): > Is that true? I thought bigwig supported compression.
2019-07-27
Aaron Lun (18:06:49): > @Kasper D. HansenYour TENxPBMCDatadataset=
argument has 3 copies of"pbmc4k"
.
2019-07-30
Friederike Dündar (09:32:48): > @Friederike Dündar has joined the channel
Friederike Dündar (09:37:03): > Is there any web interface/PDF documentation about the data sets available through ExperimentHub? Or would one have to install the package and browse it the way it’s shown in the vignette (https://bioconductor.org/packages/release/bioc/vignettes/ExperimentHub/inst/doc/ExperimentHub.html)?
Martin Morgan (09:52:44): > https://experimenthub.bioconductor.orgbut it’s a very thin layer…
Friederike Dündar (09:53:45): > indeed:slightly_smiling_face:
Friederike Dündar (09:53:52): > but it’s a start, I guess
Friederike Dündar (09:54:33): > although there’s not much to see in terms of the data sets that are indeed available
Friederike Dündar (09:54:41): > so who’s the keeper of that information?
Friederike Dündar (09:57:28): > basically, I’m trying to figure out how many cell-type specific expression data sets are already available given that cell identity assignment is a key task for scRNA-seq, so it’d be nice if there was a way to quickly identify reference data sets that make sense for a given experiment at hand
Friederike Dündar (09:58:07): > then download that data in a fairly simple format (e.g. gene counts or normalized expression values) and supply it to whatever package one is using to assign the cell labels to one’s own data
Martin Morgan (10:26:11): > I think, personally, that that is too ambitious for the current implementation of ExperimentHub; the data are not curated to that level. Also, the hub is basically an object store with files the unit of storage – many of the single-cell data sets will be stored as components that can be assembled into a SummarizedExperiment / SingleCellExperiment, which is ‘a fairly simple format’ within the Bioc community, but not perhaps useful outside bioc
Lori Shepherd (10:31:56): > @Friederike DündarI also invited you to the biochubs channel - while the web API interface may not be comprehensive, as an alternative you might get some discussion points in that channel from other developers and contributors to the hub as well as discuss any feature requests
Friederike Dündar (10:53:57): > just to clarify here: I’m not necessarily saying I want scRNA-seq data for those reference data sets, in fact, I’d much prefer bulk RNA-seq/microarrays of very pure cells
Friederike Dündar (10:54:47): > but even for scRNA-seq all one would need is the matrix of counts
Friederike Dündar (10:55:21): > I’ll reignite the discussion in the#biochubschannel
Tim Triche (10:56:40): > @Martin Morgantoday at the Childhood Cancer Data Initiative, some serious frustrations with siloed data and distribution boiled over – I had written an abstract with Vince based on the HDF5-backed, restfulSE-distributed, HSDS approach but was in Zambia at submission time and the internet died before I could submit
Tim Triche (10:58:19): > @Martin MorganI am firmly convinced that some of the data harmonization work that is currently handled by e.g. GDC or Commonses could benefit from multiple sets of eyeballs. More importantly, and to my immense surprise, this appears to be a growing consensus. Elaine Mardis argued forcefully for “federated” data sharing, as opposed to the One Big Expensive Hub. It strikes me that ExperimentHub as a central switchboard with external pointers could be a very efficient initial implementation.
Tim Triche (11:00:59): > @Martin Morganinstead of mandating that a query in ExperimentHub or AnnotationHub return a RESTful URL within the existing schema, perhaps it could be extended to index resources that are hosted elsewhere (e.g. HSDS or CHOP’s platforms) and simply poll once an hour or once daily to see whether the link is live. If not, generate a warning to the host, flag as unreliable, and after a week (or however long) deprecate. It appears that everyone is starting to realize that the Great Big Commons Of Commonses approach is not a panacea for data sharing. ExperimentHub is one of several alternatives worth exploring (“critical unmet need”, “transformative”, blah blah)
Tim Triche (11:02:09): > @Martin Morganperhaps a proposal of this nature, if funded, could provide developers to implement a more agile search function and a pilot implementation of a genuinely federated, low-level counterpart to the slick-but-locked-down St. Jude’s cloud approach (“look at the pretty pictures”)
Tim Triche (11:03:05): > @Lori Shepherdbefore you say it, we’re working with Adam Resnick to brute-force test the changes in MTseeker and deploy it within Cavatica. I have 3 interns and a postdoc hacking away on the pileup implementation; I was shocked to find that this seems to be genuinely novel. Sorry for the holdup.
Lori Shepherd (11:09:04): > Data in the current ExperimentHub I believe can be stored elsewhere as long as the data is public accessible. There is an additional column in the metadata to handle this when submitting but it not highly utilized. Perhaps different methods for querying and accessing said collection of resources would be required instead of the current state and implementation - It would be worth thinking about desirable features and current limitations for a designing a revamp -
Tim Triche (11:10:36): > @Lori Shepherdyeah I think this is a proposal that is very much in the interest of the BioC community. It was amazing to me how nearly-unanimous the emphasis on federation was/is at CCDI. Which reminds me, I need to get back to sessions. This is a grant worth writing (maybe ideally to a funder with quicker turnaround than NIH)
2019-07-31
Vince Carey (07:14:58): > @Tim Trichethanks for these comments. I had not heard about cavaticahttps://d3b.center/our-research/cavatica/… Should we try to set up a miniconference on this topic of siloing, commonsing, and agile solutions for cancer genomics access for later in the year? - Attachment (Children’s Hospital of Philadelphia® Center for Data-Driven Discovery in Biomedicine): CAVATICA - Children’s Hospital of Philadelphia® Center for Data-Driven Discovery in Biomedicine > Cavatica is a scalable, cloud-based platform designed to collaboratively access, share and analyze pediatric cancer data.
Martin Morgan (09:16:33) (in thread): > I like this idea. I’ll mention (a) current Human Cell Atlas data sets counts) will be made available via the hub in the near future (b) Gary Bader expressed strong interest in using ExperimentHub for contribution and use of pre-HCA single cell data sets available as SummarizedExperiments. The basics for doing this are likely identifying appropriate ontologies and use these as tags,query(hub, c("HumanCellAtlas", "Kidney", ...))
Tim Triche (11:17:07) (in thread): > I’m down. A LOT of people at CCDI were discussing the disconnect between “GDC will do that [sometime in the future, perhaps several years]” versus “we could use this for phenotyping and clinical studies right now”
Tim Triche (11:17:20) (in thread): > very cool regarding Bader
Tim Triche (11:18:59) (in thread): > also@Vince CareyI wrote up an abstract on using HDF5/HSDS for CCDI, but got stuck without internet in western Zambia for three days before the submission deadline. I suck. One of these days I’ll send it to you. I was actually pretty proud of it. But, so it goes. Malcolm Smith modified some verbiage to favor federation over centralization, and Charles Mullighan led right into the “sometimes big data is big because it takes a long time to move around” cononundrum (which gets worse if NIH didn’t pay for it – off to EGA! thanks Charles for pointing that out)
Tim Triche (11:20:59) (in thread): > the whole thing was far more productive than I expected. I felt like Charles Roberts was a little bit paternalistic, but the absence of a profit motive seems to have encouraged genuine participation from everyone else. Also Soheil basically described what we’ve been trying to do with MAEs-in-HSDS so that was cool too. Having a working proof of concept locally never hurt anyone. I suspect a lot of people are unaware of the progress in making datasets easily available via BioC lately, and it could stand to be publicized a lot more widely. (Tiago is presenting TCGABiolinks demos as I write this. More later.)
Tim Triche (11:24:01) (in thread): > Adam Resnick (at CHOP) is pretty enthusiastic about piloting BioC-powered workflows right inside of Cavatica (&c), and Elaine Mardis really emphasized how having loosely coupled workflow components (whether dockerized or otherwise) has helped clinically at Nationwide Children’s. Maybe I’m just jaded, but I was really impressed and got the impression that she was dead serious about promoting that sort of approach as president of AACR.
2019-08-02
Stephanie Hicks (18:10:48): > @Tim Triche@Lori ShepherdI attended the Crazy 8s event from Alex’s Lemonade Stand Foundation last September with Casey Greene and the “federation approach” was essentially what came out of the “Big Data” team. I’m happy to talk more about it for anyone interested, but there was a sincere interest in making this happen.
Stephanie Hicks (18:12:20): > The childhood cancer world seems to be in a very different place than the rest of the cancer world…
Tim Triche (18:20:35): > The “Crazy 8s” RFAs are, to some degree, an attempt to do what NIH was unable to do with the FusOncC2 U54 RFAs. Nobody ever made any money in pediatrics, which cuts down on the potential for bad actors, but rest assured that any time you get a bunch of humans competing for money, there will be some self-serving rationales proposed…
Tim Triche (18:21:02): > (I spoke with Anna about this twice in the last month, FWIW)
Tim Triche (18:22:27): > It’s really, really hard to get competent reviewers for rare tumors to give unbiased opinions on proposals, because anyone competent will already be applying to the RFAs. Pediatric tumors are incredibly informative about biology and about adult tumors in a way that the converse can never be, but at the same time, there’s a running gag in peds onc:
Tim Triche (18:22:47): > “This is going to be a blockbuster pediatric drug. We could makeHUNDREDS OF DOLLARS!”
Tim Triche (18:23:15): > I’m just happy that NIH and ALSF are insane enough to throw any money at the problem at all:smile:
Aaron Lun (18:40:06) (in thread): > @Tim TricheJust curious - why is this?
Tim Triche (19:02:11) (in thread): > how many kids get cancer? how many adults? that’s why:wink:
Aaron Lun (19:04:00) (in thread): > Huh, I would have thought that people would be pumping money into curing sick kids.
Aaron Lun (19:04:13) (in thread): > But I guess economics wins out.
Aaron Lun (19:08:16) (in thread): > It’s a cold, cold world.
Tim Triche (19:16:58) (in thread): > Philanthropically, there’s a ton more money in childhood cancer, because if you can’t raise money to cure bald kids, you can’t raise money. But the profit motive? Not so much.
Tim Triche (19:18:00) (in thread): > And yes, the profit motive does tend to move more money than the philanthropic motive. One must already have some degree of profits in order to indulge philanthropy.:confused:
Aaron Lun (19:19:18) (in thread): > For sure. Now that I’m an industry hack, I can finally do my “bathtub full of money” photo shoot.
Tim Triche (19:19:50) (in thread): > That said, modern chemotherapy is almost entirely based on Sidney Farber’s efforts in Boston treating kids with ALL (at the time, a death sentence). My contention is that some of the problems with chemo response in adults stem from the protocols originally (long ago) being most successful, and thus most emulated, in pediatric malignancies. Farber’s efforts were, as you might now expect, funded not by governmental agencies but rather the Variety Club of New England.
Lori Shepherd (20:03:11) (in thread): > @Stephanie Hicks. I would be interested in hearing more about your thoughts on it and what you got out of the meeting.
Stephanie Hicks (20:23:13) (in thread): > Sure happy to share my take. Will send you a longer email?
2019-08-03
Lori Shepherd (10:18:04) (in thread): > Yes that would be great!
Tim Triche (10:23:34) (in thread): > @Stephanie Hickswere you in DC this past week for the CCDI? Interested in your take on that, as well.
Tim Triche (10:26:07) (in thread): > @Stephanie Hicksthe quite remarkable thing is that, while a majority of participants did seem to favor a federated approach, there were some rather… “strategic”… presentations that attempted to downplay this in favor of “wait a few more years and GDC will have that”. It seemed like policy could go either way (remember caBIG?) but at least there was widespread support for federation and incentives to help enable easy submission of non-identifiable data. (The latter is where I feel like ExperimentHub and existing BioC infrastructure could play a huge role in making this happen.)
2019-08-04
Vince Carey (11:06:24): > What can we do to help create/improve scalable open access to public data relevant to these comments?@Tim TricheIIRC you have some epigenetics data that are extremely large and for which HSDS might be an easy first stab at a solution, in which it would be straightforward to produce jupyter notebooks in HDF Kita Lab, and eventually Rmd or workflow packages showing how to interrogate/retrieve the data. Is there still an embargo issue with this dataset? Let’s make a plan in a github repo?
John Readey (20:56:57): > Hey@Tim Triche- Let me know if you’d like me help in getting data setup in Kita Lab….
2019-08-05
Tim Triche (10:12:14): > @John ReadeyI have it in S3! Can we use that instead of Kita Lab?
Tim Triche (10:12:48): > Also,@Aaron Lunhas brought up some points about distribution/access that maybe need thinking
Tim Triche (10:13:44): > Also also, I’ve done a lot of realize()’ing of matrices for e.g. the MOFA package and wonder if it might be possible to create efficient accessors for HSDS/restfulSE/bigQuery storage that finesse the impedance mismatch
Tim Triche (10:14:53): > there are “holes” in almost all real data (e.g. kids with miRNA/mRNA/DNAme/WGS but no clinical vs. clinical/miRNA/DNAme/WES and no mRNA, or no karyotype, or what have you… the obscene depth of COG CDEs is perfect for determining what factors may be amenable to imputation in a clinically relevant advisory context, and how to make that go)
Tim Triche (10:15:14): > there are certainly holes in the TALL, MPAL, and TpAML data
Tim Triche (10:16:04): > and there is extensive motivation for determining how well they can be “filled” as well as where the most informative “hole-filling” can only be accomplished by ascertaining more samples. Would be an interesting 1) demo and 2) get-the-kinks-out exercise.
Tim Triche (10:16:51): > @John Readeyin any event, I finally harmonized everything that I personally am capable of harmonizing, and can dump the real work on my computational biologist now that I’m just a useless PI.
Vince Carey (10:49:55): > @Tim Trichethe concept of “hole” you mention seems well-accommodated by the MultiAssayExperiment design. MAE with HSDS or hybrid back end is not a problem. Should we have a google hangout some time this week with screenshares to get a handle on this? CCing@Levi Waldron
Tim Triche (10:58:11): > yes
Tim Triche (10:58:17): > and the MAE is what I’ve been putting them in
Tim Triche (10:58:37): > but MOFA (for example) doesn’t seem to register what the sampleMap is showing w/r/t complete cases etc.
Tim Triche (10:58:44): > so maybe it’s more of a MOFA issue than anything else
Tim Triche (10:58:46): > regardless, yes
Tim Triche (10:59:28): > incidentally (and this came up with@Stephanie Hicksnot long ago) figuring out the best way to structure HDF5-backed MAEs and merges of HDF5-backed SEs is another interesting topic to me.
John Readey (13:47:27): > Re: “holes” - are we talking about multi-dimensional datasets where some regions are un-initialized (or zeroed-out)? HDF5 deals with this very well if compression is enabled. Sparse areas get compressed away to almost nothing. There are also various explicit sparse representations which have pros and cons vs just using compression.
John Readey (13:49:12): > For trying out some things with KitaLab/HSDS, if thee data is already on S3 in HDF5, we can just link to it (rather than having to import all the data).
Tim Triche (14:01:53): > perfect, let’s do that. And yes the “holes” are both 1) NAs in the original and 2) probably predictable from non-holes in other cases
John Readey (14:24:54): > Cool, let’s try it out. If you can send me some s3 links to files you have now, I can add to KitaLab.
John Readey (14:25:13): > It’s easiest if the files are in a public read bucket.
Tim Triche (14:33:19): > OK. Let me clean up any nonpublic clinical covariates and I can just copy them over with the aws cli. May need to public-ify some bits.
John Readey (14:34:35): > If you are setting up a new bucket for this, using the us-west-2 region (same region as the KitaLab server) will be optimal.
Tim Triche (15:26:17): > OK will do
Stephanie Hicks (20:43:32): > Quick update on my processing 44 WGBS data sets (bigwig) and converting to oneBSseq
object (https://community-bioc.slack.com/archives/C35BSB5NF/p1562789992004800). This was successful! I wanted to ask, is anyone interested in the dataset? If so I can work on submitting to ExperimentHub if it would be useful to others. Data are from BLUEPRINT consortia (https://community-bioc.slack.com/archives/C35BSB5NF/p1563414470004700). There are six purified whole blood cell types (similar to theFlowSorted.Blood.450k
measured on 450K array, except measured on WGBS platform and uneven sample sizes: > > > table(pheno_table$cell_type) > > Bcell CD4T CD8T Gran Mono NK > 4 12 4 14 6 4 >
- Attachment: Attachment > question. I have 44 WGBS samples from purified whole blood cell types (6 cell types). They are bigwig files. I’m trying to identify the best way to work with these incredibly large files (>25 million rows per sample) . I can create individual bsseq
objects out of each one (either in memory or on disk with HDF5). I can combine the bsseq
files into one bsseq
object and keep the data in separate HDF5 files or resave into all 1 HDF5 file. But ultimately I want to do the usual and run dmrFinder()
across the 6 cell types. Any advice from the experts in the #bigdata-rep channel on best way to create the object in R? - Attachment: Attachment > @Vince Carey also, fwiw the WGBS data (bigwig) I’m working with is part of the BLUEPRINT consortia available here: http://dcc.blueprint-epigenome.eu/#/datasets (bigwig is the only option available)
Kasper D. Hansen (22:06:20): > This might be useful
Kasper D. Hansen (22:06:29): > But thats a pretty vague statement
2019-08-06
Tim Triche (08:34:29): > yeah I’d use it
Tim Triche (08:34:46): > although I’d use it more if it was fetal liver vs. adult bone marrow (same project)
Tim Triche (08:34:54): > but then again I already have that one on disk, so
Tim Triche (08:35:06): > yours is probably better processed
Stephanie Hicks (11:33:49): > ha I wouldn’t necessarily say that.
Stephanie Hicks (11:34:07): > Actually@Tim Trichemaybe you can help me figure out the a problem i’m having with the qc
Stephanie Hicks (11:35:57): > I have one HDF5 file containing two datasets:cov
(coverage) andmeth
(number of methylated reads). Then there is oneGRanges
object (gr
) saved as.RDS
. Thegr
object is just read into memory. From there, it’s easy to create theBSseq
object: > > hdf5_cov <- HDF5Array(filepath = hdf5_bs_path, name = "cov") > hdf5_meth <- HDF5Array(filepath = hdf5_bs_path, name = "meth") > bs <- BSseq(gr = gr, M = hdf5_meth, Cov = hdf5_cov) >
> What I’m now having trouble with is trying save the entireBSSeq
object withsaveHDF5SummarizedExperiment(bs, hdf5_bs_se_path, verbose = TRUE)
. Or even more simply, I’m having trouble realizing a smallerhdf5_cov
file on disk after filtering the rows e.g. > > keep_ids <- which(DelayedMatrixStats::rowSums2(hdf5_cov==0) == 0) > hdf5_cov <- hdf5_cov[keep_ids,] > hdf5_cov_sub <- writeHDF5Array(x = hdf5_cov, filepath = hdf5_bs_sub_path, > name = "cov", verbose=TRUE) >
> There are 29 million rows inhdf5_cov
. Thekeep_ids
gets me down to about 6 million. But when I try to realize this on disk, it just gets stuck at theRealizing block 1/44 ...
. I’ve let it run all night and the process is still going, but i find it hard to imagine it takes 8+ hours to save one block? I tried to take just a random sample of 10e3 and 50e3 rows, and that took about 1 min and 6 mins respectively, so I know the code is working.
Stephanie Hicks (11:38:02): > @Keegan Korthauer@Kasper D. Hansen@Peter Hickey@Mike Smithif you have thoughts on what i’m doing wrong, I would welcome your input.:slightly_smiling_face:
Tim Triche (12:12:45): > oh gross – I’ll look and see how we finessed this in biscuiteer
Stephanie Hicks (12:14:12): > thanks@Tim Triche
Tim Triche (12:27:39): > I had this issue in the past when block size was too big – there are some (probably shitty) workarounds in themiser
code for dealing with this in HDF5-backed GenomicRatioSet objects and so forth (for calling DMRs and crap like that)
Tim Triche (12:28:22): > hadn’t run into the issue recently withbsseq
objects (even when summarizing 60 or so WGBS runs across about 10 million disjoint DMR-lets)
Stephanie Hicks (12:32:52): > What were the block sizes for the 60 WGBS samples with 10 million rows?
Tim Triche (12:45:38): > let me look (may still have that screen session up)
Tim Triche (12:50:04): > > R> POETIC <- loadHDF5SummarizedExperiment("~/POETIC/HDF5/POETIC.HDF5/") > R> DelayedArray:::set_verbose_block_processing(TRUE) > [1] TRUE > R> getAutoBlockSize() > [1] 100000000 >
Tim Triche (12:50:21): > that machine has 512GB of RAM though, so… ?
Tim Triche (12:50:44): > (had to reload the object but I ran defaults for that job)
Tim Triche (12:51:22): > > R> POETIC > An object of type 'BSseq' with > 26747934 methylation loci > 67 samples > has not been smoothed > Some assays are HDF5Array-backed >
Tim Triche (12:51:38): > I lied, there are 67 samples not 60. A bunch are cell-free though
Tim Triche (12:54:01): > Anyways, I ran the following to summarize it:
Tim Triche (12:58:03): > > R> DMRs <- readRDS("~/POETIC/DMRs/reprocessed_disjoint_DMRsBySubject.rds") > R> length(unlist(GRangesList(DMRs)))/1e6 > [1] 14.858232 > R> system.time(lapply(DMRs, function(x) biscuiteer::summarizeBsSeqOver(POETIC, x))) >
Tim Triche (12:58:15): > (will update in a few minutes ;-))
Kasper D. Hansen (13:10:16): > This one@Peter Hickeymight know the answer to immediately
Stephanie Hicks (13:12:39): > @Kasper D. Hansen— thanks! I’m hopeful this is an easy fix. On the other hand thewriteHDF5Array
help file, it says > > “Please note that, depending on the size of the data to write to disk and the performance of the disk, writeHDF5Array can take a long time to complete” > … but >8 hours to write one block seems like I have to be doing something wrong?
Stephanie Hicks (13:12:55): > on the other hand, i’m completely stumped on what to try next
Tim Triche (13:13:13): > decrease the block size and turn on verbose processing?
Stephanie Hicks (13:17:04): > oh sorry — I just re-read your statement above@Tim Triche“I had this issue in the past when block size was too big”. The first time I read that i saw “chunk size”.
Stephanie Hicks (13:17:44): > so i’m familiar withchunk sizes, and vaguely familiar withblock sizes. Could someone clarify the difference?
Stephanie Hicks (13:19:10): > I know you can change the block sizes > > DelayedArray:::set_verbose_block_processing(TRUE) > getAutoBlockSize() > setAutoBlockSize(1e6) >
> Is it all the chunks in a given block are processed at the same time? So even if I make chunk size smaller, it’s the block size that matters the most?
Kasper D. Hansen (13:19:42): > chunk is a HDF5 concept. Block is DelayedArray. In principle they are orthogonal
Stephanie Hicks (13:21:31): > @Kasper D. Hansenand in a given block size (e.g. 1e6), is that how many rows are being processed at a given time? or is it a two dimensional concept?
Kasper D. Hansen (13:22:39): > I think blocksize is univariate, and is meant to represent to total memory. Not sure how it translates to rows and cols
Tim Triche (13:35:32): > I definitely decreased the block size at various points in the past
Tim Triche (13:35:49): > from miser::fixNAs: > > message("Checking for NAs (this can take quite a while if HDF5-backed)...") > DelayedArray:::set_verbose_block_processing(verbose) > setAutoBlockSize(1000000) > t1 <- system.time(naFrac <- DelayedMatrixStats::rowSums2([is.na](http://is.na)(getBeta(x)))/ncol(x))["elapsed"] > if (verbose) > message(sprintf("Computed NA fractions in %.1f seconds", > t1)) >
Tim Triche (13:36:15): > (and that’s not even the imputation, which ended up happening separately)
Tim Triche (13:36:38): > in biscuiteer we impute on-the-fly with Laplace smoothing, but miser is for big piles of arrays
Tim Triche (13:37:10): > anyways, try turning on verbose processing and seeing how long an arbitrary block of a million values takes to write out.
Stephanie Hicks (13:37:49): > i’m re-reading through@Peter Hickey’s wonderful Bioc workshop and I love how he says “The documentation on this topic is a little sparse, but some details can be found inhelp("block_processing", "DelayedArray")
” (https://github.com/PeteHaitch/BioC2019_DelayedArray_workshop/blob/74e6acdb7789bdb2c5950faacb305f73cb8a1e31/vignettes/Effectively_using_the_DelayedArray_framework_for_users.Rmd#L543):joy:
Kasper D. Hansen (13:37:52): > One issue is multiple parsing of the same chunk in case block and chunk doesn’t match up, I would say
Tim Triche (13:38:01): > (The DMRs are still summarizing, FWIW; I remember this taking around 30 minutes last time)
Tim Triche (13:38:31): > it should probably bother me more that there are so many:::
s in my code
Stephanie Hicks (13:38:37): > so should chunk and block be the same@Kasper D. Hansen?
Stephanie Hicks (13:38:53): > haha@Tim Triche
Stephanie Hicks (13:39:17): > also how can I turn on parallelization with block processing?
Kasper D. Hansen (13:39:20): > Isn’t chunk two dimensional and block is one dimensional? So block is something that gets added outside of HDF5
Kasper D. Hansen (13:39:41): > Think of a different file format wihtout chunks, say igWig
Stephanie Hicks (13:39:51): > the help file is in fact a bit sparse…:wink: - File (PNG): Screen Shot 2019-08-06 at 1.39.27 PM.png
Tim Triche (13:39:58): > hey, this reminds me, I need to bother Rob Scharpf about ff and VanillaICE (speaking of horrendous debugging and out-of-core)
Kasper D. Hansen (13:40:00): > You want to be able to say “Don’t read the entire dataset into memory”
Stephanie Hicks (13:40:16): > chunk is definitely two dimensional
Stephanie Hicks (13:40:22): > i’m less familiar with the concept of blocks
Kasper D. Hansen (13:41:25): > Its pretty clear in a vague sense to me, that the two things needs to be matched in some integer multiplier way, for everything to be the best possible way. But I am not sure how that can be enforced
Stephanie Hicks (13:41:33): > :disappointed:
Stephanie Hicks (13:41:45): > i agree, but i feel so clueless on how to make that happen.
Kasper D. Hansen (13:41:46): > You could for example have a block containing X (integer) chunks
Stephanie Hicks (13:42:06): > i’ve been going at this for 2 days and I don’t feel any closer to a working solution.
Kasper D. Hansen (13:43:00): > It is pretty frustrating. And itcanbe pretty fast but there are a 1000 non-obvious ways to break it
Stephanie Hicks (13:43:43): > it doesn’t help i’m up against a deadline to respond to reviewers either…
Stephanie Hicks (13:43:59): > anyways, I’ll wait until@Peter Hickeyresponds. hopefully he has some insights
Tim Triche (13:44:04): > are there any other reasons to get things done?:wink:
Tim Triche (13:44:13): > (i.e. as opposed to deadlines)
Tim Triche (13:44:32): > rumor has it that 99% of hematology research gets done in the week before ASH abstracts are due
Stephanie Hicks (13:45:11): > lol
Tim Triche (14:05:18): > oh hey the overlaps finished. > > R> system.time(lapply(DMRs, function(x) biscuiteer::summarizeBsSeqOver(POETIC, x))) > user system elapsed > 3597.484 373.352 3976.580 >
> about an hour
Tim Triche (14:05:55): > granted that’s 24 matrices with several million DMRs apiece, but anyways. Not 8 hours.
Stephanie Hicks (14:06:16): > dang ok i’m going to reduce block size and see what happens
Tim Triche (14:06:29): > I wonder if it would have gone 24 times faster in parallel:wink:
Tim Triche (14:06:50): > Will end up asking you for the Blueprint BSseq object eventually for sure:slightly_smiling_face:
Stephanie Hicks (14:07:27): > i feel like i’ve got this great resource, but at the moment i’m incapable of doingalmost anythingwith it
Stephanie Hicks (14:07:39): > happy to share the HDF5 files though
Stephanie Hicks (14:07:47): > @Vince Careyalso shared some interest
Stephanie Hicks (14:37:08) (in thread): > @Aaron Lunthis is fixed (almost). changes have been pushed to github, but I need@Kasper D. Hansenor@Davide Rissoto push to bioc
Tim Triche (14:52:32): > yeah if we can straighten out the usage patterns for HDF5 chunking/blocking so that it works over the wire, it will be sick
Tim Triche (14:52:55): > I have to recopy some HDF5-backed MultiAssayExperiments to wash off some COG CDEs and will put them up on Amazon S3 today
Tim Triche (14:53:36): > several people will go and turn them into stuff that we can regress on outcomes (baby steps, eventually would be nice to have a “do your own goddamned pediatric risk stratification within a complex trial design” vignette:wink:)
Jayaram Kancherla (16:06:32) (in thread): > Hi@Stephanie Hicks, I’m interested in using this dataset for a project I’m working on.@hcorradaand I are working on a system to interactively query and compute analysis directly over files. This would be a perfect use case for our system.
Stephanie Hicks (16:10:32) (in thread): > ok thanks@Jayaram Kancherla— I will work on getting this submitted to ExperimentHub then
Vince Carey (16:56:15): > @Stephanie Hicksif you can make the current hdf5 available in an S3 bucket i think we can make some progress.
Peter Hickey (19:07:56): > if you need something that works on a deadline and have enough memory: > > cov <- as.matrix(hdf5_cov) > keep_ids <- !matrixStats::rowAny(cov, value = 0) > cov <- cov[keep_ids,] > cov_sub <- writeHDF5Array(x = cov, filepath = hdf5_bs_sub_path, > name = "cov", verbose=TRUE) >
Peter Hickey (19:10:15): > i’m writing a workflow on large BSseq analysis which will motivate me to try to actually fix these properly in bsseq/DelayedArray/DelayedMatrixStats
Stephanie Hicks (19:31:00): > thanks@Peter Hickey
Stephanie Hicks (19:31:10): > one question i had, does block processing apply towriteHDF5Array()
?
Stephanie Hicks (19:32:30): > like if i use > > block_size <- 5e4 > setAutoBlockSize(block_size) > hdf5_cov <- writeHDF5Array(x = hdf5_cov, filepath = hdf5_bs_path, > name = "cov", verbose=TRUE) >
> does block processing modify the block size? or issetAutoBlockSize
only forDelayedArray
functions?
Stephanie Hicks (19:37:47): > another way of asking is how can I control the block size of specifically thewriteHDF5Array
function?
Peter Hickey (19:48:54): > block = DelayedArray-specific > chunk = HDF5-specific
Peter Hickey (19:49:11): > you can control the chunk size inwriteHDF5Array
usingchunkdim
Stephanie Hicks (19:56:10): > ooooh so block size is completely independent to chunk size??
Peter Hickey (20:01:03): > yep
Stephanie Hicks (20:03:29): > but then why does it say e.g. “realizingblock1/44” when usingwriteHDF5Array
?
Stephanie Hicks (20:03:54): > it sounds like it’s reallychunks?
Stephanie Hicks (20:05:59): > also can you useBiocParallel
withwriteHDF5Array
?
Stephanie Hicks (20:09:08): > (I’m sorry for all the questions too). I’m happy to move to Bioconductor Support if you prefer.
Stephanie Hicks (20:14:58): > YAY that worked. Changing thechunkdim
size using > > hdf5_cov <- writeHDF5Array(x = hdf5_cov, filepath = hdf5_bs_path, > chunkdim = c(2e3, 44), > name = "cov", verbose=TRUE) >
> is now saving moreblocks > > Realizing block 1/273 ... OK, writing it ... OK > Realizing block 2/273 ... OK, writing it ... OK > Realizing block 3/273 ... OK, writing it ... OK >
Stephanie Hicks (20:16:33): > btw,@Peter Hickeyif you want help writing up a blogpost, i’ve literally spent the last three days digging into the tutorials / vignettes / bioc workshops. I’d be glad to help you write up a blogpost and you my insights from a “newbie” perspective for what was most helpful / unhelpful or confusing / not confusing.
Stephanie Hicks (20:16:56): > I feel like I need to give back to the community somehow my newfound knowledge:joy::sob::joy:
Peter Hickey (21:49:46): > chunkdim
controls how the data are compressed on disk in the HDF5 file. > withchunkdim = c(2e3, 44)
you’ve effectively give yourself fast row-wise access using 2e3 rows at a time
Peter Hickey (21:51:16): > settingchunkdim = c(2e3, 1)
would (theoretically) give you fast column-wise access using 2e3 rows at a time.
Peter Hickey (21:51:34): > then this all interplays with block size and how the underlying algorithm you’re running wants to access the data (e.g., the algorithm may require loading all columns for a subset of rows into memory)
Peter Hickey (21:51:40): > in short, it gets complicated fast:slightly_smiling_face:
Peter Hickey (21:53:41): > fwiw i go for column-wise chunking (e.g.chunkdim = c(nrow(x), 1)
orchunkdim = c(1e6, 1)
) for pre-QC data because most of the QC is on a per-sample basis. > Then, I re-write the QC-ed data using row-wise chunks, (e.g.,chunkdim = c(1e4, ncol(x))
) for DMR calling because that can be done by fitting linear models to each CpG (row) of the data
Peter Hickey (21:57:44) (in thread): > Hmm yeah that may be confusing. > What it means is ’I’m loading block 1/44 into memory and running the delayed operations on it. Then, I’m writing it to disk (using whatever chunking strategy you asked for)”
Peter Hickey (21:59:41) (in thread): > I think you’ll find that you get the “realizing block X /Y” message even when the data aren’t in an HDF5 file nor being written to an HDF5 file. > it is just telling you that block-processing is happening, the result is beingrealizedin-memory, and then written to whatever backend you have specified (which is explicit when usingwriteHDF5Array()
but may be implicit when usingrealize()
and is controlled byDelayedArray::getRealizationBackend()
Kasper D. Hansen (22:04:50): > Pete is the expert, but a few of these things I said above:slightly_smiling_face:
Kasper D. Hansen (22:05:04): > I am also very happy to help writing the workflow
Kasper D. Hansen (22:34:24): > Just goes to show I don’t give the same natural sense ofauthorithy
Stephanie Hicks (22:39:08): > ha, or my panic is starting to subside and I wasn’t carefully listening the first time:grimacing:
2019-08-14
Stephanie Hicks (22:50:55): > @Jayaram Kancherla@Tim Triche@Kasper D. HansenI’m following up on my offer (https://community-bioc.slack.com/archives/C35BSB5NF/p1565052212069800) — I just started the process of submitting the dataset to Bioconductor (https://github.com/Bioconductor/Contributions/issues/1207). Will keep you posted once is available. - Attachment: Attachment > Quick update on my processing 44 WGBS data sets (bigwig) and converting to one BSseq
object (https://community-bioc.slack.com/archives/C35BSB5NF/p1562789992004800). This was successful! I wanted to ask, is anyone interested in the dataset? If so I can work on submitting to ExperimentHub if it would be useful to others. Data are from BLUEPRINT consortia (https://community-bioc.slack.com/archives/C35BSB5NF/p1563414470004700). There are six purified whole blood cell types (similar to the FlowSorted.Blood.450k
measured on 450K array, except measured on WGBS platform and uneven sample sizes: > > > table(pheno_table$cell_type) > > Bcell CD4T CD8T Gran Mono NK > 4 12 4 14 6 4 >
> https://community-bioc.slack.com/archives/C35BSB5NF/p1562789992004800
Kasper D. Hansen (23:02:56): > I would remove blueprint from the name
Stephanie Hicks (23:10:51): > Ah why’s that?
2019-08-15
Kasper D. Hansen (09:41:48): > Consistency with the other FlowSorted packages which are FlowSorted.TISSUE.PLATFORM
Kasper D. Hansen (09:42:22): > For CordBlood on the array platform we have two datasets (which are both generated by good groups), and I think we are using something like
Stephanie Hicks (09:43:23): > ok happy to change it. i asked on the github issue best way to make that happen. not sure if I should close current issue, change name, and open a new issue?
Kasper D. Hansen (09:43:44): > CordBlood
vsCordBloodNorway
and now we apparantly have aCordBloodCombined
Kasper D. Hansen (09:44:15): > Which is not ideal, but I think if you put the BLUEPRINT in there, you should do it at the tissue level
Stephanie Hicks (09:44:15): > These samples contain both cord and venous blood
Kasper D. Hansen (09:44:47): > But these conventions are not written down anyway
Kasper D. Hansen (09:45:09): > You mean blueprint has both?
Kasper D. Hansen (09:46:17): > If I was doing it for the arrays I would consider splitting them up, but in your case perhaps keep them. We still have the (again unwritten) convention that you then can select subsamples
Kasper D. Hansen (09:46:47): > So if you have a tissue with cell types A, B, C you might want to do deconvolution for a sample only containing A and B
Kasper D. Hansen (09:47:14): > In minfi::estimateCellTypes there is an argument to do this
Kasper D. Hansen (09:47:32): > Also, for example FlowSorted.Blood.450k has both FlowSorted and unsorted data
Kasper D. Hansen (09:47:52): > But anyway, we have not historically included the data generators in the package name
Stephanie Hicks (10:03:37): > that makes sense, but some more guidance (written down) somewhere might be helpful:upside_down_face:
Kasper D. Hansen (10:04:13): > yes, especially since now people are making packages and uploading them and I have no idea.
Kasper D. Hansen (10:04:24): > There is another data package which contains a fair amount of code
Kasper D. Hansen (10:04:28): > That should not happen
Tim Triche (10:04:39): > ok but more importantly can you merge in the fetal liver samples@Stephanie Hicks
Tim Triche (10:04:44): > because otherwise I have to package that one
Tim Triche (10:04:46): > :smile:
Kasper D. Hansen (10:04:56): > Perhaps liver is a different package no matter who does it
Tim Triche (10:05:12): > FL-HSPCs are the same lineage as bone marrow HSPCs
Tim Triche (10:05:16): > they migrate
Tim Triche (10:05:44): > which reminds me, perhaps packaging up the sorted HSPC fractions isn’t a bad idea either
Tim Triche (10:06:05): > from Majeti’s lab (and Feinberg’s, I guess?)
Kasper D. Hansen (10:06:06): > I think it sounds like two packages nevertheless
Tim Triche (10:06:20): > yeah as I’m thinking this through, we use them for different purposes anyways
Tim Triche (10:06:29): > other people care about B cells vs. T cells vs. NK cells
Tim Triche (10:06:47): > I care about how far along the way towards becoming any of the above a cell has made it
Kasper D. Hansen (10:07:44): > two packages doesn’t preclude combination
Stephanie Hicks (10:07:58): > just to confirm@Kasper D. Hansen, I’m going withFlowSorted.Blood.WGBS
Stephanie Hicks (10:07:59): > right?
Tim Triche (10:08:02): > of course. But naming them clearly will help, and it will reduce the size of each
Kasper D. Hansen (10:08:22): > @Stephanie HicksThat’s what I would do. Having said that, I am kind of making it up
Tim Triche (10:08:35): > suppose you go for semver like TxDbs
Stephanie Hicks (10:08:41): > @Lori Shepherdhas told me how to proceed with the github issue. I would prefer to not do this twice:grimacing:
Tim Triche (10:08:45): > Thing.What.How.Whence
Tim Triche (10:09:07): > TxDb.Hsapiens.UCSC.hg19.knownGene
Kasper D. Hansen (10:09:31): > Fair enough
Tim Triche (10:09:44): > actually
Kasper D. Hansen (10:09:48): > We don’t need the genome for the arrays since that is kind of stored in the assay
Tim Triche (10:09:50): > there’s the call today at noon
Tim Triche (10:10:09): > right, but my point is, would it be possible to have a semantically consistent standard for naming such packages?
Tim Triche (10:10:17): > so that it is easily searchable in ExperimentHub
Tim Triche (10:10:23): > (going to look at the GitHub issue now)
Kasper D. Hansen (10:10:53): > I don’t think there is a need to have the same syntax across differentThing
Tim Triche (10:11:02): > I’m going to disagree, and here’s why
Kasper D. Hansen (10:11:18): > Ok, tell me and I’ll shoot it down
Tim Triche (10:11:27): > FlowSorted.Blood.WGBS.BLUEPRINT makes sense
Tim Triche (10:11:29): > because
Tim Triche (10:11:38): > you could just as easily add FlowSorted.Blood.WGBS.Hodges
Tim Triche (10:11:43): > FlowSorted.Blood.WGBS.WashU
Stephanie Hicks (10:11:56): > :point_up_2:that’s was why i originally added BLUEPRINT (thinking others might generate similar data)
Tim Triche (10:11:58): > YES
Kasper D. Hansen (10:12:27): > Ok, that point is hard to argue with
Tim Triche (10:12:28): > Others already have:slightly_smiling_face:
Tim Triche (10:12:31): > many many many of them
Tim Triche (10:12:43): > FlowSorted.HSPCs.WGBS.WashU
Kasper D. Hansen (10:13:08): > But then we should also try to change some of the existing FlowSorted package names, where we already have this issue
Tim Triche (10:13:27): > FlowSorted.HSPCs.HumanMethylation450.Stanford > FlowSorted.HSPCs.WGBS.Stanford > FlowSorted.Blood.WGBS.REMC > etc.
Tim Triche (10:13:53): > So
Tim Triche (10:14:11): > that’s why I brought up the question of “is there a readily agreeable way to name such packages” as a Q for today’s call
Tim Triche (10:14:18): > for findability, etc.
Tim Triche (10:14:26): > even if things are just stubbed into recount or whatever
Stephanie Hicks (10:14:41): > ok maybe I will wait to change the name of the package until after the call today?
Kasper D. Hansen (10:14:42): > But you were talking about across modalities to use a carey-ism
Tim Triche (10:15:07): > hmm. I suppose that is an issue. Vince would know better. Hence my impulse to ask people who know better than I do.
Tim Triche (10:15:32): > There are a LOT of WGBS datasets that I would dearly love to see in ExperimentHub
Tim Triche (10:15:57): > all of the rare tumors, for example (medullo, rhabdoid, DSRCT, etc. etc.) … needless to say I’d be delighted to contribute, coauthors willing
Tim Triche (10:17:22): > (that’s one reason why POETIC is already an HDF5-backed object:slightly_smiling_face:and most of REMC is in a similar object)
Tim Triche (10:19:00): > we have been trying to figure out if a compact read-level SNP-level representation exists that is broadly useful (e.g. for comparing with ONT results) but that’s future work IMHO. HDF5 matrices of M (T) and Cov (C+T) is plenty for now
Tim Triche (10:19:37): > also, for arbitrary HDF5-backed SEs that need to be “soft anonymized”, I implemented something recently
Tim Triche (10:20:25): > “soft anonymization” for large deidentified-but-only-by-NIH-standards datasets - File (R): rehydrateSummarizedExperiment.R - File (R): dehydrateSummarizedExperiment.R - File (R): indigestion.R
Tim Triche (10:21:41): > every time a new set of eyeballs looks at it, a different fix gets requested, but it is converging towards something useful as a data intermediary for e.g. HSDS and just dumping .h5 files onto S3
Tim Triche (10:22:34): > @Stephanie Hicksthe above came about partly as a response to the push for “federating” federable data sources at CCDI
Kasper D. Hansen (10:23:19): > Let me explain what my issue is, because I don’t think it is a good topic for a dev call
Kasper D. Hansen (10:24:11): > I fail to see the need to maintain a consistent naming convention across complete different resources, like say a genome vs a flowsorted data package
Kasper D. Hansen (10:24:36): > We could have a consistent order
Kasper D. Hansen (10:24:58): > But really inside of ExperimentHub the hope is that not everything is contained in the package name
Kasper D. Hansen (10:25:25): > I think for the FlowSorted suite of things, the FlowSorted.TISSUE.PLATFORM.SUPPLIER is a good idea
Kasper D. Hansen (10:25:59): > Its going to require some work to fix up existing incompatabiltiies and I need a new argument in estimateCellTypes, but it is worth it
Kasper D. Hansen (10:26:24): > The other issue that is worth thinking about here, is the issue of file format
Kasper D. Hansen (10:27:08): > I guess we are going with the SE in HDF5 for the WGBS data?
Kasper D. Hansen (10:27:42): > or perhaps rahter the BSseq dervied SE
Kasper D. Hansen (10:27:57): > I think thats what@Stephanie Hicksis using?
Stephanie Hicks (10:28:12): > yes, I created aBSseq
object and saved it withsaveHDF5SummarizedExperiment
Tim Triche (10:28:22): > one other thing
Tim Triche (10:28:29): > that I’ve been dealing with lately
Kasper D. Hansen (10:28:48): > This goes a bit against the other discussion we have had in the project of moving away from serializing classes. Now HDF5-backed stuff is a bit different
Tim Triche (10:28:52): > if you default to HDF5 backing (which I think is reasonable), subclasses like GenomicRatioSet get squashed
Tim Triche (10:29:16): > this isn’t a show stopper, but has required some nasty hacks to make CpGcollapse and friends work
Kasper D. Hansen (10:29:33): > Somehow I think that the format should be PLATFORM dependent
Stephanie Hicks (10:29:36): > ok, i’ve got a meeting for the next hour. will check back in at 11:30
Tim Triche (10:29:46): > :wave:
Tim Triche (10:31:34): > @Kasper D. HansenI used to think that, but with so many of our MultiAssayExperiments already stored in HDF5-backed SE-like objects, I’ve started to change my mind. The reality of HSDS making ultra-light-weight access possible via restfulSE is also attractive – if you only need 50kb of one assay and 3 transcripts of another, across 5000 samples, does it make sense to pull the entire 20GB+ dataset for each?
Tim Triche (10:33:05): > The wrapper data structure (e.g. BSseq or some flavor of SE) is one thing, I totally agree with that, but I’ve started to feel like data storage going forward is converging upon HDF5 for big datasets.
Tim Triche (10:33:18): > I could be dead wrong though.
Kasper D. Hansen (10:33:22): > Im not convinced HDF5 is the future
Tim Triche (10:33:36): > Fair enough; do you see alternatives that make sense?
Kasper D. Hansen (10:33:36): > I see a lot of limitations with it
Tim Triche (10:33:45): > We went through this with e.g. GA4GH API design
Tim Triche (10:34:14): > first people wanted to use Parquet, then protobufs, etc.
Tim Triche (10:34:26): > Ultimately that got in the way of deploying something that worked
Tim Triche (10:34:48): > so I’m not opposed to shelving any format standardization
Kasper D. Hansen (10:35:12): > I have just been really surprised at some of the HDF5 limitations
Kasper D. Hansen (10:35:26): > For something I thought was a mature HPC format
Tim Triche (10:35:43): > I wish I were surprised. I worked with too many salty physicists in a past life, though:slightly_smiling_face:
Tim Triche (10:36:03): > Feather is one alternative
Kasper D. Hansen (10:36:20): > The problem is we need to real-life use cases
Kasper D. Hansen (10:36:29): > which requires a lot fo work
Tim Triche (10:37:15): > well, here’s one – coauthors would like to query across CNA, DNAme, RNA, and SVs to see the impact of certain things upon antigen expression in data from existing clinical trials, in order to plan new ones
Kasper D. Hansen (10:37:29): > usecase was the wrong word
Tim Triche (10:37:32): > (i.e. “here is why I was hacking on CpGcollapse to work with HDF5-backed SEs”)
Kasper D. Hansen (10:37:37): > We need actual use
Tim Triche (10:37:57): > see above. That particular one has been in use for years and I’m just updating it
Kasper D. Hansen (10:38:35): > It is pretty clear to me that performant code using HDF5 (and DelayedArray which I am also not fully convinced about) requires complete code rewrite and sometimes slightly different algorithm
Tim Triche (10:38:41): > true
Kasper D. Hansen (10:38:44): > Thats a pretty big price to pay for a file format
Tim Triche (10:38:44): > on both counts
Tim Triche (10:39:00): > It is, although I have trouble thinking of an alternative that doesn’t exact the same price
Kasper D. Hansen (10:39:02): > It better be not just better, but heavenly
Kasper D. Hansen (10:39:13): > That is true
Kasper D. Hansen (10:39:24): > But that also means it is a really really big decision
Tim Triche (10:39:30): > I was debugging some ff-based code recently and wanted to kill myself
Kasper D. Hansen (10:39:52): > I have not used ff extensively, but debugging DelayedArray ….
Kasper D. Hansen (10:40:07): > Its amazing when it works though
Kasper D. Hansen (10:40:25): > But if we standardize on HDF5, do we need the DelayedArray layer?
Mike Smith (10:41:05): > Isn’t there still a cost saving if you have multiple operations?
Kasper D. Hansen (10:41:51): > DelayedArray is supposed to buy us backend-independence. But I am not convinced we get that. My current “feeling” (and I stress this is a feeling) is that switchign backend will require another re-write
Tim Triche (10:41:53): > which cost and which ops
Kasper D. Hansen (10:41:55): > What do mean Mike?
Tim Triche (10:44:18): > (also@Mike Smithper my email, is there / are there agenda[s] for these calls?)
Mike Smith (10:44:26): > Perhaps I’ve jumped in without fully grasping the conversation, but I thought one of DelayedArrays benefits is that we can stack operations so when you work on a disk-based backend it can apply them all in one go to the block.
Kasper D. Hansen (10:44:45): > Yes, that is theory
Kasper D. Hansen (10:44:49): > Is it practice?
Mike Smith (10:44:51): > You’d lose that if working purely with HDF5 right?
Kasper D. Hansen (10:45:43): > So the issue I face in my experience trying to optimize, is that I still need to think quite deeply about order of stuff, how I iterate and etc etc
Kasper D. Hansen (10:46:03): > So I don’t get it in a “fire-and-forget” way where I can code and not worry about it
Kasper D. Hansen (10:46:49): > In fact, my experience is that I start with nonDA code. I switch it immediately to DA. It gets like 1000x slower. Then I optimize. And eventually I can make it very fast. But I need to be super careful
Tim Triche (10:47:11): > per Kasper – “yes but” – so many operations in R expect realized pass-by-value semantics
Tim Triche (10:47:19): > and I think that bites us in the ass on a regular basis
Kasper D. Hansen (10:47:20): > and two versions of the same line of code can look similar in principle but have 10-1000x speed difference
Tim Triche (10:47:34): > it’s like going back to C
Kasper D. Hansen (10:47:34): > So I really have to think and experiment
Tim Triche (10:48:07): > If there was a set of heuristics for making DA code performant, that would be super helpful
Kasper D. Hansen (10:48:13): > Yes, we need that
Kasper D. Hansen (10:48:18): > We need to share lessons learned
Tim Triche (10:48:18): > it’s so germinal right now (as the saga of this data package illustrates!)
Tim Triche (10:48:40): > you and Pete and Stephanie and I at least have some battle scars doing this
Kasper D. Hansen (10:48:42): > and we need to experiment further. Perhaps as we gain experience we will converge to a happy state
Kasper D. Hansen (10:48:50): > I am just not fully convinced about it
Tim Triche (10:48:57): > it sounds great in the grant proposals and papers but we know better w/r/t elbow grease:slightly_smiling_face:
Tim Triche (10:49:16): > hey, look on the bright side, at least it isn’t GDC:smiling_imp:
Tim Triche (10:49:49): > (previous code is my “ghetto HDF5-backed GDC” implementation, btw)
Kasper D. Hansen (10:50:05): > It just feels like a super black box when you code.
Kasper D. Hansen (10:50:32): > When you use DA there are a lot of layers between your code and what gets done. It makes it really hard to get
Tim Triche (10:50:39): > without a doubt. But, to some extent, a lot of the abstraction layer and shallow reference class usage in SEs is dark magic to anyone expecting regular R semantics.
Tim Triche (10:51:18): > abstraction exacts a price. (I still think it’s more transparent in many respects than, say, using environments, which itself was a hack to get around slow fat formats)
Kasper D. Hansen (10:51:35): > I mean, I think we need to continue to learn. But so far my experience has made me more sceptical, not less
Kasper D. Hansen (10:52:35): > I am however also beginning to appreciate how many moving parts are involved if you want scalable out-of-memory performance, no matter whether you use DA/HDF5 or not. Its not like any solution is going to be easy
Mike Smith (10:52:38) (in thread): > Today is going to be a bit freestyle since I only just got back from holiday & didn’t chase people while i was away. I’m going to present something on biomaRt in the style I envisioned for the calls, and then we’ll see what happens
Tim Triche (10:52:40): > Suppose we agree upon the following: > 1) HDF5 is a PITA to treat as if standard in-core R data. > 2) Just about any out-of-core format is likely to have similar issues.
Tim Triche (10:52:50): > well, it looks like we agreed upon that in realtime:slightly_smiling_face:
Tim Triche (10:53:24): > The only major advantage I see for HDF5 is its ecosystem of tools (libraries, Python support, HSDS support, etc.)
Kasper D. Hansen (10:53:49): > But there are also weird things in HDF5
Tim Triche (10:53:56): > But that’s a pretty major advantage. Also, for pure dataaccess, Vince has implemented both HDF5/HSDS and BigQuery.
Kasper D. Hansen (10:53:58): > I think there are some concurrency issues
Kasper D. Hansen (10:54:06): > No real sparse format
Kasper D. Hansen (10:54:13): > (thats a big one IMO)
Tim Triche (10:54:20): > +1000 yes
Tim Triche (10:55:09): > when I asked whether SE could abstract away the data storage format (careful what you ask for, Martin can implement ANYTHING), my original intent was to use Matrix and bigMemory as backends
Tim Triche (10:55:27): > Now I realize that what I wanted, most of the time, was a bigSparseMatrixOnDisk
Tim Triche (10:55:56): > a bigSparseMatrixOnAmazon’sDisksThatICanQueryPiecemeal
Tim Triche (10:56:20) (in thread): > :thumbsup:
Tim Triche (10:56:38) (in thread): > you can perhaps see why I asked given the present conversation about HDF5 and ExperimentHub though:wink:
Mike Smith (10:57:14) (in thread): > This gives me the perfect ammunition to shoot that down after 3 minutes & point you back here!
Tim Triche (10:58:33) (in thread): > you magnificent bastard
Tim Triche (11:01:45): > another Stupid Question (I’m good at these), this time for@Levi Waldron– are there hooks for viewing MultiAssayExperiment objects in the style of iSEE? We are using iSEE for some scRNAseq data and it’s fine, but the trouble is that we also have a bunch of other piecemeal assays on some/all of the cells.
Tim Triche (11:02:09): > It’sBig Data(tm) because the raw versions are hard to move around and iSEE tends to choke on it:smile:
Tim Triche (11:03:12): > I feel like a simple interactively-filtered upSet plot of which cells/samples/timepoints have which assays would be doap, for example, but the MAE API docs are… terse
Tim Triche (11:03:49): > (this also impacts packages like MOFA that seem to misunderstand how sampleMap/experimentList work, btw)
Vince Carey (11:09:05): > just witnessing this discussion and in other meetings. there’s a lot flying around here and a document with clarification of concept and implementation should emerge. i will try to go over this in the next couple of days.
Vince Carey (11:27:30): > For@Kasper D. Hansenit would be helpful to get clearer on “concurrency issues” and intrinsic value of sparse format vs compressed dense format, and alternatives to HDF5 that show advantages in these areas, with actual examples in genomic applications.
Stephanie Hicks (12:21:49): > ok i’m back. so where do we stand with 1) naming conventions forExperimentHub
packages? 2) format for the object?
Tim Triche (12:22:15): > 1) brawl > 2) brawl
Stephanie Hicks (12:22:18): > i read@Kasper D. Hansen’s concerns about HDF5
Tim Triche (12:22:45): > are you on the call ?
Stephanie Hicks (12:22:52): > i am
Stephanie Hicks (12:22:59): > not sure if I can stay the whole time
Tim Triche (12:23:01): > a lot of the same concerns re: IDs, etc. arise
Stephanie Hicks (12:23:09): > but i’ll ask if i’m still on the call by the time the topic comes up
Tim Triche (12:23:11): > w/r/t caching and lazy evaluation
Tim Triche (12:23:14): > :thumbsup:
Tim Triche (12:23:46): > I’m not sure that I’m the right person to bring it up.@Mike Smithsaid as much:wink:
John Readey (12:47:30): > Hey@Kasper D. Hansen- can you describe the concerns with HDF5? (for someone who is not deeply involved with the bioconductor community).
John Readey (12:47:41): > What I’ve picked up so far:
John Readey (12:48:12): > 1) concurrency is not easy (but not an issue with HSDS)
John Readey (12:48:28): > 2) No native sparse format
Tim Triche (12:53:23): > oh hey@John ReadeyI set up the S3 bucket with replication and am “washing” the last few datasets (hashing the features/samples/assays to avoid leaking data, just in case)
Tim Triche (12:53:48): > they’re going upon S3 indumpsterfire-west.trichelab.orgonce DNS catches the update.
Tim Triche (12:54:17): > don’t be alarmed by the dimnames of the arrays, in other words
Tim Triche (13:00:25): > also thanks@Stephanie Hicksfor bringing that up and@Mike Smithfor organizing the call
John Readey (13:01:22): > hi@Tim Triche- great; let me know when it’s ready.
John Readey (13:01:48): > Do you have some Python codes we can use to evaluate how it works with HSDS?
Tim Triche (13:01:52): > will email you and Vince once it’s accessible and responding to DNS
Tim Triche (13:02:01): > not really although I could try and rig one up as a unit test
Tim Triche (13:02:33): > which would be a good idea anyways (since the data gets shuffled and unshuffled upon anonymization/recovery, and I have these tests implemented in R already)
Stephanie Hicks (13:02:39): > ok so we did not get to discuss the naming convention
Stephanie Hicks (13:02:54): > I need to decide a name, otherwise the github issue remains open for a month until the next#developers-forumcall
Tim Triche (13:02:57): > so you and@Lori Shepherdget to decide
Tim Triche (13:03:08): > ultimate power
Stephanie Hicks (13:03:51): > Also, does it make sense to create aBSseq
object with HDF5 files forCov
andM
matrices and save withsaveHDF5SummarizedExperiment
which creates theassays.h5
andse.rds
objects. I assume yes?
Tim Triche (13:07:41): > is there another way that offers any advantages?
Martin Morgan (13:07:48): > Not sure if you’re asking this but I think you should save objects in simpler representations rather than derived classes, and assemble them ‘on the fly’, so if it were a SummarizedExperiment csv for row + col data and csv or hdf5 for assays; one could save row and col data as a (simple) data.frame in rds, but somehow one holds out the hope that ExpermentHub / AnnotationHub will eventually be interesting to people outside the R community where rds has no meaning
Mike Smith (13:08:21) (in thread): > Sorry we ran out of time. I don’t want to speak for@Kasper D. Hansenbut my reading ofhttps://community-bioc.slack.com/archives/C35BSB5NF/p1565879125041200is that he was convince but the original naming - Attachment: Attachment > I think for the FlowSorted suite of things, the FlowSorted.TISSUE.PLATFORM.SUPPLIER is a good idea
Tim Triche (13:08:30): > interesting point, I had cleaved off the sample covariates in the “washing” code for the HSDS proof of concept
Tim Triche (13:10:22): > @Martin Morganspecifically, for HSDS distribution of data, the rownames get “anonymized” and rows shuffled, the sample names get “anonymized” and columns shuffled, the covariates get dropped off in a separate CSV file with their original names and a function that pastes them back onto the object at “recovery” time, and the assays get hashed but neither reordered nor otherwise mangled (don’t want to mangle the HDF5 file).
Tim Triche (13:11:27): > the motivation for this exercise was to allow for distributing not-yet-published data over S3 without worrying that it would be trivial for someone else to reidentify in-flight somehow. But since collaborators are using Python etc., I wanted the data to be separated from clinical covs etc. and that’s what it does.
Stephanie Hicks (13:18:21): > @Martin MorganI completely agree with storing data in a simpler representation (vs derived classes). one concern I had with this was that it takes ~4-5 mins to create the derived class (BSseq
) from theHDF5Matrix
objects vs ~15 seconds to load in the derived class. > {gr_complete} > GRanges object with 29039352 ranges and 0 metadata columns: > seqnames ranges strand > <Rle> <IRanges> <Rle> > [1] chr1 10469-10470 * > [2] chr1 10471-10472 * > [3] chr1 10484-10485 * > [4] chr1 10489-10490 * > [5] chr1 10493-10494 * > ... ... ... ... > [29039348] chrM 16451-16452 * > [29039349] chrM 16456-16457 * > [29039350] chrM 16497-16498 * > [29039351] chrM 16544-16545 * > [29039352] chrM 16567-16568 * > ------- > seqinfo: 25 sequences from hg19 genome; no seqlengths > > hdf5_cov > <29039352 x 44> HDF5Matrix object of type "double": > [,1] [,2] [,3] ... [,43] [,44] > [1,] 8 20 3 . 0 28 > [2,] 6 24 3 . 13 28 > [3,] 9 18 0 . 4 22 > [4,] 8 17 2 . 4 25 > [5,] 8 20 3 . 15 23 > ... . . . . . . > [29039348,] 2216 2270 2497 . 1912 1332 > [29039349,] 2195 2053 2463 . 1862 1233 > [29039350,] 1542 0 1509 . 870 347 > [29039351,] 587 251 631 . 336 132 > [29039352,] 97 0 104 . 51 0 > > hdf5_meth > <29039352 x 44> HDF5Matrix object of type "double": > [,1] [,2] [,3] ... [,43] [,44] > [1,] 7 15 2 . 0 13 > [2,] 5 12 3 . 9 22 > [3,] 9 13 0 . 3 13 > [4,] 8 13 2 . 4 24 > [5,] 4 14 2 . 13 20 > ... . . . . . . > [29039348,] 2 16 77 . 25 4 > [29039349,] 0 14 71 . 24 2 > [29039350,] 2 0 75 . 10 3 > [29039351,] 0 1 22 . 1 2 > [29039352,] 0 0 12 . 0 0 > > > > # creating in BSseq object with HDF5 matrices > > Sys.time() > [1] "2019-08-15 13:11:04 EDT" > > bs <- BSseq(gr = gr_complete, > + M = hdf5_meth, > + Cov = hdf5_cov, > + sampleNames = pheno_table$sample_name) > > Sys.time() > [1] "2019-08-15 13:16:26 EDT" > > > > # loading in BSseq object > > Sys.time() > [1] "2019-08-15 13:16:40 EDT" > > hdf5_bs_se_path <- file.path(dataPath, "files_bsseq_hdf5_se") > > bs <- loadHDF5SummarizedExperiment(hdf5_bs_se_path) > > Sys.time() > [1] "2019-08-15 13:16:43 EDT" >
Stephanie Hicks (13:21:27): > I felt like someone who would work with this data in Bioconductor would likely create aBSseq
object, so it would add additional 5 mins to analysis time every time they wanted to use the data. but I see your point of people who might want to use it outside of Bioc
Stephanie Hicks (13:22:55) (in thread): > :thumbsup:ok i’ll go with this then
Tim Triche (13:27:42): > it’s notthatoutrageously expensive to store both, and eventually my hope is that a restfulSE-backed version will eliminate that lag (or kick the can down the line at least:wink:)
Kasper D. Hansen (13:27:56): > @Stephanie HicksI would go with your solution for now
Kasper D. Hansen (13:28:15): > It is pretty clear to me that we eventually might want to distributed in a different way
Tim Triche (13:28:22): > people-within-BioC > people-not-within-BioC ;-D
Kasper D. Hansen (13:28:34): > It is also clear to me that this is not going to be solved in the next couple of days
Tim Triche (13:28:37): > so +1 for store-as-bsseq
Kasper D. Hansen (13:28:50): > The issue here is that the WGBS is pretty big, so conversion is currently non-trivial
Tim Triche (13:28:58): > now: bsseq. later: whatever
Kasper D. Hansen (13:29:16): > yeah, I think high chance of wanting to address this later. But that’s later
Stephanie Hicks (13:29:23): > one idea is to include a function that returns either theBSSeq
object or the individualHDF5Matrix
objects
Stephanie Hicks (13:29:40): > and put both onExperimentHub
Tim Triche (13:29:43): > to some degree, though, why bother?
Kasper D. Hansen (13:29:50): > One thing to bear in mind that Martin will suggest is that the assembly can happen at download time, not necessarily every time you load the object
Kasper D. Hansen (13:30:05): > That’ll still be 5m, bu you pay it one time
Kasper D. Hansen (13:30:55): > There’s some work on this on the single cell front, I believe this is how Aaron serves up some single cell data
Stephanie Hicks (13:30:57): > that’s a good point
Tim Triche (13:30:58): > > tim@tim-ThinkPad-T470:~/POETIC/HDF5/POETIC.HDF5$ h5ls assays.h5 > assay001 Dataset {67, 26747934} > assay002 Dataset {67, 26747934} >
Kasper D. Hansen (13:31:06): > But what we need are “system”s level solutions
Kasper D. Hansen (13:31:14): > And we dont have that now, and you want to move on
Tim Triche (13:31:24): > I’m not necessarily seeing the point of breaking up the .h5 file
Tim Triche (13:31:51): > although it is cool that@Aaron Lun’s transpose trick fits 1:1
Kasper D. Hansen (13:32:06): > what trick?
Stephanie Hicks (13:32:43): > @Tim Tricheoh I have theCov
andM``HDF5Array
objects in the same.h5
file > > hdf5_cov <- HDF5Array(filepath = hdf5_bs_path, name = "cov") > hdf5_meth <- HDF5Array(filepath = hdf5_bs_path, name = "meth") >
Stephanie Hicks (13:32:58): > is that what you meant?
Kasper D. Hansen (13:33:00): > yeah I don’t see a big issue with that
Kasper D. Hansen (13:33:18): > I think Tim thought that you wre suggesting to serve up 2 files
Tim Triche (13:33:21): > interesting, I was lazy and just saved it however bsseq defaults to
Stephanie Hicks (13:33:43): > ohBSseq
also creates one.h5
file
Stephanie Hicks (13:33:48): > assays.h5
Tim Triche (13:33:51): > but yeah the idea was, if someone wants to use it from Python or whatever, just give them a reference to the S3 bucket path and be done with it.
Tim Triche (13:34:15): > right, > > assays.h5 >
> and then the rest in > > se.rds >
Kasper D. Hansen (13:34:59): > Eventually we would wantse.rds
to be a text file as well
Tim Triche (13:35:02): > I believe@Martin Morganwas saying that breaking up the data structures “stapled” to the sides of the assays (colData and rowRanges/rowData) might be desirablein the “later” category
Tim Triche (13:35:13): > @Kasper D. Hansenget out of my head
Kasper D. Hansen (13:35:46): > Anyway@Stephanie Hicksforget this. It is a current discussion and is really much larger than your package
Kasper D. Hansen (13:36:04): > Do what’s simple now and then - if we decide to change this - a lot of packages would have to be changed
Stephanie Hicks (13:36:14): > ok so I should go withFlowSorted.Blood.WGBS.BLUEPRINT
and i’m going to keep the Cov and M files as HDF5 and ask the user to pay the 1 time cost of 5 mins to build theBSseq
object?
Kasper D. Hansen (13:36:37): > no, just serve what you’re doing now, ie the output of saveHDF5SummarizedExperiment
Kasper D. Hansen (13:36:58): > and I guess keep the name
Stephanie Hicks (13:37:01): > lol you guys are comical. ok
Tim Triche (13:37:02): > solve your own use case first, if someone else wants it a different way, they’re free to open an issue:slightly_smiling_face:
Kasper D. Hansen (13:37:10): > So after all this discussion we decided to do exactly nothing
Stephanie Hicks (13:37:15): > :face_palm_star_trek:
Tim Triche (13:37:16): > not nothing
Kasper D. Hansen (13:37:24): > exactly nothing
Kasper D. Hansen (13:37:32): > oh no
Tim Triche (13:37:37): > Adverb.Tissue.How.Whence and do the simplest thing that could possibly work
Kasper D. Hansen (13:37:45): > We decided to fix the existing FlowSorted packages names
Stephanie Hicks (13:37:58): > :joy:
Kasper D. Hansen (13:38:08): > How did all of this discussion ended up putting a monkey on MY back
Tim Triche (13:38:18): > no good deed goes unpunished
Stephanie Hicks (13:39:19): > I greatly appreciate your contributions@Kasper D. Hansen— I can only do what I’m doing now because of work from you,@Peter Hickey@Hervé Pagès@Mike Smithand many others
Kasper D. Hansen (13:51:04): > @John ReadeyI have a problem articulating this clearly. Partly because it is hard to know if the limitations are from the file format or from other things on top.
Kasper D. Hansen (13:51:30): > To me, not having concurrent read access seems pretty strange for a supposedly high performance file format..’
Kasper D. Hansen (13:51:43): > No native sparse array is also an issue
Kasper D. Hansen (13:52:38): > But I really don’t have enough experience to fully comment on this, it is all feelings.
Kasper D. Hansen (13:53:02): > A lot of the discussion above is also about the layer we on top of HDF5 in Bioconductor
Nicholas Knoblauch (14:00:13): > @Kasper D. Hansen, I’m not@John Readey(by any stretch) but I do some part-time consulting for HDF (and lurk the bioconductor slack). As to your points: > 1) HDF5 has super high performance parallel read (and write!) capabilities, but relies on MPI and MPI-IO. > 2) The HDF group is in the process of developing native sparse array support (and I think they’re looking for use cases)
Kasper D. Hansen (14:04:02): > A lot of us wants to read parts of a file in different “things” spawned byfork
. I am saying things because I sometimes get confused by threads and processes
Kasper D. Hansen (14:04:53): > It is great to hear that sparse arrays are being developed. From the outside I am just surprised it is not there already given that - in my understanding - HDF5 has been around for a while
Kasper D. Hansen (14:05:21): > But its not there, thats ok, its just a heavy limitation
Nicholas Knoblauch (14:06:26): > I think with the new version of HDF5 you are able to get the “address”/size of every chunk in a dataset. Once you have those, I don’t see why you couldn’t pass those to threads/processes and read them “outside” of HDF5
Kasper D. Hansen (14:06:49): > ok
Kasper D. Hansen (14:07:18): > But thats not a mature solution obviously:slightly_smiling_face:
John Readey (14:07:20): > Yes - this is the approach I plan to use with the S3 data that@Tim Tricheis setting up.
John Readey (14:08:57): > Goals with HSDS are to provide a new paradigm for parallel processing (not MPI based) which works well with cloud data.
Kasper D. Hansen (14:09:51): > And that sounds very very interesting
John Readey (14:09:56): > Also it’s a bit easier to innovate with HSDS since we don’t have to worry about the 20 years worth of legacy code.:slightly_smiling_face:
Kasper D. Hansen (14:10:06): > But we are also interested in “local” performance
John Readey (14:10:43): > Local performance as in running on your workstation or local performance as in running on a on prem cluster?
Kasper D. Hansen (14:11:52): > prem?
Kasper D. Hansen (14:11:56): > I would say both
Kasper D. Hansen (14:12:05): > a HPC cluster or a laptop/workstation
Nicholas Knoblauch (14:12:28): > I also think there aren’t a ton of users who would benefit greatly from thread/process level concurrency. You can issue as many read calls to the OS as you want, but the disk is going to be slower than the CPU. It certainly makes things easier from the developer perspective though
Kasper D. Hansen (14:13:15): > well, every time you grab something you also do some compute on it
Kasper D. Hansen (14:13:29): > sometimes that compute is slow
Kasper D. Hansen (14:15:04): > But I think with (for example) solid state drives and the future of many cores in each workstation, saying that you don’t benefit from more than one reader is perhaps a bit premature. But ok, you could have a single reader which distributes to multiple workers but that is also hard(er) to code at least with the tools we have
John Readey (14:15:09): > For doing small/medium scale stuff on your workstation vs big data on HSDS, I think it’s a plus that you have the option of copying files from S3 and just using the regular HDF5 library with them. Goal is to keep the API the same for local files and HSDS (not quite there with R though!)
Kasper D. Hansen (14:16:27): > That sounds like a super nice feature
Nicholas Knoblauch (14:24:07): > Another way to help the IO/compute tradeoff is with compression. HDF5’s default compression (DEFLATE
) is pretty slow (and single threaded) . I’ve found switching to a more modern, multithreaded compression library like zstd or blosc can give a pretty insane performance increase (at the cost of somewhat larger files)
John Readey (14:26:08): > Yeah - I was asking before about using compression vs. a custom sparse representation. Don’t think there’s been any benchmarking done though.
John Readey (14:26:51): > It is possible to use add in compression filters like BLOSC with HDF5 lib, but it’s a bit of a pain.
Kasper D. Hansen (14:27:39): > Are you reading this@Mike Smith
Mike Smith (16:25:52): > This sounds like maybe I need to dust of my old attempts at including BLOSC etc inrhdf5. I seem to remember running into issues on Windows compiling and linking/knowing the location of libhdf5, rhdf5, and the dynamic filter libraries. I’m pretty sure I had something running on Linux though
Kasper D. Hansen (16:39:28): > Reading possibly completely outdated stuff somewhere on the internet suggests that windows support may be less good
Peter Hickey (19:53:04) (in thread): > The slowness is becauseBSseq()
does validation of the arguments. Historically (@Kasper D. Hansenwill correct me) this included checking that all values inM
andCov
satisfy0 <= M <= Cov < Inf
. This is fast in memory and less fast in HDF5
Peter Hickey (19:53:35) (in thread): > There’s an internal.BSseq()
constructor that does little or no validity checking
Peter Hickey (19:53:54) (in thread): > so should be as fast formatrix
andHDF5Matrix
Stephanie Hicks (19:59:08) (in thread): > Oh nice! I’ll check that out tonight
Stephanie Hicks (19:59:14) (in thread): > Thanks@Peter Hickey!
2019-08-16
Michael Lawrence (13:52:01): > @Nicholas KnoblauchI’d be interested in working with the HDF group on sparse representation use cases. Who should I contact?
Nicholas Knoblauch (13:55:49): > I think Elena Pourmal (director of technical services and operations) would probably be the best person to talk to, you can reach her atepourmal@hdfgroup.org
Michael Lawrence (14:00:54): > Great, thanks.
2019-08-19
John Readey (16:17:40): > @Michael Lawrence- also if doing something on the server side would be of interest I’d be happy to discuss - I’m atjreadey@hdfgroup.org
Michael Lawrence (18:09:43): > In terms of the server, my main concern is the efficiency of the client/server protocol. Isn’t it JSON-based right now?
Jayaram Kancherla (20:02:52) (in thread): > yes you are right, sorry i misread your reply
Jayaram Kancherla (20:53:09) (in thread): > something i’m working on is a system for querying genomic data directly from file (hosted remotely or local). I was wondering can you store indexes inside hdf5 ? This would be very helpful for queries by genomic region. We currently use the tabix index for expression matrices but these matrices can get very long (in single cell), hence using a tabix would get us the rows faster but not help much with the columns. I’d also be interested to be part of this discussion and help out in any way I can.
Michael Lawrence (22:12:53) (in thread): > I’d also be very interested in that.
2019-08-20
Vince Carey (11:33:11): > @Michael Lawrencebinary transfers can be selected … the rhdf5client does not handle this particularly well and this needs to be addressed for next release
John Readey (11:37:05): > Right, JSON or binary can be selected by using the appropriate “Content-Type” in the http header. The h5pyd client does binary by default for all data reads and writes.
Vince Carey (11:42:45): > hi@John Readey– I’d love to create a notebook in HDF kitalab to review the current situation – but the fact that it is using R 3.5.0 is an obstacle. We also need to be able to establish a persistent collection of compiled R packages if possible.
John Readey (11:52:19): > Yes we had some issues updating the user environment in Kita Lab. We’ve got some help now from someone who actually understands JupyterLab who should be able to update the environment for the JupyterLab 1.0 release and refresh the user packages. So hoping to get this out soon.
John Readey (11:52:34): > Is R 3.5.0 still the version you need?
Tim Triche (11:53:36): > hey@Vince Carey@John Readeyis there a fast matrix-digest function in HDF5? is it wrapped for HDF5Array or rhdf5?
Tim Triche (11:53:59): > I realized that if I digest the data matrices in rehash (and do NOT shuffle rows/columns), it becomes the lazy man’s synapse
Tim Triche (11:54:31): > and that way everyone knows what data release they’re using just from the data structure (& can verify if needed)
Tim Triche (11:54:50): > https://github.com/trichelab/rehash
Tim Triche (11:55:11): > I keep finding things that I wish I’d done differently the first time, but I think this is the end of the line (finally)
Tim Triche (11:55:46): > it seems to “just work” with SingleCellExperiment as well. I’m not entirely surewhy, but I’m not complaining.
Vince Carey (13:56:51): > I don’t think this exists. We have discussed it a long time ago – there may be metadata at the HSDS level that would indicate last modification date but I don’t think the content is hashed in any way. We may have to do something like this at the R level when it seems necessary. You can set the HSDS access control to read only. That, in conjunction with the last-modified metadata, could be enough to provide reasonable confidence that we know what content we are dealing with.
Tim Triche (13:59:08): > OK. Well, I just implemented hashing by assay on the way in, so at least it shouldn’t cost “as much” to validate on the way out.
Tim Triche (13:59:36): > I can patch it to make verification optional if need be.
Vince Carey (15:43:43): > @John Readeysorry to miss your question – R 3.6 is the current release and that is what we need (current release is 3.6.1 but 3.6.0 is OK)
2019-08-21
John Readey (00:56:24): > @Tim Triche- are you talking about hashing as a security thing? or a hash as checksum to verify the data hasn’t changed?
Tim Triche (09:43:14): > Checksum and fingerprint – to make sure that people can objectively compare results from a given matrix of data
John Readey (12:47:48): > On AWS S3 each object has something called the “ETag” that’s basically a MD5 digest of the object. For a bucket, AWS can provide a CSV with each key, size, and ETag. With this I imagine you could setup a process that would trigger an alert if something changed unexpectedly.
Tim Triche (14:21:36): > oh! That’s pretty much perfect
Tim Triche (14:21:56): > if the same object is stored in different buckets, will it have an identical or 1:1 comparable ETag?
Tim Triche (14:22:47): > because then I can just compute row and column hashes instead, whcih is probably better for piecemeal access anyway. I just want people to be able to verify precisely what they ran their analyses upon. (Chatting with a collaborator about this in my lab slack just now, as it happens)
John Readey (14:52:05): > I don’t think it’s bucket dependent.
John Readey (14:53:46): > If you are storing HDF5 files in s3, then the ETag just tells you the file got modified, but not what in the file got changed.
John Readey (14:56:17): > On the other hand, with HSDS the data is sharded, so you can see if a particular chunk (dataset subset) has been changed, but then it’s not so easy to check for any change in the file. I guess it wouldn’t be hard to do a concatenation of MDF5s to get a single checksum.
Tim Triche (16:32:30): > OK perfect. Also the sharding is an extremely desirable way to handle it but exacerbates the issue of “are you SURE we are working off the same data” so column-wise and row-wise hashes per assay are better anyways. I don’t care how the data gets realized as long as, from the user perspective, it’s always the same slice (row/column/block) of data
2019-09-05
Andrew McDavid (15:27:40): > @Hervé Pagès@Aaron Lunis there a way to coerce a TENxMatrix to asparseMatrix
without making a stop in Denseville? Or maybe DropletUtils just needs another constructor. It’s like a 6-liner to get a sparseMatrix from the 10x hd5 file.
Andrew McDavid (15:28:05): > So if that’s maybe the best option I can open a PR on DropletUtils
Aaron Lun (15:30:37): > It would be straightforward to do this from C++.
Aaron Lun (15:30:58): > But I would only do that if DropletUtils owns TENxMatrix, and currently it doesn’t.
Andrew McDavid (15:33:15): > what do you think about just adding an option about the output, ie, data were in the sparse hd5 format, but are represented as adgCMatrix
Hervé Pagès (15:34:03): > I’ll add a coercion method from TENxMatrix to dgCMatrix that preserves sparsity along the way. Should not be too hard.
Andrew McDavid (15:35:15) (in thread): > If you want, here’s some FREE CODE!! > > TENxToSparseMatrix = function(h5file){ > x = rhdf5::h5read(h5file, name = '/matrix/data/') > i = rhdf5::h5read(h5file, name = '/matrix/indices') > p = rhdf5::h5read(h5file, name = '/matrix/indptr') > shape = rhdf5::h5read(h5file, name = '/matrix/shape') > # m = sparseMatrix(i = i, p = p, x = x, dims = shape) > m = sparseMatrix(i = i + 1, p = p, x = x, dims = shape) #???? seems wrong but otherwise I'm off by one. > m > } >
Aaron Lun (15:35:23): > @Hervé PagèsShouldTENxMatrix
live inDropletUtils? It seems like a better home for it thanHDF5Array. It’s weird to have such a specific data structure, for such a specific format, living in a much more generic package.
Aaron Lun (15:35:47) (in thread): > =
instead of<-
?:face_vomiting:
Andrew McDavid (15:37:10) (in thread): > I no longer believe in<-
except when I’m deliberately trying assign in anif
statement. ‘< -’ vs ‘<-’ has bitten me (and I no longer use ESS)…
Hervé Pagès (15:39:40): > I guess it could. Used to live in TENxGenomics (https://github.com/mtmorgan/TENxGenomics). When TENxGenomics got abandoned I moved it to HDF5Array because nobody seemed interested in adopting it.
Aaron Lun (15:46:41) (in thread): > heresy!
Andrew McDavid (15:48:30) (in thread): > Apostates get to have all the fun though.:smirk:
Andrew McDavid (16:09:39) (in thread): > Regarding my free code, it seems I need to to add 1 to i to get my object to match the TENxMatrix, which doesn’t make any sense to me….
Hervé Pagès (16:20:36) (in thread): > The row indices in the hdf5 file are 0-based butsparseMatrix()
wants them 1-based.
Hervé Pagès (16:21:44) (in thread): > Almost there (currently runningR CMD check HDF5Array_1.13.6.tar.gz
, will take a while…)
Hervé Pagès (16:27:25) (in thread): > Done:https://github.com/Bioconductor/HDF5Array/commit/20d007359099fe6a7582c143b085834b5155eead
Andrew McDavid (16:47:01) (in thread): > :heart_eyes_cat:
Hervé Pagès (17:08:29) (in thread): > Forgot to say: don’t try this on the 10x brain data though (EH1039 dataset on ExperimentHub). It won’t work. That’s because the Matrix package doesn’t handle sparse matrices with more than 2^31 non-zero values.
2019-09-07
Aaron Lun (20:03:42): > Will prepare the transition tomorrow.
2019-09-15
Aaron Lun (23:55:36): > BiocSingular now has a FastAutoParam() mode, which will switch between IRLBA and RSVD automatically depending on the matrix representation.
Aaron Lun (23:56:03): > Specifically, it will use IRLBA unless the input matrix is a DelayedMatrix without a dedicated%*%
method, in which case it will use RSVD.
Aaron Lun (23:58:12): > This reflects the fact that IRLBA gives more predictable convergence but RSVD uses fewer multiplications, so we switch to RSVD to reduce the cost of block processing during DelayedMatrix%*%
. However, if your DM subclasses has its own%*%
, it is presumed to be efficient enough to allow the use of IRLBA.
2019-09-16
Peter Hickey (00:33:20): > a nice touch. > was this motivated as a workaround to the surprising(?) slowness of running PCA on a TENxMatrix?
Aaron Lun (00:40:42): > Well, there are levels of slow. It wouldn’t even be possible to usebase::svd
on something that couldn’t fit into memory in the first place.
Aaron Lun (00:41:27): > The next option isIrlbaParam()
, but this hits the hard drive so frequently (up toncomponents * 2
times) that it’s pretty slow for file-backed matrices.
Aaron Lun (00:41:45): > That leavesRandomParam()
, which touches the hard drive <10 times.
Aaron Lun (00:42:40): > It’s not really a workaround. Just a more acceptable speed/accuracy compromise.
Aaron Lun (00:45:42): > Obviously, you can easily circumvent the automatic choice by doing something like, e.g., defining your own%*%,HDF5Matrix-method
.
Aaron Lun (00:46:26): > This would make theFastAutoParam()
dispatcher think that you have some smart efficient way of matrix multiplication, and pick IRLBA instead.
Peter Hickey (00:50:03): > fair enough. i’ve resorted toas(x, "dgCMatrix")
to compute PCA of disk-backed 10x data on a few occasions and know others have too and wondered if that’s what spurred it
Aaron Lun (00:50:44): > Well, the TENxMatrix should be better than the HDF5Matrix in that respect.
Aaron Lun (00:52:12): > But randomized SVD is much better than IRLBA for disk-backed matrices, we use it inhttps://osca.bioconductor.org/integrating-datasets.html - Attachment (osca.bioconductor.org): Chapter 13 Integrating Datasets | Orchestrating Single-Cell Analysis with Bioconductor > Online companion to ‘Orchestrating Single-Cell Analysis with Bioconductor’ manuscript by the Bioconductor team.
Stephanie Hicks (09:40:42): > nice@Aaron Lun! I have been running into a similar issue while benchmarking mbkmeans (https://github.com/stephaniehicks/benchmark-hdf5-clustering/blob/7f1579721b080a6bcdb4aade0e0322f178679b31/main/case_studies/03-dim-reduction.R#L81).@Davide Risso@Elizabeth Purdom@Ruoxi Liuand I were running into a parallelizing-ly slow PCA withDelayedMatrix
objects. > > time <- system.time(pca <- BiocSingular::runPCA(for_pca, rank = 30, > scale = TRUE, > BSPARAM = RandomParam(deferred = FALSE), #try deferred = TRUE > BPPARAM = MulticoreParam(10))) >
> With for theTENxPBMC
68k cells and 1000 genes it would run, but with >3000 genes it wouldn’t. I realized we need to be usingdeferred=TRUE
object. Also, we’re going to try reducing the block size from the default to see if that helps.@Davide Rissoalso kindly pointed out that last week that we should use a different parallelization framework instead ofMulticoreParam()
. Still need to update that too.
Tim Triche (09:41:11): > hey speaking of HDF5
Tim Triche (09:41:25): > this is an exciting new feature in minfi:
Tim Triche (09:41:31): > > [cpgCollapse] Collapsing data > Error in .local(x, Indexes, dataSummary, na.rm, verbose, ...) : > dim(x_grid) == dim(sink_grid) are not all TRUE >
Tim Triche (09:42:20): > > Browse[1]> dim(x_grid) > [1] 1 61 > Browse[1]> dim(sink_grid) > [1] 1 65 >
Tim Triche (09:43:03): > The use of a local function within the generic does not make this particularly easy to debug. Is there any reason that a local function (as opposed to calling out to a defined but non-exported function) is commonplace in many packages?
Tim Triche (09:44:59): > Background: I’m running permutation tests (lots of them) and would prefer to permute on clusters since that’s where the results in the mesothelin paper are easiest to interpret vis-a-vis experimental evidence. But since there are thousands of samples, it seemed like a good place to use HDF5 backing. The degree of fiddling required with HDF5 is… surprising, even after having worked with it directly in other packages (e.g.miser
).
Kasper D. Hansen (09:49:19): > I seem to recall a pull request from Rafa regarding cpgCollapse. I was traveling, so it may have gotten dropped hard on the floor
Kasper D. Hansen (09:50:21): > hmm, there is an unmerged pull request, but it seems unrelated to this issue
Tim Triche (09:52:40): > It’s quite odd that the x_grid and sink_grid would end up with different sizes – this isn’t just a subsetting issue (originally I thought that was the problem, but it persists even when using the full dataset)
Kasper D. Hansen (09:56:19): > Yeah, I bet it has something to do with the fact that somehow the “last” grid box has to be different
Aaron Lun (22:32:41) (in thread): > I don’t thinkdeferred=TRUE
will help here, as the cost of centering and scaling is nothing compared to the cost of reading from disk. ¯*(ツ)*/¯
2019-09-17
Stephanie Hicks (12:56:41) (in thread): > :thumbsup:
2019-09-18
Nick Eagles (16:38:01): > @Nick Eagles has joined the channel
2019-09-19
David JM (12:13:09): > @David JM has joined the channel
Aaron Lun (22:51:55): > @Hervé PagèsCurrently in my BiocSingular package, I use DelayedArray’s definition of%*%
to perform parallelized multiplication. This is fine, but it densifies sparse matrices viaextract_array
, which is not so good. It seems like we could benefit from anotherextract_native_array
generic (yeah, not a great name), which applies all delayed operations on the seed and returns the resultwithout coercing it to a dense array. This would allow me to simply wrap a sparse matrix in aDelayedArray
and then exploit the parallelized%*%
without losing sparsity during the calculations, which would make everyone happier.
Kasper D. Hansen (22:54:29): > We absolutely need sparsity preserved in matrix multiplication. I don’t know if@Aaron Lun’s suggestion is the way to go (but that’s due to ignorance not that I have doubts), but the goal is pretty important.
Aaron Lun (23:06:50): > Though I must say that for sparse matrices… doing the multiplication on a single core is often faster than waiting for the parallel backend to set up!
Stephanie Hicks (23:35:31): > woah, I didn’t know that about howBiocSingular
densifying sparse matrices. Definitely agree with@Aaron Lun@Kasper D. Hansen. As someone currently benchmarking time and memory usage for pca inBiocSingular
with very large both in memory andDelayedArray
objects, this would incredibly be helpful.
Stephanie Hicks (23:35:42): > tagging@Davide Rissotoo
Aaron Lun (23:44:57): > Consider that a 20000*50000 sparse matrix at 5% density takes 30 seconds to save to disk, 2 seconds to read from disk; but only 0.1 seconds to actually do the matrix multiplication with a single vector (typical of IRLBA). I don’t know how quick data transfer is between sockets, but if there’s any serialization involved, it seems like the cost of parallel back-end set-up likely outweighs any gains for sparse matrix%*%
.
2019-09-20
Hervé Pagès (05:09:34): > @Aaron LunFWIW DelayedArray providesextract_sparse_array()
,read_sparse_block()
andwrite_sparse_block()
, the sparsity-preserving versions ofextract_array()
,read_block()
andwrite_block()
. > > Note thatextract_sparse_array()
andread_sparse_block()
return a SparseArraySeed object, which represents a sparse array. Coercion back and forth between 2D SparseArraySeed objects and dgCMatrix is supported. > > Important: you should only useextract_sparse_array()
orread_sparse_block()
on objects for whichis_sparse()
is TRUE. You can still call these functions on an object for whichis_sparse()
is FALSE, and they might seem to work, but they’ll probably silently return something wrong. The reasonextract_sparse_array()
orread_sparse_block()
don’t perform theis_sparse()
check is that these functions are typically used inside a loop in the context of block processing so the same check would be performed again and again. This could add some significant cost to the overall block processing of the object (the check can be a little expensive depending on what delayed operations the object carries). So it makes more sense to perform the check upfront before entering the main block processing loop. > > See for example howis_sparse
/read_sparse_block
/write_sparse_block
/read_block
/write_block
are used inBLOCK_write_to_sink
. > > The sparsity-preserving capabilities in DelayedArray/HDF5Array are still a work-in-progress and very incomplete (e.g. the DelayedMatrix row/col summarization methods should use them but they don’t yet). Anyway maybe you want to give them a shot to boost your%*%
implementation?
Princy Parsana (09:24:07): > @Princy Parsana has joined the channel
2019-10-03
Hervé Pagès (20:22:44): > Sparsity is now displayed by theshow()
method for DelayedArray objects:https://github.com/Bioconductor/DelayedArray/commit/545408564914dac1519428fbca2a81dca610cf62
2019-10-14
Aaron Lun (16:48:43): > @Peter Hickey@Davide RissoIs anyone using theraw_structure
,const_column
orget_const_col
concepts inbeachmat?
Aaron Lun (18:04:19): > I’m also thinking of getting rid of theget_rows()
andget_cols()
, which aren’t particularly useful.
Shian Su (18:35:17): > It doesn’t seem like a good idea to get rid of existing API unless it’s found to be detrimental in some way.
Aaron Lun (18:35:42): > Well, it’s detrimental to my attempts to refactor it.
Aaron Lun (18:36:06): > For example, adding OpenMP thread safety.
Shian Su (18:42:52): > Which level are you trying to get OpenMP threadsafety at? Do you want to be able to include beachmat and run it in an OpemMP section?
Aaron Lun (18:43:41): > Yes. This is currently not possible because of calls back to the R API, which need to be marked out in#pragma omp critical
.
Aaron Lun (18:44:18): > Even after this, it is difficult as there are many uses ofRcpp::Vector
that only shallow copy when you make a thread-private variable.
Aaron Lun (18:45:11): > So the first job is to protect the R calls, and the second is to switch to using rawint*
anddouble*
for the interface.
Aaron Lun (18:46:12): > That won’t require any changes user-side, but it does involve some work from my end, so the aim is to minimize this by discarding unused functionality.
Aaron Lun (18:57:58): > Technically, eachbeachmat::numeric_matrix
instance has its own internal variables that are updated uponget_*
. So the proposed contract is instead “beachmat is thread-safe if each thread possesses its own copy of an instance”. Which is not the case right now.
Peter Hickey (19:10:52) (in thread): > nope not using anywhere
Shian Su (19:11:55): > Fair enough. But does this cause performance regression for the existing users who requested the feature in the first place?
Aaron Lun (19:12:50): > No, and that’s the thing! It’s actually faster to just loop through withget_col()
compared to aget_cols()
call.
Aaron Lun (19:13:11): > HDF5 was very disappointing in this regard. Lots of things you’d expect to be faster were… not.
Shian Su (19:14:57): > Yes I’m having the same disappointment working with nanopore fast5 files…
Shian Su (19:16:14): > Is it feasible to reimplement the get_cols() as a loop through get_col()? Then you can retain the API and not have the issues?
Aaron Lun (19:16:40): > Sure, but I don’t see a use for that.
Aaron Lun (19:16:57): > In the absence of performance benefits, it seems easier for the user to just do it.
Kylie Bemis (19:18:58): > would switching to rawint*
anddouble*
mean extension packages no longer need to import Rcpp?
Shian Su (19:20:08): > Well a single function call is always syntactically nicer than a function call wrapped in a for loop.
Kylie Bemis (19:21:00): > Havingget_cols()
andget_rows()
would be useful for futurematter
support, since iterating over minor dimensions has a real performance cost, so getting a chunk when iterating one way vs the other is useful. But I don’t know when I’ll get around to that, so don’t go by me.
Aaron Lun (19:22:23): > @Shian Sunot if you have to force the array of indices to fit the accepted input format. I can’t template virtual functions so if you have indices asint
s and I’m requestingsize_t
s… well, that’s too bad, you’ll have to make a copy of the array.
Shian Su (19:29:02): > Can’t just overload them either?
Aaron Lun (19:30:10): > I’m already overloading them to do a quietdouble<->int
conversion. To overload them again and cover all possible combinations of data with integer types in the indices (int
,unsigned int
,size_t
) would be a chore.
Aaron Lun (19:31:15): > It’s not just a chore for me. Any extension also has to implement those same methods, and every extra method added means another thing that an extension is obliged to support.
Aaron Lun (19:33:10): > And extensions operate via R’s C API, so some poor sucker would have to write and register 6 separately named methods to do the same thing with different type combinations.
Shian Su (19:34:15): > get_col()
does not have this problem?
Aaron Lun (19:35:25): > Well, you only have to overload it twice, forint
anddouble
. This is barely tolerable.
Shian Su (19:36:24): > Well at the end of the day you’re the implementer, so if it’s too much overhead then it’ll have to be scrapped.
Aaron Lun (19:37:54): > I would say that they were a mistake to begin with.
Shian Su (19:40:37): > I think it’s a natural extension, and people are clearly interested in the interface. Also making a copy of an integer-like vector is rarely the end of the world.
Shian Su (19:41:55): > Indices are on the order sqrt of the data you’re juggling around anyway.
Aaron Lun (19:42:45): > But if you’re going to copy the vector, then it’s just another line to loop over the indices and extract each row/col.
Shian Su (19:51:25): > That’s true, but I still think the interface is useful if it could be neatly implemented.
2019-10-15
Davide Risso (03:57:02) (in thread): > me neither
Hervé Pagès (04:41:54): > @Aaron LunRemember that mbkmeans usesget_rows()
. It used to useget_row()
but this was very inefficient on HDF5Matrix objects so was changed in July to useget_rows()
instead:https://github.com/drisso/mbkmeans/commit/8274c63aa04965bc77cd5059883524fe8951c6f8
2019-10-16
hcorrada (10:17:11): > Hello<!channel>. We’re trying out our recent work on segment-level analysis (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2947-6) on full transcript scRNA-seq. We have the output of rapmap pseudo-alignment as tsv files and would like to create a SingleCellExperiment object. Is current preferred practice for large assay matrices to use an hdf5-backed DelayedArray? If so, what is the recommended way of writing the hdf5 file to begin with? Thanks in advance! - Attachment (BMC Bioinformatics): Yanagi: Fast and interpretable segment-based alternative splicing and gene expression analysis > Ultra-fast pseudo-alignment approaches are the tool of choice in transcript-level RNA sequencing (RNA-seq) analyses. Unfortunately, these methods couple the tasks of pseudo-alignment and transcript quantification. This coupling precludes the direct usage of pseudo-alignment to other expression analyses, including alternative splicing or differential gene expression analysis, without including a non-essential transcript quantification step. In this paper, we introduce a transcriptome segmentation approach to decouple these two tasks. We propose an efficient algorithm to generate maximal disjoint segments given a transcriptome reference library on which ultra-fast pseudo-alignment can be used to produce per-sample segment counts. We show how to apply these maximally unambiguous count statistics in two specific expression analyses – alternative splicing and gene differential expression – without the need of a transcript quantification step. Our experiments based on simulated and experimental data showed that the use of segment counts, like other methods that rely on local coverage statistics, provides an advantage over approaches that rely on transcript quantification in detecting and correctly estimating local splicing in the case of incomplete transcript annotations. The transcriptome segmentation approach implemented in Yanagi exploits the computational and space efficiency of pseudo-alignment approaches. It significantly expands their applicability and interpretability in a variety of RNA-seq analyses by providing the means to model and capture local coverage variation in these analyses.
Michael Lawrence (12:53:47): > My first guess would beHDF5Array::writeHDF5Array()
Hervé Pagès (13:13:26): > Yep, and with a particular attention to how you choose the chunk geometry (chunkdim
arg). Which depends on what the typical access pattern will be during the downstream analysis.
Aaron Lun (13:13:36): > We find HDF5 arrays to be satisfactory for interactive visualization, see the TCGA example inhttps://f1000research.com/articles/7-741/v1 - Attachment (f1000research.com): F1000Research Article: iSEE: Interactive SummarizedExperiment Explorer. > Read the latest article version by Kevin Rue-Albrecht, Federico Marini, Charlotte Soneson, Aaron T.L. Lun, at F1000Research.
Aaron Lun (13:14:19): > The delay is not noticeable if you’re not pulling a lot of data out at any given step; and it definitely makes the app boot up much faster.
hcorrada (14:23:00): > Thanks, will give this a spin!
Aaron Lun (16:02:39) (in thread): > grumble. Fine.
2019-10-28
Sean Davis (21:14:11): > > Originally touted as part of our Enterprise product last year, we’re more excited to announce that the new S3 and HDFS Virtual File Drivers (VFDs) will be released into open source with HDF5 1.12.0, due out in about two months. If you’re interested in an earlier start, you can download our snapshot release now—we’d appreciate your help in testing and reporting back on this release. > > > If you’d like to learn more, we have created videos outlining how to set up and use these new VFDs. > > > Read-only S3 VFD:https://www.youtube.com/watch?v=P7AlG0lXQJ8HDFS VFD:https://www.youtube.com/watch?v=448mSCCzrnE - Attachment (YouTube): Learn about the read-only Cloud (Amazon S3) Storage HDF5 Connector - Attachment (YouTube): Learn about the Hadoop (HDFS) HDF5 Connector
Sean Davis (21:16:08): > The note above is from the HDF group.
2019-10-29
Michael Lawrence (14:20:56): > Impressive use of org-mode and ditaa.
Hervé Pagès (19:16:38): > Cool! Sounds like it will not be too hard to support something likecounts <- HDF5Array("
https://s3.amazonaws.com/pile-of-files/singlecell.h5", "counts")
after the next update to Rhdf5lib.
Vince Carey (20:17:05): > FWIW rhdf5client does this already … but needs improvement > > > library(rhdf5client) > 9/6 packages newly attached/loaded, see sessionInfo() for details. > > example(HSDSArray) > > HSDSAr> HSDSArray(URL_hsds(), > HSDSAr+ "hsds", "/shared/bioconductor/darmgcls.h5", "/assay001") > <65218 x 3584> matrix of class HSDSMatrix and type "double": > [,1] [,2] [,3] ... [,3583] [,3584] > [1,] 0.000000 0.000000 112.394374 . 0.00000 0.00000 > [2,] 0.000000 0.000000 0.000000 . 0.00000 0.00000 > [3,] 0.000000 0.000000 0.000000 . 0.00000 0.00000 > [4,] 5.335452 11.685833 0.000000 . 0.00000 14.01612 >
2019-10-30
Hervé Pagès (00:29:25): > I think that what’s cool about an S3-backed HDF5Array object is that it doesn’t require to set up a server. Also if I understood correctly what they say towards the end of the video is that at some point in the near future they’ll support write access so we should be able to use S3 as a realization backend. I don’t know if we actually need these things but they’re coming for free (almost) since low-level workhorses likerhdf5::h5read()
,rhdf5::h5write()
,HDF5Array::h5mread()
will be able to transparently work with files on S3.
Hervé Pagès (00:42:12): > Also AnnotationHub and ExperimentHub use S3 so we could imagine that users will have the option to get HDF5Array objects from the hubs without the need to download anything. I’m not saying that’s going to boost their productivity but it’s nice to be able to take a quick look at a data set before deciding to download it.
Vince Carey (07:51:41): > My reading of II.A.ii ofhttps://bitbucket.hdfgroup.org/users/jhenderson/repos/rest-vol/browseis that~you will need a service~to mediate between C and HDF5 on S3. HSDS is one example. We could run our own instance of that, but the example I showed uses one with access donated by HDF group. I am getting clearer on the options here, I have overstated the requirement for HSDS. Will update later.
Hervé Pagès (11:37:35): > Make sense. Another nice thing about the VFD move is that it won’t matter whether the S3 VFD or HDFS VFD is used. Should be transparent for the end user and things likerhdf5::h5ls()
andrhdf5::h5read()
should work the same way.
Hervé Pagès (11:39:03): > Coming soon (in HDF5 1.12.0): > > Hyperslab selection code was optimized to improve performance by an order of magnitude (and in the cases when reading a regular selection from one dimensional array in a file into one-dimensional array in memory improvement was 6000 times.)
Kasper D. Hansen (12:14:25): > That’s averynice improvement. But it is kind of concerning that stuff like this wasn’t already optimized.
Hervé Pagès (12:24:58): > Another nice one they mentioned recently: > > We will conclude the Webinar with discussion of the third upcoming development, a new HDF5 storage method for sparse storage that we have been designing and prototyping.
Hervé Pagès (12:26:17): > This one won’t be in HDF5 1.12.0 though.
2019-10-31
Michael Lawrence (13:58:21): > I’d be interested in checking out the webinar; sounds like you have the info?
Hervé Pagès (14:17:45): > Materials for last Friday’s webinar is online:https://www.hdfgroup.org/2019/10/webinar-followup-new-hdf5-features-coming-in-2020-2021/ - Attachment (The HDF Group): Webinar followup: New HDF5 Features Coming in 2020-2021 - The HDF Group > As promised, here are the ancillary materials for Friday’s webinar on New HDF5 Features Coming in 2020-2021. The video recording can be seen in its entirety on youtube. We’ve also grabbed the time stamps for each of the three topics if you would like to jump to the section of video on those individual topics: Splitter and mirror VFD VFD SWMR Sparse data management Slides: New HDF5 Features Coming in 2020-2021 Sparse Data Management in HDF5 We also ran a poll…
2019-11-05
Izaskun Mallona (03:34:43): > @Izaskun Mallona has joined the channel
2019-11-07
Mike Smith (07:36:24) (in thread): > I plan to create a branch of Rhdf5lib with 1.12.0 soon, so we can test whether this sorts some of our issues with arbitrary column/row selection
Kevin Blighe (11:27:48): > @Kevin Blighe has joined the channel
2019-11-16
Luka (13:38:19): > @Luka has joined the channel
2019-11-20
Nolan Nichols (12:01:41): > @Nolan Nichols has joined the channel
Russ Bainer (12:02:42): > @Russ Bainer has joined the channel
2019-12-06
Somesh (12:21:17): > @Somesh has joined the channel
2019-12-07
Juan Ojeda-Garcia (18:44:54): > @Juan Ojeda-Garcia has joined the channel
2019-12-08
Shian Su (04:50:04): > https://github.com/facebookresearch/faiss
Shian Su (04:51:32): > > The optional GPU implementation provides what is likely (as of March 2017) the fastest exact and approximate (compressed-domain) nearest neighbor search implementation for high-dimensional vectors, fastest Lloyd’s k-means, and fastest small k-selection algorithm known.
Shian Su (04:51:45): > Might be of interest to some.
Aaron Lun (13:40:36): > If this can get into an R package, I’d happily add bindings to BiocNeighbors.
Tim Triche (15:59:35): > https://github.com/facebookresearch/faiss/blob/master/c_api/INSTALL.md
Tim Triche (15:59:54): > probably will be easier to hook against the pure C api if wanting to glue from R
Aaron Lun (16:18:37): > Well, that’s the thing. You’d want another R package to manage the installation so BioCneighbors doesn’t have to do that much.
2019-12-10
Camille BONAMY (11:58:01): > @Camille BONAMY has joined the channel
2019-12-11
Christine Choirat (12:07:28): > @Christine Choirat has joined the channel
2019-12-16
Federico Agostinis (09:25:32): > @Federico Agostinis has joined the channel
Federico Agostinis (09:25:32): > @Federico Agostinis has joined the channel
2019-12-18
Mike Jiang (16:06:51) (in thread): > A quick test on the random slicing of hdf5-1.12``Unit: milliseconds`` expr min lq mean median`` a <- as.matrix(bm[ridx, cidx]) 5.318715 5.877797 6.178895 6.436878`` b <- as.matrix(hm[ridx, cidx]) 49.326155 50.281914 50.699314 51.237674`` c <- rhdf5::h5read(h5.file, "data", list(ridx, cidx)) 4266.242434 4315.764825 4431.287001 4365.287216`` d <- rhdf12::h5read(h5.file, "data", list(ridx, cidx)) 1755.276245 1763.458707 1781.881867 1771.641168`` uq max neval`` 6.608985 6.781091 3`` 51.385893 51.534112 3`` 4513.809284 4662.331353 3`` 1795.184679 1818.728189 3``> all.equal(a,b,c,d)``[1] TRUE``> packageVersion("HDF5Array")``[1] '1.14.1'``> packageVersion("rhdf5")``[1] '2.30.1'``bm
is memory-mapped
-backend DelayedArray
> hm
is HDF5Array
which presumably is based on alternative implementations h5mread
. > rhdf12
is the rhdf5
built against hdf5-1.12.0-alpha1
> > Looks like hdf5-1.12
does improve a bit compared to the existing hdf5-1.10
, but nowhere close to the performance achieved by@Hervé Pagès’s h5mread
Mike Smith (16:16:23): > Thanks for the benchmarks. Looks like an OK improvement, but nothing groundbreaking. I think I really need to understand whath5mread
is doing, because if it’s that much quicker it might be worth pointing HDF5 themselves at it.
Hervé Pagès (18:58:47): > @Mike Jiangdid you make rhdf12 available somewhere?@Mike Smithh5mread()
uses different methods depending on the particular situation. In this case (random slicing) it walks on each chunk touched by the user selection, loads it in an intermediate buffer, and copies the user selection (restricted to the chunk) from the buffer to the final destination (i.e. to the SEXP representing the ordinary array to return to the user). So it completely avoids the inefficiency of a singleH5Dread()
call approach where the full selection needs to be computed upfront. In the random splicing situation the full selection is the union of millions of hyperslabs, each of them reduced to a single array element. As we know the HDF5 low-level code spends more time computing this selection than reading data from disk.
2019-12-19
Mike Jiang (12:19:38) (in thread): > @Hervé Pagèshttps://github.com/mikejiang/rhdf5
Mike Jiang (12:21:11) (in thread): > I hardcoded the path of hdf5 lib inMakevars
Hervé Pagès (12:23:50) (in thread): > ok, thanks (we’re having the developer forum right now, you should join, the other Mike is going to talk about HDF5)
Mike Jiang (12:28:45) (in thread): > also besides the minor change to the rhd5/src/H5constants.c
you see in my commit, you will need to manually copyH5private.h
andH5win32defs.h
over to the hdf5 installationinclude
folder from sourcehdf5-1.12.0-alpha1/src
Mike Jiang (12:30:22) (in thread): > When, where?
Hervé Pagès (12:31:24) (in thread): > https://bluejeans.com/114067881(was announced on the developers-forum channel)
Hervé Pagès (12:33:02) (in thread): > And Mike’s slides are here:https://docs.google.com/presentation/d/1SjPB3yEenzFNWiwPLIFBBte6Et2v3WGwBLqVyq8JFZk/edit?usp=sharing
Mike Jiang (13:24:28): > Looks likeblosc_blosclz/lz4
consistently outperform the defaultGzip
with just moderate size increase, should we consider to make it default forHDF5Array
or evenrhdf5
?
2019-12-20
Domenick Braccia (08:36:24): > @Domenick Braccia has joined the channel
Nicholas Knoblauch (11:30:28) (in thread): > I’ve seen similar performance gains usingzstd
. I think making it the default is probably problematic given thatblosc
doesn’t ship with HDF5.
Paul Harrison (18:00:34): > @Paul Harrison has joined the channel
2019-12-22
Sara Fonseca Costa (16:11:18): > @Sara Fonseca Costa has joined the channel
2019-12-24
dylan (12:02:14): > @dylan has joined the channel
2019-12-25
Wendy Wong (12:03:11): > @Wendy Wong has joined the channel
2020-01-07
Robert Ivánek (05:17:23): > @Robert Ivánek has joined the channel
2020-01-11
Leandro Roser (14:11:42): > @Leandro Roser has joined the channel
Leandro Roser (14:12:58): > @Leandro Roser has left the channel
2020-01-15
olga tsiouri (16:27:51): > @olga tsiouri has joined the channel
2020-01-21
Aaron Wolen (10:51:03): > @Aaron Wolen has joined the channel
2020-02-07
Nitin Sharma (04:27:05): > @Nitin Sharma has joined the channel
2020-02-12
Thanh Le Viet (09:42:33): > @Thanh Le Viet has joined the channel
2020-02-17
Arshi Arora (12:29:19): > @Arshi Arora has joined the channel
2020-02-19
Paula Beati (13:22:45): > @Paula Beati has joined the channel
2020-02-28
Yi Wang (16:20:57): > @Yi Wang has joined the channel
2020-03-03
Sean Davis (17:26:48): > https://portal.hdfgroup.org/display/HDF5/New+Features+in+HDF5+Release+1.12 - Attachment (Confluence): New Features in HDF5 Release 1.12 > This release includes changes in the HDF5 storage format. PLEASE NOTE that HDF5-1.10 and earlier releases cannot read files created with the new features described below that are marked with a *.
2020-03-04
Tim Triche (12:03:31): > I was wondering why all my old HDF5-backed summarized experiments seemed to be breaking
Hervé Pagès (12:06:58): > mmh.. not good! New HDF5 versions should be able to read files created with earlier releases.
Kasper D. Hansen (12:11:08): > aren’t they saying the other way around: data created using new versions cannot be read using old versions
Aaron Lun (12:11:29): > that;s how I read it.
Tim Triche (12:11:50): > meh. brb switching to zarr
Nolan Nichols (12:12:53): > ha, nothttps://tiledb.com/? - Attachment (TileDB): Homepage – TileDB > A database for data scientists.
Aaron Lun (12:13:32): > I get all of Dirk’s 100 commits a day.
Tim Triche (12:14:43): > not helpful, I’m a statistician not a data scientist
John Readey (12:18:56): > The 1.12 library should be able to read files written with older versions. If you have an example where this is not the case, please contact The HDF Group (help@hdfgroup.org).
Aaron Lun (12:59:02): > Well, my HDF5 errors are along the lines of HDF5Array: > > Error in extract_array(x@seed, index) : > no slot of name "type" for this object of class "HDF5ArraySeed" > Error during wrapup: no slot of name "type" for this object of class "HDF5ArraySeed" >
Hervé Pagès (13:23:52): > Seems hardly related to HDF5 Release 1.12. Please open an issue for HDF5Array (I suspect it has something to do with some recent changes I made to the internals of HDF5ArraySeed objects:shushing_face:)
Tim Triche (13:57:02): > those were the same ones I was getting. I rolled back a few releases and presto, no more errors
Vince Carey (14:11:37): > I took the other approach and rebuilt an HDF5SummarizedExperiment that was failing, with current HDF5Array. This revision of the instance settled the matter. Would an updateObject method help?
Aaron Lun (14:11:56): > I suspect anupdateObject
is indeed required.
Tim Triche (14:14:04): > ooh! Probably so. I am so smart, S-M-R-T…:face_palm_star_trek:
Hervé Pagès (14:22:18): > Should be fixed in HDF5Array 1.15.9:https://github.com/Bioconductor/HDF5Array/commit/ea0d7df1a38f093e958a73014412a26dd966c34fSorry guys. Please report any further problem. > Just to clarify: HDF5 1.12 just got released and it will take a while before it makes it to Rhdf5lib/rhdf5/HDF5Array. So any problem you get while using these packages has nothing to do with this new HDF5 release. However, if you’ve already downloaded and compiled this new release and you are using it at the command line, then, as@John Readeysaid, you should be able to read your old files. If you can’t (and I originally thought that’s what was happening to@Tim Triche), please report to the HDF Group.
Jialin Ma (14:57:20): > @Jialin Ma has joined the channel
Shian Su (18:56:04): > I’ve got a 100GB CSV with about 1B rows and genomic coordinates as well as methylation statistics, I want to be able to query it quickly for certain regions, anyone know of anything in R that would help with that? I’m using sqlite3 but indexing on the chr and pos columns adds another 30GB to the size, I think sqlite3 stores a whole copy of the indexed columns. I coded up my own binary search with a fst table but it requires me to have somehow sorted the columns beforehand and I don’t know of any easy ways in R to sort out-of-memory (I sorted it on a machine with 500GB of ram but I want to make this usable with <16GB of ram).
Kasper D. Hansen (20:08:04): > isnt this what tabix is for?
Kasper D. Hansen (20:09:02): > but we have dealt with this size in bsseq using HDF5 btw
Kasper D. Hansen (20:09:15): > Pete is the expert and I know you’re in contact
Shian Su (20:48:52): > Haha yes he sits behind me.
Shian Su (20:52:35): > Tabix might be the right fit for this. GNU sort might be the best way to do out-of-memory sort, though it won’t be available for Windows.
Shian Su (21:13:59): > ~As a side note, was recently informed (and tested myself) that TSV read at least a magnitude faster than CSV. Super bizarre.~
Kasper D. Hansen (21:28:24): > using what reader code?
Shian Su (22:07:45): > Aha nevermind, bug in my code, I was reading the csv using tsv so it didn’t wasn’t actually breaking up the columns which made it much faster.
Shian Su (22:09:06): > They are equivalent speed-wise usingread.csv
andread.table(sep = "\t")
as you would expect.
Shian Su (22:15:18) (in thread): > > x <- matrix(1, 1000, 1000) > > write_path_csv <- tempfile() > write_path_tsv <- tempfile() > > write.table(x, write_path_csv, sep = ",") > write.table(x, write_path_tsv, sep = "\t") > > system.time( > read.table(write_path_csv, sep = ",") > ) > > system.time( > read.table(write_path_tsv, sep = "\t") > ) >
2020-03-05
Davide Risso (07:31:21): > readr::read_csv
should be faster, I think?
Federico Marini (08:00:03): > vroom::vroom
even more
Tim Triche (08:41:49): > data.table was always fastest for me … readr had some issues
Shian Su (17:39:45): > For me it’s vroom > data.table > readr > base
Shian Su (17:40:46): > Except I hit vroom’s 2 billion row limit, somehow with less than 1 billion rows. This is fixed by a patch that’s not live on CRAN yet AFAIK.
Shian Su (17:44:04): > data.table also doesn’t do chunked reading, and going byhttps://github.com/Rdatatable/data.table/issues/1721it’s probably not going to happen any time soon.
2020-03-06
Tim Triche (09:06:09): > These data points are great. Presumably the multi-billion-row patch will go live soon-ish, but in the meantime I rarely need to process more than a few hundred million rows, so it looks like I’ll be switching over to vroom permanently. Thanks!
2020-03-08
Mike Jiang (00:40:35): > @Mike SmithWe are currently experimenting the s3-backended h5 offered by hdf51.12, it will be very helpful if we could have a Rhdf5lib with 1.12.0 .
Shian Su (21:19:36) (in thread): > That being said, vroom simply indexes the data and doesn’t read anything until you need, if you happen to trigger a full read I imagine it’ll be slower thandata.table
. I believe it excels if you only need a subset of rows and/or columns.
2020-03-09
Mike Smith (05:11:07) (in thread): > You can find a branch with 1.12.0 athttps://github.com/grimbough/Rhdf5lib/tree/1-12-0and a matching version of rhdf5 athttps://github.com/grimbough/rhdf5/tree/1-12-0I’m not sure it’ll make it into the devel-branch before the release, I want to do some more extensive testing, but feel free to use either of those and let me know if there’s any issues.
Tim Triche (09:14:28) (in thread): > Well, what would be super awesome would be to (for example) read a chromosome a time into HDF5 from a tabixed BED-like file, and it seems likevroom
may be the best long-term option for that. Regardless, thanks for the pointers & tips!
Mike Jiang (16:09:03) (in thread): > Great! I’ve submitted PR for enabling s3 vfd.
2020-03-11
Michael Lawrence (11:46:55) (in thread): > Interested in hearing about the outcome of these experiments, as we are trying to make the move to cloud-based storage.
2020-03-16
cigdemak (13:23:08): > @cigdemak has joined the channel
2020-03-17
Mike Jiang (13:42:16) (in thread): > @Michael LawrenceSurely will post some results once we’ve got some working examples.
Michael Lawrence (13:42:28) (in thread): > Awesome, thanks
Mike Jiang (14:50:43) (in thread): > @Mike SmithI confirmed that ros3 vfd is also available in HDF5 1.10.6.https://www.hdfgroup.org/2019/12/release-of-hdf5-1-10-6-newsletter-170/ - Attachment (The HDF Group): Release of HDF5 1.10.6 (Newsletter #170) - The HDF Group > The HDF5-1.10.6 release is now available for download.
Mike Jiang (14:52:56) (in thread): > Maybe upgrade from the current 1.10.5 to 1.10.6 should be more straightforward? Anyway, it is up to you whether go straight to 1.12.
Mike Smith (16:06:54) (in thread): > Thanks for the heads up. A minor point release might be safer this close to the release date. One current sticking point is that I haven’t managed to build libhdf5 on Windows with the R-4.0 toolchain, so for now I’m relying on the r-winlib version, which is 1.10.5. I’d rather not end up with different platforms on different versions, but I’ll take another look at getting the Windows build to complete tomorrow.
Mike Jiang (17:07:57) (in thread): > :+1:
2020-03-18
Crowy (12:18:15): > @Crowy has joined the channel
2020-03-22
Stephanie Hicks (15:38:42): > sanity check. TheExperimentHub
site (https://bioconductor.org/packages/release/bioc/html/ExperimentHub.html) says the author is “Bioconductor Package Maintainer”. or the complete reference is > > Maintainer BP (2019). ExperimentHub: Client to access ExperimentHub resources. R package version 1.12.0. > When citing this in an manuscript, it comes at “Maintainer (2019)“. Somehow this feels strange to me?@Lori Shepherd— just confirming this is the correct reference? - Attachment (Bioconductor): ExperimentHub > This package provides a client for the Bioconductor ExperimentHub web resource. ExperimentHub provides a central location where curated data from experiments, publications or training courses can be accessed. Each resource has associated metadata, tags and date of modification. The client creates and manages a local cache of files retrieved enabling quick and reproducible access.
Vince Carey (16:06:46): > I agree that it is strange. We should do a quick f1000research paper on it so that this tremendous resource can be cited and appreciated. Anyone with bandwidth?
Vince Carey (16:07:01): > AnnotationHub too!
Lori Shepherd (21:34:48) (in thread): > Yes. Strange. I think it uses a default formula so it is taking bioconductor package as the first and middle name and maintainer as the surname. I’d do the complete unless@Martin Morganhas any additional thoughts on it?
Lori Shepherd (21:35:26) (in thread): > Yes we should follow up on this.
2020-03-23
Martin Morgan (03:54:59) (in thread): > @Lori Shepherdthe Author field should be updated to be like (an updated?) AnnotationHub using the Authors@R notation. It looks like a CITATION file should be manually created reflecting this information, so thatcitation("ExperimentHub")
returns a citation where real people get credit… thanks Lori!
Stephanie Hicks (10:17:39) (in thread): > Thanks@Lori Shepherd@Martin Morgan! But i’m not completely sure I’m following what you want the updated citation to be?
2020-03-24
Edgar (13:24:11): > @Edgar has joined the channel
2020-03-27
Hervé Pagès (06:06:49): > @Kasper D. HansenInstalling IlluminaHumanMethylationEPICanno.ilm10b4.hg19 takes about 4 min on my laptop and uses 2.5 Gb of RAM! Almost all of that time is spent in > > ***** moving datasets to lazyload DB > > Is using lazy loading really needed? It’s been reported to be the cause of memory allocation errors and timeouts on 32-bit Windows in several occasions in the past. > Thanks!
Kasper D. Hansen (08:42:27): > My plan was to redesign the annotation package for the next release. Given current events I am trying to do so, but unsure if I’ll make it
Kasper D. Hansen (08:44:41): > The issue at runtime (I think, it’s a while since I discovered it) is actually - as far as I can see - a bug, but when I looked into it, I had to give up with the lazy loading. It comes into play because of the rather large SNP tables including in the package where all of them gets loaded into memory.
Kasper D. Hansen (08:45:35): > Anyway, when I looked at it last, I concluded that a redesign would be better. But right now, it is actually using lazy loading in a pretty nice way, and since the package is widely used it really needs a robust solution.
Vince Carey (14:34:52): > Would SQLite backing make sense? Also, do we have a SQLite-backed approach to DataFrame at this time?
Marcel Ramos Pérez (15:11:20): > I believe we do:https://github.com/Bioconductor/SQLDataFrame
Hervé Pagès (16:38:16): > My question was more about the benefits of using lazy loading. It seems to be doing more harm than good. Getting rid of it sounds like it could be a good start and hopefully an easy one.
Michael Lawrence (16:59:30): > Easy enough to achieve lazy loading “manually” withdelayedAssign()
.
Hervé Pagès (17:56:54): > Thanks Michael. Didn’t know aboutdelayedAssign()
. In the core team we came up with our ownmakeCachedActiveBinding()
(available in BiocFileCache). It’s a wrapper aroundbase::makeActiveBinding
that provides caching i.e. the active binding gets only evaluated once and is “remembered”. We’ve started to use it in a couple of AnnotationHub-based annotation packages (seehttps://github.com/ju-mu/PANTHER.db/blob/master/R/zzz.R) but it can be used in any data package that wants to define and export symbols bound to on-disk data. It’s VERY easy to use.
Michael Lawrence (18:06:16): > Sounds a lot like whatS4Vectors:::makeEnvForNames()
does.
Hervé Pagès (18:07:08): > wow, never heard of that one neither and it’s in a package I maintain:dizzy:
Hervé Pagès (18:10:39): > > > ?makeEnvForNames > No documentation for 'makeEnvForNames' in specified packages and libraries: > you could try '??makeEnvForNames' >
> sad
Michael Lawrence (18:21:30): > Well, itisinternal.
Michael Lawrence (18:21:41): > I wrote that code when I was still at the Hutch.
2020-03-30
Kasper D. Hansen (08:47:06): > The lazy loading is actually pretty key to how it works right now.
Kasper D. Hansen (08:48:02): > With lazy loading, the data object gets copied into a mysterious location (presumably an environment) which I could never figure where was (its not the gloabl environment) which is visible everywhere.
Kasper D. Hansen (08:57:00): > @Vince CareyIt seems natural to consider a SQLIte backend, of course.
Kasper D. Hansen (08:57:12): > I have looked - very briefly - at AnnotationDbi this morning and I have several questions.
Kasper D. Hansen (08:59:12): > An alternative - which I am considering right now in light of the deadline - is to make a merge between the lazy loading design and SQLIte.
Minoo (11:22:17): > @Minoo has joined the channel
2020-03-31
Yagoub Ali Ibrahim Adam (12:30:23): > @Yagoub Ali Ibrahim Adam has joined the channel
2020-04-03
Kasper D. Hansen (08:35:26): > @Michael LawrenceThe feature of lazy loading I am using is my observation that lazy loaded objects gets put somewhere (and I don’t know where) where they appear to be present and visible from everywhere, but not in the global environment. Perhaps I am wrong about this - perhaps lazy loaded objects gets loaded into the current environment when they are accessed - but as I recall (and its been a while) my experiments suggested that wasn’t the case.
Vince Carey (08:59:19): > From help(“lazyLoad”) > > The function ‘lazyLoad’ is the workhorse function called by the > package loader to load the code for a package from a database. > The database consists of two binary files, ‘filebase.rdb’ (the > objects) and ‘filebase.rdx’ (an index). > > The objects are not themselves loaded into ‘envir’: rather > promises are created that will load the object from the database > on first access. (See ‘delayedAssign’.) >
2020-04-07
Anna Lorenc (12:02:25): > @Anna Lorenc has joined the channel
2020-04-10
kipper fletez-brant (10:11:06): > @kipper fletez-brant has joined the channel
Shubham Gupta (15:10:19): > @Shubham Gupta has joined the channel
2020-04-14
Stephanie Hicks (22:50:46) (in thread): > any updates on this?@Davide Rissoand I are in the final stages of writing up a manuscript formbkmeans
where we cite it, but I noticed it is still listed the same on the bioc website. “Maintainer BP (2019)”
Stephanie Hicks (22:54:33) (in thread): > any updates on this?
2020-04-16
Lori Shepherd (08:52:27) (in thread): > so I updated AnnotationHub and ExperimentHub Authors@R list yestereday –@Martin Morganif you think the ctb vs aut should be updated for certain people let me know. it should propagate on tonights build.
2020-04-17
Stephanie Hicks (06:52:17) (in thread): > Thank you@Lori Shepherd!
2020-04-24
Anthony (20:09:28): > @Anthony has joined the channel
2020-04-27
Kelly Eckenrode (10:56:56): > @Kelly Eckenrode has joined the channel
2020-04-28
Jonathan Hicks (10:49:00): > @Jonathan Hicks has joined the channel
2020-04-29
Anamaria Elek (03:05:10): > @Anamaria Elek has joined the channel
2020-05-05
Tim Triche (17:09:35): > hey so I was requested to darken this channel’s door by@Hervé Pagès
Tim Triche (17:10:38): > earlier today@Kasper D. Hansenled a discussion with@Vince Careyand me (the latter mostly adding useless noise) about HDF5-backed thingies, DelayedArrays, practicalities of said, and GenomicFiles (because a bunch of us realized that we’d reimplemented the latter, poorly, 2-3 times independently)
Tim Triche (17:11:22): > the google docs is athttps://docs.google.com/document/d/1esesDQo6AapDyZHpqcFat7JnbH35zoqbC7gz6ki1SHQ/edit?pli=1
Tim Triche (17:12:29): > but the boring detail that I brought up is, suppose you save a RangedSummarizedExperiment-looking thingy to an RDS usingHDF5Array::saveHDF5SummarizedExperiment
Tim Triche (17:13:02): > such as, I don’t know, a boring GenomicRatioSet object that has certain mandatory slots describing things like what probes live where on what genome
Tim Triche (17:16:37): > (also who’s the Anonymous Lemur? all typoes are mine, fwiw)
Hervé Pagès (18:09:05): > what’s the question?
Tim Triche (18:25:39): > sorry got distracted there for a second. writing up an executable example
Tim Triche (18:37:21): > well, shit. as of 3.11, it looks like the magic for retaining the correct derived class and the relative filepointer for a given HDF5-backed summarized experiment is clean
Tim Triche (18:37:36): > apparently MAE is really where the plot is lost now?
Tim Triche (18:37:48): > I think I can fix up the issue we used to have much more easily now
Marcel Ramos Pérez (18:48:01): > Hi Tim, I’d like to help. Can you be specific about any requested features / issues for MAE athttps://github.com/waldronlab/MultiAssayExperiment/issues? Thanks.
Tim Triche (19:01:00): > I’ll make a reprex
Tim Triche (19:01:26): > it will take a little longer than I originally thought though since HDF5Array seems to have ensmartened:slightly_smiling_face:
Al J Abadi (19:12:11): > @Al J Abadi has joined the channel
Hervé Pagès (19:59:55): > @Tim Tricheand also herehttps://github.com/Bioconductor/HDF5Array/issuesfor any request or bug report about HDF5Array objects orsaveHDF5SummarizedExperiment
/loadHDF5SummarizedExperiment
. Thx!
Tim Triche (21:50:05): > Thank you!
2020-05-07
FelixErnst (02:19:30): > @FelixErnst has joined the channel
2020-05-14
Aaron Lun (12:43:56): > @Mike JiangDo you already have a DelayedArray backend for TileDB? I have half a package (read on dense arrays only, read/write for sparse arrays) but if you have something more mature I will see what I can donate to yours.
Mike Jiang (13:34:05) (in thread): > I worked on it two years ago when tiledb didn’t have random slicing implemented. And I haven’t touched it ever since. So probably you are far ahead of me. What’s your impression of tiledb compared to h5 though? As far as I’ve experienced, it is no much better than h5 in terms of speed.
Aaron Lun (13:37:38) (in thread): > I haven’t done any benchmarking, I was actually looking at your RPubs to get a feel of whether it would be worth doing. (I ended up doing it anyway just out of curiosity.) Seems that dense scenarios don’t look that good, but perhaps their sparse support is worth the effort.
Mike Jiang (13:45:29) (in thread): > Right, I am currently working on extending cytolib to tiledb backend support. Mainly for its better support of cloud storage
Aaron Lun (13:46:24) (in thread): > Excellent. In that case I will continue working on this DA backend, looks like we’re not working on the same thing.
2020-05-15
Sean Davis (08:32:38) (in thread): > Just FYI,@Mike Jiang, you might want to pull@Vince Careyand@John Readeyin. HDF5 has some relatively new cloud capabilities. I’m not sure how they compare to tiledb, though.
Mike Jiang (12:31:03) (in thread): > Right, libhdf5 currently only support readonly operation on s3 . Also it has some issue with concurrences, which is why we are exploring the extension to tiledb. That said, libhdf5 has been work smoothly and efficiently on local file system.
2020-05-18
Aaron Lun (02:46:39): > @Hervé PagèsThinking about working on a second round of improvements for DA matrix multiplication, current thoughts here:https://docs.google.com/document/d/1ONjkv6F0sMsfEIR1DzxCxmn8DrixFAvu5-7_j_38c5c/edit?usp=sharing
Shian Su (03:36:48): > Does this assume that a single row/column must fit in memory?
Aaron Lun (03:38:54): > Yes.
Aaron Lun (03:39:44): > Which we do already.
Shian Su (03:55:40): > Not sure if you are also considering numerical stability like inhttps://cran.r-project.org/web/packages/PreciseSums/index.html.
Shian Su (04:11:40): > I think if you want absolute numerical accuracy you will eventually get screwed by hardware. But this sounds like a good change to narrow down the points of failure.
B P Kailash (08:43:57): > @B P Kailash has joined the channel
Huipeng Li (09:23:20): > @Huipeng Li has joined the channel
Nicholas Knoblauch (11:22:15) (in thread): > Floating point multiplication is not associative, I think holding up one blocking scheme as more “correct” is going to lead to bad times long term. When C++ added the parallel version of std::inner_product
to the standard library, they came up with a whole new name for the parallel version (std::transform_reduce
https://en.cppreference.com/w/cpp/algorithm/transform_reduce) precisely because unlike parallel versions of other algorithms (e.gfind
,count
), it has different affordances.
Aaron Lun (11:41:49) (in thread): > Consider this practical example. You want to run the same code, same HDF5 backed representation on two different machines; but because the blocking scheme is different due to differences in the user’sgetAutoBlockSize()
, you get different results. This would be unacceptable.
Aaron Lun (11:48:36): > This is no different from our usual floating point discussions.
Aaron Lun (11:48:53): > It’s for this reason I don’t turn on AVX and friends in my C++ code.
Aaron Lun (11:52:47): > To be clear, I care not for accuracy, I just want to get the same result regardless of the matrix representation.
Aaron Lun (11:53:35): > Or even more specifically; for the same representation, I want to get the same result regardless of the DelayedArray blocking scheme.
Nicholas Knoblauch (12:16:50) (in thread): > I believe that is true (in general) without involvingDA
Aaron Lun (12:17:40) (in thread): > Not sure what you mean, but I just want to get the same result regardless of what blocking scheme the user has decided to use.
Aaron Lun (12:18:28) (in thread): > Guarantees are limited by hardware but that shouldn’t stop us from doing the closest we can.
Nicholas Knoblauch (12:19:30) (in thread): > even if they have the same instruction set, even if they’re compiled with the same compiler, base R does not guarantee you will get the same results
Aaron Lun (12:20:45) (in thread): > Do you have an example?
Nicholas Knoblauch (12:21:13) (in thread): > I mean the classic is thematprod
options
Aaron Lun (12:21:15) (in thread): > I know that openBLAS and MKL can give different results, but that’s because of abug in the latter.
Aaron Lun (12:21:42) (in thread): > In any case, DA should not be responsible for introducing anadditionalpoint of discrepancy.
Aaron Lun (12:25:14) (in thread): > Moreover, this discussion is largely academic, becauseDelayedArray::getAutoMultParallelAgnostic()
is TRUE by default, so we’re already doing it the way I suggested; it will be a simple matter of streamlining the cost calculations to ignore the other code path.
Hervé Pagès (12:28:55) (in thread): > This also affects things likerowSums()
. It’s inherent to block processing in general. Maybe we should have an option in DelayedArray that the user can set to use block schemes that avoid these discrepancies but that means schemes that tend to be less efficient in general.
Aaron Lun (12:29:35) (in thread): > Yes, we should.
Aaron Lun (12:30:06) (in thread): > Though matrix multiplication is currently the worst, because it’s used to create PCs, which is used to launch a t-SNE, which changes at the drop of a hat if your input values are not identical to the bit.
Aaron Lun (12:30:35) (in thread): > The others I don’t care so much about because the numerical differences don’t have any real practical consequences there.
Hervé Pagès (12:31:14) (in thread): > sure sure, but I’d rather have a general solution to a general problem
Aaron Lun (12:31:44) (in thread): > Well, if row sums are always computed using rowGrid
chunking, we’ll be fine.
Nicholas Knoblauch (12:35:52) (in thread): > If you really need bitwise identical BLAS results, it’s probably worth looking into something like reproBLAShttp://bebop.cs.berkeley.edu/reproblas/. If you look at the list of reproducible functions they implement, it’s kind of a bummerhttp://bebop.cs.berkeley.edu/reproblas/status.php - Attachment ( ): ReproBLAS - Reproducible Basic Linear Algebra Sub-programs > Reproducibility is the ability to obtain bitwise identical results. ReproBLAS provides linear algebra routines that remain reproducible regardless of the order of summation.
Hervé Pagès (12:36:20) (in thread): > Note that the block scheme has to deal with some serious constraints and be flexible enough to workaround them. For example, using row-wise blocking of the left matrix in matrix multiplication would lead to a disaster if the left matrix is a big 10x dataset where loading a single row will blow up your memory.
Hervé Pagès (12:38:51) (in thread): > (I’m talking about the sparse-on-top-of-HDF5 format from the 10x genomics people.)
Aaron Lun (12:38:53) (in thread): > Yes, that is known. But there is an assumption that there is a sane blocking scheme that provides reasonably efficient row- and column-level access.
Aaron Lun (12:39:37) (in thread): > I mean, if we can’t get that,beachmat’s row-level access fails.
Aaron Lun (12:40:39) (in thread): > Some operations just require row-level access. No way around it.
Hervé Pagès (12:40:55) (in thread): > I like to minimize assumptions. A naive user trying to do some matrix multiplication with the 10x dataset should get a decent behavior.
Hervé Pagès (12:41:57) (in thread): > > Some operations just require row-level access. > Maybe but matrix multiplication should not be one of them.
Aaron Lun (12:52:47) (in thread): > Without that assumption, it will be difficult to implement a reproducible column-wise scheme with the appropriate cost calculations.DelayedMatrix-utils.R
is already pretty complicated.
Hervé Pagès (13:14:42) (in thread): > That’s why I want to move that complexity tomultGrids()
. This is not ready yet but the idea is thatmultGrids()
will take care of choosing theoptimalblock scheme.Optimalmeans different things for different people. Seems like for you it means a block scheme that doesn’t affect the result. For other people it means “minimize memory usage”. For others that can use a lot of cores it might mean something else. Assuming row-level access in that context just sounds too restrictive and oriented to a specific use case.
Aaron Lun (13:19:16) (in thread): > Getting the same result regardless of the block size (not even the scheme, just the size) seems like a pretty reasonable expectation to me.
Hervé Pagès (13:22:55) (in thread): > FWIW getting rid of the row-level and col-level access schemes in things likerowSums()
,rowsum()
,colSums()
,colsum()
, etc.. was a major milestone in DelayedArray. For example it allows something likerowsum()
to be much faster by using a block scheme that plays much nicer with parallelization. I think the same ideas can benefit matrix multiplication a lot.
Aaron Lun (13:24:21) (in thread): > I’m up for anything as long as the default behavior is agnostic to the block size.
Hervé Pagès (13:30:31) (in thread): > What the default should be is another discussion and I leave it for another time. But first we want to make everybody happy by supporting their use case via some global options. Sounds like at least we’re able to agree on that.
Hervé Pagès (13:42:07) (in thread): > About using global options to control the behavior of block processed operations. I’ve been thinking for a while now about making all block-processed operations available viaBLOCK_*()
functions (e.g.BLOCK_rowSums()
). I already have these functions in DelayedArray but they are not exported. TheseBLOCK_*()
functions have extra arguments that let the user control various aspects of block-processing. I was considering exporting them. This would offer an alternative and more convenient way for the user to control block-processing. It also makes the code more readable and self-contained because its behavior no longer depends on a bunch of obscure global options. The “classic” interface (i.e. callingrowSums()
or%*%
) would still be available of course. They would callBLOCK_rowSums()
andBLOCK_matrix_mult()
internally and use the global options to set the extra arguments. What do you think?
Aaron Lun (13:44:34) (in thread): > I guess you could, but as a downstream developer, I can only ever callrowSums()
or%*%
rather than the BLOCK equivalents. Well, I guess my code could call the latter, but third-party packages (e.g., from CRAN) will only be calling the expected generics and so that is out of my control.
Hervé Pagès (13:53:22) (in thread): > That wouldn’t make any difference for CRAN packages of course. I want to emphasize the fact that theseBLOCK_*()
functions work on any matrix-like objects, not just DelayedMatrix objects. So they can even be used in contexts that do not involve DelayedMatrix objects.
Hervé Pagès (14:06:49) (in thread): > I should also ping@Peter Hickeyon this because providing theBLOCK_*()
interface would only make sense if we do it consistently across DelayedArray and DelayedMatrixStats.
Peter Hickey (19:02:43) (in thread): > I’m actually keen to deprecate DelayedMatrixStats and put the functionality into DelayedArray. I think we discussed this at some point in the past. now that we have MatrixGenerics, that’s the home for the generics and any backend-specific ‘matrixStats’ I think belongs in the backend package (e.g., general DelayedArray stuff in DelayedArray, any HDF5 stuff in HDF5Array)
Shian Su (19:15:20) (in thread): > I vaguely remember you removing AVX from your code, did you end up explicitly disabling it? I think compilers auto-generate AVX/SSE code on certain CPUs, definitely at -O3 but maybe even at -O2.
Aaron Lun (19:17:56) (in thread): > Only if you ask to compile with a native architecture, I think. If you compile against a generic architecture, this should not happen.
Shian Su (19:31:55) (in thread): > Cool, good to know.
Aaron Lun (21:23:50) (in thread): > Coming back to this: it is unlikely that I’d use theBLOCK
interface. Despite how I feel about what the most appropriate blocking schemeshouldbe, I’d still respect the user’s wishes on what blocking scheme they want to use, so my code would still need to respond to their global variables. This would not be possible if I baked in the most-correct blocking scheme into my code via theBLOCK
interface.
Hervé Pagès (22:00:39) (in thread): > Even in highly specialized routines like your principal component use case? Sounded like a situation where you’d want to use a very particular block scheme for matrix multiplication (and this block scheme might not be the default one). > Anyway, exposing theBLOCK
interface is something I had on my mind for a while. Even if it doesn’t have much appeal for being used in package code I still like that the functions in the interface are pure functions and easier to document. Also showing various examples that illustrate the impact of different block schemes becomes easier.
Aaron Lun (22:08:18) (in thread): > Yes, if the alternative is not to be able to run it at all because the data has a boneheaded chunking setup. > > Though I don’t think my application is all that specialized. One would expect that tweaking of efficiency parameters (block size and scheme) should not change the actual output of an operation. This is the least surprising behavior and should be the default. For example, all of my functions across all of my packages honor this expectation with respect to how they parallelize things.
Aaron Lun (22:11:43) (in thread): > And besides, I don’t think we’re losing much from having this as the default. It’s worked well so far.
Hervé Pagès (23:19:30) (in thread): > > One would expect that tweaking of efficiency parameters (block size and scheme) should not change the actual output of an operation. > I agree in general but with operations that involve floating point numbers that’s no longer a reasonable/realistic expectation.
2020-05-19
Aaron Lun (00:01:04) (in thread): > I don’t think it’s particularly hard, given that we’re already doing it.
Hervé Pagès (00:26:57) (in thread): > It only works right now if the 2 matrices have the chucking that allows it to work. As discussed earlier it will choke if the left matrix has chunks made of full columns. We need a general solution that does a good job for a wide range of use cases. Such a general solution doesn’t need to try to achieve the cute property that the results must be exactly the same whatever the block size or the order in which blocks are processed. There are many use cases where this doesn’t matter for the same reason that it generally doesn’t matter that2 + 1e-15 - 2
is not exactly the same as2 - 2 + 1e-15
.
2020-05-25
John Readey (12:47:45) (in thread): > Hey@Mike Jiang- right hdf5lib doesn’t work well with cloud data esp wrt writing. HSDS should be a good solution though. It supports multiple clients updating the file simultaneously. I’d be curious to learn how it compares with TitleDB.
Mike Jiang (13:26:55) (in thread): > Yes. But HSDS is a service, which can not be easily deployed/shipped and installed along with our R packages by users, right?
2020-05-26
Sean Davis (14:41:55) (in thread): > @Mike JiangI see the main use case for data services like HSDS to be in sharing (on potentially operating on) large, centrally-managed datasets with bioconductor users, akin to ExperimentHub.
Sean Davis (14:43:06) (in thread): > I agree that having each user deploy HSDS personally is probably not a good use case to support.
2020-05-28
John Readey (14:02:26) (in thread): > NASA has been asking for the ability to use HSDS functionality without the service as well. I’m planning to add support for this this summer. See:https://github.com/HDFGroup/hsds/blob/master/docs/design/direct_access/direct_access.md.
John Readey (14:03:44) (in thread): > In this serverless mode, the client would only be able to utilize whatever cores are on the host, but the advantage is that it can be just used by a package.
John Readey (14:04:11) (in thread): > The app would need to have at least read access to the bucket of course.
John Readey (14:04:45) (in thread): > Also, it will be most efficient if the app is running in the same AWS_REGION as the bucket.
John Readey (14:05:10) (in thread): > But with those provisos, it should be handle for some use cases.
John Readey (14:05:50) (in thread): > My first target will be for Python apps, but we can talk about R support if there is interest.
Sean Davis (15:59:09) (in thread): > Cool,@John Readey!
2020-06-01
Shuyu Zheng (03:06:24): > @Shuyu Zheng has joined the channel
2020-06-04
Nick Eagles (17:51:03): > Hi all– first off, please let me know if this is not the right channel for questions like this. Does anyone know of helpful resources to understand how to effectively work withHDF5Array
objects? I’m often finding that code I write either stalls or exceeds a reasonable amount of memory usage, because I’m not fully understanding what’s going on “under the hood”. Suppose I want to sort a few different HDF5-backedbsseq
objects, then combine them withrbind
.rbind
is delayed, but what about sorting (in particular, I wanted to call thebsseq
functionstrandCollapse
)?
Mike Jiang (17:58:38): > @Mike Smith@Hervé Pagès@Aaron LunI don’t know if this was asked before , according tohttps://github.com/grimbough/Rhdf5lib/blob/master/vignettes/Rhdf5lib.Rmd, this is recommended way of retrieving h5 static libs pathRHDF5_LIBS=$(shell echo 'Rhdf5lib::pkgconfig("PKG_CXX_LIBS")'|\`` "${R_HOME}/bin/R" --vanilla --slave)
Here is my use case, which confuses the user and myself quite a bit and cost quite some troubleshooting efforts: > On the HPC server, this is the default system R lib path, which hasRhdf5lib 1.8
installed> .libPaths()``[1] "/app/easybuild/software/R/3.6.2-foss-2016b/lib/R/library"
And I installedRhdf5lib 1.11
into my personal R lib path~/mypath
, and put it before system path by setting.libPaths()
in my.Rprofile
file, i.e.> .libPaths()``[1] "/home/wjiang2/mypath/3.6" ``[2] "/app/easybuild/software/R/3.6.2-foss-2016b/lib/R/library"
Then when IR CMD INSTALL ncdfFlow
, it gives me h5 lib mismatch errorHeaders are 1.10.6, library is 1.10.5
. BecauseRhdf5lib::pkgconfig
call was invoked with--vanilla
, which essentially ignores the customized.libPaths()
set in.Rprofile
file, causing linking to the older h5 lib from system R lib path (i.e.3.6.2-foss-2016b/lib/R/library
) while the header is frommypath/3.6
triggered byLinkingTo
field in DESCRIPTION > > My easy fix is to changeMakevar
to useRscript -e 'Rhdf5lib::pkgconfig("PKG_CXX_LIBS")'
, which I thought may be more robust? but I’d double check with you guys since that was howHDF5Array
and many other bioc packages link toRhdf5lib
, and there must some deliberate considerations on this matter.
Tim Triche (18:32:41) (in thread): > well lookit here,@Kasper D. Hansen@Peter Hickey@Vince Carey… maybe it does still make sense to write these up
Tim Triche (18:33:42): > @Nick Eaglesas someone currently replacing a function called “unionize” with a resize-and-rewrite strategy for HDF5-backed bsseq (and other ranged) objects,lo sientobut there is not (yet) a universal answer
Tim Triche (18:34:10): > this question comes up a LOT and the “right” answer depends upon what you are trying to do and how often
Peter Hickey (18:48:32) (in thread): > You might findhttps://github.com/PeteHaitch/BioC2019_DelayedArray_workshophelpful. I’m updating this for BioC2020 (https://github.com/PeteHaitch/BioC2020_DelayedArray_workshop)
Peter Hickey (18:49:58) (in thread): > For questions about bsseq, please post them tohttp://support.bioconductor.org/or perhapshttps://github.com/hansenlab/bsseq(if more technical problems)
Vince Carey (23:23:22): > @Nick EaglesI think we would like to help with this – and it will be important to have the attention of@Hervé Pagès. Can you produce a gist or github repo with an example that shows poor behavior in some application, or illustrates the sorting task you are working on?
Shian Su (23:28:29): > I think it would go a long way to have a document listing which functions are likely to blow out your memory. Does this already exist?
Shian Su (23:35:34): > My impression of most “seamless” out-of-memory frameworks has been that I have to determine in my head whether an out-or-memory algorithm exists for a particular operation, then gamble on whether or not the developer has implemented the operation in that way. Sort is an example where a out-of-memory algorithm exists as used in GNU sort but is likely to try and do a full data load in most out-of-memory frameworks.
2020-06-05
Hervé Pagès (01:36:11): > @Nick EaglesYes more details about what you’re trying to do would be helpful. In particular what are you trying to sort exactly? The rows? The columns? The array elements? (I’m afraid this last one will cause a disaster). Please produce a gist as@Vince Careysuggested, or ask on the support site, or open an issue under HDF5Array. Also Pete’s (@Peter Hickey) workshop document is a must read. Much better than any vignette I could ever write.
Peter Hickey (01:37:34) (in thread): > i’ll get it into DelayedArray as a vignette one of these days
Nick Eagles (12:56:31): > Thank you everyone for your comments so far- and I’ll try to form a more specific and detailed snippet of exactly what I’m trying. Below are pieces of code most directly involved and explanatory of the task I’m attempting (for succinctness some variable definitions are not shown) > > # Potentially relevant memory/threading details > numCores = 2 > dtThreadsPerCore = 3 > > BiocParallel::register(MulticoreParam(numCores)) > setAutoBPPARAM(MulticoreParam(numCores)) # this is necessary, despite above line > h5disableFileLocking() > data.table::setDTthreads(dtThreadsPerCore * numCores) > setAutoBlockSize(1e9) # for speed- default is 1e8 > > # Read the reports into a bsseq object ('reportFiles' are cytosine > # reports from bismark for a single chromosome) > print('Beginning read.bismark...') > BSobj = read.bismark(reportFiles, colData = data.frame(row.names = ids), > strandCollapse = FALSE, nThread=1, BACKEND = "HDF5Array", > dir=testDir, BPPARAM = MulticoreParam(numCores), > verbose=TRUE) > print('Read.bismark completed.') > gc() # check memory usage > > > # Manually read info from the first report, to construct GRanges > print('Constructing GRanges from first report and attaching to the BSobj...') > context <- fread(reportFiles[1], colClasses = c('factor', 'numeric', > 'factor', 'integer', 'integer', 'factor', 'factor')) > context_gr <- GRanges(seqnames = chrom, IRanges(start = context[[2]], width = 1), strand = context[[3]], > c_context = Rle(context[[6]]), trinucleotide_context = Rle(context[[7]])) > > # Re-order based on strand, as 'BSobj' is ordered > context_gr <- c(context_gr[strand(context_gr) == '+'], context_gr[strand(context_gr) == '-']) > stopifnot(identical(ranges(granges(BSobj)), ranges(context_gr))) > rowRanges(BSobj) <- context_gr > rm(context, context_gr) > print('Done.') > > > print('Splitting BSobj by cytosine context...') > BS_CpG = strandCollapse(BSobj[which(rowRanges(BSobj)$c_context == 'CG'),]) > BS_nonCpG = BSobj[which(rowRanges(BSobj)$c_context != 'CG'),] >
> It’s really the second last line prompting my question- my script has been stuck at this line for more than 24 hours (note strandCollapse must do quite a bit of rearranging ranges (rows) because the initial ordering puts all ‘+’ stranded loci before all ‘-’ loci). I have tried the same line withoutstrandCollapse
, and it runs fairly quickly.
Nicholas Knoblauch (13:06:17): > Pardon the generic advice, but when you run in to a problem like this it usually helps to try the following: > 1. come up with the smallest, fully representative, test case that you can so that everything runs in a reasonable amount of time > 2. profile (i likeprofvis
for R) > 3. profit
Tim Triche (16:28:03): > @Nicholas Knoblauchyou may find that with HDF5-backed data stores, that sequence of events won’t address the problems of greatest interest. Most HDF5 issues manifest at some threshold where chunking goes to hell – 499 runs complete in 15 minutes, and 500 runs take multiple days to finish. This is (imho) a fundamental reason for the slow-ish embrace of general-purpose HDF5 backending of SE-like objects.
Tim Triche (16:30:36): > Other “interesting” quirks like parallel processing can also be excruciating, although they usually yield to the suggested approach more readily. I spent a while as a systems programmer, HDF5 optimization can often seem more like that than a straightforward “give us a reprex to debug” (though I put together a small package to ease the assembly of those as well).
Tim Triche (16:31:27): > oh wow@Nick Eaglesthis is the exact issue I was working on the other day – I may have a postable fix in the next day or so
Nick Eagles (16:37:13): > Thanks@Tim Triche, that’s great to hear (also I hadn’t heard ofprofvis
, I’ll probably use that a lot)
Tim Triche (16:37:15): > you want to size the HDF5 matrix based on the union of the rows and sort that before reordering the array, fwiw
Tim Triche (16:37:59): > profvis is great but it can be very difficult to predict how HDF5 and parallel execution will interact as you scale up/out
Tim Triche (16:38:53): > That said,@Nicholas Knoblauchhas a point – a profiler & debugger is no substitute for clear thinking, nor is clear thinking a substitute for a profiler & debugger (an old internal IBM motto).
Nicholas Knoblauch (17:06:35): > @Tim TricheI totally agree that profiling toy benchmarks can be misleading, but when you’re in the “what part of this pipeline is taking so long” stage, or even the “why is step 4 of the pipeline behaving like it’s O(N^3) when it should be O(N^2)” stage of troubleshooting performance problems, toy problems are really useful. While it’s not uncommon to observe significant runtime differences coincide with small differences in problem size when using HDF5 (e.g the notorious dataset selection problem) it’s all polynomial complexity kind of stuff, and the slowest step in the 500MB-1GB dataset version of the problem is almost certainly going to be the slowest in the 1TB version of the problem
Tim Triche (17:31:14): > Point taken — in the past I have decreased chunk size so that I can at least follow along — more recently though we’ve seen some far more puzzling behaviors with parallel execution that smell like heisenbugs
Peter Hickey (20:35:18): > @Nick EaglesI’m on holiday until Monday week but can offer advice when back. may i please ask again that if it’s bsseq-specific you post to the support forum or the github issues - slack is great but things get buried quickly and it’s good to have this stuff in the public where it’s searchable
2020-06-06
Olagunju Abdulrahman (19:56:52): > @Olagunju Abdulrahman has joined the channel
2020-06-07
Yingxin Lin (19:47:04): > @Yingxin Lin has joined the channel
2020-06-08
Tanzeel Tagalsir (03:00:47): > @Tanzeel Tagalsir has joined the channel
NABISUBI PATRICIA (16:18:51): > @NABISUBI PATRICIA has joined the channel
2020-06-09
Shankar Shakya (10:05:22): > @Shankar Shakya has joined the channel
Mark Tefero Kivumbi (15:37:54): > @Mark Tefero Kivumbi has joined the channel
Mike Smith (16:46:52): > Hi<!channel>,@Sean Davisand I were thinking about focusing the next developer forum on disk-based ‘big data’ topics. Does anyone have experience with formats like TileDB, Zarr, Parquet, HDF5 etc that they’re be willing to share? I thought it’s be cool to get some perspective on things other than HDF5, but any opinions or experiences on the topic would be great.
Aaron Lun (16:47:16): > We are working on a TileDB DA backend.
Tim Triche (16:48:08): > I know@Martin Morganwas working on a Zarr SE backend, and we’ve been fiddling with HDF5 plenty lately
Tim Triche (16:48:51): > Zarrlooksawesome but right now all the hooks are in python… would be nice if there were more “R-like” C hooks for it. HCA is using it as a storage backend so that perked my interest. Alas for now HDF5 it is
Tim Triche (16:49:49): > @Mike SmithI would dearly love it if a developer kicks & stings of DelayedArray / out-of-core development in Bioconductor devel call was on the agenda.@Vince Careyreminded me of shared past trauma with that
Mike Smith (16:54:47): > I’d happily watch a 15 minute demo on what we’re missing out on with Zarr, or a prototype baslisk interface to it (if that’s possible).
Sehyun Oh (17:03:05): > @Sehyun Oh has joined the channel
Martin Morgan (17:04:38): > My ZarrExperiment proof of concept is athttps://github.com/Bioconductor/ZarrExperiment; would be happy to collaborate on further maturation, especially basilisk-ization (and DelayedArray-ization?)
Sean Davis (17:10:57): > Sounds like we have Zarr covered and tiledb covered as “new players”. I can cover parquet, avro, and ORC (should be only 10 minutes or so). Others to cover?
Taoyu Mei (17:16:30): > @Taoyu Mei has joined the channel
Shian Su (20:16:30): > I recall there was another backend presented at BioC 2019 for DelayedArray, was it ORC?
Vince Carey (22:31:29): > I don’t know if this is a back end for DelayedArray, but rhdf5client uses DelayedArray to relate to HDF Scalable Data Service resources > > > example(HSDSArray) > > HSDSAr> HSDSArray(URL_hsds(), > HSDSAr+ "hsds", "/shared/bioconductor/darmgcls.h5", "/assay001") > <65218 x 3584> matrix of class HSDSMatrix and type "double": > [,1] [,2] [,3] ... [,3583] [,3584] > [1,] 0.000000 0.000000 112.394374 . 0.00000 0.00000 > [2,] 0.000000 0.000000 0.000000 . 0.00000 0.00000 > [3,] 0.000000 0.000000 0.000000 . 0.00000 0.00000 > [4,] 5.335452 11.685833 0.000000 . 0.00000 14.01612 > [5,] 0.000000 0.000000 0.000000 . 0.00000 0.00000 > ... . . . . . . > [65214,] 0.00000 0.00000 0.00000 . 0.00000 0.00000 > [65215,] 480.68946 1228.13851 112.75566 . 0.00000 0.00000 > [65216,] 0.00000 0.00000 0.00000 . 0.00000 0.00000 > [65217,] 0.00000 610.82997 46.86639 . 0.00000 0.00000 > [65218,] 10155.80336 25366.30099 2068.63983 . 4.01555 2531.88862 > > >
Vince Carey (22:37:32): > At the time of development of rhdf5client, HSDS could only run on AWS S3 with a specific server. Now it can be used in many ways, targeting cloud or local storage – seehttps://www.hdfgroup.org/solutions/highly-scalable-data-service-hsds/
Aaron Lun (22:38:22): > Are you paying for your HSDS deployment?
Vince Carey (22:41:34): > No – HDF group has supplied space and server cycles as needed. I think one can now use it with HDF5 loaded via h5pyd to AWS S3 buckets, without a dedicated server. Using it with a dedicated server introduces the possibility of scalably parallel read/write.
Vince Carey (22:42:06): > And in fact if you have a local object store, the API can be run locally.
Aaron Lun (22:42:51): > Could be an interesting direction to explore for ExperimentHub, though I don’t know what happens with loss of connectivity.
Aaron Lun (22:43:37): > For example, having the full dataset remote on EHub would be pretty amazing for lightweight Shiny app deployments.
Vince Carey (22:45:51): > One thought I had was to try to have recount2 or CONQUER as a single unified summarized experiment with this back end. HumanTranscriptomeCompendium package illustrates the concept with 181000 uniformly processed RNA-seq studies from NCBI SRA. Now “refine.bio” has even more RNA-seq data through the pipeline (700000 is the number of samples) and this approach could be used readily.
Vince Carey (22:47:16): > > > library(HumanTranscriptomeCompendium) > 1/20 packages newly attached/loaded, see sessionInfo() for details. > > xx = htx_load() > Loading required namespace: BiocFileCache > > xx > class: RangedSummarizedExperiment > dim: 58288 181134 > metadata(1): rangeSource > assays(1): counts > rownames(58288): ENSG00000000003.14 ENSG00000000005.5 ... > ENSG00000284747.1 ENSG00000284748.1 > rowData names(0): > colnames(181134): DRX001125 DRX001126 ... SRX999990 SRX999991 > colData names(4): experiment_accession experiment_platform > study_accession study_title > > assay(xx) > Error in assay(xx) : could not find function "assay" > No suitable frames for recover() > > library(SummarizedExperiment) > 4/0 packages newly attached/loaded, see sessionInfo() for details. > > assay(xx) > <58288 x 181134> matrix of class DelayedMatrix and type "double": > DRX001125 DRX001126 DRX001127 ... SRX999990 > ENSG00000000003.14 40.001250 1322.844547 1528.257578 . 1149.0341 > ENSG00000000005.5 0.000000 9.999964 6.000006 . 0.0000 > ENSG00000000419.12 64.000031 1456.004418 2038.996875 . 1485.0003 > ENSG00000000457.13 31.814591 1583.504257 1715.041308 . 631.7751 > ENSG00000000460.16 12.430602 439.321234 529.280324 . 945.6903 > ... . . . . . > ENSG00000284744.1 1.05614505 24.81388079 32.29261298 . 7.316061 > ENSG00000284745.1 0.99999879 15.99996994 16.99999743 . 0.000000 > ENSG00000284746.1 0.00000000 0.00379458 0.00000000 . 0.000000 > ENSG00000284747.1 7.77564984 270.83296409 239.88056843 . 108.011633 > ENSG00000284748.1 1.00000768 22.23010514 37.73881938 . 11.278980 > SRX999991 > ENSG00000000003.14 1430.3955 > ENSG00000000005.5 0.0000 > ENSG00000000419.12 1970.0004 > ENSG00000000457.13 802.0563 > ENSG00000000460.16 1259.7648 > ... . > ENSG00000284744.1 3.268453 > ENSG00000284745.1 0.000000 > ENSG00000284746.1 0.000000 > ENSG00000284747.1 94.606851 > ENSG00000284748.1 5.240970 >
Vince Carey (22:48:54): > Creating the HDF5 to populate the store was tricky. The assembly of such data is a significant effort but the rewards seem substantial.
2020-06-10
MounikaGoruganthu (11:23:57): > @MounikaGoruganthu has joined the channel
Ye Zheng (12:05:02): > @Ye Zheng has joined the channel
Andrew Jaffe (15:31:04): > @Andrew Jaffe has joined the channel
Aaron Lun (15:37:40): > https://github.com/LTLA/TileDBArray
Vandhana (16:03:39): > @Vandhana has joined the channel
Tim Triche (16:33:00): > @Aaron Lundoes tiledb support sparse arrays cleanly as a DelayedArray backend? This seems really cool
Aaron Lun (16:34:18): > It will, pending a few pieces in bothTileDB-R(@Aaron Wolenmay give some thoughts) and alsoDelayedArraysupport for sparse operations.
Tim Triche (16:35:03): > Matrix(rpois(0.1, n=100), nrow=10)
gives > > 10 x 10 sparse Matrix of class "dgCMatrix" > > [1,] . . . . . . 1 . . . > [2,] 1 . . . . . . . . . > [3,] . . . . . . 2 . . . > [4,] . . . . . . . . . . > [5,] . . . 1 . . . . . 1 > [6,] . . 1 . . . . . . . > [7,] . . . . . . . . . . > [8,] . . . . . . . . 1 . > [9,] . . . . . . . . . . > [10,] . 1 . . 1 . . . . . >
Tim Triche (16:35:19): > but if I hand that to TileDBArray,
Tim Triche (16:36:01): > as(Matrix(rpois(0.1, n=100), nrow=10) , "TileDBArray")
, it says it’s sparse but doesn’t really seem to know it? > > <10 x 10> sparse matrix of class TileDBMatrix and type "double": > [,1] [,2] [,3] ... [,9] [,10] > [1,] 0 0 0 . 0 0 > [2,] 0 0 0 . 0 0 > [3,] 0 0 0 . 0 0 > [4,] 0 0 0 . 0 0 > [5,] 0 0 0 . 0 0 > [6,] 0 0 0 . 0 0 > [7,] 0 0 0 . 0 0 > [8,] 0 0 0 . 0 0 > [9,] 0 0 0 . 0 0 > [10,] 0 0 0 . 0 0 >
Aaron Lun (16:36:13): > see the sparse.
Tim Triche (16:36:17): > regardless I like this package a lot already
Tim Triche (16:36:28): > yes, it says sparse, but it prints a lot of zeros:wink:
Aaron Lun (16:36:37): > I don’t control theshow
method of DA, that’s just a show method.
Tim Triche (16:36:48): > also it looks like maybe it’s dropping some of the 1s?
Tim Triche (16:37:11): > no wait, I’m a dumbass, I need to use the same seeded dummy array
Aaron Lun (16:37:47): > Anwyay, I would advise holding off from actual testing until the other Aaron and co. optimize many of the calls.
Tim Triche (16:38:08): > cool, it’s just less of a PITA than some alternatives and I like it
Tim Triche (16:38:11): > set.seed(123); fake <- matrix(rpois(0.1, n=100), nrow=10)
Tim Triche (16:38:38): > Matrix(fake)
> > 10 x 10 sparse Matrix of class "dgCMatrix" > > [1,] . 1 . 1 . . . . . . > [2,] . . . . . . . . . . > [3,] . . . . . . . . . . > [4,] . . 1 . . . . . . . > [5,] 1 . . . . . . . . . > [6,] . . . . . . . . . . > [7,] . . . . . . . . 1 . > [8,] . . . . . . . . . . > [9,] . . . . . . . . . . > [10,] . 1 . . . . . . . . >
Tim Triche (16:39:33): > as(Matrix(fake), "TileDBArray")
> > <10 x 10> sparse matrix of class TileDBMatrix and type "double": > [,1] [,2] [,3] ... [,9] [,10] > [1,] 0 1 0 . 0 0 > [2,] 0 0 0 . 0 0 > [3,] 0 0 0 . 0 0 > [4,] 0 0 1 . 0 0 > [5,] 1 0 0 . 0 0 > [6,] 0 0 0 . 0 0 > [7,] 0 0 0 . 1 0 > [8,] 0 0 0 . 0 0 > [9,] 0 0 0 . 0 0 > [10,] 0 1 0 . 0 0 >
Tim Triche (16:39:43): > looks good, PEBKAC previously from me.
Hervé Pagès (17:29:28): > @Aaron LunLooks good. Finally a TileDb backend for DelayedArray! You need a .. cough..extract_sparse_array()
method for your… cough… TileDBArraySeed objects … cough
Aaron Lun (17:42:00): > Well, what I really need is forextract_array
tocoughbe able to support multiple matrix representationscough. But FINE, I’ll do it your way.
Hervé Pagès (17:42:57): > Except that this would break sparsity along the way, as discussed to great lengths somewhere else…
Aaron Wolen (17:43:03): > Thanks for posting,@Aaron Lun. The package looks great—I’m looking forward to kicking the tires.Cough… must be something in the air…
Hervé Pagès (17:43:44): > some bad virus floating around… cough
Dirk Eddelbuettel (17:44:15): > @Dirk Eddelbuettel has joined the channel
Dirk Eddelbuettel (17:44:52): > Hi all
Aaron Lun (17:45:11): > :party_parrot:
Dirk Eddelbuettel (17:46:55): > TileDBArray is mean to me. It keeps telling me it cannot lead BiocParallel. A vanilla R session can. Newb error?
Aaron Lun (17:48:03): > Hm. What’s it say? TileDBArray doesn’t do anything special there, that’s a dependency via DelayedArray.
Hervé Pagès (17:49:04): > only a suggested one though. No more parallelization by default in DelayedArray.
Dirk Eddelbuettel (17:50:26): > I am still stuck at my (very reflexive)git clone ..
followed byR CMD build ...
(via littler scriptbuild.r
) and theR CMD check ...
(either directly or via littler scriptrcc.r
).
Dirk Eddelbuettel (17:50:43): > Unhappy with examples and/or tests. Will triage a little.
Shian Su (17:52:32) (in thread): > I think you needBiocManager::install("LTLA/TileDBArray", build_vignettes = TRUE)
to get vignettes when installing off Github.
Aaron Lun (17:53:16) (in thread): > Where is this coming from?
Aaron Lun (17:53:36) (in thread): > can’t you just read the Rmarkdown in thevignettes/
?
Dirk Eddelbuettel (17:53:49): > Weirdness. If I source the test file “by hand” (as in the 2nd line in comments) it works. Formal check barks:
Dirk Eddelbuettel (17:54:03): - File (Plain Text): Untitled
Shian Su (17:56:11) (in thread): > Yes, but don’t you want it to show up when you usevignette(package = "TileDBArray")
or when you look inside the package index?
Aaron Lun (17:56:59) (in thread): > … I don’t mind either way.
Aaron Lun (17:57:36) (in thread): > I don’t use the vignette system like that, but you’re more than welcome to install it withbuild_vignettes=TRUE
if you want.
Shian Su (17:58:59) (in thread): > You want users who install straight off github to read the unbuilt vignette inside the repo?
Dirk Eddelbuettel (17:59:03): > From examples:
Dirk Eddelbuettel (17:59:17): - File (Plain Text): Untitled
Aaron Lun (17:59:17) (in thread): > Yes.
Dirk Eddelbuettel (17:59:41): > Do I need to reinstall matrixStats or some other package I may not have/may not have had?
Shian Su (17:59:43) (in thread): > Ok as long as that’s what you want.
Aaron Lun (17:59:50): > hm.
Aaron Lun (18:00:06): > Just to be clear, BiocParallel is installed somewhere.
Dirk Eddelbuettel (18:00:17): > Yes.
Aaron Lun (18:00:21) (in thread): > well, to be more clear, I don’t really care.
Dirk Eddelbuettel (18:00:49): - File (Plain Text): Untitled
Aaron Lun (18:01:28): > HMM. I wonder why it does that.
Aaron Lun (18:01:48): > Got a meeting right now but will think about this during it.
Dirk Eddelbuettel (18:01:58): > Sounds good!
Dirk Eddelbuettel (18:07:13): > Needed to comment out two ops; after that it is just small NOTES I can ping you about.
Dirk Eddelbuettel (18:07:41): - File (Diff): Untitled
Aaron Lun (18:11:29): > I wonder if testthat is messing with the lib.loc.
Dirk Eddelbuettel (18:15:10): > I keep things very simple, no per-user lib, everything system-wide in/usr/local/lib/R/site-library
which is spot 1 in.libPaths()
.
Dirk Eddelbuettel (18:17:24): > Just read the vignette, this is nice stuff especially because you have an extra layer over it. So we should be able to usetiledb_array
(which does all the newer things is non-contiguous subsets etc) and can take a toggle for sparse on/off hopefully replacingtiledb_dense
andtiledb_sparse
. At some point…
Aaron Lun (18:17:49): > yep, that’s what I was hoping you would say.
Dirk Eddelbuettel (18:18:13): > I’m all for it. We can probably get that done easily.
Shian Su (19:28:31): > Right off the bat writeTileDBArray is 15X faster than writeHDF5. Very cool!
Dirk Eddelbuettel (19:49:10): > Very nice! And the package is not yet “released” so there is no real tuning or benchmarking. But that is very encouraging!.
Hervé Pagès (19:55:41): > Here is what I get with a dense 1e5 x 1e3 numeric matrix: > > > library(TileDBArray) > > library(HDF5Array) > > m <- matrix(runif(1e8), 1e5, 1e3) > > DelayedArray:::set_verbose_block_processing(TRUE) > > system.time(tdba <- writeTileDBArray(m)) > Realizing block 1/8 ... OK, writing it ... OK > Realizing block 2/8 ... OK, writing it ... OK > Realizing block 3/8 ... OK, writing it ... OK > Realizing block 4/8 ... OK, writing it ... OK > Realizing block 5/8 ... OK, writing it ... OK > Realizing block 6/8 ... OK, writing it ... OK > Realizing block 7/8 ... OK, writing it ... OK > Realizing block 8/8 ... OK, writing it ... OK > user system elapsed > 3.162 2.222 5.979 > > system.time(h5a <- writeHDF5Array(m)) > Realizing block 1/10 ... OK, writing it ... OK > Realizing block 2/10 ... OK, writing it ... OK > Realizing block 3/10 ... OK, writing it ... OK > Realizing block 4/10 ... OK, writing it ... OK > Realizing block 5/10 ... OK, writing it ... OK > Realizing block 6/10 ... OK, writing it ... OK > Realizing block 7/10 ... OK, writing it ... OK > Realizing block 8/10 ... OK, writing it ... OK > Realizing block 9/10 ... OK, writing it ... OK > Realizing block 10/10 ... OK, writing it ... OK > user system elapsed > 29.696 0.342 30.068 > > system.time(rs1 <- rowSums(tdba)) > Processing block 1/8 ... OK > Processing block 2/8 ... OK > Processing block 3/8 ... OK > Processing block 4/8 ... OK > Processing block 5/8 ... OK > Processing block 6/8 ... OK > Processing block 7/8 ... OK > Processing block 8/8 ... OK > user system elapsed > 6.016 2.762 4.922 > > system.time(rs2 <- rowSums(h5a)) > Processing block 1/12 ... OK > Processing block 2/12 ... OK > Processing block 3/12 ... OK > Processing block 4/12 ... OK > Processing block 5/12 ... OK > Processing block 6/12 ... OK > Processing block 7/12 ... OK > Processing block 8/12 ... OK > Processing block 9/12 ... OK > Processing block 10/12 ... OK > Processing block 11/12 ... OK > Processing block 12/12 ... OK > user system elapsed > 3.012 0.524 3.536 >
> I expect TileDB to really shine with sparse data reading though…
Hervé Pagès (19:56:24): > Oh, and let’s not forget: > > > identical(rs1, rs2) > [1] TRUE >
> Yeah!!
Martin Morgan (22:21:07) (in thread): > Check outBiocManager::version()
(presumably 3.12, aka ‘Bioc devel’, expected) andBiocManager::valid()
; I’d guess you have packages from different Bioconductor versions; I’d suggest a separate library for R-4.0/Bioc-3.12, distinct from R-4.0/Bioc-3.11
Dirk Eddelbuettel (22:23:04) (in thread): > Possible, but I doubt it as I needed a clean break after R 4.0.0. Tried to help myself a little with this wrapper:https://github.com/eddelbuettel/littler/blob/master/inst/examples/installBioc.r
Dirk Eddelbuettel (22:24:33) (in thread): > Hm:
Dirk Eddelbuettel (22:24:50) (in thread): - File (R): Untitled
Dirk Eddelbuettel (22:25:15) (in thread): > I am developer or co-developer on those.
Dirk Eddelbuettel (22:25:52) (in thread): > This is more like it:
Dirk Eddelbuettel (22:26:09) (in thread): - File (R): Untitled
Dirk Eddelbuettel (22:27:15) (in thread): > Could any of those 7 BioC packages haveanybearing on the matrix multiplication going belly up?
Martin Morgan (22:36:48) (in thread): > You have Bioc 3.11, and outdated packages in that version. Irrelevant, because you want BiocManager::install(version=‘3.12’)
Dirk Eddelbuettel (22:39:15) (in thread): > How do you know what I want?
Dirk Eddelbuettel (22:39:49) (in thread): > Why am I not allowed to use CRAN package as released and BioC packages as released?
Dirk Eddelbuettel (22:41:35) (in thread): > (I also in the meantime did a check on a Docker container; everything passes fine there. It is apparently just my box here. But I am mystified as to why Aaron package would not want to do matrix multiplication here.
Dirk Eddelbuettel (23:01:57) (in thread): > Just to cover all bases Ididupgrade to 3.12. It makes no difference.
Dirk Eddelbuettel (23:02:18) (in thread): - File (Plain Text): Untitled
Martin Morgan (23:03:05) (in thread): > I guess Aaron is developing his package in Bioc-3.12 (the devel branch of Bioconductor). His package Depends: on DelayedArray (from Bioc-3.12). DelayedArray recently (from Bioc-3.11 to Bioc-3.12) stopped using BiocParallel. If you have Bioc-3.12 packages, then you have the stack that Aaron is developing against. But you don’t, you have Bioc-3.11, so you pick up a DelayedArray that depends on BiocParallel in a way that Aaron’s development did not anticipate
Martin Morgan (23:06:48) (in thread): > (I guess I was speculating on the problem, and if you’ve got BiocManager::version() 3.12 and BiocManager::valid() report TRUE, then I am not immediately sure what the problem is)
Dirk Eddelbuettel (23:12:03) (in thread): > (At least I follow along for the reasoning. But the counter is that i) I reinstalled all I have from BioC to be on 3.12 ii) I rebuilt the tar.gz for his package iii) it still fails on%*%
. I am puzzled.)
Martin Morgan (23:36:48) (in thread): > I guessdebugonce(DelayedArray:::.super_BLOCK_mult)
and thenB %*% runif(ncol(B))
will get you to the browser right whererequireNamespace("BiocParallel")
fails, so maybe that provides a chance to understand why BiocParallel fails to load?
Dirk Eddelbuettel (23:52:28) (in thread): > I am not sure. Part of why this has been so “charming” is that it works in my normal session – so no debugonce
. But once I switch toR CMD check ...
it breaks. But nobrowser()
… (And it doesn’t matter whether I run the check directly, on in RStudio viadevtools
and what not.
2020-06-11
Dirk Eddelbuettel (07:34:43) (in thread): > Some “better” (or worse?) news: I have this issue reproducible in a Docker container now. That points to possible cause from the TileDB code side (we changed for example threading default options) but I don’t understand how it could affectR CMD check ...
versus an interactive R session. Same TileDB library in either case.
Dirk Eddelbuettel (10:05:41) (in thread): > We got it. Plain packaging bug over on Aaron’s side.
Dirk Eddelbuettel (10:06:23) (in thread): > @Aaron Wolenwins the “Sherlock Holmes of the day” prize, and I look like a goof for not realizing this.
Dirk Eddelbuettel (10:06:38) (in thread): > Ah well 2020 has been full of nasty turns so I add this “bug” to it.
Jake Wagner (13:25:59): > @Jake Wagner has joined the channel
Vince Carey (17:48:08): > @Aaron WolenIs it possible to use the gcs:// URI as described inhttps://docs.tiledb.com/main/backends/gcswith the tiledb package?
Dirk Eddelbuettel (17:58:38): > “In theory”. The library has backend support for Azure, GCS and S3. In practice S3 is the most tested.
Dirk Eddelbuettel (18:02:19): > We are trying to get full-featured libraries out with support for all backends but we are currently not quite there. But if you build the library locally with whichever feature set you want and then use it for the R and/or Python package it should work. It is part of our unit testing setup as per the CI matrix at GitHub.
Vince Carey (19:09:53): > S3 works out of the box. This is Hervé’s example with set.seed(12345): > > > TileDBArray("s3://biocfound-tiledb/demotdb001/") > <100000 x 1000> matrix of class TileDBMatrix and type "double": > [,1] [,2] [,3] ... [,999] [,1000] > [1,] 0.11370341 0.68814550 0.07839776 . 0.11771783 0.72135397 > [2,] 0.62229940 0.20002794 0.83311604 . 0.80067009 0.32176241 > [3,] 0.60927473 0.50996910 0.62454521 . 0.27113518 0.94060051 > [4,] 0.62337944 0.98954334 0.47277772 . 0.09109424 0.69933508 > [5,] 0.86091538 0.90303722 0.76673427 . 0.12779631 0.65071573 > ... . . . . . . > [99996,] 0.3715216368 0.1858485718 0.2648862121 . 0.19065722 0.71833339 > [99997,] 0.1888288730 0.4238409030 0.6869190594 . 0.57777592 0.46098670 > [99998,] 0.4774170429 0.3340604741 0.6735544174 . 0.71760826 0.15940470 > [99999,] 0.8825738563 0.0002781234 0.2721453537 . 0.50385549 0.05919759 > [100000,] 0.8068715138 0.9980811779 0.7704871616 . 0.66994891 0.90878067 >
> I did no configuration. The retrieval seems sluggish relative to HSDS with a server, but I should try the unmediated approach.
Hervé Pagès (19:30:59): > (crowd chanting) ROW SUMS! ROW SUMS! ROW SUMS! …
Vince Carey (19:43:00): > > > system.time(rs <- rowSums(tt)) > Processing block 1/8 ... OK > Processing block 2/8 ... OK > Processing block 3/8 ... OK > Processing block 4/8 ... OK > Processing block 5/8 ... OK > Processing block 6/8 ... OK > Processing block 7/8 ... OK > Processing block 8/8 ... OK > user system elapsed > 18.875 15.995 278.381 >
Hervé Pagès (19:54:14): > mmh… I’ve no idea how fast downloading data from an S3 bucket is supposed to be but it would be interesting to compare with a direct download from S3. The TileDb file cannot be that big after all (< 800 MB).
Vince Carey (20:10:32): > actually there is 1.2 GB of tiledb data generated, while the .rda (with default compression) is .5GB. I was wondering if setAutoBPPARAM with a MulticoreParam could speed this up … but I haven’t made progress with that.
Shian Su (20:30:18): > There’s no compression on the tileDB data right? But it should be possible to compress?
Hervé Pagès (20:35:53): > Interesting. I mean if you disable compression with HDF5 then writing is about as fast as with TileDB: > > > m <- matrix(runif(1e8), 1e5, 1e3) > > DelayedArray:::set_verbose_block_processing(TRUE) > > system.time(tdba <- writeTileDBArray(m)) > Realizing block 1/8 ... OK, writing it ... OK > Realizing block 2/8 ... OK, writing it ... OK > Realizing block 3/8 ... OK, writing it ... OK > Realizing block 4/8 ... OK, writing it ... OK > Realizing block 5/8 ... OK, writing it ... OK > Realizing block 6/8 ... OK, writing it ... OK > Realizing block 7/8 ... OK, writing it ... OK > Realizing block 8/8 ... OK, writing it ... OK > user system elapsed > 3.319 1.898 4.919 > > system.time(h5a0 <- writeHDF5Array(m, level=0)) > Realizing block 1/10 ... OK, writing it ... OK > Realizing block 2/10 ... OK, writing it ... OK > Realizing block 3/10 ... OK, writing it ... OK > Realizing block 4/10 ... OK, writing it ... OK > Realizing block 5/10 ... OK, writing it ... OK > Realizing block 6/10 ... OK, writing it ... OK > Realizing block 7/10 ... OK, writing it ... OK > Realizing block 8/10 ... OK, writing it ... OK > Realizing block 9/10 ... OK, writing it ... OK > Realizing block 10/10 ... OK, writing it ... OK > user system elapsed > 5.105 0.000 5.115 >
> And the resulting file is still significantly smaller than the TileDB file (764M vs 1.2G). Also now reading is 2x faster: > > > system.time(rs1 <- rowSums(tdba)) > Processing block 1/8 ... OK > Processing block 2/8 ... OK > Processing block 3/8 ... OK > Processing block 4/8 ... OK > Processing block 5/8 ... OK > Processing block 6/8 ... OK > Processing block 7/8 ... OK > Processing block 8/8 ... OK > user system elapsed > 6.331 2.107 4.666 > > system.time(rs3 <- rowSums(h5a0)) > Processing block 1/12 ... OK > Processing block 2/12 ... OK > Processing block 3/12 ... OK > Processing block 4/12 ... OK > Processing block 5/12 ... OK > Processing block 6/12 ... OK > Processing block 7/12 ... OK > Processing block 8/12 ... OK > Processing block 9/12 ... OK > Processing block 10/12 ... OK > Processing block 11/12 ... OK > Processing block 12/12 ... OK > user system elapsed > 2.368 0.000 2.368 > > identical(rs1, rs3) > [1] TRUE >
Shian Su (20:38:18): > Is that evidence of reading being faster or just that rowSums is more optimised for HDF5Array?
Hervé Pagès (20:40:10): > No optimization for HDF5Array. Just the default block-processedrowSums()
defined in DelayedArray.
Aaron Lun (20:40:50): > To be clear, TileDBArray is very unoptimized. I probably wouldn’t bother timing anything at the moment.
Aaron Lun (20:41:11): > If you want to see how hackish it currently is, look at lines 203+ ofTileDBArray.R
.
Aaron Lun (20:41:31): > Also marvel at the.pack64
and.unpack64
ofutil.sR
.
Dirk Eddelbuettel (20:42:21): > There are three more-or-less independent layers and we haven’t tweaked any of it. Let alone tile layout and parameters. We’re at the “just started to make it work stage”.
Dirk Eddelbuettel (20:43:19): > But yes, and as I said earlier,ifyou build with s3 support then it is as simply as swapping an access URI. Should be the same for the other two. And yes, compression is available, including nested and whatnot, along with other filters.
Hervé Pagès (20:44:44): > Sounds good. Premature optimization is the root of all evil…
Shian Su (20:46:36): > Really looking forward to having zstd compressed datasets.
Hervé Pagès (20:49:51): > @Shian SuFWIW a better way to time block reading is to walk on the array without doing anything: > > > system.time(aa <- blockApply(tdba, identity)) > Processing block 1/8 ... OK > Processing block 2/8 ... OK > Processing block 3/8 ... OK > Processing block 4/8 ... OK > Processing block 5/8 ... OK > Processing block 6/8 ... OK > Processing block 7/8 ... OK > Processing block 8/8 ... OK > user system elapsed > 5.835 2.515 4.611 > > system.time(bb <- blockApply(h5a0, identity)) > Processing block 1/12 ... OK > Processing block 2/12 ... OK > Processing block 3/12 ... OK > Processing block 4/12 ... OK > Processing block 5/12 ... OK > Processing block 6/12 ... OK > Processing block 7/12 ... OK > Processing block 8/12 ... OK > Processing block 9/12 ... OK > Processing block 10/12 ... OK > Processing block 11/12 ... OK > Processing block 12/12 ... OK > user system elapsed > 1.646 0.639 2.285 >
Hervé Pagès (20:51:54): > or even better, usefunction(block) NULL
instead ofidentity
so you don’t keep the blocks in memory
Aaron Lun (20:52:19): > Speaking of optimization opportunities, that also reminds me thatchunkGrid
doesn’t know about the tile extent yet. Need to make an issue to remind myself.
Hervé Pagès (20:58:09): > Note that if TileDB uses fixed-size tiles (except maybe on the right and bottom borders of the matrix) you just need to define achunkdim()
method for your TileDBArraySeed objects. The defaultchunkGrid()
method will use this to produce the corresponding grid.
2020-06-12
Stephanie Hicks (00:24:28): > Is there a standard way of handling outer products with twoDelayedArray
objects? It seems like the reprex below tries to put the result in memory? > > > DelayedArray(matrix(1:100000)) %*% DelayedArray(matrix(1:100000, nrow = 1)) > <2 x 1> matrix of class DelayedMatrix and type "list": > [,1] > message cannot allocate vect.. > call NULL >
> but if it’s small enough, it does what one might expect: > > > DelayedArray(matrix(1:10000)) %*% DelayedArray(matrix(1:10000, nrow = 1)) > <10000 x 10000> matrix of class DelayedMatrix and type "double": > [,1] [,2] [,3] ... [,9999] [,10000] > [1,] 1 2 3 . 9999 10000 > [2,] 2 4 6 . 19998 20000 > [3,] 3 6 9 . 29997 30000 > [4,] 4 8 12 . 39996 40000 > [5,] 5 10 15 . 49995 50000 > ... . . . . . . > [9996,] 9996 19992 29988 . 99950004 99960000 > [9997,] 9997 19994 29991 . 99960003 99970000 > [9998,] 9998 19996 29994 . 99970002 99980000 > [9999,] 9999 19998 29997 . 99980001 99990000 > [10000,] 10000 20000 30000 . 99990000 100000000 >
Aaron Lun (00:27:13): > you’ll probably have to change the realization sink.
Aaron Lun (00:27:36): > Though honestly, one wonders whether you want to do that at all. I use myLowRankMatrix
to handle such cases.
Stephanie Hicks (00:39:22): > ah, great suggestion
Stephanie Hicks (00:39:32): > thanks@Aaron Lun
Aaron Lun (00:45:07): > Note that the LRM does not do much beyond the base DA operations. One could make its%*%
tremendously more efficient, if one was so inclined.
Will Townes (10:57:31): > @Will Townes has joined the channel
Stephanie Hicks (10:58:58): > yeah, for the purpose needed, I think the LRM is exactly what’s needed. Thank you again@Aaron Lun!
Kasper D. Hansen (11:21:48): > Doesn’t this suggest that the%*%
operator is not “delayed” or am I misunderstanding something. Not sure that’s a problem, just trying to undersyand it
Aaron Lun (11:23:17): > no, the initial product is delayed to create the LRM. Further products on the LRM use block processing but could use the fact that the LRM is already a product to improve efficiency.
Kasper D. Hansen (11:23:22): > From an API perspective, I guess its not clear what the “type” of the return object should be. For the big matrix, it is intuitive it should be HDF5, but you’re multiplying 2 in-memory matrices, so one could argue that it is natural that the default is for the output to be in-memory as well
Kasper D. Hansen (11:23:56): > But if the initial prodict is delayed why the memory error. Becuase it tries to allocate the full return object
Aaron Lun (11:24:17): > that’s because she’s not using a LRM, just doing a DA.
Kasper D. Hansen (11:24:51): > ah ok you’re answering this based on your code?
Kasper D. Hansen (11:26:10): > Im just saying that the printing of the return object just implies that a few of the matrix entries gets computed (the rest is delayed), so I am trying to understand the memoery error
Kasper D. Hansen (11:26:27): > So even if the return object is delayed, it still gets allocated?
Kasper D. Hansen (11:27:44): > or is there somethign special about the example because the return object couldn’t be allocated even given infinite RAM because of long doubles or somerthing?
Aaron Lun (11:30:07): > The DA example is borked, ignore it. The LRM is the solution.
Stavros Papadopoulos (11:41:22): > @Stavros Papadopoulos has joined the channel
Mike Jiang (20:29:26): > Thinking about havingRtiledblib
package in bioconductor that serves the same purpose asRhdf5lib
, which provides h5 c/c++ headers and static libs through a R package so that the other r packages likerhdf5
,HDF5Array
,cytolib
can directly compile against and link to it instead of forcing user to install libhdf5 system-wise. It will be very useful for bothtiledb
downstream package developers and their end users.
Aaron Lun (20:30:53): > IIRC tiledb’s R interface should build the binaries as well during installation, though whether they are in a linkable state is another matter.
Mike Jiang (20:34:44) (in thread): > @Dirk EddelbuettelI know you were not a big fan of this as we talked about the idea ofRProtobufLib
in the past. But making our package users (mostly non-tech savvy lab guys) life easy is our top priority. So I will probably do it if it is not something you are still suspicious about
Dirk Eddelbuettel (20:36:12) (in thread): > Oh, that warrants more context. I thinkin generalthe problem of shipping “system-style” libraries a package needs is mostly unsolved. Plus, duplication is never ideal. But at the end of the day … whatever works, works.
Dirk Eddelbuettel (20:37:37) (in thread): > Do you think you can provide protocol buffers functionalityportablyto other packages from a package? That is in essence the problem I have still filed as somewhere between “hard” and “unsolved”.
Dirk Eddelbuettel (20:39:50): > Yes. That is the approach taken by Simon with hiss-u/recipes
repo (for macOS) and Jeroen by his repo for windows in the GH orgrwinlib
.
Dirk Eddelbuettel (20:41:07): > Providingsystem-levellibraries fromuser-levelpackages does not work in general, I think. We tried early on with Rcpp as well but moved away from it. Serving object code across an entire OS and its variants is … astonishingly hard.
Dirk Eddelbuettel (20:41:33): > Then again Python wheels does it so there may be a step we’re missing. Time to get to work, collectively, …
Dirk Eddelbuettel (20:44:04) (in thread): > We have something in the package that would be, AFAIK, “novel” to CRAN. But I’d prefer to talk more about it if/when CRAN accepts it as a package. We areveryclose to submitting it so we all may know more “soon”.
Mike Jiang (20:46:33) (in thread): > RProtobufLib
has been working fine with other pkgs (cytolib
,flowWorkspace
,CytoML
), I am the author of all of these though, so I can’t speak for other user’s experience. ButRhdf5lib
has been around for many years and are smoothly used by various people in bioc community.@Mike Smith@Hervé Pagès@Aaron Luncan also voice their opinions on this
Dirk Eddelbuettel (20:47:51) (in thread): > I guess I had been too distant from BioC for too long to say more. I know the CRAN landscape pretty well and I am not aware of anything similar. But hey there is so much now that it is easy to miss things…
Aaron Lun (20:50:17) (in thread): > I can confidently say that most of the pain felt from usingRhdf5lib
was inherent to the HDF5 library itself and not from our deployment of it. It does require some fairly close coordination with respect to ensuring that theMakevars
are set up properly, but@Mike Smithhas been pretty good about that. Though from the outside looking in, it does seem that setting up the HDF5 library in a deployable state is a quite a chore (repeated for every HDF5 release).
Dirk Eddelbuettel (20:52:21) (in thread): > Just glanced at the Rhdf5lib vignette. Yes, shipping static libs and code to generate -L args works. I can’t right now think of why that did not become more prevalent around CRAN – apart from the general taste ofewwwwthat is repeating code all over from static libs in distinct separate shared package libs.
Mike Jiang (21:00:11) (in thread): > Agree, that is the single biggest downside of it! But one advantage is avoid the trouble of mismatching betweenheader
andlib
when there are multiple instances of the same system lib (of different version) at the same machine, which happens surprisingly more often than you would expect
Dirk Eddelbuettel (21:00:42) (in thread): > Yep.
2020-06-18
Ruben Dries (09:00:40): > @Ruben Dries has joined the channel
2020-06-19
Will Townes (12:43:56): > Is there any work being done on sparse array storage on-disk? From what I can tell, things like DelayedArray are expected to always be dense and the only sparse on-disk format is the TENxMatrix. For interoperability with python, it seems like having a basic sparse format like compressed column-oriented (what is used in memory by dgCMatrix) might be handy. If I load a huge dataset that wasn’t originally from 10x and want to save it to disk, is the most space efficient way to convert to TENxMatrix?
Dirk Eddelbuettel (12:46:10): > TileDB gives you sparse arrays on disk, and the TileDBArray package plugs it into the DelayedArray framework. Our TileDB package is in the incoming queue at CRAN, TileDBArray sits on top of it and is in Aaron’s repo at GH.
Tim Triche (13:30:00): > @Will Townessubmit your questions for the developer forum call on June 25th at noon ET!
Will Townes (13:31:09) (in thread): > thanks for the suggestion. Where do I go to do that? A slack channel or a URL somewhere?
Tim Triche (13:31:54): > @Will Townes#developers-forumfor@Mike Smithand@Aaron LunI think. also Dirk here. the TileDB remarks are from last week’s adventures in sparsity
Tim Triche (13:32:17): > your question is a good one, I have microwell-seq and seq-well and plate-seq data in my hands right now and thinking about the same Qs
Tim Triche (13:32:30): > while wearing out the fan in my machine
Hervé Pagès (13:32:47): > At the moment TENxMatrix is the only DelayedArray backend that uses a sparse representation, AFAIK. It’s notoriously inefficient though so hopefully TileDBArray will fill that gap.
Tim Triche (13:33:05): > @Hervé PagèsI was just looking at the definition in HDF5Array for this reason
Tim Triche (13:33:28): > why so inefficient and how can this be bypassed going forwards? are there things that you would do differently in hindsight?
Hervé Pagès (13:34:16): > TileDb/TileDBArray is the way to move forward.
Tim Triche (13:34:21): > I would dearly love an HTML document with figures that lays out why FooBarSeed and separate FooBar classes
Tim Triche (13:34:38): > will that remain a pattenr going forwards or is the idea to just run towards TileDB
Tim Triche (13:34:53): > also@Will Townesare you finding TileDB slow?
Tim Triche (13:35:02): > it wasn’t when I tested it last week but… ?
Tim Triche (13:35:47): > ugh I’m seeing where a lot of the issues in TENxMatrix come from just looking at the comments around the datatypes that don’t immediately seem “right”
Will Townes (13:35:56) (in thread): > I haven’t tried it yet, only heard about it for the first time today! Will let you know if I get a chance to test out.
Tim Triche (13:36:55): > is this why you mentioned read_sparse_block last week as the one thing that made sense
Tim Triche (13:37:37): > reading the code for the Seed class is educational
Tim Triche (13:37:41): > thanks for commenting it extensively
Tim Triche (13:39:02) (in thread): > I wrote some examples in this channel last week if they help you get started
Tim Triche (13:39:51): > finally made it to the bottom of the TENxMatrixSeed class. wow that’s a lot of code to manipulate the block loading and scanning/adjacent bits
Hervé Pagès (13:41:00): > The TENxMatrix format is a hack from the 10X Genomics people for storing sparse data in a compressed column-oriented fashion on top of HDF5. It’s very inefficient if your access pattern requires you to load full rows (it’s impossible to load a single row of data in an efficient manner). It works OK if your access pattern is to load small groups of columns at a time.
Nicholas Knoblauch (13:53:41): > The user can’t control whether genes go along the rows or the columns in a TENxMatrix?
Hervé Pagès (13:56:16): > @Tim Triche > > is this why you mentioned read_sparse_block last week as the one thing that made sense > I was just teasing Aaron that he needed to implement aextract_sparse_array
method for his TileDBArraySeed objects. This is the way to properly plug TileDBArray objects into the DelayedArray framework and to allow block-processing to preserve sparsity, at least for the moment. Aaron doesn’t like this approach hence the teasing. Anyway these are internal details.
Dirk Eddelbuettel (13:58:05) (in thread): > Note that the paint hasn’t dried on these components. We are before any and all benchmarking and tuning efforts so please take all measurements with a bucket of salt.
Hervé Pagès (13:59:15): > @Nicholas KnoblauchWell they could in theory. But that’s not the convention used by the 10X Genomics people and all the big 10X datasets that are available around follow that convention AFAIK.
Hervé Pagès (14:00:00): > (The convention being that the genes go along the rows.)
Will Townes (14:04:00) (in thread): > please forgive the naive questions if I missed this in the documentation, but 1. is the indexing always 1-based? This seems incompatible with eg numpy. 2. is the only sparsity format the row-compressed form?
Dirk Eddelbuettel (14:05:06) (in thread): > In the TileDB schema? User choice. If you like 101 … 110 better for the examples, use it. Under the hood it is all C and C++ so internally it is zero based.
Dirk Eddelbuettel (14:05:48) (in thread): > “Somehow” all examples always use 1 .. 10 (or 1 .. 100 or …). We should punk everybody now and then and run from, say, -5 to 4.
Tim Triche (14:09:14): > good to see that 10X employs the One True Way from Bioconductor (rows are features and columns are samples):wink:.
Tim Triche (14:09:34) (in thread): > wet cement was made to put hands in
Mike Jiang (17:36:36) (in thread): > @Dirk EddelbuettelI see your are doing dynamic linking inTileDB-R
through settingrpath
to its own instance oflibtiledb.so
whentiledb
isn’t available system-wise. Personally I really like it since it makes the process easy for end users without resorting to the inefficientstatic
lib approach. > > But here is the discussion/opinion from@Hervé Pagès2 months ago as I was asking about the similar idea forRProtoBufLib
(or evenRhdf5lib
potentially). > > Hervé Pagès[11:33 AM] > > FWIW I switched from dynamic to static linking for Rhtslib and its clients about 1 year ago:https://github.com/Bioconductor/Rhtslib/commit/db1d8e17ef5b8568fdae3fae0dc701fe2250c952No regrets so far. It seems to be a lot more robust than dynamic linking.
Mike Jiang (17:37:54) (in thread): > also from@Mike Smith > > Mike Smith[7:58 AM] > > I have also run into problem with dynamic linking on cluster environments, where you can’t necessarily assume that the directory structure will be the same on the execution node compared to where ever the package was built.
Mike Jiang (17:40:59) (in thread): > I can see the potential compiling/linking ourcytolib
totiledb-R
if eventually it gets accepted by CRAN and therpath
-based dynamic linking can be a robust and portable way.
Dirk Eddelbuettel (17:45:55) (in thread): > It is a super-fascinating and exciting topic that I amquitethrilled about. It was benefiting from something similar with another CRAN package of mine … and didn’t quite understand how they did it there. I think I do know now, and I will try to write something up. It should generally to other libraries but not all.
Dirk Eddelbuettel (17:47:11) (in thread): > PersonallyI do something different again and just detect/usr/local/lib/libtiledb.so.*
duringconfigure
. Which works but does not generalize. The download working shared libs may well. Very exciting.
2020-06-23
Mike Smith (16:20:44) (in thread): > Old comment, but just wanted to point out you can use zstd compression in HDF5 too viahttps://www.bioconductor.org/packages/devel/bioc/vignettes/rhdf5filters/inst/doc/rhdf5filters.html
2020-06-24
Mike Smith (10:50:54): - Attachment: Attachment > A reminder that our next Developers’ Forum is scheduled for tomorrow (Thursday 25th June) at 09:00 PDT/ 12:00 EDT / 18:00 CEST (Check here!) > > We will be using BlueJeans and the meeting can be joined via: https://bluejeans.com/114067881 (Meeting ID: 114 067 881) > > The general topic will be technologies for accessing ‘big data’ on-disk i.e. HDF5, TileDB, Zarr, etc If you’d like to share you knowledge / experience with a particular on-disk technology or have some specific questions for the discussion please reply here and I’ll try to put a little structure to our discussion.
2020-06-25
Almut (08:02:50): > @Almut has joined the channel
Michael Lawrence (10:30:11) (in thread): > Hi Mike. Sounds like fun, unfortunately I’ll miss it. In the future, would you be able to send a calendar invite for the meeting to, say, bioc-devel?
2020-06-26
Dirk Eddelbuettel (12:29:44): > Saw this via Twitter before I noticed the mail in my (work) inbox ….https://twitter.com/CRANberriesFeed/status/1276546398675238914 - Attachment (twitter): Attachment > New CRAN package tiledb with initial version 0.7.0 http://goo.gl/pgljT #rstats
Aaron Lun (12:30:03): > YEAH
Dirk Eddelbuettel (12:30:21): > Thanks for the all the encouragement so far, and particularly the help and added set of eyes by@Aaron Lun!
Jenny Smith (16:00:03): > @Jenny Smith has joined the channel
2020-06-27
Dirk Eddelbuettel (08:30:10): > …. and what CRAN giveth CRAN taketh. Already thrown off. Maybe a future as a BioC package would be simpler.
Will Townes (13:25:52) (in thread): > Gosh I’m so sorry to hear that! I really hope they will accept it on CRAN again so that non-bioc packages like Seurat can also use tiledb.
Martin Morgan (14:19:19) (in thread): > not that it’s a good idea, but CRAN packages can Depend / Import / LinkingTo bioc packages; see the light-colored S4Vectors in the Depends ofhttps://cran.r-project.org/package=iheatmaprfor instance
Aaron Lun (16:57:23): > Crazy.
Aaron Lun (16:57:27): > What was the rationale?
Aaron Lun (16:57:31): > Too much novelty?
Federico Marini (16:59:02): > with the policy violation being…?
Dirk Eddelbuettel (16:59:49) (in thread): > I am rather upset and frustrated, but I can’t share details as that would quoting a private email which is something he not infrequently went on about when he still posted to the lists.
Tim Triche (17:10:52) (in thread): > oh ffs, is it a Solaris issue (tm)
Tim Triche (17:11:07): > $20 says “did not compile on solaris”
Hervé Pagès (17:11:14): > 2-space indents?
Will Townes (18:48:46) (in thread): > Oh cool I didn’t know CRAN allowed that
2020-06-28
Dirk Eddelbuettel (14:35:39) (in thread): > So for now … and $deity knows how long, just usedrat
or anrepos=...
argument toinstall.packages()
and installtiledb
from theghrr
drat. - File (PNG): image.png
Dirk Eddelbuettel (14:37:34) (in thread): > Demo from within rstudio.cloud which seems to have broken its own RSPM repo – noRcpp
ornanotime.
So I setoptions(repos="
https://cloud.r-project.org")
first which should not be needed. Anybody know who to complain to for RSPM?
Martin Morgan (19:03:34) (in thread): > https://packagemanager.rstudio.com/client/#/repos/2/whats-newit says (not quite the right context, but…) “If you have feedback during the beta period, please open a question onRStudio Communityin the Admins category.” ? - Attachment (community.rstudio.com): RStudio Community > A community for all things R and RStudio
Dirk Eddelbuettel (21:09:15) (in thread): > Thanks@Martin MorganI had pinged Gabor who had pointed me to the same resource. I don’t have an account at community.r.c though. Also, it works on 18.04 (for which I made a RSPM container for Rocker).
2020-06-29
Lukas Weber (14:21:33): > @Lukas Weber has joined the channel
Tim Stuart (18:11:56): > @Tim Stuart has joined the channel
Dirk Eddelbuettel (18:29:41) (in thread): > Reached out to them via the actual support we have as paying (academic) customers at U of I. And turns out I discovered a discontinuity in the space/time fabric that is … by now gone. May just have been a temporary fluke.
2020-07-01
Frank Rühle (04:22:00): > @Frank Rühle has joined the channel
2020-07-04
Umar Ahmad (08:19:41): > @Umar Ahmad has joined the channel
2020-07-07
Pablo Latorre-Doménech (03:16:10): > @Pablo Latorre-Doménech has joined the channel
Mehdi Pirooznia (09:25:29): > @Mehdi Pirooznia has joined the channel
2020-07-09
Ellis Patrick (18:02:06): > @Ellis Patrick has joined the channel
2020-07-10
Hervé Pagès (03:22:38): > @Mike JiangI’m currently reviewing a flow cyto package (from your group) that uses a flowSet object to store the data. The flowSet container uses the eSet approach with slots of type environment and AnnotatedDataFrame which seems a little bit outdated. Are there plans to move to something based on SummarizedExperiment for flow data? I know how to query a SummarizedExperiment object or derivative and I’d love to be able to query a flowSet object but it seems that I first need to learn a completely different API before I can do that. Which is kind of a bummer, not only for me as a package reviewer, but also for the end user in general.
Aaron Lun (03:24:27): > I have also wanted this for a long time.
Rajesh Shigdel (03:42:36): > @Rajesh Shigdel has joined the channel
2020-07-11
Umar Ahmad (16:44:41): > Great job@Kevin Blighe
2020-07-14
Mike Jiang (11:41:33) (in thread): > @Hervé PagèsIt will involve quite significant amount of work. Also we’ve pretty much deprecated theflowset
withflowWorkspace::cytoset
in the rest of major flow tools. > That said, we will have a internal discussion to determine whether we want to go through this change for flowCore.
Hervé Pagès (11:59:26) (in thread): > Excellent. Glad that it’s on the radar! Maybe the easiest/smoothest path would be to go with a new class e.g. FlowExperiment.
2020-07-15
wmuehlhaeuser (09:19:13): > @wmuehlhaeuser has joined the channel
Jessica (09:44:04): > @Jessica has joined the channel
2020-07-16
Spencer Nystrom (09:24:10): > @Spencer Nystrom has joined the channel
2020-07-17
Chitrasen (12:39:15): > @Chitrasen has joined the channel
2020-07-20
Dr Awala Fortune O. (02:40:55): > @Dr Awala Fortune O. has joined the channel
Dr Awala Fortune O. (02:41:18): > Hello everyone
Jennifer Doering (09:59:42): > @Jennifer Doering has joined the channel
2020-07-21
Mike Jiang (20:11:02) (in thread): > @Hervé PagèsSummarizedExperiment
doesn’t seem to exactly fit theflowSet
data model: > 1. fs is a collection of flowFrames, where each flowframe is > a 2d matrix : each row is a cell, each col is an expression level > 2. for fs: all flowFrames have same cols, but different rows > My understanding of SummarizedExperiment is it expects multiple assay/matrices of the same dims (row & col) . Is that accurate?
Hervé Pagès (20:35:57) (in thread): > Yes all the assays in a SummarizedExperiment object are expected to have the same number of rows and cols. If the flowFrames in a flowSet object can have different numbers of rows, trying to make them fit in a SummarizedExperiment object or derivative is not going to work, unfortunately:disappointed:Thanks for looking into this.
Mike Jiang (20:40:16) (in thread): > how about for flowFrame to be represented as SE?
Mike Jiang (20:40:28) (in thread): > Is there any value for doing it?
Hervé Pagès (20:44:44) (in thread): > This would add an additional level of wrapping that doesn’t seem to provide much value. It would only make the object more complicated. Would be interesting to know how@Kevin Bligheapproached this in scDataviz. Also@Aaron Lunsaid he wanted this for a long time so maybe he has thought about this already?
2020-07-22
Dr Isha Goel (09:34:41): > @Dr Isha Goel has joined the channel
Laurent Gatto (11:46:07): > The typical needs for big data in the Bioconductor project have dealt with quantitative data in a single and very large (sparse) matrix and how to store/compute on it out-of-memory. I was wondering if anyone had suggestions/advice when the data consists on millions of smaller matrices. Could anyone comment on the appropriateness of the solutions that have been explored: hdf5, DelayedArrays, … and more recently tiledb.
Kurt Showmaker (11:47:35): > @Kurt Showmaker has joined the channel
Vince Carey (12:47:23): > @Laurent Gattocan you describe the distribution of dimensions among the millions of matrices? For experience (if any) in HDF5 it would be good to pose the question to@John Readey
Nicholas Knoblauch (12:57:47) (in thread): > I think this is a hard thing to advise on without knowing what you have tried, and what was unsatisfactory about that solution.
Hervé Pagès (13:10:22): > Exactly. If all your matrices have the same dimensions, or if they can be grouped by equal dimensions and the number of groups is small, then they could be stored as one or a small number of 3D arrays. Should not be a problem with HDF5. Don’t know with tiledb.
Stavros Papadopoulos (13:12:40): > In TileDB, I would try creating a single 3D sparse array, two dimensions for your matrices and the extra dimension could be an (arbitrary - even string) identifier for each matrix
Stavros Papadopoulos (13:13:37): > We could help with tile tuning and other configs, so please do not hesitate to shoot us some data and we’ll put it to the test
Stavros Papadopoulos (13:15:57) (in thread): > Isuspectthat if you try to access a big subset of those matrices at the same time, you will feel all the overheads around file opening and random IO. Making a 3D array may buy you some spatial locality on disk, so each slice could significantly minimize the IO overhead and take advantage of parallelism (that is true in TileDB)
Nicholas Knoblauch (13:38:33) (in thread): > I have no problem believing that. There’s that famous quote that goes something like “on properly structured data, algorithms are trivial”. In genomics-y environments, the data is almost certainly not “properly” structured, and you have to factor that in. If you are given a file of a certain format, and the end user expects a file of a certain format, whether it makes sense to make a full pass over the data to convert it to HDF5/TileDB (and maybe even convert back) is hard to say a priori
Stavros Papadopoulos (13:46:36) (in thread): > Yeap, I hear you! The benefits of converting to a new format should probably be orders of magnitude larger than choosing not to do so. What we (TileDB) have seen though is that this is often the case, especially with genomics data that are very large and some domain-specific formats incur non-trivial bottlenecks in analysis. We are very curious to see how the community reacts though, as we have seen some great organizations willing to take risks switching a great deal of data to a new format.
Hervé Pagès (13:59:16) (in thread): > @Kevin BligheThe question is how to use a SummarizedExperiment object (or derivative) to represent the data that is in an arbitrary flowSet object. The data in the SummarizedExperiment object should be the same as in the original flowSet. To be more precise: how can we coerce back and forth between flowSet and a SummarizedExperiment-based representation without loosing any information. I have a feeling thatprocessFCS()
solves a different problem e.g. it does something around the lines ofprocessingthe data in the original flowSet object and returning the result of that processing in a SingleCellExperiment object. But I could be wrong. (Hard to tell because the man page doesn’t really say and the examples in it don’t call the function. Also the example in the vignette is not runnable.)
Laurent Gatto (14:02:14) (in thread): > The matrices will always have columns (representing the same variables) but the number of row will vary between 100s to 10000s. Cc@Vince Carey
Laurent Gatto (14:03:49) (in thread): > Note also that these matrices don’t share their rows - it’s not as if the smaller matrices had some rows missing.
Hervé Pagès (14:05:10) (in thread): > In that case you could rbind them together and use an index to keep track of the groups of rows. Both HDF5 and tiledb should be able to accommodate this layout efficiently.
Laurent Gatto (14:05:13) (in thread): > That would be great, thank you. What would be the best way to share some data? > NB: the data I am referring to is mass spectrometry data, typically used in proteomics and metabolomics.
Laurent Gatto (14:05:37) (in thread): > Thank you.
Dirk Eddelbuettel (14:05:54) (in thread): > Why rbind? It’s just an index. A sparse matrix representation seems suitable.
Laurent Gatto (14:08:01) (in thread): > Another feature I would be looking for is storing and accessing data locally or remotely transparently.
Stavros Papadopoulos (14:08:48) (in thread): > Just drop me an email atstavros@tiledb.comand we’ll set something up (perhaps an S3 bucket). We are mostly interested in the structure of the data and query patterns, as the rest we can synthesize to start with. We’ll then give you some code to try on your real data if that is not public. It goes without saying that we will share both code and results (with explanations) in this channel to hopefully get some more feedback.
Laurent Gatto (14:09:26) (in thread): > May be to provide additional information: operations on these matrices would include accessing individual ones or sets thereof, as well as variable slices of all of the matrices at ones.
Hervé Pagès (14:10:47) (in thread): > Vertical slices I suppose?
Dirk Eddelbuettel (14:13:28) (in thread): > Also, are you aware ofhttps://github.com/LTLA/TileDBArraygiving you standard delayed array semantics. May make transitions easier…
Laurent Gatto (14:16:04) (in thread): > @Hervé Pagès- yes. Here’s an example: > > > dim(m) > [1] 25800 2 > > head(m) > [,1] [,2] > [1,] 399.9987 0 > [2,] 400.0002 0 > [3,] 400.0017 0 > [4,] 400.0032 0 > [5,] 400.2951 0 > [6,] 400.2966 0 > > m2 <- m[m[, 1] > 1000 & m[, 1] < 1500, ] > > head(m2) > [,1] [,2] > [1,] 1002.095 0.0000 > [2,] 1002.101 0.0000 > [3,] 1002.107 0.0000 > [4,] 1002.113 0.0000 > [5,] 1002.119 519.6337 > [6,] 1002.125 684.2267 >
> And this subsetting would need to be applied to all matrices.
Laurent Gatto (14:17:07) (in thread): > Thanks@Dirk Eddelbuettel- I have read about TileDB and TileDBArray, but haven’t been able to test/assess so far.
Dirk Eddelbuettel (14:18:23) (in thread): > Well it all exists, installs cleanly, can be run off containers too, … so maybe you just need to make some time to sit down and try it?
Mike Jiang (14:27:04) (in thread): > Based on my quick browsing of the code, besides its own preprocessing to the data, itrbind
multiple matrices and transpose it and wrap it into SE
Nicholas Knoblauch (14:29:08) (in thread): > If the matrices really don’t share rows, then you may also want to consider a more “traditional” database.https://duckdb.org/is columnar, embeddedable, and has an R package. Because it’s SQL, you’d have alotof choices - Attachment (duckdb.org): DuckDB - An embeddable SQL OLAP database management system > DuckDB is an embeddable SQL OLAP database management system. Simple, feature-rich, fast & open source.
Laurent Gatto (14:29:51) (in thread): > Yes, just make time. Thank you for the technical suggestions though.
Mike Jiang (14:30:12) (in thread): > The major efforts there are not the coercion from original flowFrame into SE, so having a flowFrame to be represented as SE won’t be too much gain for this particular use case
Dirk Eddelbuettel (14:30:59) (in thread): > Anytime. Just hit@Stavros Papadopoulos,@Aaron Wolenor myself with questions. We’ll all be happy to help.
Stavros Papadopoulos (14:41:05) (in thread): > @Nicholas Knoblauchone cool thing about TileDB is that it comes with efficient integration with (both embeddable and non-embeddable) MariaDB (https://docs.tiledb.com/mariadb/usage), so you get all the SQL magictoo. And of course you retain fast direct access (from multiple APIs, on prem or in the cloud). Having said that, we really like CWI’s work on DuckDB and are considering integrating with it as well in the near future.
Nicholas Knoblauch (14:42:15) (in thread): > Wow I didn’t know that. I take it all back. Try TileDB first, ask questions later!
Nicholas Knoblauch (14:43:03) (in thread): > you guys weren’t joking about the whole “Universal Data Engine” thing
Stavros Papadopoulos (14:43:46) (in thread): > hehe we’ve been getting some push back on that, but we like a challenge:slightly_smiling_face:
Dirk Eddelbuettel (14:43:49) (in thread): > :laughing:
Laurent Gatto (14:45:50) (in thread): > Thanks - I’ll email you with some background info in the coming days.
Vince Carey (14:47:47) (in thread): > https://www.hdfgroup.org/2016/06/hdfql-new-hdf-tool-speaks-sql/ - Attachment (The HDF Group): HDFql – the new HDF tool that speaks SQL - The HDF Group > HDF guest blogger Rick: HDFql was recently released … handle HDF files with a simpler, cleaner, and faster interface for HDF across C/C++/Java/Python/C#…
Hervé Pagès (15:20:11) (in thread): > @Laurent GattoPursuing the cbind approach, FWIW here is what an HDF5-based solution could look like (I assume a tiledb-based solution to this approach would look similar): > > library(rhdf5) > library(HDF5Array) > > ## Write the matrices to the h5 file: > nrows <- c(10L, 5L, 20L) > bundle_dim <- c(sum(nrows), 2L) > h5file <- "laurent.h5" > h5createFile(h5file) > h5createDataset(h5file, "bundle", dims=bundle_dim) > row_offsets <- c(0L, cumsum(nrows)[-length(nrows)]) > for (i in seq_along(nrows)) { > m <- cbind(runif(nrows[[i]], max=1e3), runif(nrows[[i]], max=1.5)) > start <- c(row_offsets[[i]] + 1L, 1L) > h5write(m, h5file, "bundle", start=start, count=dim(m)) > } > > ## Write 'nrows': > h5write(nrows, h5file, "nrows") >
> Retrieve individual matrix: > > retrieveMatrix <- function(h5file, i, delayed=FALSE) { > nrows <- as.integer(h5read(h5file, "nrows")) > row_offsets <- c(0L, cumsum(nrows)[-length(nrows)]) > start <- c(row_offsets[[i]] + 1L, 1L) > count <- c(nrows[[i]], 2L) > if (!delayed) > return(h5read(h5file, "bundle", start=start, count=count)) > ans <- HDF5Array(h5file, "bundle") > ans[row_offsets[[i]] + seq_len(nrows[[i]]), ] > } > > retrieveMatrix(h5file, 2) > # [,1] [,2] > # [1,] 117.36891 0.85373362 > # [2,] 835.50000 1.42140054 > # [3,] 243.82226 0.83388132 > # [4,] 237.50518 0.06658511 > # [5,] 28.87756 1.47137842 > > retrieveMatrix(h5file, 2, delayed=TRUE) > # <5 x 2> matrix of class DelayedMatrix and type "double": > # [,1] [,2] > # [1,] 117.36891442 0.85373362 > # [2,] 835.50000028 1.42140054 > # [3,] 243.82226076 0.83388132 > # [4,] 237.50518169 0.06658511 > # [5,] 28.87755865 1.47137842 >
> Retrieve vertical slice: > > retrieveVerticalSlice <- function(h5file, j) { > stopifnot(length(j) == 1L) > bundle_slice <- as.numeric(h5read(h5file, "bundle", index=list(NULL, j))) > nrows <- as.integer(h5read(h5file, "nrows")) > relist(bundle_slice, PartitioningByWidth(nrows)) > } > > retrieveVerticalSlice(h5file, 2) > # NumericList of length 3 > # [[1]] 1.4967416098807 0.176848640316166 ... 0.495969901210628 0.191681716474704 > # [[2]] 0.853733623749577 1.42140053689945 ... 1.47137841861695 > # [[3]] 0.624939257511869 1.30643454159144 ... 0.499073923914693 >
> I hardcoded the number of cols to 2 but it should not be too hard to adjust for an arbitrary nb of cols.
Laurent Gatto (15:21:05) (in thread): > Thanks!
Hervé Pagès (15:23:36) (in thread): > still working on the formatting (it got all messed up when I copy/pasted, hate this stupid Slack editor)
Laurent Gatto (15:34:10) (in thread): > Thanks - it works and I understand it. > > I’m wondering to what extend using delayed arrays is useful in my case. Loading a full matrix or parts of a matrix in memory isn’t the issue for me - I fail to see (but them I’m not very familiar with delayed arrays) why delay the access to a small matrix. The challenge is rather in accessing some matrices and/or some vertical or horizontal slices of matrices.
Hervé Pagès (15:38:42) (in thread): > I was talking about coercion between flowSet (a collection of flowFrames) and SummarizedExperiment. I agree that coercion between a single flowFrame and SummarizedExperiment would have little value.
Hervé Pagès (15:54:05) (in thread): > To delay or not to delay, that is the question. Yes I don’t see much value of delaying the retrieval of the small matrices either, since they are small. Just wanted to put this as an option to show how it can be done. Hard to tell to what extend using the DelayedArray framework is going to be useful in your case without knowing more about your typical workflow. Note that rbind itself is a delayed operation on DelayedArray objects so another approachmaybewould be to store the matrices as separate HDF5 datasets, create one HDF5Array per dataset and rbind them together. However creating millions of HDF5Array instances is not cheap and rbind’ing them together, even if it’s delayed, involves some non-trivial internal bookkeeping that will add to the cost so overall I suspect this won’t fly. Would be interesting to try though. Could be that for a few thousands matrices things are not so bad.
Hervé Pagès (17:31:47) (in thread): > @Mike JiangWhat do you think of therbind + transpose + wrap in SEapproach? Some bookkeeping via the colData to keep track of the original grouping by flowFrame would probably be required. How convenient/inconvenient would such representation be for a typical flow cyto workflow?
Mike Jiang (17:50:00) (in thread): > a flowFrame typicall contains 500k to 1M cells (i.e rows), a flowSet/cytoset may have 100 to nK flowFrames, to convert to a single matrix and transpose it seems to be not ideal to me
Aaron Lun (17:51:00) (in thread): > can’t you just delay it all?
Mike Jiang (17:51:54) (in thread): > and each flowFrame is pretty independent in terms of its data IO in cyto workflow, so it is not beneficial to merge them into single matrix represenation
Mike Jiang (17:52:34) (in thread): > or I can say we rarely need to combine them at matrix level
Hervé Pagès (18:06:19) (in thread): > mmh… so we would end up with 500M cols, typically? Not good indeed. IIUC the number of rows would typically be small (right now a typical flowSet object fits in memory so its flowFrames can’t have that many cols). So a few dozens at most? Still, the big 12x5e8 matrix is not going to be convenient to work with, especially if the typical workflow is to work on the individual flowFrames separately.:disappointed:
2020-07-23
Bishoy Wadie (03:01:02): > @Bishoy Wadie has joined the channel
Biljana Stankovic (05:10:46): > @Biljana Stankovic has joined the channel
Mindy (11:29:36): > @Mindy has joined the channel
2020-07-24
Will Townes (10:48:23): > Question about HDF5Array: if you create a new HDF5array object without specifying the file where it should be saved, I understand it automatically creates a temp file in a hidden directory somewhere. My question is, does this temp file get deleted when I close my R session? If not, does it ever get deleted or could I end up with creeping storage requirements from experimenting around with HDF5array objects in interactive sessions? Related- if I save an HDF5-backed SummarizedExperiment to an RDS file, does the HDF5 file get written into the RDS object, or does the RDS object just have a pointer to where the HDF5 file is on disk? I’m thinking of a scenario where I save a result to RDS on a cluster then want to download it, if I download the RDS but not the HDF5, then try to open on a different computer, it won’t be able to find the data.
Aaron Lun (10:49:18): > 1. yes > 2. no
Dirk Eddelbuettel (11:01:06) (in thread): > Working withtempfile()
and friends is always a nice reminder of how R tends to all the little details. Every session has its own temporary directory, and it always gets wiped. No need for manual follow-ups, or evenon.exit()
or …
Vince Carey (11:29:32): > I think the question of creating HDF5Array instances “without specifying the file where it should be saved” is interesting. > > > new("HDF5Array") > Error in h(simpleError(msg, call)) : > error in evaluating the argument 'x' in selecting a method for function 'type': 'filepath' must be a single string > > Enter a frame number, or 0 to exit > > 1: (new("standardGeneric", .Data = function (object) > standardGeneric("show"), > 2: (new("standardGeneric", .Data = function (object) > standardGeneric("show"), > 3: show_compact_array(object) >
Aaron Lun (11:30:53): > well normally it would be handled by HDF5Array().
Vince Carey (11:33:37): > Which fails also, when run without an argument.
Aaron Lun (11:34:44): > well normally I would give a matrix or something to coerce into a HDF5Array.
Vince Carey (11:35:26): > > > HDF5Array(matrix(0)) > Error in normarg_path(filepath, "'filepath'", "HDF5 dataset") : > 'filepath' must be a single string specifying the path to the file > where the HDF5 dataset is located >
Aaron Lun (11:35:57): > oh, wait.
Aaron Lun (11:36:06): > I was usingas(mat, "HDF5Array")
.
Aaron Lun (11:36:32): > right. becauseHDF5Array
constructs from a HDF5 file!
Aaron Lun (11:36:37): > oop.
Vince Carey (11:36:54): > I guess one just has to know what one is doing. The ‘filepath’ argument implied to me that there must be an explicit file specified.
Aaron Lun (11:37:05): > Yes, I forgot that.
Vince Carey (11:37:33): > > > as(matrix(0), "HDF5Array") > <1 x 1> matrix of class HDF5Matrix and type "double": > [,1] > [1,] 0 > > z = .Last.value > > str(z) > Formal class 'HDF5Matrix' [package "HDF5Array"] with 1 slot > ..@ seed:Formal class 'HDF5ArraySeed' [package "HDF5Array"] with 6 slots > .. .. ..@ filepath : chr "/private/var/folders/5_/14ld0y7s0vbg_z0g2c9l8v300000gr/T/RtmpeYOVNI/HDF5Array_dump/auto00001.h5" > .. .. ..@ name : chr "/HDF5ArrayAUTO00001" > .. .. ..@ type : chr "double" > .. .. ..@ dim : int [1:2] 1 1 > .. .. ..@ chunkdim : int [1:2] 1 1 > .. .. ..@ first_val: num 0 >
Vince Carey (11:39:02): > Now what is the durability and integrity of the content of that auto00001.h5? Could someone write to it with a separate program? Could they erase it?
Aaron Lun (11:40:02): > yes and yes, if they know where R’s tempdir is.
Vince Carey (11:40:21): > Well the permission is 644 which is pretty good.
Aaron Lun (11:40:42): > but if you have a hostile process running on the same user acount, you’re screwed anyway.
Vince Carey (11:41:58): > Well, hostile is one thing, unwitting code behavior another. Probably unlikely enough under standard operating conditions.
Vince Carey (11:47:44): > But Will’s question about serialization is worth a little more discussion. If you aren’t explicit about file paths and serialize an HDF5Array instance, then reload in another session, it may or may not work … > > > class(z2) > [1] "HDF5Matrix" > attr(,"package") > [1] "HDF5Array" > > z2 > Error in h(simpleError(msg, call)) : > error in evaluating the argument 'x' in selecting a method for function 'type': failed to open file '/private/var/folders/5_/14ld0y7s0vbg_z0g2c9l8v300000gr/T/RtmpeYOVNI/HDF5Array_dump/auto00002.h5' > > Enter a frame number, or 0 to exit > > 1: (new("standardGeneric", .Data = function (object) > standardGeneric("show"), >
> because as Dirk noted, the instance was created in a session that was ended before the load(), but the instance was wiped.
Vince Carey (11:48:04): > When the other session was alive, I was able to read the matrix.
Aaron Lun (11:48:36): > saveSummarizedExperimentAsHDF5 should do exactly what it says.
Aaron Lun (11:48:52): > Can’t remember whether that was its name. But it’ll create a single HDF5 file with all the stuff, IIUC.
Vince Carey (11:49:02): > Herve has done some work on making saved HDF5SummarizedExperiments portable. As you note
Vince Carey (11:49:08): > saveHDF5SummarizedExperiment
Vince Carey (11:59:53): > So to address@Will Townesa little more directly, the use case of a large HDF5 resource on a cluster is worth addressing fully. Rhdf5lib is now able to query HDF5 sitting in any AWS S3 bucket (use h5read with s3=TRUE). Thus delayed arrays with data in AWS S3 are immediately feasible. You can implement S3 storage locally, one approach for this is CEPH object store, supported in OpenStack. Thus latency to AWS is removable for this storage/interaction model. Your SummarizedExperiment RDS could live anywhere but would include the address of the delayed array which is accessible globally. A related approach doesn’t work with raw HDF5 in S3 but rather HDF5 served through HDF Scalable Data Service. This can also be implemented locally, but examples of it with AWS back end are present in the restfulSE package.
Vince Carey (12:00:06): > > > library(restfulSE) > 5/55 packages newly attached/loaded, see sessionInfo() for details. > Warning messages: > 1: package ‘SummarizedExperiment’ was built under R version 4.0.2 > 2: package ‘GenomeInfoDb’ was built under R version 4.0.2 > > tenx = se1.3M() > snapshotDate(): 2020-07-10 > see ?restfulSEData and browseVignettes('restfulSEData') for documentation > loading from cache > > tenx > class: SummarizedExperiment > dim: 27998 1306127 > metadata(0): > assays(1): counts > rownames(27998): ENSMUSG00000051951 ENSMUSG00000089699 ... > ENSMUSG00000096730 ENSMUSG00000095742 > rowData names(12): ensid seqnames ... symbol entrezid > colnames(1306127): AAACCTGAGATAGGAG-1 AAACCTGAGCGGCTTC-1 ... > TTTGTCAGTTAAAGTG-133 TTTGTCATCTGAAAGA-133 > colData names(4): Barcode Sequence Library Mouse > > library(SummarizedExperiment) > 0/0 packages newly attached/loaded, see sessionInfo() for details. > > assay(tenx) > <27998 x 1306127> matrix of class DelayedMatrix and type "double": > AAACCTGAGATAGGAG-1 ... TTTGTCATCTGAAAGA-133 > ENSMUSG00000051951 0 . 0 > ENSMUSG00000089699 0 . 0 > ENSMUSG00000102343 0 . 0 > ENSMUSG00000025900 0 . 0 > ENSMUSG00000109048 0 . 0 > ... . . . > ENSMUSG00000079808 0 . 0 > ENSMUSG00000095041 1 . 0 > ENSMUSG00000063897 0 . 0 > ENSMUSG00000096730 0 . 0 > ENSMUSG00000095742 0 . 0 > > >
Vince Carey (12:01:18): > The RDS is in ExperimentHub. It includes these definitions, which allow global access: > > > str(assay(tenx)) > Formal class 'DelayedMatrix' [package "DelayedArray"] with 1 slot > ..@ seed:Formal class 'DelayedDimnames' [package "DelayedArray"] with 2 slots > .. .. ..@ dimnames:List of 2 > .. .. .. ..$ : chr [1:27998] "ENSMUSG00000051951" "ENSMUSG00000089699" "ENSMUSG00000102343" "ENSMUSG00000025900" ... > .. .. .. ..$ : chr [1:1306127] "AAACCTGAGATAGGAG-1" "AAACCTGAGCGGCTTC-1" "AAACCTGAGGAATCGC-1" "AAACCTGAGGACACCA-1" ... > .. .. ..@ seed :Formal class 'HSDSArraySeed' [package "rhdf5client"] with 5 slots > .. .. .. .. ..@ endpoint: chr "[http://hsdshdflab.hdfgroup.org](http://hsdshdflab.hdfgroup.org)" > .. .. .. .. ..@ svrtype : chr "hsds" > .. .. .. .. ..@ domain : chr "/shared/bioconductor/tenx_full.h5" > .. .. .. .. ..@ dsetname: chr "/newassay001" > .. .. .. .. ..@ dataset :Formal class 'HSDSDataset' [package "rhdf5client"] with 5 slots > .. .. .. .. .. .. ..@ file :Formal class 'HSDSFile' [package "rhdf5client"] with 3 slots > .. .. .. .. .. .. .. .. ..@ src :Formal class 'HSDSSource' [package "rhdf5client"] with 2 slots > .. .. .. .. .. .. .. .. .. .. ..@ endpoint: chr "[http://hsdshdflab.hdfgroup.org](http://hsdshdflab.hdfgroup.org)" > .. .. .. .. .. .. .. .. .. .. ..@ type : chr "hsds" > .. .. .. .. .. .. .. .. ..@ domain: chr "/shared/bioconductor/tenx_full.h5" > .. .. .. .. .. .. .. .. ..@ dsetdf:'data.frame': 1 obs. of 2 variables: > .. .. .. .. .. .. .. .. .. ..$ paths: chr "/newassay001" > .. .. .. .. .. .. .. .. .. ..$ uuids: chr "d-fbd1c486-c210-11e8-8805-0242ac120008" > .. .. .. .. .. .. ..@ path : chr "/newassay001" > .. .. .. .. .. .. ..@ uuid : chr "d-fbd1c486-c210-11e8-8805-0242ac120008" > .. .. .. .. .. .. ..@ shape: num [1:2] 1306127 27998 > .. .. .. .. .. .. ..@ type :List of 2 > .. .. .. .. .. .. .. ..$ class: chr "H5T_INTEGER" > .. .. .. .. .. .. .. ..$ base : chr "H5T_STD_I32LE" >
Vince Carey (12:02:48): > I am working on a little workshop instance (following Sean Davis’ protocol for container-based workshops) called remlarge which will go over the various issues with rem[ote] large data in Bioconductor. HDF5 in AWS, HSDS, tiledb, BigQuery all in play. Stay tuned.
Ludwig Geistlinger (15:00:56): > DelayedArray
Beginner’s question: > > What’s the preferred way of turning aDelayedMatrix
into a 1DDelayedArray
, > > ie the equivalent of > > v <- as.vector(m) >
> withm
being a matrix. > > Is there anything I am missing that will work more efficient than: > > DelayedArray(array(as.vector(dm)) >
> withdm
being aDelayedMatrix
?
Marcel Ramos Pérez (16:15:26): > Perhaps you’re referring to having something likedim(dm) <- 200L
supported (as it is for matrices)?
Shubham Gupta (23:12:44): > CanMulticoreParam()
from BiocParallel help in reading multiple files? I have some mass-spectrometry files and I am fetching sub-data from these files usingbplapply
andlapply
. I am using 8 workers and find no difference. Using Ubuntu 16.04, R 4.0
2020-07-25
Vince Carey (07:43:08): > The answer is yes but whether you will see a difference depends upon details. One illustration can be given with the RNAseqData.HNRNPC.bam.chr14 package. Run the example in the -package man page, and then > > > register(MulticoreParam(8)) > > system.time(bplapply(bamfiles, function(x) readGAlignmentPairs(x, use.names=TRUE, param=param)) ) > user system elapsed > 36.924 3.384 7.247 > > system.time(lapply(bamfiles, function(x) readGAlignmentPairs(x, use.names=TRUE, param=param)) ) > user system elapsed > 27.038 0.157 27.211 >
Vince Carey (07:53:59): > This would be a good question forsupport.bioconductor.orgif you are doing things in the mass-spec space; folks who don’t monitor this slack could provide more guidance.
2020-07-26
Subhajit Dutta (01:05:03): > @Subhajit Dutta has joined the channel
2020-07-27
Hervé Pagès (01:45:35) (in thread): > Kind of depend on various things. Ifdm
is a 1-row or 1-col DelayedMatrix then just drop the ineffective dimension withdrop(dm)
. Otherwise doing the reshaping with something more efficient thanDelayedArray(array(as.vector(dm))
depends on wheredm
’s data lives. More precisely, what’s the nature ofdm
’s seed (seed(dm)
)? Is it on disk data or some kind of in-memory sparse data like dgCMatrix?
Hervé Pagès (02:00:06): > @Will TownesUsepath()
on your HDF5Array object to get the path to the h5 file. If the object was created by callingwriteHDF5Array()
without specifying the path or via coercion (e.g.as(m, "HDF5Array")
), the data is normally written in a temp file so doesn’t persist across sessions. HDF5 datasets created without explicit control of the user are calledautomaticdatasets. The user can still indirectly control the location and physical properties of automatic datasets via a set of utility functions. See?getHDF5DumpFile
for the details. > UsesaveHDF5SummarizedExperiment()
to serialize a SummarizedExperiment object or derivative together with its assay data written to an HDF5 file. This produces a bundle that is relocatable.
Ludwig Geistlinger (09:42:01) (in thread): > Thanks@Hervé Pagès! Let me be a bit more specific here. I have three 50k x 50k correlation matrices as computed withWGCNA::bicor
. I have wrapped each one within aDelayedArray
. Now I want to create a new matrix of the same dimensions, where each cell [i,j] stores the max correlation found in the corresponding cells of the three correlation matrices. > > In order to efficiently loop over it, I thought the best way of going about is would be to (i) convert each correlation matrix into a vector (or here a 1D DelayedArray?), (ii) cbind them together, and (iii) userowMaxs
on the resulting 3-col matrix; with each col storing the 50k * 50 k = 2500k correlations of the individual matrices. > > Now with this approach I have two problems: > a) how to efficiently convert a DelayedMatrix into something like a DelayedVector (aka a 1D DelayedArray = my original question), > b) I just noted that cbinding of DelayedArrays does not work for 1D DelayedArrays > > > cmats > [[1]] > <1335317764> array of class DelayedArray and type "double": > [1] [2] [3] . [1335317763] [1335317764] > 1.00000000 0.79994410 -0.05514658 . 0.8038488 1.0000000 > > [[2]] > <1335317764> array of class DelayedArray and type "double": > [1] [2] [3] . [1335317763] [1335317764] > 1.0000000 0.8701370 -0.2207364 . 0.8210861 1.0000000 > > [[3]] > <1335317764> array of class DelayedArray and type "double": > [1] [2] [3] . [1335317763] [1335317764] > 1.00000000 0.81277564 0.02168199 . 0.9131348 1.0000000 > > > cvec <- do.call(cbind, cmats) > Error in validObject(.Object) : invalid class "DelayedAbind" object: > the array-like objects to bind must have at least 2 dimensions for this > binding operation >
Ludwig Geistlinger (09:44:32) (in thread): > also@Marcel Ramos Pérezfor a better idea what I am trying to do here
Rowling (10:22:05): > @Rowling has joined the channel
Hervé Pagès (12:56:43) (in thread): > Not clear to me why you need to wrap your correlation matrices in DelayedArray objects in the first place. If you have sparse in-memory matrices (e.g. dgCMatrix objects), just usepmax()
on them. If you really really want to wrap them in DelayedArray objects usepmax2()
on them. This is delayed.
Ludwig Geistlinger (13:17:27) (in thread): > Ah, I didn’t know about the existence of thepmax
command. That of course makes it straightforward. Many thanks!
2020-07-28
jackgisby (09:06:38): > @jackgisby has joined the channel
Ray Su (10:53:05): > @Ray Su has joined the channel
Rajiv Kumar Tripathi (16:33:37): > @Rajiv Kumar Tripathi has joined the channel
Mike Jiang (19:02:13) (in thread): > @Hervé PagèsI am currently working on the adapter wrappers that can convert the existing cyto data structures (not flowCore though, since it is pretty much replaced by cytolib/flowWorkspace for more scaled workflow) to SCE. Here are two examples (still work in process), with some help from@Aaron LunI was able to get it work with scran/scaterhttps://rpubs.com/rglab/642636https://rpubs.com/rglab/643907Basically I wrapped theflowWorkspace::cytoframe
(equivalent toflowCore::flowFrame
) into the seed ofCytoArray
, which is aDelayedArray
extension. And extract all thecytoframe
s fromGatingSet
and cbind
these cytoArray
s into aSingleCellExperiment
I really appreciate the flexibility and generic design ofDelayedArray
which makes everything else pretty much out-of-box.
2020-07-29
Isha Goel (16:23:19): > @Isha Goel has joined the channel
Riyue Sunny Bao (17:36:04): > @Riyue Sunny Bao has joined the channel
2020-07-30
Charlotte Soneson (09:07:44): > Copying the poll questions from the DelayedArray workshop here for@Peter Hickey(when it’s morning again in Australia:slightly_smiling_face:) or others: > > 1. Can users specify the maximum memory usage or should it be controlled by the chunk size of HDF5 in advance? > 2. If we want to analyze the raw count matrix of scRNAseq downloaded from scRNAseqdb, do we need to convert the matrix to HDF5 file? > 3. What are the best practices around parallelization for HDF5Arrays? For example reading from a file within a bpapply() loop? > 4. How do you mean by "realizing" a DelayedArray or HDF5Array dataset? > 5. Curious about huge datasets. Any limitations you can think about around, say, biobank-scale data (10s + of TB)? Any thoughts about also performance in cluster environments? Assuming performance scales both with the flavor of HDD + network overhead if using NFS or similar. > 6. Is there a saveHDF5SingleCellExperiment? > 7. What's the relationship between the chunk size on disk and the block size on input to R? Do they have to match? Is there an optimal choice for each? >
beyondpie (10:11:49): > @beyondpie has joined the channel
Hervé Pagès (11:03:25): > It’s morning again here on the Pacific Coast. I’ll try to answer some of them.
Hervé Pagès (11:16:28): > > 1. Can users specify the maximum memory usage or should it be controlled by the chunk size of HDF5 in advance? >
> Maximum memory usage is controlled via theblock size. See?getAutoBlockSize
. This controls the size of the blocks during block processing. Operations on DelayedArray objects or derivatives (e.g. HDF5Array, TileDbArray, RleArray, etc…) can be eitherdelayed(e.g.[
,t()
,log()
,cbind()
,pmax2()
, etc) orblock-processed(e.g.rowMeans()
,max()
,which()
,writeHDF5Array()
,%*%
, etc). The block size only matters for the latter. Thechunk sizeis a physical property of how the HDF5 data is organized on disk. It’s typically small (e.g. 100x100, 1000x1000, 200x8000) compared to the block size (by default the block size is set to 100MB).
Hervé Pagès (11:20:36): > > 2. If we want to analyze the raw count matrix of scRNAseq downloaded from scRNAseqdb, do we need to convert the matrix to HDF5 file? >
> Can’t think of any reason why you would need to do that but I’ll let the scRNAseqdb authors answer that one.
Hervé Pagès (11:25:23): > > 3. What are the best practices around parallelization for HDF5Arrays? For example reading from a file within a bpapply() loop? >
> Or even better:blockApply()
. Note that a lot of operations are already implemented asblock-processedoperations (see 1. above) and will use parallel evaluation if you set a parallelization backend viasetAutoBPPARAM()
.
Hervé Pagès (11:40:31): > > 4. How do you mean by "realizing" a DelayedArray or HDF5Array dataset? >
> It means executing the delayed operations carried by the object. This produces a new array-like object that is either written to disk or in memory. It can be done via coercion e.g.as(x, "HDF5Array")
for on-disk realization as an HDF5 dataset, oras.array(x)
for in-memory realization as an ordinary array (this one would only make sense if the object carries operations that have reduced its original size to something that can fit in memory for exampleas.array(log(t(x[50, 20])))
). Realization can also be done via an explicit writing function likewriteHDF5Array()
orwriteTENxMatrix()
which allow explicit control of how the data should be organized on disk. Finally it can be done withrealize()
. This will use whatever DelayedArray backend is currently set as the default. See?setRealizationBackend
.
Hervé Pagès (11:45:18): > > 5. Curious about huge datasets. Any limitations you can think about around, say, biobank-scale data (10s + of TB)? Any thoughts about also performance in cluster environments? Assuming performance scales both with the flavor of HDD + network overhead if using NFS or similar. >
> Partial answer: the DelayedArray framework only support array-like datasets for which all the dimensions are <= 2^31-1. Otherwise, it should deal with anything that the on-disk backend (e.g. HDF5, TileDb) can deal with. IO is the bottleneck for block-processed operations so having good SSD is key for local datasets.
Hervé Pagès (11:48:17): > > 6. Is there a saveHDF5SingleCellExperiment? >
> saveHDF5SummarizedExperiment()
works on any SummarizedExperiment object or derivative (see?saveHDF5SummarizedExperiment
) so should work on a SingleCellExperiment object.
Aaron Lun (11:48:45) (in thread): > I’m guessing that this won’t save the extra bits of the SCE, though.
Aaron Lun (11:49:18) (in thread): > I’ll assume that this isn’t referring to thescRNAseqpackage.
Hervé Pagès (11:49:42) (in thread): > Why wouldn’t it? Everything gets serialized except the assays which go to an h5 file.
Aaron Lun (11:50:19) (in thread): > Oh, that’s how it works.
Hervé Pagès (11:50:34) (in thread): > yep. It’s that simple.
Aaron Lun (11:50:45) (in thread): > Hm. The altexps could contain some pretty chunky data, but oh well.
Hervé Pagès (11:53:02) (in thread): > So if some components other than the assays in the object are big then they won’t go to the H5 file and will just get serialized. This might not be optimal. Something that asaveHDF5SingleCellExperiment()
would remedy.
Hervé Pagès (11:58:39): > > 7. What's the relationship between the chunk size on disk and the block size on input to R? Do they have to match? Is there an optimal choice for each? >
> The blocks used by block processing are always made of a number of whole chunks.blockGrid()
in BioC 3.11 (renamed todefaultAutoGrid()
in BioC 3.12) takes care of defining a grid of blocks that is compatible with the chunk geometry and other constraints like the max block size.
Hervé Pagès (12:10:12) (in thread): > Here is my chance to complain about the name of the scRNAseq package: why so generic? At least a *Data suffix would have avoided some confusion.
Aaron Lun (12:14:50) (in thread): > I didn’t name it, I just inherited it.
Tim Howes (12:30:11): > @Tim Howes has joined the channel
Kelly Street (13:39:07): > @Kelly Street has joined the channel
Kelly Street (13:50:38): > Maybe related to Will’s earlier questions: some of us are trying to get the null residuals approximation to GLM-PCA to work with DelayedArray objects. I think we have methods that do this (at least, they take them as input and return them as output), but when we try to do anything downstream (ie.runPCA
orrealize
), we get the following error message: > > Error in check_returned_array(ans, expected_dim, "extract_array", class(x)) : > The "extract_array" method for LowRankMatrixSeed objects returned an array with > incorrect dimensions. Please contact the author of the LowRankMatrixSeed class > (defined in the BiocSingular package) about this and point him/her to the man page > for extract_array() in the DelayedArray package (?extract_array). >
> So, taking that suggestion,@Aaron Lun(or anyone else) do you know what might be going on here?
Aaron Lun (13:51:25): > Wut
Aaron Lun (13:51:37): > How did you construct the LRM?
Kelly Street (13:53:38): > > BiocSingular::LowRankMatrix( > DelayedArray(matrix(p)), > DelayedArray(matrix(n))) >
> (wherep
andn
are vectors)
Kelly Street (13:54:34): > (basically we needed the outer product of those vectors as a DelayedArray)
Aaron Lun (14:00:20): > > p <- runif(10) > n <- runif(20) > out <- BiocSingular::LowRankMatrix(DelayedArray(cbind(p)), DelayedArray(cbind(n))) >
> I’m struggling to reproduce the error on BioC-devel.
Kelly Street (14:07:03): > it seems to only happen with particularly large matrices. This did it for me: > > p <- runif(50000) > n <- runif(50000) > out <- BiocSingular::LowRankMatrix(DelayedArray(cbind(p)), DelayedArray(cbind(n))) > m <- realize(out) >
Rene Welch (14:07:36): > @Rene Welch has joined the channel
Aaron Lun (14:08:59): > Ah. The REAL error is somehow coerced into a character matrix: > > tcrossprod(get_rotation(x2), get_components(x2)) > <2 x 1> matrix of class DelayedMatrix and type "list": > [,1] > message vector memory exhaus.. > call NULL >
Aaron Lun (14:09:10): > So the actual error is that you don’t have enough memory to realize that object.
Aaron Lun (14:09:23): > It just somehow got wrapped into… whatever that was.
Aaron Lun (14:09:35): > Probably an error with the matrix multiplication’s error handling.
Kelly Street (14:10:42): > ok, that makes more sense. thanks very much!
Aaron Lun (14:16:43): > you shoudl kbe realizing it in chunks anyway
Kelly Street (14:20:42): > the actual next step (rather than realizing it) isrunPCA
, which is how I first encountered the error. Is that doing something under the hood that realizes a large matrix?
Aaron Lun (14:21:34): > Hm. I didn’t think so.
Aaron Lun (14:22:51): > Oh wait.BiocSingular::runPCA
defaults to an exact PCA. So yes, it will try to realize it becausebase::svd
requires that.
Aaron Lun (14:22:56): > Try some of the approximate algs.
Kelly Street (15:15:24): > that seems to work! (at least, it works withFastAutoParam
)
Peter Hickey (16:43:06) (in thread): > Thanks, Hervé!
Ayush Raman (16:59:46): > @Ayush Raman has joined the channel
2020-07-31
Sunil Nahata (09:10:32): > @Sunil Nahata has joined the channel
Constantin Ahlmann-Eltze (12:38:28): > @Constantin Ahlmann-Eltze has joined the channel
bogdan tanasa (13:54:32): > @bogdan tanasa has joined the channel
CristinaChe (17:59:30): > @CristinaChe has joined the channel
2020-08-01
Nick Borcherding (11:40:06): > @Nick Borcherding has joined the channel
2020-08-04
Lambda Moses (00:29:36): > @Lambda Moses has joined the channel
rohitsatyam102 (14:29:07): > @rohitsatyam102 has joined the channel
2020-08-05
shr19818 (13:46:56): > @shr19818 has joined the channel
2020-08-07
Mikhail Dozmorov (20:02:38): > @Mikhail Dozmorov has joined the channel
2020-08-08
Lukas Weber (17:39:46): > Is there any way to easily apply a function to rows of a sparsedgCMatrix
while keeping everything sparse? Applying it naively (withapply(x, 1, fun(...))
) seems to automatically convert to non-sparse. Or should I be usingblockApply
? (Not sure if this is the right channel to ask this) Thanks!
Dirk Eddelbuettel (17:46:49): > R has no “native” sparse representation so operators likeapply()
will likely densify. Not what you want. Can you (for a 1st pass) loop “by hand” ?
Aaron Lun (17:47:28): > Is said function even capable of taking a sparse input? If not, you’ll have to collapse it into an ordinary numeric vector anyway.
Lukas Weber (17:48:49): > ok thanks, yep I’m currently doing it with loops by hand
Tim Triche (18:27:25): > out of curiosity, is there functionality in BiocParallel for a parallel or recursiveReduce(fn, list, ...)
Tim Triche (18:28:23): > similarly, I’m having some issues modifying matrices or HDF5Array objects column-by-column in parallel; is there a repository of examples of this somewhere? It seems like the sort of thing that people have probably done many many times before.
Tim Triche (18:29:15): > for abase::matrix
I have to split-apply-combine, which is fine I suppose, but if I’m going to update 120 columns of a 3 million row matrix in parallel, I’d rather not
Tim Triche (18:30:06): > probably I can look in DropletUtils or beachmat or some such. Maybe I just answered my own question. Or maybe not. We’ll see soon
Aaron Lun (18:31:02): > modifying objects is probably tricky outside of basic operations.
Aaron Lun (18:31:34): > Split apply and combine is still a paradigm that works because it’s all delayed, but it really depends on the operation.
Tim Triche (18:32:00): > so should I switch to DelayedArray operations for split-apply-combine in parallel, and only materialize it at the end?
Tim Triche (18:32:55): > In this case I’m reading in a bunch of Tabix’ed files (weird formats, long story) and theReduce
comes about when deciding how many rows the matrix/Matrix/DelayedMatrix should be, while the parallel striping over the columns comes about when loading it.
Aaron Lun (18:33:27): > for example,scuttle
has a.splitRowsPerWorker
to split a matrix into equally sized chunks of rparallel processing.
Tim Triche (18:33:58): > It seems like the sort of thing that happens often enough that someone likely implemented either an idiom or a library function for it. I could write some C/C++ code to do it, but I’d just as soon not (in this case I’m looking to work column-by-column instead of rowchunk-by-rowchunk or block-by-block)
Tim Triche (18:35:05): > Reduce seems like it could handle a recursive merge (especially since I initialize on the largest object, which will usually dominate the computation) and parallel writes “seem like” a common operation, but maybe not?
Aaron Lun (18:35:42): > ¯*(ツ)*/¯
Aaron Lun (18:35:53): > Sounds like you have two problems here.
Tim Triche (18:35:57): > just curious if other people have done this before
Aaron Lun (18:36:18): > The first thing is the block processing of the HDF5Array, the second thing is this reduce qeustoin.
Tim Triche (18:36:30): > it works to bplapply() into the Reduce, it would just be faster if it were recursive about the Reduce. So that’s less important
Tim Triche (18:36:43): > for HDF5 (or a Matrix or matrix) the update-by-reference idea would be handy
Tim Triche (18:36:55): > that probably dominates in most use cases
Tim Triche (18:37:27): > perhaps I should consider using tiledb as a backing store for such situations? I vaguely remember it being optimized for parallel operations
Aaron Lun (18:37:51): > IMO update by reference is very dangerous in an interactive analysis environment.
Tim Triche (18:37:58): > you’re not wrong at all
Tim Triche (18:38:09): > and I don’t disagree
Tim Triche (18:38:27): > but the assays() slot of a SummarizedExperiment is a reference class
Aaron Lun (18:39:05): > it still doesn’t have pass-by-reference semantics.
Tim Triche (18:39:15): > or at least it was. now that I’m looking, maybe that has changed. and yes, I think that’s what I’m realizing
Aaron Lun (18:39:25): > unlike the madness from ESet’s unlocked environments, those were insane.
Tim Triche (18:39:44): > yeah that is not something I’m sad to forget, thank you for reminding me
Tim Triche (18:40:26): > so I guess if I want to stripe up and down the columns of a matrix-shaped object, reference or not, I’m going to have to hook directly into a C function and get ready to get cut.
Tim Triche (18:40:42): > maybe not worth the trouble given the expected benefit at this point in time.
Tim Triche (18:40:49): > thanks for the mental walkthrough.
Hervé Pagès (18:53:00): > FWIW a naive walk on the rows or columns of an HDF5Matrix object will generally be very inefficient and is something that has hit other people before. Seehttps://github.com/Bioconductor/DelayedArray/issues/49for how to do this in an efficient manner. Maybe relevant to your “recursive Reduce” (whatever that is) or maybe not.
Tim Triche (19:07:19): > so the Reduce is less of an issue – I can do that some other time – I realized thateitherright-left (smallest to largest)orleft-right (largest to smallest) is likely to be most efficient in that case and I’ll find out which in practice
Tim Triche (19:07:29): > but the chunk sizing for HDF5Arrays is about to become interesting
Tim Triche (19:08:36): > let’s say I have a matrix with M rows and N columns, and for each column 1 <= j <= N I want to update 1 <= i <= M rows
Tim Triche (19:09:10): > the number of rows (length of the index i) may be much smaller than the number of rows M
Tim Triche (19:09:22): > since a lot of columns will have “holes” (some may have millions of holes)
Tim Triche (19:10:05): > if I initialized my matrix as NA_real_ of dimension MxN, should I set chunksize to M?
Tim Triche (19:10:17): > (before attempting to stripe)
Tim Triche (19:10:32): > I saw a note from@Peter Hickeyalong these lines when I went searching for an answer earlier
Tim Triche (19:11:04): > in terms of parallel loading. At the moment, I’d be happy with getting an efficient serial load out of the exercise.
Tim Triche (19:12:50): > I seeblockApply
andblockReduce
in DelayedArray
Tim Triche (19:13:07): > (edit to clarify) areblockApply
andblockReduce
useful for column-wise or row-wise operations as well?@Hervé Pagès
Tim Triche (19:14:32): > it looks like perhapscolAutoGrid
is what I should use for my situation?
Tim Triche (19:16:02): > I guess I’ll find out shortly here:slightly_smiling_face:
Tim Triche (19:17:34): > err,colGrid
Tim Triche (19:18:11): > > R> colGrid(DelayedArray(bigmat)) > 1 x 3 RegularArrayGrid object on a 3459366 x 9 array: > [,1] [,2] [,3] > [1,] [ ,1-3] [ ,4-6] [ ,7-9] >
Tim Triche (19:18:14): > this looks promising!
Tim Triche (19:21:21): > it looks like I want to usewrite_block
for each pass?
Will Townes (21:05:43) (in thread): > I looked into this recently but from the perspective of applying a function to columns, so you might have to transpose your matrix first. Since dgCMatrix has column-oriented sparsity, slicing by row is inefficient. If the function you want to apply needs both the zero and nonzero values of each row, I found the best approach is converting to a simple triplet matrix (package slam) then doing a colapply function (also from slam). This method doesn’t support parallelization. If your function can work with only the nonzero values, methods from either slam, data.table, or a custom solution based on converting to a list of columns (found here:https://github.com/willtownes/quminorm/blob/master/R/utils.R#L53) all have comparable performance. Both data.table and the custom approach support parallelization. Here’s my Rmd with a bunch of different attempts and timings etc:https://rpubs.com/will_townes/sparse-apply
Will Townes (21:25:03) (in thread): > Btw the parallelization only helps if the function being applied is slow on each column. If the function is fast but there are many columns it’s probably faster to process serially
Lukas Weber (22:30:18) (in thread): > awesome, thanks! will have a look at this
2020-08-09
Hervé Pagès (02:41:28): > Try this: > > library(DelayedArray) > library(HDF5Array) > x <- matrix(runif(5e6), ncol=1000) > X <- writeHDF5Array(x, chunkdim=c(50, 30)) > chunkdim(X) > #[1] 50 30 >
> Now we want to walk onX
’s columns and update a few values in each column. We’ll write the new updated columns to a new on-disk matrix. We’ll use a RealizationSink for this. This is a temporary structure that is used for writing the new matrix to disk as we walk on the original matrix. Once we are done, we’ll turn it into a new DelayedArray object withas(sink, DelayedArray")
. > > We encapsulate all the logic in theiso_apply()
function: > > ## Returns a new DelayedArray object of same dimensions as X. > ## 'by' is the number of rows (if MARGIN=1) or columns (if > ## MARGIN=2) to load at once in memory. For good performance > ## ***whole*** chunks should be loaded which means that 'by' > ## should be a multiple of the chunk height (if MARGIN=1) > ## or width (if MARGIN=2). > iso_apply <- function(X, MARGIN, FUN, by=NULL) > { > FUN <- match.fun(FUN) > X_dim <- dim(X) > stopifnot(length(X_dim) == 2L, > isSingleNumber(MARGIN), MARGIN %in% 1:2) > sink <- RealizationSink(X_dim, > dimnames=dimnames(X), > type=type(X)) > ## Use row/colAutoGrid() to create a grid of blocks > ## where the blocks are made of full rows/columns. > ## Note that row/colAutoGrid() used to be called > ## row/colGrid() but were renamed in BioC 3.12: > X_grid <- switch(MARGIN, > `1`=rowAutoGrid(X, nrow=by), > `2`=colAutoGrid(X, ncol=by)) > nblock <- length(X_grid) > for (b in seq_len(nblock)) { > X_block <- read_block(X, X_grid[[b]]) > Y_block <- apply(X_block, MARGIN, FUN) > write_block(sink, X_grid[[b]], Y_block) > } > as(sink, "DelayedArray") > } >
> Then: > > ## Set realization backend. Doesn't have to be the same as > ## X's backend e.g. it could be TileDBArray when this becomes > ## available: > setRealizationBackend("HDF5Array") > > ## A simple function that takes a numeric vector and modifies > ## some of its values (adds 99 to them): > FUN <- function(col) > { > i <- sample(length(col), 50) > col[i] <- col[i] + 99 > col > } > > ## For good performance 'by' should be a multiple of 30: > Y <- iso_apply(X, MARGIN=2, FUN, by=90) >
Tim Triche (11:05:22): > I may need to debug that first chunk: > > R> library(DelayedArray) > R> library(HDF5Array) > R> x <- matrix(runif(5e6), ncol=1000) > R> X <- writeHDF5Array(m, chunkdim=c(50, 30)) > Error in .normarg_chunkdim(chunkdim, dim) : > the chunk dimensions specified in 'chunkdim' exceed the dimensions of > the object to write > > Enter a frame number, or 0 to exit > > 1: writeHDF5Array(m, chunkdim = c(50, 30)) > 2: HDF5RealizationSink(dim(x), sink_dimnames, type(x), filepath = filepath, > 3: .normarg_chunkdim(chunkdim, dim) > > Selection: chunkdim(X) > Enter an item from the menu, or 0 to exit > Selection: 0 >
Tim Triche (11:06:06): > The peril of shooting from the hip in examples (not blaming you, I’ve had to update examples posted to bioc-devel some absurd number of times)
Tim Triche (11:09:51): > iscolAutoGrid
only in devel? I have a Singularity instance running devel on our HPC, and I’m running release (mostly) on my laptop, so I have been going back and forth with these notations
Hervé Pagès (11:19:42): > Replacem
byx
in the call towriteHDF5Array()
. I corrected my original code too.
Hervé Pagès (11:24:04): > colAutoGrid
is the same ascolGrid
, just its new name in devel.
Hervé Pagès (11:45:33) (in thread): > Sounds good. Thanks for the update.
Tim Triche (11:54:18): > earlier example revisited: > > library(DelayedArray) > library(HDF5Array) > > m <- matrix(runif(5e6), ncol=1000) > X <- writeHDF5Array(m, chunkdim=c(50, 30)) > chunkdim(X) > # [1] 50 30 >
> then with iso_apply as above, > > R> packageVersion("DelayedArray") > [1] '0.14.1' > R> Y <- iso_apply(X, MARGIN=2, FUN, by=30) > Error in colAutoGrid(X, ncol = by) : > could not find function "colAutoGrid" > > Enter a frame number, or 0 to exit > > 1: iso_apply(X, MARGIN = 2, FUN, by = 30) > > Selection: 0 >
> Per previous, I’m not grumbling about the examples, it mostly reflects the state of flux many of these things are in right now.
Tim Triche (11:55:49): > Ack, and you already beat me to it – I was writing out the result and you anticipated the issue:slightly_smiling_face:
Tim Triche (11:56:27): > Thanks again. This is very helpful, minor kicks and stings notwithstanding – getting this sorted out will also fix a lot of problems for us elsewhere.
Tim Triche (12:00:19): > I went ahead and made it so thatiso_apply
will work the same on HPC (3.12) and my laptop (3.11, at the moment): > > iso_apply <- function(X, MARGIN, FUN, by=NULL) > { > FUN <- match.fun(FUN) > X_dim <- dim(X) > stopifnot(length(X_dim) == 2L, > isSingleNumber(MARGIN), MARGIN %in% 1:2) > sink <- RealizationSink(X_dim, > dimnames=dimnames(X), > type=type(X)) > ## Use row/col[Auto]Grid() to create a grid of blocks > ## where the blocks are made of full rows/columns. > if (packageVersion("DelayedArray") < 0.15) { > X_grid <- switch(MARGIN, > `1`=rowGrid(X, nrow=by), > `2`=colGrid(X, ncol=by)) > > } else { > X_grid <- switch(MARGIN, > `1`=rowAutoGrid(X, nrow=by), > `2`=colAutoGrid(X, ncol=by)) > } > nblock <- length(X_grid) > for (b in seq_len(nblock)) { > X_block <- read_block(X, X_grid[[b]]) > Y_block <- apply(X_block, MARGIN, FUN) > write_block(sink, X_grid[[b]], Y_block) > } > as(sink, "DelayedArray") > } >
> Now everything works fine and I can play with column striping. Thanks again for this.
Tim Triche (12:05:35): > I remembered that you (@Hervé Pagès) suggested reading@Peter Hickey’s slides so I’m doing that as well (https://petehaitch.github.io/BioC2020_DelayedArray_workshop/articles/Effectively_using_the_DelayedArray_framework_for_users.html) - Attachment (petehaitch.github.io): Effectively using the DelayedArray framework to support the analysis of large datasets > DelayedArrayWorkshop
Hervé Pagès (12:07:14): > It’s a must read if you are serious about working with DelayedArray objects. So yeah, better late than never:wink:
Lukas Weber (12:23:59): > I can highly recommend@Peter Hickey’s workshop too, watched it last week:+1:
Tim Triche (12:33:33): > I tidied upiso_apply
so that it warns the user if they give a silly value forby
and also autodetects which version of DelayedArray is running. However, I’m not sure if I got the chunk height/width checking right: > > #' sensible row- or column- wise operations against a DelayedMatrix > #' > #' For good performance, entire chunks should be loaded; consequently, > #' if MARGIN == 1 (rows), `by` should be a multiple of chunkdim(X)[1], and > #' if MARGIN == 2 (columns), `by` should be a multiple of chunkdim(X)[2]. > #' (There is a check included for this, but the principle is worth noting.) > #' > #' Remember to setRealizationBackend("HDF5Array") or TileDBArray or whatever. > #' > #' @param X a DelayedArray (e.g. a DelayedMatrix backed by HDF5Array) > #' @param MARGIN apply across rows (1) or columns (2), as in base::apply > #' @param FUN the function to apply to each row or column (see above) > #' @param by the number of rows or columns to load at one time > #' > #' @return a new DelayedArray (presumably of the same dimensions as X) > #' > #' @import DelayedArray > #' > #' @export > iso_apply <- function(X, MARGIN, FUN, by=NULL) > { > FUN <- match.fun(FUN) > X_dim <- dim(X) > stopifnot(length(X_dim) == 2L, isSingleNumber(MARGIN), MARGIN %in% 1:2) > sink <- RealizationSink(X_dim, dimnames=dimnames(X), type=type(X)) > > # check for a sane value of `by` > if (!is.null(by)) { > chunksize <- chunkdim(X)[MARGIN] > if ((by %% chunksize) > 0) { > message("Warning: `by` (",by,") is not a multiple of ", > "chunkdim(X)[", MARGIN, "] (i.e., ", chunksize, ").") > } > } > > ## Use row/col[Auto]Grid() to create a grid of blocks > ## where the blocks are made of full rows/columns. > newerDelayedArray <- (packageVersion("DelayedArray") >= 0.15) > rowFn <- ifelse(newerDelayedArray, rowAutoGrid, rowGrid) > colFn <- ifelse(newerDelayedArray, colAutoGrid, colGrid) > X_grid <- switch(MARGIN, `1`=rowFn(X, nrow=by), `2`=colFn(X, ncol=by)) > > nblock <- length(X_grid) > for (b in seq_len(nblock)) { > X_block <- read_block(X, X_grid[[b]]) > Y_block <- apply(X_block, MARGIN, FUN) > write_block(sink, X_grid[[b]], Y_block) > } > as(sink, "DelayedArray") > } >
Tim Triche (12:36:17): > then
Tim Triche (12:38:49): > > R> Y <- iso_apply(X, MARGIN=2, FUN, by=90) > `by` (90) is not a multiple of chunk width (50). > For best performance, by %% chunk width should be 0. >
Tim Triche (12:39:39): > ischunkdim(X)
returning(chunkwidth, chunkheight)
or(chunkheight, chunkwidth)
?
Tim Triche (12:46:25): > also@Hervé Pagès(or@Peter Hickey, whoever is willing to opine) – I just read thisWARNING: Be careful with delayed subassignment because you can end up with objects that are surprisingly large in-memory. This is because the subassigned values are kept in-memory until the data are realized.
. It looks like for a column-wise load, I will want to ensure that changes (subassignments) are realized on each change; doeswrite_block
ensure this?
Tim Triche (13:02:10): > hmmm. > > R> system.time(iso_apply(X, MARGIN=2, FUN, by=100) - X) > Warning: `by` (100) is not a multiple of chunkdim(X)[2] (i.e., 30). > user system elapsed > 10.637 0.000 10.643 > R> system.time(iso_apply(X, MARGIN=2, FUN, by=90) - X) > user system elapsed > 11.283 0.000 11.290 > R> system.time(iso_apply(X, MARGIN=2, FUN) - X) > user system elapsed > 3.810 0.000 3.808 >
Tim Triche (13:02:36): > Is this expected behavior?
Tim Triche (13:04:36): > in the above, > > function(col) { > i <- sample(length(col), round(length(col)/2)) > col[i] <- col[i] + 99 > col > } >
Hervé Pagès (14:33:05) (in thread): > oh yeah, the video of the workshop has some bonus material to keep you entertained. Highly recommended!:guitar:
Hervé Pagès (14:41:04) (in thread): > It returns the dimensions of a chunk so follows the same convention asdim()
i.e. returnsc(nrow, ncol)
.nrow
is the “height” andncol
the “width”.
Hervé Pagès (14:46:40) (in thread): > Theiso_apply()
solution doesn’t use delayed subassignments or delayed operations at all. The columns are loaded in memory, modified, and written to disk as you go. Piling hundreds or thousands of delayed subassignments (one per column) on top of the original DelayedArray object wouldn’t be pretty. Might work if the number of columns is small (e.g. < 100) though.Maybe.(didn’t try it)
Hervé Pagès (14:54:41): > Totally expected on a small matrix. Making this kind of comparison on such a small matrix is meaningless since the matrix fits in memory to start with. So of course loading all its columns at once (which is whatby=NULL
does) will be faster than anything else. This is Pete’s first concluding remark:Don’t use a ******DelayedArray********** if you don’t need to!****
Tim Triche (16:25:21): > duly noted:slightly_smiling_face:thanks!
2020-08-11
Tim Triche (12:30:12): > hey@Hervé PagèsI broke a couple of computers (BOOST issues in Singularity, flat-out crashing the machine outside of Singularity) when loading up a ~3M x 120 DelayedArray
Tim Triche (12:30:51): > I stole a lot of code from@Peter Hickey’s read.bismark.R inbsseq
once I realized what was going on, but then I started having… issues:slightly_smiling_face:
Tim Triche (12:32:47): > > # create a SummarizedExperiment to hold the data > message("Creating a SummarizedExperiment for metadata.") > cdata <- DataFrame(sample=tsvnames, file=tsvgzs) > se <- SummarizedExperiment(rowRanges=biggr, colData=cdata) > stopifnot(all(file.exists(se$file))) > colnames(se) <- cdata$sample > genome(se) <- gen > > # NOTE: a lot of the following is straight-up ripped off from bsseq! > ans_nrow <- length(biggr) > ans_ncol <- length(tsvgzs) > ans_dim <- c(ans_nrow, ans_ncol) > # NOTE: should we use h5writeDimnames to record the dimnames? > > dat_type <- "double" > mat_type <- paste(ifelse(HDF5, "HDF5", "in-memory"), dat_type, "matrix") > # allocate a big enough matrix (may switch to tiledb instead of HDF5) > message("Allocating a ", ans_nrow, " x ", ans_ncol, " ", mat_type, ".") > grid <- RegularArrayGrid(refdim = ans_dim, spacings = c(ans_nrow, 1L)) > DelayedArray:::set_verbose_block_processing(TRUE) > message("Creating a DelayedArray for the data.") > > if (HDF5) { > > if (!dir.exists(dir)) dir.create(dir) > h5_path <- file.path(dir, "assays.h5") > if (file.exists(h5_path)) unlink(h5_path) > asy_sink <- HDF5RealizationSink(dim = ans_dim, > type = "double", > filepath = h5_path, > name = "Beta") > on.exit(close(asy_sink), add = TRUE) > > sink_lock <- ipcid() > on.exit(ipcremove(sink_lock), add = TRUE) > > } else { > > asy_sink <- NULL > sink_lock <- NULL > > } > > # read in the Beta values from scNMT files > Beta <- bptry(bplapply(X = seq_along(grid), > FUN = .updateScNMT, > files = tsvgzs, > loci = rowRanges(se), > grid = grid, > asy_sink = asy_sink, > sink_lock = sink_lock, > gen = gen, > BPPARAM = BPPARAM)) > > # checkpoint: > if (!all(bpok(Beta))) { > stop(".updateScNMT() encountered errors for these files:\n ", > paste(files[!bpok], collapse = "\n ")) > } > > # write 'em > if (HDF5) { > > Beta <- as(asy_sink, "DelayedArray") > stopifnot(identical(dim(Beta), dim(se))) > assay(se, "Beta", withDimnames=FALSE) <- Beta > x <- se > x@assays <- HDF5Array:::.shorten_assay2h5_links(x@assays) > saveRDS(x, file = file.path(dir, "se.rds")) > > } else { > > Beta <- Reduce(cbind, Beta) > stopifnot(identical(attr(Beta, "dim"), ans_dim)) > rownames(Beta) <- names(biggr) > colnames(Beta) <- colnames(se) > assays(se)$Beta <- Beta > > } > > # done > return(se) >
Tim Triche (12:33:23): > the worker function (akin to.constructCountsFromSingleFile
inbsseq
) is like so:
Tim Triche (12:34:12): > > # utility fn, stolen from bsseq, more or less > .updateScNMT <- function(i, files, loci, grid, asy_sink, sink_lock, gen) { > > name <- names(files)[i] > message("[.updateScNMT] Extracting betas for ", name) > message(" from ", files[name]) > gr <- scanScNMT(files[i], gen = gen) > ol <- findOverlaps(gr, loci) # does this need to be `equal`?! > Beta <- matrix(rep(NA_real_, length(loci)), ncol = 1) > Beta[subjectHits(ol)] <- score(gr[queryHits(ol)]) > if (is.null(asy_sink)) return(Beta) > > # Write to asy_sink while respecting the IPC lock. > viewport <- grid[[i]] > ipclock(sink_lock) > write_block(x = asy_sink, viewport = viewport, block = Beta) > ipcunlock(sink_lock) > NULL > > } >
Tim Triche (12:34:36): > when I run this on a few (< 10) files, no problem, regardless of parallel or serial, HDF5 or in-memory
Tim Triche (12:34:51): > when I scale it up to 120 runs, problem (specifically, BOOST pitches a fit)
Tim Triche (12:36:24): > > Loading pre-saved biggr... > Creating a SummarizedExperiment for metadata. > Saving empty `se` to scNMT_meth/empty_se.rds ... OK. > Allocating a 13464893 x 120 HDF5 double matrix. > Creating a DelayedArray for the data. > terminate called after throwing an instance of 'boost::wrapexcept<boost::uuids::entropy_error>' > what(): getrandom > Aborted > Singularity> >
Tim Triche (12:44:40): > If I do this outside of a container, on a CentOS 7 machine, it crashes the machine and forces a reboot
Tim Triche (12:45:54): > this happens regardless of whether it’s devel (bioc-3.12) or release (bioc-3.11), although in both cases it’s CentOS that crashes (I have not yet had this problem on Ubuntu, but I also haven’t had the problem with smaller loads in CentOS)
Tim Triche (12:46:21): > I suppose I can see if my laptop survives it on bioc-release (3.11) and Ubuntu 20.04
Hervé Pagès (13:08:32): > Too much gory details for a slack channel. Can we discuss this somewhere else e.g. support site or open an issue on GH? Thanks
Tim Triche (13:32:33): > yes, thanks!
Tim Triche (13:32:52): > should I open an issue in the DelayedArray repository?
Hervé Pagès (14:01:16): > DelayedArray itself doesn’t know anything about BOOST. This sounds more like an inter-process lock issue (ipcid, ipcremove, ipclock, ipcunlock, IIRC these functions are implemented in C++ on top of BOOST). Perhaps some weird interaction between inter-process locking and the HDF5 lib. I suggest you open an issue under BiocParallel. Ping@Peter Hickey,@Mike Smithand me on the issue. It might be a tough one so please do your best to provide something minimalist and self-contained that everybody can run. Thanks!
Tim Triche (14:09:05): > that’s what I figured too – why else would it be using UUIDs – I can run it in-memory in parallel or in HDF5 serially
Tim Triche (14:09:15): > I am packaging up a reprex:slightly_smiling_face:
Tim Triche (14:10:12): > but it does appear that I need to exceed the number of processors by a fair amount before the issue rears its head. I may just cower in fear and make HDF5orparallel loading mutually exclusive:smile:
Nicholas Knoblauch (14:17:31) (in thread): > The HDF5 party line on parallelism used to be something along the lines of “for a single machine, parallelism probably won’t help, for multiple machines, use MPI”. Now that SWMR (single writer multiple reader) is a thing, (process level) parallel reading is officially supported, but writing still isn’t.
Tim Triche (14:21:44) (in thread): > OK, this is super helpful to know
Tim Triche (14:22:01) (in thread): > I willdefinitelymake parallel || HDF5 a thing then:slightly_smiling_face:
Tim Triche (14:22:13) (in thread): > “dear user, choose your poison”:smile:
Nicholas Knoblauch (14:36:26) (in thread): > If HDF5 is compiled in thread-safe mode, then there is a macroH5_HAVE_THREADSAFE
that’s defined, and HDF5 will automatically disable parallelism within HDF5 code using a mutex. I’m not sure if there’s a way to check that from the R side…
Hervé Pagès (14:49:01) (in thread): > From Pete’s workshop: > * Parallelization is never as straightforward or provides as big an improvement as you think/hope. > * Parallelwritingto files (e.g. HDF5 files) is a no go. > * Parallelreadingfrom files is sometimes, maybe, perhaps okay … > You’ve been warned!
Nicholas Knoblauch (15:00:48) (in thread): > with the new(ish) HDF5 direct write operations you can get the offset within an HDF5 file of your dataset. If there isn’t chunking or compression enabled, you can read and write to it directly without any interference (or assistance) from the library. If you stay “inside” that dataset, you won’t compromise the integrity of the file
2020-08-13
Dr Awala Fortune O. (06:45:27): > Please take some time to give your vote on your best packages in Rhttps://www.menti.com/m374dqb6yt
Tim Triche (14:01:38): > what does this have to do with on-disk data representations@Dr Awala Fortune O.?
Hervé Pagès (14:15:09): > Also I find it suspicious that the legit Mentimeter website is athttps://www.mentimeter.com/, nothttps://www.menti.com/
Dirk Eddelbuettel (14:30:06) (in thread): > And why doeseverychannel needed to get that spam? It would have been borderline on the dedicated borderline channel#random
Martin Morgan (14:53:41): > The author has been contacted; I believe this was not meant to be malicious. > > We have also started discussing approaches to a more intentional moderation of slack. > > As a reminder, the current Code of Conduct associated with this slack (agreed to by those joining through the heroku app) is athttps://www.contributor-covenant.org/version/1/0/0/code-of-conduct/; the Community Advisory Board (cab@bioconductor.org) has a committee working on a project-wide code of conduct.
Tim Triche (17:31:24): > I sought not to assume malice on the author’s part, while gently suggesting that perhaps it was a bit overeager… I hope I struck a reasonable balance per the above
Tim Triche (17:32:59): > I appreciate the forbearance of the many community members who have gently steered me in the right direction over the years (e.g. the recent “no, it’s not a good idea to write to HDF5 in parallel” learning experience above)
2020-08-14
Will Townes (09:49:06) (in thread): > @Kelly Street@Stephanie Hicks
Stephanie Hicks (12:49:46) (in thread): > ah thanks@Will Townes!
2020-08-17
Roye Rozov (02:08:35): > @Roye Rozov has joined the channel
Daniel Baker (12:20:50): > @Daniel Baker has joined the channel
2020-08-18
Will Macnair (09:08:21): > @Will Macnair has joined the channel
Daniel Baker (09:25:56): > Hi – > I know this channel is for out-of-core/disk-based processing, but I’m hoping that some people with more experience with Rcpp might be on this channel. (Or be able to recommend somewhere else to ask.) > > I’m primarily a C++ developer working on interfacing a C++ library (https://github.com/dnbaker/minocore) for clustering/nearest neighbors structures for use with R using Rcpp (https://github.com/dnbaker/Rfgc), where I’ve wrapped functional code successfully but have had difficulty in exposing object-oriented/class-based code. I have a lot of experience (5+ projects) using pybind11 (another Boost.Python fork) to wrap with Python, but I’m not very skilled with R and documentation/examples I’ve checked haven’t worked for me for whatever reason. > > Would someone be willing to answer some questions?
Dirk Eddelbuettel (09:29:14): > Yeah too bad that Rcpp doen’t have a) a mailing list dedicated to it b) 2k+ SO questions c) GitHub issues d) an example site ingallery.rcpp.orgor e) a few example packages.
Dirk Eddelbuettel (09:29:59): > Kidding aside any venue is fine. Some of the “interfaces” and workflow are more or less fixed and e.g. predate pybind11 so there isn’t much in terms ofautomatedgeneration. Also, and maybe that is partly a source of your difficulties, R is in many ways “different” to C++ or Python or [… insert favourite other language here …] which can make transitions of existing approaches challenging too.
Nicholas Knoblauch (11:26:08) (in thread): > I would add to Dirk’s comment that if you aren’t much of an R programmer, and you’re trying to create an R package for a C++ library, you really should try to avoid exposing the object-oriented/class-based nature of the library as much as possible.
Tim Triche (12:38:24): > I used Rcpp a million years ago in a package I wrote as part of my dissertation. My adviser couldn’t run it on Windows so she asked me to remove the C++ version and since it only ran 1000x slower, I complied. Would echo previous remarks about minimizing the impedance mismatch between R and C++
Tim Triche (12:39:00): > so if you can write mostly-procedural hooks between C++ and R, it will be less painful than if you are exposing a great big complicated C++ class and its instances
Tim Triche (12:43:40): > @Daniel Bakeryour C++ libraries seem like they could be extremely useful – in a case such as theRgfc
library, would you be willing to just runskeletor
to create a framework for hooking things up
Tim Triche (12:43:45): > https://cran.r-project.org/web/packages/skeletor/README.html
Tim Triche (12:44:05): > that way all the supporting scaffolds are already in place and you don’t have to worry about silly stuff getting in th eway
Tim Triche (12:44:31): > can concentrate on your C++ / R hooks behaving as expecte d
Tim Triche (12:45:20): > once I have a working package on GitHub I start iterating withBiocManager::install("trichelab/packageName")
with each significant change
Dirk Eddelbuettel (12:45:29) (in thread): > Yep. I actually really likebothOO approaches, but sometimes it just easiest to keep them at arm’s length or longer.I.e.something I have done a few times is “just” a (singleton) instance of C++ “object” and keep using Rcpp to add really simpleinit()
orsetFoo()
orgetFoo()
accessors. I like incremental babysteps because I know I will fall on my nose at some point so keeping the height from which I fall somewhat minimal…
Tim Triche (12:45:47): > it also will make things like Travis easier to set up since a template for that gets pooped into place by skeletor
Tim Triche (12:46:17): > and of course you can aim the codebase at other people with a reprex when things blow up
Tim Triche (12:47:02): > Dirk is the author of Rcpp so he may have forgotten how daunting the ecosystem has grown
Dirk Eddelbuettel (12:47:03) (in thread): > There approximately as many package generators as there are overeager R package authors. I would point to mine, namelypkgKitten
, and keep it otherwise close to base R. But that’s me…
Tim Triche (12:47:11) (in thread): > fair
Tim Triche (12:47:21) (in thread): > we just have good experience with skeletor
Dirk Eddelbuettel (12:47:26) (in thread): > Hence pruning clippers. See earlier comment.
Tim Triche (12:49:48) (in thread): > also fair – simplest thing that can possibly work first
Daniel Baker (12:51:57): > For instance, it’s easy with pybind11, and this interface I have no problem specifying – > > py::class_<JSDLSHasher<double>>(m, "JSDLSHasher") > .def(py::init<unsigned, unsigned, unsigned, double, uint64_t>(), py::arg("dim"), py::arg("k"), py::arg("l"), py::arg("r") = .1, py::arg("seed") = 0) > .def(py::init<LSHasherSettings, double, uint64_t>(), py::arg("settings"), py::arg("r") = .1, py::arg("seed") = 0) > .def("project", [](const JSDLSHasher<double> &hasher, py::object obj) -> py::object { > return project_array(hasher, obj); > }).def("settings", [](const JSDLSHasher<double> &hasher) -> LSHasherSettings {return hasher.settings_;}, "Get settings struct from hasher"); > py::class_<S2JSDLSHasher<double>>(m, "S2JSDLSHasher") > .def(py::init<unsigned, unsigned, unsigned, double, uint64_t>(), py::arg("dim"), py::arg("k"), py::arg("l"), py::arg("w") = .1, py::arg("seed") = 0) > .def(py::init<LSHasherSettings, double, uint64_t>(), py::arg("settings"), py::arg("w") = .1, py::arg("seed") = 0) > .def("project", [](const S2JSDLSHasher<double> &hasher, py::object obj) { > return project_array(hasher, obj); > }); >
> But simply accessing the module code was the problem. > > For instance,https://dirk.eddelbuettel.com/code/rcpp/Rcpp-modules.pdfhas C++ code, but it seems to be made for interactive use within R rather than as an installed package. Specifically, the way that the module is instantiated before its attributes can be accessed > > On page 15: > mod_vec <- Module( “mod_vec”, getDynLib(fx_vec), mustStart = TRUE ) > where did fx_vec come from? (Where did fx_unif come from before that?) Whenever I’ve used either the name of the package, the name of the class (parenthesized argument to RCPP_MODULE) or anything else, I got a segmentation faults rather than error messages. > > And adding the RCPP_MODULE section didn’t add any items at all to the R package (at least as far as attributes can be accessed by double- or triple- colon operators). The extra files/tweaks/changeshttps://github.com/r-pkg-examples/rcpp-modules-studentdidn’t seem to work for my problem. > > So perhaps a skeleton generator is a solution; I’ll come back to that. The other cheaper/easier option is to use a NumericVector to hold state and then access it from C++ code, so I may try that.
Tim Triche (12:53:11): > did you export your hooks/functions? if you document stuff with roxygen2 it will handle namespace annoyances
Tim Triche (12:53:50): > one of the nice things that skeletor poops into place is a makefile with amake doc
target that will regenerate docs and update the namespace for you
Tim Triche (12:53:50): > that has bitten me in the past
Daniel Baker (12:53:57): > I exported the functions manually > > import(Rcpp) > import(methods) > useDynLib(Rfgc) > importClassesFrom(Matrix,dgCMatrix) > importClassesFrom(Matrix,dgRMatrix) > importClassesFrom(Matrix,lgCMatrix) > importClassesFrom(Matrix,lgRMatrix) > importFrom(Rcpp,Rcpp.plugin.maker) > importFrom(Rcpp,evalCpp) > importFrom(utils,package.skeleton) > importFrom(utils,packageDescription) > exportPattern("^[[:alpha:]]+") > export(dist_matrixdd) > export(dist_matrixdf) > export(... more functions) > export(display_samplers) > export(display_sse_info) >
Tim Triche (12:54:02): > looks like it
Daniel Baker (12:54:37): > But when I try to add the module name, it says it doesn’t exist. When I addusedynlib(module_name, .registration=TRUE)
, it tells me there’s no such module .so file. So maybe burning down my glue code and starting from a skeleton is the right way to go.
Tim Triche (12:55:22): > hard to say with out something that I can light a fuse on and blow up in my own face
Tim Triche (12:56:00): > starting from a working skeleton and moving the code into that has, in my experience, made debugging easier than trying to clean everything up before packaging
Tim Triche (12:56:38): > make check
andmake doc
will typically catch these things in my experience, often pointing at what has fallen off
Dirk Eddelbuettel (12:56:59) (in thread): > > but it seems to be made for interactive use within R rather than as an installed package > I would say it is the inverse. We often warn not to use Modules in the more interactive way viasourceCpp()
and friends.
Tim Triche (12:57:04): > while they are running, stackOverflow or the rcpp mailing list might help determine if this is a common issue with a common solution
Vince Carey (13:00:32): > Coming in cold: If the python bindings are good, consider reusing them with basilisk interfaces?https://bioconductor.org/packages/release/bioc/html/basilisk.html - Attachment (Bioconductor): basilisk > Installs a self-contained Python instance that is managed by the R installation. This aims to provide a consistent Python version that can be used reliably by Bioconductor packages. Module versions are also controlled to guarantee consistent behavior on different user systems.
Dirk Eddelbuettel (13:00:38) (in thread): > Maybe these are much better – but based on experience with other such wrappers thispreciselywhy I am suspicious of generators as a “general class” of tools. You just pray and hopegenericdoc generation works across all packages. For starters, Rcpp needs a sequence of compileAttributes() followed by roxygenize(). And you no longer see who is running what for you. And then people come crying to SO and alike “thing no work”. And we get to pick up the pieces. [ Legal disclaimer: skelotor may the best thing ever, I have not tried it, it probably boils coffee for me too etc.Some tools clearly work. ] > > But these things are like religious wars so I will stay away and check myself out it. Have at it.
Tim Triche (13:15:53) (in thread): > ohhhhhh… this is good to know
Tim Triche (13:16:31) (in thread): > > For starters, Rcpp needs a sequence of compileAttributes() followed by roxygenize(). >
> So it may fail mysteriously even if everything is otherwise OK, and vice versa
Dirk Eddelbuettel (13:17:50) (in thread): > Yes. RStudio of course also falls into this class of tools, but when you have an Rcpp package and click ‘rebuild and install’ (or whatever it is called)and look closelythe two calls areveryexplicit. Our docs say it too, but they say many things and nobody reads docs anyway…
Dirk Eddelbuettel (13:18:19) (in thread): > > So it may fail mysteriously > Predictably:slightly_smiling_face:
Nicholas Knoblauch (13:32:12) (in thread): > You and your users will be better off if the functions you export take base R objects as input and return base R objects. If you’re writing an R package, you will want to write some high-level functions that take and return base R objects anyways, (otherwise, why are you writing an R package?), so you might as well start with those. R, unlike python, has (idiomatically) copy on modify semantics, which is not easy to get right on the C++ side unless 1) you know a bit about R internals or 2) you write “self-contained” functions.
Daniel Baker (13:41:06) (in thread): > Then I’ll stick to functional interfacing and use a NumericVector or Matrix to hold state. (The impetus for the “class” interface was using LSH functions which require random matrices, which I wanted to avoid re-generating.)
Nicholas Knoblauch (15:46:12) (in thread): > You don’t need to expose the class interface to have objects persist between function calls. You can heap allocate yourfoo
and stick it in aRcpp::XPtr<foo>
and return that. It works more or less like ashared_ptr
Dirk Eddelbuettel (16:53:48) (in thread): > Yes, very common approach. The contain the XPtr in an S4 or R6 object. Some people even wrap XPtr around shared_ptr<>.
2020-08-24
Sean Davis (10:02:31): > Another “standards” working group, but this one seems to be addressing a problem that overlaps a bit with discussions here.https://data-apis.org/blog/announcing_the_consortium/ - Attachment (data-apis.org): Announcing the Consortium for Python Data API Standards > An initiative to develop API standards for n-dimensional arrays and dataframes
Vince Carey (12:27:39): > apropos python-record-api: “This module is meant to help you understand how a Python module is being used by other modules.” Warms my heart.
Sean Davis (12:31:40): > I found that module very inspirational. I think this kind of code introspection is likely very doable and could enhance real-world understanding of bioc code reuse and lead to some data-driven development efforts.
2020-08-26
Iwona Belczacka (03:54:35): > @Iwona Belczacka has joined the channel
Saulius Lukauskas (07:47:40): > @Saulius Lukauskas has joined the channel
Will Townes (19:20:09): > How do I coerce an in-memory matrix to HDF5Matrix? I am trying to do this inside a testthat unit test withas(m, "HDF5Matrix")
. When I rundevtools::test
everything works fine. When I run the build check it fails sayingno method or default for coercing "matrix" to "HDF5Matrix"
. I have both HDF5Array and DelayedArray in DESCRIPTION as imports.
Aaron Lun (19:21:15): > testthat is a bit wonky sometimes with its imports, I find that I often have to do stuff likeBiocGenerics::cbind
to get it to work.
Aaron Lun (19:21:31): > Probably you may have to explicitlylibrary(HDF5Array)
in your unit test file.
Will Townes (19:21:41): > thanks I will try that!
Will Townes (19:24:06): > maybe in the future HDF5Array will have a constructor function similar to the way it works with Matrix(). It’s kind of annoying I can’t do something like HDF5Array::as(m, “HDF5Matrix”) or HDF5Array::HDF5Matrix(m)
Will Townes (19:25:41) (in thread): > @Constantin Ahlmann-Eltzecurious to know how you solved this for glmGamPoi
Hervé Pagès (19:26:41): > It already has one. It’s calledwriteHDF5Array()
.as(m, "HDF5Array")
andas(m, "HDF5Matrix")
just call that.
Will Townes (19:28:45): > ohh cool thanks for pointing out! I must have missed it in the documentation.
2020-08-27
Constantin Ahlmann-Eltze (05:03:28): > writeHDF5Array()
is also the way I am using it in glmGamPoi. For examplehttps://github.com/const-ae/glmGamPoi/blob/master/tests/testthat/test-glm_gp.R#L146
2020-09-03
Dirk Eddelbuettel (19:07:48): > We’re back on. Fingers crossed, maybe it’ll last a full day this time:https://twitter.com/CRANberriesFeed/status/1301656913902546945 - Attachment (twitter): Attachment > New CRAN package tiledb with initial version 0.8.0 http://goo.gl/pgljT #rstats
Aaron Lun (19:08:13): > :+1:
Aaron Lun (19:08:57): > I’m going to give it a week and then I’ll submit TileDBArray to BioC.
Aaron Lun (19:09:11): > Oh wait, we should probably fix some of the hacks first.
Aaron Lun (19:10:13): > Think we fixed one of them with theselected_ranges<-
option. There’s still a wacky hack to store the dimnames properly.
Dirk Eddelbuettel (19:18:28): > I am game for any and all of it as well as some proper benchmarking and tuning.
Aaron Lun (19:19:01): > let me have a look at it, it’s been a while.
Dirk Eddelbuettel (19:24:15): > No rush. We made a few things better. Note, though, that per CRAN Policy the default now throttles cores. So for Yuge performance make sure you explicitly enable cores etc. Happy to work on more explicit docs.
2020-09-04
Goutham Atla (08:24:02): > @Goutham Atla has joined the channel
2020-09-07
Tim Triche (17:00:55): > this is cool. I’m tired of HDF5 blowing up Singularity or having to use absurd hacks to get around in-memory reps.:crossed_fingers:
Tyrone Chen (20:58:37): > @Tyrone Chen has joined the channel
2020-09-12
Aaron Lun (20:56:11): > Thinking of writing a much-streamlined version of beachmat to directly target the output ofextract_array
andextract_sparse_array
within DelayedArray block-apply statements.
Aaron Lun (20:58:34): > This should make it much easier to write C++ code within the blockapply while reducing compilation times and complexity.
Aaron Lun (21:17:59): > Problem is that it’ll be all-in on the block processing, which would involve actually rewriting a lot of R code. Hm.
2020-09-13
Tim Triche (10:54:40): > Oh god
Nicholas Knoblauch (21:42:51) (in thread): > That sounds super interesting. Could you say more about what you have in mind?
2020-09-14
Aaron Lun (01:43:21) (in thread): > History lesson! > > So, back in 2017, we had two broad strategies for how to make our various C++ functions operational with alternative matrices. The first was the one currently implemented inbeachmat, which takes theRcpp::RObject
in C++, figures out what it is, and then provides methods to read a vector of values from a given row or column. This works reasonably well and allows people to write C++ code in a manner that is agnostic to the matrix representation. Most importantly, it was a plug-and-play replacement for the usualRcpp.column()
and.row()
methods for matrix access, which was important to allow client packages to get up and running with their existing C++ code. > > The second strategy was to perform block-wise looping in R, usingDelayedArray::blockApply
. This would read in a block of values as an ordinary matrix that would then be passed into the C++ code; this means that developers would only have to worry about handling ordinary (column-major, etc.) matrices in their C++ code. The problem was that realization to a dense matrix would sacrifice some of the efficiencies of a sparse matrix format. Also, getting this to work would involve a lot more refactoring as the looping would be shared across both C++ and R - one loop over blocks, and another loop within each block to do whatever you wanted to do. > > However, the equation has changed now that we have proper sparse support inDelayedArray. This means that we can get both dense and sparse submatrices during the block processing, thus avoiding the previous loss of efficiency from coercion to dense. With R-based blockwise looping, we get easy access toBiocParallel-based parallelization; we avoid the funky calls back into R from C++ that I have to do in order to handle matrices that are not of a known type; and we get a more user-tunable method of handling large matrix outputs viaRealizationSink
s. There should no longer be an efficiency drop from using block processing (for small numbers of blocks, the R-based looping overhead is negligible) and so this should be the preferred mechanism for handling large matrices when rows/columns can be processed independently of each other. > > So, howbeachmatcan be geared to serve this new paradigm? Well, inside the block processing loop, we may wish to call a C++ function that takes the block and operates on it. This block may be dense or sparse, but the C++ function may not care if it just wants to get, e.g., the individual row vectors out for further processing. In this context,beachmatcan continue to provide agnostic data access from the realized blocks so that developers don’t have to worry about handling different formats in their C++ code. Alternatively, if the sparse format lends itself to a more efficient algorithm,beachmatalso provides dedicated sparse classes to only operate on the non-zero values. > > As an aside, I also noticed the major pain point forbeachmat-dependent C++ code lies in supporting both integer and numeric matrices, e.g., for counts. This is now natively supported so there is no longer any need to do crazy templating tricks (which should, in turn, cut down on compilation times and the size of the compiled libraries).
Constantin Ahlmann-Eltze (04:29:39) (in thread): > Wow, that sounds great, though thedouble
/integer
matrix stuff was a great opportunity to learn more about template programming:smile:Will this update break the C++ API of beachmat so that the depending packages will need to update their code?
Ilir Sheraj (06:23:57): > @Ilir Sheraj has joined the channel
Dirk Eddelbuettel (07:41:57) (in thread): > So to clarify you now switch between int
and double at runtime based on
TYPEOF()`?
Constantin Ahlmann-Eltze (07:43:11) (in thread): > No quite, I usebeachmat::find_sexp_type()
https://github.com/const-ae/glmGamPoi/blob/master/src/beta_estimation.cpp#L340
Dirk Eddelbuettel (07:46:05) (in thread): > I was referring to Aaron’s new template free code.
Constantin Ahlmann-Eltze (07:46:31) (in thread): > Ah, okay:smile:Sorry:slightly_smiling_face:
Dirk Eddelbuettel (07:46:36) (in thread): > I understand how to set up a simple template, we do that all the time but yes it inflates compile time and object sizes
Dirk Eddelbuettel (07:47:06) (in thread): > No worries:laughing:
Nicholas Knoblauch (10:53:14) (in thread): > Ok I think I get the broad strokes. I think this is a dumb question, but if I have someSEXP x
, the.column()
and.row()
route inRcpp
(after I do something likeauto mat_x = Rcpp::as<Rcpp::NumericMatrix>(x)
) is only faster than the R equivalent[
functions because they don’t have to dispatch on type and maybe the compiler can do some inlining. In the scenario where the author of the S4 class hasn’t provided A C/C++ API for row and column access, you aren’t in general going to be able to do much better than calling out to[
anyways, right?
Dirk Eddelbuettel (10:55:52) (in thread): > Roughly, yes. And sometimes were faster with these re-implementations because we are also sloppier and do fewer checks and assertions and NA probes and … than R. It all depends, as always.
Dirk Eddelbuettel (10:56:42) (in thread): > The logic of R-level dispatching is costly so generally speaking functions calls are expensive in R (also a lot of state to be kept). So avoiding those helps.
Nicholas Knoblauch (11:15:14) (in thread): > for sure, I guess my point is that there’s kind of no free lunch here. You can write a templated function for some compute kernel you care about without knowing the primitive type of your object, whether it is row or column major, whether it is sparse, but you need that info at compile time so C++ can do its magic. This requires writing a C++ interface though. Your type has to be able to tell a calling function about itself, either through type traits (with c++20 there’s concepts, which you can emulate more or less effectively in c++11-17) You can rely on R to give you all of that information, but you still need to go through the C++ interface eventually. Alternatively, if you don’t want to write the C++ interface, you can use justS3
/S4
dispatch to figure out everything for you, but then the type of everything in your compute kernel is SEXP
, and you’re more or less just writing R in C++, except you don’t get R’s byte compiler
Dirk Eddelbuettel (11:17:09) (in thread): > Sure. To me the main difference is much simpler. Templates can work on types at compile time yet R only has one type (SEXP
) and everything has to be at run-time:cry:
Nicholas Knoblauch (11:25:24) (in thread): > Exactly. C++17 hasstd::variant
which can be used to great effect if you have an enumerable list of things that theSEXP
can be (which you often do because ofTYPEOF
). That kind of all goes out the window in S4 world
Aaron Lun (12:14:15) (in thread): > To clarify, the user-facing code will now be template-free w.r.t. handling integer/double handling… but the internal API code is, of course, template city. > > My solution is to use inheritance from alin_matrix
class that handles integers, doubles and logicals together. Each subclass defines a matrix representation and a type (e.g.,ordinary_integer_matrix
,double_SparseArraySeed
) and offers methods forint*
anddouble*
extraction. So the switch is done at runtime, by (i) the constructor function (e.g.,read_block()
) as it decides which actual subclass to materialize, and (ii) by each call as it goes and looks up the vtable of methods to find the correct subclass method. > > I needed virtual methods anyway to support polymorphism across the different S4 classes, so this was a natural extension of that strategy. I’ll also note that I use template specialization so that, e.g., requesting anint*
column from an integer matrix will skip a copy operation and return a pointer directly, which is a slight improvement over the “copy everything” strategy thatbeachmatcurrently implements.
Aaron Lun (12:16:45) (in thread): > Also, I’ll mention thatbeachmatactually does have a mechanism for supporting third-party native extraction methods for rows/columns in C++… and no one used it. Well, it seemed like a good idea at the time.
Hervé Pagès (12:17:02) (in thread): > Another thing that has changed the equation since 2017 is that we now have the sparseMatrixStats package by@Constantin Ahlmann-Eltze. This should reduce the need for developers to write their own C++ code to process the blocks returned byextract_sparse_array()
. Constantin missed the opportunity to advertise his own package here so I’m doing it for him.
Hervé Pagès (12:17:53) (in thread): > Hope it’s ok with you Constantin:wink:
Aaron Lun (12:20:28) (in thread): > yes, I forgot about that
Aaron Lun (12:22:53) (in thread): > I’m not using that package yet - most of my C++ code is a bit too complex for its methods - but yes, it should cut down the need for C++ in the more mundane applications.
Hervé Pagès (12:26:49) (in thread): > Nobody uses it yet. It was released in May (BioC 3.11) so is still young but I have a feeling not a lot of people are aware of it.
Nicholas Knoblauch (12:28:08) (in thread): > Ok I think I get it now. The user can treat the matrix like it’s a column-major matrix of doubles, and when they ask for, say, column 4, if it is a column major matrix of doubles, great, pass that pointer to the user, if not, you 1) allocate some memory 2) pull out the elements in that column 3) cast to double and pass the pointer to the user
Aaron Lun (12:28:54): > Rediverting from thread: > > Nobody uses it yet. It was released in May (BioC 3.11) so is still young but I have a feeling not a lot of people are aware of it. > I remember talking to@Constantin Ahlmann-Eltzeabout it during BIoC2020; there should be a single package to pull in both SparseMatrixStats and DelayedMatrixStats so that developers only have to depend on a single thing. Maybe something like BiocMatrixStats, which also pulls in MatrixGenerics.
Aaron Lun (12:29:08) (in thread): > yup, that’s it.
Nicholas Knoblauch (12:31:23) (in thread): > do you give me a const pointer?
Nicholas Knoblauch (12:31:38) (in thread): > or does it go the other way too?
Hervé Pagès (12:38:46) (in thread): > Sounds like it would be sensible to return a const pointer only when it’s pointing to a column of the original column-major matrix. Otherwise, writing to the copy of the column should be allowed. Even though getting a pointer that is sometimes a const and sometimes not wouldn’t be an easy API.
Nicholas Knoblauch (12:40:39) (in thread): > Right but now you’re back where you started
Nicholas Knoblauch (12:41:31) (in thread): > you can’t overload a C++ function on return type only
Dirk Eddelbuettel (12:41:47) (in thread): > As Camus said: “we must imagine Sisyphus as a happy man”
Hervé Pagès (12:45:50) (in thread): > What I’m trying to say is that maybe it should always make a copy of the column to avoid that kind of situation. Doesn’t matter if that means making an unneeded copy in the case of a column-major matrix when you’re only going to read from it.
Dirk Eddelbuettel (12:45:55) (in thread): > He clearly was a C++ programmer:wink:
Hervé Pagès (12:47:09) (in thread): > Who? Camus or Sisyphus?
Dirk Eddelbuettel (12:47:51) (in thread): > That’s a take-home essay question!
Hervé Pagès (12:49:59) (in thread): > I would say Sisyphus. He’s working too hard to solve easy problems. But that’s just me:wink:
Nicholas Knoblauch (12:50:54) (in thread): > If you make a copy every time, aren’t I better off just writing an R function at that point? at least R 1) has garbage collection so it doesn’t need to go out to the OS every time it needs to allocate memory 2) has copy on modify
Nicholas Knoblauch (12:55:02) (in thread): > I feel like we’re approaching a recursive extension of Greenspan’s tenth rule > > Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.And within that Common Lisp is a C extension mechanism, and within that C extension mechanism is another slow implementation of half of Common Lisp
Hervé Pagès (12:56:31) (in thread): > You could. You should at leaststartwith this. And only switch to C/C++ if that’s not fast enough. Note that the same allocated buffer can be used to copy the column during the looping on the columns of the matrix so cost is small and looping in C/C++ is going to be much faster anyway.
Hervé Pagès (12:58:31) (in thread): > But in some situations, not much faster. And in some other situations, even slightly slower:https://github.com/hansenlab/bsseq/issues/86#issuecomment-535352239
Aaron Lun (13:01:33) (in thread): > Yes, the pointers areconst
. The getters also expect a workspace pointer (a la LAPACK) that is populated by copy if a direct pointer cannot be returned. In such cases, the pointer is just that to the workspace; you can check whether a copy was made by comparing the return value pointer with the workspace pointer.
Nicholas Knoblauch (13:14:29) (in thread): > Ok cool. I guess that also helps with thread safety? (I’m guessing the class with the getter doesn’t own the allocated memory if the workspace pointer is being passed back by the caller?)
Nicholas Knoblauch (13:15:20) (in thread): > and the caller is in charge of cleaning up the workspace pointer?
Aaron Lun (13:26:40) (in thread): > that’s right.
Aaron Lun (13:28:43) (in thread): > The classes themselves aren’t guaranteed to be thread-safe, due to some internal variables that need to change to optimize sparse matrix row accesses; if multi-threading is needed, each thread generally makes a copy of the class pointer (which is light, as it doesn’t actually hold any data) and works on that.
Nicholas Knoblauch (13:36:27) (in thread): > cool!
2020-09-15
Constantin Ahlmann-Eltze (02:51:21): > Yes, good that you bring that up again. I wanted to ask@Peter Hickeywhat the current situation with the transition ofDelayedMatrixStats
to theMatrixGenerics
package is? If that is done, I could sit done and create aBiocMatrixStats
package (I like that name:slightly_smiling_face:)
Aaron Lun (02:54:23): > Would it even be necessary to transition DMS to MatGen? Seem perfectly okay to keep them separate.
Hervé Pagès (03:00:21): > Agreed. But for the use case we were discussing earlier today (or yesterday for those East of Aaron and I) which was the use of sparseMatrixStats in the context of block processing (that is on the blocks returned byread_block()
), only MatrixGenerics and sparseMatrixStats need to be imported. The refactoring of DelayedMatrixStats to make use of sparseMatrixStats internally is a separate story.
Hervé Pagès (03:03:37): > Or, to put this otherwise: whether DelayedMatrixStats imports sparseMatrixStats or not is DelayedMatrixStats internal business and not something that should matter to developers who want to use DelayedMatrixStats in their package.
Aaron Lun (03:39:20): > sure, I don’t think we were saying otherwise
Hervé Pagès (03:43:52): > ok, maybe, not sure what “a single package to pull in both SparseMatrixStats and DelayedMatrixStats” would be then
Aaron Lun (03:44:48): > it would just be a wrapper package that allows, say, thescrandevelopers (i.e., me) to pull in both packages without having to list them all out in theImports
.
Aaron Lun (03:45:13): > So then I could get sparsecolVars
and DelayedcolVars
, etc. without even thinking about the need for the constituent packages.
Hervé Pagès (03:46:12): > so you suggest to create wrapper packages each time you need to import 2 packages?
Aaron Lun (03:48:26): > well, there’s three here (including MatrixGenerics) and downstream DelayedArray-compatible packages would always need to be importing all three packages. The nature of S4 means that I wouldn’t even know that I was missing an import from, say, SMS until someone tried to run a sparse matrix into my functions and it didn’t work.
Aaron Lun (03:50:11): > To be more precise, downstream “DelayedArray + sparse matrix compatible packages”, which is pretty much the entire single-cell stack.
Hervé Pagès (03:59:41): > I don’t know what package would be the wrapper package to rule them all. Not DelayedMatrixStats: it just implements DelayedMatrix methods for the generics in MatrixGenerics (these generics are the matrixStats verbs). Not sparseMatrixStats: same thing as DelayedMatrixStats except that here the methods are defined for dgCMatrix objects. Not DelayedArray. So a new package? e.g. matrixverse? Sounds weird to create a wrapper package for that.
Hervé Pagès (04:04:28): > Oh, and I was forgetting Matrix. So 6 packages for now: matrixStats, MatrixGenerics, Matrix, sparseMatrixStats, DelayedArray, DelayedMatrixStats.
Aaron Lun (04:05:16): > I thought BiocMatrixStats was a nice name. Fits in well with BiocNeighbors and BiocSingular, for example. > > From a downstream developer standpoint, it’s helpful to be able to point to a single package to import from, rather than 1 + 2. Where forgetting the 2 optional ones may not be easily picked up until a user runs into them and it all fails with “nocolVars
method for DelayedArray”.
Hervé Pagès (04:15:21): > We’ll also get this error if the new matrix type is not covered by the umbrella package. So someone has to keep track of new matrix-like things showing up in the ecosystem. I agree it’s easier if we do it in a single/central place though. But for now… Also how often new matrix-like things are going to show up? > > Anyway all this is because we have workflows where the central object is a SummarizedExperiment object (or derivative) and the assays in an SE object can be any matrix-like or array-like object. We’re giving too much freedom! Maybe one alternative could be to make the SummarizedExperiment package the umbrella package?
Hervé Pagès (04:21:19): > I don’t know about the name. IIUC it’s not just about being able to apply a matrixStats verb to an arbitrary matrix-like object. I guess there are other matrix operations liket()
,[
,dim()
,log()
, etc that one would want to be able to apply in an agnostic way. So I’m not a big fan of having the “stats” thing in the name of the wrapper package.
Peter Hickey (06:43:19) (in thread): > It doesn’t yet use it. hope to do it before the next release
Aaron Lun (11:14:26): > well,BiocMatrixis also perfectly fine. And, in fact, was the name I originally suggested forbeachmat… 3 years ago.
Aaron Lun (11:16:15): > I guess one could even imagine beachmat as the wrapper package, especially under this new paradigm. I would have to think through some of the implications, namely whether all client packages currently depending on beachmat would already have to import the cohort of 6 packages anyway.
Aaron Lun (11:17:34): > Hm. That’s not too bad an idea, actually. Will think about it on the train.
Aaron Lun (17:13:45): > Using beachmat as the umbrella really depends on whether DMS would ever want to depend on beachmat code. If it does, then beachmat can’t be the umbrella.
Vince Carey (19:03:27): > The umbrella package concept is fine, but I wonder if it is at least conceptually avoidable? It seems to me the best practice at present is to declare symbol imports in roxygen with importFrom … I for one am constantly failing to make the DESCRIPTION entries consistent with the autogenerated NAMESPACE. Is there any reason the DESCRIPTION Imports clauses cannot be autogenerated on the basis of the import directives? Would the need for an umbrella package then go away? The question of autogeneration has been posedhttps://stackoverflow.com/questions/37263848/is-there-a-way-to-automatically-generate-imports-section-in-the-description-fibut I am not sure the answers address the question directly. - Attachment (Stack Overflow): Is there a way to automatically generate Imports
section in the DESCRIPTION file? > When developing an R package, it is common for me to just turn on Packrat and use a localized repository, and develop as I explore in a session. But when publishing the package, it is a big headach…
Peter Hickey (19:04:36) (in thread): > It doesn’t currently, but may one day?
Aaron Lun (20:26:44): > The biggest problem here is not the import specper se, it’s the fact that missing one of the method imports will not trigger an error until a user attempts to run something with it. So for example, if you importedcolVars
fromMatrixGenerics, everything might work perfectly well in your testing, and in BBS building, and so on… until someone gives your function a sparse matrix and you realize that you forgot to import the corresponding method fromsparseMatrixStats. On the whole, I would say that is a really easy mistake to make.
Aaron Lun (20:28:05): > The ideal solution would be for every pakage that defines a matrix class to also define its matrixStats methods, which would guarantee that you could never have an instance of a class without also having its methods available. That probably isn’t going to happen, for understandable reasons - there’s a lot of matrixStats methods - so creating an umbrella that wraps the most commonly used methods is a pretty sensible approach to me.
2020-09-16
Hervé Pagès (12:31:36): > Note that if thecolVars()
generic (defined in MatrixGenerics) knew where to find the method for dgCMatrix or DelayedMatrix objects then it could load the corresponding package before dispatch. So it feels that MatrixGenericscouldbe the umbrella package but that would be via some kind of improved dispatch mechanism, not just via Imports.
Hervé Pagès (12:40:54): > Could be achieved via an extra argument tosetGeneric()
where one would specify a list of packages where to search for methods. I’ll givesetGeneric2()
a shot and see how it goes.
Nicholas Knoblauch (12:53:31) (in thread): > I’m a S4 noob, but would that have the effect of “closing” the method to extension, or is the idea that it would be an “extra” place to look?
Hervé Pagès (12:54:08) (in thread): > An extra place to look.
Michael Lawrence (12:54:10) (in thread): > Lazy loading of methods is an interesting idea. Ideally method registration would populate a registry where the generic would look for packages to load.
Michael Lawrence (12:55:21) (in thread): > Then there would be no need to specify packages at generic registration, which is pretty strong coupling in the wrong direction.
Michael Lawrence (12:56:44) (in thread): > I guess the loading would need to be explicit. Like the calling code would callloadAllMethods(generic)
because in general loading all method defining namespaces may not be a desirable side effect.
Michael Lawrence (12:59:29) (in thread): > Will have to think about it. Could just get messy.
Nicholas Knoblauch (13:02:41) (in thread): > It seems like this could be solved by some kind of tool that has all of the S4 generics, and is able to work out (hypothetically) missing imports
Nicholas Knoblauch (13:08:12) (in thread): > like some kind ofBiocConcepts
check a package developer could run that would try out your functions (that you maybe flag somehow) with different matrix classes etc.
Michael Lawrence (13:10:55) (in thread): > Is all that is missing animportMethodsFrom()
call?
Nicholas Knoblauch (13:13:32) (in thread): > yes
Michael Lawrence (13:14:20) (in thread): > So we’re OK with a hard dependency on the package defining the method?
Hervé Pagès (13:14:22): > So right now thecolVars()
generic is defined as: > > setGeneric("colVars", signature="x", > function(x, rows=NULL, cols=NULL, na.rm=FALSE, center=NULL, ...) > standardGeneric("colVars") > ) >
> Could be defined as (1 line added): > > setGeneric("colVars", signature="x", > function(x, rows=NULL, cols=NULL, na.rm=FALSE, center=NULL, ...) > { > import_packages_with_methods("colVars") > standardGeneric("colVars") > } > ) >
> where: > > .PACKAGES_WITH_METHODS <- c( > "matrixStats", > "sparseMatrixStats", > "DelayedMatrixStats" > # add more here in the future > ) > > import_packages_with_methods <- function(GENERIC) { > for (pkg in .PACKAGES_WITH_METHODS) { > if (!requireNamespace(pkg, quietly=TRUE)) > stop(wmsg("Couldn't load the ", pkg, " package. Please install ", > "the ", pkg, " package to get access to all ", GENERIC, > "methods.")) > } > } >
> All packages in.PACKAGES_WITH_METHODS
would need to be in MatrixGenerics’Suggests
field. Then all of them would get imported the first time thecolVars()
generic is called. Not ideal, would be better to import only if actually needed. But still better than an umbrella package that uses bruteImports
. Also the MatrixGenerics feels like the right place to encode knowledge of where to find the methods for the generics defined in the package. Something more refined like what Michael is proposing would be nice.
Nicholas Knoblauch (13:16:01) (in thread): > I would say no, it’s more of a note
Nicholas Knoblauch (13:17:07) (in thread): > like whether the developer wants to depend on a package or not is up to them
Aaron Lun (13:18:20): > Perhaps the most expedient solution right now would be for MatrixGenerics to just import the two extra MatStats packages. I can’t see a situation where I would want MG and not the associated methods.
Hervé Pagès (13:19:53): > Of course we would do this if we could but we can’t because DMS and sparseMatrixStats need to import MatrixGenerics. The gymnastic I suggest above is exactly to work around that.
Hervé Pagès (13:20:40): > The import is delayed until the first generic is called.
Nicholas Knoblauch (13:20:42): > Why would you throw an error for not having DelayedMatrixStats
before you’ve seen if I’m passing aDelayedArray
?
Aaron Lun (13:24:10): > Hm.
Hervé Pagès (13:24:13): > @Nicholas KnoblauchBecause that’s the expedient way. If we could import DMS and sparseMatrixStats, we would just do that (this is the original idea of the umbrella package). With an umbrella package based on imports DMS would automatically get installed when the umbrella package gets installed. Here installation is postponed until the first time thecolVars()
generic is called.
Hervé Pagès (13:25:45): > Again, not fined grained or subtle but still better than an umbrella package based on Imports. Also it can be donenow.
Aaron Lun (13:25:59): > Hm. I guess having an error like that would necessitate an explicit installation byImports
from downstream packages anyway, if I wanted my users to be sure that they wouldn’t encounter errors.
Nicholas Knoblauch (13:27:22): > also able to be done now is 1) importing the generics if you want or 2) letting the user import them, right?
Hervé Pagès (13:28:26): > @Aaron LunThey get a useful error message. The umbrella package will probably sit low in the stack so at least we’re not forcing everybody to install upfront a bunch packages that they will never need.
Nicholas Knoblauch (13:28:43): > Like I would rather have a package that lets’s melist_generics("colVars")
after I get an error aboutcolVars
Aaron Lun (13:29:41): > I suspect that the most expedient solution for me will be to stick all imports into beachmat for the time being, given that DMS isn’t using beachmat and should deflect to SMS for its sparse needs anyway.
Aaron Lun (13:31:29): > actually, scuttle could be that umbrella package as well.
Aaron Lun (13:42:08): > Yes. I think an set of imports via scuttle should propagate to the entire sc-stack that I control. Other packages… oh well.
Hervé Pagès (14:17:22): > Another way to tackle this is to think of it as a 2-wayImports
problem . If we want the MatrixGenerics package to be the umbrella, which IMO is conceptually the right package to play that role, we need to be able to specify “suggested” packages that actually get installedafterMatrixGenerics, and also loadedafterMatrixGenerics. So we avoid the circular dep problem . Can’t think of a good name for this field, something likeLoadAfterMe
.
Hervé Pagès (14:20:32): > It would actually be enough thatinstall.packages()
installs these additional packages. Then a few lines in the.onLoad()
hook could take care of importing their namespaces.
Nicholas Knoblauch (14:37:45): > What you’re describing is pretty similar toEnhances
, no?https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Package-Dependencies - Attachment (cran.r-project.org): Writing R Extensions > Writing R Extensions
Hervé Pagès (14:44:27): > Yes, kind of. Except that by defaultinstall.packages()
ignores the packages listed inEnhances
.
Dirk Eddelbuettel (14:58:02) (in thread): > IIUIC because ‘the arrow’ is backwards. AnEnhances: B
means that when you install package B it should also consider the package A that lists the Enhances for B.
Aaron Lun (15:17:51): > Either way, I think we can agree the aim is to - somehow - ensure that packages installing MatGen also install SMS and DMS. Out of all strategies present so far, the approach that involves the least innovation would be to make an umbrella, or repurpose an existing package to be such an umbrella. I will volunteer beachmat to be this umbrella, a brief survey suggests that almost all its hard downstream dependencies also depend on DMS (directly or indirectly) and will want to depend on SMS.
Hervé Pagès (15:57:14) (in thread): > Yep that’s my understanding too. So the way I see it there’s not much difference between havingEnhances: B
in package A vs havingSuggests: A
in package B, at least in semantics. Only difference is ownership of the field i.e.Enhances
is useful if, as the author of package A, you failed to convince the authors of package B to have you in theirSuggests
field.
Hervé Pagès (15:58:28) (in thread): > In both cases, this relationship is ignored byinstall.packages()
by default.
Dirk Eddelbuettel (15:59:02) (in thread): > For pragmatic reasons. Other values fordependencies=...
can be set.
Hervé Pagès (16:00:15) (in thread): > Right, and we could change the default inBiocManager::install()
, but for many reasons we don’t want to do that.
Michael Lawrence (16:01:05): > I agree that the umbrella is the way to go.
Dirk Eddelbuettel (16:03:54) (in thread): > Actually, I wasjustlooking at that.
Dirk Eddelbuettel (16:04:01) (in thread): > Do you have a sec?
Hervé Pagès (16:04:39): > I disagree. Using beachmat as the umbrella “works”, is “easy”, and can be done now. Even though that’s a lot of advantages, that still doesn’t make it “the way to go” .
Hervé Pagès (16:05:16) (in thread): > for a latte? sure
Aaron Lun (16:05:42): > I don’t see much alternative without a change to how dependency installation is performed.
Hervé Pagès (16:06:34): > So that only makes it “the way to go” until we can do something better.
Hervé Pagès (16:07:15): > Let’s just call it a convenient temporary hack.
Aaron Lun (16:15:16): > Sure, one could say that about every piece of software ever.
Aaron Lun (16:15:23): > Except maybe LAPACK, that thing’s still going strong.
Hervé Pagès (16:24:14): > or except maybe for the Linux kernel. Or maybe it’s another convenient temporary hack to get super computers going.
Hervé Pagès (20:21:36): > Small improvement upon earlier proposal:https://github.com/Bioconductor/MatrixGenerics/pull/13@Nicholas KnoblauchThis somehow addresses your concern about throwing an error for not havingDelayedMatrixStatsbefore seeing if you’re passing a DelayedArray object.
Aaron Lun (20:23:30): > Insofar as we’re forward-thinking, it would be a nice touch for a base BioC package (maybeBiocManager) to provide some kind of “install if needed” replacement for::
. Then stuff inSuggests
could just be smoothly auto-installed instead of crashing out.
Hervé Pagès (21:23:03): > Not sure what you have in mind but if that means automatically triggering an install when::
fails I think we want to stay away from this kind of business. There are still a number of packages that do this but we eventually want to get rid of them. We’ve even been considering adding some checks to the build system to make sure that packages contain no code that trigger installation of other packages (duringR CMD build
orR CMD check
).
Aaron Lun (21:29:30): > That would have been a solution for the current issue, as MatGen could pull down packages with methods that it needs while reducing the risk of error. Though I appreciate that the potential for an unstable installation is not good.
Hervé Pagès (21:37:55): > Exactly, if we allowed this, my fallback mechanism would just do that instead of telling the user to install withBiocManager::install()
. But I’m purposely not doing it. Nobody would like to see apps secretly install new apps on their smartphone.
Aaron Lun (21:38:29): > hey! That’s the entirebasiliskstrategy!
Hervé Pagès (21:38:45): > I know:sweat_smile:
Aaron Lun (21:38:46): > haven’t you noticed the keyloggers getting installed?
Hervé Pagès (21:41:15): > As you remember I’ve taken the time to closely look at basilisk auto install feature when you submitted it. I was a little bit nervous to see this deployed on the build machines:sweat_smile:But Ithinkin this case it’sprobablyok.
Aaron Lun (21:44:24): > I think the biggest difference is that the conda installation is completely isolated; it won’t break other conda installations, and it won’t break the R installation. If things don’t work out, you can just delete the cache directory and you’re back where you started.
Hervé Pagès (21:44:44): > Isolation. Exactly
2020-09-17
Aaron Lun (03:55:25): > umbrella branch has been created.
Aaron Lun (12:39:22): > hold on. If beachmat is the umbrella, SMS and MatGen get a free ride into the top 50!
Aaron Lun (12:39:24): > unacceptable.
2020-09-18
Kasper D. Hansen (07:37:34): > I am sure this is part of the discussion above that I just skimmed, that technically you can export methods which are imported
Kasper D. Hansen (07:38:26): > Like the NAMESPACE of package B can have > > importFrom("thisMethod", A) > export("thisMethod") >
Aaron Lun (11:08:37): > yes, this is how beachmat’s umbrella works.
Aaron Lun (11:09:14): > it’ll reexport all matrixGenerics symbols. Basically can be treated as MatGen with guaranteed imports of DMS + SMS.
Aaron Lun (11:19:06): > OMG, I didn’t even realize that the beachmat goes with an umbrella.
Sean Davis (11:20:40): > I guess beachmats go with beaches in Australia?
Federico Marini (11:51:59): > > OMG, I didn’t even realize that the beachmat goes with an umbrella. > I thought it was on purpose
Federico Marini (11:52:02): > but this is even better
Hervé Pagès (13:48:15): > I’ve even seen people sometimes use a beachmatasan umbrella
2020-09-19
Aaron Lun (03:44:44): > 3 kLoC’s later, and beachmat v3’s C++ API is done.
Aaron Lun (03:44:56): > Just need to write some wrappers around DA’sblockApply
and it’s ready to go.
Aaron Lun (03:45:50): > It’s a good sign when the test code has more lines than the actual API.
Aaron Lun (14:36:01): > Almost done. Just need to add some tests for the R wrappers.
Aaron Lun (22:18:47): > Finishing vignette now.
Aaron Lun (22:22:46): > YOU ARE
2020-09-20
Aaron Lun (01:52:55): > my god, I finally figured out how to stop the text wrapping in the code in BiocStyle-formatted Rmd’s.
Aaron Lun (02:00:36): > And it is done. Beachmat v3 with dedicated DA processing is now available for testing.
2020-09-21
Aaron Lun (03:03:12): > https://ltla.github.io/beachmat/index.html
Chris Cheshire (03:38:11): > @Chris Cheshire has joined the channel
Aaron Lun (17:41:37): > https://bioconductor.org/packages/devel/bioc/vignettes/beachmat/inst/doc/linking.html
2020-09-22
Hervé Pagès (15:42:37): > @Aaron LunWhere doeshttps://github.com/Bioconductor/HDF5Array/pull/19fit with respect to this refactoring. Is it still relevant?
Aaron Lun (15:42:51): > No, it’s not, I was actually going to close it.
Aaron Lun (15:42:55): > But I forgot.
Aaron Lun (15:43:06): > but let me add some closing comments and I’ll do it now.
Hervé Pagès (15:44:12): > ok
Aaron Lun (15:52:06): > t’is done.
Aaron Lun (15:57:03): > what do you want to do abouthttps://github.com/Bioconductor/DelayedArray/pull/67? Go, no-go, more unit tests?
Hervé Pagès (17:23:55): > Should be good to go. Not directly related to the PR but while I was testing the changes I ran into this: > > > matrix(1:8, ncol=4) %*% RleArray(Rle(0, 12), dim=3:4) > <2 x 1> matrix of class DelayedMatrix and type "list": > [,1] > message non-conformable argu.. > call MULT(x, block) >
> Seems to have been around for a while. FWIW with old (now dead) code this was giving: > > > DelayedArray:::.BLOCK_mult_by_left_matrix(matrix(1:8, ncol=4), RleArray(Rle(0, 12), dim=3:4)) > Error in DelayedArray:::.BLOCK_mult_by_left_matrix(matrix(1:8, ncol = 4), : > ncol(x) == nrow(y) is not TRUE >
> So yes, more unit tests wouldn’t hurt.
Aaron Lun (17:25:02): > yes, i too have noticed that every now and then. (It’s probably why SingleR is failing on windows in release.)
Hervé Pagès (17:27:59): > any plan to fix?
Aaron Lun (17:28:36): > yes, if I can repro it locally.
Aaron Lun (17:29:22): > I never noticed it in person, only second-hand after people told me it happened. Probably because it was happening in situations where one thread was failing on their machines (e.g., ran out of memory or something).
Aaron Lun (17:29:46): > There’s probably a whole suite of related problems where the error message is somehow caught in thecbind
.
Hervé Pagès (17:32:38): > Should be easy to avoid by checking compatibility of the matrix dims early i.e. before starting the multi process machinery.
Aaron Lun (17:33:21): > I was thinking of also checking the return values from the bpiterate, as you never know when one child just fails for no discernable reason.
Hervé Pagès (17:34:03): > I don’t know, that sounds like a different issue. The issue with the incompatible dims is really a trivial one and has nothing to do with bpiterate
Aaron Lun (17:35:49): > Well, I was talking about the more general problem of errors being captured as character strings andcbind
ed as if they were valid matrix outputs. Ifbpiterate
(or whatever we were using) actually threw upon encountering an error in its jobs, rather than returning atry-error
object, then this current weirdness would not exist.
Aaron Lun (17:36:47): > Sure we can add a incompatible dims check, but we’ll want to fix the other issue anyway, because the errors that I see (and lead to strange character matrix output) lie outside the realm of incompatible dims.
Hervé Pagès (17:39:20): > Maybe but you can think of a solution to that general problem later. All I’m saying is that you don’t need to have a satisfying solution for this in order to address the current mishandling of matrices with incompatible dims.
Hervé Pagès (17:40:25): > Back to beachmat’s refactoring. I guess this one can be closed too:https://github.com/Bioconductor/HDF5Array/issues/15?
Aaron Lun (17:40:39): > yes
Hervé Pagès (17:44:45): > Excellent. 2 less items on my plate. A productive day so far:blush:
Hervé Pagès (17:46:34): > I mean3:https://github.com/Bioconductor/DelayedArray/pull/67
Aaron Lun (17:46:55): > oh, I was going to add some unit tests tonight.
Hervé Pagès (17:47:57): > oh well, send me another PR for the unit tests or just commit them directly
Aaron Lun (19:44:26): > I’ve solved the immediate problem and cleaned up the code at the same time, I’ll put a PR in later in the evening.
2020-09-23
Giuseppe D’Agostino (22:14:36): > @Giuseppe D’Agostino has joined the channel
2020-09-24
Aaron Lun (02:48:43): > scuttle has been flipped to use beachmat v3. All in all, it was prettty easy.
2020-09-25
Nicholas Knoblauch (16:13:51): > What’s the simplest way to “un-delay” a DelayedArray based object? Say I’d like to useHDF5Array
to serialize myRangedSummarizedExperiment
, but I want to be able to read it back as it was before I saved it. Naively I would think I could doSummarizedExperiment(loadHDF5SummarizedExperiment("my_data_dir"))
, or maybeas(loadHDF5SummarizedExperiment("my_data_dir"),"RangedSummarizedExperiment")
but those don’t work.
Aaron Lun (16:22:14): > I think you could just doDelayedArray(seed(my_modified_hdf5array))
Aaron Lun (16:22:32): > oh wait, I didn’t figure out what you meant.
Aaron Lun (16:22:58): > okay, I’m not sure what you wanted anymore.
Nicholas Knoblauch (16:24:49): > yeah I think this is more a generic S4 question than aDelayedArray
question, but if I have an S4 object that hasDelayedArray
members, is there a way to “visit” theDelayedArray
members and get back their in-memory counterparts
Aaron Lun (16:25:49): > Do you just want an in-memory version of your DelayedArray object? Because then it’s justas.array
, assuming you have enough memory to represent it.
Nicholas Knoblauch (16:26:52): > I want an in-memoryRangedSummarizedExperiment
from aRangedSummarizedExperiment
that has DelayedArrays in it
Aaron Lun (16:27:52): > Seems like you want to visit all assays and force them into memory.
Aaron Lun (16:30:20): > Would: > > for (i in seq_along(assayNames(se))) { > current <- assay(se, i) > if (is(current, "DelayedMatrix")) { > assay(se, i) <- as.matrix(current) > } > } >
> suffice?
Nicholas Knoblauch (16:43:14): > yeah for sure. I guess there’s no clean “generic” way to visit the components of an S4 object
Aaron Lun (16:44:25): > If you’re talking in general, you could iterate over theslotNames
but I can’t imagine a sensible operation that would work for all slots.
2020-09-26
Hervé Pagès (01:29:31): > We have the verbrealize
for this. I made it a generic so maybe it would just be a matter of implementing arealize()
method for SummarizedExperiment objects.
Aaron Lun (02:04:26): > I have come to the realization that DMS is implicitly the umbrella package that I was looking for, given that I need to do, e.g.,colVars(DelayedArray(x))
to guarantee that my code works for all matrix types anyway. The only thing that is missing is an import of SMS by DMS, which would be a good idea regardless of whether we want to use the umbrella concept or not - DMS already explicitly looks for a method for the seed, and for the most common non-ordinary seeds (i.e., sparse Matrix matrices), it seems prudent to cover that possibility ahead of time rather than relying on the user to load SMS themselves. It only costs an extra dependency so there’s no reason to not do it; then everyone imports the generics from DMS and they cover all possible matrices in use right now.
Martin Morgan (07:37:53) (in thread): > I might have written this > > assays(se) <- lapply(assays(se), as, "matrix") >
> skipping the check (because the coercion would be a no-op if it were a regular matrix (though maybe you’d like to avoid other coercions, then I’d use a more complicated lambda in lapply) and updatingassays<-
in one step to avoid potential additional copies that seem to still be associated with updating S4 objects.
RGentleman (21:20:23): > @RGentleman has joined the channel
2020-09-28
Qirong Lin (06:54:03): > @Qirong Lin has joined the channel
Tim Triche (12:43:09): > hi I have a stupid question aboutrealize
and how to not take a week per block for it (possibly@Hervé Pagès’s wheelhouse)
Tim Triche (12:44:11): > > [.updateScNMT] Extracting scores for ESC_H09 > from GSM2936262_ESC_H09_GpC-acc_processed.tsv.gz > Merged chromosome Y, saving to HDF5... > Realizing block 1/8 ... OK, writing it ... OK > Realizing block 2/8 ... OK, writing it ... OK > Realizing block 3/8 ... OK, writing it ... OK > Realizing block 4/8 ... OK, writing it ... OK > Realizing block 5/8 ... OK, writing it ... OK > Realizing block 6/8 ... OK, writing it ... OK > Realizing block 7/8 ... OK, writing it ... OK > Realizing block 8/8 ... OK, writing it ... OK > OK. > # now we load all the chromosomes and try to rbind() the HDF5-backed SummarizedExperiments together, then save them: > Realizing block 1/960 ... >
> So that was where I left it last week and today I log in and it’s still working on block 1/960. I’m thinking this might take a while.
Aaron Lun (13:01:47): > without the surrounding context of the code, who knows.
Tim Triche (13:10:56): > chr1:22, X, Y, and M were read in and saved as individual HDF5-backedSummarizedExperiment
s with default block size.
Tim Triche (13:11:30): > Then I stacked them one on top of the other withdo.call(rbind, list.of.HDF5backed.SEs)
Tim Triche (13:11:44): > again sticking with the default block size
Tim Triche (13:11:53): > and that’s where it has stayed since last week.
Tim Triche (13:12:48): > My assumption is that either changing the block or chunk size might lead to a less glacial pace.
Tim Triche (13:13:47): > Having saved all the individual SEs to HDF5, I can pull them in serially or all at once, and stack them; but eventually the goal is to have a workable HDF5-backed object for the accessibility data, which is folded into a MultiAssayExperiment for all of the scNMT runs (accessibility, DNAme, mRNA from each cell).
Nicholas Knoblauch (13:13:49): > My totally shoot-from-the-hip guess is that I think you’re accidentally generating a ton of unnecessary intermediates because of whatdo.call
expands to
Tim Triche (13:14:25): > I would have guessed that, although it appears that the “merge” is delayed and thus I don’t face the consequences until it isrealize
d
Aaron Lun (13:14:27): > This shouldn’t be a DA problem, therbind
is also delayed.
Aaron Lun (13:14:37): > Where is therealize
coming from?
Tim Triche (13:14:51): > saveHDF5SummarizedExperiment
Tim Triche (13:15:19): > so for example when the memory-backed chrY is saved as HDF5, the 8 blocks are realized and written
Tim Triche (13:15:46): > > # many more files > [.updateScNMT] Extracting scores for ESC_H07 > from GSM2936260_ESC_H07_GpC-acc_processed.tsv.gz > [.updateScNMT] Extracting scores for ESC_H08 > from GSM2936261_ESC_H08_GpC-acc_processed.tsv.gz > [.updateScNMT] Extracting scores for ESC_H09 > from GSM2936262_ESC_H09_GpC-acc_processed.tsv.gz > Merged chromosome Y, saving to HDF5... > Realizing block 1/8 ... OK, writing it ... OK > Realizing block 2/8 ... OK, writing it ... OK > Realizing block 3/8 ... OK, writing it ... OK > Realizing block 4/8 ... OK, writing it ... OK > Realizing block 5/8 ... OK, writing it ... OK > Realizing block 6/8 ... OK, writing it ... OK > Realizing block 7/8 ... OK, writing it ... OK > Realizing block 8/8 ... OK, writing it ... OK > OK. >
> That works fine.
Tim Triche (13:16:07): > But the much larger all-chromosomes merged object is… not fine. I’m not sure how to trace or debug this process.
Nicholas Knoblauch (13:16:25): > what about just 2 chromosomes ?
Aaron Lun (13:16:27): > I would guess that there is some deeply suboptimal chunking pattern in the underlying files.
Tim Triche (13:16:35): > 2-3 chromosomes works fine
Tim Triche (13:16:52): > e.g. I have successfully merged and read chr1 + chr11 + chr17 and saved it and read that.
Nicholas Knoblauch (13:17:01): > with the same syntax?
Tim Triche (13:17:11): > let me look and see:slightly_smiling_face:
Tim Triche (13:17:59): > yes, same syntax
Tim Triche (13:18:21): > I figured I had best test it out beforehand, and when it succeeded, I figured I would be able to scale it up by just waiting longer to write it all out.
Tim Triche (13:18:30): > Something must be exponential complexity in there?
Tim Triche (13:18:38): > Trouble is I don’t know where to start looking.
Aaron Lun (13:18:52): > What are the chunk dimensions for the individual files?
Tim Triche (13:19:02): > Let me grab a node and find out
Tim Triche (13:22:05): > > library(enmity) > library(HDF5Array) > chr5 <- loadHDF5SummarizedExperiment("scNMT_chr5_acc") > library(SummarizedExperiment) > chunkdim(assay(chr5, "Acc")) > # [1] 218341 4 >
> I’m guessing this is less than optimal
Tim Triche (13:22:18): > > dim(chr5) > [1] 5720751 120 >
Tim Triche (13:24:02): > is there a way (or, say, a shortcut using something like@Davide Rissoand@Stephanie Hicks’smbkmeans::blocksize
) to optimize this without a ton of additional effort?
Hervé Pagès (13:34:59): > The chunk dims of an individual SE are not a problem per se. Problem is that the chunk dims are likely to be different across the SE’s, then reading blocks that have to be realized via the delayedrbind()
could be a struggle. Let me check.
Tim Triche (13:36:35): > ohhhh… see this is the question I did not know how to ask:slightly_smiling_face:thanks@Hervé Pagèsas always!
Tim Triche (13:36:56): > shall I pull in a few more and verify that this is the issue
Hervé Pagès (13:37:25): > Oh, but I forgot:saveHDF5SummarizedExperiment()
has achunkdim
arg! So you could try to save all your SE’s with the same chunk dims. Choose it small (e.g. 100x100 or 200x200).
Tim Triche (13:38:03): > > R> chr19 <- loadHDF5SummarizedExperiment("scNMT_chr19_acc") > R> chunkdim(assay(chr19, "Acc")) > [1] 139887 7 > R> dim(chr19) > [1] 2348232 120 >
> I think that’s the problem. I’ll try forcing them all to the same size. Any suggestions for achunkdim
?
Hervé Pagès (13:39:40): > Anything between 100x100 and 250x250 should be fine. Hopefully:wink:
Tim Triche (13:39:49): > Heh, stopping the save was “fun”:
Tim Triche (13:39:49): > > Realizing block 1/960 ... ^C > > Enter a frame number, or 0 to exit > > 1: source("acc_load.R") > 2: withVisible(eval(ei, envir)) > 3: eval(ei, envir) > 4: eval(ei, envir) > 5: acc_load.R#69: saveHDF5SummarizedExperiment(acc_se, dir = "scNMT_acc", repl > 6: .write_HDF5SummarizedExperiment(x, rds_path = rds_path, h5_path = h5_path, > 7: .write_h5_assays(x@assays, h5_path, chunkdim, level, verbose) > 8: writeHDF5Array(a, h5_path, h5_name, chunkdim, level, verbose = verbose) > 9: BLOCK_write_to_sink(x, sink) > 10: read_block(x, viewport, as.sparse = x_is_sparse) > 11: subset_dimnames_by_Nindex(dimnames(x), Nindex) > 12: dimnames(x) > 13: dimnames(x) > 14: callNextMethod() > 15: .nextMethod(x = x) > 16: dimnames(x@seed) > 17: dimnames(x@seed) > 18: combine_dimnames_along(x@seeds, dims, x@along) > 19: combine_dimnames(objects) > 20: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 21: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 22: FUN(X[[i]], ...) > 23: dimnames(x) > 24: dimnames(x) > 25: combine_dimnames_along(x@seeds, dims, x@along) > 26: combine_dimnames(objects) > 27: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 28: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 29: FUN(X[[i]], ...) > 30: dimnames(x) > 31: dimnames(x) > 32: combine_dimnames_along(x@seeds, dims, x@along) > 33: combine_dimnames(objects) > 34: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 35: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 36: FUN(X[[i]], ...) > 37: dimnames(x) > 38: dimnames(x) > 39: combine_dimnames_along(x@seeds, dims, x@along) > 40: combine_dimnames(objects) > 41: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 42: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 43: FUN(X[[i]], ...) > 44: dimnames(x) > 45: dimnames(x) > 46: combine_dimnames_along(x@seeds, dims, x@along) > 47: combine_dimnames(objects) > 48: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 49: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 50: FUN(X[[i]], ...) > 51: dimnames(x) > 52: dimnames(x) > 53: combine_dimnames_along(x@seeds, dims, x@along) > 54: combine_dimnames(objects) > 55: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 56: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 57: FUN(X[[i]], ...) > 58: dimnames(x) > 59: dimnames(x) > 60: combine_dimnames_along(x@seeds, dims, x@along) > 61: lapply(objects, function(x) dimnames(x)[[along]]) > 62: lapply(objects, function(x) dimnames(x)[[along]]) > 63: FUN(X[[i]], ...) > 64: dimnames(x) > 65: dimnames(x) > 66: combine_dimnames_along(x@seeds, dims, x@along) > 67: lapply(objects, function(x) dimnames(x)[[along]]) > 68: lapply(objects, function(x) dimnames(x)[[along]]) > 69: FUN(X[[i]], ...) > 70: dimnames(x) > 71: dimnames(x) > 72: combine_dimnames_along(x@seeds, dims, x@along) > 73: lapply(objects, function(x) dimnames(x)[[along]]) > 74: lapply(objects, function(x) dimnames(x)[[along]]) > 75: FUN(X[[i]], ...) > 76: dimnames(x) > 77: dimnames(x) > 78: combine_dimnames_along(x@seeds, dims, x@along) > 79: lapply(objects, function(x) dimnames(x)[[along]]) > 80: lapply(objects, function(x) dimnames(x)[[along]]) > 81: FUN(X[[i]], ...) > 82: dimnames(x) > 83: dimnames(x) > 84: combine_dimnames_along(x@seeds, dims, x@along) > 85: combine_dimnames(objects) > 86: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 87: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 88: FUN(X[[i]], ...) > 89: dimnames(x) > 90: dimnames(x) > 91: combine_dimnames_along(x@seeds, dims, x@along) > 92: combine_dimnames(objects) > 93: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 94: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 95: FUN(X[[i]], ...) > 96: dimnames(x) > 97: dimnames(x) > 98: combine_dimnames_along(x@seeds, dims, x@along) > 99: combine_dimnames(objects) > 100: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 101: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 102: FUN(X[[i]], ...) > 103: dimnames(x) > 104: dimnames(x) > 105: combine_dimnames_along(x@seeds, dims, x@along) > 106: combine_dimnames(objects) > 107: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 108: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 109: FUN(X[[i]], ...) > 110: dimnames(x) > 111: dimnames(x) > 112: combine_dimnames_along(x@seeds, dims, x@along) > 113: lapply(objects, function(x) dimnames(x)[[along]]) > 114: lapply(objects, function(x) dimnames(x)[[along]]) > 115: FUN(X[[i]], ...) > 116: dimnames(x) > 117: dimnames(x) > 118: combine_dimnames_along(x@seeds, dims, x@along) > 119: combine_dimnames(objects) > 120: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 121: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 122: FUN(X[[i]], ...) > 123: dimnames(x) > 124: dimnames(x) > 125: combine_dimnames_along(x@seeds, dims, x@along) > 126: lapply(objects, function(x) dimnames(x)[[along]]) > 127: lapply(objects, function(x) dimnames(x)[[along]]) > 128: FUN(X[[i]], ...) > 129: dimnames(x) > 130: dimnames(x) > 131: combine_dimnames_along(x@seeds, dims, x@along) > 132: lapply(objects, function(x) dimnames(x)[[along]]) > 133: lapply(objects, function(x) dimnames(x)[[along]]) > 134: FUN(X[[i]], ...) > 135: dimnames(x) > 136: dimnames(x) > 137: combine_dimnames_along(x@seeds, dims, x@along) > 138: combine_dimnames(objects) > 139: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 140: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 141: FUN(X[[i]], ...) > 142: dimnames(x) > 143: dimnames(x) > 144: combine_dimnames_along(x@seeds, dims, x@along) > 145: combine_dimnames(objects) > 146: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 147: lapply(seq_along(dim(objects[[1]])), function(n) { > for (x in objects) { > 148: FUN(X[[i]], ...) > 149: dimnames(x) > 150: dimnames(x) > 151: h5readDimnames(path(x), x@name, as.character = TRUE) > 152: get_h5dimnames(filepath, name) > 153: h5getdimscales(filepath, name, scalename = "dimnames") > 154: stopifnot(isSingleStringOrNA(scalename)) > > Selection:0 >
Tim Triche (13:40:03): > So whoever brought up the intermediate representations was perhaps also not wrong:slightly_smiling_face:
Tim Triche (13:41:16): > I don’t understand why it’s squawking, there are only 95 million ranges with 120 samples each: > {acc_se} > class: RangedSummarizedExperiment > dim: 95030342 120 > metadata(0): > assays(1): Acc > rownames(95030342): 1:3000204 1:3000275 ... Y:90844303 Y:90844358 > rowData names(0): > colnames(120): EBcontrol_P1D12 EBcontrol_P1E12 ... ESC_H08 ESC_H09 > colData names(2): sample file >
> :grin:
Tim Triche (13:43:31): > oh, also, I didnotusedo.call
but ratherReduce
, not that it necessarily makes any difference, but for transparency: > > loadSE <- function(chr) loadHDF5SummarizedExperiment(dir=hdf5dir(chr)) > acc_se <- Reduce(rbind, lapply(names(chroms), loadSE)) > acc_se <- saveHDF5SummarizedExperiment(acc_se, dir="scNMT_acc", replace=TRUE) > message("Brute force loading completed, result assigned to `acc_se`.") > show(acc_se) >
> I will setchunkdim
to 200x200 and let’s see what happens!
Tim Triche (13:45:40): > > getHDF5DumpChunkDim(dim(acc_se)) > # [1] 889898 1 >
> Dear god. Yeah I think I see the problem here.
Nicholas Knoblauch (13:45:56): > I thinkdo.call
might be better thanReduce
here
Hervé Pagès (13:46:09): > absolutely!
Tim Triche (13:46:16): > eh, that part is done already (I can go back and see if it changes things though)
Nicholas Knoblauch (13:46:58): > I think DA is optimized forrbind(A,B,C)
overrbind(rbind(A,B),C)
Tim Triche (13:47:22): > huh: > > acc_se <- saveHDF5SummarizedExperiment(acc_se, dir="scNMT_acc", chunkdim=c(200,200), replace=TRUE) > Error in .setDataType(H5type, storage.mode, size) : > Can not create dataset. H5type unknown. Check h5const('H5T') for valid types. > In addition: Warning message: > In h5checkConstants("H5T", H5type) : > H5 constant identifier has more than one value. Only the first value will be used. >
Hervé Pagès (13:47:42): > Reduce
will stack 25 binary delayedrbind()
whiledo.call
will stack 1 n-ary delayedrbind()
. This will probably make a big difference when realizing blocks.
Tim Triche (13:47:56): > oh crap! OK I’ll try fixing that first.
Tim Triche (13:51:17): > should I attempt towriteHDF5Array()
the assay directly so thatsaveHDF5SummarizedExperiment
doesn’t squawk about HDF5 filetypes?
Hervé Pagès (13:53:35): > I don’t know. Would be good to understand what’s going on with the H5type. So when you didsaveHDF5SummarizedExperiment()
without specifying thechunkdim
you didn’t get that error and now that you specifychunkdim
you get the H5type error? There must be something else going on.
Nicholas Knoblauch (13:53:48): > tryc(200L,200L)
?
Tim Triche (13:54:24): > just thought of that@Nicholas Knoblauch
Tim Triche (13:54:26): > :slightly_smiling_face:
Hervé Pagès (13:54:27): > that shouldn’t make any difference but it doesn’t hurt to try
Tim Triche (13:54:52): > I reassembled the big h5se: > > acc_se <- do.call(rbind, lapply(names(chroms), loadSE)) > acc_se > class: RangedSummarizedExperiment > dim: 95030342 120 > metadata(0): > assays(1): Acc > rownames(95030342): 1:3000204 1:3000275 ... Y:90844303 Y:90844358 > rowData names(0): > colnames(120): EBcontrol_P1D12 EBcontrol_P1E12 ... ESC_H08 ESC_H09 > colData names(2): sample file >
Tim Triche (13:55:35): > Alas, no luck: > > acc_se <- saveHDF5SummarizedExperiment(acc_se, dir="scNMT_acc", chunkdim=c(200L, 200L), replace=TRUE) > Error in .setDataType(H5type, storage.mode, size) : > Can not create dataset. H5type unknown. Check h5const('H5T') for valid types. > In addition: Warning message: > In h5checkConstants("H5T", H5type) : > H5 constant identifier has more than one value. Only the first value will be used. > > Enter a frame number, or 0 to exit > > 1: saveHDF5SummarizedExperiment(acc_se, dir = "scNMT_acc", chunkdim = c(200, 2 > 2: .write_HDF5SummarizedExperiment(x, rds_path = rds_path, h5_path = h5_path, > 3: .write_h5_assays(x@assays, h5_path, chunkdim, level, verbose) > 4: writeHDF5Array(a, h5_path, h5_name, chunkdim, level, verbose = verbose) > 5: HDF5RealizationSink(dim(x), sink_dimnames, type(x), filepath = filepath, na > 6: create_and_log_HDF5_dataset(filepath, name, dim, type = type, H5type = H5ty > 7: h5createDataset2(filepath, name, dim, maxdim = maxdim, type = type, H5type > 8: h5createDataset(filepath, name, dim, maxdims = maxdim, storage.mode = type, > 9: .setDataType(H5type, storage.mode, size) > > Selection: >
> Any thoughts on how to debug?
Nicholas Knoblauch (13:57:56): > I think level 4 is where the action is
Hervé Pagès (13:58:04): > ok I can reproduce this. I think that somehow thechunkdim
gets passed to theH5type
arg down the pipe. Geez! Means I never testedsaveHDF5SummarizedExperiment()
withchunkdim
!:face_with_rolling_eyes:
Tim Triche (13:58:04): > > assays(acc_se)$Acc[1:10, 1:4] > <10 x 4> matrix of class DelayedMatrix and type "double": > EBcontrol_P1D12 EBcontrol_P1E12 EBcontrol_P1F12 EBcontrol_P1G12 > 1:3000204 NA NA NA NA > 1:3000275 NA NA NA NA > 1:3000377 NA NA NA NA > 1:3000394 NA NA NA NA > 1:3000435 NA NA NA NA > 1:3000493 NA NA NA NA > 1:3000515 NA NA NA NA > 1:3000541 0 NA 0 NA > 1:3000559 0 NA 0 NA > 1:3000571 0 NA 0 NA > > assays(acc_se)$Acc[(nrow(acc_se)-100):nrow(acc_se), 1:4] > <101 x 4> matrix of class DelayedMatrix and type "double": > EBcontrol_P1D12 ... EBcontrol_P1G12 > Y:90826661-90826662 0 . NA > Y:90826665-90826666 0 . NA > Y:90826736-90826737 NA . NA > Y:90826739 NA . NA > Y:90826745-90826746 NA . NA > ... . . . > Y:90837322 NA . NA > Y:90842707 NA . NA > Y:90842735 NA . NA > Y:90844303 NA . NA > Y:90844358 NA . NA >
Tim Triche (13:58:10): > so the data is all there
Hervé Pagès (13:58:38): > I’m working on a fix (give me a few minutes)
Tim Triche (13:58:51): > you rule (everyone in this community is wonderful)
Hervé Pagès (14:05:38): > Should be fixed:https://github.com/Bioconductor/HDF5Array/commit/e4e6ec671c8c85e412f8c2211533669160945f0c
Tim Triche (14:06:32): > ugh now I have to update all these packages:wink:
Tim Triche (14:08:04): > crap: > > **** testing if installed package can be loaded from temporary location > Bioconductor version 3.12 (BiocManager 1.30.10), ?BiocManager::install for help > Error: package or namespace load failed for 'HDF5Array' in dyn.load(file, DLLpath = DLLpath, ...): > unable to load shared object '/usr/local/lib/R/host-site-library/00LOCK-HDF5Array/00new/HDF5Array/libs/HDF5Array.so': > /usr/local/lib/R/host-site-library/00LOCK-HDF5Array/00new/HDF5Array/libs/HDF5Array.so: undefined symbol: H5T_NATIVE_DOUBLE_g > Error: loading failed > Execution halted > ERROR: loading failed > * removing '/usr/local/lib/R/host-site-library/HDF5Array' > * restoring previous '/usr/local/lib/R/host-site-library/HDF5Array' > Error: Failed to install 'HDF5Array' from GitHub: > (converted from warning) installation of package '/tmp/RtmpaZ9rPX/file29ff076c285fb/HDF5Array_1.17.10.tar.gz' had non-zero exit status >
Tim Triche (14:11:55): > > # YOLO > R> source("[https://github.com/Bioconductor/HDF5Array/raw/e4e6ec671c8c85e412f8c2211533669160945f0c/R/saveHDF5SummarizedExperiment.R](https://github.com/Bioconductor/HDF5Array/raw/e4e6ec671c8c85e412f8c2211533669160945f0c/R/saveHDF5SummarizedExperiment.R)") > > # Still stumped > R> acc_se <- saveHDF5SummarizedExperiment(acc_se, dir="scNMT_acc", chunkdim=c(200L,200L), replace=TRUE) > Error in .normarg_chunkdim(chunkdim, dim) : > the chunk dimensions specified in 'chunkdim' exceed the dimensions of > the object to write > > Enter a frame number, or 0 to exit > > 1: saveHDF5SummarizedExperiment(acc_se, dir = "scNMT_acc", chunkdim = c(200, 2 > 2: saveHDF5SummarizedExperiment.R#261: .write_HDF5SummarizedExperiment(x, rds_ > 3: saveHDF5SummarizedExperiment.R#158: .write_h5_assays(x@assays, h5_path, chu > 4: saveHDF5SummarizedExperiment.R#124: writeHDF5Array(a, h5_path, h5_name, chu > 5: HDF5RealizationSink(dim(x), sink_dimnames, type(x), filepath = filepath, na > 6: .normarg_chunkdim(chunkdim, dim) >
> Hmm
Hervé Pagès (14:13:33): > so you managed to install HDF5Array 1.17.10?
Hervé Pagès (14:14:15): > Oh you sourced the modified R file, I see. You like to live dangerously.
Hervé Pagès (14:17:09): > BiocManager::install("Bioconductor/HDF5Array")
does work for me. I’ve done some work on DelayedArray/HDF5Array over the last few days and needed to make some additions to S4Vectors to support this work. Make sure you have the latest S4Vectors and DelayedArray. Also make sure you have the latest Rhdf5lib.
Tim Triche (14:20:28): > oh dear. I’m working inside a Singularity image, which will probably crap itself when I do this:confused:
FelixErnst (14:20:59): > well can you modify said singularity image?
Tim Triche (14:21:22): > sure, although this is rapidly becoming exactly the sort of tarpit I try to avoid
FelixErnst (14:21:46): > In for a penny, in for a pound?
Tim Triche (14:22:37): > sure is starting to look that way
Tim Triche (14:23:31): > uuuuuuuuuugh > > Bioconductor version 3.12 (BiocManager 1.30.10), ?BiocManager::install for help > Error: package or namespace load failed for 'HDF5Array' in dyn.load(file, DLLpath = DLLpath, ...): > unable to load shared object '/usr/local/lib/R/host-site-library/00LOCK-HDF5Array/00new/HDF5Array/libs/HDF5Array.so': > /usr/local/lib/R/host-site-library/00LOCK-HDF5Array/00new/HDF5Array/libs/HDF5Array.so: undefined symbol: H5T_NATIVE_DOUBLE_g > Error: loading failed > Execution halted > ERROR: loading failed >
Nicholas Knoblauch (14:24:29): > you could also just scrap all of that and see if swapping outReduce
fordo.call
fixes it
Tim Triche (14:30:20): > already did that
Tim Triche (14:30:28): > that was a while ago:slightly_smiling_face:
2020-09-29
Aaron Lun (02:21:32): > Surprise!Biobasealso has arowMedians
function! Took me ages to figure out it was getting called instead of the DMS version.
Aaron Lun (02:26:39): > not an uncommon problem when you’re using DMS with practically any other BioC package that depends on Biobase.
Stephanie Hicks (06:25:48): > Ah yes, I’ve run into similar problems for other functions. But yeah, I had no idea rowMedians was in biobase:grimacing:
Aaron Lun (11:19:00): > Did I mentionhttps://bioconductor.org/packages/devel/bioc/html/TileDBArray.htmlhere? - Attachment (Bioconductor): TileDBArray (development version) > Implements a DelayedArray backend for reading and writing dense or sparse arrays in the TileDB format. The resulting TileDBArrays are compatible with all Bioconductor pipelines that can accept DelayedArray instances.
Tim Triche (11:21:34): > it works now? awesome!
Aaron Lun (11:22:07): > Was it not working before? I was just going to say that it’s gotten in.
Dirk Eddelbuettel (11:25:20) (in thread): > There was no “not working”. There was a certain professor emeritus at a certain very old European university creating a certain amount of work to ensure the underlying package could be on CRAN.
Tim Triche (11:25:39) (in thread): > hi Prof Ripley!
Dirk Eddelbuettel (11:25:54) (in thread): > Don’t. Bloody. Jinx. It!
Dirk Eddelbuettel (11:26:23) (in thread): > Kidding aside we should be fine.
Dirk Eddelbuettel (11:27:28) (in thread): > Aforementioned professor already asked us for Apple Silicone support. Yes, for machines not yet available. He clearly is on a different planet. (Truth be told he wanted the library download to be aware of maybe more than one cpu architecture which is indeed a valid point.)
Kasper D. Hansen (13:10:47) (in thread): > All grumpiness aside, he usually has a point.
Dirk Eddelbuettel (13:11:06) (in thread): > I love the man. He is a treasure.
2020-09-30
Aaron Lun (00:41:15): > rowRanges
is going to be nasty. The DA vignette breaks itself now.
Aaron Lun (00:49:20): > Wait. Why do we have areadKallisto
function in the SE package?
Aaron Lun (01:02:14): > Anyway, got that fixed. Had to edit DA’s NAMESPACE to get rid of the rowRanges export.
Hervé Pagès (03:35:00): > It’s all fixed now in MatrixGenerics, DelayedArray, and SummarizedExperiment. Just make sure that you have the latest versions (all 3 versions have been bumped).
Hervé Pagès (03:36:35): > I agree that SE doesn’t seem like the best place forreadKallisto()
Tim Triche (13:45:51): > @Hervé Pagès@Nicholas KnoblauchI useddo.call(rbind, ...)
and regular oldsaveHDF5SummarizedExperiment
per your suggestions and it’s done. Now time to assemble the MultiAssayExperiment and see if I can break a bunch of other stuff for@Marcel Ramos Pérezto debug:wink:
Will Townes (13:51:45): > @Will Townes has left the channel
Martin Morgan (15:18:18) (in thread): > readKallisto()
was introduced back in the day; it should be deprecated.
Hervé Pagès (15:29:28) (in thread): > want me to do it?
Martin Morgan (16:05:48) (in thread): > yes that would be great, thanks
Hervé Pagès (16:14:11) (in thread): > What’s the replacement so I can redirect the user to whatever the replacement is?
Aaron Lun (21:51:48) (in thread): > probably something in tximeta, I would guess
2020-10-01
Martin Morgan (06:20:50) (in thread): > tximport::tximport(..., type = "kallisto")
is the workhorse,tximeta::tximeta()
is probably what should be called.
Shila Ghazanfar (12:07:30): > @Shila Ghazanfar has joined the channel
Hervé Pagès (14:11:55) (in thread): > Done:https://github.com/Bioconductor/SummarizedExperiment/commit/cdb4c2e5f99f61ee8dbe5ca516cb2cc90435c5ea
2020-10-04
Alan O’C (07:28:36): > @Alan O’C has joined the channel
Aaron Lun (19:19:01): > Another fun fact with MatrixGenerics: givingtable
objects that used to work withmatrixStats::rowVars()
no longer works withMatrixGenerics::rowVars()
. It’s effectively amatrix
but the S4 dispatch doesn’t see it as such. Don’t see a lightweight way to keep that code working beyond unclass
ing the object.
Hervé Pagès (20:00:12): > “effectively a matrix” from an S3 point of view, but not for an S4 point of view: > > t <- table(aa=1:6, bb=11:16) > is.matrix(t) > # [1] TRUE > is(t, "matrix") > # [1] FALSE >
> Not an isolated case of S3/S4 disagreement (and not much hope for a reconciliation in the foreseeable future, unfortunately).
Hervé Pagès (20:23:27): > In the mean time we should probably add a method for table objects in MatrixGenerics e.g.: > > setMethod("rowVars", signature = "matrix", .default_rowVars) > setMethod("rowVars", signature = "numeric", .default_rowVars) > setMethod("rowVars", signature = "array", .default_rowVars) > setMethod("rowVars", signature = "table", .default_rowVars) >
> (or define a single method forarray_OR_numeric_OR_table
), and do this for all other generics.
2020-10-07
Anjali Silva (09:07:53): > @Anjali Silva has joined the channel
2020-10-08
Dirk Eddelbuettel (08:59:49): > Version 0.8.1 of the TileDB R package now on CRAN but little to no user-facing changes.https://cran.r-project.org/package=tiledb - Attachment (cran.r-project.org): tiledb: Sparse and Dense Multidimensional Array Storage Engine for Data Science > The data science storage engine ‘TileDB’ introduces a powerful on-disk format for multi-dimensional arrays. It supports dense and sparse arrays, dataframes and key-values stores, cloud storage (‘S3’, ‘GCS’, ‘Azure’), chunked arrays, multiple compression, encryption and checksum filters, uses a fully multi-threaded implementation, supports parallel I/O, data versioning (‘time travel’), metadata and groups. It is implemented as an embeddable cross-platform C++ library with APIs from several languages, and integrations.
Kasper D. Hansen (11:18:49): > @Aaron Lunperhaps a contentious subject, but how do you feel about TileDb vs. HDF5 for our single cell stack? I mean, apart from the fact that loom/scanpy uses HDF5
Aaron Lun (11:19:42): > some comments athttps://github.com/LTLA/TileDBArray/issues/12
Kasper D. Hansen (11:22:33): > Thanks, very useful.
Kasper D. Hansen (11:22:49): > TileDB looks very good I must say
Dirk Eddelbuettel (11:24:06): > Thank you. We are only just getting going!@Aaron Lunleft us a TODO with respect to some benchmarking that I should get to ‘real soon now’ after getting one more task out of the way first.
Kasper D. Hansen (11:26:17): > Any constraints on the dimensions of the matrix@Dirk Eddelbuettel?
Kasper D. Hansen (11:26:38): > This is prob. somewhere in the docs, but … well, its easy to ask
Dirk Eddelbuettel (11:27:23): > No! And full dense/sparse you name it. Meta data. Char dimensions…. Have a poke at the upcoming TileDBArray package by@Aaron Lun.
Kasper D. Hansen (11:28:40): > yeah, I have a 237M x 400k sparse matrix I need to look at somehow.
Kasper D. Hansen (11:29:03): > eventually
Dirk Eddelbuettel (11:34:45) (in thread): > Outside of genomics we are doing rather well with geospatial apps and “images” in the large sense. Think SAR, Lidar, … which can 3-d (x,y,color) or 4-d (add time) over really large data sets. And … one representation we really like is just one large array where, say, time is just an index too. Reads and writes are naturally parallel, it’s all cloud-native if you want it to (or local on shared FS, however your prefer).
Dirk Eddelbuettel (11:36:35) (in thread): > And some of those apps think more in tera byte or peta than the gb or mb we use on our local machines to prototype.
Kasper D. Hansen (11:41:06) (in thread): > That sounds pretty appealing. How well tested is the R interface?
Kasper D. Hansen (11:41:41) (in thread): > tested = optimized. I assume its somewhat new, which suggests that if we find bottlenecks, it should be fixable
Dirk Eddelbuettel (11:42:31) (in thread): > That is the TODO I mentioned. > > Also, Mike from Hutch uses it. Is he here?
Kasper D. Hansen (11:43:18) (in thread): > @Mike JiangI’m guessing?
Stavros Papadopoulos (14:53:42): > @Kasper D. Hansenplease see:https://github.com/LTLA/TileDBArray/issues/12#issuecomment-705758335
Stavros Papadopoulos (14:58:53): > For HDF5, I am wondering if there is any caching, as a constant time does not really make sense. Caching is also configurable in TileDB, so we can play with that. > > We owe you folks detailed comparisons to HDF5 (in multiple configurations with detailed information about the TileDB internals), which we will provide over the next few weeks.
RGentleman (20:20:14) (in thread): > I would like to suggest that it would be helpful to think of some benchmark datasets/operations that could be used going forward as a way to compare and identify bottlenecks
2020-10-10
Davide Corso (10:50:57): > @Davide Corso has joined the channel
2020-10-11
Alexander Toenges (10:47:16): > @Alexander Toenges has joined the channel
Nicholas Knoblauch (14:19:14) (in thread): > @Stavros PapadopoulosCheck out chapter 4 of this document for some practical info on chunk caching in HDF5https://portal.hdfgroup.org/display/HDF5/Improving+IO+Performance+When+Working+with+HDF5+Compressed+Datasets
Yuyao Song (23:41:53): > @Yuyao Song has joined the channel
2020-10-12
Stavros Papadopoulos (08:48:38) (in thread): > Thank you@Nicholas Knoblauch! TileDB does something very similar for both the dense and sparse case. We did some more digging over the weekend and we know the reason behind the perf observations in this discussion. We will follow up soon with a detailed explanation, as it may be interesting for anyone looking to store (dense or sparse) data efficiently.
2020-10-15
Pol Castellano (04:43:21): > @Pol Castellano has joined the channel
Aaron Lun (13:11:09): > @Hervé Pagèscan we haveDelayedArray:::block_APPLY
switch over to usingblockApply
for the time being? All the DMS functions are currently single-threaded without any option to force them to use multiple workers.
Hervé Pagès (13:18:27): > DelayedArray:::block_APPLY()
doesn’t have theBPPARAM
arg so I would need to add it and then all the calls toDelayedArray:::block_APPLY()
in DelayedMatrixStats would need to be modified. Sounds like it wouldn’t be more work to just switch DelayedMatrixStats toblockApply()
.
Aaron Lun (13:19:37): > I was thinking of just ensuring that it responds to the globalgetAutoBPPARAM()
.
Hervé Pagès (13:20:11): > Oh right. I see.~That’s easy to do.~Edit: Actually it’s not easy to do (see below).
Hervé Pagès (13:24:54): > wait…DelayedArray:::block_APPLY()
has asink
argument that I got rid of when I replaced it withblockApply()
. I think Pete uses it in DMS. UnfortunatelyblockApply()
cannot be used as a drop-in replacement forDelayedArray:::block_APPLY()
. I kind of suspect that this is probably why Pete didn’t switch yet.
Aaron Lun (13:33:44): > Hm. This is going to suck. My 1 million cell analysis is clogged at this single-threaded section.
Hervé Pagès (13:45:24): > Actually Pete doesn’t callDelayedArray:::block_APPLY()
directly. He callsDelayedArray:::colblock_APPLY()
which also has asink
arg and callsDelayedArray:::block_APPLY()
. I wonder how many of theDelayedArray:::colblock_APPLY()
calls in DMS effectively use thesink
arg. Probably not many. If that’s the case then most calls should be easy to replace with something that isgetAutoBPPARAM()
-aware. I’ll take a closer look.
Aaron Lun (13:46:34): > I would do it myself but there were already just so many problems with getting the tests to pass inhttps://github.com/PeteHaitch/DelayedMatrixStats/pull/62/.
Hervé Pagès (13:50:06): > We should probably keep these low-level technical conversations for#delayed_array
Peter Hickey (16:37:56) (in thread): > I’m pretty swamped for the next week unfortunately. the tests are a source of frustration for me, i would design it differently if starting now
2020-10-16
Aaron Lun (13:23:25): > seriously, though, the memory usage of forking really breaks the advantage of using file-backed backends.
Aaron Lun (13:24:06): > On both server and laptop, I’m seeing forks use the same amount of memory as the main process. I thought this was all virtual and copy-on-write stuff, but apparently this is not the case
Nicholas Knoblauch (13:46:57) (in thread): > In the situation where you have C/C++ based libraries for file-backing (e.g HDF5 and TileDB and C/C++ based libraries for the block-level algorithms (beachmat), there’s potentially a large savings in memory by writingblock_APPLY
in C/C++ too, isn’t there?
Nicholas Knoblauch (13:51:41) (in thread): > or at least like a Bring Your Own Memory version where I can pass it a matrix that it overwrites for each block
Nicholas Knoblauch (13:59:57) (in thread): > Eigen is a great example of this. You can have Eigen allocate memory for you, or you can use the Map
class to BYOM. There is also a****EIGEN_RUNTIME_NO_MALLOC
******** ****macro that you can define that creates a switchset_is_malloc_allowed(bool)
that, when on, will throw an exception if the library allocates. you can turn it on and off in blocks of code you are trying to optimize to get a sense of where memory allocation is happening.
Aaron Lun (14:04:19) (in thread): > The initial matrix allocation is not an issue, it’s pretty small by comparison. The problem is that the memory usage of the entire R session (even for objects not used in the parallel section) is effectively duplicated. So if I made a random 1 GB object in RAM and started using fork-based parallelization afterwards, that 1 GB object seems to be duplicated in every worker, even if I wasn’t using that object inside the parallel section. I would have expected there to be some virtual copy-on-write that doesn’t do an actual copy of everything in the R session, but I can see the free memory on my OS dropping when I hit the parallel section; the depletion is consistent with an actual copy being made for each worker.
Aaron Lun (14:05:13) (in thread): > This has some amusing implications - for example, my jobs are more likely to be killed near the end of their run instead of near the start, as the duplication of the session’s memory usage becomes more likely to pass the memory limit (and thus get killed by our cgroups monitor) as the number of objects in the R session accumulates throughout the course of the analysis.
Aaron Lun (14:05:47) (in thread): > So at the end of the analysis, any innocuous parallelization - even if it does nothing except returnNULL
- will break the session.
Nicholas Knoblauch (14:07:37) (in thread): > I guess it’s all virtual memory to the cgroup monitor?
Aaron Lun (14:08:39) (in thread): > I don’t really know how it works. Maybe it counts it separately, though the depletion of memory on my laptop is independent of what cgroups is seeing.
Nicholas Knoblauch (14:09:18) (in thread): > threads shouldn’t have this problem, right?
Hervé Pagès (14:09:55) (in thread): > Do you see this on Linux too Aaron?
Aaron Lun (14:10:25) (in thread): > My linux laptop? I think so, based on the fact that my laptop would brick at the end of the HCA chapter of the book.
Hervé Pagès (14:11:34) (in thread): > oh your laptop is Linux, I thought you were on Mac
Aaron Lun (14:11:54) (in thread): > i have two laptop - work is mac, home is linux
Aaron Lun (14:11:59) (in thread): > well, I guess they’re both at home now.
Aaron Lun (14:13:45) (in thread): > I think I can reproduce this effect consistently across all my machines that I have, though I’ll have to see if I can make a convincing MRE beyond just hittingfree -m
repeatedly while a job is running.
Nicholas Knoblauch (14:14:55) (in thread): > I haven’t used it much yet buthttps://github.com/r-prof/jointprof/is worth keeping an eye on for profiling mixed R/C/C++
Hervé Pagès (14:15:06) (in thread): > Anyways, yeah I was also assuming that forking on Linux would use a COW mechanism. Disappointing!
Nicholas Knoblauch (14:16:00) (in thread): > I think it does, but if you are using some sort of oom observer, you still have to “pay” for the memory even if you don’t use it
Aaron Lun (14:16:38) (in thread): > That may be true of the server, but I don’t have cgroups on my laptops, so that shouldn’t be an issue there.
Aaron Lun (14:17:41) (in thread): > Without knowing anything about the problem, I’m going to blame the garbage collector. Probably getting its dirty little fingers on each page and triggering a real copy.
Hervé Pagès (14:17:56) (in thread): > I think you’re right@Nicholas Knoblauch. Also the COW mechanism is at the memory page level, not at the R object level. Don’t know exactly what the concrete effects of that are when forking an R process.
Nicholas Knoblauch (14:17:57) (in thread): > yeah that makes sense
Hervé Pagès (14:19:56) (in thread): > how hard would it be to disable the garbage collector in the workers just to see how it goes?
Aaron Lun (14:21:22) (in thread): > can that be done at the R level? I thought it was a C-only option
Hervé Pagès (14:21:53) (in thread): > I think it can be done at the R level
Nicholas Knoblauch (14:23:36) (in thread): > if it’s the reference counter that’s the issue (i.e if objects store their ref count), then “reading” is actually writinghttps://cran.r-project.org/doc/manuals/r-release/R-ints.html#Rest-of-header - Attachment (cran.r-project.org): R Internals > R Internals
Hervé Pagès (14:27:50) (in thread): > … but I can’t find anything. I thought there was maybe an env variable that could be used to control this. Oh well, maybe it can’t.
Aaron Lun (14:31:17) (in thread): > I don’t mind making a copy of all objects that are read in the parallel section - it’s the copy of everything else in the session that bothers me. Maybe this is due to some unfortunate mapping between pages and objects, but I hope not.
Nicholas Knoblauch (14:31:35) (in thread): > again though, why not use threads?
Aaron Lun (14:32:22) (in thread): > At the R level? Doesn’t sound like that would be easy.
Nicholas Knoblauch (14:34:59) (in thread): > your other option is to wrap the parallel part in acallr
block so you’re starting with a fresh session
Nicholas Knoblauch (14:35:17) (in thread): > gross though
Aaron Lun (14:41:32) (in thread): > That’s basicallySnowParam
.
Hervé Pagès (14:54:46) (in thread): > On my laptop (16GB of physical memory), when I run this: > > library(BiocParallel) > BPPARAM <- MulticoreParam(workers=8) > big <- matrix(runif(5e8), ncol=1000) > bplapply(1:8, function(i, big) Sys.sleep(10), big, BPPARAM=BPPARAM) >
> I see this (intop
): > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 20702 hpages 20 0 7989984 7.514g 9952 S 0.0 48.3 0:16.05 R > 20718 hpages 20 0 7989984 7.508g 3688 S 0.0 48.3 0:00.00 R > 20719 hpages 20 0 7989984 7.508g 3688 S 0.0 48.3 0:00.00 R > 20720 hpages 20 0 7989984 7.508g 3688 S 0.0 48.3 0:00.00 R > 20721 hpages 20 0 7989984 7.508g 3688 S 0.0 48.3 0:00.00 R > 20722 hpages 20 0 7989984 7.508g 3688 S 0.0 48.3 0:00.00 R > 20723 hpages 20 0 7989984 7.508g 3688 S 0.0 48.3 0:00.00 R > 20724 hpages 20 0 7989984 7.508g 3688 S 0.0 48.3 0:00.00 R > 20717 hpages 20 0 7989984 7.508g 3624 S 0.0 48.3 0:00.00 R >
> This was very fast, didn’t generate any swapping, didn’t freeze my screen. So even thoughtop
reports that the workers were using a lot more memory than I have, they didn’t. Seems like COW was working, at least in this example where i’m not doing anything with the big object. Sotop
is not telling the truth. What did you use?
Hervé Pagès (15:03:18) (in thread): > No problem either if I callgc()
in the callback function. Still goes smoothly. What’s interesting though is that now the memory used by each worker is half (as reported bytop
): > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 20702 hpages 20 0 7999160 7.523g 10000 S 0.0 48.4 0:23.71 R > 21083 hpages 20 0 4092908 3.791g 3652 S 0.0 24.4 0:00.14 R > 21084 hpages 20 0 4092908 3.791g 3652 S 0.0 24.4 0:00.14 R > 21085 hpages 20 0 4092908 3.791g 3652 S 0.0 24.4 0:00.14 R > 21086 hpages 20 0 4092908 3.791g 3652 S 0.0 24.4 0:00.15 R > 21087 hpages 20 0 4092908 3.791g 3652 S 0.0 24.4 0:00.15 R > 21088 hpages 20 0 4092908 3.791g 3652 S 0.0 24.4 0:00.14 R > 21089 hpages 20 0 4092908 3.791g 3652 S 0.0 24.4 0:00.14 R > 21090 hpages 20 0 4092908 3.791g 3652 S 0.0 24.4 0:00.13 R >
> 3.791g is actually the size of the object.
Hervé Pagès (15:04:21) (in thread): > COW still seems to be happening
Aaron Lun (15:17:24) (in thread): > I think the special sauce comes from some interaction with HDF5.
Aaron Lun (15:19:00) (in thread): > I’m looking attop
’sPhysMem
entry, e.g.,7263M used (3694M wired), 9119M unused.
before entering the parallel section.
Aaron Lun (15:19:45) (in thread): > In this chunk of code, I have: > > library(HDF5Array) > mat <- HDF5Array("test.hdf5", "samp_data/data") > big <- matrix(runif(1e8), ncol=1000) > > library(BiocParallel) > BPPARAM <- MulticoreParam(8) > grid <- defaultAutoGrid(mat) > Y <- DelayedArray::blockApply(mat, grid=grid, BPPARAM=BPPARAM, FUN=colSums) >
> and theunused
drops down to double-digit numbers in the parallel section: basically nothing.
Aaron Lun (15:21:35) (in thread): > If I makebig
smaller, (e.g., 1e7), it only drops down to ~5000 unused
Hervé Pagès (15:24:55) (in thread): > What if you makebig
bigger? Does it start swapping?
Martin Morgan (15:29:10) (in thread): > I think repeating what Hervé was saying – maybe the tool isn’t measuring actual usage, as > > > x <- 1:100000000000000 > > print(object.size(x), units = "auto") > 727.6 Tb > > lobstr::obj_size(x) > 680 B >
Aaron Lun (15:32:17) (in thread): > Hard to say re. swap, I do see an increase in the “swapin” number but I don’t really know what it means. I prefer usingfree
but I don’t have it on my mac. On a less scientific measure, my computer’s fan does start up when I bumpbig
up to5e8
.
Hervé Pagès (15:41:57) (in thread): > > library(HDF5Array) > mat <- writeHDF5Array(matrix(runif(2e6), ncol=1000), chunkdim=c(50, 50)) > big <- matrix(runif(5e8), ncol=1000) > > library(BiocParallel) > BPPARAM <- MulticoreParam(workers=8) > grid <- defaultAutoGrid(mat, block.length=4e5) # 8 blocks > FUN <- function(block) {colSums(block); Sys.sleep(10)} > Y <- DelayedArray::blockApply(mat, FUN, grid=grid, BPPARAM=BPPARAM) >
> No swapping, very smooth, even thoughtop
reports a crazy amount of cumulated memory usage (72GB) on my laptop.
Aaron Lun (15:45:22) (in thread): > Hm.sysctl vm.swapusage
doesn’t report any Swap either, though I”m not sure that does what it says it does.
Aaron Lun (15:52:42) (in thread): > Interesting. Maybe I wasactuallyrunning out of memory in the later parts of the HCA chapter. I will have to check on my linux machine.
Aaron Lun (20:23:45): > It’s friday afternoon and it’s time for another piece of absurdist theater. The following code fails reliably for me on a SLURM node with 8 GB of RAM and 20 allocated CPUs. (The process was presumably killed by the OOM; this is a signature death rattle from the child.) > > big <- matrix(runif(1e8), ncol=1000) > library(BiocParallel) > > BPPARAM <- MulticoreParam(20) > ref <- bplapply(1:234, function(x) { out <- runif(1e7); sum(out) }, BPPARAM=BPPARAM) > ## Error in result[[njob]] <- value : > ## attempt to select less than one element in OneIndex > ## In addition: Warning message: > ## In parallel::mccollect(wait = FALSE, timeout = 1) : > ## 1 parallel job did not deliver a result >
> However! The followingworksreliably for me on the same node: > > library(BiocParallel) > big <- matrix(runif(1e8), ncol=1000) > > BPPARAM <- MulticoreParam(20) > ref <- bplapply(1:234, function(x) { out <- runif(1e7); sum(out) }, BPPARAM=BPPARAM) >
> And when I say “reliably”, I mean I’ve tried each at least 5 times in independent Slurm jobs on different nodes, within the last half hour. Who knows what the future holds, but this is what I’m seeing right now.
Peter Hickey (20:25:54): > sounds like beer o’clock
Hervé Pagès (20:27:19): > 1:234
looks suspicious to me
Aaron Lun (20:27:57): > Oh, this example was derived from my previous HDF5Array example, which happened to have a grid length of about that much.
Hervé Pagès (20:28:52): > but it smells lazy and not very creative
Hervé Pagès (20:30:47): > no seriously, so IIUC the only difference is the order in which you createbig
and load the BiocParallel package?
Aaron Lun (20:31:11): > That’s correct
Aaron Lun (20:31:22): > And the plot thickens.
Aaron Lun (20:31:31): > Hold on while I just wait for the plot to thicken the right amount
Aaron Lun (20:32:34): > Okay, nowthisworks: > > library(snow) > big <- matrix(runif(1e8), ncol=1000) > library(BiocParallel) > > BPPARAM <- MulticoreParam(20) > ref <- bplapply(1:234, function(x) { out <- runif(1e7); sum(out) }, BPPARAM=BPPARAM) >
Hervé Pagès (20:33:12): > let me try this on nebbiolo1
Aaron Lun (20:34:14): > If it helps: > > R version 4.0.0 (2020-04-24) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: CentOS Linux 7 (Core) > > Matrix products: default > BLAS: /apps/user/R/R_4.0.0_Bioc_3.11/R-4.0.0-Bioc-3.11-uat-20200830/lib64/R/lib/libRblas.so > LAPACK: /apps/user/R/R_4.0.0_Bioc_3.11/R-4.0.0-Bioc-3.11-uat-20200830/lib64/R/lib/libRlapack.so > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BiocParallel_1.22.0 > > loaded via a namespace (and not attached): > [1] compiler_4.0.0 parallel_4.0.0 >
Aaron Lun (20:35:45): > I just realized, though, that loading BioCparallel doesn’t actually load snow, because there’s nothing in the NAMESPACE, so it’s not like snow is really relevant to this discussion.
Aaron Lun (20:36:43): > In fact, replacing snow with edgeR is also sufficient to get it to work.
Hervé Pagès (20:40:26): > Works very reliably on nebbiolo1. I did this 5-6 times for both orders. I’m using devel + R 4.0.3. Some cluster settings maybe?
Aaron Lun (20:41:02): > I didn’t think nebbiolo had cgroups?
Aaron Lun (20:41:46): > The error is an OOM one for the 8GB constraint that I allocated the Slurm job with.
Hervé Pagès (20:43:05): > I don’t think nebbiolo1 has cgroups. It’s just a standard Ubuntu 20.04 install. So yes, sounds like something specific to your cluster settings.
Aaron Lun (20:43:16): > If I do the math, I’ve got ~1 Gb forbig
and then another 20 * 80 MB = 1.6 GB for all the individual workers, so I should have been well under 8 GB.
Aaron Lun (20:44:48): > Ah. And my parallel 1 M cell job has now been killed at the very end, after 8 hours of compute. Bum.
Hervé Pagès (20:49:04): > I agree with@Peter Hickey, definitely time for happy hour
Aaron Lun (21:01:18): > Oh. but this is interesting. Pre-loading snow or edgeR works… but SummarizedExperiment does not.
Aaron Lun (22:09:25): > So. I was hoping to just empirically solve my problem by just puttinglibrary(snow)
at the top of all my scripts, which does rescue the situation as per my comments above. However,library(SummarizedExperiment)
afterlibrary(snow)
will cause the “did not deliver a result” error to reappear. In other words, this fails reliably: > > library(SummarizedExperiment) > library(snow) > big <- matrix(runif(1e8), ncol=1000) > library(BiocParallel) > > BPPARAM <- MulticoreParam(20) > ref <- bplapply(1:234, function(x) { out <- runif(1e7); sum(out) }, BPPARAM=BPPARAM) > ## Error in result[[njob]] <- value : > ## attempt to select less than one element in OneIndex > ## In addition: Warning message: > ## In parallel::mccollect(wait = FALSE, timeout = 1) : > ## 1 parallel job did not deliver a result >
> After much struggle, I narrowed it down to “something with S4”. I deleted all files in SE except forR/Assays-class.R
, and I started commenting things out and reinstalling the package until the error disappeared. It seems that commenting out allsetAs
calls and theset(Replace)Method("[")
calls will eliminate the error. Uncommenting any of those calls seems to re-introduce the error.
Aaron Lun (22:14:06): > In fact, insertinglibrary(SummarizedExperiment)
anywhere will trigger the error. > > library(BiocParallel) > big <- matrix(runif(1e8), ncol=1000) > BPPARAM <- MulticoreParam(20) > ref <- bplapply(1:234, function(x) { out <- runif(1e7); sum(out) }, BPPARAM=BPPARAM) # no problems, as BiocParallel is loaded before big is created. > > library(SummarizedExperiment) > ref <- bplapply(1:234, function(x) { out <- runif(1e7); sum(out) }, BPPARAM=BPPARAM) > ## Error in result[[njob]] <- value : > ## attempt to select less than one element in OneIndex > ## In addition: Warning message: > ## In parallel::mccollect(wait = FALSE, timeout = 1) : > ## 1 parallel job did not deliver a result >
Aaron Lun (22:18:10): > Of course, SE is hardly the only package withsetAs
calls floating around, so I would guess that it just happened to be the straw that broke some kind of S4-related camel’s back.
Aaron Lun (22:34:41) (in thread): > Now this, however, will reliably use swap on my mac: > > library(BiocParallel) > big <- runif(5e8) > BPPARAM <- MulticoreParam(workers=10) > bplapply(1:1000, function(i) { out <- sum(runif(1e7)); out }, BPPARAM=BPPARAM) >
Aaron Lun (22:37:52) (in thread): > 12 GB of swap and counting
Aaron Lun (22:45:40) (in thread): > Just tried it on my linux box, also hits swap.
Aaron Lun (22:47:38) (in thread): > Sticking agc()
into the function before it returns fixes the problem. Can it be? Is this the solution?
Aaron Lun (22:50:03) (in thread): > Well, I’ve had enough. If someone else can repro the problem and the fix, perhaps we can patch it for 3.12.
2020-10-17
Aaron Lun (01:31:35) (in thread): > Slides through the server without any problems.
2020-10-18
Noah Pieta (00:23:04): > @Noah Pieta has joined the channel
2020-10-19
FelixErnst (03:14:11) (in thread): > I certainly don’t know how top works internally, but it makes sense, that top cannot distinguish address space allocated from parent to fork, since it is literally the same address space until (over)written. I don’t who and why I got this stuck in my browser history, but it is a rant from Microsoft on fork (I hope you like your food salty, since I guess a grain of salt is advised):https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf
Aaron Lun (03:33:01) (in thread): > the name for section 7 is fun.
FelixErnst (03:39:51) (in thread): > I guess, that this becomes more and more important for MS since most instances on Azure are running Linux, which cause problems like you experienced plus you cannot accurately calculate how much resources to allocate for certain jobs/instances. So this is probably about financial reason as much as it is about valid reasons
2020-10-21
Sudarshan (04:33:47): > @Sudarshan has joined the channel
2020-10-22
Tim Triche (10:17:23): > @Levi Waldron@Marcel Ramos PérezI wrapped up the raw scNMT data into an HDF5-backed MultiAssayExperiment for some compartmap testing:
Tim Triche (10:17:36): > > show(scNMT_MAE) > # > # A MultiAssayExperiment object of 3 listed > # experiments with user-defined names and respective classes. > # Containing an ExperimentList class object of length 3: > # [1] rna: RangedSummarizedExperiment with 22084 rows and 116 columns > # [2] acc: RangedSummarizedExperiment with 95030342 rows and 116 columns > # [3] meth: RangedSummarizedExperiment with 13464893 rows and 116 columns > # Functionality: > # experiments() - obtain the ExperimentList instance > # colData() - the primary/phenotype DataFrame > # sampleMap() - the sample coordination DataFrame > # `$`, `[`, `[[` - extract colData columns, subset, or experiment > # *Format() - convert into a long or wide DataFrame > # assays() - convert ExperimentList to a SimpleList of matrices > # exportClass() - save all data to files > # > colData(scNMT_MAE) > # > # DataFrame with 116 rows and 4 columns > # sample rna acc meth > # <character> <logical> <logical> <logical> > # EB_P1D12 EB_P1D12 TRUE TRUE TRUE > # EB_P1E12 EB_P1E12 TRUE TRUE TRUE > # EB_P1F12 EB_P1F12 TRUE TRUE TRUE > # EB_P1G12 EB_P1G12 TRUE TRUE TRUE > # EB_P2B12 EB_P2B12 TRUE TRUE TRUE > # ... ... ... ... ... > # ESC_H05 ESC_H05 TRUE TRUE TRUE > # ESC_H06 ESC_H06 TRUE TRUE TRUE > # ESC_H07 ESC_H07 TRUE TRUE TRUE > # ESC_H08 ESC_H08 TRUE TRUE TRUE > # ESC_H09 ESC_H09 TRUE TRUE TRUE > # >
Tim Triche (10:17:53): > Needless to say it’s a big’un. However, I’d like to put it on ExperimentHub
Tim Triche (10:18:23): > What’s the best way to go about this? Eventually it would be a nice demo as a restfulSE (@Vince Careymight have opinions on this statement though)
Tim Triche (10:18:52): > The summarized / boiled-down version was useless for our purposes so I wrote a couple of loaders to pull in the full data instead.
Tim Triche (10:19:17): > It may also be useful for testingsparseMatrixStats
,DelayedMatrixStats
, and the like.
Tim Triche (10:19:47): > The accessibility data is 95 million loci, for example. A lot of them areNA
. This is its own special “opportunity”, I suppose.
Kasper D. Hansen (10:22:30): > This could also be useful to convert into TileDb to test sparsity I think….
Hervé Pagès (11:01:48): > Speaking of sparsity,HDF5Array()
,writeHDF5Array()
, andsaveHDF5SummarizedExperiment()
have a newas.sparse
argument that you can set to TRUE if your data is sparse. This won’t change how the data is stored (it will still be stored the usual dense way) but it will allow block processing to switch to sparse mode early, that is,h5mread()
, the workhorse behind block processing, will read and load the block data directly in sparse format. This can make things slightly more efficient:https://github.com/Bioconductor/HDF5Array/issues/33#issuecomment-700788537
Kasper D. Hansen (13:09:37): > That’ll be nice. There is sparsity for storage and sparsity for computation and they are slightly different and this will give us sparsity for computation.
Kasper D. Hansen (13:10:44): > @Tim TricheWhy is the accessibility dataNA
as opposed to 0? Perhaps Im not familiar enough with scNMT
Tim Triche (13:12:00): > unmeasured == NA, 0=0
Tim Triche (13:14:14): > however, I have to think that smoothing could fix this
Tim Triche (13:14:20): > now that I’m thinking about it properly
Kasper D. Hansen (13:16:02): > So you can differentiate between unmeasured and one of the states. Interesting. Of course, reflecting on it, that’s possible for some chromatin assays
Kasper D. Hansen (13:16:22): > Smoothing is probably a good idea of course. But then, I have to say that
Tim Triche (14:12:13): > the thought occurred:slightly_smiling_face:
Tim Triche (14:13:39): > to some extent, compartmap gets away with single-cell chromatin conformation reconstruction by smoothing (shrinking) and then iteratively roughening – a Gaussian blur, if you will. You and JPF inadvertently forced me and Ben to learn what my committee had always wanted me to (how to implement empirical Bayes from first principles on a new problem)
2020-10-23
Rebecca Howard (08:18:02): > @Rebecca Howard has joined the channel
2020-10-27
Nicholas Knoblauch (20:37:18): > don’t know how I missed this, but HDF5 is on github nowhttps://github.com/HDFGroup/hdf5
Nicholas Knoblauch (20:38:27) (in thread): > They’ve also started a new wiki-based documentation efforthttps://hdf5.wiki/index.php/Main_Page
Aaron Lun (20:42:54) (in thread): > damn, 21k commits
Nicholas Knoblauch (20:47:15) (in thread): > yeah I think the library dates back to the mid 90’s
2020-10-29
Jordan L. (13:51:06): > @Jordan L. has joined the channel
2020-11-01
Aaron Lun (00:28:45): > It was, in hindsight, a mistake to watch the HDF5 repo.
Nicholas Knoblauch (00:42:51): > 11 pull requests a week is kind of a lot
2020-11-11
watanabe_st (19:25:09): > @watanabe_st has joined the channel
2020-11-13
Dania Machlab (05:02:59): > @Dania Machlab has joined the channel
2020-11-17
Dania Machlab (05:26:05): > Hello Big Data community! I have a question regarding saving a DelayedArray
as an ‘.h5’ file. Is there a good way to use a chunk size that will result in an HDF5 file that isn’t too big or reasonably big? I’ve tried a few chunk sizes now, and the resulting ‘h5’ file is > 3G everytime (the loom file I use to read into R as a DelayedArray is 2.8G). This is how I save the file: > > counts_processed <- HDF5Array::HDF5Array(filepath = “TabulaMurisSenisData_round3/processed/processed.loom”, name = “matrix”) > > options(DelayedArray.block.size=1e9) # 1GB block size. > # had to stop this: already >> than loom file > mat_processed <- writeHDF5Array(x = counts_processed, > filepath = file.path(“TabulaMurisSenisData_round3”, “processed”, “processed_counts.h5”), > name = “processed_counts”, > chunkdim = HDF5Array::getHDF5DumpChunkDim(dim(counts_processed))) > > The matrix I am trying to save is a processed count matrix from tabula muris senis. The raw count matrix (same dimensions) saves as an ‘h5’ file that is 790 MB big. I’m not sure if I should expect the processed version of this matrix to store as a bigger file. > > HDF5Array::getHDF5DumpChunkDim(dim(counts_processed))
gives me c(3490, 286)
. And I’ve tried the following chunk sizes: c(500, 250), c(2000, 250), c(5000, 286), c(10000, 300), c(10000, 2000), c(100, 100)
. I always stop the process once they exceed 3G. > > Does anyone know how/if chunk size affects compression?
Vince Carey (08:26:51): > Hi – what is the evidence that chunk size affects compression at all? I wonder if the problem is that the original tabula muris data is intrinsically smaller because a sparse matrix representation is used, while the DelayedArray representation that you are attempting is – possibly inadvertently – a dense representation. I have not followed closely the capacity for DelayedArray to work with sparse matrix formats. I think it is supported, but you may have to take some extra steps.
Dania Machlab (09:34:31): > Yeah, I guess I was surprised it exceeded the size of the loom file (for the dense processed matrix). To compare, the raw count matrix loom file is 1.3G, and as the .h5 file that went down to 790 MB (using HDF5Array::getHDF5DumpChunkDim to get chunk size). I thought it could be chunk size related since with the raw count matrix I see the h5 file size change with different chunk options. For example if I set it to c(1000, 100), the size of the h5 file becomes 867 MB (as opposed to 790 MB). It can even be bigger than 1G if I set a bad chunk size. But probably for the processed matrix as you say (since it’s not sparse) I should expect a bigger HDF5 file, but if chunk size does affect the size I would like to be still not doing so bad in terms of the final size.
Nicholas Knoblauch (09:45:49): > take the following matrix: > > x <- sample(-.Machine$integer.max:.Machine$integer.max,1000,replace=FALSE) > y <- matrix(x,nrow=1000,ncol=100) >
> If I have one row per chunk it will compress extremely well. If I have one column per chunk it won’t compress at all.
Mike Smith (09:52:35): > Are the processed counts integer or numeric? That might be affecting your file size comparison, if the raw count matrix is integer and the processed is numeric. I can’t say for certain, but I’m pretty surewriteHDF5Array()
will write an HDF5 dataset using the same equivalent type to the input R matrix. > Setting the chunk dimensions to the size of the dataset should give the best compression possible (for the selected compression parameters), but whether it will end up smaller than a sparse representation is dependent on the dataset itself.
Dania Machlab (10:34:44): > I can read the raw matrix in as a sparse matrix (delayed and sparse), but storing that matrix with writeHDF5Array gives me the same compression as by saving it as a matrix (delayed and not sparse), but anyway from the description of the function, writeHDF5Array does not support sparse storage at the moment. As for the class of the values in the matrices, they are numeric for both the raw and processed. Thanks for all your inputs!
Kasper D. Hansen (11:40:59): > HDF5 does not support sparse matrices
Stephanie Hicks (11:41:22): > but tileDB does support sparse matrices
Kasper D. Hansen (11:41:37): > What we have with the 10X data format is the intrinsic ability of anything that stores a dense matrix to also store a sparse matrix by storing a 3 column “flat” matrix
Kasper D. Hansen (11:42:05): > It achieves compression, but you are now unable to use any of the HDF5 functions for “get me (i,j) element”
Kasper D. Hansen (11:42:14): > So I bet this is what happens
Kasper D. Hansen (11:42:44): > Yes, tileDB supports native sparse matrices, which is one of the reasons I am persoanlly very excited about tileDB
Dirk Eddelbuettel (12:04:00): > We also have a number of compression algorithms (withzstd
being default and a very good compromise between write speed, read speed and file size) which of course leads to a new number of parameters to tweak … I set up a really really simple little benchmark once but in real use this will of course depend on the nature of the data, its cardinality, … and as always a few other things. I also do not have comprehensive comparisons yet to hdf5 but on the example@Aaron Lunonce create we compress better. And now run faster too with the ‘Hilbert’ layout option…
Dirk Eddelbuettel (12:04:37): - File (R): Untitled
Dirk Eddelbuettel (12:05:04): > That’s really just an example I had hanging around using the NYC Flights dataframe…
Hervé Pagès (13:21:50): > @Dania MachlabHave you tried to play with the compression level? (level
argument inwriteHDF5Array()
)
Dania Machlab (14:11:19): > Thanks for pointing that out! I’ve tried with level=5, and level=7, that gave me 3.2 and 3.1 G respectively. I’ll try to go even higher, I think 9 is the maximum value
Aaron Lun (14:12:33): > I’ve lost track of the sizes here. The “raw” loom file with raw counts and other stuff is 1.3 GB. If you convert the counts to a HDF5Matrix, the corresponding file is ~800 MB. Then when you process it (dunno what that means), you get a 3 GB file? Is that it?
Dania Machlab (14:14:57): > the processed version of the raw count matrix is also downloaded from tabula munis senis (they have done some processing of their own). The loom file of the processed matrix is 2.8G, and saving that as an HDF5 file results in a file of size 3.2G.
Aaron Lun (14:16:19): > seems close enough to me
Aaron Lun (14:17:21): > especially if you’re not getting the same chunk dimensions
Dania Machlab (14:17:34): > so you wouldn’t expect more compression (like to the extent with the raw matrix: for loom vs h5)
Aaron Lun (14:18:28): > Unless they’ve changed something dramatically, loom is HDF5. Just the same wine in a new bottle.
Dania Machlab (14:18:42): > chunk dim wise I’m using the same ones, and both matrices have the same dimensions
Aaron Lun (14:18:56): > you don’t even need to do a conversion, just haveHDF5Array()
read off the loom file.
Dania Machlab (14:19:33): > yeah that works too
Dania Machlab (14:19:40): > oki thanks!
2020-11-18
Liliana Zięba (11:39:43): > @Liliana Zięba has joined the channel
2020-11-21
SM (05:38:35): > @SM has joined the channel
SM (05:39:12): > @SM has left the channel
2020-12-02
Konstantinos Geles (Constantinos Yeles) (05:42:06): > @Konstantinos Geles (Constantinos Yeles) has joined the channel
2020-12-03
cottamma (10:07:40): > @cottamma has joined the channel
2020-12-05
Aaron Lun (18:12:06): > Do we have any advice on using the separate RNG streams in BioCparallel?
Aaron Lun (18:12:55): > It seems like it could be very nice to have the ability for bplapply and friends to handle the construction of the different streams.
Aaron Lun (18:13:17): > I, for one, always forget how I’m meant to set them up.clusterRNGstream()
or something like that.
Martin Morgan (18:30:52): > Setting a seed in the constructor (and then, because of open issue in BiocParallel, starting the cluster) leads to reproducible streams, even across back-ends > > > p = bpstart(SnowParam(5, RNGseed=123)) ^Punlist(bplapply(1:10, \(i) rnorm(1), BPPARAM=p)) > [1] -0.9685927 0.7061091 -0.4094454 0.8909694 -0.4890608 0.4330424 > [7] -1.0388664 1.5745125 0.7613014 2.2994158 > > p = bpstart(SnowParam(5, RNGseed=123)) > > unlist(bplapply(1:10, \(i) rnorm(1), BPPARAM=p)) > [1] -0.9685927 0.7061091 -0.4094454 0.8909694 -0.4890608 0.4330424 > [7] -1.0388664 1.5745125 0.7613014 2.2994158 > > p = bpstart(MulticoreParam(5, RNGseed=123)) > > unlist(bplapply(1:10, \(i) rnorm(1), BPPARAM=p)) > [1] -0.9685927 0.7061091 -0.4094454 0.8909694 -0.4890608 0.4330424 > [7] -1.0388664 1.5745125 0.7613014 2.2994158 >
> But there are caveats. > 1. The number of workers can’t change > > > > p = bpstart(MulticoreParam(6, RNGseed=123)) > > unlist(bplapply(1:10, \(i) rnorm(1), BPPARAM=p)) > [1] -0.9685927 0.7061091 -0.4094454 -0.4890608 0.4330424 -1.0388664 > [7] 1.5745125 0.7613014 -1.1488680 1.0644774 >
> 2. The default strategy of dividing work (sending a subset ofX
to each worker, withbpnworkers(p)
jobs) is used (time for each task doesn’t matter to RNG; notesample()
on master doesn’t influence streams on workers) > > > p = bpstart(MulticoreParam(5, RNGseed=123)) > > unlist(bplapply(sample(10), \(i) { Sys.sleep(i/10); rnorm(1) } , BPPARAM=p)) > [1] -0.9685927 0.7061091 -0.4094454 0.8909694 -0.4890608 0.4330424 > [7] -1.0388664 1.5745125 0.7613014 2.2994158 >
> rather than a more dynamic strategy (e.g., settingtasks=
in the constructor to the length ofX
, resulting inlength(X)
jobs, where the scheduling of tasks to workers depends on the duration of each task > > > p = bpstart(MulticoreParam(5, tasks = 10, RNGseed=123)) > > unlist(bplapply(sample(10), \(i) { Sys.sleep(i/10); rnorm(1) } , BPPARAM=p)) > [1] -0.96859273 -0.40944544 -0.48906078 -1.03886641 0.76130143 0.89096942 > [7] 2.29941580 0.43304237 0.70610908 -0.03195349 >
Aaron Lun (18:34:48): > Hm. I would like the output to be independent of the number of workers and the dividing strategies. One could imagine creating a different substream for each individual task.
Aaron Lun (18:42:29): > this is how DropletUtils handles it, though it’s all done in C++ anyway because I needed to do something else in C++ at the time.
Martin Morgan (18:55:36): > It seems like the only way to do that with current infrastructure is to generate the streams in the master and forward them to the workers in a predefined order. This seems palatable for a modest number of tasks…
Aaron Lun (18:59:08): > I can try to make a PR if it’s sufficiently non-trivial.
2020-12-08
Aaron Lun (20:59:08): > @Hervé Pagèsany thoughts onhttps://github.com/Bioconductor/DelayedArray/pull/81andhttps://github.com/Bioconductor/DelayedArray/pull/82?
Hervé Pagès (21:50:31): > will get to this ASAP
Aaron Lun (22:10:07): > I also suspect something in my own stack is doing some kind of operation to cause it to be treated as non-sparse… need to figure out what’s going on there.
2020-12-09
Aaron Lun (02:07:18): > hm. able to get a 2-fold speed up on the standard workflow on my laptop, but can’t get it on my cluster! head scratching time.
2020-12-13
Huipeng Li (20:36:14): > @Huipeng Li has joined the channel
2020-12-14
Thomas Naake (08:55:23): > @Thomas Naake has joined the channel
Bharati Mehani (20:03:44): > @Bharati Mehani has joined the channel
2020-12-15
Fredrick E. Kakembo (01:50:21): > @Fredrick E. Kakembo has joined the channel
Jenny Brown (03:04:13): > @Jenny Brown has joined the channel
2020-12-17
Aaron Lun (02:35:49): > @Mike SmithIs rhdf5 guaranteed to store its matrices in transposed form? I remember some talk about adding a flag to indicate whether it was transposed or not.
Mike Smith (02:49:04): > No. I can write something more comprehensive later, but for now take a look at thenative
argument.
Aaron Lun (02:54:07): > I’m guessing that we don’t have an auto-detection mechanism for whether it was saved natively or not.
Aaron Lun (02:54:42): > I mean, in this case, it’s quite convenient, because I needed to transpose anyway.
Aaron Lun (02:55:51): > okay, if there’s no flag, that’s one less thing for me to worry about.
2020-12-21
Giacomo Antonello (04:21:09): > @Giacomo Antonello has joined the channel
Faris Naji (07:22:57): > @Faris Naji has joined the channel
Yue Pan (09:06:46): > @Yue Pan has joined the channel
2020-12-24
António Domingues (07:44:32): > @António Domingues has joined the channel
2021-01-04
Pablo Rodriguez (13:55:21): > @Pablo Rodriguez has joined the channel
Pablo Rodriguez (14:05:04): > Hi. I’m trying to perform asvd
on a very large subset of a very largeHDF5Array
. Is there any hint of how to process it withou huge amounts of RAM? Is there any block-proccesing aproach to do this? > I saw on the biocWorkshop from 2019 of “Effectively using the DelayedArray…” a comentary about using thescran
functionquickCluster()
that performs a svd using irlba, but since I don’t wanna use any clustering, I tried to figure how to use irlba in this case but miserably failed. Any hint will be much valued. Thanks!
Aaron Lun (15:03:41): > It should be fairly easy to do so withBiocSingular::runSVD()
andBSPARAM
set toRandomParam()
.
Aaron Lun (15:04:01): > For example, this is what is used to process the 350k cell dataset in the OSCA book.
Davide Risso (15:36:24): > Yes, it works with 1.3 million cells too! Note that if you want to use irlba you have to set BSPARAM to IrlbaParam()
Davide Risso (15:36:57): > But RandomParam() which performs random PCA is faster and similar in terms of accuracy
Davide Risso (15:37:54): > Faster with HDF5 as input that is
2021-01-05
Robert Castelo (02:02:56): > @Robert Castelo has joined the channel
Pablo Rodriguez (06:31:11): > Thanks, I’ll give it a try!!
Aaron Lun (12:25:57): > You’ll also want to ramp up the number of cores
Aaron Lun (13:01:35): > I wonder whether we should create a DelayedArray seed equivalent to thedgCMatrix
that isn’t constrained by the 2e9 limit on the number of non-zero entries. This would be useful for people handling large datasets whodohave enough memory to realize it all, while the DA framework would protect against unnecessary copying. (The limit of adgCMatrix
would be around ~25 GB, so if we had 100 GB to work in and could guarantee no copies, we would be pretty safe).
Hervé Pagès (13:14:10): > That kind of was the original purpose of RleMatrix but that project has stalled.
Aaron Lun (13:15:29): > Just realized that if we store int
s that would go down to around 12 GB.
Hervé Pagès (13:24:37): > Yeah the fact that dgCMatrix always stores the non-zero values as numeric is not really optimal and something that RleMatrix intends to address.
2021-01-12
Federico Marini (11:11:17): > Hi there! > I have the following issue, for a dataset I had originally in h5ad format - and then was used to create the loom format. > I want to deploy an instance of iSEE to explore that set, and it works all nicely locally. > My files/folder structure (on a MBP) for theiSEE_ci
folder: > > ├── B_cells_final_annotated > │ ├── obs.csv > │ ├── obsm.csv > │ ├── processed.loom > │ ├── uns > │ │ ├── Time_colors.csv > │ │ ├── batch_colors.csv > ... > │ │ └── umap.csv > │ ├── var.csv > │ └── varm.csv > ├── B_cells_final_annotated.h5ad > ├── sce_Bcells.RDS >
> Now, thesce
file was created withsaveRDS
. > > Locally: no problem. > > But then I try to deploy that to the server, and from the log I can see that I get > > Warnung: Error in h5mread: failed to open file '/Users/fede/Development/iSEE_ci/B_cells_final_annotated/processed.loom' >
> … and accordingly I do not see the FeatureAssayPlot, for example (but for the Reduced Dimension Plot, no prob) > > Might be a beginner question for the experts of hdf5-based representations, so I apologize in advance if that’s the case > Tagging@Hervé Pagèsand@Aaron Lunon this, maybe you can be most helpful?
Hervé Pagès (15:01:22): > @Federico MariniTry to save the object withsaveHDF5SummarizedExperiment()
instead.saveRDS()
is only safe to use on an in-memory object but is unsafe on an object with on-disk data. See “Difference between saveHDF5SummarizedExperiment() and saveRDS()” in?saveHDF5SummarizedExperiment
for an explanation why.
Federico Marini (15:22:36): > Oh perfect, thank you Herve! > Don’t ask me why but I was looking for somethings SCE-specific
Federico Marini (15:23:01): > Will run that and give you a feedback, but I guess it will solve it:wink:
Hervé Pagès (15:30:55): > He he, this is a major benefit of having a bunch of specialized SummarizedExperiment derivatives in Bioconductor. They share a lot of things.
Hervé Pagès (15:33:50): > Even though it can be tricky for the end user to know where to look for some functionality: in the documentation of the specialized derivative, or in the documentation of the base class, or in-between?
Federico Marini (16:17:47): > Yeah that was the catch in my case. Thanks for clarifying this! > BTW: works like a charm as expected - I did not set up any of the chunkdim options or so, is there any of these options that impacts majorly the performance once it is re-loaded?
Hervé Pagès (17:15:57): > How to~check~set thechunkdim
optimally depends very much on the access pattern of your typical downstream analysis. If the access is going to be row-wise only, choose chunks that contain full rows, if column-wise only then chunks that contain full columns, if not sure then 100x100 or 250x250 chunks. As a general rule of thumb, I would suggest keeping the size of the chunks under 1e5 array values. If the data is sparse, chunks can be made bigger, e.g. 1e6 or more, depending on how sparse it is.
Hervé Pagès (17:18:25): > OTOH chunks shouldn’t be too small either because there’s a bookkeeping cost in having a crazy number of chunks e.g. tens of millions of chunks.
Federico Marini (17:32:12): > Got it. The usage in this case is merely via iSEE for exploration, and the sets are not too large - speaking in this case of approx. 10k to 50k cells
2021-01-13
Davide Risso (02:06:12): > @Stephanie Hicksand I have studied this a little bit in the context of cell (column) clustering inhttps://www.biorxiv.org/content/10.1101/2020.05.27.119438v1
Davide Risso (02:07:35): > As expected column-wise chunks are much faster to access than row-wise in this case, but default chunk shape’s performance is surprisingly close to column-wise
Davide Risso (02:08:24): > (See Figure 4)
Davide Risso (02:09:12): > So my suggestion would be stick to default unless you have a very good reason to change
Tim Triche (09:35:47): > thanks for reminding me to steal that code for compartmap@Davide Risso:wink:
Espen Riskedal (10:25:13): > @Espen Riskedal has joined the channel
2021-01-15
Aaron Lun (02:07:17): > the garbage collector + MulticoreParam problems are back!
Aaron Lun (02:08:03): > though in this case, I am at least getting a friendlier “vector cannot be allocated” instead of the OOM killer wiping out my jobs
Aaron Lun (02:16:01): > Probably because I’m trying to realize the 1.3M dataset into memory.
Aaron Lun (02:16:39): > (Deliberately; I want to check performance in high-memory environments. Apparently RAM isn’t a constraint anymore.)
Aaron Lun (02:20:43): > Hm. I might try giving it 50 GB next time.
2021-01-18
Vince Carey (07:46:09): > Has anyone considered a DelayedArray interface to gdsfmt?
Martin Morgan (09:09:33): > Seehttps://github.com/Bioconductor/Contributions/issues/1809#issuecomment-747941363et seq., and alsohttp://bioconductor.org/packages/devel/bioc/html/GDSArray.html. I think there is recent discussion elsewhere, but am not sure exactly where - Attachment (Bioconductor): GDSArray (development version) > GDS files are widely used to represent genotyping or sequence data. The GDSArray package implements the GDSArray
class to represent nodes in GDS files in a matrix-like representation that allows easy manipulation (e.g., subsetting, mathematical transformation) in R. The data remains on disk until needed, so that very large files can be processed.
Qian Liu (12:39:19): > @Qian Liu has joined the channel
Aaron Lun (13:49:09): > It was in the SCArray Contributions thread.
Qian Liu (22:36:03): > Hi, Just saw the SCArray thread. I will work with Xiuwen to update the GDSArray to accommodate more general GDS format, not only the current formats from SNPRelate and SeqArray which corresponds to GWAS and sequencing data separately.
2021-01-22
Annajiat Alim Rasel (15:40:59): > @Annajiat Alim Rasel has joined the channel
2021-01-27
Dario Righelli (05:19:35): > @Dario Righelli has joined the channel
Dario Righelli (05:46:57): > Hi everyone, I’m trying to do some operations on a dataset of1093036 cells (on the rows) and 31053 genes (on the columns) in csv format.To work with such a dataset I wanted to save the matrix as a HDF5 file, but I’m having troubles on doing it. > Of course it’s not so easy to load the whole matrix in memory, so I started to shrink the file into smaller files dividing it by columns (100 columns per file). > Then I loaded the small files in memory, translated the small-matrices (to have genes x cells matrices), and saved small HDF5 files by using theHDF5Array::writeHDF5Array(small-matrix, name="myname")
. > Then I start from these small-matrix HDF5 files, load them withHDF5Array(filepath=x, name="myname")
andrbind
them into a uniqueHDF5Array
. > At this point I’d like to save this hugeDelayedMatrix
(That’s theclass
of the matrix after all the bindings) as a HDF5 file to work with it and do operations on the rows and on the columns, such as computing means per genes, library sizes, etc, but, when I try, it takes forever and the file size on the disk is always 1.4kb. > Do you have any suggestion for this process? > p.s. I’ve also tried to set this optionoptions(DelayedArray.block.size=1e9) # 1GB block size.
p.p.s. code in the thread
Dario Righelli (05:55:55) (in thread): > the script > > ngenes=31054 # cols > ncells=1093785 # rows > for (i in seq(1, ngenes, by=99)) > { > > if( i==30988 ) { endcol=ngenes } else { endcol=(i+98) } > cmd <- paste0("cut -d , -f", i, "-", endcol, " matrix.csv > splitmat/matrix_", i, "_", endcol, ".csv") > print(cmd) > system(cmd) > } > > library(HDF5Array) > > splitmats <- list.files("splitmat", pattern=".csv$", full.names=TRUE) > sampnames <- read.csv("sample_names.csv") > colnames(sampnames) <- NULL > lapply(splitmats, function(filemat) > { > mm <- read.csv(filemat) > mm <- t(mm) > # mm <- Matrix(mm) > colnames(mm) <- sampnames[[1]] > cat("Saving ", paste0("splitmat/HDF5/", basename(filemat), ".h5"), "\n") > # writeTENxMatrix(mm, paste0("splitmat/HDF5/", basename(filemat), ".h5"), group="Allen") > writeHDF5Array(mm, paste0("splitmat/HDF5/", basename(filemat), ".h5"), name="Allen") > }) > > system("ulimit -s unlimited") > library("HDF5Array") > hdf5s <- list.files("splitmat/HDF5/", pattern="h5$", full.names=TRUE) > AMM <- HDF5Array(filepath=hdf5s[1], name="Allen") > hdf5s <- hdf5s[-1] > lapply(hdf5s, function(x) > { > print(x) > mm <- HDF5Array(filepath=x, name="Allen") > print(which(hdf5s == x)) > AMM <<- rbind(AMM, mm) > print(dim(AMM)) > }) > options(DelayedArray.block.size=1e9) > writeHDF5Array(AMM, "splitmat/HDF5/whole_mat/Allen_Brain.h5", name="Allen") >
Pablo Rodriguez (05:59:52) (in thread): > If you enable verbose block proccesing withDelayedArray:::set_verbose_block_processing(TRUE)
, do you actually see the list of hdf5 files being processed?
Pablo Rodriguez (06:02:39) (in thread): > like this message: > > / Reading and realizing block 1/1 ... OK > \ Writing it ... OK >
Dario Righelli (06:04:36) (in thread): > that’s a good hint, thanks! > It seems stucked/ Reading and realizing block 1/1480 ...
Pablo Rodriguez (06:10:20) (in thread): > I do something similar (I end up with a list of small hdf5 files that are actually each one a column of a bigger matrix) but instead of creating a new hdf5 file with the list of the smaller ones withwriteHDF5Array()
, once I have the list of each smaller hdf5 files (I think in your script this list ishdf5s
), I do > > AMM <- do.call(DelayedArray::cbind, hdf5s) # or rbind, if that's what you need >
> Then you could modify AMM as you wish
Dario Righelli (06:11:28) (in thread): > thanks I’m gonna try
Dario Righelli (06:14:57) (in thread): > hdf5s
is a list of file paths, are you binding them?
Pablo Rodriguez (06:16:32) (in thread): > No, I do cbind with a list of files, not with their paths. Sorry, i missinterpreted your list
Pablo Rodriguez (06:18:20) (in thread): > If I recall correctly, in this piece of your code: > > lapply(splitmats, function(filemat) > { > mm <- read.csv(filemat) > mm <- t(mm) > # mm <- Matrix(mm) > colnames(mm) <- sampnames[[1]] > cat("Saving ", paste0("splitmat/HDF5/", basename(filemat), ".h5"), "\n") > # writeTENxMatrix(mm, paste0("splitmat/HDF5/", basename(filemat), ".h5"), group="Allen") > writeHDF5Array(mm, paste0("splitmat/HDF5/", basename(filemat), ".h5"), name="Allen") > }) >
> You could save the result of thatlapply()
in a variable and then you would have the list of hdf5 files, then do therbind
Dania Machlab (06:48:26) (in thread): > Could this be useful? I haven’t used it myself but seems handy for manipulating large csv files without loading them into memory:https://diskframe.com/index.html - Attachment (diskframe.com): Larger-than-RAM Disk-Based Data Manipulation Framework > A disk-based data manipulation tool for working with large-than-RAM datasets. Aims to lower the barrier-to-entry for manipulating large datasets by adhering closely to popular and familiar data manipulation paradigms like dplyr verbs and data.table syntax.
Stephanie Hicks (07:17:32) (in thread): > I had a similar problem of the code chunks getting stuck at the first one in this project and the trick was to make block size smallerhttps://github.com/stephaniehicks/methylCCPaper/blob/f67fdfa77a38107ebb04a49ecb251f850248107a/case-studies/blueprint-wgbs/bp2019-05-apply-qc.R#L43
Dario Righelli (07:53:14) (in thread): > Thanks, I’ll try all these approaches!:smile:
Hervé Pagès (11:09:23) (in thread): > Using a binary delayedrbind()
in a loop means that you are piling 310 delayed operations on top of yourAMM
object. Don’t do this! > > I wouldn’t use any delayed operation for that. Just iterate on the small files and write them to the destination file as you go. No need to delay the writing, no need to bind anything. > > Also I suggest that you transpose the data. > > Something like this (untested): > > library(HDF5Array) > > hdf5s <- list.files("splitmat/HDF5/", pattern="h5$", full.names=TRUE) > dest_path <- "splitmat/HDF5/whole_mat/Allen_Brain.h5" > dest_name <- "Allen" > > sink <- HDF5RealizationSink( > dim=c(31053, 1093036), > type="integer", > filepath=dest_path, > name=dest_name, > chunkdim=c(100, 100)) > > sink_grid <- rowAutoGrid(sink, nrow=100) > nblock <- length(sink_grid) > for (bid in seq_len(nblock)) { > message("Loading ", hdf5s[[bid]]) > block <- t(h5mread(hdf5s[[bid]], "Allen")) > message("Writing it to ", dest_path) > viewport <- sink_grid[[bid]] > write_block(sink, viewport, block) > } > AMM <- as(sink, "HDF5Array") >
Pablo Rodriguez (11:19:25) (in thread): > I suggested using a delayedrbind()
because I thought that the list obtained on the first script was a list of hdf5 files, not hdf5 file paths. > Using a RealizationSink is a great option, though.
Dario Righelli (11:44:51) (in thread): > thanks@Hervé PagèsI didn’t know about the sink function!
Hervé Pagès (11:57:49) (in thread): > @Pablo RodriguezA solution based ondo.call(DelayedArray::cbind, list_of_h5objects)
might work because it uses asingle N-arycbind(). This is different from pilinghundreds of binarydelayed cbind() or rbind() on a DelayedArray object. Generally speaking, having a big pile of hundreds of delayed operations on a DelayedArray will make realization of the individual blocks very expensive during block processing.
Tim Triche (12:41:03): > doesdim(HugeDelayedMatrix)
produce the expected result? Also the block size for writing can affect this,@Hervé Pagèsmay have more insight
Tim Triche (12:41:34): > oh woops good job ignoring the huge thread tim. sorry Dario
Espen Riskedal (12:55:13) (in thread): > Alternatively you can use a file-backed big matrix. And write the .csv to it in batches. I’ve done combining multiple .csv files together into a large big matrix
Espen Riskedal (12:58:24) (in thread): > Essentially I usedvroomto find the the dimensions of the .csv files, useddata.tableto read in the files in batches, and wrote it all into abigmatrix. - Attachment (cran.r-project.org): vroom: Read and Write Rectangular Text Data Quickly > The goal of ‘vroom’ is to read and write data (like ‘csv’, ‘tsv’ and ‘fwf’) quickly. When reading it uses a quick initial indexing step, then reads the values lazily , so only the data you actually use needs to be read. The writer formats the data in parallel and writes to disk asynchronously from formatting. - Attachment (cran.r-project.org): data.table: Extension of ‘data.frame’ > Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns, friendly and fast character-separated-value read/write. Offers a natural and flexible syntax, for faster development. - Attachment (cran.r-project.org): bigmemory: Manage Massive Matrices with Shared Memory and Memory-Mapped Files > Create, store, access, and manipulate massive matrices. Matrices are allocated to shared memory and may use memory-mapped files. Packages ‘biganalytics’, ‘bigtabulate’, ‘synchronicity’, and ‘bigalgebra’ provide advanced functionality.
Hervé Pagès (13:07:29): > I don’t know whatHugeDelayedMatrix
is
Tim Triche (14:14:12): > Dario’s thing
Vince Carey (14:16:26) (in thread): > Nice@Espen Riskedal… do we have any information on relative performance of bigmatrix and hdf5 for submatrix retrieval?
Vince Carey (14:17:34) (in thread): > Also is bigmatrix R-specific? Are there readers in other languages?
Espen Riskedal (14:19:47) (in thread): > Bigmatrix is R only AFAIK. It can also only subset in continuous parts . I don’t know performance vs hdf5.
Espen Riskedal (14:20:21) (in thread): > (it can subset in more advanced ways, but then it returns a data frame, and the whole point goes away)
2021-01-28
Aaron Lun (03:51:40): > @Mike Smithanything I should know about writing to an extensible dataset inrhdf5? I plan try to justh5writeDataset
with increasingstart
.
Dario Righelli (04:42:16) (in thread): > Thanks Tim, I solved with the solutions Pablo and Hervè provided me!:smile:
Mike Smith (05:49:08) (in thread): > Do you know the final size in advance and just want to write parts incrementally, or are you appending as needed? I haven’t looked at the code for a while, but I thinkh5createDataset
will set the max dim sizes equal to the initial array you try to write. If you know in advance you can set via themaxdims
arguments, otherwise you might need to look at usingH5Screate
andH5Sunlimted()
Mike Smith (05:49:49) (in thread): > Maybe that last part can actually be used withh5createDataset(maxdims = H5Sunlimted())
directly - I’m not sure I’ve tried.
Aaron Lun (11:33:08) (in thread): > the latter, appending. I’ll have a try tonight and see if anything sticks.
2021-01-29
Aaron Lun (03:59:06) (in thread): > Ithinkit works if I repeatedly callh5set_extent
to resize the dataset. Not sure if this is entirely legit.
Aaron Lun (04:05:15) (in thread): > if you’re curious you can look athttps://github.com/theislab/zellkonverter/pull/35/commits/38c007f128d62c131cc43940fbc29ee53a45ffd7
Aaron Lun (04:05:44) (in thread): > write_CSR_matrix
for the set up,blockwise_sparse_writer
for the appending.
mariadermit (04:48:36): > @mariadermit has joined the channel
Dario Righelli (07:21:31) (in thread): > @Hervé PagèsI’m having this error from theHDF5RealizationSink
function: > > Error in validObject(.Object) : > invalid class "HDF5RealizationSink" object: 1: invalid object for slot "dim" in class "HDF5RealizationSink": got class "numeric", should be or extend class "integer" > invalid class "HDF5RealizationSink" object: 2: invalid object for slot "chunkdim" in class "HDF5RealizationSink": got class "numeric", should be or extend class "integer_OR_NULL" >
Hervé Pagès (11:19:08) (in thread): > I guess it means what it says i.e. that you must supply integer vectors fordim
andchunkdim
to theHDF5RealizationSink()
constructor.
2021-02-02
Dario Righelli (08:51:53) (in thread): > Hi@Hervé Pagèssorry again, I’m still trying to execute the “sink” code, but I’m having another error that I’m not able to account for… > When I run this one: > > sink_grid <- rowAutoGrid(sink, nrow=99) > nblock <- length(sink_grid) > for (bid in seq_len(nblock)) { > message("Loading ", hdf5s[[bid]]) > block <- h5mread(hdf5s[[bid]], "Allen") > message("Writing it to ", dest_path) > viewport <- sink_grid[[bid]] > message("viewport dim: ", dim(viewport), " block dim: ", dim(block)) > write_block(sink, viewport, block) > } >
> I get this error: > > Loading splitmat/HDF5//001_matrix_1_99.csv.h5 > Writing it to splitmat/HDF5/whole_mat/Allen_Brain.h5 > viewport dim: 991093785 block dim: 991093036 > Error in write_block(sink, viewport, block) : > identical(dim(block), dim(viewport)) is not TRUE >
> What am I doing wrong? the h5 files are of 99 rows each, so I changed thenrow
in therowAutoGrid
and removed thetranspose
operation for the reading.
Pablo Rodriguez (09:56:11) (in thread): > The sink and the hdf5s file dimensions are not equal, the viewport dim is 991093785 (99 rows * X cols) and the block dim is 991093036 (99 rows * Z cols) for that first hdf5 file. > Take note that the sink dimensions should be big enough to accomodate all the hdf5 files. > example: if you have two hdf5 files of 99 rows 100 cols, the sink file should have 198 rows 100 cols (if you rbind) or 99 rows * 200 cols (if you cbind)
Hervé Pagès (12:14:11) (in thread): > Right, the geometry of the sink doesn’t seem to reflect the geometry of your files. In the code I showed earlier, I assumed that you have 31053 cells and 1093036 genes, and 100 genes per file, because that’s what you originally said you have. But now I see that latter you also wrote > > ngenes=31054 # cols > ncells=1093785 # rows > for (i in seq(1, ngenes, by=99)) { > ... > } >
> so I don’t know. Note that generating the small CSV and HDF5 files in advance is not needed. You could instead cut the big CSV file as you go in thefor (bid in seq_len(nblock)) {...}
loop and immediately load the small CSV file as a matrix. Having the cutting logic close to the reading logic maybe would help making sure that they are in sync.
Dario Righelli (14:47:19) (in thread): > thanks guys! tomorrow I’ll try it again!
Dario Righelli (14:48:57) (in thread): > Just for completeness, I was already creating a sink with the correct dimensions after double checking them … > > sink <- HDF5RealizationSink( > dim=as.integer(c(31053, 1093785)), > type="integer", > filepath=dest_path, > name=dest_name, > chunkdim=as.integer(c(100, 100))) >
> EDIT: turned out I was using the wrong dimensions! Thanks again guys!
2021-02-03
Hervé Pagès (01:00:36) (in thread): > Also used internally bywriteTENxMatrix()
:https://github.com/Bioconductor/HDF5Array/blob/f10e77dae4ceb18a14e21c28266f412e3e188949/R/writeTENxMatrix.R#L81-L85h5append()
callsrhdf5::h5set_extent()
before writing the data. If there’s a better way to write data to an extensible dataset, I’d be curious to know.
2021-02-05
MARC SUBIRANA I GRANÉS (09:57:19): > @MARC SUBIRANA I GRANÉS has joined the channel
gargi (16:26:35): > @gargi has joined the channel
2021-02-11
Aaron Lun (01:46:37): > I wonder whether we could make aNumpyArray
, which uses a reticulate binding to and.array
as a DA seed. This would avoid making a copy of the entire array in R’s memory, allowing us to extract parts from the numpy array as needed.
2021-02-15
Pablo Rodriguez (04:28:46) (in thread): > I’m not very familiar with Python but could“this”be a solution or inspiration for your use case? An R wrapper for Python’sanndata
package
Aaron Lun (11:23:28) (in thread): > no, not really, we already have that in zellkonverter.
Michael Lawrence (15:47:02) (in thread): > I like this idea. Immediate access to all of the numpy implementations.
2021-02-19
abdullah hanta (01:27:20): > @abdullah hanta has joined the channel
2021-02-22
Dario Righelli (04:59:27): > Hi guys, > thanks to thisprevious thread on this channelI’ve been able to create aDelayedArray
based on aHDF5
file. > Now I need to do some computations on this object, such as computing (row)CPMs
androwMeans
per column blocks. > I discovered the functionsrowAvgsPerColSet
andcolAvgsPerRowSet
, but obviously they are not always applicable… > Indeed when I need to compute the trimmed means it cannot be made because therowMeans
function doesn’t support thetrim
argument. > So I tried to do something liketmeans[,j] <- apply(AMM[1:10,jj], 1, mean, trim=0.25)
(wherejj
is the column set of interest), but this error message came up: > > Error in .local(x, ...) : > mean() method for DelayedArray objects does not support the 'trim' > argument yet >
> Do you know any way to avoid this problem?:slightly_smiling_face:Thanks!
Davide Risso (07:08:51): > Hey@Dario Righellithis (untested) should do what you want: > > mygrid <- RegularArrayGrid(dim(AMM), c(1, ncol(AMM))) > means <- blockApply(AMM, mean, trim=.25, grid=mygrid) >
Davide Risso (07:10:47): > I have the feeling that there might be a more efficient way (e.g., working with a bunch of rows at a time rather than row-by-row), but I’ll let@Peter Hickey@Mike Smithor@Hervé Pagèsand all the other hdf5 experts comment on that…
Hervé Pagès (13:06:53): > @Dario RighelliCan’t you trim the matrix yourself? Or maybe I don’t understand whattrim
is about. To process more than 1 row at eachblockApply
iteration, use a grid where the blocks are made of several rows e.g. with something likemygrid <- rowAutoGrid(AMM, nrow=50)
.
Hervé Pagès (13:09:49): > I should add that, for better performance, the number of rows for each block should preferably be a multiple of the number of rows of the physical HDF5 chunks. Usechunkdim(AMM)
to get the geometry of the chunks. The general idea is that the blocks should contain full chunks, so whenblockApply
walks on the blocks, each chunk is loaded only once. If you callrowAutoGrid()
without specifying the nb of rows, it will try to optimize things for you by choosing the greatest nb of rows that is a multiple ofchunkdim(AMM)[[1]]
and that keeps the size of the blocks belowgetAutoBlockSize()
.
Vince Carey (16:03:32): > Hi@Hervé Pagès”trim” in mean function is a method of robustifying the statistic to outlying observations. For trim=.25, data in highest and lowest quartiles of the dataset are ignored and the mean of the remaining “inner” data elements is returned.
Aaron Lun (16:07:15): > I wouldbeachmat::rowBlockApply
and then do thetmeans
thing above in the function passed torowBlockApply
.
Kasper D. Hansen (16:17:10): > any idea howbeachmat
would compare to theblockApply/rowAutoGrid
?
Aaron Lun (16:20:09): > it just does the same thing,rowBlockApply
is just a friendly wrapper .
Kasper D. Hansen (16:23:58): > I was thinking performance
Aaron Lun (16:24:28): > That’s what I’m saying, it’s just a wrapper aroundblockApply
withrowAutoGrid
.
Kasper D. Hansen (16:24:46): > oh
Kasper D. Hansen (16:24:50): > thanks
Hervé Pagès (17:48:59) (in thread): > I see. That makes more sense than what?base::mean
is suggesting: > > trim: the fraction (0 to 0.5) of observations to be trimmed fromeach end of ‘x’ before the mean is computed. > They probably mean “each endof the quantiles”, which I’m sure is obvious for any statistician. One should always keep in mind that S/R was designed by statisticians for statisticians:grin:
Marcel Ramos Pérez (17:58:47) (in thread): > sortedx
, right?
Hervé Pagès (19:33:08) (in thread): > yes, that would be even clearer
2021-02-23
Dario Righelli (04:52:11): > Thanks guys!
Wynn Cheung (10:32:07): > @Wynn Cheung has joined the channel
2021-02-26
Hervé Pagès (03:59:24): > HDF5Array objects now work with files on Amazon S3:https://support.bioconductor.org/p/9135005/#9135150
Aaron Lun (04:00:38): > very nice.
Mike Smith (04:08:30): > Awesome, thanks Hervé! Was there anything hacky involved in opening the files? I can’t remember at what level I exposed the S3 stuff inH5Fopen
etc. I know I didn’t write an example, but I can’t remember if that was because I was lazy or because it wasn’t possible for a user. If you had to make any modification to the lowlevel H5 functions let me know and I can backport it to rhdf5.
Martin Morgan (06:19:08): > Does this (rhdf5, I guess…) support direct use of the S3 protocol and bucket urls3://…
? And since the feature iron is hot, is there similar functionality for google’sgs://…
(or the publichttps://
equivalent via google? > > Also, does it make sense to play better with the R ecosystem for cloud access, e.g., usinghttps://CRAN.R-project.org/package=aws.signature/https://CRAN.R-project.org/package=googleAuthR(? not sure about the google solution) for credentials management? Happy to open issues if that’s appropriate - Attachment (cran.r-project.org): aws.signature: Amazon Web Services Request Signatures > Generates version 2 and version 4 request signatures for Amazon Web Services (‘AWS’) https://aws.amazon.com/](https://aws.amazon.com/)) Application Programming Interfaces (‘APIs’) and provides a mechanism for retrieving credentials from environment variables, ‘AWS’ credentials files, and ‘EC2’ instance metadata. For use on ‘EC2’ instances, users will need to install the suggested package ‘aws.ec2metadata’ https://cran.r-project.org/package=aws.ec2metadata](https://cran.r-project.org/package=aws.ec2metadata)). - Attachment (cran.r-project.org): googleAuthR: Authenticate and Create Google APIs > Create R functions that interact with OAuth2 Google APIs https://developers.google.com/apis-explorer/](https://developers.google.com/apis-explorer/)) easily, with auto-refresh and Shiny compatibility.
Mike Smith (06:26:41): > IIRC none of the HDF5 documentation used the S3 protocol, and I was just following their examples of which there are very few! I think it just uses libcurl to access via http(s) and so that has to be available when installing Rhdf5lib. If you know that HDF5 can uses3://
then I don’t think there’s a reason we can’t do that in the R package.
Mike Smith (06:33:21): > The current authentication is an example of the ‘self-declared technical debt’ I learnt about this week! I remember looking at aws.signature at the time, but decided to simplify my workload by using a list with the intention of improving that bit later. > > Important note: for now they must be in this specific order. > I think that comment in the vingette was intended as a FIXME/TODO
Martin Morgan (06:46:10) (in thread): > Maybe@John Readeyor@Vince Careyknows about support for s3; FWIW htslib (CRAM / BAM access) supports direct, indexed access to s3 / gc buckets.
Mike Smith (07:26:09) (in thread): > > the HDF5 command line tools currently do not support the authoritys3
, and we must specify the URL with thehttps
authority. > That’s fromhttps://www.hdfgroup.org/solutions/enterprise-support/cloud-amazon-s3-storage-hdf5-connector/
Dirk Eddelbuettel (08:19:08): > FWIW that is what tiledb does: instead a local URI such as/share/project/dir
it just becomes an URI with S3:s3://project-bucket/dirand ditto for GCS and Azure. See eghttps://github.com/TileDB-Inc/TileDB-R/blob/master/inst/examples/ex_S3.R– works by having the two env vars set:
Dirk Eddelbuettel (08:19:27): - File (R): Untitled
Hervé Pagès (12:18:17) (in thread): > > Was there anything hacky involved in opening the files? > Not for opening the file. Was easy to implement usingH5Pset_fapl_ros3
like you did in rhdf5. > > There was some extra work involved in order to keep the connection opened in the HDF5Array object. This is to avoid the cost of re-establishing the connection each time the object needs to read data from the remote file. I introduced a new class for this: H5File. Basically an H5IdComponent object with some additional slots (e.g. filepath/url). However, I ran into a BIG gotcha:https://github.com/Bioconductor/HDF5Array/blob/a0903dfde2fbdf316d583341d0f5b53628815409/R/H5File-class.R#L211-L227Also, the connection ID in the object will not survive serialization/deserialization or transmission to the workers when using SnowParam. And even if it does survive transmission when using MulticoreParam, it’s not safe to use. Seehttps://github.com/Bioconductor/HDF5Array/blob/a0903dfde2fbdf316d583341d0f5b53628815409/man/H5File-class.Rd#L65-L97I might try to remedy this by implementing some kind of auto-reconnect feature. This would need that the S3 credentials can be retrieved on the workers so maybe we should discuss what’s the best way to do this e.g. via some environment variables? Unfortunately it can’t be a file-based solution (on some clusters the working nodes don’t have access to the file system of the head node). > > Even outside the context of parallel evaluation, having something like this would add a lot of convenience (e.g. no need to explicitly specify thes3credentials
argument). It would also allow people to write scripts that don’t expose anything, so they can safely share them.
Hervé Pagès (12:29:28) (in thread): > Bummer!
Hervé Pagès (12:44:20): > @Martin MorganWe should definitely harmonize management of credentials and stick to some well established solution.aws.signaturelooks like a natural candidate andflowWorkspacealready uses it. I’ll take a closer look at it.
Mike Smith (12:49:39): > We might also want to look at whether it makes sense to build the Rhdf5lib Mac binary with support for this - at the moment it doesn’thttps://support.bioconductor.org/p/9134972/(That thread is all out of order now, hopefully you can work follow the conversation). I don’t know enough / haven’t thought about the implications of whether linking against libopenssl when building the binary will be safe to distribute.
Hervé Pagès (12:57:37): > Oh, I missed that part of the conversation. Let me see what the situation is on the Mac builder and what we need to do to make this work. I vaguely remember facing the same problem forrtracklayerat some point (also needs to link to openssl).
John Readey (12:59:51) (in thread): > Yes that’s right. The S3VFD requires “https:”. E.g.
John Readey (13:00:10) (in thread): > > h5ls --vfd=ros3 -r[https://s3.amazonaws.com/pile-of-files/sample.h5](https://s3.amazonaws.com/pile-of-files/sample.h5) >
John Readey (13:00:59) (in thread): > Is this a big problem? All s3 uri’s can be represented as https paths.
John Readey (13:03:31) (in thread): > BTW, the HDF library team would love to get some real-world feedback on use of the S3VFD. Posting tohttps://forum.hdfgroup.org/c/hdf5is probably the best way to get a conversation going. - Attachment (HDF Forum): HDF5 > All HDF5 (and HDF4) questions, potential bug reports, and other issues.
2021-03-01
Diana Hendrickx (03:18:14): > @Diana Hendrickx has joined the channel
Diana Hendrickx (03:18:42): > @Diana Hendrickx has left the channel
2021-03-02
Hervé Pagès (13:20:00) (in thread): > @Mike SmithHey Mike,@Michael Lawrenceuses environment variableOPENSSL_LIBS
inrtracklayerto find the openssl libs. See hisINSTALL
file. This is currently set as: > > export OPENSSL_LIBS="/usr/local/Cellar/openssl@1.1/1.1.1i/lib/libssl.a /usr/local/Cellar/openssl@1.1/1.1.1i/lib/libcrypto.a" >
> on machv2. Because we use static linking, the resultingrtracklayerbinary is self-contained and works everywhere. See thertracklayersource for the details of how this is handled exactly. HopefullyRhdf5libcan do something similar. > PS: Note thatrtracklayerfails to INSTALL on machv2 at the moment. That’s because some recent brew installs on machv2 decided to replace openssl 1.1.1h with 1.1.1i. I just updatedOPENSSL_LIBS
accordingly on the machine so the current error should go away on the next build report.
2021-03-11
Shubham Gupta (11:12:05): > Hi, I want to convert numeric matrix todata.frame
ordata.table
so that I can add other columns of character type to it. In general I need to do it 90 million times and the conversion of matrix to data.frame has become the bottleneck of the analysis. I am currently usingas.data.frame
.Is there a faster way to convert matrix to data.frame or data.table?
Martin Morgan (11:13:28) (in thread): > How are you doing it now?
Shubham Gupta (11:13:47) (in thread): > Usingas.data.frame(mat)
Martin Morgan (11:14:15) (in thread): > Why do you need to do this 90 million times?
Shubham Gupta (11:15:12) (in thread): > This is as per the problem. There are 90 M features and create them in the software
Shubham Gupta (11:16:30) (in thread): > The matrix size is generally 5 x 6
Martin Morgan (11:22:05) (in thread): > I guess I was thinking about one call instead of 90 million, > > > m = matrix(1:30, 5) > > mm = replicate(1000000, m, simplify = FALSE) > > system.time(for (elt in mm) as.data.frame(elt)) > user system elapsed > 47.321 0.078 47.431 > > system.time(as.data.frame(do.call(rbind, mm))) > user system elapsed > 2.151 0.201 2.353 >
Shubham Gupta (11:23:35) (in thread): > Right. This makes sense. With a single call it is quite fast
2021-03-19
Helen Miller (15:36:05): > @Helen Miller has joined the channel
2021-03-24
Robert Castelo (04:50:10): > hi, does anybody here know whether there is the corresponding function of base Rmax.col()
for large sparse matrices stored indgCMatrix
objects? in case you’re not familiar withmax.col()
: > > m <- matrix(sample(1:10, size=9), nrow=3) > m > [,1] [,2] [,3] > [1,] 3 6 7 > [2,] 4 9 2 > [3,] 1 5 10 > max.col(m) > [1] 3 2 3 >
> the closest i found isrowMaxs()
but it gives the maximum value per row, not the column where this maximum value is located.
Federico Marini (06:10:25) (in thread): > Pinging@Constantin Ahlmann-Eltzeon this, as he could be aware at best of what is/could be insparseMatrixStats
Constantin Ahlmann-Eltze (06:37:32) (in thread): > No, sorry there isn’t any such method in sparseMatrixStats. It’s not that it would be terribly complicated to implement, but I follow the API of matrixStats and it isn’t implemented there.
Constantin Ahlmann-Eltze (06:50:57) (in thread): > > sparse_max_col <- function(x){ > sparseMatrixStats:::reduce_sparse_matrix_to_num(t(x), function(values, row_indices, number_of_zeros){ > m_idx <- which.max(values) > if(values[m_idx] < 0){ > if(length(values) == ncol(x)){ > row_indices[m_idx] + 1 > }else{ > setdiff(seq_len(ncol(x)), row_indices + 1)[1] > } > }else{ > row_indices[m_idx] + 1 > } > }) > } >
Constantin Ahlmann-Eltze (06:52:10) (in thread): > Something like this should do the job fairly efficiently.setdiff(seq_len(ncol(x)), row_indices + 1)
is probably not ideal for actually large matrices, but I hope it illustrates the idea:slightly_smiling_face:
Robert Castelo (07:00:25) (in thread): > Thanks Constantin!! the benchmarking looks good!!! at least from the perspective of memory consumption: > > library(Matrix) > library(bench) > > p <- 10000L > m <- Matrix(0L, nrow=p*2, ncol=p, sparse=TRUE) > m[cbind(1:nrow(m), sample(1:ncol(m), size=nrow(m), replace=TRUE))] <- sample(1:10, size=nrow(m), replace=TRUE) > > res <- bench::mark(sparse_max_col(m), max.col(m)) > res > # A tibble: 2 x 13 > expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc > <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> > 1 sparse_max_col(m) 90.98ms 90.98ms 11.0 488.99KB 44.0 1 4 > 2 max.col(m) 7.02s 7.02s 0.142 1.49GB 0 1 0 > # … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, > # time <list>, gc <list> >
Kasper D. Hansen (07:06:10): > @Henrik BengtssonWe got a feature request for matrixStats
Henrik Bengtsson (07:06:14): > @Henrik Bengtsson has joined the channel
Robert Castelo (07:07:24): > Thanks@Kasper D. Hansen!!@Constantin Ahlmann-EltzeI was also trying to develop my own sparse version, but at the==
operator below somehow the memory consumption explodes: > > sparseMaxCol <- function(m) { > mask <- m == sparseMatrixStats::rowMaxs(m) > wh <- Matrix::which(mask, arr.ind=TRUE, useNames=FALSE) > wh <- wh[!duplicated(wh[, 1]), ] > wh <- wh[order(wh[, 1]), ] > wh[, 2] > } > res <- bench::mark(sparseMaxCol(m), sparse_max_col(m), max.col(m)) > res > # A tibble: 3 x 13 > expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc > <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> > 1 sparseMaxCol(m) 10.04s 10.04s 0.0996 12.67GB 0.498 1 5 > 2 sparse_max_col(m) 72.68ms 75.23ms 9.18 468.94KB 5.51 5 3 > 3 max.col(m) 5.63s 5.63s 0.178 1.49GB 0 1 0 > # … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, > # time <list>, gc <list> >
Constantin Ahlmann-Eltze (07:14:54) (in thread): > I guess the memory problem with==
approach is that if the rowMaxs() is 0, you get a fairly dense lgCMatrix for that row
Robert Castelo (07:17:26) (in thread): > True, but that shouldn’t happen with the particular example I’m using: > > summary(rowMaxs(m)) > Min. 1st Qu. Median Mean 3rd Qu. Max. > 1.000 3.000 5.000 5.477 8.000 10.000 >
Robert Castelo (07:57:35) (in thread): > I was going to open an issue at the matrixStatsrepobut found thisonefrom@Hervé Pagèswhich includes this request. - Attachment: #176 Feature request: whichRowMaxs, whichColMaxs, whichRowMins, and whichColMins
Robert Castelo (12:05:55) (in thread): > I’ve found out that recycling one single number doesn’t increase the memory consumption as with recycling a vector: > > sparseMaxCol <- function(m) { > mask <- (m / sparseMatrixStats::rowMaxs(m)) == 1 > wh <- Matrix::which(mask, arr.ind=TRUE, useNames=FALSE) > wh <- wh[!duplicated(wh[, 1]), ] > wh <- wh[order(wh[, 1]), ] > wh[, 2] > } > res <- bench::mark(sparseMaxCol(m), sparse_max_col(m), max.col(m)) > res > # A tibble: 3 x 13 > expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc > <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> > 1 sparseMaxCol(m) 5.7ms 6.7ms 143. 4.65MB 20.0 50 7 > 2 sparse_max_col(m) 75.61ms 78.21ms 12.8 486.91KB 32.0 2 5 > 3 max.col(m) 5.56s 5.56s 0.180 1.49GB 0 1 0 >
> and now this implementation is one order of magnitude faster than yours, but it still consumes one order of magnitude more memory.
Constantin Ahlmann-Eltze (12:10:51) (in thread): > that’s pretty cool. I guess if you want to have the speedup and the low memory, you would need to translate my function to C++. It actually should be straightforward following for examplehttps://github.com/const-ae/sparseMatrixStats/blob/master/src/methods.cpp#L354. Then it would also be easy to replace thesetdiff()
with a simple for loop:slightly_smiling_face: - Attachment: src/methods.cpp:354 > > return reduce_matrix_double(matrix, na_rm, [na_rm](auto values, auto row_indices, int number_of_zeros) -> double{ >
Harry Danwonno (20:07:54): > @Harry Danwonno has joined the channel
2021-04-05
Hervé Pagès (12:08:42) (in thread): > @Mike SmithHi Mike, did you have a chance to try this?
Mike Smith (12:11:54) (in thread): > Sorry@Hervé Pagès, it’s on my to do list, but not got round to it yet. Currently trying to push through the git browser/search tool & trying not to get sidetracked from that! Hopefully should get a chance to try by the end of this week.
Chase Clark (13:10:08): > @Chase Clark has joined the channel
2021-04-06
Alex Bott (18:54:51): > @Alex Bott has joined the channel
2021-04-07
Hervé Pagès (00:19:55) (in thread): > No problem. Just wanted to make sure you saw that.
Vince Carey (19:35:55): > Are there any current compendia of benchmarks of matrix methods relevant to Bioconductor? I know a lot of work has been reported here and is scattered in various places. Would BiocArrayBenchmarks be a sensible package to collect examples, or is there something suitable, perhaps in the non-bioc/R ecosystem, that could be used?
2021-04-08
Kasper D. Hansen (04:20:08): > Im not aware of stuff.
Kasper D. Hansen (04:20:38): > A complexity - which absolutely has to be tackled - is that you also need to report and consider IO, given our use of disk-based storage
Kasper D. Hansen (04:21:10): > What I mean is that when you recording timings it becomes pretty important whether it is a network mounted drive or local SSD
2021-04-14
Dipanjan Dey (05:20:48): > @Dipanjan Dey has joined the channel
2021-04-15
Harshita Ojha bs17b012 (06:08:20): > @Harshita Ojha bs17b012 has joined the channel
2021-04-16
Ben Story (17:32:23): > @Ben Story has joined the channel
2021-04-30
Winfred Gatua (10:39:59): > @Winfred Gatua has joined the channel
Aaron Lun (12:22:34): > @Hervé Pagèsis the SparseArray repo visible? I also have some more random thoughts that I’d like to stick into the issues, from some experiences with a half-hearted implementation of aCsparseMatrix
.
Hervé Pagès (12:25:43) (in thread): > https://github.com/hpages/SparseArrayIt’s empty, I created it during the call.
Aaron Lun (12:26:07) (in thread): > thanks, was looking in Bioconductor/Sparsearray
Hervé Pagès (12:26:52) (in thread): > I’ll move it there after submission and inclusion to BioC.
Hervé Pagès (12:27:06) (in thread): > or maybe i should just start it there
Hervé Pagès (12:29:43) (in thread): > donehttps://github.com/Bioconductor/SparseArray
Aaron Lun (12:30:10) (in thread): > great, I’ll add some comments and such later today.
2021-05-01
Arjun Krishnan (21:39:02): > @Arjun Krishnan has joined the channel
2021-05-03
Stephen Chen (07:50:20): > @Stephen Chen has joined the channel
Mike Smith (10:33:35) (in thread): > @Hervé PagèsI finally got round to trying this - unfortunately didn’t seem to go so wellhttp://bioconductor.org/checkResults/devel/bioc-LATEST/Rhdf5lib/machv2-install.html > * First up it reportschecking for EVP_sha256 in -lcrypto... no
which disables to S3 file driver compilation and kind of kills the point of trying this! > * Ignoring that, then there’s a few instances of: > > > ***** Warning: Linking the shared library libhdf5.la against the > ***** static library /usr/local/Cellar/openssl@1.1/1.1.1i/lib/libssl.a is not portable! >
> > * some/Library/Developer/CommandLineTools/usr/bin/ranlib: file: .libs/libhdf5.a(H5CS.o) has no symbols
> * and finally, when it comes to test loading the package: > > > Error: package or namespace load failed for 'Rhdf5lib' in dyn.load(file, DLLpath = DLLpath, ...): > unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/00LOCK-Rhdf5lib/00new/Rhdf5lib/libs/Rhdf5lib.so': > dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/00LOCK-Rhdf5lib/00new/Rhdf5lib/libs/Rhdf5lib.so, 6): Symbol not found: _H5get_libversion > Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/00LOCK-Rhdf5lib/00new/Rhdf5lib/libs/Rhdf5lib.so > Expected in: flat namespace > in /Library/Frameworks/R.framework/Versions/4.1/Resources/library/00LOCK-Rhdf5lib/00new/Rhdf5lib/libs/Rhdf5lib.so >
> I’m mostly recording those here so I have a record of them in the case that we fix this.
Hervé Pagès (18:03:22) (in thread): > Hey Mike, have you been able to make this work on a Mac? Once you’ve figured this out, we can see how to make this work on machv2.
Aaron Lun (22:26:46): > @Hervé PagèsI’d like your opinion on this proof-of-concept:https://github.com/LTLA/DelayedArraySaver(which is also the reason why I spotted thebase=
bug earlier).
Aaron Lun (22:27:22): > The general idea is to be able to save and restore delayed operations in a more language-agnostic manner than serializing the DA as an RDS.
Aaron Lun (22:28:49): > The current examples are a bit trivial, but the endgame is to save, e.g., aHDF5ArraySeed
by just storing the path, so that we avoid resaving an entire dataset after applying some delayed ops.
Aaron Lun (22:36:12): > particularly useful for some of myHDF5ArraySeed
subclasses that are remote, so the delayed operations + data source is portable.
2021-05-04
Mike Smith (03:11:47) (in thread): > I’ve managed to reproduce the problem on a Mac VM, but build successfully with static linking tolibssl.a
etc. I’m a bit confused as to exactly how to mimic the machv2 setup. For now I’ve just donebrew install openssl@1.1
which get’s me both the dynamic and static versions of the the libraries. > > If I copy the strategy inrtacklayerconfigure.acit compiles fine, but links with the dynamic versions of the libraries. SettingOPENSSL_LIBS
to the static libraries on my system doesn’t achieve anything, the linking remains to the dynamic libraries. I also can’t tell whetherrtracklayeractually uses thatOPENSSL_LIBS
environment variable. I can’t find mention of it directly in the package code, it looks to me like the variable is set viaPKG_CHECK_MODULES
athttps://code.bioconductor.org/browse/rtracklayer/blob/master/configure.ac#L11, but if you needed to change the envar to fix a build problem it must be doing something. I’ve ended up hard coding the static libs into the configure script for testing - that gets me to the point where I can recreate the errors. > > For now I’m going to revert the changes on machv2 so it will build, and try to get this working outside the build system.
Tim Triche (10:20:39): > “trivial” is an interesting way to say “could waste a tremendous amount of time for end users otherwise”. thanks for writing this
Hervé Pagès (13:21:33) (in thread): > I’ve no idea whatrtracklayerdoes withOPENSSL_LIBS
sorry, but surely@Michael Lawrencewould know. I’ve just set the variable on machv2 as per his instructions. Let me know if there’s anything else I need to do on the machine to allow Rhdf5lib to compile and link.
Michael Lawrence (14:22:15) (in thread): > It should follow the conventions defined byPKG_CHECK_MODULES().
If that’s not happening, I’d be surprised.
Aaron Lun (19:03:50): - File (HTML): userguide.html
2021-05-05
Hervé Pagès (12:41:15) (in thread): > Major concern I have is that you’re playing with the internals of DelayedArray objects so you’re going to have to constantly keep up with them. For example, right nowsaveDelayed(x)
fails ifx
carries delayed ops that you are not explicitly supporting (e.g.dpois()
). Another example is when I add a new type of delayed operation. Doesn’t happen very often but it will happen soon to support reshaping (seehttps://support.bioconductor.org/p/9136602/#9136608). > BTW you can’t assume that the unary isometric operations stacked on the object have names. The general case is that they are anonymous functions. For example this is what happens withround(log(x, base=12), digits=2)
. What goes on the stack here are anonymous functionsfunction(a) log(a, base=12)
andfunction(a) round(a, digits=2)
. Maybe the DelayedUnaryIsoOpStack class could have been designed differently e.g. it could store the function names (e.g."log"
and"round"
) and the extra arguments passed to them but that’s not what I did. This is the reason whyshowtree()
doesn’t display the names of the unary iso ops: > > > showtree(round(log(x, base=12), digits=2)) > 6x5 double: DelayedMatrix object > └─ 6x5 double: Stack of 2 unary iso op(s) > └─ 6x5 double: [seed] matrix object >
> saveDelayed()
chokes on this because it wants those names.
Aaron Lun (12:45:39) (in thread): > yes, the keeping-up-with-things was also a concern for me.
Aaron Lun (13:55:34) (in thread): > FWIW I determine the identity of the anonymous function by reaching into its environment and pulling out its.Generic
.log
requires some special casing forbase
, which I just added.
Aaron Lun (13:57:49) (in thread): > I’m thinking that, as long as we cover most common operations, we can add support for new operations as we go along.
2021-05-06
Stephanie Hicks (05:18:31): > @Davide Rissothis tweet is supportive of your suspicion about scanpy memory consumptionhttps://twitter.com/gkarlssonlab/status/1389560499214426117 - Attachment (twitter): Attachment > 1/7 We are delighted to introduce Scarf, a single-cell analysis toolkit that puts atlas-scale datasets within the reach of all scientists, requiring just a laptop. One can process scRNA-Seq data of 4 million cells using less than 16GB of RAM. > > Preprint: http://doi.org/10.1101/2021.05.02.441899 https://pbs.twimg.com/media/E0iy0VhXEAE84IW.jpg
Davide Risso (05:21:57): > yep! I’ve seen this yesterday and it’s on my reading list!
Kasper D. Hansen (11:08:04): > You should reply to his tweet on no scalable clustering methods. I almost did it for you (and I could still do that)
Davide Risso (11:38:59): > Well, they say firsthierarchicalclustering method scalable to millions of cells…
Stephanie Hicks (11:40:28): > @Kasper D. Hansenlol
Stephanie Hicks (11:40:50): > yeah, i’m not in a position to correct folks on twitter. but please do feel free!
Aaron Lun (11:40:58): > can’t be that hard. PCA + kmeans + arbitrarily slow clustering method of choice.
Aaron Lun (11:41:17): > see e.g.bluster::TwoStepParam
.
Davide Risso (11:41:56): > oh yeah, that’s what we do in clusterExperiment too
2021-05-07
Davide Risso (03:16:42): > Also, isn’tbuildSNNGraph()
+cluster_walktrap()
hierarchical and scalable?
Davide Risso (03:17:30): > (never tried myself on a million cells so this is genuine question)
Aaron Lun (03:38:13): > the graph building is mostly fine, esp. when done with approximate methods. But the community detection just chokes. Louvain is faster but is still pretty slow.
Aaron Lun (03:39:53): > expect that to take about 10-20 minutes, IIRC.
2021-05-09
Hervé Pagès (02:01:42) (in thread): > Also to be totally honest, I don’t see the point of all of this. If you’re going to save the object to HDF5, why not just realize it? Why would you use such a complex representation (original dataset + stack of delayed operations) when you can use a simple representation (realized dataset)? I guess I don’t really understand your use case.
Aaron Lun (02:42:14) (in thread): > We have some big datasets that are shared across many analyses. Mostly single cell, but you could also imagine binned ChIP-seq matrices, methylation, etc. Many different users, and many replicates of the same analysis, and we keep track of all objects that are generated at every version of an analysis - sort of like a build system for analysis code that gets triggered ongit push
and archives those results. > > The objects that are generated include DA-containing SEs. Re-saving the entire array would be prohibitively time-consuming and increase our disk/network usage several-fold. Up until now, we have been saving the RDS files, which are extremely light as they just point back to the backing HDF5 file. (I use a HDF5ArraySeed subclass that handles a remote pull from our internal S3 instances, seehttps://github.com/Bioconductor/HDF5Array/issues/32, so the object itself can be portably reused on different machines.) However, I’d like these results to be usable by applications written in other languages, hence this package.
Hervé Pagès (13:15:37) (in thread): > Thanks for providing some context. Right now I see that the absolute path to the file gets stored in the file itself when I do something likesaveDelayed(HDF5Array("toto.h5", "m") + 1, "toto.h5")
. This breaks the file if I rename it or move it around.
Aaron Lun (14:29:02) (in thread): > For HDF5ArraySeeds, the expected behavior is the same as that of serializing to an RDS file; local use only, position-dependent objects. The real magic happens when we use HDF5ArraySeed subclasses that handle acquisition of the file, as I mentioned above. In particular, construction of my subclasses will automatically download and cache HDF5 files from our company file stores into the user’s current environment. In such cases,DelayedArraySaverwill be instructed to store the identifier of the file; theloadDelayed()
function will then automatically download the file when it constructs the seed.
Aaron Lun (14:29:46) (in thread): > The same principle could well be applied to EHub files - I remember some discussion on the HDF5Array GitHub repo to that effect.
Hervé Pagès (15:22:11) (in thread): > I see. So it will only do something useful in a very specific context/setup that is only available at your company. FWIWquickResaveHDF5SummarizedExperiment()
is an RDS-based approach that produces relocatable objects.
Aaron Lun (16:08:09) (in thread): > As I said, it would not be difficult to generalize this to ExperimentHub-sourced HDF5 files, if you wanted relocatability. It also wouldn’t be difficult to do this in general for any file available from a given URL. My internal case is just one example of its wider utility.
2021-05-10
Hervé Pagès (00:58:34) (in thread): > Yeah I get that. What I was trying to say is that usingsaveDelayed()
really makes sense for a very specific use case in the very specific context where the seed of the DelayedArray object is not bound to the local file system. For example when the seed is an in-memory seed, or an HDF5ArraySeed object pointing to something like"EH1040"
or"
https://rhdf5-public.s3.eu-central-1.amazonaws.com/rhdf5ex_t_float_3d.h5"
. This is the exception, not the rule. Nothing wrong with that.
Hervé Pagès (01:56:06) (in thread): > Also I only realize now that the challenge of keeping-up-with-things inDelayedArraySaveris not only with the internal representation of delayed ops. It extends to explicitly supporting many types of seeds and knowing about the internals of each seed. > > I wonder if this couldn’t be alleviated by introducing a new generic that would return the information thatsaveLayer()
needs to write a description of the seed to the HDF5 file. The difference between the HDF5ArraySeed (or TileDBArraySeed) method for this new generic and the HDF5ArraySeed (or TileDBArraySeed) method forsaveLayer()
is that the former wouldn’t write anything but only return the information that the latter needs to write to the HDF5 file. In other words the former would be agnostic about where/howsaveLayer()
actually writes this information. After all, you’ve made the choice to write the delayed ops + description of the seed to HDF5 but the same information could be written to JSON or other formats. This new generic would be agnostic about that. We could even ask the developer of a new seed to provide a method for this new generic, in addition to supporting the standard seed contract. This would alleviate the keeping-up-with-things challenge inDelayedArraySaver.
Jason Prasad (20:20:58): > @Jason Prasad has joined the channel
2021-05-11
Megha Lal (16:43:46): > @Megha Lal has joined the channel
2021-05-25
Enrica Calura (03:48:38): > @Enrica Calura has joined the channel
2021-06-01
Espen Riskedal (14:41:16): > Has anyone used arrow and their multi-file open_dataset functionality?https://arrow.apache.org/docs/r/reference/open_dataset.html? I’m wondering if I can process data in batches, save the batches, and then open it all using this. - Attachment (arrow.apache.org): Open a multi-file dataset — open_dataset > Arrow Datasets allow you to query against data that has been split across multiple files. This sharding of data may indicate partitioning, which can accelerate queries that only touch some partitions (files). Call open_dataset() to point to a directory of data files and return a Dataset, then use dplyr methods to query it.
2021-06-04
Tim Triche (11:33:17): > have you tried asking Jeff Granja about this, ArchR is the most prominent Arrow user I know of
Tim Triche (11:33:57) (in thread): > re:@Espen Riskedaldoesn’t look like jeffrey granja is on here but that’s the first person I’d ask
Aaron Lun (11:37:39) (in thread): > oh tim. ArchR’s arrow files != Apache Arrow file.
Aaron Lun (11:38:32) (in thread): > well, at least not according to their documentation.
Aaron Lun (11:38:41) (in thread): > who knows what’s happening in practice?
2021-06-07
Espen Riskedal (03:59:15) (in thread): > They use hdf5 it seems. Which afaik is not the same as arrow. From docs:“More explicitly, an Arrow file is not an R-language object that is stored in memory but rather an HDF5-format file stored on disk.” - https://www.archrproject.com/bookdown/what-is-an-arrow-file-archrproject.html - Attachment (archrproject.com): 1.3 What is an Arrow file / ArchRProject? | ArchR: Robust and scaleable analysis of single-cell chromatin accessibility data. > A guide to ArchR
Espen Riskedal (04:01:31) (in thread): > I guess an archer shots arrows, and it sounded cool.
2021-06-08
Pablo Rodriguez (14:22:58): > I’m having problems with subassignments with aDelayedArray
like object. > I’m working with a hugeHDF5Array
, and I need to subassign some of its values whose positions (row and column) are stored in a different array. > I came with a partial and ugly solution, but it degrades the whole h5 array and it doesn’t fit in memory, so in order to avoid degradation, or at least make the wholeHDF5Array
be realized in memory, should I try usingblockApply()
or work with a realization sink? > > This is an example: > 1. h5 example: > > > > library(HDF5Array) > > m <- matrix(sample(10), nrow = 5, ncol = 5) > > h5 <- as(m, "HDF5Array") > > h5 > <5 x 5> matrix of class HDF5Matrix and type "integer": > [,1] [,2] [,3] [,4] [,5] > [1,] 8 1 8 1 8 > [2,] 2 5 2 5 2 > [3,] 9 7 9 7 9 > [4,] 6 3 6 3 6 > [5,] 4 10 4 10 4 >
> 2. matrix with positions to change: > > > rows <- c(1,2) > > cols <- c(2,2) > > ma <- cbind(rows, cols) > > ma > rows cols > [1,] 1 2 > [2,] 2 2 >
> Let’s say that I want to change to 4 the values whose positions are saved in each row of ma > > 3. This gives an error, as specify in DelayedArray documentation: > > > h5[ma] <- 4 > Error in `[<-`(`**tmp**`, ma, value = 4) : > linear subassignment to a DelayedArray object 'x' (i.e. 'x[i] <- value') is only supported > when the subscript 'i' is a logical DelayedArray object of the same dimensions as 'x' and > 'value' an ordinary vector of length 1) >
> 4. My ugly solution: > > > for(i in 1:nrow(ma)){ > + h5[ma[i,][1], ma[i,][2]] <- 4 > + } > > h5 > <5 x 5> matrix of class DelayedMatrix and type "double": > [,1] [,2] [,3] [,4] [,5] > [1,] 8 4 8 1 8 > [2,] 2 4 2 5 2 > [3,] 9 7 9 7 9 > [4,] 6 3 6 3 6 > [5,] 4 10 4 10 4 >
Pablo Rodriguez (14:25:27): > I know thatfor()
hurts anyone’s feelings, so any tip will be much apreciated
Mike Smith (14:55:57): > Can you just forget thefor()
loop and use the two columns ofma
directly? e.g. > {h5} > #> <5 x 5> matrix of class HDF5Matrix and type "integer": > #> [,1] [,2] [,3] [,4] [,5] > #> [1,] 10 8 10 8 10 > #> [2,] 6 2 6 2 6 > #> [3,] 5 7 5 7 5 > #> [4,] 4 9 4 9 4 > #> [5,] 1 3 1 3 1 > > h5[ ma[,1], ma[,2] ] <- 4 > h5 > #> <5 x 5> matrix of class DelayedMatrix and type "double": > #> [,1] [,2] [,3] [,4] [,5] > #> [1,] 10 4 10 8 10 > #> [2,] 6 4 6 2 6 > #> [3,] 5 7 5 7 5 > #> [4,] 4 9 4 9 4 > #> [5,] 1 3 1 3 1 >
> I think that should obey the “N-dimensional subassignment” mentioned in?DelayedArray
which says “These 3 forms of subassignment are implemented as delayed operations so are very light.”
Pablo Rodriguez (15:03:52) (in thread): > This! I totally forgot that I could justma[,1], ma[,2]
to subset, thanks!
Tim Triche (19:08:06) (in thread): > oh gross. I guess the wonderful thing about standards is that there are so many to choose from
2021-06-09
Hervé Pagès (02:23:44) (in thread): > Note that in generalM[ma] <- 99
is not the same asM[ma[,1], ma[,2]] <- 99
: > > M1 <- M2 <- M3 <- matrix(0, nrow=8, ncol=5) > rows <- c(1, 5, 6) > cols <- c(2, 4, 5) > ma <- cbind(rows, cols) > > M1[ma] <- 99 > M1 > # [,1] [,2] [,3] [,4] [,5] > # [1,] 0 99 0 0 0 > # [2,] 0 0 0 0 0 > # [3,] 0 0 0 0 0 > # [4,] 0 0 0 0 0 > # [5,] 0 0 0 99 0 > # [6,] 0 0 0 0 99 > # [7,] 0 0 0 0 0 > # [8,] 0 0 0 0 0 > > M2[ma[,1], ma[,2]] <- 99 > M2 > # [,1] [,2] [,3] [,4] [,5] > # [1,] 0 99 0 99 99 > # [2,] 0 0 0 0 0 > # [3,] 0 0 0 0 0 > # [4,] 0 0 0 0 0 > # [5,] 0 99 0 99 99 > # [6,] 0 99 0 99 99 > # [7,] 0 0 0 0 0 > # [8,] 0 0 0 0 0 >
> And only the former is equivalent to yourfor
loop: > > for (i in 1:nrow(ma)) { > M3[ma[i,1], ma[i,2]] <- 99 > } > > M3 > # [,1] [,2] [,3] [,4] [,5] > # [1,] 0 99 0 0 0 > # [2,] 0 0 0 0 0 > # [3,] 0 0 0 0 0 > # [4,] 0 0 0 0 0 > # [5,] 0 0 0 99 0 > # [6,] 0 0 0 0 99 > # [7,] 0 0 0 0 0 > # [8,] 0 0 0 0 0 > > identical(M1, M3) > # [1] TRUE > identical(M2, M3) > # [1] FALSE >
> Unfortunately subassignment to a DelayedArray object doesn’t support theM1[ma] <- 99
form.
Pablo Rodriguez (02:50:55) (in thread): > I tested it with some real data and it behaves as yourM2
example, so I’m back to square one. > In my real data, my h5 file is a huge logical sparse matrix with all values set toFALSE
, and then I populate it withTRUE
using positions stored in a two column matrix. > Does anybody have any tip to achieve this without realizing the whole h5 file into memory?
Mike Smith (04:17:42) (in thread): > We can do this in “pure” HDF5 (i.e. no DelayedArray etc), although you have to get pretty down and dirty with the rhdf5 C interface. I’ve got some code that writes to the correct indices. > > However it seems to have highlighted a bug in rhdf5**** ****(or my understanding of the code) as the data it’s currently writing is nonsense. It’s currently producing this when I try to set the indices to 1: > > > h5read(tf, name = "/M1") > [,1] [,2] [,3] [,4] [,5] > [1,] 0 -641975624 0 0 0 > [2,] 0 0 0 0 0 > [3,] 0 0 0 0 0 > [4,] 0 0 0 0 0 > [5,] 0 0 0 16 0 > [6,] 0 0 0 0 1 > [7,] 0 0 0 0 0 > [8,] 0 0 0 0 0 >
> I’ll try to figure out the problem with writing the data, and then share the code with you.
Pablo Rodriguez (04:40:00) (in thread): > Usingrhdf5
with my previousfor()
I can get the desired output (using@Hervé Pagèsexample): > > > library(rhdf5) > > my_hdf5_file <- tempfile(fileext = ".h5") > > h5createFile(my_hdf5_file) > [1] TRUE > > h5write(obj = matrix(0, nrow=8, ncol=5), > + file = my_hdf5_file, > + name = "data") > > h5ls(my_hdf5_file) > group name otype dclass dim > 0 / data H5I_DATASET FLOAT 8 x 5 > > h5read(my_hdf5_file, "data") > [,1] [,2] [,3] [,4] [,5] > [1,] 0 0 0 0 0 > [2,] 0 0 0 0 0 > [3,] 0 0 0 0 0 > [4,] 0 0 0 0 0 > [5,] 0 0 0 0 0 > [6,] 0 0 0 0 0 > [7,] 0 0 0 0 0 > [8,] 0 0 0 0 0 > > rows <- c(1, 5, 6) > > cols <- c(2, 4, 5) > > ma <- cbind(rows, cols) > > for(i in 1:nrow(ma)){ > + h5write(99, file = my_hdf5_file, name ="data", index = list(ma[i,][1], ma[i,][2])) > + } > > h5read(my_hdf5_file, "data") > [,1] [,2] [,3] [,4] [,5] > [1,] 0 99 0 0 0 > [2,] 0 0 0 0 0 > [3,] 0 0 0 0 0 > [4,] 0 0 0 0 0 > [5,] 0 0 0 99 0 > [6,] 0 0 0 0 99 > [7,] 0 0 0 0 0 > [8,] 0 0 0 0 0 >
Mike Smith (04:47:54) (in thread): > What’s the performance of that like for your huge matrix? Looping over each index will involve reading and writing the HDF5 chunk that contains that position each time. That might range from “no problem” to “super inefficient” depending on the size of the HDF5 chunks and the number/density of the points you want to update. > > I was aiming for a solution that would give the full list of indices to HDF5 and letting that deal with the IO more efficiently, but if this works for you then awesome!
Pablo Rodriguez (05:54:56) (in thread): > I’m doing some tests on a subset of the real data (just 200 x 1500 and 4800 positions) and performance is pretty slow. My best time trying with different chunk sizes was 13 secs (withchunk = c(1,1)
) > Definitively need a diff aproach. Maybe trying a realization sink withHDF5Array
. But if you come around with giving the full list of indices to HDF5, please let me know!:smiley:
Mike Smith (06:00:48) (in thread): > Yep, small chunks should give the best performance for “random” IO, but you’ll end up with a really large HDF5 file since compression happens at the chunk level.
Mike Smith (09:08:44) (in thread): > Here’s a function that should do the update. - File (R): Untitled
Mike Smith (09:11:36) (in thread): > For comparision, lets also create a function version of thefor()
loop approach. Then lets use the little example dataset to check both functions doing the same thing, and it matches theM[ma] <-
output. - File (R): Untitled
Mike Smith (09:14:25) (in thread): > That looks good too me. How about performance? Lets try withM
being 200 x 1500 and 4800 positions to update. I also went for HDF5 chunks of 100 x 100 as the default setting took too long for me to wait when trying the looping approach. - File (R): Untitled
Mike Smith (09:18:53) (in thread): > That looks like a pretty decent speedup to me, and we can see it uses less memory than theinMemory()
approach, which should indicate that we’re never realizing the complete matrix. I’d expect the gap in performance to increase with larger HDF5 chunks, but this is already a better improvement than my back-of-the-envelope expectation, so maybe there’s more going on than I appreciate. There’s also lots of factors like disk IO performance that may give different performance patterns than I see.
Hervé Pagès (12:49:29) (in thread): > I’m wondering, if the size of the dataset is only 200 x 1500, it might be worth trying to use a contiguous layout (i.e. no chunks) with no compression.@Pablo RodriguezUsing compressed 1x1 chunks is very likely to be counter-productive. Compressing single values has no benefits and I suspect it’s going to take more disk space than the uncompressed values.
2021-06-10
Pablo Rodriguez (04:55:45) (in thread): > The real data is composed of millions of rows per ten thousands of columns. Anyways, thank you so much for the detailed walkthrought on both methods. I’ll try thewithoutLoop()
method and another one I came up with using realization sinks andwrite_block()
, I’ll comeback with benchmarks with real data to see the differences. again, thank ytou all so much.
2021-06-16
Pablo Rodriguez (12:59:52) (in thread): > Sorry to come back so late. > So, to adress my problem: > 1. I tried to use@Mike Smithsolution with the “real”lgCMatrix
but ran with a little problem which I think is in part my fault for trying to create a huge h5 file > 2. In order to tackle that, I tried to use a way to implement a realization sink, but it is extremely slow, because I don’t get the hang on doing the subassigments using a position matrix and traversing a grid at the same time > 3. I combined both aproaches, which seems to work. On my personal computer it takes around 20 minutes to do the job, but I wanted to see if there could be other tweeks to make it faster. > Let’s begin:
Pablo Rodriguez (13:04:48) (in thread): > Part IMy real world data is comprised of a hugelgCMatrix
full ofFALSE
, and i have to use another matrix where I storeTRUE
positions. > The sparse matrix is 3092028 rows X 27076 cols, around 311 Gb of RAM, while the position matrix is 70216670 rows X 2 cols > As soon as i run > > h5write(obj = matrix(FALSE, nrow = 3092028, ncol = 27076), > file = my_hdf5_file, > name = "data") >
> I get an C stack memory error. > Is there a particular way inside BioConductor ecosystem in order to build huge “empty” h5 files from scratch?
Pablo Rodriguez (13:15:22) (in thread): > Part 2I came up with a sink backend in order to create and populate the huge sparse matrix, but it’s extremely slow (it would take around 500 hours to finish) > > library(DelayedArray) > library(HDF5Array) > > # creating the position matrix, this is just to test > rows <- sample(1:5000, size = 5000, replace = TRUE) > cols <- sample(1:5000, size = 5000, replace = TRUE) > ma <- cbind(rows, cols) > > # create a realization sink > sink <- HDF5RealizationSink(dim = c(3092028L, 27076L), > type = "logical") > > # create a grid to traverse the sink > # bear in mind it has to take only one row > # in order for the sub assignment to work > # the way I found (which I know it's not great) > sink_grid <- rowAutoGrid(sink, nrow=1L) > > > for(bid in seq_along(sink_grid)){ > > viewport <- sink_grid[[bid]] > > # initialize the block as a matrix of 1 row full of FALSE > block <- matrix(FALSE, nrow = dim(viewport)[1], ncol= dim(viewport)[2]) > > # select values from 'ma': this matrix stores row and col position > # row in [,1] and col in [,2] > # subsetting is done in a 'weird' way: select rows from 'ma' > # where the first column (the one that stores 'row' position) is > # equal to 'bid', since sink_grid has the same number of rows > # this is my bottleneck using HDF5Array > values <- ma[ ma[,1] == bid, ] > > # this part is to filter cases where the position matrix > # has only one value (one row for one column): it's rare but it can happend > if(is.null(dim(values))){ > block[ values[2] ] <- TRUE > } else { > block[ values[,2] ] <- TRUE > } > > # write the block to sink > sink <- write_block(sink, viewport, block) > } > > close(sink) > res <- as(sink, "DelayedArray") >
Pablo Rodriguez (13:23:16) (in thread): > Part 3Finally, I kind of did a mesh up of both solutions, using a realization sink to create an hdf5 file full of FALSE and then useupdateElements()
in order to subassign myTRUE
values. > > setAutoBlockSize(7e+08) > > # create the realization backend > sink <- HDF5RealizationSink(dim = c(3092028L, 27076L), > type = "logical", chunkdim = c(7000L, 27076L)) > > # create the grid > sink_grid <- rowAutoGrid(sink, nrow=7000L) > > # function to write blocks full of FALSE > FUN <- function(sink, viewport){ > block <- matrix(FALSE, nrow = dim(viewport)[1], ncol= dim(viewport)[2]) > write_block(sink, viewport, block) > } > sink <- sinkApply(sink, FUN, grid = sink_grid, verbose = TRUE) > close(sink) > res <- as(sink, "DelayedArray") > > # then using updateElements() > updateElements(res@seed@filepath, res@seed@name, ma, TRUE) >
> I tried it and gave me the right answers (comparing it to using a workstation with huge RAM memory and using in-memory methods). > I tried different block sizes, chunkdims and rowgrids to get the fastest h5 file creation with 16 Gb of RAM on my PC, and usually I didnt went over 6 Gb creating the h5 file, but as soon asupdateElements()
started working it took almost all available RAM memory. Is this by design (having to work with a heavy ‘ma’ file), or am I doing something wrong?
2021-06-17
Mike Smith (03:07:29) (in thread): > This should create an HDF5 file with an “empty” matrix dataset, without the need to realise the matrix in R: > > my_hdf5_file <- "/tmpdata/msmith/huge.h5" > h5createFile( my_hdf5_file ) > h5createDataset(my_hdf5_file, dataset = "data", dims = c(3092028, 27076), chunk = c(1000, 1000), > storage.mode = "logical", fillValue = FALSE) >
> You can usefillValue
to specify any starting value for the matrix, so it could also be “TRUE” etc. Running this is pretty much instant for me. > Here’s the output from reading a few rows and cols: > > > h5read(my_hdf5_file, name = "data", index = list(1001:1005, 10001:10005)) > [,1] [,2] [,3] [,4] [,5] > [1,] FALSE FALSE FALSE FALSE FALSE > [2,] FALSE FALSE FALSE FALSE FALSE > [3,] FALSE FALSE FALSE FALSE FALSE > [4,] FALSE FALSE FALSE FALSE FALSE > [5,] FALSE FALSE FALSE FALSE FALSE >
2021-06-24
Ilaria Billato (08:15:37): > @Ilaria Billato has joined the channel
2021-06-25
Hervé Pagès (04:42:47) (in thread): > @Pablo RodriguezHere is a function that does the job of traversing the grid defined on a RealizationSink and writing values at the positions supplied in a matrix of positions: > > write_to_sink_at_position_matrix <- > function(sink, pos_mat, values, fill_value, grid=NULL, verbose=NA) > { > stopifnot(is(sink, "RealizationSink"), > is.matrix(pos_mat), is.numeric(pos_mat), > ncol(pos_mat) == length(dim(sink)), > is.vector(values), is.atomic(values), > length(values) == nrow(pos_mat), > is.vector(fill_value), is.atomic(fill_value), > length(fill_value) == 1L) > > ## Prepare block to write for a given viewport. > prepare_block <- function(viewport, pos_mat, values, fill_value) { > block <- array(fill_value, dim=dim(viewport)) > in_block <- rep.int(TRUE, nrow(pos_mat)) > for (j in seq_len(ncol(pos_mat))) { > xj <- pos_mat[ , j] > in_block <- in_block & > xj >= start(viewport)[j] & xj <= end(viewport)[j] > } > idx <- which(in_block) > if (length(idx) != 0L) { > m <- pos_mat[idx, , drop=FALSE] > m <- m - rep(start(viewport) - 1L, each=length(idx)) > block[m] <- values[idx] > } > block > } > > FUN <- function(sink, viewport, prepare_block, pos_mat, values, fill_value) > { > block <- prepare_block(viewport, pos_mat, values, fill_value) > write_block(sink, viewport, block) > } > sinkApply(sink, FUN, prepare_block, pos_mat, values, fill_value, > grid=grid, verbose=verbose) > } >
> Note that the function is backend-agnostic and works with datasets with an arbitrary number of dimensions. Also the supplied grid can be chosen arbitrarily (i.e. it doesn’t have to be made withrowAutoGrid()
orcolAutoGrid()
), or not supplied at all in which case an automatic grid is created withdefaultSinkAutoGrid()
. > Using it on a very small toy example: > > library(HDF5Array) > > sink <- HDF5RealizationSink(c(10L, 5L)) > pos_mat <- rbind(c(1, 5), c(1, 1), c(8, 2), c(2, 1), c(10, 5)) > values <- 100 + seq_len(nrow(pos_mat)) > > sink_grid <- RegularArrayGrid(dim(sink), spacings=c(4, 3)) > sink <- write_to_sink_at_position_matrix(sink, pos_mat, values, 0, grid=sink_grid) > close(sink) > M <- as(sink, "DelayedArray") > M > # <10 x 5> matrix of class HDF5Matrix and type "double": > # [,1] [,2] [,3] [,4] [,5] > # [1,] 102 0 0 0 101 > # [2,] 104 0 0 0 0 > # [3,] 0 0 0 0 0 > # [4,] 0 0 0 0 0 > # [5,] 0 0 0 0 0 > # [6,] 0 0 0 0 0 > # [7,] 0 0 0 0 0 > # [8,] 0 103 0 0 0 > # [9,] 0 0 0 0 0 > # [10,] 0 0 0 0 105 >
> Using it with the dataset you used inPart 2above takes about 13 min on my laptop and consumes about 500 MB of RAM: > > rows <- sample(5000, size=5000, replace=TRUE) > cols <- sample(5000, size=5000, replace=TRUE) > ma <- cbind(rows, cols) > values <- rep(TRUE, nrow(ma)) > sink <- HDF5RealizationSink(dim=c(3092028L, 27076L), type="logical") > sink <- write_to_sink_at_position_matrix(sink, ma, values, FALSE, verbose=TRUE) > close(sink) > res <- as(sink, "DelayedArray") >
> Using it on the full-size position matrix (70216670 rows X 2 cols) takes about 3.5 hours and increases memory usage to almost 5 GB. The bottleneck is helper functionprepare_block()
where more than 90% of the time is spent. But there’s room for making this function much faster e.g. by precomputingm
andvalues[idx]
for each viewport.
Pablo Rodriguez (04:47:36) (in thread): > Thanks so much for the detailed solution, i’ll take a look and try to adapt my code, thanks!
2021-07-05
Chouaib Benchraka (01:57:07): > @Chouaib Benchraka has joined the channel
2021-07-07
Pablo Rodriguez (11:30:43): > I have a question regardingHDF5Array
In the documentation ofHDF5Array-Class, on the examples there is this quote: > > ## The data in the dataset looks sparse. In this case it is recommended > ## to set 'as.sparse' to TRUE when constructing the HDF5Array object. > ## This will make block processing (used in operations like sum()) more > ## memory efficient and likely faster: >
> From which package is thissum()
operation mentioned in this comment? Is there already asum()
function that has a native block processing form? - Attachment (rdrr.io): HDF5Array-class: HDF5 datasets as DelayedArray objects in HDF5Array: HDF5 backend for DelayedArray objects > The HDF5Array class is a DelayedArray subclass for representing a conventional (i.e. dense) HDF5 dataset. All the operations available for DelayedArray objects work on HDF5Array objects.
Vince Carey (12:21:14): > I am not suresum
is the right reference here, but > > > getMethod("colSums2", "DelayedMatrix") > Method Definition: > > function (x, rows = NULL, cols = NULL, na.rm = FALSE, ...) > { > .local <- function (x, rows = NULL, cols = NULL, na.rm = FALSE, > force_block_processing = FALSE, ...) > { > .smart_seed_dispatcher(x, generic = MatrixGenerics::colSums2, > blockfun = .DelayedMatrix_block_colSums2, force_block_processing = force_block_processing, > rows = rows, cols = cols, na.rm = na.rm, ...) > } > .local(x, rows, cols, na.rm, ...) > } > <bytecode: 0x120a783c8> > <environment: namespace:DelayedMatrixStats> >
> is worth a look
2021-07-12
Hervé Pagès (17:21:11) (in thread): > Yes, there’s already asum()
function that has a native block processing form. It’s implemented in theDelayedArraypackage: > > > selectMethod("sum", "HDF5Array") > Method Definition: > > function (x, ..., na.rm = FALSE) > .BLOCK_Summary(.Generic, x, ..., na.rm = na.rm) > <bytecode: 0x558a5334c850> > <environment: namespace:DelayedArray> > > Signatures: > x > target "HDF5Array" > defined "DelayedArray" >
> Many base R operations work directly on DelayedArray objects and derivatives thanks to methods defined in theDelayedArraypackage. See?
DelayedArray-utils``. In addition, theDelayedMatrixStatspackage implements DelayedArray methods for all the matrix summarization operations from thematrixStatspackage.
2021-07-14
António Domingues (03:51:38): > @António Domingues has left the channel
2021-07-16
Dario Righelli (03:54:50): > Hello everyone, I’m having a problem with ah5
file previously saved in python. > It’s a public accessible file that I’m downloading from the Allen Institute Repository. > In particular, when I’m loading theh5
file in python I can access all the data stored into the file, while when loading the file in R withH5Fopen
it seems to not load all the data. > It’s easier if I provide an example: > Python example: > > ###### Code > fname = 'filepath/CTX_Hip_counts_10x.h5' > def print_attrs(name, obj): > print(name) > for key, val in obj.attrs.items(): > print(" %s: %s" % (key, val)) > > f_10x = h5py.File(fname, mode='r') > f_10x.visititems(print_attrs) > > ###### Output -> the structure of the h5 file > data > data/counts > data/gene > data/samples > data/shape >
> R code: > > tenx = H5Fopen("~/Downloads/CTX_Hip_counts_10x.h5") > > tenx > HDF5 FILE > name / > filename > > name otype dclass dim > 0 data H5I_GROUP > > tenx$data > Error: vector memory exhausted (limit reached?) > Error: Error in h5checktype(). H5Identifier not valid. >
> What am I doing wrong? > Thanks!
Hervé Pagès (04:11:59) (in thread): > Probably worth opening an issue on GitHub for this athttps://github.com/grimbough/rhdf5/issues. But first make sure that a similar issue has not already been reported. Also make sure to provide access to the problematic file (or to provide code to generate such file) so others can reproduce the problem. Plus the usualsessionInfo()
. Thx!
Dario Righelli (04:12:55) (in thread): > Thanks Hervè!
Martin Morgan (10:40:27) (in thread): > istenx$data
trying to read the entire content ofdata
into memory? If you’re after just the structure of the file you can doh5ls("~/Downloads/CTX_Hip_counts_10x.h5")
. To actually visit the data you’ll likely benefit from the HDF5Array package, maybeTENxMatrix()
orHDF5Array()
and the facilities the provide for iterating through content (e.g.,blockApply()
).
Dario Righelli (10:59:11) (in thread): > Hi@Martin Morgan, thanks for your reply, I tried to load the file with the HDF5Array package functions, but when I try to do that by using the name and group argument with different values ( “data”, “counts”) it gives me an error like these ones: > > tenx = TENxMatrix("~/Downloads/CTX_Hip_counts_10x.h5", group="data") > Error in .check_data_and_subdata(filepath, group, subdata) : > HDF5 object "/data/data" does not exist in this HDF5 file. Are you sure that HDF5 group > "/data" contains a sparse matrix stored in CSR/CSC/Yale format? > ##### > tenx <- HDF5Array("~/Downloads/CTX_Hip_counts_10x.h5", name="data") > Error in H5Dopen(gid, name) : HDF5. Dataset. Can't open object. >
Dario Righelli (11:08:49) (in thread): > I didn’t know about theh5ls
function that indeed works and returns: > > h5ls("~/Downloads/CTX_Hip_counts_10x.h5") > group name otype dclass dim > 0 / data H5I_GROUP > 1 /data counts H5I_DATASET INTEGER 1169213 x 31053 > 2 /data gene H5I_DATASET STRING 31053 > 3 /data samples H5I_DATASET STRING 1169213 > 4 /data shape H5I_DATASET INTEGER 2 >
Dario Righelli (11:09:44) (in thread): > EDIT: this one worked! Thanks!:tada: > > aa <- HDF5Array("~/Downloads/CTX_Hip_counts_10x.h5", name="data/counts") > > aa > <1169213 x 31053> matrix of class HDF5Matrix and type "integer": > [,1] [,2] [,3] [,4] ... [,31050] [,31051] [,31052] [,31053] > [1,] 8 0 0 0 . 3 0 0 0 > [2,] 13 2 0 0 . 7 0 0 0 > [3,] 8 0 0 0 . 0 0 0 0 > [4,] 11 0 0 0 . 1 0 0 0 > [5,] 10 1 0 0 . 5 0 0 0 > ... . . . . . . . . . > [1169209,] 3 0 0 0 . 4 0 0 0 > [1169210,] 14 0 0 0 . 4 0 0 0 > [1169211,] 1 1 0 0 . 5 0 0 0 > [1169212,] 16 0 0 0 . 9 0 0 0 > [1169213,] 16 1 0 0 . 5 0 0 0 >
Lori Shepherd (12:42:39): > @Lori Shepherd has left the channel
2021-08-04
Ayush Aggarwal (19:20:11): > @Ayush Aggarwal has joined the channel
2021-08-05
Prateek Arora (04:53:32): > @Prateek Arora has joined the channel
2021-08-19
Ava Hoffman (she/her) (11:32:37): > @Ava Hoffman (she/her) has joined the channel
2021-09-08
Julien Roux (05:29:00): > @Julien Roux has joined the channel
2021-10-18
Pablo Rodriguez (04:21:43): > Hi > I was wondering if is there any development in storing and working with data on disk in a sparse format. > I saw a couple of utils functions in a package calledscrattch.iowhere developers try to emulate thedgCMatrix
format savingMatrix@i
,Matrix@p
andMatrix@x
asseparate datasetsinside the same h5 file; and then I thought if this kind of solution could help in this matter. > My use-case is that I have a huge sparse matrix saved as an h5 file and I need to modify it through vector-wise operations (dividing and multiplying by rowsum and colsum). And eventhought I can work using block-processing operations this can’t take advantage of the matrix sparsity.
Vince Carey (06:31:23) (in thread): > There is work on this, athttps://github.com/Bioconductor/S4Arrays, and@Hervé Pagèswill surely say more … this is still in development.
Dirk Eddelbuettel (09:09:38) (in thread): > You could also consider TileDB (on CRAN as packagetiledb
) which has helper functions such asfromSparseMatrix()
take an S4 class and storing it (as adgTMatrix
as its three equal sized columns match the TileDB arrays well). You can then index etc at will via the (performant, C++) TileDB Embedded library, and get data back vaitoSparseMatrix
. It is already supported viaDelayedArray
in Bioconductor. Let@Aaron Wolenor myself know if you have questions.
John Kerl (09:14:34): > @John Kerl has joined the channel
Hervé Pagès (11:31:08) (in thread): > To be clear: the SparseArray container currently in development in theS4Arrayspackage is forin-memoryrepresentation of multidimensional sparse arrays:https://github.com/Bioconductor/S4Arrays/blob/4367dea011ab6b1c7454a020d766a3df09c10d6d/man/SVT_SparseArray-class.Rd#L88-L107
2021-11-02
itTan (00:10:51): > @itTan has joined the channel
2021-11-08
Paula Nieto García (03:18:41): > @Paula Nieto García has joined the channel
2021-11-11
Shilpa Garg (09:27:46): > @Shilpa Garg has joined the channel
2021-12-14
Megha Lal (08:23:22): > @Megha Lal has left the channel
2022-01-21
John Readey (15:02:39): > Hey@Pablo Rodriguez- I’m starting work on a project to handle sparse data with HDF5. Would be interested in discussing your requirements when you have time.
2022-01-24
Robert Castelo (03:31:00) (in thread): > Hi John, Pablo was working on this in my group until last December 31st, but unfortunately I ran out of money to renew his contract and he’s not with us anymore. Anyway, I’ll be happy to discuss our use case since I’m taking over what he was doing.
2022-01-25
John Readey (16:47:13) (in thread): > Hey@Robert Castelo- Sorry to hear about Pablo, but good to know that you’ll be taking over the work. > > I’ve been thinking along the lines that it would be more efficient to have sparse-specific methods HDF5 rather than trundling the data back and forth with each read and write request. E.g. a method to get the number of non-zero elements. > > What methods would be most useful in your use case? > > I’m more familiar with the Python scipy sparse package:https://docs.scipy.org/doc/scipy/reference/sparse.htmlthan with R, but should boil down to the same thing I guess. > > Do you have any public datasets using sparse data? It would be handy to have some real-world data to work with.
2022-02-07
Ning Shen (17:44:54): > @Ning Shen has joined the channel
2022-02-15
Gene Cutler (11:59:37): > @Gene Cutler has joined the channel
2022-03-15
Ruslan Rust (08:43:44): > @Ruslan Rust has joined the channel
2022-04-01
Nitesh Turaga (13:56:33): > @Nitesh Turaga has left the channel
2022-04-12
Vivian Chu (13:58:52): > @Vivian Chu has joined the channel
2022-05-10
michaelkleymn (23:00:45): > @michaelkleymn has joined the channel
2022-05-12
Helen Lindsay (05:47:38): > @Helen Lindsay has joined the channel
Bernat Bramon (06:04:12): > @Bernat Bramon has joined the channel
Bernat Bramon (06:05:57): > Hello everyone! Do you know of any faster ways to storeSingleCellExperiment
objects as HDF5 files thansaveHDF5SummarizedExperiment
? Alternatively, do you know if there are ways to speed up the process (e.g. maybe changing the chunk size)? Thanks in advance!
Mike Smith (06:18:32) (in thread): > Changing the chunk size might help, but I’d only expect to see a significant difference if the current chunks are really small i.e. tons of small write operations. You could also try reducing the compression level, but again I’d expect that to be a small percentage change unless you’re currently using maximum compression. > > What sort of size data and save time are you currently seeing? Perhaps you can share an example.
Bernat Bramon (07:23:38) (in thread): > Hi Mike, thank you so much for the quick reply. I am currently trying to work with the data fromStephenson et. al. 2021, which is a h5ad file of 6.64GB. At this moment, the only thing I am doing is reading the file withreadH5AD
(from the zellkonverter package), and trying to save the resultingSingleCellExperiment
object (only one of the assays) withsaveHDF5SummarizedExperiment
. Given that the file is fairly big, I am running things in a computer cluster where I can request a fair amount of RAM, and I am expecting the process to take more than 20 hours according to the current pace. With that said, my question is not specific to this dataset, as I would want to do the same with other big single-cell experiments. Thanks again for your help! - Attachment (ebi.ac.uk): Files < E-MTAB-10026 < Browse < ArrayExpress < EMBL-EBI > EMBL-EBI
Mike Smith (10:10:14) (in thread): > Setting thechunkdim
argument to something larger than the default will probably yield some noticable speed improvements, but be aware that it might make things slower in the future if you only want to use a subset of the cells or genes in the resulting file. You can also set theverbose = TRUE
argument to get and idea of the progress. You’ll see how many chunks it’s going to write, which might help select an acceptablechunkdim
argument.
Raphael Gottardo (10:52:52) (in thread): > Thanks@Mike SmithAs a follow up, what’s the best current practice to store single-cell data (large and not large). HDF5, sparse representation, tileDB (or some other format). Also paging@Hervé Pagèsto see if he has any comments.
2022-05-13
Dirk Eddelbuettel (10:21:56) (in thread): > Have you seenhttps://github.com/single-cell-data/matrix-api?
Raphael Gottardo (15:28:01) (in thread): > No, I have been sleeping for the past 2 years:wink:. Looks very interesting.
Raphael Gottardo (15:28:08) (in thread): > Thanks@Dirk Eddelbuettel
Dirk Eddelbuettel (15:39:02) (in thread): > It’s fairly new and fairly active and still pre-announcement. There should be something out later this summer…
2022-05-17
Isaac Virshup (16:49:04): > @Isaac Virshup has joined the channel
2022-05-19
Bernat Bramon (04:29:16): > Does anyone has any good recommendation? Thanks in advance!! - Attachment: Attachment > Thanks @Mike Smith As a follow up, what’s the best current practice to store single-cell data (large and not large). HDF5, sparse representation, tileDB (or some other format). Also paging @Hervé Pagès to see if he has any comments.
Isaac Virshup (06:09:58) (in thread): > I’d sayh5ad
, but I’m probably a bit biased:wink:
Vince Carey (08:44:18) (in thread): > The general question is being addressed athttps://github.com/single-cell-data/matrix-api
Stephanie Hicks (09:32:04) (in thread): > I believe@Davide Rissohas some thoughts here too
Isaac Virshup (09:51:18) (in thread): > For sure matrix-api as it moves along. > > From a conversation@Luke Zappia, Robert, and Martin about the EOSS grant, there was some interest in making it easier to get a single cell experiment with DelayedArrays in it from a.h5ad
file. > > Which could be an relatively easy (:crossed_fingers:) and interchange friendly way to get this.
John Kerl (11:29:45) (in thread): > See alsohttps://community-bioc.slack.com/archives/C35BSB5NF/p1652451716879809?thread_ts=1652349957.938969&cid=C35BSB5NF - Attachment: Attachment > Have you seen https://github.com/single-cell-data/matrix-api ?
2022-06-17
George Odette (17:25:18): > @George Odette has joined the channel
2022-06-22
Bernat Bramon (13:54:58): > Hey there, I am really struggling to work with single cell data stored as large csv files. I am quite new in the field, so I can only assume that I am using the wrong tools. What is the best practice/software for reading such big csv/tsv files? Thanks in advance for any suggestion!
Tim Triche (14:27:25): > vroom
ortximport
… that said you will be well served by purchasing as much RAM and SSDs as will fit into your machine, whether that machine is a laptop or an HPC node. We received an award recently to develop more efficient representations of these type of data for relatively new users. If you are interested in potentially testing some of them, please let me know – we would like to make fast, efficient data structures accessible to as many people as possible. CSV and TSV files are great but they are not the right format for sparse matrices of counts.
Tim Triche (14:29:36): > If these are droplet scRNAseq data, you will be well served by finding the original matrix market files (usually these will have files namedbarcodes
,features
, andmatrix
if they are like 10X output). If these are plate-based single-cell data they may be dense enough to bother with a full CSV/TSV scan, but even then we usually compress them after the first ingestion, at least in my lab.
Tim Triche (14:29:45): > for example: > > tim@thinkpad-P15:~/Dropbox/TricheLab/immunograph/moreSplenicDCs/10X_output/Batf3_WT$ ls > barcodes.tsv.gz features.tsv.gz matrix.mtx.gz >
Tim Triche (14:32:23): > processing the above was relatively straightforward (it is CITE-seq data, so, both RNA and protein): > > library(DropletUtils) > > # this dataset > tenXpath <- "10X_output" > splenic <- file.path(tenXpath, list.files(tenXpath)) > names(splenic) <- basename(splenic) > splenicDCs <- read10xCounts(samples=splenic, sample.names=names(splenic)) > colnames(splenicDCs) <- paste(splenicDCs$Sample, splenicDCs$Barcode, sep="_") > > # split into altExps (RNA and ADT) > rowData(splenicDCs)$type <- ifelse(grepl("^ENSMUS", rownames(splenicDCs)), > "RNA", "ADT") > splenicDCs <- splitAltExps(splenicDCs, rowData(splenicDCs)$type) > > # provide proper symbols for the ADTs > abGeneNames <- c(CD4="Cd4", CD8A="Cd8a", CD117="Kit", CD11B="Itgam", > CD11C="Itgax", F480="Adgre1", SIGLECH="Siglech", > CD24="Cd24a", CD172A="Sirpa", XCR1="Xcr1", CD370="Clec9a") > > # crosscheck against mRNA symbols > stopifnot(all(sapply(sapply(abGeneNames, grep, x=rowData(splenicDCs)$Symbol), length) > 0)) > rowData(altExp(splenicDCs))$Symbol <- abGeneNames[rownames(altExp(splenicDCs))] > > # lognormalize appropriately > splenicDCs <- applySCE(splenicDCs, logNormCounts) # apparently 1 is good enough > > # plot the ADTs > par(mfrow=c(3,4)) > for(i in rownames(altExp(splenicDCs))) { > plot(density(logcounts(altExp(splenicDCs))[i,]), main=i) > } > dev.copy2pdf(file="ADT_logNormCountsDensities.pdf") > dev.off() > > # plot correlation with transcripts > # .. >
> The SingleCellExperiment data structure is quite efficient for working with single-cell data once you are used to it.
Tim Triche (14:33:00): > That said, we’d quite like to have more compact representations for denoised data in broader use, so.
2022-07-01
kent riemondy (13:45:44): > @kent riemondy has joined the channel
2022-07-05
Andrew J. Rech (10:24:11): > @Andrew J. Rech has joined the channel
2022-07-18
Filip Stachura (09:24:14): > @Filip Stachura has joined the channel
2022-07-19
Bernat Bramon (05:46:50): > Hey there, I was hoping someone could point me in the right direction with a couple more questions that I had regarding single cell data. First, I find fread from data.table to be the best way to read single cell data written as huge csv matrices in R. The issue I am finding is that it doesn’t seem straightforward to convert the resulting data.table object to a sparse matrix. Namely, I can use the Matrix package only if I first turn the data.table object into an R matrix (if I am not mistaken), and this process seems to use a lot of RAM. Are there more efficient ways to do that? I saw “sparsify” from the mltools packages, but it goes really slow for big matrices. Thanks in advance, any suggestions would be super helpful!
Bernat Bramon (05:51:23): > The second question I had similarly involves RAM usage when reading big count matrices as text files. Is there a way to have an estimation of how much RAM I will need to read a big matrix if I know the size of the file? I originally thought it would be more of a one-to-one relationship, but with huge files I find that I need to request much more RAM than I expected
Alan O’C (05:57:30) (in thread): > If the values are read in as numeric rather than integer, it’ll be ~twice as much as you might expect, since ints are 32bit and floats are 64http://adv-r.had.co.nz/memory.html
Alan O’C (05:58:13) (in thread): > Unsure if fread defaults to numeric but I believe most things in R do unless you tell them that you really, really, really for sure want ints
Bernat Bramon (06:01:27) (in thread): > Thanks for your reply and the link, this is useful:raised_hands:
Tim Triche (10:53:36): > a fundamental issue with CSV files of big matrices is that they are inherently wasteful – if you have a matrix of (say) a billion entries and 950 million of them are 0, right off the bat you’ve spent about 20x as much time and space as necessary to turn around and throw them away. If they’re column-sparse the waste is even bigger. Maybe we should present SRLE at BioC to try and emphasize just how ridiculous of a waste this truly is
Tim Triche (10:56:43): > my suspicion is thatvroom
(https://vroom.r-lib.org/) will be the most sensible way to approach this, if your collaborators insist on providing sparse data in a dense format (if it’s actually dense then that is a different story, but presumably you wouldn’t be asking about sparse data structures if that were the case:wink:) - Attachment (vroom.r-lib.org): Read and Write Rectangular Text Data Quickly > The goal of vroom is to read and write data (like csv, tsv and fwf) quickly. When reading it uses a quick initial indexing step, then reads the values lazily , so only the data you actually use needs to be read. The writer formats the data in parallel and writes to disk asynchronously from formatting.
Alan O’C (11:01:45): > This may be a case of PEBCAK but I found the vroom options super unintuitive for anything other than the type of CSV they expect, whereas fread “just works”. Also I think fread is still faster, though I may be wrong there
Tim Triche (11:10:24): > fair, although data.table is a bit… “quirky” w/r/t syntax
Dirk Eddelbuettel (11:10:32) (in thread): > I also think it is faster especially as vroom redefined the metrics. It indexes, and mmaps, and then claims to be faster. fread just reads, in parallel.
Dirk Eddelbuettel (11:10:59) (in thread): > fread
has an option to return a data.frame. Still faster and simpler. YMMV.
Tim Triche (11:11:34): > one thought is that reading a set number of records and calling something likeacast
to write them into a sparse structure will probably end up being faster (especially if the reading is done using iostreams via e.g. Rcpp)
Tim Triche (11:11:56) (in thread): > cool except that the goal here was a sparse matrix:wink:
Tim Triche (11:12:44) (in thread): > which is, after a fashion, a data.frame, I suppose. uint_16, uint_16, [float or int]
Tim Triche (11:13:13): > the good news is that CZI funded us to do some of this:wink:
Tim Triche (11:13:55): > the bad news is that CSV -> sparse is (or ought to be) a corner case when everyone sane is using matrixmarket
Tim Triche (11:14:08): > (or tiledb, or similar, before Dirk says it)
Dirk Eddelbuettel (11:14:10): > And we (== TileDB) are on it with them. I poked Aaron (W.) to come over here but he is perennially busy.
Alan O’C (11:15:08) (in thread): > I’d be shocked if vroom allowed you to specify anything like a matrix as output either
Dirk Eddelbuettel (11:15:39) (in thread): > Right. Neither does fread AFAIK
Tim Triche (11:15:50) (in thread): > this is a fundamental issue don’t you think:wink:
Tim Triche (11:16:16) (in thread): > vroom and broom actually do have something like this, but it’s… not at all efficient
Dirk Eddelbuettel (11:16:24) (in thread): > I am not the one insisting on vroom ATATE (== as the answer to everything)
Tim Triche (11:16:28) (in thread): > true
Tim Triche (11:16:39) (in thread): > I’m not particularly wedded to it eihter
Tim Triche (11:17:27) (in thread): > the thought occurs that if a person is going to parse a CSV into a large sparse matrix, I don’t want to support a general purpose tool to do it more than once (i.e. read it into a sensible format and park the original in something like Glacier)
Aaron Wolen (14:36:08): > Yeap, we’re working with czi on a data model (SOMA) for storing annotated matrices using tiledb and have R/Python packages available to get data in and out of this format. However, you’d still need to get the data into memory in a supported format (eg singlecellexperiment) first. For that I’d echo the vroom/fread recommendations.
Tim Triche (14:40:19): > right on. I was just looking at miller (mlr) and thinking that it appears to be a proper streaming interface, but given that you plural are the experts, I’ll follow your recommendations. What are the spec’d inputs to the SOMA model?
John Kerl (14:41:38): > @Tim Trichei can help you with any/all miller questions:slightly_smiling_face:
Tim Triche (14:42:35): > sweet – I was looking at SOMA and miller thinking about “gee, if Ididend up needing to do something useful with a bunch of CSV rectangles, could I avoid writing any code by turning it into {row, column, value} tuples withmlr
?”
Tim Triche (14:43:24): > also just knowing about SOMA will be handy for Zach and free up more of his CZI time, so thanks for pointing at your project:slightly_smiling_face:
John Kerl (14:44:13): > @Tim Tricheindeed for prototyping i’ve done some things with miller & it is a great way to play with data > > of course for anything productionalized we useanndata
’sfrom_h5ad
,scanpy
’sread_10_mtx
, Seurat I/O,tiledb.from_numpy
,tiledb.from_pandas
, etc
Tim Triche (14:44:32): > so the current interfaces are primarily via the Python connector?
Tim Triche (14:45:00): > or scanpy’sread_10x_mtx
which is presumably what everyone has written a few times:slightly_smiling_face:
John Kerl (14:45:49): > our current v0 R & Python impls arehttps://github.com/TileDB-Inc/tiledbschttps://tiledb-inc.github.io/tiledbsc/https://github.com/single-cell-data/TileDB-SingleCell/tree/main/apis/pythonhttps://tiledb-singlecell-docs.s3.amazonaws.com/docs/overview.html
Tim Triche (14:46:40): > sweettiledbsc
is what I was looking for. thanks!
Tim Triche (14:47:20): > this is great. CSVs for single cell data are yuck but I guess they’ll be with us for a while
John Kerl (14:47:40): > https://cloud.tiledb.com/notebooks/details/johnkerl-tiledb/33c4fe81-d15f-43cd-a588-5c277cf70cb6/previewnotebooksoma-testing
has ingest examples you can copy/paste
Seth Shelnutt (14:58:25): > @Seth Shelnutt has joined the channel
2022-07-20
Hervé Pagès (04:18:22): > @Bernat BramonTryreadSparseCSV()
from theS4Arrayspackage to natively load a CSV matrix into a dgCMatrix object:https://github.com/Bioconductor/S4Arrays(still a work in progress)
Bernat Bramon (04:46:25) (in thread): > Awesome! Thanks
Tim Triche (08:02:17) (in thread): > Cool thanks Hervé!
Tim Triche (13:00:01) (in thread): > so@Martin Morganis presenting on today’s CZI workshop and in parallel I was playing withtiledbsc
andhca
/cellxgenedp
Tim Triche (13:00:09) (in thread): > I came across this: > > 'to_single_cell_experiment()': > > Convert to a Bioconductor > SingleCellExperiment::SingleCellExperiment object. >
Tim Triche (13:00:19) (in thread): > is it possible to implement these as regular old coercions?
Tim Triche (13:00:27) (in thread): > so far SOMA looks fantastic
John Kerl (13:40:57) (in thread): > good call@Tim Triche— i’ll defer the R questions to@Aaron Wolen
Aaron Wolen (13:42:41) (in thread): > Absolutely we could do that. Everything’s been R6 centric to facilitate faster cross-language development but it’s probably time to start adding some more familiar s4 methods
Aaron Wolen (13:48:30) (in thread): > Just a friendly warning, the bioc support is still a little undercooked relative to the Seurat side of the house (where we’ve had more customer engagement) so there may well be dragons there. But happy to fix/improve anything you come across.
Tim Triche (17:08:44) (in thread): > no worries – my only issue with Seurat is that the data structures are fairly horrible for anyone used to BioC workflows, and often end up putting assays in the wrong place if working from an .h5ad file (e.g. from scVI)
Tim Triche (17:09:34) (in thread): > we have used Seurat and Signac a ton lately and frankly it’s obnoxious to do simple things with that object model.
Tim Triche (17:10:07) (in thread): > I guess the obvious thing for us to do is try moving some of those workflows to SOMA and SCE, and see if the pain is lessened:wink:
John Kerl (17:11:08) (in thread): > definitely … but do know we’re moving from a prototype/v0 spec to a v1 spec which has better support for multimodal — where things go, i hope you like better from v0 to v1 but we should chat:slightly_smiling_face:
Tim Triche (17:11:21) (in thread): > right on!
Tim Triche (17:11:36) (in thread): > as long as the stack of matrices can be different sizes and annotated accordingly, that sounds great
Tim Triche (17:12:30) (in thread): > I just want them to end up in the right places. My experience was that Seurat tried to interpret things and got it wrong; SCE didn’t attempt to do such things and as a result things went where I placed them. The latter seems a lot less brittle.
Tim Triche (17:13:29) (in thread): > our specific project usedin silicogating of CITE-seq to match RNA clusters to ADT “gates” per flow, validated in a second dataset, and then label transferred onto plate-scATAC data since that’s all there is
Tim Triche (17:15:50) (in thread): > there are myriad issues here: scGate uses AUCell (which can handle SCE) to gate cells from Seurat; label transfer from the latter to ATAC; matching regions of interest to bulk contrasts between cell types and strains with/without treatment
Tim Triche (17:17:10) (in thread): > but it would be fun to work through this with a cloud-friendly data repository (even if we really just store it locally) and not have to deal with paths etc.
2022-07-25
Hervé Pagès (01:18:19) (in thread): > The function now has a full man page:smiley:I also optimized it a little (made it slightly faster + reduced its memory usage) and addedwriteSparseCSV()
. This is inS4Arrays0.3.0.
2022-07-26
Raphael Gottardo (02:19:25): > Thanks@Hervé Pagès!
2022-08-15
Alexander Bender (07:17:48): > @Alexander Bender has joined the channel
2022-08-22
Vince Carey (23:01:19): > https://www.hdfgroup.org/2022/08/hsds-streaming/discusses an update to HDF Scalable Data Service. - Attachment (The HDF Group): HSDS Streaming - The HDF Group > Highly Scalable Data Service principal architect John Readey covers an update to the Highly Scalable Data Service. The max request size limit per HTTP request no longer applies with the latest HSDS update. In the new version large requests are streamed back to the client as the bytes are fetched from storage. Regardless of the size of the read request, the amount of memory used by the service is limited and clients will start to see bytes coming back while the server is still processing the tail chunks in the selection. The same applies for write operations—the service will fetch some bytes from the connection, update the storage, and fetch more bytes until the entire request is complete. Learn more about this update, plus check out John’s benchmark results using a couple of different MacBook Pros and his new DevOne laptop.
2022-08-23
Tyrone Lee (14:00:40): > @Tyrone Lee has joined the channel
2022-09-26
Sean Davis (22:24:00): > https://google.github.io/tensorstore/index.html
2022-09-27
Jennifer Holmes (16:14:50): > @Jennifer Holmes has joined the channel
2022-11-17
Marcel Ramos Pérez (18:25:56) (in thread): > @Mike SmithIt looks likeTENxIO
is bumping into this issue withRhdf5lib
on Machttps://bioconductor.org/checkResults/devel/bioc-LATEST/TENxIO/merida1-checksrc.html
2022-11-18
Mike Smith (04:20:18) (in thread): > @Marcel Ramos PérezI’m not sure I’ll get a chance to look at this in the short term. I remember making very little progress when I wrote those comment a few years ago. If you want to get getTENxIO
passing the build, you can try and skip that test if S3 support isn’t configured. Take a look at the top of this file for how I detect that withrhdf5
(https://github.com/grimbough/rhdf5/blob/master/tests/testthat/test_S3.R). I should probably turn that into a more useful diagnostic function at some point.
2022-12-14
Lijia Yu (19:38:40): > @Lijia Yu has joined the channel
2022-12-21
Andres Wokaty (15:03:18): > @Andres Wokaty has joined the channel
2023-01-10
Robert Shear (14:08:36): > @Robert Shear has joined the channel
2023-01-17
Charlotte Soneson (07:08:52): > I’m wondering if someone may have an idea about the reasons behind the issue represented by the following small example: I first create anHDF5Array
, and then I would like to read it in blocks, in parallel, and do some processing on the blocks. This seems to work fine if I save the hdf5 file on our (Linux) server and run the code from there, or if I generate it on my (mac) laptop (or copy the one from the server) and run the code there. However, if I mount the file system of the server on the laptop and try to read the file from there, in an R session running on the laptop, it fails unless I set the number of workers to 1. I was guessing it has something to do with the hdf5 file being locked when used by another process (I was readingherefor example), but I don’t understand what makes it fail only in the last setup, and if there’s something to do about it (perhaps I’m doing something wrong/suboptimal on a deeper level). Here’s the code: > > writeHDF5Array(matrix(rnorm(1e6), nrow = 100), filepath = "dat.h5", > name = "dat", chunkdim = c(100, 100)) > hdf5 <- HDF5Array("dat.h5", "dat") > full_grid <- DelayedArray::colAutoGrid(hdf5, ncol = min(100, ncol(hdf5))) > nblock <- length(full_grid) > > ## This is the code that I refer to above > res <- BiocParallel::bplapply(seq_len(nblock), function(b) { > ref_block <- DelayedArray::read_block(hdf5, full_grid[[b]]) > return(1)}, BPPARAM = BiocParallel::MulticoreParam(workers = 3)) >
> BiocParallel
gives me the following error: > > Error: BiocParallel errors > 0 remote errors, element index: > 66 unevaluated and other errors > first remote error: >
> Switching tomclapply
instead, with 3 workers I get the correct behaviour for every third block, and the following error for the others: > > <simpleError in .Call2("C_h5openlocalfile", filepath, readonly, PACKAGE = "HDF5Array"): failed to open HDF5 file '/...mounted file path.../dat.h5'> >
> I get the same behaviour if I useblockApply()
(which maybe is preferable for other reasons?). Finally, callingh5testFileLocking()
on the file returnsTRUE
on both systems. Any pointers would be appreciated:slightly_smiling_face::gratitude-thank-you:
Mike Smith (07:52:02) (in thread): > What protocol are you using to mount the remote drive? > > Does runninghrhdf5::5disableFileLocking()
change the behaviour? I’m not sure if you’d want to run it before executing the parallel code or within the parallel function.
Charlotte Soneson (07:54:03) (in thread): > I’m using smb to mount the drive. I tried callingdisableFileLocking
before the code, but that didn’t make a difference - I’ll try within the function.
Charlotte Soneson (07:56:17) (in thread): > Also calling it within the function doesn’t seem to help
Charlotte Soneson (09:02:07) (in thread): > We just found this (underSpecial cases - Multiple opens):https://portal.hdfgroup.org/display/HDF5/H5F_OPEN. Perhaps we’re trying to do something that shouldn’t be done, and we should just restrict ourselves to single-threaded use here.
Mike Smith (09:09:37) (in thread): > Can you try just opening the HDF5 file in your parallel function. Maybe we can strip away any of the other packages and see if that makes things any clearer. > > Maybe something like this:function(x) { fid <- rhdf5::H5Fopen(x, flags = "H5F_ACC_RDONLY"); rhdf5::H5Fclose(fid); }
I’m just writing that from memory, so might need some refinement. The intention is to do nothing but open the file and then close it.
Mike Smith (09:10:47) (in thread): > I guess it’s possible that happens so fast they won’t interfere with each other, so you could add something built aroundH5Dread
to extract some data too.
Henrik Bengtsson (13:14:25) (in thread): > Drive-by comment:Forkedparallel processing is known to cause all hard-to-troubleshoot errors in R. If this is the case here, replacingMulticoreParam
withSnowParam
should avoid the problem, or at least give another type of error.
Charlotte Soneson (15:49:42) (in thread): > Thanks both! So interestingly, replacingMulticoreParam()
withSnowParam()
does seem to help, but only when used in combination withrhdf5::h5disableFileLocking()
(called outside the parallel function) - I don’t think we tried that combination before.
2023-01-18
Charlotte Soneson (03:04:11) (in thread): > And just opening the file in parallel as suggested by Mike above also recapitulates the same behaviour - it works fine if the file is local, or withSnowParam()
ifrhdf5::h5disableFileLocking()
is called before, but not withMulticoreParam()
or withSnowParam()
without disabling the file locking.
Charlotte Soneson (03:12:10) (in thread): > …and in the latter case, the error is > > Error in rhdf5::H5Fopen("<remote path>", : HDF5. File accessibility. Unable to open file. >
Mike Smith (05:43:00) (in thread): > I’m in the office today and managed to replicated you setup with a smb mounted drive. I can confirm I see the same behaviour in the default case. Interestingly (but not helpfully) settingh5disableFileLocking()
allows bothMulticoreParam()
andSnowParam()
to work for me.
Charlotte Soneson (09:38:16) (in thread): > Interesting indeed - I just tried again in a new R session, and even afterh5disableFileLocking()
,MulticoreParam()
fails for me:thinking_face:. From our perspective, I think this is still very valuable, as at least one combination works:slightly_smiling_face:
Hervé Pagès (22:45:46) (in thread): > Could it be that smb own file locks are getting in the way? I don’t have experience with smb but it seems that file locking can be configured on the samba server side so maybe that’s the difference between your setup Charlotte and Mike’s setup?
2023-01-21
Hien (16:01:26): > @Hien has joined the channel
2023-02-06
Ying Chen (21:37:20): > @Ying Chen has joined the channel
2023-03-10
Edel Aron (15:27:59): > @Edel Aron has joined the channel
2023-03-15
Amarinder Singh Thind (01:45:07): > @Amarinder Singh Thind has joined the channel
Jiefei Wang (11:16:45): > @Jiefei Wang has joined the channel
2023-03-31
Monica Valecha (08:35:37): > @Monica Valecha has joined the channel
2023-04-26
Robert Shear (15:57:32): > @Robert Shear has left the channel
2023-05-03
Rebecca Butler (15:18:28): > @Rebecca Butler has joined the channel
2023-05-08
Axel Klenk (08:46:14): > @Axel Klenk has joined the channel
2023-05-18
Oluwafemi Oyedele (05:53:45): > @Oluwafemi Oyedele has joined the channel
2023-05-27
nezar (15:47:33): > @nezar has joined the channel
2023-06-07
Alyssa Obermayer (18:28:14): > @Alyssa Obermayer has joined the channel
2023-06-19
Pierre-Paul Axisa (05:08:36): > @Pierre-Paul Axisa has joined the channel
2023-07-04
Alexander Bender (09:36:13): > @Alexander Bender has left the channel
2023-07-06
Assa (02:54:07): > @Assa has left the channel
2023-07-28
Konstantinos Daniilidis (13:47:31): > @Konstantinos Daniilidis has joined the channel
2023-08-12
Xiuwen Zheng (22:18:11): > @Xiuwen Zheng has joined the channel
2023-08-15
Nick Owen (07:58:48): > @Nick Owen has joined the channel
2023-08-20
Federica Gazzelloni (10:37:54): > @Federica Gazzelloni has joined the channel
Jacques SERIZAY (10:38:31): > @Jacques SERIZAY has joined the channel
2023-08-28
Abdullah Al Nahid (15:06:14): > @Abdullah Al Nahid has joined the channel
2023-09-20
Jaykishan (05:30:16): > @Jaykishan has joined the channel
2023-11-21
Konstantinos Geles (Constantinos Yeles) (05:42:06): > @Konstantinos Geles (Constantinos Yeles) has left the channel
2023-11-30
Alex Bott (10:18:51): > @Alex Bott has left the channel
2024-05-15
Sunil Nahata (08:31:29): > @Sunil Nahata has left the channel
2024-07-04
Sounkou Mahamane Toure (15:28:51): > @Sounkou Mahamane Toure has joined the channel
2024-07-11
Sathish Kumar (06:02:32): > @Sathish Kumar has joined the channel
2024-07-30
Jorge Kageyama (17:48:43): > @Jorge Kageyama has joined the channel
2024-08-19
Rema Gesaka (09:41:06): > @Rema Gesaka has joined the channel
2024-09-04
Jiefei Wang (17:22:17): > Hi folks, I’m one of the developers ofBiocParallel
, and I’ve noticed the channel has been quiet for a while. I’d like to start a conversation and gather your thoughts on parallel computing challenges, especially in the context of usingBiocParallel
. > > What hurdles have you encountered in your research with parallel computing? Specifically, if you’ve had to switch to other parallel packages, what features or limitations prompted you to make the change? I’m hoping to collect suggestions and explore ideas for future development.
2024-09-05
Antonin Thiébaut (03:44:31): > @Antonin Thiébaut has joined the channel
Vince Carey (08:12:30): > how about measurement? how do you assess whether you are using resources efficiently? Rcollectl can tell u what percentage of cpu is in use on linux for multicore. can we do better in wider set of contexts?
Martin Morgan (08:20:23) (in thread): > miraiand the underlyingnanonextseem to provide a very nice / robust / fast alternative to sockets. - Attachment (cran.r-project.org): mirai: Minimalist Async Evaluation Framework for R > Designed for simplicity, a ‘mirai’ evaluates an R expression asynchronously in a parallel process, locally or distributed over the network, with the result automatically available upon completion. Modern networking and concurrency built on ‘nanonext’ and ‘NNG’ (Nanomsg Next Gen) ensures reliable and efficient scheduling, over fast inter-process communications or TCP/IP secured by TLS. Advantages include being inherently queued thus handling many more tasks than available processes, no storage on the file system, support for otherwise non-exportable reference objects, an event-driven promises implementation, and built-in asynchronous parallel map. - Attachment (cran.r-project.org): nanonext: NNG (Nanomsg Next Gen) Lightweight Messaging Library > R binding for NNG (Nanomsg Next Gen), a successor to ZeroMQ. NNG is a socket library implementing ‘Scalability Protocols’, a reliable, high-performance standard for common communications patterns including publish/subscribe, request/reply and service discovery, over in-process, IPC, TCP, WebSocket and secure TLS transports. As its own threaded concurrency framework, provides a toolkit for asynchronous programming and distributed computing, with intuitive ‘aio’ objects which resolve automatically upon completion of asynchronous operations, and synchronisation primitives allowing R to wait upon events signalled by concurrent threads.
Robert Castelo (09:37:16) (in thread): > I would suggest to use thecli packagefor showing progress, also withbpiterate()
, which currently shows an increasing counter on the number of tasks, instead of a progress bar. - Attachment (cran.r-project.org): cli: Helpers for Developing Command Line Interfaces > A suite of tools to build attractive command line interfaces (‘CLIs’), from semantic elements: headings, lists, alerts, paragraphs, etc. Supports custom themes via a ‘CSS’-like language. It also contains a number of lower level ‘CLI’ elements: rules, boxes, trees, and ‘Unicode’ symbols with ‘ASCII’ alternatives. It support ANSI colors and text styles as well.
Jiefei Wang (09:49:44) (in thread): > This is a good idea. The worker R process is blocked for doing calculations. Perhaps we need a daemon process(or just a thread?) to monitor and report the resources. On the user side we can use Rshiny to provide interface. It sounds like a good R package!
Gabriel Hoffman (10:56:09): > How about improving startup times and memory sharing across? My applications are all embarrassingly parallel: do the same thing across each of 20K genes (i.e. rows or columns). I find it takes a long time to ramp up to using multiple processes and each process uses more memory than I expect. Even using iterators, I find a substantial overhead to launching a BiocParallel processes. Or does a solution already exist?
Jiefei Wang (11:26:26) (in thread): > Have you triedSharedObject
? This is a package that allows you to share data across processes in a single machine. It wouldn’t change your code a lot, just callobj <- share(obj)
before parallelization is enough to make it a shared object. The rest code can be the same.
Kasper D. Hansen (12:35:51) (in thread): > I also had some questions about SharedObject. For example, it is not clear from the documentation that it works withmclapply
Kasper D. Hansen (12:36:05) (in thread): > But Martin told me that it should work which would be amazing
Jiefei Wang (12:42:15) (in thread): > It should work.SharedObject
does not bind withBiocParallel
. You can use any parallel package you like. parallel, foreach, future, and etc.
Jiefei Wang (12:42:43) (in thread): > (I’m the author ofSharedObject
, if it does not work, let me know)
Kasper D. Hansen (12:44:17) (in thread): > I know you’re the author which is why I am jumping on the opportunity:slightly_smiling_face:
Kasper D. Hansen (12:44:44) (in thread): > I need to return to this, but I will say the package could really do with some documentation improvement in this space
Kasper D. Hansen (12:45:01) (in thread): > But perhaps I should return to this and then reach out when I have issues / more specific suggestions
2024-09-11
Kylie Bemis (16:41:11): > Supportingparallel::makePSOCKcluster
inSnowParam
would be my biggest request for faster cluster startup and support for options likeuseXDR=FALSE
Henrik Bengtsson (16:48:46) (in thread): > Seehttps://github.com/Bioconductor/BiocParallel/issues/231for this exact feature request.
Kylie Bemis (16:51:02) (in thread): > * Yep, I found that issue when searching for a solution.Would love to see it developed.:thumbsup:For now I am starting the cluster withparallel
and coercing to aSnowParam
object butthat’snot ideal since some options like loggingaren’tsupported that way.
2024-09-12
Vince Carey (08:04:16) (in thread): > @Jiefei Wang^^
Jiefei Wang (23:31:14) (in thread): > I have to go back and check if this is possible, I’ll put it on my todo list
2024-09-16
Mike Morgan (06:23:34): > @Mike Morgan has joined the channel
2024-10-23
Hong Qin (17:46:07): > @Hong Qin has joined the channel
2024-11-02
skuba (14:13:06): > @skuba has joined the channel
2024-11-10
Shian Su (19:47:46): > Very late to the conversation, but I’d appreciate more detailed documentation around memory. My specific challenge with parallelisation in R in general is explosive memory consumption, I don’t know when the entire global environment is going to be duplicated across workers, especially within RStudio. All I know is there’s some weird interaction between forking, GC and RStudio. I would be a lot more fearless with using parallelism as a developer if I had some definitive idea of what data can/will be duplicated.carrier
andjob
for example allow you to specify what is exported. > > If such documentation already exists then I’d appreciate someone linking it for me, I haven’t looked into this for at least 3 years.
Vince Carey (20:35:43): > Hi Shian, have you looked athttps://www.bioconductor.org/packages/release/bioc/html/SharedObject.html - Attachment (Bioconductor): SharedObject > This package is developed for facilitating parallel computing in R. It is capable to create an R object in the shared memory space and share the data across multiple R processes. It avoids the overhead of memory dulplication and data transfer, which make sharing big data object across many clusters possible.
Shian Su (22:21:03) (in thread): > Thanks Vince, that looks promising, will give it a try next time I try to do some parallelism. I can’t quite tell from the documentation, does this only work for FORK based clusters? Obviously it only works if the workers are on the same machine, but can it share memory across processes?
2024-11-11
Vince Carey (05:23:14) (in thread): > @Jiefei Wang^^
2024-11-22
Jiefei Wang (22:25:08) (in thread): > Hi Shian, just to be clear, are you trying to use forked workers to do parallelisation? The forked process shares the same memory with the master process, so usually you do not need to useSharedObject
. However, I do know that for a long running task, the GC in R can force the workers make a copy of the memory. I do not know if this is your case.@Shian Su
2024-11-24
Shian Su (18:39:20) (in thread): > I think that is my case, my primary concern in general is the control over what gets exported into parallel workers. I believe there are circumstances where the workers have full access the to calling environment, which is problematic as a package developer since I don’t know what the users will have in their environment when they use a parallel function, and I become worried that it will blow out the memory. Forking is probably the more annoying situation since it’s hard to reproduce when GC decides to mess everything up, but the same concern remains for multiprocess, I want some guarantees of what objects will and won’t be duplicated.
2024-12-04
Jiefei Wang (17:16:59) (in thread): > Hi Shian, do you thinkexportvariables
inBiocParallel::bpoptions
can solve you problem? Basically it can control which objects get exported to the worker and which not. You might want to do some studies to see how it works for the forked process as this is primarily designed for snow workers.
Shian Su (18:31:38) (in thread): > Thanks for letting me know about that argument, from the documentation itsoundslike what I am looking for. I would turn global export off and choose the export variables manually.
Shian Su (18:32:53) (in thread): > In practice I am not really sure how I can actually test forking behaviour, I don’t know how to reliably trigger whatever RStudio/R does that causes CoW and memory explosion. I could at the very least verify that variables I don’t want to be exported are not visible in the forked worker, that should offer some peace of mind.
2025-03-05
Benjamin Hernandez Rodriguez (22:26:55): > @Benjamin Hernandez Rodriguez has joined the channel
2025-03-17
Sunil Nahata (09:26:13): > @Sunil Nahata has joined the channel
2025-03-18
Nicolo (14:58:17): > @Nicolo has joined the channel