#biochubs
2018-10-31
Lori Shepherd (07:43:36): > @Lori Shepherd has joined the channel
Lori Shepherd (07:43:36): > set the channel description: give feedback or report issues on Bioconductor Hubs
Levi Waldron (07:43:36): > @Levi Waldron has joined the channel
Marcel Ramos Pérez (07:43:36): > @Marcel Ramos Pérez has joined the channel
Martin Morgan (07:43:36): > @Martin Morgan has joined the channel
2018-11-01
Charlotte Soneson (14:12:53): > @Charlotte Soneson has joined the channel
2018-12-14
Rena Yang (12:42:09): > @Rena Yang has joined the channel
2019-01-17
Jayaram Kancherla (15:30:41): > @Jayaram Kancherla has joined the channel
Jayaram Kancherla (15:35:48): > Hi@Lori Shepherdis there a way to get the response as json for the annotationhub api ? > > for example the query endpointhttp://annotationhub.bioconductor.org/query/dataprovider(BroadInstitute)
Lori Shepherd (15:57:26): > The individual resource results are in json but not the query result - off hand I don’t think so but I can look into it
2019-03-13
James MacDonald (11:11:03): > @James MacDonald has joined the channel
Kasper D. Hansen (11:11:03): > @Kasper D. Hansen has joined the channel
Aaron Lun (11:11:03): > @Aaron Lun has joined the channel
Lori Shepherd (11:11:53): > <!channel>Hello Hub Users, > > It has been a concern for awhile to have file level checks for when a remote > data resource has been updated. While we implemented the option to be able to > force download (hub[[force=TRUE]]
) we still felt like this should be improved > upon. With the creation of BiocFileCache, we decided it was time for the Hubs > to switch to use BiocFileCache in the backend to utilize this functionality. We > have some test branches created and before we fully implement we would > appreciate your help in testing and providing some feedback. The branch is > called convertToBFC on the GitHub for both AnnotationHub and ExperimentHub (if > using ExperimentHub you will obviously need both branches)https://github.com/Bioconductor/AnnotationHub/tree/convertToBFChttps://github.com/Bioconductor/ExperimentHub/tree/convertToBFCAnother notable change besides having the BiocFileCache control the downloads > and caching mechanism: > > BiocFileCache utilized rappdirs::user_cache_dir to determine the default > location of the directory where files would be stored. This has also been > updated in the Hubs. For convenience we provide the helper functionconvertHub() / convertHub(hubType="ExperimentHub")
that will transition > between the old hub default and new hub default location, attempting to > redownload any previously downloaded resources. (feel free to test - it does not > destroy the old cache location so it will still be intacted). > > > > We realize there is still much work to be done on the Hubs to make them more > user friendly and some big desirable functionality. Please try and refrain from > those comments for now, we really would just like feedback on the BiocFileCache > implementation. Our next step after this will be to work on having a versioning > mechanism to allow for more than one version of a file to be in the hubs and > tracked by the same id and possibly improved upon remote querying (just so you > have an idea of where we plan to go) > > We look forward to your feedback. Please send comments toLori.Shepherd@Roswellpark.org
Valerie Obenchain (11:14:18): > @Valerie Obenchain has joined the channel
Johannes Rainer (11:15:38): > @Johannes Rainer has joined the channel
Kayla Interdonato (11:39:41): > @Kayla Interdonato has joined the channel
2019-03-17
gamzeaydilek (07:17:28): > @gamzeaydilek has joined the channel
2019-03-26
Lori Shepherd (07:48:57): > We would like to push these changes later this week or early next - If anyone has any additional feedback please emaillori.shepherd@roswellpark.orgsoon<!channel>
Johannes Rainer (09:10:30): > Actually, I run into an error: > > > convertHub() > Attempting to redownload: 4 hub resources > /Users/jo/Library/Caches/AnnotationHub > does not exist, create directory? (yes/no): yes > Error in FUN(X[[i]], ...) : unused argument (exact = TRUE) >
Lori Shepherd (09:15:40): > Thanks I’ll look into it!
Lori Shepherd (10:48:21) (in thread): > Should be resolved with a BFC version >= 1.5.1
Lucas Schiffer (19:31:35): > @Lucas Schiffer has joined the channel
2019-03-27
Johannes Rainer (02:24:07): > Was my fault - old version ofBiocFileCache
. It’s working now:+1:
2019-03-29
Lori Shepherd (08:27:11): > I’m going to merge the branches into master and push up later this morning - please keep me posted on any feedback - cheers
2019-04-10
Sean Davis (09:48:55): > @Sean Davis has joined the channel
2019-05-07
Lori Shepherd (07:12:59): > <!channel>I realize there is a bug somewhere in the devel version of the hubs - I’m looking into it this morning and hope to have a resolution shortly.
2019-05-14
Aaron Lun (22:45:01): > @Davide RissoYou should consider makingscRNAseqa communal dumping ground for random scRNA-seq data sets that don’t thematically belong anywhere else. I’ve got a whole bunch of wrangled data sets from simpleSingleCell that could live in there for use by other people. If we EHub it, it should be pretty light (and a bunch of the existing data sets could go into EHub as well).
Davide Risso (22:45:05): > @Davide Risso has joined the channel
Kasper D. Hansen (22:53:33): > @Lori ShepherdDoes ExperimentHub support hosting the material outside of S3?
Kasper D. Hansen (22:54:05): > We will have 5+TB of data and we want to host it elsewhere
2019-05-15
Lori Shepherd (07:08:56): > @Kasper D. HansenAs long as it is public accessible and can follow our naming schema yes. You would add the location_prefix column in the metadata.csv file. We would still require the data to be organized in a subdirectory the same name as the package
Kasper D. Hansen (08:39:18): > That’s good
Kasper D. Hansen (08:39:34): > (would be nice to have no constraints on the directory structure)
Kasper D. Hansen (08:40:21): > What are the requirements on the file types? I can put bigWig right? Is it possible to preprocess the data using a script. For example, could I have CSV files stored and I provide a script for parsing them into SummarizedExperiment?
Lori Shepherd (08:49:25): > There needs to be a resource load method - The currently supported methods can be accessed withAnnotatinoHub::DispatchClassList
- The first column is the name of the dispatchclass to use, the second column information on the internal methods call - Note: There is aBigWig
dispatchclass but that will use rtracklayer::BigWigFile() - If you want a file path to utilize in a script, useFilePath
- FilePath returns a character string path to the locally cached file -
Lori Shepherd (08:50:36): > If you feel there should be a method/recipe added for a particular file type that we currently don’t provide, I can add one as well -
Kasper D. Hansen (09:10:48): > Right now I am thinking about design, I have not looked to closely at code. But in our case, I think we want to provide data files in non-R formats, because (1) it makes processed data more broadly accessible and (2) parsing a text file for a (say 30k x 1k SummarizedExperiment is not a big deal).
Kasper D. Hansen (09:11:13): > One thing we are likely to do, is to think about a plain text format for SummarizedExperiments
Kasper D. Hansen (09:11:39): > Alternatively, the different matrices (colData, rowData, assays) could be provided as separate CSV files
Lori Shepherd (09:12:16): > AnnotationHub by design was thought to load data in and take raw files to an R data object - However as mentioned, people desired the use of the raw file directly for their own processing. That was the idea behind the FilePath that will download, cache the file, and only give the file path back for you to use however you like.
Kasper D. Hansen (09:12:43): > We will do some more thinking because in our case we have the same colData for different types out output (think exons vs genes for example)
Kasper D. Hansen (09:13:07): > Yes, it sounds like it is feasible to go down the ExperimentHub path, which is nice
Lori Shepherd (09:13:37): > Let me know if you have any other questions or concerns
Kasper D. Hansen (11:05:45): > Ok thx. It may be a while…
2019-05-21
Aaron Lun (00:09:36): > @Lori Shepherd: me and@Davide Rissoare planning to convertscRNAseqto an EHub package from its currentdata(...)
usage. Any suggestions on deprecation of the old behaviour? I don’t see an easy way to insert a deprecation message anywhere…
Lori Shepherd (07:32:43): > hmm…. good question… it seems like there should be a way to add a message when someone uses data(…) but I can’t think of any off hand… You could add a message to the package but that doesn’t help people doing a direct access to your data if they did system.file or something of the like…@Martin Morgan/@Hervé Pagèsany suggestions?
Martin Morgan (08:35:32): > If I understand, the idea is that you want to deprecate thedata(foo)
and point the user tohub[["bar"]]
or something.data()
can be an R filefoo.R
, and then you can do what you want, including a deprecation message.
Aaron Lun (13:52:03): > Not quite sure I understand - as in, overridedata()
with our own function?
Kasper D. Hansen (14:16:48): > No
Kasper D. Hansen (14:17:33): > When you call data, we normally think ofdata(foo)
as loading a filedata/foo.rda
in the package.
Kasper D. Hansen (14:17:50): > But actuallydata()
supports.R
files as well
Aaron Lun (14:18:29): > Woah
Kasper D. Hansen (14:18:33): > So you could havedata/foo.R
instead ofdata/foo.rda
. Then insidefoo.R
you could have > > message("This is super deprecated") >
Aaron Lun (14:19:05): > This is nice, but whoever thought of allowingdata()
to call R files in the first place?
Kasper D. Hansen (14:19:21): > That has been supported since probably version 0.6.0
Aaron Lun (14:19:39): > as in, the logic behind it.
Kasper D. Hansen (14:19:41): > but look at details in?data
Kasper D. Hansen (14:19:50): > I think the idea was that you could have stuff like
Kasper D. Hansen (14:20:02): > > df <- data.frame(a = letters[1:3] >
Kasper D. Hansen (14:20:32): > In principle this is even more backwards compatible thatrda
files which has to be updated when the rda format changes
Kasper D. Hansen (14:20:58): > But it requires you are defining the data as full write-outs
Kasper D. Hansen (14:21:14): > I would not be surprised if some of the old datasets looks like this, perhapscars
Kasper D. Hansen (14:21:43): > Obviously this doesn’t work with bigger data
Kasper D. Hansen (14:23:10): > Yup
Kasper D. Hansen (14:23:47): > > ./src/library/datasets/data/mtcars.R > ./src/library/datasets/data/cars.R >
> and > > # cat src/library/datasets/data/cars.R > cars <- data.frame( > speed = c(4, 4, 7, 7, 8, 9, 10, 10, 10, 11, 11, 12, 12, 12, 12, 13, > 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 16, 16, 17, 17, 17, 18, 18, > 18, 18, 19, 19, 19, 20, 20, 20, 20, 20, 22, 23, 24, 24, 24, 24, 25), > dist = c(2, 10, 4, 22, 16, 10, 18, 26, 34, 17, 28, 14, 20, 24, 28, 26, > 34, 34, 46, 26, 36, 60, 80, 20, 26, 54, 32, 40, 32, 40, 50, 42, 56, > 76, 84, 36, 46, 68, 32, 48, 52, 56, 64, 66, 54, 70, 92, 93, 120, 85)) >
Aaron Lun (14:23:59): > So we could probably just throw in > > .Deprecated(msg="this is dep'd") > data(allen) >
Kasper D. Hansen (14:24:01): > this is OLD SKOOL
Kasper D. Hansen (14:24:04): > yes
Aaron Lun (14:24:04): > lol
Kasper D. Hansen (14:24:40): > thedata(allen)
has to be something else, right
Aaron Lun (14:25:02): > yes.
Kasper D. Hansen (14:25:04): > I mean, it should either have the code to actually load the data from the new package, which is probably more cumbersome, or it should error out
Aaron Lun (14:27:52): > will need to play around with this. It suggests that I’d need an R source file namedallen
, where the actual Allen data is moved somewhere else. The source file would then calldata()
on the renamed actual data, plus the derpecation message.
Kasper D. Hansen (14:28:24): > I am also not sure how.Deprecated()
works insidedata()
, but alternatively you could justmessage("data(allen) is deprecated....")
Aaron Lun (14:28:30): > sure.
2019-05-23
Aaron Lun (00:45:46): > Got it working,https://github.com/drisso/scRNAseq/issues/10#issuecomment-495065351
Aaron Lun (00:46:25): > @Lori Shepherdwe were thinking to transition over in waves, which should give us an easier time of fixing issues than a “big bang” switch.
Aaron Lun (00:49:42): > Any thoughts on that? We have two datasets ready to go inhttps://github.com/drisso/scRNAseq/pull/11.
Lori Shepherd (07:10:38): > Thats fine. It would be easiest for me if there were a new metadata.csv file for each new wave so I know exactly what has been added when - (metadata.csv can be any .csv file as long as I know which name to use in the function to add to the database)
2019-05-25
Aaron Lun (01:24:55): > okay.@Davide Rissohow about we just set up everything we currently have and do it all in one go? It’s a long weekend, perfect opportunity to get this done.
Davide Risso (04:01:22): > Ok, but I have a grant deadline on Monday so won’t be able to work on this until Tuesday.
Aaron Lun (05:42:03): > Are the scripts to generate the old Rdata files available somewhere? I can rejig them myself.
Davide Risso (09:53:34): > Mmm… what do you mean to generate the old Rdata files? This was a preprocessing of the data from FASTQs…
Davide Risso (09:54:06): > what do you need exactly? Can’t we just port the existing matrix + colData?
Davide Risso (09:54:31): > Also, since we’re at it, it would make sense to turn these from SE to SCE
Aaron Lun (13:17:05): > I need instructions to regenerate the files from public sources, see themake-zeisel-brain-data.Rmd
file for example.
Aaron Lun (13:17:36): > Otherwise we’ll end up in exactly the same situation where we are now, where an observer doesn’t knowexactlyhow these files were generated.
Aaron Lun (14:03:13): > And yes, they will be SCEs, but assembled from components client-side rather than being stored as SCEs server-side.
Aaron Lun (14:03:47): > This allows us to avoid at least one worry about serialized objects upon updating the SCE S4 structure.
Kasper D. Hansen (15:14:21): > Interesting with the assembly. Is that a standard feature or something new you’re baking. I want to so something like that sometime in the future for the next iteration of recount
Aaron Lun (16:46:35): > AFAIK that’s standard. We download the assay, rowData and colData separately, and put it together on the user’s end.
2019-05-30
Aaron Lun (00:12:24): > @Davide Rissoyou finished your grant yet? Let’s kick off our refactoring party!:party_parrot::party_parrot::party_parrot::party_parrot::party_parrot::party_parrot::party_parrot::party_parrot:
Davide Risso (07:46:38): > Yep. Let’s do this! So, how do we exactly proceed? The vignette includes the details on how we preprocessed the data
Davide Risso (07:48:17): > but, do we want to reprocessed them or should we just start with the existing matrices?
Aaron Lun (11:37:44): > In principle, we should just start with the existing matrices, as reprocessed data would be outside the scope of the package (as discussed inhttps://github.com/drisso/scRNAseq/issues/10). Reprocessing is better suited for stuff like@Charlotte Soneson’s conquer and related projects. We would just be providing one-click solutions to get already-demangled count data for studies of interest. > > In practice, we may want to use the reprocessed data for backwards compatibility. This would require the full instructions (as in, code; not just words) to live in theinst/scripts/make-allen-brain-data.Rmd
, etc.). It depends on how you want to proceed with the deprecation.
2019-06-01
Aaron Lun (18:48:17): > so…@Davide Risso… what’s the situation?
2019-06-03
Davide Risso (10:48:47): > Hi@Aaron Lun, I’m trying to get the actual code of the pre-processing (it was Michael Cole who preprocessed the samples).
2019-06-08
Aaron Lun (20:16:26): > @Davide RissoI can’t push to the repo anymore. What gives?
2019-06-09
Davide Risso (03:57:43): > Uh?:thinking_face:
Davide Risso (03:58:08): - File (PNG): Image from iOS
Aaron Lun (03:58:51): > Nope, getting > > ERROR: Authentication error: Authentication required: You must have push access to verify locks > > error: failed to push some refs to ‘https://LTLA@github.com/drisso/scRNAseq’
Aaron Lun (04:01:08): > okay, fixed.
Davide Risso (04:01:25): > :+1:
Davide Risso (04:01:48): > Btw, still waiting for Michael to send me the preprocessing code
Aaron Lun (04:02:33): > Let’s just move along. Can you review and compile all the reports ininst/scripts
, and we’ll compare the checksums we get in thescRNAseq
folder.
Davide Risso (04:06:14): > Sure. I’ll do it later today
Davide Risso (10:50:41): > @Aaron LunI’ve checked and run all scripts. They look fine!
Davide Risso (10:52:55): - File (Plain Text): Untitled
Aaron Lun (12:43:01): > I just tried to merge this with Bioconductor’s Git repo, but the histories are unrelated. Did you not replace your repo when the SVN->Git transition occurred?
Aaron Lun (12:44:12): > While we sort that out,@Lori Shepherd- we’re good for upload to EHub.
Aaron Lun (17:21:13): > There is, perhaps, a more general question of whether it better to (a) haveXXXData()
download count matrices from EHub, or (b) haveXXXData()
download count matrices from GEO or ArrayExpress if they’re already available there. The idea with (b) would be to push the work into the R code on the client side, and to simply cache objects using BiocFileCache; this would avoid the need to constantly ping Lori for every new thing that we want to add.
Aaron Lun (17:23:35): > Of course, this is only possible for datasets where GEO already has the count matrices, and there’s not a lot of additional work to do on the client side. It also requires that the remote resource be stable, which is okay for GEO and ArrayExpress, less so for your average lab FTP server. In the latter case, it is much safer to have a persistent copy on EHub.
Aaron Lun (17:51:41): > Okay, having gone through that little thought experiment,XXXData()
from EHub is almost 10 times faster than a version that pulls from GEO and does the reformatting on the client side. So it probably is worth the hassle of going onto EHub, especially if it allows people to use the data in tests and examples to fit in under the 5 minute limit.
Aaron Lun (17:53:35): > Unless…. I use BFC to cache theoutputas well, and then the entire function only needs to be run once for any use.
Aaron Lun (17:59:28): > But then the file needs to be stored twice - and there’s also the issue of what happens if I clobber a BFC entry that happens to have the same name…
Aaron Lun (17:59:41): > Well, okay. Sticking to EHub then.
2019-06-10
Lori Shepherd (07:28:26): > Go ahead an upload whenever - most credentials are temporary so the last round has probably expired since you last used them - let me know which email I should send some over
Aaron Lun (11:24:11): > My usual,infinite.monkeys.with.keyboards@gmail.com
Lori Shepherd (11:26:14): > sent
Aaron Lun (11:30:15): > thanks, will deal with this tonight.
Aaron Lun (11:40:54): > @Davide Rissoif you give me write access to the Bioconductor repo, I can manage the rest.
Davide Risso (13:13:15): > Sure! How do I do that?
Aaron Lun (13:16:45): > Ask one of the core team - I guess@Lori Shepherdcould do it or pass the message along? Do we have to do it on the mailing list, e.g., to leave a paper trail, or is this good enough?
Lori Shepherd (13:20:03): > which Bioconductor repo? Probably best on the mailing-list to keep track -
Aaron Lun (13:24:07): > Okay.@Davide Rissoif you could sent a ML email about giving me access toscRNAseq.
Davide Risso (13:35:06): > Done!
Lori Shepherd (13:58:40): > try now - let me know if you still do not have access
2019-06-11
Aaron Lun (00:10:00): > Thanks@Lori Shepherd. It is done, data is on S4 and I’ve pushed the changes to the package to Github. Bit courageous as the EHub getters have never been tested and will probably fail, so expect some build errors for the next few days.
Aaron Lun (00:10:21): > Also@Davide Risso, you should archive your out-of-datehttps://github.com/drisso/scRNAseq.
Aaron Lun (00:10:49): > I’m happy to transferhttps://github.com/LTLA/scRNAseq(which has the current BioC git history) to you, or I can hold onto it.
Lori Shepherd (13:46:47): > The data for scRNAseq has been added to ExperimentHub and is available in bioc devel 3.10
Aaron Lun (13:47:46): > Sweet, thx
Aaron Lun (22:55:15): > @Nitesh TuragaNow that I’ve moved the data inscRNAseqfromdata/
to EHub, I just deleted the old*.Rda
files from the Git repository. But should I also ask you to wipe them from the Bioc Git history? What’s the official policy here? My inclination would be to leave them in, they’re a bit big and annoying to check out, but not intolerably so.
2019-06-12
Davide Risso (03:29:59): > Thanks@Aaron Lun! I’m archiving my repo, you can hold onto the other one for now!
Nitesh Turaga (10:16:52): > @Nitesh Turaga has joined the channel
Nitesh Turaga (10:19:18) (in thread): > Hi@Aaron LunThere is a tool called BFG repo cleanerhttps://rtyley.github.io/bfg-repo-cleaner/. You may use that in your “Github” to clean everything if you would like to, and then notify me. I will sync your bioc git repo and your github repo. > > Some more reading about this,https://help.github.com/en/articles/removing-sensitive-data-from-a-repository. > > If you prefer not to change it, and it’s tolerable. I’d let it be. - Attachment (rtyley.github.io): BFG Repo-Cleaner by rtyley > A simpler, faster alternative to git-filter-branch for deleting big files and removing passwords from Git history.
2019-06-13
Aaron Lun (01:23:08) (in thread): > I think I’ll pass then.
Aaron Lun (01:23:54) (in thread): > On the grounds that, if we deleted the data history from a data package, the package wouldn’t actually have any history to speak of!
2019-06-25
Aaron Lun (00:43:17): > @Lori ShepherdI have a whole stack of mouse brain scRNA-seq datasets to upload to EHub inhttps://github.com/LTLA/scRNAseq/pull/2. I’ve got metadata CSVs set up and ready to roll as soon as you give the green light. (Haven’t uploaded to AWS yet - are the old creds still valid?)
Aaron Lun (00:45:13): > Incidentally, this is the first chunk of datasets from Martin Hemberg’s website (https://hemberg-lab.github.io/scRNA.seq.datasets/). I’m trying to break up the uploads into manageable chunks, otherwise I’ll go crazy; I’ll try to keep the number of upload requests down to one a week.
Lori Shepherd (09:02:14): > Cool. Try the old credentials if they don’t work please email me and I’ll send new ones. At the bioc2019 until thur but if everything is set and upload I could try and add everything on Fri
Aaron Lun (22:49:06): > Should we have an XLS/XLSX sourceType?
Aaron Lun (22:49:47): > And how careful should I be with this - e.g., if it is listed as.txt
but parsed as a TSV, should I put down TXT or TSV?
2019-06-26
Aaron Lun (00:04:28): > @Lori ShepherdI’ll send an email with a summary of everything new in this wave.
2019-07-30
Friederike Dündar (10:30:28): > @Friederike Dündar has joined the channel
Friederike Dündar (10:53:08): > thanks for adding me
Tim Triche (10:55:30): > @Tim Triche has joined the channel
Friederike Dündar (10:56:29): > hi, > I’m trying to get a sense of where to go if one was interested in a set of reference expression values of pure cell populations in order to support the packages, such as SingleR, that do automated cell label assignment for scRNA-seq
Friederike Dündar (10:57:09): > for example, there’s this data set from the Allen Brain Atlas:https://experimenthub.bioconductor.org/dataprovider/Allen%20Brain%20Atlas
Friederike Dündar (10:57:24): > which is part of ExperimentHub and seems to be the supporting data package for CellMapper
Friederike Dündar (10:57:54): > I’m dealing with lots of very different scRNA-seq experiments, i.e. sometimes it would make sense to focus on brain cells, sometimes it’s blood cells, sometimes other tissues
Friederike Dündar (10:58:41): > that means, every time I want to run the automated cell label assignment, it’d be nice to be able to fairly quickly pull out the expression values for suitable reference cell types
Friederike Dündar (10:59:18): > I was going to start putting a data package together by annotating and cleaning up the data provided by SingleR:https://github.com/dviraran/SingleR/tree/master/data
Friederike Dündar (11:00:07): > which is when I came across ExperimentHub, wondering whether that data is actually already part of it, but I find browsing the data of ExperimentHub somewhat cumbersome
Tim Triche (11:04:52): > @Friederike Dündarthis could be the basis for a compelling informatics proposal. I don’t think it is a minor feature request but rather something that merits long-term planning and prioritization – which aspects of findability are the most important to the most people and the most NIH- or other-funder-prioritized projects?
Friederike Dündar (11:05:26): > Are you referring to the search function?
Tim Triche (11:06:06): > @Friederike DündarWhich is not to say that some sort of seed funding or other quick turnaround scheme couldn’t enable things to get started. Yes I’m referring to the search tools for ExperimentHub/AnnotationHub. They’re not awful but they’re also not intuitive yet (IMHO) even for someone who’s used BioC for a decade plus.
Tim Triche (11:07:28): > @Friederike Dündarthis is a recurring problem and your remarks reminded me of the sheer quantity of data and annotaitons in the Hubs. It’s probably worth figuring out how to fund the improvements to make the platform more broadly useful (it’s already deeply integrated with automation-friendly, open-source workflows, of course).
Friederike Dündar (11:08:23): > I had never even heard of ExperimentHub before, and I’m an avid user of BioC packages
Tim Triche (11:08:33): > @Friederike DündarI have plenty of examples that I could write out from recent matching-up of annotations and experiments, for example, but my personal favorite example of ExperimentHub is the HCAData package
Tim Triche (11:08:54): > people are shocked when they realize that HCA matrices stored out-of-core are available right out of ehub
Friederike Dündar (11:10:14): > indeed
Tim Triche (11:12:10): > that’s not even the extent of it; Vince Carey and John Readey have made some things possible with on-demand per-row or per-column serving of immense datasets. I threw together a corpus of about 5,000 pediatric cancer patients to test it out, although I was not able to get the abstract submitted in time. More later, remind me to follow up on this. It’s important for a lot of people and the infrastructure is (at least partly) already built.
Friederike Dündar (11:15:09): > well, for me, personally and short-term-goal’ish, I’m basically just trying to figure out what the best practice would be to collect reference data sets for my personal use (as in: shareable with my colleagues and students and whoever else finds them useful, documentable, versionizable)
Friederike Dündar (11:15:49): > what’s the difference between submitting/preparing a data package for ExperimentHub vs. ExperimentData?
Aaron Lun (11:22:52): > @Friederike Dündarplease also seehttps://github.com/LTLA/SingleR.
Aaron Lun (11:22:55): > This would help me a lot.
Friederike Dündar (11:24:12): > Not sure what type of help you’re hoping for….:grin:
Friederike Dündar (11:24:55): > but I’m guessing restructuring the data and submitting it separately would make it easier for you to use it as a dependency?
Aaron Lun (11:24:57): > Well, one of the TODOs is an EHub upload of the various datasets so that they’re not in the package itself.
Friederike Dündar (11:25:08): > got it
Friederike Dündar (11:25:42): > so what would you propose?
Aaron Lun (11:26:04): > I don’t think we need a separate data package, I would just have EHub-related functions in SingleR itself.
Friederike Dündar (11:26:20): > that’s where my understanding of EHub is missing
Aaron Lun (11:26:26): > Seehttps://github.com/LTLA/scRNAseqfor some practical examples of what might be required.
Friederike Dündar (11:26:26): > I have no idea what that entails/means
Aaron Lun (11:26:34): > In particular,inst/scripts
Aaron Lun (11:27:24): > just pick one of the various datasets to look at as an example.
Friederike Dündar (11:27:52): > will do, but can you give me a general intro/overview? What’s the role of Ehub there?
Aaron Lun (11:31:00): > It’s a remote hosting of data files. The idea is that you call a function likeZeiselBrainData()
, and this pulls down data from BioC’s EHub S3 buckets and into a local cache on your computer. The next time you call this function, you don’t need to download it again, it just uses the local cache. It means that you don’t have to store the file in the package itself - this is good, as it avoids inflation of the Git repository with unnecessarily large blobs, and it also means that people only pay for what they use re. downloads. The second point makes the data packages more scalable as you can have loads of datasets and people only pull down what they need.
Aaron Lun (11:32:16): > In this case, I would have oneinst/scripts/make-SingleR-data.Rmd
that takes Rdata files from Dvir’s repository and reformats them into RDS files (one per list element).
Friederike Dündar (11:32:27): > right
Aaron Lun (11:32:53): > If you make a PR into LTLA/SingleR, I will provide appropriate code review comments.
Friederike Dündar (11:34:24): > so, these would be the two main issues I’d have to figure out: > 1. Which data to submit to EHub? (the singleR objects contain all sorts of information in addition to$data
) > 2. How to submit it?
Aaron Lun (11:35:09): > 1 - just throw it all in, each datasets seems to have a standard format.
Aaron Lun (11:35:33): > 2 - that’s an easy bridge to cross when you get to it.
Aaron Lun (11:35:59): > Re 1: “standard” meaning “consistent”, it’s not really a standard in any actual sense of the word.
Aaron Lun (11:36:09): > So you should be able to iterate over al the datasets and apply the exact same code to each one.
Aaron Lun (11:36:53): > I wouldcurl
out the Rdata files from the GitHub repo (avoid cloning it, which is pretty heavy) and then loop over all the individual files, splitting the list into its constituent components for saving.
Tim Triche (11:37:08): > @Aaron Lunhow do you feel about the approach taken by the HCAdata package?
Friederike Dündar (11:37:33): > it would be a start, I guess, but my initial goal was to actually have a package that would be amenable to additional reference data sets, ideally stored/accessed in a less mind-boggling way than the structures SingleR has implemented
Aaron Lun (11:38:43) (in thread): > Well, storage and in-memory representation don’t have to be the same. And in fact they’re often not. In scRNAseq, I save all the individual components to file, and use them to reconstruct a SingleCellExperiment on the client side when they pull down the content.
Friederike Dündar (11:39:49) (in thread): > Yeah, I guess I was thinking along the lnes of making the data accession/download simple and then provide, for example, SingleR-specific wrapper functions
Aaron Lun (11:40:18) (in thread): > Why would you need SingleR-specific wrappers? This is just data that goes into an SE.
Aaron Lun (11:40:37) (in thread): > I don’t recall anything SingleR-specific about the data files. There’s an assay and some metadata.
Friederike Dündar (11:40:46) (in thread): > the SingleR objects contain all sorts of nested lists
Friederike Dündar (11:40:59) (in thread): > e.g. with SD and labels and which ones are DE etc.
Tim Triche (11:41:02) (in thread): > @Friederike Dündarhave you talked with Dvir about this?
Tim Triche (11:41:08) (in thread): > a lot of that can be automated
Tim Triche (11:41:14) (in thread): > whether it should be is another matter
Aaron Lun (11:41:41) (in thread): > Again… this can all be put into the metadata. But it doesn’t really matter becauseSingleR
wil recompute these by default anyway.
Friederike Dündar (11:42:46) (in thread): > I need to have a look at the changes you’ve implemented
Aaron Lun (11:42:54) (in thread): > Yes, the client-side assembly is how it’s meant to be done.
Friederike Dündar (11:43:08) (in thread): > but I think generally we’re on the same page, i.e. simplifying the data structures
Aaron Lun (11:43:19) (in thread): > The fact that there was recomputation is the same before and after my changes.
Friederike Dündar (11:43:27) (in thread): > and if everything has been brought into the SCE context, all we’d need are the matrices, correct?
Aaron Lun (11:43:34) (in thread): > And the labels.
Aaron Lun (11:43:42) (in thread): > Each dataset should have two sets of labels.
Aaron Lun (11:44:00) (in thread): > Broad and something else, I think.
Aaron Lun (11:44:11) (in thread): > “Broad” not referring to the institute, AFAIK.
Friederike Dündar (11:44:18) (in thread): > https://github.com/dviraran/SingleR/tree/master/data
Friederike Dündar (11:44:25) (in thread): > These are the objects I’ve been talking about
Aaron Lun (11:44:28) (in thread): > Yes.
Friederike Dündar (11:44:54) (in thread): > I would just extract the$data
part of each and store that
Aaron Lun (11:44:58) (in thread): > And the labels.
Friederike Dündar (11:45:10) (in thread): > right, that would go into the metadata slot of SCE
Friederike Dündar (11:45:23) (in thread): > or SummarizedExperiment, I guess
Aaron Lun (11:45:40) (in thread): > Remember that you don’t assemble the S(C)Es on your side, you’re just saving raw data. The objects are assembled client-side at every download.
Friederike Dündar (11:45:47) (in thread): > yes
Aaron Lun (11:46:10) (in thread): > This has advantages of avoiding storage of a serialized SE, ensuring that the client can always consturct an updated SE if the class definition changes.
Friederike Dündar (11:46:40) (in thread): > for this, the scripts inints/scripts/
would have to adapt, too, right?
Aaron Lun (11:47:12) (in thread): > Well, you wouldn’t be able to copy and paste them from scRNAseq, if that’s what you mean.
Tim Triche (11:47:36) (in thread): > @Friederike Dündaryou work with@Davide Risso? what you are describing is better stored in an SCE with its reduced dim slot
Tim Triche (11:47:53) (in thread): > since singleR wants tSNE coords (ugh)
Aaron Lun (11:47:53) (in thread): > Wait what.
Aaron Lun (11:48:04) (in thread): > I don’t recall this being the case for the mainSingleR()
function.
Tim Triche (11:48:19) (in thread): > I looked at Dvir’s docs on github
Aaron Lun (11:48:32) (in thread): > I stripped out a lot of that stuff.
Friederike Dündar (11:48:36) (in thread): > @Aaron Lunno, I just meant as a general concept: if at one point SE specifications change, all one would have to do is to adapt the scripts to reflect those changes; raw data in EHub wouldn’t be affected
Tim Triche (11:48:36) (in thread): > oh nice
Friederike Dündar (11:50:10) (in thread): > @Tim TricheDavide used to be in an office right across the street from me before he deserted back to a place with proper food and proper summers
Aaron Lun (11:50:29) (in thread): > @Friederike DündarWe would not have to adapt the scripts ininst/scripts
at all. The nature of the raw data on EHub is not affected (and it is, in fact, a pain to update and version the raw data, so we do want to minimize changes if possible). We only have to update the R functions that handle the download. See theR
functions inscRNAseq
, and note the extensive use of a version number in the S3 file path to allow for user-level control of versioned data.
Friederike Dündar (11:50:43) (in thread): > got it
Friederike Dündar (11:51:03) (in thread): > I meant that but was obviously not clear on the terminology
Friederike Dündar (11:51:07) (in thread): > thanks for clearing that up
Friederike Dündar (11:51:53) (in thread): > I’ll dig into this maybe next week or so and will ping you accordingly
Aaron Lun (11:52:09) (in thread): > k
Tim Triche (11:54:41) (in thread): > thanks – did not realize all the machinery under the surface in scRNAseq – that’s terrific to know about
Aaron Lun (11:56:38) (in thread): > I spent one hour per night uploading one dataset over the course of a fortnight.
Tim Triche (11:58:22) (in thread): > scRNAseq:::.create_sce is nice
Tim Triche (11:59:09) (in thread): > seems like it could be extended to automatically instantiate e.g. a restfulSE, which would be awesome
Aaron Lun (11:59:44) (in thread): > Uh, I guess so.
Aaron Lun (11:59:52) (in thread): > Haven’t done any RESTful stuff.
Tim Triche (12:01:47) (in thread): > although I guess for scRNAseq, the reduced-dimension representation is usually what people will use, so it’s not like people are likely to run around recomputing on 1M+ cells each time they download. Anyways, thanks for pointing out that function and its existence.
Martin Morgan (12:04:40) (in thread): > maybe worth ‘honest advertising’ that this is the ‘preview data’https://preview.data.humancellatlas.org/released quite a while ago, rather than a comprehensive collection of HCAData (although that is in the works…)
Tim Triche (12:17:55) (in thread): > fair enough, so people are being amazed by the efficient and facile distribution of thepreviewdata. Not bad as an advertisement for what could be done with a bit of funding for some problems that (on their surface) don’t seem as hard as playing with millions of cells each measured on thousands of dimensions:slightly_smiling_face:
Dan Bunis (16:41:39): > @Dan Bunis has joined the channel
2019-07-31
Kasper D. Hansen (20:30:04): > I will second the call@Tim Trichemakes above about enabling ExperimentHub to index resources located at essentially any URL, as I have asked about many times. We cannot gurantee persistence and we cannot gurantee correctness. It would still be extremely valuable. We could md5hash the remote resources at a given time X and then test if the remote resource has changed for the user at download time. Not perfect, but useful
2019-08-01
Tim Triche (10:26:27): > Useful is so much better than allegedly-perfect (which nothing is in practice)
Lori Shepherd (11:02:18): > The ability to host data elsewhere is already implemented - but I believe it is the validity that we implement with the association to an R package along with some constraints on the directory structure of how the data is hosted - I will look into how feasible it is to loosen these constraints
Tim Triche (11:19:08): > let me know what I can do to make this go more smoothly (although my primary BioC responsibility right now is to tidy up and document the changes in MTseeker, so…. :-D)
2019-08-06
Kasper D. Hansen (08:18:02): > https://twitter.com/melvidoni/status/1158489472456986624?s=21 - Attachment (twitter): Attachment > If you are working with genomics data in R, let me introduce you the UCSCXenaTools package by Shixiang Wang that just joined @rOpenSci and for which I was the Editor. It has also been submitted to #JOSS! > > https://joss.theoj.org/papers/10.21105/joss.01627 > https://github.com/ropensci/UCSCXenaTools > > #rstats #datascience
Kasper D. Hansen (08:18:29): > I have not looked at it, but it looks relevant to know about
Tim Triche (08:35:33): > toilR ?
Lori Shepherd (08:54:08): > thanks for the reference@Kasper D. HansenI’ll check it out
Lori Shepherd (12:57:35): > So it is possible to host data outside of Bioconductor. This is controlled with the Location_Prefix column of the metadata file. If this column is included it is considered hosted outside of the Bioconductor S3 buckets. This has been implemented in AnnotationHub, there was a slight bug preventing this in ExperimentHub that is corrected in ExperimentHubData version 1.11.2 (just pushed should propagate tonight). I tested locally with accessing data fromhttp://imlspenticton.uzh.ch:3838/conquerand didn’t have any issues. The requirements will remain that it must be a public accessible site and that there must be a package associated with the addition of data (so a maintainer can monitor outside server reliability and assumingly have some code to construct/load/analyze data) I will work on updating the man, vignette, and commonly referenced How to create hub package documentation to better expand and explain this.
Tim Triche (12:58:47): > this is awesome! thanks@Lori Shepherd!
2019-08-16
Aedin Culhane (16:33:39): > @Aedin Culhane has joined the channel
2019-09-05
Aaron Lun (01:20:17): > @Lori ShepherdI noticed that theRdataclass
fields for my scRNAseq EHub entries are all wrong. Currently they’re allcharacter
, probably because I copied it from chipseqDBData (where the Hub was just returning literal character strings). But in scRNAseq, they’re returning matrix/dgCMatrix/DataFrame objects, and I should probably fix the metadata to reflect that. Would you be willing to update all the Hub entries once I’m done?
Lori Shepherd (07:14:56): > yes - If possible the easiest way for me to update then will be a list of EH numbers
2019-09-06
Aaron Lun (01:25:03): > There turns out to be a few things that I need to ask you, so I put it into a separate list here:https://docs.google.com/document/d/1nVVut7jZ2N0UYIQtjbxWQP8vEJADLk9929dHVYTsQxE/edit?usp=sharing
Aaron Lun (01:28:44): > Tag@Lori Shepherdjust in case.
Lori Shepherd (07:49:03): > thanks@Aaron LunI will try to get to these later today or early next week. Cheers
Rob Amezquita (17:42:47): > @Rob Amezquita has joined the channel
Rob Amezquita (17:46:25): > hi@Lori Shepherd-@Aaron Lunreferred me to you re: some issues we’ve been having building the Orchestrating Single Cell Analysis book, specifically, in dealing with thescRNAseq()
package it seems that sometimes the calls to the server will fail: > > ## code that causes error > library(scRNAseq) > sce.grun <- GrunPancreasData() > ... > Quitting from lines 19-21 (P3_W01.grun-pancreas.Rmd) > Error: failed to load resource > name: EH2687 > title: Grun pancreas counts > reason: Empty reply from server >
> This doesn’talwayshappen, it happens sporadically and breaks the entire build process. Another problematic dataset is theMouseGastrulationData
, where the call toWTChimeraData
calls the server several times, and if one of them fails once the whole thing fails and again borks the build. > > library(MouseGastrulationData) > sce.chimera <- WTChimeraData(samples = 5:10) > ... > snapshotDate(): 2019-09-04 > see ?MouseGastrulationData and browseVignettes('MouseGastrulationData') for documentation > downloading 0 resources > loading from cache > see ?MouseGastrulationData and browseVignettes('MouseGastrulationData') for documentation > downloading 0 resources > loading from cache > see ?MouseGastrulationData and browseVignettes('MouseGastrulationData') for documentation > downloading 0 resources > loading from cache > ## ... repeats several more times > see ?MouseGastrulationData and browseVignettes('MouseGastrulationData') for documentation > Quitting from lines 18-21 (P3_W10.pijuan-embryo.Rmd) > Error: failed to load resource >
> Wanted to let you know…any fixes that we can do on our end to ameliorate the behavior?
Lori Shepherd (17:58:47): > Thanks for letting me know. I’ll try to investigate a little more
2019-09-12
Rob Amezquita (17:39:39): > Hi@Lori Shepherd-@Aaron Lunand I are still getting these errors intermittently, and this is despite that it doesn’t seem to be needing to download the resources. Any workarounds to pinging the server? > > Attaching package: 'AnnotationHub' > > The following object is masked from 'package:Biobase': > > cache > > snapshotDate(): 2019-08-02 > downloading 0 resources > loading from cache > require("ensembldb") > Quitting from lines 41-49 (P3_W11.bach-mammary.Rmd) > Error: failed to load resource > name: AH73905 > title: Ensembl 97 EnsDb for Mus musculus > reason: Empty reply from server > > Execution halted >
Lori Shepherd (18:07:54): > Sorry I haven’t looked yet. I’ll make sure to get to this tomorrow
2019-09-13
Lori Shepherd (07:21:23): > Is this only happening on the bioconductor build serves or when using in general?
Rob Amezquita (10:41:14): > this is happening every so often in our build of the osca book (osca.bioconductor.org), its not really reproducible unless you try a bunch of times to get the book to build
Rob Amezquita (10:42:17): > specifically, its happening on a cluster instance
Lori Shepherd (10:43:55): > harder to reproduce - harder for me to debug - any chance of being able to get a traceback to know more?
Rob Amezquita (10:45:01): > yeah thats a core issue…could run a script over and over again until it fails and then produce a traceback or something
Rob Amezquita (10:45:53): > alternately, is there any way of making the connection attempt at least try harder? say, if it fails on one attempt, to attempt again some set number of times?
Lori Shepherd (10:49:59): > yes I can look into that - just trying to figure out if its related to accessing the sqlite file from the remote location or if its an issue of trying to access the cached version of the file …
Rob Amezquita (10:51:35): > ahhh yeah…sorry i cant be of more help:disappointed:if you have something youd like me to run happy to do it
2019-10-13
Aaron Lun (18:55:26): > Hi@Lori Shepherd. Got another batch ofscRNAseqdata to upload (metadata files available in v1.99.7) but my AWS key seems to have expired.
Aaron Lun (18:56:21): > The relevant files areinst/exdata/metadata-buettner-esc.csv
andmetadata-leng-esc.csv
. I will push up the actual files to S3 if you send me some creds.
Lori Shepherd (19:02:28): > Awesome. Its a holiday for us so not sure when I’ll be back at my computer but I’ll email credentials as soon as I’m back at my computer
Aaron Lun (19:03:29): > Wait, it’s a holiday? This is news to me.
Lori Shepherd (19:10:07): > Columbus day in the USA is a holiday.:woman-shrugging:
Aaron Lun (19:11:47): > Hm. Maybe I don’t have to go to work tomorrow, then.
Lori Shepherd (19:19:00): > I think it depends on which set the institution follows
Aaron Lun (19:20:15): > Oh bum.
2019-10-14
Davide Risso (03:11:11): > Yeah, I don’t think California celebrates Columbus day… sorry Aaron!:slightly_smiling_face:
Davide Risso (03:11:57): > You would think Italy would celebrate it but nope!
2019-10-15
Aaron Lun (17:13:25): > @Lori Shepherdcreds?
Dan Bunis (18:20:38): > Somewhat correct!: Columbus Day isn’t celebrated throughout much of California. In SF, and lots of other cities across the US as well, we celebrate Indigenous Peoples’ Day instead. I believe cities individually set which one. Same day, but opposite connotation…
2019-10-16
Peter Hickey (23:34:42): > @Peter Hickey has joined the channel
2019-10-17
Lori Shepherd (14:20:37) (in thread): > Your new data is available in EH > > > eh = ExperimentHub() > |======================================================================| 100% > > snapshotDate(): 2019-10-17 > > query(eh, c("scRNAseq", "buettner")) > ExperimentHub with 3 records > # snapshotDate(): 2019-10-17 > # $dataprovider: ArrayExpress > # $species: Mus musculus > # $rdataclass: DFrame, matrix > # additional mcols(): taxonomyid, genome, description, > # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, > # rdatapath, sourceurl, sourcetype > # retrieve records with, e.g., 'object[["EH3214"]]' > > title > EH3214 | Buettner ESC counts > EH3215 | Buettner ESC rowData > EH3216 | Buettner ESC colData > > query(eh, c("scRNAseq", "leng")) > ExperimentHub with 2 records > # snapshotDate(): 2019-10-17 > # $dataprovider: GEO > # $species: Homo sapiens > # $rdataclass: DFrame, matrix > # additional mcols(): taxonomyid, genome, description, > # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, > # rdatapath, sourceurl, sourcetype > # retrieve records with, e.g., 'object[["EH3217"]]' > > title > EH3217 | Leng ESC normcounts > EH3218 | Leng ESC colData >
Aaron Lun (14:21:29) (in thread): > sweet, thanks
2019-12-03
Aaron Lun (18:54:02): > Wjat tje
Aaron Lun (18:54:07): > didn’t even show up
Peter Hickey (18:59:27): > Sorry wrong channel
Aaron Lun (18:59:59): > it’s probably something about embryos
2019-12-13
Aaron Lun (03:08:38): > @Lori ShepherdDifficult to reproduce, but the various Hubs seem to continually redownload stuff. It seems to only happen after a certain time period has passed after the initial use, but I’ve definitely noticed, say, > > library(AnnotationHub) > ens.mm.v97 <- AnnotationHub()[["AH73905"]] >
> redownloading on the second use (usually several hours after the first use and in a different R session).
Lori Shepherd (07:17:00): > I’ll check to see if we set an expiration on them
2019-12-14
Aaron Lun (04:40:48): > To provide some more clarification, this is what I see: > > > NestorowaHSCData() > |======================================================================| 100% > > snapshotDate(): 2019-10-22 > see ?scRNAseq and browseVignettes('scRNAseq') for documentation > loading from cache > see ?scRNAseq and browseVignettes('scRNAseq') for documentation > loading from cache > class: SingleCellExperiment > dim: 46078 1920 > metadata(0): > assays(1): counts > rownames(46078): ENSMUSG00000000001 ENSMUSG00000000003 ... > ENSMUSG00000107391 ENSMUSG00000107392 > rowData names(0): > colnames(1920): HSPC_007 HSPC_013 ... Prog_852 Prog_810 > colData names(2): cell.type FACS > reducedDimNames(1): diffusion > spikeNames(0): > altExpNames(1): ERCC >
> The 100% bar fills across the top; what is this? The actual objects don’t seem to be redownloaded, otherwise the snapshotDate messages would be followed by another progress bar.
Aaron Lun (04:41:00): > FYI that’s from the scRNAseq package.
Lori Shepherd (08:23:07): > That’s the hub sqlite file.
Lori Shepherd (08:23:11): > Most likely
Lori Shepherd (08:24:09): > I’ll see what meta is used in caching this object. Thanks for the clarification, actually helped a lot
2019-12-15
Aaron Lun (04:25:53): > The same thing also happens with AnnotationHub, where it’s noticeably slower.
Lori Shepherd (22:19:30): > Yes. That would make sense because the annotation hub sqlite file is bigger. I’ll look into it
2019-12-16
Aaron Lun (01:30:20): > On another note, I’ve a new dataset forscRNAseq, but it seems my AWScreds have expired.
Lori Shepherd (08:53:25): > I’ll email you the new creds
Lori Shepherd (08:55:04): > let meknow when the data is uploaded and the repo updated with the metadata file. I’m training someone on backing me up for adding data into the hubs so it would be a good exercise to take them through a live upload:slightly_smiling_face:
Aaron Lun (11:42:41): > It’s up.scRNAseq/paul-hsc
on S3,scRNAseqv2.1.4 on BioC Git.
2019-12-17
Lori Shepherd (12:03:12): > Your data has been added
Aaron Lun (12:07:45): > :+1:
2019-12-23
Lori Shepherd (13:23:42) (in thread): > still trying to reproduce this and figure out whats going on - my initial thought does not appear to be correct so I’m trying to do some more digging - hopefully will come up with more soon
2019-12-29
Aaron Lun (20:04:11): > Based on a few questions athttps://github.com/LTLA/SingleR, I wonder… is AWS S3 accessible from China?
Aaron Lun (20:12:27): > https://forums.aws.amazon.com/thread.jspa?threadID=221050
Aaron Lun (20:12:32): > ridiculous
Peter Hickey (20:28:09) (in thread): > my experience teaching in China is “sometimes and slowly”
Peter Hickey (20:28:31) (in thread): > Using ExperimentHub/AnnotationHub resources
Aaron Lun (20:48:55) (in thread): > Well, that’s just great.
Sean Davis (22:16:40) (in thread): > This is what cloudfront is for, I think:https://aws.amazon.com/cloudfront/. Seehttps://aws.amazon.com/about-aws/whats-new/2019/04/amazon-cloudfront-is-now-available-in-mainland-china/. I’m not sure if the hubs are set up for cloudfront, but it might be worth doing so if latency becomes a problem. - Attachment (Amazon Web Services, Inc.): Content Delivery Network (CDN) | Low Latency, High Transfer Speeds, Video Streaming | Amazon CloudFront > Amazon CloudFront is a fast content delivery network (CDN) service that securely delivers data, videos, applications, and APIs to customers globally with low latency, high transfer speeds, all within a developer friendly environment.
Sean Davis (22:18:34) (in thread): > I’m not sure that this thread is still applicable?
Aaron Lun (23:50:50) (in thread): > One would hope not.
2020-01-13
Lori Shepherd (08:13:18) (in thread): > @Aaron Lunlet me know if you still see this too - I made some adjustments that I was hoping would help - but since its intermittent I was having trouble testing.
Lori Shepherd (13:05:00): > <!channel>Hello all - there has been a recent increase in the number of support site questions particularly involving the use of the hubs behind a proxy. Namely - that they cannot use the hub and are getting an ERROR for internet connection which is an affect of being behind a proxy - the proxy can be set in the constructor, as an environment variable, or as a global option to avoid this and then it all runs fine - I was wondering if any users that are behind a proxy would like to work up a little documentation/blurb that we can put into the hub documentation to better describe this. Perhaps some advice on how to find your proxy server? Unfortunately I cannot do this myself as I don’t have this set up.
2020-01-18
Aaron Lun (21:39:26) (in thread): > nope, I’m still getting it for AHub.
2020-01-20
Aaron Lun (22:09:27) (in thread): > Hm. EHub seems to be fine so far.
2020-01-25
Aaron Lun (06:17:21): > Just thinking about@Hervé Pagèscomments inhttps://github.com/Bioconductor/Contributions/issues/1318#issuecomment-578324471
Aaron Lun (06:18:02): > I wonder whether a variant of the solution we ended up using forbasiliskmight be applicable here.
Aaron Lun (06:18:56): > In this case, you would have someEXPERIMENTHUB_DOWNLOAD_ON_INSTALL
environment variable that is either TRUE or just contains a range/selection of EH numbers to download into theExperimentHubsystem directory upon R package installation.
Aaron Lun (06:19:56): > This would enable system admins to provide re-usable data resources on enterprise R deployments.
Aaron Lun (06:21:40): > The main difficulty (amongst others) is that EHub would need to be modified to accommodate a search path, as resources not installed in the system dir would need to be installed in some user directory. This seems pretty drastic if my memories of AHub’s source code are applicable here.
2020-01-27
Lori Shepherd (08:35:29): > Could you open this as an issue in AnnotationHub so I don’t loose track of it and the details
2020-02-13
Aaron Lun (01:23:27): > @Lori ShepherdWhat happens to the Hub data at every release? Is it snapshotted so that historical analyses with old versions of packages continue to use the versions of the files that they had access to at the time?
Lori Shepherd (07:19:00): > no it currently does not - right now we don’t have/allow versioning of files which is why we recommend not directly replacing files on S3 but rather have a subdirectory or new name for versions so that past version and new versions would be accessible as they then would be added to the hubs as a new EH number - We have been in a planning stage to allow versioning on S3 and build it into the database where EH #s would also have a version ID to get a specific version so the EH # stay consistent but we haven’t implemented it yet
Aaron Lun (13:13:58): > k
FelixErnst (17:21:12): > @FelixErnst has joined the channel
FelixErnst (17:29:03): > I am toying with the idea of submitting an annotation package, which uses the AnnotationDbi interfaces. However the sqlite file might exceed the 200 Mb mark, which causes some potential problems for git and the submission process. Can anyone offer advice on best practices or share some experience with annotation packages?
2020-02-14
Lori Shepherd (07:44:53): > Three is also the AnnotationHub for larger annotation resources - same concept of experimenthub but for annotations.
FelixErnst (10:59:16): > Ok, thats a plausible. If I wanted to load a resource from the annotation hub as a specific class, would I need to submit the package with the class before I could return an AnnotationHub as a specific class? (much like the resource returning an EnsDb object).
FelixErnst (11:00:31): > That was a weird sentence… I hope you know what I mean
Lori Shepherd (11:17:22): > I do know what you mean - We generally have stepped away from adding new dispatchclass to the AnnotationHub backend directly to avoid the which comes first the package or the addition of data, plus it seemed more appropriate then that the loading/handling of data be in the associated package and another reason we now have required annotationhub resources must have an accompanied package rather than ad hoc added in (an option in the past) . I would recommend loading as raw of a file into the hub as possible and if use SQLiteFile as the dispatch is not appropriate because you would like to have your own load method with defined class structure utilizing the FilePath option that will simple download the file and gives you the file path on the system that then could be utilized in your package constuctor/load/read function - hope this makes sense
FelixErnst (11:34:39): > Yep that makes sense. Since I would use a class, which inherits from AnnotationDb, I would just need a SQLite connection. Now I would assume, that I have to expose the resources somehow in the annotation package, for example with a function. Is there an exquivalent ofcreateHubAccessors
available for the AnnotationHub? If not, do you have code snippet/package I could look at for reference?
Lori Shepherd (11:55:39): > The SQLiteFile utilizes AnnotationDbi::loadDb to load the resource - so it may be sufficient but I would encouraging testing on your end - There is not an automatic createHubAccessors like there is for ExperimentHub but you could (and we encourage) to create your own. What your function returns will depending probably how you want to utilize the information in your package but generally it would be creating a function like > > retrieveData <- function(){ > ah <- AnnotationHub() > resource <- ah[[AH####]] > # you may want additional manipulation?? > return(resource) > } >
> Once the data has been added to the hub you would be able to have the exact AH# associated with your file/object
FelixErnst (12:11:14): > Ok. Such function could the also take a version argument and return the appropriate version, couldn’t it? I will ask for the the AWS login once I am ready. Thanks for the hints and the example
2020-02-25
Hena Ramay (12:44:02): > @Hena Ramay has joined the channel
FelixErnst (13:15:50): > getValidSourceTypes()
does not contain an identifier forXML
orHTML
. Should I useTXT
instead?
Lori Shepherd (13:48:30): > Where is the data coming from? TXT is probably okay - if you really feel like there should be a new source type I could add one in
FelixErnst (13:49:41): > Its a parsed HTML file/website
FelixErnst (13:50:14): > In my opinionXML
would match it best and includeHTML
parsing by default
FelixErnst (13:52:47): > (I regardHTML
in the context of Bioc and the focus on data analysis as a subset ofXML
. probably some would disagree)
Lori Shepherd (13:57:41): > agreed - I have added XML as a valid source type in 1.17.2 - it should build tonight and be available tomorrow after the builds post - if you want to in the meantime you can temporarily install from the bioconductor github
FelixErnst (13:57:56): > Thank you!
2020-02-29
Aaron Lun (14:09:28): > @Lori ShepherdI just uploaded some updated files forSingleRto EHub. Some obvious questions came up around smooth handling of versioning. > > Some context first. I organize my EHub files with a structure likeSingleR/1.0.0/blah.rds
so as to allow me to makeSingleR/1.2.0/blah.rds
and so on. But I then realized, if I were to update just one file in a set (e.g.,1.0.0/blah.rds
,1.0.0/whee.rds
,1.0.0/stuff.rds
), I would either have to duplicate of all other files in the same set from1.0.0
to1.2.0
; or my client-side getter code would need to be smart enough to fetch some files from1.2.0
and other files from1.0.0
. The former is an unnecessary waste of space and the latter is pretty inconvenient. In addition, the latter requires me to recreate the metadata file with some files in1.0.0
and other files in1.2.0
, which is very error prone. > > So I’m wondering what the policy should be here. Should we just make go ahead and make copies? That would be the most convenient from my end; perhaps there may be some smart stuff that EHub can do under the hood to avoid duplicates (e.g., a poor-man’s symlink to re-direct requests)?
2020-03-02
Lori Shepherd (07:58:43): > Obviously the best solution is for me to finally get versioning into the hubs but it is still a work in progress and probably a few months off at least (although we are working on it) - We wouldnt want to duplicate files on S3 because Bioconductor fronts the cost of hosting resources - The recommended solution right now is to make the code smart enough , we recommend once the EH numbers are known to use those in the code so it would be more exact -
Martin Morgan (11:30:41): > @Aaron Lunmaybe the simple thing is to maintain a simple named character vector or two-column data.frame that maps between EH number and convenient-to-you naming convention, and use the EH number rather than convenient-to-you path for retrieving the data? We really don’t want to have duplicate copies of data resources in S3.
2020-03-04
Aaron Lun (01:11:48): > @Lori ShepherdSingleR 1.1.11 contains updated metadata files formetadata-blueprint_encode.csv
andmetadata-dmap.csv
. I’ve also uploaded the relevant files to AWS inSingleR/
(note that there’s an unnecessarylogcounts
file floating around there; this can be removed).
Aaron Lun (01:16:07): > The new metadata files are ininst/extdata/1.2.0
, which should make it a bit easier for you to pull them in.
Aaron Lun (01:34:46): > I have also resolved the versioning issues on my end, so I’m ready to go whenever.
2020-03-06
Aaron Lun (12:19:25): > @Lori Shepherd?
Lori Shepherd (12:21:12): > sorry i missed this. I’ll get to this either today or monday at the latest
2020-03-09
Lori Shepherd (08:58:33): > done > > > query(eh, c("SingleR", "1.2.0")) > ExperimentHub with 2 records > # snapshotDate(): 2020-03-09 > # $dataprovider: GEO, Dvir Aran > # $species: Homo sapiens > # $rdataclass: DataFrame > # additional mcols(): taxonomyid, genome, description, > # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, > # rdatapath, sourceurl, sourcetype > # retrieve records with, e.g., 'object[["EH3295"]]' > > title > EH3295 | DMAP RNA microarray colData > EH3296 | Blueprint/Encode RNA-seq colData >
Aaron Lun (13:14:07): > Thanks, done.
2020-03-10
Aaron Lun (02:08:12): > @Lori ShepherdAnother one uploaded at scRNAseq/ercc-concentrations/2.2.0/cmscRNAseq
version 2.1.7 (extdata/2.2.0/metadata-ercc-.*
). Note that this one is a bit weird; it’s spike-in data, so it doesn’t have a genome or species. It’s also a raw TSV file and I couldn’t figure out the best DispatchClass for that. I guess I’d like it read in as a puredata.frame
withstringsAsFactors=FALSE
andcheck.names=FALSE
.
Aaron Lun (02:08:20): > Also tagging@Alan O’C.
Alan O’C (07:05:20): > @Alan O’C has joined the channel
Lori Shepherd (08:06:14): > I think because the data.frame reading has so many options thats why we never created a generic DispatchClass. May I suggest using “FilePath” which will download and return the path to the file that you could use in data.frame with any options desired.@Aaron Lun
Aaron Lun (16:13:58): > Thanks@Lori Shepherd. What do you suggest for the RDataClass? I assume this is justcharacter
then if I’m getting the path back?
Lori Shepherd (16:57:03): > developers call if you want to leave the RDataClass of what is actually loaded which is prbably more useful to a user
Aaron Lun (17:28:32): > done
2020-03-11
Aaron Lun (12:14:03): > tagging@Lori Shepherd
2020-03-12
Lori Shepherd (09:06:40): > @Aaron LunIt looks like the DispathClass is Rds for the txt file? Is that correct or did you want it to be the suggested FilePath?
Aaron Lun (11:41:18): > Whoops, forgot to push my changesupstream
.
Aaron Lun (11:41:24): > Now it should be good
Aaron Lun (11:41:29): > @Lori Shepherd
Lori Shepherd (12:02:11): > it the hub let me know if there is any issues@Aaron Lun
Aaron Lun (12:03:58): > looks good.@Alan O’Cfile can now be accessed with: > > library(ExperimentHub) > ehub <- ExperimentHub() > blah <- ehub[["EH3298"]] > df <- read.delim(blah, check.names=FALSE) >
Alan O’C (12:05:05): > Awesome, cheers
Aaron Lun (12:06:54): > Probably want to use therdatapath
to look them up as in.create_sce
, rather than hard-coding the EH number as I’ve done above for convenience.
Aaron Lun (12:09:35): > thanks@Lori Shepherd
2020-03-31
Aaron Lun (20:36:26): > Looks like EHub is down@Lori Shepherd > > > library(scRNAseq) > Loading required package: SingleCellExperiment > > sce <- ReprocessedAllenData(assays="tophat_counts") > Error in curl::curl_fetch_memory(url, handle = handle) : > Timeout was reached: [[experimenthub.bioconductor.org](http://experimenthub.bioconductor.org)] Operation timed out after 10001 milliseconds with 0 out of 0 bytes received >
> Similar lack of response from attempting to connect via a web browser.
Lori Shepherd (20:39:21): > I’ll investigate.
Aaron Lun (20:39:31): > :+1:
Aaron Lun (20:54:31): > so, is this a problem on my end or yours?
Aaron Lun (20:55:53): > for me, connections fail both with and without a firewall in between.
Lori Shepherd (21:02:00): > Firewall / proxy doesn’t help… But their is something going on on our end tok
Lori Shepherd (21:02:06): > *too
Aaron Lun (21:26:16): > ah, it’s back
2020-04-01
Sean Davis (16:04:45): > I have used this site for years to monitor web resources. It simply sends an email with any status change.https://uptimerobot.com/ - Attachment (Uptime Robot): Uptime Robot | Free Website Monitoring > Get up to 50 website, port or heartbeat monitors for free. When something happens, be alerted via email, SMS, Telegram, Slack or many more ways.
2020-04-11
Aaron Lun (14:31:51): > Looks like the hubs were down for a brief moment last night, and now they’re back.
Aaron Lun (14:33:59): > Is there a diagnosis for these connection issues from the logs? I’m assuming it’s not an S3 thing, what with all their 9’s of uptime.
Martin Morgan (14:39:18): > itwasan AWS thing, “EC2 has detected degradation of the underlying hardware hosting your Amazon EC2 instance” the image became unresponsive and was restarted (on new hardware, apparently). Whether this addresses the intermittent timeout issue is another question…
Aaron Lun (14:40:07): > oh!
Aaron Lun (14:40:10): > okay, good.
Martin Morgan (14:40:12): > (restarting an image and getting new hardware sure beats filling a purchase order with the local IT and waiting for the hardware to arrive!)
Aaron Lun (14:40:23): > someone else to blame, then.
2020-04-14
Aaron Lun (12:11:43): > @Lori Shepherdis there a way to clear out contents of a BFC directory while leaving it in a valid state for future additions? I’m currently doing: > > unlink(bfccache(bfc), recursive=TRUE) > BiocFileCache(bfccache(bfc), ask=FALSE) >
Aaron Lun (12:13:59): > Wait,cleanbfc
looks promising. (Doesn’t start with bfc
!) Can I setdays=Inf
?
Lori Shepherd (12:18:27): > cleanbfc I think is what you are looking for – I don’t think you can use Inf in days, I dont think I thought to test that
Aaron Lun (13:28:56): > cleanbfc(bfc, days=0)
seems like it should delete everything, but doesn’t.
Aaron Lun (13:32:26): > Turned out I had to dodays=-1
.
Lori Shepherd (13:50:32): > I haven’t check this function since created so it very well may need some refining
Aaron Lun (15:07:09): > @Lori Shepherdalso,message("No internet connection using 'localHub=TRUE'")
probably could use some grammar, “No internet connection[,] using …”
Aaron Lun (15:07:25): > As it’s currently written I was wondering whylocalHub=TRUE
would need a connection.
Aaron Lun (15:07:40): > Maybe evenmessage("No internet connection, using 'localHub=TRUE' instead")
.
Dan Bunis (15:23:57): > Or flipping the order perhaps,message("'localHub=TRUE', no internet connection used")
Dan Bunis (17:34:26): > FWIW, having not looked at the code before commenting, I’d also misunderstood the error’s intent. Ignore mymessage
suggestion as it is certainly wrong.
Aaron Lun (17:36:13): > @Lori Shepherdmay also want to comment on my thoughts inhttps://github.com/LTLA/SingleR/issues/109#issuecomment-613688636.
Kasper D. Hansen (22:26:48): > @Lori Shepherdwe (@Leonardo Collado Torresand me) are preparing an ExperimentHub package where we are hosting data ourselves. I know we have to produce a metadata file (not sure about terminology) that eventually gets added to the central Bioc listing. Is there a way - for testing purposes - to access ExperimenthHub resources detailed in a local metadata file? Like how do we test all of this before we add the file the central listing
Leonardo Collado Torres (22:26:54): > @Leonardo Collado Torres has joined the channel
2020-04-15
Lori Shepherd (09:34:36): > Its not very easy to replicate but I tried to make a generic server code that could be utilized for local testing :https://github.com/Bioconductor/BiocHubServer– The example shows AnnotationHub but it could be used with ExperimentHub the same way – > We understand this limitation and are willing to work with maintainers to add the data in to the central listing (and edit or remove) if necessary too while maintainers test/debug
2020-04-16
Al J Abadi (02:06:19): > @Al J Abadi has joined the channel
2020-05-25
Aaron Lun (16:16:56): > @Lori ShepherdIt’s that time of year again. Can I get some updated ExperimentHub creds to upload another dataset forscRNAseq
?
2020-05-26
Lori Shepherd (08:37:40): > please email me – I won’t send credentials over slack.
Aaron Lun (11:36:33): > done.
2020-05-27
Aaron Lun (00:51:54): > @Lori ShepherdIt is done. The metadata files are in the latest scRNAseq repo on Bioc-devel, inextdata/2.4.0/metadata-zilionis-lung.csv
. There are also 4 files in thescRNAseq
directory on ExperimentHub.
2020-05-28
James MacDonald (11:08:04): > @Lori ShepherdHas something changed in the processing of GTF files for the AnnotationHub? Last release we had > > > hub <- AnnotationHub() > snapshotDate(): 2019-10-29 > > z <- hub[["AH50377"]] > loading from cache > require("GenomicRanges") > !> seqinfo(z) > Seqinfo object with 59 sequences (1 circular) from GRCh38 genome; no seqlengths: > seqnames seqlengths isCircular genome > 1 <NA> FALSE GRCh38 > 2 <NA> FALSE GRCh38 > 3 <NA> FALSE GRCh38 > 4 <NA> FALSE GRCh38 > 5 <NA> FALSE GRCh38 > ... ... ... ... > KI270741.1 <NA> FALSE GRCh38 > KI270743.1 <NA> FALSE GRCh38 > KI270744.1 <NA> FALSE GRCh38 > KI270750.1 <NA> FALSE GRCh38 > KI270752.1 <NA> FALSE GRCh38 >
> And in the current release we have > > > z <- qry[["AH50377"]] > loading from cache > require("GenomicRanges") > > seqinfo(z) > Seqinfo object with 59 sequences (1 circular) from 2 genomes (GRCh38, NA): > seqnames seqlengths isCircular genome > 1 248956422 FALSE GRCh38 > 2 242193529 FALSE GRCh38 > 3 198295559 FALSE GRCh38 > 4 190214555 FALSE GRCh38 > 5 181538259 FALSE GRCh38 > ... ... ... ... > KI270741.1 <NA> <NA> <NA> > KI270743.1 <NA> <NA> <NA> > KI270744.1 <NA> <NA> <NA> > KI270750.1 <NA> <NA> <NA> > KI270752.1 <NA> <NA> <NA> >
> Which seems not right? The only way this can be converted to aTxDb
is to usekeepStandardChromosomes
to strip off the unplaced scaffolds and whatnot.
Lori Shepherd (11:19:35): > I have not made any changes to code or processing. When we add the gtfs to the hubs we only store metadata. It’s converted on the fly using rtracklayer::import(cache(yy), format=“gtf”, genome=yy$genome,…) . We could see if there was some change there.
Lori Shepherd (11:21:35): > Maybe some underlying change in R4.0 or another related package?
Aaron Lun (11:28:47) (in thread): > Bumping@Lori Shepherd
James MacDonald (11:56:22) (in thread): > It appears to be due to information coming fromGenomeInfoDb
that gets added inAnnotationHub:::.tidyGRanges
. In the last release this happened: > > existingSeqinfo <- GenomeInfoDb::seqinfo(gr) > newSeqinfo <- tryCatch({ > GenomeInfoDb::Seqinfo(genome = genome) > }, error = function(err) { > NULL > }) >
Lori Shepherd (12:03:06) (in thread): > Any chance you can post this as an issue in AnnotationHub github for us to look into?
James MacDonald (12:04:39) (in thread): > ARGH stupid slack. Enter isn’t send, fool. > > existingSeqinfo <- GenomeInfoDb::seqinfo(gr) > newSeqinfo <- tryCatch({ > GenomeInfoDb::Seqinfo(genome = genome) > }, error = function(err) { > NULL > }) > > ## running the code we get > GenomeInfoDb::Seqinfo(genome = genome) > Error in fetchSequenceInfo(genome) : genome "GRCh38" is not supported >
> And the genome info would be inferred using the next bit of code: > > if (is.null(newSeqinfo)) { > if (guess.circular) > GenomeInfoDb::isCircular(existingSeqinfo) <- .guessIsCircular(existingSeqinfo) > if (addGenome) > GenomeInfoDb::genome(existingSeqinfo) <- genome > if (sort || guess.circular || addGenome) { > new2old <- match(GenomeInfoDb::seqlevels(existingSeqinfo), > GenomeInfoDb::seqlevels(gr)) > GenomeInfoDb::seqinfo(gr, new2old = new2old) <- existingSeqinfo > } > return(gr) > } >
> But in the current release we get > > GenomeInfoDb::Seqinfo(genome = genome) > Seqinfo object with 455 sequences (1 circular) from GRCh38 genome: > seqnames seqlengths isCircular genome > 1 248956422 FALSE GRCh38 > 2 242193529 FALSE GRCh38 > 3 198295559 FALSE GRCh38 > 4 190214555 FALSE GRCh38 > 5 181538259 FALSE GRCh38 > ... ... ... ... > HSCHRUN_RANDOM_CTG30 62944 FALSE GRCh38 > HSCHRUN_RANDOM_CTG33 40191 FALSE GRCh38 > HSCHRUN_RANDOM_CTG34 36723 FALSE GRCh38 > HSCHRUN_RANDOM_CTG35 79590 FALSE GRCh38 > HSCHRUN_RANDOM_CTG36 71251 FALSE GRCh38 >
> And that’s obviously NCBI, whereas this is an Ensembl GTF
James MacDonald (12:04:57) (in thread): > OK, I’ll post there
Lori Shepherd (12:09:08) (in thread): > I think@Hervé Pagèsmight have done some updating of GenomeInfoDb functions for seq levels – maybe related?
James MacDonald (12:29:29) (in thread): > Yeah, but all those data come from UCSC, so they don’t conform to what one gets from an Ensembl GTF
2020-05-29
Lori Shepherd (08:04:48) (in thread): > Available in devel > > > query(eh, c("scRNAseq", "Zilionis")) > ExperimentHub with 4 records > # snapshotDate(): 2020-05-29 > # $dataprovider: GEO > # $species: Mus musculus, Homo sapiens > # $rdataclass: dgCMatrix, DFrame > # additional mcols(): taxonomyid, genome, description, > # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, > # rdatapath, sourceurl, sourcetype > # retrieve records with, e.g., 'object[["EH3460"]]' > > title > EH3460 | Zilionis human lung counts > EH3461 | Zilionis human lung colData > EH3462 | Zilionis mouse lung counts > EH3463 | Zilionis mouse lung colData >
Aaron Lun (11:11:31) (in thread): > :+1:
2020-05-30
Aaron Lun (02:13:08) (in thread): > @Lori ShepherdI realized I made a mistake with some of the files; these have now been fixed and the new versions are on S3 inscRNAseq/
, can you push them along to EHub?
2020-06-01
Lori Shepherd (08:09:17) (in thread): > should be good now. files moved over. The cache should pick them up on next access.
Aaron Lun (11:05:59) (in thread): > thanks
Arshi Arora (15:22:43): > @Arshi Arora has joined the channel
2020-06-09
Aaron Lun (17:34:08): > @Lori Shepherddo you mind if I share the previous creds with@Charlotte Sonesonto upload some more data for thescRNAseq package?
Lori Shepherd (20:29:22): > Not at all. Go ahead
2020-06-14
Aaron Lun (13:59:43): > @Lori ShepherdI don’t know if it matters, but the metadata CSV files forSingleR’s datasets now live in the (soon-to-be in the build system)celldexpackage.
Aaron Lun (14:00:09): > I assume that it doesn’t matter because EHub doesn’t refer back to these files past the initial ingestion, but just in case, here it is.
2020-06-15
Lori Shepherd (14:42:14): > It does and doesn’t matter – yes from a maintainance perspective it is only used in the ingestion process – from a tracking perspective it matters as we hold what packages is associated with the resources in the hub in the database. If the package (and package maintainer) are switching from SingleR to celldex than it would be better to have the package information in the database reflect this.
Aaron Lun (15:31:51): > Okay. Well, for the record, all of SingleR’s datasets are now managed by celldex. I haven’t changed any of the paths for back compatibility.
Lori Shepherd (15:37:10): > ok – let me know when its in bioconductor – Ill double check that the package value in the database is not utilize in retrieval and if necessary update at least that entriy
Aaron Lun (15:38:47): > Sounds like if we had to update it, older versions of SingleR would no longer be able to pull data out of EHub.
2020-06-19
Aaron Lun (22:31:53): > @Lori Shepherdit is done:https://bioconductor.org/packages/devel/data/experiment/html/celldex.html - Attachment (Bioconductor): celldex (development version) > Provides a collection of reference expression datasets with curated cell type labels, for use in procedures like automated annotation of single-cell data or deconvolution of bulk RNA-seq.
2020-06-22
Lori Shepherd (12:12:34): > I talked with Martin – we feel like for consistency the data should be re-added under celldex and with the path to celldx in the metadata RDataPath and associated package entry. For backwards compatibility we will keep the SingleR entries in the database and on S3.
Aaron Lun (13:29:23): > okay, tell me when you’ve done it and I’ll update all of the celldex files.
Aaron Lun (13:29:55): > to be clear, do you mean that you’ll copy the stuff on S3? Or do you want me to re-upload it?
Aaron Lun (13:30:00): > Because the latter will take some time.
2020-06-26
Lori Shepherd (08:04:26): > I’ll copy from my end it will be easier. working on this and some other hub uploads this morning
Lori Shepherd (08:04:34): > ill let you know when its officially done -
Lori Shepherd (09:17:56): > Done… The 16 files that are references in the inst/extdata metadata files in celldex are now officially under celldex – to add to the database I only updates the RDataPath from SingleR to celldex and changed the BiocVersion to 3.12 where they are available under the new package – I did this change locally only and did not make any official pushes to the repo to let you determine how to update – let me know if there are any issues > > > hub = ExperimentHub() > |======================================================================| 100% > > snapshotDate(): 2020-06-26 > q> query(hub, "celldex") > ExperimentHub with 16 records > # snapshotDate(): 2020-06-26 > # $dataprovider: Dvir Aran, GEO, DICE > # $species: Homo sapiens, Mus musculus > # $rdataclass: DataFrame, matrix, DFrame > # additional mcols(): taxonomyid, genome, description, > # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, > # rdatapath, sourceurl, sourcetype > # retrieve records with, e.g., 'object[["EH3486"]]' > > title > EH3486 | Blueprint/Encode RNA-seq logcounts > EH3487 | Blueprint/Encode RNA-seq colData > EH3488 | DICE RNA-seq logcounts > EH3489 | DICE RNA-seq colData > EH3490 | DMAP RNA microarray logcounts > ... ... > EH3497 | Monaco Immune Cell RNA-seq colData > EH3498 | Mouse bulk RNA-seq logcounts > EH3499 | Mouse bulk RNA-seq colData > EH3500 | Blueprint/Encode RNA-seq colData > EH3501 | DMAP RNA microarray colData >
Aaron Lun (11:23:24): > thanks@Lori Shepherd, will work on the transition.
2020-06-27
Aaron Lun (23:01:44): > IT IS DONE.
2020-07-04
Umar Ahmad (08:21:55): > @Umar Ahmad has joined the channel
2020-07-19
FelixErnst (15:03:23): > Hi, I have a bot of a problem debugging an error in buildingRNAmodR
for devel, which involves loading a file from the ExperimentHub using theRNAmodR.Data
package.http://bioconductor.org/checkResults/devel/bioc-LATEST/RNAmodR/tokay1-buildsrc.htmlclaims that a file is malformated. On mc PC the file looks fine and does not seem to be malformated. The vignette builds locally as well. I also cannot reproduce this error in GitHub Action builds or using the bioconductor docker image. Furthermore the error is limited to the Windows build in the last three days (7/17 - 7/19). Does anybody have an idea, what might be the problem? Thanks for any help
2020-07-20
Martin Morgan (11:27:07) (in thread): > Does your code try to use the same file from different threads / processes? It could be that two threads both tried to download the same file, and the file is now corrupt. We will see about cleaning this up on the builder.
FelixErnst (11:38:33) (in thread): > That might be the case, but it is downloaded ahead of time, so that shouldn’t be an issue. > > However thinking about downloading the file twice at the same time:RNAmodR.AlkAnalineSeq
build report shows the same error. Maybe the ran at the same time?
FelixErnst (11:39:51) (in thread): > That hasn’t been a problem in the last year and I haven’t touched this part of the code.
2020-07-24
Aaron Lun (01:31:17): > @Vince Careyand I encountered a pretty interesting edge case of ExperimentHub and HDF5Array. > > We were running the OSCA on Docker and we were getting weird errors with HDF5Arrays containing paths to ExperimentHub cached resources. I was creating a HDF5Array in one report, saving the R object in a cache (unrelated to the EHub cache), and the restoring the object in another R process for later use. However, the restored objects contained paths to HDF5 files in temporary directories that didn’t exist; this was despite my overriding ofEXPERIMENT_HUB_CACHE
to provide a constant cache location. > > I eventually figured out that, when we run a report, ExperimentHub will try to ask if we want to create the cache location. Presumably, outside of an interactive session, it assumes that the answer is “no” and then it uses a temporary directory instead. This causes problems for my serialized HDF5Array objects because those objects are created with hard-coded paths to the HDF5 files in the temporary directories… which are promptly destroyed when the R process finishes. This invalidates the use of those serialized objects in the later R process. > > The solution was to setEXPERIMENT_HUB_ASK
toFALSE
.
Martin Morgan (09:35:25) (in thread): > is that the right solution, or is there something better? For instance, the docker image could be constructed with the cache created; ExperimentHub would then use it.
Aaron Lun (11:05:22) (in thread): > we could create the dir, but I don’t see an easy way of populating the cache in advance.
Aaron Lun (16:38:38): > FYI@Lori ShepherdI will also have some scRNAseq datasets to upload in the near future. I’ll check if my existing creds are still good.
2020-07-26
Aaron Lun (01:18:41): > @Lori Shepherdit is done. Two new datasets with metadata files ininst/extdata/2.4.0/metadata-mair-pbmc.csv
andinst/extdata/2.4.0/metadata-stoeckius-hashing.csv
forscRNAseq, on the BioC Git with commitf20b85f35702fdc28cf4d7ed738ec49fcf5518a5
.
Aaron Lun (01:53:39): > And I forgot to say that data is uploaded to EHub in the scRNAseq folder.
2020-07-27
Lori Shepherd (07:20:39): > Thanks. With the conference it might take me a day or two to get to this but I have it (and the website badges still) on my radar
2020-07-29
Lori Shepherd (09:13:11): > I’ll try to carve out some time this afternoon to try and get these on – been distracted with the conference
Lori Shepherd (11:44:40): > for metadata-stoeckius-hashing – the directory on S3 stoeckius-hashing but only stoeckius in metadata – which do you prefer?@Aaron Lun
Aaron Lun (12:18:53): > oops. should be -hashing.
Aaron Lun (12:19:06): > changing now
Aaron Lun (12:20:30): > fixed.
Aaron Lun (12:20:47): > 1474110a6f08ee47b74c322d04e0497d394f7972
.
Lori Shepherd (12:52:48): > @Aaron Lunfor mair-pbmc : > > > meta = makeExperimentHubMetadata("/home/shepherd/BioconductorPackages/ExperimentData/scRNAseq/", "2.4.0/metadata-mair-pbmc.csv") > missing or NA values for 'Coordinate_1_based set to TRUE' > Loading valid species information. > Error in AnnotationHubData:::.checkThatSingleStringOrNAAndNoCommas(SourceVersion) : > SourceVersion must not contain commas >
Aaron Lun (12:53:34): > Oh bugger.
Aaron Lun (12:53:47): > fixing now
Aaron Lun (12:55:48): > hm. I usually set the SourceVersion to the literal file that was used to generate a resource, but these resources were generated by combining multiple files. ANy suggestions?
Aaron Lun (12:56:02): > semi-colon separation?
Lori Shepherd (12:56:31): > semi-colon I think should work
Aaron Lun (12:58:06): > fixed,1caae3e3edf04728d6ca61e9c6cdc4073421fc04
.
Lori Shepherd (13:08:25): > should be all set
Aaron Lun (13:08:40): > excellent, thanks
2020-07-30
Mike Smith (08:27:07): > @Mike Smith has joined the channel
Sean Davis (08:55:40): > With@Peter Hickeylab at least one person noticed that the events of making hub or querying led to some requests that interrupted code execution – query to build the hub, and then some package updates … not clear why the latter was triggered. I wonder if there are some global option settings or instructions to authors to avoid this. It seems you have to answer the questions properly and then rerun the interrupted chunk. Any thoughts@Lori Shepherd?
Lori Shepherd (08:56:59): > I’m not sure why there would be package updates – if there is a way to document it better for me to try and reproduce?
Vince Carey (08:58:45): > @Vince Carey has joined the channel
Vince Carey (09:00:12): > it was me … > > > > > # Load the relevant resource. > > # This will download the data and may take a little while on the first run. > > # The result will be cached, however, so subsequent runs avoid re-downloading > > # the data. > > fname <- hub[["EH1040"]] > Bioconductor version 3.12 (BiocManager 1.30.10), R 4.0.0 (2020-04-24) > Installing package(s) 'TENxBrainData' > trying URL '[https://bioconductor.org/packages/3.12/data/experiment/src/contrib/TENxBrainData_1.9.0.tar.gz](https://bioconductor.org/packages/3.12/data/experiment/src/contrib/TENxBrainData_1.9.0.tar.gz)' > Content type 'application/x-gzip' length 350287 bytes (342 KB) > ================================================== > downloaded 342 KB > > Bioconductor version 3.12 (BiocManager 1.30.10), ?BiocManager::install for help > * installing **source** package ‘TENxBrainData’ ... > **** using staged installation > **** R > **** inst > **** byte-compile and prepare package for lazy loading > Bioconductor version 3.12 (BiocManager 1.30.10), ?BiocManager::install for help > **** help > ***** installing help indices > **** building package indices > Bioconductor version 3.12 (BiocManager 1.30.10), ?BiocManager::install for help > **** installing vignettes > **** testing if installed package can be loaded from temporary location > Bioconductor version 3.12 (BiocManager 1.30.10), ?BiocManager::install for help > **** testing if installed package can be loaded from final location > Bioconductor version 3.12 (BiocManager 1.30.10), ?BiocManager::install for help > **** testing if installed package keeps a record of temporary installation path > * DONE (TENxBrainData) > > The downloaded source packages are in > ‘/tmp/Rtmp2lXKI5/downloaded_packages’ > Old packages: 'bit', 'bit64' > Update all/some/none? [a/s/n]: > Update all/some/none? [a/s/n]: # The structure of this HDF5 file can be seen using the h5ls() command > Update all/some/none? [a/s/n]: # from the rhdf5 package: > Update all/some/none? [a/s/n]: rhdf5::h5ls(fname) > Update all/some/none? [a/s/n]: > Update all/some/none? [a/s/n]: # The 1.3 Million Brain Cell Dataset is represented by the "counts" group. > Update all/some/none? [a/s/n]: # We point the HDF5Array() constructor to this group to create a HDF5Matrix > Update all/some/none? [a/s/n]: # object (a type of DelayedArray) representing the dataset: > Update all/some/none? [a/s/n]: tenx <- HDF5Array(filepath = fname, name = "counts") > Update all/some/none? [a/s/n]: > a > trying URL '[https://cran.r-project.org/src/contrib/bit_4.0.3.tar.gz](https://cran.r-project.org/src/contrib/bit_4.0.3.tar.gz)' > Content type 'application/x-gzip' length 279205 bytes (272 KB) > ================================================== > downloaded 272 KB > > trying URL '[https://cran.r-project.org/src/contrib/bit64_4.0.2.tar.gz](https://cran.r-project.org/src/contrib/bit64_4.0.2.tar.gz)' > Content type 'application/x-gzip' length 134375 bytes (131 KB) > ================================================== > downloaded 131 KB >
Lori Shepherd (09:00:47): > … hmm… I’ll look into it — possible backend if we require a package –
Vince Carey (09:02:25): > The user can sort it out by answering the question and then rerunning the chunk. But some bulletproofing should be possible.
Martin Morgan (09:20:38): > the resource requires TENxBrainData. TENxBrainData gets installed essentially with BiocManager::install(). One could instead do BiocManager::install(update = FALSE), but then the user might have an out of date installation. Of course update = TRUE would do the update without asking, but that might not be what the user wants either. One could also just stop, and say ‘hey, to download that resource you need to install TENxBrainData’. > > Which solution is best?
Martin Morgan (09:24:55): > From the workshop perspective, I guess the solution would have installed TENxBrainData on the image, maybe in the Suggests: field? (do Suggests packages get installed?)
Martin Morgan (09:25:59): > If something needs to be done, then opening an issue onhttps://github.com/Bioconductor/ExperimentHub/issuesmight be a good place for discussion
Sean Davis (09:53:01) (in thread): > If using our template repo, yes. The key component is that the defaultinstall
does not usedependencies=TRUE
, so a change from the template that does not include that additional parameter doesn’t getSuggests
packages. In this case, TENxBrainData wasn’t in Suggests. Just added in a PR.https://github.com/PeteHaitch/BioC2020_DelayedArray_workshop/pull/7
2020-07-31
Dr Awala Fortune O. (16:13:33): > @Dr Awala Fortune O. has joined the channel
2020-08-01
Aaron Lun (15:12:02): > @Lori Shepherdthere’s a few newscRNAseqdatasets I’ve pushed to the EHub server.hu-cortex
andwu-kidney
in thescRNAseq
folder, corresponding metadata files in the usual place atinst/extdata
forf4cdf0a1a6e3ae5be8b77763375cda4d656a6e31
.
Lori Shepherd (17:02:07): > Awesome. We’ll look at adding it Mon or Tues
2020-08-03
Lori Shepherd (09:55:35): > @Aaron Lunwe are doing a hot fix locally but in your metadata for kidney could you add the extensions.rds
to the RDataPath – we did locally to get the data added but didnt push up – please fix on your end
Kayla Interdonato (10:29:13): > @Aaron LunData should be all set
Aaron Lun (11:09:08): > thanks, done.
Junyan Xu (12:55:43): > @Junyan Xu has joined the channel
2020-08-06
Aaron Lun (13:39:17): > how hard would it be to make BFC (and thus EHub and AHub) thread-safe?
Aaron Lun (13:39:41): > Sounds like it should be achievable with liberal use offilelock
or something similar.
Lori Shepherd (17:48:29): > I’m off for the next week but I’d liked to explore this more when I get back
2020-08-10
Peter Hickey (07:11:50): > I’m planning to upload some large DNA methylation datasets to ExperimentHub and looking for best practices / preferred formatting for such uploads. > In a real analysis, these data are stored in disk-backed (i.e. HDF5-backed) SummarizedExperiment objects and comprise 2 (dense) count matrices with 10-100 million rows and 10-100 columns. > How should I prepare these data for upload to ExperimentHub server? > > The two main options I see are: > > 1. ‘All in one-ish’: RunHDF5Array::saveHDF5SummarizedExperiment()
which produces a serializedSummarizedExperiment‘shell’ (se.rds
) and a single HDF5 file containing all the assay data (assays.h5
). > 2. ‘Separate components’: Separately serialize therowRanges(rowRanges.rds
) and thecolData(colData.rds
) (in my case there’s nometadatato be saved). > a. Save all the assays to a single HDF5 file (assays.h5
). > b. Save each assay to its own HDF5 file (M.h5
andCov.h5
). > > An advantage of (1) is that I can callHDF5Array::loadHDF5SummarizedExperiment()
to construct theSummarizedExperimentwhen the user calls the relevant data-loading function. > The returned object would also be compatible withHDF5Array::quickResaveHDF5SummarizedExperiment()
, although I haven’t fully though through the implications of a user running that function on a AnnotationHub-cached version of the data. > Any thoughts,@Hervé Pagès? > > The style of (2) is more in-line with what@Aaron Lunhas done in the scRNAseq package in that the individual components are serialized and then aSingleCellExperimentis constructed on-the-fly when the user calls the relevant data-loading function. > I’m guessing the choice to serialize the individual components rather than just serializing theSingleCellExperimentas a whole is a defensive move in case of changes to class definitions or so an ExperimentHub user could opt to download just a portion of the object (e.g., count matrix)? > Splitting the assays up, as in (2b), feels like a step too far in this case because any analysis of these datasets requires both theM
andCov
assays. > > Taking (2) even further would be to use more general file formats (e.g.,rowRanges.csv
) but then you’d need to do more file munging & format conversions within the relevant data-loading function (which can take a long time for these large datasets). > Finally, taking (1) and (2) even further would be storing therowRangesandcolDataalongside theassaysin the same file (e.g.se.h5
). > The relevant data-loading function would then construct therowRangesandcolDataS4 objects on-the-fly whereas the assay data would be stored in anHDF5Arrayso that the assay data would not be loaded into memory. > That’s potentially neat but requires designing a schema for storing S4 objects in a HDF5 file (e.g.,GenomicRangesforrowRangesandDataFrameforcolData), which isn’t something I’m keen to embark upon.
Hervé Pagès (07:12:07): > @Hervé Pagès has joined the channel
Martin Morgan (08:45:16): > I think there is value in keeping *Hub resources relatively granular, so one file per assay, separate files for row and column data. It is not costly to assemble these ‘on the fly’. > > I’d aim for the simplest feasible format, so if the row and column data are easily stored in a simple csv file I’d do that. This seems more future-proof, and potentially more interoperable (although maybe it’s feature-creep to think that some sort of meta-software will process these files). > > Assays seem like they are separate HDF5 files.
Hervé Pagès (18:43:15): > Definitely (2). I’m a big fan of the “defensive move in case of changes to class definitions”. You will still serialize S4 objects (rowRanges and colData) but these are lower-level objects so maybe less likely to see their class internals change (:lying_face:). But you never know so yes, maybe the rowRanges can go in a csv or GFF file as Martin suggested. When@Stephanie Hickssubmitted DNA methylation data last year (https://github.com/Bioconductor/Contributions/issues/1207) I took a quick look at it and it seemed easy enough to store the rowRanges in the h5 file along with the M and Cov datasets. Don’t know for the colData but IMHO aiming at having everything in a single standalone platform/language agnostic h5 file is worth a shot.
Stephanie Hicks (18:43:23): > @Stephanie Hicks has joined the channel
Aaron Lun (18:49:06): > In addition to defending against class changes: in the case of scRNAseq, there are a few datasets with more complex getter semantics. E.g., get a single batch, multiple batches, RNA + ADT, and so on. So it makes sense to store some of these separately to avoid loading stuff you don’t need.
Peter Hickey (21:19:59): > thanks all. I’ll start with (2) using separate h5 files for each assay and assess the performance of storing rowRanges in BED and colData in CSV vs. storing them as serialized S4 objects. i might try storing everything in an H5 file but i’m hesitant to be responsible for a SummarizedExperiment <-> h5 file inter-converter
2020-08-12
Aaron Lun (02:29:20): > I’m getting some wackiness on nebbiolo1 for scRNAseq with something about a corrupt cache:http://bioconductor.org/checkResults/devel/data-experiment-LATEST/scRNAseq/nebbiolo1-checksrc.html
Hervé Pagès (15:17:26): > Maybe give it a few build cycles. I see a lot of AnnotationHub/ExperimentHub errors on nebbiolo1, not just in the data experiment builds but even more in the software builds. My feeling is that since we start from scratch on this machine, a lot of stuff is getting downloaded to the local AnnotationHub/ExperimentHub caches, and all this happens concurrently (up to 27R CMD check
commands can be running concurrently on nebbiolo1 during the software builds). Maybe the code in AnnotationHub/ExperimentHub/BiocFileCache is somehow having a hard time handling all this concurrent writing. I just went on nebbiolo1 to runR CMD check scRNASeq
again and this time it was successful so who knows…
Aaron Lun (15:17:46): > :+1:
Aaron Lun (15:17:49): > makes sense.
2020-08-16
Peter Hickey (20:52:41): > Following up onhttps://community-bioc.slack.com/archives/CDSG30G66/p1597108799113400with some numbers. > A dataset with 22,136,977 loci and 27 samples: > > 306M bulk_CpG.Cov.h5 > 303M bulk_CpG.M.h5 > 609M bulk_CpG.assays.h5 H5 file with both M and Cov > > 9.5K bulk_CpG.colData.csv.gz > 9.5K bulk_CpG.colData.rds colData as DataFrame > > 127M bulk_CpG.rowRanges.bed.gz > 55M bulk_CpG.rowRanges.rds rowRanges as GPos >
> csv.gz vs. rds for colData > > - Number of samples is usually small, so colData is usually small. > - “CSV files do not record an encoding” (?write.csv
) so may have cross-platform issues? > - rds file does not need to be parsed upon loading > > bed.gz vs. rds for rowRanges > > - Number of loci is usually very large, so rowRanges may be large > - BED files don’t usually contain genome / seqinfo (although could be stored as free text in TrackLine?) > - rds file does not need to be parsed upon loading > - rds of GPos is more space-efficient (no end column, name and score column not required for this dataset) > - rds file takes ~1 second to load, BED file takes ~38 seconds > > I’ve not pursued storing the rowRanges and colData in the H5 file.
2020-08-17
Hervé Pagès (16:44:11): > Do you have metadata columns on the GPos? If you don’t then storing it in the h5 file would just be a matter of storing 2 parallel 1D datasets, 1 for the chrom ids, 1 for the genomic pos. Then you’d need a 3rd 1D dataset to lookup the chrom ids to the chromosome names. Reconstructing the GPos from this might be pretty fast, maybe not 1 sec. like with the rds solution but certainly much less than the 38 sec. of the BED solution.
Peter Hickey (17:40:09): > this particular example has no metadata columns, so I agree in this case it’s probably not too hard to store the positional information in the h5 file. What about seqinfo? Can I safely store thegenome()
in the h5 file as a string and then repopulate withSeqinfo(genome)
at the R-level?
Peter Hickey (17:43:17): > how does the core team feel about ending up with potentially dataset-specific .h5 files? i.e. if another dataset comes along that has metadata columns then it would need a richer structure in the h5 file whereas saving therds
everything works (ignoring class definition updates)
Vince Carey (22:36:56): > @Peter HickeyI read your question as asking whether you should put additional information in the HDF5 so that it is more self-describing. I see no reason not to do this, as long as there is documentation. For homogeneous data like coordinates and reference build name it seems pretty straightforward. For “metadata columns” that might have diverse types and missing values, maybe it is less straightforward for HDF5.
Aaron Lun (23:59:21): > I’m thinking of making aTENxTestData
package for bits and pieces of 10x data that I would like to use to testDropletUtils.
Aaron Lun (23:59:52): > It doesn’t really fit into the other packages likescRNAseqbecause I need the raw files to test that theDropletUtilsutilities are working
2020-08-18
Aaron Lun (00:00:12): > the other packages useDropletUtilsto get to the processed count matrices, so that’s not really sufficient for my purposes.
Peter Hickey (01:41:42) (in thread): > I also see no reason not to do this apart from not wanting to do it because it’s not clear what it buysmewhen compared to saving rowData and colData as.rds
files except making extra work (including maintenance):slightly_smiling_face:
Peter Hickey (01:43:55) (in thread): > as a community I can see if beingpotentiallyuseful but that’s a different matter (to me, right now)
Hervé Pagès (02:20:02) (in thread): > Any solution that avoids serializing S4 objects is worth trying IMO, granted that it’s still fast enough. Storing the rowData in the h5 file is one way to achieve that (using BED seems way to slow). Another one maybe is to use GFF. Hopefullyrtracklayer::readGFF()
will be faster than reading a BED file. However I don’t expect it to be as fast as the h5-based solution.
Peter Hickey (02:24:07) (in thread): > is avoiding serializing S4 objects better or worse than writing/maintaining code to ‘save’ and ‘load’ an S4 object to/from another file format (h5)? Not being facetious, but trying to think through implications
Aaron Lun (02:37:09): > well. I just went ahead and started.https://github.com/LTLA/DropletTestFiles.
Hervé Pagès (02:39:25) (in thread): > Maybe I misunderstood but you’re not trying to solve the general problem of storing an arbitrary SummarizedExperiment object in an h5 file right? Only trying to store some DNA methylation datasets to ExperimentHub in a form that makes it easy to reconstruct the original SummarizedExperiment objects. So the save/load functions are going to be very specialized and not that complicated. The save function is not even meant to be used by the end user. The load function is going to be a utility defined in the associated data experiment package. It’s going to be simple because you have total control on its input. This is a hundred times simpler than implementing/maintaining a general SummarizedExperiment <-> h5 file inter-converter.
Peter Hickey (02:40:07) (in thread): > okay, if we’re happy with dataset-specific save/load functionality then i agree
Peter Hickey (02:42:33) (in thread): > now to learn how to read/write to an HDF5 file outside outside of using HDF5Array:slightly_smiling_face:
Hervé Pagès (03:20:59) (in thread): > Well I’m assuming that you have more than one DNA methylation dataset that you want to put on ExperimentHub. And I suppose thedataset-specificsave/load functions would work for all of the present ones and maybe for the future ones too. So hopefully there is already enough reusability to justify the effort. Maybe small adjustments to these save/load functions will be needed in the future to accommodate datasets that carry more metadata. No adjustments to these dataset-specific save/load functions will be as painful as seeing dozens of ExperimentHub resources break because of some change in the internals of some S4 class.
2020-08-19
Aaron Lun (03:34:18): > @Lori ShepherdI’m ready with the first round of files for DropletTestFiles. How do you want to do this? Should I submit to BioC first?
Lori Shepherd (07:46:49): > However you like – if you have the github url and upload the data we can work on that - if you want to wait to submit until you are fully able to debug/implement the hubs you can or you can submit knowing that the data will be up in the next day or two @Kayla Interdonatois helping with uploads now too so that we can do it more timely - making sure she sees this too for which ever gets to it first
2020-08-30
Aaron Lun (01:21:31): > @Lori ShepherdAnother week, another dataset; I have another upload inscRNAseqininst/extdata/2.4.0/metadata-kotliarov-pbmc.csv
. Data has been uploaded to EHub’s S3 underscRNAseq/kotliarov-pbmc/
.
2020-08-31
Lori Shepherd (08:48:49): > @Aaron LunDone
FelixErnst (11:34:01): > @Lori ShepherdHi Lori. If I want to add additional data to an exisiting package, I need to create seperate metadata.csv. Is that correct?
Lori Shepherd (11:35:12): > We encourage a separate metadata.csv file yes – we dont document it very well but the “metadata.csv” can be named anything.csv as long as you let us know the name of it when@Kayla Interdonatoor I add the data to the database.
Aaron Lun (11:52:58) (in thread): > Thanks@Lori Shepherd, looks good. Upon testing the getter, I realized that the tworowdata
files were actually redundant and could be deleted: > > scRNAseq/kotliarov-pbmc/2.4.0/rowdata-adt.rds > scRNAseq/kotliarov-pbmc/2.4.0/rowdata-rna.rds >
> I’ve pushed up an updated metadata CSV with these entries stripped out, if you need it (inst/extdata/2.4.0/metadata-kotliarov-pbmc.csv
).
Lori Shepherd (12:03:21) (in thread): > I think they are removed now
Aaron Lun (12:04:57) (in thread): > thx
FelixErnst (14:46:26) (in thread): > I uploaded the data and send an explanation tohubs@bioconductor.org. Thanks for the hints. Please let me know, if everything os ok or whether I made a mistake.
2020-09-07
Aaron Lun (23:45:37): > @Lori ShepherdI’ve got another set of files forDropletTestFileson the EHub S3 bucket. This is fortenx-2.1.0-pbmc4k
with metadata available ininst/extdata/1.0.0/metadata-tenx-2.1.0-pbmc4k.csv
. (Turns out I couldn’t be bothered waiting for 10X to fix their problematic file, so I’m just going ahead with what I’ve got.)
2020-09-08
Lori Shepherd (10:55:30): > @Aaron Lundone
Aaron Lun (11:09:22): > thanks
Hervé Pagès (14:58:26): > Posted this to the wrong channel sorry. Moving it to#bioc-builds
2020-09-18
Aaron Lun (03:05:30): > @Lori Shepherdwas there documnetation for how to handle the BAM/BAI paired file situation in the EHub caches? I remember discussing this and you had a solution that I didn’t adopt because I had something else that worked. But now seems like a good time, given that my previous symlinking hack (forchipseqDBData) is no longer feasible on riesling.
Lori Shepherd (11:08:07): > you can have a resource id be associated with multiple files – I think the paths get seperated with a semicolon but I’ll double check – and I think there is a dispath class for Bam that assumes two files with the first being bam and second being bai – I will also check on this
Lori Shepherd (11:18:33): > > setClass("BamFileResource", contains="AnnotationHubResource") > > setMethod(".get1", "BamFileResource", > function(x, ...) > { > .require("Rsamtools") > bam <- cache(getHub(x)) > Rsamtools::BamFile(file=bam[1],index=bam[2]) > }) >
> Yes there is a BamFile DispatchClass that will assume first file bam second file bai and load with rtracklayer. Was the issue that you didn’t want them directly loaded through Rsamtools? You could still feasibly I think you FilePath for DispatchClass – but I would have to see if it displays the path to both files in the output or not
Aaron Lun (11:20:49): > Yes… I think I was expecting to get them as file paths, at least currently the infrastructure was expecting this. Hm.
Aaron Lun (11:21:56): > If it’s too hard to guarantee the same name, I can handle it on my end by forcing a copy instead of symlink.
Lori Shepherd (11:29:45): > hmm… yeah BFC adds the unique identifier to allow for multiple versions of the same file to be cached – I wonder if we need to revisit having an option to turn it off – I think BamFile is generic enough where the bam and bai don’t have to be exact matching names which also seems good practice too –
Aaron Lun (11:34:18): > Direct BamFile ingestion would require a fairly large overhaul tocsaw’s counting routines (https://github.com/LTLA/csaw/pull/4), which I never revisited because it wasn’t a big deal at the time.
Aaron Lun (11:35:20): > Probably too late in the cycle to get that through, though. I guess I’ll fall back to a copy inchipseqDBDataif a symlink fails.
Martin Morgan (11:50:14): > BamFile just wraps the path names, without loading the bam; the path and index arepath(bfl)
andindex(bfl)
.
Lori Shepherd (12:19:59): > ah so it might be feasible to work with this dispatchclass
Aaron Lun (12:22:36): > It is… possible.csaw’s existing code would have to be modified to accommodate alternative indices, which I think should be fairly straightforward. The bigger problem is feeding these BamFile objects in the first place, which will be a breaking change tochipseqDBData’s API.
Aaron Lun (12:23:28): > I will have to meditate on this. Maybe I’ll add an argument to the getter function sayingas.BamFile
, and when that sets toTRUE
, it uses this new behavior of returning aBamFile
.
2020-09-21
FelixErnst (03:02:11): > The seperator between BAM/BAI is expected to by a colon. See?makeAnnotationHubMetadata
and for an example:https://github.com/FelixErnst/RNAmodR.Data/blob/master/inst/extdata/metadata.csv
Aaron Lun (03:03:05): > right, but that’s not so much the problem here.
FelixErnst (03:03:21): > the file names can also be independent. AFAIK they don’t have to be the same
Aaron Lun (03:04:06): > The issue is not what the RDataPath is. The issue is the literal file name chosen by ExperimentHub when it caches the files locally.
FelixErnst (03:11:08): > I see. Sorry for the blurp.
2020-09-22
Marcel Ramos Pérez (11:39:19): > Hi Lori,@Lori ShepherdI have additional resources to addhttps://github.com/waldronlab/SingleCellMultiModal/blob/dario/inst/extdata/metadata.csv
2020-09-23
Lori Shepherd (09:16:32): > The new entries are teh two cord_blood?
Marcel Ramos Pérez (09:16:42): > Yes
Lori Shepherd (09:33:44): > done
Marcel Ramos Pérez (09:33:57): > Thank you:pray:
2020-09-27
Leonardo Collado Torres (23:39:17): > Hi! > > If I have a metadata csv file that is ~13 Mb, should I version control it? or ignore it and submit separately?
2020-10-01
Dan Bunis (16:19:21): > @Spiro Stilianoudakisanswers getting to the wider email readership is great too, but just to potentially help you get help faster, this channel is where I generally see the “requires assistance from a Bioconductor team member” part from your email being handled. My understanding is that many of the messages here are about getting exactly that, and generally from Lori:smiley:
Spiro Stilianoudakis (16:19:26): > @Spiro Stilianoudakis has joined the channel
Spiro Stilianoudakis (16:31:55): > @Dan BunisGreat! Thank you for the heads up!
Lori Shepherd (18:11:25): > @Leonardo Collado Torreswhy/how is the metadata file that big? Does it contain alot of extra data?
2020-10-02
Kasper D. Hansen (07:02:07): > @Lori ShepherdIts a LOT of studies
Lori Shepherd (14:10:43) (in thread): > I’m going to touch base about this but off slack
Leonardo Collado Torres (18:11:32): > Sorry Lori, just saw this. We decided not to submit this large metadata file as part ofrecount3
. It had entries for about 18k studies: > > > 8742 + 10088 > [1] 18830 >
2020-10-22
Aaron Lun (01:34:05): > oops, fat fingered there.
Aaron Lun (01:34:28): > Think I fixed my problem. But man, interfacing with HTSLib was a mistake.
Aaron Lun (01:35:21): > You know the HTSLib documentation is bad when you have to look atRsamtools’s C code to figure out how to use it.
Aaron Lun (01:37:59): > I mean, there isn’t even documentation forhts_idx_load2
! And it’s the only index loader I can get to work when the BAI file does not end with “bam.bai”.
Hervé Pagès (01:44:41): > Great, I’m not the only one to suffer. Migrating Rsamtools to Rhtslib was a nightmare.
Hervé Pagès (01:49:50): > This one was particularly painful:https://github.com/Bioconductor/Rsamtools/wiki/How-to-extract-records-within-a-user-specified-region-from-a-VCF-or-BCF-file-with-htslib-1.7
Aaron Lun (02:03:59): > my god
Aaron Lun (15:50:18): > I get an awful amount of stuff like: > > Warning: download failed > web resource path: '[https://annotationhub.bioconductor.org/fetch/80651](https://annotationhub.bioconductor.org/fetch/80651)' > local file path: '/local/tmp/RtmpTJRrDq/BiocFileCache/1ba064dac1ca_80651' > reason: Internal Server Error (HTTP 500). > Warning: bfcadd() failed; resource removed > rid: BFC12 > fpath: '[https://annotationhub.bioconductor.org/fetch/80651](https://annotationhub.bioconductor.org/fetch/80651)' > reason: download failed > Warning: download failed > hub path: '[https://annotationhub.bioconductor.org/fetch/80651](https://annotationhub.bioconductor.org/fetch/80651)' > cache resource: 'AH73905 : 80651' > reason: bfcadd() failed; see warnings() > Error: failed to load resource > name: AH73905 > title: Ensembl 97 EnsDb for Mus musculus > reason: 1 resources failed to download > Execution halted >
Aaron Lun (15:50:44): > Why does this happen? It’s not just a client-side locking problem, it seems to be an EHub-side issue.
Lori Shepherd (15:51:24): > I can look into if AWS locks after certain number of queries or something of the like – its the only thing I could think of immediately
Aaron Lun (16:00:32): > Looking athttps://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/#:~:text=Amazon%20S3%20now%20provides%20increased,time%20for%20no%20additional%20chargeindicates that you can have several thousands requests per second for retrieval, seems unlikely that we’d be overloading that.
2020-11-17
Aaron Lun (16:11:54): > @Lori Shepherdcan I get the Hubs to just give me the local path back to the downloaded resource without trying to load it into R?
Lori Shepherd (16:13:04): > yes. DispatchClass FilePath
Aaron Lun (16:13:22): > no, I don’t own the resource.
Aaron Lun (16:13:41): > So I’m trying to use someone else’s resource and they’ve already set the dispatchclass to something else.
Lori Shepherd (16:14:26): > ah… um… let me think for a sec…
Lori Shepherd (16:21:49): > so you would have to donwloaded initially with their dispatchclass – but after its downloaded you could get the local path using cache on a hub object
Lori Shepherd (16:22:00): > > > cache(eh['EH166']) > EH166 > "/home/shepherd/.cache/ExperimentHub/2e5771200474_166" >
Lori Shepherd (16:22:33): > I think cache may download if the resource isn’t downloaded otherwise it lists the path …
Aaron Lun (16:24:16): > thanks, looks like that did it.
2020-11-18
Aaron Lun (00:55:58): > has there been any resolution to the Hub reliability issues? For example,https://bioconductor.org/checkResults/devel/books-LATEST/OSCA/malbec2-buildsrc.html: > {#} > # Error : failed to load resource > # name: EH3769 > # title: 10X PBMC 4k raw count matrix > # reason: 1 resources failed to download > # In addition: Warning message: > # download failed > # hub path: '[https://experimenthub.bioconductor.org/fetch/3805](https://experimenthub.bioconductor.org/fetch/3805)' > # cache resource: 'EH3769 : 3805' > # reason: Timeout was reached: [experimenthub.bioconductor.org] Operation timed out after 10000 milliseconds with 0 out of 0 bytes received >
> I only get these builds twice a week, so that’s pretty frustrating.
Aaron Lun (23:32:41): > And again,http://bioconductor.org/checkResults/release/bioc-LATEST/scuttle/merida1-buildsrc.html
Aaron Lun (23:32:51): > > using temporary cache /tmp/RtmpQMcMYw/BiocFileCache > snapshotDate(): 2020-10-27 > downloading 1 resources > retrieving 1 resource > Warning: download failed > web resource path: '[https://annotationhub.bioconductor.org/fetch/80651](https://annotationhub.bioconductor.org/fetch/80651)' > local file path: '/tmp/RtmpQMcMYw/BiocFileCache/121e7284f12c3_80651' > reason: Internal Server Error (HTTP 500). > Warning: bfcadd() failed; resource removed > rid: BFC3 > fpath: '[https://annotationhub.bioconductor.org/fetch/80651](https://annotationhub.bioconductor.org/fetch/80651)' > reason: download failed > Warning: download failed > hub path: '[https://annotationhub.bioconductor.org/fetch/80651](https://annotationhub.bioconductor.org/fetch/80651)' > cache resource: 'AH73905 : 80651' > reason: bfcadd() failed; see warnings() > Quitting from lines 263-272 (overview.Rmd) > Error: processing vignette 'overview.Rmd' failed with diagnostics: > failed to load resource > name: AH73905 > title: Ensembl 97 EnsDb for Mus musculus > reason: 1 resources failed to download >
2020-11-20
Vince Carey (10:00:29) (in thread): > I just tried this interactively and was a little surprised to be interrupted by a package update query: > > > xx = eh[["EH3769"]] > Bioconductor version 3.12 (BiocManager 1.30.10), R 4.0.2 Patched (2020-07-19 > r78892) > Installing package(s) 'DropletTestFiles' > trying URL '[https://bioconductor.org/packages/3.12/data/experiment/src/contrib/DropletTestFiles_1.0.0.tar.gz](https://bioconductor.org/packages/3.12/data/experiment/src/contrib/DropletTestFiles_1.0.0.tar.gz)' > Content type 'application/x-gzip' length 234039 bytes (228 KB) > ================================================== > downloaded 228 KB > > 1/9 packages newly attached/loaded, see sessionInfo() for details. > * installing **source** package ‘DropletTestFiles’ ... > **** using staged installation > **** R > **** inst > **** byte-compile and prepare package for lazy loading > 1/9 packages newly attached/loaded, see sessionInfo() for details. > **** help > ***** installing help indices > **** building package indices > 1/9 packages newly attached/loaded, see sessionInfo() for details. > **** installing vignettes > **** testing if installed package can be loaded from temporary location > 1/9 packages newly attached/loaded, see sessionInfo() for details. > **** testing if installed package can be loaded from final location > 1/9 packages newly attached/loaded, see sessionInfo() for details. > **** testing if installed package keeps a record of temporary installation path > * DONE (DropletTestFiles) > > The downloaded source packages are in > ‘/tmp/RtmpQwOoII/downloaded_packages’ > Updating HTML index of packages in '.Library' > Making 'packages.html' ... done > Old packages: 'cli', 'mclust', 'mzR', 'pillar', 'RCy3', 'Rsubread' > Update all/some/none? [a/s/n]: n > see ?DropletTestFiles and browseVignettes('DropletTestFiles') for documentation > downloading 1 resources > retrieving 1 resource > |======================================================================| 100% > > loading from cache >
Aaron Lun (12:17:28): > A series of failures in our internal automated builds: > > --- re-building 'pseudobulk.Rmd' using rmarkdown > snapshotDate(): 2020-04-27 > see ?scRNAseq and browseVignettes('scRNAseq') for documentation > downloading 1 resources > retrieving 1 resource > Warning: download failed > hub path: '[https://experimenthub.bioconductor.org/fetch/2591](https://experimenthub.bioconductor.org/fetch/2591)' > cache resource: 'EH2575 : 2591' > reason: Timeout was reached: [experimenthub.bioconductor.org] Connection timed out after 10003 milliseconds > Quitting from lines 41-50 (pseudobulk.Rmd) > Error: processing vignette 'pseudobulk.Rmd' failed with diagnostics: > failed to load resource > name: EH2575 > title: Segerstolpe pancreas counts > reason: 1 resources failed to download > --- failed re-building 'pseudobulk.Rmd' >
Lori Shepherd (12:26:01): > I’m looking into if there is some timeout issue happening like we see with download.file and to see how we download files in the hub as well as looking into aws to see if there is some limit on consecutive pings to the instance – its incredible hard for me to reproduce as when I try manual there arent any issues
Aaron Lun (12:27:41): > maybe we spam the Hubs with requests, record the times and see what shows up on the server logs.
Martin Morgan (13:46:31): > 1. If run in a non-interactive session, the default is to behave as though there is no cache. So can you say, e.g.,BiocFileCache::BiocFileCache(ask=FALSE)
in your vignette, so the cache is actually re-used? > 2. Are these large files, so that what is happening is that the files start to be downloaded, but do not complete within the timeout of curl?
2020-11-22
Aaron Lun (04:13:06): > The caches don’t persist in the container after each run.
Aaron Lun (04:14:01): > And the files are not altogetherthatlarge. Probably no more than 50 MB, if I had to guess.
Aaron Lun (04:15:02): > Another example: > > Error: failed to load resource > name: EH2557 > title: Allen brain Tophat counts > reason: 1 resources failed to download > In addition: Warning messages: > 1: download failed > web resource path: '[https://experimenthub.bioconductor.org/fetch/2573](https://experimenthub.bioconductor.org/fetch/2573)' > local file path: '/tmp/RtmpFd3eHM/BiocFileCache/74326df29ba_2573' > reason: Internal Server Error (HTTP 500). > 2: bfcadd() failed; resource removed > rid: BFC3 > fpath: '[https://experimenthub.bioconductor.org/fetch/2573](https://experimenthub.bioconductor.org/fetch/2573)' > reason: download failed > 3: download failed > hub path: '[https://experimenthub.bioconductor.org/fetch/2573](https://experimenthub.bioconductor.org/fetch/2573)' > cache resource: 'EH2557 : 2573' > reason: bfcadd() failed; see warnings() > Execution halted >
Aaron Lun (04:15:50): > I’m seeing this semi-regularly in GitHub Actions, on the BioC build reports, and on our internal company builders, so it can’t be that hard to reproduce.
Sean Davis (12:50:33) (in thread): > There is a 500 error in there that looks like it is probably coming from HubServer? That should be in a log somewhere?
Aaron Lun (14:51:56) (in thread): > My guess is that there’s an exclusive lock on the server-side DB that causes read operations to fail sporadically, e.g., during periods of high demand.
2020-11-24
Martin Morgan (03:48:28) (in thread): > is there a timestamp when this occurs?@Aaron Lun
Aaron Lun (03:49:43) (in thread): > I’d have to dig out my work computer to check the timestamp. Will do so tomorrow.
2020-12-05
Jonathan Griffiths (07:57:15): > @Jonathan Griffiths has joined the channel
Jonathan Griffiths (07:58:27): > I’ve recieved an email delivery error for something I sent to the hubs@bioc email address - this seems unusual. Is it a known issue at the moment? Might be something on the @roswellpark.org end of things? > > Diagnostic-Code: smtp; 550 #5.7.1 Your access to submit messages to this e-mail system has been rejected.
Lori Shepherd (15:56:50): > Hmm. It shouldn’t be a Roswell park issue. your sure it was correct address? Hubs atBioconductor.org?
2020-12-06
Jonathan Griffiths (16:59:15) (in thread): > Yeah, I replied to an old email I recieved. Well, I can send it again tomorrow morning anyway
2020-12-07
Lori Shepherd (07:36:08) (in thread): > I see the email now so not quite sure what the earlier issue was. We will try to look to see if there was any downtown
2020-12-11
Marcel Ramos Pérez (10:02:55): > I know there’sMTX
available but canmtx.gz
be added as a valid source type inAnnotationHubData::getValidSourceTypes()
?@Lori Shepherd
Lori Shepherd (10:11:11) (in thread): > yes we can look at adding it in
Marcel Ramos Pérez (10:11:28) (in thread): > Thanks!
2020-12-12
Huipeng Li (00:41:03): > @Huipeng Li has joined the channel
2020-12-14
Kozo Nishida (13:38:09): > @Kozo Nishida has joined the channel
Kozo Nishida (13:45:11): > @Lori Shepherd@Kayla InterdonatoI created an AnnotationHub package for the PathBank pathway database. > -https://pathbank.org/downloads-https://github.com/biopackathon/AHPathbankDbsI would like to access an S3 bucket where I can upload the rda files. > (https://github.com/biopackathon/AHPathbankDbs/blob/main/inst/extdata/metadata.csv) > > Can I ask you to give me access to the ‘AnnotationContributor’ user? > (https://bioconductor.org/packages/release/bioc/vignettes/AnnotationHub/inst/doc/CreateAnAnnotationPackage.html#uploading-data-to-s3) > > (I had a problem sending tohubs@bioconductor.organd contact you via Slack instead.)
Constantin Ahlmann-Eltze (14:14:33): > @Constantin Ahlmann-Eltze has joined the channel
Nick Owen (14:49:11): > @Nick Owen has joined the channel
2020-12-21
Marcel Ramos Pérez (13:04:53): > @Lori Shepherd@Kayla InterdonatoI have a metadata file readyhttps://github.com/waldronlab/SingleCellMultiModal/blob/pbmc/inst/extdata/metadata.csvI’ve also made changes to the Maintainer field for earlier versions of data with EH_IDS > > c("EH3738", "EH3739", "EH3740", "EH3741", "EH3742", "EH3743", > "EH3744", "EH3745", "EH3746", "EH3747", "EH3748", "EH3749", "EH3750", > "EH3751", "EH3752", "EH3753", "EH3754", "EH3755", "EH3756", "EH3757", > "EH3758", "EH3759", "EH3760", "EH3761", "EH3762", "EH3763", "EH3764", > "EH3765", "EH3766", "EH3767") >
Marcel Ramos Pérez (14:15:01) (in thread): > Note.MTX``DispatchClass
changed toFilePath
. The metadata is GTG
Levi Waldron (15:31:57): > This looks kind of interesting as a way to include versioned data in git projects, looks like git LFS but better:https://github.com/iterative/dvc
2020-12-22
Kayla Interdonato (09:38:31) (in thread): > @Marcel Ramos PérezI’ll be working on this today
Marcel Ramos Pérez (09:38:48) (in thread): > Thanks Kayla!
Kayla Interdonato (12:29:26) (in thread): > The new data has been added (as shown below). Things to be changed in the metadata file: > 1. Species for peripheral_blood should be changed to ‘Homo sapiens’ as this field is case sensitive. > 2. SourceType for pbmc_10x should be ‘mtx.gz’ as this field is case sensitive. > I made these changes locally so the data could be added to the hubs. I’ll work on updating the Maintainer field in the earlier versions but wanted to be sure this new data was made available in the hubs. > > > eh = ExperimentHub() > |======================================================================| 100% > > snapshotDate(): 2020-12-22 > > query(eh, c("SingleCellMultiModal", "peripheral_blood")) > ExperimentHub with 10 records > # snapshotDate(): 2020-12-22 > # $dataprovider: Technology Innovation Lab, New York Genome Center, New York... > # $species: Homo sapiens > # $rdataclass: matrix, data.frame, dgCMatrix > # additional mcols(): taxonomyid, genome, description, > # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, > # rdatapath, sourceurl, sourcetype > # retrieve records with, e.g., 'object[["EH4613"]]' > > title > EH4613 | CTCL_scADT > EH4614 | CTCL_scHTO > EH4615 | CTCL_scRNA > EH4616 | CTCL_TCRab > EH4617 | CTCL_TCRgd > EH4618 | CTRL_scADT > EH4619 | CTRL_scHTO > EH4620 | CTRL_scRNA > EH4621 | CTRL_TCRab > EH4622 | CTRL_TCRgd > > query(eh, c("SingleCellMultiModal", "pbmc_10x")) > ExperimentHub with 8 records > # snapshotDate(): 2020-12-22 > # $dataprovider: European Bioinformatics Institute (EMBL-EBI), United Kingdom > # $species: Homo sapiens > # $rdataclass: dgCMatrix, SingleCellExperiment, HDF5Matrix, DFrame > # additional mcols(): taxonomyid, genome, description, > # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, > # rdatapath, sourceurl, sourcetype > # retrieve records with, e.g., 'object[["EH4623"]]' > > title > EH4623 | pbmc_atac_assays > EH4624 | pbmc_atac_se > EH4625 | pbmc_atac > EH4626 | pbmc_colData > EH4627 | pbmc_rna_assays > EH4628 | pbmc_rna_se > EH4629 | pbmc_rna > EH4630 | pbmc_sampleMap >
Marcel Ramos Pérez (12:45:13) (in thread): > Thanks Kayla, I’ll update this right away !
Kayla Interdonato (14:18:41) (in thread): > Maintainer field should be updated for those EH entries.
2020-12-23
Aaron Lun (20:53:23): > Another day, another 500:https://bioconductor.org/checkResults/release/books-LATEST/OSCA/nebbiolo1-buildsrc.htmlI recall being asked for a timestamp, so sometime between 10:30:26 and 10:30:58 on the 22nd.
2021-01-09
Aaron Lun (16:13:39): > @Lori Shepherdnew year, new datasets! Can I get some new EHub credentials?
Aaron Lun (16:24:43): > I plan to be uploading 4-5 datasets every week, depending on how well I can stick to my plan. I’d like to upload these in smaller bundles rather than doing it all at once, because some of the datasets are difficult and need their own bespoke getter and I want to check that the getter works before I move onto the next set of datasets.
Aaron Lun (16:25:20): > How long does it take on your end? I can of course run the validation before I upload it.
Lori Shepherd (16:59:37): > If it’s validated we can have it uploaded in a day or two especially if you send the request to hubs atBioconductor.orgemail so that anyone that knows how to do it could assist.
Lori Shepherd (17:00:12): > I’ll send the creds over tomorrow or Monday. Cheers
2021-01-12
Aaron Lun (01:41:22): > Thanks@Lori Shepherd. I sent back a response with instructions but that ended up bouncing with” > > The mail system > > <kayla.morrell at roswellpark.org>: host mx2.roswellpark.iphmx.com[68.232.137.170] > said: 550 #5.7.1 Your access to submit messages to this e-mail system has > been rejected. (in reply to DATA command) > > <lori.shepherd at roswellpark.org>: host mx2.roswellpark.iphmx.com[68.232.137.170] > said: 550 #5.7.1 Your access to submit messages to this e-mail system has > been rejected. (in reply to DATA command) >
> (I manually replaced the@
withat
above.)
Aaron Lun (01:41:46): > So I’ll just say that the upload has been completed to the scRNAseq/ directory, and the latest HEAD of the scRNAseq Bioconductor Git repository contains the following metadata files in inst/extdata/2.6.0: > > metadata-bacher-tcell.csv > metadata-darmanis-brain.csv > metadata-giladi-hsc.csv > metadata-zhong-prefrontal.csv >
> In summary, four new datasets.
Jonathan Griffiths (04:02:59) (in thread): > I knew I wasn’t going mad, I knew it!
Lori Shepherd (06:39:12) (in thread): > Thanks I’ll talk to IT.
2021-01-13
FelixErnst (03:12:55): > @Lori ShepherdI send a mail to hubs atBioconductor.orgon the 14th of December and yesterday with the same request for Ehub credentials. I haven’t received any reply. Did the mail go through?
Lori Shepherd (06:50:11): > No. Unfortunately they did not. I will pm you to try and straighten this out. Sorry for the inconveniences.
2021-01-22
Annajiat Alim Rasel (15:42:01): > @Annajiat Alim Rasel has joined the channel
Marcel Ramos Pérez (18:02:56): > Hi Lori and Kayla,@Lori Shepherd@Kayla InterdonatoI have version2.0.1
of the data forcuratedTCGAData
here (devel branch):https://github.com/waldronlab/curatedTCGAData/blob/ver2/inst/extdata/metadata.csvThis CSV file includes only versions1.1.38
and the latest2.0.1
. Can we get rid of the2.0.0
datasets on AWS?2.0.1
is the fixed version of the data which is already uploaded and should replace2.0.0
completely.
2021-01-27
Aaron Lun (02:28:07): > @Lori Shepherdstill getting bounced from my emails to hubs atbioconductor.org. > > This is the mail system at host delivery.bioconductor.org. > > I'm sorry to have to inform you that your message could not > be delivered to one or more recipients. It's attached below. > > For further assistance, please send mail to postmaster. > > If you do so, please include this problem report. You can > delete your own text from the attached returned message. > > The mail system > > <kayla.morrell@roswellpark.org>: host mx1.roswellpark.iphmx.com[68.232.137.170] > said: 550 #5.7.1 Your access to submit messages to this e-mail system has > been rejected. (in reply to DATA command) > > <lori.shepherd@roswellpark.org>: host mx1.roswellpark.iphmx.com[68.232.137.170] > said: 550 #5.7.1 Your access to submit messages to this e-mail system has > been rejected. (in reply to DATA command) >
Aaron Lun (02:28:36): > Here’s all the guts: > > Reporting-MTA: dns; delivery.bioconductor.org > X-Postfix-Queue-ID: B31EC800FC > X-Postfix-Sender: rfc822; infinite.monkeys.with.keyboards@gmail.com > Arrival-Date: Wed, 27 Jan 2021 07:26:05 +0000 (UTC) > > Final-Recipient: rfc822; kayla.morrell@roswellpark.org > Original-Recipient: rfc822;kayla.morrell@roswellpark.org > Action: failed > Status: 5.0.0 > Remote-MTA: dns; mx1.roswellpark.iphmx.com > Diagnostic-Code: smtp; 550 #5.7.1 Your access to submit messages to this e-mail > system has been rejected. > > Final-Recipient: rfc822; lori.shepherd@roswellpark.org > Original-Recipient: rfc822;lori.shepherd@roswellpark.org > Action: failed > Status: 5.0.0 > Remote-MTA: dns; mx1.roswellpark.iphmx.com > Diagnostic-Code: smtp; 550 #5.7.1 Your access to submit messages to this e-mail > system has been rejected. >
Aaron Lun (02:29:15): > All I wanted to say was that the latest batch has been pushed to ExperimentHub on S3. Three datasets, manifests in theHEAD
of scRNAseq’smaster
on BioC Git: > > inst/extdata/2.6.0/metadata-bunis-hspc.csv > inst/extdata/2.6.0/metadata-he-organ-atlas.csv > inst/extdata/2.6.0/metadata-zhao-immune-liver.csv >
Lori Shepherd (07:57:39) (in thread): > @Marcel Ramos PérezCan you confirm the EH ids for the ones that should be removed completed are the 574 records beginning with EH3965 to EH4538 ? And confirming I can delete the v2.0.0 directory of resources on S3 as well?
Marcel Ramos Pérez (09:08:07) (in thread): > Hi Lori,@Lori ShepherdYes, please delete the resources and remove the IDs. Confirming EH3965 - EH4538 given byquery(eh, "curatedTCGAData/v2.0.0")
. Thanks!
Lori Shepherd (09:21:33): > I’m not seeing these files when I pull from bioc git?@Aaron Lun
Lori Shepherd (09:56:59) (in thread): > should be all set – let me know if there are any issues.
Aaron Lun (11:25:50): > ooops. They’re there now.
2021-01-28
Lori Shepherd (08:15:05): > @Aaron Lunyou should be all set with these
2021-01-30
Aaron Lun (18:54:09): > thanks@Lori Shepherd. Upon writing the getters, I realized I made a mistake with the variousscRNAseq/he-organ-atlas/2.6.0/coldata-*.rds
files. I’ve reuploaded those to S3. No metadata changes, just need to swap out the files.
2021-01-31
Lori Shepherd (09:00:33) (in thread): > The data has been moved over. If you try to reaccess the resource it should automatically redownload
Aaron Lun (15:50:18) (in thread): > thanks, looks good.
2021-02-08
Davide Risso (04:37:19): > Hi, while updating theTENxPBMCData
package, I realized that we had added an additional datasetpbmc5k-CITEseq
, which is inExperimentHub
but the code to access it was never pushed to Bioconductor (i.e., it’s only available in the Github version of the package)
Davide Risso (04:38:22): > I was about to push the code to Bioconductor, but I realized it doesn’t work. In particular the colData of the new dataset cannot be downloaded from the hub, with the following error: > > > hub <- ExperimentHub::ExperimentHub() > > hub[["EH3238"]] > see ?TENxPBMCData and browseVignettes('TENxPBMCData') for documentation > loading from cache > Error: failed to load resource > name: EH3238 > title: PBMC, 5k CITE-Seq sample (column) annotation > reason: unknown input format >
Davide Risso (04:38:35): > Any idea how to fix this?
Charlotte Soneson (05:08:37) (in thread): > out of curiosity and in case it may be related to similar recent experiences that I had - what is your sessionInfo?
Davide Risso (05:10:30) (in thread): > > R version 4.0.3 (2020-10-10) > Platform: x86_64-apple-darwin17.0 (64-bit) > Running under: macOS Catalina 10.15.7 > > Matrix products: default > BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib > LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats4 parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] TENxPBMCData_1.9.1 HDF5Array_1.18.1 rhdf5_2.34.0 > [4] DelayedArray_0.16.1 Matrix_1.3-2 SingleCellExperiment_1.12.0 > [7] SummarizedExperiment_1.20.0 Biobase_2.50.0 GenomicRanges_1.42.0 > [10] GenomeInfoDb_1.26.2 IRanges_2.24.1 S4Vectors_0.28.1 > [13] MatrixGenerics_1.2.1 matrixStats_0.58.0 ExperimentHub_1.16.0 > [16] AnnotationHub_2.22.0 BiocFileCache_1.14.0 dbplyr_2.1.0 > [19] BiocGenerics_0.36.0 > > loaded via a namespace (and not attached): > [1] Rcpp_1.0.6 lattice_0.20-41 assertthat_0.2.1 > [4] digest_0.6.27 mime_0.9 R6_2.5.0 > [7] RSQLite_2.2.3 httr_1.4.2 pillar_1.4.7 > [10] zlibbioc_1.36.0 rlang_0.4.10 curl_4.3 > [13] blob_1.2.1 RCurl_1.98-1.2 bit_4.0.4 > [16] shiny_1.6.0 compiler_4.0.3 httpuv_1.5.5 > [19] pkgconfig_2.0.3 htmltools_0.5.1.1 tidyselect_1.1.0 > [22] tibble_3.0.6 GenomeInfoDbData_1.2.4 interactiveDisplayBase_1.28.0 > [25] crayon_1.4.0 dplyr_1.0.4 withr_2.4.1 > [28] later_1.1.0.1 bitops_1.0-6 rhdf5filters_1.2.0 > [31] rappdirs_0.3.3 grid_4.0.3 xtable_1.8-4 > [34] lifecycle_0.2.0 DBI_1.1.1 magrittr_2.0.1 > [37] cachem_1.0.3 XVector_0.30.0 promises_1.1.1 > [40] ellipsis_0.3.1 generics_0.1.0 vctrs_0.3.6 > [43] Rhdf5lib_1.12.1 tools_4.0.3 bit64_4.0.5 > [46] glue_1.4.2 purrr_0.3.4 BiocVersion_3.12.0 > [49] fastmap_1.1.0 yaml_2.2.1 AnnotationDbi_1.52.0 > [52] BiocManager_1.30.10 memoise_2.0.0 >
Charlotte Soneson (05:11:16) (in thread): > Ah, ok. Then it’s probably not the same problem (thought you might be on R-devel).
Davide Risso (05:12:13) (in thread): > no, I probably should be, but I figured that if it doesn’t work with Bioc release it probably won’t work in devel either… I’ll try with devel just in case
Charlotte Soneson (05:12:58) (in thread): > I had all kinds of weird issues (with the hubs, and even just creating a simple SummarizedExperiment) in the past couple of days, which were seemingly solved by updating to the latest R-devel (admittedly, mine was pretty old).
Davide Risso (05:13:34) (in thread): > I’m trying with the latest bioc-devel docker image now…
Davide Risso (05:41:06) (in thread): > …and I get the same error
Charlotte Soneson (05:44:55) (in thread): > I just tried your call locally on my shiny new R-devel and I get the same error too, so it seems unlikely to be related to the R version indeed.
Lori Shepherd (08:21:52): > I think the uploaded files might be corrupt – or readRDS changed something – I downloaded the file manually from S3 and tried to manually do readRDS and received the same ERROR > > > readRDS("/home/shepherd/Downloads/pbmc5k-CITEseq_colData.rds") > Error in readRDS("/home/shepherd/Downloads/pbmc5k-CITEseq_colData.rds") : > unknown input format >
Lori Shepherd (08:24:22): > I would suggest re-uploading files that can be read with RDS to AWS –
FelixErnst (08:26:54): > Did you tryload
?
FelixErnst (08:27:11): > I know the chance is small, but maybe somehow the file ending was mistyped
Lori Shepherd (08:30:34): > regardless – then it would still require re-upload and/or changes to the database hubs but no that also did not work > > > load("pbmc5k-CITEseq_colData.rds") > Error in load("pbmc5k-CITEseq_colData.rds") : > bad restore file magic number (file may be corrupted) -- no data loaded > In addition: Warning message: > file 'pbmc5k-CITEseq_colData.rds' has magic number 'pbmc5' > Use of save versions prior to 2 is deprecated >
Lori Shepherd (08:46:22): > @Davide RissoI don’t know who has/created the original data for that set as I see the maintainer listed is@Kasper D. Hansen– but it looks like there is an issue with the file itself – let me know if/who to send AWS credentials to if you would like to re-upload the file
Stephanie Hicks (08:52:03) (in thread): > oh are you handling the github issue Aaron brought up? if so, thank you!
Davide Risso (08:57:58): > thanks@Lori Shepherdlet me check because this was a contributed dataset following EuroBioc2019… I’ll see if I can get the original file
Davide Risso (08:58:23) (in thread): > yes! and no worries!:slightly_smiling_face:
Davide Risso (09:09:40): > hi@Lori ShepherdI have the original file, can you please send the AWS credentials to me? Thanks!
2021-02-09
Aaron Lun (03:27:03): > Another bounce: > > This is the mail system at host delivery.bioconductor.org. > > I'm sorry to have to inform you that your message could not > be delivered to one or more recipients. It's attached below. > > For further assistance, please send mail to postmaster. > > If you do so, please include this problem report. You can > delete your own text from the attached returned message. > > The mail system > > <kayla.morrell@roswellpark.org>: host mx1.roswellpark.iphmx.com[68.232.137.170] > said: 550 #5.7.1 Your access to submit messages to this e-mail system has > been rejected. (in reply to DATA command) > > <lori.shepherd@roswellpark.org>: host mx1.roswellpark.iphmx.com[68.232.137.170] > said: 550 #5.7.1 Your access to submit messages to this e-mail system has > been rejected. (in reply to DATA command) >
Aaron Lun (03:27:11): > All I wanted to say was: looks good on my end. Yes, that was an error, now fixed in the latest master.
Lori Shepherd (06:28:26): > We have a ticket into our IT. Hopefully it will be resolved soon.
2021-02-18
Aaron Lun (23:57:25): > ExperimentHub::ExperimentHub()
is taking ages for me.https://bioconductor.org/config.yamlis not responsive.
2021-02-19
Aaron Lun (00:02:32): > Wiat, hold on, all ofBioC.orgis down.
Aaron Lun (00:02:38): > or at least, pretty slow.
Aaron Lun (00:02:52): > ah, good the config.yaml is back.
Peter Hickey (00:03:02): > yeah loaded eventually for me
Peter Hickey (00:03:30): > was > > > system.time(ExperimentHub::ExperimentHub()) > snapshotDate(): 2020-10-27 > user system elapsed > 0.948 0.008 32.787 >
> now > > > system.time(ExperimentHub::ExperimentHub()) > snapshotDate(): 2020-10-27 > user system elapsed > 0.739 0.004 2.551 >
2021-02-22
Aaron Lun (20:14:53): > @Lori Shepherd@Kayla Interdonatodid you guys get an email about “Bulk uploads of H5AD files to ExperimentHub?”. Trying to figure out whether this bounced off your email server or not.
2021-02-23
Lori Shepherd (07:23:32): > Yes we received it. sorry for the slow response.
2021-02-25
Aaron Lun (03:48:59): > So… yay or nay?
Martin Morgan (06:20:18): > Let’s have the conversation in the email thread you started; I’ve responded there…
2021-03-05
Michael Love (11:00:05): > @Michael Love has joined the channel
Michael Love (11:14:35): > This is kind of related to our discussion earlier about how to know what types of resources are present in Ahub: I can’t seem to find the ENCODE Blacklist / Exclusion List which appear on UCSChttp://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=876251189_lEwmtar4Jcr1MJgh92cL5gL9Anaa&g=wgEncodeDacMapabilityConsensusExcludablehttps://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=1049368925_6diSdguwyFJ1E9CRGzZzVwo5m1g1&g=problematicI tried all kinds ofpattern
with a query(), but doesn’t seem to be in there. In general how would I know if/why such a track wouldn’t be in Ahub?
Michael Love (11:15:14): > I think figuring out the line that circumscribes what was already imported will help others in the community to know where to start for contributing new resources to the hub
Laurent Gatto (11:49:31): > @Laurent Gatto has joined the channel
Lori Shepherd (13:37:30): > Unless an outside contributed uploaded it, it probably is not in the hub. The core team limits what we provide by default and generally is a very minimal list of (at the time of release) new versions of TxDbs, new versions of OrgDbs, and when ensembl comes out with a new release we provide those resources at converted gfs to granges and twobits. I think the sourceurls are searchable or at least accessible with mcols() so that would be the other option for seeing if its in there as well
Michael Love (14:12:59): > Thanks Lori!
2021-03-19
Ludwig Geistlinger (17:28:41): > @Ludwig Geistlinger has joined the channel
2021-03-20
watanabe_st (01:56:49): > @watanabe_st has joined the channel
2021-03-22
Jovana Maksimovic (01:33:01): > @Jovana Maksimovic has joined the channel
2021-03-23
Marcel Ramos Pérez (23:02:56): > Hi Lori@Lori Shepherdand Kayla@Kayla Interdonato, could you please update the EH database to include GTseq data (SingleCellMultiModal)? They are the last four rows of themetadata.csv
here:https://github.com/waldronlab/SingleCellMultiModal/blob/gtseq/inst/extdata/metadata.csvThanks!
2021-03-24
Kayla Interdonato (10:16:34) (in thread): > The data should be added now > > > library(ExperimentHub) > > eh = ExperimentHub() > |======================================================================| 100% > snapshotDate(): 2021-03-24 > > query(eh, c("SingleCellMultiModal", "mouse_embryo_8_cell")) > ExperimentHub with 4 records > # snapshotDate(): 2021-03-24 > # $dataprovider: Wellcome Trust Sanger Institute, Cambridge, United Kingdom > # $species: Mus musculus > # $rdataclass: SingleCellExperiment, RaggedExperiment, DFrame, CompressedCha... > # additional mcols(): taxonomyid, genome, description, > # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, > # rdatapath, sourceurl, sourcetype > # retrieve records with, e.g., 'object[["EH5430"]]' > > title > EH5430 | GTseq_colData > EH5431 | GTseq_genomic > EH5432 | GTseq_metadata > EH5433 | GTseq_transcriptomic >
Marcel Ramos Pérez (11:57:11) (in thread): > Thanks Kayla!
2021-04-02
Tim Howes (21:33:17): > @Tim Howes has joined the channel
2021-04-05
Lori Shepherd (09:34:47): > <!channel>We are in process of making some major updates to the caching in BiocFileCache, AnnotationHub, and ExperimentHub. Namely, the default caching location will change from using rappdirs::user_cache_dir to using tools::R_user_dir eventually relieving the dependency on rappdirs. To avoid conflicting default caches, if anyone used an old default caching directory, there will be an error to decide how to deal with the old location before proceeding and documentation in the vignettes for how to resolve. Currently I have update BiocFileCache, the changes were just pushed to the devel branch and should propagate tonight. I plan on doing the same for both AnnotationHub and ExperimentHub later today and tomorrow. We appreciate any feedback or questions with regards to these updates. This is only relevant to using the default cache location, if a user manually specified a unique location or created a package specific cache the code/location is not affected. Anyone using package specific caching that utilizes rappdirs is encouraged also to consider changing package code to use the now available function in tools. Cheers –
Michael Love (10:20:03): > Thanks for the update Lori
Michael Love (10:23:09): > ~minor thing: could you push the changes to GitHub for ease of viewing~
Michael Love (10:23:24): > i’ll switch up tximeta as well
Lori Shepherd (10:25:22): > The changes are on the master branch of BiocFileCache — I’m still working on AnnotationHub and ExperimentHub but will let you know when they are available
Michael Love (10:25:31): > ok
Michael Love (10:25:54): > oh oops, i was looking at wrong pkg:flushed:
Michael Love (13:38:27): > thanks i’ve moved tximeta over to tools::R_user_dir and put a note in vignette pointing to BiocFileCache vignette (as we are a heavy user of BiocFileCache)
Lori Shepherd (13:50:18): > If anything doesn’t seem clear or anyone wants to improve the documentation gladly welcome input too
Michael Love (15:19:13): > really minor thing:https://github.com/Bioconductor/BiocFileCache/blob/master/vignettes/BiocFileCache.Rmd#L950I would suggest “Users who have utilized the default BiocFileCachelocation, to continue using…” - Attachment: vignettes/BiocFileCache.Rmd:950 > > BiocFileCache, to continue using the created cache, must move the cache and its >
Lori Shepherd (15:22:41): > AnnotationHub updates have been pushed and are also viewable on github master branch
Lori Shepherd (15:32:11): > ExperimentHub now has also been updated in devel
2021-04-07
Jonathan Griffiths (07:35:48): > Is this going to break all builds of packages that use the Hubs? (since the fixes a user needs to perform are largely interactive) > > Or perhaps are the Bioc build machines going to have their caches flushed?
Lori Shepherd (07:37:30): > the builders will have their caches flushed which is good practice occassionally to make sure all resources are still available (at least on the builders maybe not on a users system) – there were some breakage last night that will be investigated and fixed
Lori Shepherd (07:37:44): > I hope to have it smoothed over in the next few days
Jonathan Griffiths (07:37:53): > That’s super, thanks
2021-04-15
Aaron Lun (19:18:40): > continuing with the discussion from#developers-forum
Aaron Lun (19:19:19): > why not make hardlinks to all files from the old cache location to the new cache location, and if all hard links are successfully created, delete the old cache?
Aaron Lun (19:19:47): > This is effectively a copy-free move operation, but one that is read-only until the final deletion (and hence atomic if any of the individual linking steps fail)
2021-04-16
Hervé Pagès (00:39:11): > Can you make hard links on Windows? Also hard links don’t work across partitions, only within the same file system.
Aaron Lun (01:45:29): > don’t know about windows, but we can fall back to actual copies if the hard links fail.
Aaron Lun (01:46:06): > I do thatif (!file.link() && !file.copy())
pattern a lot
Jonathan Griffiths (02:39:09): > My package is having some trouble building due to something funky about the cache files (http://bioconductor.org/checkResults/devel/data-experiment-LATEST/MouseGastrulationData/). This is also affecting many other Hub packages (http://bioconductor.org/checkResults/devel/data-experiment-LATEST/index.html). > > Is it possible to flush the cache (and possibly fire off the builds again)? I have been messaged by one developer whose packages use some of my data and I’m having a hard time propagating a fix for them into the devel builds. Thanks!
Hervé Pagès (03:11:39) (in thread): > Just tried this on riesling1 andfile.link()
seems to be working there. Anyways, if the goal is to do a copy-free move operation of the content of the old cache to the new cache, falling back to actual copies is not really an option.
Aaron Lun (05:17:44) (in thread): > the ultimate goal is an atomic move of the cache directory. The hard links are just the most efficient way of achieving that, otherwise we’d have to do a copy and delete.
Lori Shepherd (07:38:37): > @Jonathan GriffithsI’ll look into fixing this today
Lori Shepherd (07:56:25) (in thread): > Would the hardlink work if you don’t have permission to move the file?
Jonathan Griffiths (08:35:14) (in thread): > thank you!
Lori Shepherd (08:59:29) (in thread): > I flushed the cache so it should be resolved on the next build.
Aaron Lun (11:46:55) (in thread): > i would imagine the hardlink would work but the deletion would fail. That’s still okay, you can start using the new cache and just emit a warning about incomplete deletion of the old cache.
Hervé Pagès (12:13:50) (in thread): > If you’re falling back to copies instead of hard links there could be some complications in case of limited disk space that could compromise the atomicity of the whole operation. There is a chance that you get stuck in the middle of the process with the risk that you won’t be able to restore the original state. In other words it seems hard to implement this in a robust way and to make sure that it works as expected in all possible scenarios. It might work just fine in 98% of the case but you might leave the user’s system in a messy state 2% of the time.
Aaron Lun (12:16:09) (in thread): > perhaps, but that was always going to happen if you decide to change the cache location. Therealsolution is to just leave the cache where it is. But given that the cache is changing, the code should be responsible for moving it, given that BiocFileCache is meant to manage my caches for me.
2021-04-17
Aaron Lun (19:33:26): > There are massive hub problems across the board for the OSCA books
Lori Shepherd (19:58:27): > @Aaron LunThat is related to@Jonathan Griffithspost yesterday. I reset the cache yesterday afternoon and it should clear up on the next build.
Aaron Lun (19:58:40): > ok, great.
2021-04-19
Dario Righelli (04:32:42): > @Dario Righelli has joined the channel
Dario Righelli (04:35:48): > Hi guys, I don’t know if this is the right place for this issue, but I’m trying to validate themetadata.csv
file for anExperimentHub
package and I’m getting this error: > > makeAnnotationHubMetadata(pathToPackage="AllenBrainData",fileName="metadata.csv") > Error in h(simpleError(msg, call)) : > error in evaluating the argument 'x' in selecting a method for function 'strsplit': error in evaluating the argument 'x' in selecting a method for function 'gsub': subscript out of bounds >
> Mywd
is one level up to theAllenBrainData
directory package. > Do you know if it’s related to something in my package or to something else? > Thanks
Kayla Interdonato (08:08:31): > I believe this has something to do with not having biocViews defined in your DESCRIPTION file. I would try adding at least 2 biocViews and then try running the function again. Also - you mention it’s anExperiemntHub
package but you show code using themakeAnnotationHubMetatdata
function. Just be sure you are using the correct function for your package type.
Lori Shepherd (08:17:48): > admittedly this error is unhelpful – if that ends up being the issue we should rework that section of code to have a better ERROR message
Dario Righelli (09:05:22): > Thanks@Kayla Interdonato, indeed I forgot the biocviews section in the DESCRIPTION. > Now it is producing better error information, so maybe as@Lori Shepherdsuggested it would be better to provide an improved ERROR message.
2021-04-22
Satoshi Kume (12:18:55): > @Satoshi Kume has joined the channel
Satoshi Kume (12:56:06): > @Kayla Interdonato, Hi, Kayla, thank you for your e-mail a while ago. I just modified and uploaded the files in GitHub.
Satoshi Kume (12:59:32): > Sorry, I made many mistakes. This is the first time for me to apply to the BioC. Bit difficult…
Kayla Interdonato (13:18:52): > No problem at all! That’s what we are here for, making sure everything looks correct and help in any way we can. I’ll have the data uploaded to ExperimentHub shortly.
koki (20:11:37): > @koki has joined the channel
2021-04-28
Mikhail Dozmorov (21:43:20): > @Mikhail Dozmorov has joined the channel
2021-05-04
Leonardo Collado Torres (14:08:17): > Hi Lori@Lori Shepherd, > > I haven’t uploaded to the Hubs some data in a few packages that I should have done so by now:spatialLIBD
,regutools
andGenomicState
. For example, the files related tohttps://github.com/LieberInstitute/spatialLIBD/blob/master/inst/extdata/metadata_spatialLIBD.csv. Is it ok to still use those CSVs or should I update the BiocVersion and maybe other fields?AnnotationHubData::makeAnnotationHubMetadata(here::here(), fileName = "metadata_spatialLIBD.csv")
does work in R .1 with BioC 3.13 for example.https://github.com/LieberInstitute/spatialLIBD/blob/master/inst/scripts/make-metadata_spatialLIBD.R#L48Another one ishttps://github.com/ComunidadBioInfo/regutools/blob/master/inst/extdata/metadata_regutools.csvthat also has BiocVersion 3.10. (cc@Joselyn Chávez@Carmina Barberena Jonas). That was the BioC version when these packages were accepted, so I think that it might make sense to keep it that way, though strictly speaking, the data has been served through Dropbox since and not the Hubs. > > ForGenomicState
I would like to submit new data. That is all the data athttps://github.com/LieberInstitute/GenomicState/blob/master/inst/extdata/metadata_gencode_human.csvis already on the *Hubs. My understanding is that I would need to make a new CSV file, make sure it works withAnnotationHubData::makeAnnotationHubMetadata()
, then upload the files. Fromhttps://bioconductor.org/packages/devel/bioc/vignettes/AnnotationHub/inst/doc/CreateAHubPackage.html#uploading-data-to-s3should I upload the files to sayGenomicState/bioc3.13/
? > > I’ll also send an email tohubs@bioconductor.orgrequesting access to upload the data. > > > After uploading the data, given the relatively new change inBiocCheck
https://github.com/Bioconductor/BiocCheck/commit/2562114bf18c11a259dfb5a40f2b73821093db4aby Marcel@Marcel Ramos Pérezshould I then delete the “backup” code we have inspatialLIBD
andregutools
for using files from Dropbox? That ishttps://github.com/LieberInstitute/spatialLIBD/blob/master/R/fetch_data.R#L140-L157andhttps://github.com/ComunidadBioInfo/regutools/blob/master/R/connect_database.R#L54-L61. Or is ok to leave that code there given that BBS doesn’t runBiocCheck
? > > Thanks!
Joselyn Chávez (14:08:23): > @Joselyn Chávez has joined the channel
Carmina Barberena Jonas (14:08:23): > @Carmina Barberena Jonas has joined the channel
2021-05-05
Lori Shepherd (08:29:02): > spatialLIBD section: If you have updated versions please remember to either name the files with a version or place the files in a new subdirectory so the old files would still be accessible for legacy and reproduciblity (we are working on versioning automatically in the hubs when using our S3 buckets and hope to have a rollout shortly after the next release) – this would require a new line in the file at least updating the RDataPath and also the BiocVersion (and any other relevant information like source information that may have changed) . You may choose to do new lines in the existing file or create a new file specific to the different versions. > regtools: so you are updating from using dropbox to using the hubs since it is now a requirement to not host directly on dropbox or github? I think you would still want to have the current BiocVersion since the code changes would only be present in the new version of the package that would not be available in previous versions of Bioconductor… > metadata.csv is example but as long as we know the name of the csv file and that it is formatted correctly, it can be named anything. The BiocVersion should be updated and be the BiocVersion that the new data would be released under. It takes the same stamp as packages to determine when a specific version of data was released. > GenomicState: yes that all sounds fine. How you store the data in subdirectories is up to you. As long as the RDataPath matches what you upload to S3 then it will map. You can use bioc version, a source version, whatever you deem appropriate. > Yes we no longer encourage hosting on private locations like Dropbox and Github as it is easily removed/destroyed/deleted and encourage hosting on more publicly available trusted servers. If you would like to keep the script in the function as reference and legacy but please default to downloading files from the hubs.
2021-05-11
Jonathan Griffiths (16:06:57): > I’ve been a bit ambushed by a breaking change in my package (MouseGastrulationData) due to a behaviour-breaking change upstream in a core dependency. I think I have produced a change to fix it but I’m having a bit of trouble getting R compiled on a machine with enough memory to test it for sure.@Lori Shepherd, do you know when the next builds are due for the Hub packages? I’m keen to make sure it’s working before the deadline. Is there anything I can do to get it tested “live” out-of-cycle (if necessary)? Or any other steps I should take? Thanks!
Lori Shepherd (16:09:01): > Is it an ExerimentHub package right?
Lori Shepherd (16:10:00): > Experiment Data packages are run on Tue, Thur, and Sat
Megha Lal (16:44:09): > @Megha Lal has joined the channel
2021-05-12
Jonathan Griffiths (03:23:57) (in thread): > Thanks (sorry for the late reply, I went to bed!)
Jonathan Griffiths (11:10:29) (in thread): > Eventually I was able to get a machine to test it on - while it looks fine now, is there any recourse I have RE unexpected issues on this final pre-deadline build?
Devon Kohler (18:10:49): > @Devon Kohler has joined the channel
2021-05-14
Michael Love (04:37:42): > question aboutnebbiolo1and devel branch package checks. Some of my packages (tximeta, fishpond) cannot build on Linux because > > Corrupt Cache: index file > See AnnotationHub's TroubleshootingTheCache vignette section on corrupt cache > cache: /home/biocbuild/.cache/R/AnnotationHub >
> should I just sit tight, or investigate? Note that the other builders don’t have error
Aaron Lun (04:42:35): > seems like a system-wide build error, mentioned in#bioc-builds
Michael Love (05:12:28): > got it, thanks
Lori Shepherd (07:55:48) (in thread): > Sit tight. When we updated R on the builders earlier this week we flushed the cache and somehow two index files got created. It should already be fixed and not appear on today’s report.
2021-05-15
Pedro Baldoni (04:51:49): > @Pedro Baldoni has joined the channel
2021-05-17
Mahmoud Ahmed (00:43:03): > @Mahmoud Ahmed has joined the channel
Federico Marini (10:25:59): > @Federico Marini has joined the channel
2021-05-24
Daniela Cassol (17:19:42): > @Daniela Cassol has joined the channel
2021-05-27
Satoshi Kume (05:57:49): > Hi,@Kayla Interdonato. I noticed that the file format of data in the BioimageDba package (ver 1.0.0) was wrong. I changed them from Rda to rds, and then uploaded the new data to AWS (BioImageDbs/v01). Also, I fixed the Metadata (https://github.com/kumeS/BioImageDbs/blob/main/inst/extdata/metadata_v02.csv) and GitHub (git@git.bioconductor.org:packages/BioImageDbs.git). Please check them.
2021-05-28
Lori Shepherd (10:47:09): > Please update the DispatchClass fromrds
toRds
caps matter in R . I went ahead and made this changes to a copy I needed to add the data to the database, but please also push up to the copy that is ongit.bioconductor.org. the new versions of the files are now available. > > ExperimentHub with 23 records > # snapshotDate(): 2021-05-18 > # $dataprovider: CELL TRACKING CHALLENGE ([http://celltrackingchallenge.net/2](http://celltrackingchallenge.net/2)... > # $species: Homo sapiens, Mus musculus, Drosophila melanogaster > # $rdataclass: List, magick-image > # additional mcols(): taxonomyid, genome, description, > # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, > # rdatapath, sourceurl, sourcetype > # retrieve records with, e.g., 'object[["EH6095"]]' > > title > EH6095 | EM_id0001_Brain_CA1_hippocampus_region_5dTensor.rds > EH6096 | EM_id0001_Brain_CA1_hippocampus_region_5dTensor_train_dataset.gif > EH6097 | EM_id0002_Drosophila_brain_region_5dTensor.rds > EH6098 | EM_id0002_Drosophila_brain_region_5dTensor_train_dataset.gif > EH6099 | LM_id0001_DIC_C2DH_HeLa_4dTensor.rds > ... ... > EH6113 | LM_id0003_Fluo_N2DH_GOWT1_5dTensor.rds > EH6114 | EM_id0003_J558L_4dTensor.rds > EH6115 | EM_id0003_J558L_4dTensor_train_dataset.gif > EH6116 | EM_id0004_PrHudata_4dTensor.rds > EH6117 | EM_id0004_PrHudata_4dTensor_train_dataset.gif >
Satoshi Kume (14:28:57): > Hi, Lori. Thank you very much for checking the files. I updated the DispatchClass from rds to Rds and then pushed up to RELEASE_3_13 ongit.bioconductor.org. I also checked to import the data from the EH successfully. Thank you a lot. Satoshi.
2021-06-01
Leonardo Collado Torres (13:06:29) (in thread): > Thanks a lot for the detailed reply Lori! I’ll get back to this later this month (I’m currently traveling). Thanks again!!
2021-06-18
Kozo Nishida (16:35:14): > Hi all, > Is the AnnotationHub web service publicly available somewhere?Here’s how to see itlocally, but I would like to provide a web service that is accessible to everyone (without installing AnnotationHub).
Lori Shepherd (16:52:32): > Not sure I completely understand what you’re looking for but there is the web API..https://annotationhub.bioconductor.org/
Kozo Nishida (16:56:27): > Thank you for the information. I didn’t know this web API. > But this is not what I am looking for. > What I’m looking for is a URL that makes this image publicly available instead of 127.0.0.1. - File (PNG): image.png
Kozo Nishida (17:01:38): > We can see this by installing AnnotationHub locally and runningd <-display (ah)
, but people may find it hassle.
Kozo Nishida (17:05:24): > I would like to be able to explore the data in AnnotationHub with just a web browser without installing it.
Lori Shepherd (17:09:17): > If you explore different endpoints of the above web browser that I shared some of this information is available. We’ve had it as a to do to improve the API and web display.
Kozo Nishida (17:19:17): > Thank you for your reply. I think I understand what you said…
Kozo Nishida (17:23:34): > By the way, (about the API) > Is it possible to know which software (or workflow) package the Hub package is used in?
Kozo Nishida (17:29:26): > To know it, I think we need to add the “Software packages that use the Hub package” column to the Hub package metadata. > Let me know if you have any comments on this.
Lori Shepherd (19:59:57): > Not through the website but that would be a nice addition since we store this information in the database too. I could add another that lists packages and shows resources by package….You could do this easily from the query within R but I’m assuming you meant not in R / installing again
Kozo Nishida (20:32:28) (in thread): > Sorry, now I found (didn’t know) the following note. > > NOTE: The metadata file can have additional columns beyond the 'Required Fields' listed above. These values are not added to the Hub database but they can be used in package functions to provide an additional level of metadata on the resources. >
> (in the help of?AnnotationHubData::makeAnnotationHubMetadata
) > I think I understand what you say.
Kozo Nishida (20:36:22) (in thread): > In the future, I will add the metadata (proposed earlier) in my (or others) hub package. Thank you.
2021-06-19
Kasper D. Hansen (03:01:46): > Having a webserver running which serves up the content in AnnotationHub / ExperimentHub at the browsing level as suggested by@Kozo Nishidaseems a very useful service. Of course it means running a webserver or deploying it on an existing webserver and I have no idea of the work or resources involved. But I can see the advantage.
Kasper D. Hansen (03:02:03): > Is the webpage a stand-alone HTML page or a shiny thing?
Kozo Nishida (03:07:48): > The webpage is a shiny thing.AnnotationHub::display(AnnotationHub())
will launch the shiny service.
Kozo Nishida (03:09:36): > ActuallyAnnotationHub::display
callsinteractiveDisplayBase::dispaly
.
Kozo Nishida (03:11:13): > So I asked the following on#randomThis is not a topic limited to#biochubsso I asked on the#randomchannel. - Attachment: Attachment > Does anyone know how to deploy a package using interactiveDisplayBaseto shinyapps.io? > I want to deploy a webserver to shinyapps.io that is started by the display
function of AnnotationHub.
Kozo Nishida (03:19:09): > I don’t thinkAnnotationHub::display
is apure(?)Shiny app so I have no idea about how to deploy it.
Kozo Nishida (03:20:55): > By the way, how to deploy a shiny app toshinyapps.iois likehttps://shiny.rstudio.com/articles/shinyapps.html
Kozo Nishida (03:50:25): > Of course, if you have a server with a named global IP, all you have to do is just install AnnotationHub on that machine and runAnnotationHub::display(AnnotationHub())
, but I don’t have such a server so I triedshinyapps.io
Lori Shepherd (12:53:44) (in thread): > The webpage and the shiny display are two different things
Lori Shepherd (12:55:38) (in thread): > I wrote response on wrong comment…. The webpage/API and the shiny app are two different things both available right now
Lori Shepherd (13:00:58) (in thread): > Again… It needs a revamp and additional fields but this is available at the link I provided and a similar for experiment hub… You can use the API or you can click on the ones that are links to browse down too…
2021-06-20
Vince Carey (08:35:20): > It seems to me that interactiveDisplayBase does something useful but it is quite underdocumented. We can useshinyapps.ioto present this information but display method would have to be modified to initiate a download when the send button is used. If there is not a working group on resource exposure we should form one. We had a little group thinking about “hub 2.0” concepts andhttps://vjcitn.github.io/bedbaseRClient/articles/bedbaseRClient.htmlis an early exploration ofbedbase.org. The emphasis there is on facilitating targeted retrieval; metadata and discovery support deserve more attention. - Attachment (vjcitn.github.io): bedbaseRClient: illustrative operations on bedbase.org > bedbaseRClient
Michael Love (10:11:45) (in thread): > Is targeted retrieval meaning targeted by range?
2021-06-21
Vince Carey (08:55:44) (in thread): > yes
2021-06-23
Kozo Nishida (03:03:59) (in thread): > Excuse me,@Lori Shepherd, I may not understand your intentions. > Is it possible to seehttps://mramos.shinyapps.io/AnnotationHubShiny/(made by@Marcel Ramos Pérez) equivalent from yourhttps://annotationhub.bioconductor.org/?
Kasper D. Hansen (07:09:30) (in thread): > When I say webpage, I mean when I madewww.epigeneticmachinery.orgI just generate a single HTML page that I can upload to wherever; Ie I need no R backend to host a shiny thing.
Kozo Nishida (07:16:13) (in thread): > Thank you for the information, I did not know we can filter the table rows (likehttp://www.epigeneticmachinery.org/) without Shiny.
Kozo Nishida (07:20:35) (in thread): > What I want to see is the full AnnotationHub dataset version ofhttp://www.epigeneticmachinery.org/. If Shiny isn’t necessary to achieve that, I don’t stick to Shiny.
Marcel Ramos Pérez (11:11:00) (in thread): > It looks likehttp://www.epigeneticmachinery.org/is usinghttps://datatables.net/DataTables
(which translates to theDT
package in R). It is strictly not necessary to use shiny / R to host a page like this but it is easier to adapt existing shiny code and host onshinyapps.ioas in the case ofAnnotationHubShiny
.
2021-06-24
Kasper D. Hansen (11:51:46) (in thread): > Well, I don’t think that’s easier. It is certainly not cheaper
Marcel Ramos Pérez (20:51:48) (in thread): > We’ve put it here for nowhttps://shiny.sph.cuny.edu/AnnotationHubShiny/
2021-06-25
Vince Carey (05:59:37) (in thread): > This is experimental and needs some more helper text in the app. But note that if you click on a single row of the table, you have the option to download the selected resource. This could be tedious for large resources because they will be cached at the server and then downloaded as RDS.
Vince Carey (06:13:33) (in thread): > If you want just a metadata table for your selections, there is a download option for that. We should probably download it as csv. Finally, some (many?) of the resources are not RData … for example there are 10247 bigwig files.https://annotationhub.bioconductor.org/sourcetype/BigWigprovides URLs for them. Currently the shiny server accesses them via AnnotationHub and they are processed to GRanges. But they could be delivered in their raw .bw format too, to make the service more language-agnostic.
2021-07-12
Charlotte Soneson (10:18:23): > Are there ‘best practices’ for giving users a heads-up on the amount of data that would be downloaded by a call to a function in a hub package (brought up by Vincehere)? I noticed that theExperimentHubMetadata
class has asourcesize
argument (and then we could perhaps let the user choose whether they actually want to download the files or just know how much would be downloaded by aggregating this information across all the files included in the requested object). However, all the hub entries I looked at seem to have a value of NA here, and when trying to access it I get > > > eh$sourcesize > Error in .resource_column(x, name) : > 'sourcesize' is not a resource data column >
> so I’m not sure this is actually being used. Would this be the way to go, or is there a better way? Thanks!
Lori Shepherd (10:21:20): > You could recommending checking out the information on the resource using the functiongetInfoOnIds
that accesses the file_size > > > getInfoOnIds(eh, "EH1") > ah_id fetch_id > 1 EH1 1 > title > 1 RNA-Sequencing and clinical data for 7706 tumor samples from The Cancer Genome Atlas > rdataclass status biocversion rdatadateadded rdatadateremoved file_size > 1 ExpressionSet Public 3.4 2016-02-23 <NA> 349853641 >
Charlotte Soneson (10:25:30): > Oh great, thanks!
Dania Machlab (10:35:42): > @Dania Machlab has joined the channel
2021-07-13
Dario Righelli (06:17:45): > Hi I’m working with external file links for anExperimentHub
package and I have some files intar
format. This seems not to be a validSourceType
. Is there any way to handle this? Thanks
Marcel Ramos Pérez (13:31:16): > Hi Lori@Lori Shepherdand Kayla@Kayla InterdonatoI’d like to add a couple of resources toExperimentHub
. Here is my metadata file:https://github.com/waldronlab/SingleCellMultiModal/blob/tenx/inst/extdata/metadata.csvThank you!
2021-07-16
Kayla Interdonato (12:31:43) (in thread): > Hey@Marcel Ramos Pérez- Sorry for the delay on this. I’ve got some time now to work on adding the data but I’ve just got some questions before doing so. I’m assuming the last two entries of the metadata file are the new resources to be added. One seems to be new, the other points to a resource already in the Hubs. Are you looking for another EH ID to be associated with this resource? I’m not sure if you’ve already talked this over with Lori but I just wanted to clarify before doing so.
Kayla Interdonato (12:33:47) (in thread): > Sorry for the late reply. There does seem to betar.gz
as a validSourceType
, does this not work for your files?
Marcel Ramos Pérez (12:34:38) (in thread): > Hi Kayla, thanks for taking care of this. That’s right, the last two entries are the new ones. Yes, we’d have to get another EH ID for this resource even though it’s already there. Thanks!
Kayla Interdonato (13:30:41) (in thread): > All set@Marcel Ramos Pérez- EH6688 and EH6689 are the newly added resources. > > > query(eh, c("SingleCellMultiModal", "pbmc_10x")) > ExperimentHub with 10 records > # snapshotDate(): 2021-07-16 > # $dataprovider: European Bioinformatics Institute (EMBL-EBI), United Kingdo... > # $species: Homo sapiens > # $rdataclass: SingleCellExperiment, dgCMatrix, HDF5Matrix, DFrame, TENxMatrix > # additional mcols(): taxonomyid, genome, description, > # coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags, > # rdatapath, sourceurl, sourcetype > # retrieve records with, e.g., 'object[["EH4623"]]' > > title > EH4623 | pbmc_atac_assays > EH4624 | pbmc_atac_se > EH4625 | pbmc_atac > EH4626 | pbmc_colData > EH4627 | pbmc_rna_assays > EH4628 | pbmc_rna_se > EH4629 | pbmc_rna > EH4630 | pbmc_sampleMap > EH6688 | pbmc_rna_tenx > EH6689 | pbmc_rna_se >
2021-07-17
Dario Righelli (05:21:17) (in thread): > Thanks, yes, for me it’s not a problem, I was just wondering if putting atar.gz
asSourceType
could be a problem if in the end the file doesn’t have the .gz part
2021-07-30
Lori Shepherd (14:42:06): > We are seeing a BiocFileCache ERROR for the default caching location on the build machines. This seems recent, in the last week or two. I’m trying to find the offending package that creates the now defunct location. Please ignore this error for now and know it is being investigated and should be resolved soon
2021-08-05
Wes W (11:49:23): > @Wes W has joined the channel
Robert M Flight (17:44:15): > @Robert M Flight has joined the channel
Michael Love (17:49:56): > Notes from the BoF: > provide a taxonomy for AHub, e.g. biological taxonomy such as: genes, functional elements, genetic variants might be useful > RC says:http://www.sequenceontology.org
PeterJ (17:52:36): > @PeterJ has joined the channel
Michael Love (17:52:41): > :wave:
Michael Love (17:55:11): > I would like to keep track of what people think is missing from AnnotationHub. > > This is something I can try to populate a bit at a later date, but i’m going to just throw up a GDoc for now:https://docs.google.com/document/d/1aZGUZo_6FZgWHWir4a78vAuOGvTPYo3OGaiHDx0pouw/edit?usp=sharing
Manojkumar Selvaraju (17:58:27): > @Manojkumar Selvaraju has joined the channel
Fabricio Almeida-Silva (18:01:52): > @Fabricio Almeida-Silva has joined the channel
2021-09-16
Henry Miller (18:36:04): > @Henry Miller has joined the channel
2021-09-23
Satoshi Kume (00:24:36): > Hi channel. I would like to upload the data of BioImageDbs package to AWS S3 for version 3.14. > But I had a problem to upload the data. > My AWS’s id and access key did not work well with an error. > Please find my e-mail to the bioconductor in detailed.
Lori Shepherd (07:44:22): > Yes I received your previous email. Please be patient. Generally the core team does not work on weekends (when your request came in) unless its an emergency and earlier in the week the core team was away at a 2 day intensive. I plan to look at the pending hub requests later today and tomorrow.
Satoshi Kume (12:16:15) (in thread): > Thank you for your reply. I understand the situation. I am looking forward to your e-mail.
2021-09-25
Haichao Wang (07:20:26): > @Haichao Wang has joined the channel
2021-10-06
margherita mutarelli (01:36:29): > @margherita mutarelli has joined the channel
2021-11-03
Stephanie Hicks (22:34:59): > @Stephanie Hicks has left the channel
2021-11-11
Shilpa Garg (09:28:05): > @Shilpa Garg has joined the channel
2021-11-16
Chris Vanderaa (06:08:51): > @Chris Vanderaa has joined the channel
Chris Vanderaa (06:20:31): > Hello, I’m trying to debug myscpdata
data package. I’m trying to understand why my datasets cannot be loaded (error) when using R4.2.0/Bioc3.15 but everything works fine on R.4.1.1/Bioc3.14. Is there a way to retrieve theRda
(or other) files that were uploaded on ExperimentHub so that I can try to manually load the Rda files myself?
Lori Shepherd (07:21:09) (in thread): > getInfoOnIds
can be helpful. The underlying code base of the hubs uses the hub api fetch call to download resources — so something like > > > getInfoOnIds(eh, "EH3899") > ah_id fetch_id title rdataclass status biocversion rdatadateadded > 3758 EH3899 3942 specht2019v2 QFeatures Public 3.13 2020-11-05 > rdatadateremoved file_size > 3758 <NA> 78244884 >
> Will give you the fetch id to use in the call to download the resource. The resource redirection location can be seen by using something like curlGetHeaders for instance on the above > > > curlGetHeaders("[https://experimenthub.bioconductor.org/fetch/3942](https://experimenthub.bioconductor.org/fetch/3942)") > [1] "HTTP/1.1 302 Found\r\n" > [2] "Date: Tue, 16 Nov 2021 12:19:42 GMT\r\n" > [3] "Server: Apache/2.4.18 (Ubuntu)\r\n" > [4] "X-XSS-Protection: 1; mode=block\r\n" > [5] "X-Content-Type-Options: nosniff\r\n" > [6] "X-Frame-Options: SAMEORIGIN\r\n" > [7] "X-Powered-By: Phusion Passenger 4.0.46\r\n" > [8] "Location:[http://s3.amazonaws.com/experimenthub/scpdata/specht2019v2.Rda](http://s3.amazonaws.com/experimenthub/scpdata/specht2019v2.Rda)\r\n" > [9] "Status: 302 Found\r\n" > [10] "Content-Type: text/html;charset=utf-8\r\n" > [11] "\r\n" > [12] "HTTP/1.1 200 OK\r\n" > [13] "x-amz-id-2: WHzoGnz8nvhrtP/25rJUEcM3gzXLtWjxr2xk+QjZo5kKRjz48x44eNxybPltzxn19I7uPe0jaD4=\r\n" > [14] "x-amz-request-id: A20NJC4J6KSC51ZG\r\n" > [15] "Date: Tue, 16 Nov 2021 12:19:43 GMT\r\n" > [16] "Last-Modified: Tue, 13 Oct 2020 14:49:53 GMT\r\n" > [17] "ETag: \"9a0bd1329055eab14bf2bd04a2408892-5\"\r\n" > [18] "Accept-Ranges: bytes\r\n" > [19] "Content-Type: binary/octet-stream\r\n" > [20] "Server: AmazonS3\r\n" > [21] "Content-Length: 78244884\r\n" > [22] "\r\n" > attr(,"status") > [1] 200 >
Chris Vanderaa (07:22:04) (in thread): > Excellent thanks a lot for the tip:thumbsup:
Hervé Pagès (11:50:36) (in thread): > Hi Chris, > Make sure to use the latestAnnotationHub(v 3.3.6). Starting with BioC 3.15, serialized S4 hub resources are now passed thruupdateObject()
at load-time. When I was testing this new feature, I ran across the 13 QFeatures instances that are in ExperimentHub because they were causing problems:updateObject()
chokes on them with error: “Assay links names are wrong”. So I hardcoded an exception for these objects:https://github.com/Bioconductor/AnnotationHub/commit/add08877fc393b6270d0e33be960f6a8c9db5859This hack is only to get things going and should be temporary. Ideally those objects would need to be fixed. Thanks!
2021-11-18
Chris Vanderaa (02:46:40) (in thread): > Hi Hervé, thank you for identifying the problem! That saved me so much time!! Indeed theupdateObject()
is where the issue lies. I realize that the method calls the constructor fromMultiAssayExperiment
(parent class) instead ofQFeatures
, probably becauseQFeatures
has noupdateObject()
method. My question: do all packages that define a new class need to define anupdateObject()
method?
Hervé Pagès (04:26:05) (in thread): > No simple answer to that. It depends. > 1. First of all if the new class is for objects that are not meant to be serialized thenupdateObject()
is not needed. For example nobody should ever serialize a TxDb object. > 2. Another situation wheremaybethere’s no need to define a specificupdateObject()
method for a new class B is if the new class extends a class A that already has anupdateObject()
method, and adds slots that are meant to contain only ordinary vectors or other base R objects. Then callingupdateObject()
on objects of class B will do the right thing i.e. it will update the parts of the object that are under the control of A and will ignore the new slots. All this is fine until the author of class B decides to make changes to the class definition e.g. to add more slots. When this happens, they’ll need to implement anupdateObject()
method for their objects (this method will need to callcallNextMethod()
and that call should happen at the very end of the method body). > 3. If, OTOH, B extends A by adding slots that are meant to contain other S4 objects, then it’s recommended to define anupdateObject()
method for B. This method would simply update the new slots containing S4 objects by callingupdateObject()
on each of them, and callcallNextMethod()
at the end. > 4. If the new class does not extend anything then callingupdateObject()
on these objects will call theupdateObject,ANY
method defined inBiocGenerics. This defaultupdateObject()
method will probably do the right thing, until the author of the new class makes a change to the internal of their class. When this happens, they’ll need to implement a specificupdateObject()
method for their class of course. > 5. In any case callingupdateObject()
on an S4 object is expected to work, even if it doesn’t do anything (no-op). So I’d say that authors of a new class should make sure thatupdateObject()
doesn’t return an error on their objects. > Hope this helps.
Chris Vanderaa (05:09:10) (in thread): > This is very useful, thank you so much!
Chris Vanderaa (05:11:32) (in thread): > I plan to have QFeatures fixed by tomorrow, and will report the progress toAnnoationHub
so to remove the hack in the commit you pointed out. Sorry for the inconvenience!
2021-11-24
Laurent Gatto (11:42:51): > Hello - quick question regarding how package specific caching (based on BiocFileCache) handle release and devel cached data. If I understand correctly, cached data is shared between different Bioc versions and the devel version can access data that was cached by the release version (and vice versa)?
Lori Shepherd (11:47:21): > I think it depends on how it is implemented. BiocFileCache will take into account the url of the data (if remote) and then could potentially have two different versions of data, if the url for release vs devel data is different. If there is no difference than it probably is shared between release/devel
Laurent Gatto (11:53:05) (in thread): > Ok, thank you very much.
2021-12-02
Andres Wokaty (13:05:12): > @Andres Wokaty has joined the channel
2021-12-14
Megha Lal (08:24:50): > @Megha Lal has left the channel
2022-01-06
Michael Love (16:35:24): > for those who have added data to an existing EHub package, I’m not certain on this part > > Generate a new metadata.csv file. The package should contain metadata for all versions of the data in ExperimentHub or AnnotationHub so the old file should remain. When adding a new version it might be helpful to write a new csv file named by version, e.g., metadata_v84.csv, metadata_85.csv etc. > if I have objectsa.rda
andb.rda
from the first submission, and want to addc.rda
andd.rda
, I should keepinst/extdata/metadata.csv
as it is (likewiseinst/scripts/make-metadata.R
) and make a newinst/extdata/metadata_2022.csv
(with corresponding R script to make it) with the two new entries? or should the new CSV have entries for all four resources?
Lori Shepherd (17:44:00): > Either will work. Depends on how you as a maintainer want to organize it. Just let us know which you did when you submit
2022-03-09
Fabricio Almeida-Silva (03:06:10): > @Fabricio Almeida-Silva has joined the channel
2022-03-15
Ramon Massoni-Badosa (07:04:34): > @Ramon Massoni-Badosa has joined the channel
2022-03-29
Tim Triche (13:27:21): > @Tim Triche has left the channel
2022-04-04
Fabricio Almeida-Silva (14:06:44): > Hi, everyone. I mentioned in a BoF session in the Bioconductor conference last year that plant genome data were underrepresented in AHub. For people working on plant phylogenomics (myself included), having a unified resource to get data (protein sequences, DNA sequences, coordinates, etc) would be extremely useful. Currently, the best resource the plant genomics community has is the PLAZA database (https://bioinformatics.psb.ugent.be/plaza/versions/plaza_v5_dicots/), but they are still quite limited. I would like to take the challenge to integrate plant genomic data to Bioc annotation resources. What do you think is the best way to do it these days? I like BSgenome packages and organism packages, but people seem to be migrating to AHub these days. - Attachment (bioinformatics.psb.ugent.be): Dicots PLAZA 5.0 : comparative genomics in plants > Plaza : Plant Resource For Comparative Genomics
Fabricio Almeida-Silva (14:11:05): > Most common needs of the plant genomics community: > * translated sequences of primary transcripts (for gene family evolution analyses) - could be stored as AAStringSet objects > * CDS for each locus (for Ka/Ks analyses and similar) - could be DNAStringSet objects > * Whole-genome sequences (DNA and whole proteomes) - could be AA/DNAStringSet objects > * Ranges (GFF-/BED-like objects) - could be GRanges objects
Lori Shepherd (14:13:48): > there is a push to put even traditional annotation resources (like BS genomes ) into the hubs eventually, orgDb are hosted both as traditional and in the hub – you can still create the structure of a BSgenome or an orgDb or StringSets/GRanges – the structure I will leave for others to debate here that are more in the field – and misconception that hub data has to be hosted on the Bioconductor server, the data can be hosted on a public, trusted server (institutional provided, zenodod, etc) , just no personal level (like github), and then referenced in the hub for visibility and ease of use regarding caching/download through the hub interface
Fabricio Almeida-Silva (14:18:59): > Thank you for your feedback, Lori! Let’s see what other people suggest regarding structure.
Hervé Pagès (15:02:50): > Some random comments and thoughts: > * I think it’s important to separate hosting details from data structure. Ideally we want data structures that are as much as possible agnostic about where the data is actually located (e.g. on the Hubs, locally, or already in memory). > * It’s true that BSgenome objects don’t support that kind of separation at the moment but as Lori mentioned some work is planned to address this in BioC 3.16. > * Core data structures for annotations are org, TxDb, BSgenome, and OrganismDbi at the moment. Are there use cases with plants that are not covered by these structures? If so, what are they? I’m just curious if there’s something special about plants or if we just need to address some shortcomings with the current data structures that would benefit all organisms, not just plants. For example it’s true that the core data structures don’t make it easy to get the translated sequences of primary transcripts at the moment. Maybe that’s something we should address but I don’t see any reason to do this only for plants. > * It seems that assemblies and annotations available at PLAZA are also available at NCBI (e.g. the cs10 assembly for which a BSgenome package was recently requested:https://support.bioconductor.org/p/9142910/). FWIW making it easy to obtain a BSgenome object from an NCBI assembly is also on the roadmap for BioC 3.16.
2022-04-05
Fabricio Almeida-Silva (10:53:20): > Thanks a lot,@Hervé Pagès! Great to know you’re planning to address it in Bioc 3.16. Regarding your 3rd question (plants not covered by these structures), I would ask the opposite: what plants are covered by these structures? If you look carefully, you’ll see that a tiny fraction of plant genomes are available as BSgenome packages. For instance, I was working on a plant family with >25 plant genomes, and I only found 1 of them in AHub/BSgenome. Maybe there’s an efficient to store these data, right? Like storing only DNA sequences and ranges for each species, so users can get DNA and protein sequences for each gene from GRanges objects or TxDb objects.
Fabricio Almeida-Silva (10:57:01): > Creating the BSgenome and GRanges/TxDb from NCBI is nice, but I don’t know if one can get different gene IDs. RefSeq IDs are rarely used by the plant genomics community. There’s usually a widely used assembly for each species with its own nomenclature system.
2022-09-19
Julien Roux (04:39:13): > @Julien Roux has joined the channel
2022-10-24
Nitesh Turaga (14:43:08): > Is there documentation on how we can create Hubs we can host ourselves?
Nitesh Turaga (14:43:22): >
Lori Shepherd (14:49:52): > https://github.com/Bioconductor/BiocHubServer
Nitesh Turaga (16:47:01): > This is great! Thanks Lori. I’ll ask if I have further questions.
2023-01-10
Robert Shear (14:09:13): > @Robert Shear has joined the channel
2023-03-10
Leonardo Collado Torres (15:30:17): > Daianna Gonzalez-Padilla and I ran into a few hiccups at the end ofHubPub::create_pkg()
which we documented athttps://github.com/Bioconductor/HubPub/issues/7 - Attachment: #7 HubPub template unit tests have a few errors > Hi, > > About 3 weeks ago I did a LIBD rstats club session on HubPub
and ran into some errors at the end. I haven’t processed and uploaded the video to YouTube, but you can see the live tests at https://github.com/lcolladotor/HubBSP2. @daianna21 is currently making her first ExperimentHub submission with https://github.com/LieberInstitute/smokingMouse and ran into the same errors. > > Error 1: typo on the template > > There’s a typo at > > https://github.com/Bioconductor/HubPub/blob/616c5657a92193a39130abaf19c02c8e3c57340a/inst/templates/test_metadata.R#L8|HubPub/inst/templates/test_metadata.R > > Line 8 in 616c565 > > . It should be package
not packge
, otherwise that leads to https://github.com/lcolladotor/HubBSP2/blob/1fcabeb9447781204bd9f570dd509c84471a13b5/tests/testthat/test_metadata.R#L8|https://github.com/lcolladotor/HubBSP2/blob/1fcabeb9447781204bd9f570dd509c84471a13b5/tests/testthat/test_metadata.R#L8 and it fails. > > Error 2: wrong fileName
input > > @daianna21 noticed that AnnotationHubData::makeAnnotationHubMetadata(fileName)
should be the name of the metadata csv file, not the full path. Otherwise, it gets appended internally. > > > > args(AnnotationHubData::makeAnnotationHubMetadata) > function (pathToPackage, fileName = character()) >
> > So AnnotationHubData::makeAnnotationHubMetadata(path, metadata)
fails with > > https://github.com/Bioconductor/HubPub/blob/616c5657a92193a39130abaf19c02c8e3c57340a/inst/templates/test_metadata.R#L9|HubPub/inst/templates/test_metadata.R > > Line 9 in 616c565 > > but would work with AnnotationHubData::makeAnnotationHubMetadata(path, "metadata.csv")
. > > Error 3: expect_true()
test fails > > AnnotationHubData::makeAnnotationHubMetadata()
returns a list()
if it worked. So well, we tried looking if you meant to use another function to check that the metadata is correct but couldn’t find it. We likely missed it, and well, if you know it, then please let us know so we can edit https://github.com/LieberInstitute/smokingMouse/blob/4da23c8401a82b1c60eddb4f3f7d8a44d43b9078/tests/testthat/test_metadata.R#L10|https://github.com/LieberInstitute/smokingMouse/blob/4da23c8401a82b1c60eddb4f3f7d8a44d43b9078/tests/testthat/test_metadata.R#L10 and ultimately if you can edit > > https://github.com/Bioconductor/HubPub/blob/616c5657a92193a39130abaf19c02c8e3c57340a/inst/templates/test_metadata.R#L10|HubPub/inst/templates/test_metadata.R > > Line 10 in 616c565 > > that would be great. Otherwise, we can check that the output is a list
object with code like expect_type(AnnotationHubData::makeAnnotationHubMetadata(path, metadata), "list")
. > > Here’s our modified version on smokingMouse
that works:
> https://github.com/LieberInstitute/smokingMouse/blob/33afcaf18a63c9dcefd4cc9e708402e10e44bce2/tests/testthat/test_metadata.R#L8-L10|https://github.com/LieberInstitute/smokingMouse/blob/33afcaf18a63c9dcefd4cc9e708402e10e44bce2/tests/testthat/test_metadata.R#L8-L10 > > If you like it, I can teach Daianna how to send a PR. > > Thank you for creating HubPub::create_pkg()
and making it easier for people to contribute data to the Hubs! > > Thanks!
> Leo and Daianna
2023-03-12
Vince Carey (08:34:02): > @Lori Shepherd^^@Kayla Interdonato
2023-04-20
Chris Vanderaa (04:34:21): > @Chris Vanderaa has left the channel
2023-06-27
Gavin Rhys Lloyd (08:56:58): > @Gavin Rhys Lloyd has joined the channel
2023-07-17
Leonardo Collado Torres (15:12:30): > https://twitter.com/lcolladotor/status/1681001144330637312 - Attachment (Twitter): :flag-mx: Leonardo Collado-Torres on Twitter > After a long hiatus, I’ve finally uploaded all @LIBDrstats videos from 2023 to @YouTube @LieberInstitute > > First we have 2023-02-17 “Making a R/Bioconductor :package: for ExperimentHub” #HubPub @kaylainter1011 @Bioconductor #rstats > > :spiral_note_pad: https://t.co/pfrNPDfo8i > :video_camera: https://t.co/K9DBVBbQOA
2023-07-28
Konstantinos Daniilidis (13:47:38): > @Konstantinos Daniilidis has joined the channel
2024-01-17
Ahmad Al Ajami (10:05:12): > @Ahmad Al Ajami has joined the channel
2024-02-26
Ahmad Al Ajami (10:56:38): > Hi community, > Thank you so much Lori for adding my data to the hubs. Since it’s my first data package, I have a general (maybe naive) question: > When trying to query the data, I get the following: > > Error: EH****** added after current Hub snapshot date. > added: 2024-02-26 > snapshot date: 2023-10-24 >
> My question is, how can I access the newest snapshotdate? Is this something only accessible with the next release? Happy to sharesessionInfo
. Sorry if I missed this somewhere!
Marcel Ramos Pérez (10:58:37) (in thread): > The snapshot is updated when you run theExperimentHub
function: > > eh <- ExperimentHub() > |========================| 100% > snapshotDate(): 2024-02-26 >
Ahmad Al Ajami (11:00:20) (in thread): > > eh <- ExperimentHub() > snapshotDate(): 2023-10-24 >
> For me, this is what’s returned, even though I have the newest versions (I think): - File (PNG): image.png
Marcel Ramos Pérez (11:00:43) (in thread): > Are you using Bioconductor devel?
Marcel Ramos Pérez (11:01:54) (in thread): > Ah I see your R version. You should be using R-devel and Bioc version 3.19
Ahmad Al Ajami (11:03:04) (in thread): > Ah, I see! I will try that. Thanks a lot:raised_hands:
2024-04-13
Ankitha Ramaiyer (20:48:27): > @Ankitha Ramaiyer has joined the channel
2024-08-18
Sean Davis (13:32:35): > @Erdal Cosgunand@Lori Shepherd, I’m using the curatedMetagenomicData package to download and process all datasets. When running each “job” on a separate machine (so, 800+ IP addresses), I can run things in parallel just fine. However, when running on my HPC cluster, I find that after about 5 successful downloads, downloads “hang” and do not progress. I’m not seeing any errors, just stalled downloads (using BiocFileCache, so hard to debug directly). The code that stalls is just: > > suppressMessages(curatedMetagenomicData::curatedMetagenomicData(id, dryrun=FALSE,rownames = "NCBI" ))[[1]] >
> But many of these are running in parallel and I suspect that the server endpoint is seeing a single IP address associated with the proxy for our HPC cluster. > > I’m wondering if there is any configuration in place on the bucket/storage that would limit access or throttle things?
2024-08-19
Rema Gesaka (09:41:17): > @Rema Gesaka has joined the channel
Lori Shepherd (13:45:15) (in thread): > I think@Erdal Cosgunwould have a better idea of the settings and if there is anything like this set up ?
Erdal Cosgun (13:45:22): > @Erdal Cosgun has joined the channel
Sean Davis (13:46:37) (in thread): > Thanks,@Lori Shepherd. I’ll wait to hear from Erdal. Just to confirm, though, the hub data is now primarily hosted on Azure?
Lori Shepherd (13:47:55) (in thread): > default Bioconductor location is Azure yes. But for anyone that sees this message you can host data on other trusted sites and still list/access through hubs (ie. zenodo, s3 buckets, other data lake locations, )
Erdal Cosgun (14:03:43) (in thread): > Hi@Sean Davis, there is no limit on access to the storage accounts. You can download as many as you need. The storage account type is “Premium” therefore throttling should not be a problem. I checked again from portal but there is no specific config setting that I can change. On the other hand, there is no “Security” alert on the storage accounts. This means, you’re on the safe limit of Azure Defender.
Sean Davis (14:05:39) (in thread): > Thanks,@Erdal Cosgun. Must be a local problem, then. Thanks to you both for looking into it.
2024-09-13
Gobi Dasu (18:20:09): > @Gobi Dasu has joined the channel