Version Three • curatedMetagenomicData

Dear User,

Thank you for using curatedMetagenomicData and thank you for installing curatedMetagenomicData version 3.0.0 – this release is a long awaited and major milestone for all of us who work on the software. We are very proud of the package and hope you will find the updates make your own work more easy and enjoyable. The curatedMetagenomicData package now contains 20,283 samples from 86 studies, all processed in a standardized way with manually curated sample metadata for each and every sample. The sample metadata is curated by humans, and, to be certain, we owe a huge debt of gratitude to our curators for their time and efforts. To our knowledge, curatedMetagenomicData represent the largest collection of its kind, and it will continue to grow into the future to enable metagenomics.

If you have been using curatedMetagenomicData for a few years you might notice we have skipped version 2.0.0 and went directly from version 1.0.0 to version 3.0.0. While this is not traditional, we had good reason to do so. The data in curatedMetagenomicData version 1.0.0 (released in 2017) came from MetaPhlAn2 and HUMAnN2; the data in curatedMetagenomicData version 3.0.0 (released in 2021) come from MetaPhlAn3 and HUMAnN3. The skipping of version 2.0.0 and release of 3.0.0 instead was done to align our version numbers with MetaPhlAn3 and HUMAnN3; from now on our major releases will reflect data processed using major versions of these software.

As for what has changed in curatedMetagenomicData version 3.0.0, allow us to explain. First, all raw data has been reprocessed with MetaPhlAn3 and HUMAnN3 – this includes both new and existing studies. And speaking of new studies, 29 new studies with a total of 10,084 samples were added in this release, roughly doubling the number of samples available in curatedMetagenomicData. Next, all data is now returned as SummarizedExperiment and/or TreeSummarizedExperiment objects by the refactored curatedMetagenomicData() method. The curatedMetagenomicData() method is now the primary (and only) means to access data – this change allows us to keep sample metadata up to date and pull in changes from the curation repository on a weekly basis. With that change it was also necessary to refactor the mergeData() method to work with the new data structures. Goodbye ExpressionSet, hello SummarizedExperiment/TreeSummarizedExperiment!

Perhaps most exciting though is the ability to now return samples across studies using the returnSamples() method. You simply supply the sampleMetadata data.frame subset to include only desired samples and metadata and a SummarizedExperiment or TreeSummarizedExperiment object is returned with the correct samples. At some point it became obvious to us that this method was likely the most needed and probably being implemented beyond the package frequently – sorry we did not include it sooner. Again, if you have been using curatedMetagenomicData for a few years, you might be asking what is sampleMetadata. The sampleMetadata object replaces the combined_metadata object, and combined_metadata object will be removed in the next release – please start using sampleMetadata.

Finally, a few things have become defunct in curatedMetagenomicData version 3.0.0, principally methods that worked on ExpressionSet objects (e.g. ExpressionSet2phyloseq()) and named accessors (e.g. HMP_2012.pathcoverage.stool()). The older methods had to become defunct with the move to SummarizedExperiment/TreeSummarizedExperiment and the named accessors were very hard to maintain and document. The package is now simpler and easier to use with the SummarizedExperiment/TreeSummarizedExperiment data structures and with the curatedMetagenomicData() method replacing all named accessors.

Again, thank you sincerly for using curatedMetagenomicData and please don’t hesitate to reach out to us when you find issues or need help.

Kind regards,

The curatedMetagenomicData team