vignettes/curatedMetagenomicData.Rmd
curatedMetagenomicData.Rmd
Abstract
The curatedMetagenomicData package provides standardized, curated human microbiome data for novel analyses. It includes gene families, marker abundance, marker presence, pathway abundance, pathway coverage, and relative abundance for samples collected from different body sites. The bacterial, fungal, and archaeal taxonomic abundances for each sample were calculated with MetaPhlAn3, and metabolic functional potential was calculated with HUMAnN3. The manually curated sample metadata and standardized metagenomic data are available as (Tree)SummarizedExperiment objects.
curatedMetagenomicData
provides
curatedMetagenomicData
provides processed data from
whole-metagenome shotgun metagenomics, with manually-curated metadata,
as integrated and documented Bioconductor TreeSummarizedExperiment
objects or TSV flat text exports. It provides 6 types of data for each
dataset:
relative_abundance
)marker_presence
)marker_abundance
)gene_families
)pathway_coverage
)pathway_abundance
)Types 1-3 are generated by MetaPhlAn3; 4-6 are generated by HUMAnN3 using the UniRef90 database.
Currently there are:
To install curatedMetagenomicData from Bioconductor, use BiocManager as follows.
BiocManager::install("curatedMetagenomicData")
To install curatedMetagenomicData from GitHub, use BiocManager as follows.
BiocManager::install("waldronlab/curatedMetagenomicData", dependencies = TRUE, build_vignettes = TRUE)
Most users should simply install curatedMetagenomicData from Bioconductor.
To demonstrate the functionality of curatedMetagenomicData, the dplyr and DT packages are needed.
The curatedMetagenomicData
package contains a data.frame
, sampleMetadata
,
of manually curated sample metadata to help users understand the nature
of studies and samples available prior to returning resources. Beyond
this, it serves two purposes: 1) to define study_name
,
which is used with the curatedMetagenomicData()
function to
query and return resources, and 2) to define sample_id
,
which is used with the returnSamples()
function to return
samples across studies.
To demonstrate, the first ten rows and columns (without any
NA
values) of sampleMetadata
for the
AsnicarF_2017
study are shown in the table below.
There are three main ways to access data resources in curatedMetagenomicData.
curatedMetagenomicData()
function to search for and
return resources.returnSamples()
function to return samples across
studies.curatedMetagenomicData()
To access curated metagenomic data, users will use the
curatedMetagenomicData()
function both to query and return
resources. The first argument pattern
is a regular
expression pattern to look for in the titles of resources available in
curatedMetagenomicData;
""
will return all resources. The title of each resource is
a three part string with “.” as a delimiter – the fields are
runDate
, studyName
, and dataType
.
The runDate
is the date we created the resource and can
mostly be ignored by users because if there is more than one date
corresponding to a resource, the most recent one is selected
automatically – it would be used if a specific runDate
was
needed.
Multiple resources can be queried or returned with a single call to
curatedMetagenomicData()
, but only the titles of resources
are returned by default.
curatedMetagenomicData("AsnicarF_20.+")
## 2021-03-31.AsnicarF_2017.gene_families
## 2021-03-31.AsnicarF_2017.marker_abundance
## 2021-03-31.AsnicarF_2017.marker_presence
## 2021-03-31.AsnicarF_2017.pathway_abundance
## 2021-03-31.AsnicarF_2017.pathway_coverage
## 2021-03-31.AsnicarF_2017.relative_abundance
## 2021-10-14.AsnicarF_2017.gene_families
## 2021-10-14.AsnicarF_2017.marker_abundance
## 2021-10-14.AsnicarF_2017.marker_presence
## 2021-10-14.AsnicarF_2017.pathway_abundance
## 2021-10-14.AsnicarF_2017.pathway_coverage
## 2021-10-14.AsnicarF_2017.relative_abundance
## 2021-03-31.AsnicarF_2021.gene_families
## 2021-03-31.AsnicarF_2021.marker_abundance
## 2021-03-31.AsnicarF_2021.marker_presence
## 2021-03-31.AsnicarF_2021.pathway_abundance
## 2021-03-31.AsnicarF_2021.pathway_coverage
## 2021-03-31.AsnicarF_2021.relative_abundance
When the dryrun
argument is set to FALSE
, a
list
of SummarizedExperiment
and/or
TreeSummarizedExperiment
objects is returned. The
rownames
argument determines the type of
rownames
to use for relative_abundance
resources: either "long"
(the default),
"short"
(species name), or "NCBI"
(NCBI
Taxonomy ID). When a single resource is requested, a single element
list
is returned.
curatedMetagenomicData("AsnicarF_2017.relative_abundance", dryrun = FALSE, rownames = "short")
## $`2021-10-14.AsnicarF_2017.relative_abundance`
## class: TreeSummarizedExperiment
## dim: 296 24
## metadata(1): agglomerated_by_rank
## assays(1): relative_abundance
## rownames(296): Escherichia coli Bifidobacterium bifidum ...
## Streptococcus gordonii Abiotrophia sp. HMSC24B09
## rowData names(7): superkingdom phylum ... genus species
## colnames(24): MV_FEI1_t1Q14 MV_FEI2_t1Q14 ... MV_MIM5_t2M14
## MV_MIM5_t3F15
## colData names(22): study_name subject_id ... lactating curator
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (296 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
When the counts
argument is set to TRUE
,
relative abundance proportions are multiplied by read depth and rounded
to the nearest integer prior to being returned. Also, when multiple
resources are requested, the list
will contain named
elements corresponding to each SummarizedExperiment
and/or
TreeSummarizedExperiment
object.
curatedMetagenomicData("AsnicarF_20.+.relative_abundance", dryrun = FALSE, counts = TRUE, rownames = "short")
## $`2021-10-14.AsnicarF_2017.relative_abundance`
## class: TreeSummarizedExperiment
## dim: 296 24
## metadata(1): agglomerated_by_rank
## assays(1): relative_abundance
## rownames(296): Escherichia coli Bifidobacterium bifidum ...
## Streptococcus gordonii Abiotrophia sp. HMSC24B09
## rowData names(7): superkingdom phylum ... genus species
## colnames(24): MV_FEI1_t1Q14 MV_FEI2_t1Q14 ... MV_MIM5_t2M14
## MV_MIM5_t3F15
## colData names(22): study_name subject_id ... lactating curator
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (296 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
##
## $`2021-03-31.AsnicarF_2021.relative_abundance`
## class: TreeSummarizedExperiment
## dim: 633 1098
## metadata(1): agglomerated_by_rank
## assays(1): relative_abundance
## rownames(633): Phocaeicola vulgatus Bacteroides stercoris ...
## Pyramidobacter sp. C12-8 Brevibacterium aurantiacum
## rowData names(7): superkingdom phylum ... genus species
## colnames(1098): SAMEA7041133 SAMEA7041134 ... SAMEA7045952 SAMEA7045953
## colData names(24): study_name subject_id ... family treatment
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (633 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
mergeData()
To merge the list
elements returned from the
curatedMetagenomicData()
function into a single
SummarizedExperiment
or
TreeSummarizedExperiment
object, users will use the
mergeData()
function, provided elements are of the same
dataType
.
curatedMetagenomicData("AsnicarF_20.+.marker_abundance", dryrun = FALSE) |>
mergeData()
## class: SummarizedExperiment
## dim: 76639 1122
## metadata(0):
## assays(1): marker_abundance
## rownames(76639): 39491__A0A395UYA6__EUBREC_0408
## 39491__C4Z9E9__T1815_12341 ... 78448__A0A087CC86__BPULL_0419
## 356829__A0A087E8C8__BITS_5024
## rowData names(0):
## colnames(1122): MV_FEI1_t1Q14 MV_FEI2_t1Q14 ... SAMEA7045952
## SAMEA7045953
## colData names(26): study_name subject_id ... family treatment
The mergeData()
function works for every
dataType
and will always return the appropriate data
structure (a single SummarizedExperiment
or
TreeSummarizedExperiment
object).
curatedMetagenomicData("AsnicarF_20.+.pathway_abundance", dryrun = FALSE) |>
mergeData()
## class: SummarizedExperiment
## dim: 16913 1122
## metadata(0):
## assays(1): pathway_abundance
## rownames(16913): UNMAPPED UNINTEGRATED ... PWY-6277: superpathway of
## 5-aminoimidazole ribonucleotide
## biosynthesis|g__Massiliomicrobiota.s__Massiliomicrobiota_timonensis
## PWY-6151: S-adenosyl-L-methionine cycle
## I|g__Massiliomicrobiota.s__Massiliomicrobiota_timonensis
## rowData names(0):
## colnames(1122): MV_FEI1_t1Q14 MV_FEI2_t1Q14 ... SAMEA7045952
## SAMEA7045953
## colData names(26): study_name subject_id ... family treatment
This is useful for analysis across entire studies
(e.g. meta-analysis); however, when doing analysis across individual
samples (e.g. mega-analysis) the returnSamples()
function
is preferable.
curatedMetagenomicData("AsnicarF_20.+.relative_abundance", dryrun = FALSE, rownames = "short") |>
mergeData()
## class: TreeSummarizedExperiment
## dim: 673 1122
## metadata(0):
## assays(1): relative_abundance
## rownames(673): Escherichia coli Bifidobacterium bifidum ...
## Pyramidobacter sp. C12-8 Brevibacterium aurantiacum
## rowData names(7): superkingdom phylum ... genus species
## colnames(1122): MV_FEI1_t1Q14 MV_FEI2_t1Q14 ... SAMEA7045952
## SAMEA7045953
## colData names(26): study_name subject_id ... family treatment
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (673 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
returnSamples()
The returnSamples()
function takes the
sampleMetadata
data.frame
subset to include
only desired samples and metadata as input, and returns a single
SummarizedExperiment
or
TreeSummarizedExperiment
object that includes only desired
samples and metadata. To use this function, filter rows and/or select
columns of interest from the sampleMetadata
data.frame
, maintaining at least one row, and the
sample_id
and study_name
columns. Then provide
the subset data.frame
as the first argument to the
returnSamples()
function.
The returnSamples()
function requires a second argument
dataType
(either "gene_families"
,
"marker_abundance"
, "marker_presence"
,
"pathway_abundance"
, "pathway_coverage"
, or
"relative_abundance"
) to be specified. It is often most
convenient to subset the sampleMetadata
data.frame
using dplyr
syntax.
sampleMetadata |>
filter(age >= 18) |>
filter(!is.na(alcohol)) |>
filter(body_site == "stool") |>
select(where(~ !all(is.na(.x)))) |>
returnSamples("relative_abundance", rownames = "short")
## class: TreeSummarizedExperiment
## dim: 824 702
## metadata(0):
## assays(1): relative_abundance
## rownames(824): Prevotella copri Prevotella sp. CAG:520 ...
## Corynebacterium aurimucosum Corynebacterium coyleae
## rowData names(7): superkingdom phylum ... genus species
## colnames(702): JAS_1 JAS_10 ... YSZC12003_37879 YSZC12003_37880
## colData names(45): study_name subject_id ... inr zigosity
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (824 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
The counts
and rownames
arguments apply to
returnSamples()
as well, and can be passed the function.
Finally, users should know that any arbitrary columns added to
sampleMetadata
will be present in the colData
of the SummarizedExperiment
or
TreeSummarizedExperiment
object that is returned.
To demonstrate the utility of curatedMetagenomicData, an example analysis is presented below. However, readers should know analysis is generally beyond the scope of curatedMetagenomicData and the analysis presented here is for demonstration alone. It is best to consider the output of curatedMetagenomicData as the input of analysis more than anything else.
To demonstrate the utility of curatedMetagenomicData, the stringr, mia, scater, and vegan packages are needed.
In our hypothetical study, let’s examine the association of alcohol consumption and stool microbial composition across all annotated samples in curatedMetagenomicData. We will examine the alpha diversity (within subject diversity), beta diversity (between subject diversity), and conclude with a few notes on differential abundance analysis.
First, as above, we use the returnSamples()
function to
return the relevant samples across all studies available in curatedMetagenomicData.
We want adults over the age of 18, for whom alcohol consumption status
is known, and we want only stool samples. The
select(where...
line below simply removes metadata columns
which are all NA
values – they exist in another study but
are all NA
once subsetting has been done. Lastly, the
"relative_abundance"
dataType
is requested
because it contains the relevant information about microbial
composition.
Most of the values in the sampleMetadata
data.frame
(which becomes colData
) are in
snake case (e.g. snake_case
) and don’t look nice in plots.
Here, the values of the alcohol
variable are made into
title case using stringr so
they will look nice in plots.
colData(alcoholStudy) <-
colData(alcoholStudy) |>
as.data.frame() |>
mutate(alcohol = str_replace_all(alcohol, "no", "No")) |>
mutate(alcohol = str_replace_all(alcohol, "yes", "Yes")) |>
DataFrame()
Next, the splitByRanks
function from mia is used
to create alternative experiments for each level of the taxonomic tree
(e.g. Genus). This allows for diversity and differential abundance
analysis at specific taxonomic levels; with this step complete, our data
is ready to analyze.
altExps(alcoholStudy) <-
splitByRanks(alcoholStudy)
Alpha diversity is a measure of the within sample diversity of
features (relative abundance proportions here) and seeks to quantify the
evenness (i.e. are the amounts of different microbes the same) and
richness (i.e. are they are large variety of microbial taxa present).
The Shannon index (H’) is a commonly used measure of alpha diversity,
it’s estimated here using the estimateDiversity()
function
from the mia
package.
To quickly plot the results of alpha diversity estimation, the
plotColData()
function from the scater
package is used along with ggplot2
syntax.
alcoholStudy |>
estimateDiversity(abund_values = "relative_abundance", index = "shannon") |>
plotColData(x = "alcohol", y = "shannon", colour_by = "alcohol", shape_by = "alcohol") +
labs(x = "Alcohol", y = "Alpha Diversity (H')") +
guides(colour = guide_legend(title = "Alcohol"), shape = guide_legend(title = "Alcohol")) +
theme(legend.position = "none")
The figure suggest that those who consume alcohol have higher Shannon alpha diversity than those who do not consume alcohol; however, the difference does not appear to be significant, at least qualitatively.
Beta diversity is a measure of the between sample diversity of features (relative abundance proportions here) and seeks to quantify the magnitude of differences (or similarity) between every given pair of samples. Below it is assessed by Bray–Curtis Principal Coordinates Analysis (PCoA) and Uniform Manifold Approximation and Projection (UMAP).
To calculate pairwise Bray–Curtis distance for every sample in our
study we will use the runMDS()
function from the scater
package along with the vegdist()
function from the vegan
package.
To quickly plot the results of beta diversity analysis, the
plotReducedDim()
function from the scater
package is used along with ggplot2
syntax.
alcoholStudy |>
runMDS(FUN = vegdist, method = "bray", exprs_values = "relative_abundance", altexp = "genus", name = "BrayCurtis") |>
plotReducedDim("BrayCurtis", colour_by = "alcohol", shape_by = "alcohol") +
labs(x = "PCo 1", y = "PCo 2") +
guides(colour = guide_legend(title = "Alcohol"), shape = guide_legend(title = "Alcohol")) +
theme(legend.position = c(0.90, 0.85))
To calculate the UMAP coordinates of every sample in our study we
will use the runUMAP()
function from the scater
package package, as it handles the task in a single line.
To quickly plot the results of beta diversity analysis, the
plotReducedDim()
function from the scater
package is used along with ggplot2
syntax again.
alcoholStudy |>
runUMAP(exprs_values = "relative_abundance", altexp = "genus", name = "UMAP") |>
plotReducedDim("UMAP", colour_by = "alcohol", shape_by = "alcohol") +
labs(x = "UMAP 1", y = "UMAP 2") +
guides(colour = guide_legend(title = "Alcohol"), shape = guide_legend(title = "Alcohol")) +
theme(legend.position = c(0.90, 0.85))
Next, it would be desirable to establish which microbes are
differentially abundant between the two groups (those who consume
alcohol, and those who do not). The lefser and
ANCOMBC
packages are excellent resources for this tasks; however, code is not
included here to avoid including excessive Suggests
packages – curatedMetagenomicData
had far too many of these in the the past and is now very lean. There is
a repository of analyses, curatedMetagenomicAnalyses,
on GitHub and a forthcoming paper that will feature extensive
demonstrations of analyses – but for now, the suggestions above will
have to suffice.
Finally, the curatedMetagenomicData
package previously had functions for conversion to phyloseq
class objects, and they have been removed. It is likely that some users
will still want to do analysis using phyloseq,
and we would like to help them do so – it is just easier if we don’t
have to maintain the conversion function ourselves. As such, the mia package
has a function, makePhyloseqFromTreeSummarizedExperiment
,
that will readily do the conversion – users needing this functionality
are advised to use it.
makePhyloseqFromTreeSummarizedExperiment(alcoholStudy, abund_values = "relative_abundance")
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] vegan_2.6-4 lattice_0.20-45
## [3] permute_0.9-7 scater_1.26.1
## [5] ggplot2_3.4.0 scuttle_1.8.3
## [7] mia_1.6.0 MultiAssayExperiment_1.24.0
## [9] stringr_1.5.0 DT_0.26
## [11] dplyr_1.0.10 curatedMetagenomicData_3.7.0
## [13] TreeSummarizedExperiment_2.6.0 Biostrings_2.66.0
## [15] XVector_0.38.0 SingleCellExperiment_1.20.0
## [17] SummarizedExperiment_1.28.0 Biobase_2.58.0
## [19] GenomicRanges_1.50.2 GenomeInfoDb_1.34.4
## [21] IRanges_2.32.0 S4Vectors_0.36.1
## [23] BiocGenerics_0.44.0 MatrixGenerics_1.10.0
## [25] matrixStats_0.63.0 BiocStyle_2.26.0
##
## loaded via a namespace (and not attached):
## [1] AnnotationHub_3.6.0 BiocFileCache_2.6.0
## [3] systemfonts_1.0.4 plyr_1.8.8
## [5] lazyeval_0.2.2 splines_4.2.2
## [7] crosstalk_1.2.0 BiocParallel_1.32.5
## [9] digest_0.6.31 yulab.utils_0.0.6
## [11] htmltools_0.5.4 viridis_0.6.2
## [13] fansi_1.0.3 magrittr_2.0.3
## [15] memoise_2.0.1 ScaledMatrix_1.6.0
## [17] cluster_2.1.4 DECIPHER_2.26.0
## [19] pkgdown_2.0.7 colorspace_2.0-3
## [21] blob_1.2.3 rappdirs_0.3.3
## [23] ggrepel_0.9.2 textshaping_0.3.6
## [25] xfun_0.36 crayon_1.5.2
## [27] RCurl_1.98-1.9 jsonlite_1.8.4
## [29] ape_5.6-2 glue_1.6.2
## [31] gtable_0.3.1 zlibbioc_1.44.0
## [33] DelayedArray_0.24.0 BiocSingular_1.14.0
## [35] scales_1.2.1 DBI_1.1.3
## [37] Rcpp_1.0.9 viridisLite_0.4.1
## [39] xtable_1.8-4 decontam_1.18.0
## [41] tidytree_0.4.2 bit_4.0.5
## [43] rsvd_1.0.5 htmlwidgets_1.6.0
## [45] httr_1.4.4 FNN_1.1.3.1
## [47] ellipsis_0.3.2 farver_2.1.1
## [49] pkgconfig_2.0.3 uwot_0.1.14
## [51] sass_0.4.4 dbplyr_2.2.1
## [53] utf8_1.2.2 labeling_0.4.2
## [55] tidyselect_1.2.0 rlang_1.0.6
## [57] reshape2_1.4.4 later_1.3.0
## [59] AnnotationDbi_1.60.0 munsell_0.5.0
## [61] BiocVersion_3.16.0 tools_4.2.2
## [63] cachem_1.0.6 cli_3.5.0
## [65] DirichletMultinomial_1.40.0 generics_0.1.3
## [67] RSQLite_2.2.20 ExperimentHub_2.6.0
## [69] evaluate_0.19 fastmap_1.1.0
## [71] yaml_2.3.6 ragg_1.2.4
## [73] knitr_1.41 bit64_4.0.5
## [75] fs_1.5.2 purrr_1.0.0
## [77] KEGGREST_1.38.0 nlme_3.1-161
## [79] sparseMatrixStats_1.10.0 mime_0.12
## [81] compiler_4.2.2 rstudioapi_0.14
## [83] beeswarm_0.4.0 filelock_1.0.2
## [85] curl_4.3.3 png_0.1-8
## [87] interactiveDisplayBase_1.36.0 treeio_1.22.0
## [89] tibble_3.1.8 bslib_0.4.2
## [91] stringi_1.7.8 highr_0.10
## [93] desc_1.4.2 Matrix_1.5-3
## [95] vctrs_0.5.1 pillar_1.8.1
## [97] lifecycle_1.0.3 BiocManager_1.30.19
## [99] jquerylib_0.1.4 BiocNeighbors_1.16.0
## [101] cowplot_1.1.1 bitops_1.0-7
## [103] irlba_2.3.5.1 httpuv_1.6.7
## [105] R6_2.5.1 bookdown_0.31
## [107] promises_1.2.0.1 gridExtra_2.3
## [109] vipor_0.4.5 codetools_0.2-18
## [111] MASS_7.3-58.1 assertthat_0.2.1
## [113] rprojroot_2.0.3 withr_2.5.0
## [115] GenomeInfoDbData_1.2.9 mgcv_1.8-41
## [117] parallel_4.2.2 grid_4.2.2
## [119] beachmat_2.14.0 tidyr_1.2.1
## [121] rmarkdown_2.19 DelayedMatrixStats_1.20.0
## [123] shiny_1.7.4 ggbeeswarm_0.7.1