R Packages for This Tutorial

This tutorial demonstrates basic usage of curatedMetagenomicData, prerequisite steps for analysis, estimation of alpha diversity, analysis of beta diversity, and differential abundance analysis. To make the tables and figures in this tutorial, a number of R packages are required, as follows. If you are using app.orchestra.cancerdatasci.org, these packages are already installed for you; if not, you’ll have to make sure the packages are installed.

With the requisite R packages loaded and a basic working knowledge of metagenomics analysis (as was covered in the workshop), you’re ready to begin. Just one last note, in most of the code chunks below messages have been suppressed; more verbose output should be expected when using the R console.

Using curatedMetagenomicData

First, let’s address what the curatedMetagenomicData R/Bioconductor package is; here’s an brief description.

The curatedMetagenomicData package provides standardized, curated human microbiome data for novel analyses. It includes gene families, marker abundance, marker presence, pathway abundance, pathway coverage, and relative abundance for samples collected from different body sites. The bacterial, fungal, and archaeal taxonomic abundances for each sample were calculated with MetaPhlAn3 and metabolic functional potential was calculated with HUMAnN3. The manually curated sample metadata and standardized metagenomic data are available as (Tree)SummarizedExperiment objects.

Be sure to refer to the SummarizedExperiment and TreeSummarizedExperiment vignettes if the data structures are unclear. There is also a reference website for curatedMetagenomicData at waldronlab.io/curatedMetagenomicData, if you need it.

Sample Metadata Exploration

Any metagenomic analysis that uses curatedMetagenomicData is likely to begin with an exploration of sample metadata (i.e. sampleMetadata). The sampleMetadata object is a data.frame in the package that contains sample metadata that has been manually curated by humans for each and every sample. To give you an idea of some of the columns that are available, dplyr syntax is used below to take a random sampling of 10 rows (slice_sample). Then, columns containing NA values are removed, and the first 10 remaining columns are selected. The data.frame is sorted alphabetically by study_name prior to being returned.

sampleMetadata |>
  slice_sample(n = 10) |>
  select(where(~ !any(is.na(.x)))) |>
  select(1:10) |>
  arrange(study_name)
##            study_name                            sample_id
## 1       AsnicarF_2021                         SAMEA7045146
## 2       BackhedF_2015                           SID281_12M
## 3     HMP_2019_ibdmdb                           HSM5MD5K_P
## 4  LifeLinesDeep_2016     EGAR00001420974_9002000001474089
## 5      PehrssonE_2016                            SID030301
## 6      SchirmerM_2016                               G89012
## 7          ShaoY_2019 cc2f310e-7ae6-11e9-a106-68b59976a384
## 8       VatanenT_2016                               G78749
## 9            YuJ_2015                      SZAXPI015228-14
## 10        ZeeviD_2015                          PNP_Main_84
##                subject_id body_site study_condition        disease age_category
## 1       predict1_MTG_0293     stool         control        healthy        adult
## 2                  SID281     stool         control        healthy        child
## 3                   H4001     stool             IBD            IBD    schoolage
## 4  sub_9002000001474089LL     stool         control        healthy        adult
## 5                  SID3_3     stool         control        healthy    schoolage
## 6                   HV388     stool         control        healthy        adult
## 7               B02724_ba     stool         control        healthy      newborn
## 8                 P022130     stool  respiratoryinf respiratoryinf      newborn
## 9         SZAXPI015228-14     stool         control            T2D        adult
## 10            PNP_Main_84     stool         control        healthy        adult
##    country non_westernized sequencing_platform
## 1      GBR              no     IlluminaNovaSeq
## 2      SWE              no       IlluminaHiSeq
## 3      USA              no       IlluminaHiSeq
## 4      NLD              no       IlluminaHiSeq
## 5      SLV             yes       IlluminaHiSeq
## 6      NLD              no       IlluminaHiSeq
## 7      GBR              no       IlluminaHiSeq
## 8      RUS              no       IlluminaHiSeq
## 9      CHN              no       IlluminaHiSeq
## 10     ISR              no       IlluminaHiSeq

Finding Available Resources

The resources available in curatedMetagenomicData are organized by study_name and can be discovered with the curatedMetagenomicData() function. When provided with a string, it will return the names of available resources. Each study_name will have 6 data types (gene_families, marker_abundance, marker_presence, pathway_abundance, pathway_coverage, and relative_abundance), these follow the study_name and are separated by a dot. The date that precedes study_name is for versioning, but it’s unimportant here.

## 2021-03-31.AsnicarF_2017.gene_families
## 2021-03-31.AsnicarF_2017.marker_abundance
## 2021-03-31.AsnicarF_2017.marker_presence
## 2021-03-31.AsnicarF_2017.pathway_abundance
## 2021-03-31.AsnicarF_2017.pathway_coverage
## 2021-03-31.AsnicarF_2017.relative_abundance
## 2021-10-14.AsnicarF_2017.gene_families
## 2021-10-14.AsnicarF_2017.marker_abundance
## 2021-10-14.AsnicarF_2017.marker_presence
## 2021-10-14.AsnicarF_2017.pathway_abundance
## 2021-10-14.AsnicarF_2017.pathway_coverage
## 2021-10-14.AsnicarF_2017.relative_abundance
## 2021-03-31.AsnicarF_2021.gene_families
## 2021-03-31.AsnicarF_2021.marker_abundance
## 2021-03-31.AsnicarF_2021.marker_presence
## 2021-03-31.AsnicarF_2021.pathway_abundance
## 2021-03-31.AsnicarF_2021.pathway_coverage
## 2021-03-31.AsnicarF_2021.relative_abundance

The first argument passed to curatedMetagenomicData(), pattern, is actually a regular expression. So things like .+ (match any character one or more times) work, and the names multiple resources are returned if they match. Below, you can see there are two AsnicarF resources for the relative_abundance data type.

curatedMetagenomicData("AsnicarF.+.relative_abundance")
## 2021-03-31.AsnicarF_2017.relative_abundance
## 2021-10-14.AsnicarF_2017.relative_abundance
## 2021-03-31.AsnicarF_2021.relative_abundance

Downloading Study Resources

As is clear from the examples above, simply searching for the AsnicarF studies did not download any curated metagenomic data. To do that you must provide another argument, dryrun = FALSE, to the curatedMetagenomicData() function. Doing so will download the matching resources from ExperimentHub (or load them from the local cache). When dryrun = FALSE, curatedMetagenomicData() will always return a list of SummarizedExperiment and/or TreeSummarizedExperiment objects.

curatedMetagenomicData("AsnicarF.+.relative_abundance", dryrun = FALSE)
## $`2021-10-14.AsnicarF_2017.relative_abundance`
## class: TreeSummarizedExperiment 
## dim: 298 24 
## metadata(0):
## assays(1): relative_abundance
## rownames(298):
##   k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli
##   k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_bifidum
##   ...
##   k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Streptococcaceae|g__Streptococcus|s__Streptococcus_gordonii
##   k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Aerococcaceae|g__Abiotrophia|s__Abiotrophia_sp_HMSC24B09
## rowData names(7): superkingdom phylum ... genus species
## colnames(24): MV_FEI1_t1Q14 MV_FEI2_t1Q14 ... MV_MIM5_t2M14
##   MV_MIM5_t3F15
## colData names(22): study_name subject_id ... lactating curator
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (298 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
## 
## $`2021-03-31.AsnicarF_2021.relative_abundance`
## class: TreeSummarizedExperiment 
## dim: 639 1098 
## metadata(0):
## assays(1): relative_abundance
## rownames(639):
##   k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides_vulgatus
##   k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides_stercoris
##   ...
##   k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Pyramidobacter|s__Pyramidobacter_sp_C12_8
##   k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Brevibacteriaceae|g__Brevibacterium|s__Brevibacterium_aurantiacum
## rowData names(7): superkingdom phylum ... genus species
## colnames(1098): SAMEA7041133 SAMEA7041134 ... SAMEA7045952 SAMEA7045953
## colData names(22): study_name subject_id ... BMI family
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (639 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL

Merging Data Across Studies

It can be useful to have the list elements returned by curatedMetagenomicData() merged into a single SummarizedExperiment or TreeSummarizedExperiment object. This is accomplished using the mergeData() function – simply pipe (|>) the output of curatedMetagenomicData() to mergeData() and a single object will be returned.

curatedMetagenomicData("AsnicarF.+.relative_abundance", dryrun = FALSE) |>
  mergeData()
## class: TreeSummarizedExperiment 
## dim: 680 1122 
## metadata(0):
## assays(1): relative_abundance
## rownames(680):
##   k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli
##   k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_bifidum
##   ...
##   k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Pyramidobacter|s__Pyramidobacter_sp_C12_8
##   k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Brevibacteriaceae|g__Brevibacterium|s__Brevibacterium_aurantiacum
## rowData names(7): superkingdom phylum ... genus species
## colnames(1122): MV_FEI1_t1Q14 MV_FEI2_t1Q14 ... SAMEA7045952
##   SAMEA7045953
## colData names(25): study_name subject_id ... BMI family
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (680 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL

Returning Data from Samples

There is just one more function in the curatedMetagenomicData package to know about, returnSamples(). It is used when you want to return samples across studies that match a certain set of conditions. To return a specific set of samples, you first subset the sampleMetadata data.frame to include only the desired samples. In the example below, stool (body_site == "stool") samples from healthy (disease == "healthy") adults (age >= 18) living in Ireland or Italy (str_detect(country, "IRL|ITA")) are included before columns of all NA values are dropped. When the sampleMetadata data.frame is subset as desired, it is piped to the returnSamples() function. As you can see below, there is another argument, counts, which has not yet been mentioned. When counts = TRUE, relative abundance proportions are multiplied by read depth and rounded to the nearest integer prior to being returned – the argument applies to both returnSamples() and curatedMetagenomicData() when requesting the relative_abundance data type.

countryData <-
  filter(sampleMetadata, body_site == "stool") |>
  filter(disease == "healthy") |>
  filter(age >= 18) |>
  filter(str_detect(country, "IRL|ITA")) |>
  select(where(~ !all(is.na(.x)))) |>
  returnSamples("relative_abundance", counts = TRUE)

Getting Ready for Analysis

First, have a look at the countryData object you just created to get a better understanding of what it contains. You will be using it for the rest of this tutorial.

countryData
## class: TreeSummarizedExperiment 
## dim: 906 286 
## metadata(0):
## assays(1): relative_abundance
## rownames(906):
##   k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri
##   k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_adolescentis
##   ...
##   k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiaceae|g__Clostridium|s__Clostridium_sp_D5
##   k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lactobacillus|s__Lactobacillus_algidus
## rowData names(7): superkingdom phylum ... genus species
## colnames(286): SID01.BA.VG.2 SID01.BA.V_metag ... CRC_MR_SBJ83H_17
##   CRC_MR_SBJ84H_17
## colData names(32): study_name subject_id ... dyastolic_p systolic_p
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (906 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL

As can be seen above, the first assay of the TreeSummarizedExperiment is named relative_abundance. This is usually of little consequence, but you will need to rename the assay to counts for the transformation in the next step; do so by setting the assayNames as follows.

assayNames(countryData) <-
  "counts"

Next, use logNormCounts() from the scater package to create a second assay logcounts that contains log-normalized species abundances.

countryData <-
  logNormCounts(countryData)

By taking a look again, you can see there are now two assays as was desired.

countryData
## class: TreeSummarizedExperiment 
## dim: 906 286 
## metadata(0):
## assays(2): counts logcounts
## rownames(906):
##   k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri
##   k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_adolescentis
##   ...
##   k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiaceae|g__Clostridium|s__Clostridium_sp_D5
##   k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lactobacillus|s__Lactobacillus_algidus
## rowData names(7): superkingdom phylum ... genus species
## colnames(286): SID01.BA.VG.2 SID01.BA.V_metag ... CRC_MR_SBJ83H_17
##   CRC_MR_SBJ84H_17
## colData names(33): study_name subject_id ... systolic_p sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (906 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL

Before diving into analysis, you should create a quick summary table to understand the metadata better and be prepared to handle potential biases in analysis. The steps below use dplyr syntax to clean up a few of the variables before they are output in the summary table.

colData(countryData) |>
  as.data.frame() |>
  select(study_name, age, gender, country, sequencing_platform, curator, diet) |>
  mutate(gender = to_title_case(gender)) |>
  mutate(gender = fct_explicit_na(gender)) |>
  mutate(curator = str_replace_all(curator, "_", " ")) |>
  mutate(curator = str_replace_all(curator, ";", ", ")) |>
  mutate(diet = to_title_case(diet)) |>
  mutate(diet = fct_explicit_na(diet)) |>
  rename_with(to_title_case) |>
  tbl_summary()
Characteristic N = 2861
Study Name
    DeFilippisF_2019 97 (34%)
    FerrettiP_2018 20 (7.0%)
    KeohaneDM_2020 117 (41%)
    RampelliS_2015 11 (3.8%)
    ThomasAM_2018a 14 (4.9%)
    ThomasAM_2018b 27 (9.4%)
Age 39 (31, 52)
Gender
    Female 140 (49%)
    Male 126 (44%)
    (Missing) 20 (7.0%)
Country
    IRL 117 (41%)
    ITA 169 (59%)
Sequencing Platform
    IlluminaHiSeq 189 (66%)
    IlluminaNextSeq 97 (34%)
Curator
    Francesca DeFilippis, Edoardo Pasolli 97 (34%)
    Pamela Ferretti 20 (7.0%)
    Paolo Manghi 155 (54%)
    Paolo Manghi, Marisa Metzger 14 (4.9%)
Diet
    Omnivore 23 (8.0%)
    Vegan 36 (13%)
    Vegetarian 38 (13%)
    (Missing) 189 (66%)
1 n (%); Median (IQR)

Estimating Alpha Diversity

Alpha diversity estimation seeks to quantify variation within a sample; the Shannon index (H’) is probably the most widely used measure. It accounts for both richness (i.e. how many types of bacteria are present) and evenness (i.e. how equal the proportions of types is); it is assessed below using the estimateDiversity() function from the mia package. Then, the plotColData() function from the scater package is used to generate a basic ggplot2 plot, which is stylized further using the ggsignif and hrbrthemes packages.

countryData |>
  estimateDiversity(abund_values = "logcounts", index = "shannon") |>
  plotColData(x = "country", y = "shannon", colour_by = "country", shape_by = "country") +
  geom_signif(comparisons = list(c("IRL", "ITA")), test = "t.test", map_signif_level = TRUE) +
  labs(
    title = "Alpha Diversity by Country, Shannon Index (H')",
    subtitle = "Stool Samples of Healthy Adults",
    x = "Country",
    y = "Alpha Diversity (H')"
  ) +
  guides(color = guide_none(), shape = guide_none()) +
  theme_ipsum_rc()
Alpha Diversity by Country. Alpha Diversity of stool samples from healthy adults, as measured by the Shannon index (H'), is significantly $(Pr(T < t) < 0.001)$ lower among Irish samples, as compared to Italian samples.

Alpha Diversity by Country. Alpha Diversity of stool samples from healthy adults, as measured by the Shannon index (H’), is significantly \((Pr(T < t) < 0.001)\) lower among Irish samples, as compared to Italian samples.

Analysis of Beta Diversity

Beta diversity is a measurement of between sample variation that is usually qualitatively assessed using a low-dimensional embedding; classically this has been done in metagenomics with a principal coordinates analysis (PCoA) of Bray-Curtis dissimilarity.1 This is done below using the runMDS() function from the scater package with the vegdist function from the vegan package. The plotReducedDim() function from the scater package is used to generate a basic ggplot2 plot, which is styled further.

countryData |>
  runMDS(FUN = vegdist, method = "bray", exprs_values = "logcounts", name = "BrayCurtis") |>
  plotReducedDim("BrayCurtis", colour_by = "country", shape_by = "country", text_by = "country") +
  labs(
    title = "Beta Diversity by Country, Bray-Curtis PCoA",
    subtitle = "Stool Samples of Healthy Adults",
    x = "PCo 1",
    y = "PCo 2"
  ) +
  guides(color = guide_none(), shape = guide_none()) +
  theme_ipsum_rc()
Beta Diversity by Country, Bray-Curtis PCoA. The first two principal coordinates demonstrate good seperation of Irish and Italian stool samples from healthy adults, suggesting differences in gut microbial composition between the two populations.

Beta Diversity by Country, Bray-Curtis PCoA. The first two principal coordinates demonstrate good seperation of Irish and Italian stool samples from healthy adults, suggesting differences in gut microbial composition between the two populations.

To address the issue of using principal coordinates analysis (PCoA) with a dissimilarity, beta diversity can be assessed using the UMAP (Uniform Manifold Approximation and Projection) algorithm instead.2 The runUMAP() function from the scater package is used below; otherwise, the syntax is largely the same as above.

countryData |>
  runUMAP(exprs_values = "logcounts", name = "UMAP") |>
  plotReducedDim("UMAP", colour_by = "country", shape_by = "country", text_by = "country") +
  labs(
    title = "Beta Diversity by Country, UMAP Embedding",
    subtitle = "Stool Samples of Healthy Adults",
    x = "UMAP 1",
    y = "UMAP 2"
  ) +
  guides(color = guide_none(), shape = guide_none()) +
  theme_ipsum_rc()
Beta Diversity by Country, UMAP Embedding. The two-dimensional UMAP embedding demonstrates good seperation of Irish and Italian stool samples from healthy adults, suggesting differences in gut microbial composition between the two populations.

Beta Diversity by Country, UMAP Embedding. The two-dimensional UMAP embedding demonstrates good seperation of Irish and Italian stool samples from healthy adults, suggesting differences in gut microbial composition between the two populations.

Modeling Bacteria by Country

Where beta diversity embeddings suggest there are features that separate Irish and Italian stool samples from healthy adults, you might like to know which are most differentially abundance. To assess this the ANCOMBC package has a metagenomics-specific additive log-ratio model for the task. However, the ancombc() function requires a phyloseq class objects – one can be created using the makePhyloseqFromTreeSummarizedExperiment() function from the mia package, as shown below.

ancombcResults <-
  makePhyloseqFromTreeSummarizedExperiment(countryData) |>
  ancombc("country")

The results of the ANCOMBC model are in a strange list structure and have to be coerced into a data.frame before they can be displayed; the bind_cols() function from the dplyr package is used below.

ancombcTable <-
  bind_cols(ancombcResults[["res"]])

Yet, the column names of the results table are missing and have to be assigned, as shown below.

colnames(ancombcTable) <-
  names(ancombcResults[["res"]])

The row names of the results table are big long strings of microbial taxonomies that will need some editing if they are to be displayed nicely. the rownames_to_column() function from the tibble package is used below to turn them into a column so they can be edited.

ancombcTable <-
  rownames_to_column(ancombcTable)

Before the row names are split into 7 pieces, the names of columns that each piece will be assigned to are created below.

rankNames <-
  c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")

The row names of the results table are then transformed using tidyr, dplyr, and stringr, as shown below.

ancombcTable[["rowname"]] <-
  separate(ancombcTable, rowname, rankNames, sep = "\\|") |>
  select(all_of(rankNames)) |>
  mutate(across(.fns = ~ str_remove_all(.x, ".__"))) |>
  mutate(across(.fns = ~ str_replace_all(.x, "_", " "))) |>
  mutate(label = Species) |>
  pull(label)

Once the results table is in good shape, it can be filtered to include only bacterial species that exhibited large (e.g. abs(beta) > 1) and significant (-log10(q_val) > 5) differences in abundances between the two countries. In the example below, the table is sorted by effect size and a number of formatting conventions are applied before displaying the results table.

filter(ancombcTable, abs(beta) > 1) |>
  filter(-log10(q_val) > 5) |>
  select(rowname, beta, se, p_val, q_val) |>
  arrange(-abs(beta)) |>
  column_to_rownames() |>
  mutate(across(.fns = ~ round(.x, digits = 3))) |>
  mutate(beta = format(beta)) |>
  mutate(beta = str_replace(beta, " ", "&nbsp;")) |>
  mutate(p_val = if_else(p_val == 0, "< 0.001", format(p_val, nsmall = 3))) |>
  mutate(q_val = if_else(q_val == 0, "< 0.001", format(q_val, nsmall = 3))) |>
  kable(col.names = c("β", "SE", "P", "Q"), align = "cccc", escape = FALSE)
β SE P Q
Catenibacterium mitsuokai -8.689 0.618 < 0.001 < 0.001
Holdemanella biformis -8.007 0.575 < 0.001 < 0.001
Eubacterium sp CAG 180 -7.361 0.687 < 0.001 < 0.001
Bacteroides dorei  6.364 0.551 < 0.001 < 0.001
Lactobacillus ruminis -6.029 0.562 < 0.001 < 0.001
Phascolarctobacterium succinatutens -5.947 0.706 < 0.001 < 0.001
Alistipes putredinis  5.733 0.622 < 0.001 < 0.001
Ruminococcus torques -5.591 0.442 < 0.001 < 0.001
Eubacterium hallii -5.245 0.437 < 0.001 < 0.001
Bifidobacterium angulatum -5.185 0.550 < 0.001 < 0.001
Prevotella sp CAG 279 -5.093 0.626 < 0.001 < 0.001
Lawsonibacter asaccharolyticus  4.976 0.408 < 0.001 < 0.001
Dorea formicigenerans -4.855 0.366 < 0.001 < 0.001
Prevotella sp 885 -4.646 0.588 < 0.001 < 0.001
Methanobrevibacter smithii -4.514 0.642 < 0.001 < 0.001
Coprococcus comes -4.476 0.417 < 0.001 < 0.001
Slackia isoflavoniconvertens -4.356 0.581 < 0.001 < 0.001
Roseburia intestinalis -4.332 0.559 < 0.001 < 0.001
Oscillibacter sp CAG 241 -4.328 0.540 < 0.001 < 0.001
Ruminococcus callidus -4.294 0.517 < 0.001 < 0.001
Eubacterium ramulus -4.224 0.489 < 0.001 < 0.001
Coprococcus catus -4.206 0.421 < 0.001 < 0.001
Intestinibacter bartlettii -4.007 0.534 < 0.001 < 0.001
Alistipes finegoldii  3.849 0.544 < 0.001 < 0.001
Eubacterium sp CAG 251 -3.826 0.652 < 0.001 < 0.001
Dorea longicatena -3.811 0.327 < 0.001 < 0.001
Prevotella sp AM42 24 -3.567 0.560 < 0.001 < 0.001
Roseburia faecis -3.483 0.491 < 0.001 < 0.001
Firmicutes bacterium CAG 170 -3.393 0.577 < 0.001 < 0.001
Bacteroides plebeius  3.353 0.522 < 0.001 < 0.001
Alistipes shahii  3.259 0.532 < 0.001 < 0.001
Bacteroides eggerthii  3.212 0.477 < 0.001 < 0.001
Butyricimonas virosa  3.171 0.457 < 0.001 < 0.001
Parasutterella excrementihominis  3.167 0.477 < 0.001 < 0.001
Parabacteroides distasonis  3.114 0.529 < 0.001 < 0.001
Alistipes indistinctus  3.074 0.516 < 0.001 < 0.001
Eubacterium rectale -3.044 0.410 < 0.001 < 0.001
Blautia wexlerae -3.029 0.428 < 0.001 < 0.001
Clostridium disporicum -2.965 0.441 < 0.001 < 0.001
Anaerostipes hadrus -2.797 0.457 < 0.001 < 0.001
Bacteroides sp CAG 530 -2.790 0.411 < 0.001 < 0.001
Blautia obeum -2.751 0.405 < 0.001 < 0.001
Intestinimonas butyriciproducens  2.642 0.465 < 0.001 < 0.001
Roseburia inulinivorans -2.535 0.365 < 0.001 < 0.001
Prevotella stercorea -2.525 0.457 < 0.001 < 0.001
Collinsella aerofaciens -2.333 0.322 < 0.001 < 0.001
Butyricimonas synergistica  2.311 0.411 < 0.001 < 0.001
Fusicatenibacter saccharivorans -1.539 0.274 < 0.001 < 0.001

While the table above is somewhat interesting, the results are better summarized as a volcano plot (i.e. statistical significance versus fold change) and one can be made using the results table. To shorten labels for readability, stringr is first used to abbreviate species names by replacing the first word of the name with a single letter followed by a period. Next, the same filtering of the table as above is undertaken, and the color scheme used in all the plots above is applied. Both labeling and coloring are only applied where effect size and significance thresholds are met, as denoted by the dotted lines.

ancombcTable |>
  mutate(rowname = str_replace(rowname, "^([A-Z])([a-z])+ ", "\\1. ")) |>
  mutate(q_val = -log10(q_val)) |>
  mutate(label = if_else(abs(beta) > 1, rowname, NA_character_)) |>
  mutate(label = if_else(q_val > 5, label, NA_character_)) |>
  mutate(color = if_else(beta > 0, "#FF9E4A", "#729ECE")) |>
  mutate(color = if_else(is.na(label), "#000000", color)) |>
  ggplot(mapping = aes(beta, q_val, color = I(color), label = label, shape = I(1))) +
  geom_point() +
  geom_hline(linetype = "dotted", yintercept = 5) +
  geom_vline(linetype = "dotted", xintercept = 1) +
  geom_vline(linetype = "dotted", xintercept = -1) +
  geom_label_repel(min.segment.length = 0, force = 10, max.overlaps = 20, na.rm = TRUE) +
  labs(
    title = "Significance vs. Effect Size, ANCOM-BC",
    subtitle = "Stool Samples of Healthy Adults",
    x = expression(beta),
    y = expression(-~log[10]~Q)
  ) +
  guides(color = guide_none(), shape = guide_none()) +
  theme_ipsum_rc()
Volcano Plot of Differentially Abundance Bacterial Species. In the model and the figure, Irish samples are the reference group such that bacterial species in blue at the left are significantly more abundant in Irish samples and those in yellow at the right are significantly more abundant in Italian samples.

Volcano Plot of Differentially Abundance Bacterial Species. In the model and the figure, Irish samples are the reference group such that bacterial species in blue at the left are significantly more abundant in Irish samples and those in yellow at the right are significantly more abundant in Italian samples.

As a bonus, do a Google or a PubMed search for Holdemanella biformis to see what condition(s) it is associated with and then explore differences between the two countries using GBD Compare visualizations.


  1. Bray-Curtis dissimilarity is not a metric distance and does not satisfy the triangle inequality, but it is generally accepted in metagenomics.↩︎

  2. McInnes, L., Healy, J. & Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv [stat.ML] (2018)↩︎