vignettes/NUI-Galway-Metagenomics-Workshop.Rmd
NUI-Galway-Metagenomics-Workshop.Rmd
This tutorial demonstrates basic usage of curatedMetagenomicData, prerequisite steps for analysis, estimation of alpha diversity, analysis of beta diversity, and differential abundance analysis. To make the tables and figures in this tutorial, a number of R packages are required, as follows. If you are using app.orchestra.cancerdatasci.org, these packages are already installed for you; if not, you’ll have to make sure the packages are installed.
library(curatedMetagenomicData)
library(dplyr)
library(stringr)
library(scater)
library(snakecase)
library(forcats)
library(gtsummary)
library(mia)
library(ggplot2)
library(ggsignif)
library(hrbrthemes)
library(vegan)
library(uwot)
library(ANCOMBC)
library(tibble)
library(tidyr)
library(knitr)
library(ggrepel)
With the requisite R packages loaded and a basic working knowledge of metagenomics analysis (as was covered in the workshop), you’re ready to begin. Just one last note, in most of the code chunks below messages have been suppressed; more verbose output should be expected when using the R console.
First, let’s address what the curatedMetagenomicData R/Bioconductor package is; here’s an brief description.
The curatedMetagenomicData package provides standardized, curated human microbiome data for novel analyses. It includes gene families, marker abundance, marker presence, pathway abundance, pathway coverage, and relative abundance for samples collected from different body sites. The bacterial, fungal, and archaeal taxonomic abundances for each sample were calculated with MetaPhlAn3 and metabolic functional potential was calculated with HUMAnN3. The manually curated sample metadata and standardized metagenomic data are available as (Tree)SummarizedExperiment objects.
Be sure to refer to the SummarizedExperiment and TreeSummarizedExperiment vignettes if the data structures are unclear. There is also a reference website for curatedMetagenomicData at waldronlab.io/curatedMetagenomicData, if you need it.
Any metagenomic analysis that uses curatedMetagenomicData
is likely to begin with an exploration of sample metadata
(i.e. sampleMetadata
). The sampleMetadata
object is a data.frame
in the package that contains sample
metadata that has been manually curated by humans for each and every
sample. To give you an idea of some of the columns that are available,
dplyr
syntax is used below to take a random sampling of 10 rows
(slice_sample
). Then, columns containing NA
values are removed, and the first 10 remaining columns are selected. The
data.frame
is sorted alphabetically by
study_name
prior to being returned.
sampleMetadata |>
slice_sample(n = 10) |>
select(where(~ !any(is.na(.x)))) |>
select(1:10) |>
arrange(study_name)
## study_name sample_id
## 1 AsnicarF_2021 SAMEA7045146
## 2 BackhedF_2015 SID281_12M
## 3 HMP_2019_ibdmdb HSM5MD5K_P
## 4 LifeLinesDeep_2016 EGAR00001420974_9002000001474089
## 5 PehrssonE_2016 SID030301
## 6 SchirmerM_2016 G89012
## 7 ShaoY_2019 cc2f310e-7ae6-11e9-a106-68b59976a384
## 8 VatanenT_2016 G78749
## 9 YuJ_2015 SZAXPI015228-14
## 10 ZeeviD_2015 PNP_Main_84
## subject_id body_site study_condition disease age_category
## 1 predict1_MTG_0293 stool control healthy adult
## 2 SID281 stool control healthy child
## 3 H4001 stool IBD IBD schoolage
## 4 sub_9002000001474089LL stool control healthy adult
## 5 SID3_3 stool control healthy schoolage
## 6 HV388 stool control healthy adult
## 7 B02724_ba stool control healthy newborn
## 8 P022130 stool respiratoryinf respiratoryinf newborn
## 9 SZAXPI015228-14 stool control T2D adult
## 10 PNP_Main_84 stool control healthy adult
## country non_westernized sequencing_platform
## 1 GBR no IlluminaNovaSeq
## 2 SWE no IlluminaHiSeq
## 3 USA no IlluminaHiSeq
## 4 NLD no IlluminaHiSeq
## 5 SLV yes IlluminaHiSeq
## 6 NLD no IlluminaHiSeq
## 7 GBR no IlluminaHiSeq
## 8 RUS no IlluminaHiSeq
## 9 CHN no IlluminaHiSeq
## 10 ISR no IlluminaHiSeq
The resources available in curatedMetagenomicData
are organized by study_name
and can be discovered with the
curatedMetagenomicData()
function. When provided with a
string, it will return the names of available resources. Each
study_name
will have 6 data types
(gene_families
, marker_abundance
,
marker_presence
, pathway_abundance
,
pathway_coverage
, and relative_abundance
),
these follow the study_name
and are separated by a dot. The
date that precedes study_name
is for versioning, but it’s
unimportant here.
curatedMetagenomicData("AsnicarF")
## 2021-03-31.AsnicarF_2017.gene_families
## 2021-03-31.AsnicarF_2017.marker_abundance
## 2021-03-31.AsnicarF_2017.marker_presence
## 2021-03-31.AsnicarF_2017.pathway_abundance
## 2021-03-31.AsnicarF_2017.pathway_coverage
## 2021-03-31.AsnicarF_2017.relative_abundance
## 2021-10-14.AsnicarF_2017.gene_families
## 2021-10-14.AsnicarF_2017.marker_abundance
## 2021-10-14.AsnicarF_2017.marker_presence
## 2021-10-14.AsnicarF_2017.pathway_abundance
## 2021-10-14.AsnicarF_2017.pathway_coverage
## 2021-10-14.AsnicarF_2017.relative_abundance
## 2021-03-31.AsnicarF_2021.gene_families
## 2021-03-31.AsnicarF_2021.marker_abundance
## 2021-03-31.AsnicarF_2021.marker_presence
## 2021-03-31.AsnicarF_2021.pathway_abundance
## 2021-03-31.AsnicarF_2021.pathway_coverage
## 2021-03-31.AsnicarF_2021.relative_abundance
The first argument passed to curatedMetagenomicData()
,
pattern
, is actually a regular expression. So things like
.+
(match any character one or more times) work, and the
names multiple resources are returned if they match. Below, you can see
there are two AsnicarF
resources for the
relative_abundance
data type.
curatedMetagenomicData("AsnicarF.+.relative_abundance")
## 2021-03-31.AsnicarF_2017.relative_abundance
## 2021-10-14.AsnicarF_2017.relative_abundance
## 2021-03-31.AsnicarF_2021.relative_abundance
As is clear from the examples above, simply searching for the
AsnicarF
studies did not download any curated metagenomic
data. To do that you must provide another argument,
dryrun = FALSE
, to the
curatedMetagenomicData()
function. Doing so will download
the matching resources from ExperimentHub
(or load them from the local cache). When dryrun = FALSE
,
curatedMetagenomicData()
will always return a
list
of SummarizedExperiment
and/or TreeSummarizedExperiment
objects.
curatedMetagenomicData("AsnicarF.+.relative_abundance", dryrun = FALSE)
## $`2021-10-14.AsnicarF_2017.relative_abundance`
## class: TreeSummarizedExperiment
## dim: 298 24
## metadata(0):
## assays(1): relative_abundance
## rownames(298):
## k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli
## k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_bifidum
## ...
## k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Streptococcaceae|g__Streptococcus|s__Streptococcus_gordonii
## k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Aerococcaceae|g__Abiotrophia|s__Abiotrophia_sp_HMSC24B09
## rowData names(7): superkingdom phylum ... genus species
## colnames(24): MV_FEI1_t1Q14 MV_FEI2_t1Q14 ... MV_MIM5_t2M14
## MV_MIM5_t3F15
## colData names(22): study_name subject_id ... lactating curator
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (298 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
##
## $`2021-03-31.AsnicarF_2021.relative_abundance`
## class: TreeSummarizedExperiment
## dim: 639 1098
## metadata(0):
## assays(1): relative_abundance
## rownames(639):
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides_vulgatus
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Bacteroidaceae|g__Bacteroides|s__Bacteroides_stercoris
## ...
## k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Pyramidobacter|s__Pyramidobacter_sp_C12_8
## k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Brevibacteriaceae|g__Brevibacterium|s__Brevibacterium_aurantiacum
## rowData names(7): superkingdom phylum ... genus species
## colnames(1098): SAMEA7041133 SAMEA7041134 ... SAMEA7045952 SAMEA7045953
## colData names(22): study_name subject_id ... BMI family
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (639 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
It can be useful to have the list
elements returned by
curatedMetagenomicData()
merged into a single SummarizedExperiment
or TreeSummarizedExperiment
object. This is accomplished using the mergeData()
function
– simply pipe (|>
) the output of
curatedMetagenomicData()
to mergeData()
and a
single object will be returned.
curatedMetagenomicData("AsnicarF.+.relative_abundance", dryrun = FALSE) |>
mergeData()
## class: TreeSummarizedExperiment
## dim: 680 1122
## metadata(0):
## assays(1): relative_abundance
## rownames(680):
## k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria|o__Enterobacterales|f__Enterobacteriaceae|g__Escherichia|s__Escherichia_coli
## k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_bifidum
## ...
## k__Bacteria|p__Synergistetes|c__Synergistia|o__Synergistales|f__Synergistaceae|g__Pyramidobacter|s__Pyramidobacter_sp_C12_8
## k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Micrococcales|f__Brevibacteriaceae|g__Brevibacterium|s__Brevibacterium_aurantiacum
## rowData names(7): superkingdom phylum ... genus species
## colnames(1122): MV_FEI1_t1Q14 MV_FEI2_t1Q14 ... SAMEA7045952
## SAMEA7045953
## colData names(25): study_name subject_id ... BMI family
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (680 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
There is just one more function in the curatedMetagenomicData
package to know about, returnSamples()
. It is used when you
want to return samples across studies that match a certain set of
conditions. To return a specific set of samples, you first subset the
sampleMetadata
data.frame
to include only the
desired samples. In the example below, stool
(body_site == "stool"
) samples from healthy
(disease == "healthy"
) adults (age >= 18
)
living in Ireland or Italy (str_detect(country, "IRL|ITA")
)
are included before columns of all NA
values are dropped.
When the sampleMetadata
data.frame
is subset
as desired, it is piped to the returnSamples()
function. As
you can see below, there is another argument, counts
, which
has not yet been mentioned. When counts = TRUE
, relative
abundance proportions are multiplied by read depth and rounded to the
nearest integer prior to being returned – the argument applies to both
returnSamples()
and curatedMetagenomicData()
when requesting the relative_abundance
data type.
countryData <-
filter(sampleMetadata, body_site == "stool") |>
filter(disease == "healthy") |>
filter(age >= 18) |>
filter(str_detect(country, "IRL|ITA")) |>
select(where(~ !all(is.na(.x)))) |>
returnSamples("relative_abundance", counts = TRUE)
First, have a look at the countryData
object you just
created to get a better understanding of what it contains. You will be
using it for the rest of this tutorial.
countryData
## class: TreeSummarizedExperiment
## dim: 906 286
## metadata(0):
## assays(1): relative_abundance
## rownames(906):
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri
## k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_adolescentis
## ...
## k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiaceae|g__Clostridium|s__Clostridium_sp_D5
## k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lactobacillus|s__Lactobacillus_algidus
## rowData names(7): superkingdom phylum ... genus species
## colnames(286): SID01.BA.VG.2 SID01.BA.V_metag ... CRC_MR_SBJ83H_17
## CRC_MR_SBJ84H_17
## colData names(32): study_name subject_id ... dyastolic_p systolic_p
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (906 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
As can be seen above, the first assay
of the TreeSummarizedExperiment
is named relative_abundance
. This is usually of little
consequence, but you will need to rename the assay to
counts
for the transformation in the next step; do so by
setting the assayNames
as follows.
assayNames(countryData) <-
"counts"
Next, use logNormCounts()
from the scater
package to create a second assay logcounts
that contains
log-normalized species abundances.
countryData <-
logNormCounts(countryData)
By taking a look again, you can see there are now two
assays
as was desired.
countryData
## class: TreeSummarizedExperiment
## dim: 906 286
## metadata(0):
## assays(2): counts logcounts
## rownames(906):
## k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__Prevotellaceae|g__Prevotella|s__Prevotella_copri
## k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_adolescentis
## ...
## k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Clostridiaceae|g__Clostridium|s__Clostridium_sp_D5
## k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Lactobacillaceae|g__Lactobacillus|s__Lactobacillus_algidus
## rowData names(7): superkingdom phylum ... genus species
## colnames(286): SID01.BA.VG.2 SID01.BA.V_metag ... CRC_MR_SBJ83H_17
## CRC_MR_SBJ84H_17
## colData names(33): study_name subject_id ... systolic_p sizeFactor
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## rowLinks: a LinkDataFrame (906 rows)
## rowTree: 1 phylo tree(s) (10430 leaves)
## colLinks: NULL
## colTree: NULL
Before diving into analysis, you should create a quick summary table to understand the metadata better and be prepared to handle potential biases in analysis. The steps below use dplyr syntax to clean up a few of the variables before they are output in the summary table.
colData(countryData) |>
as.data.frame() |>
select(study_name, age, gender, country, sequencing_platform, curator, diet) |>
mutate(gender = to_title_case(gender)) |>
mutate(gender = fct_explicit_na(gender)) |>
mutate(curator = str_replace_all(curator, "_", " ")) |>
mutate(curator = str_replace_all(curator, ";", ", ")) |>
mutate(diet = to_title_case(diet)) |>
mutate(diet = fct_explicit_na(diet)) |>
rename_with(to_title_case) |>
tbl_summary()
Characteristic | N = 2861 |
---|---|
Study Name | |
DeFilippisF_2019 | 97 (34%) |
FerrettiP_2018 | 20 (7.0%) |
KeohaneDM_2020 | 117 (41%) |
RampelliS_2015 | 11 (3.8%) |
ThomasAM_2018a | 14 (4.9%) |
ThomasAM_2018b | 27 (9.4%) |
Age | 39 (31, 52) |
Gender | |
Female | 140 (49%) |
Male | 126 (44%) |
(Missing) | 20 (7.0%) |
Country | |
IRL | 117 (41%) |
ITA | 169 (59%) |
Sequencing Platform | |
IlluminaHiSeq | 189 (66%) |
IlluminaNextSeq | 97 (34%) |
Curator | |
Francesca DeFilippis, Edoardo Pasolli | 97 (34%) |
Pamela Ferretti | 20 (7.0%) |
Paolo Manghi | 155 (54%) |
Paolo Manghi, Marisa Metzger | 14 (4.9%) |
Diet | |
Omnivore | 23 (8.0%) |
Vegan | 36 (13%) |
Vegetarian | 38 (13%) |
(Missing) | 189 (66%) |
1 n (%); Median (IQR) |
Alpha diversity estimation seeks to quantify variation within a
sample; the Shannon index (H’) is probably the most widely used measure.
It accounts for both richness (i.e. how many types of bacteria are
present) and evenness (i.e. how equal the proportions of types is); it
is assessed below using the estimateDiversity()
function
from the mia package.
Then, the plotColData()
function from the scater
package is used to generate a basic ggplot2 plot,
which is stylized further using the ggsignif and
hrbrthemes
packages.
countryData |>
estimateDiversity(abund_values = "logcounts", index = "shannon") |>
plotColData(x = "country", y = "shannon", colour_by = "country", shape_by = "country") +
geom_signif(comparisons = list(c("IRL", "ITA")), test = "t.test", map_signif_level = TRUE) +
labs(
title = "Alpha Diversity by Country, Shannon Index (H')",
subtitle = "Stool Samples of Healthy Adults",
x = "Country",
y = "Alpha Diversity (H')"
) +
guides(color = guide_none(), shape = guide_none()) +
theme_ipsum_rc()
Beta diversity is a measurement of between sample variation that is
usually qualitatively assessed using a low-dimensional embedding;
classically this has been done in metagenomics with a principal
coordinates analysis (PCoA) of Bray-Curtis dissimilarity.1 This is done below
using the runMDS()
function from the scater
package with the vegdist
function from the vegan package.
The plotReducedDim()
function from the scater
package is used to generate a basic ggplot2 plot,
which is styled further.
countryData |>
runMDS(FUN = vegdist, method = "bray", exprs_values = "logcounts", name = "BrayCurtis") |>
plotReducedDim("BrayCurtis", colour_by = "country", shape_by = "country", text_by = "country") +
labs(
title = "Beta Diversity by Country, Bray-Curtis PCoA",
subtitle = "Stool Samples of Healthy Adults",
x = "PCo 1",
y = "PCo 2"
) +
guides(color = guide_none(), shape = guide_none()) +
theme_ipsum_rc()
To address the issue of using principal coordinates analysis (PCoA)
with a dissimilarity, beta diversity can be assessed using the UMAP
(Uniform Manifold Approximation and Projection) algorithm instead.2 The
runUMAP()
function from the scater
package is used below; otherwise, the syntax is largely the same as
above.
countryData |>
runUMAP(exprs_values = "logcounts", name = "UMAP") |>
plotReducedDim("UMAP", colour_by = "country", shape_by = "country", text_by = "country") +
labs(
title = "Beta Diversity by Country, UMAP Embedding",
subtitle = "Stool Samples of Healthy Adults",
x = "UMAP 1",
y = "UMAP 2"
) +
guides(color = guide_none(), shape = guide_none()) +
theme_ipsum_rc()
Where beta diversity embeddings suggest there are features that
separate Irish and Italian stool samples from healthy adults, you might
like to know which are most differentially abundance. To assess this the
ANCOMBC
package has a metagenomics-specific additive log-ratio model for the
task. However, the ancombc()
function requires a
phyloseq
class objects – one can be created using the
makePhyloseqFromTreeSummarizedExperiment()
function from
the mia package,
as shown below.
ancombcResults <-
makePhyloseqFromTreeSummarizedExperiment(countryData) |>
ancombc("country")
The results of the ANCOMBC
model are in a strange list
structure and have to be
coerced into a data.frame
before they can be displayed; the
bind_cols()
function from the dplyr package
is used below.
ancombcTable <-
bind_cols(ancombcResults[["res"]])
Yet, the column names of the results table are missing and have to be assigned, as shown below.
The row names of the results table are big long strings of microbial
taxonomies that will need some editing if they are to be displayed
nicely. the rownames_to_column()
function from the tibble package
is used below to turn them into a column so they can be edited.
ancombcTable <-
rownames_to_column(ancombcTable)
Before the row names are split into 7 pieces, the names of columns that each piece will be assigned to are created below.
rankNames <-
c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")
The row names of the results table are then transformed using tidyr, dplyr, and stringr, as shown below.
ancombcTable[["rowname"]] <-
separate(ancombcTable, rowname, rankNames, sep = "\\|") |>
select(all_of(rankNames)) |>
mutate(across(.fns = ~ str_remove_all(.x, ".__"))) |>
mutate(across(.fns = ~ str_replace_all(.x, "_", " "))) |>
mutate(label = Species) |>
pull(label)
Once the results table is in good shape, it can be filtered to
include only bacterial species that exhibited large
(e.g. abs(beta) > 1
) and significant
(-log10(q_val) > 5
) differences in abundances between
the two countries. In the example below, the table is sorted by effect
size and a number of formatting conventions are applied before
displaying the results table.
filter(ancombcTable, abs(beta) > 1) |>
filter(-log10(q_val) > 5) |>
select(rowname, beta, se, p_val, q_val) |>
arrange(-abs(beta)) |>
column_to_rownames() |>
mutate(across(.fns = ~ round(.x, digits = 3))) |>
mutate(beta = format(beta)) |>
mutate(beta = str_replace(beta, " ", " ")) |>
mutate(p_val = if_else(p_val == 0, "< 0.001", format(p_val, nsmall = 3))) |>
mutate(q_val = if_else(q_val == 0, "< 0.001", format(q_val, nsmall = 3))) |>
kable(col.names = c("β", "SE", "P", "Q"), align = "cccc", escape = FALSE)
β | SE | P | Q | |
---|---|---|---|---|
Catenibacterium mitsuokai | -8.689 | 0.618 | < 0.001 | < 0.001 |
Holdemanella biformis | -8.007 | 0.575 | < 0.001 | < 0.001 |
Eubacterium sp CAG 180 | -7.361 | 0.687 | < 0.001 | < 0.001 |
Bacteroides dorei | 6.364 | 0.551 | < 0.001 | < 0.001 |
Lactobacillus ruminis | -6.029 | 0.562 | < 0.001 | < 0.001 |
Phascolarctobacterium succinatutens | -5.947 | 0.706 | < 0.001 | < 0.001 |
Alistipes putredinis | 5.733 | 0.622 | < 0.001 | < 0.001 |
Ruminococcus torques | -5.591 | 0.442 | < 0.001 | < 0.001 |
Eubacterium hallii | -5.245 | 0.437 | < 0.001 | < 0.001 |
Bifidobacterium angulatum | -5.185 | 0.550 | < 0.001 | < 0.001 |
Prevotella sp CAG 279 | -5.093 | 0.626 | < 0.001 | < 0.001 |
Lawsonibacter asaccharolyticus | 4.976 | 0.408 | < 0.001 | < 0.001 |
Dorea formicigenerans | -4.855 | 0.366 | < 0.001 | < 0.001 |
Prevotella sp 885 | -4.646 | 0.588 | < 0.001 | < 0.001 |
Methanobrevibacter smithii | -4.514 | 0.642 | < 0.001 | < 0.001 |
Coprococcus comes | -4.476 | 0.417 | < 0.001 | < 0.001 |
Slackia isoflavoniconvertens | -4.356 | 0.581 | < 0.001 | < 0.001 |
Roseburia intestinalis | -4.332 | 0.559 | < 0.001 | < 0.001 |
Oscillibacter sp CAG 241 | -4.328 | 0.540 | < 0.001 | < 0.001 |
Ruminococcus callidus | -4.294 | 0.517 | < 0.001 | < 0.001 |
Eubacterium ramulus | -4.224 | 0.489 | < 0.001 | < 0.001 |
Coprococcus catus | -4.206 | 0.421 | < 0.001 | < 0.001 |
Intestinibacter bartlettii | -4.007 | 0.534 | < 0.001 | < 0.001 |
Alistipes finegoldii | 3.849 | 0.544 | < 0.001 | < 0.001 |
Eubacterium sp CAG 251 | -3.826 | 0.652 | < 0.001 | < 0.001 |
Dorea longicatena | -3.811 | 0.327 | < 0.001 | < 0.001 |
Prevotella sp AM42 24 | -3.567 | 0.560 | < 0.001 | < 0.001 |
Roseburia faecis | -3.483 | 0.491 | < 0.001 | < 0.001 |
Firmicutes bacterium CAG 170 | -3.393 | 0.577 | < 0.001 | < 0.001 |
Bacteroides plebeius | 3.353 | 0.522 | < 0.001 | < 0.001 |
Alistipes shahii | 3.259 | 0.532 | < 0.001 | < 0.001 |
Bacteroides eggerthii | 3.212 | 0.477 | < 0.001 | < 0.001 |
Butyricimonas virosa | 3.171 | 0.457 | < 0.001 | < 0.001 |
Parasutterella excrementihominis | 3.167 | 0.477 | < 0.001 | < 0.001 |
Parabacteroides distasonis | 3.114 | 0.529 | < 0.001 | < 0.001 |
Alistipes indistinctus | 3.074 | 0.516 | < 0.001 | < 0.001 |
Eubacterium rectale | -3.044 | 0.410 | < 0.001 | < 0.001 |
Blautia wexlerae | -3.029 | 0.428 | < 0.001 | < 0.001 |
Clostridium disporicum | -2.965 | 0.441 | < 0.001 | < 0.001 |
Anaerostipes hadrus | -2.797 | 0.457 | < 0.001 | < 0.001 |
Bacteroides sp CAG 530 | -2.790 | 0.411 | < 0.001 | < 0.001 |
Blautia obeum | -2.751 | 0.405 | < 0.001 | < 0.001 |
Intestinimonas butyriciproducens | 2.642 | 0.465 | < 0.001 | < 0.001 |
Roseburia inulinivorans | -2.535 | 0.365 | < 0.001 | < 0.001 |
Prevotella stercorea | -2.525 | 0.457 | < 0.001 | < 0.001 |
Collinsella aerofaciens | -2.333 | 0.322 | < 0.001 | < 0.001 |
Butyricimonas synergistica | 2.311 | 0.411 | < 0.001 | < 0.001 |
Fusicatenibacter saccharivorans | -1.539 | 0.274 | < 0.001 | < 0.001 |
While the table above is somewhat interesting, the results are better summarized as a volcano plot (i.e. statistical significance versus fold change) and one can be made using the results table. To shorten labels for readability, stringr is first used to abbreviate species names by replacing the first word of the name with a single letter followed by a period. Next, the same filtering of the table as above is undertaken, and the color scheme used in all the plots above is applied. Both labeling and coloring are only applied where effect size and significance thresholds are met, as denoted by the dotted lines.
ancombcTable |>
mutate(rowname = str_replace(rowname, "^([A-Z])([a-z])+ ", "\\1. ")) |>
mutate(q_val = -log10(q_val)) |>
mutate(label = if_else(abs(beta) > 1, rowname, NA_character_)) |>
mutate(label = if_else(q_val > 5, label, NA_character_)) |>
mutate(color = if_else(beta > 0, "#FF9E4A", "#729ECE")) |>
mutate(color = if_else(is.na(label), "#000000", color)) |>
ggplot(mapping = aes(beta, q_val, color = I(color), label = label, shape = I(1))) +
geom_point() +
geom_hline(linetype = "dotted", yintercept = 5) +
geom_vline(linetype = "dotted", xintercept = 1) +
geom_vline(linetype = "dotted", xintercept = -1) +
geom_label_repel(min.segment.length = 0, force = 10, max.overlaps = 20, na.rm = TRUE) +
labs(
title = "Significance vs. Effect Size, ANCOM-BC",
subtitle = "Stool Samples of Healthy Adults",
x = expression(beta),
y = expression(-~log[10]~Q)
) +
guides(color = guide_none(), shape = guide_none()) +
theme_ipsum_rc()
As a bonus, do a Google or a PubMed search for Holdemanella biformis to see what condition(s) it is associated with and then explore differences between the two countries using GBD Compare visualizations.