vignettes/lefser.Rmd
lefser.Rmd
lefser is the R implementation of the Linear discriminant analysis (LDA) Effect Size (LEfSe), a Python package for metagenomic biomarker discovery and explanation. (Huttenhower et al. 2011).
The original software utilizes standard statistical significance tests along with supplementary tests that incorporate biological consistency and the relevance of effects to identity the features (e.g., organisms, clades, OTU, genes, or functions) that are most likely to account for differences between the two sample classes of interest. While LEfSe is widely used and available in different platform such as Galaxy UI and Conda, there is no convenient way to incorporate it in R-based workflows. Thus, we re-implement LEfSe as an R/Bioconductor package, lefser. Following the LEfSe‘s algorithm including Kruskal-Wallis test, Wilcoxon-Rank Sum test, and Linear Discriminant Analysis, with some modifications, lefser successfully reproduces and improves the original statistical method and the associated plotting functionality.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("lefser")
lefser package include the demo dataset,
zeller14
, which is the microbiome data from colorectal
cancer (CRC) patients and controls (Zeller
et al. 2014).
In this vignette, we excluded the ‘adenoma’ condition and used control/CRC as the main classes and age category as sub-classes (adult vs. senior) with different numbers of samples: control-adult (n = 46), control-senior (n = 20), CRC-adult (n = 45), and CRC-senior (n = 46).
data(zeller14)
zeller14 <- zeller14[, zeller14$study_condition != "adenoma"]
The class and subclass information is stored in the
colData
slot under the study_condition
and
age_category
columns, respectively.
## Contingency table
table(zeller14$age_category, zeller14$study_condition)
#>
#> control CRC
#> adult 46 45
#> senior 20 46
If you try to run lefser
directly on the ‘zeller14’
data, you will get the following warning messages
lefser(zeller14, classCol = "study_condition", subclassCol = "age_category")
Warning messages:
1: In lefser(zeller14, classCol = "study_condition", subclassCol = "age_category") :
Convert counts to relative abundances with 'relativeAb()'
2: In lda.default(x, classing, ...) : variables are collinear
When working with taxonomic data, including both terminal and non-terminal nodes in the analysis can lead to collinearity problems. Non-terminal nodes (e.g., genus) are often linearly dependent on their corresponding terminal nodes (e.g., species) since the species-level information is essentially a subset or more specific representation of the genus-level information. This collinearity can violate the assumptions of certain statistical methods, such as linear discriminant analysis (LDA), and can lead to unstable or unreliable results. By using only terminal nodes, you can effectively eliminate this collinearity issue, ensuring that your analysis is not affected by linearly dependent or highly correlated variables. Additionally, you can benefit of avoiding redundancy, increasing specificity, simplifying data, and reducing ambiguity, using only terminal nodes.
You can select only the terminal node using
get_terminal_nodes
function.
tn <- get_terminal_nodes(rownames(zeller14))
zeller14tn <- zeller14[tn,]
First warning message informs you that lefser
requires
relative abundance of features. You can use relativeAb
function to reformat your input.
zeller14tn_ra <- relativeAb(zeller14tn)
lefser
The lefser
function returns a data.frame
with two columns - the names of selected features (the
features
column) and their effect size (the
scores
column).
There is a random number generation step in the lefser
algorithm to ensure that more than half of the values for each features
are unique. In most cases, inputs are sparse, so in practice, this step
is handling 0s. So to reproduce the identical result, you should set the
seed before running lefser
.
set.seed(1234)
res <- lefser(zeller14tn_ra, # relative abundance only with terminal nodes
classCol = "study_condition",
subclassCol = "age_category")
head(res)
#> features
#> 1 k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Ruminococcaceae|g__Ruminococcus|s__Ruminococcus_sp_5_1_39BFAA|t__GCF_000159975
#> 2 k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Eubacteriaceae|g__Eubacterium|s__Eubacterium_hallii|t__GCF_000173975
#> 3 k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Streptococcaceae|g__Streptococcus|s__Streptococcus_salivarius|t__Streptococcus_salivarius_unclassified
#> 4 k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_catenulatum|t__GCF_000173455
#> 5 k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Eubacteriaceae|g__Eubacterium|s__Eubacterium_ventriosum|t__GCF_000153885
#> 6 k__Bacteria|p__Firmicutes|c__Clostridia|o__Clostridiales|f__Lachnospiraceae|g__Lachnospiraceae_noname|s__Lachnospiraceae_bacterium_5_1_63FAA|t__GCF_000185525
#> scores
#> 1 -3.795320
#> 2 -3.700813
#> 3 -3.521247
#> 4 -3.154544
#> 5 -3.126245
#> 6 -3.082713
lefserPlot
lefserPlot(res)
The codes for benchmarking lefser against LEfSe and the other R implementation of LEfSe is available here.
When using phyloseq
objects, we recommend to extract the
data and create a SummarizedExperiment
object as
follows:
library(phyloseq)
library(SummarizedExperiment)
## Load phyloseq object
fp <- system.file("extdata",
"study_1457_split_library_seqs_and_mapping.zip",
package = "phyloseq")
kostic <- microbio_me_qiime(fp)
#> Found biom-format file, now parsing it...
#> Done parsing biom...
#> Importing Sample Metdadata from mapping file...
#> Merging the imported objects...
#> Successfully merged, phyloseq-class created.
#> Returning...
## Split data tables
counts <- unclass(otu_table(kostic))
coldata <- as(sample_data(kostic), "data.frame")
## Create a SummarizedExperiment object
SummarizedExperiment(assays = list(counts = counts), colData = coldata)
#> class: SummarizedExperiment
#> dim: 2505 190
#> metadata(0):
#> assays(1): counts
#> rownames(2505): 304309 469478 ... 206906 298806
#> rowData names(0):
#> colnames(190): C0333.N.518126 C0333.T.518046 ... 32I9UNA9.518098
#> BFJMKNMP.518102
#> colData names(71): X.SampleID BarcodeSequence ... HOST_TAXID
#> Description
You may also consider using
makeTreeSummarizedExperimentFromPhyloseq
from the
mia
package.
mia::makeTreeSummarizedExperimentFromPhyloseq(kostic)
#> class: TreeSummarizedExperiment
#> dim: 2505 190
#> metadata(0):
#> assays(1): counts
#> rownames(2505): 304309 469478 ... 206906 298806
#> rowData names(7): Kingdom Phylum ... Genus Species
#> colnames(190): C0333.N.518126 C0333.T.518046 ... 32I9UNA9.518098
#> BFJMKNMP.518102
#> colData names(71): X.SampleID BarcodeSequence ... HOST_TAXID
#> Description
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] phyloseq_1.50.0 lefser_1.16.0
#> [3] SummarizedExperiment_1.36.0 Biobase_2.66.0
#> [5] GenomicRanges_1.58.0 GenomeInfoDb_1.42.1
#> [7] IRanges_2.40.1 S4Vectors_0.44.0
#> [9] BiocGenerics_0.52.0 MatrixGenerics_1.18.0
#> [11] matrixStats_1.5.0 BiocStyle_2.34.0
#>
#> loaded via a namespace (and not attached):
#> [1] splines_4.4.2 ggplotify_0.1.2
#> [3] tibble_3.2.1 rpart_4.1.24
#> [5] DirichletMultinomial_1.48.0 lifecycle_1.0.4
#> [7] lattice_0.22-6 MASS_7.3-64
#> [9] MultiAssayExperiment_1.32.0 backports_1.5.0
#> [11] magrittr_2.0.3 Hmisc_5.2-1
#> [13] sass_0.4.9 rmarkdown_2.29
#> [15] jquerylib_0.1.4 yaml_2.3.10
#> [17] DBI_1.2.3 minqa_1.2.8
#> [19] ade4_1.7-22 multcomp_1.4-26
#> [21] abind_1.4-8 zlibbioc_1.52.0
#> [23] purrr_1.0.2 yulab.utils_0.1.9
#> [25] nnet_7.3-20 TH.data_1.1-2
#> [27] sandwich_3.1-1 GenomeInfoDbData_1.2.13
#> [29] ggrepel_0.9.6 irlba_2.3.5.1
#> [31] tidytree_0.4.6 testthat_3.2.2
#> [33] vegan_2.6-8 rbiom_1.0.3
#> [35] pkgdown_2.1.1 permute_0.9-7
#> [37] DelayedMatrixStats_1.28.0 codetools_0.2-20
#> [39] coin_1.4-3 DelayedArray_0.32.0
#> [41] scuttle_1.16.0 tidyselect_1.2.1
#> [43] aplot_0.2.4 UCSC.utils_1.2.0
#> [45] farver_2.1.2 lme4_1.1-35.5
#> [47] ScaledMatrix_1.14.0 viridis_0.6.5
#> [49] base64enc_0.1-3 jsonlite_1.8.9
#> [51] BiocNeighbors_2.0.1 multtest_2.62.0
#> [53] decontam_1.26.0 mia_1.14.0
#> [55] Formula_1.2-5 survival_3.8-3
#> [57] scater_1.34.0 iterators_1.0.14
#> [59] systemfonts_1.1.0 foreach_1.5.2
#> [61] tools_4.4.2 treeio_1.30.0
#> [63] ragg_1.3.3 Rcpp_1.0.13-1
#> [65] glue_1.8.0 gridExtra_2.3
#> [67] SparseArray_1.6.0 xfun_0.50
#> [69] mgcv_1.9-1 TreeSummarizedExperiment_2.14.0
#> [71] dplyr_1.1.4 withr_3.0.2
#> [73] BiocManager_1.30.25 fastmap_1.2.0
#> [75] boot_1.3-31 rhdf5filters_1.18.0
#> [77] bluster_1.16.0 digest_0.6.37
#> [79] rsvd_1.0.5 R6_2.5.1
#> [81] gridGraphics_0.5-1 textshaping_0.4.1
#> [83] colorspace_2.1-1 lpSolve_5.6.23
#> [85] tidyr_1.3.1 generics_0.1.3
#> [87] data.table_1.16.4 DECIPHER_3.2.0
#> [89] httr_1.4.7 htmlwidgets_1.6.4
#> [91] S4Arrays_1.6.0 pkgconfig_2.0.3
#> [93] gtable_0.3.6 modeltools_0.2-23
#> [95] SingleCellExperiment_1.28.1 XVector_0.46.0
#> [97] brio_1.1.5 htmltools_0.5.8.1
#> [99] bookdown_0.42 biomformat_1.34.0
#> [101] scales_1.3.0 ggfun_0.1.8
#> [103] knitr_1.49 rstudioapi_0.17.1
#> [105] reshape2_1.4.4 checkmate_2.3.2
#> [107] nlme_3.1-166 nloptr_2.1.1
#> [109] cachem_1.1.0 zoo_1.8-12
#> [111] rhdf5_2.50.1 stringr_1.5.1
#> [113] parallel_4.4.2 vipor_0.4.7
#> [115] libcoin_1.0-10 foreign_0.8-87
#> [117] desc_1.4.3 pillar_1.10.1
#> [119] grid_4.4.2 vctrs_0.6.5
#> [121] slam_0.1-55 BiocSingular_1.22.0
#> [123] beachmat_2.22.0 cluster_2.1.8
#> [125] beeswarm_0.4.0 htmlTable_2.4.3
#> [127] evaluate_1.0.1 mvtnorm_1.3-2
#> [129] cli_3.6.3 compiler_4.4.2
#> [131] rlang_1.1.4 crayon_1.5.3
#> [133] labeling_0.4.3 mediation_4.5.0
#> [135] plyr_1.8.9 fs_1.6.5
#> [137] ggbeeswarm_0.7.2 stringi_1.8.4
#> [139] viridisLite_0.4.2 BiocParallel_1.40.0
#> [141] munsell_0.5.1 Biostrings_2.74.1
#> [143] lazyeval_0.2.2 Matrix_1.7-1
#> [145] patchwork_1.3.0 sparseMatrixStats_1.18.0
#> [147] ggplot2_3.5.1 Rhdf5lib_1.28.0
#> [149] igraph_2.1.3 RcppParallel_5.1.9
#> [151] bslib_0.8.0 ggtree_3.14.0
#> [153] ape_5.8-1