Geistlinger_enrichOmics.Rmd
CUNY School of Public Health 55 W 125th St, New York, NY 10027
This workshop gives an in-depth overview of existing methods for enrichment analysis of gene expression data with regard to functional gene sets, pathways, and networks. The workshop will help participants understand the distinctions between assumptions and hypotheses of existing methods as well as the differences in objectives and interpretation of results. It will provide code and hands-on practice of all necessary steps for differential expression analysis, gene set- and network-based enrichment analysis, and identification of enriched genomic regions and regulatory elements, along with visualization and exploration of results.
Basic knowledge of R syntax
Familiarity with the SummarizedExperiment class
Familiarity with the GenomicRanges class
Familiarity with high-throughput gene expression data as obtained with microarrays and RNA-seq
Familiarity with the concept of differential expression analysis (with e.g. limma, edgeR, DESeq2)
Activity | Time |
---|---|
Background | 30m |
Differential expression analysis | 15m |
Gene set analysis | 30m |
Gene network analysis | 15m |
Genomic region analysis | 15m |
Theory:
Practice:
Geistlinger L, Csaba G, Zimmer R. Bioconductor’s EnrichmentBrowser: seamless navigation through combined results of set- & network-based enrichment analysis. BMC Bioinformatics, 17:45, 2016.
Geistlinger L, Csaba G, Santarelli M, Ramos M, Schiffer L, Turaga N, Law C, Davis S, Carey V, Morgan M, Zimmer R, Waldron L. Toward a gold standard for benchmarking gene set enrichment analysis. Brief Bioinform, 2020.
Geistlinger L, Csaba G, Küffner R, Mulder N, Zimmer R. From sets to graphs: towards a realistic enrichment analysis of transcriptomic systems. Bioinformatics, 27(13):i366-73, 2011.
Geistlinger L, Csaba G, Dirmeier S, Küffner R, Zimmer R. A comprehensive gene regulatory network for the diauxic shift in Saccharomyces cerevisiae. Nucleic Acids Res, 41(18):8452-63, 2013.
Test whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, e.g. a list of differentially expressed (DE) genes. See Goeman and Buehlmann, 2007 for a critical review.
Example: Transcriptomic study, in which 12,671 genes have been tested for differential expression between two sample conditions and 529 genes were found DE.
Among the DE genes, 28 are annotated to a specific functional gene set, which contains in total 170 genes. This setup corresponds to a 2x2 contingency table,
deTable <- matrix(c(28, 142, 501, 12000), nrow = 2, dimnames = list(c("DE", "Not.DE"), c("In.gene.set", "Not.in.gene.set"))) deTable
## In.gene.set Not.in.gene.set
## DE 28 501
## Not.DE 142 12000
where the overlap of 28 genes can be assessed based on the hypergeometric distribution. This corresponds to a one-sided version of Fisher’s exact test, yielding here a significant enrichment.
fisher.test(deTable, alternative = "greater")
##
## Fisher's Exact Test for Count Data
##
## data: deTable
## p-value = 4.088e-10
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
## 3.226736 Inf
## sample estimates:
## odds ratio
## 4.721744
This basic principle is at the foundation of major public and commercial enrichment tools such as DAVID and Pathway Studio.
Although gene set enrichment methods have been primarily developed and applied on transcriptomic data, they have recently been modified, extended and applied also in other fields of genomic and biomedical research. This includes novel approaches for functional enrichment analysis of proteomic and metabolomic data as well as genomic regions and disease phenotypes, Lavallee and Yates, 2016, Chagoyen et al., 2016, McLean et al., 2010, Ried et al., 2012.
The first part of the workshop is largely based on the EnrichmentBrowser package, which implements an analysis pipeline for high-throughput gene expression data as measured with microarrays and RNA-seq. In a workflow-like manner, the package brings together a selection of established Bioconductor packages for gene expression data analysis. It integrates a wide range of gene set enrichment analysis methods and facilitates combination and exploration of results across methods.
library(EnrichmentBrowser)
Further information can be found in the vignette and publication.
Gene sets, pathways & regulatory networks
Gene sets are simple lists of usually functionally related genes without further specification of relationships between genes.
Pathways can be interpreted as specific gene sets, typically representing a group of genes that work together in a biological process. Pathways are commonly divided in metabolic and signaling pathways. Metabolic pathways such as glycolysis represent biochemical substrate conversions by specific enzymes. Signaling pathways such as the MAPK signaling pathway describe signal transduction cascades from receptor proteins to transcription factors, resulting in activation or inhibition of specific target genes.
Gene regulatory networks describe the interplay and effects of regulatory factors (such as transcription factors and microRNAs) on the expression of their target genes.
Resources
GO and KEGG annotations are most frequently used for the enrichment analysis of functional gene sets. Despite an increasing number of gene set and pathway databases, they are typically the first choice due to their long-standing curation and availability for a wide range of species.
GO: The Gene Ontology (GO) consists of three major sub-ontologies that classify gene products according to molecular function (MF), biological process (BP) and cellular component (CC). Each ontology consists of GO terms that define MFs, BPs or CCs to which specific genes are annotated. The terms are organized in a directed acyclic graph, where edges between the terms represent relationships of different types. They relate the terms according to a parent-child scheme, i.e. parent terms denote more general entities, whereas child terms represent more specific entities.
KEGG: The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a collection of manually drawn pathway maps representing molecular interaction and reaction networks. These pathways cover a wide range of biochemical processes that can be divided in 7 broad categories: metabolism, genetic and environmental information processing, cellular processes, organismal systems, human diseases, and drug development. Metabolism and drug development pathways differ from pathways of the other 5 categories by illustrating reactions between chemical compounds. Pathways of the other 5 categories illustrate molecular interactions between genes and gene products.
Gene set analysis vs. gene set enrichment analysis
The two predominantly used enrichment methods are:
However, the term gene set enrichment analysis nowadays subsumes a general strategy implemented by a wide range of methods Huang et al., 2009. Those methods have in common the same goal, although approach and statistical model can vary substantially Goeman and Buehlmann, 2007, Khatri et al., 2012.
To better distinguish from the specific method, some authors use the term gene set analysis to denote the general strategy. However, there is also a specific method from Efron and Tibshirani, 2007 of this name.
Underlying null: competitive vs. self-contained
Goeman and Buehlmann, 2007 classified existing enrichment methods into competitive and self-contained based on the underlying null hypothesis.
Although the authors argue that a self-contained null is closer to the actual question of interest, the vast majority of enrichment methods is competitive.
Goeman and Buehlmann further raise several critical issues concerning the 2x2 ORA:
With regard to these statistical concerns, GSEA is considered superior:
However, the simplicity and general applicability of ORA is unmet by subsequent methods improving on these issues. For instance, GSEA requires the expression data as input, which is not available for gene lists derived from other experiment types. On the other hand, the involved sample permutation procedure has been proven inaccurate and time-consuming Efron and Tibshirani, 2007, Phipson and Smyth, 2010, Larson and Owen, 2015.
Generations: ora, fcs & topology-based
Khatri et al., 2012 have taken a slightly different approach by classifying methods along the timeline of development into three generations:
Although topology-based (also: network-based) methods appear to be most realistic, their straightforward application can be impaired by features that are not-detectable on the transcriptional level (such as protein-protein interactions) and insufficient network knowledge Geistlinger et al., 2013, Bayerlova et al., 2015.
Given the individual benefits and limitations of existing methods, cautious interpretation of results is required to derive valid conclusions. Whereas no single method is best suited for all application scenarios, applying multiple methods can be beneficial. This has been shown to filter out spurious hits of individual methods, thereby reducing the outcome to gene sets accumulating evidence from different methods Geistlinger et al., 2016, Alhamdoosh et al., 2017.
Guidelines for input and method selection
Based on a comprehensive assessment of 10 major enrichment methods, we recently identified significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested, and detection of relevant processes Geistlinger et al., 2020.
Based on these results, we make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance:
For the exploratory analysis of simple gene lists, we recommend ORA given its ease of applicability, fast runtime and evident relevance of resulting gene set rankings, provided that input gene list and reference gene list are chosen carefully and remembering ORA’s propensity for type I error rate inflation when genes tend to be co-expressed within sets.
For the analysis of pre-ranked gene lists accompanied by gene scores such as fold changes, alternatives to ORA such as pre-ranked GSEA or pre-ranked CAMERA exist.
For expression-based enrichment analysis on the full expression matrix, we recommend providing normalized log2 intensities for microarray data, and logTPMs (or logRPKMs / logFPKMs) for RNA-seq data. When given raw read counts, we recommend to apply a variance-stabilizing transformation such as voom to arrive at library-size normalized logCPMs.
If the question of interest is to test for association of any gene in the set with the phenotype (self-contained null hypothesis), we recommend ROAST or GSVA that both test a directional hypothesis (genes in the set tend to be either predominantly up- or down-regulated). Both methods can be applied for simple or extended experimental designs, where ROAST is the more natural choice for the comparison of sample groups and also allows one to test a mixed hypothesis (genes in the set tend to be differentially expressed, regardless of the direction). The main strength of GSVA lies in its capabilities for analyzing single samples.
If the question of interest is to test for excess of differential expression in a gene set relative to genes outside the set (competitive null hypothesis), which we believe comes closest to the expectations and intuition of most end users when performing GSEA, we recommend PADOG, which is slower to run but resolves major shortcomings of ORA, and has desirable properties for the analyzed criteria and when compared to other competitive methods. However, PADOG is limited to testing a mixed hypothesis in a comparison of two sample groups, optionally including paired samples or sample batches. Therefore, we recommend the highly customizable SAFE for testing a directional hypothesis or in situations of more complex experimental designs such as comparisons between multiple groups, continuous phenotypes or the presence of covariates.
See also Geistlinger et al., 2020 for the results of the benchmarking study and the GSEABenchmarkeR package for a general framework for reproducible benchmarking of gene set enrichment methods.
Although RNA-seq (read count data) has become the de facto standard for transcriptomic profiling, it is important to know that many methods for differential expression and gene set enrichment analysis have been originally developed for microarray data (intensity measurements).
However, differences in data distribution assumptions (microarray: quasi-normal, RNA-seq: negative binomial) made adaptations in differential expression analysis and, to some extent, also in gene set enrichment analysis necessary.
Thus, we consider two example datasets - a microarray and a RNA-seq dataset, and discuss similarities and differences of the respective analysis steps.
For microarray data, we consider expression measurements of patients with acute lymphoblastic leukemia Chiaretti et al., 2004. A frequent chromosomal defect found among these patients is a translocation, in which parts of chromosome 9 and 22 swap places. This results in the oncogenic fusion gene BCR/ABL created by positioning the ABL1 gene on chromosome 9 to a part of the BCR gene on chromosome 22.
We load the ALL dataset
and select B-cell ALL patients with and without the BCR/ABL fusion, as described previously Gentleman et al., 2005.
ind.bs <- grep("^B", ALL$BT) ind.mut <- which(ALL$mol.biol %in% c("BCR/ABL", "NEG")) sset <- intersect(ind.bs, ind.mut) all.eset <- ALL[, sset]
We can now access the expression values, which are intensity measurements on a log-scale for 12,625 probes (rows) across 79 patients (columns).
dim(all.eset)
## Features Samples
## 12625 79
exprs(all.eset)[1:4,1:4]
## 01005 01010 03002 04007
## 1000_at 7.597323 7.479445 7.567593 7.905312
## 1001_at 5.046194 4.932537 4.799294 4.844565
## 1002_f_at 3.900466 4.208155 3.886169 3.416923
## 1003_s_at 5.903856 6.169024 5.860459 5.687997
As we often have more than one probe per gene, we compute gene expression values as the average of the corresponding probe values.
allSE <- probe2gene(all.eset) head(names(allSE))
## [1] "5595" "7075" "1557" "643" "1843" "4319"
For RNA-seq data, we consider transcriptome profiles of four primary human airway smooth muscle cell lines in two conditions: control and treatment with dexamethasone Himes et al., 2014.
We load the airway dataset
For further analysis, we only keep genes that are annotated to an ENSEMBL gene ID.
## [1] 63677 8
assay(airSE)[1:4,1:4]
## SRR1039508 SRR1039509 SRR1039512 SRR1039513
## ENSG00000000003 679 448 873 408
## ENSG00000000005 0 0 0 0
## ENSG00000000419 467 515 621 365
## ENSG00000000457 260 211 263 164
Normalization of high-throughput expression data is essential to make results within and between experiments comparable. Microarray (intensity measurements) and RNA-seq (read counts) data typically show distinct features that need to be normalized for. As this is beyond the scope of this workshop, we refer to limma for microarray normalization and EDASeq for RNA-seq normalization. See also EnrichmentBrowser::normalize
, which wraps commonly used functionality for normalization.
The EnrichmentBrowser incorporates established functionality from the limma package for differential expression analysis. This involves the voom
transformation when applied to RNA-seq data. Alternatively, differential expression analysis for RNA-seq data can also be carried out based on the negative binomial distribution with edgeR and DESeq2.
This can be performed using the function EnrichmentBrowser::deAna
and assumes some standardized variable names:
For more information on experimental design, see the limma user’s guide, chapter 9.
For the ALL dataset, the GROUP variable indicates whether the BCR-ABL gene fusion is present (1) or not (0).
##
## 0 1
## 42 37
For the airway dataset, it indicates whether the cell lines have been treated with dexamethasone (1) or not (0).
##
## 0 1
## 4 4
Paired samples, or in general sample batches/blocks, can be defined via a BLOCK
column in the colData
slot. For the airway dataset, the sample blocks correspond to the four different cell lines.
airSE$BLOCK <- airway$cell table(airSE$BLOCK)
##
## N052611 N061011 N080611 N61311
## 2 2 2 2
For microarray data, the EnrichmentBrowser::deAna
function carries out differential expression analysis based on functionality from the limma package. Resulting log2 fold changes and t-test derived p-values for each gene are appended to the rowData
slot.
allSE <- deAna(allSE) rowData(allSE)
## DataFrame with 9009 rows and 4 columns
## FC limma.STAT PVAL ADJ.PVAL
## <numeric> <numeric> <numeric> <numeric>
## 5595 0.0429699 0.734676 0.4646865 0.899340
## 7075 0.0320835 0.454690 0.6505647 0.949052
## 1557 -0.0439401 -1.065773 0.2897348 0.818508
## 643 -0.0277544 -0.567391 0.5720386 0.929215
## 1843 -0.4273025 -1.750504 0.0838684 0.566394
## ... ... ... ... ...
## 6300 -0.02665177 -0.608604 0.544512 0.922900
## 7297 -0.12425768 -1.112795 0.269132 0.804769
## 2246 0.05224289 1.274073 0.206329 0.748238
## 7850 -0.00908230 -0.102406 0.918691 0.991836
## 1593 -0.00747714 -0.145650 0.884565 0.989546
Nominal p-values are already corrected for multiple testing (ADJ.PVAL
) using the method from Benjamini and Hochberg implemented in stats::p.adjust
.
For RNA-seq data, the deAna
function can be used to carry out differential expression analysis between the two groups either based on functionality from limma (that includes the voom
transformation), or alternatively, the frequently used edgeR or DESeq2 package. Here, we use the analysis based on edgeR.
airSE <- deAna(airSE, de.method = "edgeR")
## Excluding 47751 genes not satisfying min.cpm threshold
rowData(airSE)
## DataFrame with 15926 rows and 4 columns
## FC edgeR.STAT PVAL ADJ.PVAL
## <numeric> <numeric> <numeric> <numeric>
## ENSG00000000003 -0.3901002 31.0558140 0.000232422 0.00217355
## ENSG00000000419 0.1978022 6.6454709 0.027419893 0.07560513
## ENSG00000000457 0.0291609 0.0929623 0.766666551 0.84808859
## ENSG00000000460 -0.1243820 0.3832263 0.549659194 0.67996523
## ENSG00000000971 0.4172901 28.7686093 0.000312276 0.00272063
## ... ... ... ... ...
## ENSG00000273373 -0.0438722 0.0397087 0.8460260 0.901607
## ENSG00000273382 -0.8597567 7.7869742 0.0190219 0.057267
## ENSG00000273448 0.0281667 0.0103270 0.9752405 0.984888
## ENSG00000273472 -0.4642705 1.9010366 0.1978818 0.328963
## ENSG00000273486 -0.1109445 0.1536285 0.7032766 0.802377
Exercise: Compare the number of differentially expressed genes as obtained on the airSE
with limma/voom
, edgeR
, and DESeq2
.
We are now interested in whether pre-defined sets of genes that are known to work together, e.g. as defined in the Gene Ontology or the KEGG pathway annotation, are coordinately differentially expressed.
The function getGenesets
can be used to download gene sets from databases such as GO and KEGG. Here, we use the function to download all KEGG pathways for a chosen organism (here: ) as gene sets.
kegg.gs <- getGenesets(org = "hsa", db = "kegg") kegg.gs[1:2]
## $`hsa00010_Glycolysis_/_Gluconeogenesis`
## [1] "10327" "124" "125" "126" "127" "128" "130" "130589"
## [9] "131" "160287" "1737" "1738" "2023" "2026" "2027" "217"
## [17] "218" "219" "2203" "221" "222" "223" "224" "226"
## [25] "229" "230" "2538" "2597" "26330" "2645" "2821" "3098"
## [33] "3099" "3101" "387712" "3939" "3945" "3948" "441531" "501"
## [41] "5105" "5106" "5160" "5161" "5162" "5211" "5213" "5214"
## [49] "5223" "5224" "5230" "5232" "5236" "5313" "5315" "55276"
## [57] "55902" "57818" "669" "7167" "80201" "83440" "84532" "8789"
## [65] "92483" "92579" "9562"
##
## $`hsa00020_Citrate_cycle_(TCA_cycle)`
## [1] "1431" "1737" "1738" "1743" "2271" "3417" "3418" "3419" "3420"
## [10] "3421" "4190" "4191" "47" "48" "4967" "50" "5091" "5105"
## [19] "5106" "5160" "5161" "5162" "55753" "6389" "6390" "6391" "6392"
## [28] "8801" "8802" "8803"
Analogously, the function getGenesets
can be used to retrieve GO terms of a selected ontology (here: biological process, BP) as defined in the GO.db annotation package.
go.gs <- getGenesets(org = "hsa", db = "go", onto = "BP", mode = "GO.db") go.gs[1:2]
## $`GO:0000002_mitochondrial_genome_maintenance`
## [1] "10000" "1890" "291" "4205" "4358" "4976" "55154" "55186" "80119"
## [10] "84275" "92667" "9361"
##
## $`GO:0000003_reproduction`
## [1] "2796" "2797" "286826" "8510"
If provided a file, the function getGenesets
parses user-defined gene sets from GMT file format. Here, we use this functionality for reading a list of already downloaded KEGG gene sets for Homo sapiens containing NCBI Entrez Gene IDs.
data.dir <- system.file("extdata", package = "EnrichmentBrowser") gmt.file <- file.path(data.dir, "hsa_kegg_gs.gmt") hsa.gs <- getGenesets(gmt.file) hsa.gs[1:2]
## $hsa05416_Viral_myocarditis
## [1] "100509457" "101060835" "1525" "1604" "1605" "1756"
## [7] "1981" "1982" "25" "2534" "27" "3105"
## [13] "3106" "3107" "3108" "3109" "3111" "3112"
## [19] "3113" "3115" "3117" "3118" "3119" "3122"
## [25] "3123" "3125" "3126" "3127" "3133" "3134"
## [31] "3135" "3383" "3683" "3689" "3908" "4624"
## [37] "4625" "54205" "5551" "5879" "5880" "5881"
## [43] "595" "60" "637" "6442" "6443" "6444"
## [49] "6445" "71" "836" "841" "842" "857"
## [55] "8672" "940" "941" "942" "958" "959"
##
## $`hsa04622_RIG-I-like_receptor_signaling_pathway`
## [1] "10010" "1147" "1432" "1540" "1654" "23586" "26007" "29110"
## [9] "338376" "340061" "3439" "3440" "3441" "3442" "3443" "3444"
## [17] "3445" "3446" "3447" "3448" "3449" "3451" "3452" "3456"
## [25] "3467" "3551" "3576" "3592" "3593" "3627" "3661" "3665"
## [33] "4214" "4790" "4792" "4793" "5300" "54941" "55593" "5599"
## [41] "5600" "5601" "5602" "5603" "56832" "57506" "5970" "6300"
## [49] "64135" "64343" "6885" "7124" "7186" "7187" "7189" "7706"
## [57] "79132" "79671" "80143" "841" "843" "8517" "8717" "8737"
## [65] "8772" "9140" "9474" "9636" "9641" "9755"
Note #1: Gene set collections for 11 different species from the
Molecular Signatures Database (MSigDB) can be obtained using getGenesets
with db = "msigdb"
. For example, the Hallmark gene set collection can be obtained from MSigDB via:
hall.gs <- getGenesets(org = "hsa", db = "msigdb", cat = "H") hall.gs[1:2]
## $M5890_HALLMARK_TNFA_SIGNALING_VIA_NFKB
## [1] "19" "57007" "374" "467" "490" "2683" "9334" "597"
## [9] "602" "604" "8553" "329" "330" "650" "694" "7832"
## [17] "10950" "6347" "6364" "6351" "6352" "3491" "595" "57018"
## [25] "9034" "960" "969" "941" "9308" "1026" "1051" "1052"
## [33] "8837" "23529" "1435" "1437" "2919" "3627" "6373" "2920"
## [41] "2921" "6372" "23586" "23258" "11080" "55332" "1843" "1844"
## [49] "1846" "1847" "1906" "1942" "1958" "1959" "1960" "10938"
## [57] "10209" "2114" "2150" "2152" "24147" "2353" "2354" "8061"
## [65] "2355" "2526" "50486" "1647" "4616" "2643" "2669" "9945"
## [73] "1880" "1839" "3280" "3383" "23308" "3398" "9592" "8870"
## [81] "51278" "64135" "3433" "3460" "3593" "3601" "3606" "3552"
## [89] "3553" "51561" "3569" "3572" "3575" "3624" "3659" "8660"
## [97] "182" "3725" "3726" "23135" "7071" "10365" "9314" "1316"
## [105] "687" "8942" "3914" "3949" "3976" "9516" "23764" "5606"
## [113] "1326" "4082" "4170" "9242" "4084" "4609" "10135" "10725"
## [121] "4780" "4783" "4790" "4791" "4792" "4794" "4814" "3164"
## [129] "4929" "8013" "4973" "24145" "5142" "10611" "5187" "5209"
## [137] "22822" "7262" "5328" "5329" "5341" "10769" "8613" "56937"
## [145] "10957" "23645" "5734" "5743" "5791" "5806" "1827" "5966"
## [153] "5970" "5971" "388" "8767" "127544" "6303" "6385" "5055"
## [161] "5271" "5054" "6446" "150094" "9120" "6515" "11182" "4088"
## [169] "8303" "9021" "6648" "8877" "80176" "8878" "6776" "10010"
## [177] "6890" "7050" "25976" "7097" "3371" "7124" "7127" "7128"
## [185] "7130" "25816" "3604" "8744" "10318" "79155" "7185" "10221"
## [193] "9322" "8848" "7280" "7422" "79693" "65986" "80149" "7538"
##
## $M5891_HALLMARK_HYPOXIA
## [1] "57007" "133" "136" "205" "9590" "226" "229" "230"
## [9] "272" "51129" "55139" "302" "467" "538" "126792" "124872"
## [17] "63827" "596" "633" "8553" "665" "680" "694" "771"
## [25] "839" "857" "284119" "112464" "3491" "1490" "8839" "901"
## [33] "1026" "1027" "1028" "9435" "9469" "10370" "1289" "1356"
## [41] "1466" "7852" "1634" "1649" "54541" "10570" "1837" "1843"
## [49] "1907" "1942" "1944" "1956" "2023" "2026" "2027" "30001"
## [57] "54206" "2113" "2131" "2152" "26355" "2203" "2353" "2355"
## [65] "2309" "2548" "2584" "2597" "26330" "2632" "2645" "2651"
## [73] "2745" "2817" "2719" "2239" "2821" "9380" "2997" "3036"
## [81] "3069" "3073" "3098" "3099" "3162" "3219" "9957" "3309"
## [89] "3423" "8870" "3484" "3486" "3569" "10994" "3623" "8660"
## [97] "3669" "23210" "3725" "11015" "55818" "3798" "1316" "8609"
## [105] "54800" "3906" "9215" "3939" "3948" "4015" "56925" "23764"
## [113] "4214" "4282" "4493" "4502" "4601" "4627" "55577" "1463"
## [121] "10397" "3340" "8509" "23327" "4783" "25819" "2908" "5033"
## [129] "8974" "5066" "5105" "5155" "5163" "5165" "5209" "5211"
## [137] "5214" "5224" "5228" "5230" "5236" "55276" "5260" "5292"
## [145] "5313" "5317" "51316" "5329" "123" "10957" "10891" "8497"
## [153] "23645" "5507" "25824" "5578" "5837" "3516" "6095" "58528"
## [161] "6275" "8819" "949" "6383" "9672" "6385" "8991" "5054"
## [169] "6478" "6576" "6513" "6515" "6518" "2542" "6533" "8406"
## [177] "8987" "6781" "8614" "6820" "26136" "7043" "7045" "7052"
## [185] "25976" "8277" "55076" "7128" "7162" "7163" "7167" "8459"
## [193] "7360" "7422" "7428" "7436" "26118" "7511" "7538" "23036"
Note #2: Gene set libraries from the comprehensive Enrichr collection for 5 different species can be obtained using getGenesets
with db = "enrichr"
. For example, gene sets based on transcription factor interactions can be obtained from Enrichr via:
tfppi.gs <- getGenesets(org = "hsa", db = "enrichr", lib = "Transcription_Factor_PPIs") tfppi.gs[1:2]
Note #3: The idMap
function can be used to map gene sets from NCBI Entrez Gene IDs to other common gene ID types such as ENSEMBL gene IDs or HGNC symbols.\ For example, to map the gene sets from Entrez Gene IDs to gene symbols:
hsa.gs.sym <- idMap(hsa.gs, org = "hsa", from = "ENTREZID", to = "SYMBOL") hsa.gs.sym[1:2]
## $hsa05416_Viral_myocarditis
## [1] "ABL1" "ABL2" "ACTB" "ACTG1" "BID" "CASP3"
## [7] "CASP8" "CASP9" "CAV1" "CCND1" "CD28" "CD40"
## [13] "CD40LG" "CD55" "CD80" "CD86" "CXADR" "CYCS"
## [19] "DAG1" "DMD" "EIF4G1" "EIF4G2" "EIF4G3" "FYN"
## [25] "HLA-A" "HLA-B" "HLA-C" "HLA-DMA" "HLA-DMB" "HLA-DOA"
## [31] "HLA-DOB" "HLA-DPA1" "HLA-DPB1" "HLA-DQA1" "HLA-DQA2" "HLA-DQB1"
## [37] "HLA-DRA" "HLA-DRB1" "HLA-DRB3" "HLA-DRB4" "HLA-DRB5" "HLA-E"
## [43] "HLA-F" "HLA-G" "ICAM1" "ITGAL" "ITGB2" "LAMA2"
## [49] "MYH6" "MYH7" "PRF1" "RAC1" "RAC2" "RAC3"
## [55] "SGCA" "SGCB" "SGCD" "SGCG"
##
## $`hsa04622_RIG-I-like_receptor_signaling_pathway`
## [1] "ATG12" "ATG5" "AZI2" "CASP10" "CASP8" "CHUK" "CXCL10" "CXCL8"
## [9] "CYLD" "DDX3X" "DDX58" "DHX58" "FADD" "IFIH1" "IFNA1" "IFNA10"
## [17] "IFNA13" "IFNA14" "IFNA16" "IFNA17" "IFNA2" "IFNA21" "IFNA4" "IFNA5"
## [25] "IFNA6" "IFNA7" "IFNA8" "IFNB1" "IFNE" "IFNK" "IFNW1" "IKBKB"
## [33] "IKBKE" "IKBKG" "IL12A" "IL12B" "IRF3" "IRF7" "ISG15" "MAP3K1"
## [41] "MAP3K7" "MAPK10" "MAPK11" "MAPK12" "MAPK13" "MAPK14" "MAPK8" "MAPK9"
## [49] "MAVS" "NFKB1" "NFKBIA" "NFKBIB" "NLRX1" "OTUD5" "PIN1" "RELA"
## [57] "RIPK1" "RNF125" "SIKE1" "STING1" "TANK" "TBK1" "TBKBP1" "TKFC"
## [65] "TNF" "TRADD" "TRAF2" "TRAF3" "TRAF6" "TRIM25"
A variety of gene set analysis methods have been proposed Khatri et al., 2012. The most basic, yet frequently used, method is the over-representation analysis (ORA) with gene sets defined according to GO or KEGG. As outlined in the first section, ORA tests the overlap between DE genes (typically DE p-value < 0.05) and genes in a gene set based on the hypergeometric distribution. Here, we choose a significance level \(\alpha = 0.2\) for demonstration.
## DataFrame with 7 rows and 4 columns
## GENE.SET NR.GENES NR.SIG.GENES PVAL
## <character> <numeric> <numeric> <numeric>
## 1 hsa05202_Transcripti.. 153 17 0.0351
## 2 hsa05412_Arrhythmoge.. 63 8 0.0717
## 3 hsa05144_Malaria 45 6 0.0932
## 4 hsa04670_Leukocyte_t.. 94 10 0.1220
## 5 hsa05100_Bacterial_i.. 64 7 0.1620
## 6 hsa04622_RIG-I-like_.. 54 6 0.1780
## 7 hsa05130_Pathogenic_.. 43 5 0.1840
Such a ranked list is the standard output of most existing enrichment tools. Using the eaBrowse
function creates a HTML summary from which each gene set can be inspected in more detail.
eaBrowse(ora.all)
The resulting summary page includes for each significant gene set
NR.GENES
),SET.VIEW
, supports mouse-over and click-on),PATH.VIEW
, supports mouse-over and click-on).As ORA works on the list of DE genes and not the actual expression values, it can be straightforward applied to RNA-seq data. However, as the gene sets here contain NCBI Entrez gene IDs and the airway dataset contains ENSEMBL gene ids, we first map the airway dataset to Entrez IDs.
airSE <- idMap(airSE, org = "hsa", from = "ENSEMBL", to = "ENTREZID")
## Encountered 85 from.IDs with >1 corresponding to.ID
## (the first to.ID was chosen for each of them)
## Excluded 2391 from.IDs without a corresponding to.ID
## Encountered 6 to.IDs with >1 from.ID (the first from.ID was chosen for each of them)
## Mapped from.IDs have been added to the rowData column ENSEMBL
## DataFrame with 360 rows and 4 columns
## GENE.SET NR.GENES NR.SIG.GENES PVAL
## <character> <numeric> <numeric> <numeric>
## 1 GO:0030198_extracell.. 163 96 7.87e-10
## 2 GO:0007155_cell_adhe.. 291 149 2.08e-08
## 3 GO:0030336_negative_.. 88 56 6.71e-08
## 4 GO:0007186_G_protein.. 282 136 6.11e-06
## 5 GO:0006936_muscle_co.. 62 38 3.03e-05
## ... ... ... ... ...
## 356 GO:0033077_T_cell_di.. 24 13 0.0472
## 357 GO:0048839_inner_ear.. 24 13 0.0472
## 358 GO:0030282_bone_mine.. 31 16 0.0480
## 359 GO:0031100_animal_or.. 31 16 0.0480
## 360 GO:0051781_positive_.. 31 16 0.0480
Note #1: Young et al., 2010, have reported biased results for ORA on RNA-seq data due to over-detection of differential expression for long and highly expressed transcripts. The goseq package and limma::goana
implement possibilities to adjust ORA for gene length and abundance bias.
Note #2: Independent of the expression data type under investigation, overlap between gene sets can result in redundant findings. This is well-documented for GO (parent-child structure, Rhee et al., 2008) and KEGG (pathway overlap/crosstalk, Donato et al., 2013). The topGO package (explicitly designed for GO) and mgsa (applicable to arbitrary gene set definitions) implement modifications of ORA to account for such redundancies.
A major limitation of ORA is that it restricts analysis to DE genes, excluding genes not satisfying the chosen significance threshold (typically the vast majority).
This is resolved by gene set enrichment analysis (GSEA), which scores the tendency of gene set members to appear rather at the top or bottom of the ranked list of all measured genes Subramanian et al., 2005. The statistical significance of the enrichment score (ES) of a gene set is assessed via sample permutation, i.e. (1) sample labels (= group assignment) are shuffled, (2) per-gene DE statistics are recomputed, and (3) the enrichment score is recomputed. Repeating this procedure many times allows to determine the empirical distribution of the enrichment score and to compare the observed enrichment score against it. Here, we carry out GSEA with 1000 permutations.
gsea.all <- sbea(method = "gsea", se = allSE, gs = hsa.gs, perm = 1000)
## Permutations: 1 -- 100
## Permutations: 101 -- 200
## Permutations: 201 -- 300
## Permutations: 301 -- 400
## Permutations: 401 -- 500
## Permutations: 501 -- 600
## Permutations: 601 -- 700
## Permutations: 701 -- 800
## Permutations: 801 -- 900
## Permutations: 901 -- 1000
## Processing ...
gsRanking(gsea.all)
## DataFrame with 19 rows and 4 columns
## GENE.SET ES NES PVAL
## <character> <numeric> <numeric> <numeric>
## 1 hsa05412_Arrhythmoge.. 0.511 1.86 0.00000
## 2 hsa04520_Adherens_ju.. 0.488 1.73 0.00000
## 3 hsa05202_Transcripti.. 0.530 1.72 0.00000
## 4 hsa05416_Viral_myoca.. 0.630 1.89 0.00191
## 5 hsa04670_Leukocyte_t.. 0.499 1.79 0.00194
## ... ... ... ... ...
## 15 hsa04621_NOD-like_re.. 0.633 1.56 0.0216
## 16 hsa05150_Staphylococ.. 0.576 1.69 0.0231
## 17 hsa05131_Shigellosis 0.479 1.52 0.0236
## 18 hsa05130_Pathogenic_.. 0.486 1.52 0.0283
## 19 hsa04210_Apoptosis 0.424 1.42 0.0413
As GSEA’s permutation procedure involves re-computation of per-gene DE statistics, adaptations are necessary for RNA-seq. When analyzing RNA-seq datasets with expression values given as logTPMs (or logRPKMs / logFPKMs), the available set-based enrichment methods can be applied as for microarray data. However, when given raw read counts as for the airway dataset, we recommend to first apply a variance-stabilizing transformation such as voom
to arrive at library-size normalized logCPMs.
airSE <- normalize(airSE, norm.method = "vst")
The mean-variance relationship of the transformed data is similar to what is observed for microarray data, simplifying the application of legacy enrichment methods such as GSEA and PADOG to RNA-seq data, and enable the use of fast and established methods.
gsea.air <- sbea(method = "gsea", se = airSE, gs = kegg.gs)
## Permutations: 1 -- 100
## Permutations: 101 -- 200
## Permutations: 201 -- 300
## Permutations: 301 -- 400
## Permutations: 401 -- 500
## Permutations: 501 -- 600
## Permutations: 601 -- 700
## Permutations: 701 -- 800
## Permutations: 801 -- 900
## Permutations: 901 -- 1000
## Processing ...
While it might be in some cases necessary to apply permutation-based GSEA for RNA-seq data, there are also alternatives avoiding permutation. Among them is ROtAtion gene Set Testing (ROAST), which uses rotation instead of permutation Wu et al., 2010.
## DataFrame with 23 rows and 4 columns
## GENE.SET NR.GENES DIR PVAL
## <character> <numeric> <numeric> <numeric>
## 1 hsa05131_Shigellosis 56 1 0.000999
## 2 hsa05206_MicroRNAs_i.. 129 -1 0.000999
## 3 hsa03030_DNA_replica.. 35 -1 0.000999
## 4 hsa05323_Rheumatoid_.. 52 -1 0.002000
## 5 hsa04390_Hippo_signa.. 116 1 0.003000
## ... ... ... ... ...
## 19 hsa00340_Histidine_m.. 15 1 0.019
## 20 hsa04150_mTOR_signal.. 51 1 0.020
## 21 hsa05218_Melanoma 52 -1 0.020
## 22 hsa04514_Cell_adhesi.. 74 -1 0.025
## 23 hsa04550_Signaling_p.. 101 1 0.038
A selection of additional methods is also available:
## [1] "ora" "safe" "gsea" "gsa" "padog"
## [6] "globaltest" "roast" "camera" "gsva" "samgs"
## [11] "ebm" "mgsa"
Exercise: Carry out a GO overrepresentation analysis for the allSE
and airSE
. How many significant gene sets do you observe in each case?
Having found gene sets that show enrichment for differential expression, we are now interested whether these findings can be supported by known regulatory interactions.
For example, we want to know whether transcription factors and their target genes are expressed in accordance to the connecting regulations (activation/inhibition). Such information is usually given in a gene regulatory network derived from specific experiments or compiled from the literature (Geistlinger et al., 2013 for an example).
There are well-studied processes and organisms for which comprehensive and well-annotated regulatory networks are available, e.g. the RegulonDB for E. coli and Yeastract for S. cerevisiae.
However, there are also cases where such a network is missing or at least incomplete. A basic workaround is to compile a network from regulations in pathway databases such as KEGG.
hsa.grn <- compileGRN(org = "hsa", db = "kegg") head(hsa.grn)
## FROM TO TYPE
## [1,] "10000" "100132074" "-"
## [2,] "10000" "1026" "-"
## [3,] "10000" "1027" "-"
## [4,] "10000" "10488" "+"
## [5,] "10000" "107" "+"
## [6,] "10000" "107080638" "-"
Signaling pathway impact analysis (SPIA) is a network-based enrichment analysis method, which is explicitly designed for KEGG signaling pathways Tarca et al., 2009. The method evaluates whether expression changes are propagated across the pathway topology in combination with ORA.
spia.all <- nbea(method = "spia", se = allSE, gs = hsa.gs, grn = hsa.grn, alpha = 0.2)
##
## Done pathway 1 : RNA transport..
## Done pathway 2 : RNA degradation..
## Done pathway 3 : PPAR signaling pathway..
## Done pathway 4 : Fanconi anemia pathway..
## Done pathway 5 : MAPK signaling pathway..
## Done pathway 6 : ErbB signaling pathway..
## Done pathway 7 : Calcium signaling pathway..
## Done pathway 8 : Cytokine-cytokine receptor int..
## Done pathway 9 : Chemokine signaling pathway..
## Done pathway 10 : NF-kappa B signaling pathway..
## Done pathway 11 : Phosphatidylinositol signaling..
## Done pathway 12 : Neuroactive ligand-receptor in..
## Done pathway 13 : Cell cycle..
## Done pathway 14 : Oocyte meiosis..
## Done pathway 15 : p53 signaling pathway..
## Done pathway 16 : Sulfur relay system..
## Done pathway 17 : SNARE interactions in vesicula..
## Done pathway 18 : Regulation of autophagy..
## Done pathway 19 : Protein processing in endoplas..
## Done pathway 20 : Lysosome..
## Done pathway 21 : mTOR signaling pathway..
## Done pathway 22 : Apoptosis..
## Done pathway 23 : Vascular smooth muscle contrac..
## Done pathway 24 : Wnt signaling pathway..
## Done pathway 25 : Dorso-ventral axis formation..
## Done pathway 26 : Notch signaling pathway..
## Done pathway 27 : Hedgehog signaling pathway..
## Done pathway 28 : TGF-beta signaling pathway..
## Done pathway 29 : Axon guidance..
## Done pathway 30 : VEGF signaling pathway..
## Done pathway 31 : Osteoclast differentiation..
## Done pathway 32 : Focal adhesion..
## Done pathway 33 : ECM-receptor interaction..
## Done pathway 34 : Cell adhesion molecules (CAMs)..
## Done pathway 35 : Adherens junction..
## Done pathway 36 : Tight junction..
## Done pathway 37 : Gap junction..
## Done pathway 38 : Complement and coagulation cas..
## Done pathway 39 : Antigen processing and present..
## Done pathway 40 : Toll-like receptor signaling p..
## Done pathway 41 : NOD-like receptor signaling pa..
## Done pathway 42 : RIG-I-like receptor signaling ..
## Done pathway 43 : Cytosolic DNA-sensing pathway..
## Done pathway 44 : Jak-STAT signaling pathway..
## Done pathway 45 : Natural killer cell mediated c..
## Done pathway 46 : T cell receptor signaling path..
## Done pathway 47 : B cell receptor signaling path..
## Done pathway 48 : Fc epsilon RI signaling pathwa..
## Done pathway 49 : Fc gamma R-mediated phagocytos..
## Done pathway 50 : Leukocyte transendothelial mig..
## Done pathway 51 : Intestinal immune network for ..
## Done pathway 52 : Circadian rhythm - mammal..
## Done pathway 53 : Long-term potentiation..
## Done pathway 54 : Neurotrophin signaling pathway..
## Done pathway 55 : Retrograde endocannabinoid sig..
## Done pathway 56 : Glutamatergic synapse..
## Done pathway 57 : Cholinergic synapse..
## Done pathway 58 : Serotonergic synapse..
## Done pathway 59 : GABAergic synapse..
## Done pathway 60 : Dopaminergic synapse..
## Done pathway 61 : Long-term depression..
## Done pathway 62 : Olfactory transduction..
## Done pathway 63 : Taste transduction..
## Done pathway 64 : Phototransduction..
## Done pathway 65 : Regulation of actin cytoskelet..
## Done pathway 66 : Insulin signaling pathway..
## Done pathway 67 : GnRH signaling pathway..
## Done pathway 68 : Progesterone-mediated oocyte m..
## Done pathway 69 : Melanogenesis..
## Done pathway 70 : Adipocytokine signaling pathwa..
## Done pathway 71 : Type II diabetes mellitus..
## Done pathway 72 : Type I diabetes mellitus..
## Done pathway 73 : Maturity onset diabetes of the..
## Done pathway 74 : Aldosterone-regulated sodium r..
## Done pathway 75 : Endocrine and other factor-reg..
## Done pathway 76 : Vasopressin-regulated water re..
## Done pathway 77 : Salivary secretion..
## Done pathway 78 : Gastric acid secretion..
## Done pathway 79 : Pancreatic secretion..
## Done pathway 80 : Carbohydrate digestion and abs..
## Done pathway 81 : Bile secretion..
## Done pathway 82 : Mineral absorption..
## Done pathway 83 : Alzheimer's disease..
## Done pathway 84 : Parkinson's disease..
## Done pathway 85 : Amyotrophic lateral sclerosis ..
## Done pathway 86 : Huntington's disease..
## Done pathway 87 : Prion diseases..
## Done pathway 88 : Cocaine addiction..
## Done pathway 89 : Amphetamine addiction..
## Done pathway 90 : Morphine addiction..
## Done pathway 91 : Alcoholism..
## Done pathway 92 : Bacterial invasion of epitheli..
## Done pathway 93 : Vibrio cholerae infection..
## Done pathway 94 : Epithelial cell signaling in H..
## Done pathway 95 : Pathogenic Escherichia coli in..
## Done pathway 96 : Shigellosis..
## Done pathway 97 : Salmonella infection..
## Done pathway 98 : Pertussis..
## Done pathway 99 : Legionellosis..
## Done pathway 100 : Leishmaniasis..
## Done pathway 101 : Chagas disease (American trypa..
## Done pathway 102 : African trypanosomiasis..
## Done pathway 103 : Malaria..
## Done pathway 104 : Toxoplasmosis..
## Done pathway 105 : Amoebiasis..
## Done pathway 106 : Staphylococcus aureus infectio..
## Done pathway 107 : Tuberculosis..
## Done pathway 108 : Hepatitis C..
## Done pathway 109 : Measles..
## Done pathway 110 : Influenza A..
## Done pathway 111 : HTLV-I infection..
## Done pathway 112 : Herpes simplex infection..
## Done pathway 113 : Epstein-Barr virus infection..
## Done pathway 114 : Pathways in cancer..
## Done pathway 115 : Transcriptional misregulation ..
## Done pathway 116 : Viral carcinogenesis..
## Done pathway 117 : Colorectal cancer..
## Done pathway 118 : Renal cell carcinoma..
## Done pathway 119 : Pancreatic cancer..
## Done pathway 120 : Endometrial cancer..
## Done pathway 121 : Glioma..
## Done pathway 122 : Prostate cancer..
## Done pathway 123 : Thyroid cancer..
## Done pathway 124 : Basal cell carcinoma..
## Done pathway 125 : Melanoma..
## Done pathway 126 : Bladder cancer..
## Done pathway 127 : Chronic myeloid leukemia..
## Done pathway 128 : Acute myeloid leukemia..
## Done pathway 129 : Small cell lung cancer..
## Done pathway 130 : Non-small cell lung cancer..
## Done pathway 131 : Asthma..
## Done pathway 132 : Autoimmune thyroid disease..
## Done pathway 133 : Systemic lupus erythematosus..
## Done pathway 134 : Rheumatoid arthritis..
## Done pathway 135 : Allograft rejection..
## Done pathway 136 : Graft-versus-host disease..
## Done pathway 137 : Arrhythmogenic right ventricul..
## Done pathway 138 : Dilated cardiomyopathy..
## Done pathway 139 : Viral myocarditis..
## Finished SPIA analysis
gsRanking(spia.all)
## DataFrame with 5 rows and 6 columns
## GENE.SET SIZE NDE T.ACT STATUS PVAL
## <character> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 hsa04620_Toll-like_r.. 75 6 -3.610 -1 0.0477
## 2 hsa05416_Viral_myoca.. 45 5 2.970 1 0.0733
## 3 hsa05202_Transcripti.. 94 9 -0.209 -1 0.0796
## 4 hsa04630_Jak-STAT_si.. 72 8 -1.060 -1 0.0908
## 5 hsa05143_African_try.. 22 4 0.263 1 0.1790
More generally applicable is gene graph enrichment analysis (GGEA), which evaluates consistency of interactions in a given gene regulatory network with the observed expression data Geistlinger et al., 2011.
## DataFrame with 8 rows and 5 columns
## GENE.SET NR.RELS RAW.SCORE NORM.SCORE PVAL
## <character> <numeric> <numeric> <numeric> <numeric>
## 1 hsa04390_Hippo_signa.. 64 23.20 0.363 0.002
## 2 hsa05416_Viral_myoca.. 7 3.30 0.471 0.005
## 3 hsa05217_Basal_cell_.. 18 6.92 0.385 0.013
## 4 hsa04350_TGF-beta_si.. 19 7.20 0.379 0.019
## 5 hsa04910_Insulin_sig.. 32 11.60 0.364 0.022
## 6 hsa04520_Adherens_ju.. 13 5.02 0.386 0.025
## 7 hsa04550_Signaling_p.. 36 13.10 0.363 0.025
## 8 hsa05412_Arrhythmoge.. 4 1.75 0.437 0.045
## [1] "ggea" "spia" "pathnet" "degraph" "ganpa"
## [6] "cepa" "topologygsa" "netgsa" "neat"
Note #1: As network-based enrichment methods typically do not involve sample permutation but rather network permutation, thus avoiding DE re-computation, they can likewise be applied to RNA-seq data.
Note #2: Given the various enrichment methods with individual benefits and limitations, combining multiple methods can be beneficial, e.g. combined application of a set-based and a network-based method. This has been shown to filter out spurious hits of individual methods and to reduce the outcome to gene sets accumulating evidence from different methods Geistlinger et al., 2016, Alhamdoosh et al., 2017.
The function combResults
implements the straightforward combination of results, thereby facilitating seamless comparison of results across methods. For demonstration, we use the ORA and GSEA results for the ALL dataset from the previous section:
res.list <- list(ora.all, gsea.all) comb.res <- combResults(res.list) gsRanking(comb.res)
## DataFrame with 20 rows and 6 columns
## GENE.SET ORA.RANK GSEA.RANK MEAN.RANK ORA.PVAL GSEA.PVAL
## <character> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 hsa05202_Transcripti.. 2.6 7.7 5.1 0.0351 0.00000
## 2 hsa05412_Arrhythmoge.. 5.1 7.7 6.4 0.0717 0.00000
## 3 hsa04670_Leukocyte_t.. 10.3 12.8 11.5 0.1220 0.00194
## 4 hsa04520_Adherens_ju.. 20.5 7.7 14.1 0.2010 0.00000
## 5 hsa05100_Bacterial_i.. 12.8 28.2 20.5 0.1620 0.01190
## ... ... ... ... ... ... ...
## 16 hsa05410_Hypertrophi.. 41.0 51.3 46.2 0.403 0.0537
## 17 hsa04350_TGF-beta_si.. 38.5 59.0 48.7 0.390 0.0958
## 18 hsa04621_NOD-like_re.. 61.5 38.5 50.0 0.631 0.0216
## 19 hsa04514_Cell_adhesi.. 69.2 33.3 51.3 0.649 0.0158
## 20 hsa04622_RIG-I-like_.. 15.4 87.2 51.3 0.178 0.3550
Exercise: Carry out SPIA
and GGEA
for the airSE
and combine the results. How many gene sets are rendered significant by both methods?
Microarrays and next-generation sequencing are also widely applied for large-scale detection of variable and regulatory genomic regions, e.g. single nucleotide polymorphisms, copy number variations, and transcription factor binding sites.
Such experimentally-derived genomic region sets are raising similar questions regarding functional enrichment as in gene expression data analysis.
Of particular interest is thereby whether experimentally-derived regions overlap more (enrichment) or less (depletion) than expected by chance with regions representing known functional features such as genes or promoters.
The regioneR package implements a general framework for testing overlaps of genomic regions based on permutation sampling. This allows to repeatedly sample random regions from the genome, matching size and chromosomal distribution of the region set under study. By recomputing the overlap with the functional features in each permutation, statistical significance of the observed overlap can be assessed.
To demonstrate the basic functionality of the package, we consider the overlap of gene promoter regions and CpG islands in the human genome. We expect to find an enrichment as promoter regions are known to be GC-rich. Hence, is the overlap between CpG islands and promoters greater than expected by chance?
We use the collection of CpG islands described in Wu et al., 2010 and restrict them to the set of canonical chromosomes 1-23, X, and Y.
cpgHMM <- read.delim("http://www.haowulab.org/software/makeCGI/model-based-cpg-islands-hg19.txt") cpgHMM <- makeGRangesFromDataFrame(cpgHMM, keep.extra.columns = TRUE) genome(cpgHMM) <- "hg19" cpgHMM <- filterChromosomes(cpgHMM, chr.type = "canonical") cpgHMM <- sort(cpgHMM) cpgHMM
## GRanges object with 63705 ranges and 5 metadata columns:
## seqnames ranges strand | length CpGcount GCcontent
## <Rle> <IRanges> <Rle> | <integer> <integer> <integer>
## [1] chr1 10497-11241 * | 745 110 549
## [2] chr1 28705-29791 * | 1087 115 792
## [3] chr1 135086-135805 * | 720 42 484
## [4] chr1 136164-137362 * | 1199 71 832
## [5] chr1 137665-138121 * | 457 22 301
## ... ... ... ... . ... ... ...
## [63701] chrY 59213702-59214290 * | 589 43 366
## [63702] chrY 59240512-59241057 * | 546 40 369
## [63703] chrY 59348047-59348370 * | 324 17 193
## [63704] chrY 59349137-59349565 * | 429 31 276
## [63705] chrY 59361489-59362401 * | 913 128 650
## pctGC obsExp
## <numeric> <numeric>
## [1] 0.737 1.106
## [2] 0.729 0.818
## [3] 0.672 0.548
## [4] 0.694 0.524
## [5] 0.659 0.475
## ... ... ...
## [63701] 0.621 0.765
## [63702] 0.676 0.643
## [63703] 0.596 0.593
## [63704] 0.643 0.700
## [63705] 0.712 1.108
## -------
## seqinfo: 24 sequences from hg19 genome; no seqlengths
Analogously, we load promoter regions in the hg19 human genome assembly as available from UCSC:
promoters <- read.delim("http://gattaca.imppc.org/regioner/data/UCSC.promoters.hg19.bed") promoters <- makeGRangesFromDataFrame(promoters, keep.extra.columns = TRUE) genome(promoters) <- "hg19" promoters <- filterChromosomes(promoters, chr.type = "canonical") promoters <- sort(promoters) promoters
## GRanges object with 49049 ranges and 3 metadata columns:
## seqnames ranges strand | V4 V5
## <Rle> <IRanges> <Rle> | <character> <character>
## [1] chr1 9873-12073 * | TSS .
## [2] chr1 16565-18765 * | TSS .
## [3] chr1 17551-19751 * | TSS .
## [4] chr1 17861-20061 * | TSS .
## [5] chr1 19559-21759 * | TSS .
## ... ... ... ... . ... ...
## [49045] chrY 59211948-59214148 * | TSS .
## [49046] chrY 59328251-59330451 * | TSS .
## [49047] chrY 59350972-59353172 * | TSS .
## [49048] chrY 59352984-59355184 * | TSS .
## [49049] chrY 59360654-59362854 * | TSS .
## V6
## <character>
## [1] +
## [2] -
## [3] -
## [4] -
## [5] -
## ... ...
## [49045] +
## [49046] +
## [49047] +
## [49048] +
## [49049] -
## -------
## seqinfo: 24 sequences from hg19 genome; no seqlengths
To speed up the example, we restrict analysis to chromosomes 21 and 22. Note that this is done for demonstration only. To make an accurate claim, the complete region set should be used (which, however, runs considerably longer).
cpg <- subset(cpgHMM, seqnames %in% c("chr21", "chr22")) prom <- subset(promoters, seqnames %in% c("chr21", "chr22"))
Now, we are applying an overlap permutation test with 100 permutations (ntimes = 100
), while maintaining chromosomal distribution of the CpG island region set (per.chromosome = TRUE
). Furthermore, we use the option count.once = TRUE
to count an overlapping CpG island only once, even if it overlaps with 2 or more promoters.
Note that we use 100 permutations for demonstration only. To draw robust conclusions a minimum of 1000 permutations should be carried out.
res <- overlapPermTest(cpg, prom, mask = NA, genome = "hg19", ntimes = 100, per.chromosome = TRUE, count.once = TRUE) res
## $numOverlaps
## P-value: 0.0099009900990099
## Z-score: 56.9304
## Number of iterations: 100
## Alternative: greater
## Evaluation of the original region set: 719
## Evaluation function: numOverlaps
## Randomization function: randomizeRegions
##
## attr(,"class")
## [1] "permTestResultsList"
summary(res[[1]]$permuted)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 86.0 112.0 117.0 117.4 123.0 139.0
The resulting permutation p-value indicates a significant enrichment. Out of the 2859 CpG islands, 719 overlap with at least one promoter. In contrast, when repeatedly drawing random regions matching the CpG islands in size and chromosomal distribution, the mean number of overlapping regions across permutations was 117.4 \(\pm\) 10.6.
Note #1: The function regioneR::permTest
allows to incorporate user-defined functions for randomizing regions and evaluating additional measures of overlap such as total genomic size in bp.
Note #2: The LOLA package implements a genomic region ORA, which assesses genomic region overlap based on the hypergeometric distribution using a library of pre-defined functional region sets.