This function implements a general interface for loading the pre-defined GEO2KEGG microarray compendium and the TCGA RNA-seq compendium. It also allows loading of user-defined data from file.
Arguments
- edata
Expression data compendium. A character vector of length 1 that must be either
'geo2kegg': to load the GEO2KEGG microarray compendium,
'tcga': to load the TCGA RNA-seq compendium, or
an absolute file path pointing to a directory, in which a user-defined compendium has been saved in RDS files.
See details.
- nr.datasets
Integer. Number of datasets that should be loaded from the compendium. This is mainly for demonstration purposes.
- cache
Logical. Should an already cached version used if available? Defaults to
TRUE
.- ...
Additional arguments passed to the internal loading routines of the GEO2KEGG and TCGA compendia. This currently includes for loading of the GEO2KEGG compendium
preproc
: logical. Should probe level data automatically be summarized to gene level data? Defaults toFALSE
.de.only
: logical. Include only datasets in which differentially expressed genes have been found? Defaults toFALSE
.excl.metac
: logical. Exclude datasets for which MetaCore rather than KEGG pathways have been assigned as target pathways? Defaults toFALSE
.
And for loading of the TCGA compendium
mode
: character, determines how TCGA RNA-seq datasets are obtained. To obtain raw read counts from GSE62944 use either'ehub'
(default, via ExperimentHub) or'geo'
(direct download from GEO, slow). Alternatively, use'cTD'
to obtain normalized log2 TPM values from curatedTCGAData.data.dir
: character. Absolute file path indicating where processed RDS files for each dataset are written to. Defaults toNULL
, which will then write totools::R_user_dir("GSEABenchmarkeR")
.min.ctrls
: integer. Minimum number of controls, i.e. adjacent normal samples, for a cancer type to be included. Defaults to 9.paired
: Logical. Should the pairing of samples (tumor and adjacent normal) be taken into account? Defaults toTRUE
, which reduces the data for each cancer type to patients for which both sample types (tumor and adjacent normal) are available. UseFALSE
to obtain all samples in an unpaired manner.min.cpm
: integer. Minimum counts-per-million reads mapped. See the edgeR vignette for details. The default filter is to exclude genes with cpm < 2 in more than half of the samples.with.clin.vars
: logical. Should clinical variables (>500) be kept to allow for more advanced sample groupings in addition to the default binary grouping (tumor vs. normal)?map2entrez
: Should human gene symbols be automatically mapped to Entrez Gene IDs? Defaults toTRUE
.
Value
A list
of datasets, typically of class
SummarizedExperiment
.
Note that loadEData("geo2kegg", preproc = FALSE)
(the default)
returns the original microarray probe level data as a list of
ExpressionSet
objects. Use preproc = TRUE
or
the maPreproc
function to summarize the probe level
data to gene level data and to obtain a list
of
SummarizedExperiment
objects.
Details
The pre-defined GEO2KEGG microarray compendium consists of 42 datasets investigating a total of 19 different human diseases as collected by Tarca et al. (2012 and 2013).
The pre-defined TCGA RNA-seq compendium consists of datasets from The Cancer Genome Atlas (TCGA, 2013) investigating a total of 34 different cancer types.
User-defined data can also be loaded, given that datasets, preferably of
class SummarizedExperiment
, have been saved as
RDS
files.
References
Tarca et al. (2012) Down-weighting overlapping genes improves gene set analysis. BMC Bioinformatics, 13:136.
Tarca et al. (2013) A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity. PLoS One, 8(11):e79217.
The Cancer Genome Atlas Research Network (2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet, 45(10):1113-20.
Rahman et al. (2015) Alternative preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis results. Bioinformatics, 31(22):3666-72.
See also
SummarizedExperiment
,
ExpressionSet
, maPreproc
Examples
# (1) Loading the GEO2KEGG microarray compendium
geo2kegg <- loadEData("geo2kegg", nr.datasets=2)
#> Loading GEO2KEGG data compendium ...
# (2) Loading the TCGA RNA-seq compendium
tcga <- loadEData("tcga", nr.datasets=2)
#> Loading TCGA data compendium ...
#> Cancer types with tumor samples:
#> ACC, BLCA, BRCA, CESC, COAD, DLBC, GBM, HNSC, KICH, KIRC, KIRP, LAML, LGG, LIHC, LUAD, LUSC, OV, PRAD, READ, SKCM, STAD, THCA, UCEC, UCS
#> Cancer types with adj. normal samples:
#> BLCA, BRCA, CESC, CHOL, COAD, ESCA, GBM, HNSC, KICH, KIRC, KIRP, LIHC, LUAD, LUSC, PAAD, PCPG, PRAD, READ, SARC, SKCM, STAD, THCA, THYM, UCEC
#> Cancer types with sufficient tumor and adj. normal samples:
#> BLCA, BRCA
#> Creating a SummarizedExperiment for each of them ...
#> BLCA tumor: 19 adj.normal: 19
#> BRCA tumor: 113 adj.normal: 113
# (3) reading user-defined expression data from file
data.dir <- system.file("extdata/myEData", package="GSEABenchmarkeR")
edat <- loadEData(data.dir)