Loading pre-defined and user-defined expression data

This function implements a general interface for loading the pre-defined GEO2KEGG microarray compendium and the TCGA RNA-seq compendium. It also allows loading of user-defined data from file.

Usage

loadEData(edata, nr.datasets = NULL, cache = TRUE, ...)

Arguments

edata

Expression data compendium. A character vector of length 1 that must be either

'geo2kegg': to load the GEO2KEGG microarray compendium,
'tcga': to load the TCGA RNA-seq compendium, or
an absolute file path pointing to a directory, in which a user-defined compendium has been saved in RDS files.

See details.

nr.datasets

Integer. Number of datasets that should be loaded from the compendium. This is mainly for demonstration purposes.

cache

Logical. Should an already cached version used if available? Defaults to TRUE.

...

Additional arguments passed to the internal loading routines of the GEO2KEGG and TCGA compendia. This currently includes for loading of the GEO2KEGG compendium

preproc: logical. Should probe level data automatically be summarized to gene level data? Defaults to FALSE.
de.only: logical. Include only datasets in which differentially expressed genes have been found? Defaults to FALSE.
excl.metac: logical. Exclude datasets for which MetaCore rather than KEGG pathways have been assigned as target pathways? Defaults to FALSE.

And for loading of the TCGA compendium

mode: character, determines how TCGA RNA-seq datasets are obtained. To obtain raw read counts from GSE62944 use either 'ehub' (default, via ExperimentHub) or 'geo' (direct download from GEO, slow). Alternatively, use 'cTD' to obtain normalized log2 TPM values from curatedTCGAData.
data.dir: character. Absolute file path indicating where processed RDS files for each dataset are written to. Defaults to NULL, which will then write to tools::R_user_dir("GSEABenchmarkeR").
min.ctrls: integer. Minimum number of controls, i.e. adjacent normal samples, for a cancer type to be included. Defaults to 9.
paired: Logical. Should the pairing of samples (tumor and adjacent normal) be taken into account? Defaults to TRUE, which reduces the data for each cancer type to patients for which both sample types (tumor and adjacent normal) are available. Use FALSE to obtain all samples in an unpaired manner.
min.cpm: integer. Minimum counts-per-million reads mapped. See the edgeR vignette for details. The default filter is to exclude genes with cpm < 2 in more than half of the samples.
with.clin.vars: logical. Should clinical variables (>500) be kept to allow for more advanced sample groupings in addition to the default binary grouping (tumor vs. normal)?
map2entrez: Should human gene symbols be automatically mapped to Entrez Gene IDs? Defaults to TRUE.

Value

A list of datasets, typically of class SummarizedExperiment.

Note that loadEData("geo2kegg", preproc = FALSE) (the default) returns the original microarray probe level data as a list of ExpressionSet objects. Use preproc = TRUE or the maPreproc function to summarize the probe level data to gene level data and to obtain a list of SummarizedExperiment objects.

Details

The pre-defined GEO2KEGG microarray compendium consists of 42 datasets investigating a total of 19 different human diseases as collected by Tarca et al. (2012 and 2013).

The pre-defined TCGA RNA-seq compendium consists of datasets from The Cancer Genome Atlas (TCGA, 2013) investigating a total of 34 different cancer types.

User-defined data can also be loaded, given that datasets, preferably of class SummarizedExperiment, have been saved as RDS files.

References

Tarca et al. (2012) Down-weighting overlapping genes improves gene set analysis. BMC Bioinformatics, 13:136.

Tarca et al. (2013) A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity. PLoS One, 8(11):e79217.

The Cancer Genome Atlas Research Network (2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet, 45(10):1113-20.

Rahman et al. (2015) Alternative preprocessing of RNA-Sequencing data in The Cancer Genome Atlas leads to improved analysis results. Bioinformatics, 31(22):3666-72.

Author

Ludwig Geistlinger <Ludwig.Geistlinger@sph.cuny.edu>

Examples


    # (1) Loading the GEO2KEGG microarray compendium
    geo2kegg <- loadEData("geo2kegg", nr.datasets=2)
#> Loading GEO2KEGG data compendium ...

    # (2) Loading the TCGA RNA-seq compendium
    tcga <- loadEData("tcga", nr.datasets=2)
#> Loading TCGA data compendium ...
#> Cancer types with tumor samples:
#> ACC, BLCA, BRCA, CESC, COAD, DLBC, GBM, HNSC, KICH, KIRC, KIRP, LAML, LGG, LIHC, LUAD, LUSC, OV, PRAD, READ, SKCM, STAD, THCA, UCEC, UCS
#> Cancer types with adj. normal samples:
#> BLCA, BRCA, CESC, CHOL, COAD, ESCA, GBM, HNSC, KICH, KIRC, KIRP, LIHC, LUAD, LUSC, PAAD, PCPG, PRAD, READ, SARC, SKCM, STAD, THCA, THYM, UCEC
#> Cancer types with sufficient tumor and adj. normal samples:
#> BLCA, BRCA
#> Creating a SummarizedExperiment for each of them ...
#> BLCA tumor: 19 adj.normal: 19
#> BRCA tumor: 113 adj.normal: 113

    # (3) reading user-defined expression data from file
    data.dir <- system.file("extdata/myEData", package="GSEABenchmarkeR")
    edat <- loadEData(data.dir)