Testing CNV regions for effects on the expression level of genes in defined genomic windows.
Usage
cnvEQTL(
cnvrs,
calls,
rcounts,
data,
window = "1Mbp",
multi.calls = .largest,
min.samples = 10,
min.state.freq = 3,
de.method = c("edgeR", "limma"),
padj.method = "BH",
filter.by.expr = TRUE,
verbose = FALSE
)
Arguments
- cnvrs
A
GRanges
or character object containing the summarized CNV regions as e.g. obtained withpopulationRanges
. Alternatively, the assay name if the 'data' argument is provided.- calls
Either a
GRangesList
orRaggedExperiment
storing the individual CNV calls for each sample. Alternatively, the assay name if 'data' is provided.- rcounts
A
RangedSummarizedExperiment
or character name storing either the raw RNA-seq read counts in a rectangular fashion (genes x samples). Alternatively, the assay name if 'data' is provided.- data
(optional) A
MultiAssayExperiment
object with `cnvrs`, `calls`, and `rcounts` arguments corresponding to assay names.- window
Numeric or Character. Size of the genomic window in base pairs by which each CNV region is extended up- and downstream. This determines which genes are tested for each CNV region. Character notation is supported for convenience such as "100kbp" (same as 100000) or "1Mbp" (same as 1000000). Defaults to
"1Mbp"
. Can also be set toNULL
to test against all genes included in the analysis.- multi.calls
A function. Determines how to summarize the CN state in a CNV region when there are multiple (potentially conflicting) calls for one sample in that region. Defaults to
.largest
, which assigns the CN state of the call that covers the largest part of the CNV region tested. A user-defined function that is passed on toqreduceAssay
can also be provided for customized behavior.- min.samples
Integer. Minimum number of samples with at least one call overlapping the CNV region tested. Defaults to 10. See details.
- min.state.freq
Integer. Minimun number of samples in each CNV state being tested. Defaults to 3.
- de.method
Character. Differential expression method. Defaults to
"edgeR"
.- padj.method
Character. Method for adjusting p-values to multiple testing. For available methods see the man page of the function
p.adjust
. Defaults to"BH"
.- filter.by.expr
Logical. Include only genes with sufficiently large counts in the DE analysis? If TRUE, excludes genes not satisfying a minimum number of read counts across samples using the
filterByExpr
function from the edgeR package. Defaults to TRUE.- verbose
Logical. Display progress messages? Defaults to
FALSE
.
Value
A DataFrame
containing measures of association for
each CNV region and each gene tested in the genomic window around the CNV region.
Details
Association testing between CNV regions and RNA-seq read counts is carried out using edgeR, which applies generalized linear models (GLMs) based on the negative-binomial distribution while incorporating normalization factors for different library sizes.
In the case of only one CN state deviating from 2n for a CNV region under investigation, this reduces to the classical 2-group comparison. For more than two states (e.g. 0n, 1n, 2n), edgeR’s ANOVA-like test is applied to test all deviating groups for significant expression differences relative to 2n.
To avoid artificial effects due to low expression of a gene or insufficient
sample size in deviating groups, it is typically recommended to exclude from
the analysis (i) genes with fewer than r reads per million reads mapped
(cpm, counts per million) in the maximally expressed sample group,
and (ii) CNV regions with fewer than s samples in a group deviating from 2n.
Use the min.cpm
and min.samples
arguments, respectively.
When testing local effects (adjacent or coinciding genes of a CNV region), suitable thresholds for candidate discovery are r = 3, s = 4, and a nominal significance level of 0.05; as such effects have a clear biological indication and the number of genes tested is typically small.
For distal effects (i.e. when testing genes far away from a CNV region) more stringent thresholds such as r = 20 and s = 10 for distal effects in conjunction with multiple testing correction using a conservative adjusted significance level such as 0.01 is typically recommended (due to power considerations and to avoid detection of spurious effects).
References
Geistlinger et al. (2018) Widespread modulation of gene expression by copy number variation in skeletal muscle. Sci Rep, 8(1):1399.
See also
findOverlaps
to find overlaps between sets of genomic
regions,
qreduceAssay
to summarize ragged genomic location data
in defined genomic regions,
glmQLFit
and glmQLFTest
to conduct negative
binomial generalized linear models for RNA-seq read count data.
Examples
# (1) CNV calls
states <- sample(c(0,1,3,4), 17, replace=TRUE)
calls <- GRangesList(
sample1 = GRanges( c("chr1:1-10", "chr2:15-18", "chr2:25-34"), state=states[1:3]),
sample2 = GRanges( c("chr1:1-10", "chr2:11-18" , "chr2:25-36"), state=states[4:6] ),
sample3 = GRanges( c("chr1:2-11", "chr2:14-18", "chr2:26-36"), state=states[7:9] ),
sample4 = GRanges( c("chr1:1-12", "chr2:18-35" ), state=states[10:11] ),
sample5 = GRanges( c("chr1:1-12", "chr2:11-17" , "chr2:26-34"), state=states[12:14] ) ,
sample6 = GRanges( c("chr1:1-12", "chr2:12-18" , "chr2:25-35"), state=states[15:17] )
)
# (2) summarized CNV regions
cnvrs <- populationRanges(calls, density=0.1)
# (3) RNA-seq read counts
genes <- GRanges(c("chr1:2-9", "chr1:100-150", "chr1:200-300",
"chr2:16-17", "chr2:100-150", "chr2:200-300", "chr2:26-33"))
y <- matrix(rnbinom(42,size=1,mu=10),7,6)
names(genes) <- rownames(y) <- paste0("gene", 1:7)
colnames(y) <- paste0("sample", 1:6)
library(SummarizedExperiment)
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: ‘matrixStats’
#> The following object is masked from ‘package:RaggedExperiment’:
#>
#> rowRanges
#>
#> Attaching package: ‘MatrixGenerics’
#> The following objects are masked from ‘package:matrixStats’:
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: ‘Biobase’
#> The following object is masked from ‘package:MatrixGenerics’:
#>
#> rowMedians
#> The following objects are masked from ‘package:matrixStats’:
#>
#> anyMissing, rowMedians
rse <- SummarizedExperiment(assays=list(counts=y), rowRanges=granges(genes))
# (4) perform the association analysis
res <- cnvEQTL(cnvrs, calls, rse,
min.samples = 1, min.state.freq = 1, filter.by.expr = FALSE)