Summarizing CNV ranges across a population

In CNV analysis, it is often of interest to summarize individual calls across the population, (i.e. to define CNV regions), for subsequent association analysis with e.g. phenotype data.

Usage

populationRanges(
  grl,
  mode = c("density", "RO"),
  density = 0.1,
  ro.thresh = 0.5,
  multi.assign = FALSE,
  verbose = FALSE,
  min.size = 2,
  classify.ranges = TRUE,
  type.thresh = 0.1,
  est.recur = FALSE
)

Arguments

grl: A GRangesList.
mode: Character. Should population ranges be computed based on regional density ("density") or reciprocal overlap ("RO"). See Details.
density: Numeric. Defaults to 0.1.
ro.thresh: Numeric. Threshold for reciprocal overlap required for merging two overlapping regions. Defaults to 0.5.
multi.assign: Logical. Allow regions to be assigned to several region clusters? Defaults to FALSE.
verbose: Logical. Report progress messages? Defaults to FALSE.
min.size: Numeric. Minimum size of a summarized region to be included. Defaults to 2 bp.
classify.ranges: Logical. Should CNV frequency (number of samples overlapping the region) and CNV type (gain, loss, or both) be annotated? Defaults to TRUE.
type.thresh: Numeric. Required minimum relative frequency of each CNV type (gain / loss) to be taken into account when assigning CNV type to a region. Defaults to 0.1. That means for a region overlapped by individual gain and loss calls that both types must be present in >10 in order to be typed as 'both'. If gain or loss calls are present below the threshold they are ignored.
est.recur: Logical. Should recurrence of regions be assessed via a permutation test? Defaults to FALSE. See Details.

Value

A GRanges object containing the summarized CNV ranges.

Details

CNVRuler procedure that trims region margins based on regional density

Trims low-density areas (usually <10% of the total contributing individual calls within a summarized region).

An illustration of the concept can be found here: https://www.ncbi.nlm.nih.gov/pubmed/22539667 (Figure 1)
Reciprocal overlap (RO) approach (e.g. Conrad et al., Nature, 2010)

Reciprocal overlap of 0.51 between two genomic regions A and B:

requires that B overlaps at least 51% of A, *and* that A also overlaps at least 51% of B

Approach:

At the top level of the hierarchy, all contiguous bases overlapping at least 1bp of individual calls are merged into one region. Within each region, we further define reciprocally overlapping regions with the following algorithm:
- Calculate reciprocal overlap (RO) between all remaining calls.
- Identify pair of calls with greatest RO. If RO > threshold, merge and create a new CNV. If not, exit.
- Continue adding unclustered calls to the region, in order of best overlap. In order to add a call, the new call must have > threshold to all calls within the region to be added. When no additional calls may be added, move to next step.
- If calls remain, return to 1. Otherwise exit.
GISTIC procedure (Beroukhim et al., PNAS, 2007) to identify recurrent CNV regions

GISTIC scores each CNV region with a G-score that is proportional to the total magnitude of CNV calls in each CNV region. In addition, by permuting the locations in each sample, GISTIC determines the frequency with which a given score would be attained if the events were due to chance and therefore randomly distributed. A significance threshold can then be used to determine scores / regions that are unlikely to occur by chance alone.

References

Kim et al. (2012) CNVRuler: a copy number variation-based case-control association analysis tool. Bioinformatics, 28(13):1790-2.

Conrad et al. (2010) Origins and functional impact of copy number variation in the human genome. Nature, 464(7289):704-12.

Beroukhim et al. (2007) Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma. PNAS, 104(50):20007-12.

Author

Ludwig Geistlinger, Martin Morgan

Examples


grl <- GRangesList(
     sample1 = GRanges( c("chr1:1-10", "chr2:15-18", "chr2:25-34") ),
     sample2 = GRanges( c("chr1:1-10", "chr2:11-18" , "chr2:25-36") ),
     sample3 = GRanges( c("chr1:2-11", "chr2:14-18", "chr2:26-36") ),
     sample4 = GRanges( c("chr1:1-12", "chr2:18-35" ) ),
     sample5 = GRanges( c("chr1:1-12", "chr2:11-17" , "chr2:26-34") ) ,
     sample6 = GRanges( c("chr1:1-12", "chr2:12-18" , "chr2:25-35") )
)

# default as chosen in the original CNVRuler procedure
populationRanges(grl, density=0.1, classify.ranges=FALSE)
#> GRanges object with 3 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1      1-12      *
#>   [2]     chr2     11-18      *
#>   [3]     chr2     25-36      *
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

# density = 0 merges all overlapping regions, 
# equivalent to: reduce(unlist(grl))
populationRanges(grl, density=0, classify.ranges=FALSE) 
#> GRanges object with 2 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1      1-12      *
#>   [2]     chr2     11-36      *
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

# density = 1 disjoins all overlapping regions, 
# equivalent to: disjoin(unlist(grl))
populationRanges(grl, density=1, classify.ranges=FALSE)
#> GRanges object with 1 range and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1      2-10      *
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths

# RO procedure
populationRanges(grl, mode="RO", ro.thresh=0.5, classify.ranges=FALSE)
#> GRanges object with 5 ranges and 0 metadata columns:
#>       seqnames    ranges strand
#>          <Rle> <IRanges>  <Rle>
#>   [1]     chr1      1-12      *
#>   [2]     chr2     11-18      *
#>   [3]     chr2     15-18      *
#>   [4]     chr2     18-35      *
#>   [5]     chr2     25-36      *
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome; no seqlengths