Utilities for working with HUMAN genome builds

A few functions are available to search for build versions, either from NCBI or UCSC.

translateBuild: translates between UCSC and NCBI build versions
extractBuild: use grep patterns to find the first build within the string input
uniformBuilds: replace build occurrences below a threshold level of occurence with the alternative build
correctBuild: Ensure that the build annotation is correct based on the NCBI/UCSC website. If not, use translateBuild with the indicated 'style' input
isCorrect: Check to see if the build is exactly as annotated

Usage

translateBuild(from, to = c("UCSC", "NCBI"))

correctBuild(build, style = c("UCSC", "NCBI"))

isCorrect(build, style = c("UCSC", "NCBI"))

extractBuild(string, build = c("UCSC", "NCBI"))

uniformBuilds(builds, cutoff = 0.2, na = c("", "NA"))

Arguments

from: character() A vector of build versions typically from genome() (e.g., "37"). The build vector must be homogenous (i.e., length(unique(x)) == 1L).
to: character(1) The name of the desired build version (either "UCSC" or "NCBI"; default: "UCSC")
build: A vector of build version names (default UCSC, NCBI)
style: character(1) The annotation style, either 'UCSC' or 'NCBI'
string: A single character string
builds: A character vector of builds
cutoff: numeric(1L) An inclusive threshold tolerance value for missing values and translating builds that are below the threshold
na: character() The values to be considered as missing (default: c("", "NA"))

Value

translateBuild: A character vector of translated genome builds

extractBuild: A character string of the build information available

uniformBuilds: A character vector of builds where all builds are
    identical `identical(length(unique(build)), 1L)`

correctBuild: A character string of the 'corrected' build name

isCorrect: A logical indicating if the build is exactly as annotated

Details

The correctBuild function takes the input and ensures that the style specified matches the input. Otherwise, it will return the correct style for use with seqlevelsStyle. Currently, the function does not support patched builds (e.g., 'GRCh38.p13') Build names are taken from the website: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/

Examples


translateBuild("GRCh35", "UCSC")
#> [1] "hg17"


correctBuild("grch38", "NCBI")
#> [1] "GRCh38"
correctBuild("hg19", "NCBI")
#> [1] "GRCh37"


isCorrect("GRCh38", "NCBI")
#> [1] TRUE

isCorrect("hg19", "UCSC")
#> [1] TRUE


extractBuild(
"SCENA_p_TCGAb29and30_SNP_N_GenomeWideSNP_6_G05_569110.nocnv_grch38.seg.txt"
)
#>     NCBI 
#> "grch38" 


buildvec <- rep(c("GRCh37", "hg19"), times = c(5, 1))
uniformBuilds(buildvec)
#> [1] "GRCh37" "GRCh37" "GRCh37" "GRCh37" "GRCh37" "GRCh37"

navec <- c(rep(c("GRCh37", "hg19"), times = c(5, 1)), "NA")
uniformBuilds(navec)
#> [1] "GRCh37" "GRCh37" "GRCh37" "GRCh37" "GRCh37" "GRCh37" "GRCh37"