Translate study identifiers from barcode to UUID and vice versa
Source:R/ID-translation.R
ID-translation.RdThese functions allow the user to enter a character vector of
identifiers and use the GDC API to translate from TCGA barcodes to
Universally Unique Identifiers (UUID) and vice versa. These relationships
are not one-to-one. Therefore, a data.frame is returned for all
inputs. The UUID to TCGA barcode translation only applies to file and case
UUIDs. Two-way UUID translation is available from 'file_id' to 'case_id'
and vice versa. Please double check any results before using these
features for analysis. Case / submitter identifiers are translated by
default, see the from_type argument for details. All identifiers are
converted to lower case.
Arguments
- id_vector
character() A vector of UUIDs corresponding to either files or cases (default assumes case_ids)
- from_type
character(1) Either
case_idorfile_idindicating the type ofid_vectorentered (default"case_id")- to_type
character(1) The desired UUID type to obtain, can either be
"case_id"(default) or"file_id"- barcodes
character() A vector of TCGA barcodes
- filenames
character()A vector of file names usually obtained from aGenomicDataCommonsquery- slides
logical(1L)DEFUNCT: No longer used. See details.- id
character(1) A UUID whose history of versions is sought
- endpoint
character(1) Generally a constant pertaining to the location of the history api endpoint. This argument rarely needs to change.
Value
Generally, a data.frame of identifier mappings
UUIDhistory: A data.frame containting a list of associated UUIDs
for the given input along with file_change status, data_release
versions, etc.
Details
Based on the file UUID supplied, the appropriate entity_id (TCGA barcode) is
returned. In previous versions of the package, the 'end_point' parameter
would require the user to specify what type of barcode needed. This is no
longer supported as entity_id returns the appropriate one.
Slides are identified by the filenames input by searching for the
.svs extension. Slide queries can only be done when all filenames
inputs are slide file names.
Examples
## Translate UUIDs >> TCGA Barcode
uuids <- c("b4bce3ff-7fdc-4849-880b-56f2b348ceac",
"5ca9fa79-53bc-4e91-82cd-5715038ee23e",
"b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382")
UUIDtoBarcode(uuids, from_type = "file_id")
#> file_id associated_entities.entity_submitter_id
#> 1 b4bce3ff-7fdc-4849-880b-56f2b348ceac TCGA-B0-5094-11A-01D-1421-08
#> 2 5ca9fa79-53bc-4e91-82cd-5715038ee23e TCGA-E9-A295-10A-01D-A16D-09
#> 3 b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382 TCGA-B0-5117-11A-01D-1421-08
UUIDtoBarcode("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", from_type = "case_id")
#> case_id submitter_id
#> 1 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 TCGA-B0-5117
UUIDtoBarcode("d85d8a17-8aea-49d3-8a03-8f13141c163b", "aliquot_ids")
#> portions.analytes.aliquots.aliquot_id portions.analytes.aliquots.submitter_id
#> 1 d85d8a17-8aea-49d3-8a03-8f13141c163b TCGA-CV-5443-01A-01D-1510-01
## Translate file UUIDs >> case UUIDs
uuids <- c("b4bce3ff-7fdc-4849-880b-56f2b348ceac",
"5ca9fa79-53bc-4e91-82cd-5715038ee23e",
"b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382")
UUIDtoUUID(uuids)
#> file_id cases.case_id
#> 1 5ca9fa79-53bc-4e91-82cd-5715038ee23e fec0da58-1047-44d2-b6d1-c18cceed43dc
#> 2 b4bce3ff-7fdc-4849-880b-56f2b348ceac 8aaa4e25-5c12-4ace-96dc-91aaa0c4457c
#> 3 b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382 ae55b2d3-62a1-419e-9f9a-5ddfac356db4
## Translate TCGA Barcode >> UUIDs
fullBarcodes <- c("TCGA-B0-5117-11A-01D-1421-08",
"TCGA-B0-5094-11A-01D-1421-08",
"TCGA-E9-A295-10A-01D-A16D-09")
sample_ids <- TCGAbarcode(fullBarcodes, sample = TRUE)
barcodeToUUID(sample_ids)
#> submitter_sample_ids sample_ids
#> 9 TCGA-B0-5117-11A b1116541-bece-4df3-b3dd-cec50aeb277b
#> 4 TCGA-B0-5094-11A 7519d7a8-c3ee-417b-9cfc-111bc5ad0637
#> 3 TCGA-E9-A295-10A e74183e1-f0b4-412a-8dac-a62d404add78
participant_ids <- c("TCGA-CK-4948", "TCGA-D1-A17N",
"TCGA-4V-A9QX", "TCGA-4V-A9QM")
barcodeToUUID(participant_ids)
#> submitter_id case_id
#> 1 TCGA-CK-4948 5d73b382-3da3-4220-890e-2095228bbe6c
#> 4 TCGA-D1-A17N 001e0309-9c50-42b0-9e38-347883ee2cd3
#> 3 TCGA-4V-A9QX 0050d8be-1db6-4c17-8bef-3ae2eaaa63ce
#> 2 TCGA-4V-A9QM 0be4fa90-0122-4b26-b35f-7b1a4a16e63b
library(GenomicDataCommons)
#>
#> Attaching package: ‘GenomicDataCommons’
#> The following object is masked from ‘package:stats’:
#>
#> filter
### Query CNV data and get file names
cnv <- files() |>
filter(
~ cases.project.project_id == "TCGA-COAD" &
data_category == "Copy Number Variation" &
data_type == "Copy Number Segment"
) |>
results(size = 6)
filenameToBarcode(cnv$file_name)
#> file_name
#> 1 DADOS_p_TCGAb3_85_86_87_88_NSP_GenomeWideSNP_6_A02_1464720.grch38.seg.v2.txt
#> 2 GRIPS_p_TCGA_b116_SNP_N_GenomeWideSNP_6_E03_781394.grch38.seg.v2.txt
#> 3 TCGA-F4-6807-01A-11D-A91Z-36.WholeGenome.RP-1657.cr.igv.reheader.seg.txt
#> 4 d02407db-aece-475a-aa3f-00653b1e7bee_wgs_gdc_realn.cr.igv.reheader.seg.txt
#> 5 TCGA-A6-6781-01A-22D-A91U-36.WholeGenome.RP-1657.cr.igv.reheader.seg.txt
#> 6 BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_C09_808880.grch38.seg.v2.txt
#> file_id
#> 1 27942c4d-57d1-43c2-a27f-ce4ed46f75a2
#> 2 71801dc8-f50a-4e08-b423-ec0aa0fc18d5
#> 3 214a8894-94f1-44e6-b18e-6bf058588efd
#> 4 20f5867e-7312-413c-94b4-4b04b273db5d
#> 5 90492b0b-4787-4805-b1ff-8633febdf304
#> 6 51ce2c99-57e2-4304-b66a-16b70a25b235
#> samples.portions.analytes.aliquots.submitter_id
#> 1 TCGA-NH-A5IV-01A-42D-A36W-01
#> 2 TCGA-AA-3697-01A-01D-1717-01
#> 3 TCGA-F4-6807-01A-11D-A91Z-36
#> 4 TCGA-A6-2676-10A-01D-A91U-36
#> 5 TCGA-A6-6781-10A-01D-A91U-36
#> 6 TCGA-AZ-4616-10A-01D-1834-01
#> samples.portions.analytes.aliquots.submitter_id
#> 1 TCGA-NH-A5IV-01A-42D-A36W-01
#> 2 TCGA-AA-3697-01A-01D-1717-01
#> 3 TCGA-F4-6807-10A-01D-A91Z-36
#> 4 TCGA-A6-2676-01A-01D-1167-02
#> 5 TCGA-A6-6781-01A-22D-A91U-36
#> 6 TCGA-AZ-4616-10A-01D-1834-01
### Query slides data and get file names
slides <- files() |>
filter(
~ cases.project.project_id == "TCGA-BRCA" &
cases.samples.sample_type == "Primary Tumor" &
data_type == "Slide Image" &
experimental_strategy == "Diagnostic Slide"
) |>
results(size = 3)
filenameToBarcode(slides$file_name)
#> file_name
#> 1 TCGA-E2-A14P-01Z-00-DX1.663B02FF-C64B-41A6-8685-FD61CD76F9C6.svs
#> 2 TCGA-A7-A0CD-01Z-00-DX1.F045B9C8-049C-41BF-8432-EF89F236D34D.svs
#> 3 TCGA-5L-AAT1-01Z-00-DX1.F3449A5B-2AC4-4ED7-BF44-4C8946CDB47D.svs
#> file_id entity_submitter_id entity_type
#> 1 4730b23e-aea1-49a2-ba63-2231fd88b592 TCGA-E2-A14P-01Z-00-DX1 slide
#> 2 554855d7-4e21-406b-8f9f-458b1e7c89c9 TCGA-A7-A0CD-01Z-00-DX1 slide
#> 3 4eec69ca-381b-4c17-b3e9-49492d71560e TCGA-5L-AAT1-01Z-00-DX1 slide
#> case_id entity_id
#> 1 e4fc0909-f284-4471-866d-d8967b6adcbc 6f9f59be-f550-4f53-8d7f-9f96fe1db152
#> 2 09765b0a-94f6-47d2-af56-93368084ac3a 2c72ef33-b4d7-406b-9b5a-8f1cf8cd1225
#> 3 16fc3677-0393-4ed1-ad3f-c8355f056369 256b1f51-012a-45ee-8950-e2e0eddd814b
#> project.project_id samples.tumor_descriptor samples.tissue_type
#> 1 TCGA-BRCA Primary Tumor
#> 2 TCGA-BRCA Primary Tumor
#> 3 TCGA-BRCA Primary Tumor
## Get the version history of a BAM file in TCGA-KIRC
UUIDhistory("0001801b-54b0-4551-8d7a-d66fb59429bf")
#> uuid version file_change release_date
#> 1 0001801b-54b0-4551-8d7a-d66fb59429bf 1 superseded 2018-08-23
#> 2 b4bce3ff-7fdc-4849-880b-56f2b348ceac 2 released 2022-03-29
#> data_release
#> 1 12.0
#> 2 32.0