Translate study identifiers from barcode to UUID and vice versa
Source:R/ID-translation.R
ID-translation.RdThese functions allow the user to enter a character vector of
identifiers and use the GDC API to translate from TCGA barcodes to
Universally Unique Identifiers (UUID) and vice versa. These relationships
are not one-to-one. Therefore, a data.frame is returned for all
inputs. The UUID to TCGA barcode translation only applies to file and case
UUIDs. Two-way UUID translation is available from 'file_id' to 'case_id'
and vice versa. Please double check any results before using these
features for analysis. Case / submitter identifiers are translated by
default, see the from_type argument for details. All identifiers are
converted to lower case.
Arguments
- id_vector
character() A vector of UUIDs corresponding to either files or cases (default assumes case_ids)
- from_type
character(1) Either
case_idorfile_idindicating the type ofid_vectorentered (default"case_id")- to_type
character(1) The desired UUID type to obtain, can either be
"case_id"(default) or"file_id"- barcodes
character() A vector of TCGA barcodes
- filenames
character()A vector of file names usually obtained from aGenomicDataCommonsquery- slides
logical(1L)DEPRECATED: Whether the provided file names correspond to slides typically with an.svsextension. Note The barcodes returned correspond 1:1 with thefilenameinputs. Always triple check the output against the Genomic Data Commons Data Portal by searching the file name and comparing associated "Entity ID" with thesubmitter_idgiven by the function.- id
character(1) A UUID whose history of versions is sought
- endpoint
character(1) Generally a constant pertaining to the location of the history api endpoint. This argument rarely needs to change.
Value
Generally, a data.frame of identifier mappings
UUIDhistory: A data.frame containting a list of associated UUIDs
for the given input along with file_change status, data_release
versions, etc.
Details
Based on the file UUID supplied, the appropriate entity_id (TCGA barcode) is
returned. In previous versions of the package, the 'end_point' parameter
would require the user to specify what type of barcode needed. This is no
longer supported as entity_id returns the appropriate one.
When providing slide file names, the function will only work if
all the provided files are slide files with an .svs extension.
Examples
## Translate UUIDs >> TCGA Barcode
uuids <- c("b4bce3ff-7fdc-4849-880b-56f2b348ceac",
"5ca9fa79-53bc-4e91-82cd-5715038ee23e",
"b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382")
UUIDtoBarcode(uuids, from_type = "file_id")
#> file_id associated_entities.entity_submitter_id
#> 1 b4bce3ff-7fdc-4849-880b-56f2b348ceac TCGA-B0-5094-11A-01D-1421-08
#> 2 5ca9fa79-53bc-4e91-82cd-5715038ee23e TCGA-E9-A295-10A-01D-A16D-09
#> 3 b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382 TCGA-B0-5117-11A-01D-1421-08
UUIDtoBarcode("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", from_type = "case_id")
#> case_id submitter_id
#> 1 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 TCGA-B0-5117
UUIDtoBarcode("d85d8a17-8aea-49d3-8a03-8f13141c163b", "aliquot_ids")
#> portions.analytes.aliquots.aliquot_id portions.analytes.aliquots.submitter_id
#> 1 d85d8a17-8aea-49d3-8a03-8f13141c163b TCGA-CV-5443-01A-01D-1510-01
## Translate file UUIDs >> case UUIDs
uuids <- c("b4bce3ff-7fdc-4849-880b-56f2b348ceac",
"5ca9fa79-53bc-4e91-82cd-5715038ee23e",
"b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382")
UUIDtoUUID(uuids)
#> file_id cases.case_id
#> 1 5ca9fa79-53bc-4e91-82cd-5715038ee23e fec0da58-1047-44d2-b6d1-c18cceed43dc
#> 2 b4bce3ff-7fdc-4849-880b-56f2b348ceac 8aaa4e25-5c12-4ace-96dc-91aaa0c4457c
#> 3 b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382 ae55b2d3-62a1-419e-9f9a-5ddfac356db4
## Translate TCGA Barcode >> UUIDs
fullBarcodes <- c("TCGA-B0-5117-11A-01D-1421-08",
"TCGA-B0-5094-11A-01D-1421-08",
"TCGA-E9-A295-10A-01D-A16D-09")
sample_ids <- TCGAbarcode(fullBarcodes, sample = TRUE)
barcodeToUUID(sample_ids)
#> submitter_sample_ids sample_ids
#> 9 TCGA-B0-5117-11A b1116541-bece-4df3-b3dd-cec50aeb277b
#> 4 TCGA-B0-5094-11A 7519d7a8-c3ee-417b-9cfc-111bc5ad0637
#> 3 TCGA-E9-A295-10A e74183e1-f0b4-412a-8dac-a62d404add78
participant_ids <- c("TCGA-CK-4948", "TCGA-D1-A17N",
"TCGA-4V-A9QX", "TCGA-4V-A9QM")
barcodeToUUID(participant_ids)
#> submitter_id case_id
#> 1 TCGA-CK-4948 5d73b382-3da3-4220-890e-2095228bbe6c
#> 4 TCGA-D1-A17N 001e0309-9c50-42b0-9e38-347883ee2cd3
#> 3 TCGA-4V-A9QX 0050d8be-1db6-4c17-8bef-3ae2eaaa63ce
#> 2 TCGA-4V-A9QM 0be4fa90-0122-4b26-b35f-7b1a4a16e63b
library(GenomicDataCommons)
#>
#> Attaching package: ‘GenomicDataCommons’
#> The following object is masked from ‘package:stats’:
#>
#> filter
### Query CNV data and get file names
cnv <- files() |>
filter(
~ cases.project.project_id == "TCGA-COAD" &
data_category == "Copy Number Variation" &
data_type == "Copy Number Segment"
) |>
results(size = 6)
filenameToBarcode(cnv$file_name)
#> file_name
#> 1 DADOS_p_TCGAb3_85_86_87_88_NSP_GenomeWideSNP_6_A02_1464720.grch38.seg.v2.txt
#> 2 GRIPS_p_TCGA_b116_SNP_N_GenomeWideSNP_6_E03_781394.grch38.seg.v2.txt
#> 3 TCGA-F4-6807-01A-11D-A91Z-36.WholeGenome.RP-1657.cr.igv.reheader.seg.txt
#> 4 d02407db-aece-475a-aa3f-00653b1e7bee_wgs_gdc_realn.cr.igv.reheader.seg.txt
#> 5 TCGA-A6-6781-01A-22D-A91U-36.WholeGenome.RP-1657.cr.igv.reheader.seg.txt
#> 6 BAIZE_p_TCGA_b138_SNP_N_GenomeWideSNP_6_C09_808880.grch38.seg.v2.txt
#> file_id
#> 1 27942c4d-57d1-43c2-a27f-ce4ed46f75a2
#> 2 71801dc8-f50a-4e08-b423-ec0aa0fc18d5
#> 3 214a8894-94f1-44e6-b18e-6bf058588efd
#> 4 20f5867e-7312-413c-94b4-4b04b273db5d
#> 5 90492b0b-4787-4805-b1ff-8633febdf304
#> 6 51ce2c99-57e2-4304-b66a-16b70a25b235
#> samples.portions.analytes.aliquots.submitter_id
#> 1 TCGA-NH-A5IV-01A-42D-A36W-01
#> 2 TCGA-AA-3697-01A-01D-1717-01
#> 3 TCGA-F4-6807-01A-11D-A91Z-36
#> 4 TCGA-A6-2676-10A-01D-A91U-36
#> 5 TCGA-A6-6781-10A-01D-A91U-36
#> 6 TCGA-AZ-4616-10A-01D-1834-01
#> samples.portions.analytes.aliquots.submitter_id
#> 1 TCGA-NH-A5IV-01A-42D-A36W-01
#> 2 TCGA-AA-3697-01A-01D-1717-01
#> 3 TCGA-F4-6807-10A-01D-A91Z-36
#> 4 TCGA-A6-2676-01A-01D-1167-02
#> 5 TCGA-A6-6781-01A-22D-A91U-36
#> 6 TCGA-AZ-4616-10A-01D-1834-01
### Query slides data and get file names
slides <- files() |>
filter(
~ cases.project.project_id == "TCGA-BRCA" &
cases.samples.sample_type == "Primary Tumor" &
data_type == "Slide Image" &
experimental_strategy == "Diagnostic Slide"
) |>
results(size = 3)
filenameToBarcode(slides$file_name, slides = TRUE)
#> Warning: The 'slides' argument is deprecated.
#> file_name
#> 1 TCGA-E2-A14P-01Z-00-DX1.663B02FF-C64B-41A6-8685-FD61CD76F9C6.svs
#> 2 TCGA-A7-A0CD-01Z-00-DX1.F045B9C8-049C-41BF-8432-EF89F236D34D.svs
#> 3 TCGA-5L-AAT1-01Z-00-DX1.F3449A5B-2AC4-4ED7-BF44-4C8946CDB47D.svs
#> file_id entity_submitter_id entity_type
#> 1 4730b23e-aea1-49a2-ba63-2231fd88b592 TCGA-E2-A14P-01Z-00-DX1 slide
#> 2 554855d7-4e21-406b-8f9f-458b1e7c89c9 TCGA-A7-A0CD-01Z-00-DX1 slide
#> 3 4eec69ca-381b-4c17-b3e9-49492d71560e TCGA-5L-AAT1-01Z-00-DX1 slide
#> case_id entity_id
#> 1 e4fc0909-f284-4471-866d-d8967b6adcbc 6f9f59be-f550-4f53-8d7f-9f96fe1db152
#> 2 09765b0a-94f6-47d2-af56-93368084ac3a 2c72ef33-b4d7-406b-9b5a-8f1cf8cd1225
#> 3 16fc3677-0393-4ed1-ad3f-c8355f056369 256b1f51-012a-45ee-8950-e2e0eddd814b
#> project.project_id samples.tumor_descriptor samples.tissue_type
#> 1 TCGA-BRCA Primary Tumor
#> 2 TCGA-BRCA Primary Tumor
#> 3 TCGA-BRCA Primary Tumor
## Get the version history of a BAM file in TCGA-KIRC
UUIDhistory("0001801b-54b0-4551-8d7a-d66fb59429bf")
#> uuid version file_change release_date
#> 1 0001801b-54b0-4551-8d7a-d66fb59429bf 1 superseded 2018-08-23
#> 2 b4bce3ff-7fdc-4849-880b-56f2b348ceac 2 released 2022-03-29
#> data_release
#> 1 12.0
#> 2 32.0