Translate study identifiers from barcode to UUID and vice versa
Source:R/ID-translation.R
ID-translation.RdThese functions allow the user to enter a character vector of
identifiers and use the GDC API to translate from TCGA barcodes to
Universally Unique Identifiers (UUID) and vice versa. These relationships
are not one-to-one. Therefore, a data.frame is returned for all
inputs. The UUID to TCGA barcode translation only applies to file and case
UUIDs. Two-way UUID translation is available from 'file_id' to 'case_id'
and vice versa. Please double check any results before using these
features for analysis. Case / submitter identifiers are translated by
default, see the from_type argument for details. All identifiers are
converted to lower case.
Arguments
- id_vector
character() A vector of UUIDs corresponding to either files or cases (default assumes case_ids)
- from_type
character(1) Either
case_idorfile_idindicating the type ofid_vectorentered (default"case_id")- to_type
character(1) The desired UUID type to obtain, can either be
"case_id"(default) or"file_id"- barcodes
character() A vector of TCGA barcodes
- filenames
character()A vector of file names usually obtained from aGenomicDataCommonsquery- slides
logical(1L)DEPRECATED: Whether the provided file names correspond to slides typically with an.svsextension. Note The barcodes returned correspond 1:1 with thefilenameinputs. Always triple check the output against the Genomic Data Commons Data Portal by searching the file name and comparing associated "Entity ID" with thesubmitter_idgiven by the function.- id
character(1) A UUID whose history of versions is sought
- endpoint
character(1) Generally a constant pertaining to the location of the history api endpoint. This argument rarely needs to change.
Value
Generally, a data.frame of identifier mappings
UUIDhistory: A data.frame containting a list of associated UUIDs
for the given input along with file_change status, data_release
versions, etc.
Details
Based on the file UUID supplied, the appropriate entity_id (TCGA barcode) is
returned. In previous versions of the package, the 'end_point' parameter
would require the user to specify what type of barcode needed. This is no
longer supported as entity_id returns the appropriate one.
When providing slide file names, the function will only work if
all the provided files are slide files with an .svs extension.
Examples
## Translate UUIDs >> TCGA Barcode
uuids <- c("b4bce3ff-7fdc-4849-880b-56f2b348ceac",
"5ca9fa79-53bc-4e91-82cd-5715038ee23e",
"b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382")
UUIDtoBarcode(uuids, from_type = "file_id")
#> file_id associated_entities.entity_submitter_id
#> 1 b4bce3ff-7fdc-4849-880b-56f2b348ceac TCGA-B0-5094-11A-01D-1421-08
#> 2 5ca9fa79-53bc-4e91-82cd-5715038ee23e TCGA-E9-A295-10A-01D-A16D-09
#> 3 b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382 TCGA-B0-5117-11A-01D-1421-08
UUIDtoBarcode("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", from_type = "case_id")
#> case_id submitter_id
#> 1 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 TCGA-B0-5117
UUIDtoBarcode("d85d8a17-8aea-49d3-8a03-8f13141c163b", "aliquot_ids")
#> portions.analytes.aliquots.aliquot_id portions.analytes.aliquots.submitter_id
#> 1 d85d8a17-8aea-49d3-8a03-8f13141c163b TCGA-CV-5443-01A-01D-1510-01
## Translate file UUIDs >> case UUIDs
uuids <- c("b4bce3ff-7fdc-4849-880b-56f2b348ceac",
"5ca9fa79-53bc-4e91-82cd-5715038ee23e",
"b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382")
UUIDtoUUID(uuids)
#> file_id cases.case_id
#> 1 b4bce3ff-7fdc-4849-880b-56f2b348ceac 8aaa4e25-5c12-4ace-96dc-91aaa0c4457c
#> 2 b7c3e5ad-4ffc-4fc4-acbf-1dfcbd2e5382 ae55b2d3-62a1-419e-9f9a-5ddfac356db4
#> 3 5ca9fa79-53bc-4e91-82cd-5715038ee23e fec0da58-1047-44d2-b6d1-c18cceed43dc
## Translate TCGA Barcode >> UUIDs
fullBarcodes <- c("TCGA-B0-5117-11A-01D-1421-08",
"TCGA-B0-5094-11A-01D-1421-08",
"TCGA-E9-A295-10A-01D-A16D-09")
sample_ids <- TCGAbarcode(fullBarcodes, sample = TRUE)
barcodeToUUID(sample_ids)
#> submitter_sample_ids sample_ids
#> 9 TCGA-B0-5117-11A b1116541-bece-4df3-b3dd-cec50aeb277b
#> 4 TCGA-B0-5094-11A 7519d7a8-c3ee-417b-9cfc-111bc5ad0637
#> 3 TCGA-E9-A295-10A e74183e1-f0b4-412a-8dac-a62d404add78
participant_ids <- c("TCGA-CK-4948", "TCGA-D1-A17N",
"TCGA-4V-A9QX", "TCGA-4V-A9QM")
barcodeToUUID(participant_ids)
#> submitter_id case_id
#> 4 TCGA-CK-4948 5d73b382-3da3-4220-890e-2095228bbe6c
#> 3 TCGA-D1-A17N 001e0309-9c50-42b0-9e38-347883ee2cd3
#> 2 TCGA-4V-A9QX 0050d8be-1db6-4c17-8bef-3ae2eaaa63ce
#> 1 TCGA-4V-A9QM 0be4fa90-0122-4b26-b35f-7b1a4a16e63b
library(GenomicDataCommons)
#>
#> Attaching package: ‘GenomicDataCommons’
#> The following object is masked from ‘package:stats’:
#>
#> filter
### Query CNV data and get file names
cnv <- files() |>
filter(
~ cases.project.project_id == "TCGA-COAD" &
data_category == "Copy Number Variation" &
data_type == "Copy Number Segment"
) |>
results(size = 6)
filenameToBarcode(cnv$file_name)
#> file_name
#> 1 e6481f25-e9dd-483e-b274-1fc5f1e54dc1_wgs_gdc_realn.cr.igv.reheader.seg.txt
#> 2 SONGS_p_TCGAb36_SNP_N_GenomeWideSNP_6_F01_585308.grch38.seg.v2.txt
#> 3 RARER_p_TCGA_MixedRedos_N_GenomeWideSNP_6_B07_747818.grch38.seg.v2.txt
#> 4 VENUE_p_TCGAb28_SNP_N_GenomeWideSNP_6_D08_568938.grch38.seg.v2.txt
#> 5 SONGS_p_TCGAb36_SNP_N_GenomeWideSNP_6_E07_585304.grch38.seg.v2.txt
#> 6 TCGA-AA-3952-01A-01D-A91W-36.WholeGenome.RP-1657.cr.igv.reheader.seg.txt
#> file_id
#> 1 7221548d-4204-4fdb-b82c-3224cc15518d
#> 2 7b781535-4a39-4107-982f-5f455535a3bf
#> 3 c285f7a9-f99e-4940-8f98-1e8d03103158
#> 4 2a2a30d4-5708-483a-82f4-9c37a327f829
#> 5 3b51da42-9e29-4944-b282-d9b4d8979471
#> 6 c4685500-effe-44a8-9123-63693e5390da
#> samples.portions.analytes.aliquots.submitter_id
#> 1 TCGA-AA-3715-01A-01D-0957-02
#> 2 TCGA-AA-3715-01A-01D-0903-01
#> 3 TCGA-AA-3531-01A-01D-1549-01
#> 4 TCGA-AA-3531-01A-01D-0819-01
#> 5 TCGA-AA-3860-01A-02D-0903-01
#> 6 TCGA-AA-3952-10A-01D-A91W-36
#> samples.portions.analytes.aliquots.submitter_id
#> 1 TCGA-AA-3715-10B-01D-A91V-36
#> 2 TCGA-AA-3715-01A-01D-0903-01
#> 3 TCGA-AA-3531-01A-01D-1549-01
#> 4 TCGA-AA-3531-01A-01D-0819-01
#> 5 TCGA-AA-3860-01A-02D-0903-01
#> 6 TCGA-AA-3952-01A-01D-A91W-36
### Query slides data and get file names
slides <- files() |>
filter(
~ cases.project.project_id == "TCGA-BRCA" &
cases.samples.sample_type == "Primary Tumor" &
data_type == "Slide Image" &
experimental_strategy == "Diagnostic Slide"
) |>
results(size = 3)
filenameToBarcode(slides$file_name, slides = TRUE)
#> Warning: The 'slides' argument is deprecated.
#> file_name
#> 1 TCGA-3C-AALI-01Z-00-DX2.CF4496E0-AB52-4F3E-BDF5-C34833B91B7C.svs
#> 2 TCGA-3C-AALI-01Z-00-DX1.F6E9A5DF-D8FB-45CF-B4BD-C6B76294C291.svs
#> 3 TCGA-BH-A18Q-01Z-00-DX1.E89E49C7-D62A-4408-A3D9-19E79FCB249E.svs
#> file_id entity_submitter_id entity_type
#> 1 6a7477b1-86f3-473f-9bf1-174a758142e9 TCGA-3C-AALI-01Z-00-DX2 slide
#> 2 d46167af-6c29-49c7-95cf-3a801181aca4 TCGA-3C-AALI-01Z-00-DX1 slide
#> 3 490c4a57-24fc-4595-8b03-285ec9c61181 TCGA-BH-A18Q-01Z-00-DX1 slide
#> case_id entity_id
#> 1 55262fcb-1b01-4480-b322-36570430c917 162c3a3c-e5f0-4530-aadd-fd4992ae4f2f
#> 2 55262fcb-1b01-4480-b322-36570430c917 7ea905d4-30c8-4611-9889-c514a2b56fb0
#> 3 db4bc6aa-2e7d-4bcb-8519-a455f624d33b 4cbcb419-4a53-4aa8-b985-dc94d65a459b
#> project.project_id samples.tumor_descriptor samples.tissue_type
#> 1 TCGA-BRCA Primary Tumor
#> 2 TCGA-BRCA Primary Tumor
#> 3 TCGA-BRCA Primary Tumor
## Get the version history of a BAM file in TCGA-KIRC
UUIDhistory("0001801b-54b0-4551-8d7a-d66fb59429bf")
#> uuid version file_change release_date
#> 1 0001801b-54b0-4551-8d7a-d66fb59429bf 1 superseded 2018-08-23
#> 2 b4bce3ff-7fdc-4849-880b-56f2b348ceac 2 released 2022-03-29
#> data_release
#> 1 12.0
#> 2 32.0