cBioPortalData: Data Build Errors
Marcel Ramos & Levi Waldron
April 22, 2025
Source:vignettes/cBioPortalDataErrors.Rmd
      cBioPortalDataErrors.RmdLoading
library(cBioPortalData)
library(AnVIL)
library(jsonlite)Overview
This document serves as a reporting tool for errors that occur when running our utility functions on the cBioPortal datasets.
Data from the cBioPortal API (cBioPortalData())
Typically, the number of errors encountered via the API are low. There are only a handful of packages that error when we apply the utility functions to provide a MultiAssayExperiment data representation.
First, we load the error Rda dataset.
api_errs <- system.file(
    "extdata", "api", "err_api_info.json",
    package = "cBioPortalData", mustWork = TRUE
)
err_api_info <- fromJSON(api_errs)We can now inspect the contents of the data:
class(err_api_info)## [1] "list"
length(err_api_info)## [1] 6
lengths(err_api_info)##                           Barcodes must start with 'TCGA' 
##                                                         2 
##                     group length is 0 but data length > 0 
##                                                         1 
##   Frequency of NA values higher than the cutoff tolerance 
##                                                         2 
##                          Inconsistent build numbers found 
##                                                        33 
##         `n` must be a single number, not an integer `NA`. 
##                                                         1 
## Argument 1 must be a data frame or a named atomic vector. 
##                                                         1There were about 6 unique errors during the last build run.
names(err_api_info)## [1] "Barcodes must start with 'TCGA'"                          
## [2] "group length is 0 but data length > 0"                    
## [3] "Frequency of NA values higher than the cutoff tolerance"  
## [4] "Inconsistent build numbers found"                         
## [5] "`n` must be a single number, not an integer `NA`."        
## [6] "Argument 1 must be a data frame or a named atomic vector."The most common error was
Inconsistent build numbers found. This is due to
annotations from different build numbers that were not able to be
resolved.
To see what datasets (cancer_study_id s) have that error
we can use:
err_api_info[['Inconsistent build numbers found']]##  [1] "msk_ch_2020"                       "msk_access_2021"                  
##  [3] "mixed_msk_tcga_2021"               "mixed_impact_subset_2022"         
##  [5] "pan_origimed_2020"                 "prad_msk_stopsack_2021"           
##  [7] "pancan_pcawg_2020"                 "prad_pik3r1_msk_2021"             
##  [9] "skcm_tcga"                         "stad_tcga"                        
## [11] "stad_tcga_pub"                     "skcm_tcga_pan_can_atlas_2018"     
## [13] "stad_tcga_pan_can_atlas_2018"      "stes_tcga_pub"                    
## [15] "summit_2018"                       "cfdna_msk_2019"                   
## [17] "blca_bcan_hcrn_2022"               "nsclc_ctdx_msk_2022"              
## [19] "thyroid_mskcc_2016"                "skcm_mskcc_2014"                  
## [21] "tmb_mskcc_2018"                    "rectal_msk_2019"                  
## [23] "skcm_tcga_pub_2015"                "msk_spectrum_tme_2022"            
## [25] "ucec_ccr_cfdna_msk_2022"           "paired_bladder_2022"              
## [27] "mtnn_msk_2022"                     "pog570_bcgsc_2020"                
## [29] "sarcoma_msk_2023"                  "bowel_colitis_msk_2022"           
## [31] "luad_mskcc_2023_met_organotropism" "coad_silu_2022"                   
## [33] "paac_msk_jco_2023"We can also have a look at the entirety of the dataset.
err_api_info## $`Barcodes must start with 'TCGA'`
## [1] "blca_msk_tcga_2020"    "nsclc_tcga_broad_2016"
## 
## $`group length is 0 but data length > 0`
## [1] "glioma_msk_2018"
## 
## $`Frequency of NA values higher than the cutoff tolerance`
## [1] "mixed_selpercatinib_2020" "ucec_ccr_msk_2022"       
## 
## $`Inconsistent build numbers found`
##  [1] "msk_ch_2020"                       "msk_access_2021"                  
##  [3] "mixed_msk_tcga_2021"               "mixed_impact_subset_2022"         
##  [5] "pan_origimed_2020"                 "prad_msk_stopsack_2021"           
##  [7] "pancan_pcawg_2020"                 "prad_pik3r1_msk_2021"             
##  [9] "skcm_tcga"                         "stad_tcga"                        
## [11] "stad_tcga_pub"                     "skcm_tcga_pan_can_atlas_2018"     
## [13] "stad_tcga_pan_can_atlas_2018"      "stes_tcga_pub"                    
## [15] "summit_2018"                       "cfdna_msk_2019"                   
## [17] "blca_bcan_hcrn_2022"               "nsclc_ctdx_msk_2022"              
## [19] "thyroid_mskcc_2016"                "skcm_mskcc_2014"                  
## [21] "tmb_mskcc_2018"                    "rectal_msk_2019"                  
## [23] "skcm_tcga_pub_2015"                "msk_spectrum_tme_2022"            
## [25] "ucec_ccr_cfdna_msk_2022"           "paired_bladder_2022"              
## [27] "mtnn_msk_2022"                     "pog570_bcgsc_2020"                
## [29] "sarcoma_msk_2023"                  "bowel_colitis_msk_2022"           
## [31] "luad_mskcc_2023_met_organotropism" "coad_silu_2022"                   
## [33] "paac_msk_jco_2023"                
## 
## $``n` must be a single number, not an integer `NA`.`
## [1] "msk_met_2021"
## 
## $`Argument 1 must be a data frame or a named atomic vector.`
## [1] "makeanimpact_ccr_2023"Packaged data from cBioDataPack()
Now let’s look at the errors in the packaged datasets that are used
for cBioDataPack:
pack_errs <- system.file(
    "extdata", "pack", "err_pack_info.json",
    package = "cBioPortalData", mustWork = TRUE
)
err_pack_info <- fromJSON(pack_errs)We can do the same for this data:
length(err_pack_info)## [1] 5
lengths(err_pack_info)##                                         more columns than column names 
##                                                                     12 
##                Frequency of NA values higher than the cutoff tolerance 
##                                                                      5 
## invalid class "ExperimentList" object: \n    Non-unique names provided 
##                                                                      2 
##                                                 non-character argument 
##                                                                      2 
##                                    'wget' call had nonzero exit status 
##                                                                     13We can get a list of all the errors present:
names(err_pack_info)## [1] "more columns than column names"                                          
## [2] "Frequency of NA values higher than the cutoff tolerance"                 
## [3] "invalid class \"ExperimentList\" object: \n    Non-unique names provided"
## [4] "non-character argument"                                                  
## [5] "'wget' call had nonzero exit status"And finally the full list of errors:
err_pack_info## $`more columns than column names`
##  [1] "ccrcc_utokyo_2013"                "gbm_cptac_2021"                  
##  [3] "luad_mskimpact_2021"              "mbl_dkfz_2017"                   
##  [5] "pan_origimed_2020"                "sarcoma_msk_2022"                
##  [7] "bowel_colitis_msk_2022"           "prad_msk_mdanderson_2023"        
##  [9] "brca_tcga_pan_can_atlas_2018"     "coadread_tcga_pan_can_atlas_2018"
## [11] "ov_tcga_pan_can_atlas_2018"       "sarc_tcga_pan_can_atlas_2018"    
## 
## $`Frequency of NA values higher than the cutoff tolerance`
## [1] "ihch_mskcc_2020"          "mixed_selpercatinib_2020"
## [3] "ucec_ccr_msk_2022"        "mixed_msk_tcga_2021"     
## [5] "ihch_msk_2021"           
## 
## $`invalid class "ExperimentList" object: \n    Non-unique names provided`
## [1] "mpnst_mskcc"   "stad_tcga_pub"
## 
## $`non-character argument`
## [1] "pcpg_tcga_pub"  "mbn_mdacc_2013"
## 
## $`'wget' call had nonzero exit status`
##  [1] "braf_msk_impact_2024"  "braf_msk_archer_2024"  "prostate_msk_2024"    
##  [4] "pcnsl_msk_2024"        "pdac_msk_2024"         "ucs_msk_2024"         
##  [7] "asclc_msk_2024"        "lms_msk_2024"          "crc_orion_2024"       
## [10] "brca_aurora_2023"      "hcc_msk_2024"          "pancreas_msk_2024"    
## [13] "pancan_mimsi_msk_2024"sessionInfo
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] jsonlite_2.0.0              cBioPortalData_2.20.0      
##  [3] MultiAssayExperiment_1.34.0 SummarizedExperiment_1.38.0
##  [5] Biobase_2.68.0              GenomicRanges_1.60.0       
##  [7] GenomeInfoDb_1.44.0         IRanges_2.42.0             
##  [9] S4Vectors_0.46.0            BiocGenerics_0.54.0        
## [11] generics_0.1.3              MatrixGenerics_1.20.0      
## [13] matrixStats_1.5.0           AnVIL_1.20.0               
## [15] AnVILBase_1.2.0             dplyr_1.1.4                
## [17] BiocStyle_2.36.0           
## 
## loaded via a namespace (and not attached):
##   [1] DBI_1.2.3                 bitops_1.0-9             
##   [3] httr2_1.1.2               formatR_1.14             
##   [5] rlang_1.1.6               magrittr_2.0.3           
##   [7] compiler_4.5.0            RSQLite_2.3.9            
##   [9] GenomicFeatures_1.60.0    png_0.1-8                
##  [11] systemfonts_1.2.2         vctrs_0.6.5              
##  [13] rvest_1.0.4               stringr_1.5.1            
##  [15] pkgconfig_2.0.3           crayon_1.5.3             
##  [17] fastmap_1.2.0             dbplyr_2.5.0             
##  [19] XVector_0.48.0            Rsamtools_2.24.0         
##  [21] promises_1.3.2            rmarkdown_2.29           
##  [23] tzdb_0.5.0                UCSC.utils_1.4.0         
##  [25] ragg_1.4.0                purrr_1.0.4              
##  [27] bit_4.6.0                 xfun_0.52                
##  [29] cachem_1.1.0              blob_1.2.4               
##  [31] later_1.4.2               DelayedArray_0.34.1      
##  [33] BiocParallel_1.42.0       parallel_4.5.0           
##  [35] R6_2.6.1                  bslib_0.9.0              
##  [37] stringi_1.8.7             rtracklayer_1.68.0       
##  [39] jquerylib_0.1.4           Rcpp_1.0.14              
##  [41] bookdown_0.43             knitr_1.50               
##  [43] readr_2.1.5               BiocBaseUtils_1.10.0     
##  [45] httpuv_1.6.16             Matrix_1.7-3             
##  [47] tidyselect_1.2.1          abind_1.4-8              
##  [49] yaml_2.3.10               codetools_0.2-20         
##  [51] miniUI_0.1.2              curl_6.2.2               
##  [53] lattice_0.22-7            tibble_3.2.1             
##  [55] KEGGREST_1.48.0           shiny_1.10.0             
##  [57] evaluate_1.0.3            desc_1.4.3               
##  [59] lambda.r_1.2.4            futile.logger_1.4.3      
##  [61] BiocFileCache_2.16.0      xml2_1.3.8               
##  [63] Biostrings_2.76.0         pillar_1.10.2            
##  [65] BiocManager_1.30.25       filelock_1.0.3           
##  [67] DT_0.33                   TCGAutils_1.28.0         
##  [69] RCurl_1.98-1.17           hms_1.1.3                
##  [71] xtable_1.8-4              RTCGAToolbox_2.38.0      
##  [73] glue_1.8.0                tools_4.5.0              
##  [75] BiocIO_1.18.0             data.table_1.17.0        
##  [77] GenomicAlignments_1.44.0  rapiclient_0.1.8         
##  [79] XML_3.99-0.18             fs_1.6.6                 
##  [81] grid_4.5.0                tidyr_1.3.1              
##  [83] AnnotationDbi_1.70.0      GenomeInfoDbData_1.2.14  
##  [85] RaggedExperiment_1.32.0   RJSONIO_2.0.0            
##  [87] restfulr_0.0.15           cli_3.6.4                
##  [89] rappdirs_0.3.3            textshaping_1.0.0        
##  [91] futile.options_1.0.1      GenomicDataCommons_1.32.0
##  [93] S4Arrays_1.8.0            sass_0.4.10              
##  [95] digest_0.6.37             SparseArray_1.8.0        
##  [97] rjson_0.2.23              htmlwidgets_1.6.4        
##  [99] memoise_2.0.1             htmltools_0.5.8.1        
## [101] pkgdown_2.1.1             lifecycle_1.0.4          
## [103] httr_1.4.7                mime_0.13                
## [105] bit64_4.6.0-1