Create datasets for machine learning

This vignette identifies and writes to disk datasets for use in multi-dataset machine learning. For each relevant study condition, two files are written:

A .rda file containing a TreeSummarizedExperiment with full colData, rowData, a phylogenetic tree, and assay data containing taxonomic data.
A .csv.gz file containing only key sample metadata, and taxonomic data (variables in columns)

See source file on GitHub

Do BiocManager::install("waldronlab/curatedMetagenomicAnalyses") to install the makeSEforCondition function and to run this vignette.

Packages used here:

library(curatedMetagenomicData)
library(curatedMetagenomicAnalyses)
library(dplyr)

Investigate potential response variables

These are the 10 study conditions most commonly found in curatedMetagenomicData:

data("sampleMetadata")
availablediseases <- pull(sampleMetadata, study_condition) %>%
  table() %>%
  sort(decreasing = TRUE)
availablediseases

## .
##                   control                       IBD                       CRC 
##                     14446                      2086                       702 
##            premature_born                       T2D                       IGT 
##                       448                       332                       288 
##                      ACVD                   adenoma                 cirrhosis 
##                       214                       209                       132 
##                       STH                    otitis              schizofrenia 
##                       108                       107                       106 
##              hypertension                        AS                       T1D 
##                        99                        97                        89 
##                  melanoma           acute_diarrhoea          pre-hypertension 
##                        64                        56                        56 
##                    ME/CFS                  migraine                      STEC 
##                        50                        49                        42 
##               fatty_liver                 psoriasis carcinoma_surgery_history 
##                        41                        41                        40 
##                       FMT                        AD            cephalosporins 
##                        40                        38                        36 
##                       CDI                    asthma             periodontitis 
##                        33                        24                        24 
##                       SRP          peri-implantitis                        BD 
##                        24                        23                        20 
##                 mucositis                bronchitis            respiratoryinf 
##                        20                        18                        13 
##        metabolic_syndrome            pyelonephritis infectiousgastroenteritis 
##                        10                         6                         5 
##                     fever                 pneumonia               tonsillitis 
##                         3                         3                         3 
##                     cough             pyelonefritis                   skininf 
##                         2                         2                         2 
##                stomatitis                  cystitis                        NK 
##                         2                         1                         1 
##             salmonellosis                    sepsis                   suspinf 
##                         1                         1                         1

And the number of studies they are found in:

studies <- lapply(names(availablediseases), function(x){
  filter(sampleMetadata, study_condition %in% x) %>%
    pull(study_name) %>%
    unique()
})
names(studies) <- names(availablediseases)
studies <- studies[-grep("control", names(studies))] #get rid of controls
studies <- studies[sapply(studies, length) > 1] #available in more than one study
studies

## $IBD
## [1] "HallAB_2017"     "HMP_2019_ibdmdb" "IjazUZ_2017"     "LiJ_2014"       
## [5] "NielsenHB_2014"  "VilaAV_2018"    
## 
## $CRC
##  [1] "FengQ_2015"      "GuptaA_2019"     "HanniganGD_2017" "ThomasAM_2018a" 
##  [5] "ThomasAM_2018b"  "ThomasAM_2019_c" "VogtmannE_2016"  "WirbelJ_2018"   
##  [9] "YachidaS_2019"   "YuJ_2015"        "ZellerG_2014"   
## 
## $premature_born
## [1] "BrooksB_2017" "OlmMR_2017"  
## 
## $T2D
## [1] "HMP_2019_t2d"           "KarlssonFH_2013"        "LiJ_2014"              
## [4] "QinJ_2012"              "SankaranarayananK_2015"
## 
## $IGT
## [1] "HMP_2019_t2d"    "KarlssonFH_2013"
## 
## $adenoma
## [1] "FengQ_2015"      "HanniganGD_2017" "ThomasAM_2018a"  "YachidaS_2019"  
## [5] "ZellerG_2014"   
## 
## $cirrhosis
## [1] "LoombaR_2017" "QinN_2014"   
## 
## $STH
## [1] "RosaBA_2018"  "RubelMA_2020"
## 
## $schizofrenia
## [1] "Castro-NallarE_2015" "ZhuF_2020"          
## 
## $T1D
## [1] "Heitz-BuschartA_2016" "KosticAD_2015"        "LiJ_2014"            
## 
## $melanoma
## [1] "GopalakrishnanV_2018" "MatsonV_2018"        
## 
## $acute_diarrhoea
## [1] "DavidLA_2015" "KieserS_2018"

Each of these datasets has six data types associated with it; for example:

curatedMetagenomicData("JieZ_2017.+")

## 2021-03-31.JieZ_2017.gene_families
## 2021-03-31.JieZ_2017.marker_abundance
## 2021-03-31.JieZ_2017.marker_presence
## 2021-03-31.JieZ_2017.pathway_abundance
## 2021-03-31.JieZ_2017.pathway_coverage
## 2021-03-31.JieZ_2017.relative_abundance
## 2021-10-14.JieZ_2017.gene_families
## 2021-10-14.JieZ_2017.marker_abundance
## 2021-10-14.JieZ_2017.marker_presence
## 2021-10-14.JieZ_2017.pathway_abundance
## 2021-10-14.JieZ_2017.pathway_coverage
## 2021-10-14.JieZ_2017.relative_abundance

Write relative abundance datasets to disk

for (i in seq_along(studies)){
  cond <- names(studies)[i]
  se <-
    curatedMetagenomicAnalyses::makeSEforCondition(cond, removestudies = "HMP_2019_ibdmdb", dataType = "relative_abundance")
  print(paste("Next study condition:", cond, " /// Body site: ", unique(colData(se)$body_site)))
  print(with(colData(se), table(study_name, study_condition)))
  cat("\n \n")
  save(se, file = paste0(cond, ".rda"))
  flattext <- select(as.data.frame(colData(se)), c("study_name", "study_condition", "subject_id"))
  rownames(flattext) <- colData(se)$sample_id
  flattext <- cbind(flattext, data.frame(t(assay(se))))
  write.csv(flattext, file = paste0(cond, ".csv"))
  system(paste0("gzip ", cond, ".csv"))
}

## [1] "Next study condition: IBD  /// Body site:  stool"
##                 study_condition
## study_name       control IBD
##   HallAB_2017         74 185
##   IjazUZ_2017         38  56
##   LiJ_2014            10 140
##   NielsenHB_2014     248 148
##   VilaAV_2018          0 355
## 
##  
## [1] "Next study condition: CRC  /// Body site:  stool"
##                  study_condition
## study_name        control CRC
##   FengQ_2015           61  46
##   GuptaA_2019          30  30
##   HanniganGD_2017      28  27
##   ThomasAM_2018a       24  29
##   ThomasAM_2018b       28  32
##   ThomasAM_2019_c      40  40
##   VogtmannE_2016       52  52
##   WirbelJ_2018         65  60
##   YachidaS_2019       251 258
##   YuJ_2015             53  75
##   ZellerG_2014         61  53
## 
##  
## [1] "Next study condition: premature_born  /// Body site:  stool"     
## [2] "Next study condition: premature_born  /// Body site:  oralcavity"
## [3] "Next study condition: premature_born  /// Body site:  skin"      
##               study_condition
## study_name     control premature_born
##   BrooksB_2017       5            403
##   OlmMR_2017         0             45
## 
##  
## [1] "Next study condition: T2D  /// Body site:  stool"
##                         study_condition
## study_name               control T2D
##   HMP_2019_t2d                46  11
##   KarlssonFH_2013             43  53
##   LiJ_2014                    10  79
##   QinJ_2012                  174 170
##   SankaranarayananK_2015      18  19
## 
##  
## [1] "Next study condition: IGT  /// Body site:  stool"
##                  study_condition
## study_name        control IGT
##   HMP_2019_t2d         46 239
##   KarlssonFH_2013      43  49
## 
##  
## [1] "Next study condition: adenoma  /// Body site:  stool"
##                  study_condition
## study_name        adenoma control
##   FengQ_2015           47      61
##   HanniganGD_2017      26      28
##   ThomasAM_2018a       27      24
##   YachidaS_2019        67     251
##   ZellerG_2014         42      61
## 
##  
## [1] "Next study condition: cirrhosis  /// Body site:  stool"
##               study_condition
## study_name     cirrhosis control
##   LoombaR_2017         9      36
##   QinN_2014          123     114
## 
##  
## [1] "Next study condition: STH  /// Body site:  stool"
##               study_condition
## study_name     control STH
##   RosaBA_2018        5  19
##   RubelMA_2020      86  89
## 
##  
## [1] "Next study condition: schizofrenia  /// Body site:  oralcavity"
## [2] "Next study condition: schizofrenia  /// Body site:  stool"     
##                      study_condition
## study_name            control schizofrenia
##   Castro-NallarE_2015      16           16
##   ZhuF_2020                81           90
## 
##  
## [1] "Next study condition: T1D  /// Body site:  stool"
##                       study_condition
## study_name             control T1D
##   Heitz-BuschartA_2016      26  27
##   KosticAD_2015             89  31
##   LiJ_2014                  10  31
## 
##  
## [1] "Next study condition: melanoma  /// Body site:  stool"
##                       study_condition
## study_name             melanoma
##   GopalakrishnanV_2018       25
##   MatsonV_2018               39
## 
##  
## [1] "Next study condition: acute_diarrhoea  /// Body site:  stool"
##               study_condition
## study_name     acute_diarrhoea control
##   DavidLA_2015              38       9
##   KieserS_2018              18       9
## 
##

Direct link to files

Download the .csv and .rda files directly from https://www.dropbox.com/sh/0t0nbhj9eqm3wkq/AACZIw42WA-uHjzo97bG5tE6a?dl=0

Levi Waldron

19 October 2022

Investigate potential response variables

Write relative abundance datasets to disk

Direct link to files