Curation Checks • bugphyzz

library(bugphyzz)
library(dplyr)
library(purrr)

Import bugphyzz data (physiologies)

phys <- physiologies()
#> Warning: Missing columns in halophily. Missing columns are: Genome_ID,
#> Accession_ID

Check presence and order of ‘required’ columns

The function .checkRequiredColumnList() performs two checks:
1. All of the required columns are present.
2. The required columns are in the right order.

err1 <- bugphyzz:::.checkRequiredColumnsList(phys, table = TRUE)

The result has four columns:
1. dataset contains the name of each dataset.
2. error_type contains the type of error (‘missing column’ or ‘misplaced column’).
3. columns the name of the columns that must be corrected.
4. link the link to the dataset.

Visualization of the errors:

if (!is.null(err1)) {
  DT::datatable(err1)
} else {
  message('No errors found.')
}

Check syntax

.checkColumnValuesList performs two checks on the list of physiologies: 1. That a column is in the template file.
2. That all of the values in a column are valid according to the template
file.

phys2 <-  phys |> 
  map(~ dplyr::filter(.x, Attribute_source != 'https://bacdive.dsmz.de/')) |> 
  discard(~!nrow(.x))
err2 <- bugphyzz:::.checkColumnValuesList(phys2, table = TRUE)

Uncatalogued columns

Columns in the output:

dataset the name of the spreadsheet.
col the name of the column that has not been added to the template.csv file.
source_link link to the spreadsheet.

uncat_columns <- dplyr::filter(err2, error_type == 'Uncatalogued column') |> 
  discard(~ all(is.na(.x))) |> 
  select(-error_type)
DT::datatable(uncat_columns)

Errors in syntax/values

Columns in the output:

dataset. Name of the spreadsheet.
col. Name of the column with invalid values.
n_rows. Number of rows with the same invalid value.
invalid_values. The value that most be corrected or added to the attributes.tsv file or the template.tsv file
source_link. The link to the spreadsheet.

invalid_values <- dplyr::filter(err2, error_type == 'Invalid values') |> 
  discard(~ all(is.na(.x))) |> 
  tidyr::unnest(cols = c(invalid_values, invalid_pos)) |> 
  select(-error_type, -invalid_pos)
invalid_values |> 
  # head(100) |> 
  unique() |> 
  select(-col, -n_rows) |> 
  filter(dataset != 'isolation site') |> 
  DT::datatable(
    filter = 'top',
    extensions = c("Buttons","KeyTable"),
    options = list(
      dom = '<"top"B>lfrtip',
      buttons = list('copy', 'print'),
      iDisplayLength = 25,
      keys = TRUE,
      pageLength = 100, 
      scrollY = TRUE
    )
  )
#> Warning in instance$preRenderHook(instance): It seems your data is too big for
#> client-side DataTables. You may consider server-side processing:
#> https://rstudio.github.io/DT/server.html

Check valid NCBI IDs

NCBI IDs are compared against the latest version of the NCBI ID database available through taxizedb.

If an NCBI ID is outdated (some IDs are no longer in use or are merged into another one), then a more recent NCBI ID and taxon name are suggested based on the result of taxize::classification. You should see “outdated” in the comment column (below).

If no NCBI ID is suggested, it’s likely because the taxid doesn’t exist. You should see “not found” in the comment column (below).

Note: taxize and taxizedb are not based on the same annotations. Both former and updated annotations can be accessed through taxize. However, this can take a long time since the client must make requests to the NCBI API. Having an NCBI API key can help (it’s free once you created your NCBI account). On the other hand, working with the taxizedb functions is much faster. This is because the NCBI database is downloaded first to the user’s machine. Only the most recent version of the database is downloaded, that’s why only current annotations are show. Roughly, taxize behaves like https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi while taxizedb behaves like https://www.ncbi.nlm.nih.gov/taxonomy/.

id_err <- lapply(phys, bugphyzz:::.check_valid_ncbi_ids) %>% 
  purrr::discard(is.null) %>%
  bind_rows(.id = 'dataset')

links_file <- system.file('extdata/links.tsv', package = 'bugphyzz')
links <- readr::read_tsv(links_file, show_col_types = FALSE)
links <- links[,c('physiology', 'source_link')]

id_err_tbl <- dplyr::left_join(
  id_err, links, by = c('dataset' = 'physiology')
)
dim(id_err_tbl)

DT::datatable(id_err_tbl)