phys <- physiologies()
#> Warning: Missing columns in halophily. Missing columns are: Genome_ID,
#> Accession_ID
The function .checkRequiredColumnList()
performs two
checks:
1. All of the required columns are present.
2. The required columns are in the right order.
err1 <- bugphyzz:::.checkRequiredColumnsList(phys, table = TRUE)
The result has four columns:
1. dataset contains the name of each dataset.
2. error_type contains the type of error (‘missing
column’ or ‘misplaced column’).
3. columns the name of the columns that must be
corrected.
4. link the link to the dataset.
Visualization of the errors:
.checkColumnValuesList
performs two checks on the list
of physiologies: 1. That a column is in the template file.
2. That all of the values in a column are valid according to the
template
file.
phys2 <- phys |>
map(~ dplyr::filter(.x, Attribute_source != 'https://bacdive.dsmz.de/')) |>
discard(~!nrow(.x))
err2 <- bugphyzz:::.checkColumnValuesList(phys2, table = TRUE)
Columns in the output:
Columns in the output:
invalid_values <- dplyr::filter(err2, error_type == 'Invalid values') |>
discard(~ all(is.na(.x))) |>
tidyr::unnest(cols = c(invalid_values, invalid_pos)) |>
select(-error_type, -invalid_pos)
invalid_values |>
# head(100) |>
unique() |>
select(-col, -n_rows) |>
filter(dataset != 'isolation site') |>
DT::datatable(
filter = 'top',
extensions = c("Buttons","KeyTable"),
options = list(
dom = '<"top"B>lfrtip',
buttons = list('copy', 'print'),
iDisplayLength = 25,
keys = TRUE,
pageLength = 100,
scrollY = TRUE
)
)
#> Warning in instance$preRenderHook(instance): It seems your data is too big for
#> client-side DataTables. You may consider server-side processing:
#> https://rstudio.github.io/DT/server.html
NCBI IDs are compared against the latest version of the NCBI ID
database available through taxizedb
.
If an NCBI ID is outdated (some IDs are no longer in
use or are merged into another one), then a more recent NCBI ID and
taxon name are suggested based on the result of
taxize::classification
. You should see “outdated” in the
comment column (below).
If no NCBI ID is suggested, it’s likely because the taxid doesn’t exist. You should see “not found” in the comment column (below).
Note: taxize and taxizedb are not based on the same annotations. Both former and updated annotations can be accessed through taxize. However, this can take a long time since the client must make requests to the NCBI API. Having an NCBI API key can help (it’s free once you created your NCBI account). On the other hand, working with the taxizedb functions is much faster. This is because the NCBI database is downloaded first to the user’s machine. Only the most recent version of the database is downloaded, that’s why only current annotations are show. Roughly, taxize behaves like https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi while taxizedb behaves like https://www.ncbi.nlm.nih.gov/taxonomy/.
id_err <- lapply(phys, bugphyzz:::.check_valid_ncbi_ids) %>%
purrr::discard(is.null) %>%
bind_rows(.id = 'dataset')
links_file <- system.file('extdata/links.tsv', package = 'bugphyzz')
links <- readr::read_tsv(links_file, show_col_types = FALSE)
links <- links[,c('physiology', 'source_link')]
id_err_tbl <- dplyr::left_join(
id_err, links, by = c('dataset' = 'physiology')
)
dim(id_err_tbl)
DT::datatable(id_err_tbl)