# Build (or reuse) the example database and return its file path
path_to_db <- source("helper_make_example_db.R")$valueInspecting and customising the taxonomic classification
Source:vignettes/a07_classification_workflow.qmd
Introduction
When you call get_taxa(classify_to = "genus"), vaultkeepr uses the TaxonClassification table built automatically from the GBIF Taxonomy Backbone via the {taxospace} package. That automated process works well for most taxa, but you may sometimes need to:
- inspect which species map to which genus or family before committing to a classification;
- restrict the analysis to a single plant family;
- override the classification for a taxon that is notoriously hard to harmonise automatically.
The get_classification_table() function is designed for exactly this workflow. It extracts the classification table filtered to the taxa already present in your pipeline, returns it as a plain tibble that you can inspect and edit in R, and the result can be passed straight back to get_taxa() or get_traits() via the classification_data argument.
We use a small example database (built automatically below) that mirrors the structure and naming conventions of a real VegVault file. You can swap the path for your own VegVault download to run the same code on real data.
Working with real data
Download the full VegVault database from the Database Access page to run this code on real global vegetation data.
Step 1 — Build the plan up to species level
We build a base plan that covers dataset and sample metadata but does not include taxa yet. We then branch off plan_taxa specifically for the classification-table inspection: classify_to = "original" is required because get_classification_table() looks up species-level taxon_id values in the TaxonClassification table. Keeping plan_base separate is important — it is the correct starting point for the custom and default classification pipelines in Steps 5 and 6.
library(vaultkeepr)
plan_base <-
open_vault(path = path_to_db) |>
get_datasets() |>
get_samples()
#> ℹ Vault opened successfully
plan_taxa <-
plan_base |>
get_taxa(classify_to = "original")Note
plan_baseandplan_taxaare bothvault_pipeobjects — no data has been collected yet. The plan is assembled lazily.get_classification_table()in the next step is the first call that actually touches the database.
Step 2 — Inspect the classification table
get_classification_table() executes the query, filters the TaxonClassification table to taxa present in the plan, joins the Taxa name table for every level, and returns a collected tibble — not a vault_pipe. This is the “pipeline split”: plan_taxa is kept intact for the next step.
data_class <-
get_classification_table(con = plan_taxa)
dplyr::glimpse(data_class)
#> Rows: 45
#> Columns: 8
#> $ taxon_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
#> $ taxon_name <chr> "Poa annua", "Poa pratensis", "Poa trivialis", "Poa alpi…
#> $ taxon_species <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
#> $ species_name <chr> "Poa annua", "Poa pratensis", "Poa trivialis", "Poa alpi…
#> $ taxon_genus <int> 46, 46, 46, 46, 46, 47, 47, 47, 47, 47, 48, 48, 48, 48, …
#> $ genus_name <chr> "Poa", "Poa", "Poa", "Poa", "Poa", "Festuca", "Festuca",…
#> $ taxon_family <int> 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, …
#> $ family_name <chr> "Poaceae", "Poaceae", "Poaceae", "Poaceae", "Poaceae", "…The default output (return_raw_data = FALSE) contains both IDs and resolved names for every classification level:
| Column | Content |
|---|---|
taxon_id |
Original species-level taxon ID |
taxon_name |
Species name (from Taxa) |
taxon_species |
Species taxon ID in TaxonClassification
|
species_name |
Species name resolved from taxon_species
|
taxon_genus |
Genus taxon ID |
genus_name |
Genus name |
taxon_family |
Family taxon ID |
family_name |
Family name |
The ID columns (taxon_id, taxon_species, taxon_genus, taxon_family) make the tibble directly usable as classification_data in subsequent calls, while the name columns let you inspect it with human-readable labels.
data_class
#> # A tibble: 45 × 8
#> taxon_id taxon_name taxon_species species_name taxon_genus genus_name
#> <int> <chr> <int> <chr> <int> <chr>
#> 1 1 Poa annua 1 Poa annua 46 Poa
#> 2 2 Poa pratensis 2 Poa pratens… 46 Poa
#> 3 3 Poa trivialis 3 Poa trivial… 46 Poa
#> 4 4 Poa alpina 4 Poa alpina 46 Poa
#> 5 5 Poa nemoralis 5 Poa nemoral… 46 Poa
#> 6 6 Festuca rubra 6 Festuca rub… 47 Festuca
#> 7 7 Festuca ovina 7 Festuca ovi… 47 Festuca
#> 8 8 Festuca pratensis 8 Festuca pra… 47 Festuca
#> 9 9 Festuca arundinac… 9 Festuca aru… 47 Festuca
#> 10 10 Festuca valesiaca 10 Festuca val… 47 Festuca
#> # ℹ 35 more rows
#> # ℹ 2 more variables: taxon_family <int>, family_name <chr>Step 3 — Explore the family breakdown
A quick summary shows how many species fall into each family in the data:
data_class |>
dplyr::count(family_name, sort = TRUE)
#> # A tibble: 3 × 2
#> family_name n
#> <chr> <int>
#> 1 Asteraceae 15
#> 2 Betulaceae 15
#> 3 Poaceae 15And within a specific family:
data_class |>
dplyr::filter(family_name == "Betulaceae") |>
dplyr::select(
taxon_name,
genus_name,
family_name
)
#> # A tibble: 15 × 3
#> taxon_name genus_name family_name
#> <chr> <chr> <chr>
#> 1 Betula pendula Betula Betulaceae
#> 2 Betula pubescens Betula Betulaceae
#> 3 Betula nana Betula Betulaceae
#> 4 Betula humilis Betula Betulaceae
#> 5 Betula platyphylla Betula Betulaceae
#> 6 Alnus glutinosa Alnus Betulaceae
#> 7 Alnus incana Alnus Betulaceae
#> 8 Alnus viridis Alnus Betulaceae
#> 9 Alnus cordata Alnus Betulaceae
#> 10 Alnus rhombifolia Alnus Betulaceae
#> 11 Corylus avellana Corylus Betulaceae
#> 12 Corylus maxima Corylus Betulaceae
#> 13 Corylus colurna Corylus Betulaceae
#> 14 Corylus americana Corylus Betulaceae
#> 15 Corylus sieboldiana Corylus BetulaceaeStep 4 — Create a restricted classification table
Suppose we want to run the genus-level analysis for Betulaceae only (genera Betula, Alnus, Corylus). We simply filter the table and pass it to get_taxa().
data_class_betulaceae <-
data_class |>
dplyr::filter(family_name == "Betulaceae")
data_class_betulaceae
#> # A tibble: 15 × 8
#> taxon_id taxon_name taxon_species species_name taxon_genus genus_name
#> <int> <chr> <int> <chr> <int> <chr>
#> 1 31 Betula pendula 31 Betula pend… 52 Betula
#> 2 32 Betula pubescens 32 Betula pube… 52 Betula
#> 3 33 Betula nana 33 Betula nana 52 Betula
#> 4 34 Betula humilis 34 Betula humi… 52 Betula
#> 5 35 Betula platyphylla 35 Betula plat… 52 Betula
#> 6 36 Alnus glutinosa 36 Alnus gluti… 53 Alnus
#> 7 37 Alnus incana 37 Alnus incana 53 Alnus
#> 8 38 Alnus viridis 38 Alnus virid… 53 Alnus
#> 9 39 Alnus cordata 39 Alnus corda… 53 Alnus
#> 10 40 Alnus rhombifolia 40 Alnus rhomb… 53 Alnus
#> 11 41 Corylus avellana 41 Corylus ave… 54 Corylus
#> 12 42 Corylus maxima 42 Corylus max… 54 Corylus
#> 13 43 Corylus colurna 43 Corylus col… 54 Corylus
#> 14 44 Corylus americana 44 Corylus ame… 54 Corylus
#> 15 45 Corylus sieboldia… 45 Corylus sie… 54 Corylus
#> # ℹ 2 more variables: taxon_family <int>, family_name <chr>Tip
Any column transformation is valid here — you can also re-map individual species to different genera by editing the
taxon_genuscolumn directly, or remove rows for taxa you want to exclude entirely. The only requirement is that the table contains ataxon_idcolumn and the column corresponding toclassify_to(e.g.taxon_genusforclassify_to = "genus").
Step 5 — Apply the custom classification
We pass data_class_betulaceae back into get_taxa() via classification_data, starting from plan_base — the plan before any taxa were added. Taxa not present in the custom table are dropped automatically — only Betulaceae species are retained.
plan_betulaceae <-
plan_base |>
get_taxa(
classify_to = "genus",
classification_data = data_class_betulaceae
)
#> Warning: The classification is being made using an automatic workflow
#> and might contain errors.
#> ℹ We recommend checking the classification table by calling
#> `get_classification_table()`.Step 6 — Compare default vs. custom classification
Let us execute both pipelines and compare the resulting taxa.
# Default: all families, classified to genus
data_default <-
plan_base |>
get_taxa(classify_to = "genus") |>
extract_data(verbose = FALSE)
#> Warning: The classification is being made using an automatic workflow
#> and might contain errors.
#> ℹ We recommend checking the classification table by calling
#> `get_classification_table()`.
# Custom: Betulaceae only
data_custom <-
plan_betulaceae |>
extract_data(verbose = FALSE)The custom result contains only the three Betulaceae genera (Betula, Alnus, Corylus), while the default contains all nine genera present in the example data.
Scenario B — Overriding a specific taxon assignment
Sometimes you need to correct or re-map a single row rather than restrict the whole table. Typical reasons include:
- the automated GBIF harmonisation is known to misplace a taxon (e.g., Senecio jacobaea has been reclassified to Jacobaea vulgaris in many modern treatments — and automated pipelines often still map it back to Senecio);
- your analysis follows a different taxonomic authority than GBIF (e.g., lumping Betula nana with Alnus s.l. for broad palaeoecological summaries).
The edit is a two-step process:
- Retrieve the target genus ID from the classification table itself.
- Patch the affected rows with
dplyr::mutate()anddplyr::case_when().
# Step 2: patch the table — reassign Betula nana to Alnus
# Only taxon_genus (the ID column) needs to change; name columns
# are display-only and have no effect on the query.
data_class_corrected <-
data_class |>
dplyr::mutate(
taxon_genus = dplyr::case_when(
taxon_name == "Betula nana" ~ id_genus_alnus,
.default = taxon_genus
)
)
# Confirm: Betula nana now carries the Alnus genus ID (id_genus_alnus),
# while its genus_name display column still reads "Betula" — the
# pipeline uses only the ID, so the display value is irrelevant.
data_class_corrected |>
dplyr::filter(
taxon_name %in% c("Betula pendula", "Betula nana")
) |>
dplyr::select(taxon_name, taxon_genus, genus_name)
#> # A tibble: 2 × 3
#> taxon_name taxon_genus genus_name
#> <chr> <int> <chr>
#> 1 Betula pendula 52 Betula
#> 2 Betula nana 53 BetulaNote
Only the ID columns (
taxon_id,taxon_species,taxon_genus,taxon_family) influence the pipeline result. The display columns (genus_name,family_name, …) are resolved from the database at collection time and can be left as-is — the confirmation above shows that Betula nana still readsgenus_name = "Betula"even though itstaxon_genusID now points to Alnus. The pipeline will group it under Alnus regardless.
Apply the corrected table starting from plan_base. Because Betula nana is now lumped with Alnus, its abundance contributes to the Alnus column instead of Betula.
data_corrected <-
plan_base |>
get_taxa(
classify_to = "genus",
classification_data = data_class_corrected
) |>
extract_data(verbose = FALSE)
#> Warning: The classification is being made using an automatic workflow
#> and might contain errors.
#> ℹ We recommend checking the classification table by calling
#> `get_classification_table()`.
# Compare total Betula and Alnus abundance across all samples
dplyr::bind_rows(
data_default |>
dplyr::pull("data_community") |>
purrr::list_rbind() |>
dplyr::summarise(
Betula = sum(Betula, na.rm = TRUE),
Alnus = sum(Alnus, na.rm = TRUE)
) |>
dplyr::mutate(classification = "default"),
data_corrected |>
dplyr::pull("data_community") |>
purrr::list_rbind() |>
dplyr::summarise(
Betula = sum(Betula, na.rm = TRUE),
Alnus = sum(Alnus, na.rm = TRUE)
) |>
dplyr::mutate(classification = "corrected")
) |>
dplyr::relocate(classification)
#> # A tibble: 2 × 3
#> classification Betula Alnus
#> <chr> <dbl> <dbl>
#> 1 default 1046520. 991041
#> 2 corrected 836880. 1200681.The Betula total is lower in the corrected result (one species removed) and the Alnus total is higher by exactly the same amount (Betula nana gained). Other genera are unchanged. The genus columns themselves do not disappear — Betula still has four species — but the abundance budget shifts between the two genera.
Tip
To re-map every species of a genus rather than a single taxon, use
genus_name == "Betula"as thedplyr::case_when()condition instead oftaxon_name == "Betula nana". The same approach works for family- level remapping viataxon_family.
Using the raw-ID table
If you only need the ID columns (e.g. to edit IDs programmatically or to pass to a function that expects only the TaxonClassification schema), set return_raw_data = TRUE:
data_class_raw <-
get_classification_table(
con = plan_taxa,
return_raw_data = TRUE
)
data_class_raw
#> # A tibble: 45 × 4
#> taxon_id taxon_species taxon_genus taxon_family
#> <int> <int> <int> <int>
#> 1 1 1 46 55
#> 2 2 2 46 55
#> 3 3 3 46 55
#> 4 4 4 46 55
#> 5 5 5 46 55
#> 6 6 6 47 55
#> 7 7 7 47 55
#> 8 8 8 47 55
#> 9 9 9 47 55
#> 10 10 10 47 55
#> # ℹ 35 more rowsThis returns only taxon_id, taxon_species, taxon_genus, and taxon_family. It is slightly lighter than the default but lacks the resolved names, so it is harder to inspect by eye.
Full pipeline
The complete workflow — from opening the vault to applying a custom family filter — can be written as a single expression:
# Step 1: build the base plan (no taxa yet) and branch for inspection
plan_base <-
open_vault(path = path_to_db) |>
get_datasets() |>
get_samples()
# Step 2: get and filter the classification table
data_class_betulaceae <-
get_classification_table(
con = plan_base |> get_taxa(classify_to = "original")
) |>
dplyr::filter(family_name == "Betulaceae")
# Step 3: apply and extract, starting from the base plan
data_betulaceae <-
plan_base |>
get_taxa(
classify_to = "genus",
classification_data = data_class_betulaceae
) |>
extract_data()Next steps
- Combine with climate data after restricting the taxa of interest — see the vegetation and climate article
- Retrieve the full trait dataset for the restricted or overridden taxa with
get_traits(classify_to = "genus", classification_data = ...)— the sameclassification_dataargument is available there too - Explore age uncertainty for the filtered fossil samples — see the paleoecological uncertainty article
- Chain both edits: first filter the table to a family, then patch individual rows within it before passing the result to
get_taxa()