# Build (or reuse) the example database and return its file path
path_to_db <- source("helper_make_example_db.R")$valueExploring extract_data(): from ready-to-use output to raw data
Source:vignettes/a04_extract_data.qmd
Introduction
Every vaultkeepr pipeline follows the same two-phase pattern: build the plan, then execute it. Each function call — get_datasets(), select_dataset_by_type(), get_samples(), get_taxa(), etc. — adds a step to a lazy SQL query stored in the plan object without pulling any data into R. extract_data() is the execution step: it runs the complete plan against the database and returns a tibble ready for analysis.
extract_data() offers two output modes controlled by the return_raw_data argument:
-
return_raw_data = FALSE(default) — the result is passed throughget_readable_column_names()(which translates internal database column names to descriptive equivalents) andpack_data()(which nests sample, community, trait, and abiotic data into dedicated list-columns, one row per dataset). This is the recommended output for most users. -
return_raw_data = TRUE— the raw SQL result is returned immediately, skipping all post-processing. It is faster and preserves every column exactly as it comes from the database, but requires familiarity with the VegVault database schema.
We use a small example database (built automatically below) that mirrors a real VegVault file. You can swap the path for your own VegVault download to run the same code on real data.
Working with real data
Download the full VegVault database from the Database Access page. For an overview of the extraction workflow, see VegVault Usage Examples.
Building the plan
We start with a minimal plan that retrieves vegetation plot datasets from a small geographic area together with their taxa. This gives us both sample metadata and community (taxon abundance) data — enough to illustrate both output modes clearly.
library(vaultkeepr)
plan <-
open_vault(path = path_to_db) |>
get_datasets() |>
select_dataset_by_type(sel_dataset_type = "vegetation_plot") |>
select_dataset_by_geo(
long_lim = c(-116, -114),
lat_lim = c(44, 46)
) |>
get_samples() |>
get_taxa()
#> The data does not contain the all dataset types specified in `vec_dataset_type`.
#> Changing `vec_dataset_type` to the dataset types present in the data as: vegetation_plotAt this point plan holds the complete lazy query; no data has been collected yet.
Default output — Nested (packed) tibble (return_raw_data = FALSE)
With the default mode, extract_data() executes the plan and then passes the SQL result through two additional steps before returning it:
-
get_readable_column_names()— translates internal database column names to descriptive equivalents. -
pack_data()— nests sample, community, trait, and abiotic data into dedicated list-columns, one row per dataset.
data_nested <-
plan |>
extract_data(
return_raw_data = FALSE,
verbose = FALSE
)
dplyr::glimpse(data_nested)
#> Rows: 27
#> Columns: 9
#> $ dataset_name <chr> "dataset_1", "dataset_2", "dataset_3", "datase…
#> $ data_source_desc <chr> "EVA (European Vegetation Archive)", "EVA (Eur…
#> $ dataset_type <chr> "vegetation_plot", "vegetation_plot", "vegetat…
#> $ dataset_source_type <chr> "BIEN", "BIEN", "BIEN", "sPlotOpen", "sPlotOpe…
#> $ coord_long <dbl> -115, -115, -115, -115, -115, -115, -115, -115…
#> $ coord_lat <dbl> 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45…
#> $ sampling_method_details <chr> "Standardised vegetation plot survey (1–1000 m…
#> $ data_samples <list> [<tbl_df[2 x 5]>], [<tbl_df[2 x 5]>], [<tbl_d…
#> $ data_community <list> [<tbl_df[2 x 46]>], [<tbl_df[2 x 46]>], [<tbl…Structure of the nested tibble
Each row represents one dataset. The top-level metadata columns (dataset_name, dataset_type, coord_long, coord_lat, etc.) describe the dataset. Detailed data are packed into nested data.frame columns:
| Nested column | Content | Present when |
|---|---|---|
data_samples |
Sample metadata: sample_name, age, sample_size, description
|
get_samples() was called |
data_community |
Community matrix: samples × taxa (wide format, taxa as columns, abundance as values) |
get_taxa() was called |
data_traits |
Trait table: samples × taxa × trait values |
get_traits() was called |
data_abiotic |
Abiotic table: samples linked to gridpoint variables |
get_abiotic_data() was called |
Working with the nested columns
Use tidyr::unnest() to expand a nested column back into a flat tibble, or use purrr::map() to operate on each dataset’s data independently.
# Expand sample metadata for every dataset
data_nested |>
tidyr::unnest(cols = data_samples)
#> # A tibble: 54 × 13
#> dataset_name data_source_desc dataset_type dataset_source_type coord_long
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 dataset_1 EVA (European Veget… vegetation_… BIEN -115
#> 2 dataset_1 EVA (European Veget… vegetation_… BIEN -115
#> 3 dataset_2 EVA (European Veget… vegetation_… BIEN -115
#> 4 dataset_2 EVA (European Veget… vegetation_… BIEN -115
#> 5 dataset_3 EVA (European Veget… vegetation_… BIEN -115
#> 6 dataset_3 EVA (European Veget… vegetation_… BIEN -115
#> 7 dataset_4 EVA (European Veget… vegetation_… sPlotOpen -115
#> 8 dataset_4 EVA (European Veget… vegetation_… sPlotOpen -115
#> 9 dataset_5 EVA (European Veget… vegetation_… sPlotOpen -115
#> 10 dataset_5 EVA (European Veget… vegetation_… sPlotOpen -115
#> # ℹ 44 more rows
#> # ℹ 8 more variables: coord_lat <dbl>, sampling_method_details <chr>,
#> # sample_name <chr>, age <dbl>, sample_details <chr>, sample_size <dbl>,
#> # description <chr>, data_community <list>
# Expand the community matrix for every dataset
data_nested |>
tidyr::unnest(cols = data_community)
#> # A tibble: 54 × 54
#> dataset_name data_source_desc dataset_type dataset_source_type coord_long
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 dataset_1 EVA (European Veget… vegetation_… BIEN -115
#> 2 dataset_1 EVA (European Veget… vegetation_… BIEN -115
#> 3 dataset_2 EVA (European Veget… vegetation_… BIEN -115
#> 4 dataset_2 EVA (European Veget… vegetation_… BIEN -115
#> 5 dataset_3 EVA (European Veget… vegetation_… BIEN -115
#> 6 dataset_3 EVA (European Veget… vegetation_… BIEN -115
#> 7 dataset_4 EVA (European Veget… vegetation_… sPlotOpen -115
#> 8 dataset_4 EVA (European Veget… vegetation_… sPlotOpen -115
#> 9 dataset_5 EVA (European Veget… vegetation_… sPlotOpen -115
#> 10 dataset_5 EVA (European Veget… vegetation_… sPlotOpen -115
#> # ℹ 44 more rows
#> # ℹ 49 more variables: coord_lat <dbl>, sampling_method_details <chr>,
#> # data_samples <list>, sample_name <chr>, `Alnus cordata` <dbl>,
#> # `Alnus glutinosa` <dbl>, `Alnus incana` <dbl>, `Alnus rhombifolia` <dbl>,
#> # `Alnus viridis` <dbl>, `Artemisia absinthium` <dbl>,
#> # `Artemisia annua` <dbl>, `Artemisia campestris` <dbl>,
#> # `Artemisia maritima` <dbl>, `Artemisia vulgaris` <dbl>, …
# Compute per-dataset taxon richness using purrr::map()
data_nested |>
dplyr::mutate(
n_taxa = purrr::map_int(
.x = data_community,
.f = ~ {
# Number of taxon columns = all columns except sample_name
ncol(.x) - 1L
}
)
) |>
dplyr::select(dataset_name, dataset_type, n_taxa)
#> # A tibble: 27 × 3
#> dataset_name dataset_type n_taxa
#> <chr> <chr> <int>
#> 1 dataset_1 vegetation_plot 45
#> 2 dataset_2 vegetation_plot 45
#> 3 dataset_3 vegetation_plot 45
#> 4 dataset_4 vegetation_plot 45
#> 5 dataset_5 vegetation_plot 45
#> 6 dataset_6 vegetation_plot 45
#> 7 dataset_7 vegetation_plot 45
#> 8 dataset_8 vegetation_plot 45
#> 9 dataset_9 vegetation_plot 45
#> 10 dataset_28 vegetation_plot 45
#> # ℹ 17 more rowsWhen to use the nested output
- Recommended for most users — the data arrives clean and clearly organised, with no schema knowledge required.
-
Iterate over datasets with
purrr::map(), applying the same analysis to every dataset independently. - Keep data organised by dataset — useful when datasets differ in the number of samples they contain.
- Export each dataset’s community matrix or trait table as a separate object or file.
Advanced — Raw flat tibble (return_raw_data = TRUE)
Setting return_raw_data = TRUE executes the plan and returns the SQL result immediately, skipping the column-renaming and packing steps. This makes it noticeably faster on large queries.
Note: This mode requires familiarity with the VegVault database schema. Column names match the internal database fields, and mixed data types (e.g. taxon rows interspersed with abiotic rows) must be separated manually before analysis.
data_flat <-
plan |>
extract_data(
return_raw_data = TRUE,
verbose = FALSE
)
dplyr::glimpse(data_flat)
#> Rows: 2,430
#> Columns: 16
#> $ dataset_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ dataset_name <chr> "dataset_1", "dataset_1", "dataset_1", "dataset_1"…
#> $ data_source_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ dataset_type_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ data_source_type_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ coord_long <dbl> -115, -115, -115, -115, -115, -115, -115, -115, -1…
#> $ coord_lat <dbl> 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45…
#> $ sampling_method_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ dataset_type <chr> "vegetation_plot", "vegetation_plot", "vegetation_…
#> $ sample_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ sample_name <chr> "sample_1", "sample_1", "sample_1", "sample_1", "s…
#> $ sample_details <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ age <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ sample_size_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ taxon_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
#> $ value <dbl> 34.3, 74.7, 33.9, 61.4, 35.5, 51.2, 56.1, 46.7, 44…What the columns mean
Every row is one sample × taxon combination. Key columns include:
| Column | Content |
|---|---|
dataset_id |
Unique dataset identifier |
dataset_type |
Type of dataset (e.g. vegetation_plot) |
coord_long / coord_lat
|
Geographic coordinates of the dataset |
sample_id |
Unique sample identifier |
age |
Age of the sample (cal yr BP; 0 = modern) |
taxon_id |
Unique taxon identifier |
taxon_name |
Original taxon name from the source |
value |
Taxon abundance in the sample |
When to use the raw output
- Large queries where the packing step is a bottleneck.
- Workflows that need to join directly on database IDs with external tables before any reshaping.
- Debugging or exploring the raw query result to understand what the database returned.
- Users who are already familiar with the VegVault schema and prefer to do their own tidying with dplyr and tidyr.
# Count the number of samples per dataset
data_flat |>
dplyr::distinct(dataset_id, sample_id) |>
dplyr::count(dataset_id, name = "n_samples")
#> # A tibble: 27 × 2
#> dataset_id n_samples
#> <int> <int>
#> 1 1 2
#> 2 2 2
#> 3 3 2
#> 4 4 2
#> 5 5 2
#> 6 6 2
#> 7 7 2
#> 8 8 2
#> 9 9 2
#> 10 28 2
#> # ℹ 17 more rowsComparing the two outputs
Both objects represent the same plan executed in different output modes. We can verify this by checking that the number of datasets matches.
# Nested: one row per dataset
nrow(data_nested)
#> [1] 27
# Flat: one row per sample × taxon combination
nrow(data_flat)
#> [1] 2430
# The nested tibble covers the same datasets as the flat one
dplyr::n_distinct(data_flat$dataset_id) == nrow(data_nested)
#> [1] TRUEFull pipeline
Both modes in a single script — build the plan once, execute it twice with different output modes:
# --- Default: nested (packed) output ---
data_nested <-
open_vault(path = path_to_db) |>
get_datasets() |>
select_dataset_by_type(sel_dataset_type = "vegetation_plot") |>
select_dataset_by_geo(
long_lim = c(-116, -114),
lat_lim = c(44, 46)
) |>
get_samples() |>
get_taxa() |>
extract_data(return_raw_data = FALSE)
# --- Advanced: raw flat output ---
data_flat <-
open_vault(path = path_to_db) |>
get_datasets() |>
select_dataset_by_type(sel_dataset_type = "vegetation_plot") |>
select_dataset_by_geo(
long_lim = c(-116, -114),
lat_lim = c(44, 46)
) |>
get_samples() |>
get_taxa() |>
extract_data(return_raw_data = TRUE)Next steps
- Retrieve data citations with
get_references()before publishing results - Combine taxa with climate data — see the vegetation and climate article
- Work with trait data — see the functional traits article
- When publishing, cite VegVault using the data paper