Exploring extract_data(): from ready-to-use output to raw data

Introduction

Every vaultkeepr pipeline follows the same two-phase pattern: build the plan, then execute it. Each function call — get_datasets(), select_dataset_by_type(), get_samples(), get_taxa(), etc. — adds a step to a lazy SQL query stored in the plan object without pulling any data into R. extract_data() is the execution step: it runs the complete plan against the database and returns a tibble ready for analysis.

extract_data() offers two output modes controlled by the return_raw_data argument:

return_raw_data = FALSE (default) — the result is passed through get_readable_column_names() (which translates internal database column names to descriptive equivalents) and pack_data() (which nests sample, community, trait, and abiotic data into dedicated list-columns, one row per dataset). This is the recommended output for most users.
return_raw_data = TRUE — the raw SQL result is returned immediately, skipping all post-processing. It is faster and preserves every column exactly as it comes from the database, but requires familiarity with the VegVault database schema.

We use a small example database (built automatically below) that mirrors a real VegVault file. You can swap the path for your own VegVault download to run the same code on real data.

Working with real data

Download the full VegVault database from the Database Access page. For an overview of the extraction workflow, see VegVault Usage Examples.

# Build (or reuse) the example database and return its file path
path_to_db <- source("helper_make_example_db.R")$value

Building the plan

We start with a minimal plan that retrieves vegetation plot datasets from a small geographic area together with their taxa. This gives us both sample metadata and community (taxon abundance) data — enough to illustrate both output modes clearly.

library(vaultkeepr)

plan <-
  open_vault(path = path_to_db) |>
  get_datasets() |>
  select_dataset_by_type(sel_dataset_type = "vegetation_plot") |>
  select_dataset_by_geo(
    long_lim = c(-116, -114),
    lat_lim  = c(44, 46)
  ) |>
  get_samples() |>
  get_taxa()
#> ℹ Vault opened successfully
#> ! The data does not contain all dataset types specified in `vec_dataset_type`. Changing `vec_dataset_type` to the dataset types present in the data as: vegetation_plot

At this point plan holds the complete lazy query; no data has been collected yet.

Default output — Nested (packed) tibble (`return_raw_data = FALSE`)

With the default mode, extract_data() executes the plan and then passes the SQL result through two additional steps before returning it:

get_readable_column_names() — translates internal database column names to descriptive equivalents.
pack_data() — nests sample, community, trait, and abiotic data into dedicated list-columns, one row per dataset.

data_nested <-
  plan |>
  extract_data(
    return_raw_data = FALSE,
    verbose = FALSE
  )

dplyr::glimpse(data_nested)
#> Rows: 27
#> Columns: 9
#> $ dataset_name            <chr> "dataset_1", "dataset_2", "dataset_3", "datase…
#> $ data_source_desc        <chr> "EVA (European Vegetation Archive)", "EVA (Eur…
#> $ dataset_type            <chr> "vegetation_plot", "vegetation_plot", "vegetat…
#> $ dataset_source_type     <chr> "BIEN", "BIEN", "BIEN", "sPlotOpen", "sPlotOpe…
#> $ coord_long              <dbl> -115, -115, -115, -115, -115, -115, -115, -115…
#> $ coord_lat               <dbl> 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45…
#> $ sampling_method_details <chr> "Standardised vegetation plot survey (1–1000 m…
#> $ data_samples            <list> [<tbl_df[2 x 5]>], [<tbl_df[2 x 5]>], [<tbl_d…
#> $ data_community          <list> [<tbl_df[2 x 46]>], [<tbl_df[2 x 46]>], [<tbl…

Structure of the nested tibble

Each row represents one dataset. The top-level metadata columns (dataset_name, dataset_type, coord_long, coord_lat, etc.) describe the dataset. Detailed data are packed into nested data.frame columns:

Nested column	Content	Present when
`data_samples`	Sample metadata: `sample_name`, `age`, `sample_size`, `description`	`get_samples()` was called
`data_community`	Community matrix: samples × taxa (wide format, taxa as columns, abundance as values)	`get_taxa()` was called
`data_traits`	Trait table: samples × taxa × trait values	`get_traits()` was called
`data_abiotic`	Abiotic table: samples linked to gridpoint variables	`get_abiotic_data()` was called

Working with the nested columns

Use tidyr::unnest() to expand a nested column back into a flat tibble, or use purrr::map() to operate on each dataset’s data independently.

# Expand sample metadata for every dataset
data_nested |>
  tidyr::unnest(cols = data_samples)
#> # A tibble: 54 × 13
#>    dataset_name data_source_desc     dataset_type dataset_source_type coord_long
#>    <chr>        <chr>                <chr>        <chr>                    <dbl>
#>  1 dataset_1    EVA (European Veget… vegetation_… BIEN                      -115
#>  2 dataset_1    EVA (European Veget… vegetation_… BIEN                      -115
#>  3 dataset_2    EVA (European Veget… vegetation_… BIEN                      -115
#>  4 dataset_2    EVA (European Veget… vegetation_… BIEN                      -115
#>  5 dataset_3    EVA (European Veget… vegetation_… BIEN                      -115
#>  6 dataset_3    EVA (European Veget… vegetation_… BIEN                      -115
#>  7 dataset_4    EVA (European Veget… vegetation_… sPlotOpen                 -115
#>  8 dataset_4    EVA (European Veget… vegetation_… sPlotOpen                 -115
#>  9 dataset_5    EVA (European Veget… vegetation_… sPlotOpen                 -115
#> 10 dataset_5    EVA (European Veget… vegetation_… sPlotOpen                 -115
#> # ℹ 44 more rows
#> # ℹ 8 more variables: coord_lat <dbl>, sampling_method_details <chr>,
#> #   sample_name <chr>, age <dbl>, sample_details <chr>, sample_size <dbl>,
#> #   description <chr>, data_community <list>

# Expand the community matrix for every dataset
data_nested |>
  tidyr::unnest(cols = data_community)
#> # A tibble: 54 × 54
#>    dataset_name data_source_desc     dataset_type dataset_source_type coord_long
#>    <chr>        <chr>                <chr>        <chr>                    <dbl>
#>  1 dataset_1    EVA (European Veget… vegetation_… BIEN                      -115
#>  2 dataset_1    EVA (European Veget… vegetation_… BIEN                      -115
#>  3 dataset_2    EVA (European Veget… vegetation_… BIEN                      -115
#>  4 dataset_2    EVA (European Veget… vegetation_… BIEN                      -115
#>  5 dataset_3    EVA (European Veget… vegetation_… BIEN                      -115
#>  6 dataset_3    EVA (European Veget… vegetation_… BIEN                      -115
#>  7 dataset_4    EVA (European Veget… vegetation_… sPlotOpen                 -115
#>  8 dataset_4    EVA (European Veget… vegetation_… sPlotOpen                 -115
#>  9 dataset_5    EVA (European Veget… vegetation_… sPlotOpen                 -115
#> 10 dataset_5    EVA (European Veget… vegetation_… sPlotOpen                 -115
#> # ℹ 44 more rows
#> # ℹ 49 more variables: coord_lat <dbl>, sampling_method_details <chr>,
#> #   data_samples <list>, sample_name <chr>, `Alnus cordata` <dbl>,
#> #   `Alnus glutinosa` <dbl>, `Alnus incana` <dbl>, `Alnus rhombifolia` <dbl>,
#> #   `Alnus viridis` <dbl>, `Artemisia absinthium` <dbl>,
#> #   `Artemisia annua` <dbl>, `Artemisia campestris` <dbl>,
#> #   `Artemisia maritima` <dbl>, `Artemisia vulgaris` <dbl>, …

# Compute per-dataset taxon richness using purrr::map()
data_nested |>
  dplyr::mutate(
    n_taxa = purrr::map_int(
      .x = data_community,
      .f = ~ {
        # Number of taxon columns = all columns except sample_name
        ncol(.x) - 1L
      }
    )
  ) |>
  dplyr::select(dataset_name, dataset_type, n_taxa)
#> # A tibble: 27 × 3
#>    dataset_name dataset_type    n_taxa
#>    <chr>        <chr>            <int>
#>  1 dataset_1    vegetation_plot     45
#>  2 dataset_2    vegetation_plot     45
#>  3 dataset_3    vegetation_plot     45
#>  4 dataset_4    vegetation_plot     45
#>  5 dataset_5    vegetation_plot     45
#>  6 dataset_6    vegetation_plot     45
#>  7 dataset_7    vegetation_plot     45
#>  8 dataset_8    vegetation_plot     45
#>  9 dataset_9    vegetation_plot     45
#> 10 dataset_28   vegetation_plot     45
#> # ℹ 17 more rows

When to use the nested output

Recommended for most users — the data arrives clean and clearly organised, with no schema knowledge required.
Iterate over datasets with purrr::map(), applying the same analysis to every dataset independently.
Keep data organised by dataset — useful when datasets differ in the number of samples they contain.
Export each dataset’s community matrix or trait table as a separate object or file.

Advanced — Raw flat tibble (`return_raw_data = TRUE`)

Setting return_raw_data = TRUE executes the plan and returns the SQL result immediately, skipping the column-renaming and packing steps. This makes it noticeably faster on large queries.

Note: This mode requires familiarity with the VegVault database schema. Column names match the internal database fields, and mixed data types (e.g. taxon rows interspersed with abiotic rows) must be separated manually before analysis.

data_flat <-
  plan |>
  extract_data(
    return_raw_data = TRUE,
    verbose = FALSE
  )

dplyr::glimpse(data_flat)
#> Rows: 2,430
#> Columns: 16
#> $ dataset_id          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ dataset_name        <chr> "dataset_1", "dataset_1", "dataset_1", "dataset_1"…
#> $ data_source_id      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ dataset_type_id     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ data_source_type_id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ coord_long          <dbl> -115, -115, -115, -115, -115, -115, -115, -115, -1…
#> $ coord_lat           <dbl> 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45…
#> $ sampling_method_id  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ dataset_type        <chr> "vegetation_plot", "vegetation_plot", "vegetation_…
#> $ sample_id           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ sample_name         <chr> "sample_1", "sample_1", "sample_1", "sample_1", "s…
#> $ sample_details      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
#> $ age                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ sample_size_id      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
#> $ taxon_id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
#> $ value               <dbl> 34.3, 74.7, 33.9, 61.4, 35.5, 51.2, 56.1, 46.7, 44…

What the columns mean

Every row is one sample × taxon combination. Key columns include:

Column	Content
`dataset_id`	Unique dataset identifier
`dataset_type`	Type of dataset (e.g. `vegetation_plot`)
`coord_long` / `coord_lat`	Geographic coordinates of the dataset
`sample_id`	Unique sample identifier
`age`	Age of the sample (cal yr BP; 0 = modern)
`taxon_id`	Unique taxon identifier
`taxon_name`	Original taxon name from the source
`value`	Taxon abundance in the sample

When to use the raw output

Large queries where the packing step is a bottleneck.
Workflows that need to join directly on database IDs with external tables before any reshaping.
Debugging or exploring the raw query result to understand what the database returned.
Users who are already familiar with the VegVault schema and prefer to do their own tidying with dplyr and tidyr.

# Count the number of samples per dataset
data_flat |>
  dplyr::distinct(dataset_id, sample_id) |>
  dplyr::count(dataset_id, name = "n_samples")
#> # A tibble: 27 × 2
#>    dataset_id n_samples
#>         <int>     <int>
#>  1          1         2
#>  2          2         2
#>  3          3         2
#>  4          4         2
#>  5          5         2
#>  6          6         2
#>  7          7         2
#>  8          8         2
#>  9          9         2
#> 10         28         2
#> # ℹ 17 more rows

Comparing the two outputs

Both objects represent the same plan executed in different output modes. We can verify this by checking that the number of datasets matches.

# Nested: one row per dataset
nrow(data_nested)
#> [1] 27

# Flat: one row per sample × taxon combination
nrow(data_flat)
#> [1] 2430

# The nested tibble covers the same datasets as the flat one
dplyr::n_distinct(data_flat$dataset_id) == nrow(data_nested)
#> [1] TRUE

Full pipeline

Both modes in a single script — build the plan once, execute it twice with different output modes:

# --- Default: nested (packed) output ---
data_nested <-
  open_vault(path = path_to_db) |>
  get_datasets() |>
  select_dataset_by_type(sel_dataset_type = "vegetation_plot") |>
  select_dataset_by_geo(
    long_lim = c(-116, -114),
    lat_lim  = c(44, 46)
  ) |>
  get_samples() |>
  get_taxa() |>
  extract_data(return_raw_data = FALSE)

# --- Advanced: raw flat output ---
data_flat <-
  open_vault(path = path_to_db) |>
  get_datasets() |>
  select_dataset_by_type(sel_dataset_type = "vegetation_plot") |>
  select_dataset_by_geo(
    long_lim = c(-116, -114),
    lat_lim  = c(44, 46)
  ) |>
  get_samples() |>
  get_taxa() |>
  extract_data(return_raw_data = TRUE)

Next steps

Retrieve data citations with get_references() before publishing results
Combine taxa with climate data — see the vegetation and climate article
Work with trait data — see the functional traits article
When publishing, cite VegVault using the data paper