Published

August 7, 2025

Modified

August 28, 2025

Data Assembly Process of VegVault 1.0.0

VegVault v1.0.0 has been assembled through systematic integration of multiple publicly available databases, each contributing specialized data types to create a comprehensive vegetation database spanning both temporal and ecological dimensions.

The assembly workflow visualization:

Figure legend:

Primary data sources:

  1. Neotoma Paleoecology Database - Fossil pollen data (0-20,000 years BP)
  2. sPlotOpen - Contemporary vegetation plots
  3. BIEN - Contemporary vegetation and functional traits
  4. TRY Plant Trait Database - Comprehensive plant functional traits
  5. CHELSA - High-resolution climate data (contemporary and paleoclimate)
  6. WoSIS - Global soil property data

Data Processing: Individual processing pipelines for each data type using specialized GitHub repositories

  1. VegVault-FOSSILPOL v1.0.0 - Fossil pollen data processing
  2. VegVault-Vegetation_data v1.1.0 - Contemporary vegetation processing
  3. VegVault-Trait_data v1.1.0 - Functional trait data processing
  4. VegVault-abiotic_data v1.1.0 - Environmental data processing
  5. VegVault v1.0.0 - Consolidation into unified SQLite database structure with key steps:
    • Taxonomic Harmonization: Using the {taxospace} R package to align taxa with GBIF backbone
    • Trait Categorization: Grouping traits into functional domains following Díaz et al. (2016)
    • Spatial-Temporal Linking: Creating gridpoints for abiotic data and linking to vegetation samples

Reproducibility Through Version Control

All VegVault processing repositories use GitHub Tags to ensure complete reproducibility. The specific tagged versions used for v1.0.0 are documented in both our code and documentation, enabling exact replication of the database assembly process.

Data Processing Details

Fossil Pollen Records

Fossil pollen records have been downloaded from the Neotoma Paleoecology Database using their API on 26th June 2023. All data acquisition and processing have been done using the FOSSILPOL: Workflow for processing and standardizing global paleoecological pollen data (version 1.0.0). This includes the selection of depositional environments, ecological groups, chronology control point types, and a minimum number of chronology control points to construct age-depth models. Individual samples and records have been filtered by age limits, number of pollen grains, maximum age interpolation, and number of valid levels. In addition, the accuracy of fossil pollen data is increased by re-estimating all age-depth models using the Bayesian probabilistic approach and including the information about individual age uncertainty.

Selection Criteria:

  • Only records with type “pollen” with valid geological coordinates (longitude between -180 and 180, latitude between -90 and 90)
  • Depositional environments: lakes, bogs, and mires only (selection table)
  • Specific ecological groups retained (selection table)
  • Minimum 5 chronology control points (valid types)

Quality Filters:

  • Minimum 125 pollen grains per sample (balanced for data retention vs. quality)
    • While the preferred minimum was initially set at 150, this threshold resulted in significant data loss
    • Threshold adjusted to 125 with condition that less than 75% of samples would have low pollen sum
  • Age limits: 0-20,000 cal yr BP
  • Exclusion of samples older than 3000 years of the last chronology control point
  • Minimum 5 samples per record
  • Re-calculated age-depth models using Bayesian approach {Bchron} package

Processing Details: Available at VegVault-FOSSILPOL

Contemporary Vegetation Plots

The primary sources of contemporary plot-based vegetation data are BIEN (Botanical Information and Ecology Network) and sPlotOpen (the open-access version of sPlot).

BIEN Processing:

  • Downloaded using {RBIEN} R package v1.2.7 on 2nd August 2023 using function BIEN::BIEN_plot_datasource()
  • Retained key columns: datasource_id, datasource, plot_name, sampling_protocol, methodology_reference, methodology_description, longitude, latitude, plot_area_ha, subplot, individual_count, family_matched, name_matched, name_matched_author, verbatim_family, verbatim_scientific_name, scrubbed_species_binomial, scrubbed_taxonomic_status, scrubbed_family, scrubbed_author
  • All columns renamed using snake case
  • Filtered out records with missing essential information (NA values in datasource_id, datasource, plot_name, longitude, latitude, and/or plot_area_ha)

sPlotOpen Processing:

  • Downloaded sPlotOpen v2.0 on 26th September 2023
  • Linked tables DT2.oa and header.oa via PlotObservationID column
  • Applied quality filters for geographic and area data (filtered out rows with NA values in PlotObservationID, GIVD_ID, Longitude, Latitude, and/or Releve_area)
  • All columns renamed using snake case

Processing Details: Available at VegVault-Vegetation_data

Functional Traits

Following Díaz et al. (2016), we selected six key functional traits representing:

  • Stem specific density
  • Leaf nitrogen content per unit mass
  • Diaspore mass
  • Plant height
  • Leaf area
  • Leaf mass per area

TRY Database Processing:

  • Data request ID: 28498 on 29th August 2023
  • Requested traits with codes: 3106, 4, 3108, 3110, 3112, 3114, 3116, 3117, 14, and 26 (closest to Díaz et al. 2016 description)
  • Used {rtry} R package v1.1.0 for data import
  • For each trait (TraitID), extracted all relevant observations (ObservationID), ensuring only observations unique to each trait
  • Identified all unique data (DataID) associated with each trait
  • Excluded non-meaningful trait variations (DataID: 2221, 2222, 2223, 2224, 2225, 3646, 3647, 3698, 3699, 3727, 3728, 3730, 3731, 3849, 3850, 4029, and 4030), e.g., height at 15 days
  • Extracted covariate information (additional data stored in DataName column)
  • All columns renamed using snake case
  • Added Trait Domain variable to group traits following Díaz et al. (2016) selection for efficient extraction across TRY and BIEN

BIEN Traits Processing:

  • Downloaded using {RBIEN} R package v1.2.7 on 15th December 2023 using function BIEN::BIEN_trait_trait()
  • Requested traits: whole plant height, stem wood density, leaf area, leaf area per leaf dry mass, leaf nitrogen content per leaf dry mass, and seed mass
  • Retained columns: trait_name, trait_value, unit, id, longitude, latitude, method, url_source, source_citation, project_pi, scrubbed_species_binomial, and access
  • Calculated derived measures (e.g., leaf mass per area = 1/leaf area per leaf dry mass)

Processing Details: Available at VegVault-Trait_data

Abiotic Environmental Data

The primary sources of abiotic data are CHELSA, CHELSA-TRACE21, and WoSIS Soil Profile Database. The first two data sources provide high-resolution downscaled climatic data, while the latter offers detailed soil information (only available for contemporary data).

CHELSA Climate Data:

  • Contemporary: CHELSA v2.1 downloaded on 8th September 2023
  • Used {ClimDatDownloadR} R package with function ClimDatDownloadR::Chelsa.Clim.download()
  • Selected bio-variables: 1, 4, 6, 12, 15, 18, 19
  • Spatial aggregation: 25x factor using median values with {terra} R package function terra::aggregate(factor = 25, fun = "median")

Paleoclimate: CHELSA-TRACE21

  • Paleoclimate: CHELSA-TRACE21 v1.0 downloaded on 31st December 2023
  • Downloaded values for each 500-year time-slice between 0 and 18,000 years before present (BP)
  • Selected bio-variables: 1, 4, 6, 12, 15, 18, 19
  • For each time slice, applied spatial aggregation using terra::aggregate(factor = 25, fun = "median")

WoSIS Soil Data:

  • Downloaded on 11th September 2023 (both HWSD2_RASTER.zip and HWSD2_DB.zip)
  • Extracted soil type names (column HWSD2_SMU_ID) by combining tables HWSD2_SMU and D_WRB4
  • Added soil type information to the raster
  • Resampled using terra::resample(method = "near") function to match climate data resolution
  • Provides essential edaphic context for vegetation-environment relationships (contemporary data only)

Processing Details: Available at VegVault-abiotic_data

Data Integration Procedures

All processing pipelines with their corresponding Tags are migrated into an SQLite database using the GitHub repository titled VegVault, which can be accessed as DOI: 10.5281/zenodo.15201994.

Data Migration Details

Migrating sPlotOpen vegetation data:

  • Dataset name (dataset_name) created from plot_observation_id as splot_[plot_observation_id]
  • Original data source from column givd_id stored in DatasetSourcesID table
  • Sample name (sample_name) created using plot_observation_id as splot_[plot_observation_id]
  • Sample Size (sample_size) created from releve_area column
  • All samples automatically assigned age of 0
  • Taxonomic names extracted from Species column, abundances from Original_abundance

Migrating BIEN vegetation data:

  • Dataset name (dataset_name) created as bien_[row number]
  • Original data source from column datasource stored in DatasetSourcesID table
  • Sampling method extracted from methodology_description column
  • Sample name (sample_name) created as bien_[row number]
  • Sample Size (sample_size) created from plot_area_ha column, multiplied by 10,000 (stored in square meters)
  • All samples automatically assigned age of 0
  • Taxonomic names extracted from name_matched column

Migrating fossil pollen data:

  • Dataset name(dataset_name) created from dataset_id (fossilpol_[dataset_id])
  • Note: column dataset_id from primary source does not match dataset_id in VegVault
  • Original data source from source_of_data column stored in DatasetSourcesID table
  • Sampling method extracted from depositionalenvironment column
  • Individual Dataset Reference extracted from doi column
  • Sample name (sample_name) created using dataset_id and sample_id as fossilpol_[dataset_id]_[sample_id]
  • Ages extracted from levels column
  • Age uncertainty from age-depth models extracted from age_uncertainty column
  • Taxonomic names and abundances extracted from counts_harmonised column

Migrating TRY functional traits:

  • Dataset name (dataset_name) created as try_[row number]
  • Original data source from column dataset stored in DatasetSourcesID table
  • Individual Dataset Reference extracted from reference_source column
  • Sample name (sample_name) created as try_[row number]
  • Individual Sample reference extracted from dataset_reference_citation
  • All samples automatically assigned age of 0
  • Trait names extracted from trait_full_name, taxonomic names from acc_species_name, trait values from trait_value

Migrating BIEN functional traits:

  • Dataset name (dataset_name) created as bien_traits_[row number]
  • Original data source from column project_pi stored in DatasetSourcesID table
  • Individual Dataset Reference extracted from source_citation column
  • Sample name (sample_name) created using column id as bien_traits_[id]
  • All samples automatically assigned age of 0
  • Trait names extracted from trait_name, taxonomic names from scrubbed_species_binomial, trait values from trait_value
  • Trait leaf mass per area calculated from leaf area per leaf dry mass as 1/value

Final Database Integration Procedures

In addition to the consolidation of all processed data into a unified SQLite database, the final VegVault migration repository performs three critical procedures to ensure data consistency and usability:

(i) Taxonomic Classification

As VegVault integrates data on taxa from various sources, the {taxospace} R package is used to classify diverse taxa into a unifying taxonomic backbone. The {taxospace} tool automatically aligns taxon names with the GBIF taxonomical backbone. Specifically, we find the best match of raw taxon names using the Global Names Resolver, which is then aligned with GBIF. The resulting taxonomic classification information, detailed up to the family level, is stored for each taxon, ensuring consistency and facilitating comparative analyses across different datasets.

Important limitations: Taxonomic classification down to the species level is not available for each taxon (e.g., some fossil pollen types can only be identified to the genus or family level). For several taxa, no matching classification could be found. Note that taxonomic classification is additional information—the original taxon name is always present and returned by default. Finally, users should be aware that this classification is an automated process and may contain errors.

(ii) Grouping of Traits into Trait Domains

As there are differences in trait names across data sources (e.g., “Leaf nitrogen (N) content per leaf dry mass” vs. “leaf nitrogen content per leaf dry mass”), we added a new variable Trait Domain that groups traits together following the trait selection of Díaz et al. (2016). For example, trait Plant height vegetative from TRY and trait whole plant height from BIEN are both grouped under the Plant height Trait Domain. This grouping serves as an efficient mechanism for extracting comparable traits across both TRY and BIEN datasets.

(iii) Creation of Gridpoints for Abiotic Data

We developed a data structure that provides readily available environmental context for each vegetation (and trait) record by creating spatio-temporal links between these records and abiotic information. As raster data are not suitable for storage in an SQLite database, we created artificial points, called ‘gridpoints’, located in the center of each raster cell. This resulted in a uniform spatio-temporal matrix of gridpoints holding the abiotic information.

Gridpoint naming conventions: dataset_name are created as “geo_[longitude][latitude]” and sample_name are created as ”geo[longitude][latitude][age]”.

Spatial-temporal linking: We linked gridpoints with other non-gridpoint Samples (vegetation_plot, fossil_pollen_archive, and traits) and calculated the spatial and temporal distances between them. We retained any gridpoint Sample within 50 km and/or 5000 years from any other non-gridpoint Sample and discarded the rest.

Back to top