Published

January 24, 2025

Modified

August 28, 2025

VegVault Database Structure

Currently, VegVault consists of 31 interconnected tables with 87 fields (variables), which are described in detail below. However, the internal database structure may not be directly relevant to most users, as the {vaultkeepr} R package processes all data to output only the most relevant information in a “ready-to-analyze” format.

[!NOTE]

For Most Users: If you’re primarily interested in using VegVault for research, you may want to start with our Usage Examples rather than diving into the technical database structure.

This section provides comprehensive documentation of all tables and their relationships for users who need detailed technical information about the database architecture.


Full Database Schema

Metadata Tables

Several tables contain metadata and administrative information that are not directly linked to the scientific data:

  • Authors: Information about VegVault authors and maintainers, including contact details
  • version_control: Database version information with descriptions of changes over time
  • sqlite_stat1 & sqlite_stat4: SQLite system tables containing database index statistics for query optimization
column_name data_type description
author_id INTEGER ID of an Author (unique)
author_fullname TEXT Full name of an Author
author_email TEXT Contact email of an Author
author_orcid TEXT ORCID ID of an Author

Column names and types for table Authors.

column_name data_type description
id INTEGER ID of a database version (unique)
version TEXT Version number
update_date TEXT Date of the creation of that version
changelog TEXT Text description of main changes in the database

Column names and types for table version_control.


Datasets

Dataset Structure Overview (Datasets)

The Datasets table represents the main organizational structure in VegVault, serving as the keystone for managing and organizing all data. Each Dataset contains one row with a unique Dataset ID (dataset_id), Dataset name (dataset_name), geographic location (coord_lat, coord_long), Dataset Type (dataset_type_id), Dataset Source (data_source_id), Dataset Source Type (dataset_source_type_id), and Sampling Method (sampling_method_id).

column_name data_type description
dataset_id INTEGER ID of a Dataset (unique)
dataset_name TEXT Name of each Dataset
data_source_id INTEGER ID of a Dataset Source
dataset_type_id INTEGER ID of a Dataset Type
data_source_type_id INTEGER ID of a Dataset Source-Type
coord_long REAL Geographical coordinates - longitude
coord_lat REAL Geographical coordinates - latitude
sampling_method_id INTEGER ID of a Sampling Method

Column names and types for table Datasets.

Dataset Types (DatasetTypeID)

The Dataset Type defines the most basic classification of each Dataset, ensuring that the vast amount of data is categorized systematically. Currently, VegVault contains the following types of Datasets:

  • vegetation_plot: This type includes contemporary vegetation plot data, capturing contemporary vegetation characteristics and distributions.
  • fossil_pollen_archive: This type encompasses past vegetation plot data derived from fossil pollen records, providing insights into past vegetation and climate dynamics.
  • traits: This type contains functional trait data, detailing specific characteristics of plant species that influence their ecological roles.
  • gridpoints: This type holds artificially created Datasets to manage abiotic data, here climate and soil information (a Dataset Type created to hold abiotic data, see details in the Database Assemly).
column_name data_type description
dataset_type_id INTEGER ID of a Dataset Type (unique)
dataset_type TEXT Text description of individual Dataset Types (currently vegetation_plot, fossil_pollen_archive, traits, gridpoints)

Column names and types for table DatasetTypeID.

Dataset Source-Types (DatasetSourceTypeID)

VegVault maintains detailed information about the primary data source, thereby enhancing the findability and referencing of primary data sources. Each Dataset is derived from a specific Source-Type that provides detailed information on the source used to retrieve the original data. The current Source-Types in VegVault include:

column_name data_type description
data_source_type_id INTEGER ID of a Dataset Source-Type (unique)
dataset_source_type TEXT Text description of individual Dataset Source-Type (currently, BIEN, sPlotOpen, TRY, Neotoma-FOSSILPOL, gridpoints)

Column names and types for table DatasetSourceTypeID.

column_name data_type description
data_source_type_id INTEGER NA
reference_id INTEGER ID of a Reference
data_source_id NA ID of a Dataset Source

Column names and types for table DatasetSourceTypeReference.

Dataset Sources (DatasetSourcesID)

Each individual Dataset from a specific Dataset Source-Type can have information on the source of the data (i.e. sub-database). VegVault v1.0.0 currently includes 706 sources of Datasets, where each dataset can also have one or more direct references to specific data, ensuring that users can accurately cite and verify the sources of their data. This should help to promote better findability of the primary source of data and referencing.

column_name data_type description
data_source_id INTEGER ID of a Dataset Source (unique)
data_source_desc TEXT Text description of individual Dataset Sources (e.g., name of the sub-database from the primary source)

Column names and types for table DatasetSourcesID.

column_name data_type description
data_source_id INTEGER ID of a Dataset Source
reference_id INTEGER ID of a Reference

Column names and types for table DatasetSourcesReference.

Currently, there are 691 sources of Datasets.

Sampling Methods (SamplingMethodID)

Sampling methods vary significantly across the different types of Datasets integrated into VegVault, reflecting the diverse nature of the data collected. Such information is crucial for understanding the context and limitations of each Dataset Type. For contemporary vegetation plots, sampling involves standardised plot inventories and surveys that capture detailed vegetation characteristics across various regions. Fossil pollen data are collected from sediment records from numerous different depositional environments representing past vegetation. Therefore, information on sampling methods is only available for both vegetation_plot and fossil_pollen_archive Datasets, providing metadata that ensures accurate and contextually relevant analyses.

column_name data_type description
sampling_method_id INTEGER ID of a Dataset Sampling Method (unique)
sampling_method_details TEXT Text description of individual Dataset Sampling Methods

Column names and types for table SamplingMethodID.

column_name data_type description
sampling_method_id INTEGER ID of a Dataset Sampling Method
reference_id INTEGER ID of a Reference

Column names and types for table SamplingMethodReference.

Dataset References (DatasetReferences)

To support robust and transparent scientific research, each Dataset in VegVault can have multiple references at different levels. The Dataset Source-Type, Dataset Source, and Sampling Method can all have their own references, providing detailed provenance and citation information. This multi-level referencing system enhances the traceability and validation of the data. Each Dataset can also have one or more direct references to specific data, further ensuring that users can accurately cite and verify the sources of their data.

column_name data_type description
dataset_id INTEGER ID of a Dataset
reference_id INTEGER ID of a Reference

Column names and types for table DatasetReferences.

This means that one dataset can have one/several references from each of those parts. Let’s take a look at an example of what that could mean in practice.

We have selected dataset ID: 91256, which is a fossil pollen archive. Therefore, it has the reference of the Dataset Source-Type:

  • Flantua, S. G. A., Mottl, O., Felde, V. A., Bhatta, K. P., Birks, H. H., Grytnes, J.-A., Seddon, A. W. R., & Birks, H. J. B. (2023). A guide to the processing and standardization of global palaeoecological data for large-scale syntheses using fossil pollen. Global Ecology and Biogeography, 32, 1377–1394. https://doi.org/10.1111/geb.13693

and reference for the individual dataset:

  • Grimm, E.C., 2008. Neotoma: an ecosystem database for the Pliocene, Pleistocene, and Holocene. Illinois State Museum Scientific Papers E Series, 1.


Samples

Samples represent the main unit of data in VegVault, serving as the fundamental building blocks for all analyses. There are currently over 13 million Samples in VegVault v1.0.0 (of which ~ 1.6 million are gridpoints of abiotic data, see Database Assembly).

VegVault encompasses both contemporary and paleo data, necessitating accurate age information for each Sample. Contemporary Samples are assigned an age of 0, while Samples from fossil pollen records are in calibrated years before the present (cal yr BP). The present is here specified as 1950 AD.

Sample Structure Overview (Samples)

The table contains one Sample per row, with each Sample containing: a unique Sample ID (sample_id), Sample name (sample_name), temporal information about Sample (age), sample site (size of the plot if available; sample_size_id), and additional information about sample (sample_details; this is currently not being used in v1.0.0.). As VegVault encompasses both contemporary and paleo-data, accurate age information is required for each Sample.

column_name data_type description
sample_id INTEGER ID of a Sample (unique)
sample_name TEXT Name of a Sample
sample_details TEXT Specific description of a Sample. Currently not being used.
age REAL Age of sample. Mainly used for fossil_pollen_archives, where note the age of a Sample in calibrated years before present. Note that all contemporary Samples, have age of 0.
sample_size_id INTEGER ID of a Sample Size

Column names and types for table Samples.

Dataset-Sample (DatasetSample)

Each Sample is linked to a specific Dataset via the DatasetSample table, which ensures that every Sample is correctly associated with its corresponding Dataset Type (whether it is vegetation_plot, fossil_pollen_archive, traits, or gridpoint) and other Dataset properties (e.g., geographic location). One Dataset contains several Samples only in a case where they differ in time (age).

column_name data_type description
dataset_id INTEGER ID of a Dataset
sample_id INTEGER ID of a Sample

Column names and types for table DatasetSample.

Sample Size (SampleSizeID)

The size of vegetation plots can vary substantially. This detail is crucial for ecological studies where plot size can influence species diversity and abundance metrics, thus impacting follow-up analyses and interpretations. To account for this variability, information about the plot size is recorded separately for each contemporary Sample.

column_name data_type description
sample_size_id INTEGER ID of a size category (unique)
sample_size REAL Numeric expression of size
description TEXT Mostly description of units in which the values are stored

Column names and types for table SampleSizeID.

Sample age uncertainty (SampleUncertainty)

Each Sample from the fossil_pollen_archive Dataset is also associated with an uncertainty matrix generated during the re-estimation of ages using FOSSILPOL workflow. This matrix provides a range of potential ages derived from age-depth modelling, reflecting the inherent uncertainty in dating paleoecological records.

column_name data_type description
sample_id INTEGER ID of a Sample
iteration INTEGER ID of a iteration from age depth model. Currently, the is 1000 iteration per each Sample.
age INTEGER Potential age of a Sample

Column names and types for table SampleUncertainty.

We can show this on the previously selected fossil pollen archive with dataset ID: 91256.

Sample Reference

Each Sample in VegVault can have specific References in addition to those at the Dataset-level. These individual Sample References provide detailed provenance and citation information, ensuring that users can trace the origin and validation of each data point. Note that a single Sample can have several References. This level of referencing enhances the transparency and reliability of the data, especially when the dataset continues to be updated in the future.

column_name data_type description
sample_id INTEGER ID of a Sample
reference_id INTEGER ID of a Reference

Column names and types for table SampleReference.


Taxa

Taxa Structure Overview (Taxa)

The VegVault database records the original taxonomic names derived directly from the primary data sources, and currently, it holds over 100 thousand taxonomic names.

column_name data_type description
taxon_id INTEGER ID of a Taxon (unique)
taxon_name TEXT Name of a Taxon from primary source.

Column names and types for table Taxa.

Sample-Taxa (SampleTaxa)

Each individual Taxon is linked to corresponding Samples through the SampleTaxa table, ensuring accurate and systematic association between species and their ecological data. Note that the abundance information varies across the primary data sources. Therefore, users have to be careful while processing data from various sources.

column_name data_type description
sample_id INTEGER ID of a Sample
taxon_id INTEGER ID of a Taxon
value REAL Abundance representation of a Taxon (the units may differ among primary sources, i.e. Dataset Source-Types)

Column names and types for table SampleTaxa.

Taxon Classification (TaxonClassification)

Each taxonomic name undergoes an automated classification (see Database Assembly) and results are stored in the TaxonClassification table. To classify the diverse taxa present in the VegVault database, the {taxospace} R package was used. This tool automatically aligns taxa names with the Taxonomy Backbone from the Global Biodiversity Information Facility, providing a standardized classification framework. Specifically, we try to find the best match of the raw names of taxa using Global Names Resolver.

column_name data_type description
taxon_id INTEGER ID of a Taxon
taxon_species INTEGER ID of a Taxon, which was assign as species level
taxon_genus INTEGER ID of a Taxon, which was assign as genus level
taxon_family INTEGER ID of a Taxon, which was assign as family level

Column names and types for table TaxonClassification.

Taxonomic classification for some Taxa might be only available down to the genus or family level, while most of the data is classified to species level. Classification information, detailed up to the family level, is stored for each taxon, ensuring consistency and facilitating comparative analyses across different datasets. Currently, the VegVault database holds over 110 thousand taxonomic names, of which we were unable to classify only 1312 (1.2%).

Taxon Reference (TaxonReference)

Each taxon might get a reference. Currently, this is used to track the origin of the Taxon name (i.e. which primary source was used first with this Taxon). Note that Taxa, generated from taxonomic classification are associated with taxospace reference.

column_name data_type description
taxon_id INTEGER ID of a Taxon
reference_id INTEGER ID of a Reference

Column names and types for table TaxonReference.


Traits

Traits Structure Overview (Traits)

The Traits table contains the list of functional traits currently contained in VegVault. The table contains one Trait per row, with each Trait containing: a unique Trait ID (trait_id), original Trait name from primary source (trait_name), and Trait Domain (trait_domain_id). Functional traits of vegetation taxa follow the same structure of Dataset and Samples obtained directly from Dataset Source-Types.

column_name data_type description
trait_id INTEGER ID of a Trait (unique)
trait_domain_id INTEGER ID of a Trait Domain
trait_name TEXT Name of the trait from the primary source. See ‘VegVault Content’ for the details about the specific columns used from primary sources.

Column names and types for table Traits.

Traits Domain (TraitsDomain)

Traits are grouped into Trait Domains to allow for easier aggregation of Traits across data sources. As there are differences in trait names across sources of data and individual Datasets, the VegVault database contains Trait Domain information to group traits together. In total, six Trait Domains are present: Stem specific density, Leaf nitrogen content per unit mass, Diaspore mass, Plant height, Leaf area, Leaf mass per area, following Diaz et al. (2016). Yet, it is up to the user to decide how to further aggregate trait values if multiple trait Samples of one Trait Domain are available for the same environmental or taxonomic entity.

column_name data_type description
trait_domain_id INTEGER ID of a Trait Domain (unique)
trait_domain_name TEXT Name of the Trait Domain from Diaz et al. (2016)
trait_domanin_description TEXT NA
trait_domain_description NA Additional information about the Trait Domain

Column names and types for table TraitsDomain.

Trait domain Trait Data Source
Stem specific density stem wood density BIEN
Stem specific density Stem specific density (SSD, stem dry mass per stem fresh volume) or wood density TRY
Leaf nitrogen content per unit mass leaf nitrogen content per leaf dry mass BIEN
Leaf nitrogen content per unit mass Leaf nitrogen (N) content per leaf dry mass TRY
Diaspore mass seed mass BIEN
Diaspore mass Seed dry mass TRY
Plant heigh whole plant height BIEN
Plant heigh Plant height vegetative TRY
Leaf Area leaf area BIEN
Leaf Area Leaf area (in case of compound leaves undefined if leaf or leaflet, undefined if petiole is in- or exluded) TRY
Leaf Area Leaf area (in case of compound leaves: leaf, petiole excluded) TRY
Leaf Area Leaf area (in case of compound leaves: leaf, petiole included) TRY
Leaf Area Leaf area (in case of compound leaves: leaf, undefined if petiole in- or excluded) TRY
Leaf mass per area leaf mass per area BIEN
Leaf mass per area Leaf area per leaf dry mass (specific leaf area, SLA or 1/LMA): petiole included TRY
Leaf mass per area Leaf area per leaf dry mass (specific leaf area, SLA or 1/LMA): undefined if petiole is in- or excluded) TRY

Overview of Trait Domains and their associated Traits

Traits value (TraitsValue)

In general, data of functional traits of vegetation taxa follow the same structure of the Dataset and Samples obtained directly from the Dataset Source-Types. Therefore, TraitsValue table contains not only the actual measured value of Trait observation but also information about linking information across Datasets, Samples, and Taxa. This comprehensive linkage ensures that each Trait value is accurately associated with its relevant ecological, environmental and taxonomic context.

column_name data_type description
trait_id INTEGER ID of a Trait
dataset_id INTEGER ID of a Dataset
sample_id INTEGER ID of a Sample
taxon_id INTEGER ID of a Taxon
trait_value REAL Value of specific measured observation of Trait.

Column names and types for table TraitsValue.

Traits Reference (TraitsReference)

To ensure clarity and reproducibility, each Trait in VegVault can have additional References beyond the general Dataset and Sample References. These Trait-specific References provide detailed provenance and citation information, supporting rigorous scientific research and enabling users to trace the origins and validation of each trait value.

column_name data_type description
trait_id INTEGER ID of a Trait
reference_id INTEGER ID of a Reference

Column names and types for table TraitsReference.


Abiotic Variables

The abiotic data in the VegVault database provide essential information on environmental factors affecting vegetation distribution and traits. Currently, VegVault includes abiotic data from CHELSA, CHELSA-TRACE21, and WoSIS. CHELSA and CHELSA-TRACE21 provide high-resolution climate data, while WoSIS offers detailed soil information.

Abiotic variables (AbioticVariable)

As VegVault contains abiotic variables from several primary sources, the AbioticVariable table contains descriptions of abiotic variables (abiotic_variable_name), their units (abiotic_variable_unit), and measurement details (measure_details). These data include variables such as climate and soil conditions, which are crucial for understanding the ecological contexts of vegetation dynamics.

column_name data_type description
abiotic_variable_id INTEGER ID of a Abiotic Variable
abiotic_variable_name TEXT Name of a Abiotic Variable from primary source
abiotic_variable_unit TEXT Unit of a Abiotic Variable
measure_details TEXT Additional details about Abiotic Variable
abiotic_variable_scale NA Scale of a Abiotic Variable

Column names and types for table AbioticVariable.

Variable name Variable unit Source of data
bio1 C (degree Celsius) mean annual air temperature
bio4 C (degree Celsius) temperature seasonality
bio6 C (degree Celsius) mean daily minimum air temperature of the coldest month
bio12 kg m-2 year-1 annual precipitation amount
bio15 Unitless precipitation seasonality
bio18 kg m-2 quarter-1 mean monthly precipitation amount of the warmest quarter
bio19 kg m-2 quarter-1 mean monthly precipitation amount of the coldest quarter
HWSD2 Unitless SoilGrids-soil_class

Table showing abiotic variables.

Abiotic Data (AbioticData)

The AbioticData table holds the actual values of abiotic variables (the units are the same for each AbioticVariable).

Gridpoints (AbioticDataReference)

Gridpoints are stored in artificially created Datasets and Samples, with one Dataset holding more Samples only if they differ in age. We have estimated the spatial and temporal distance between each gridpoint and other non-gridpoint Samples (vegetation_plot, fossil_pollen_archive, and traits). We store the link between gridpoint and non-gridpoint Samples as well as the spatial and temporal distance. As this results in very large amounts of data, we have discarded any gridpoint Sample, which is not close to 50 km and/or 5000 years to any other non-gridpoint Samples as not relevant for the vegetation dynamics.

column_name data_type description
sample_id INTEGER ID of non-gridpoint Sample
sample_ref_id INTEGER ID of gridpoint Sample
distance_in_km INTEGER Distance among samples expressed in kilometres
distance_in_years INTEGER Distance among samples expressed in years

Column names and types for table AbioticDataReference.

Such data structure allows that environmental context is readily available for each vegetation and trait Sample. For each non-gridpoint Sample, users can select the closest spatio-temporally abiotic data or get average from all surrounding gridpoints.

Abiotic Variable Reference (AbioticVariableReference)

Each Abiotic Variable can have a separate Reference, in addition to a Dataset and Sample.

column_name data_type description
abiotic_variable_id INTEGER ID of an Abiotic Variable
reference_id INTEGER ID of a Reference

Column names and types for table AbioticVariableReference.


References

The References table is a central component that serves all sections of the VegVault database. This table contains all References, independent of the source of the reference and the type of data. Each row contains a single Reference, which is then linked to the type of data which is being referenced. This allows a single Reference to be used across data types, but also one data point having many different references.

Moreover, most primary sources of the data have a license, which requires correct attribution. Therefore, each Reference has information if such a Reference needs to be cited while using the specific data.

Back to top