VegVault Database Structure
Currently, VegVault consists of 31 interconnected tables with 87 fields (variables), which are described in detail below. However, the internal database structure may not be directly relevant to most users, as the {vaultkeepr} R package processes all data to output only the most relevant information in a “ready-to-analyze” format.
[!NOTE]
For Most Users: If you’re primarily interested in using VegVault for research, you may want to start with our Usage Examples rather than diving into the technical database structure.
This section provides comprehensive documentation of all tables and their relationships for users who need detailed technical information about the database architecture.
Full Database Schema

Metadata Tables
Several tables contain metadata and administrative information that are not directly linked to the scientific data:
- Authors: Information about VegVault authors and maintainers, including contact details
- version_control: Database version information with descriptions of changes over time
- sqlite_stat1&- sqlite_stat4: SQLite system tables containing database index statistics for query optimization
| column_name | data_type | description | 
|---|---|---|
| author_id | INTEGER | ID of an Author (unique) | 
| author_fullname | TEXT | Full name of an Author | 
| author_email | TEXT | Contact email of an Author | 
| author_orcid | TEXT | ORCID ID of an Author | 
Column names and types for table Authors.
| column_name | data_type | description | 
|---|---|---|
| id | INTEGER | ID of a database version (unique) | 
| version | TEXT | Version number | 
| update_date | TEXT | Date of the creation of that version | 
| changelog | TEXT | Text description of main changes in the database | 
Column names and types for table version_control.
Datasets
Dataset Structure Overview (Datasets)
The Datasets table represents the main organizational structure in VegVault, serving as the keystone for managing and organizing all data. Each Dataset contains one row with a unique Dataset ID (dataset_id), Dataset name (dataset_name), geographic location (coord_lat, coord_long), Dataset Type (dataset_type_id), Dataset Source (data_source_id), Dataset Source Type (dataset_source_type_id), and Sampling Method (sampling_method_id).
| column_name | data_type | description | 
|---|---|---|
| dataset_id | INTEGER | ID of a Dataset (unique) | 
| dataset_name | TEXT | Name of each Dataset | 
| data_source_id | INTEGER | ID of a Dataset Source | 
| dataset_type_id | INTEGER | ID of a Dataset Type | 
| data_source_type_id | INTEGER | ID of a Dataset Source-Type | 
| coord_long | REAL | Geographical coordinates - longitude | 
| coord_lat | REAL | Geographical coordinates - latitude | 
| sampling_method_id | INTEGER | ID of a Sampling Method | 
Column names and types for table Datasets.
Dataset Types (DatasetTypeID)
The Dataset Type defines the most basic classification of each Dataset, ensuring that the vast amount of data is categorized systematically. Currently, VegVault contains the following types of Datasets:
- vegetation_plot: This type includes contemporary vegetation plot data, capturing contemporary vegetation characteristics and distributions.
- fossil_pollen_archive: This type encompasses past vegetation plot data derived from fossil pollen records, providing insights into past vegetation and climate dynamics.
- traits: This type contains functional trait data, detailing specific characteristics of plant species that influence their ecological roles.
- gridpoints: This type holds artificially created Datasets to manage abiotic data, here climate and soil information (a Dataset Type created to hold abiotic data, see details in the Database Assemly).
| column_name | data_type | description | 
|---|---|---|
| dataset_type_id | INTEGER | ID of a Dataset Type (unique) | 
| dataset_type | TEXT | Text description of individual Dataset Types (currently vegetation_plot, fossil_pollen_archive, traits, gridpoints) | 
Column names and types for table DatasetTypeID.
Dataset Source-Types (DatasetSourceTypeID)
VegVault maintains detailed information about the primary data source, thereby enhancing the findability and referencing of primary data sources. Each Dataset is derived from a specific Source-Type that provides detailed information on the source used to retrieve the original data. The current Source-Types in VegVault include:
- BIEN - Botanical Information and Ecology Network
- sPlotOpen - The open-access version of sPlot
- TRY - TRY Plant Trait Database
- Neotoma-FOSSILPOL - The workflow that aims to process and standardise global palaeoecological pollen data. Note that we specifically state Neotoma-FOSSILPOL and not just Neotoma, as FOSSILPOL not only provides the data acquisition but also alters it (e.g., creating new age-depth models). It also addresses major challenges in paleoecological data integration, such as age uncertainty, by incorporating probabilistic age-depth models and their associated uncertainty matrices. This enables the propagation of temporal uncertainty in subsequent analyses, a critical advancement for robust macroecological studies, previously flagged as a major issue with paleo-data.
- gridpoints - artificially created Datasets to hold abiotic data. See Database Assembly for more details.
| column_name | data_type | description | 
|---|---|---|
| data_source_type_id | INTEGER | ID of a Dataset Source-Type (unique) | 
| dataset_source_type | TEXT | Text description of individual Dataset Source-Type (currently, BIEN, sPlotOpen, TRY, Neotoma-FOSSILPOL, gridpoints) | 
Column names and types for table DatasetSourceTypeID.
| column_name | data_type | description | 
|---|---|---|
| data_source_type_id | INTEGER | NA | 
| reference_id | INTEGER | ID of a Reference | 
| data_source_id | NA | ID of a Dataset Source | 
Column names and types for table DatasetSourceTypeReference.
Dataset Sources (DatasetSourcesID)
Each individual Dataset from a specific Dataset Source-Type can have information on the source of the data (i.e. sub-database). VegVault v1.0.0 currently includes 706 sources of Datasets, where each dataset can also have one or more direct references to specific data, ensuring that users can accurately cite and verify the sources of their data. This should help to promote better findability of the primary source of data and referencing.
| column_name | data_type | description | 
|---|---|---|
| data_source_id | INTEGER | ID of a Dataset Source (unique) | 
| data_source_desc | TEXT | Text description of individual Dataset Sources (e.g., name of the sub-database from the primary source) | 
Column names and types for table DatasetSourcesID.
| column_name | data_type | description | 
|---|---|---|
| data_source_id | INTEGER | ID of a Dataset Source | 
| reference_id | INTEGER | ID of a Reference | 
Column names and types for table DatasetSourcesReference.
Currently, there are 691 sources of Datasets.
Sampling Methods (SamplingMethodID)
Sampling methods vary significantly across the different types of Datasets integrated into VegVault, reflecting the diverse nature of the data collected. Such information is crucial for understanding the context and limitations of each Dataset Type. For contemporary vegetation plots, sampling involves standardised plot inventories and surveys that capture detailed vegetation characteristics across various regions. Fossil pollen data are collected from sediment records from numerous different depositional environments representing past vegetation. Therefore, information on sampling methods is only available for both vegetation_plot and fossil_pollen_archive Datasets, providing metadata that ensures accurate and contextually relevant analyses.
| column_name | data_type | description | 
|---|---|---|
| sampling_method_id | INTEGER | ID of a Dataset Sampling Method (unique) | 
| sampling_method_details | TEXT | Text description of individual Dataset Sampling Methods | 
Column names and types for table SamplingMethodID.
| column_name | data_type | description | 
|---|---|---|
| sampling_method_id | INTEGER | ID of a Dataset Sampling Method | 
| reference_id | INTEGER | ID of a Reference | 
Column names and types for table SamplingMethodReference.
Dataset References (DatasetReferences)
To support robust and transparent scientific research, each Dataset in VegVault can have multiple references at different levels. The Dataset Source-Type, Dataset Source, and Sampling Method can all have their own references, providing detailed provenance and citation information. This multi-level referencing system enhances the traceability and validation of the data. Each Dataset can also have one or more direct references to specific data, further ensuring that users can accurately cite and verify the sources of their data.
| column_name | data_type | description | 
|---|---|---|
| dataset_id | INTEGER | ID of a Dataset | 
| reference_id | INTEGER | ID of a Reference | 
Column names and types for table DatasetReferences.
This means that one dataset can have one/several references from each of those parts. Let’s take a look at an example of what that could mean in practice.
We have selected dataset ID: 91256, which is a fossil pollen archive. Therefore, it has the reference of the Dataset Source-Type:
- Flantua, S. G. A., Mottl, O., Felde, V. A., Bhatta, K. P., Birks, H. H., Grytnes, J.-A., Seddon, A. W. R., & Birks, H. J. B. (2023). A guide to the processing and standardization of global palaeoecological data for large-scale syntheses using fossil pollen. Global Ecology and Biogeography, 32, 1377–1394. https://doi.org/10.1111/geb.13693
and reference for the individual dataset:
- Grimm, E.C., 2008. Neotoma: an ecosystem database for the Pliocene, Pleistocene, and Holocene. Illinois State Museum Scientific Papers E Series, 1.
Samples
Samples represent the main unit of data in VegVault, serving as the fundamental building blocks for all analyses. There are currently over 13 million Samples in VegVault v1.0.0 (of which ~ 1.6 million are gridpoints of abiotic data, see Database Assembly).
VegVault encompasses both contemporary and paleo data, necessitating accurate age information for each Sample. Contemporary Samples are assigned an age of 0, while Samples from fossil pollen records are in calibrated years before the present (cal yr BP). The present is here specified as 1950 AD.
Sample Structure Overview (Samples)
The table contains one Sample per row, with each Sample containing: a unique Sample ID (sample_id), Sample name (sample_name), temporal information about Sample (age), sample site (size of the plot if available; sample_size_id), and additional information about sample (sample_details; this is currently not being used in v1.0.0.). As VegVault encompasses both contemporary and paleo-data, accurate age information is required for each Sample.
| column_name | data_type | description | 
|---|---|---|
| sample_id | INTEGER | ID of a Sample (unique) | 
| sample_name | TEXT | Name of a Sample | 
| sample_details | TEXT | Specific description of a Sample. Currently not being used. | 
| age | REAL | Age of sample. Mainly used for fossil_pollen_archives, where note the age of a Sample in calibrated years before present. Note that all contemporary Samples, have age of 0. | 
| sample_size_id | INTEGER | ID of a Sample Size | 
Column names and types for table Samples.
Dataset-Sample (DatasetSample)
Each Sample is linked to a specific Dataset via the DatasetSample table, which ensures that every Sample is correctly associated with its corresponding Dataset Type (whether it is vegetation_plot, fossil_pollen_archive, traits, or gridpoint) and other Dataset properties (e.g., geographic location). One Dataset contains several Samples only in a case where they differ in time (age).
| column_name | data_type | description | 
|---|---|---|
| dataset_id | INTEGER | ID of a Dataset | 
| sample_id | INTEGER | ID of a Sample | 
Column names and types for table DatasetSample.
Sample Size (SampleSizeID)
The size of vegetation plots can vary substantially. This detail is crucial for ecological studies where plot size can influence species diversity and abundance metrics, thus impacting follow-up analyses and interpretations. To account for this variability, information about the plot size is recorded separately for each contemporary Sample.
| column_name | data_type | description | 
|---|---|---|
| sample_size_id | INTEGER | ID of a size category (unique) | 
| sample_size | REAL | Numeric expression of size | 
| description | TEXT | Mostly description of units in which the values are stored | 
Column names and types for table SampleSizeID.
Sample age uncertainty (SampleUncertainty)
Each Sample from the fossil_pollen_archive Dataset is also associated with an uncertainty matrix generated during the re-estimation of ages using FOSSILPOL workflow. This matrix provides a range of potential ages derived from age-depth modelling, reflecting the inherent uncertainty in dating paleoecological records.
| column_name | data_type | description | 
|---|---|---|
| sample_id | INTEGER | ID of a Sample | 
| iteration | INTEGER | ID of a iteration from age depth model. Currently, the is 1000 iteration per each Sample. | 
| age | INTEGER | Potential age of a Sample | 
Column names and types for table SampleUncertainty.
We can show this on the previously selected fossil pollen archive with dataset ID: 91256.
Sample Reference
Each Sample in VegVault can have specific References in addition to those at the Dataset-level. These individual Sample References provide detailed provenance and citation information, ensuring that users can trace the origin and validation of each data point. Note that a single Sample can have several References. This level of referencing enhances the transparency and reliability of the data, especially when the dataset continues to be updated in the future.
| column_name | data_type | description | 
|---|---|---|
| sample_id | INTEGER | ID of a Sample | 
| reference_id | INTEGER | ID of a Reference | 
Column names and types for table SampleReference.
Taxa
Taxa Structure Overview (Taxa)
The VegVault database records the original taxonomic names derived directly from the primary data sources, and currently, it holds over 100 thousand taxonomic names.
| column_name | data_type | description | 
|---|---|---|
| taxon_id | INTEGER | ID of a Taxon (unique) | 
| taxon_name | TEXT | Name of a Taxon from primary source. | 
Column names and types for table Taxa.
Sample-Taxa (SampleTaxa)
Each individual Taxon is linked to corresponding Samples through the SampleTaxa table, ensuring accurate and systematic association between species and their ecological data. Note that the abundance information varies across the primary data sources. Therefore, users have to be careful while processing data from various sources.
| column_name | data_type | description | 
|---|---|---|
| sample_id | INTEGER | ID of a Sample | 
| taxon_id | INTEGER | ID of a Taxon | 
| value | REAL | Abundance representation of a Taxon (the units may differ among primary sources, i.e. Dataset Source-Types) | 
Column names and types for table SampleTaxa.
Taxon Classification (TaxonClassification)
Each taxonomic name undergoes an automated classification (see Database Assembly) and results are stored in the TaxonClassification table. To classify the diverse taxa present in the VegVault database, the {taxospace} R package was used. This tool automatically aligns taxa names with the Taxonomy Backbone from the Global Biodiversity Information Facility, providing a standardized classification framework. Specifically, we try to find the best match of the raw names of taxa using Global Names Resolver.
| column_name | data_type | description | 
|---|---|---|
| taxon_id | INTEGER | ID of a Taxon | 
| taxon_species | INTEGER | ID of a Taxon, which was assign as species level | 
| taxon_genus | INTEGER | ID of a Taxon, which was assign as genus level | 
| taxon_family | INTEGER | ID of a Taxon, which was assign as family level | 
Column names and types for table TaxonClassification.
Taxonomic classification for some Taxa might be only available down to the genus or family level, while most of the data is classified to species level. Classification information, detailed up to the family level, is stored for each taxon, ensuring consistency and facilitating comparative analyses across different datasets. Currently, the VegVault database holds over 110 thousand taxonomic names, of which we were unable to classify only 1312 (1.2%).
Taxon Reference (TaxonReference)
Each taxon might get a reference. Currently, this is used to track the origin of the Taxon name (i.e. which primary source was used first with this Taxon). Note that Taxa, generated from taxonomic classification are associated with taxospace reference.
| column_name | data_type | description | 
|---|---|---|
| taxon_id | INTEGER | ID of a Taxon | 
| reference_id | INTEGER | ID of a Reference | 
Column names and types for table TaxonReference.
Traits
Traits Structure Overview (Traits)
The Traits table contains the list of functional traits currently contained in VegVault. The table contains one Trait per row, with each Trait containing: a unique Trait ID (trait_id), original Trait name from primary source (trait_name), and Trait Domain (trait_domain_id). Functional traits of vegetation taxa follow the same structure of Dataset and Samples obtained directly from Dataset Source-Types.
| column_name | data_type | description | 
|---|---|---|
| trait_id | INTEGER | ID of a Trait (unique) | 
| trait_domain_id | INTEGER | ID of a Trait Domain | 
| trait_name | TEXT | Name of the trait from the primary source. See ‘VegVault Content’ for the details about the specific columns used from primary sources. | 
Column names and types for table Traits.
Traits Domain (TraitsDomain)
Traits are grouped into Trait Domains to allow for easier aggregation of Traits across data sources. As there are differences in trait names across sources of data and individual Datasets, the VegVault database contains Trait Domain information to group traits together. In total, six Trait Domains are present: Stem specific density, Leaf nitrogen content per unit mass, Diaspore mass, Plant height, Leaf area, Leaf mass per area, following Diaz et al. (2016). Yet, it is up to the user to decide how to further aggregate trait values if multiple trait Samples of one Trait Domain are available for the same environmental or taxonomic entity.
| column_name | data_type | description | 
|---|---|---|
| trait_domain_id | INTEGER | ID of a Trait Domain (unique) | 
| trait_domain_name | TEXT | Name of the Trait Domain from Diaz et al. (2016) | 
| trait_domanin_description | TEXT | NA | 
| trait_domain_description | NA | Additional information about the Trait Domain | 
Column names and types for table TraitsDomain.
| Trait domain | Trait | Data Source | 
|---|---|---|
| Stem specific density | stem wood density | BIEN | 
| Stem specific density | Stem specific density (SSD, stem dry mass per stem fresh volume) or wood density | TRY | 
| Leaf nitrogen content per unit mass | leaf nitrogen content per leaf dry mass | BIEN | 
| Leaf nitrogen content per unit mass | Leaf nitrogen (N) content per leaf dry mass | TRY | 
| Diaspore mass | seed mass | BIEN | 
| Diaspore mass | Seed dry mass | TRY | 
| Plant heigh | whole plant height | BIEN | 
| Plant heigh | Plant height vegetative | TRY | 
| Leaf Area | leaf area | BIEN | 
| Leaf Area | Leaf area (in case of compound leaves undefined if leaf or leaflet, undefined if petiole is in- or exluded) | TRY | 
| Leaf Area | Leaf area (in case of compound leaves: leaf, petiole excluded) | TRY | 
| Leaf Area | Leaf area (in case of compound leaves: leaf, petiole included) | TRY | 
| Leaf Area | Leaf area (in case of compound leaves: leaf, undefined if petiole in- or excluded) | TRY | 
| Leaf mass per area | leaf mass per area | BIEN | 
| Leaf mass per area | Leaf area per leaf dry mass (specific leaf area, SLA or 1/LMA): petiole included | TRY | 
| Leaf mass per area | Leaf area per leaf dry mass (specific leaf area, SLA or 1/LMA): undefined if petiole is in- or excluded) | TRY | 
Overview of Trait Domains and their associated Traits
Traits value (TraitsValue)
In general, data of functional traits of vegetation taxa follow the same structure of the Dataset and Samples obtained directly from the Dataset Source-Types. Therefore, TraitsValue table contains not only the actual measured value of Trait observation but also information about linking information across Datasets, Samples, and Taxa. This comprehensive linkage ensures that each Trait value is accurately associated with its relevant ecological, environmental and taxonomic context.
| column_name | data_type | description | 
|---|---|---|
| trait_id | INTEGER | ID of a Trait | 
| dataset_id | INTEGER | ID of a Dataset | 
| sample_id | INTEGER | ID of a Sample | 
| taxon_id | INTEGER | ID of a Taxon | 
| trait_value | REAL | Value of specific measured observation of Trait. | 
Column names and types for table TraitsValue.
Traits Reference (TraitsReference)
To ensure clarity and reproducibility, each Trait in VegVault can have additional References beyond the general Dataset and Sample References. These Trait-specific References provide detailed provenance and citation information, supporting rigorous scientific research and enabling users to trace the origins and validation of each trait value.
| column_name | data_type | description | 
|---|---|---|
| trait_id | INTEGER | ID of a Trait | 
| reference_id | INTEGER | ID of a Reference | 
Column names and types for table TraitsReference.
Abiotic Variables
The abiotic data in the VegVault database provide essential information on environmental factors affecting vegetation distribution and traits. Currently, VegVault includes abiotic data from CHELSA, CHELSA-TRACE21, and WoSIS. CHELSA and CHELSA-TRACE21 provide high-resolution climate data, while WoSIS offers detailed soil information.
Abiotic variables (AbioticVariable)
As VegVault contains abiotic variables from several primary sources, the AbioticVariable table contains descriptions of abiotic variables (abiotic_variable_name), their units (abiotic_variable_unit), and measurement details (measure_details). These data include variables such as climate and soil conditions, which are crucial for understanding the ecological contexts of vegetation dynamics.
| column_name | data_type | description | 
|---|---|---|
| abiotic_variable_id | INTEGER | ID of a Abiotic Variable | 
| abiotic_variable_name | TEXT | Name of a Abiotic Variable from primary source | 
| abiotic_variable_unit | TEXT | Unit of a Abiotic Variable | 
| measure_details | TEXT | Additional details about Abiotic Variable | 
| abiotic_variable_scale | NA | Scale of a Abiotic Variable | 
Column names and types for table AbioticVariable.
| Variable name | Variable unit | Source of data | 
|---|---|---|
| bio1 | C (degree Celsius) | mean annual air temperature | 
| bio4 | C (degree Celsius) | temperature seasonality | 
| bio6 | C (degree Celsius) | mean daily minimum air temperature of the coldest month | 
| bio12 | kg m-2 year-1 | annual precipitation amount | 
| bio15 | Unitless | precipitation seasonality | 
| bio18 | kg m-2 quarter-1 | mean monthly precipitation amount of the warmest quarter | 
| bio19 | kg m-2 quarter-1 | mean monthly precipitation amount of the coldest quarter | 
| HWSD2 | Unitless | SoilGrids-soil_class | 
Table showing abiotic variables.
Abiotic Data (AbioticData)
The AbioticData table holds the actual values of abiotic variables (the units are the same for each AbioticVariable).
Gridpoints (AbioticDataReference)
Gridpoints are stored in artificially created Datasets and Samples, with one Dataset holding more Samples only if they differ in age. We have estimated the spatial and temporal distance between each gridpoint and other non-gridpoint Samples (vegetation_plot, fossil_pollen_archive, and traits). We store the link between gridpoint and non-gridpoint Samples as well as the spatial and temporal distance. As this results in very large amounts of data, we have discarded any gridpoint Sample, which is not close to 50 km and/or 5000 years to any other non-gridpoint Samples as not relevant for the vegetation dynamics.
| column_name | data_type | description | 
|---|---|---|
| sample_id | INTEGER | ID of non-gridpoint Sample | 
| sample_ref_id | INTEGER | ID of gridpoint Sample | 
| distance_in_km | INTEGER | Distance among samples expressed in kilometres | 
| distance_in_years | INTEGER | Distance among samples expressed in years | 
Column names and types for table AbioticDataReference.
Such data structure allows that environmental context is readily available for each vegetation and trait Sample. For each non-gridpoint Sample, users can select the closest spatio-temporally abiotic data or get average from all surrounding gridpoints.
Abiotic Variable Reference (AbioticVariableReference)
Each Abiotic Variable can have a separate Reference, in addition to a Dataset and Sample.
| column_name | data_type | description | 
|---|---|---|
| abiotic_variable_id | INTEGER | ID of an Abiotic Variable | 
| reference_id | INTEGER | ID of a Reference | 
Column names and types for table AbioticVariableReference.
References
The References table is a central component that serves all sections of the VegVault database. This table contains all References, independent of the source of the reference and the type of data. Each row contains a single Reference, which is then linked to the type of data which is being referenced. This allows a single Reference to be used across data types, but also one data point having many different references.
Moreover, most primary sources of the data have a license, which requires correct attribution. Therefore, each Reference has information if such a Reference needs to be cited while using the specific data.