%%{init: {'theme': 'base', 'themeVariables': {
'background': '#0B0F14',
'primaryColor': '#141B22',
'primaryTextColor': '#E6EDF3',
'primaryBorderColor': '#2A3441',
'lineColor': '#98A6B3',
'secondaryColor': '#141B22',
'tertiaryColor': '#141B22',
'edgeLabelBackground': '#141B22',
'clusterBkg': '#141B22',
'titleColor': '#E6EDF3',
'fontFamily': 'IBM Plex Mono, monospace'
}}}%%
flowchart TD
A[(VegVault)] --> B["Community<br/>Processing"]
A --> C["Abiotic<br/>Processing"]
A --> D["Coordinates"]
B --> E{"Analysis<br/>Type"}
C --> E
D --> E
E -- "Temporal<br/>pipeline_time.R" --> F["Age Filter<br/>per 500-yr slice"]
E -- "Spatial<br/>pipeline_basic.R" --> G["Align Sample IDs<br/>3-way intersection"]
F --> H["Model Data<br/>Preparation"]
G --> H
H --> I["Compute MEMs<br/>Moran Eigenvector Maps"]
I --> J["Fit jSDM<br/>sjSDM + GPU"]
J --> K["Variance Decomposition<br/>ANOVA: A / B / S"]
J --> L["Network Analysis<br/>connectance / nestedness<br/>modularity"]
style E fill:#D98CFF,color:#0B0F14,stroke:#2A3441
style F fill:#48C7B8,color:#0B0F14,stroke:#2A3441
style G fill:#8DF59A,color:#0B0F14,stroke:#2A3441
style J fill:#FFB35C,color:#0B0F14,stroke:#2A3441
BIODYNAMICS — Vegetation Co-occurrence
This project is a work in progress. The code is not yet ready!
This project analyses co-occurrence patterns of plant taxa across space and time using paleoecological pollen records and co-located palaeoclimate reconstructions. The central aim is to quantify how much of the observed co-occurrence signal is explained by shared abiotic (environmental) responses versus genuine biotic associations — and how those patterns shift across geographic scales and through the last 20,000 years.
The project is part of the BIODYNAMICS initiative.
Data Sources
VegVault
All vegetation and palaeoclimate data are sourced from VegVault, a harmonised database combining fossil pollen archives with co-located gridpoint climate reconstructions. For each analysis, records are extracted for a configurable spatial window (longitude × latitude bounding box) and time range.
The function extract_data_from_vegvault() queries a local VegVault SQLite database and returns a unified tidy object containing both community and abiotic records for the requested region and period. Coordinates are extracted separately with get_coords(), which records one pair per dataset (not per sample), using dataset_name as the row identifier. Database integrity is verified at session startup via check_presence_of_vegvault().
Filters applied at extraction:
- Spatial:
x_lim(longitude) andy_lim(latitude) from the active configuration - Temporal:
age_limfrom the active configuration - Dataset types:
sel_dataset_type— typicallyc("fossil_pollen_archive", "gridpoints") - Abiotic variables:
sel_abiotic_var_namefrom the active configuration
Climate Variables
Palaeoclimate predictors are CHELSA-TraCE21k bioclimatic variables co-registered with the pollen records in VegVault. Full continental analyses use seven variables: BIO1 (mean annual temperature), BIO4 (temperature seasonality), BIO6 (minimum temperature of the coldest month), BIO12 (annual precipitation), BIO15 (precipitation seasonality), BIO18 (precipitation of the warmest quarter), and BIO19 (precipitation of the coldest quarter).
To prevent multicollinearity, predictors are assessed pairwise and a non-redundant subset is automatically retained before modelling.
Pipeline targets (from pipe_segment_abiotic_data.R):
| Target | Function | Description |
|---|---|---|
data_abiotic_ages |
add_age_to_samples() |
Abiotic data with sample age column attached |
abiotic_collinearity |
get_predictor_collinearity() |
Pairwise Spearman correlation matrix |
data_abiotic_selected |
select_non_collinear_predictors() |
Least-redundant predictor subset |
data_abiotic_interpolated |
interpolate_data() |
Interpolated to the configured time step |
The collinearity filter greedily removes predictors exceeding the pairwise correlation threshold, retaining the variable with the most unique information at each step.
Two Complementary Analysis Axes
The project addresses two related but distinct questions using two separate pipeline designs:
Temporal — How do co-occurrence patterns change through time?
pipeline_time.R fits the model independently for each 500-year time slice across the last 20,000 years. The full spatial window for a continent is used; the pipeline is branched over time using tarchetypes::tar_map(). Results aggregate into temporal trajectories showing how co-occurrence structure and its variance components evolve from the Last Glacial Maximum to the present.
Configurations: project_temporal_europe · project_temporal_america · project_temporal_asia
Spatial — How do co-occurrence patterns vary across spatial scales?
pipeline_basic.R is executed independently for each geographic unit defined in Data/Input/spatial_grid.csv, at three nested scales (continental → regional → local). Results are loaded post-hoc and compared to identify which co-occurrence patterns hold at all scales and which are scale-specific.
Configurations: project_spatial_continental · project_spatial_regional · project_spatial_local
The two pipelines share an identical analytical core. The only structural differences are (1) how the sample set is defined — time-filtering versus geographic subsetting — and (2) which Moran Eigenvector Maps are computed for the spatial random effect: pure 2-D MEMs for spatial analyses, or 3-D spatiotemporal MEMs for temporal analyses.
Analysis Pipeline
The diagram below shows the shared core workflow and the single divergence point between the two analysis types:
Pipeline Stages
1. VegVault Data Extraction
Vegetation records and co-located palaeoclimate data are queried from a local VegVault SQLite database for the spatial and temporal window defined by the active configuration. The extraction produces three parallel data objects — community records, abiotic records, and dataset-level coordinates — that flow independently through the next stages until merged at model assembly.
Pipe segment: pipe_segment_vegvault_data.R
| Target | Function | Description |
|---|---|---|
data_vegvault_extracted |
extract_data_from_vegvault() |
Raw community + abiotic records from VegVault |
data_coords |
get_coords() |
One coordinate pair per dataset_name |
check_n_cores |
check_min_n_cores() |
Guard: aborts if fewer than min_n_cores cores available |
2. Community Data Processing
Raw pollen counts are transformed into a cleaned, classified, and interpolated community matrix. The processing pipeline applies a standardised sequence of filters to ensure taxonomic consistency and temporal alignment across sites.
Pipe segment: pipe_segment_community_data.R
| Step | Target | Function |
|---|---|---|
| Reshape to long format | data_community_long |
make_community_data_long() |
| Attach sample ages | data_community_long_ages |
add_age_to_samples() |
| Compute proportions | data_community_proportions |
make_community_proportions() |
| Interpolate to time step | data_community_interpolated |
interpolate_community_data() |
| Retain only Plantae | data_community_plantae |
filter_non_plantae_taxa() |
| Resolve taxonomic resolution | data_community_classified |
classify_taxonomic_resolution() |
| Remove rare taxa | data_community_rare_filtered |
filter_rare_taxa() |
| Filter by core count | data_community_filtered_cores |
filter_community_by_n_cores() |
| Filter by sample count | data_community_filtered_samples |
filter_by_n_samples() |
| Select top N taxa | data_community_subset |
select_n_taxa() |
Taxonomic classification combines automated API lookup (get_taxa_classification()) with a user-supplied auxiliary table (Data/Input/aux_classification_table.csv) via combine_classification_tables().
Filtering thresholds from active config: minimal_proportion_of_pollen (default: 1%), min_n_cores, min_n_samples, number_of_taxa.
3. Abiotic Data Processing
Palaeoclimate predictors are screened for multicollinearity and a non-redundant subset is selected and interpolated to the community time step.
Pipe segment: pipe_segment_abiotic_data.R
| Target | Function | Description |
|---|---|---|
data_abiotic_ages |
add_age_to_samples() |
Abiotic data with age column |
abiotic_collinearity |
get_predictor_collinearity() |
Pairwise Spearman correlation matrix |
data_abiotic_selected |
select_non_collinear_predictors() |
Non-redundant predictor subset |
data_abiotic_interpolated |
interpolate_data() |
Interpolated to time_step |
4. Data Alignment — Pipeline Divergence Point
This step defines which samples enter the model and is the point where the temporal and spatial pipelines structurally diverge. Both paths produce the same output — a canonical (dataset_name, age) index present in all three data sources — but they derive it differently.
Spatial pipeline — pipe_segment_alignment.R:
align_sample_ids() computes the 3-way intersection of (dataset_name, age) pairs across community, abiotic, and coordinate data. All time steps within the spatial window are retained; the model sees the full temporal range of the geographic unit.
Temporal pipeline — pipe_segment_age_filter.R:
The pipeline is wrapped in tarchetypes::tar_map(), which injects a sel_age value for each time slice. align_sample_ids() is called with a subset_age argument restricting the intersection to samples at or near that target age. Each branch yields an independent model fitted to a single 500-year time window.
Both paths then call check_data_sample_ids_n() to validate that the minimum number of (dataset_name, age) pairs is met before proceeding.
Row-name convention: all matrices (community, abiotic, spatial) use "dataset_name__age" (double-underscore separator) as row names, ensuring consistent cross-matrix alignment throughout the pipeline.
5. Model Data Preparation
The aligned data are converted to model-ready matrix form: community counts are binarised, predictors are centred and scaled, coordinates are projected to a metric CRS, and the spatial random effect is computed as a set of Moran Eigenvector Maps (MEMs).
Pipe segments: pipe_segment_fit_data_prep.R + pipe_segment_model_prep.R
| Target | Function | Description |
|---|---|---|
data_community_to_fit |
prepare_community_for_fit() |
Long → wide community matrix aligned to data_sample_ids |
data_abiotic_wide |
prepare_abiotic_for_fit() |
Long → wide abiotic matrix |
data_abiotic_scaled_list |
scale_abiotic_for_fit() |
Centred + scaled; age centred only; scaling attributes saved for back-transform |
data_community_prepared |
binarize_community_data() |
Binary presence/absence matrix |
data_coords_projected |
project_coords_to_metric() |
WGS84 → metric CRS (default EPSG:3035) in km |
data_spatial_mev_core |
compute_spatial_mev() |
2-D MEMs from site locations via sjSDM::generateSpatialEV() |
data_spatial_mev_samples |
compute_spatiotemporal_mev() / prepare_spatial_predictors_for_fit() |
3-D MEMs (temporal) or 2-D MEMs expanded to sample level (spatial) |
data_spatial_scaled_list |
scale_spatial_for_fit() |
Centred + scaled MEM predictors |
data_to_fit |
assemble_data_to_fit() |
Validates and assembles all components into the model input list |
MEMs: the number of retained eigenvectors (n_mev) scales with the geographic extent. In the spatial mode, a 2-D geographic distance matrix is used. In the spatiotemporal mode, a 3-D distance matrix over (longitude, latitude, age) captures both spatial and temporal autocorrelation simultaneously.
6. Joint Species Distribution Model (sjSDM)
The model is a Joint Species Distribution Model (jSDM) fitted with the {sjSDM} package. sjSDM uses stochastic gradient descent with an optional GPU backend, making it tractable for the large community matrices and many sites in this project.
Pipe segment: pipe_segment_model_simple.R
| Target | Function | Description |
|---|---|---|
model_formula |
make_env_formula() |
~ (v1 + v2 + ...) * age - age if age is a predictor; ~ v1 + v2 + ... otherwise |
mod_jsdm |
fit_jsdm_model() |
Fits the jSDM with configured architecture, iterations, and error family |
mod_jsdm_se |
compute_jsdm_se() |
Standard errors computed post-hoc (CPU parallel) |
model_evaluation |
evaluate_jsdm() |
McFadden R², Nagelkerke R²; species-level AUC, Accuracy, LogLoss |
Model structure:
- Environment component: linear abiotic predictors, with optional age × climate interactions from
make_env_formula() - Spatial random effect: linear MEM predictors as fixed effects, approximating a spatial random field
- Biotic component: residual species covariation captured via a shared low-rank latent covariance matrix
- Error family:
binomial(presence/absence)
Key parameters: n_iter (training iterations; 500 default, 1000 for continental), n_sampling (posterior draws), n_cores (parallel SE computation).
The project was originally built with {HMSC} (MCMC-based). Fitting became prohibitively slow at continental scale; the pipeline was fully migrated to {sjSDM} for GPU-accelerated gradient descent, retaining the joint modelling framework at a fraction of the computation time.
7. Variance Decomposition (ANOVA)
After fitting, the variance explained by the model is partitioned into three components using Monte Carlo ANOVA: abiotic (A), biotic (B), and spatial (S) — plus all pairwise and three-way interaction fractions.
Pipe segment: pipe_segment_model_anova.R
| Target | Function | Description |
|---|---|---|
model_anova |
get_anova() |
Wraps sjSDM:::anova.sjSDM() with n_samples_anova Monte Carlo draws |
Downstream (temporal pipeline — pipe_segment_result_summary_age.R):
| Function | Description |
|---|---|
extract_anova_fractions() |
Named fractions: F_A, F_B, F_S, F_AB, F_AS, F_BS, F_ABS |
aggregate_anova_components() |
Combines per-age ANOVA objects into a long tibble: (age, component, R²) |
Components:
- A — variance uniquely explained by abiotic predictors
- B — residual species covariation not explained by the environment (biotic signal)
- S — variance captured by the spatial random effect (MEMs)
- AB, AS, BS, ABS — shared and interaction fractions between pairs and all three
8. Community Network Analysis
In the temporal pipeline, the binary presence/absence community matrix for each time slice is analysed as a co-occurrence network to extract ecological network metrics characterising the structure of plant associations.
Pipe segment: pipe_segment_community_network.R
| Target | Function | Description |
|---|---|---|
| Per-age network metrics | compute_network_metrics() |
Binary community matrix → bipartite::networklevel() |
Metrics:
- Connectance — proportion of realised species pairs out of all possible pairs
- Nestedness (NODF) — degree to which specialist taxa interact with subsets of generalist partners
- Modularity — partitioning strength of the network into discrete co-occurrence sub-communities
Results are aggregated across ages in pipe_segment_network_summary_age.R and visualised with plot_network_metrics_by_age().
Analysis Configurations
Temporal Analyses
Each model is fitted independently per 500-year time slice; results are aggregated into temporal trajectories spanning 0–20,000 yr BP.
| Configuration | Region | Lon | Lat | Age range | Bioclim variables | n_mev | Min samples |
|---|---|---|---|---|---|---|---|
project_cz |
Czechia (test) | 12–18.9°E | 48.5–51.5°N | 0–5,000 yr BP | BIO1, BIO12 | 3 | 5 |
project_temporal_europe |
Europe | 10°W–40°E | 35–70°N | 0–20,000 yr BP | BIO1, 4, 6, 12, 15, 18, 19 | 5 | 10 |
project_temporal_america |
North America | 130–60°W | 30–70°N | 0–20,000 yr BP | BIO1, 4, 6, 12, 15, 18, 19 | 5 | 10 |
project_temporal_asia |
Northern Asia | 60–140°E | 50–75°N | 0–20,000 yr BP | BIO1, 4, 6, 12, 15, 18, 19 | 5 | 10 |
Spatial Analyses
pipeline_basic.R is run independently for each unit in Data/Input/spatial_grid.csv. Geographic bounds are defined per unit in the spatial grid; all scales use the full set of seven CHELSA bioclim variables. Targets for each unit are stored at Data/targets/spatial_{scale}/{scale_id}/pipeline_basic/.
| Scale | Configuration | Units | n_mev | Min samples |
|---|---|---|---|---|
| Continental | project_spatial_continental |
3 (Europe, America, Asia) | 5 | 10 |
| Regional | project_spatial_regional |
~18 per continent | 4 | 10 |
| Local | project_spatial_local |
Fine-grain geographic units | 3 | 5 |
Current Pipeline Status
Below are live interactive pipeline graphs generated by targets::tar_visnetwork(). Each node colour indicates the target state (up-to-date, outdated, or unbuilt). Click a node to inspect its name and status.