Published

May 15, 2025

Modified

March 31, 2026

BIODYNAMICS — Vegetation Co-occurrence

This project is a work in progress. The code is not yet ready!

This project analyses co-occurrence patterns of plant taxa across space and time using paleoecological pollen records and co-located palaeoclimate reconstructions. The central aim is to quantify how much of the observed co-occurrence signal is explained by shared abiotic (environmental) responses versus genuine biotic associations — and how those patterns shift across geographic scales and through the last 20,000 years.

The project is part of the BIODYNAMICS initiative.

Data Sources

VegVault

All vegetation and palaeoclimate data are sourced from VegVault, a harmonised database combining fossil pollen archives with co-located gridpoint climate reconstructions. For each analysis, records are extracted for a configurable spatial window (longitude × latitude bounding box) and time range.

Technical details — data extraction

The function extract_data_from_vegvault() queries a local VegVault SQLite database and returns a unified tidy object containing both community and abiotic records for the requested region and period. Coordinates are extracted separately with get_coords(), which records one pair per dataset (not per sample), using dataset_name as the row identifier. Database integrity is verified at session startup via check_presence_of_vegvault().

Filters applied at extraction:

Spatial: x_lim (longitude) and y_lim (latitude) from the active configuration
Temporal: age_lim from the active configuration
Dataset types: sel_dataset_type — typically c("fossil_pollen_archive", "gridpoints")
Abiotic variables: sel_abiotic_var_name from the active configuration

Climate Variables

Palaeoclimate predictors are CHELSA-TraCE21k bioclimatic variables co-registered with the pollen records in VegVault. Full continental analyses use seven variables: BIO1 (mean annual temperature), BIO4 (temperature seasonality), BIO6 (minimum temperature of the coldest month), BIO12 (annual precipitation), BIO15 (precipitation seasonality), BIO18 (precipitation of the warmest quarter), and BIO19 (precipitation of the coldest quarter).

To prevent multicollinearity, predictors are assessed pairwise and a non-redundant subset is automatically retained before modelling.

Technical details — abiotic processing

Pipeline targets (from pipe_segment_abiotic_data.R):

Target	Function	Description
`data_abiotic_ages`	`add_age_to_samples()`	Abiotic data with sample age column attached
`abiotic_collinearity`	`get_predictor_collinearity()`	Pairwise Spearman correlation matrix
`data_abiotic_selected`	`select_non_collinear_predictors()`	Least-redundant predictor subset
`data_abiotic_interpolated`	`interpolate_data()`	Interpolated to the configured time step

The collinearity filter greedily removes predictors exceeding the pairwise correlation threshold, retaining the variable with the most unique information at each step.

Two Complementary Analysis Axes

The project addresses two related but distinct questions using two separate pipeline designs:

Two Complementary Analysis Axes

Temporal — How do co-occurrence patterns change through time?

pipeline_time.R fits the model independently for each 500-year time slice across the last 20,000 years. The full spatial window for a continent is used; the pipeline is branched over time using tarchetypes::tar_map(). Results aggregate into temporal trajectories showing how co-occurrence structure and its variance components evolve from the Last Glacial Maximum to the present.

Configurations: project_temporal_europe · project_temporal_america · project_temporal_asia

Spatial — How do co-occurrence patterns vary across spatial scales?

pipeline_basic.R is executed independently for each geographic unit defined in Data/Input/spatial_grid.csv, at three nested scales (continental → regional → local). Results are loaded post-hoc and compared to identify which co-occurrence patterns hold at all scales and which are scale-specific.

Configurations: project_spatial_continental · project_spatial_regional · project_spatial_local

The two pipelines share an identical analytical core. The only structural differences are (1) how the sample set is defined — time-filtering versus geographic subsetting — and (2) which Moran Eigenvector Maps are computed for the spatial random effect: pure 2-D MEMs for spatial analyses, or 3-D spatiotemporal MEMs for temporal analyses.

Analysis Pipeline

The diagram below shows the shared core workflow and the single divergence point between the two analysis types:

%%{init: {'theme': 'base', 'themeVariables': {
  'background': '#0B0F14',
  'primaryColor': '#141B22',
  'primaryTextColor': '#E6EDF3',
  'primaryBorderColor': '#2A3441',
  'lineColor': '#98A6B3',
  'secondaryColor': '#141B22',
  'tertiaryColor': '#141B22',
  'edgeLabelBackground': '#141B22',
  'clusterBkg': '#141B22',
  'titleColor': '#E6EDF3',
  'fontFamily': 'IBM Plex Mono, monospace'
}}}%%
flowchart TD
    A[(VegVault)] --> B["Community<br/>Processing"]
    A --> C["Abiotic<br/>Processing"]
    A --> D["Coordinates"]
    B --> E{"Analysis<br/>Type"}
    C --> E
    D --> E
    E -- "Temporal<br/>pipeline_time.R" --> F["Age Filter<br/>per 500-yr slice"]
    E -- "Spatial<br/>pipeline_basic.R" --> G["Align Sample IDs<br/>3-way intersection"]
    F --> H["Model Data<br/>Preparation"]
    G --> H
    H --> I["Compute MEMs<br/>Moran Eigenvector Maps"]
    I --> J["Fit jSDM<br/>sjSDM + GPU"]
    J --> K["Variance Decomposition<br/>ANOVA: A / B / S"]
    J --> L["Network Analysis<br/>connectance / nestedness<br/>modularity"]
    style E fill:#D98CFF,color:#0B0F14,stroke:#2A3441
    style F fill:#48C7B8,color:#0B0F14,stroke:#2A3441
    style G fill:#8DF59A,color:#0B0F14,stroke:#2A3441
    style J fill:#FFB35C,color:#0B0F14,stroke:#2A3441

Pipeline Stages

1. VegVault Data Extraction

Vegetation records and co-located palaeoclimate data are queried from a local VegVault SQLite database for the spatial and temporal window defined by the active configuration. The extraction produces three parallel data objects — community records, abiotic records, and dataset-level coordinates — that flow independently through the next stages until merged at model assembly.

Technical details

Pipe segment: pipe_segment_vegvault_data.R

Target	Function	Description
`data_vegvault_extracted`	`extract_data_from_vegvault()`	Raw community + abiotic records from VegVault
`data_coords`	`get_coords()`	One coordinate pair per `dataset_name`
`check_n_cores`	`check_min_n_cores()`	Guard: aborts if fewer than `min_n_cores` cores available

2. Community Data Processing

Raw pollen counts are transformed into a cleaned, classified, and interpolated community matrix. The processing pipeline applies a standardised sequence of filters to ensure taxonomic consistency and temporal alignment across sites.

Technical details

Pipe segment: pipe_segment_community_data.R

Step	Target	Function
Reshape to long format	`data_community_long`	`make_community_data_long()`
Attach sample ages	`data_community_long_ages`	`add_age_to_samples()`
Compute proportions	`data_community_proportions`	`make_community_proportions()`
Interpolate to time step	`data_community_interpolated`	`interpolate_community_data()`
Retain only Plantae	`data_community_plantae`	`filter_non_plantae_taxa()`
Resolve taxonomic resolution	`data_community_classified`	`classify_taxonomic_resolution()`
Remove rare taxa	`data_community_rare_filtered`	`filter_rare_taxa()`
Filter by core count	`data_community_filtered_cores`	`filter_community_by_n_cores()`
Filter by sample count	`data_community_filtered_samples`	`filter_by_n_samples()`
Select top N taxa	`data_community_subset`	`select_n_taxa()`

Taxonomic classification combines automated API lookup (get_taxa_classification()) with a user-supplied auxiliary table (Data/Input/aux_classification_table.csv) via combine_classification_tables().

Filtering thresholds from active config: minimal_proportion_of_pollen (default: 1%), min_n_cores, min_n_samples, number_of_taxa.

3. Abiotic Data Processing

Palaeoclimate predictors are screened for multicollinearity and a non-redundant subset is selected and interpolated to the community time step.

Technical details

Pipe segment: pipe_segment_abiotic_data.R

Target	Function	Description
`data_abiotic_ages`	`add_age_to_samples()`	Abiotic data with age column
`abiotic_collinearity`	`get_predictor_collinearity()`	Pairwise Spearman correlation matrix
`data_abiotic_selected`	`select_non_collinear_predictors()`	Non-redundant predictor subset
`data_abiotic_interpolated`	`interpolate_data()`	Interpolated to `time_step`

4. Data Alignment — Pipeline Divergence Point

This step defines which samples enter the model and is the point where the temporal and spatial pipelines structurally diverge. Both paths produce the same output — a canonical (dataset_name, age) index present in all three data sources — but they derive it differently.

Technical details

Spatial pipeline — pipe_segment_alignment.R:

align_sample_ids() computes the 3-way intersection of (dataset_name, age) pairs across community, abiotic, and coordinate data. All time steps within the spatial window are retained; the model sees the full temporal range of the geographic unit.

Temporal pipeline — pipe_segment_age_filter.R:

The pipeline is wrapped in tarchetypes::tar_map(), which injects a sel_age value for each time slice. align_sample_ids() is called with a subset_age argument restricting the intersection to samples at or near that target age. Each branch yields an independent model fitted to a single 500-year time window.

Both paths then call check_data_sample_ids_n() to validate that the minimum number of (dataset_name, age) pairs is met before proceeding.

Row-name convention: all matrices (community, abiotic, spatial) use "dataset_name__age" (double-underscore separator) as row names, ensuring consistent cross-matrix alignment throughout the pipeline.

5. Model Data Preparation

The aligned data are converted to model-ready matrix form: community counts are binarised, predictors are centred and scaled, coordinates are projected to a metric CRS, and the spatial random effect is computed as a set of Moran Eigenvector Maps (MEMs).

Technical details

Pipe segments: pipe_segment_fit_data_prep.R + pipe_segment_model_prep.R

Target	Function	Description
`data_community_to_fit`	`prepare_community_for_fit()`	Long → wide community matrix aligned to `data_sample_ids`
`data_abiotic_wide`	`prepare_abiotic_for_fit()`	Long → wide abiotic matrix
`data_abiotic_scaled_list`	`scale_abiotic_for_fit()`	Centred + scaled; `age` centred only; scaling attributes saved for back-transform
`data_community_prepared`	`binarize_community_data()`	Binary presence/absence matrix
`data_coords_projected`	`project_coords_to_metric()`	WGS84 → metric CRS (default EPSG:3035) in km
`data_spatial_mev_core`	`compute_spatial_mev()`	2-D MEMs from site locations via `sjSDM::generateSpatialEV()`
`data_spatial_mev_samples`	`compute_spatiotemporal_mev()` / `prepare_spatial_predictors_for_fit()`	3-D MEMs (temporal) or 2-D MEMs expanded to sample level (spatial)
`data_spatial_scaled_list`	`scale_spatial_for_fit()`	Centred + scaled MEM predictors
`data_to_fit`	`assemble_data_to_fit()`	Validates and assembles all components into the model input list

MEMs: the number of retained eigenvectors (n_mev) scales with the geographic extent. In the spatial mode, a 2-D geographic distance matrix is used. In the spatiotemporal mode, a 3-D distance matrix over (longitude, latitude, age) captures both spatial and temporal autocorrelation simultaneously.

6. Joint Species Distribution Model (sjSDM)

The model is a Joint Species Distribution Model (jSDM) fitted with the {sjSDM} package. sjSDM uses stochastic gradient descent with an optional GPU backend, making it tractable for the large community matrices and many sites in this project.

Technical details

Pipe segment: pipe_segment_model_simple.R

Target	Function	Description
`model_formula`	`make_env_formula()`	`~ (v1 + v2 + ...) * age - age` if age is a predictor; `~ v1 + v2 + ...` otherwise
`mod_jsdm`	`fit_jsdm_model()`	Fits the jSDM with configured architecture, iterations, and error family
`mod_jsdm_se`	`compute_jsdm_se()`	Standard errors computed post-hoc (CPU parallel)
`model_evaluation`	`evaluate_jsdm()`	McFadden R², Nagelkerke R²; species-level AUC, Accuracy, LogLoss

Model structure:

Environment component: linear abiotic predictors, with optional age × climate interactions from make_env_formula()
Spatial random effect: linear MEM predictors as fixed effects, approximating a spatial random field
Biotic component: residual species covariation captured via a shared low-rank latent covariance matrix
Error family: binomial (presence/absence)

Key parameters: n_iter (training iterations; 500 default, 1000 for continental), n_sampling (posterior draws), n_cores (parallel SE computation).

The project was originally built with {HMSC} (MCMC-based). Fitting became prohibitively slow at continental scale; the pipeline was fully migrated to {sjSDM} for GPU-accelerated gradient descent, retaining the joint modelling framework at a fraction of the computation time.

7. Variance Decomposition (ANOVA)

After fitting, the variance explained by the model is partitioned into three components using Monte Carlo ANOVA: abiotic (A), biotic (B), and spatial (S) — plus all pairwise and three-way interaction fractions.

Technical details

Pipe segment: pipe_segment_model_anova.R

Target	Function	Description
`model_anova`	`get_anova()`	Wraps `sjSDM:::anova.sjSDM()` with `n_samples_anova` Monte Carlo draws

Downstream (temporal pipeline — pipe_segment_result_summary_age.R):

Function	Description
`extract_anova_fractions()`	Named fractions: F_A, F_B, F_S, F_AB, F_AS, F_BS, F_ABS
`aggregate_anova_components()`	Combines per-age ANOVA objects into a long tibble: `(age, component, R²)`

Components:

A — variance uniquely explained by abiotic predictors
B — residual species covariation not explained by the environment (biotic signal)
S — variance captured by the spatial random effect (MEMs)
AB, AS, BS, ABS — shared and interaction fractions between pairs and all three

8. Community Network Analysis

In the temporal pipeline, the binary presence/absence community matrix for each time slice is analysed as a co-occurrence network to extract ecological network metrics characterising the structure of plant associations.

Technical details

Pipe segment: pipe_segment_community_network.R

Target	Function	Description
Per-age network metrics	`compute_network_metrics()`	Binary community matrix → `bipartite::networklevel()`

Metrics:

Connectance — proportion of realised species pairs out of all possible pairs
Nestedness (NODF) — degree to which specialist taxa interact with subsets of generalist partners
Modularity — partitioning strength of the network into discrete co-occurrence sub-communities

Results are aggregated across ages in pipe_segment_network_summary_age.R and visualised with plot_network_metrics_by_age().

Analysis Configurations

Temporal Analyses

Each model is fitted independently per 500-year time slice; results are aggregated into temporal trajectories spanning 0–20,000 yr BP.

Configuration	Region	Lon	Lat	Age range	Bioclim variables	n_mev	Min samples
`project_cz`	Czechia (test)	12–18.9°E	48.5–51.5°N	0–5,000 yr BP	BIO1, BIO12	3	5
`project_temporal_europe`	Europe	10°W–40°E	35–70°N	0–20,000 yr BP	BIO1, 4, 6, 12, 15, 18, 19	5	10
`project_temporal_america`	North America	130–60°W	30–70°N	0–20,000 yr BP	BIO1, 4, 6, 12, 15, 18, 19	5	10
`project_temporal_asia`	Northern Asia	60–140°E	50–75°N	0–20,000 yr BP	BIO1, 4, 6, 12, 15, 18, 19	5	10

Spatial Analyses

pipeline_basic.R is run independently for each unit in Data/Input/spatial_grid.csv. Geographic bounds are defined per unit in the spatial grid; all scales use the full set of seven CHELSA bioclim variables. Targets for each unit are stored at Data/targets/spatial_{scale}/{scale_id}/pipeline_basic/.

Scale	Configuration	Units	n_mev	Min samples
Continental	`project_spatial_continental`	3 (Europe, America, Asia)	5	10
Regional	`project_spatial_regional`	~18 per continent	4	10
Local	`project_spatial_local`	Fine-grain geographic units	3	5

Current Pipeline Status

Below are live interactive pipeline graphs generated by targets::tar_visnetwork(). Each node colour indicates the target state (up-to-date, outdated, or unbuilt). Click a node to inspect its name and status.

BIODYNAMICS — Vegetation Co-occurrence

Data Sources

VegVault

Climate Variables

Two Complementary Analysis Axes

Temporal — How do co-occurrence patterns change through time?

Spatial — How do co-occurrence patterns vary across spatial scales?

Analysis Pipeline

Pipeline Stages

1. VegVault Data Extraction

2. Community Data Processing

3. Abiotic Data Processing

4. Data Alignment — Pipeline Divergence Point

5. Model Data Preparation

6. Joint Species Distribution Model (sjSDM)

7. Variance Decomposition (ANOVA)

8. Community Network Analysis

Analysis Configurations

Temporal Analyses

Spatial Analyses

Current Pipeline Status

Temporal Pipeline — Europe (`project_temporal_europe`)

Spatial Pipeline — Europe continental (`spatial_continental / europe`)

BIODYNAMICS — Vegetation Co-occurrence

Data Sources

VegVault

Climate Variables

Two Complementary Analysis Axes

Temporal — How do co-occurrence patterns change through time?

Spatial — How do co-occurrence patterns vary across spatial scales?

Analysis Pipeline

Pipeline Stages

1. VegVault Data Extraction

2. Community Data Processing

3. Abiotic Data Processing

4. Data Alignment — Pipeline Divergence Point

5. Model Data Preparation

6. Joint Species Distribution Model (sjSDM)

7. Variance Decomposition (ANOVA)

8. Community Network Analysis

Analysis Configurations

Temporal Analyses

Spatial Analyses

Current Pipeline Status

Temporal Pipeline — Europe (project_temporal_europe)

Spatial Pipeline — Europe continental (spatial_continental / europe)

Temporal Pipeline — Europe (`project_temporal_europe`)

Spatial Pipeline — Europe continental (`spatial_continental / europe`)