Step-by-step guide

This workflow should show the full strength of the RRatepol package and serve as step-by-step guidance starting from downloading dataset from Neotoma, building age-depth models, to estimating rate-of-change using age uncertainty.

:warning: This workflow is only meant as an example: There are several additional steps for data preparation which should be done to properly implement RRatepol and asssess rate of change of a fossil pollen dataset from Neotoma!

Install packages

Please follow the pre-workshop instructions to make sure all packages are installed.

Attach packages

library(tidyverse) # general data wrangling and visualisation
library(pander) # nice tables
library(RRatepol) # rate-of-vegetation change ! v1.2.0 !
library(neotoma2) # obtain data from the Neotoma database
library(Bchron) # age-depth modeling
library(janitor) # string cleaning

Download a dataset from Neotoma

Here we have selected the Ahakagyezi Swamp record.

sel_dataset_download <-
  neotoma2::get_downloads(50216)

Prepare the pollen counts

# get samples
sel_counts <-
  neotoma2::samples(sel_dataset_download)

# select only "pollen" taxa
sel_taxon_list_selected <-
  neotoma2::taxa(sel_dataset_download) %>%
  dplyr::filter(element == "pollen") %>%
  purrr::pluck("variablename")

# prepare taxa table
sel_counts_selected <-
  sel_counts %>%
  as.data.frame() %>%
  dplyr::mutate(sample_id = as.character(sampleid)) %>%
  tibble::as_tibble() %>%
  dplyr::select("sample_id", "value", "variablename") %>%
  # only include selected taxons
  dplyr::filter(
    variablename %in% sel_taxon_list_selected
  ) %>%
  # turn into the wider format
  tidyr::pivot_wider(
    names_from = "variablename",
    values_from = "value",
    values_fill = 0
  ) %>%
  # clean names
  janitor::clean_names()

head(sel_counts_selected)[, 1:5]

sample_id	rhamnaceae	combretaceae_melastomataceae	ranunculaceae	prunus
500543	1	1	1	1
500544	4	0	0	2
500545	6	1	0	0
500547	4	0	0	0
500548	7	0	0	0
500549	3	1	0	1

Here, we strongly advocate that attention should be paid to the selection of the ecological groups as well as the harmonisation of the pollen taxa. However, that is not the subject of this workflow, but any analysis to be published needs careful preparation of the fossil pollen datasets before using R-Ratepol!

We can now try to visualise the taxa per sample_id

sel_counts_selected %>%
  tibble::rowid_to_column("ID") %>%
  tidyr::pivot_longer(
    cols = -c(sample_id, ID),
    names_to = "taxa",
    values_to = "n_grains"
  ) %>%
  ggplot2::ggplot(
    mapping = ggplot2::aes(
      x = ID,
      y = n_grains,
      fill = taxa
    ),
  ) +
  ggplot2::geom_bar(
    stat = "identity",
    position = "fill"
  ) +
  ggplot2::labs(
    x = "sample_id",
    y = "proportion of pollen grains"
  ) +
  ggplot2::theme(
    axis.text.x = ggplot2::element_blank(),
    legend.position = "none"
  )

Preparation of the levels

Sample depth

Extract depth for each level

sel_level <-
  neotoma2::samples(sel_dataset_download) %>%
  tibble::as_tibble() %>%
  dplyr::mutate(sample_id = as.character(sampleid)) %>%
  dplyr::distinct(sample_id, depth) %>%
  dplyr::relocate(sample_id)

head(sel_level)

sample_id	depth
500543	703
500544	753
500545	803
500547	853
500548	908
500549	953

Age-depth modelling

We will recalculate the age-depth model ‘de novo’ using the Bchron package.

Prepare chron.control table and run Bchron

The chronology control table contains all the dates (mostly radiocarbon) to create the age-depth model.

Here we only present a few of the important steps of preparation of the chronology control table. There are many more potential issues, but solving those is not the focus of this workflow.

# First, get the chronologies and check which we want to use used
sel_chron_control_table_download <-
  neotoma2::chroncontrols(sel_dataset_download)

print(sel_chron_control_table_download)

Table continues below
siteid	chronologyid	depth	thickness	agelimitolder	chroncontrolid
27552	35429	330	20	1150	110443
27552	35429	1892	16	14860	110449
27552	35429	1175	30	5250	110446
27552	35429	2060	14	15750	110451
27552	35429	2235	30	24200	110454
27552	35429	2087	27	18890	110453

agelimityounger	chroncontrolage	chroncontroltype
1010	1080	Radiocarbon
14660	14760	Radiocarbon
5110	5180	Radiocarbon
15530	15640	Radiocarbon
23680	23940	Radiocarbon
18650	18770	Radiocarbon

# prepare the table
sel_chron_control_table <-
  sel_chron_control_table_download %>%
  # Here select the ID of one of the chronology
  dplyr::filter(chronologyid == 35430) %>%
  tibble::as_tibble() %>%
  # Here we calculate the error as the average of the age `limitolder` and
  #   `agelimityounger`
  dplyr::mutate(
    error = round((agelimitolder - agelimityounger) / 2)
  ) %>%
  # As Bchron cannot accept a error of 0, we need to replace the value with 1
  dplyr::mutate(
    error = replace(error, error == 0, 1),
    error = ifelse(is.na(error), 1, error)
  ) %>%
  # We need to specify which calibration curve should be used for what point
  dplyr::mutate(
    curve = ifelse(as.data.frame(sel_dataset_download)["lat"] > 0, "intcal20", "shcal20"),
    curve = ifelse(chroncontroltype != "Radiocarbon", "normal", curve)
  ) %>%
  tibble::column_to_rownames("chroncontrolid") %>%
  dplyr::arrange(depth) %>%
  dplyr::select(
    chroncontrolage, error, depth, thickness, chroncontroltype, curve
  )

head(sel_chron_control_table)

chroncontrolage	error	depth	thickness	chroncontroltype	curve
1080	70	330	20	Radiocarbon	shcal20
3070	70	675	30	Radiocarbon	shcal20
3360	70	820	20	Radiocarbon	shcal20
5180	70	1175	30	Radiocarbon	shcal20
5870	90	1335	20	Radiocarbon	shcal20
9580	60	1535	30	Radiocarbon	shcal20

As this is just a toy example, we will use only the iteration multiplier (i_multiplier) of 0.1 to reduce the computation time. However, we strongly recommend increasing it to 5 for any normal age-depth model construction.

i_multiplier <- 0.1 # increase to 5

# Those are default values suggested by the Bchron package
n_iteration_default <- 10e3
n_burn_default <- 2e3
n_thin_default <- 8

# Let's multiply them by our i_multiplier
n_iteration <- n_iteration_default * i_multiplier
n_burn <- n_burn_default * i_multiplier
n_thin <- max(c(1, n_thin_default * i_multiplier))

# run Bchron
sel_bchron <-
  Bchron::Bchronology(
    ages = sel_chron_control_table$chroncontrolage,
    ageSds = sel_chron_control_table$error,
    positions = sel_chron_control_table$depth,
    calCurves = sel_chron_control_table$curve,
    positionThicknesses = sel_chron_control_table$thickness,
    iterations = n_iteration,
    burn = n_burn,
    thin = n_thin
  )

Visually check the age-depth models

plot(sel_bchron)

Predict ages

Let’s first extract posterior ages (i.e. possible ages) from the age-depth model.

age_position <-
  Bchron:::predict.BchronologyRun(object = sel_bchron, newPositions = sel_level$depth)

age_uncertainties <-
  age_position %>%
  as.data.frame() %>%
  dplyr::mutate_all(., as.integer) %>%
  as.matrix()

colnames(age_uncertainties) <- sel_level$sample_id

head(age_uncertainties, n = 8)[, 1:8]

Here we see samples (e.g., 500543,500544, 500547,…) and their possible ages (age-sequence) with each model iteration (posterior). Each age-sequence is similar but there are differences of tens or hundreds of years. We will call this the uncertainty matrix.

500543	500544	500545	500547	500548	500549	500550	500546
3124	3343	3588	3714	3834	4172	4669	4756
3379	3503	3627	4796	5082	5317	5551	5588
3337	3479	3623	3775	3935	4062	4402	4518
3405	3546	3635	3798	4021	4202	4384	4445
3404	3477	3536	3872	4438	4548	4657	4693
3483	3598	3648	3711	3787	3849	3911	3932
3418	3524	3629	3762	3924	4056	4189	4233
3315	3448	3581	3787	4054	4275	4505	4587

We can visualise these “possible ages” (age-sequence) of each iteration.

# create a data.frame for plotting
data_age_uncertainties <-
  age_uncertainties %>%
  as.data.frame() %>%
  tibble::rowid_to_column("ID") %>%
  tidyr::pivot_longer(
    cols = -ID,
    names_to = "sample_id",
    values_to = "age"
  ) %>%
  dplyr::left_join(
    sel_level,
    by = dplyr::join_by(sample_id)
  )

Each line is a single potential age-depth model iteration (age-sequence). Green points represent the radiocarbon dates. Horizontal lines are depths of our samples.

(
  fig_age_uncertainties <-
    data_age_uncertainties %>%
    ggplot2::ggplot(
      mapping = ggplot2::aes(
        x = age,
        y = depth
      )
    ) +
    ggplot2::geom_line(
      mapping = ggplot2::aes(
        group = ID
      ),
      alpha = 0.05,
      linewidth = 0.1
    ) +
    ggplot2::geom_hline(
      yintercept = sel_level$depth,
      lty = 2,
      color = "gray50",
      alpha = 0.5,
      linewidth = 0.1
    ) +
    ggplot2::geom_point(
      data = sel_chron_control_table,
      mapping = ggplot2::aes(
        x = chroncontrolage
      ),
      color = "green",
      shape = 15,
      size = 3
    ) +
    ggplot2::scale_y_continuous(trans = "reverse") +
    ggplot2::scale_x_continuous(trans = "reverse")
)

We can visualise all age-depth “possible ages” together as the range of values. Here, each line representing one sampled depth in our record.

data_age_uncertainties %>%
  ggplot2::ggplot(
    mapping = ggplot2::aes(
      x = age,
      y = depth,
      group = depth
    )
  ) +
  ggplot2::geom_hline(
    yintercept = sel_level$depth,
    lty = 2,
    color = "gray50",
    alpha = 0.5,
    linewidth = 0.1
  ) +
  ggplot2::geom_boxplot(
    outlier.shape = NA
  )

Let’s take the median age of all possible ages (i.e. the estimated age from each age-depth model run) as our default.

sel_level_predicted <-
  sel_level %>%
  dplyr::mutate(
    age = apply(
      age_uncertainties, 2,
      stats::quantile,
      probs = 0.5
    )
  )

head(sel_level_predicted)

sample_id	depth	age
500543	703	3284
500544	753	3410
500545	803	3532
500547	853	3846
500548	908	4178
500549	953	4444

We can visualise the median age by drawing a red line. This age is the age that is often reported in publications but in essence it represents multiple age-depth model runs with smaller or larger age uncertainties throughout the pollen record.

fig_age_uncertainties +
  ggplot2::geom_point(
    data = sel_level_predicted,
    color = "red",
    size = 3
  ) +
  ggplot2::geom_line(
    data = sel_level_predicted,
    color = "red",
    linewidth = 1
  )

Visualisation of our data

Let’s now make a simple pollen diagram with proportions of the main pollen taxa (x-axis) against our estimated ages along depth (y-axis).

sel_counts_selected %>%
  tibble::column_to_rownames("sample_id") %>%
  RRatepol:::transform_into_proportions() %>%
  tibble::rownames_to_column("sample_id") %>%
  dplyr::inner_join(
    sel_level_predicted,
    by = dplyr::join_by(sample_id)
  ) %>%
  tidyr::pivot_longer(
    cols = -c(sample_id, depth, age),
    names_to = "taxa",
    values_to = "proportion_of_grains"
  ) %>%
  dplyr::group_by(taxa) %>%
  # Calculate the average proportion of grains
  dplyr::mutate(
    avg_prop = mean(proportion_of_grains)
  ) %>%
  # only keep te main taxa
  dplyr::filter(avg_prop > 0.01) %>%
  dplyr::ungroup() %>%
  ggplot2::ggplot(
    mapping = ggplot2::aes(
      y = age,
      x = proportion_of_grains,
      xmax = proportion_of_grains,
      xmin = 0,
      fill = taxa,
      col = taxa
    ),
  ) +
  ggplot2::geom_ribbon() +
  ggplot2::scale_y_continuous(trans = "reverse") +
  ggplot2::scale_x_continuous(breaks = c(0, 1)) +
  ggplot2::facet_wrap(~taxa, nrow = 1) +
  ggplot2::theme(
    legend.position = "none"
  )

Estimation Rate-of-Change

Now we will use our prepared fossil pollen data and age-depth model to estimate the rate of vegetation change. We will present several scenarios (i.e. approaches) to calculate RoC. For all scenarios, we will be using the chisq dissimilarity coefficient (works best for pollen data), and time_standardisation == 500 (this means that all ROC values are ‘change per 500 yr’).

Scenario 1 - Estimating RoC for each level

This is the “Classic” approach that uses each sampled depth in a pollen record (i.e. individual level) to estimate RoC.

scenario_1 <-
  RRatepol::estimate_roc(
    data_source_community = sel_counts_selected,
    data_source_age = sel_level_predicted,
    dissimilarity_coefficient = "chisq",
    time_standardisation = 500,
    working_units = "levels" # here is set to use individual levels
  )

RRatepol::plot_roc(data_source = scenario_1)

Scenario 2 - Estimating RoC for each level with smoothing of data

We do the same as in Scenario 1 but now we smooth the pollen data before calculating RoC. Specifically, we will add smooth_method = “shep” (i.e. Shepard’s 5-term filter).

scenario_2 <-
  RRatepol::estimate_roc(
    data_source_community = sel_counts_selected,
    data_source_age = sel_level_predicted,
    dissimilarity_coefficient = "chisq",
    time_standardisation = 500,
    working_units = "levels",
    smooth_method = "shep" # Shepard's 5-term filter
  )

RRatepol::plot_roc(data_source = scenario_2)

We see that the pattern changed only slightly but the absolute RoC scores changed (x-axis).

Scenario 3 - Estimating RoC for each level subsampling data

We will now do taxa-standardisation by random sub-sampling each level to 150 pollen grains (i.e. for analysis only 150 pollen grains per sample will be used). In order to do that we need to increase the number of randomisations. This is again a toy example for a quick computation and therefore we only do 100 randomisations. We would recommend increasing the set_randomisations to 10.000 for any real estimation. To speed the process up, you can also set use_parallel == TRUE, which will use all cores of your computer.

set_randomisations <- 100

scenario_3 <-
  RRatepol::estimate_roc(
    data_source_community = sel_counts_selected,
    data_source_age = sel_level_predicted,
    dissimilarity_coefficient = "chisq",
    working_units = "levels",
    time_standardisation = 500,
    smooth_method = "shep",
    standardise = TRUE, # set the taxa standardisation
    n_individuals = 150, # set the number of pollen grains
    rand = set_randomisations, # set number of randomisations
    use_parallel = TRUE # do use parallel computing
  )

We will now also visualize uncertainty around the RoC scores shown by a grey shadow.

RRatepol::plot_roc(data_source = scenario_3)

Scenario 4 - Estimating RoC for each level subsampling data and calculating age uncertainties

For RoC analysis, it is important to consider age uncertainties. For each iteration, RRatepol will randomly select one age-sequence from the uncertainty matrix (see the age-depth modelling section for more info).

scenario_4 <-
  RRatepol::estimate_roc(
    data_source_community = sel_counts_selected,
    data_source_age = sel_level_predicted,
    dissimilarity_coefficient = "chisq",
    working_units = "levels",
    time_standardisation = 500,
    smooth_method = "shep",
    standardise = TRUE,
    n_individuals = 150,
    rand = set_randomisations,
    use_parallel = TRUE,
    age_uncertainty = age_uncertainties # Add the uncertainty matrix
  )

RRatepol::plot_roc(data_source = scenario_4)

Here you can see that the uncertainty (grey shadow) around the RoC scores (black line) increased drastically. This is because we are randomly sampling age and taxa with a small number of randomisations.

Scenario 5 - Estimating RoC per bin

In order to get rid of the effect of uneven distribution of sampled depths (i.e. levels) in a fossil pollen record, we can bin the data. Specifically, we will change the working_units from single levels to "bins". Here we select bins of 500 years each instead of the individual levels. Note that one level is randomly selected as a representation of that time bin.

scenario_5 <-
  RRatepol::estimate_roc(
    data_source_community = sel_counts_selected,
    data_source_age = sel_level_predicted,
    dissimilarity_coefficient = "chisq",
    working_units = "bins", # change the "bins"
    bin_size = 500, # sie of a time bin
    time_standardisation = 500,
    smooth_method = "shep",
    standardise = TRUE,
    n_individuals = 150,
    rand = set_randomisations,
    use_parallel = TRUE,
    age_uncertainty = age_uncertainties
  )

RRatepol::plot_roc(data_source = scenario_5)

We see a substantial increase in temporal uncertainty around the RoC scores (grey shadow), indicating a loss of temporal precision.

Scenario 6 A - Estimating RoC with the new “Moving-window” approach

In order to reduce the temporal uncertainty and improve temporal precision, we can apply a novel approach in RRATEPOL called “moving window”.

scenario_6 <-
  RRatepol::estimate_roc(
    data_source_community = sel_counts_selected,
    data_source_age = sel_level_predicted,
    dissimilarity_coefficient = "chisq",
    working_units = "MW", # change the "MW" to apply the "moving window"
    bin_size = 500,
    number_of_shifts = 5, # number of shifts
    time_standardisation = 500,
    smooth_method = "shep",
    standardise = TRUE,
    n_individuals = 150,
    rand = set_randomisations,
    use_parallel = TRUE,
    age_uncertainty = age_uncertainties
  )

RRatepol::plot_roc(data_source = scenario_6)

Scenario 6 B - Estimating RoC with the new “Moving-window” approach and detecting peak points

Throughout the record, there can be periods when the RoC will substantially change. We can detect RoC increases that are significant by identifying so called peak-points. Here, we will use “Non-linear” method, which will detect significant change from a non-linear trend of RoC.

scenario_6_peak <-
  RRatepol::detect_peak_points(
    data_source = scenario_6,
    sel_method = "trend_non_linear"
  )

Now we will plot the RoC estimates showing the peak-points. So here we can see that there were rates of vegetation change throughout the record but only at certain moments in time (green dots - peak points) these changes were significant. There you go!

RRatepol::plot_roc(
  data_source = scenario_6_peak,
  peaks = TRUE
)