time-uncertain-data-analysis-in-R/ageModelling.Rmd at main · LinkedEarth/time-uncertain-data-analysis-in-R · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
# Age modelling in geoChronR {#agemodelling}

geoChronR quantifies the uncertainties due to time uncertainty by taking advantage of ensembles of plausible age histories for one or more datasets.
This means that often an early step in the geoChronR workflow is generating age ensembles.
Most modern age modelling approaches quantify uncertainties using methods that rely on ensembles, however preserving, extracting, and storing those uncertainties for subsequent analysis can be challenging.
geoChronR helps with this!
In this chapter, we'll go through the workflow of generating age models with four methods that are integrated into geoChronR
<!-- add note about importing age models from other approaches -->

```{r}
library(lipdR)
library(geoChronR)
library(ggplot2)
library(magrittr)

tana <- readLipd("https://lipdverse.org/Temp12k/1_0_2/TanaLake.Loomis.2015.lpd")
```


## Bacon

The Bayesian ACcumulatiON (Bacon) algorithm  [@Blaauw2011BACON] is one of the most broadly used age-modelling techniques, and was designed to take advantage of prior knowledge about the distribution and autocorrelation structure of sedimentation rates in a sequence to better quantify uncertainty between dated levels.
Bacon divides a sediment sequence into a parameterized number of equally-thick segments; most models use dozens to hundreds of these segments.
Bacon then models sediment deposition, with uniform accumulation within each segment, as an autoregressive gamma process, where both the amount of autocorrelation and the shape of the gamma distribution are given prior estimates.
The algorithm employs an adaptive Markov Chain Monte Carlo algorithm that allows for Bayesian learning to update these variables given the age-depth constraints, and converge on a distribution of age estimates for each segment in the model.
Bacon has two key parameters: the shape of the accumulation prior, and the segment length, which can interact in complicated ways [@trachsel2017].
In our experience, the segment length parameter has the greatest impact on the ultimate shape and amount of uncertainty simulated by Bacon, as larger segments result in increased flexibility of the age-depth curve, and increased uncertainty between dated levels.
Bacon is written in C++ and R, with an R interface.
More recently, the authors released an R package "rbacon" [@baconPackage], which geoChronR leverages to provide access to the algorithm.
Bacon will optionally return a thinned subset of the stabilized MCMC accumulation rate ensemble members, which geoChronR uses to form age ensemble members for subsequent analysis.


```{r runBacon,message=FALSE,results='hide',cache=TRUE,fig.show='hide',}
tana <- runBacon(tana,
                 lab.id.var = 'LabID',
                 age.14c.var = 'age14C',
                 age.14c.uncertainty.var = 'age14CUnc',
                 age.var = 'age',
                 age.uncertainty.var = '1SD',
                 depth.var = 'depth',
                 reservoir.age.14c.var = NULL,
                 reservoir.age.14c.uncertainty.var = NULL,
                 rejected.ages.var = NULL,
                 accept.suggestions = TRUE)
```
Great! If all went well Bacon ran, and geoChronR grabbed the ensembles for future use. What kind of future use? Well, let's start with plotting.


The `plotChronEns()` function is great for making, quick, but pretty nice, figures to show an age model ensemble. It has a lot of options for customization (check out `?plotChronEns`). Lastly, what it returns is a ggplot2 object, meaning that you can further customize it! Let's see how it goes!

```{r}
plotChronEns(tana) + ggtitle("Tana Lake - default Bacon model")
```
That was easy! But you'll have to explore the options to fully customize your figure.

:::: {.blackbox data-latex=""}
::: {.exercise #exploreplotChronEns}
Explore the parameter choices in plotChronEns. Can you a) change the confidence interval colors and b) quantiles? c) Change the type of distribution plotted for the dates d) and their color and transparency? e) what does truncate.dist do?
:::
::::

## BChron

BChron [@bchron;@parnell2008flexible] uses a similar approach, using a continuous Markov monotone stochastic process coupled to a piecewise linear deposition model.
This simplicity allows semi-analytical solutions that make BChron computationally efficient. BChron was originally intended to model radiocarbon-based age-depth models in lake sedimentary cores of primarily Holocene age, but its design allows broader applications.
In particular, modeling accumulation as additive independent gamma increments is appealing for the representation of hiatuses, particularly for speleothem records, where accumulation rate can vary quite abruptly between quiescent intervals of near-constant accumulation [@Parnell_QSR2011;@PRYSM;@Hu_epsl17].
The downside of this assumption is that BChron is known to exaggerate age uncertainties in cases where sedimentation varies smoothly [@trachsel2017].

Bchron has several key parameters, which allow a user to encode their specific knowledge about their data.
In particular, the `outlierProbs` parameter is useful in giving less weight to chronological tie points that may be considered outliers, either because they create a reversal in the stratigraphic sequence, or because they were flagged during analysis (e.g. contamination).
This is extremely useful for radiocarbon-based chronologies where the top age may not be accurately measured for modern samples.
The `thetaMhSd`, `psiMhSd`, and `muMhSd` parameters control the Metropolis-Hastings standard deviation for the age parameters and Compound Poisson-Gamma scale and mean respectively, which influence the width of the ensemble between age control tie points.
geoChronR uses the same default values as the official Bchron package, and we recommend that users only change them if they have good cause for doing so.


```{r runBchron, cache=TRUE, results = 'hide',cache=TRUE}
tana <- runBchron(tana,
                 iter = 10000,
                 model.num = 2,
                 lab.id.var = 'LabID',
                 age.14c.var = 'age14C',
                 age.14c.uncertainty.var = 'age14CUnc',
                 age.var = 'age',
                 age.uncertainty.var = '1SD',
                 depth.var = 'depth',
                 reservoir.age.14c.var = NULL,
                 reservoir.age.14c.uncertainty.var = NULL,
                 rejected.ages.var = NULL)
```
```{r}
plotChronEns(tana,model.num = 2,truncate.dist = .0001) + ggtitle("Tana Lake - default Bchron model")
```

## OxCal

The OxCal software package has a long history and extensive tools for the statistical treatment of radiocarbon and other geochronological data [@BronkRamsey95].
In @ramsey2008deposition, age-depth modelling was introduced with three options for modelling depositional processes that are typically useful for sedimentary sequences: uniform, varve, and Poisson deposition models, labeled U-sequence, V-sequence and P-sequence, respectively.
The Poisson-based model is the most broadly applicable for sedimentary, or other accumulation-based archives (e.g. speleothems), and although any sequence type can be used in geoChronR, most users should use a P-sequence, which is the default.
Analogously to segment length parameter in Bacon, the *k* parameter (called `eventsPerUnitLength` in geoChronR), controls how many events are simulated per unit of depth, and has a strong impact on the flexibility of the model, as well as the amplitude of the resulting uncertainty.
As the number of events increases, the flexibility of the model, and the uncertainties, decrease.
@trachsel2017 found that this parameter has a large impact on the accuracy of the model, more so than the choices made in Bacon or Bchron.
Fortunately, @bronkramsey2010 made it possible for *k* to be treated as a variable, and the model will estimate the most likely values of *k* given a prior estimate and the data. The downside of this flexibility is that this calculation can greatly increase the convergence time of the model.
Oxcal is written in C++, with an interface in R [@oxcAAR].
Oxcal does not typically calculate posterior ensembles for a depth sequence, but can optionally output MCMC posteriors at specified levels in the sequence.
geoChronR uses this feature to extract ensemble members for subsequent analysis.


```{r runOxcal,cache=TRUE}
tana <- runOxcal(tana,model.num = 3,
                 lab.id.var = 'LabID',
                 age.14c.var = 'age14C',
                 age.14c.uncertainty.var = 'age14CUnc',
                 age.var = 'age',
                 age.uncertainty.var = '1SD',
                 depth.var = 'depth',
                 reservoir.age.14c.var = NULL,
                 reservoir.age.14c.uncertainty.var = NULL,
                 rejected.ages.var = NULL,
                 events.per.unit.length = .05,
                 depth.interval = 20)
```
```{r}
plotChronEns(tana,model.num = 3,truncate.dist = .0001) + ggtitle("Tana Lake - Oxcal model")
```

### Let's compare these models.
First, lets use `selectData()` to pull the depth and ageEnsemble variables for each model. The `selectData()` function is introduced in section \ref{#selectData} .


```{r,warning=FALSE,results = 'hide',message = FALSE}
ensBacon <- selectData(tana,
                       var.name = "ageEnsemble",
                       paleo.or.chron = "chronData",
                       model.num = 1,
                       table.type = "ensemble")

depthBacon <- selectData(tana,
                       var.name = "depth",
                       paleo.or.chron = "chronData",
                       model.num = 1,
                       table.type = "ensemble")

ensBchron <- selectData(tana,
                       var.name = "ageEnsemble",
                       paleo.or.chron = "chronData",
                       model.num = 2,
                       table.type = "ensemble")

depthBchron <- selectData(tana,
                       var.name = "depth",
                       paleo.or.chron = "chronData",
                       model.num = 2,
                       table.type = "ensemble")

ensOxcal <- selectData(tana,
                       var.name = "ageEnsemble",
                       paleo.or.chron = "chronData",
                       model.num = 3,
                       table.type = "ensemble")

depthOxcal <- selectData(tana,
                       var.name = "depth",
                       paleo.or.chron = "chronData",
                       model.num = 3,
                       table.type = "ensemble")

```


Now that we have all the data extracted, we can use the `plotTimeseriesEnsRibbons()` function to plot each of the modeled age-depth relationships and their uncertainties. We will use the `magrittr` "pipe" function `%>%` to pass the output of one plot into the next to build up a complex figure. We'll also use different colors and transparencies so we can distinguish the different models.

```{r,message=FALSE}
plotTimeseriesEnsRibbons(X = ensBacon,Y = depthBacon) %>%
  plotTimeseriesEnsRibbons(X = ensBchron,Y = depthBchron,
                           alp = .7,
                           color.high = "DarkGreen",
                           color.line = "Green") %>%
plotTimeseriesEnsRibbons(X = ensOxcal,Y = depthOxcal,
                         alp = .7,
                         color.high = "DarkBlue",
                         color.line = "Blue") %>%
  plotModelDistributions(tana,add.to.plot = .) + #here we use the ggplot +
  scale_y_reverse()
```
All geoChronR plotting functions return ggplot2 objects, so we can modify the scale by adding a layer using `+` using the ggplot2 model.


:::: {.blackbox data-latex=""}
::: {.exercise #compareChronModels}
Where do the models agree? Where do they differ? Do you think one is better than the others?

<details>
  <summary>After you've answered, click for next step</summary>
   The OxCal model is considerably more flexible than the Bacon model, which leaves outliers off the main trend. If you wanted to make the OxCal model less flexible, which parameter(s) would you change? Alternatively, if you wanted to make the Bacon model more flexible, which parameter(s) would you change in the Bacon model?

   Try making a change to parameters in either Bacon or OxCal to make the models more similar (note, Bacon runs much faster, so I'd probably try that one first)

   Finally, how should you decide whether a more or less flexible model is better?
</details>
:::
::::


### Creating a multimodel ensemble

Sometimes, there are good reason to believe that because of it's design, or underlying assumptions, one model may be superior to the others, in which case you should choose that model. However, frequently, it's unclear which model to choose, or to objectively pick on model over another. In this case, you might want to create a multimodel ensemble that incorporates model structural uncertainty into your uncertainty structure. This is pretty straightforward in geoChronR.

Here, we'll create a fourth model that combines these three into a "Grand Ensemble" using `createMultiModelEnsemble`

```{r}
tana <- createMultiModelEnsemble(tana,
                                 models.to.combine = 1:3,
                                 depth.interval =10,
                                 n.ens = 1000)
```
:::: {.blackbox data-latex=""}
::: {.exercise #plotGrandEnsemble}
Use plotChronEns() and plotModelDistributions() to visualize your multi-model age model.
<details>
<summary>Hint #1</summary>
  First plot the chronEns, then use the "add.to.plot" parameter to add in the distributions.
</details>
<details>
  <summary>Hint #2</summary>
  Something with this structure is what you're looking for
```{r,eval = FALSE}
plotChronEns() %>% plotModelDistributions()
```
</details>
</details>
:::
::::


:::: {.blackbox data-latex=""}
::: {.exercise #exploreMultiModelEnsemble}
Now that you've got plotting working, try changing the choices made in createMultiModelEnsemble. Specifically, what is the impact of changing, depth.interval, n.ens, or depth.sequence? Use the documentation for help!
:::
::::


:::: {.blackbox data-latex=""}
::: {.exercise #plotAllFour}
Add your final multi model ensemble to the figure that showed the three original age models above. Does it look like a combination of the three?
:::
::::


### Mapping the age ensemble to the paleoData measurements

Great, our LiPD file now has an age ensemble (actually 4 age ensembles!) that we can use in subsequent analysis. We could write out our LiPD file right now using `lipdR::writeLipd(tana)`, for future work, or share with a colleague, and when we load it back in, all of our ensembles will be there, ready to go!

But for now, let's think about the next step in our analysis. We want to look at our paleoenvironmental data in the context of the age uncertainties. So let's take a look at the paleoData!

```{r}
#First, create a tibble from the paleoata
paleo <- extractTs(tana) %>% ts2tibble()

#Now you can explore that much more easily - here are all the variable names in all the measurementTables in the paleoData.
paleo$paleoData_variableName
```
 It looks like depth in this dataset is "Composite_depth", and the median age vector is here, but the age ensemble is not! Why not? Well, the ensemble chronology in a model may or may not have values corresponding to the paleoclimatic or paleoenvironmental measurements in paleoData. Each of our models have different depth scales, and they're all different than our paleoclimate data. So we need to "map" the model ensemble values to a measurement table in paleoData, so we can estimate the age uncertainty on each value. To do this we use the `mapAgeEnsembleToPaleoData()` function.


```{r , results = 'hide'}
tana <- mapAgeEnsembleToPaleoData(tana,
                                  age.var = "ageEnsemble",
                                  model.num = 4,
                                  paleo.depth.var = "Composite_depth",
                                  paleo.meas.table.num = 1)
```
Now let's look at the paleoData again:

```{r}
paleo <- extractTs(tana) %>% ts2tibble()

paleo$paleoData_variableName
```

Great, now we have an ageEnsemble variable in our paleoData (and our tibble!)


### Creating a timeseries plot as a spaghetti plot of lines

Let's visualize the reconstructed temperature with age uncertainties.

First, we'll use `selectData()` again to get our mapped ensemble and temperature data:
```{r}
tana.ae <- selectData(tana,var.name = "ageEnsemble",meas.table.num = 1)
tana.temp <- selectData(tana,var.name = "temperature",meas.table.num = 1)
```

OK, we're ready to plot it. There are a few ways to visualize ensemble data. The simplest is to just plot multiple instances of the line. Here we will just plot the temperature data against 50 random ensemble members.
```{r,results = 'hide',warning = FALSE,message = FALSE}
tana.ts.plot <-  plotTimeseriesEnsLines(X = tana.ae,Y = tana.temp,alp = 0.05,n.ens.plot = 50,color = "blue")
print(tana.ts.plot)
```


:::: {.blackbox data-latex=""}
::: {.exercise #plotTimeseriesEnsLines}
plotTimeseriesEnsLines() has options that control the output. Take a look at the documentation, and then change the following parameters, and understand how that affects the output:

  a. alp
  b. color (What does "Blues" or "Set2" do? How does it work)
  c. n.ens.plot
  d. Change the limits of the plot to only show the Holocene (~12,000-0 yr BP)
  <details>
  <summary>Hint</summary>
  To change the limits, you'll use the xlim() or scale_x_reverse() functions from ggplot2
  </details>
:::
::::


### Creating a timeseries plot with a ribbon confidence intervals

We can also plot this as a ribbon plot of quantiles

```{r,results = 'hide',warning=FALSE,message = FALSE}
tana.ribbon.plot <- plotTimeseriesEnsRibbons(X = tana.ae,Y = tana.temp)
print(tana.ribbon.plot)
```


:::: {.blackbox data-latex=""}
::: {.exercise #plotTimeseriesEnsLines2}
plotTimeseriesEnsRibbons () has many more options that control the output. Take a look at the documentation for that function, and then change the following parameters, and understand how that affects the output:

  a. probs
  b. color.high
  c. n.bins
  d. export.quantiles
  e. limit.outliers.x
:::
::::

### Combining the two kinds of timeseries plots

Each of these approaches to age-uncertain timeseries visualization is valuable - ribbons show case the probability distributions of the data through time, given the ensembles, but tend to smooth out real variability recorded by the data. Plotting an ensemble of lines highlights this variability, but gets messy with many lines, or doesn't showcase the true range of uncertainty with a few lines.

At different times, either approach can be right, but our favorite, standard approach is to do both, using the `add.to.plot` option we used above.


:::: {.blackbox data-latex=""}
::: {.exercise #ribbonsAndLines}
Use plotTimeseriesEnsRibbons(), plotTimeseriesEnsLines(), and the "add.to.plot" parameter to create a figure that shows the uncertainty ribbons in the background, and a with 5 ensemble members to highlight the variability of the data.
<details>
  <summary>Hint</summary>
  Something with this structure is what you're looking for
```{r,eval = FALSE}
plotTimeseriesEnsRibbons() %>% plotTimeseriesEnsLines()
```
</details>
:::
::::


## Banded Age Modelling
Thus far, we've explored methods to derive, transfer, and visualize age models derived from age-depth observations (mostly radiocarbon data). geoChronR also includes another type of age modelling algorithm, that estimates uncertainties based on miscounting rates in layer-counted archives. Consequently, BAM does not require, or even consider, depth in its uncertainty quantification. Although BAM is intended for layer-counted archives, like corals or varves, it can be useful as a rough approximation of age-uncertainty in non-banded timeseries, where the original geochronologic data needed to create a proper age model are not available. We'll explore this with Tana Lake dataset and see how it compares.


@BAM is a probabilistic model of age errors in layer-counted chronologies.
The model allows a flexible parametric representation of such errors (either as Poisson or Bernoulli processes), and separately considers the possibility of double-counting or missing a band.
The model is parameterized in terms of the error rates associated with each event, which are intuitive parameters to geoscientists, and may be estimated via replication [@DeLong_Paleo3_2013].
In cases where such rates can be estimated from the data alone, an optimization principle may be used to identify a more likely age model when a high-frequency common signal can be used as a clock [@BAM].
As of now, BAM does not consider uncertainties about such parameters, representing a weakness of the method.
BAM was coded in MATLAB, Python and R, and it is this latter version that geoChronR uses.


```{r}
tana <- runBam(tana,
               paleo.meas.table.num = 1,
               n.ens = 1000,
               model.num = 5,
               make.new = TRUE,
               ens.table.number = 1,
               model = list(name = "poisson",
                            param = 0.05,
                            resize = 0,
                            ns = 1000))
```


```{r}
tana.ye <- selectData(tana,var.name = "yearEnsemble",meas.table.num = 1)
tana.ae.bam <- convertAD2BP(tana.ye)

tana.ribbon.plot.bam <- plotTimeseriesEnsRibbons(X = tana.ae.bam,Y = tana.temp)

#we can compare this to the original age model supplied by the paper (which used the Heegaard et al., 2005 model, so a whole other approach)

tana.orig.age <- selectData(tana,var.name = "age",meas.table.num = 1)

tana.ribbon.plot.bam <- tana.ribbon.plot.bam +
  geom_line(aes(x = tana.orig.age$values, y = tana.temp$values),color = "red")

tana.ribbon.plot.bam

```

```{r}
library(egg)
ggarrange(plots = list(tana.ribbon.plot + xlim(c(15000,0)) + ggtitle("Temperature on Multimodel age model"),
                            tana.ribbon.plot.bam + xlim(c(15000,0)) + ggtitle("Temperature on BAM")),
               nrow = 2)


```


## Chapter project

This chapter covers a range of age modelling approaches, how to implement them in geoChronR, how to propagate the uncertainties to data of interest, and visualization of all of the above. Now it's time to test what you've learned!

In addition to a temperature reconstruction, the Tana Lake dataset includes \delta$^{18}$O from leaf waxes. The data are measured on a different samples at different resolution, and so are stored in a different measurement table. So your project is:


:::: {.blackbox data-latex=""}
::: {.exercise #ageModelProject}
Using tools learned in this chapter, create ensemble timeseries figures for \delta$^{18}$O leaf wax from Tana Lake using, Bacon, Bchron, OxCal and BAM. Arrange these four figures vertically into a 4 panel figure with constant x-axes to allow comparison. How does the choice of age modelling algorithm impact the record? Which parts of the dataset are most vulnerable to age uncertainties?
<details>
  <summary>Hint</summary>
Create plots for each model separately, and save them as variables. Then use ggarrange to combine them.
</details>
:::
::::