ColbyForecasting/C02_background.qmd at main · BigelowLab/ColbyForecasting · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
title: "Background"
format: html
---

> No choice is the wrong choice as long as you make a choice. The only wrong choice is choosing not to make one. ~ [Jake Abel](https://en.wikipedia.org/wiki/Jack_Abel)

Traditional ecological surveys are systematic, for a given species survey data sets tell us where the species is found and where it is absent.  Using an observational data set (like [OBIS](https://obis.org))  we only know where the species is found, which leaves us guessing about where they might not be found. This difference is what distinguishes a *presence-abscence* data set from a *presence-only* data set, and this difference guides the modeling process.

When we model species distributions we are trying to define the environments where we should expect to find a species as well as the environments we would not expect to find a species. With OBIS data we have in hand the locations of observations, and we can extract the environmental data at those locations.  To characterize the unoccupied environments we are going to have to sample what is called "background" (aka "background points" and "pseudo-absences".)

We assume that we want the number of background points to be roughoy balanced with the number of observations. What balance means is open to interpretation, but if we have 100 observations then we would like to have 50 to 200 background points in order to be balanced.  One can spend a lot of time of time finding the most appropriate balance.

We want these background samples to roughly match the regional preferences of the observations; that is we want to avoid having observations that are mostly over Georges Bank while our background samples are primarily around the Bay of Fundy. We want there to be some reasonable proximity between occupied and unoccupied environments.

Here we want to satisfy a couple of basic requirements...

+ balance the number of observations and background points (whatever balance means),

+ avoid selecting background points that coincide with observations,

+ choose background points where we can measure the same environmental characteristics that we can measure for observations.

Keep in mind we will be glossing over important details; there is much more to investigate here, but *tempus fugit* and our course of study is brief.  At the end we want to have in hand a set of locations that can be a companion for the observations.

# Setup

As always, we start by running our setup function. Start RStudio/R, and reload your project with the menu `File > Recent Projects`.

```{r setup}
source("setup.R")
```

We also will need the Brickman mask and the observation data. Note that we are making a model that works for every month, so we shall select background points by month. Given what we have learned about the distribution of observations over the months of year this may be surprising; some months have many more observations than other months. Won't months with more data be "better informed" than months with fewer data? That is possible, but we have to start somewhere so we'll be choosing a set number of background points for each month that balances the observations and background throughout the year rather than per month.

OK - load the data we need.

```{r load_obs_mask}
coast = read_coastline()
obs = read_observations(scientificname = "Mola mola")
db = brickman_database() |>
  filter(scenario == "STATIC", var == "mask")
mask = read_brickman(db)
```

Let's start with a simple plot of the observations per month. We'll make an array of maps with one constructed per month.  We'll use the [ggplot2 plotting system](https://ggplot2.tidyverse.org/) which is a powerful system for layering components of a plot.

We will want to add a notation to each plot showing the number of points that month.  To make that table we use the `count()` function from the [dplyr R package](https://dplyr.tidyverse.org/). We also found the best location to print the count (through trial and error) so we define that, too.

```{r all_observations_count}
LON0 = -67
LAT0 = 46
all_counts = count(st_drop_geometry(obs), month) # counting is faster without spatial baggage
all_counts
```

Now we can construct the plot.  Note that layering of data, then the coast then the extra text.  Also, we use alpha (aka 'transparency' or 'opacity') blending to make the plotting symbols slightly transparent which reveals overlapping points nicely.

```{r all_observations}
ggplot() +
  geom_sf(data = obs, alpha = 0.2, shape = "circle small", size = 1) +
  geom_sf(data = coast, col = "orange") +
  geom_text(data = all_counts,
            mapping = aes(x = LON0,
                          y = LAT0,
                          label = sprintf("n: %i", .data$n)),
                          size = 3) +
  labs(x = "Longitude", y = "Latitude", title = "All observations") +
  facet_wrap(~month)
```

Some of those points are densely distributed depending upon the month.  In fact, it may be that we have multiple points in each raster cell in our Brickman data.  We have a function to help use reduce "thin" the observations so that we have just one observation per cell.

## Thinning observations

Thinning is the process we use to sub-sample observations so that they are more evenly sampled across the spatial domain. We do this to discourage accidental sampling bias. The [tidysdm package]() provides a nice function, `thin_by_cell()`, that "thins" by allowing only one observation per cell in the Brickman data.  We'll iterate through each month

```{r thin_observations}
thinned_obs = sapply(month.abb,
               function(mon){
                 temp_x = obs |> filter(month == mon)
                 if(nrow(temp_x) == 0) return(NULL)
                 thin_by_cell(temp_x, mask)
               }, simplify = FALSE) |>
  dplyr::bind_rows()

# another count
thinned_counts = count(st_drop_geometry(thinned_obs), month)

ggplot() +
  geom_sf(data = thinned_obs,
          alpha = 0.2,
          shape = "circle small",
          size = 1) +
  geom_sf(data = coast, col = "orange") +
  geom_text(data = thinned_counts,
            mapping = aes(x = LON0,
                          y = LAT0,
                          label = sprintf("n: %i", .data$n)),
                          size = 3) +
  labs(x = "Longitude", y = "Latitude", title = "Thinned observations") +
  facet_wrap(~month)
```

## Weighted sampling

We now have thinned the observations to just `r nrow(thinned_obs)` records compared to the original `r nrow(obs)` observations. So that get us to no more than 1 observation to cell.  But do we loose the importance of clustered observations?  Isn't meaningful ecologically that the species is found in clusters?  The answers to those are **yes** and **yes**.

Uh oh, so how do we retain that clustering importance?  The answer is to build a sampling bias map using the original observations that influences where we sample.  Let's build a bias map by counting the number of observations in each cell - we'll use the original observations for the entire year (regardless of month.)

```{r bias_map}
bias_map = rasterize_point_density(obs, mask) # <-- note the original observations

ggplot() +
  geom_stars(data = bias_map, aes(fill = count)) +
  scale_fill_viridis_b(na.value = "transparent") +
  geom_sf(data = coast, col = "orange") +
  labs(x = "Longitude", y = "Latitude", title = "Bias map using all observations")
```
Depending upon your distribution, the clustering may be more or less apparent.  For *Mola mola* it's dominated by a wide even spread with a few "peak" spots.

## Randomly sample background points

Now that we have a reasonable set of observation and a sampling bias map, we now need to choose points using those to represent the background.  The `sample_background()` function requires three input arguments: the set of observations, the raster array and the number of points desired.  We can also ask for the presence observations to be returned along with the background points to keep things tidy.  Since we are doing this by month, we'll modify the output by adding a column identifying the month.

#### But just how many background points?

But how many background points should we get for each month? Should be try balance by month?  Or balance by taking the average for all months.  Let's do the latter, but it might be worth exploring that further.

```{r n_background_per_month}
nback_avg = mean(all_counts$n) |>
  round()
nback_avg
```

So, here we go.  We shall use the [`sapply` iterator](https://bigelowlab.github.io/handytandy/iterations.html) to step through each month. For each month we'll use the thinned observations and the annual bias map to sample for background. At the end we'll have a list of monthly tables which we'll bind into one giant table, and we'll refresh the month column to be a factor (of monthly groupings of course!)

```{r sample_background}
obsbkg = sapply(month.abb,
    function(mon){
      temp_x = thinned_obs |> filter(month == mon)
      if (nrow(temp_x) == 0) return(NULL)
      sample_background(temp_x, # <- just this month
                       bias_map,
                       method = "bias",  # <-- it needs to know it's a bias map
                       return_pres = TRUE, # <-- give me the obs back, too
                       n = nback_avg) |>   # <-- how many points
        mutate(month = mon, .before = 1)
    }, simplify = FALSE) |>
  bind_rows() |>
  mutate(month = factor(month, levels = month.abb))
obsbkg
```

A quick tally of the class by month confirms that our `nback_avg` of `r nback_avg` points is used consistently every month.

```{r nback_avg}
# drop the spatial baggage to speed the tally
count(st_drop_geometry(obsbkg), month, class)
```

Let's see where these presence/background points fall relative to each other.

```{r plot_presence_background}
ggplot() +
  geom_sf(data = obsbkg,
          mapping = aes(col = class),
          alpha =  0.4, shape = "circle small", size = 1) +
  geom_sf(data = coast, col = "orange")  +
  labs(x = "Longitude", y = "Latitude", title = "All") +
  theme_bw() +  # <- make a simple white background
  scale_fill_okabe_ito() +  # <-- colorblind friendly for N Record
  facet_wrap(~month)

```

These would be the data locations we feed into the model.  So is that reasonable solution? At first glance it seems to be, so we'll choose this background sampling approach; thinning with biased sampling. We can't know if another approach might be better until we actually start modeling.

Let's save this.

```{r save_model_input}
write_model_input(obsbkg, scientificname = "Mola mola")
```

Note that there is a function for reading model input files, too.

```{r read_model_input}
x = read_model_input(scientificname = "Mola mola")
```


# Recap

We have prepared what we call "model inputs", in particular for *Mola mola*, by thinning observations and selecting background points.  There are lots of other approaches to creating the model input, but for the sake of learning we'll stick with this approach.  We developed a function that will produce our model inputs for a given month, and saved them to disk.

# Coding Assignment

::: {.callout-note appearance="simple"}
Use the menu option File > New File > R Script to create a blank file. Save the file (even though it is empty) in the “assignment” directory as “assignment_script_2.R”. Use this file to build a script that meets the following challenge. Note that the existing file, “assignment_script_0.R” is already there as an example.

For each month select at random one presence and one background point (so, that will be 2 x 12 = 24 points!) from your model input data.  Then select three (3) variables in the Brickman present monthly data set, and build a single table that has the three variables for the 24 points.  This is **very similar** to the [C00 coding chapter homework](https://bigelowlab.github.io/ColbyForecasting/C00_coding.html#coding-assignment).

Hint!  Check out [iterating over grouped tables](https://bigelowlab.github.io/handytandy/iterations.html#groups-in-a-table) and the [`slice_sample` function](https://dplyr.tidyverse.org/reference/slice.html).

Here's what our table looks like for the *Mola mola* species.

```
Simple feature collection with 24 features and 7 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -72.99962 ymin: 39.07928 xmax: -65.51362 ymax: 43.89779
Geodetic CRS:  WGS 84
# A tibble: 24 × 8
   month class                      geom point  depth   SSS   SST  Tbtm
 * <fct> <fct>               <POINT [°]> <chr>  <dbl> <dbl> <dbl> <dbl>
 1 Jan   presence   (-72.05151 39.07928) p1    2116.   33.3 10.9   3.72
 2 Jan   background (-67.48795 40.36045) p2     789.   31.7  6.65  4.67
 3 Feb   presence   (-72.00363 39.67702) p1     231.   31.9  5.73  8.64
 4 Feb   background (-69.79134 40.60724) p2      60.4  31.2  3.58  3.89
 5 Mar   presence       (-72.056 39.205) p1    1362.   32.4  6.98  4.13
 6 Mar   background (-67.73474 42.66383) p2     194.   30.9  1.88  7.58
 7 Apr   presence      (-72.263 39.4379) p1     386.   31.7  7.08  6.44
 8 Apr   background  (-72.8351 39.20875) p2     102.   31.5  6.74 10.3
 9 May   presence          (-66.3 41.25) p1     281.   31.2  8.03  7.12
10 May   background (-71.43661 39.62007) p2    1867.   31.9 11.4   3.78
# ℹ 14 more rows
# ℹ Use `print(n = ...)` to see more rows
```

:::