Elk-Migration.github.io/cleaning_data.Rmd at main · brooklynnrm/Elk-Migration.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
---
title: "Data Sources and Processing"
output: html_document
---

Library load
```{r}
library(dplyr)
library(tidyverse)
library(stringr)
library(raster)
library(sf)
library(terra)
library(geosphere)
library(leaflet)
library(leaflet.extras2)
```

## Elk Data Source

Our main data set is the Elk GPS collar data from National Elk Refuge (2006-2015) published by the Northern Rocky Mountain Science Center in 2018. The data follows 17 adult female elk captured in the National Elk Refuge in their migration to Yellowstone National Park. Elk that did not migrate to Yellowstone National Park were not included in the data set. The data is available [here](https://www.usgs.gov/data/elk-gps-collar-data-national-elk-refuge-2006-2015).


## Cleaning Elk Data

The elk data contains a unique ID for each of the 17 elk. It includes a date-time variable and coordinates. There are To clean the elk data, we we drop the `tz` timezone variable, because it is homogeneous. We drop the `utm_x` and `utm_y` variables because they are simply another kind of location tracking, and we already have latitude and longitude. We create the `dist_km` variable, which measures the distance traveled in kilometers between observed points using the `distHaversine` function from the `geosphere` package. The length of the shortest line between two points on a sphere is the Haversine distance.

```{r}
elk =
  read_csv('raw_data/Elk GPS collar data from National Elk Refuge 2006-2015.csv') |>
  janitor::clean_names() |>
  dplyr::mutate(
    day = day(dt),
    hour = hour(dt),
    dist_km =
      ifelse(
        elk_id == lag(elk_id),
        geosphere::distHaversine(cbind(long, lat), cbind(lag(long), lag(lat)))/1000,
        NA)
    ) |>
  dplyr::select(
    elk_id,
    dt,
    year,
    month,
    day,
    hour,
    lat,
    long,
    dist_km
  )
```


## Land Cover Cleaning

Our second data set is the land cover data in the state of Wyoming published by the GAP Analysis Project in 2019. This data set uses satellite imagery from 2011 to get land cover data accurate to thirty square meters. While our elk migration data ranges from 2006 to 2015, we make the assumption that land coverage in the form of water or vegetation stays approximately constant over time. The data is available [here](https://www.usgs.gov/programs/gap-analysis-project/science/land-cover-data-download).

We use the [terra package](https://bookdown.org/mcwimberly/gdswr-book/coordinate-reference-systems.html#reprojecting-raster-data) to import and transform the data into latitude and longitude coordinates. We save a cleaned, cropped version of the land cover data. For the purpose of this website, we won't evaluate the code chunk below.

```{r eval=FALSE}
land = rast('raw_data/land_cover/land_cover.tif')

# Reprojecting in latitude and longitude
land_coord = project(land, "EPSG:4326")

plot(land_coord)

# Subset to the relevant area
min_long = elk |> pull(long) |> min()
max_long = elk |> pull(long) |> max()
rng_long = abs(min_long - max_long)
lowerleftlon = min_long - 0.1 * rng_long
upperrightlon = max_long + 0.1 * rng_long
min_lat = elk |> pull(lat) |> min()
max_lat = elk |> pull(lat) |> max()
rng_lat = abs(min_lat - max_lat)
lowerleftlat = min_lat - 0.1 * rng_lat
upperrightlat = max_lat + 0.1 * rng_lat

# cropping
small_land_coord = crop(
  land_coord,
  extent(lowerleftlon, upperrightlon, lowerleftlat, upperrightlat))

plot(small_land_coord)

# writing raster
terra::writeRaster(small_land_coord, 'clean_data/land_cover.tif', overwrite=TRUE)
```


Using terra's `extract` function, we get the land cover at the relevant points of the analysis, i.e. the locations of the elk.
```{r}
small_land_coord = rast('clean_data/land_cover.tif')

temp_elk =
  elk |>
  mutate(
    longitude = long,
    latitude  = lat) |>
  dplyr::select(
    longitude,
    latitude
  )

elk_land_cover = terra::extract(x = small_land_coord, y = temp_elk)
```

We add the elk land cover data to the elk data frame. We save the elk data frame.
```{r}
elk =
  elk |>
  mutate(land_cover = elk_land_cover$land_cover)

elk |> write.csv('clean_data/elk.csv', row.names =  FALSE)
```

## Cleaning Water Quality Data

The water quality data is sourced from the Greater Yellowstone Network and published by the National Park Service. The data contains readings on a variety of water quality measures including mineral composition, flow speed, and temperature. The two locations relevant to this analysis are GRTE_SNR01 and GRTE_SNR02, where the elk migrate. The data is available [here](https://catalog.data.gov/dataset/greater-yellowstone-network-published-water-quality-data-through-2023-from-the-bicawq01-g-)

### Locations

This data set gives the locations that water was sampled from. We drop the `org_code` variable because it is homogenous.
```{r}
water_quality_locations =
  read_csv('raw_data/water_quality/Locations.csv')|>
  janitor::clean_names() |>
  dplyr::select(
    location_id,
    location_name,
    park_code,
    location_type,
    latitude,
    longitude
  )
```


### Results

Here we read in the results of the water quality sampling.
```{r}
water_quality_results =
  read_csv('raw_data/water_quality/Results.csv')|>
  janitor::clean_names()
```

We find the most common observations.
```{r}
common_obs =
  water_quality_results |>
  drop_na(result_text) |>
  group_by(characteristic_name) |>
  summarize(n = n())|>
  arrange(desc(n)) |>
  filter(
    n > 875,
    characteristic_name != "Weather Comments (text)") |>
  pull(characteristic_name)
```

We filter for the most common observations. We filter for readings that we can use, given in the `acceptable_readings` variable. We replace non-detected values with zero. We select for the relevant columns
```{r}

acceptable_readings = c("Detected and Quantified", "Not Detected", "Present Below Quantification Limit")

water_quality_results=
  water_quality_results |>
  filter(
    characteristic_name %in% common_obs,
    result_detection_condition %in% acceptable_readings) |>
  mutate(
    result_text = stringr::str_replace(result_text, "NULL", "0"),
    result_unit = stringr::str_replace(result_unit, "None", ""),
    characteristic_name = paste0(characteristic_name, " ", result_unit) |> trimws(),
    year = year(activity_start_date),
    month = month(activity_start_date),
    day = day(activity_start_date)
    ) |>
  dplyr::select(
    location_id,
    activity_id,
    activity_type,
    activity_start_date,
    year,
    month,
    day,
    characteristic_name,
    result_text
  )
```


### Combining Water Data


We aggregate the data at the month level by taking the minimum, mean, and maximum readings. We filter for the GRTE_SNR01 and GRTE_SNR02 locations, which are the two relevant locations to our analysis. We save the water quality data in raw, cleaned, and wide formats for ease of use.
```{r}
water_quality =
  water_quality_locations |>
  left_join(water_quality_results) |>
  filter(location_id %in% c('GRTE_SNR01', 'GRTE_SNR02'))

water_quality2 =
  water_quality %>%
  mutate(
    result_text = stringr::str_replace(result_text, 'LOW', '1'),
    result_text = stringr::str_replace(result_text, 'ABOVE NORMAL', '3'),
    result_text = stringr::str_replace(result_text, 'NORMAL', '2'),
    result_text = stringr::str_replace(result_text, 'FLOOD', '4'),
  ) |>
  group_by(
    location_id,
    location_name,
    year,
    month,
    characteristic_name
  ) %>%
  summarize(
    monthly_mean = mean(as.numeric(result_text),na.rm = TRUE),
    monthly_min = min(as.numeric(result_text),na.rm = TRUE),
    monthly_max = max(as.numeric(result_text),na.rm = TRUE)
  )

water_quality |> write.csv('clean_data/water_quality.csv', row.names = FALSE)
water_quality2 |> write.csv('clean_data/water_quality2.csv', row.names = FALSE)


clean_water =
  water_quality2 %>%
  pivot_wider(
    names_from = characteristic_name,
    values_from = c('monthly_mean', 'monthly_min', 'monthly_max')
  )

write.csv(clean_water, 'clean_data/clean_water.csv', row.names = FALSE)
```


## Reading in Weather Data


To explore the relationship between temperature, precipitation, and migration, we use the Global Historical Climatology Network's data. The data is available [here](https://www.ncei.noaa.gov/products/land-based-station/global-historical-climatology-network-daily)


We read in the data, we filter for the four closest stations to the elk, we filter for the correct date range, and finally we average the temperature and snowfall across the stations. Since we only have weather data at four points in the general area of the elk, we average the readings from the four stations.

```{r}
# closest four stations
four_stations <-
  c("SNAKE RIVER STATION, WY US", "MORAN 5 WNW, WY US", "BURRO HILL WYOMING, WY US", "MOOSE 1 NNE, WY US")

weather = read_csv("raw_data/raw_weather_data.csv") |>
  janitor::clean_names() |>
  filter(
    name %in% four_stations,
    date >= '2006-03-01',
    date <= '2015-08-25') |>
  group_by(date) |>
  summarize(
    tavg = mean(tavg, na.rm = TRUE),
    tmin = mean(tmin, na.rm = TRUE),
    tmax = mean(tmin, na.rm = TRUE),
    prcp = mean(prcp, na.rm = TRUE),
    snow = mean(snow, na.rm = TRUE),
    snwd = mean(snwd, na.rm = TRUE),
  ) |>
  mutate(
    year = year(date),
    month = month(date),
    day = day(date)
  )

weather |> write.csv('clean_data/weather.csv', row.names = FALSE)
```


## Combining all data

We combine the data for ease of comparison. We begin with the elk data, then left join the weather data by date. We find the closest water quality location, then join on the relevant water quality data. We save the data.

```{r}
all_data =
  elk %>%
  left_join(weather |> dplyr::select(-date)) %>%
  mutate(
    dist_to_GRTE_SNR01 = distHaversine(cbind(long, lat), c(-110.6716, 44.10177))/1000,
    dist_to_GRTE_SNR02 = distHaversine(cbind(long, lat), c(-110.7159, 43.65261))/1000,
    location_id =
      ifelse(dist_to_GRTE_SNR01 < dist_to_GRTE_SNR02, 'GRTE_SNR01', 'GRTE_SNR02')
  ) %>%
  left_join(clean_water) %>%
  dplyr::select(-dist_to_GRTE_SNR01, -dist_to_GRTE_SNR02)
```


```{r}
all_data  |> drop_na(location_name)
```

```{r}
all_data |> write.csv('clean_data/all_data.csv', row.names =  FALSE)
```