ml_mitigation/model.qmd at main · amalmuda/ml_mitigation · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
---
title: "A model to identify development aid agreements targeting climate change mitigation"
author: "Norad - Section for statistics and analysis"
format:
  html:
    code-fold: true
editor: source
execute:
  echo: true
  warning: false
  error: false
  enabled: false
---

<!-- To execute code chunks when rendering this document, switch to enabled: true in yaml -->

We build a model to identify development agreements supporting developing countries in reducing greenhouse gas emissions. The model is trained using supervised machine learning.

## Project dependencies

The `renv` package is used for reproducibility by bringing project-local R dependency management to the project. Call `renv::restore()` to restore the projects dependencies (R version and package version). Then we load the packages.

```{r}
#| label: load-packages

#renv::restore()

library(tidyverse)
library(tidymodels)
library(stopwords)
library(tidytext)
library(stringr)
library(textrecipes)
library(themis)
library(ranger)
library(doParallel)
library(here)
library(janitor)
library(knitr)
library(vetiver)
library(plumber)
library(renv)

```

## Load data

Norwegian development aid statistics (microdata) is publicly available to download [aidresults.no/microdata](https://resultater.norad.no/microdata). We select last five years (2013-2017). These data can also be downloaded by using [this link](https://resultater.norad.no/api/microdata?from_year=2011&to_year=2017&main_region_code=&country_iso_code=&agreement_partner_group_sid=&agreement_partner_sid=&target_area_code=&dac_main_sector_code=&dac_sub_sector_code=&chapter_code=&format=csv&language=en). The csv file is saved in the the *data* folder in the project directory and named *Norad-Norwegian_development_assistance.csv*.

Let's import the csv file. We use the data for the years 2013-2017.

```{r}
#| label: load-data

df_oda <- read_csv(here("data", "Norad-Norwegian_development_assistance.csv"))

df_oda <- df_oda |>
  filter(year %in% c(2013:2017))
```

## Selecting data

The development aid data contain information about the individual ddevelopment aid agreements. The unit of observations is agreement-year, and most agreements are multi-year agreements. The unique agreements can be identified using the variable *agreement_number*.

**Unique agreement observations to avoid data leakage into the testing set**

We want to avoid data leakage when training the model, meaning that we want to keep the training data separate from the testing data. We therefore chose to only include unique agreements (*agreement_numbers*) when training/testing the data. If not, statistical information for a multi-year agreement would be included in both the training set and the testing set, for example agreement1-year1 in the training set and agreement1-year2 in the testing set) and then overestimate the model performance on the testing set. The testing data would then not be really *new* data (agreements). We therefore include only one observation for each agreement in the training/testing data. For the time period 2013-2017 the total observations are X. However, when we only include unique agreement_numbers the number of unique agreement observations are redused to X.


**Initial data transformation: outcome variable and predictors**

-   We create a dichotomous outcome variable *mitigation* based on the existing variable *policy_marker_climate_change_mitigation*. The levels *2 (main objective)* and *1 (significant objective)* are collapsed to one level *Mitigation*.
-   We create a character variable containg the text information from two variables, *agreement_title* and *description_of_agreement* The character strings are collapsed into one.
-   We exclude non-english characters from all character variables. This is to avoid errors using special non-english characters, like spanish letters for excample in the variable *agreement_partner*.
-   We keep most of the variables in the dataset to be used as predictors in the model. We exclude all policy markers as these are often quality checked at the same time.

```{r}
#| label: data-wrangling

df_oda <- df_oda |>
  distinct(agreement_number, .keep_all = TRUE) |>
  mutate(mitigation = if_else(policy_marker_climate_change_mitigation == 0, "Not mitigation", "Mitigation")) |>
  mutate(title_desc = paste0(agreement_title, ". ", description_of_agreement)) |>
  mutate(across(where(is.character), ~str_replace_all(., pattern = "[^[\\da-zA-Z ]]", " "))) |>
  select(c(mitigation,
  #         agreement_number,
           title_desc,
           agreement_partner,
           group_of_agreement_partner,
           implementing_partner,
           main_sector,
           sub_sector,
           target_area,
           recipient_country,
           recipient_region,
           type_of_assistance,
           extending_agency,
           budget_post_chapter,
           budget_post_post
           )
         )


# df_oda |>
#   mutate_if(is.character, factor) |>
#   skimr::skim()

```

## Spending our data budget

![](figs/resampling.svg)

-   The data split into traing set and testing set. The split is stratified on the *mitigation* variable. This ensures that our training and test data sets will keep roughly the same proportions of *mitigation* and *not relevant* agreements as in the original data.
-   We create a set of 10 cross-validation resampling folds of the trainind data to evaluate the model. For each of the ten folds the model will train and evaluate.

```{r}
#| label: data-budget

set.seed(1)
oda_split <- df_oda |>
  initial_split(strata = mitigation)


oda_train <- training(oda_split)
oda_test <- testing(oda_split)

set.seed(1)
oda_folds <- vfold_cv(oda_train, strata = mitigation)

#oda_folds

```

## Recipe for preprocessing

Let's set up our recipe for preprocessing.

-   First, we specify the model. The *mitigation* as outcome and all other variables are predictors. We also specify the training data.
-   We upgrade the role of agreement_number to an id-variable. Instead of getting rid of the id we update the role. Instead of being a predictor or outcome variable, this is an ID, and not used in the model.

Preprocessing steps for the predictor variable *title_desc*:

-   We tokenize the *title_desc* variable using ordinary tokenization by splitting words by space. This creates a dummy variable for each token in the variable and count the presence of each token in each "document" (*title_desc*).
-   We remove stopwords to eliminate words that are so commonly used that they carry very little useful information, like *a*, *the* and *is*.
-   We don't keep all of the tokens, but keep the top 1000 used tokens.
-   We weight the token (word) counts by using TF-IDF (*Text Frequency-Inverse Document Frequency*) for each token. TF-IDF is a weight to measure the importance of a token in the document and corpus (collection of documents).

Other preprocessing steps:

-   Remember: All steps are executed on the training set, but steps with `skip = TRUE` will be ignored when `bake()` is invoked (under the hood in the model). We should therefore isolate any steps that involve the outcome and/or use the `skip = TRUE` argument. For example, see the separate steps and the use of skip() when converting the character variables to factors below using `step_string2factor()`.
-   We normalize all the numeric predictors as this is required in some models.
-   We convert all character variables to factors. Both output and predictor variables. This is a good habit, but not necessary for this model workflow. Remember to isolate any steps that involves the outcome variable. Therefore do this step separately for the predictors and the outcome variable. We also specify skip=TRUE for preprosessing on the outcome variable to ignore this step when bake() is invoked.
-   The variables *agreement_partner* and *implementing_partner* have a lot of factor levels. We collapse the less used partners in an *other*-level and specify a treshold for this level so that we keep only around 200 levels (partners).
-   We convert all factor or character variables to binary numeric variables (dummy variables).
-   We transform all the nominal predictors to factors by creating dummy variables. Also we specify that unseen factor levels will be assign with a new value. Also we deal with missing (unknown) data, and we remove factor levels with zero variance.
-   We handle the class imbalance in the outcome variable by using oversampling, to avoid poorly model perfonmanse on the minority class. We oversample the *mitigation* level in the outcome (*mitigation*) variable using the method *Synthetic Minority Over-sampling Technique (SMOTE)*.

```{r}
#| label: recipe

# Preprocessing recipe. Steps for fature engineering
oda_rec <- recipe(mitigation ~ ., data = oda_train) |>
  #update_role(agreement_number, new_role = "ID") |>
  step_tokenize(title_desc) |>
  step_stopwords(title_desc) |>
  step_tokenfilter(title_desc, max_tokens = 1e3, min_times = 10) |>
  step_tfidf(title_desc) |>
  step_normalize(all_numeric_predictors()) |>
  step_string2factor(all_nominal_predictors()) |>
  step_string2factor(mitigation, skip = TRUE) |>
  step_other(agreement_partner, implementing_partner, threshold = 0.001) |>
  step_novel(all_nominal_predictors()) |>
  step_unknown(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_smote(mitigation)

#oda_rec

```

Let's have a look at the training data after these preprocessing steps. The `recipe()` defines the preprocessing, the `prep()` calculates statistics from the training set, and `bake()` applies the preprocessing to data sets. These preprocessing steps are applied under the hood in the model, but it's useful to have a look to at the preprocessing.

```{r}
#| label: baked

# df_train_baked <- prep(oda_rec) |> bake(new_data = NULL)
#
# glimpse(df_train_baked)

```

Let's create a model specification for the model(s) we want to try. We are specifying a a) random forest model withouth hyperparameter tuning, and b) random forest model with hyperparameter tuning using a tuning grid.

In the rest of the document both of these models are included in the code, but only the random forest model workflow with hyperparameter tuning is extracted and saved as the best final fitted model workflow.

```{r}
#| label: model-specs

# Random forest (without hyperparameter tuning)
# rf_spec <- rand_forest(trees = 500) |>
#   set_mode("classification") |>
#   set_engine("ranger")

# Random forest (with hyperparameter tuning)
rf_spec_tune <-
  rand_forest(trees = 500,
              mtry = tune(),
              min_n = tune()
              ) |>
  set_mode("classification") |>
  set_engine("ranger")
```

![](figs/random_forest.png)

## Model workflow

We build a model workflow where we put both the data recipe for preprocessing and the model specification.

```{r}
#| label: model-workflow

# Random forest (without hyperparameter tuning)
# rf_wf <- workflow() |>
#   add_recipe(oda_rec) |>
#   add_model(rf_spec)

# Random forest tune hyperparameters
rf_wf_tune <- workflow() |>
  add_recipe(oda_rec) |>
  add_model(rf_spec_tune)
```

## Fit a model

We fit the model workflow (preprocessing and model) on the training set using resampling.

```{r}
#| label: resampling-results

# Fit a random forest model (without tuning hyperparameters)

# doParallel::registerDoParallel()
#
# set.seed(1)
# rf_res <- fit_resamples(
#   rf_wf,
#   resamples = oda_folds,
#   metrics = metric_set(accuracy, recall, precision, roc_auc, sens, spec),
#   control = control_resamples(save_pred = TRUE)
# )

# Fit a random forest model with hyperparameters

start_time <- Sys.time()

doParallel::registerDoParallel()

set.seed(1)
rf_tune_res <-
  tune_grid(
    rf_wf_tune,
    resamples = oda_folds,
    metrics = metric_set(accuracy, recall, precision),
    control = control_resamples(save_pred = TRUE),
    grid = 8
    )

end_time <- Sys.time()
time <- end_time - start_time
```

## Evaluate performance

How did the model perform? Let's have a look at the resampling performance metrics. We select the model with the best hyperparameters.

```{r}
#| label: resampling-metrics

# Random forest (without hyperparameter tuning)

# rf_res_metrics <- collect_metrics(rf_res)
#
# rf_res_truefalse <- rf_res |>
#   conf_mat_resampled(tidy = FALSE)
#
# rf_res_autoplot <- rf_res_truefalse |>
#   autoplot()

# Random forest (with hyperparameter tuning)
collect_metrics(rf_tune_res)

best_accuracy <- select_best(rf_tune_res, "accuracy")
```

## Finalizing our model

Finally, let's make a final workflow, and then fit and evaluate the model one last time. We use the function `last_fit()` to fit the final model on the full training data set and evaluates the finalized model on the testing data set. We just need to give this funtion our original train/test data split (*oda_split*). This is the first time we have used the testing data. The purpose of the testing data is to estimate the model performance we expect to see with new data.

```{r}
#| label: final-model

# Fit final model (without tuning hyperparameters)
# final_res <- last_fit(
#   rf_wf,
#   split = oda_split,
#   metrics = metric_set(accuracy, recall, precision, roc_auc, sens, spec)
# )
#
# collect_metrics(final_res) # Metrics evaluated on the testing data. No sign of overfitting.
#
# final_res_autoplot <- collect_predictions(final_res) |>
#   conf_mat(mitigation, .pred_class) |>
#   autoplot()

# Fit final model with the best tuning hyperparameters
final_rf <- finalize_model(
  rf_spec_tune,
  best_accuracy
)

final_wf <- workflow() |>
  add_recipe(oda_rec) |>
  add_model(final_rf)

final_res <- final_wf |>
  last_fit(oda_split)

# Model performance on new data
final_res |>  collect_metrics()

p_final_res <- final_res |>
  collect_predictions() |>
  conf_mat(mitigation, .pred_class) |>
  autoplot()

```

The performance metrics from the test set indicate that we did not overfit during the training procedure.

## Make predictions using the workflow

The final_res object contains a finalized, fitted **workflow** that can be used for predicting on new data. We can extract this object.

```{r}
#| label: extract-workflow

# Extract final fitted workflow used to train the algoritm
final_wf <- extract_workflow(final_res)

augment(final_wf, new_data = oda_test[1,])

```

We can save this fitted `final_wf()` object to use later with new data.

```{r save final workflow}
#| label: save-workflow

#readr::write_rds(final_wf, "final_wf.rds")

```

## Deploy model workflow

Deploy model workflow using packages `vetiver` and `plumber`.

```{r}
#| label: deploy-workflow

# v <- vetiver::vetiver_model(final_wf, "A model to identify development aid agreements targeting climate change mitigation")
#
# v
#
# plumber::pr() |>
#   vetiver::vetiver_api(v) |>
#   plumber::pr_run()

```