-
Notifications
You must be signed in to change notification settings - Fork 18
Expand file tree
/
Copy path330_featureSelectionExample.qmd
More file actions
107 lines (78 loc) · 3.61 KB
/
330_featureSelectionExample.qmd
File metadata and controls
107 lines (78 loc) · 3.61 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# Feature Selection Example
This section presents a modern feature-selection workflow using `tidymodels`.
The goal is to keep the process explicit and reproducible:
- split data into training and test subsets
- define feature filters in a `recipe`
- fit a model through a `workflow`
- inspect variable importance
```{r setup_fs, message=FALSE, warning=FALSE}
library(tidymodels)
has_vip <- requireNamespace("vip", quietly = TRUE)
set.seed(10)
kc1 <- read.csv("./datasets/defectPred/unified/Unified-file.csv", stringsAsFactors = FALSE)
kc1 <- kc1[, c("McCC", "CLOC", "PDA", "PUA", "LLOC", "LOC", "bug")]
kc1$Defective <- factor(ifelse(kc1$bug > 0, "Y", "N"))
kc1$bug <- NULL
split <- initial_split(kc1, prop = 0.75, strata = Defective)
training <- training(split)
testing <- testing(split)
```
## Feature filtering with recipes
`recipes` supports common feature-selection and cleaning steps:
- `step_zv()` removes zero-variance predictors
- `step_nzv()` removes near-zero-variance predictors
- `step_corr()` removes highly correlated numeric predictors
```{r fs_recipe, message=FALSE, warning=FALSE}
rec <- recipe(Defective ~ ., data = training) |>
step_zv(all_predictors()) |>
step_nzv(all_predictors()) |>
step_corr(all_numeric_predictors(), threshold = 0.90)
rec_prep <- prep(rec)
juice(rec_prep) |> dplyr::glimpse()
```
## Leakage pitfalls in feature selection
Feature selection must be fit on training data only. Common mistakes:
- selecting features on the full dataset before splitting
- ranking features using labels from future releases
- including attributes that are proxies of the target (post-release fields)
In this chapter, feature filters are defined in the recipe and learned from
`training` only, then applied to `testing` through the fitted workflow.
## Train a model after feature filtering
```{r fs_model, message=FALSE, warning=FALSE}
rf_spec <- rand_forest(trees = 500, min_n = 5) |>
set_mode("classification") |>
set_engine("ranger", importance = "impurity")
wf <- workflow() |>
add_recipe(rec) |>
add_model(rf_spec)
rf_fit <- fit(wf, data = training)
rf_fit
```
## Evaluate on the test set
```{r fs_eval, message=FALSE, warning=FALSE}
pred_cls <- predict(rf_fit, testing, type = "class")
pred_prb <- predict(rf_fit, testing, type = "prob")
eval_tbl <- bind_cols(testing, pred_cls, pred_prb)
metrics(eval_tbl, truth = Defective, estimate = .pred_class)
conf_mat(eval_tbl, truth = Defective, estimate = .pred_class)
```
## Variable importance
```{r fs_importance, message=FALSE, warning=FALSE, fig.width=8, fig.height=6}
if (has_vip) {
rf_fit |>
extract_fit_parsnip() |>
vip::vip(num_features = 15)
} else {
message("Package 'vip' is not installed; skipping variable-importance plot.")
}
```
## Feature Selection Packages and Further Reading
For this chapter, the most useful package families are:
| Package | Typical Use in Feature Selection |
|---------|----------------------------------|
| [recipes](https://recipes.tidymodels.org/) | Filtering and preprocessing (`step_zv`, `step_nzv`, `step_corr`) |
| [Boruta](https://cran.r-project.org/package=Boruta) | All-relevant feature selection using random forests |
| [FSelectorRcpp](https://cran.r-project.org/package=FSelectorRcpp) | Information-gain and entropy-based ranking |
| [vip](https://koalaverse.github.io/vip/) | Variable-importance visualization for fitted models |
| [ranger](https://cran.r-project.org/package=ranger) | Fast tree-based models with built-in importance scores |
To keep this chapter concise, we focus on the `recipes` + `ranger` workflow. See @sec-popular-packages for a broader package map and @sec-rpackages for package-management guidance.