DASE/310_basicPreprocessing.qmd at main · danrodgar/DASE · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563

# Preprocessing

Following the data mining process, we describe what is meant by preprocessing, classical supervised models,  unsupervised models and evaluation in the context of software engineering with examples

This task is probably the hardest and where most of effort is spend in the data mining process. It is quite typical to transform the data, for example, finding inconsistencies, normalising, imputing missing values, transforming input data, merging variables, etc.

Typically, pre-processing consist of the following tasks (subprocesses):

 + Data cleaning (consistency, noise detection, outliers)
 + Data integration
 + Data transformation  (normalisation, discretisation) and derivation of new attributes from existing ones (e.g., population density from population and area)
 + Missing data imputation
 + Data reduction (feature selection and instance selection)


## Data

_Consistent_ data are semantically correct based on real-world knowledge of the problem, i.e., no constraints are violated and data can be used for inducing models and analysis. For example, LoC or effort is constrained to non-negative values. We can also consider multiple attributes to be consistent among them, and even across datasets (e.g., same metrics but collected by different tools).


## Missing values

_Missing values_ will have a negative effect when analysing the data or learning models. The results can be biased when compared with the models induced from the complete data, the results can be harder to analyse, it may be needed to discard records with missing values depending on the algorithm and this can be an important problems with small datasets such as the effort estimation ones.

Missing data is typically classified into:
  * MCAR (Missing Completely at Random) or MAR (Missing At Random) where there is no reason for those missing values and we can assume that the distribution could follow the attribute's distribution.
  * MNAR (Missing Not At Random) where there is a pattern for those missing values and it may be advisable to check the data-gathering process to try to understand why such information is missing.

_Imputation_ consists of replacing missing values with estimates. Many algorithms cannot handle missing values directly; therefore, imputation methods are needed. We can use simple approaches such as replacing missing values with the mean or mode of the attribute. More elaborate approaches include:

  * EM (Expectation-Maximisation)
  * Distance-based
      + $k$-NN ($k$-Nearest Neighbours)
      + Clustering

In R, a missing value is represented with `NA` and the analyst must decide what to do with missing data. The simplest approach is to leave out instances (ignore missing -IM-) with with missing data. This functionality is supported by many base functions through the `na.rm` option.

The `mice` R package. MICE (Multivariate Imputation via Chained Equations) assumes that data are missing at random.  Other packages include `Amelia`, `missForest`, `Hmisc` and `mi`.


## Noise

Imperfections in real-world data can negatively influence induced machine learning models. Approaches to deal with noisy data include:
 * Robust learners capable of handling noisy data (e.g., C4.5 through pruning strategies)
 * Data polishing methods which aim to correct noisy instances prior training
 * Noise filters which are used to identify and eliminate noisy instances from the training data.


Types of noise data:
 * Class Noise (aka label noise).
      + There can be contradictory cases (all attributes have the same value except the class)
      + Misclassifications. The class attribute is not labeled with the true label (golden truth)
 * Attribute Noise. Values of attributes that are noise, missing or unknown.


## Outliers

There is a large amount of literature related to outlier detection, and furthermore several definitions of outlier exist.

```{r message=FALSE, warning=FALSE}
library(dbscan)

unified <- read.csv("./datasets/defectPred/unified/Unified-file.csv", stringsAsFactors = FALSE)
kc1 <- unified[, c("McCC", "CLOC", "PDA", "PUA", "LLOC", "LOC", "bug")]
kc1$Defective <- factor(ifelse(kc1$bug > 0, "Y", "N"))
kc1$bug <- NULL
```

The LOF algorithm (`lofactor`), given a data set it produces a vector of local outlier factors for each case.

```{r message=FALSE, warning=FALSE}
kc1num <- kc1[, sapply(kc1, is.numeric)]
outlier.scores <- dbscan::lof(kc1num, minPts=5)
plot(density(na.omit(outlier.scores)))
outliers <- order(outlier.scores, decreasing=T)[1:5]
print(outliers)
```


Another simple method of Hiridoglou and Berthelot for positive observations.


## Feature selection

Feature Selection (FS) aims at identifying the most relevant attributes from a dataset. It is important in different ways:

  * A reduced volume of data allows different data mining or searching techniques to be applied.

  * Irrelevant and redundant attributes can generate less accurate and more complex models. Furthermore, data mining algorithms can be executed faster.

  * It avoids the collection of data for those irrelevant and redundant attributes in the future.


The problem of FS received a thorough treatment in pattern recognition and machine learning. Most of the FS algorithms tackle the task as a _search_ problem, where each
state in the search specifies a distinct subset of the possible attributes [@BL97]. The search procedure is combined with a criterion to evaluate the merit of each candidate subset of attributes. There are a multiple possible combinations between each procedure search and each attribute measure [@LY05].

There are two major approaches in FS from the method's output point of view:

  * _Feature subset selection_ (FSS)

  * _Feature ranking_ in which attributes are ranked as a list of features which are ordered according to evaluation measures (a subset of features is often selected from the top of the ranking list).


FFS algorithms designed with different evaluation criteria broadly fall into two categories:

  * The _filter_ model relies on general characteristics of the data to evaluate and select feature subsets without involving any data mining algorithm.

  * The _wrapper_ model requires one predetermined mining algorithm and uses its performance as the evaluation criterion. It searches for features better suited to the mining algorithm aiming to improve mining performance, but it also tends to be more computationally expensive than filter model [@KJ97,@Lan94].


Feature subset algorithms search through candidate feature subsets guide by a certain evaluation measure [@LM98] which captures the goodness of each subset. An optimal (or near optimal) subset is selected when the search stops.

Some existing evaluation measures that have been shown effective in removing both irrelevant and redundant features include the consistency measure [@DLM00], the correlation measure [@Hal99] and the estimated accuracy of a learning algorithm [@KJ97].


 + _Consistency_ measure attempts to find a minimum number of features that separate classes as consistently as the full set of features can. An inconsistency is defined as to instances having the same
feature values but different class labels.

  + _Correlation_ measure evaluates the goodness of feature subsets based on the hypothesis that good feature subsets contain features highly correlated to the class, yet uncorrelated to each other.

  + _Wrapper-based_ attribute selection uses the target learning algorithm to estimate the worth of attribute subsets. The feature subset selection algorithm conducts a search for a good subset using
the induction algorithm itself as part of the evaluation function.

Langley [-@Lan94] notes that feature selection algorithms that search through the space of feature subsets must address four main issues: (i) the starting point of the search, (ii) the organization of the search, (iii) the evaluation of feature subsets and (iv) the criterion used to terminate the search. Different algorithms address these issues differently.

It is impractical to look at all possible feature subsets, even with a small number of attributes. Feature selection algorithms usually proceed greedily and can be classified into those that add features to an initially empty set (_forward selection_) and those that remove features from an initially complete set (_backward elimination_). Hybrid approaches both add and remove features as the algorithm progresses. Forward selection is much faster than backward elimination and therefore scales better to large datasets. A wide range of search strategies can be used: best-first, branch-and-bound, simulated annealing, genetic algorithms (see Kohavi and John [-@KJ97] for a review).


### FSelector package in R

The `FSelector` package in R implements several classic feature-selection
algorithms inspired by Weka. For new projects, prefer actively maintained
alternatives such as `FSelectorRcpp`, `Boruta`, or recipe-based selectors in
`tidymodels`.


```{r message=FALSE, warning=FALSE}
library(FSelector)

unified <- read.csv("./datasets/defectPred/unified/Unified-file.csv", stringsAsFactors = FALSE)
cm1 <- unified[, c("McCC", "CLOC", "PDA", "PUA", "LLOC", "LOC", "bug")]
cm1$Defective <- factor(ifelse(cm1$bug > 0, "Y", "N"))
cm1$bug <- NULL

cm1RFWeigths <- random.forest.importance(Defective ~ ., cm1)
cutoff.biggest.diff(cm1RFWeigths)
```

Using the Information Gain measure as ranking:

```{r message=FALSE, warning=FALSE}
cm1GRWeights <- gain.ratio(Defective ~ ., cm1)
cm1GRWeights
cutoff.biggest.diff(cm1GRWeights)

# After assigning weights, we can select the statistically significant ones
cm1X2Weights <- chi.squared(Defective ~ ., cm1)
cutoff.biggest.diff(cm1X2Weights)
```


Using CFS attribute selection

```{r message=FALSE, warning=FALSE}
library(FSelector)

unified <- read.csv("./datasets/defectPred/unified/Unified-file.csv", stringsAsFactors = FALSE)
cm1 <- unified[, c("McCC", "CLOC", "PDA", "PUA", "LLOC", "LOC", "bug")]
cm1$Defective <- factor(ifelse(cm1$bug > 0, "Y", "N"))
cm1$bug <- NULL

result <- cfs(Defective ~ ., cm1)
f <- as.simple.formula(result, "Defective")
f
```

Other packages for feature selection in R include `FSelectorRcpp`, which
re-implements `FSelector` methods without Weka dependencies.

Another popular package is `Boruta`, which is based on selection based on Random Forest.

## Instance selection

Removal of samples (complementary to the removal of attributes) in order to scale down the dataset prior to learning a model so that there is (almost) no performance loss.

There are two types of processes:

 * _Prototype Selection_ (PS) [@GDCH12] when the subset is used with a distance based method (kNN)

 * _Training Set Selection_ (TSS) [@CanoHL07] in which an actual model is learned.

It is also a search problem as with _feature selection_. Garcia et al. [-@GDCH12] provide a  comprehensive overview of the topic.


## Discretization

This process transforms continuous attributes into discrete ones, by associating categorical values to intervals and thus transforming quantitative data into qualitative data.


## Correlation Coefficient and Covariance for Numeric Data

Two random variables $x$ and $y$ are called independent if the probability distribution of one variable is not affected by the presence of another.


$\tilde{\chi}^2=\frac{1}{d}\sum_{k=1}^{n} \frac{(O_k - E_k)^2}{E_k}$

```{r message=FALSE, warning=FALSE}
cor(kc1$CLOC, kc1$LOC, use = "complete.obs")
cor(kc1$McCC, kc1$LLOC, use = "complete.obs")
```

## Normalization

### Min-Max Normalization

$z_i=\frac{x_i-\min(x)}{\max(x)-\min(x)}$

```{r message=FALSE, warning=FALSE}
library(recipes)

rec_norm <- recipe(Defective ~ ., data = kc1) |>
  step_normalize(all_numeric_predictors())

prep(rec_norm) |> juice()
```


### Z-score normalization
Z-score normalization rescales each value by subtracting the mean and dividing
by the standard deviation. It is useful when models assume centered predictors
with comparable scales.

```{r message=FALSE, warning=FALSE}
kc1_num <- dplyr::select(kc1, where(is.numeric))
kc1_scaled <- scale(kc1_num)
summary(kc1_scaled[, 1:min(5, ncol(kc1_scaled))])
```

## Transformations

### Linear and Quadratic Transformations

Transformations can reduce skewness, stabilize variance, or expose nonlinear
patterns for models that rely on linear assumptions.

```{r message=FALSE, warning=FALSE}
if ("LOC_TOTAL" %in% names(kc1)) {
  loc_linear <- kc1$LOC_TOTAL
  loc_quadratic <- kc1$LOC_TOTAL^2
  summary(data.frame(loc_linear, loc_quadratic))
}
```

### Box-cox transformation
Box-Cox is often used for strictly positive variables. It can improve
normality-like behavior and model fit for some regression techniques.

```{r message=FALSE, warning=FALSE}
if ("LOC_TOTAL" %in% names(kc1)) {
  loc_pos <- kc1$LOC_TOTAL + 1
  lambda_grid <- seq(-2, 2, by = 0.25)
  ll <- sapply(lambda_grid, function(l) MASS::boxcox(loc_pos ~ 1, lambda = l, plotit = FALSE)$y)
  lambda_grid[which.max(ll)]
}
```

### Nominal to Binary Transformations

Categorical predictors are commonly converted into binary indicators before
training some models.

```{r message=FALSE, warning=FALSE}
if ("Defective" %in% names(kc1)) {
  dummy_tbl <- model.matrix(~ Defective - 1, data = kc1)
  head(dummy_tbl)
}
```


##  Preprocessing in R

### The `dplyr` package

The *[dplyr](https://cran.r-project.org/web/packages/dplyr/index.html)* package created by Hadley Wickham. Some functions are similar to SQL syntax.  key functions in dplyr include:

 + select: select columns from a dataframe
 + filter: select rows from a dataframe
 + summarize: allows us to do summary stats based upon the grouped variable
 + group_by: group by a factor variable
 + arrange: order the dataset
 + joins: as in sql left join

Tutorial:
[https://github.com/justmarkham/dplyr-tutorial](https://github.com/justmarkham/dplyr-tutorial)

Examples

```{r message=FALSE, warning=FALSE}
library(dplyr)
```

Describe the dataframe:

```{r}
str(kc1)
```

`as_tibble()` creates a local tibble for better printing and modern dplyr workflows.

```{r}
kc1_tbl <- tibble::as_tibble(kc1)
```


Filter:

```{r}
# Filter rows: use comma or & to represent AND condition
filter(kc1_tbl, Defective == "Y" & CLOC != 0)
```

Another operator is `%in%`.

Select:
```{r}
select(kc1_tbl, contains("LOC"), Defective)
```

Now, `kc1_tbl` contains("LOC"), Defective


Filter and Select together:
```{r}
# nesting method
filter(select(kc1_tbl, contains("LOC"), Defective), Defective == "Y")
```

It is easier using the chaining method:

```{r}
# chaining method
kc1_tbl %>%
    select(contains("LOC"), Defective) %>%
  filter(Defective == "Y")
```


Arrange ascending
```{r}
#
kc1_tbl %>%
  select(LOC, Defective) %>%
  arrange(LOC)
```

Arrange descending:
```{r}
kc1_tbl %>%
  select(LOC, Defective) %>%
  arrange(desc(LOC))
```


Mutate:
```{r}
kc1_tbl %>%
    filter(Defective == "Y") %>%
  select(CLOC, LLOC, LOC, Defective) %>%
  mutate(loc_density = LOC / pmax(CLOC, 1))
```

`summarise`: Reduce variables to values

```{r}
# Create a table grouped by Defective, and then summarise each group by taking the mean of loc
kc1_tbl %>%
    group_by(Defective) %>%
  summarise(avg_loc = mean(LOC, na.rm=TRUE))
```


```{r}
# Create a table grouped by Defective, and then summarise each group by taking the mean of loc
kc1_tbl %>%
    group_by(Defective) %>%
    summarise(
      mccc_mean = mean(McCC, na.rm = TRUE),
      mccc_min = min(McCC, na.rm = TRUE),
      mccc_max = max(McCC, na.rm = TRUE),
      loc_mean = mean(LOC, na.rm = TRUE),
      loc_min = min(LOC, na.rm = TRUE),
      loc_max = max(LOC, na.rm = TRUE)
    )
```


It seems than the number of _Defective_ modules is larger than the _Non-Defective_ ones. We can count them with:

```{r}
# n() or tally
kc1_tbl %>%
    group_by(Defective) %>%
    tally()
```

It seems that it's an imbalanced dataset...

```{r}
# randomly sample a fixed number of rows, without replacement
kc1_tbl %>% sample_n(2)

# randomly sample a fraction of rows, with replacement
kc1_tbl %>% sample_frac(0.05, replace=TRUE)

# Better formatting adapted to the screen width
glimpse(kc1_tbl)
```

## Missing but critical in preprocessing: leakage-safe workflows

A common mistake is to normalize, impute, or select features on the **full
dataset** before splitting into train/test. This leaks information from test to
train and usually inflates performance.

Recommended order:

1. Split data first
2. Fit preprocessing only on training data
3. Apply the learned preprocessing to validation/test data
4. Train and evaluate model

```{r message=FALSE, warning=FALSE}
library(tidymodels)

set.seed(10)
split <- initial_split(kc1_tbl, prop = 0.75, strata = Defective)
train_tbl <- training(split)
test_tbl <- testing(split)

rec_safe <- recipe(Defective ~ ., data = train_tbl) |>
  step_zv(all_predictors()) |>
  step_impute_median(all_numeric_predictors()) |>
  step_normalize(all_numeric_predictors())

prep_rec <- prep(rec_safe)
train_processed <- bake(prep_rec, new_data = train_tbl)
test_processed <- bake(prep_rec, new_data = test_tbl)

dim(train_processed)
dim(test_processed)
```

### Group-aware and time-aware splitting

Random splits can still leak information when related rows belong to the same
project, release, or time period. Prefer:

- **group-aware splits**: keep all rows of a project/release in one side
- **time-aware splits**: train on older observations, test on newer observations

```{r eval=FALSE}
# group-aware split idea (project_id must exist)
# rsample::group_initial_split(data_tbl, group = project_id)

# time-aware split idea (timestamp must exist and be sorted)
# data_tbl <- data_tbl |> dplyr::arrange(timestamp)
# cutoff <- floor(0.8 * nrow(data_tbl))
# train_tbl <- data_tbl[1:cutoff, ]
# test_tbl  <- data_tbl[(cutoff + 1):nrow(data_tbl), ]
```

### Concrete leakage examples to avoid

Typical leakage patterns in SE datasets:

1. Using post-release defect counts as predictors of the same release.
2. Computing normalization/imputation on full data before split.
3. Duplicated modules across train and test due to merges or snapshots.
4. Features derived from future information (for example, final churn).


## Class imbalance handling

Defect datasets are often imbalanced, so accuracy alone can be misleading.

Useful practices:

- report precision, recall, F1, and PR-AUC (not only accuracy)
- use stratified sampling in train/test splits
- try rebalancing methods on training data only

```{r message=FALSE, warning=FALSE}
if (requireNamespace("themis", quietly = TRUE)) {
  rec_bal <- recipe(Defective ~ ., data = train_tbl) |>
    step_zv(all_predictors()) |>
    themis::step_smote(Defective)

  rec_bal |> prep() |> juice() |>
    dplyr::count(Defective)
} else {
  message("Package 'themis' not installed; skipping SMOTE example.")
}
```


## Preprocessing checklist

Before modeling, verify:

1. No leakage from test folds into preprocessing
2. Missing-value strategy documented and reproducible
3. Outlier policy documented (keep/cap/remove)
4. Categorical encoding strategy defined
5. Class imbalance strategy defined (if classification)
6. Temporal split considered when timestamps are available
7. Pipeline saved so the same transformations can be reused in deployment


## End-to-End Workflow (Preprocess to Deploy)

The following flow summarizes a leakage-safe and reproducible workflow for
software engineering prediction tasks.

```{mermaid}
flowchart LR
  A[Raw SE Data<br/>commits issues metrics tests] --> B[Data Quality Checks<br/>missing duplicates impossible values]
  B --> C[Train/Test Split<br/>stratified or temporal]
  C --> D[Fit Preprocessing on Train Only<br/>impute encode normalize select]
  D --> E[Transform Train]
  D --> F[Transform Test]
  E --> G[Train Model]
  G --> H[Cross-Validation on Train]
  H --> I[Final Evaluation on Test<br/>precision recall F1 PR-AUC]
  I --> J[Persist Pipeline + Model]
  J --> K[Deployment]
  K --> L[Monitoring<br/>drift performance alerts]
```

Key principle: any operation that learns parameters from data (imputation,
normalization, feature selection, balancing) must be fit using training data
only.

## Other libraries and tricks

The `lubridate` package contains a number of functions facilitating the conversion of text to
POSIX dates. As an example, consider the following code. We may use this, for example, with time series.


For example [https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf](https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf)


```{r message=FALSE, warning=FALSE}
library(lubridate)
dates <- c("15/02/2013", "15 Feb 13", "It happened on 15 02 '13")
dmy(dates)
```