forked from capstone4ds/capstone4ds_template
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.qmd
More file actions
1537 lines (1130 loc) · 68.6 KB
/
index.qmd
File metadata and controls
1537 lines (1130 loc) · 68.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Diagnosing Diseases using kNN"
subtitle: "An Application of kNN to Diagnose Diabetes"
author: "Elena Boiko, Jacqueline Razo (Advisor: Dr. Cohen)"
date: '`r Sys.Date()`'
format:
html:
code-fold: true
course: Capstone Projects in Data Science
bibliography: references.bib # file contains bibtex for references
#always_allow_html: true # this allows to get PDF with HTML features
self-contained: true
execute:
warning: false
message: false
editor:
markdown:
wrap: 72
---
Slides: [slides.html](slides.html){target="_blank"} ( Go to `slides.qmd`
to edit)
## Introduction
The k-Nearest-Neighbors (kNN) is an algorithm that is being used in a
variety of fields to classify or predict data. The kNN algorithm is a
simple algorithm that classifies data based on how similar a datapoint
is to a class of datapoints. One of the benefits of using this
algorithmic model is how simple it is to use and the fact it’s
non-parametric which means it fits a wide variety of datasets. One
drawback from using this model is that it does have a higher
computational cost than other models which means that it doesn’t perform
as well or fast on big data. Despite this, the model’s simplicity makes
it easy to understand and easy to implement in a variety of fields. One
such field is the field of healthcare where kNN models have been
successfully used to predict diseases such as diabetes and hypertension.
In this paper we will focus on the methodology and application of kNN
models in the field of healthcare to predict diabetes, a pressing public
health problem.
To better understand the role of kNN in healthcare applications, it is important to first review its theoretical foundations, the key factors affecting its performance, and recent advancements in optimizing kNN for large datasets and medical diagnosis, particularly for diabetes prediction.
#### Theoretical Background of kNN
kNNs are supervised learning algorithms that work by comparing a data
point to other similar data points to label it. It works on the
assumption that data points that are similar to each other must be close
to each other. In the thesis [@zhang2016introduction], the author gave
the reader an introduction to how kNN works and how to run a kNN model
in R studio. He describes the methodology as assigning an unlabeled
observation to a class by using labeled examples that are similar to it.
It also describes the Euclidean distance equation which is the default
distance equation that is used for kNNs. The author also describes the
impact the k parameter has on the algorithm. The k parameter is the
parameter that tells the model how many neighbors it will use when
trying to classify a data point. Zhang recommends setting the k
parameter equal to the square root of the number of observations in the
training dataset.
Although Zhang’s recommendation to set the k parameter could be a great
starting point, the thesis [@zhang2017efficient] proposed the decision
tree-assisted tuning to optimize k, significantly enhancing accuracy.
The authors of this thesis propose using a training stage where we use a
decision tree to select the ideal number of k values and thus make the
kNN more efficient. The authors deployed and tested two more efficient
kNN methods called kTree and the k\*Tree methods. They found their
method did reduce running costs and increased classification accuracy.
Another big impact on accuracy is the distance the model uses to
classify neighbors. Although the euclidean distance is the default
distance that is used in kNNs there are other distances that can be
used. In the thesis [@kataria2013review] the authors compare different
distances in classification algorithms with a focus on the kNN
algorithm. It starts off explaining how the kNN algorithm uses the
nearest k-neighbors in order to classify data points and then describes
how the euclidean distance does this by putting a line segment between
point a and point b and then measuring the distance using the euclidean
distance formula. It moves on to describe the “cityblock” or taxican
distance and describes it as “the sum of the length of the projections
of the line segment”. It also describes the cosine distance and the
correlation distance and then compares the performance of the default
euclidean distance to the performance of using city block, cosine and
correlation distances. In the end it found the euclidean distance was
more efficient than the others in their observations.
Syriopoulos et al. [@syriopoulos2023k] also reviewed distance metric
selection, confirming that Euclidean distance remains the most effective
choice for most datasets. However, alternative metrics like Mahalanobis
distance can perform better for correlated features. The review
emphasized that selecting the right metric is dataset-dependent,
influencing classification accuracy.
#### Challenges in Scaling kNN for Large Datasets
While kNN is simple and effective, it struggles with computational
inefficiency when working with large datasets since it must calculate
distances for every new observation. This becomes a major challenge in
big data, where the sheer volume of information makes traditional kNN
methods slow and resource-intensive.
To address this, Deng et al. [@deng2016efficient] proposed an improved
approach called LC-kNN, which combines k-means clustering with kNN to
speed up computations and enhance accuracy. By dividing large datasets
into smaller clusters, their method reduces the number of distance
calculations needed. After extensive testing, the authors found that
LC-kNN consistently outperformed standard kNN, achieving higher accuracy
and better efficiency. Their study highlights a key limitation of
traditional kNN (without optimization, its performance significantly
declines on big data) and offers an effective solution to improve its
scalability.
Continuing and summarizing these ideas, Syriopoulos et al.
[@syriopoulos2023k] explored techniques for accelerating kNN
computations, such as:
- Dimensionality reduction (e.g., PCA, feature selection) to reduce
data complexity.
- Approximate Nearest Neighbor (ANN) methods to speed up distance
calculations.
- Hybrid models combining kNN with clustering (e.g., LC-kNN) to
improve efficiency.
This approach enhanced both speed and accuracy, making it a promising
solution for handling large datasets. In addition, the study categorizes
kNN modifications into local hyperplane methods, fuzzy-based models,
weighting schemes, and hybrid approaches, demonstrating how these
adaptations help tackle issues like class imbalance, computational
inefficiency, and sensitivity to noise.
Another key challenge for kNN is its performance in high-dimensional
datasets. The 2023 study by Syriopoulos et al. evaluates multiple
nearest neighbor search algorithms such as kd-trees, ball trees,
Locality-Sensitive Hashing (LSH), and graph-based search methods that
enable kNN performance scaling for larger datasets through minimized
distance calculations.
The enhancements to kNN have substantially increased its performance in
terms of speed and accuracy which now allows it to better handle
large-scale datasets. However, as Syriopoulos et al. primarily compile
prior research rather than conducting empirical comparisons, further
work is needed to evaluate these optimizations in real-world medical
classification tasks.
#### kNN in Disease Prediction: Applications & Limitations
##### Disease Prediction with kNN
kNN has been widely used for diabetes classification and early
detection. Ali et al. [@ali2020diabetes] tested six different kNN
variants in MATLAB to classify blood glucose levels, finding that fine
kNN was the most accurate. Their research highlights how optimizing kNN
can improve classification performance, making it a valuable tool in
healthcare.
In turn, Saxena et al. [@saxena2014diagnosis] used kNN on a diabetes
dataset and observed that increasing the number of neighbors (k) led to
better accuracy, but only to a certain extent. In their MATLAB-based
study, they found that using k = 3 resulted in 70% accuracy, while
increasing k to 5 improved it to 75%. Both studies demonstrate how kNN
can effectively classify diabetes, with accuracy depending on the choice
of k and dataset characteristics. Ongoing research continues to refine
kNN, making it a more efficient and reliable tool for medical
applications.
Feature selection is another critical factor. Panwar et al.
[@panwar2016k] demonstrated that focusing on just BMI and Diabetes
Pedigree Function improved accuracy, suggesting that simplifying feature
selection enhances model performance. The study of Suriya and Muthu
[@suriya2023type] showed that kNN is a promising model for predicting
type 2 diabetes, showing the highest accuracy on smaller datasets. The
authors tested three datasets of varying sizes from 692 to 1853 rows and
9-22 dimensions to test the kNN algorithm’s performance and found that
the larger dataset requires a higher k-value. Besides, PCA analysis to
reduce dimensionality did not improve model performance. This suggests
that simplifying the data doesn’t always lead to better results in
diabetes prediction. The same findings about PCA influence on ML models
implementation, and kNN in particular, showed in the research of
Iparraguirre-Villanueva et al. [@iparraguirre2023application]. Also,
they confirmed that kNN alone is not always the best choice. Authors
compared kNN with Logistic Regression, Naïve Bayes, and Decision Trees.
Their results showed that while kNN performed well on balanced datasets,
it struggled when class imbalances existed. While PCA significantly
reduced accuracy for all models, the SMOTE-preprocessed dataset
demonstrated the highest accuracy for the k-NN model (79.6%), followed
by BNB with 77.2%. This reveals the importance of correct preprocessing
techniques in improving kNN model accuracy, especially when handling
imbalanced datasets.
Khateeb & Usman [@khateeb2017efficient] extended kNN’s application to
heart disease prediction, demonstrating that feature selection and data
balancing techniques significantly impact accuracy. Their study showed
that removing irrelevant features did not always improve performance,
emphasizing the need for careful feature engineering in medical
datasets.
##### kNN Beyond Prediction: Handling Missing Data
While kNN is widely known for classification, it also plays a key role
in data preprocessing for medical machine learning. Altamimi et al.
[@altamimi2024automated] explored kNN imputation as a method to handle
missing values in medical datasets. Their study showed that applying kNN
imputation before training a machine learning model significantly
improved diabetes prediction accuracy - from 81.13% to 98.59%. This
suggests that kNN is not only useful for disease classification but also
for improving data quality and completeness in healthcare applications.
Traditional methods often discard incomplete records, but kNN imputation
preserves valuable information, leading to more reliable model
performance. However, Altamimi et al. (2024) also highlighted challenges
such as computational costs and sensitivity to parameter selection,
reinforcing the need for further optimization when applying kNN to
large-scale medical datasets.
##### Comparing kNN Variants & Hybrid Approaches
Research indicate that kNN works well for diabetes prediction, but
recent studies demonstrate it doesn't consistently provide the best
results. The study by Theerthagiri et al. [@theerthagiri2022diagnosis]
evaluated kNN against multiple machine learning models such as Naïve
Bayes, Decision Trees, Extra Trees, Radial Basis Function (RBF), and
Multi-Layer Perceptron (MLP) through analysis of the Pima Indians
Diabetes dataset. The research indicated that kNN performed adequately
but MLP excelled beyond all other algorithms achieving top accuracy at
80.68% and leading in AUC-ROC with an 86%. Despite its effectiveness in
classification tasks, kNN's primary limitation is its inability to
compete with advanced models like neural networks when processing
complex datasets.
In turn, Uddin et al.[@uddin2022comparative] explored advanced kNN
variants, including Weighted kNN, Distance-Weighted kNN, and Ensemble
kNN. Their findings suggest that:
- Weighted kNN improved classification by assigning greater importance
to closer neighbors.
- Ensemble kNN outperformed standard kNN in disease prediction but
required additional computational resources.
- Performance was highly sensitive to the choice of distance metric
and k value tuning.
Their findings suggest that kNN can be improved through modifications,
but it remains highly sensitive to dataset size, feature selection, and
distance metric choices. In large-scale healthcare applications,
Decision Trees (DT) and ensemble models may offer better trade-offs
between accuracy and efficiency. These studies highlight the ongoing
debate over kNN’s role in medical classification - whether modifying kNN
is the best approach or if other models, such as DT or ensemble
learning, provide stronger performance for diagnosing diseases.
kNN continues to be a valuable tool in medical machine learning,
offering simplicity and strong performance in classification tasks.
However, as research shows, its effectiveness depends on proper feature
selection, optimized k values, and preprocessing techniques like
imputation. While kNN remains an interpretable and adaptable model,
newer methods - such as ensemble learning and neural networks - often
outperform it, particularly in large-scale datasets. For our capstone
project, exploring feature selection, fine-tuning kNN’s settings, and
comparing it to other algorithms could give us valuable insights into
its strengths and limitations.
## Methods
The kNN algorithm is a nonparametric supervised learning algorithm that
can be used for classification or regression problems
[@syriopoulos2023k]. In classification, it works on the assumption that
similar data is close to each other in distance. It classifies a
datapoint by using the euclidean distance formula to find the nearest k
data specified. Once these k data points have been found, the kNN
assigns a category to the new datapoint based off the category with the
majority of the data points that are similar [@zhang2016introduction].
Figure 1 illustrates this methodology with two distinct classes of
hearts and circles. The knn algorithm is attempting to classify the
mystery figure represented by the red square. The k parameter is set to
k=5 which means the algorithm will use the euclidean distance formula to
find the 5 nearest neighbors illustrated by the green circle. From here
the algorithm simply counts the number from each class and designates
the class that represents the majority which in this case is a heart.
{width=400 height=400}
The red square represents a data point to be classified. The algorithm selects the 5 nearest neighbors within the green circle—3 hearts and 2 circles. Based on the majority vote, the red square is classified as a heart.
### Classification process
#### The classification process has three distinct steps:
##### 1. Distance calculation
The kNN algorithm first calculates the distance between the data point it’s trying to classify and all the points in the training dataset. The most commonly used method is the **Euclidean distance** [@theerthagiri2022diagnosis], which measures the straight-line distance between two points. The formula is:
$$
d = \sqrt{(X_2 - X_1)^2 + (Y_2 - Y_1)^2}
$$
Figure 2 shows the euclidean distance formula where $X_2 - X_1$ calculates the horizontal difference and $Y_2 - Y_1$ calculates the vertical difference. These two distances are then squared to ensure they are positive regardless of which directionality it has. Squaring the distances also gives greater emphasis to larger distances.
```{r}
library(ggplot2)
# Add points
X1 <- 10; Y1 <- 12
X2 <- 14; Y2 <- 16
# Create base plot
plot(c(X1, X2), c(Y1, Y2), type = "n",
xlab = "X-axis", ylab = "Y-axis",
main = "Figure 2: Euclidean and Manhattan Distances",
xlim = c(X1 - 4, X2 + 4), ylim = c(Y1 - 4, Y2 + 4))
# Plot points
points(X1, Y1, col = "red", pch = 16, cex = 2)
points(X2, Y2, col = "blue", pch = 16, cex = 2)
# Add Manhattan path (green arrows)
arrows(X1, Y1, X2, Y1, col = "darkgreen", lwd = 2, length = 0.1) # horizontal
arrows(X2, Y1, X2, Y2, col = "darkgreen", lwd = 2, length = 0.1) # vertical
# Add Euclidean line (dashed purple)
segments(X1, Y1, X2, Y2, col = "purple", lwd = 2, lty = 2)
# Point labels
text(X1 - 0.5, Y1, labels = paste("(X1, Y1)\n(", X1, ",", Y1, ")"),
col = "red", cex = 0.8, pos = 2)
text(X2 + 0.5, Y2, labels = paste("(X2, Y2)\n(", X2, ",", Y2, ")"),
col = "blue", cex = 0.8, pos = 4)
# Euclidean distance label + arrow
text((X1 + X2)/2 - 1, (Y1 + Y2)/2 + 2.5, "Euclidean Distance (d)", col = "purple", font = 2, cex = 1)
arrows((X1 + X2)/2, (Y1 + Y2)/2 + 2, (X1 + X2)/2, (Y1 + Y2)/2 + 0.6,
col = "purple", lwd = 1.5, length = 0.1)
# Manhattan label
text(X2 + 1.5, Y1 + 0.5, "Manhattan\nDistance", col = "darkgreen", font = 2, cex = 0.9, pos = 4)
# Euclidean distance formula
text(mean(c(X1, X2)), mean(c(Y1, Y2)) - 4.5,
labels = expression(d == sqrt((14 - 10)^2 + (16 - 12)^2)),
col = "black", cex = 0.9)
```
In some cases, **Manhattan distance** may be used instead. This metric calculates the total absolute difference across dimensions:
$$
d = |X_2 - X_1| + |Y_2 - Y_1|
$$
Unlike Euclidean distance, Manhattan distance follows a grid-like path (horizontal + vertical in Figure 2), making it more suitable for certain types of structured data or when outliers need to be minimized in influence [@aggarwal2015data].
##### 2. Neighbor Selection
The kNN allows the selection of a parameter k that is used by the
algorithm to choose how many neighbors will be used to classify the
unknown datapoint. The k parameter is very important as a k parameter
that is too large can lead to a classification problem caused by a
majority of the samples creating a bias and causing underfitting.
[@mucherino2009k] A k being too small can cause the algorithm to be too
sensitive to noise and outliers which can cause overfitting. Studies
recommend using cross-validation or heuristic methods, such as setting k
to the square root of the dataset size, to determine an optimal value
[@syriopoulos2023k].
##### 3. Classification decision based on majority voting
Once the k-nearest neighbors are identified, the algorithm assigns the
new data point the most frequent class label among its neighbors. In
cases of ties, distance-weighted voting can be applied, where closer
neighbors have higher influence on the classification decision
[@uddin2022comparative].
### Assumptions
The k-Nearest Neighbors (kNN) algorithm operates under the assumption that data points with similar features exist in close proximity within the feature space and are therefore likely to belong to the same class [@boateng2020basic].
### Implementation of kNN
```{r, include=FALSE}
install.packages("DiagrammeR", repos = "https://cloud.r-project.org")
```
```{r}
library(DiagrammeR)
grViz("
digraph {
graph [layout = dot, rankdir = LR, splines = true, size= 10]
node [shape = box, style = rounded, fillcolor = lightblue, fontname = Arial, fontsize = 25, penwidth = 2]
A [label = '1. Load Required Libraries',width=3, height=1.5]
B [label = '2. Import & Explore Dataset',width=3, height=1.5]
C [label = '3. Is preprocessing required?', shape = circle, fillcolor = lightblue, width=0.8, height=0.8, fontsize=25]
D [label = '3a. Pre-Process the data',width=3, height=1.5]
E [label = '4. Split Dataset into Training & Testing',width=3, height=1.5]
F [label = '5. Hyperparameter tuning',width=3, height=1.5]
G [label = '6. Train kNN Model',width=3, height=1.5]
H [label = '7. Make Predictions',width=3, height=1.5]
I [label = '8. Evaluate Model',width=3, height=1.5]
A -> B
B -> C
C -> E [label = 'No', fontsize=25]
C -> D [label = 'Yes', fontsize=25]
D -> E
E -> F
F -> G
G -> H
H -> I
#Edge Style
edge [color = '#8B814C', arrowhead = vee, penwidth = 2]
}
")
```
#### Pre-processing Data
Data must be prepared before implementing the kNN. In order for the kNN
algorithm to work we need to handle missing values, make all values
numeric and normalize or standardize the features. We also have the
option of increasing accuracy by reducing dimensionality, removing
correlated features and fixing class imbalance if we notice our data
needs it.
1. **Handle missing values**: kNN's work by calculating the distance
between datapoints and missing values can skew the results. We must
remove the missing values by either inputting them or dropping them.
2. **Make all values numeric**: kNN's only handle numeric values so all
categorical values must be encoded using either one-hot encoding or
label encoding.
3. **Normalize or Standardize the features**: We must normalize or
standardize the features to make sure we reduce bias. We can use the
min-max scaler or the standard scaler to do this.
4. **Reduce dimensionality**: The kNN can struggle to calculate the
distance between features if there are too many features. In order
to solve this we can use Principal Component Analysis to reduce the
number of features but keep the variance.
5. **Remove correlated features**: The kNN works best when there aren't
too many features, so we can use a correlation matrix to see which
features we can drop. For example, it might be good to drop any
features that have low variance or have a high correlation over 0.9
because this can be redundant.
6. **Fix class imbalance**: Class imbalances can lead to a bias. We
noticed a class imbalance in our dataset and chose to use Synthetic
Minority Over-sampling Technique(SMOTE) in order to handle the
imbalance.
#### Hyperparameter Tuning
In order to increase the accuracy of the model there are a few
parameters that we can adjust.
1. Find the optimal k parameter: We manually tested several k values and
selected the one that provided the best balance of performance metrics
2. Change the distance metric: The kNN uses the euclidean distance by
default but we can use the Manhattan distance, or another distance.
3. Weights: The kNN defaults to a "uniform" weight where it gives the
same weight to all the distances but it can be adjusted to
"distance" so that the closest neighbors have more weight.
### Advantages and Limitations
One of the advantages of the kNN is it's easy to understand and
implement. It is able to maintain great accuracy even with noisy data.
[@syriopoulos2023k]. A serious limitation it has is the high
computational cost and that it needs a large amount of memory to
calculate the distance between all the datapoints.The kNN also has low
accuracy with multidimensional data that has irrelevant features.
[@saxena2014diagnosis]. Having to calculate the distance for all the
datapoints can cause the knn to be slower when the number of datapoints
gets too large as is the case with big data. The kNN takes a significant
amount of time calculating the distances between at the datapoints in a
big file. [@deng2016efficient].
## Analysis and Results
### Data Exploration
We explored the [CDC Diabetes Health
Indicators](https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators.)
dataset, sourced from the UC Irvine Machine Learning Repository. It is a
set of data that was gathered by the Centers for Disease Control and
Prevention (CDC) through the Behavioral Risk Factor Surveillance System
(BRFSS), which is one of the biggest continuous health surveys in the
United States.
The BRFSS is an annual telephone survey that has been ongoing since 1984
and each year, more than 400,000 Americans respond to the survey. It
provides important data on health behaviors, chronic diseases, and
preventive health care use to help researchers and policymakers
understand the health status and risks of the public.
To transfer the data we used Python and the ucimlrepo package import the
dataset directly from the UCI Machine Learning Repository, following the
recommended instructions. This enabled us to easily save, prepare, and
analyze the data in view of the current research.
```{r, include= FALSE}
library(reticulate)
# Tell reticulate to use the correct Python environment
use_python("C:/Users/Elena/miniconda3/envs/myenv/python.exe", required = TRUE)
# Install necessary packages into myenv (if needed)
py_install(c("pandas", "ucimlrepo", "scipy", "scikit-learn", "imbalanced-learn"))
```
```{python}
from ucimlrepo import fetch_ucirepo
import pandas as pd
# Fetch the dataset from UCI repository
cdc_data = fetch_ucirepo(id=891)
# Combine features and target into a single DataFrame
cdc_data_df = pd.concat([cdc_data.data.features, cdc_data.data.targets], axis=1)
# Save to CSV for R environment
cdc_data_df.to_csv("cdc_data.csv", index=False)
```
#### Data Composition
The dataset consists of **253,680** survey responses collected through the CDC Behavioral Risk Factor Surveillance System (BRFSS). It includes:
- **1 binary target variable**: Diabetes_binary
- **21 explanatory features** covering demographics, health conditions, lifestyle habits, and healthcare access.
This large-scale dataset is well-suited for modeling diabetes risk, providing a mix of binary, ordinal, and continuous variables.
The following table displays the first few rows of the CDC Diabetes Health Indicators dataset.
```{r}
library(readr)
library(knitr)
# Load dataset in R
cdc_data_df <- read_csv("cdc_data.csv", show_col_types = FALSE)
kable(head(cdc_data_df), caption = "Table 1. The First Few Raw of CDC Diabetes Dataset")
```
#### Feature Overview
The variables fall into *four types*, each encoded to preserve meaning and support distance-based modeling:
**1. Target Variable (1)**
- *Diabetes_binary*: Binary classification (0 = No diabetes, 1 = Diabetes/prediabetes)
**2. Binary Variables (14)**
Encoded as 0 = No, 1 = Yes (except for Sex: 0 = Female, 1 = Male)
- *Health Conditions*: HighBP, HighChol, CholCheck, Stroke, HeartDiseaseorAttack
- *Lifestyle Factors*: Smoker, PhysActivity, Fruits, Veggies, HvyAlcoholConsump
- *Healthcare Access & Mobility*: AnyHealthcare, NoDocbcCost, DiffWalk, Sex
**3. Ordinal Variables (6)**
Encoded using ranked integers to reflect meaningful progression:
- *Self-Reported Health*: GenHlth, MentHlth, PhysHlth
- *Demographics*: Age, Education, Income
(Higher values represent worse health or higher socioeconomic levels.)
**4. Continuous Variable (1)**
- *BMI*: Numeric value for Body Mass Index
The table below provides a detailed breakdown of variable types, descriptions, and value ranges.
```{r}
# Load necessary packages
library(knitr)
# Create a Data Frame with Variable Information
table_data <- data.frame(
Type = c(
"Target",
"Binary", "", "", "", "", "", "", "", "", "", "", "", "", "",
"Ordinal", "", "", "", "", "",
"Continuous"
),
Variable = c(
"Diabetes_binary",
"HighBP", "HighChol", "CholCheck", "Smoker", "Stroke", "HeartDiseaseorAttack",
"PhysActivity", "Fruits", "Veggies", "HvyAlcoholConsump", "AnyHealthcare",
"NoDocbcCost", "DiffWalk", "Sex",
"GenHlth", "MentHlth", "PhysHlth", "Age", "Education", "Income",
"BMI"
),
Description = c(
"Indicates whether a person has diabetes",
"High Blood Pressure", "High Cholesterol", "Cholesterol check in the last 5 years",
"Smoked at least 100 cigarettes in lifetime", "Had a stroke", "History of heart disease or attack",
"Engaged in physical activity in the last 30 days", "Regular fruit consumption",
"Regular vegetable consumption", "Heavy alcohol consumption", "Has health insurance or healthcare access",
"Could not see a doctor due to cost", "Difficulty walking/climbing stairs", "Biological sex",
"Self-reported general health (1=Excellent, 5=Poor)",
"Number of mentally unhealthy days in last 30 days", "Number of physically unhealthy days in last 30 days",
"Age Groups (1 = 18-24, ..., 13 = 80+)",
"Highest education level (1 = No school, ..., 6 = College graduate)",
"Household income category (1 = <$10K, ..., 8 = $75K+)",
"Body Mass Index (BMI), measure of body fat"
),
Range = c(
"(0 = No, 1 = Yes)",
"(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)",
"(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)",
"(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)", "(0 = No, 1 = Yes)",
"(0 = No, 1 = Yes)", "(0 = Female, 1 = Male)",
"(1 = Excellent, ..., 5 = Poor)", "(0 - 30)", "(0 - 30)",
"(1 = 18-24, ..., 13 = 80+)", "(1 = No school, ..., 6 = College grad)",
"(1 = <$10K, ..., 8 = $75K+)", "(12 - 98)"
)
)
# Print Table with knitr::kable()
kable(table_data, caption = "Table 1. Summary of Explanatory Variables", align = "l")
```
```{python, echo=FALSE}
import pandas as pd
from ucimlrepo import fetch_ucirepo
# fetch dataset
cdc_diabetes_health_indicators = fetch_ucirepo(id=891)
# data (as pandas dataframes)
X = cdc_diabetes_health_indicators.data.features
y = cdc_diabetes_health_indicators.data.targets
cdc_data_df = pd.concat([cdc_diabetes_health_indicators.data.features,
cdc_diabetes_health_indicators.data.targets], axis=1)
exploratory_data_analysis = { "Data Quality Check": ["Number of Nulls", "Missing Data", "Duplicate Rows", "Total Rows"], "Count": [cdc_data_df.isna().sum().sum(), (cdc_data_df == " ").sum().sum(), cdc_data_df.duplicated().sum(), cdc_data_df.shape[0]]}
exploratory_data_analysis_df=pd.DataFrame(exploratory_data_analysis)
exploratory_data_analysis_df.to_csv("eda.csv", index=False)
```
#### Data Integrity Assessment
In this step, we checked for null values, missing data (NaNs), and
duplicate rows to ensure data integrity. Additionally, we identified
columns with invalid values such as strings with spaces in numeric
fields.
```{r}
library(knitr)
library(readr)
# Load the dataset
exploratory_df <- read_csv("eda.csv", show_col_types = FALSE)
# Print table with a new title (caption)
kable(exploratory_df, caption = "Table 2: Data Integrity Report")
```
There are no missing values, meaning no imputation is needed.
But 24,206 duplicate records were detected, which need to be analyzed to
determine whether they need removal or weighting to prevent redundancy
in model training.
### Exploratory Data Analysis (EDA)
To effectively prepare data for a distance-based model like k-Nearest Neighbors (kNN), it's critical to understand the statistical properties of the features - including scale, variability, and the presence of outliers.
```{python, echo=FALSE, results='hide'}
df_stats = cdc_data_df.describe()
print(df_stats)
```
Figures 3 and 4 summarize the central tendencies and distributional characteristics of selected *ordinal* and *continuous* variables: *GenHlth, MentHlth, PhysHlth, Age, Education, Income, BMI*
#### Summary Statistics Heatmap
The heatmap below presents descriptive statistics for each variable, including mean, standard deviation, min/max, and quartiles.
```{python}
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Select ordinal + continuous variables
summary_cols = ['GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income', 'BMI']
# Calculate statistics
summary_stats = cdc_data_df[summary_cols].describe().loc[['mean', 'std', 'min', '25%', '50%', '75%', 'max']]
# Plot heatmap
plt.figure(figsize=(12, 6))
sns.heatmap(summary_stats, annot=True, cmap="YlGnBu", fmt=".1f")
plt.title("Figure 3: Summary Statistics Heatmap")
plt.tight_layout()
plt.show()
```
- **BMI** stands out with the largest range (min = 12.0, max = 98.0) and highest standard deviation (6.6) - indicating significant variability.
- **MentHlth** and **PhysHlth** also show wide spreads (standard deviations of 7.4 and 8.7, respectively), reinforcing the need for scaling to prevent these features from dominating distance calculations.
- **Age** also shows moderate variability (std = 3.1), which may impact distance calculations if not scaled.
- **Ordinal features** like GenHlth, Education, and Income are on a much smaller scale (e.g., GenHlth: 1-5), which may cause them to be underweighted unless scaling is applied.
Because features like BMI, MentHlth, PhysHlth, and Age have larger numeric ranges, they can disproportionately influence distance metrics in kNN. This is why feature scaling is essential - it ensures that each feature contributes fairly when calculating similarity.
#### Outliers in Distribution
The boxplot below further illustrates value distributions and highlights extreme values:
```{python}
import matplotlib.pyplot as plt
import seaborn as sns
# Select numeric ordinal and continuous variables
cols = ['GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income', 'BMI']
# Create boxplot
plt.figure(figsize=(12, 6))
sns.boxplot(data=cdc_data_df[cols], orient="h", palette="Set2")
plt.title("Figure 4: Boxplot of Ordinal and Continuous Variables")
plt.xlabel("Value")
plt.tight_layout()
plt.show()
```
**Notable Outliers:**
- **MentHlth** and **PhysHlth** exhibit outliers up to 30 — these may reflect long-term health issues but skew distributions.
- **BMI** has a wide distribution with extreme values approaching 98, which can affect both scaling and model sensitivity.
Outliers can mislead distance calculations, making certain data points appear abnormally close or far in feature space.
**Practical Steps for Preprocessing:**
| Problem | Solution |
|-----------------|----------------------------------------|
| Scale imbalance | StandardScaler / MinMaxScaler |
| Outliers | RobustScaler / Clipping / Removal |
| Skewed features | Consider log or square root transform |
Applying these transformations ensures the distance metric used in kNN remains balanced and sensitive to meaningful differences across all feature dimensions.
#### Class Imbalance in Diabetes Prevalence
A critical issue in classification problems is *target class imbalance*. For our **Diabetes_binary** variable, the majority class (No Diabetes) comprises over 86% of the dataset, while the minority class (Diabetes/Prediabetes) represents only about 14%.
```{python}
def plot_class_distribution():
import matplotlib.pyplot as plt
import seaborn as sns
target_variable = "Diabetes_binary"
if target_variable in cdc_data_df.columns:
class_counts = cdc_data_df[target_variable].value_counts()
class_percentages = cdc_data_df[target_variable].value_counts(normalize=True) * 100
plt.figure(figsize=(6, 4))
ax = sns.barplot(x=class_counts.index, y=class_counts.values, palette="Set2")
for i, value in enumerate(class_counts.values):
percentage = class_percentages[i]
ax.text(i, value + 1000, f"{value} ({percentage:.2f}%)", ha="center", fontsize=12)
plt.title(f"Figure 5: Class Distribution of {target_variable}")
plt.ylabel("Count")
plt.xlabel("Diabetes Status (0 = No, 1 = Diabetes/Prediabetes)")
plt.xticks([0, 1], ["No Diabetes", "Diabetes/Prediabetes"])
plt.tight_layout()
plt.show()
plot_class_distribution()
```
This imbalance can lead to biased model predictions, favoring the dominant class while under-detecting diabetes cases.
To handle this imbalance and improve classification performance, we can apply strategies such as oversampling the minority class (e.g., with SMOTE) or undersampling the majority class. Another effective approach is to use class-weighted algorithms, like setting weights='distance' in KNeighborsClassifier, which has more influence on underrepresented classes during prediction.
#### Correlation Analysis
To better understand how variables relate to each other - and to our target - we generated a correlation heatmap. This helps detect redundant features, multicollinearity, and potential predictors of diabetes.
```{python, echo=TRUE, message=FALSE, warning=FALSE}
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
corr_matrix = cdc_data_df.corr()
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(corr_matrix, ax=ax, annot=True, fmt=".2f", cmap="coolwarm", linewidths=0.5, vmin=-1, vmax=1)
ax.set_title("Figure 6: Feature Correlation Heatmap")
plt.tight_layout()
plt.show()
```
**Positive Correlations:**
• *General Health (GenHlth) is strongly correlated with Physical Health
(PhysHlth)* (0.52) and *Difficulty Walking (DiffWalk)* (0.45).
As individuals report poorer general health, they experience more
physical health issues and mobility limitations.
• *Physical Health (PhysHlth)* and *Difficulty Walking (DiffWalk)* (0.47)
show a strong link. Those with more days of poor physical health are
likely to struggle with mobility.
• *Age* correlates with *High Blood Pressure* (0.34) and *High
Cholesterol* (0.27), indicating an increased risk of cardiovascular
conditions as people get older.
• *Mental Health (MentHlth)* and *Physical Health (PhysHlth)* (0.34) are
positively associated. Worsening mental health often coincides with
physical health problems.
**Negative Correlations:**
• Higher *Income* is associated with better *General Health* (-0.33),
fewer *Mobility Issues* (-0.30), and better *Physical Health* (-0.24).
This suggests financial stability improves access to healthcare and
promotes a healthier lifestyle.
• Higher *Education* is linked to better *General Health* (-0.28) and
*Mental Health* (-0.19). Educated individuals may have better health
awareness and coping strategies.
The heatmap confirms well-known health trends: age, high blood pressure, and cholesterol are major risk factors for diabetes. Poor physical and mental health are strongly related, and socioeconomic status (income, education) plays a key role in overall health. These insights highlight the importance of early intervention strategies and lifestyle modifications to prevent chronic diseases like diabetes.
No pair of features exceeds a correlation of ±0.52, so **multicollinearity is not a concern**. No features need to be dropped due to redundancy.
These patterns support the need for early interventions and lifestyle-focused health strategies.
#### Age and BMI Density Analysis by Diabetes Status
Age and Body Mass Index (BMI) are both recognized as key risk factors for diabetes. To explore these relationships, we compared their distributions between individuals with and without diabetes.
**Age Distribution:**
Figure 7 illustrates the distribution of age categories for individuals with and without diabetes. Although age is represented as an ordinal variable (1-13, likely corresponding to increasing age groups), several trends are apparent:
- Individuals with *diabetes or prediabetes* show a *higher density in the upper age categories*, especially around values 10–13.
- Conversely, the *non-diabetic group* is more prominent in the mid-range age categories (around 8–11).
- The sharp peaks reflect the ordinal nature of the age variable and the likely grouping into discrete bands.
```{python}
plt.figure(figsize=(10, 6))
sns.kdeplot(data=cdc_data_df, x="Age", hue="Diabetes_binary", fill=True,
common_norm=False, palette={0: "#80cdc1", 1: "#d6604d"},
alpha=0.4, linewidth=1.5)
plt.title("Figure 7: Age Density by Diabetes Status")
plt.xlabel("Age")
plt.ylabel("Density")
plt.legend(title="Diabetes Status", labels=["No Diabetes (0)", "Diabetes/Prediabetes (1)"])
plt.tight_layout()
plt.show()
```
As expected, the prevalence of diabetes increases with age. This distribution confirms the importance of *age as a predictive feature* and suggests that older adults are at higher risk - aligning with clinical and epidemiological findings.
**BMI Distribution:**
BMI is a known risk factor for diabetes, and the analysis confirms that individuals with diabetes tend to have slightly higher BMI values on average. The **KDE plot** below shows a noticeable rightward shift in BMI values for diabetic individuals.
```{python}
# Set figure size
plt.figure(figsize=(10, 6))
# Ensure Diabetes_binary is integer for filtering and plotting
cdc_data_df["Diabetes_binary"] = cdc_data_df["Diabetes_binary"].astype(int)
# KDE plot for BMI distribution by diabetes status
sns.kdeplot(data=cdc_data_df[cdc_data_df['Diabetes_binary'] == 0]['BMI'],
label='No Diabetes (0)', color="mediumaquamarine", fill=True)
sns.kdeplot(data=cdc_data_df[cdc_data_df['Diabetes_binary'] == 1]['BMI'],
label='Diabetes/Prediabetes (1)', color="salmon", fill=True)
# Titles and labels
plt.title('Figure 8: BMI Density by Diabetes Status', fontsize=16)
plt.xlabel('BMI', fontsize=14)
plt.ylabel('Density', fontsize=14)
plt.legend(title='Diabetes Status')
# Show plot
plt.show()
```
A significant portion of individuals with diabetes have BMI values above 30, supporting established links between obesity and diabetes risk. Despite this, there remains substantial overlap between the two groups, indicating that *BMI alone is not a definitive predictor* of diabetes.
These observations reinforce the importance of using *multiple factors* - not just BMI - when modeling diabetes risk.
#### EDA Summary
This Exploratory Data Analysis (EDA) provides a comprehensive overview
of the dataset’s structure, distributions, and key correlations. The
findings highlight several critical patterns:
Diabetes prevalence is low (13.9%), leading to a class imbalance that
may require resampling techniques.
Age, BMI, and high blood pressure are strong risk factors for diabetes.
Socioeconomic factors (income, education) influence health status,
supporting the need for targeted interventions.
The next phase involves data preprocessing, feature selection, and model
development to enhance predictive performance.
### Modeling and Results
This section explores the performance of the k-Nearest Neighbors (kNN) algorithm for predicting diabetes using the CDC Behavioral Risk Factor Surveillance System dataset. The primary objective was to evaluate how various modeling choices - such as scaling techniques, distance metrics, SMOTE resampling, feature selection, and hyperparameter tuning - impact classification performance, especially for identifying the minority diabetic class.
We tested four distinct kNN configurations, progressively applying different strategies to improve the model:
1. **kNN 1:** Baseline model trained on the imbalanced dataset (original distribution) using Euclidean distance and uniform weights.
2. **kNN 2:** Tuned model with Manhattan distance, distance-based weighting, and RobustScaler to address outliers.
3. **kNN 3:** SMOTE-resampled model using standard preprocessing, to mitigate class imbalance.
4. **kNN 4:** Feature-selected model combining SMOTE and top 12 features (via chi-squared test), along with distance weighting.
Each model was evaluated using accuracy, ROC-AUC, recall, precision, and f1-score, with particular emphasis on recall for the diabetic class, given the critical importance of minimizing false negatives in healthcare applications.
Results from the four configurations are summarized in Table 3, highlighting the effect of different preprocessing and tuning choices on kNN’s ability to detect diabetes accurately and fairly.
#### Data Preprocessing
```{python, echo=FALSE}
# Import all libraries once
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, roc_curve
from imblearn.over_sampling import SMOTE
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
from IPython.display import Markdown, display
import textwrap
```
Preprocessing is a critical step for models like k-Nearest Neighbors (kNN), which rely on distance-based calculations. If features are not properly scaled or class imbalance is not addressed, kNN’s performance - particularly for detecting minority classes - can degrade significantly.
The dataset originally contained 253,680 survey responses with 21 predictor variables and 1 binary outcome (Diabetes_binary). The following preprocessing steps were applied:
```{python}
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
# 1. Remove duplicates
cdc_df = cdc_data_df.drop_duplicates()
# Define features and target
X = cdc_df.drop(columns=['Diabetes_binary'])
y = cdc_df['Diabetes_binary']
# Calculate class distribution
class_distribution = (y.value_counts(normalize=True) * 100).round(2)
# Print clean output
print(
"Class Distribution After Removing Duplicates (%):\n" +
"\n".join([f"Class {label}: {percent:.2f}%" for label, percent in class_distribution.items()])
)
```
- **Duplicates Removed:**