You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So far, we've explored regression problems where the outcome is numerical. But what if we want to predict **qualitative** outcomes—like a person’s mental health diagnosis based on their behavioral data? This is where **classification** methods come into play.
20
+
The k-Nearest Neighbours (kNN) algorithm is one of the simplest and most intuitive methods for classification and regression tasks. It is a non-parametric, instance-based learning algorithm, which means it makes predictions based on the similarity of new instances to previously encountered data points, without assuming any specific distribution for the underlying data. The idea behind kNN is straightforward: to classify a new data point, the algorithm finds the k closest points in the training set (its 'neighbours') and assigns the most common class among those neighbours to the new instance. In the case of regression, the prediction is the average of the values of its nearest neighbours.
21
21
22
+
We often evaluate kNN classification using the error rate the (proportion of misclassified observations). Here, the choice of $k$ is crucial for the algorithm's performance. A small $k$ (e.g., $k=1$) makes the classifier sensitive to noise, while a large $k$ may smooth out boundaries too much, leading to underfitting. Typically, $k$ is chosen through cross-validation to optimise predictive accuracy.
22
23
23
-
One of the simplest yet effective classification algorithms is **k-Nearest Neighbors (kNN)**. The core idea is simple: observations that are similar tend to belong to the same class. In practice, this means that to classify a new data point, we look at the *k* closest points in the training data and assign the most common class label among them.
24
+
Have a look at the folloqing plot, which illustrates the conceptwith two clearly defined classes. The blue dot in the center represents a new datapoint, which we wish to classify depending on other data points, which are already labeled (Class A and B). To determine its class, we calculate the distance from this point to every point in both Class A and Class B. The dashed circle around the new example marks the radius up to the fifth-nearest neighbour, demonstrating the boundary within which the algorithm searches for its neighbours.
While concepts like the **bias-variance trade-off** still apply, traditional regression metrics like Mean Squared Error aren't useful here. Instead, we often evaluate kNN classification using **error rate** —the proportion of misclassified observations.
85
+
In an algorithmic description, this includes the following steps:
34
86
35
-
In summary, kNN is a straightforward and powerful method for classification, relying on the simple assumption that similar inputs lead to similar outputs. Its ease of use and intuitive appeal make it a foundational technique in machine learning. Let's have a look on a practical application of kNN.
87
+
**Step 1: Neighbour Identification**
88
+
Given a positive integer $k$ and an observation $x_0$, the kNN classifier first identifies the $k$ points in the training data that are closest to $x_0$, represented by the set $N_0$.
As we’ve already worked with this dataset, it may look familiar. It contains measurements of three different iris species—Setosa, Versicolor, and Virginica—based on their sepal and petal lengths and widths.
90
+
**Step 2: Conditional Probability Estimation**
91
+
The classifier then estimates the conditional probability for class $j$ as the fraction of points in $N_0$ whose response values equal $j$:
You already should be familiar with the [Iris dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html). It contains measurements for three different iris species — Setosa, Versicolor, and Virginica. To refresh your memory, let’s visualize the sepal length and width for the samples:
44
109
45
-
# Load the iris dataset
46
-
iris = datasets.load_iris()
47
-
```
48
-
To refresh your memory, let’s visualize the dataset using the sepal length and sepal width features:
49
110
50
111
```{code-cell} ipython3
51
-
:tags: [remove-input]
112
+
import seaborn as sns
113
+
import pandas as pd
114
+
from sklearn import datasets
52
115
53
-
import matplotlib.pyplot as plt
54
-
# Create a scatter plot for the first two features: sepal length and sepal width
From the plot, you can already see that the Setosa species stands out clearly—it tends to have shorter and wider sepals, making it easily distinguishable. However, Versicolor and Virginica show more overlap, making them harder to separate using only these two dimensions.
The question is: Given a **new data point** with certain sepal measurements, how can we decide which iris species **it belongs to**? This is exactly the kind of problem that a classification algorithm like k-Nearest Neighbors is designed to solve.
In the plot, you can already see that the Setosa species stands out clearly — it tends to have shorter and wider sepals, making it easily distinguishable. However, Versicolor and Virginica show more overlap, making them harder to separate using only these two dimensions.
67
126
68
-
Let's first prepare our dataset:
69
127
70
-
```{code-cell}
71
-
import pandas as pd
128
+
The question is: Given a **new data point** with certain sepal measurements, how can we decide which iris species **it belongs to**? This is the kind of problem that a classification algorithm like kNN is designed to solve.
72
129
73
-
# Defining features and target
74
-
# Features: sepal length and width; target: type of flower
X = targets[["sepal length (cm)", "sepal width (cm)"]]
80
-
y = iris.target
133
+
```{code-cell} ipython3
134
+
X = df[["sepal length (cm)", "sepal width (cm)"]]
135
+
y = df["class"]
81
136
```
82
-
When training a kNN classifier, it's essential to **normalize the features**. This because kNN relies on distance calculations, and unscaled features can distort the results.
83
137
84
-
```{code-cell}
138
+
When training a kNN classifier, it's important to normalize the features. This is because kNN relies on distance calculations, and unscaled features can distort the results. The `StandardScaler` from `sklearn` standardizes features by removing the mean and scaling them to unit variance:
139
+
140
+
```{code-cell} ipython3
85
141
from sklearn.preprocessing import StandardScaler
86
142
87
-
# Scale the features using StandardScaler
88
143
scaler = StandardScaler()
89
-
# scale the entire feature set
90
144
X_scaled = scaler.fit_transform(X)
91
-
92
145
```
93
146
94
-
-----------------------------------------
95
-
96
147
## KNN Classifier Implementation
97
148
98
149
```{margin}
99
150
k is the number of nearest neighbors to use and is a hyperparameter
100
151
```
101
-
The choice of *k* plays a crucial role:
102
-
- A small *k* (e.g., 1) makes the method sensitive to noise and outliers.
103
-
- A larger *k* smooths the decision boundary, possibly at the cost of ignoring finer local structures.
104
152
105
-
####MICHA: hier bitte INTERACTIVE PLOT - MIT FESTER DECISION BOUNDARY FINDEST DU EINEN PLOT WEITER UNTEN IM SCRIPT
153
+
The choice of $k$ plays a crucial role:
154
+
- A small $k$ (e.g., 1) makes the method sensitive to noise and outliers.
155
+
- A larger $k$ smooths the decision boundary, possibly at the cost of ignoring finer local structures.
106
156
107
157
108
-
Unfortunately, there’s no magical formula to determine the best value for *k* in advance. Instead, we need to try out a range of values and use our best judgment to choose the one that works best.
158
+
As with all hyperparameters, there is no magical formula to determine the best value for $k$ in advance. Instead, we need to try out a range of values and use our best judgment to choose the one that works best.
109
159
110
-
111
-
To do this, we’ll fit the k-Nearest Neighbors model using different *k*-values within a specified range. To evaluate which value performs best, we use cross-validation — specifically, 5-fold cross-validation. Since cross-validation handles the splitting of the data into training and test sets internally, we don’t need to manually divide the dataset beforehand.
160
+
To do this, we’ll fit the k-Nearest Neighbors model using different $k$-values within a specified range. To evaluate which value performs best, we here use 5-fold cross validation.
112
161
113
162
1) Identifying the best *k*!
114
-
```{code-cell}
163
+
164
+
```{code-cell} ipython3
115
165
import numpy as np
116
166
from sklearn.neighbors import KNeighborsClassifier
117
167
from sklearn.model_selection import cross_val_score
118
168
119
-
# range 1 to 45 in steps of one
120
-
k_range= list(range(1, 46))
121
-
122
-
123
-
# variable to store the accuracy scores in loop
124
-
scores= []
169
+
k_range = range(1, 60)
170
+
accuracies = []
125
171
126
-
# loop trough the range of k using cross validation
172
+
# Loop over all k values and save the accuracy
127
173
for k in k_range:
128
174
knn = KNeighborsClassifier(n_neighbors=k)
129
-
score = cross_val_score(knn, X_scaled, y, cv=5) # get scores for each k
130
-
scores.append(np.mean(score)) # append mean score to list
175
+
accuracy = cross_val_score(knn, X_scaled, y, cv=5)
176
+
accuracies.append(np.mean(accuracy))
131
177
178
+
# Plot
179
+
fig, ax = plt.subplots()
180
+
sns.lineplot(x = k_range, y = accuracies, marker = 'o')
181
+
ax.set(xlabel="kNN", ylabel="accuracy");
132
182
```
133
183
134
-
135
-
```{code-cell}
136
-
import matplotlib.pyplot as plt
137
-
import seaborn as sns
138
-
139
-
sns.lineplot(x = k_range, y = scores, marker = 'o')
In datasets where class boundaries are clear and the data is clean, a high k can actually be beneficial. It provides a form of regularization by averaging over many neighbors. In contrast, for more complex or noisy datasets, such a high k could oversimplify the structure and reduce model performance.
218
-
```
219
-
220
-
<br>
221
-
<br>
222
-
223
-
2) Train the model using the best *k*!
224
-
We can now train our model using the best *k* value using the code below.
225
-
226
-
**Hands on:**
227
-
228
-
229
-
Now that we've determined the best value for k, we can go ahead and train and evaluate our final kNN model using this value. To properly assess how well the model performs on unseen data, we need to split our dataset into training and test sets. It's important to follow the correct sequence here: **first, we split the data — and only then do we scale it.**
230
-
231
-
232
-
(TO MICHA: MUSS DAS ÜBERHAUPT NOCH SEIN? WIR HABEN OBEN JA 5 fold CV GENUTZT UND DAMIT JA EIG TEST UND TRAININGS DATA SCHON IN DIE BERECHNUNG DER ACCURACY MIT EINGEBUNDEN ODER?? https://www.datacamp.com/tutorial/k-nearest-neighbor-classification-scikit-learn HIER HABEN DIE DAS SO GEMACHT; DESWEGEN MACH ICH ES JETZT NOCH - ABER DU KANNST ES JA EINFACH LÖSCHEN WENN NICHT NÖTIG)
X = targets[["sepal length (cm)", "sepal width (cm)"]]
256
-
y = iris.target
257
-
258
-
# Split the data into training and test sets
259
-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
260
-
261
-
# normalize the data
262
-
scaler = StandardScaler()
263
-
X_train = scaler.fit_transform(X_train)
264
-
X_test = scaler.transform(X_test)
265
-
266
-
267
-
268
-
## TRAIN THE MODEL USING THE BEST K!
269
-
# Train the model using training data
270
-
knn = KNeighborsClassifier(n_neighbors=31)
271
-
knn.fit(X_train, y_train)
272
-
273
-
#The model is now trained using the training data. Next, we can use it to make predictions on the test set.
274
-
275
-
# predict the feature-category with the trained model
276
-
y_pred = knn.predict(X_test)
277
-
278
-
# check accuracy
279
-
accuracy = accuracy_score(y_test, y_pred)
280
-
```
227
+
In datasets where class boundaries are clear and the data is clean, a higher $k$ can be beneficial. It provides a form of regularization by averaging over many neighbors. In contrast, for more complex or noisy datasets, a high $k$ could oversimplify the structure and reduce model performance.
0 commit comments