Skip to content

Commit 7bee402

Browse files
committed
updated KNN session
1 parent 30de5d0 commit 7bee402

3 files changed

Lines changed: 861 additions & 195 deletions

File tree

book/2_models/4_KNN.md

Lines changed: 137 additions & 189 deletions
Original file line numberDiff line numberDiff line change
@@ -17,192 +17,202 @@ myst:
1717

1818
# <i class="fa-solid fa-arrows-to-dot"></i> K-Nearest Neighbors
1919

20-
So far, we've explored regression problems where the outcome is numerical. But what if we want to predict **qualitative** outcomes—like a person’s mental health diagnosis based on their behavioral data? This is where **classification** methods come into play.
20+
The k-Nearest Neighbours (kNN) algorithm is one of the simplest and most intuitive methods for classification and regression tasks. It is a non-parametric, instance-based learning algorithm, which means it makes predictions based on the similarity of new instances to previously encountered data points, without assuming any specific distribution for the underlying data. The idea behind kNN is straightforward: to classify a new data point, the algorithm finds the k closest points in the training set (its 'neighbours') and assigns the most common class among those neighbours to the new instance. In the case of regression, the prediction is the average of the values of its nearest neighbours.
2121

22+
We often evaluate kNN classification using the error rate the (proportion of misclassified observations). Here, the choice of $k$ is crucial for the algorithm's performance. A small $k$ (e.g., $k=1$) makes the classifier sensitive to noise, while a large $k$ may smooth out boundaries too much, leading to underfitting. Typically, $k$ is chosen through cross-validation to optimise predictive accuracy.
2223

23-
One of the simplest yet effective classification algorithms is **k-Nearest Neighbors (kNN)**. The core idea is simple: observations that are similar tend to belong to the same class. In practice, this means that to classify a new data point, we look at the *k* closest points in the training data and assign the most common class label among them.
24+
Have a look at the folloqing plot, which illustrates the conceptwith two clearly defined classes. The blue dot in the center represents a new datapoint, which we wish to classify depending on other data points, which are already labeled (Class A and B). To determine its class, we calculate the distance from this point to every point in both Class A and Class B. The dashed circle around the new example marks the radius up to the fifth-nearest neighbour, demonstrating the boundary within which the algorithm searches for its neighbours.
2425

25-
```{figure} figures/KNN.png
26-
:alt: KNN
27-
:width: 800
28-
:align: center
2926

30-
source: https://www.ibm.com/think/topics/knn
31-
```
27+
```{code-cell} ipython3
28+
:tags: [remove-input]
29+
import numpy as np
30+
import seaborn as sns
31+
from matplotlib import pyplot as plt
32+
33+
np.random.seed(123)
34+
palette = sns.color_palette("Set2", 3)
35+
36+
class_a = np.random.randn(10, 2) + [-2, 2]
37+
class_b = np.random.randn(10, 2) + [2, -2]
38+
39+
new_example = np.array([[0, 0]])
40+
distances = np.vstack([class_a, class_b])
41+
42+
# Distances
43+
euclidean_distances = np.sqrt(((distances - new_example) ** 2).sum(axis=1))
44+
sorted_indices = np.argsort(euclidean_distances)
45+
nearest_indices = sorted_indices[:5]
46+
radius = euclidean_distances[nearest_indices[-1]]
47+
48+
# Data points
49+
sns.scatterplot(x=class_a[:, 0], y=class_a[:, 1], color=palette[0], s=200, label='Class A')
50+
sns.scatterplot(x=class_b[:, 0], y=class_b[:, 1], color=palette[1], s=200, label='Class B')
51+
sns.scatterplot(x=new_example[:, 0], y=new_example[:, 1], color=palette[2], s=300)
52+
53+
# Annotate
54+
plt.annotate("Data to\nclassify", (0.1, 0.1), textcoords="offset points", xytext=(80, 40), ha='center',
55+
arrowprops=dict(arrowstyle='->', color='grey'))
56+
57+
# Lines
58+
for i, point in enumerate(distances):
59+
color = 'black' if i in nearest_indices else 'grey'
60+
alpha = 1 if i in nearest_indices else 0.5
61+
plt.plot([new_example[0, 0], point[0]], [new_example[0, 1], point[1]], linestyle='--', color=color, alpha=alpha, zorder=-1)
62+
63+
# Decision boundary
64+
circle = plt.Circle((0, 0), radius, color='grey', linestyle='--', fill=False, alpha=0.5)
65+
plt.gca().add_artist(circle)
66+
plt.gca().set_aspect('equal', adjustable='box')
67+
68+
# Custom legend
69+
from matplotlib.lines import Line2D
70+
legend_elements = [
71+
Line2D([0], [0], marker='o', color='w', label='Class A', markerfacecolor=palette[0], markersize=10),
72+
Line2D([0], [0], marker='o', color='w', label='Class B', markerfacecolor=palette[1], markersize=10),
73+
Line2D([0], [0], linestyle='--', color='grey', label='Distances')
74+
]
75+
76+
plt.xlim(-6,6)
77+
plt.ylim(-5.5,5.5)
78+
plt.xticks([], [])
79+
plt.yticks([], [])
80+
plt.legend(handles=legend_elements, loc="lower left", frameon=False)
81+
plt.title("5-Nearest Neighbours")
82+
plt.show()
83+
```
3284

33-
While concepts like the **bias-variance trade-off** still apply, traditional regression metrics like Mean Squared Error aren't useful here. Instead, we often evaluate kNN classification using **error rate**the proportion of misclassified observations.
85+
In an algorithmic description, this includes the following steps:
3486

35-
In summary, kNN is a straightforward and powerful method for classification, relying on the simple assumption that similar inputs lead to similar outputs. Its ease of use and intuitive appeal make it a foundational technique in machine learning. Let's have a look on a practical application of kNN.
87+
**Step 1: Neighbour Identification**
88+
Given a positive integer $k$ and an observation $x_0$, the kNN classifier first identifies the $k$ points in the training data that are closest to $x_0$, represented by the set $N_0$.
3689

37-
----------------------------------------------------------------
38-
## *Todays data - Iris dataset*
39-
As we’ve already worked with this dataset, it may look familiar. It contains measurements of three different iris species—Setosa, Versicolor, and Virginica—based on their sepal and petal lengths and widths.
90+
**Step 2: Conditional Probability Estimation**
91+
The classifier then estimates the conditional probability for class $j$ as the fraction of points in $N_0$ whose response values equal $j$:
4092

93+
$$P(Y = j \mid X = x_0) = \frac{1}{k} \sum_{i \in N_0} I(y_i = j)$$
4194

42-
```{code-cell}
43-
from sklearn import datasets
95+
where:
96+
- $I(y_i = j)$ is an indicator function that equals 1 if the label of the neighbour $y_i$ is class $j$, and 0 otherwise.
97+
- $k$ is the number of neighbours considered.
98+
- $N_0$ is the set of the $k$ nearest neighbours to $x_0$.
99+
100+
**Step 3: Classification Decision**
101+
Finally, the test observation $x_0$ is classified into the class with the largest estimated probability:
102+
103+
$$\hat{y} = \underset{j}{\operatorname{argmax}} \; P(Y = j \mid X = x_0)$$
104+
105+
106+
## Today's Data: The Iris Dataset
107+
108+
You already should be familiar with the [Iris dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html). It contains measurements for three different iris species — Setosa, Versicolor, and Virginica. To refresh your memory, let’s visualize the sepal length and width for the samples:
44109

45-
# Load the iris dataset
46-
iris = datasets.load_iris()
47-
```
48-
To refresh your memory, let’s visualize the dataset using the sepal length and sepal width features:
49110

50111
```{code-cell} ipython3
51-
:tags: [remove-input]
112+
import seaborn as sns
113+
import pandas as pd
114+
from sklearn import datasets
52115
53-
import matplotlib.pyplot as plt
54-
# Create a scatter plot for the first two features: sepal length and sepal width
55-
_, ax = plt.subplots()
56-
scatter = ax.scatter(iris.data[:, 0], iris.data[:, 1], c=iris.target)
57-
ax.set(xlabel=iris.feature_names[0], ylabel=iris.feature_names[1])
58-
_ = ax.legend(
59-
scatter.legend_elements()[0], iris.target_names, loc="lower right", title="Classes"
60-
)
61-
```
62-
From the plot, you can already see that the Setosa species stands out clearly—it tends to have shorter and wider sepals, making it easily distinguishable. However, Versicolor and Virginica show more overlap, making them harder to separate using only these two dimensions.
116+
# Get data
117+
iris = datasets.load_iris(as_frame=True)
63118
119+
df = iris.frame
120+
df['class'] = pd.Categorical.from_codes(iris.target, iris.target_names)
64121
65-
The question is: Given a **new data point** with certain sepal measurements, how can we decide which iris species **it belongs to**? This is exactly the kind of problem that a classification algorithm like k-Nearest Neighbors is designed to solve.
122+
sns.scatterplot(data=df, x="sepal length (cm)", y="sepal width (cm)", hue="class");
123+
```
66124

125+
In the plot, you can already see that the Setosa species stands out clearly — it tends to have shorter and wider sepals, making it easily distinguishable. However, Versicolor and Virginica show more overlap, making them harder to separate using only these two dimensions.
67126

68-
Let's first prepare our dataset:
69127

70-
```{code-cell}
71-
import pandas as pd
128+
The question is: Given a **new data point** with certain sepal measurements, how can we decide which iris species **it belongs to**? This is the kind of problem that a classification algorithm like kNN is designed to solve.
72129

73-
# Defining features and target
74-
# Features: sepal length and width; target: type of flower
75130

76-
# Convert data (features) to a DataFrame
77-
targets = pd.DataFrame(iris.data, columns=iris.feature_names)
131+
Let's first prepare our dataset:
78132

79-
X = targets[["sepal length (cm)", "sepal width (cm)"]]
80-
y = iris.target
133+
```{code-cell} ipython3
134+
X = df[["sepal length (cm)", "sepal width (cm)"]]
135+
y = df["class"]
81136
```
82-
When training a kNN classifier, it's essential to **normalize the features**. This because kNN relies on distance calculations, and unscaled features can distort the results.
83137

84-
```{code-cell}
138+
When training a kNN classifier, it's important to normalize the features. This is because kNN relies on distance calculations, and unscaled features can distort the results. The `StandardScaler` from `sklearn` standardizes features by removing the mean and scaling them to unit variance:
139+
140+
```{code-cell} ipython3
85141
from sklearn.preprocessing import StandardScaler
86142
87-
# Scale the features using StandardScaler
88143
scaler = StandardScaler()
89-
# scale the entire feature set
90144
X_scaled = scaler.fit_transform(X)
91-
92145
```
93146

94-
-----------------------------------------
95-
96147
## KNN Classifier Implementation
97148

98149
```{margin}
99150
k is the number of nearest neighbors to use and is a hyperparameter
100151
```
101-
The choice of *k* plays a crucial role:
102-
- A small *k* (e.g., 1) makes the method sensitive to noise and outliers.
103-
- A larger *k* smooths the decision boundary, possibly at the cost of ignoring finer local structures.
104152

105-
####MICHA: hier bitte INTERACTIVE PLOT - MIT FESTER DECISION BOUNDARY FINDEST DU EINEN PLOT WEITER UNTEN IM SCRIPT
153+
The choice of $k$ plays a crucial role:
154+
- A small $k$ (e.g., 1) makes the method sensitive to noise and outliers.
155+
- A larger $k$ smooths the decision boundary, possibly at the cost of ignoring finer local structures.
106156

107157

108-
Unfortunately, there’s no magical formula to determine the best value for *k* in advance. Instead, we need to try out a range of values and use our best judgment to choose the one that works best.
158+
As with all hyperparameters, there is no magical formula to determine the best value for $k$ in advance. Instead, we need to try out a range of values and use our best judgment to choose the one that works best.
109159

110-
111-
To do this, we’ll fit the k-Nearest Neighbors model using different *k*-values within a specified range. To evaluate which value performs best, we use cross-validation — specifically, 5-fold cross-validation. Since cross-validation handles the splitting of the data into training and test sets internally, we don’t need to manually divide the dataset beforehand.
160+
To do this, we’ll fit the k-Nearest Neighbors model using different $k$-values within a specified range. To evaluate which value performs best, we here use 5-fold cross validation.
112161

113162
1) Identifying the best *k*!
114-
```{code-cell}
163+
164+
```{code-cell} ipython3
115165
import numpy as np
116166
from sklearn.neighbors import KNeighborsClassifier
117167
from sklearn.model_selection import cross_val_score
118168
119-
# range 1 to 45 in steps of one
120-
k_range= list(range(1, 46))
121-
122-
123-
# variable to store the accuracy scores in loop
124-
scores= []
169+
k_range = range(1, 60)
170+
accuracies = []
125171
126-
# loop trough the range of k using cross validation
172+
# Loop over all k values and save the accuracy
127173
for k in k_range:
128174
knn = KNeighborsClassifier(n_neighbors=k)
129-
score = cross_val_score(knn, X_scaled, y, cv=5) # get scores for each k
130-
scores.append(np.mean(score)) # append mean score to list
175+
accuracy = cross_val_score(knn, X_scaled, y, cv=5)
176+
accuracies.append(np.mean(accuracy))
131177
178+
# Plot
179+
fig, ax = plt.subplots()
180+
sns.lineplot(x = k_range, y = accuracies, marker = 'o')
181+
ax.set(xlabel="kNN", ylabel="accuracy");
132182
```
133183

134-
135-
```{code-cell}
136-
import matplotlib.pyplot as plt
137-
import seaborn as sns
138-
139-
sns.lineplot(x = k_range, y = scores, marker = 'o')
140-
plt.xlabel("K Values")
141-
plt.ylabel("Accuracy Score")
142-
```
143-
144-
```{code-cell}
145-
best_index = np.argmax(scores)
146-
# getting best k
147-
best_k = k_range[best_index]
148-
# getting accuracy of best k
149-
best_score = scores[best_index]
184+
```{code-cell} ipython3
185+
best_index = np.argmax(accuracies)
186+
best_k, best_accuracy = k_range[best_index], accuracies[best_index]
150187
151188
print(f"Best k: {best_k}")
152-
print(f"Accuracy with best k: {best_score:.4f}")
189+
print(f"Accuracy: {best_accuracy:.2f}")
153190
```
154191

155-
```{code-cell} ipython3
156-
:tags: [remove-input]
157-
158-
import numpy as np
159-
import pandas as pd
160-
import matplotlib.pyplot as plt
161-
from sklearn import datasets
162-
from sklearn.preprocessing import StandardScaler
163-
from sklearn.neighbors import KNeighborsClassifier
164-
165-
# Load iris data and select only 2 features for 2D plotting
166-
iris = datasets.load_iris()
167-
X = iris.data[:, :2] # only sepal length and width
168-
y = iris.target
169-
feature_names = iris.feature_names[:2]
192+
Let's visualise the decision boundary:
170193

171-
# Standardize the features
172-
scaler = StandardScaler()
173-
X_scaled = scaler.fit_transform(X)
194+
```{code-cell} ipython3
195+
# Create a DataFrame with the scaled features and the target
196+
df_scaled = pd.DataFrame(X_scaled, columns=["sepal length (cm)", "sepal width (cm)"])
197+
df_scaled['class'] = df['class']
174198
175-
# Use the best k from your cross-validation
199+
# Fit the KNN classifier
176200
knn = KNeighborsClassifier(n_neighbors=best_k)
177-
knn.fit(X_scaled, y)
201+
knn.fit(X_scaled, df['target'])
178202
179-
# Create a meshgrid to evaluate the model
203+
# Generate mesh grid for predicting and plotting the decision boundary
180204
x_min, x_max = X_scaled[:, 0].min() - 1, X_scaled[:, 0].max() + 1
181205
y_min, y_max = X_scaled[:, 1].min() - 1, X_scaled[:, 1].max() + 1
182-
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 300),
183-
np.linspace(y_min, y_max, 300))
184-
grid = np.c_[xx.ravel(), yy.ravel()]
185-
Z = knn.predict(grid).reshape(xx.shape)
186-
187-
# Plot the decision boundary
188-
plt.figure(figsize=(8, 6))
189-
plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
190-
191-
# Plot the training points
192-
for i, label in enumerate(iris.target_names):
193-
plt.scatter(
194-
X_scaled[y == i, 0],
195-
X_scaled[y == i, 1],
196-
label=label,
197-
edgecolor='k'
198-
)
199-
200-
plt.xlabel(feature_names[0])
201-
plt.ylabel(feature_names[1])
202-
plt.title(f"Decision Boundary with k = {best_k}")
203-
plt.legend(loc="lower right")
204-
plt.show()
206+
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
207+
np.linspace(y_min, y_max, 100))
205208
209+
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
210+
211+
# Plot
212+
fig, ax = plt.subplots()
213+
ax.contourf(xx, yy, Z, alpha=0.3, cmap="Set2")
214+
sns.scatterplot(data=df_scaled, x="sepal length (cm)", y="sepal width (cm)", hue="class", ax=ax, palette='Set2')
215+
ax.set(xlabel=df_scaled.columns[0], ylabel=df_scaled.columns[1], title=f"Decision Boundary with k = {best_k}");
206216
```
207217

208218
```{code-cell} ipython3
@@ -214,67 +224,5 @@ display_quiz("quiz/KNN.json", shuffle_answers=True)
214224
```{admonition} The choice of k
215225
:class: note
216226
217-
In datasets where class boundaries are clear and the data is clean, a high k can actually be beneficial. It provides a form of regularization by averaging over many neighbors. In contrast, for more complex or noisy datasets, such a high k could oversimplify the structure and reduce model performance.
218-
```
219-
220-
<br>
221-
<br>
222-
223-
2) Train the model using the best *k*!
224-
We can now train our model using the best *k* value using the code below.
225-
226-
**Hands on:**
227-
228-
229-
Now that we've determined the best value for k, we can go ahead and train and evaluate our final kNN model using this value. To properly assess how well the model performs on unseen data, we need to split our dataset into training and test sets. It's important to follow the correct sequence here: **first, we split the data — and only then do we scale it.**
230-
231-
232-
(TO MICHA: MUSS DAS ÜBERHAUPT NOCH SEIN? WIR HABEN OBEN JA 5 fold CV GENUTZT UND DAMIT JA EIG TEST UND TRAININGS DATA SCHON IN DIE BERECHNUNG DER ACCURACY MIT EINGEBUNDEN ODER?? https://www.datacamp.com/tutorial/k-nearest-neighbor-classification-scikit-learn HIER HABEN DIE DAS SO GEMACHT; DESWEGEN MACH ICH ES JETZT NOCH - ABER DU KANNST ES JA EINFACH LÖSCHEN WENN NICHT NÖTIG)
233-
234-
<iframe src="https://trinket.io/embed/python3/8c050c48e2b4" width="100%" height="356" frameborder="0" marginwidth="0" marginheight="0" allowfullscreen></iframe>
235-
236-
237-
```{code-cell} ipython3
238-
:tags: [remove-input]
239-
240-
# import packages
241-
import pandas as pd
242-
from sklearn import datasets
243-
from sklearn.metrics import accuracy_score
244-
from sklearn.neighbors import KNeighborsClassifier
245-
from sklearn.preprocessing import StandardScaler
246-
from sklearn.model_selection import train_test_split
247-
248-
249-
## PREPARE THE DATA
250-
iris = datasets.load_iris()
251-
252-
# Convert data (features) to a DataFrame
253-
targets = pd.DataFrame(iris.data, columns=iris.feature_names)
254-
255-
X = targets[["sepal length (cm)", "sepal width (cm)"]]
256-
y = iris.target
257-
258-
# Split the data into training and test sets
259-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
260-
261-
# normalize the data
262-
scaler = StandardScaler()
263-
X_train = scaler.fit_transform(X_train)
264-
X_test = scaler.transform(X_test)
265-
266-
267-
268-
## TRAIN THE MODEL USING THE BEST K!
269-
# Train the model using training data
270-
knn = KNeighborsClassifier(n_neighbors=31)
271-
knn.fit(X_train, y_train)
272-
273-
#The model is now trained using the training data. Next, we can use it to make predictions on the test set.
274-
275-
# predict the feature-category with the trained model
276-
y_pred = knn.predict(X_test)
277-
278-
# check accuracy
279-
accuracy = accuracy_score(y_test, y_pred)
280-
```
227+
In datasets where class boundaries are clear and the data is clean, a higher $k$ can be beneficial. It provides a form of regularization by averaging over many neighbors. In contrast, for more complex or noisy datasets, a high $k$ could oversimplify the structure and reduce model performance.
228+
```

0 commit comments

Comments
 (0)