Skip to content

Commit 0597f4a

Browse files
committed
Implement k-means clustering visualization and update documentation
- Added detailed implementation of the k-means algorithm, including initialization and fitting methods. - Introduced visualizations for clustering results, showcasing the final cluster assignments and centroids. - Updated documentation to reflect changes, enhancing clarity and educational value for users. - Included new SVG figures to illustrate the k-means iterations and final clustering results.
1 parent b5b8640 commit 0597f4a

25 files changed

Lines changed: 19542 additions & 75 deletions

src/05-machine_learning/04-clustering.md

Lines changed: 29 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -40,70 +40,61 @@ $$
4040
and can be visualized in [Voronoi diagrams](https://en.wikipedia.org/wiki/Voronoi_diagram).
4141
```
4242

43-
<!-- ### Implementation
43+
### Implementation
4444

45-
Wir implementieren auch den $k$-Means Algorithmus als Klasse. In der `__init__` Methode
46-
initialisieren wir die Anzahl der Cluster und die maximale Anzahl an Iterationen. Zudem
47-
setzen wir die Variablen `self.centroids` und `self.labels`, die im Laufe des Algorithmus
48-
abwechselnd aktualisiert werden:
45+
Of course, we implement the $k$-means algorithm as a class. In the `__init__` method
46+
we set the number of clusters and the maximum number of iterations. Additionally,
47+
we already initialize the class attributes `self.centroids` and `self.labels`, which will be updated alternately during the algorithm. `self.centroids` will be, of course, a 2D array holding in each row the centroid of a cluster. `self.labels` will be a 1D array holding the cluster label ($0, \ldots, K-1$) for each data point.
4948

5049
```python
5150
{{#include ../codes/05-machine_learning/k_means.py:kmeans_init}}
5251
```
5352

54-
Dann implementieren wir die Methode `fit`, die den Algorithmus wie oben beschrieben
55-
ausführt. Nachdem wir zufällig ausgewählte Datenpunkte als Mittelwerte `self.centroids`
56-
der $K$ Cluster initialisiert haben, berechnen wir in einer Schleife die Zuweisungen und Mittelwerte
57-
der Cluster:
53+
Then, we implement the `fit` method, which executes the algorithm as described above. Here, we have used [`np.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) to randomly select indices of data points to serve as the initial centroids. The function `np.random.choice` selects `self.n_clusters` unique indices from the range of available data points, given by `X.shape[0]`. The parameter `replace=False` ensures that the same data point is not selected more than once. By indexing the data matrix `X` with these indices, we obtain the initial centroids. After that, we update the centroids and labels in a loop as described above for `self.num_iter` iterations.
5854

5955
```python
6056
{{#include ../codes/05-machine_learning/k_means.py:kmeans_fit}}
6157
```
6258

63-
Hier haben wir angenommen, dass wir die Methoden `assign_labels` und `compute_centroids`
64-
noch implementieren werden. Dabei sei noch einmal darauf hingewiesen, dass wir auf
65-
die Variablen `self.centroids` und `self.labels` innerhalb der Methoden der Klasse zugreifen können,
66-
da diese als Klassenattribute definiert sind. Die Methode `assign_labels` berechnet zunächst die
67-
Distanzen aller Datenpunkte zu allen Mittelwerten. Dazu erweitern wir die Datenmatrix `X` um eine
68-
zusätzliche Dimension, also `X.shape = (n_points, 1, n_features)`, um die Abstandsvektoren zu den
69-
Mittelwerten `self.centroids`, die die Form `(n_clusters, n_features)` haben, zu berechnen. Die
70-
Subtraktion der beiden Arrays führt also zu einem Array der Form `(n_points, n_clusters, n_features)`.
71-
Die Distanz erhalten wir dann durch die Berechnung der euklidischen Norm entlang der letzten Achse
72-
(`axis=2`). Der Array `distances` speichert also für alle Datenpunkte die Distanzen zu den $K$
73-
Mittelwerten. Die Zuweisung erfolgt demnach durch die Auswahl des Clusters mit dem kleinsten Abstand
74-
für jeden Datenpunkt, was mit der `numpy` Funktion `argmin` realisiert werden kann:
59+
Here, we actually assumed that we will implement the methods `assign_labels` and `compute_centroids` later.
60+
61+
Within the `assign_labels` method, we first compute the distances between all data points and all centroids. This is done by subtracting the centroids from the data points and then computing the euclidean norm of the resulting vectors. However, because the shape of the data matrix `X` is `(n_points, n_features)`, while the shape of the centroids is `(n_clusters, n_features)`, we need to expand the data matrix to match the shape of the centroids. This is done by adding an additional dimension to the data matrix, i.e. `X.shape = (n_points, 1, n_features)`. The subtraction of the centroids from the data points then results in an array of shape `(n_points, n_clusters, n_features)`. The euclidean norm of the resulting vectors is then computed accross all features, i.e. `axis=2`. The resulting array `distances` then holds the distances between all data points and all centroids. The assignment of the data points to the clusters is then done by selecting the cluster with the smallest distance for each data point, which can be achieved by using the `np.argmin` function, along the second axis, i.e. `axis=1`.
62+
7563

7664
```python
7765
{{#include ../codes/05-machine_learning/k_means.py:kmeans_assign_labels}}
7866
```
7967

80-
Die Berechnung der Mittelwerte ist vergleichsweise einfach, da wir ledigleich für jedes Cluster
81-
$i = 1, \ldots, K$ die Mittelwerte der Datenpunkte des $i$-ten Clusters berechnen müssen und in
82-
einem Array speichern müssen. Dazu nutzen wir List-Comprehension:
68+
The calculation of the centroids is relatively simple, since we only need to compute the mean values of the data points for each cluster $i = 1, \ldots, K$ and store them in an array. Remeber that we can get the data points assigned to a cluster by indexing the data matrix `X` with the cluster labels, i.e. `X[labels == i]`. We use list comprehension for this:
8369

8470
```python
8571
{{#include ../codes/05-machine_learning/k_means.py:kmeans_compute_centroids}}
8672
```
8773

88-
Um dem Konzept der allgemeinen ML-Klasse treu zu bleiben, implementieren wir auch die Methode
89-
`predict`, die die Zuweisungen für ggf. neue Datenpunkte berechnet.
74+
To stay true to the concept of the general ML class, we also implement the `predict` method, which computes the assignments for possibly new data points.
75+
76+
### Clustering of aptamers
77+
78+
We can now test our implementation of the $k$-means algorithm on the aptamer dataset. We will use the representation of the aptamers in the principal component space, which we computed earlier with the PCA. With some imagination, we can see that the aptamers are clustered into four groups, which we want to identify with the $k$-means algorithm.
9079

91-
Wir testen unsere Implementierung des $k$-Means Algorithmus anhand der Projektion des Iris Datensatzes
92-
auf die zwei Hauptkomponenten, die wir zuvor mit der PCA berechnet haben:
80+
Loading the data from the CSV file and converting the data to a numpy array is straightforward:
9381

9482
```python
95-
{{#include ../codes/05-machine_learning/k_means.py:kmeans_example}}
83+
{{#include ../codes/05-machine_learning/k_means.py:load_data_from_csv}}
9684
```
9785

98-
Dabei erhalten wir die folgende Abbildung, wobei wir die vorhergesagten Cluster durch die Farben
99-
der Punkte darstellen:
86+
We can now create an instance of the `KMeans` class and fit it to the data:
10087

101-
![Iris k-Means](../assets/figures/05-machine_learning/k_means_iris.svg)
88+
```python
89+
{{#include ../codes/05-machine_learning/k_means.py:kmeans_fit_and_predict}}
90+
```
91+
92+
Plotting the result is also straightforward. We simply color the data points according to the cluster labels. Additionally, we access the class attribute `centroids` to plot the centroids as red crosses:
93+
94+
```python
95+
{{#include ../codes/05-machine_learning/k_means.py:plot_kmeans_result}}
96+
```
10297

103-
Ohne Beachtung der korrekten Farben erkennen wir durch Vergleich der tatsächlichen Labels von oben,
104-
dass die drei Cluster mit hinreichender Genauigkeit den korrekten Schwertlilien-Arten zugeordnet
105-
werden konnten. Dabei sei nochmal angemerkt, dass es sich bei Clustering um eine Methode des
106-
unüberwachten Lernens handelt, d.h. wir haben keine Information über die tatsächlichen Labels
107-
der Datenpunkte verwendet.
98+
By inspecting the plot, we can see that the $k$-means algorithm will correctly identified the four clusters of aptamers in most cases. Because centroids are initialized randomly, the result will vary slightly between runs.
10899

109-
--- -->
100+
![Result of the $k$-means algorithm on the aptamer dataset](../assets/figures/05-machine_learning/k_means_aptamers.svg)

0 commit comments

Comments
 (0)