|
40 | 40 | and can be visualized in [Voronoi diagrams](https://en.wikipedia.org/wiki/Voronoi_diagram). |
41 | 41 | ``` |
42 | 42 |
|
43 | | -<!-- ### Implementation |
| 43 | +### Implementation |
44 | 44 |
|
45 | | -Wir implementieren auch den $k$-Means Algorithmus als Klasse. In der `__init__` Methode |
46 | | -initialisieren wir die Anzahl der Cluster und die maximale Anzahl an Iterationen. Zudem |
47 | | -setzen wir die Variablen `self.centroids` und `self.labels`, die im Laufe des Algorithmus |
48 | | -abwechselnd aktualisiert werden: |
| 45 | +Of course, we implement the $k$-means algorithm as a class. In the `__init__` method |
| 46 | +we set the number of clusters and the maximum number of iterations. Additionally, |
| 47 | +we already initialize the class attributes `self.centroids` and `self.labels`, which will be updated alternately during the algorithm. `self.centroids` will be, of course, a 2D array holding in each row the centroid of a cluster. `self.labels` will be a 1D array holding the cluster label ($0, \ldots, K-1$) for each data point. |
49 | 48 |
|
50 | 49 | ```python |
51 | 50 | {{#include ../codes/05-machine_learning/k_means.py:kmeans_init}} |
52 | 51 | ``` |
53 | 52 |
|
54 | | -Dann implementieren wir die Methode `fit`, die den Algorithmus wie oben beschrieben |
55 | | -ausführt. Nachdem wir zufällig ausgewählte Datenpunkte als Mittelwerte `self.centroids` |
56 | | -der $K$ Cluster initialisiert haben, berechnen wir in einer Schleife die Zuweisungen und Mittelwerte |
57 | | -der Cluster: |
| 53 | +Then, we implement the `fit` method, which executes the algorithm as described above. Here, we have used [`np.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) to randomly select indices of data points to serve as the initial centroids. The function `np.random.choice` selects `self.n_clusters` unique indices from the range of available data points, given by `X.shape[0]`. The parameter `replace=False` ensures that the same data point is not selected more than once. By indexing the data matrix `X` with these indices, we obtain the initial centroids. After that, we update the centroids and labels in a loop as described above for `self.num_iter` iterations. |
58 | 54 |
|
59 | 55 | ```python |
60 | 56 | {{#include ../codes/05-machine_learning/k_means.py:kmeans_fit}} |
61 | 57 | ``` |
62 | 58 |
|
63 | | -Hier haben wir angenommen, dass wir die Methoden `assign_labels` und `compute_centroids` |
64 | | -noch implementieren werden. Dabei sei noch einmal darauf hingewiesen, dass wir auf |
65 | | -die Variablen `self.centroids` und `self.labels` innerhalb der Methoden der Klasse zugreifen können, |
66 | | -da diese als Klassenattribute definiert sind. Die Methode `assign_labels` berechnet zunächst die |
67 | | -Distanzen aller Datenpunkte zu allen Mittelwerten. Dazu erweitern wir die Datenmatrix `X` um eine |
68 | | -zusätzliche Dimension, also `X.shape = (n_points, 1, n_features)`, um die Abstandsvektoren zu den |
69 | | -Mittelwerten `self.centroids`, die die Form `(n_clusters, n_features)` haben, zu berechnen. Die |
70 | | -Subtraktion der beiden Arrays führt also zu einem Array der Form `(n_points, n_clusters, n_features)`. |
71 | | -Die Distanz erhalten wir dann durch die Berechnung der euklidischen Norm entlang der letzten Achse |
72 | | -(`axis=2`). Der Array `distances` speichert also für alle Datenpunkte die Distanzen zu den $K$ |
73 | | -Mittelwerten. Die Zuweisung erfolgt demnach durch die Auswahl des Clusters mit dem kleinsten Abstand |
74 | | -für jeden Datenpunkt, was mit der `numpy` Funktion `argmin` realisiert werden kann: |
| 59 | +Here, we actually assumed that we will implement the methods `assign_labels` and `compute_centroids` later. |
| 60 | + |
| 61 | +Within the `assign_labels` method, we first compute the distances between all data points and all centroids. This is done by subtracting the centroids from the data points and then computing the euclidean norm of the resulting vectors. However, because the shape of the data matrix `X` is `(n_points, n_features)`, while the shape of the centroids is `(n_clusters, n_features)`, we need to expand the data matrix to match the shape of the centroids. This is done by adding an additional dimension to the data matrix, i.e. `X.shape = (n_points, 1, n_features)`. The subtraction of the centroids from the data points then results in an array of shape `(n_points, n_clusters, n_features)`. The euclidean norm of the resulting vectors is then computed accross all features, i.e. `axis=2`. The resulting array `distances` then holds the distances between all data points and all centroids. The assignment of the data points to the clusters is then done by selecting the cluster with the smallest distance for each data point, which can be achieved by using the `np.argmin` function, along the second axis, i.e. `axis=1`. |
| 62 | + |
75 | 63 |
|
76 | 64 | ```python |
77 | 65 | {{#include ../codes/05-machine_learning/k_means.py:kmeans_assign_labels}} |
78 | 66 | ``` |
79 | 67 |
|
80 | | -Die Berechnung der Mittelwerte ist vergleichsweise einfach, da wir ledigleich für jedes Cluster |
81 | | -$i = 1, \ldots, K$ die Mittelwerte der Datenpunkte des $i$-ten Clusters berechnen müssen und in |
82 | | -einem Array speichern müssen. Dazu nutzen wir List-Comprehension: |
| 68 | +The calculation of the centroids is relatively simple, since we only need to compute the mean values of the data points for each cluster $i = 1, \ldots, K$ and store them in an array. Remeber that we can get the data points assigned to a cluster by indexing the data matrix `X` with the cluster labels, i.e. `X[labels == i]`. We use list comprehension for this: |
83 | 69 |
|
84 | 70 | ```python |
85 | 71 | {{#include ../codes/05-machine_learning/k_means.py:kmeans_compute_centroids}} |
86 | 72 | ``` |
87 | 73 |
|
88 | | -Um dem Konzept der allgemeinen ML-Klasse treu zu bleiben, implementieren wir auch die Methode |
89 | | -`predict`, die die Zuweisungen für ggf. neue Datenpunkte berechnet. |
| 74 | +To stay true to the concept of the general ML class, we also implement the `predict` method, which computes the assignments for possibly new data points. |
| 75 | + |
| 76 | +### Clustering of aptamers |
| 77 | + |
| 78 | +We can now test our implementation of the $k$-means algorithm on the aptamer dataset. We will use the representation of the aptamers in the principal component space, which we computed earlier with the PCA. With some imagination, we can see that the aptamers are clustered into four groups, which we want to identify with the $k$-means algorithm. |
90 | 79 |
|
91 | | -Wir testen unsere Implementierung des $k$-Means Algorithmus anhand der Projektion des Iris Datensatzes |
92 | | -auf die zwei Hauptkomponenten, die wir zuvor mit der PCA berechnet haben: |
| 80 | +Loading the data from the CSV file and converting the data to a numpy array is straightforward: |
93 | 81 |
|
94 | 82 | ```python |
95 | | -{{#include ../codes/05-machine_learning/k_means.py:kmeans_example}} |
| 83 | +{{#include ../codes/05-machine_learning/k_means.py:load_data_from_csv}} |
96 | 84 | ``` |
97 | 85 |
|
98 | | -Dabei erhalten wir die folgende Abbildung, wobei wir die vorhergesagten Cluster durch die Farben |
99 | | -der Punkte darstellen: |
| 86 | +We can now create an instance of the `KMeans` class and fit it to the data: |
100 | 87 |
|
101 | | - |
| 88 | +```python |
| 89 | +{{#include ../codes/05-machine_learning/k_means.py:kmeans_fit_and_predict}} |
| 90 | +``` |
| 91 | + |
| 92 | +Plotting the result is also straightforward. We simply color the data points according to the cluster labels. Additionally, we access the class attribute `centroids` to plot the centroids as red crosses: |
| 93 | + |
| 94 | +```python |
| 95 | +{{#include ../codes/05-machine_learning/k_means.py:plot_kmeans_result}} |
| 96 | +``` |
102 | 97 |
|
103 | | -Ohne Beachtung der korrekten Farben erkennen wir durch Vergleich der tatsächlichen Labels von oben, |
104 | | -dass die drei Cluster mit hinreichender Genauigkeit den korrekten Schwertlilien-Arten zugeordnet |
105 | | -werden konnten. Dabei sei nochmal angemerkt, dass es sich bei Clustering um eine Methode des |
106 | | -unüberwachten Lernens handelt, d.h. wir haben keine Information über die tatsächlichen Labels |
107 | | -der Datenpunkte verwendet. |
| 98 | +By inspecting the plot, we can see that the $k$-means algorithm will correctly identified the four clusters of aptamers in most cases. Because centroids are initialized randomly, the result will vary slightly between runs. |
108 | 99 |
|
109 | | ---- --> |
| 100 | + |
0 commit comments