Skip to content

Commit 977ebab

Browse files
committed
Enhance k-means clustering documentation and visualization
- Updated the clustering section to clarify the k-means algorithm, including improved descriptions of assignment and reconstruction rules. - Added an animated GIF to illustrate the evolution of clusters during the k-means iterations, enhancing the educational value of the documentation. - Replaced SVG figure outputs with PNG format for better compatibility and visualization in the documentation. - Improved clarity in the code comments related to centroid initialization and plotting steps.
1 parent 323cbdb commit 977ebab

47 files changed

Lines changed: 26 additions & 18028 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

src/05-machine_learning/04-clustering.md

Lines changed: 25 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,21 @@
11
## Clustering
22

3-
Clustering is another widely used method of unsupervised learning. This means that our datapoints $\mathcal{D} := \{\vec{x}_i\}_{i=1, \dots, N}$ are without any labels. Our goal is to group them into $K$ clusters, where data points within a cluster are *as similar as possible* and data points between clusters are *as different as possible*.
3+
Clustering is another widely used method of unsupervised learning. This means that our datapoints $\mathcal{D} := \{\vec{x}_i\}_{i=1, \dots, N}$ have no labels. Our goal is to group them into $K$ clusters, where data points within a cluster are *as similar as possible* and data points between clusters are *as different as possible*.
44

55
Before we start, let's list some properties we expect from a clustering algorithm:
66

7-
1. A general **assignment rule**, which assigns each data point to a cluster, i.e. $\vec{x}_i \mapsto k \in \{1, \ldots, K\}$ for $i = 1, \ldots, N$.
8-
2. A **reconstruction rule**, which for each cluster $k \in \{1, \ldots, K\}$ determines a representative element $\vec{m}_k$, i.e. $k \mapsto \vec{m}_k \in \mathbb{R}^n$ for $k \in \{1, \ldots, K\}$.
7+
1. A general **assignment rule**, which assigns each data point to a cluster, i.e., $\vec{x}_i \mapsto k \in \{1, \ldots, K\}$ for $i = 1, \ldots, N$.
8+
2. A **reconstruction rule**, which for each cluster $k \in \{1, \ldots, K\}$ determines a representative element $\vec{m}_k$, i.e., $k \mapsto \vec{m}_k \in \mathbb{R}^n$ for $k \in \{1, \ldots, K\}$.
99

1010
### $k$-Means Clustering
1111

12-
As the name suggests, the $k$-means algorithm is based on the idea of defining this representative element as the geometrical mean of the data points in the cluster. To formulate the $k$-means algorithm, we first fix the number of clusters $K$ and define two quantities, $\mathbf{C} := (C_1, \ldots, C_K)$, which contains the subsets $C_k \subseteq \mathcal{D}$ of the data points that are assigned to cluster $k$, and $\mathbf{M} := (\vec{m}_1, \ldots, \vec{m}_K)$, which contains the mean values of the clusters.
12+
As the name suggests, the $k$-means algorithm defines the representative element as the geometric mean of the data points in the cluster. To formulate the $k$-means algorithm, we first fix the number of clusters $K$ and define two quantities: $\mathbf{C} := (C_1, \ldots, C_K)$, which contains the subsets $C_k \subseteq \mathcal{D}$ of the data points assigned to cluster $k$, and $\mathbf{M} := (\vec{m}_1, \ldots, \vec{m}_K)$, which contains the mean values of the clusters.
1313

1414
```admonish note title="Note"
15-
The union of the clusters must cover the entire data set $\mathcal{D}$, i.e. $\bigcup_{k=1}^K C_k = \mathcal{D}$ and $C_i \cap C_j = \emptyset$ for $i \neq j$, i.e. a data point cannot be assigned to multiple clusters at the same time.
15+
The union of the clusters must cover the entire dataset $\mathcal{D}$, i.e., $\bigcup_{k=1}^K C_k = \mathcal{D}$ and $C_i \cap C_j = \emptyset$ for $i \neq j$, meaning a data point cannot be assigned to multiple clusters simultaneously.
1616
```
1717

18-
The $k$-means algorithm then iteratively alternates between updating the cluster variable $\mathbf{C}$ and the mean values $\mathbf{M}$. For an initial clustering $\mathbf{C}$, we first **compute the mean value of each cluster as the mean value of the data points in that cluster**:
18+
The $k$-means algorithm iteratively alternates between updating the cluster variable $\mathbf{C}$ and the mean values $\mathbf{M}$. For an initial clustering $\mathbf{C}$, we first **compute the mean value of each cluster as the average of the data points in that cluster**:
1919

2020
$$
2121
\vec{m}_k \leftarrow \frac{1}{|C_k|} \sum_{\vec{x}_i \in C_k} \vec{x}_i\,,
@@ -27,11 +27,11 @@ $$
2727
C_k \leftarrow \{\vec{x}_i \in \mathcal{D} \mid \|\vec{x}_i - \vec{m}_k\| \leq \|\vec{x}_i - \vec{m}_j\| \text{ for all } j \neq k\}\,.
2828
$$
2929

30-
This corresponds to the assignment rule. These two steps are then repeated iteratively for a given number of iterations, or until the clusters no longer change.
30+
This corresponds to the assignment rule. These two steps are repeated iteratively for a given number of iterations, or until the clusters no longer change.
3131

3232
```admonish info title="Voronoi cells"
3333
34-
An alternative view on the clusters assigned by the $k$-means algorithm are the so-called *Voronoi cells*, which are defined as:
34+
An alternative view of the clusters assigned by the $k$-means algorithm are the so-called *Voronoi cells*, which are defined as:
3535
3636
$$
3737
V_k := \{\vec{x} \in \mathbb{R}^n \mid \|\vec{x} - \vec{m}_k\| \leq \|\vec{x} - \vec{m}_j\| \text{ for all } j \neq k\}
@@ -42,42 +42,40 @@ and can be visualized in [Voronoi diagrams](https://en.wikipedia.org/wiki/Vorono
4242

4343
### Implementation
4444

45-
Of course, we implement the $k$-means algorithm as a class. In the `__init__` method
46-
we set the number of clusters and the maximum number of iterations. Additionally,
47-
we already initialize the class attributes `self.centroids` and `self.labels`, which will be updated alternately during the algorithm. `self.centroids` will be, of course, a 2D array holding in each row the centroid of a cluster. `self.labels` will be a 1D array holding the cluster label ($0, \ldots, K-1$) for each data point.
45+
We implement the $k$-means algorithm as a class. In the `__init__` method, we set the number of clusters and the maximum number of iterations. Additionally, we initialize the class attributes `self.centroids` and `self.labels`, which will be updated alternately during the algorithm. `self.centroids` will be a 2D array holding the centroid of each cluster in each row. `self.labels` will be a 1D array holding the cluster label ($0, \ldots, K-1$) for each data point.
4846

4947
```python
5048
{{#include ../codes/05-machine_learning/k_means.py:kmeans_init}}
5149
```
5250

53-
Then, we implement the `fit` method, which executes the algorithm as described above. Here, we have used [`np.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) to randomly select indices of data points to serve as the initial centroids. The function `np.random.choice` selects `self.n_clusters` unique indices from the range of available data points, given by `X.shape[0]`. The parameter `replace=False` ensures that the same data point is not selected more than once. By indexing the data matrix `X` with these indices, we obtain the initial centroids. After that, we update the centroids and labels in a loop as described above for `self.num_iter` iterations.
51+
Next, we implement the `fit` method, which executes the algorithm as described above. Here, we use [`np.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) to randomly select indices of data points to serve as the initial centroids. The function `np.random.choice` selects `self.n_clusters` unique indices from the range of available data points, given by `X.shape[0]`. The parameter `replace=False` ensures that the same data point is not selected more than once. By indexing the data matrix `X` with these indices, we obtain the initial centroids. After that, we update the centroids and labels in a loop as described above for `self.num_iter` iterations.
5452

5553
```python
5654
{{#include ../codes/05-machine_learning/k_means.py:kmeans_fit}}
5755
```
5856

59-
Here, we actually assumed that we will implement the methods `assign_labels` and `compute_centroids` later.
57+
Here, we assume that we will implement the methods `assign_labels` and `compute_centroids` separately.
6058

61-
Within the `assign_labels` method, we first compute the distances between all data points and all centroids. This is done by subtracting the centroids from the data points and then computing the euclidean norm of the resulting vectors. However, because the shape of the data matrix `X` is `(n_points, n_features)`, while the shape of the centroids is `(n_clusters, n_features)`, we need to expand the data matrix to match the shape of the centroids. This is done by adding an additional dimension to the data matrix, i.e. `X.shape = (n_points, 1, n_features)`. The subtraction of the centroids from the data points then results in an array of shape `(n_points, n_clusters, n_features)`. The euclidean norm of the resulting vectors is then computed accross all features, i.e. `axis=2`. The resulting array `distances` then holds the distances between all data points and all centroids. The assignment of the data points to the clusters is then done by selecting the cluster with the smallest distance for each data point, which can be achieved by using the `np.argmin` function, along the second axis, i.e. `axis=1`.
59+
Within the `assign_labels` method, we first compute the distances between all data points and all centroids. This is done by subtracting the centroids from the data points and then computing the Euclidean norm of the resulting vectors. However, because the shape of the data matrix `X` is `(n_points, n_features)`, while the shape of the centroids is `(n_clusters, n_features)`, we need to expand the data matrix to match the shape of the centroids. This is done by adding an additional dimension to the data matrix, representing the different clusters. The new shape of the data matrix is then `(n_points, 1, n_features)`. The subtraction of the centroids from the data points then results in an array of shape `(n_points, n_clusters, n_features)`. The Euclidean norm of the resulting vectors is computed across all features, i.e., `axis=2`. The resulting array `distances` holds the distances between all data points and all centroids. The assignment of data points to clusters is then done by selecting the cluster with the smallest distance for each data point, which can be achieved using the `np.argmin` function along the second axis, i.e., `axis=1`.
6260

6361

6462
```python
6563
{{#include ../codes/05-machine_learning/k_means.py:kmeans_assign_labels}}
6664
```
6765

68-
The calculation of the centroids is relatively simple, since we only need to compute the mean values of the data points for each cluster $i = 1, \ldots, K$ and store them in an array. Remeber that we can get the data points assigned to a cluster by indexing the data matrix `X` with the cluster labels, i.e. `X[labels == i]`. We use list comprehension for this:
66+
The calculation of the centroids is relatively straightforward, since we only need to compute the mean values of the data points for each cluster $i = 1, \ldots, K$ and store them in an array. Remember that we can get the data points assigned to a cluster by indexing the data matrix `X` with the cluster labels, i.e., `X[labels == i]`. We use list comprehension for this:
6967

7068
```python
7169
{{#include ../codes/05-machine_learning/k_means.py:kmeans_compute_centroids}}
7270
```
7371

74-
To stay true to the concept of the general ML class, we also implement the `predict` method, which computes the assignments for possibly new data points.
72+
To stay true to the concept of the general ML class, we also implement the `predict` method, which computes the assignments for new data points.
7573

76-
### Clustering of aptamers
74+
### Clustering of Aptamers
7775

78-
We can now test our implementation of the $k$-means algorithm on the aptamer dataset. We will use the representation of the aptamers in the principal component space, which we computed earlier with the PCA. With some imagination, we can see that the aptamers are clustered into four groups, which we want to identify with the $k$-means algorithm.
76+
We can now test our implementation of the $k$-means algorithm on the aptamer dataset. We will use the representation of the aptamers in the principal component space, which we computed earlier with PCA. With some imagination, we can see that the aptamers are clustered into four groups, which we want to identify using the $k$-means algorithm.
7977

80-
Loading the data from the CSV file and converting the data to a numpy array is straightforward:
78+
Loading the data from the CSV file and converting it to a numpy array is straightforward:
8179

8280
```python
8381
{{#include ../codes/05-machine_learning/k_means.py:load_data_from_csv}}
@@ -89,12 +87,17 @@ We can now create an instance of the `KMeans` class and fit it to the data:
8987
{{#include ../codes/05-machine_learning/k_means.py:kmeans_fit_and_predict}}
9088
```
9189

92-
Plotting the result is also straightforward. We simply color the data points according to the cluster labels. Additionally, we access the class attribute `centroids` to plot the centroids as red crosses:
90+
Plotting the result is also straightforward. We simply color the data points according to their cluster labels. Additionally, we access the class attribute `centroids` to plot the centroids as red crosses:
9391

9492
```python
9593
{{#include ../codes/05-machine_learning/k_means.py:plot_kmeans_result}}
9694
```
9795

98-
By inspecting the plot, we can see that the $k$-means algorithm will correctly identified the four clusters of aptamers in most cases. Because centroids are initialized randomly, the result will vary slightly between runs.
96+
By inspecting the plot, we can see that the $k$-means algorithm correctly identifies the four clusters of aptamers in most cases. Because centroids are initialized randomly, the result may vary slightly between runs. In the animation below, we can see the evolution of the clusters over the iterations of the $k$-means algorithm.
97+
98+
<figure>
99+
<center>
100+
<img src="../assets/figures/05-machine_learning/k_means_aptamers_animated.gif" alt="K-Means Clustering of aptamers" style="max-width: 600px;" />
101+
</center>
102+
</figure>
99103

100-
![Result of the $k$-means algorithm on the aptamer dataset](../assets/figures/05-machine_learning/k_means_aptamers.svg)
72.1 KB
Loading
72.1 KB
Loading
1006 Bytes
Loading
61 KB
Loading

0 commit comments

Comments
 (0)