You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enhance k-means clustering documentation and visualization
- Updated the clustering section to clarify the k-means algorithm, including improved descriptions of assignment and reconstruction rules.
- Added an animated GIF to illustrate the evolution of clusters during the k-means iterations, enhancing the educational value of the documentation.
- Replaced SVG figure outputs with PNG format for better compatibility and visualization in the documentation.
- Improved clarity in the code comments related to centroid initialization and plotting steps.
Copy file name to clipboardExpand all lines: src/05-machine_learning/04-clustering.md
+25-22Lines changed: 25 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,21 +1,21 @@
1
1
## Clustering
2
2
3
-
Clustering is another widely used method of unsupervised learning. This means that our datapoints $\mathcal{D} := \{\vec{x}_i\}_{i=1, \dots, N}$ are without any labels. Our goal is to group them into $K$ clusters, where data points within a cluster are *as similar as possible* and data points between clusters are *as different as possible*.
3
+
Clustering is another widely used method of unsupervised learning. This means that our datapoints $\mathcal{D} := \{\vec{x}_i\}_{i=1, \dots, N}$ have no labels. Our goal is to group them into $K$ clusters, where data points within a cluster are *as similar as possible* and data points between clusters are *as different as possible*.
4
4
5
5
Before we start, let's list some properties we expect from a clustering algorithm:
6
6
7
-
1. A general **assignment rule**, which assigns each data point to a cluster, i.e. $\vec{x}_i \mapsto k \in \{1, \ldots, K\}$ for $i = 1, \ldots, N$.
8
-
2. A **reconstruction rule**, which for each cluster $k \in \{1, \ldots, K\}$ determines a representative element $\vec{m}_k$, i.e. $k \mapsto \vec{m}_k \in \mathbb{R}^n$ for $k \in \{1, \ldots, K\}$.
7
+
1. A general **assignment rule**, which assigns each data point to a cluster, i.e., $\vec{x}_i \mapsto k \in \{1, \ldots, K\}$ for $i = 1, \ldots, N$.
8
+
2. A **reconstruction rule**, which for each cluster $k \in \{1, \ldots, K\}$ determines a representative element $\vec{m}_k$, i.e., $k \mapsto \vec{m}_k \in \mathbb{R}^n$ for $k \in \{1, \ldots, K\}$.
9
9
10
10
### $k$-Means Clustering
11
11
12
-
As the name suggests, the $k$-means algorithm is based on the idea of defining this representative element as the geometrical mean of the data points in the cluster. To formulate the $k$-means algorithm, we first fix the number of clusters $K$ and define two quantities, $\mathbf{C} := (C_1, \ldots, C_K)$, which contains the subsets $C_k \subseteq \mathcal{D}$ of the data points that are assigned to cluster $k$, and $\mathbf{M} := (\vec{m}_1, \ldots, \vec{m}_K)$, which contains the mean values of the clusters.
12
+
As the name suggests, the $k$-means algorithm defines the representative element as the geometric mean of the data points in the cluster. To formulate the $k$-means algorithm, we first fix the number of clusters $K$ and define two quantities: $\mathbf{C} := (C_1, \ldots, C_K)$, which contains the subsets $C_k \subseteq \mathcal{D}$ of the data points assigned to cluster $k$, and $\mathbf{M} := (\vec{m}_1, \ldots, \vec{m}_K)$, which contains the mean values of the clusters.
13
13
14
14
```admonish note title="Note"
15
-
The union of the clusters must cover the entire data set $\mathcal{D}$, i.e. $\bigcup_{k=1}^K C_k = \mathcal{D}$ and $C_i \cap C_j = \emptyset$ for $i \neq j$, i.e. a data point cannot be assigned to multiple clusters at the same time.
15
+
The union of the clusters must cover the entire dataset $\mathcal{D}$, i.e., $\bigcup_{k=1}^K C_k = \mathcal{D}$ and $C_i \cap C_j = \emptyset$ for $i \neq j$, meaning a data point cannot be assigned to multiple clusters simultaneously.
16
16
```
17
17
18
-
The $k$-means algorithm then iteratively alternates between updating the cluster variable $\mathbf{C}$ and the mean values $\mathbf{M}$. For an initial clustering $\mathbf{C}$, we first **compute the mean value of each cluster as the mean value of the data points in that cluster**:
18
+
The $k$-means algorithm iteratively alternates between updating the cluster variable $\mathbf{C}$ and the mean values $\mathbf{M}$. For an initial clustering $\mathbf{C}$, we first **compute the mean value of each cluster as the average of the data points in that cluster**:
C_k \leftarrow \{\vec{x}_i \in \mathcal{D} \mid \|\vec{x}_i - \vec{m}_k\| \leq \|\vec{x}_i - \vec{m}_j\| \text{ for all } j \neq k\}\,.
28
28
$$
29
29
30
-
This corresponds to the assignment rule. These two steps are then repeated iteratively for a given number of iterations, or until the clusters no longer change.
30
+
This corresponds to the assignment rule. These two steps are repeated iteratively for a given number of iterations, or until the clusters no longer change.
31
31
32
32
```admonish info title="Voronoi cells"
33
33
34
-
An alternative view on the clusters assigned by the $k$-means algorithm are the so-called *Voronoi cells*, which are defined as:
34
+
An alternative view of the clusters assigned by the $k$-means algorithm are the so-called *Voronoi cells*, which are defined as:
35
35
36
36
$$
37
37
V_k := \{\vec{x} \in \mathbb{R}^n \mid \|\vec{x} - \vec{m}_k\| \leq \|\vec{x} - \vec{m}_j\| \text{ for all } j \neq k\}
@@ -42,42 +42,40 @@ and can be visualized in [Voronoi diagrams](https://en.wikipedia.org/wiki/Vorono
42
42
43
43
### Implementation
44
44
45
-
Of course, we implement the $k$-means algorithm as a class. In the `__init__` method
46
-
we set the number of clusters and the maximum number of iterations. Additionally,
47
-
we already initialize the class attributes `self.centroids` and `self.labels`, which will be updated alternately during the algorithm. `self.centroids` will be, of course, a 2D array holding in each row the centroid of a cluster. `self.labels` will be a 1D array holding the cluster label ($0, \ldots, K-1$) for each data point.
45
+
We implement the $k$-means algorithm as a class. In the `__init__` method, we set the number of clusters and the maximum number of iterations. Additionally, we initialize the class attributes `self.centroids` and `self.labels`, which will be updated alternately during the algorithm. `self.centroids` will be a 2D array holding the centroid of each cluster in each row. `self.labels` will be a 1D array holding the cluster label ($0, \ldots, K-1$) for each data point.
Then, we implement the `fit` method, which executes the algorithm as described above. Here, we have used [`np.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) to randomly select indices of data points to serve as the initial centroids. The function `np.random.choice` selects `self.n_clusters` unique indices from the range of available data points, given by `X.shape[0]`. The parameter `replace=False` ensures that the same data point isnot selected more than once. By indexing the data matrix `X`with these indices, we obtain the initial centroids. After that, we update the centroids and labels in a loop as described above for`self.num_iter` iterations.
51
+
Next, we implement the `fit` method, which executes the algorithm as described above. Here, we use [`np.random.choice`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) to randomly select indices of data points to serve as the initial centroids. The function `np.random.choice` selects `self.n_clusters` unique indices from the range of available data points, given by `X.shape[0]`. The parameter `replace=False` ensures that the same data point isnot selected more than once. By indexing the data matrix `X`with these indices, we obtain the initial centroids. After that, we update the centroids and labels in a loop as described above for`self.num_iter` iterations.
Here, we actually assumed that we will implement the methods `assign_labels`and`compute_centroids`later.
57
+
Here, we assume that we will implement the methods `assign_labels`and`compute_centroids`separately.
60
58
61
-
Within the `assign_labels` method, we first compute the distances between all data points and all centroids. This is done by subtracting the centroids from the data points and then computing the euclidean norm of the resulting vectors. However, because the shape of the data matrix `X` is `(n_points, n_features)`, while the shape of the centroids is `(n_clusters, n_features)`, we need to expand the data matrix to match the shape of the centroids. This is done by adding an additional dimension to the data matrix, i.e. `X.shape = (n_points, 1, n_features)`. The subtraction of the centroids from the data points then results in an array of shape `(n_points, n_clusters, n_features)`. The euclidean norm of the resulting vectors is then computed accross all features, i.e. `axis=2`. The resulting array `distances` then holds the distances between all data points and all centroids. The assignment of the data points to the clusters is then done by selecting the cluster with the smallest distance for each data point, which can be achieved by using the `np.argmin` function, along the second axis, i.e. `axis=1`.
59
+
Within the `assign_labels` method, we first compute the distances between all data points and all centroids. This is done by subtracting the centroids from the data points and then computing the Euclidean norm of the resulting vectors. However, because the shape of the data matrix `X` is `(n_points, n_features)`, while the shape of the centroids is `(n_clusters, n_features)`, we need to expand the data matrix to match the shape of the centroids. This is done by adding an additional dimension to the data matrix, representing the different clusters. The new shape of the data matrix is then `(n_points, 1, n_features)`. The subtraction of the centroids from the data points then results in an array of shape `(n_points, n_clusters, n_features)`. The Euclidean norm of the resulting vectors is computed across all features, i.e., `axis=2`. The resulting array `distances` holds the distances between all data points and all centroids. The assignment of data points to clusters is then done by selecting the cluster with the smallest distance for each data point, which can be achieved using the `np.argmin` function along the second axis, i.e., `axis=1`.
The calculation of the centroids is relatively simple, since we only need to compute the mean values of the data points for each cluster $i = 1, \ldots, K$ and store them in an array. Remeber that we can get the data points assigned to a cluster by indexing the data matrix `X` with the cluster labels, i.e. `X[labels == i]`. We use list comprehension for this:
66
+
The calculation of the centroids is relatively straightforward, since we only need to compute the mean values of the data points for each cluster $i = 1, \ldots, K$ and store them in an array. Remember that we can get the data points assigned to a cluster by indexing the data matrix `X` with the cluster labels, i.e., `X[labels == i]`. We use list comprehension for this:
To stay true to the concept of the general MLclass, we also implement the `predict` method, which computes the assignments forpossibly new data points.
72
+
To stay true to the concept of the general MLclass, we also implement the `predict` method, which computes the assignments for new data points.
75
73
76
-
### Clustering of aptamers
74
+
### Clustering of Aptamers
77
75
78
-
We can now test our implementation of the $k$-means algorithm on the aptamer dataset. We will use the representation of the aptamers in the principal component space, which we computed earlier withthe PCA. With some imagination, we can see that the aptamers are clustered into four groups, which we want to identify with the $k$-means algorithm.
76
+
We can now test our implementation of the $k$-means algorithm on the aptamer dataset. We will use the representation of the aptamers in the principal component space, which we computed earlier withPCA. With some imagination, we can see that the aptamers are clustered into four groups, which we want to identify using the $k$-means algorithm.
79
77
80
-
Loading the data from the CSVfileand converting the data to a numpy array is straightforward:
78
+
Loading the data from the CSVfileand converting it to a numpy array is straightforward:
Plotting the result is also straightforward. We simply color the data points according to the cluster labels. Additionally, we access the class attribute `centroids` to plot the centroids as red crosses:
90
+
Plotting the result is also straightforward. We simply color the data points according to their cluster labels. Additionally, we access the class attribute `centroids` to plot the centroids as red crosses:
By inspecting the plot, we can see that the $k$-means algorithm will correctly identified the four clusters of aptamers in most cases. Because centroids are initialized randomly, the result will vary slightly between runs.
96
+
By inspecting the plot, we can see that the $k$-means algorithm correctly identifies the four clusters of aptamers in most cases. Because centroids are initialized randomly, the result may vary slightly between runs. In the animation below, we can see the evolution of the clusters over the iterations of the $k$-means algorithm.
97
+
98
+
<figure>
99
+
<center>
100
+
<img src="../assets/figures/05-machine_learning/k_means_aptamers_animated.gif" alt="K-Means Clustering of aptamers" style="max-width: 600px;"/>
101
+
</center>
102
+
</figure>
99
103
100
-

0 commit comments