Skip to content

Commit 366ff61

Browse files
committed
Enhance SVM documentation with loss function derivation
- Updated the Support Vector Machines (SVM) section to clarify the concept of maximizing the separation margin by introducing the signed distance to the hyperplane. - Added a new subsection detailing the derivation of the SVM loss function, including both hard and soft margin formulations. - Provided mathematical expressions for the loss functions and their optimization using gradient descent, improving the educational value of the documentation.
1 parent db76d2e commit 366ff61

1 file changed

Lines changed: 36 additions & 1 deletion

File tree

src/psets/04.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ Again, implementation is straightforward — just replace the respective line in
7474

7575
In the lecture, we implemented and used the Rosenblatt Perceptron to classify RNA aptamers according to their absorption behavoir. Not only did our data have to be linearly separable for the training to converge, we also saw that it did not provide us with a very robust decision boundary.
7676

77-
In this exercise, we will implement Support Vector Machines (SVM) to improve the performance of our classifier. SVMs are based on the idea of finding the best separating hyperplane between two classes by maximizing the distance of the hyperplane to the closest data points from each class.
77+
In this exercise, we will implement Support Vector Machines (SVM) to improve the performance of our classifier. SVMs are based on the idea of finding the best separating hyperplane between two classes by maximizing the smallest signed distance of the hyperplane to the data points, called the separation margin.
7878

7979
**(a) Derivation of the Point-to-Hyperplane Distance Formula**
8080

@@ -147,6 +147,41 @@ $$
147147
$$
148148
-->
149149

150+
**(b) Deriving the SVM Loss Function**
151+
152+
From (a), it follows that the signed distance of a datapoint $(\vec{x}_i, y_i)$ to the hyperplane is given by $\frac{y_i \langle \vec{w}, \vec{x}_i \rangle}{\|\vec{w}\|}$. Therefore, maximizing the separation margin is equivalent to minimizing $\|\vec{w}\|$ once we fix the scale of the numerator to 1. This is known as the *hard margin* SVM loss:
153+
154+
$$
155+
\mathcal{L}_{\vec{w}} = \frac{1}{2} \|\vec{w}\|^2 \quad \text{s.t.} \quad y_i \langle \vec{w}, \vec{x}_i \rangle \geq 1 \quad \forall i = 1, \ldots, N \,.
156+
$$
157+
158+
We can also relax the constraint of all datapoints being correctly classified, by introducing a *slack* variable $\xi_i \geq 0$, such that $y_i \langle \vec{w}, \vec{x}_i \rangle \geq 1 - \xi_i$. Intuitively, we want to minimize the sum of the slack variables, which we can then directly incorporate into the so-called *soft margin* SVM loss:
159+
160+
$$
161+
\mathcal{L}_{\vec{w}} = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i \langle \vec{w}, \vec{x}_i \rangle) + \frac{\lambda}{2} \|\vec{w}\|^2 \,.
162+
$$
163+
164+
where $\lambda$ is a regularization parameter.
165+
166+
To optimize this loss using gradient descent, we need the gradient of this loss with respect to the weights $\vec{w}$. For the second term, this is easy, but for the first term, we can only compute subgradients for terms for which the data point is misclassified. Show that under these conditions, the update rule for a single datapoint is given by:
167+
168+
$$
169+
\vec{w} \leftarrow (1 - \eta \lambda) \vec{w} \,.
170+
$$
171+
172+
and for misclassified data points,
173+
174+
$$
175+
\vec{w} \leftarrow \vec{w} + \eta y_i \vec{x}_i \,.
176+
$$
177+
178+
where $\eta$ is the learning rate.
179+
180+
**(c) Implementing the SVM**
181+
182+
183+
184+
150185
<!--
151186
152187

0 commit comments

Comments
 (0)