Enhance SVM documentation with loss function derivation

carlportz · carlportz · commit 366ff61054e8 · 2025-07-14T23:32:37.000+02:00
- Updated the Support Vector Machines (SVM) section to clarify the concept of maximizing the separation margin by introducing the signed distance to the hyperplane.
- Added a new subsection detailing the derivation of the SVM loss function, including both hard and soft margin formulations.
- Provided mathematical expressions for the loss functions and their optimization using gradient descent, improving the educational value of the documentation.
diff --git a/src/psets/04.md b/src/psets/04.md
@@ -74,7 +74,7 @@ Again, implementation is straightforward — just replace the respective line in
 
 In the lecture, we implemented and used the Rosenblatt Perceptron to classify RNA aptamers according to their absorption behavoir. Not only did our data have to be linearly separable for the training to converge, we also saw that it did not provide us with a very robust decision boundary.
 
-In this exercise, we will implement Support Vector Machines (SVM) to improve the performance of our classifier. SVMs are based on the idea of finding the best separating hyperplane between two classes by maximizing the distance of the hyperplane to the closest data points from each class.
+In this exercise, we will implement Support Vector Machines (SVM) to improve the performance of our classifier. SVMs are based on the idea of finding the best separating hyperplane between two classes by maximizing the smallest signed distance of the hyperplane to the data points, called the separation margin.
 
 **(a) Derivation of the Point-to-Hyperplane Distance Formula**
 
@@ -147,6 +147,41 @@ $$
 $$
  -->
 
+**(b) Deriving the SVM Loss Function**
+
+From (a), it follows that the signed distance of a datapoint $(\vec{x}_i, y_i)$ to the hyperplane is given by $\frac{y_i \langle \vec{w}, \vec{x}_i \rangle}{\|\vec{w}\|}$. Therefore, maximizing the separation margin is equivalent to minimizing $\|\vec{w}\|$ once we fix the scale of the numerator to 1. This is known as the *hard margin* SVM loss:
+
+$$
+\mathcal{L}_{\vec{w}} = \frac{1}{2} \|\vec{w}\|^2 \quad \text{s.t.} \quad y_i \langle \vec{w}, \vec{x}_i \rangle \geq 1 \quad \forall i = 1, \ldots, N \,.
+$$
+
+We can also relax the constraint of all datapoints being correctly classified, by introducing a *slack* variable $\xi_i \geq 0$, such that $y_i \langle \vec{w}, \vec{x}_i \rangle \geq 1 - \xi_i$. Intuitively, we want to minimize the sum of the slack variables, which we can then directly incorporate into the so-called *soft margin* SVM loss:
+
+$$
+\mathcal{L}_{\vec{w}} = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i \langle \vec{w}, \vec{x}_i \rangle) + \frac{\lambda}{2} \|\vec{w}\|^2 \,.
+$$
+
+where $\lambda$ is a regularization parameter. 
+
+To optimize this loss using gradient descent, we need the gradient of this loss with respect to the weights $\vec{w}$. For the second term, this is easy, but for the first term, we can only compute subgradients for terms for which the data point is misclassified. Show that under these conditions, the update rule for a single datapoint is given by:
+
+$$
+\vec{w} \leftarrow (1 - \eta \lambda) \vec{w} \,.
+$$
+
+and for misclassified data points,
+
+$$
+\vec{w} \leftarrow \vec{w} + \eta y_i \vec{x}_i \,.
+$$
+
+where $\eta$ is the learning rate.
+
+**(c) Implementing the SVM**
+
+
+
+
 <!--