You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Enhance SVM documentation with loss function derivation
- Updated the Support Vector Machines (SVM) section to clarify the concept of maximizing the separation margin by introducing the signed distance to the hyperplane.
- Added a new subsection detailing the derivation of the SVM loss function, including both hard and soft margin formulations.
- Provided mathematical expressions for the loss functions and their optimization using gradient descent, improving the educational value of the documentation.
Copy file name to clipboardExpand all lines: src/psets/04.md
+36-1Lines changed: 36 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -74,7 +74,7 @@ Again, implementation is straightforward — just replace the respective line in
74
74
75
75
In the lecture, we implemented and used the Rosenblatt Perceptron to classify RNA aptamers according to their absorption behavoir. Not only did our data have to be linearly separable for the training to converge, we also saw that it did not provide us with a very robust decision boundary.
76
76
77
-
In this exercise, we will implement Support Vector Machines (SVM) to improve the performance of our classifier. SVMs are based on the idea of finding the best separating hyperplane between two classes by maximizing the distance of the hyperplane to the closest data points from each class.
77
+
In this exercise, we will implement Support Vector Machines (SVM) to improve the performance of our classifier. SVMs are based on the idea of finding the best separating hyperplane between two classes by maximizing the smallest signed distance of the hyperplane to the data points, called the separation margin.
78
78
79
79
**(a) Derivation of the Point-to-Hyperplane Distance Formula**
80
80
@@ -147,6 +147,41 @@ $$
147
147
$$
148
148
-->
149
149
150
+
**(b) Deriving the SVM Loss Function**
151
+
152
+
From (a), it follows that the signed distance of a datapoint $(\vec{x}_i, y_i)$ to the hyperplane is given by $\frac{y_i \langle \vec{w}, \vec{x}_i \rangle}{\|\vec{w}\|}$. Therefore, maximizing the separation margin is equivalent to minimizing $\|\vec{w}\|$ once we fix the scale of the numerator to 1. This is known as the *hard margin* SVM loss:
We can also relax the constraint of all datapoints being correctly classified, by introducing a *slack* variable $\xi_i \geq 0$, such that $y_i \langle \vec{w}, \vec{x}_i \rangle \geq 1 - \xi_i$. Intuitively, we want to minimize the sum of the slack variables, which we can then directly incorporate into the so-called *soft margin* SVM loss:
To optimize this loss using gradient descent, we need the gradient of this loss with respect to the weights $\vec{w}$. For the second term, this is easy, but for the first term, we can only compute subgradients for terms for which the data point is misclassified. Show that under these conditions, the update rule for a single datapoint is given by:
0 commit comments