Add Support Vector Machine implementation for classification

carlportz · carlportz · commit 06cb0c44a007 · 2025-07-15T10:12:50.000+02:00
- Introduced a new SVM class with methods for fitting the model and making predictions.
- Implemented data loading and processing from a CSV file, including feature selection and target variable extraction.
- Added functionality to calculate accuracy and visualize the decision boundary using matplotlib.
- Enhanced educational value by providing clear structure and comments throughout the code.
diff --git a/src/codes/05-machine_learning/svm.py b/src/codes/05-machine_learning/svm.py
@@ -0,0 +1,118 @@
+#!/usr/bin/env python
+
+SEED = 1234
+
+## ANCHOR: load_data_from_csv
+import numpy as np
+import matplotlib.pyplot as plt
+import pandas as pd
+
+path_to_csv = "aptamer_classification_data.csv"
+df = pd.read_csv(path_to_csv)
+print(df.head())
+### ANCHOR_END: load_data_from_csv
+
+### ANCHOR: process_data
+target_column = "lambda_abs_class"
+
+X = df[["PC1", "PC2"]].values # Select the PC1 and PC2 columns as features
+X = np.hstack([X, np.ones((X.shape[0], 1))]) # Add a column of ones to the data matrix
+y = df[target_column].values # Target column
+
+print(X.shape)
+print(y.shape)
+### ANCHOR_END: process_data
+
+# fig.savefig('../../assets/figures/05-machine_learning/classification_data.svg')
+
+### ANCHOR: svm_init
+class SupportVectorMachine:
+    def __init__(self, learning_rate=0.01, n_iterations=50, lam=10.0):
+        self.learning_rate = learning_rate
+        self.n_iterations = n_iterations
+        self.lam = lam  # regularization parameter lambda
+        self.weights = None
+        self.losses = []  # store loss values for each epoch
+        self.margins = []  # store margin values (2 / ||w||) for each epoch
+### ANCHOR_END: svm_init
+    
+### ANCHOR: svm_fit
+    def fit(self, X, y):
+        n_samples, n_features = X.shape
+        self.weights = np.random.randn(n_features)
+        
+        for epoch in range(self.n_iterations):
+            epoch_loss = 0
+            
+            for i, (x_i, y_i) in enumerate(zip(X, y)):
+                # Calculate prediction
+                prediction = np.dot(x_i, self.weights)
+                
+                # Calculate hinge loss for this sample
+                hinge_loss = max(0, 1 - y_i * prediction)
+                
+                # Update weights based on whether point is misclassified
+                if y_i * prediction < 1:  # misclassified or within margin
+                    # Gradient of hinge loss + regularization
+                    self.weights = (1 - self.learning_rate * self.lam) * self.weights + self.learning_rate * y_i * x_i
+                else:  # correctly classified
+                    # Only regularization term
+                    self.weights = (1 - self.learning_rate * self.lam) * self.weights
+                
+                # Accumulate loss for this epoch
+                epoch_loss += hinge_loss
+            
+            # Calculate total loss for this epoch (hinge loss + regularization)
+            regularization_loss = 0.5 * self.lam * np.dot(self.weights, self.weights)
+            total_loss = epoch_loss / n_samples + regularization_loss
+            self.losses.append(total_loss)
+            
+            # Calculate margin (2 / ||w||)
+            weight_norm = np.linalg.norm(self.weights)
+            margin = 2 / weight_norm if weight_norm > 0 else 0
+            self.margins.append(margin)
+            
+            if epoch % 10 == 0:
+                print(f"Epoch {epoch}, Loss: {total_loss:.4f}, Margin: {margin:.4f}")
+### ANCHOR_END: svm_fit
+
+### ANCHOR: svm_predict
+    def predict(self, X):
+        return np.sign(np.dot(X, self.weights))
+### ANCHOR_END: svm_predict
+
+np.random.seed(SEED)
+### ANCHOR: fit_svm_model
+svm_model = SupportVectorMachine(learning_rate=0.01, n_iterations=100, lam=0.1)
+svm_model.fit(X, y)
+y_pred_svm = svm_model.predict(X)
+### ANCHOR_END: fit_svm_model
+
+### ANCHOR: calculate_svm_accuracy
+accuracy_svm = np.mean(y_pred_svm == y)
+print(f"SVM Accuracy: {accuracy_svm}")
+### ANCHOR_END: calculate_svm_accuracy
+
+### ANCHOR: plot_svm_decision_boundary
+fig, ax = plt.subplots(figsize=(7, 6))
+
+# Plot decision boundary
+ax.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', alpha=0.7)
+
+# Define the decision boundary as a function of the first feature
+x1_range = np.linspace(X[:, 0].min(), X[:, 0].max(), 100)
+if svm_model.weights[1] != 0:
+    x2_boundary = -(svm_model.weights[0] * x1_range + svm_model.weights[2]) / svm_model.weights[1]
+    ax.plot(x1_range, x2_boundary, 'k--', linewidth=2, label='SVM Decision Boundary')
+
+ax.legend(loc='upper right')
+ax.set_xlabel('PC1')
+ax.set_ylabel('PC2')
+ax.set_title('SVM Decision Boundary')
+ax.set_xlim(X[:, 0].min()-0.1, X[:, 0].max()+0.1)
+ax.set_ylim(X[:, 1].min()-0.1, X[:, 1].max()+0.1)
+
+plt.show()
+### ANCHOR_END: plot_svm_decision_boundary
+
+
diff --git a/src/psets/04.md b/src/psets/04.md
@@ -24,7 +24,7 @@ Use your object-oriented implementation to perform linear regression on the data
 
 **(b) Moore-Penrose Pseudoinverse**
 
-You will see that `numpy` will likely throw an error of the form `numpy.linalg.LinAlgError: Singular matrix`, indicating that the matrix $\bm{X}^T \bm{X}$ is not invertible. This means the columns of $\bm{X}$ are linearly dependent. This occurs because the finite dimension (e.g., 2048 bits) means multiple substructures must be encoded in the same bit, which directly induces dependencies between bits. To solve this problem, we can use the Moore-Penrose pseudoinverse $\bm{X}^+$, which you learned about in the SVD lecture.
+You will see that `numpy` will likely throw an error of the form `numpy.linalg.LinAlgError: Singular matrix`, indicating that the matrix $\bm{X}^T \bm{X}$ is not invertible. This means the columns of $\bm{X}$ are linearly dependent, which is not surprising, as we have more features than data points. Therefore, there are inifintely many solutions for the weights $\vec{w}$ to minimize the loss function. We have shown, however, that we can obtain the solution with the least norm with help of the Moore-Penrose pseudoinverse $\bm{X}^+$.
 
 Show that the analytical solution for the weights in linear regression 
 
@@ -41,14 +41,14 @@ $$
 Then use `np.linalg.pinv` to compute the pseudoinverse and use it to compute the weights for linear regression. Compute the MAE and plot the predicted vs. actual fluorescence intensity for the training and test data.
 
 ```admonish tip title="Tip" collapsible=true
-You can show that $(\bm{X}^T \bm{X})^{+} \bm{X}^T = \bm{X}^+$ by checking the Moore-Penrose conditions for the matrix $\bm{B} := (\bm{X}^T \bm{X})^{+} \bm{X}^T$. Also, use the fact that the matrix $\bm{X}^T \bm{X}$ is invertible.
+You can show that $(\bm{X}^T \bm{X})^{+} \bm{X}^T = \bm{X}^+$ by checking the Moore-Penrose conditions for the matrix $\bm{B} := (\bm{X}^T \bm{X})^{+} \bm{X}^T$. Note that in the derivation of the optimal weights, we used the fact that the matrix $\bm{X}^T \bm{X}$ is invertible.
 
 Implementation is straightforward — just replace the respective line in the `LinearRegression` class with the pseudoinverse.
 ```
 
 **(c) Ridge Regression**
 
-To reduce overfitting risk, we can use Ridge regression. Remember that Ridge regression is linear regression with a regularization term added to the loss function:
+The use of the Moore-Penrose pseudoinverse already introduces some sort of regularization to the weight. A similar approach is Ridge regression, which is linear regression with a regularization term added to the loss function:
 
 $$
 \mathcal{L} = \frac{1}{2} \sum_{i=1}^{N} (y_i - \hat{f}(\vec{x}_i))^2 + \frac{\lambda}{2} \|\vec{w}\|^2 \,,
@@ -161,7 +161,7 @@ $$
 \mathcal{L}_{\vec{w}} = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i \langle \vec{w}, \vec{x}_i \rangle) + \frac{\lambda}{2} \|\vec{w}\|^2 \,.
 $$
 
-where $\lambda$ is a regularization parameter. 
+where $\lambda$ is a regularization parameter. Here, the $\frac{1}{2} \|\vec{w}\|^2$ term maximizes the margin, and the $\sum_{i=1}^{N} \xi_i$ term penalizes points in proportion to their distance shortfall.
 
 To optimize this loss using gradient descent, we need the gradient of this loss with respect to the weights $\vec{w}$. For the second term, this is easy, but for the first term, we can only compute subgradients for terms for which the data point is misclassified. Show that under these conditions, the update rule for a single datapoint is given by:
 
@@ -179,6 +179,14 @@ where $\eta$ is the learning rate.
 
 **(c) Implementing the SVM**
 
+Implement the SVM by using the loss function and the update rule from above. Use the `Perceptron` class as a template. Then, apply the SVM to the aptamer dataset that we used for classification and plot the decision boundary. 
+
+What happens if you randomly change the sign of the labels of one or two data points? How does $\lambda$ affect the decision boundary?
+
+
+
+
+