Skip to content

Commit 06cb0c4

Browse files
committed
Add Support Vector Machine implementation for classification
- Introduced a new SVM class with methods for fitting the model and making predictions. - Implemented data loading and processing from a CSV file, including feature selection and target variable extraction. - Added functionality to calculate accuracy and visualize the decision boundary using matplotlib. - Enhanced educational value by providing clear structure and comments throughout the code.
1 parent 366ff61 commit 06cb0c4

2 files changed

Lines changed: 130 additions & 4 deletions

File tree

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
#!/usr/bin/env python
2+
3+
SEED = 1234
4+
5+
## ANCHOR: load_data_from_csv
6+
import numpy as np
7+
import matplotlib.pyplot as plt
8+
import pandas as pd
9+
10+
path_to_csv = "aptamer_classification_data.csv"
11+
df = pd.read_csv(path_to_csv)
12+
print(df.head())
13+
### ANCHOR_END: load_data_from_csv
14+
15+
### ANCHOR: process_data
16+
target_column = "lambda_abs_class"
17+
18+
X = df[["PC1", "PC2"]].values # Select the PC1 and PC2 columns as features
19+
X = np.hstack([X, np.ones((X.shape[0], 1))]) # Add a column of ones to the data matrix
20+
y = df[target_column].values # Target column
21+
22+
print(X.shape)
23+
print(y.shape)
24+
### ANCHOR_END: process_data
25+
26+
# fig.savefig('../../assets/figures/05-machine_learning/classification_data.svg')
27+
28+
### ANCHOR: svm_init
29+
class SupportVectorMachine:
30+
def __init__(self, learning_rate=0.01, n_iterations=50, lam=10.0):
31+
self.learning_rate = learning_rate
32+
self.n_iterations = n_iterations
33+
self.lam = lam # regularization parameter lambda
34+
self.weights = None
35+
self.losses = [] # store loss values for each epoch
36+
self.margins = [] # store margin values (2 / ||w||) for each epoch
37+
### ANCHOR_END: svm_init
38+
39+
### ANCHOR: svm_fit
40+
def fit(self, X, y):
41+
n_samples, n_features = X.shape
42+
self.weights = np.random.randn(n_features)
43+
44+
for epoch in range(self.n_iterations):
45+
epoch_loss = 0
46+
47+
for i, (x_i, y_i) in enumerate(zip(X, y)):
48+
# Calculate prediction
49+
prediction = np.dot(x_i, self.weights)
50+
51+
# Calculate hinge loss for this sample
52+
hinge_loss = max(0, 1 - y_i * prediction)
53+
54+
# Update weights based on whether point is misclassified
55+
if y_i * prediction < 1: # misclassified or within margin
56+
# Gradient of hinge loss + regularization
57+
self.weights = (1 - self.learning_rate * self.lam) * self.weights + self.learning_rate * y_i * x_i
58+
else: # correctly classified
59+
# Only regularization term
60+
self.weights = (1 - self.learning_rate * self.lam) * self.weights
61+
62+
# Accumulate loss for this epoch
63+
epoch_loss += hinge_loss
64+
65+
# Calculate total loss for this epoch (hinge loss + regularization)
66+
regularization_loss = 0.5 * self.lam * np.dot(self.weights, self.weights)
67+
total_loss = epoch_loss / n_samples + regularization_loss
68+
self.losses.append(total_loss)
69+
70+
# Calculate margin (2 / ||w||)
71+
weight_norm = np.linalg.norm(self.weights)
72+
margin = 2 / weight_norm if weight_norm > 0 else 0
73+
self.margins.append(margin)
74+
75+
if epoch % 10 == 0:
76+
print(f"Epoch {epoch}, Loss: {total_loss:.4f}, Margin: {margin:.4f}")
77+
### ANCHOR_END: svm_fit
78+
79+
### ANCHOR: svm_predict
80+
def predict(self, X):
81+
return np.sign(np.dot(X, self.weights))
82+
### ANCHOR_END: svm_predict
83+
84+
np.random.seed(SEED)
85+
### ANCHOR: fit_svm_model
86+
svm_model = SupportVectorMachine(learning_rate=0.01, n_iterations=100, lam=0.1)
87+
svm_model.fit(X, y)
88+
y_pred_svm = svm_model.predict(X)
89+
### ANCHOR_END: fit_svm_model
90+
91+
### ANCHOR: calculate_svm_accuracy
92+
accuracy_svm = np.mean(y_pred_svm == y)
93+
print(f"SVM Accuracy: {accuracy_svm}")
94+
### ANCHOR_END: calculate_svm_accuracy
95+
96+
### ANCHOR: plot_svm_decision_boundary
97+
fig, ax = plt.subplots(figsize=(7, 6))
98+
99+
# Plot decision boundary
100+
ax.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm', alpha=0.7)
101+
102+
# Define the decision boundary as a function of the first feature
103+
x1_range = np.linspace(X[:, 0].min(), X[:, 0].max(), 100)
104+
if svm_model.weights[1] != 0:
105+
x2_boundary = -(svm_model.weights[0] * x1_range + svm_model.weights[2]) / svm_model.weights[1]
106+
ax.plot(x1_range, x2_boundary, 'k--', linewidth=2, label='SVM Decision Boundary')
107+
108+
ax.legend(loc='upper right')
109+
ax.set_xlabel('PC1')
110+
ax.set_ylabel('PC2')
111+
ax.set_title('SVM Decision Boundary')
112+
ax.set_xlim(X[:, 0].min()-0.1, X[:, 0].max()+0.1)
113+
ax.set_ylim(X[:, 1].min()-0.1, X[:, 1].max()+0.1)
114+
115+
plt.show()
116+
### ANCHOR_END: plot_svm_decision_boundary
117+
118+

src/psets/04.md

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Use your object-oriented implementation to perform linear regression on the data
2424

2525
**(b) Moore-Penrose Pseudoinverse**
2626

27-
You will see that `numpy` will likely throw an error of the form `numpy.linalg.LinAlgError: Singular matrix`, indicating that the matrix $\bm{X}^T \bm{X}$ is not invertible. This means the columns of $\bm{X}$ are linearly dependent. This occurs because the finite dimension (e.g., 2048 bits) means multiple substructures must be encoded in the same bit, which directly induces dependencies between bits. To solve this problem, we can use the Moore-Penrose pseudoinverse $\bm{X}^+$, which you learned about in the SVD lecture.
27+
You will see that `numpy` will likely throw an error of the form `numpy.linalg.LinAlgError: Singular matrix`, indicating that the matrix $\bm{X}^T \bm{X}$ is not invertible. This means the columns of $\bm{X}$ are linearly dependent, which is not surprising, as we have more features than data points. Therefore, there are inifintely many solutions for the weights $\vec{w}$ to minimize the loss function. We have shown, however, that we can obtain the solution with the least norm with help of the Moore-Penrose pseudoinverse $\bm{X}^+$.
2828

2929
Show that the analytical solution for the weights in linear regression
3030

@@ -41,14 +41,14 @@ $$
4141
Then use `np.linalg.pinv` to compute the pseudoinverse and use it to compute the weights for linear regression. Compute the MAE and plot the predicted vs. actual fluorescence intensity for the training and test data.
4242

4343
```admonish tip title="Tip" collapsible=true
44-
You can show that $(\bm{X}^T \bm{X})^{+} \bm{X}^T = \bm{X}^+$ by checking the Moore-Penrose conditions for the matrix $\bm{B} := (\bm{X}^T \bm{X})^{+} \bm{X}^T$. Also, use the fact that the matrix $\bm{X}^T \bm{X}$ is invertible.
44+
You can show that $(\bm{X}^T \bm{X})^{+} \bm{X}^T = \bm{X}^+$ by checking the Moore-Penrose conditions for the matrix $\bm{B} := (\bm{X}^T \bm{X})^{+} \bm{X}^T$. Note that in the derivation of the optimal weights, we used the fact that the matrix $\bm{X}^T \bm{X}$ is invertible.
4545
4646
Implementation is straightforward — just replace the respective line in the `LinearRegression` class with the pseudoinverse.
4747
```
4848

4949
**(c) Ridge Regression**
5050

51-
To reduce overfitting risk, we can use Ridge regression. Remember that Ridge regression is linear regression with a regularization term added to the loss function:
51+
The use of the Moore-Penrose pseudoinverse already introduces some sort of regularization to the weight. A similar approach is Ridge regression, which is linear regression with a regularization term added to the loss function:
5252

5353
$$
5454
\mathcal{L} = \frac{1}{2} \sum_{i=1}^{N} (y_i - \hat{f}(\vec{x}_i))^2 + \frac{\lambda}{2} \|\vec{w}\|^2 \,,
@@ -161,7 +161,7 @@ $$
161161
\mathcal{L}_{\vec{w}} = \frac{1}{N} \sum_{i=1}^{N} \max(0, 1 - y_i \langle \vec{w}, \vec{x}_i \rangle) + \frac{\lambda}{2} \|\vec{w}\|^2 \,.
162162
$$
163163

164-
where $\lambda$ is a regularization parameter.
164+
where $\lambda$ is a regularization parameter. Here, the $\frac{1}{2} \|\vec{w}\|^2$ term maximizes the margin, and the $\sum_{i=1}^{N} \xi_i$ term penalizes points in proportion to their distance shortfall.
165165

166166
To optimize this loss using gradient descent, we need the gradient of this loss with respect to the weights $\vec{w}$. For the second term, this is easy, but for the first term, we can only compute subgradients for terms for which the data point is misclassified. Show that under these conditions, the update rule for a single datapoint is given by:
167167

@@ -179,6 +179,14 @@ where $\eta$ is the learning rate.
179179

180180
**(c) Implementing the SVM**
181181

182+
Implement the SVM by using the loss function and the update rule from above. Use the `Perceptron` class as a template. Then, apply the SVM to the aptamer dataset that we used for classification and plot the decision boundary.
183+
184+
What happens if you randomly change the sign of the labels of one or two data points? How does $\lambda$ affect the decision boundary?
185+
186+
187+
188+
189+
182190

183191

184192

0 commit comments

Comments
 (0)