You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# <iclass="fa-solid fa-gear"></i> Support Vector Machines
19
19
20
-
WORK IN PROGRESS
21
-
22
-
After a brief excursion into generative models such as [discriminant analysis](5_LDA_QDA) or [Naïve Bayes](6_Naive_Bayes), we will now again discuss a discriminative family of models: Support Vector Machines (SVM). SVMs are powerful supervised learning models used for classification and regression tasks. When used for classification, they are called Support Vector Classifiers (SVC).
20
+
After a brief excursion into generative models such as [LDA & QDA](5_LDA_QDA) or [Naïve Bayes](6_Naive_Bayes), we will now again discuss a discriminative family of models: Support Vector Machines (SVM). SVMs are powerful supervised learning models used for classification and regression tasks. When used for classification, they are called Support Vector Classifiers (SVC).
23
21
24
22
Let's consider some simulated classification data:
25
23
@@ -28,100 +26,317 @@ import numpy as np
28
26
import matplotlib.pyplot as plt
29
27
import seaborn as sns
30
28
from scipy import stats
29
+
from matplotlib.lines import Line2D
31
30
from sklearn.datasets import make_classification
32
31
sns.set_theme(style="darkgrid")
33
32
34
33
X, y = make_classification(n_samples=50, n_features=2, n_informative=2, n_redundant=0,
There are infinite ways to separate the two classes because you can find an infinte amount of lines which perfectly separate them.
57
+
```
58
+
59
+
If we visualise this and add a new data point for classification a potential issue becomes apparent. For some models, this data point would fall into Class 0 and for others into Class 1:
So evidently, we can't just be satisfied with having an infinite amount of possible solutions we need to come up with a more justifiable one. If you remember, we already did so for linear regression: there, the least squares method chose the line that minimised the total squared distance between predictions and true values.
45
95
96
+
Support Vector Classifiers have a slightly different method. As Robert Tibshirani put it, they are
46
97
47
-
-**Hyperplane**: A decision boundary that separates classes. In p dimensions, it is a p−1 dimensional flat affine subspace, given by the equation:
The vector $\beta = (\beta_1, \beta_2, \dots, \beta_p)$ is the normal vector.
98
+
> An approach to the classification problem in a way computer scientists would approach it.
99
+
100
+
Rather than minimising a squared error, they aim to find the hyperplane that maximises the margin — the distance between the separating hyperplane and the closest data points from each class. The idea is that by maximising this margin, we obtain a decision boundary that is both robust and generalisable.
101
+
102
+
-**Hyperplane**: A decision boundary that separates classes. In p dimensions, it is a p−1 dimensional subspace, given by the equation: $\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_p X_p = 0$. So in the case of two predictors the hyperplane is one dimensional (a line).
50
103
-**Separating Hyperplane**: A hyperplane that correctly separates the data by class label.
51
104
-**Margin**: The (perpendicular) distance between the hyperplane and the closest training points. A maximal margin classifier chooses the hyperplane that maximises this margin.
52
105
-**Support Vectors**: Observations closest to the decision boundary. They define the margin and the classifier.
53
106
-**Soft Margin**: A method used when the data is not linearly separable. Allows some observations to violate the margin. Controlled via the hyperparameter $C$.
54
107
-**Kernel Trick**: Implicitly maps data into a higher-dimensional space to make it linearly separable using functions like polynomial or RBF (Gaussian) kernels.
55
108
56
-
## When to Use SVC
109
+
110
+
To formalise this intuition, SVCs look for the maximum margin classifier — a hyperplane that not only separates the classes but does so with the greatest possible distance to the closest training samples. These closest samples are known as support vectors, and they uniquely determine the position of the hyperplane. All other samples can be moved without changing the decision boundary, making SVCs especially robust to outliers away from the margin.
111
+
112
+
113
+
## Using SVCs
114
+
115
+
As you learned in the lecture, SVCs are considered to be one of the best "out of the box" classifiers and can be used in many scenarios. This includes:
57
116
58
117
- When the number of features is large relative to the number of samples
59
118
- When classes are not linearly separable
60
119
- When a robust and generalisable classifier is needed
61
120
62
-
## Linear vs Nonlinear SVC
121
+
If the data is not perfectly separable (either because the classes overlap, or the classes are not linearly separable) SVCs become creative in two ways
122
+
123
+
1. "Soften" what is meant by separating the classes and allow for errors
124
+
2. Map feature space into a higher dimension (kernel trick)
125
+
63
126
64
-
-**Linear SVC**: Suitable when data is linearly separable or when using a linear decision boundary is sufficient.
65
-
-**Nonlinear SVC**: Use kernel methods when data exhibits nonlinear patterns.
127
+
### Example 1: Linear Classification
66
128
67
-
## Example 1: Linear SVC on Linearly Separable Data
129
+
Fitting a SVC is straigthforward:
68
130
69
131
```{code-cell} ipython3
70
-
from sklearn.datasets import make_classification
71
132
from sklearn.svm import SVC
72
-
import matplotlib.pyplot as plt
73
-
import numpy as np
74
-
75
-
X, y = make_classification(n_samples=100, n_features=2, n_informative=2,
In that case, non-linear SVC can be applied. For example, a simple projection would be a radial basis function centered on the middle clump. As you can see, the data becomes linearly separable in three dimensions:
192
+
193
+
```{code-cell} ipython3
194
+
from mpl_toolkits import mplot3d
195
+
196
+
# Apply radial basis function to the feature space
197
+
r = np.exp(-(X ** 2).sum(1))
85
198
86
-
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr')
87
-
plt.plot(x_vals, y_vals, 'k-')
88
-
plt.title('Linear SVC Decision Boundary')
89
-
plt.show()
199
+
# Plot features in 3D
200
+
fig = plt.figure()
201
+
ax = fig.add_subplot(projection='3d')
202
+
203
+
colors = np.array(["#0173B2", "#DE8F05"])[y] # colors for each class
## Example 2: Nonlinear Classification with RBF Kernel
215
+
We can create a similar plot as above, first with a linear SVC and second with a RBF SVC to visualize the decision boundary, margins, and support vectors:
93
216
94
217
```{code-cell} ipython3
95
-
from sklearn.datasets import make_moons
96
218
from sklearn.model_selection import train_test_split
97
-
from sklearn.svm import SVC
98
-
from sklearn.metrics import accuracy_score
219
+
from sklearn.metrics import classification_report
99
220
100
-
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
101
221
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
Notably, we now see that there are a lot more support vectors, especially when we fit a linear SVC to the data. This is because of the softening of the margins (???)
271
+
110
272
## Multiclass Classification
111
273
112
-
SVMs are inherently binary classifiers but can be extended:
274
+
Cs are inherently binary classifiers but can be extended:
113
275
114
276
***One-vs-One**: $\binom{K}{2}$ classifiers for each pair of classes.
115
277
***One-vs-All**: K classifiers, each comparing one class against the rest.
116
278
117
279
## Choosing Hyperparameters
118
280
281
+
SVCs have a few hyperparameters. Please have a look at the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) for a more in-depth overview. For the SVC used in the previous examples, the most important ones are:
282
+
119
283
*`C`: Regularisation parameter; trade-off between margin width and classification error.
120
284
*`kernel`: `'linear'`, `'poly'`, `'rbf'`, `'sigmoid'`, or custom.
121
-
*`gamma`: Defines the influence of a training example; affects RBF, polynomial, and sigmoid kernels.
285
+
*`gamma`: Kernel coefficient (for RBF, polynomial, and sigmoid kernels)
286
+
287
+
As always, hyperparameters should be tuned using [cross-validation](book/1_basics/3_resampling) to balance bias and variance. It often makes sense to use a [grid search](https://scikit-learn.org/stable/modules/grid_search.html) or related strategies to find the optimal solution:
288
+
122
289
123
-
Hyperparameters should be tuned using cross-validation to balance bias and variance.
290
+
```{code-cell} ipython3
291
+
import pandas as pd
292
+
from sklearn.model_selection import GridSearchCV
293
+
294
+
# Generate data
295
+
X, y = make_circles(100, factor=.1, noise=.3)
296
+
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
Support Vector Classifiers are a robust and versatile tool for classification tasks. The key ideas are rooted in geometry—finding the optimal hyperplane that separates data with maximum margin. With the use of kernels, SVMs extend effectively to non-linear decision boundaries and multiclass problems.
338
+
- Support Vector Classifiers are a robust and versatile tool for classification tasks
339
+
- The key ideas are rooted in geometry - finding the optimal hyperplane that separates data with maximum margin
340
+
- With the use of kernels, SVCs extend effectively to non-linear decision boundaries
341
+
- Multiclass classification can be done in a one-vs-one or one-vs-all approach
0 commit comments