Skip to content

Commit 5fa69cd

Browse files
committed
updated neural networks and PCA sections and exercises
1 parent fc45089 commit 5fa69cd

6 files changed

Lines changed: 1026 additions & 289 deletions

File tree

book/2_models/10_PCA.md

Lines changed: 108 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,115 @@ kernelspec:
1010
display_name: Python 3
1111
language: python
1212
name: python3
13-
myst:
14-
substitutions:
15-
ref_test: 1
1613
---
1714

1815
# <i class="fa-solid fa-magnifying-glass-chart"></i> Principal Component Analysis
1916

20-
https://www.datacamp.com/tutorial/principal-component-analysis-in-python
17+
Principal Component Analysis (PCA) is a *dimensionality reduction* technique. It is considered an *unsupervised* machine learning method, since we do not model any relationship with a target/response variable. Instead, PCA finds a lower-dimensional representation of our data.
18+
19+
Simply put, PCA finds the principal components (the *eigenvectors*) of the centered data matrix $X$. Each eigenvector points in a direction of maximal variance, ordered by how much variance it explains.
20+
21+
22+
```{code-cell} ipython3
23+
---
24+
tags:
25+
- hide-input
26+
---
27+
28+
import numpy as np
29+
import matplotlib.pyplot as plt
30+
from sklearn.decomposition import PCA
31+
32+
# Simulate & center 2D data
33+
np.random.seed(0)
34+
X = np.random.multivariate_normal([0, 0], [[3, 2], [2, 2]], size=500)
35+
Xc = X - X.mean(axis=0)
36+
37+
# Run PCA: extract eigenvectors and eigenvalues
38+
pca2d = PCA().fit(Xc)
39+
pcs, scales = pca2d.components_, np.sqrt(pca2d.explained_variance_)
40+
41+
# Plot original data and principal components
42+
fig, ax = plt.subplots()
43+
ax.scatter(X[:, 0], X[:, 1], alpha=0.5, label='Data')
44+
mean = X.mean(axis=0)
45+
# Draw PC1 (red) and PC2 (blue)
46+
ax.arrow(*mean, *(pcs[0] * scales[0] * 3), head_width=0.2, head_length=0.3, color='r', linewidth=2, label='PC1')
47+
ax.arrow(*mean, *(pcs[1] * scales[1] * 3), head_width=0.2, head_length=0.3, color='b', linewidth=2, label='PC2')
48+
ax.set(xlabel="Feature 1", ylabel="Feature 2", title="PCA for 2D data")
49+
ax.axis('equal')
50+
ax.legend()
51+
plt.show()
52+
```
53+
54+
This example illustrates how PCA finds the two orthogonal directions (eigenvectors) along which the data vary most.
55+
56+
Next, we apply PCA to the Iris dataset (4 features). First, we can inspect which features are most discriminative with a pairplot:
57+
58+
```{code-cell} ipython3
59+
import seaborn as sns
60+
from sklearn.datasets import load_iris
61+
62+
# Load Iris as a DataFrame for easy plotting
63+
iris = load_iris(as_frame=True)
64+
iris.frame["target"] = iris.target_names[iris.target]
65+
_ = sns.pairplot(iris.frame, hue="target")
66+
67+
```
68+
69+
The pairplot shows **petal length** and **petal width** separate the three species most clearly.
70+
71+
```{code-cell} ipython3
72+
from sklearn.decomposition import PCA
73+
74+
# Prepare data arrays
75+
X, y = iris.data, iris.target
76+
feature_names = iris.feature_names
77+
78+
# Fit PCA to retain first 3 principal components
79+
pca = PCA(n_components=3).fit(X)
80+
81+
# Display feature-loadings and explained variance
82+
print("Feature names:\n", feature_names)
83+
print("\nPrincipal components (loadings):\n", pca.components_)
84+
print("\nExplained variance ratio:\n", pca.explained_variance_ratio_)
85+
```
86+
87+
* **`pca.components_`**: each row is an eigenvector (unit length) showing how the four original features load onto each principal component.
88+
* **`pca.explained_variance_ratio_`**: the fraction of total variance each component explains (e.g. PC1 ≈ 0.92, PC2 ≈ 0.05, PC3 ≈ 0.02).
89+
90+
Since PC1 explains over 92% of the variance, projecting onto it alone already captures most of the dataset’s structure.
91+
92+
Finally, we can project the data wit the `.transform(X)` method. This does the following:
93+
94+
1. Centers `X` by subtracting each feature’s mean.
95+
2. Computes dot-products with the selected eigenvectors.
96+
97+
The resulting `X_pca` matrix has shape `(n_samples, 3)`, giving the coordinates of each sample in the PCA subspace.
98+
99+
100+
```{code-cell} ipython3
101+
# Project (transform) the data into the first 3 PCs
102+
X_pca = pca.transform(X)
103+
104+
# Plot
105+
from mpl_toolkits.mplot3d import Axes3D
106+
fig = plt.figure(figsize=(12, 4), constrained_layout=True)
107+
108+
ax1 = fig.add_subplot(1, 3, 1)
109+
scatter1 = ax1.scatter(X_pca[:, 0], np.zeros_like(X_pca[:, 0]), c=y, cmap='viridis', s=40)
110+
ax1.set(title='1 Component', xlabel='PC1')
111+
ax1.get_yaxis().set_visible(False)
112+
113+
ax2 = fig.add_subplot(1, 3, 2)
114+
ax2.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', s=40)
115+
ax2.set(title='2 Components', xlabel='PC1', ylabel='PC2')
116+
117+
ax3 = fig.add_subplot(1, 3, 3, projection='3d', elev=-150, azim=110)
118+
ax3.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=y, cmap='viridis', s=40)
119+
ax3.set(title='3 Components', xlabel='PC1', ylabel='PC2', zlabel='PC3')
120+
121+
handles, labels = scatter1.legend_elements()
122+
legend = ax1.legend(handles, iris.target_names, loc='upper left', title='Species')
123+
ax1.add_artist(legend);
124+
```

0 commit comments

Comments
 (0)