@@ -10,11 +10,115 @@ kernelspec:
1010 display_name : Python 3
1111 language : python
1212 name : python3
13- myst :
14- substitutions :
15- ref_test : 1
1613---
1714
1815# <i class =" fa-solid fa-magnifying-glass-chart " ></i > Principal Component Analysis
1916
20- https://www.datacamp.com/tutorial/principal-component-analysis-in-python
17+ Principal Component Analysis (PCA) is a * dimensionality reduction* technique. It is considered an * unsupervised* machine learning method, since we do not model any relationship with a target/response variable. Instead, PCA finds a lower-dimensional representation of our data.
18+
19+ Simply put, PCA finds the principal components (the * eigenvectors* ) of the centered data matrix $X$. Each eigenvector points in a direction of maximal variance, ordered by how much variance it explains.
20+
21+
22+ ``` {code-cell} ipython3
23+ ---
24+ tags:
25+ - hide-input
26+ ---
27+
28+ import numpy as np
29+ import matplotlib.pyplot as plt
30+ from sklearn.decomposition import PCA
31+
32+ # Simulate & center 2D data
33+ np.random.seed(0)
34+ X = np.random.multivariate_normal([0, 0], [[3, 2], [2, 2]], size=500)
35+ Xc = X - X.mean(axis=0)
36+
37+ # Run PCA: extract eigenvectors and eigenvalues
38+ pca2d = PCA().fit(Xc)
39+ pcs, scales = pca2d.components_, np.sqrt(pca2d.explained_variance_)
40+
41+ # Plot original data and principal components
42+ fig, ax = plt.subplots()
43+ ax.scatter(X[:, 0], X[:, 1], alpha=0.5, label='Data')
44+ mean = X.mean(axis=0)
45+ # Draw PC1 (red) and PC2 (blue)
46+ ax.arrow(*mean, *(pcs[0] * scales[0] * 3), head_width=0.2, head_length=0.3, color='r', linewidth=2, label='PC1')
47+ ax.arrow(*mean, *(pcs[1] * scales[1] * 3), head_width=0.2, head_length=0.3, color='b', linewidth=2, label='PC2')
48+ ax.set(xlabel="Feature 1", ylabel="Feature 2", title="PCA for 2D data")
49+ ax.axis('equal')
50+ ax.legend()
51+ plt.show()
52+ ```
53+
54+ This example illustrates how PCA finds the two orthogonal directions (eigenvectors) along which the data vary most.
55+
56+ Next, we apply PCA to the Iris dataset (4 features). First, we can inspect which features are most discriminative with a pairplot:
57+
58+ ``` {code-cell} ipython3
59+ import seaborn as sns
60+ from sklearn.datasets import load_iris
61+
62+ # Load Iris as a DataFrame for easy plotting
63+ iris = load_iris(as_frame=True)
64+ iris.frame["target"] = iris.target_names[iris.target]
65+ _ = sns.pairplot(iris.frame, hue="target")
66+
67+ ```
68+
69+ The pairplot shows ** petal length** and ** petal width** separate the three species most clearly.
70+
71+ ``` {code-cell} ipython3
72+ from sklearn.decomposition import PCA
73+
74+ # Prepare data arrays
75+ X, y = iris.data, iris.target
76+ feature_names = iris.feature_names
77+
78+ # Fit PCA to retain first 3 principal components
79+ pca = PCA(n_components=3).fit(X)
80+
81+ # Display feature-loadings and explained variance
82+ print("Feature names:\n", feature_names)
83+ print("\nPrincipal components (loadings):\n", pca.components_)
84+ print("\nExplained variance ratio:\n", pca.explained_variance_ratio_)
85+ ```
86+
87+ * ** ` pca.components_ ` ** : each row is an eigenvector (unit length) showing how the four original features load onto each principal component.
88+ * ** ` pca.explained_variance_ratio_ ` ** : the fraction of total variance each component explains (e.g. PC1 ≈ 0.92, PC2 ≈ 0.05, PC3 ≈ 0.02).
89+
90+ Since PC1 explains over 92% of the variance, projecting onto it alone already captures most of the dataset’s structure.
91+
92+ Finally, we can project the data wit the ` .transform(X) ` method. This does the following:
93+
94+ 1 . Centers ` X ` by subtracting each feature’s mean.
95+ 2 . Computes dot-products with the selected eigenvectors.
96+
97+ The resulting ` X_pca ` matrix has shape ` (n_samples, 3) ` , giving the coordinates of each sample in the PCA subspace.
98+
99+
100+ ``` {code-cell} ipython3
101+ # Project (transform) the data into the first 3 PCs
102+ X_pca = pca.transform(X)
103+
104+ # Plot
105+ from mpl_toolkits.mplot3d import Axes3D
106+ fig = plt.figure(figsize=(12, 4), constrained_layout=True)
107+
108+ ax1 = fig.add_subplot(1, 3, 1)
109+ scatter1 = ax1.scatter(X_pca[:, 0], np.zeros_like(X_pca[:, 0]), c=y, cmap='viridis', s=40)
110+ ax1.set(title='1 Component', xlabel='PC1')
111+ ax1.get_yaxis().set_visible(False)
112+
113+ ax2 = fig.add_subplot(1, 3, 2)
114+ ax2.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', s=40)
115+ ax2.set(title='2 Components', xlabel='PC1', ylabel='PC2')
116+
117+ ax3 = fig.add_subplot(1, 3, 3, projection='3d', elev=-150, azim=110)
118+ ax3.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=y, cmap='viridis', s=40)
119+ ax3.set(title='3 Components', xlabel='PC1', ylabel='PC2', zlabel='PC3')
120+
121+ handles, labels = scatter1.legend_elements()
122+ legend = ax1.legend(handles, iris.target_names, loc='upper left', title='Species')
123+ ax1.add_artist(legend);
124+ ```
0 commit comments