What is PCA?
Principal Component Analysis is a technique to extract the most important information from a dataset. Imagine you have a table of thousands of measurements — PCA finds the directions where the data varies the most, and represents everything using far fewer numbers.
The Singular Value Decomposition
Every matrix A can be broken into rank-1 pieces. This is the Singular Value Decomposition (SVD). Understanding it is the key to everything in PCA.
Suppose we have a diagonal matrix A. Its SVD is trivial to see — the singular values are right on the diagonal:
| Piece | Singular Value (σ) | Importance | Keep it? |
|---|---|---|---|
| σ₁ | 4.0 | ████████ 57% | ✓ Yes |
| σ₂ | 3.0 | ██████ 32% | ✓ Yes |
| σ₃ | 2.0 | ███ 9% | ✗ Drop |
| σ₄ | 1.0 | █ 2% | ✗ Drop |
import numpy as np # Our data matrix A A = np.array([[4, 0, 0], [0, 3, 0], [0, 0, 2]]) # Compute SVD: A = U @ diag(sigma) @ V.T U, sigma, Vt = np.linalg.svd(A) print("Singular values:", sigma) # Output: [4.0, 3.0, 2.0] # Best rank-2 approximation (keep top 2 pieces) k = 2 A_k = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :] # How much variance is explained? explained = np.sum(sigma[:k]**2) / np.sum(sigma**2) print(f"Variance explained: {explained:.1%}") # Output: 96.0%
The Eckart-Young Theorem
This is the key theorem that justifies PCA. It tells us that the SVD gives the provably best low-rank approximation — you cannot do better.
Say A = diag(4, 3, 2, 1). Prof. Strang asks: could we do better with this rank-2 matrix B?
| Matrix | Strategy | Diagonal Errors | Off-diagonal Penalty | Total Error |
|---|---|---|---|---|
| A₂ (SVD) | Keep σ₁=4, σ₂=3 | 0, 0, 2, 1 | 0 | ‖error‖ = σ₃ = 2 |
| B (alternative) | diag(3.5, 3.5, 0, 1.5) | 0.5, 0.5, 2, 0.5 | Large off-diag terms needed | ‖error‖ > 2 |
Matrix Norms — How to Measure a Matrix
A norm is a measure of "how big" a matrix is. There are many possible norms. The three most important in data science all depend only on singular values.
| Norm | Formula | Intuition | Application |
|---|---|---|---|
L2 (Spectral) ‖A‖₂ |
= σ₁ | The single largest singular value. How much can A "stretch" a unit vector? | Stability analysis, condition number |
Frobenius ‖A‖_F |
= √(σ₁² + σ₂² + ··· + σᵣ²) | Treat A as a long vector and take its Euclidean length. Sum of all entries squared. | PCA variance explained, regression |
Nuclear ‖A‖_* |
= σ₁ + σ₂ + ··· + σᵣ | Sum all singular values. Like the L1 norm for matrices — promotes low-rank solutions. | Netflix completion, MRI, recommendations |
📐 Vector Norms (Review)
- ‖v‖₂ = √(v₁² + ··· + vₙ²) — Euclidean length, least squares
- ‖v‖₁ = |v₁| + ··· + |vₙ| — promotes sparsity (mostly-zero answers)
- ‖v‖_∞ = max|vᵢ| — worst-case component
📊 Why L1 gives sparse solutions
- Minimizing L2 → many small components (hard to interpret)
- Minimizing L1 → mostly zero components (easy to interpret)
- This is the basis of "basis pursuit" in signal processing
- Also the foundation of LASSO regression in statistics
Step 1 — Center the Data
Before applying PCA, we must remove the mean of each row. This moves the data cloud to be centered at the origin. It's required for PCA to work correctly.
Original data — heights (cm) and ages (years) of 5 people:
| Person | Height (cm) | Age (years) |
|---|---|---|
| P₁ | 160 | 20 |
| P₂ | 170 | 25 |
| P₃ | 175 | 30 |
| P₄ | 180 | 35 |
| P₅ | 165 | 22 |
| Mean | 170 | 26.4 |
Centered data — after subtracting the mean of each row:
| Person | Height − 170 | Age − 26.4 |
|---|---|---|
| P₁ | −10 | −6.4 |
| P₂ | 0 | −1.4 |
| P₃ | +5 | +3.6 |
| P₄ | +10 | +8.6 |
| P₅ | −5 | −4.4 |
| Sum → | 0 ✓ | 0 ✓ |
import numpy as np # Data matrix: rows = features, columns = samples A = np.array([ [160, 170, 175, 180, 165], # heights [ 20, 25, 30, 35, 22] # ages ]) # Subtract row means (keepdims keeps the shape for broadcasting) row_means = A.mean(axis=1, keepdims=True) A_centered = A - row_means print("Row means:", row_means.flatten()) # [170. 26.4] print("Row sums after centering:", A_centered.sum(axis=1)) # [0. 0.] ← both rows sum to zero ✓
Step 2 — Compute the Covariance Matrix
Once the data is centered, we compute the sample covariance matrix S. This tells us how features vary together — the key to finding the principal directions.
Using our centered data Ã, let n = 5 samples, so we divide by (n−1) = 4:
| Entry | Meaning | Calculation | Value |
|---|---|---|---|
| S₁₁ | Variance of height | (-10²+0²+5²+10²+(-5)²)/4 | 62.5 |
| S₂₂ | Variance of age | (-6.4²+(-1.4²)+3.6²+8.6²+(-4.4²))/4 | 29.3 |
| S₁₂ = S₂₁ | Covariance (height & age) | ((-10)(-6.4)+···+((-5)(-4.4)))/4 | 42.5 |
⎣ 42.5 29.3 ⎦
# Method 1: Manual formula S = A_centered @ A_centered.T / (n-1) n = A_centered.shape[1] S_manual = (A_centered @ A_centered.T) / (n - 1) # Method 2: NumPy built-in (rowvar=True → each row is a variable) S = np.cov(A_centered, rowvar=True) print(S) # [[62.5 42.5] # [42.5 29.3]] # S is symmetric ✓ (S == S.T)
Step 3 — Apply SVD to Find Principal Components
With the centered data matrix à ready, we apply SVD. The right singular vectors v₁, v₂, ... are the principal components — the directions of maximum variance.
After SVD on our centered height-age matrix Ã, we find:
| Component | Singular Value σᵢ | Variance Explained | Direction v |
|---|---|---|---|
| PC₁ (1st) | 9.33 | 92.4% | [0.83, 0.56]ᵀ |
| PC₂ (2nd) | 2.69 | 7.6% | [−0.56, 0.83]ᵀ |
import numpy as np ## ─── Complete PCA from scratch ─────────────────────────────── # Step 1: Start with your data matrix A = np.array([ [160, 170, 175, 180, 165], # heights (cm) [ 20, 25, 30, 35, 22], # ages (years) ]) # Step 2: Center the data (subtract row means) A_c = A - A.mean(axis=1, keepdims=True) # Step 3: SVD of the centered matrix U, sigma, Vt = np.linalg.svd(A_c, full_matrices=False) # Step 4: Explained variance explained = sigma**2 / np.sum(sigma**2) print("Explained variance ratio:", np.round(explained, 3)) # [0.924 0.076] → PC1 explains 92.4%! # Step 5: The principal components (columns of V = rows of Vt) PC1 = Vt[0] # first right singular vector print("PC1 direction:", np.round(PC1, 2)) # [0.83 0.56] → mix of height and age # Step 6: Project data onto PC1 (reduce 2D → 1D) k = 1 Z = Vt[:k] @ A_c # shape: (k, n_samples) print("PC1 scores:", np.round(Z[0], 2)) # Each sample's position along the main growth direction ## ─── Using sklearn (production) ───────────────────────────── from sklearn.decomposition import PCA pca = PCA(n_components=1) Z_sk = pca.fit_transform(A.T) # sklearn wants (n_samples, n_features) print("Explained variance (sklearn):", pca.explained_variance_ratio_)
PCA vs Least Squares Regression
Both find the "best line" through data — but they minimize different things. This is a subtle but critical distinction that trips up many students.
📉 Least Squares Regression
- Minimizes vertical distances from points to line
- Solves ATAx̂ = ATb (normal equations)
- Asymmetric — one variable is "output" (y), others are "inputs"
- Fast — linear system of equations
- Use when predicting y from x (e.g., house price from size)
📐 PCA
- Minimizes perpendicular distances from points to line
- Solves via SVD — no linear system needed
- Symmetric — no "output" variable, all features equal
- Richer — gives full variance structure, not just one line
- Use when exploring structure, reducing dimensions, visualizing
Real-World Applications
The SVD and matrix norms we studied power some of the most impactful algorithms in modern data science and engineering.
1. Center: Ã = A − mean(rows)
2. SVD: Ã = U Σ Vᵀ
3. Choose k: keep top k where Σσᵢ²/Στσⱼ² ≥ threshold
4. Project: Z = Vₖᵀ Ã (k × n compressed representation)
Output: Z — low-dimensional data, principal directions V, variances σᵢ²