MIT 18.065 · Gilbert Strang

Principal Component
Analysis & SVD

A visual, step-by-step guide to understanding how the Singular Value Decomposition powers PCA — the most important tool in data science for finding structure in data.

Linear Algebra Data Science Machine Learning Matrix Theory
Introduction

What is PCA?

Principal Component Analysis is a technique to extract the most important information from a dataset. Imagine you have a table of thousands of measurements — PCA finds the directions where the data varies the most, and represents everything using far fewer numbers.

📊
The Problem
You have a big matrix of data. Most of it is noise or redundancy. PCA finds what actually matters — the low-rank structure hiding inside.
🎯
The Goal
Reduce a high-dimensional dataset to a small number of "principal components" that capture the maximum variance — without losing the important signal.
🔬
The Tool
SVD (Singular Value Decomposition) is the mathematical engine behind PCA. It breaks any matrix into structured, interpretable pieces.
💡
Key Insight
If you memorize all the data exactly, you haven't learned anything. The goal is to find the important patterns — that's what PCA does.
💬
Prof. Strang's Key Quote
"In machine learning, if you've learned all the training data, you haven't learned anything, really. You've just copied it. The whole point is to learn important facts about the data."
Step 01 · Foundation

The Singular Value Decomposition

Every matrix A can be broken into rank-1 pieces. This is the Singular Value Decomposition (SVD). Understanding it is the key to everything in PCA.

SVD — The Main Formula Any m×n matrix
A = σ₁u₁v₁ᵀ + σ₂u₂v₂ᵀ + σ₃u₃v₃ᵀ + ··· + σᵣuᵣvᵣᵀ
σ₁ ≥ σ₂ ≥ ··· ≥ σᵣ ≥ 0 are the singular values (how important each piece is) · u's are orthonormal left singular vectors · v's are orthonormal right singular vectors · r is the rank
Example Breaking a 3×3 matrix into rank-1 pieces

Suppose we have a diagonal matrix A. Its SVD is trivial to see — the singular values are right on the diagonal:

PieceSingular Value (σ)ImportanceKeep it?
σ₁4.0████████ 57%✓ Yes
σ₂3.0██████ 32%✓ Yes
σ₃2.0███ 9%✗ Drop
σ₄1.0█ 2%✗ Drop
Best Rank-2 Approximation
Keep σ₁ and σ₂ only. This captures 89% of the information using just 2 pieces instead of 4. This is the Eckart-Young theorem in action — proven to be the best possible rank-2 approximation.
Python · NumPy SVD
import numpy as np

# Our data matrix A
A = np.array([[4, 0, 0],
              [0, 3, 0],
              [0, 0, 2]])

# Compute SVD: A = U @ diag(sigma) @ V.T
U, sigma, Vt = np.linalg.svd(A)

print("Singular values:", sigma)
# Output: [4.0, 3.0, 2.0]

# Best rank-2 approximation (keep top 2 pieces)
k = 2
A_k = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]

# How much variance is explained?
explained = np.sum(sigma[:k]**2) / np.sum(sigma**2)
print(f"Variance explained: {explained:.1%}")
# Output: 96.0%
Step 02 · Theorem

The Eckart-Young Theorem

This is the key theorem that justifies PCA. It tells us that the SVD gives the provably best low-rank approximation — you cannot do better.

Eckart-Young Theorem (1936) Fundamental Result
‖A − Aₖ‖ ≤ ‖A − B‖
for any rank-k matrix B
Aₖ (using the first k pieces of SVD) is closer to A than any other rank-k matrix B. This works for the L2, Frobenius, and Nuclear norms.
Why?
Intuition: The Greedy Argument
The singular values σ₁ ≥ σ₂ ≥ ··· are sorted by importance. Taking the top k is like keeping the "heaviest" pieces. Any other rank-k matrix mixes directions and pays a penalty — it can't do better than picking purely along the dominant directions.
Why?
Orthogonal Invariance
All three key norms (L2, Frobenius, Nuclear) are orthogonally invariant — multiplying by an orthogonal matrix Q doesn't change them. This means the diagonal case (easy to see) covers all cases.
Key
The Error You Accept
When you drop pieces k+1, k+2, ..., r, the approximation error is exactly ‖A − Aₖ‖₂ = σₖ₊₁ (in the L2 norm). The next singular value tells you exactly how much you're losing.
Counter-Example Proof Why any other rank-2 matrix does worse

Say A = diag(4, 3, 2, 1). Prof. Strang asks: could we do better with this rank-2 matrix B?

MatrixStrategyDiagonal ErrorsOff-diagonal PenaltyTotal Error
A₂ (SVD) Keep σ₁=4, σ₂=3 0, 0, 2, 1 0 ‖error‖ = σ₃ = 2
B (alternative) diag(3.5, 3.5, 0, 1.5) 0.5, 0.5, 2, 0.5 Large off-diag terms needed ‖error‖ > 2
⚠️
The Trap
B might look closer on the diagonal, but to keep rank ≤ 2, it needs off-diagonal terms that cost more in total. The Eckart-Young theorem guarantees A₂ always wins.
Step 03 · Tools

Matrix Norms — How to Measure a Matrix

A norm is a measure of "how big" a matrix is. There are many possible norms. The three most important in data science all depend only on singular values.

Norm Formula Intuition Application
L2 (Spectral)
‖A‖₂
= σ₁ The single largest singular value. How much can A "stretch" a unit vector? Stability analysis, condition number
Frobenius
‖A‖_F
= √(σ₁² + σ₂² + ··· + σᵣ²) Treat A as a long vector and take its Euclidean length. Sum of all entries squared. PCA variance explained, regression
Nuclear
‖A‖_*
= σ₁ + σ₂ + ··· + σᵣ Sum all singular values. Like the L1 norm for matrices — promotes low-rank solutions. Netflix completion, MRI, recommendations
Orthogonal Invariance (Key Property)
All three norms above satisfy ‖QA‖ = ‖A‖ for any orthogonal matrix Q. This is why the Eckart-Young theorem works for all of them — multiplying by U or Vᵀ doesn't change the error.

📐 Vector Norms (Review)

  • ‖v‖₂ = √(v₁² + ··· + vₙ²) — Euclidean length, least squares
  • ‖v‖₁ = |v₁| + ··· + |vₙ| — promotes sparsity (mostly-zero answers)
  • ‖v‖_∞ = max|vᵢ| — worst-case component

📊 Why L1 gives sparse solutions

  • Minimizing L2 → many small components (hard to interpret)
  • Minimizing L1 → mostly zero components (easy to interpret)
  • This is the basis of "basis pursuit" in signal processing
  • Also the foundation of LASSO regression in statistics
Step 04 · PCA in Practice

Step 1 — Center the Data

Before applying PCA, we must remove the mean of each row. This moves the data cloud to be centered at the origin. It's required for PCA to work correctly.

Worked Example Height & Age Dataset (5 people)

Original data — heights (cm) and ages (years) of 5 people:

PersonHeight (cm)Age (years)
P₁16020
P₂17025
P₃17530
P₄18035
P₅16522
Mean17026.4
Centering Formula
à = A − mean(each row)

Centered data — after subtracting the mean of each row:

PersonHeight − 170Age − 26.4
P₁−10−6.4
P₂0−1.4
P₃+5+3.6
P₄+10+8.6
P₅−5−4.4
Sum →0 ✓0 ✓
Both rows now sum to zero!
This centers our data at the origin. The "negative age" values just mean "younger than average" — the data is now relative to the mean, not absolute.
Python · Centering
import numpy as np

# Data matrix: rows = features, columns = samples
A = np.array([
    [160, 170, 175, 180, 165],  # heights
    [ 20,  25,  30,  35,  22]   # ages
])

# Subtract row means (keepdims keeps the shape for broadcasting)
row_means = A.mean(axis=1, keepdims=True)
A_centered = A - row_means

print("Row means:", row_means.flatten())
# [170.  26.4]

print("Row sums after centering:", A_centered.sum(axis=1))
# [0. 0.]  ← both rows sum to zero ✓
Step 05 · PCA in Practice

Step 2 — Compute the Covariance Matrix

Once the data is centered, we compute the sample covariance matrix S. This tells us how features vary together — the key to finding the principal directions.

Sample Covariance Matrix
S = (1/(n−1)) · Ã · Ãᵀ
à = centered data matrix · n = number of samples · n−1 not n (Bessel's correction — one degree of freedom used for the mean) · S is always symmetric and positive semi-definite
Worked Example Computing S for Height & Age

Using our centered data Ã, let n = 5 samples, so we divide by (n−1) = 4:

EntryMeaningCalculationValue
S₁₁ Variance of height (-10²+0²+5²+10²+(-5)²)/4 62.5
S₂₂ Variance of age (-6.4²+(-1.4²)+3.6²+8.6²+(-4.4²))/4 29.3
S₁₂ = S₂₁ Covariance (height & age) ((-10)(-6.4)+···+((-5)(-4.4)))/4 42.5
Resulting 2×2 Covariance Matrix
S = ⎡ 62.5   42.5 ⎤
      ⎣ 42.5   29.3 ⎦
💡
What does S₁₂ = 42.5 tell us?
A large positive covariance means height and age increase together — taller people tend to be older (in children/teens). PCA will find this relationship automatically through the eigenvectors of S.
Python · Covariance Matrix
# Method 1: Manual formula  S = A_centered @ A_centered.T / (n-1)
n = A_centered.shape[1]
S_manual = (A_centered @ A_centered.T) / (n - 1)

# Method 2: NumPy built-in (rowvar=True → each row is a variable)
S = np.cov(A_centered, rowvar=True)

print(S)
# [[62.5  42.5]
#  [42.5  29.3]]
# S is symmetric ✓ (S == S.T)
Step 06 · PCA in Practice

Step 3 — Apply SVD to Find Principal Components

With the centered data matrix à ready, we apply SVD. The right singular vectors v₁, v₂, ... are the principal components — the directions of maximum variance.

3a
Compute SVD of the centered matrix Ã
Run à = U Σ Vᵀ. The singular values σ₁ ≥ σ₂ ≥ ··· tell you how important each direction is. The columns of V are the principal component directions.
3b
Choose how many components to keep
Compute the explained variance ratio: σᵢ² / Σσⱼ². Keep enough components to explain ~90–95% of the variance. This is your k.
3c
Project the data onto the k principal components
The reduced representation is Z = Vₖᵀ · Ã. Each column of Z is the "score" of one data point along the principal directions — this is your compressed, interpretable data.
Numerical Example Finding the 1st Principal Component of Height & Age

After SVD on our centered height-age matrix Ã, we find:

ComponentSingular Value σᵢVariance ExplainedDirection v
PC₁ (1st) 9.33 92.4% [0.83, 0.56]ᵀ
PC₂ (2nd) 2.69 7.6% [−0.56, 0.83]ᵀ
🎯
Interpretation
PC₁ = [0.83, 0.56]ᵀ means "a mix of height (83%) and age (56%)" is the most important direction. This captures the growth pattern: taller = older. One number (the PC₁ score) replaces two (height + age) while keeping 92% of the information!
Python · Full PCA Pipeline
import numpy as np

## ─── Complete PCA from scratch ───────────────────────────────

# Step 1: Start with your data matrix
A = np.array([
    [160, 170, 175, 180, 165],  # heights (cm)
    [ 20,  25,  30,  35,  22],  # ages (years)
])

# Step 2: Center the data (subtract row means)
A_c = A - A.mean(axis=1, keepdims=True)

# Step 3: SVD of the centered matrix
U, sigma, Vt = np.linalg.svd(A_c, full_matrices=False)

# Step 4: Explained variance
explained = sigma**2 / np.sum(sigma**2)
print("Explained variance ratio:", np.round(explained, 3))
# [0.924  0.076] → PC1 explains 92.4%!

# Step 5: The principal components (columns of V = rows of Vt)
PC1 = Vt[0]  # first right singular vector
print("PC1 direction:", np.round(PC1, 2))
# [0.83  0.56] → mix of height and age

# Step 6: Project data onto PC1 (reduce 2D → 1D)
k = 1
Z = Vt[:k] @ A_c  # shape: (k, n_samples)
print("PC1 scores:", np.round(Z[0], 2))
# Each sample's position along the main growth direction

## ─── Using sklearn (production) ─────────────────────────────
from sklearn.decomposition import PCA

pca = PCA(n_components=1)
Z_sk = pca.fit_transform(A.T)  # sklearn wants (n_samples, n_features)
print("Explained variance (sklearn):", pca.explained_variance_ratio_)
🎛️ Interactive: Variance Explained
Drag the slider to see how many components you need to explain X% of variance
k =
2
Step 07 · Key Distinction

PCA vs Least Squares Regression

Both find the "best line" through data — but they minimize different things. This is a subtle but critical distinction that trips up many students.

📉 Least Squares Regression

  • Minimizes vertical distances from points to line
  • Solves ATAx̂ = ATb (normal equations)
  • Asymmetric — one variable is "output" (y), others are "inputs"
  • Fast — linear system of equations
  • Use when predicting y from x (e.g., house price from size)

📐 PCA

  • Minimizes perpendicular distances from points to line
  • Solves via SVD — no linear system needed
  • Symmetric — no "output" variable, all features equal
  • Richer — gives full variance structure, not just one line
  • Use when exploring structure, reducing dimensions, visualizing
LEAST SQUARES Vertical errors min Σ(vertical distance)² PCA Perpendicular errors min Σ(perpendicular distance)²
Step 08 · Real World

Real-World Applications

The SVD and matrix norms we studied power some of the most impactful algorithms in modern data science and engineering.

🎬
Netflix Prize ($1M)
User-movie rating matrix with millions of missing entries. Nuclear norm minimization completed the matrix — predicting what you'd rate a movie you haven't seen. The foundation of modern recommender systems.
🏥
MRI Reconstruction
Children can't stay still in MRI scanners long enough for a full scan. Nuclear norm minimization reconstructs a complete, high-quality image from incomplete measurements — saving scan time.
🛍️
Amazon Recommendations
If you've bought math books, Amazon's matrix factorization model recommends more math books. Same matrix completion idea as Netflix — a huge sparse purchase matrix, completed with SVD.
🖼️
Image Compression
An image is a matrix. SVD finds its low-rank approximation: keep only the top k singular values. k=50 might give 90% visual quality at 5% the storage cost — exactly the Eckart-Young theorem.
🧬
Genomics
Gene expression matrices have thousands of genes (rows) and hundreds of patients (columns). PCA reduces this to 2-3 principal components for visualization and disease classification.
👤
Eigenfaces (Face Recognition)
Apply PCA to a dataset of face images. The top principal components are the "eigenfaces" — the most common facial patterns. New faces are recognized by comparing their PCA coefficients.
Summary — The Full PCA Pipeline
Input:  Data matrix A (m features × n samples)
1.  Center:  Ã = A − mean(rows)
2.  SVD:  Ã = U Σ Vᵀ
3.  Choose k:  keep top k where Σσᵢ²/Στσⱼ² ≥ threshold
4.  Project:  Z = Vₖᵀ Ã  (k × n compressed representation)
Output:  Z — low-dimensional data, principal directions V, variances σᵢ²