PCA & SVD — Interactive Guide

Introduction

What is PCA?

Principal Component Analysis is a technique to extract the most important information from a dataset. Imagine you have a table of thousands of measurements — PCA finds the directions where the data varies the most, and represents everything using far fewer numbers.

📊

The Problem

You have a big matrix of data. Most of it is noise or redundancy. PCA finds what actually matters — the low-rank structure hiding inside.

🎯

The Goal

Reduce a high-dimensional dataset to a small number of "principal components" that capture the maximum variance — without losing the important signal.

🔬

The Tool

SVD (Singular Value Decomposition) is the mathematical engine behind PCA. It breaks any matrix into structured, interpretable pieces.

💡

Key Insight

If you memorize all the data exactly, you haven't learned anything. The goal is to find the important patterns — that's what PCA does.

💬

Prof. Strang's Key Quote

"In machine learning, if you've learned all the training data, you haven't learned anything, really. You've just copied it. The whole point is to learn important facts about the data."

Step 01 · Foundation

The Singular Value Decomposition

Every matrix A can be broken into rank-1 pieces. This is the Singular Value Decomposition (SVD). Understanding it is the key to everything in PCA.

SVD — The Main Formula Any m×n matrix

A = σ₁u₁v₁ᵀ + σ₂u₂v₂ᵀ + σ₃u₃v₃ᵀ + ··· + σᵣuᵣvᵣᵀ

σ₁ ≥ σ₂ ≥ ··· ≥ σᵣ ≥ 0 are the singular values (how important each piece is) · u's are orthonormal left singular vectors · v's are orthonormal right singular vectors · r is the rank

Example Breaking a 3×3 matrix into rank-1 pieces

Suppose we have a diagonal matrix A. Its SVD is trivial to see — the singular values are right on the diagonal:

Piece	Singular Value (σ)	Importance	Keep it?
σ₁	4.0	████████ 57%	✓ Yes
σ₂	3.0	██████ 32%	✓ Yes
σ₃	2.0	███ 9%	✗ Drop
σ₄	1.0	█ 2%	✗ Drop

✅

Best Rank-2 Approximation

Keep σ₁ and σ₂ only. This captures 89% of the information using just 2 pieces instead of 4. This is the Eckart-Young theorem in action — proven to be the best possible rank-2 approximation.

Python · NumPy SVD

import numpy as np

# Our data matrix A
A = np.array([[4, 0, 0],
              [0, 3, 0],
              [0, 0, 2]])

# Compute SVD: A = U @ diag(sigma) @ V.T
U, sigma, Vt = np.linalg.svd(A)

print("Singular values:", sigma)
# Output: [4.0, 3.0, 2.0]

# Best rank-2 approximation (keep top 2 pieces)
k = 2
A_k = U[:, :k] @ np.diag(sigma[:k]) @ Vt[:k, :]

# How much variance is explained?
explained = np.sum(sigma[:k]**2) / np.sum(sigma**2)
print(f"Variance explained: {explained:.1%}")
# Output: 96.0%

Step 02 · Theorem

The Eckart-Young Theorem

This is the key theorem that justifies PCA. It tells us that the SVD gives the provably best low-rank approximation — you cannot do better.

Eckart-Young Theorem (1936) Fundamental Result

‖A − Aₖ‖ ≤ ‖A − B‖

for any rank-k matrix B

Aₖ (using the first k pieces of SVD) is closer to A than any other rank-k matrix B. This works for the L2, Frobenius, and Nuclear norms.

Why?

Intuition: The Greedy Argument

The singular values σ₁ ≥ σ₂ ≥ ··· are sorted by importance. Taking the top k is like keeping the "heaviest" pieces. Any other rank-k matrix mixes directions and pays a penalty — it can't do better than picking purely along the dominant directions.

Why?

Orthogonal Invariance

All three key norms (L2, Frobenius, Nuclear) are orthogonally invariant — multiplying by an orthogonal matrix Q doesn't change them. This means the diagonal case (easy to see) covers all cases.

Key

The Error You Accept

When you drop pieces k+1, k+2, ..., r, the approximation error is exactly ‖A − Aₖ‖₂ = σₖ₊₁ (in the L2 norm). The next singular value tells you exactly how much you're losing.

Counter-Example Proof Why any other rank-2 matrix does worse

Say A = diag(4, 3, 2, 1). Prof. Strang asks: could we do better with this rank-2 matrix B?

Matrix	Strategy	Diagonal Errors	Off-diagonal Penalty	Total Error
A₂ (SVD)	Keep σ₁=4, σ₂=3	0, 0, 2, 1	0	‖error‖ = σ₃ = 2
B (alternative)	diag(3.5, 3.5, 0, 1.5)	0.5, 0.5, 2, 0.5	Large off-diag terms needed	‖error‖ > 2

⚠️

The Trap

B might look closer on the diagonal, but to keep rank ≤ 2, it needs off-diagonal terms that cost more in total. The Eckart-Young theorem guarantees A₂ always wins.

Step 03 · Tools

Matrix Norms — How to Measure a Matrix

A norm is a measure of "how big" a matrix is. There are many possible norms. The three most important in data science all depend only on singular values.

Norm	Formula	Intuition	Application
L2 (Spectral) ‖A‖₂	= σ₁	The single largest singular value. How much can A "stretch" a unit vector?	Stability analysis, condition number
Frobenius ‖A‖_F	= √(σ₁² + σ₂² + ··· + σᵣ²)	Treat A as a long vector and take its Euclidean length. Sum of all entries squared.	PCA variance explained, regression
Nuclear ‖A‖_*	= σ₁ + σ₂ + ··· + σᵣ	Sum all singular values. Like the L1 norm for matrices — promotes low-rank solutions.	Netflix completion, MRI, recommendations

⭐

Orthogonal Invariance (Key Property)

All three norms above satisfy ‖QA‖ = ‖A‖ for any orthogonal matrix Q. This is why the Eckart-Young theorem works for all of them — multiplying by U or Vᵀ doesn't change the error.

📐 Vector Norms (Review)

‖v‖₂ = √(v₁² + ··· + vₙ²) — Euclidean length, least squares
‖v‖₁ = |v₁| + ··· + |vₙ| — promotes sparsity (mostly-zero answers)
‖v‖_∞ = max|vᵢ| — worst-case component

📊 Why L1 gives sparse solutions

Minimizing L2 → many small components (hard to interpret)
Minimizing L1 → mostly zero components (easy to interpret)
This is the basis of "basis pursuit" in signal processing
Also the foundation of LASSO regression in statistics

Step 04 · PCA in Practice

Step 1 — Center the Data

Before applying PCA, we must remove the mean of each row. This moves the data cloud to be centered at the origin. It's required for PCA to work correctly.

Worked Example Height & Age Dataset (5 people)

Original data — heights (cm) and ages (years) of 5 people:

Person	Height (cm)	Age (years)
P₁	160	20
P₂	170	25
P₃	175	30
P₄	180	35
P₅	165	22
Mean	170	26.4

Centering Formula

              Ã = A − mean(each row)
            

Centered data — after subtracting the mean of each row:

Person	Height − 170	Age − 26.4
P₁	−10	−6.4
P₂	0	−1.4
P₃	+5	+3.6
P₄	+10	+8.6
P₅	−5	−4.4
Sum →	0 ✓	0 ✓

✅

Both rows now sum to zero!

This centers our data at the origin. The "negative age" values just mean "younger than average" — the data is now relative to the mean, not absolute.

Python · Centering

import numpy as np

# Data matrix: rows = features, columns = samples
A = np.array([
    [160, 170, 175, 180, 165],  # heights
    [ 20,  25,  30,  35,  22]   # ages
])

# Subtract row means (keepdims keeps the shape for broadcasting)
row_means = A.mean(axis=1, keepdims=True)
A_centered = A - row_means

print("Row means:", row_means.flatten())
# [170.  26.4]

print("Row sums after centering:", A_centered.sum(axis=1))
# [0. 0.]  ← both rows sum to zero ✓

Step 05 · PCA in Practice

Step 2 — Compute the Covariance Matrix

Once the data is centered, we compute the sample covariance matrix S. This tells us how features vary together — the key to finding the principal directions.

Sample Covariance Matrix

S = (1/(n−1)) · Ã · Ãᵀ

Ã = centered data matrix · n = number of samples · n−1 not n (Bessel's correction — one degree of freedom used for the mean) · S is always symmetric and positive semi-definite

Worked Example Computing S for Height & Age

Using our centered data Ã, let n = 5 samples, so we divide by (n−1) = 4:

Entry	Meaning	Calculation	Value
S₁₁	Variance of height	(-10²+0²+5²+10²+(-5)²)/4	62.5
S₂₂	Variance of age	(-6.4²+(-1.4²)+3.6²+8.6²+(-4.4²))/4	29.3
S₁₂ = S₂₁	Covariance (height & age)	((-10)(-6.4)+···+((-5)(-4.4)))/4	42.5

Resulting 2×2 Covariance Matrix

              S = ⎡ 62.5   42.5 ⎤

                    ⎣ 42.5   29.3 ⎦

💡

What does S₁₂ = 42.5 tell us?

A large positive covariance means height and age increase together — taller people tend to be older (in children/teens). PCA will find this relationship automatically through the eigenvectors of S.

Python · Covariance Matrix

# Method 1: Manual formula  S = A_centered @ A_centered.T / (n-1)
n = A_centered.shape[1]
S_manual = (A_centered @ A_centered.T) / (n - 1)

# Method 2: NumPy built-in (rowvar=True → each row is a variable)
S = np.cov(A_centered, rowvar=True)

print(S)
# [[62.5  42.5]
#  [42.5  29.3]]
# S is symmetric ✓ (S == S.T)

Step 06 · PCA in Practice

Step 3 — Apply SVD to Find Principal Components

With the centered data matrix Ã ready, we apply SVD. The right singular vectors v₁, v₂, ... are the principal components — the directions of maximum variance.

3a

Compute SVD of the centered matrix Ã

Run Ã = U Σ Vᵀ. The singular values σ₁ ≥ σ₂ ≥ ··· tell you how important each direction is. The columns of V are the principal component directions.

3b

Choose how many components to keep

Compute the explained variance ratio: σᵢ² / Σσⱼ². Keep enough components to explain ~90–95% of the variance. This is your k.

3c

Project the data onto the k principal components

The reduced representation is Z = Vₖᵀ · Ã. Each column of Z is the "score" of one data point along the principal directions — this is your compressed, interpretable data.

Numerical Example Finding the 1st Principal Component of Height & Age

After SVD on our centered height-age matrix Ã, we find:

Component	Singular Value σᵢ	Variance Explained	Direction v
PC₁ (1st)	9.33	92.4%	[0.83, 0.56]ᵀ
PC₂ (2nd)	2.69	7.6%	[−0.56, 0.83]ᵀ

🎯

Interpretation

PC₁ = [0.83, 0.56]ᵀ means "a mix of height (83%) and age (56%)" is the most important direction. This captures the growth pattern: taller = older. One number (the PC₁ score) replaces two (height + age) while keeping 92% of the information!

Python · Full PCA Pipeline

import numpy as np

## ─── Complete PCA from scratch ───────────────────────────────

# Step 1: Start with your data matrix
A = np.array([
    [160, 170, 175, 180, 165],  # heights (cm)
    [ 20,  25,  30,  35,  22],  # ages (years)
])

# Step 2: Center the data (subtract row means)
A_c = A - A.mean(axis=1, keepdims=True)

# Step 3: SVD of the centered matrix
U, sigma, Vt = np.linalg.svd(A_c, full_matrices=False)

# Step 4: Explained variance
explained = sigma**2 / np.sum(sigma**2)
print("Explained variance ratio:", np.round(explained, 3))
# [0.924  0.076] → PC1 explains 92.4%!

# Step 5: The principal components (columns of V = rows of Vt)
PC1 = Vt[0]  # first right singular vector
print("PC1 direction:", np.round(PC1, 2))
# [0.83  0.56] → mix of height and age

# Step 6: Project data onto PC1 (reduce 2D → 1D)
k = 1
Z = Vt[:k] @ A_c  # shape: (k, n_samples)
print("PC1 scores:", np.round(Z[0], 2))
# Each sample's position along the main growth direction

## ─── Using sklearn (production) ─────────────────────────────
from sklearn.decomposition import PCA

pca = PCA(n_components=1)
Z_sk = pca.fit_transform(A.T)  # sklearn wants (n_samples, n_features)
print("Explained variance (sklearn):", pca.explained_variance_ratio_)

🎛️ Interactive: Variance Explained

Drag the slider to see how many components you need to explain X% of variance

k =

2

Step 07 · Key Distinction

PCA vs Least Squares Regression

Both find the "best line" through data — but they minimize different things. This is a subtle but critical distinction that trips up many students.

📉 Least Squares Regression

Minimizes vertical distances from points to line
Solves A^TAx̂ = A^Tb (normal equations)
Asymmetric — one variable is "output" (y), others are "inputs"
Fast — linear system of equations
Use when predicting y from x (e.g., house price from size)

📐 PCA

Minimizes perpendicular distances from points to line
Solves via SVD — no linear system needed
Symmetric — no "output" variable, all features equal
Richer — gives full variance structure, not just one line
Use when exploring structure, reducing dimensions, visualizing

Step 08 · Real World

Real-World Applications

The SVD and matrix norms we studied power some of the most impactful algorithms in modern data science and engineering.

🎬

Netflix Prize ($1M)

User-movie rating matrix with millions of missing entries. Nuclear norm minimization completed the matrix — predicting what you'd rate a movie you haven't seen. The foundation of modern recommender systems.

🏥

MRI Reconstruction

Children can't stay still in MRI scanners long enough for a full scan. Nuclear norm minimization reconstructs a complete, high-quality image from incomplete measurements — saving scan time.

🛍️

Amazon Recommendations

If you've bought math books, Amazon's matrix factorization model recommends more math books. Same matrix completion idea as Netflix — a huge sparse purchase matrix, completed with SVD.

🖼️

Image Compression

An image is a matrix. SVD finds its low-rank approximation: keep only the top k singular values. k=50 might give 90% visual quality at 5% the storage cost — exactly the Eckart-Young theorem.

🧬

Genomics

Gene expression matrices have thousands of genes (rows) and hundreds of patients (columns). PCA reduces this to 2-3 principal components for visualization and disease classification.

👤

Eigenfaces (Face Recognition)

Apply PCA to a dataset of face images. The top principal components are the "eigenfaces" — the most common facial patterns. New faces are recognized by comparing their PCA coefficients.

Summary — The Full PCA Pipeline

Input: Data matrix A (m features × n samples)
1. Center: Ã = A − mean(rows)
2. SVD: Ã = U Σ Vᵀ
3. Choose k: keep top k where Σσᵢ²/Στσⱼ² ≥ threshold
4. Project: Z = Vₖᵀ Ã (k × n compressed representation)
Output: Z — low-dimensional data, principal directions V, variances σᵢ²

Principal ComponentAnalysis & SVD

What is PCA?

The Singular Value Decomposition

The Eckart-Young Theorem

Matrix Norms — How to Measure a Matrix

📐 Vector Norms (Review)

📊 Why L1 gives sparse solutions

Step 1 — Center the Data

Step 2 — Compute the Covariance Matrix

Step 3 — Apply SVD to Find Principal Components

PCA vs Least Squares Regression

📉 Least Squares Regression

📐 PCA

Real-World Applications

Principal Component
Analysis & SVD