Principal Component Analysis (PCA) — ML Depth

Understanding Principal Component Analysis

Core Concepts for PCA

Data Representation: Data matrix X, centered data.
Covariance Matrix (S or Σ): Captures variance and covariance between features.
Eigendecomposition: Sv = λv. Eigenvectors (v) and eigenvalues (λ).
Principal Components (PCs): Linear combinations of original features, defined by eigenvectors.
Projection: Projecting data onto the principal components.
Variance Maximization: PCA finds directions (PCs) that maximize the variance of the projected data.
Reconstruction Error Minimization: PCA finds a lower-dimensional subspace that minimizes the squared error when reconstructing the original data from the projection.
Lagrange Multipliers: Used in constrained optimization for the max variance derivation.
Kernel PCA: Using the kernel trick to perform PCA in a higher-dimensional feature space for non-linear dimensionality reduction.

PCA Derivation Explained

Interviewer: Let's talk about dimensionality reduction. Can you derive the principal components from the eigendecomposition of the covariance matrix? And then explain why PCA is often described as finding directions of maximum variance and, alternatively, directions that minimize reconstruction error.

Candidate: Absolutely. PCA is a fundamental technique for dimensionality reduction.

1. Setup and Covariance Matrix

Let X be our n × d data matrix, where n is the number of samples and d is the number of features. We assume the data is centered (mean of each feature is zero). If not, we subtract the mean first.

The covariance matrix S (or Σ) is a d × d matrix given by:

S = (1/n) X^TX

(Sometimes (1/(n-1)) is used for an unbiased estimate, but (1/n) is common in PCA derivations related to variance maximization of projected data).

PCA aims to find a new set of orthogonal basis vectors (principal components) in the feature space.

2. PCA as Maximizing Variance

Let u be a unit vector (||u||₂ = 1) representing a direction in the d-dimensional space. We want to find the direction u such that when we project the data X onto this direction, the variance of the projected data is maximized.

The projection of a data point x_i onto u is x_i^Tu.

The variance of the projected data (since data is centered, E[x_i^Tu] = E[x_i]^Tu = 0) is:

Var(Xu) = (1/n) Σ_i=1ⁿ (x_i^Tu - E[x_i^Tu])²
                 = (1/n) Σ_i=1ⁿ (x_i^Tu)²
                 = (1/n) Σ_i=1ⁿ (u^Tx_i)(x_i^Tu)
                 = u^T [ (1/n) Σ_i=1ⁿ x_ix_i^T ] u
                 = u^TSu

So, we want to maximize u^TSu subject to the constraint ||u||² = u^Tu = 1.

We can use Lagrange multipliers. Let L(u, λ) = u^TSu - λ(u^Tu - 1).

Taking the derivative with respect to u and setting it to zero:

∂L/∂u = 2Su - 2λu = 0
    Su = λu

This is the definition of an eigenvector problem! u is an eigenvector of the covariance matrix S, and λ is its corresponding eigenvalue.

To maximize u^TSu = u^Tλu = λu^Tu = λ, we should choose the eigenvector u corresponding to the largest eigenvalue λ. This u is the first principal component (PC1).

Subsequent principal components are found by seeking directions orthogonal to the previous ones that maximize the remaining variance, which correspond to the eigenvectors associated with the next largest eigenvalues, and so on.

So, the principal components are the eigenvectors of the covariance matrix S, ordered by their corresponding eigenvalues (which represent the variance captured by that PC).

Interviewer: That clearly shows how PCA relates to maximizing variance and eigendecomposition. Now, can you explain the connection to minimizing reconstruction error?

Candidate:

3. PCA as Minimizing Reconstruction Error

Let's say we want to project our d-dimensional data onto a k-dimensional subspace (k < d) spanned by an orthonormal basis {u₁, ..., u_k}. A data point x can be approximated by its projection onto this subspace:

x̂ = Σ_j=1^k (x^Tu_j)u_j

We want to find the basis {u_j} that minimizes the mean squared reconstruction error:

Error = (1/n) Σ_i=1ⁿ ||x_i - x̂_i||²

Consider a single data point x. We can express x in terms of a full orthonormal basis {u₁, ..., u_d} (where the first k are our chosen projection vectors):

x = Σ_j=1^d (x^Tu_j)u_j

Then the reconstruction error for this point is:

||x - x̂||² = || Σ_j=1^d (x^Tu_j)u_j - Σ_j=1^k (x^Tu_j)u_j ||²
                    = || Σ_j=k+1^d (x^Tu_j)u_j ||²

Since {u_j} are orthonormal, by Pythagorean theorem:

= Σ_j=k+1^d (x^Tu_j)²

The total mean squared reconstruction error is:

Error = (1/n) Σ_i=1ⁿ Σ_j=k+1^d (x_i^Tu_j)²
            = Σ_j=k+1^d [ (1/n) Σ_i=1ⁿ (x_i^Tu_j)² ]
            = Σ_j=k+1^d u_j^TSu_j

To minimize this sum, we want the terms u_j^TSu_j (which are the variances along directions u_j) for j=k+1 to d to be as small as possible. The total variance in the data is Σ_j=1^d λ_j = trace(S). The variance captured by the first k components is Σ_j=1^k u_j^TSu_j. Minimizing Σ_j=k+1^d u_j^TSu_j is equivalent to maximizing Σ_j=1^k u_j^TSu_j (the variance captured by the projected data).

As shown before, to maximize this captured variance, the vectors u₁, ..., u_k must be the eigenvectors of S corresponding to the k largest eigenvalues. Therefore, selecting the top k eigenvectors of the covariance matrix simultaneously maximizes the variance of the projected data and minimizes the mean squared reconstruction error.

Interviewer: That connects both perspectives well. Now, for the follow-up: How does Kernel PCA extend this to perform non-linear dimensionality reduction, and what are the computational trade-offs involved?

Candidate:

Kernel PCA for Non-Linear Dimensionality Reduction

Standard PCA finds linear projections. If the underlying structure of the data is non-linear, PCA might not be effective at capturing it.

Kernel PCA extends PCA to find non-linear principal components using the "kernel trick":

Mapping to a Higher-Dimensional Feature Space: Conceptually, Kernel PCA first maps the input data x_i from the original d-dimensional space to a (potentially much higher or even infinite dimensional) feature space F using a non-linear mapping Φ(x_i).
Performing PCA in Feature Space: PCA is then performed in this feature space F. This involves finding the eigenvectors of the covariance matrix in F:
```
S_F = (1/n) Σ_i=1ⁿ Φ(x_i)Φ(x_i)^T
```
(Assuming Φ(x_i) is centered in F). The eigenvectors v in F satisfy S_Fv = λv.
The Kernel Trick: The mapping Φ can be very high-dimensional, making direct computation of Φ(x_i) and S_F infeasible. The kernel trick allows us to compute dot products in the feature space F without explicitly forming Φ(x_i). A kernel function K(x_i, x_j) = Φ(x_i)^TΦ(x_j) computes these dot products.
Reformulating PCA with Kernels: The PCA problem in F can be reformulated such that all operations involve only these dot products. The eigenvectors v in F can be expressed as linear combinations of the mapped data points: v = Σ_i=1ⁿ α_iΦ(x_i). Substituting this into the eigenvalue equation and using the kernel function leads to an eigenvalue problem involving the n × n Kernel Matrix (Gram matrix) K, where K_ij = K(x_i, x_j).
```
Kα = nλα
```
(This is a simplified form; centering the kernel matrix is also required). Solving this gives the coefficients α. The projection of a new point x onto an eigenvector v_k is then:
```
v_k^TΦ(x) = (Σ_i α_i^(k)Φ(x_i))^TΦ(x) = Σ_i α_i^(k)K(x_i, x)
```

Popular kernels include the Polynomial kernel and the Radial Basis Function (RBF) kernel, which can capture complex non-linear structures.

Computational Trade-offs of Kernel PCA

Advantages:
- Can find non-linear manifolds and perform non-linear dimensionality reduction.
- Flexibility through choice of kernel and its parameters.
Disadvantages (Computational Costs):
- Kernel Matrix Computation: Requires computing the n × n kernel matrix K, which takes O(n²d) time (if d is feature dimension for each K(x_i, x_j) call) or O(n²) if kernel evaluations are O(1) after initial feature processing. This is prohibitive for very large n (number of samples).
- Eigendecomposition of Kernel Matrix: Solving the eigenvalue problem for the n × n kernel matrix typically takes O(n³) time. This is a major bottleneck for large n. Standard PCA's eigendecomposition is on a d × d matrix, taking O(d³) or O(min(n,d)² * max(n,d)) if using SVD on X, which is better if d << n.
- Projection of New Data: Projecting a new point involves computing its kernel similarity with all n training points, taking O(nd) time per projection, which can be slow if n is large.
- No Explicit Feature Mapping: We get the projections, but not an explicit transformation matrix to a lower-dimensional space in the original feature coordinates like in linear PCA. Reconstructing back to the original space (pre-image problem) can also be difficult or ill-defined for some kernels.
- Memory: Storing the n × n kernel matrix requires O(n²) memory.

So, Kernel PCA is powerful for non-linear tasks but scales poorly with the number of samples (n) due to the n × n kernel matrix operations. Linear PCA scales with the number of features (d).

Interviewer: That's a very clear and comprehensive explanation of both linear PCA and its extension with Kernel PCA, along with the trade-offs. Well done.

Candidate: Thank you!

Why Understanding PCA In-Depth Matters

Foundation of Dimensionality Reduction: PCA is a cornerstone technique used widely for data compression, visualization, and as a preprocessing step.
Linear Algebra Connection: Reinforces understanding of covariance, eigenvectors, and eigenvalues and their geometric interpretations.
Multiple Perspectives: Understanding PCA as both variance maximization and reconstruction error minimization provides deeper intuition.
Basis for Advanced Techniques: Concepts from PCA extend to more advanced methods like Kernel PCA, Probabilistic PCA, Sparse PCA, etc.
Interpreting Results: Knowing how PCs are derived helps in interpreting what they represent (directions of greatest variance).
Limitations Awareness: Understanding its linearity helps recognize when non-linear methods like Kernel PCA or t-SNE might be more appropriate.