Clustering: K-means vs. Gaussian Mixture Models (GMM) — ML Depth

K-means vs. GMM & EM Algorithm

Core Concepts for K-means & GMM

K-means:
- Objective: Minimize intra-cluster sum of squares (inertia).
- Algorithm: Iterative assignment and centroid update.
- Assumptions: Spherical, equally sized clusters. Hard assignments.
Gaussian Mixture Model (GMM):
- Assumption: Data points are generated from a mixture of several Gaussian distributions.
- Parameters: Mixing coefficients (π_k), means (μ_k), covariances (Σ_k).
- Objective: Maximize the log-likelihood of the data.
Expectation-Maximization (EM) Algorithm: Iterative algorithm for finding MLE or MAP estimates for models with latent variables.
- E-step (Expectation): Calculate posterior probabilities of latent variables (responsibilities).
- M-step (Maximization): Update model parameters using these responsibilities.
Latent Variables: In GMM, the component from which each data point is drawn.
Gaussian PDF: Formula for multivariate Gaussian distribution.
Cluster Shapes and Densities: How different covariance matrix structures in GMM handle varied cluster characteristics.

Clustering Algorithms Explained

Interviewer: Let's discuss clustering algorithms. Can you compare the mathematical formulations of K-means and Gaussian Mixture Models (GMM)? Then, please derive the Expectation-Maximization (EM) algorithm for GMMs.

Candidate: Certainly. Both K-means and GMM are popular clustering algorithms, but they make different assumptions and have different mathematical underpinnings.

K-means Clustering

Objective: K-means aims to partition N data points {x₁, ..., x_N} into K disjoint clusters C = {C₁, ..., C_K}, each with a centroid μ_k. The objective is to minimize the within-cluster sum of squares (WCSS), also known as inertia:

J_K-means = Σ_k=1^K Σ_{x_i ∈ C_k} ||x_i - μ_k||²

where μ_k is the mean of the points in cluster C_k.

Algorithm (Iterative):

Initialize K centroids μ₁, ..., μ_K (e.g., randomly).
Assignment Step: Assign each data point x_i to the cluster C_k whose centroid μ_k is closest (in Euclidean distance):
C_k = { x_i | ||x_i - μ_k||² ≤ ||x_i - μ_j||² for all j ≠ k }
Update Step: Recalculate the centroids μ_k as the mean of all data points assigned to cluster C_k:
μ_k = (1/|C_k|) Σ_{x_i ∈ C_k} x_i
Repeat steps 2 and 3 until convergence (e.g., centroids don't change much, or assignments stabilize).

K-means makes "hard" assignments of points to clusters and implicitly assumes clusters are spherical and roughly equally sized due to the Euclidean distance and centroid update.

Gaussian Mixture Models (GMM)

Assumption: GMM assumes that the data points are generated from a mixture of K Gaussian distributions. Each Gaussian component k has its own mean μ_k, covariance matrix Σ_k, and a mixing coefficient π_k (the prior probability of a point belonging to cluster k, with Σ_k π_k = 1).

Objective: GMM aims to find the parameters (π_k, μ_k, Σ_k for all k) that maximize the log-likelihood of the observed data {x₁, ..., x_N}.

The probability density of a data point x_i is given by the mixture:

P(x_i | θ) = Σ_k=1^K π_k N(x_i | μ_k, Σ_k)

where N(x_i | μ_k, Σ_k) is the probability density function of the multivariate Gaussian for component k. θ represents all parameters {π_k, μ_k, Σ_k}_k=1^K.

The log-likelihood of the entire dataset is:

log L(θ) = log ∏_i=1^N P(x_i | θ) = Σ_i=1^N log [ Σ_k=1^K π_k N(x_i | μ_k, Σ_k) ]

Maximizing this log-likelihood directly is hard because the sum is inside the logarithm. This is where the Expectation-Maximization (EM) algorithm comes in.

Interviewer: That's a good comparison of their objectives. Now, please derive the EM algorithm steps for GMM.

Candidate:

EM Algorithm for GMM

The EM algorithm is an iterative method to find MLE solutions when there are latent (unobserved) variables. In GMM, the latent variable z_ik for each data point x_i indicates which Gaussian component k it was generated from (z_ik=1 if x_i from component k, 0 otherwise).

The EM algorithm alternates between two steps:

1. E-step (Expectation Step):

In this step, we calculate the "responsibility" or posterior probability that each component k takes for each data point x_i, given the current parameter estimates (π_k, μ_k, Σ_k). This is essentially E[z_ik | x_i, θ^(t)].

Let γ(z_ik) or r_ik denote this responsibility:

r_ik = P(z_ik=1 | x_i, θ^(t)) = [ π_k N(x_i | μ_k, Σ_k) ] / [ Σ_j=1^K π_j N(x_i | μ_j, Σ_j) ]

This is calculated using Bayes' theorem. π_k acts as the prior P(z_ik=1), N(x_i | μ_k, Σ_k) as P(x_i | z_ik=1), and the denominator is P(x_i).

2. M-step (Maximization Step):

In this step, we re-estimate the model parameters (π_k, μ_k, Σ_k) by maximizing the expected complete-data log-likelihood, using the responsibilities r_ik calculated in the E-step. The complete-data log-likelihood (if we knew z_ik) would be:

log P(X,Z | θ) = Σ_i=1^N Σ_k=1^K z_ik [ log π_k + log N(x_i | μ_k, Σ_k) ]

We maximize E_Z|X,θ^(t)[log P(X,Z | θ)], which involves replacing z_ik with r_ik.

The update rules derived from maximizing this expected log-likelihood are:

Mixing Coefficients (π_k):
Let N_k = Σ_i=1^N r_ik (effective number of points assigned to cluster k).
```
π_k^(t+1) = N_k / N
```
Means (μ_k): (Weighted average of data points)
```
μ_k^(t+1) = (1/N_k) Σ_i=1^N r_ik x_i
```

Covariance Matrices (Σ_k): (Weighted sample covariance)

Σ_k^(t+1) = (1/N_k) Σ_i=1^N r_ik (x_i - μ_k^(t+1))(x_i - μ_k^(t+1))^T

The E-step and M-step are repeated iteratively until the log-likelihood (or parameters) converges.

When GMM is Preferred Over K-means:

Probabilistic Assignments (Soft Clustering): GMM provides probabilities (responsibilities) of a point belonging to each cluster, which can be more informative than K-means' hard assignments. This is useful when clusters overlap or points are ambiguous.
Non-Spherical Cluster Shapes: K-means assumes spherical clusters due to its use of Euclidean distance and variance minimization around a single centroid. GMM, with its full covariance matrices Σ_k for each component, can model elliptical and oriented clusters of varying shapes and sizes.
Unequal Cluster Sizes/Densities: The mixing coefficients π_k allow GMM to model clusters of different sizes/densities naturally.
Model-Based Approach: GMM is a generative model, providing a probability density for the data. This allows for tasks like density estimation and generating new samples.
Theoretical Foundation: Based on statistical principles (MLE via EM), which can be more appealing than the more heuristic nature of K-means.

However, K-means is simpler, computationally faster, and often works well when its assumptions are reasonably met or as a good initialization for GMM.

Interviewer: That's a very good derivation and comparison. For the follow-up: How would you handle the case where clusters have different shapes and densities using GMMs, and what are the implications of different covariance matrix structures?

Candidate:

Handling Different Cluster Shapes and Densities with GMM

GMM's ability to model varied cluster shapes and densities comes directly from the flexibility of the covariance matrices (Σ_k) for each Gaussian component and the mixing coefficients (π_k).

1. Mixing Coefficients (π_k):

These directly model different densities or sizes of clusters. A cluster k with a larger π_k is expected to contain a larger proportion of the data points.

2. Covariance Matrix Structures (Σ_k):

The structure of the covariance matrix Σ_k for each component k determines the shape and orientation of the Gaussian ellipsoid for that component. Common choices for covariance structures include:

Spherical (covariance_type='spherical' in scikit-learn):
Σ_k = σ_k²I (where I is the identity matrix).
Each component has its own variance σ_k², but it's the same along all dimensions (spherical shape).
This allows for clusters of different sizes (radii) but assumes they are spherical. It's more flexible than K-means (which assumes all clusters have similar variance if viewed as a GMM with fixed, equal spherical covariances).
Diagonal (covariance_type='diag'):
Σ_k is a diagonal matrix, with variances σ_kj² along each dimension j.
This allows each component to have different variances along different axes, resulting in axis-aligned elliptical shapes. Clusters can be stretched or compressed along the coordinate axes.
Tied (covariance_type='tied'):
All K components share the same full covariance matrix: Σ_k = Σ for all k.
This means all clusters have the same shape and orientation, but can be centered differently. This reduces the number of parameters to estimate.
Full (covariance_type='full'):
Each component k has its own arbitrary (positive semi-definite) covariance matrix Σ_k.
This is the most flexible option, allowing each cluster to have any elliptical shape, size, and orientation. It can model highly varied cluster structures.

Implications and Handling:

Flexibility vs. Overfitting:
- 'Full' covariance offers the most flexibility to capture diverse shapes and densities but has the most parameters (d(d+1)/2 per component). This makes it prone to overfitting, especially with limited data or high dimensionality. It might also lead to singular covariance matrices if a component has too few points.
- 'Spherical' and 'diag' are more constrained, have fewer parameters, and are less prone to overfitting but might not capture the true cluster shapes if they are arbitrarily oriented ellipses.
- 'Tied' is a compromise, reducing parameters while still allowing elliptical shapes.
Model Selection:
- The choice of covariance type (and the number of components K) is a model selection problem.
- Information criteria like AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion) are often used to choose the best covariance structure and K. These criteria balance model fit (likelihood) with model complexity (number of parameters). BIC tends to prefer simpler models.
- Cross-validation can also be used, though defining a good clustering evaluation metric for CV can be tricky.
Initialization and Convergence:
- GMM (and EM) is sensitive to initialization. Poor initialization can lead to convergence to a suboptimal local maximum of the likelihood. Running EM multiple times with different initializations (e.g., from K-means results) is common.
- With 'full' covariances, if a component gets very few points, its covariance matrix might become ill-conditioned or singular during updates. Regularization (e.g., adding a small constant to the diagonal of the covariance matrices) can help stabilize the estimation.

So, by choosing the appropriate `covariance_type`, GMM can effectively model clusters of very different shapes (spherical, axis-aligned elliptical, arbitrarily oriented elliptical) and densities (via π_k). The challenge lies in selecting the right complexity to avoid overfitting while capturing the true underlying structure.

Interviewer: That's an excellent discussion on how GMM handles diverse cluster structures through its covariance options. You've covered the topic very thoroughly.

Candidate: Thank you!

Why K-means vs. GMM & EM Matter

Understanding Clustering Assumptions: Highlights the difference between hard (K-means) and soft (GMM) clustering, and the geometric assumptions each makes.
Probabilistic Modeling: GMM introduces a probabilistic framework for clustering, allowing for more nuanced assignments and density estimation.
EM Algorithm Insight: The EM algorithm is a powerful general technique for MLE with latent variables, applicable far beyond GMMs (e.g., Hidden Markov Models, missing data imputation).
Model Flexibility: GMMs, through different covariance structures, offer great flexibility in modeling clusters of varying shapes, sizes, and orientations.
Model Selection: Underscores the importance of model selection criteria (like AIC, BIC) when dealing with models of varying complexity.
Choosing the Right Tool: Knowing when K-means is sufficient versus when the added complexity and flexibility of GMM are beneficial is key for practical applications.