K-means vs. GMM & EM Algorithm
Core Concepts for K-means & GMM
- K-means:
- Objective: Minimize intra-cluster sum of squares (inertia).
- Algorithm: Iterative assignment and centroid update.
- Assumptions: Spherical, equally sized clusters. Hard assignments.
- Gaussian Mixture Model (GMM):
- Assumption: Data points are generated from a mixture of several Gaussian distributions.
- Parameters: Mixing coefficients (πk), means (μk), covariances (Σk).
- Objective: Maximize the log-likelihood of the data.
- Expectation-Maximization (EM) Algorithm: Iterative algorithm for finding MLE or MAP estimates for models with latent variables.
- E-step (Expectation): Calculate posterior probabilities of latent variables (responsibilities).
- M-step (Maximization): Update model parameters using these responsibilities.
- Latent Variables: In GMM, the component from which each data point is drawn.
- Gaussian PDF: Formula for multivariate Gaussian distribution.
- Cluster Shapes and Densities: How different covariance matrix structures in GMM handle varied cluster characteristics.
Clustering Algorithms Explained
K-means Clustering
Objective: K-means aims to partition N data points {x1, ..., xN} into K disjoint clusters C = {C1, ..., CK}, each with a centroid μk. The objective is to minimize the within-cluster sum of squares (WCSS), also known as inertia:
JK-means = Σk=1K Σxi ∈ Ck ||xi - μk||2
where μk is the mean of the points in cluster Ck.
Algorithm (Iterative):
- Initialize K centroids μ1, ..., μK (e.g., randomly).
- Assignment Step: Assign each data point xi to the cluster Ck whose centroid μk is closest (in Euclidean distance):
Ck = { xi | ||xi - μk||2 ≤ ||xi - μj||2 for all j ≠ k } - Update Step: Recalculate the centroids μk as the mean of all data points assigned to cluster Ck:
μk = (1/|Ck|) Σxi ∈ Ck xi - Repeat steps 2 and 3 until convergence (e.g., centroids don't change much, or assignments stabilize).
K-means makes "hard" assignments of points to clusters and implicitly assumes clusters are spherical and roughly equally sized due to the Euclidean distance and centroid update.
Gaussian Mixture Models (GMM)
Assumption: GMM assumes that the data points are generated from a mixture of K Gaussian distributions. Each Gaussian component k has its own mean μk, covariance matrix Σk, and a mixing coefficient πk (the prior probability of a point belonging to cluster k, with Σk πk = 1).
Objective: GMM aims to find the parameters (πk, μk, Σk for all k) that maximize the log-likelihood of the observed data {x1, ..., xN}.
The probability density of a data point xi is given by the mixture:
P(xi | θ) = Σk=1K πk N(xi | μk, Σk)
where N(xi | μk, Σk) is the probability density function of the multivariate Gaussian for component k. θ represents all parameters {πk, μk, Σk}k=1K.
The log-likelihood of the entire dataset is:
log L(θ) = log ∏i=1N P(xi | θ) = Σi=1N log [ Σk=1K πk N(xi | μk, Σk) ]
Maximizing this log-likelihood directly is hard because the sum is inside the logarithm. This is where the Expectation-Maximization (EM) algorithm comes in.
EM Algorithm for GMM
The EM algorithm is an iterative method to find MLE solutions when there are latent (unobserved) variables. In GMM, the latent variable zik for each data point xi indicates which Gaussian component k it was generated from (zik=1 if xi from component k, 0 otherwise).
The EM algorithm alternates between two steps:
1. E-step (Expectation Step):
In this step, we calculate the "responsibility" or posterior probability that each component k takes for each data point xi, given the current parameter estimates (πk, μk, Σk). This is essentially E[zik | xi, θ(t)].
Let γ(zik) or rik denote this responsibility:
rik = P(zik=1 | xi, θ(t)) = [ πk N(xi | μk, Σk) ] / [ Σj=1K πj N(xi | μj, Σj) ]
This is calculated using Bayes' theorem. πk acts as the prior P(zik=1), N(xi | μk, Σk) as P(xi | zik=1), and the denominator is P(xi).
2. M-step (Maximization Step):
In this step, we re-estimate the model parameters (πk, μk, Σk) by maximizing the expected complete-data log-likelihood, using the responsibilities rik calculated in the E-step. The complete-data log-likelihood (if we knew zik) would be:
log P(X,Z | θ) = Σi=1N Σk=1K zik [ log πk + log N(xi | μk, Σk) ]
We maximize EZ|X,θ(t)[log P(X,Z | θ)], which involves replacing zik with rik.
The update rules derived from maximizing this expected log-likelihood are:
- Mixing Coefficients (πk):
Let Nk = Σi=1N rik (effective number of points assigned to cluster k).πk(t+1) = Nk / N - Means (μk): (Weighted average of data points)
μk(t+1) = (1/Nk) Σi=1N rik xi - Covariance Matrices (Σk): (Weighted sample covariance)
Σk(t+1) = (1/Nk) Σi=1N rik (xi - μk(t+1))(xi - μk(t+1))T
The E-step and M-step are repeated iteratively until the log-likelihood (or parameters) converges.
When GMM is Preferred Over K-means:
- Probabilistic Assignments (Soft Clustering): GMM provides probabilities (responsibilities) of a point belonging to each cluster, which can be more informative than K-means' hard assignments. This is useful when clusters overlap or points are ambiguous.
- Non-Spherical Cluster Shapes: K-means assumes spherical clusters due to its use of Euclidean distance and variance minimization around a single centroid. GMM, with its full covariance matrices Σk for each component, can model elliptical and oriented clusters of varying shapes and sizes.
- Unequal Cluster Sizes/Densities: The mixing coefficients πk allow GMM to model clusters of different sizes/densities naturally.
- Model-Based Approach: GMM is a generative model, providing a probability density for the data. This allows for tasks like density estimation and generating new samples.
- Theoretical Foundation: Based on statistical principles (MLE via EM), which can be more appealing than the more heuristic nature of K-means.
However, K-means is simpler, computationally faster, and often works well when its assumptions are reasonably met or as a good initialization for GMM.
Handling Different Cluster Shapes and Densities with GMM
GMM's ability to model varied cluster shapes and densities comes directly from the flexibility of the covariance matrices (Σk) for each Gaussian component and the mixing coefficients (πk).
1. Mixing Coefficients (πk):
- These directly model different densities or sizes of clusters. A cluster k with a larger πk is expected to contain a larger proportion of the data points.
2. Covariance Matrix Structures (Σk):
The structure of the covariance matrix Σk for each component k determines the shape and orientation of the Gaussian ellipsoid for that component. Common choices for covariance structures include:
- Spherical (
covariance_type='spherical'in scikit-learn):
Σk = σk2I (where I is the identity matrix).
Each component has its own variance σk2, but it's the same along all dimensions (spherical shape).
This allows for clusters of different sizes (radii) but assumes they are spherical. It's more flexible than K-means (which assumes all clusters have similar variance if viewed as a GMM with fixed, equal spherical covariances). - Diagonal (
covariance_type='diag'):
Σk is a diagonal matrix, with variances σkj2 along each dimension j.
This allows each component to have different variances along different axes, resulting in axis-aligned elliptical shapes. Clusters can be stretched or compressed along the coordinate axes. - Tied (
covariance_type='tied'):
All K components share the same full covariance matrix: Σk = Σ for all k.
This means all clusters have the same shape and orientation, but can be centered differently. This reduces the number of parameters to estimate. - Full (
covariance_type='full'):
Each component k has its own arbitrary (positive semi-definite) covariance matrix Σk.
This is the most flexible option, allowing each cluster to have any elliptical shape, size, and orientation. It can model highly varied cluster structures.
Implications and Handling:
- Flexibility vs. Overfitting:
- 'Full' covariance offers the most flexibility to capture diverse shapes and densities but has the most parameters (d(d+1)/2 per component). This makes it prone to overfitting, especially with limited data or high dimensionality. It might also lead to singular covariance matrices if a component has too few points.
- 'Spherical' and 'diag' are more constrained, have fewer parameters, and are less prone to overfitting but might not capture the true cluster shapes if they are arbitrarily oriented ellipses.
- 'Tied' is a compromise, reducing parameters while still allowing elliptical shapes.
- Model Selection:
- The choice of covariance type (and the number of components K) is a model selection problem.
- Information criteria like AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion) are often used to choose the best covariance structure and K. These criteria balance model fit (likelihood) with model complexity (number of parameters). BIC tends to prefer simpler models.
- Cross-validation can also be used, though defining a good clustering evaluation metric for CV can be tricky.
- Initialization and Convergence:
- GMM (and EM) is sensitive to initialization. Poor initialization can lead to convergence to a suboptimal local maximum of the likelihood. Running EM multiple times with different initializations (e.g., from K-means results) is common.
- With 'full' covariances, if a component gets very few points, its covariance matrix might become ill-conditioned or singular during updates. Regularization (e.g., adding a small constant to the diagonal of the covariance matrices) can help stabilize the estimation.
So, by choosing the appropriate `covariance_type`, GMM can effectively model clusters of very different shapes (spherical, axis-aligned elliptical, arbitrarily oriented elliptical) and densities (via πk). The challenge lies in selecting the right complexity to avoid overfitting while capturing the true underlying structure.
Why K-means vs. GMM & EM Matter
- Understanding Clustering Assumptions: Highlights the difference between hard (K-means) and soft (GMM) clustering, and the geometric assumptions each makes.
- Probabilistic Modeling: GMM introduces a probabilistic framework for clustering, allowing for more nuanced assignments and density estimation.
- EM Algorithm Insight: The EM algorithm is a powerful general technique for MLE with latent variables, applicable far beyond GMMs (e.g., Hidden Markov Models, missing data imputation).
- Model Flexibility: GMMs, through different covariance structures, offer great flexibility in modeling clusters of varying shapes, sizes, and orientations.
- Model Selection: Underscores the importance of model selection criteria (like AIC, BIC) when dealing with models of varying complexity.
- Choosing the Right Tool: Knowing when K-means is sufficient versus when the added complexity and flexibility of GMM are beneficial is key for practical applications.