Ensemble Methods: Variance Reduction & Diversity — ML Depth

Ensemble Learning Principles

Core Concepts for Ensembles

Ensemble Learning: Combining multiple models to achieve better predictive performance.
Variance of a Sum/Average: Var( (1/n)ΣX_i ) formula.
Uncorrelated Models: Assumption that errors of base models are not correlated.
Decorrelation Techniques: How methods like bagging and random subspaces achieve this.
Random Forests: Bagging + random feature subspace selection.
Bagging (Bootstrap Aggregating): Reduces variance by averaging predictions of models trained on bootstrap samples.
Boosting: Reduces bias by sequentially training models, with each focusing on errors of the previous ones.
Bias-Variance Tradeoff: How different ensemble strategies affect bias and variance.
Diversity-Accuracy Tradeoff: The balance between having diverse (uncorrelated) base models and having individually accurate base models.

Ensemble Principles Explained

Interviewer: Let's talk about ensemble methods. Can you prove mathematically that the variance of an ensemble of N uncorrelated models (each with the same variance σ²) is reduced by a factor of 1/N compared to a single model? Then, explain how random forests achieve this decorrelation and why bagging primarily reduces variance while boosting primarily reduces bias.

Candidate: Certainly. Ensembling is a powerful technique.

1. Variance Reduction in Ensembles of Uncorrelated Models

Let's consider an ensemble of N models, f₁, f₂, ..., f_N. For a regression task, the ensemble prediction is often the average of individual model predictions:

f_ensemble(x) = (1/N) Σ_i=1^N f_i(x)

We are interested in the variance of this ensemble prediction. Let Var(f_i(x)) = σ² for all individual models i (they have the same variance). Assume the models are uncorrelated, meaning Cov(f_i(x), f_j(x)) = 0 for i ≠ j.

The variance of a sum of random variables is the sum of their variances plus twice the sum of all pairwise covariances. For an average, we have:

Var(f_ensemble(x)) = Var( (1/N) Σ_i=1^N f_i(x) )

Using the property Var(cX) = c²Var(X):

= (1/N²) Var( Σ_i=1^N f_i(x) )

Now, Var( ΣX_i ) = ΣVar(X_i) + Σ_i≠j Cov(X_i, X_j). Since the models are uncorrelated, Cov(f_i(x), f_j(x)) = 0 for i ≠ j.

= (1/N²) [ Σ_i=1^N Var(f_i(x)) + Σ_i≠j Cov(f_i(x), f_j(x)) ]
= (1/N²) [ Σ_i=1^N σ² + 0 ]
= (1/N²) [ Nσ² ]
= (1/N) σ²

Thus, the variance of the ensemble of N uncorrelated models is σ²/N, which is 1/N times the variance of a single model. This shows that averaging uncorrelated models significantly reduces variance.

If the models are not perfectly uncorrelated but have an average pairwise correlation ρ, the formula becomes:

Var(f_ensemble(x)) = ρσ² + ( (1-ρ)/N )σ²

As N → ∞, the variance approaches ρσ². So, decorrelation (ρ close to 0) is key for variance reduction.

2. How Random Forests Achieve Decorrelation

Random Forests build upon bagging (which already promotes diversity) by adding another layer of randomness to further decorrelate the individual decision trees:

Bagging (Bootstrap Aggregating): Each tree in the Random Forest is trained on a different bootstrap sample (random sample with replacement) of the original training data. This means different trees see slightly different data, leading to diverse tree structures.
Random Feature Subspace Selection: At each split point in each tree, instead of considering all available features to find the best split, Random Forest considers only a random subset of the features.
- Typically, for m features, √m features (for classification) or m/3 features (for regression) are randomly selected at each node.
- This prevents strong predictors from dominating all trees. If a few features are very strong, without this random subspace, many trees might choose to split on these same strong features early on, making the trees very similar (correlated).
- By forcing each split to consider only a subset of features, different trees are more likely to use different features for their splits, leading to more diverse and less correlated trees.

The combination of these two sources of randomness – different data subsets and different feature subsets for splits – helps ensure that the individual trees in the ensemble are as decorrelated as possible, maximizing the variance reduction benefit from averaging.

3. Why Bagging Reduces Variance & Boosting Reduces Bias

Bagging (e.g., Random Forests):

Mechanism: Trains multiple independent, typically high-variance, low-bias base models (e.g., deep decision trees) on different bootstrap samples of the data. It then averages their predictions.
Effect on Variance: As shown mathematically, averaging predictions from (ideally) uncorrelated or less correlated models reduces the overall variance of the ensemble. Each individual tree might overfit its bootstrap sample (high variance), but their errors tend to average out.
Effect on Bias: The bias of the bagged ensemble is typically similar to the average bias of the individual base models. If the base models are already low-bias (e.g., deep trees), the ensemble will also have low bias. Bagging doesn't systematically try to reduce the bias of the base learners.
Primary Goal: Variance reduction. Works best with unstable learners (high variance).

Boosting (e.g., AdaBoost, Gradient Boosting):

Mechanism: Trains base models (often weak learners like shallow decision trees, which have high bias but low variance) sequentially. Each new model focuses on correcting the errors made by the previous ensemble of models.
- AdaBoost reweights instances based on previous errors.
- Gradient Boosting fits new models to the pseudo-residuals (gradients of the loss function) of the previous ensemble.
Effect on Bias: By iteratively focusing on the "hard" examples or residual errors, boosting incrementally builds a more complex and powerful model that can better fit the training data, thereby reducing bias. Each weak learner contributes to reducing the overall bias of the ensemble.
Effect on Variance: Boosting can sometimes also reduce variance, especially Gradient Boosting with proper regularization (like tree depth limits, shrinkage). However, if not carefully controlled (e.g., too many trees or too complex base learners), boosting can overfit and increase variance. Its primary strength is bias reduction.
Primary Goal: Bias reduction. Works best by combining many weak learners.

Interviewer: That's a very clear distinction. Now, for the follow-up: How does the diversity-accuracy tradeoff affect ensemble performance?

Candidate:

The Diversity-Accuracy Tradeoff in Ensembles

The performance of an ensemble depends critically on two factors concerning its base learners: their individual accuracy and their diversity (or lack of correlation).

The Tradeoff:

Often, there's a tradeoff: techniques that increase the diversity among base learners might slightly decrease their individual accuracy, and vice-versa. The goal is to find a sweet spot where the ensemble benefits more from diversity than it loses from any potential drop in individual accuracy.

Accuracy: If the individual base learners are very poor (e.g., worse than random guessing or consistently making large errors), even an ensemble of diverse models won't perform well. The base models need to have some predictive power.
Diversity: If all base learners are identical or highly correlated, they will make the same mistakes, and the ensemble will not perform better than a single base learner. Diversity means that the errors made by different base learners are, ideally, uncorrelated. When one model makes an error, other, different models might get it right, and averaging/voting can cancel out these individual errors.

Mathematical Intuition (revisiting the variance formula for correlated models):

If f_i are predictions of N base models with average variance σ² and average pairwise correlation ρ, the variance of the ensemble mean is:

Var(f_ensemble) = ρσ² + ( (1-ρ)/N )σ²

If models are perfectly correlated (ρ=1), Var(f_ensemble) = σ² (no variance reduction).
If models are uncorrelated (ρ=0), Var(f_ensemble) = σ²/N (maximum variance reduction).
If models are somewhat correlated (0 < ρ < 1), there's still variance reduction, but less than the uncorrelated case. The term ρσ² forms a lower bound on the ensemble variance as N grows.

This shows that even if individual models are very accurate (low σ²), high correlation (ρ) can limit the ensemble's benefit. Conversely, even if individual models are slightly less accurate (higher σ²), strong decorrelation (low ρ) can lead to a good ensemble.

How it Affects Ensemble Performance:

Too Little Diversity (High Correlation): The ensemble behaves like a single model. Errors are systematic across base learners. Example: Bagging very stable learners (like linear regression on a large dataset) provides little benefit because the bootstrap samples don't change the model much.
Too Much Diversity at the Cost of Accuracy: If base learners are made too diverse by, for example, extreme randomization or using very weak features, their individual accuracy might drop so much that even averaging them doesn't lead to a good ensemble. Example: Randomly guessing base learners are diverse but useless.
Optimal Balance: The best ensembles find a balance.
- Random Forests achieve this by using relatively complex (low-bias, high-variance) individual trees (good accuracy on their bootstrap sample) but then injecting randomness (bootstrap sampling, random feature subspaces) to ensure diversity.
- Boosting starts with weak (high-bias) but diverse (as they focus on different errors) learners. The sequential process combines them to improve accuracy while maintaining some diversity.

Effectively, we want base learners that are "experts" in different aspects of the data or make different kinds of errors. The ensemble then aggregates these diverse "opinions" to arrive at a more robust and accurate final prediction. Techniques like pruning in Random Forests (though full trees are often used) or regularization in boosting help manage the accuracy of base learners while promoting diversity.

Interviewer: That's a great explanation of the tradeoff and how it plays out in different ensemble strategies. Very well articulated.

Candidate: Thank you!

Why Understanding Ensemble Principles Matters

Improved Predictive Performance: Ensembles are often among the top-performing models in many machine learning competitions and real-world applications.
Bias-Variance Management: Provides explicit strategies to tackle high bias (boosting) or high variance (bagging).
Robustness: Ensembles are generally more robust to noise and outliers than single models.
Understanding Model Behavior: Knowing why ensembles work (variance reduction through decorrelation, bias reduction through sequential correction) provides deeper insight into model building.
Informed Algorithm Choice: Helps in selecting appropriate ensemble techniques (e.g., Random Forest vs. Gradient Boosting) based on the characteristics of the base learners and the problem.
Key to Modern ML: Many state-of-the-art algorithms (like XGBoost, LightGBM, CatBoost) are sophisticated ensemble methods.