SVM Dual Formulation, KKT & Kernels — ML Depth

Understanding Support Vector Machines (SVM)

Core Concepts for SVM Dual Formulation

Maximal Margin Classifier: The goal of SVMs.
Primal Optimization Problem: Minimizing ||w||² subject to y_i(w^Tx_i + b) ≥ 1.
Lagrangian Mechanics: Introducing Lagrange multipliers (α_i) for constrained optimization.
Karush-Kuhn-Tucker (KKT) Conditions: Necessary (and often sufficient) conditions for optimality in constrained non-linear programming.
- Stationarity
- Primal feasibility
- Dual feasibility
- Complementary slackness
Dual Optimization Problem: Formulated in terms of Lagrange multipliers α_i.
Support Vectors: Data points for which α_i > 0, which lie on or inside the margin.
Kernel Trick: Replacing dot products with kernel functions for non-linear classification.
Kernel Functions: RBF, Polynomial, their impact on decision boundaries, and parameter tuning.

SVM Dual Formulation Explained

Interviewer: Let's dive into Support Vector Machines. Can you derive the dual formulation of the hard-margin SVM optimization problem using Lagrange multipliers? Then, please explain the KKT conditions and how they lead to the concept of support vectors.

Candidate: Certainly. The SVM aims to find a hyperplane that maximizes the margin between two classes.

1. Primal SVM Optimization Problem (Hard Margin)

For a linearly separable dataset {(x_i, y_i)}^N_i=1, where x_i ∈ R^d and y_i ∈ {-1, +1}, the hyperplane is defined by w^Tx + b = 0. The goal is to maximize the margin, which is 2/||w||. This is equivalent to minimizing (1/2)||w||².

The primal optimization problem is:

minimize_w,b (1/2) ||w||²
    subject to: y_i(w^Tx_i + b) ≥ 1,  for all i = 1, ..., N

2. Lagrangian Formulation

We introduce non-negative Lagrange multipliers α_i ≥ 0 for each constraint:

L(w, b, α) = (1/2) ||w||² - Σ_i=1^N α_i [y_i(w^Tx_i + b) - 1]

To find the solution, we need to minimize L with respect to w and b, and maximize L with respect to α (subject to α_i ≥ 0). This is a saddle point problem.

Setting the partial derivatives of L with respect to w and b to zero:

∂L / ∂w = w - Σ_i=1^N α_i y_i x_i = 0 ⇒ w = Σ_i=1^N α_i y_i x_i (Equation 1)

∂L / ∂b = - Σ_i=1^N α_i y_i = 0 ⇒ Σ_i=1^N α_i y_i = 0 (Equation 2)

3. Dual Optimization Problem

Now we substitute these conditions (Eq. 1 and Eq. 2) back into the Lagrangian L to obtain the dual objective function, which we want to maximize with respect to α.

L_D(α) = (1/2) (Σ_i α_i y_i x_i)^T(Σ_j α_j y_j x_j) - Σ_i α_i y_i ( (Σ_j α_j y_j x_j)^Tx_i + b ) + Σ_i α_i

L_D(α) = (1/2) Σ_iΣ_j α_iα_jy_iy_j(x_i^Tx_j) - Σ_iΣ_j α_iα_jy_iy_j(x_i^Tx_j) - b(Σ_i α_iy_i) + Σ_i α_i

Using Eq. 2 (Σ_i α_iy_i = 0), the term with b vanishes:

L_D(α) = Σ_i α_i - (1/2) Σ_iΣ_j α_iα_jy_iy_j(x_i^Tx_j)

The dual problem is to maximize L_D(α) subject to the constraints derived:

maximize_α   Σ_i=1^N α_i - (1/2) Σ_i=1^NΣ_j=1^N α_iα_jy_iy_j(x_i^Tx_j)
    subject to:  α_i ≥ 0,  for all i = 1, ..., N
                 Σ_i=1^N α_iy_i = 0

This is a quadratic programming problem. Once we find the optimal α^*, we can find w^* using Eq. 1. The bias b^* can be found using the KKT conditions, typically from a support vector.

Interviewer: That's the correct dual formulation. Now, please explain the Karush-Kuhn-Tucker (KKT) conditions for this problem and how they lead to the concept of support vectors.

Candidate:

4. KKT Conditions and Support Vectors

For an optimization problem with inequality constraints (like SVM), the Karush-Kuhn-Tucker (KKT) conditions are necessary for a solution to be optimal (and sufficient under certain convexity conditions, which SVM satisfies).

The KKT conditions for the SVM primal problem are:

Stationarity (from derivatives of L w.r.t. primal variables):
- ∂L/∂w = 0 ⇒ w = Σ_i α_iy_ix_i
- ∂L/∂b = 0 ⇒ Σ_i α_iy_i = 0
Primal Feasibility:
- y_i(w^Tx_i + b) - 1 ≥ 0 for all i
Dual Feasibility:
- α_i ≥ 0 for all i
Complementary Slackness:
- α_i [y_i(w^Tx_i + b) - 1] = 0 for all i

The complementary slackness condition is key to understanding support vectors.

From α_i [y_i(w^Tx_i + b) - 1] = 0, we have two cases for each data point x_i:

Case 1: α_i > 0
For this condition to hold, we must have y_i(w^Tx_i + b) - 1 = 0.
This means y_i(w^Tx_i + b) = 1.
These are the data points that lie exactly on the margin hyperplanes (either w^Tx + b = 1 or w^Tx + b = -1).
These points are called Support Vectors. They are "supporting" the margin.
Case 2: α_i = 0
For this condition to hold, y_i(w^Tx_i + b) - 1 > 0 (since α_i = 0 and we know from primal feasibility it's ≥ 0).
This means y_i(w^Tx_i + b) > 1.
These are the data points that are correctly classified and lie strictly outside the margin. They do not contribute to defining the hyperplane (since their α_i = 0, they don't appear in the sum for w = Σ α_iy_ix_i).

Therefore, only the support vectors (points with α_i > 0) contribute to the definition of the weight vector w. This is a very important property, as it means the SVM solution is sparse in terms of the data points – it only depends on a subset of the training data (the support vectors).

The bias term 'b' can be calculated using any support vector x_s (for which α_s > 0):

y_s(w^Tx_s + b) = 1
    y_s²(w^Tx_s + b) = y_s   (since y_s² = 1)
    w^Tx_s + b = y_s
    b = y_s - w^Tx_s

In practice, b is often averaged over all support vectors for numerical stability.

Interviewer: That's a very clear explanation of the KKT conditions and support vectors. Now, for the follow-up: How do different kernel functions, like RBF and polynomial kernels, affect the decision boundary, and how would you typically choose the optimal kernel and its parameters?

Candidate:

Kernel Functions and Decision Boundaries

The dual formulation is powerful because the data points x_i only appear in the form of dot products (x_i^Tx_j). The kernel trick allows us to replace these dot products with a kernel function K(x_i, x_j) = Φ(x_i)^TΦ(x_j), where Φ is a mapping to a higher-dimensional feature space. This allows SVMs to find non-linear decision boundaries in the original input space.

1. Linear Kernel:

K(x_i, x_j) = x_i^Tx_j
This is the standard SVM, resulting in a linear decision boundary in the original feature space.
No extra parameters to tune beyond the (soft-margin) C parameter.

2. Polynomial Kernel:

K(x_i, x_j) = (γx_i^Tx_j + r)^d
Parameters:
- γ (gamma): controls the influence of individual training samples.
- r (coef0): a constant term.
- d (degree): the degree of the polynomial.
Effect on Decision Boundary: Can create curved, polynomial decision boundaries. The complexity of the boundary depends on the degree 'd'. Higher degrees can lead to more complex boundaries and potential overfitting if not careful.

3. Radial Basis Function (RBF) Kernel (or Gaussian Kernel):

K(x_i, x_j) = exp(-γ ||x_i - x_j||²)
Parameters:
- γ (gamma): defines how far the influence of a single training example reaches.
  - Low γ: Far reach, smoother boundary (more like linear).
  - High γ: Close reach, more complex, potentially wiggly boundary, can overfit by essentially memorizing training points.
Effect on Decision Boundary: Can create highly flexible, non-linear decision boundaries. It maps samples to an infinitely dimensional space. The decision boundary can be localized, meaning the influence of each support vector is localized.

Choosing Optimal Kernel and Parameters

The choice of kernel and its parameters is crucial and data-dependent. There's no universally best kernel.

Domain Knowledge/Data Characteristics:
- If you believe the data is linearly separable or nearly so, a linear kernel is a good starting point (fast, less prone to overfitting).
- If the data has a known polynomial structure, a polynomial kernel might be appropriate.
- RBF is a very popular general-purpose kernel because of its flexibility and ability to handle complex relationships. It's often a good default to try if you have no prior knowledge.
Cross-Validation and Grid Search:
- This is the most common and robust approach.
- Define a grid of possible values for the kernel parameters (e.g., C for all SVMs, γ for RBF, degree/γ/r for polynomial).
- For each combination of parameters, train the SVM on a training fold and evaluate its performance (e.g., accuracy, F1-score) on a validation fold using k-fold cross-validation.
- Select the kernel and parameter combination that yields the best average performance on the validation sets.
Computational Cost:
- Linear kernels are generally the fastest to train.
- RBF and polynomial kernels can be more computationally intensive, especially with large datasets or high-dimensional feature spaces (though the kernel trick avoids explicit mapping).
Interpretability: Linear SVMs are more interpretable as the decision boundary is a simple hyperplane. Non-linear kernels make interpretation harder.
Start Simple: It's often advisable to start with a linear kernel. If performance is poor, then explore more complex kernels like RBF.

For RBF, common practice is to search for C and γ on a logarithmic scale (e.g., C in [0.1, 1, 10, 100], γ in [0.001, 0.01, 0.1, 1]). Visualizing the decision boundary on 2D data can also provide intuition during parameter tuning.

Interviewer: That's a very thorough and clear explanation, covering the derivation, KKT conditions, support vectors, and practical aspects of kernel selection. Excellent!

Candidate: Thank you!

Why Understanding SVM Dual & Kernels Matters

Core of SVM: The dual formulation is fundamental to how SVMs work, especially for enabling the kernel trick.
Kernel Trick Enablement: The dual depends only on dot products of data points, allowing kernels to be used for non-linear classification.
Concept of Support Vectors: The KKT conditions naturally lead to the idea that only a subset of data (support vectors) defines the decision boundary, making SVMs efficient and robust.
Constrained Optimization: Demonstrates the application of powerful optimization theory (Lagrangians, KKT conditions) in machine learning.
Model Flexibility: Understanding kernels (linear, polynomial, RBF) allows for choosing appropriate model complexity and shaping the decision boundary.
Parameter Tuning Rationale: Knowing how parameters like C and kernel-specific parameters (γ, degree) affect the model is crucial for effective tuning.