Ask Claude about this
 

Deriving Logistic Regression's Update Rule

Core Concepts to Understand

  • Problem Setup: Defining inputs, outputs, parameters for binary classification.
  • Sigmoid Function: Its mathematical form, properties, and crucially, its derivative.
  • Probabilistic Model: How logistic regression models P(y=1|x) and P(y=0|x).
  • Maximum Likelihood Estimation (MLE): The principle of finding parameters that maximize the likelihood of observed data.
  • Log-Likelihood: Why we use it (numerical stability, easier differentiation).
  • Cost Function: The negative log-likelihood (or binary cross-entropy).
  • Gradient Derivation: Applying the chain rule step-by-step.
  • Gradient Descent Update Rule: The final form for parameter updates.
  • Regularization (L1 & L2): How it modifies the cost function and gradient, and its implications.

Derivation Walkthrough

Interviewer: Welcome! Today, we're going to dive deep into the fundamentals of logistic regression. Can you start by deriving the gradient descent update rule for logistic regression from first principles? Let's begin with the problem setup and the sigmoid function.
Candidate: Absolutely.

Problem Setup

For binary classification with logistic regression:

  • Input features: x ∈ ℝd
  • Binary labels: y ∈ {0, 1}
  • Parameters: θ = [θ₀, θ₁, ..., θd]T
  • Linear combination: z = θTx = θ₀ + θ₁x₁ + ... + θdxd

Step 1: Sigmoid Function and Its Properties

The sigmoid function maps any real number to (0,1):

σ(z) = 1 / (1 + e-z)

Key Properties:

  • σ(z) ∈ (0, 1) for all z ∈ ℝ
  • σ(0) = 0.5
  • σ(-z) = 1 - σ(z)

Derivative of Sigmoid (Critical for elegant update rule):

dσ(z)/dz = σ(z) × (1 - σ(z))

Proof:

dσ(z)/dz = d/dz [1 / (1 + e-z)]
                     = -1 / (1 + e-z)2 × d/dz(1 + e-z)
                     = -1 / (1 + e-z)2 × (-e-z)
                     = e-z / (1 + e-z)2
                     = [1 / (1 + e-z)] × [e-z / (1 + e-z)]
                     = σ(z) × [e-z / (1 + e-z)]

Since e-z / (1 + e-z) = (1 + e-z - 1) / (1 + e-z) = 1 - 1/(1 + e-z) = 1 - σ(z)

Therefore: dσ(z)/dz = σ(z)(1 - σ(z))

Interviewer: Excellent. Now, how do we use this to model probabilities, and what's the likelihood function for our training data? Explain why we typically move to the log-likelihood.
Candidate:

Step 2: Probabilistic Model

Logistic regression models the probability:

P(y = 1 | x, θ) = σ(θTx) = 1 / (1 + eTx)
P(y = 0 | x, θ) = 1 - σ(θTx) = σ(-θTx)

This can be written compactly as:

P(y | x, θ) = σ(θTx)y × (1 - σ(θTx))(1-y)

Step 3: Why Log-Likelihood?

Maximum Likelihood Estimation (MLE) Principle: We want to find θ that maximizes the probability of observing our training data.

For n training examples {(x₁, y₁), (x₂, y₂), ..., (xn, yn)}:

Likelihood:

L(θ) = ∏i=1n P(yi | xi, θ) = ∏i=1n σ(θTxi)yi × (1 - σ(θTxi))(1-yi)

Why take the logarithm?

  1. Numerical stability: Products of small probabilities → underflow
  2. Computational efficiency: Products become sums
  3. Optimization: Monotonic transformation preserves maxima
  4. Differentiability: Easier to differentiate sums than products

Log-Likelihood:

ℓ(θ) = log L(θ) = Σi=1n [yi log σ(θTxi) + (1-yi) log(1 - σ(θTxi))]
Interviewer: Very clear. How does this lead us to the cost function we aim to optimize?
Candidate:

Step 4: Cost Function

We minimize the negative log-likelihood (cross-entropy loss):

J(θ) = -ℓ(θ) = -Σi=1n [yi log σ(θTxi) + (1-yi) log(1 - σ(θTxi))]

For a single example:

J(θ) = -[y log σ(θTx) + (1-y) log(1 - σ(θTx))]
Interviewer: Perfect. Now for the core part: please derive the gradient of this cost function J(θ) with respect to a single parameter θj.
Candidate:

Step 5: Gradient Derivation

To find the gradient ∇θJ(θ), we compute ∂J/∂θj for each parameter θj.

For a single training example:

∂J/∂θj = ∂/∂θj [-y log σ(θTx) - (1-y) log(1 - σ(θTx))]

Chain rule application:

∂J/∂θj = -y × (1/σ(θTx)) × ∂σ(θTx)/∂θj - (1-y) × (1/(1-σ(θTx))) × ∂(1-σ(θTx))/∂θj

Since ∂(1-σ(θTx))/∂θj = -∂σ(θTx)/∂θj:

∂J/∂θj = -y × (1/σ(θTx)) × ∂σ(θTx)/∂θj + (1-y) × (1/(1-σ(θTx))) × ∂σ(θTx)/∂θj

Factor out ∂σ(θTx)/∂θj:

∂J/∂θj = [-y/σ(θTx) + (1-y)/(1-σ(θTx))] × ∂σ(θTx)/∂θj

Now use the chain rule for ∂σ(θTx)/∂θj:

∂σ(θTx)/∂θj = σ(θTx)(1 - σ(θTx)) × ∂(θTx)/∂θj = σ(θTx)(1 - σ(θTx)) × xj

Substituting back:

∂J/∂θj = [-y/σ(θTx) + (1-y)/(1-σ(θTx))] × σ(θTx)(1 - σ(θTx)) × xj

Simplifying:

∂J/∂θj = [-y(1 - σ(θTx)) + (1-y)σ(θTx)] × xj
                    = [-y + yσ(θTx) + σ(θTx) - yσ(θTx)] × xj
                    = [σ(θTx) - y] × xj
Interviewer: That's the key simplification! So, what is the full gradient descent update rule, and why is this result often described as "elegant"?
Candidate:

Step 6: The Elegant Update Rule

For n training examples:

∂J/∂θj = Σi=1n (σ(θTxi) - yi) × xij

In vector form:

θJ(θ) = Σi=1n (σ(θTxi) - yi) × xi = XT(σ(Xθ) - y)

Gradient Descent Update:

θ := θ - α × ∇θJ(θ)
θ := θ - α × XT(σ(Xθ) - y)

Where α is the learning rate.

Why This Update Rule is "Elegant"

  1. Simple Form: The gradient has the intuitive form (prediction - actual) × feature
  2. No Complex Terms: Despite the non-linear sigmoid, the gradient is linear in the error
  3. Automatic Weighting: Larger errors get more weight in the update
  4. Sigmoid Derivative Cancellation: The σ(z)(1-σ(z)) terms cancel out beautifully
  5. Resembles Linear Regression: Same form as linear regression gradient!
Interviewer: Excellent derivation and explanation. Now, let's consider regularization. How would you modify this for L2 regularized logistic regression, and what are the key computational implications?
Candidate:

Regularized Logistic Regression

L2 Regularization (Ridge)

Modified Cost Function:

J(θ) = -Σi=1n [yi log σ(θTxi) + (1-yi) log(1 - σ(θTxi))] + λ/2 Σj=1d θj2

Note: We typically don't regularize the bias term θ0.

Modified Gradient:

∂J/∂θj = Σi=1n (σ(θTxi) - yi) × xij + λθj  (for j ≠ 0)
∂J/∂θ0 = Σi=1n (σ(θTxi) - yi) × xi0  (bias term, no regularization)

Update Rule:

θj := θj - α[Σi=1n (σ(θTxi) - yi) × xij + λθj]
θj := θj(1 - αλ) - α Σi=1n (σ(θTxi) - yi) × xij

Computational Implications of L2:

  • Advantages:
    • Smooth, differentiable everywhere
    • Closed-form solutions possible (for linear models)
    • Shrinks parameters proportionally
  • Computational Cost: O(d) additional operations per iteration
  • Memory: No additional memory overhead
Interviewer: Good. And what about L1 regularization (Lasso)? How does that differ, and what are its implications?
Candidate:

L1 Regularization (Lasso)

Modified Cost Function:

J(θ) = -Σi=1n [yi log σ(θTxi) + (1-yi) log(1 - σ(θTxi))] + λ Σj=1dj|

Challenge: |θj| is not differentiable at θj = 0.

Subgradient:

∂|θj|/∂θj = {
    +1  if θj > 0
    -1  if θj < 0
    [-1, +1]  if θj = 0
}

Update Rule (Soft Thresholding):

θj := sign(θj - α∇j) × max(0, |θj - α∇j| - αλ)

Where ∇j is the gradient of the unregularized loss.

Computational Implications of L1:

  • Advantages:
    • Promotes sparsity (automatic feature selection)
    • Robust to outliers
  • Disadvantages:
    • Non-differentiable at zero
    • Requires specialized algorithms (proximal gradient, coordinate descent)
    • Slower convergence than L2
  • Computational Cost: O(d) additional operations but requires more sophisticated updates
  • Memory: May require storing active set of non-zero parameters
Interviewer: That's a very comprehensive answer. You've covered the derivation from first principles beautifully and addressed the nuances of regularization well. Thank you.
Candidate: Thank you! It was a good exercise to walk through it.

Why This Derivation Matters

  • Fundamental Understanding: Knowing this derivation solidifies your understanding of how logistic regression learns.
  • Basis for Other Models: The concepts of likelihood, log-likelihood, cross-entropy, and gradient descent are foundational to many other machine learning models, especially in deep learning.
  • Debugging & Modification: Understanding the "why" behind the update rule helps in debugging custom implementations or modifying the loss function for specific needs.
  • Regularization Intuition: Seeing how regularization terms are added to the cost and gradient provides a clear picture of their impact.
  • Interview Preparedness: This is a classic "ML fundamentals" interview question.

 

Nerchuko Academy · Free DS Interview Prep