Log-Normal Distributions

A comprehensive guide to understanding and leveraging log-normal distributions in your data science workflow.

The Mathematics Behind Real-World Phenomena

Many real-world phenomena produce asymmetric data distributions rather than the symmetrical bell curve. The log-normal distribution is one of the most powerful tools for capturing this asymmetry.

Definition & Key Properties

A random variable X follows a log-normal distribution if Y = ln(X) follows a normal distribution.

If Y ~ N(μ, σ²), then X = e^Y ~ LogNormal(μ, σ²)

Key parameters:

Location parameter (μ): Affects the scale of the distribution
Scale parameter (σ): Controls the shape and dispersion

Why Data Scientists Should Care

Log-normal distributions appear naturally in many real-world datasets:

Financial data: Stock prices, returns, and asset valuations
Income distributions: Wages and wealth across populations
Biological measurements: Species abundance, cell growth, survival times
Internet phenomena: Website traffic, viral content spread
Environmental data: Pollution levels, particle sizes

Common Pitfall: Normal vs. Log-Normal

A frequent mistake is applying normal distribution assumptions to log-normally distributed data. This leads to:

Underestimating rare events and extreme values
Biased confidence intervals
Inaccurate hypothesis testing
Misleading visualizations

Implementation in Python

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Generate log-normal distributed data
mu = 0
sigma = 0.5
sample_size = 1000
data = np.random.lognormal(mu, sigma, sample_size)

# Plot histogram with log-normal PDF
plt.figure(figsize=(10, 6))
plt.hist(data, bins=50, density=True, alpha=0.7, label='Data')

# Calculate PDF for comparison
x = np.linspace(min(data), max(data), 1000)
pdf = stats.lognorm.pdf(x, s=sigma, scale=np.exp(mu))
plt.plot(x, pdf, 'r-', linewidth=2, label='Log-Normal PDF')

plt.title('Log-Normal Distribution Example')
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.legend()
plt.show()

# Calculate key statistics
mean = np.exp(mu + sigma**2/2)
median = np.exp(mu)
mode = np.exp(mu - sigma**2)

print(f"Mean: {mean:.4f}")
print(f"Median: {median:.4f}")
print(f"Mode: {mode:.4f}")

Testing for Log-Normality

# Q-Q plot for log-normality check
log_data = np.log(data)

plt.figure(figsize=(10, 6))
stats.probplot(log_data, dist="norm", plot=plt)
plt.title("Q-Q Plot for Log-Normality Check")
plt.show()

# Statistical test
stat, p_value = stats.shapiro(log_data)
print(f"Shapiro-Wilk test: p-value = {p_value:.4f}")

Decision Making with Log-Normal Distributions

Risk Assessment

Log-normal models better capture tail risks in financial models and insurance pricing.

Resource Allocation

Understand skewed distributions of resource usage to optimize allocation across infrastructure.

Anomaly Detection

Establish more accurate thresholds for outlier detection in naturally skewed data.

Important Statistics

In a log-normal distribution, the relationship is: Mean > Median > Mode

This characteristic right-skew means:

The geometric mean = median = exp(μ)
The arithmetic mean is heavily influenced by extreme values

Log-Normal Distributions: Key Takeaways

Log-normal relationships: X is log-normal if ln(X) is normal
Natural for positive, skewed data: Income, stock prices, biological measurements
Mean > Median > Mode: Characteristic of right-skewed distributions
Geometric mean often best: More appropriate than arithmetic mean for central tendency
Test before assuming: Use Q-Q plots and Shapiro-Wilk test to verify
Beware of normal assumptions: Misapplying normal theory leads to underestimating extremes