Data ScienceStatistics 2025-05-25

Log-Normal Distributions

Master log-normal distributions for modeling naturally skewed data. Learn when to apply them to income, stock prices, and biological measurements.

Log-Normal Distributions

A comprehensive guide to understanding and leveraging log-normal distributions in your data science workflow.

The Mathematics Behind Real-World Phenomena

Many real-world phenomena produce asymmetric data distributions rather than the symmetrical bell curve. The log-normal distribution is one of the most powerful tools for capturing this asymmetry.

Definition & Key Properties

A random variable X follows a log-normal distribution if Y = ln(X) follows a normal distribution.

If Y ~ N(μ, σ²), then X = e^Y ~ LogNormal(μ, σ²)

Key parameters:

  • Location parameter (μ): Affects the scale of the distribution
  • Scale parameter (σ): Controls the shape and dispersion

Why Data Scientists Should Care

Log-normal distributions appear naturally in many real-world datasets:

  • Financial data: Stock prices, returns, and asset valuations
  • Income distributions: Wages and wealth across populations
  • Biological measurements: Species abundance, cell growth, survival times
  • Internet phenomena: Website traffic, viral content spread
  • Environmental data: Pollution levels, particle sizes

Common Pitfall: Normal vs. Log-Normal

A frequent mistake is applying normal distribution assumptions to log-normally distributed data. This leads to:

  • Underestimating rare events and extreme values
  • Biased confidence intervals
  • Inaccurate hypothesis testing
  • Misleading visualizations

Implementation in Python

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# Generate log-normal distributed data
mu = 0
sigma = 0.5
sample_size = 1000
data = np.random.lognormal(mu, sigma, sample_size)

# Plot histogram with log-normal PDF
plt.figure(figsize=(10, 6))
plt.hist(data, bins=50, density=True, alpha=0.7, label='Data')

# Calculate PDF for comparison
x = np.linspace(min(data), max(data), 1000)
pdf = stats.lognorm.pdf(x, s=sigma, scale=np.exp(mu))
plt.plot(x, pdf, 'r-', linewidth=2, label='Log-Normal PDF')

plt.title('Log-Normal Distribution Example')
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.legend()
plt.show()

# Calculate key statistics
mean = np.exp(mu + sigma**2/2)
median = np.exp(mu)
mode = np.exp(mu - sigma**2)

print(f"Mean: {mean:.4f}")
print(f"Median: {median:.4f}")
print(f"Mode: {mode:.4f}")

Testing for Log-Normality

# Q-Q plot for log-normality check
log_data = np.log(data)

plt.figure(figsize=(10, 6))
stats.probplot(log_data, dist="norm", plot=plt)
plt.title("Q-Q Plot for Log-Normality Check")
plt.show()

# Statistical test
stat, p_value = stats.shapiro(log_data)
print(f"Shapiro-Wilk test: p-value = {p_value:.4f}")

Decision Making with Log-Normal Distributions

Risk Assessment

Log-normal models better capture tail risks in financial models and insurance pricing.

Resource Allocation

Understand skewed distributions of resource usage to optimize allocation across infrastructure.

Anomaly Detection

Establish more accurate thresholds for outlier detection in naturally skewed data.

Important Statistics

In a log-normal distribution, the relationship is: Mean > Median > Mode

This characteristic right-skew means:

  • The geometric mean = median = exp(μ)
  • The arithmetic mean is heavily influenced by extreme values

Log-Normal Distributions: Key Takeaways

  • Log-normal relationships: X is log-normal if ln(X) is normal
  • Natural for positive, skewed data: Income, stock prices, biological measurements
  • Mean > Median > Mode: Characteristic of right-skewed distributions
  • Geometric mean often best: More appropriate than arithmetic mean for central tendency
  • Test before assuming: Use Q-Q plots and Shapiro-Wilk test to verify
  • Beware of normal assumptions: Misapplying normal theory leads to underestimating extremes
← All articles
Nerchuko Academy · Free DS Interview Prep