Skip to content
Pablo Rodriguez

Regularization

Regularization is a powerful technique for reducing overfitting by preventing parameters from becoming too large, leading to simpler and more generalizable models.

More Data Solution

Benefits: Often the most effective approach

  • Larger training sets make algorithms less likely to overfit
  • Can continue using complex models with sufficient data
  • Limitation: Not always feasible to obtain more data

Reduce Features

Approach: Use fewer features

  • Select most important features (size, bedrooms, age vs. distance to coffee shop)
  • Eliminate less relevant polynomial terms
  • Trade-off: May discard useful information

Regularization

Best of Both Worlds: Keep all features but prevent large parameters

  • Shrink parameter values without eliminating features
  • Maintain all available information while reducing overfitting
  • Most commonly used approach in practice

Instead of eliminating features entirely (setting parameters to 0), regularization encourages smaller parameter values across all features.

Consider a high-order polynomial that overfits:

f(x) = w₁x + w₂x² + w₃x³ + w₄x⁴ + b

If we could make w₃ and w₄ very small (close to 0), we get a function closer to:

f(x) ≈ w₁x + w₂x² + b

This reduces complexity while keeping all features.

Add a penalty term for large parameters:

J(w,b) = (1/2m) * Σ(f(x⁽ⁱ⁾) - y⁽ⁱ⁾)² + (λ/2m) * Σ(wⱼ²)

Where:

  • λ (lambda): Regularization parameter
  • First term: Original cost (fit training data)
  • Second term: Regularization penalty (keep parameters small)

λ = 0

No regularization - potential overfitting

λ very large (10¹⁰)

All w ≈ 0, f(x) ≈ b - underfitting

λ just right

Balanced trade-off between fit and simplicity

  • Parameter b: Usually not regularized (minimal impact in practice)
  • Scaling: Divide by 2m to make λ selection easier across different dataset sizes
  • Scope: Regularize w₁ through wₙ, but not b

The regularized update for wⱼ becomes:

wⱼ := wⱼ(1 - α*λ/m) - α*(1/m)*Σ[(f(x⁽ⁱ⁾) - y⁽ⁱ⁾)*xⱼ⁽ⁱ⁾]

On each iteration:

  • Multiply wⱼ by (1 - α*λ/m) ≈ 0.9998 (slightly less than 1)
  • This gradually shrinks parameters toward zero
  • Also apply normal gradient descent update
  • Prevents overfitting with many features
  • Creates smoother, more generalizable functions
  • Maintains all feature information
J(w,b) = -(1/m) * Σ[y*log(f(x)) + (1-y)*log(1-f(x))] + (λ/2m) * Σ(wⱼ²)

Same principle applies to classification problems.

Regularization becomes even more important in deep learning due to the large number of parameters.

Regularization offers an elegant solution to overfitting by encouraging smaller parameter values rather than eliminating features entirely. By adding a penalty term to the cost function, it creates a balance between fitting the training data and maintaining model simplicity, leading to better generalization on new examples.