Regularization

Addressing Overfitting

Regularization is a powerful technique for reducing overfitting by preventing parameters from becoming too large, leading to simpler and more generalizable models.

Methods to Address Overfitting

1. Collect More Training Data

2. Feature Selection

Reduce Features

Approach: Use fewer features

Select most important features (size, bedrooms, age vs. distance to coffee shop)
Eliminate less relevant polynomial terms
Trade-off: May discard useful information

3. Regularization

Regularization

Best of Both Worlds: Keep all features but prevent large parameters

Shrink parameter values without eliminating features
Maintain all available information while reducing overfitting
Most commonly used approach in practice

How Regularization Works

The Core Concept

Instead of eliminating features entirely (setting parameters to 0), regularization encourages smaller parameter values across all features.

Mathematical Intuition

Consider a high-order polynomial that overfits:

f(x) = w₁x + w₂x² + w₃x³ + w₄x⁴ + b

If we could make w₃ and w₄ very small (close to 0), we get a function closer to:

f(x) ≈ w₁x + w₂x² + b

This reduces complexity while keeping all features.

Regularization Implementation

Modified Cost Function

Add a penalty term for large parameters:

J(w,b) = (1/2m) * Σ(f(x⁽ⁱ⁾) - y⁽ⁱ⁾)² + (λ/2m) * Σ(wⱼ²)

Where:

λ (lambda): Regularization parameter
First term: Original cost (fit training data)
Second term: Regularization penalty (keep parameters small)

The Regularization Parameter λ

λ = 0

No regularization - potential overfitting

λ very large (10¹⁰)

All w ≈ 0, f(x) ≈ b - underfitting

λ just right

Balanced trade-off between fit and simplicity

Convention Notes

Parameter b: Usually not regularized (minimal impact in practice)
Scaling: Divide by 2m to make λ selection easier across different dataset sizes
Scope: Regularize w₁ through wₙ, but not b

Regularization Effects

Gradient Descent Perspective

The regularized update for wⱼ becomes:

wⱼ := wⱼ(1 - α*λ/m) - α*(1/m)*Σ[(f(x⁽ⁱ⁾) - y⁽ⁱ⁾)*xⱼ⁽ⁱ⁾]

Shrinkage Effect

On each iteration:

Multiply wⱼ by (1 - α*λ/m) ≈ 0.9998 (slightly less than 1)
This gradually shrinks parameters toward zero
Also apply normal gradient descent update

Applications Across Algorithms

Linear Regression

Prevents overfitting with many features
Creates smoother, more generalizable functions
Maintains all feature information

Logistic Regression

J(w,b) = -(1/m) * Σ[y*log(f(x)) + (1-y)*log(1-f(x))] + (λ/2m) * Σ(wⱼ²)

Same principle applies to classification problems.

Neural Networks

Regularization becomes even more important in deep learning due to the large number of parameters.

Summary

Regularization offers an elegant solution to overfitting by encouraging smaller parameter values rather than eliminating features entirely. By adding a penalty term to the cost function, it creates a balance between fitting the training data and maintaining model simplicity, leading to better generalization on new examples.