Overfitting Problem

The Problem of Overfitting

Understanding Overfitting and Underfitting

Overfitting is a critical problem where learning algorithms perform poorly on new data despite working well on training data. Understanding this concept is essential for building robust machine learning models.

Linear Regression Examples

Three Scenarios with Housing Price Prediction

Underfitting (High Bias)

Linear Model: Straight line fit

Poor performance on training data
Algorithm has strong preconception that data is linear
Unable to capture clear patterns in the data

Just Right

Quadratic Model: f(x) = w₁x + w₂x² + b

Good fit to training data
Likely to generalize well to new examples
Captures underlying pattern without excessive complexity

Overfitting (High Variance)

Fourth-Order Polynomial: f(x) = w₁x + w₂x² + w₃x³ + w₄x⁴ + b

Perfect fit to training data (zero cost)
Very wiggly, unrealistic curve
Poor generalization to new examples

The Generalization Problem

Generalization means making good predictions on brand new examples never seen during training. This is the ultimate goal of machine learning.

Underfit models: Fail to capture patterns, poor on both training and new data
Overfit models: Memorize training data, fail on new data
Well-fit models: Learn generalizable patterns, perform well on new data

Classification Examples

Logistic Regression Overfitting

Using tumor classification with features x₁ (tumor size) and x₂ (patient age):

Simple Linear Model: z = w₁x₁ + w₂x₂ + b
- Creates straight-line decision boundary
- May underfit the data patterns
Quadratic Features: z = w₁x₁ + w₂x₂ + w₃x₁² + w₄x₂² + b
- Creates elliptical decision boundary
- Good balance of fit and generalization
High-Order Polynomial: Many polynomial features
- Creates very complex, twisted decision boundary
- Overfits by perfectly separating training data

Bias vs Variance Trade-off

Terminology

Understanding High Variance

High variance means that small changes in the training set can lead to very different learned functions. If two engineers train the same high-order polynomial on slightly different datasets, they might get completely different models.

The Goldilocks Principle

Like the children’s story of Goldilocks and the Three Bears:

Too cold (underfitting): Too few features, too simple
Too hot (overfitting): Too many features, too complex
Just right: Appropriate model complexity for the data

The goal is finding models that are neither too simple nor too complex.

Key Insights

Signs of Each Problem

Underfitting: Poor performance even on training data, clear patterns missed
Overfitting: Perfect training performance but unrealistic/complex model shape
Good fit: Reasonable training performance with realistic model complexity

Impact on Real Applications

Medical diagnosis: Overfit models might make dangerous predictions on new patients
Financial systems: Poor generalization could lead to incorrect fraud detection
Autonomous systems: Safety-critical applications require reliable generalization

Summary

Overfitting occurs when models become too complex and memorize training data instead of learning generalizable patterns. Recognizing the balance between underfitting (high bias) and overfitting (high variance) is crucial for building machine learning models that perform well on new, unseen data.