Skip to content
Pablo Rodriguez

Overfitting Problem

Understanding Overfitting and Underfitting

Section titled “Understanding Overfitting and Underfitting”

Overfitting is a critical problem where learning algorithms perform poorly on new data despite working well on training data. Understanding this concept is essential for building robust machine learning models.

Three Scenarios with Housing Price Prediction

Section titled “Three Scenarios with Housing Price Prediction”

Underfitting (High Bias)

Linear Model: Straight line fit

  • Poor performance on training data
  • Algorithm has strong preconception that data is linear
  • Unable to capture clear patterns in the data

Just Right

Quadratic Model: f(x) = w₁x + w₂x² + b

  • Good fit to training data
  • Likely to generalize well to new examples
  • Captures underlying pattern without excessive complexity

Overfitting (High Variance)

Fourth-Order Polynomial: f(x) = w₁x + w₂x² + w₃x³ + w₄x⁴ + b

  • Perfect fit to training data (zero cost)
  • Very wiggly, unrealistic curve
  • Poor generalization to new examples

Generalization means making good predictions on brand new examples never seen during training. This is the ultimate goal of machine learning.

  • Underfit models: Fail to capture patterns, poor on both training and new data
  • Overfit models: Memorize training data, fail on new data
  • Well-fit models: Learn generalizable patterns, perform well on new data

Using tumor classification with features x₁ (tumor size) and x₂ (patient age):

  1. Simple Linear Model: z = w₁x₁ + w₂x₂ + b

    • Creates straight-line decision boundary
    • May underfit the data patterns
  2. Quadratic Features: z = w₁x₁ + w₂x₂ + w₃x₁² + w₄x₂² + b

    • Creates elliptical decision boundary
    • Good balance of fit and generalization
  3. High-Order Polynomial: Many polynomial features

    • Creates very complex, twisted decision boundary
    • Overfits by perfectly separating training data

High variance means that small changes in the training set can lead to very different learned functions. If two engineers train the same high-order polynomial on slightly different datasets, they might get completely different models.

Like the children’s story of Goldilocks and the Three Bears:

  • Too cold (underfitting): Too few features, too simple
  • Too hot (overfitting): Too many features, too complex
  • Just right: Appropriate model complexity for the data

The goal is finding models that are neither too simple nor too complex.

  • Underfitting: Poor performance even on training data, clear patterns missed
  • Overfitting: Perfect training performance but unrealistic/complex model shape
  • Good fit: Reasonable training performance with realistic model complexity
  • Medical diagnosis: Overfit models might make dangerous predictions on new patients
  • Financial systems: Poor generalization could lead to incorrect fraud detection
  • Autonomous systems: Safety-critical applications require reliable generalization

Overfitting occurs when models become too complex and memorize training data instead of learning generalizable patterns. Recognizing the balance between underfitting (high bias) and overfitting (high variance) is crucial for building machine learning models that perform well on new, unseen data.