Overfitting is a critical problem where learning algorithms perform poorly on new data despite working well on training data. Understanding this concept is essential for building robust machine learning models.
Underfitting (High Bias)
Linear Model : Straight line fit
Poor performance on training data
Algorithm has strong preconception that data is linear
Unable to capture clear patterns in the data
Just Right
Quadratic Model : f(x) = w₁x + w₂x² + b
Good fit to training data
Likely to generalize well to new examples
Captures underlying pattern without excessive complexity
Overfitting (High Variance)
Fourth-Order Polynomial : f(x) = w₁x + w₂x² + w₃x³ + w₄x⁴ + b
Perfect fit to training data (zero cost)
Very wiggly, unrealistic curve
Poor generalization to new examples
Generalization means making good predictions on brand new examples never seen during training. This is the ultimate goal of machine learning.
Underfit models : Fail to capture patterns, poor on both training and new data
Overfit models : Memorize training data, fail on new data
Well-fit models : Learn generalizable patterns, perform well on new data
Using tumor classification with features x₁ (tumor size) and x₂ (patient age):
Simple Linear Model : z = w₁x₁ + w₂x₂ + b
Creates straight-line decision boundary
May underfit the data patterns
Quadratic Features : z = w₁x₁ + w₂x₂ + w₃x₁² + w₄x₂² + b
Creates elliptical decision boundary
Good balance of fit and generalization
High-Order Polynomial : Many polynomial features
Creates very complex, twisted decision boundary
Overfits by perfectly separating training data
Note
Important Distinctions
Bias in ML : Algorithm’s inability to fit training data (underfitting)
Bias in Society : Unfair discrimination - completely different concept
Variance : Algorithm’s sensitivity to small changes in training data
High variance means that small changes in the training set can lead to very different learned functions. If two engineers train the same high-order polynomial on slightly different datasets, they might get completely different models.
Like the children’s story of Goldilocks and the Three Bears:
Too cold (underfitting) : Too few features, too simple
Too hot (overfitting) : Too many features, too complex
Just right : Appropriate model complexity for the data
The goal is finding models that are neither too simple nor too complex.
Underfitting : Poor performance even on training data, clear patterns missed
Overfitting : Perfect training performance but unrealistic/complex model shape
Good fit : Reasonable training performance with realistic model complexity
Medical diagnosis : Overfit models might make dangerous predictions on new patients
Financial systems : Poor generalization could lead to incorrect fraud detection
Autonomous systems : Safety-critical applications require reliable generalization
Tip
Prevention Strategy
Later topics will cover specific techniques for addressing overfitting, including regularization, cross-validation, and proper model selection methods.
Overfitting occurs when models become too complex and memorize training data instead of learning generalizable patterns. Recognizing the balance between underfitting (high bias) and overfitting (high variance) is crucial for building machine learning models that perform well on new, unseen data.