- Symptom: Cost sometimes increases, sometimes decreases
- Cause: Learning rate α is too large
- Mechanism: Algorithm overshoots the minimum repeatedly
- Pattern: Cost grows at every iteration
- Likely Cause: Learning rate too large
- Alternative Cause: Bug in code implementation
When α is too large, gradient descent can overshoot the minimum:
- Start Position: Algorithm begins at some point
- Large Step: Takes too big a step and overshoots minimum
- Next Step: Overshoots in opposite direction
- Result: Bounces back and forth, cost increases
When α is appropriately sized:
- Controlled Steps: Each step moves closer to minimum
- Consistent Progress: Cost decreases at every iteration
- Efficient Convergence: Reaches minimum relatively quickly
- Purpose: Isolate whether problem is learning rate or bug
- Method: Set α to very small value (e.g., 0.0001)
- Expected Result: Cost should decrease every iteration with correct implementation
- Verify Update Rule: Ensure using minus sign in parameter updates
- Check Derivatives: Confirm gradient calculations are correct
- Test Small α: Use tiny learning rate to verify implementation
- Examine Data: Ensure features are properly preprocessed
Debugging
- Problem: Very slow convergence
- Result: Many iterations needed
- Impact: Inefficient training time
- Problem: Overshooting minimum
- Result: Cost increases or fluctuates
- Impact: Algorithm doesn’t converge
- Result: Rapid, consistent decrease
- Goal: Largest α that converges reliably
Test learning rates in roughly 3× increments:
alpha_candidates = [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1.0]
for alpha in alpha_candidates:
# Run gradient descent for handful of iterations
- Find Boundaries: Identify α that’s too small and α that’s too large
- Choose Optimal: Pick largest value that decreases cost rapidly but consistently
- Verify Stability: Ensure chosen α works reliably across multiple runs
- Duration: Run each α for small number of iterations
- Comparison: Compare learning curves side by side
- Efficiency: Quick way to identify promising learning rates
- Avoid Extremes: Not the smallest working α, not the largest that just barely works
- Prioritize Speed: Choose α that decreases cost most rapidly while remaining stable
- Safety Margin: Pick slightly smaller than the maximum that works
- Properly Scaled Features: Can use larger learning rates
- Unscaled Features: May require smaller learning rates
- Recommendation: Always scale features first, then tune learning rate
- Dataset Size: Larger datasets may allow larger learning rates
- Feature Count: More features may require smaller learning rates
- Problem Complexity: Some problems are inherently more sensitive
- Continuous Monitoring: Watch learning curves throughout training
- Early Detection: Spot problems before wasting computational time
- Dynamic Adjustment: Some advanced techniques adjust learning rate during training
Conservative Choice
Choose learning rate that works reliably even if not fastest
Empirical Testing
Always test multiple values rather than guessing
Visual Validation
Trust learning curves over theoretical recommendations
Choosing an appropriate learning rate is crucial for effective gradient descent. A systematic approach combining debugging techniques, empirical testing, and careful monitoring ensures optimal algorithm performance.