Skip to content
Pablo Rodriguez

Choosing Learning Rate

  • Symptom: Cost sometimes increases, sometimes decreases
  • Cause: Learning rate α is too large
  • Mechanism: Algorithm overshoots the minimum repeatedly
  • Pattern: Cost grows at every iteration
  • Likely Cause: Learning rate too large
  • Alternative Cause: Bug in code implementation

When α is too large, gradient descent can overshoot the minimum:

  1. Start Position: Algorithm begins at some point
  2. Large Step: Takes too big a step and overshoots minimum
  3. Next Step: Overshoots in opposite direction
  4. Result: Bounces back and forth, cost increases

When α is appropriately sized:

  • Controlled Steps: Each step moves closer to minimum
  • Consistent Progress: Cost decreases at every iteration
  • Efficient Convergence: Reaches minimum relatively quickly
  • Purpose: Isolate whether problem is learning rate or bug
  • Method: Set α to very small value (e.g., 0.0001)
  • Expected Result: Cost should decrease every iteration with correct implementation
  1. Verify Update Rule: Ensure using minus sign in parameter updates
  2. Check Derivatives: Confirm gradient calculations are correct
  3. Test Small α: Use tiny learning rate to verify implementation
  4. Examine Data: Ensure features are properly preprocessed
Debugging
  • Problem: Very slow convergence
  • Result: Many iterations needed
  • Impact: Inefficient training time

Test learning rates in roughly 3× increments:

alpha_search.py
alpha_candidates = [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1.0]
for alpha in alpha_candidates:
# Run gradient descent for handful of iterations
# Plot learning curve
# Evaluate performance
  1. Find Boundaries: Identify α that’s too small and α that’s too large
  2. Choose Optimal: Pick largest value that decreases cost rapidly but consistently
  3. Verify Stability: Ensure chosen α works reliably across multiple runs
  • Duration: Run each α for small number of iterations
  • Comparison: Compare learning curves side by side
  • Efficiency: Quick way to identify promising learning rates
  • Avoid Extremes: Not the smallest working α, not the largest that just barely works
  • Prioritize Speed: Choose α that decreases cost most rapidly while remaining stable
  • Safety Margin: Pick slightly smaller than the maximum that works
  • Properly Scaled Features: Can use larger learning rates
  • Unscaled Features: May require smaller learning rates
  • Recommendation: Always scale features first, then tune learning rate
  • Dataset Size: Larger datasets may allow larger learning rates
  • Feature Count: More features may require smaller learning rates
  • Problem Complexity: Some problems are inherently more sensitive
  • Continuous Monitoring: Watch learning curves throughout training
  • Early Detection: Spot problems before wasting computational time
  • Dynamic Adjustment: Some advanced techniques adjust learning rate during training

Conservative Choice

Choose learning rate that works reliably even if not fastest

Empirical Testing

Always test multiple values rather than guessing

Visual Validation

Trust learning curves over theoretical recommendations

Choosing an appropriate learning rate is crucial for effective gradient descent. A systematic approach combining debugging techniques, empirical testing, and careful monitoring ensures optimal algorithm performance.