Choosing Learning Rate

Learning Rate Problems and Solutions

Signs of Learning Rate Issues

Cost Fluctuates (Up and Down)

Symptom: Cost sometimes increases, sometimes decreases
Cause: Learning rate α is too large
Mechanism: Algorithm overshoots the minimum repeatedly

Cost Consistently Increases

Pattern: Cost grows at every iteration
Likely Cause: Learning rate too large
Alternative Cause: Bug in code implementation

Visual Understanding of Overshooting

When α is too large, gradient descent can overshoot the minimum:

Start Position: Algorithm begins at some point
Large Step: Takes too big a step and overshoots minimum
Next Step: Overshoots in opposite direction
Result: Bounces back and forth, cost increases

When α is appropriately sized:

Controlled Steps: Each step moves closer to minimum
Consistent Progress: Cost decreases at every iteration
Efficient Convergence: Reaches minimum relatively quickly

Debugging Strategy

Use Very Small Learning Rate

Purpose: Isolate whether problem is learning rate or bug
Method: Set α to very small value (e.g., 0.0001)
Expected Result: Cost should decrease every iteration with correct implementation

Debugging Checklist

Verify Update Rule: Ensure using minus sign in parameter updates
Check Derivatives: Confirm gradient calculations are correct
Test Small α: Use tiny learning rate to verify implementation
Examine Data: Ensure features are properly preprocessed

Debugging

Learning Rate Selection Process

Trade-off Analysis

Problem: Very slow convergence
Result: Many iterations needed
Impact: Inefficient training time

Systematic Testing Approach

Try Multiple Values

Test learning rates in roughly 3× increments:

alpha_candidates = [0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1.0]

for alpha in alpha_candidates:
  # Run gradient descent for handful of iterations
  # Plot learning curve
  # Evaluate performance

Selection Criteria

Find Boundaries: Identify α that’s too small and α that’s too large
Choose Optimal: Pick largest value that decreases cost rapidly but consistently
Verify Stability: Ensure chosen α works reliably across multiple runs

Evaluation Process

Short Training Runs

Duration: Run each α for small number of iterations
Comparison: Compare learning curves side by side
Efficiency: Quick way to identify promising learning rates

Selection Guidelines

Avoid Extremes: Not the smallest working α, not the largest that just barely works
Prioritize Speed: Choose α that decreases cost most rapidly while remaining stable
Safety Margin: Pick slightly smaller than the maximum that works

Advanced Considerations

Feature Scaling Impact

Properly Scaled Features: Can use larger learning rates
Unscaled Features: May require smaller learning rates
Recommendation: Always scale features first, then tune learning rate

Problem-Specific Factors

Dataset Size: Larger datasets may allow larger learning rates
Feature Count: More features may require smaller learning rates
Problem Complexity: Some problems are inherently more sensitive

Monitoring During Training

Continuous Monitoring: Watch learning curves throughout training
Early Detection: Spot problems before wasting computational time
Dynamic Adjustment: Some advanced techniques adjust learning rate during training

Conservative Choice

Choose learning rate that works reliably even if not fastest

Empirical Testing

Always test multiple values rather than guessing

Visual Validation

Trust learning curves over theoretical recommendations

Choosing an appropriate learning rate is crucial for effective gradient descent. A systematic approach combining debugging techniques, empirical testing, and careful monitoring ensures optimal algorithm performance.