Skip to content
Pablo Rodriguez

Learning Rate

The learning rate α has a huge impact on gradient descent efficiency. Poor choice can cause the algorithm to fail completely or perform very slowly.

Learning Rate in Update Rule
w = w - α * (d/dw)J(w)

Problem: Extremely Slow Convergence

Behavior: Takes very tiny baby steps toward minimum Example: α = 0.0000001 (very small number) Result: Each step barely moves the parameter Consequence: Requires enormous number of iterations to reach minimum

  • Step size: Minuscule movements on cost function
  • Progress: Gradual decrease in cost, but extremely slow
  • Iterations: May need thousands or millions of steps
  • Time: Takes very long time to complete training

Summary: Algorithm works but is impractically slow.

Problem: Overshooting and Divergence

Behavior: Takes huge steps that overshoot the minimum Result: Cost may increase instead of decrease Consequence: Algorithm fails to converge, may diverge completely

  1. Initial position: Start relatively close to minimum
  2. First step: Large α causes massive overshoot to opposite side
  3. Second step: Another massive overshoot in opposite direction
  4. Continued steps: Gets progressively further from minimum
  5. Final result: Complete failure to find optimal parameters

Scenario: Parameter w already at local minimum Derivative: d/dw J(w) = 0 (slope is zero at minimum) Update calculation: w = w - α × 0 = w Result: No change in parameter value

Automatic Stopping

At minimum: Gradient descent leaves parameters unchanged Reason: Derivative equals zero, so update term is zero Benefit: Algorithm naturally stops when optimal point is reached

Even with fixed learning rate α, gradient descent automatically adjusts effective step size:

  1. Far from minimum: Large derivative → large steps
  2. Getting closer: Smaller derivative → smaller steps
  3. Near minimum: Very small derivative → very small steps
  4. At minimum: Zero derivative → no movement

Early iterations: Large steps when far from minimum Later iterations: Progressively smaller steps as approaching minimum Final iterations: Tiny steps as derivative approaches zero

  • Start moderate: Try values like 0.01, 0.1, 0.001
  • Monitor cost: Ensure cost decreases with each iteration
  • Adjust if needed: Increase if too slow, decrease if diverging
  • Use validation: Test on separate data to avoid overfitting

Fixed α works: Don’t need to manually adjust learning rate during training Automatic adaptation: Step size naturally decreases as approaching minimum Convergence guarantee: For convex functions (like linear regression), will reach global minimum with appropriate α

Understanding learning rate behavior helps choose appropriate values and diagnose problems when gradient descent isn’t working as expected.