Problem: Extremely Slow Convergence
Behavior: Takes very tiny baby steps toward minimum Example: α = 0.0000001 (very small number) Result: Each step barely moves the parameter Consequence: Requires enormous number of iterations to reach minimum
The learning rate α has a huge impact on gradient descent efficiency. Poor choice can cause the algorithm to fail completely or perform very slowly.
w = w - α * (d/dw)J(w)
Problem: Extremely Slow Convergence
Behavior: Takes very tiny baby steps toward minimum Example: α = 0.0000001 (very small number) Result: Each step barely moves the parameter Consequence: Requires enormous number of iterations to reach minimum
Summary: Algorithm works but is impractically slow.
Problem: Overshooting and Divergence
Behavior: Takes huge steps that overshoot the minimum Result: Cost may increase instead of decrease Consequence: Algorithm fails to converge, may diverge completely
Scenario: Parameter w already at local minimum Derivative: d/dw J(w) = 0 (slope is zero at minimum) Update calculation: w = w - α × 0 = w Result: No change in parameter value
Automatic Stopping
At minimum: Gradient descent leaves parameters unchanged Reason: Derivative equals zero, so update term is zero Benefit: Algorithm naturally stops when optimal point is reached
Even with fixed learning rate α, gradient descent automatically adjusts effective step size:
Early iterations: Large steps when far from minimum Later iterations: Progressively smaller steps as approaching minimum Final iterations: Tiny steps as derivative approaches zero
Slope relationship: Steeper slopes → larger derivatives → bigger steps Flatter slopes: Smaller derivatives → smaller steps Mathematical: Derivative naturally encodes distance from minimum
Fixed α works: Don’t need to manually adjust learning rate during training Automatic adaptation: Step size naturally decreases as approaching minimum Convergence guarantee: For convex functions (like linear regression), will reach global minimum with appropriate α
Understanding learning rate behavior helps choose appropriate values and diagnose problems when gradient descent isn’t working as expected.