Learning Rate

Critical Parameter Choice

The learning rate α has a huge impact on gradient descent efficiency. Poor choice can cause the algorithm to fail completely or perform very slowly.

w = w - α * (d/dw)J(w)

Learning Rate Too Small

Problem: Extremely Slow Convergence

Behavior: Takes very tiny baby steps toward minimum Example: α = 0.0000001 (very small number) Result: Each step barely moves the parameter Consequence: Requires enormous number of iterations to reach minimum

Visualization

Step size: Minuscule movements on cost function
Progress: Gradual decrease in cost, but extremely slow
Iterations: May need thousands or millions of steps
Time: Takes very long time to complete training

Summary: Algorithm works but is impractically slow.

Learning Rate Too Large

Problem: Overshooting and Divergence

Behavior: Takes huge steps that overshoot the minimum Result: Cost may increase instead of decrease Consequence: Algorithm fails to converge, may diverge completely

Step-by-Step Failure

Initial position: Start relatively close to minimum
First step: Large α causes massive overshoot to opposite side
Second step: Another massive overshoot in opposite direction
Continued steps: Gets progressively further from minimum
Final result: Complete failure to find optimal parameters

Already at Minimum

Special Case Analysis

Scenario: Parameter w already at local minimum Derivative: d/dw J(w) = 0 (slope is zero at minimum) Update calculation: w = w - α × 0 = w Result: No change in parameter value

Automatic Stopping

At minimum: Gradient descent leaves parameters unchanged Reason: Derivative equals zero, so update term is zero Benefit: Algorithm naturally stops when optimal point is reached

Automatic Step Size Adjustment

Even with fixed learning rate α, gradient descent automatically adjusts effective step size:

Approaching Minimum Behavior

Far from minimum: Large derivative → large steps
Getting closer: Smaller derivative → smaller steps
Near minimum: Very small derivative → very small steps
At minimum: Zero derivative → no movement

Step Size Changes
Why This Happens

Early iterations: Large steps when far from minimum Later iterations: Progressively smaller steps as approaching minimum Final iterations: Tiny steps as derivative approaches zero

Choosing Good Learning Rate

Common Strategies

Start moderate: Try values like 0.01, 0.1, 0.001
Monitor cost: Ensure cost decreases with each iteration
Adjust if needed: Increase if too slow, decrease if diverging
Use validation: Test on separate data to avoid overfitting

Algorithm Robustness

Fixed α works: Don’t need to manually adjust learning rate during training Automatic adaptation: Step size naturally decreases as approaching minimum Convergence guarantee: For convex functions (like linear regression), will reach global minimum with appropriate α

Understanding learning rate behavior helps choose appropriate values and diagnose problems when gradient descent isn’t working as expected.