Skip to content
Pablo Rodriguez

Gradient Descent

Gradient descent is a systematic algorithm for finding parameter values that minimize the cost function J(w,b). This fundamental algorithm is used throughout machine learning, from linear regression to advanced neural networks and deep learning models.

Universal Application

Linear regression: Minimize squared error cost function Neural networks: Train deep learning models General optimization: Minimize any differentiable function Industry standard: Used across all of machine learning

Gradient descent can minimize functions with multiple parameters:

  • J(w,b): Two parameters (linear regression)
  • J(w₁, w₂, …, wₙ, b): Multiple parameters (complex models)

Objective: Find parameter values that give the smallest possible J value

  • Starting point: Choose initial guesses for parameters
  • Common choice: Set w = 0, b = 0 for linear regression
  • Impact: Starting values don’t matter much for linear regression
  1. Calculate adjustments: Determine how to change parameters
  2. Update parameters: Modify w and b to reduce cost
  3. Repeat: Continue until cost stops decreasing significantly
  4. Convergence: Algorithm settles at or near minimum
  1. Look around 360°: Assess all possible directions
  2. Find steepest descent: Choose direction that goes downhill fastest
  3. Take a baby step: Move small distance in that direction
  4. Repeat process: From new position, again find steepest descent direction
  5. Continue: Until reaching valley bottom (local minimum)

Some cost functions (not linear regression) can have multiple local minima:

Definition: Lowest point in a particular valley Characteristic: Surrounded by higher points Limitation: May not be the global optimum

Different starting positions: Can lead to different local minima Example: Starting on left side of surface vs. right side may result in reaching different valleys Implication: Initial parameter values can affect final solution

Convex Function Property

Squared error cost function: Always has bowl shape (convex) Single minimum: Only one global minimum exists No local minima: Cannot get trapped in suboptimal solutions Guaranteed convergence: Will always find global optimum with proper learning rate

Gradient descent is crucial because:

  • Automation: Eliminates need for manual parameter tuning
  • Efficiency: Systematically finds optimal solutions
  • Scalability: Works with any number of parameters
  • Foundation: Understanding enables work with complex models

The algorithm uses calculus concepts (derivatives) to determine:

  • Direction: Which way to adjust parameters
  • Magnitude: How much to change parameters
  • Efficiency: Fastest path to minimum

Next steps involve learning the mathematical implementation and applying gradient descent specifically to linear regression problems.

The gradient descent algorithm updates parameters using these equations:

Gradient Descent Update Rules
w = w - α * (∂/∂w)J(w,b)
b = b - α * (∂/∂b)J(w,b)

Assignment vs Mathematical Equality

Programming context: a = c means “store value c in variable a” Example: a = a + 1 means “increment a by 1” Mathematical context: a = c asserts that a and c are equal Programming equality: Often written as a == c for testing

  • Symbol: Greek letter Alpha
  • Typical values: Small positive number between 0 and 1 (e.g., 0.01)
  • Purpose: Controls the size of each step toward the minimum
  • Effect: Determines how aggressive the gradient descent procedure is

Behavior: Very aggressive gradient descent Steps: Large steps downhill Risk: May overshoot the minimum

  • Mathematical concept: Comes from calculus
  • Purpose: Determines direction and magnitude of parameter updates
  • Intuition: Tells you which direction to take your step
  • Combined with α: Determines both direction and step size

Since linear regression has two parameters, both must be updated:

Complete Parameter Updates
w = w - α * (∂/∂w)J(w,b)
b = b - α * (∂/∂b)J(w,b)

Note: The derivative terms are slightly different for w and b.

Definition: Algorithm reaches a point where parameters no longer change significantly with additional steps

Indication: The algorithm has found a local minimum

Termination: Stop when convergence is achieved

Critical Implementation Detail

Requirement: Update both w and b simultaneously Correct approach: Calculate both updates using current values, then apply both changes together Incorrect approach: Update w first, then use new w value to calculate b update

Simultaneous Update (Correct)
temp_w = w - α * (∂/∂w)J(w,b)
temp_b = b - α * (∂/∂b)J(w,b)
w = temp_w
b = temp_b

Process:

  1. Calculate both updates using original w and b values
  2. Store results in temporary variables
  3. Simultaneously update both parameters
Non-Simultaneous Update (Incorrect)
temp_w = w - α * (∂/∂w)J(w,b)
w = temp_w # w updated first
temp_b = b - α * (∂/∂b)J(w,b) # uses new w value
b = temp_b

Problem: The updated w value affects the b calculation, creating a different algorithm with different properties.

Universality: Works for any function J, not just linear regression cost functions

Flexibility: Applies to models with any number of parameters

Foundation: Understanding gradient descent enables work with advanced models like neural networks

The next step involves understanding the derivative terms in detail, which will complete your ability to implement and apply gradient descent effectively.

To understand how gradient descent works, let’s examine a simplified version with one parameter w:

One-Parameter Gradient Descent
w = w - α * (d/dw)J(w)

This simplification helps visualize the algorithm’s behavior using 2D graphs instead of 3D.

Derivative as Slope

Tangent line: Straight line that touches the curve at a specific point Slope calculation: Height divided by width of triangle formed by tangent line Derivative value: Equals the slope of the tangent line at that point Sign significance: Positive slope = upward direction, Negative slope = downward direction

  • Starting point: Right side of cost function curve
  • Tangent line: Points up and to the right
  • Slope example: +2 (positive number)
  • Derivative: d/dw J(w) > 0
Positive Derivative Update
w_new = w - α * (+positive_number)
w_new = w - positive_value
Result: w decreases

Effect: Moving left on the graph (decreasing w) Result: Cost J decreases, moving toward minimum Conclusion: Correct direction for optimization

  • Starting point: Left side of cost function curve
  • Tangent line: Slopes down and to the right
  • Slope example: -2 (negative number)
  • Derivative: d/dw J(w) < 0
Negative Derivative Update
w_new = w - α * (-negative_number)
w_new = w + positive_value
Result: w increases

Effect: Moving right on the graph (increasing w) Result: Cost J decreases, moving toward minimum Conclusion: Correct direction for optimization

Positive derivative: Algorithm automatically moves left (decreases w) Negative derivative: Algorithm automatically moves right (increases w) No manual intervention: Direction is determined by mathematics

The derivative term provides:

  1. Direction information: Sign (+ or -) indicates which way to move
  2. Magnitude information: Size indicates how steep the slope is
  3. Automatic adjustment: Combines with learning rate to determine step size
  • Right of minimum: Positive derivative → move left → approach minimum
  • Left of minimum: Negative derivative → move right → approach minimum
  • At minimum: Zero derivative → no movement → stay at optimal point
  • Self-correcting: Always moves toward minimum regardless of starting point
  • Mathematical foundation: Based on calculus principles
  • Reliable convergence: Systematic approach to optimization

The same principles apply to the full gradient descent with parameters w and b:

  • Each parameter has its own derivative
  • Each parameter moves in its optimal direction
  • Combined movement navigates toward the global minimum

Understanding this intuition helps explain why gradient descent is such a powerful and widely-used optimization algorithm in machine learning.