Skip to content
Pablo Rodriguez

Linear Regression

This brings together the linear regression model, squared error cost function, and gradient descent algorithm to create a complete learning system.

Complete Linear Regression System

Model: f(x) = wx + b Cost Function: J(w,b) = 1/(2m) Σ(f(x^(i)) - y^(i))² Optimization: Gradient descent to minimize J(w,b)

The gradient descent algorithm requires computing these derivatives:

Linear Regression Derivatives
∂/∂w J(w,b) = (1/m) Σ(f(x^(i)) - y^(i)) * x^(i)
∂/∂b J(w,b) = (1/m) Σ(f(x^(i)) - y^(i))

Key differences:

  • w derivative: Includes x^(i) term at the end
  • b derivative: Same formula without the x^(i) term

Starting with the cost function:

Cost Function
J(w,b) = 1/(2m) Σ(f(x^(i)) - y^(i))²

Substituting f(x^(i)) = wx^(i) + b:

Expanded Cost Function
J(w,b) = 1/(2m) Σ(wx^(i) + b - y^(i))²

Using chain rule:

Chain Rule Application
∂/∂w J(w,b) = 1/(2m) Σ 2(wx^(i) + b - y^(i)) * x^(i)

The 2’s cancel out:

Final w Derivative
∂/∂w J(w,b) = (1/m) Σ(f(x^(i)) - y^(i)) * x^(i)

Similarly for b:

b Derivative Chain Rule
∂/∂b J(w,b) = 1/(2m) Σ 2(wx^(i) + b - y^(i)) * 1

After cancellation:

Final b Derivative
∂/∂b J(w,b) = (1/m) Σ(f(x^(i)) - y^(i))

Note: The 1/(2m) factor with the “2” makes derivatives cleaner by canceling the “2” from differentiation.

Linear Regression Gradient Descent
Repeat until convergence:
w = w - α * (1/m) Σ(f(x^(i)) - y^(i)) * x^(i)
b = b - α * (1/m) Σ(f(x^(i)) - y^(i))
where f(x^(i)) = w * x^(i) + b

Linear regression with squared error cost function has a special property:

Global Minimum Guarantee

Convex function: Bowl-shaped cost function Single minimum: Only one global minimum exists No local minima: Cannot get trapped in suboptimal solutions Guaranteed convergence: Always finds the best possible solution

  • Neural networks: May have multiple local minima
  • Non-convex surfaces: Risk of getting stuck in suboptimal solutions
  • Linear regression: Always reaches global optimum with proper learning rate

Guaranteed: Will always find global minimum Condition: Learning rate must be chosen appropriately Time: Depends on learning rate and data characteristics

  1. Initialize parameters: Start with w = 0, b = 0
  2. Set learning rate: Choose appropriate α (e.g., 0.01)
  3. Compute derivatives: Calculate both partial derivatives
  4. Update parameters: Apply gradient descent updates simultaneously
  5. Check convergence: Repeat until cost stops decreasing significantly
  6. Use model: Apply learned parameters for predictions

Automation: No manual parameter tuning required Optimality: Finds mathematically best fit to training data Efficiency: Systematic approach is faster than trial-and-error Scalability: Works with any size dataset

Combining linear regression with gradient descent creates a complete, automated system for finding the best straight-line fit to data. The convex nature of the squared error cost function guarantees finding the global optimum, making this a reliable and effective machine learning algorithm.