Linear Regression

Linear Regression with Gradient Descent

Combining Components

This brings together the linear regression model, squared error cost function, and gradient descent algorithm to create a complete learning system.

Complete Linear Regression System

Model: f(x) = wx + b Cost Function: J(w,b) = 1/(2m) Σ(f(x^(i)) - y^(i))² Optimization: Gradient descent to minimize J(w,b)

Derivative Calculations

The gradient descent algorithm requires computing these derivatives:

∂/∂w J(w,b) = (1/m) Σ(f(x^(i)) - y^(i)) * x^(i)
∂/∂b J(w,b) = (1/m) Σ(f(x^(i)) - y^(i))

Key differences:

w derivative: Includes x^(i) term at the end
b derivative: Same formula without the x^(i) term

Calculus Derivation (Optional)

Derivative with respect to w

Starting with the cost function:

J(w,b) = 1/(2m) Σ(f(x^(i)) - y^(i))²

Substituting f(x^(i)) = wx^(i) + b:

J(w,b) = 1/(2m) Σ(wx^(i) + b - y^(i))²

Using chain rule:

∂/∂w J(w,b) = 1/(2m) Σ 2(wx^(i) + b - y^(i)) * x^(i)

The 2’s cancel out:

∂/∂w J(w,b) = (1/m) Σ(f(x^(i)) - y^(i)) * x^(i)

Derivative with respect to b

Similarly for b:

∂/∂b J(w,b) = 1/(2m) Σ 2(wx^(i) + b - y^(i)) * 1

After cancellation:

∂/∂b J(w,b) = (1/m) Σ(f(x^(i)) - y^(i))

Note: The 1/(2m) factor with the “2” makes derivatives cleaner by canceling the “2” from differentiation.

Complete Gradient Descent Algorithm

Repeat until convergence:
w = w - α * (1/m) Σ(f(x^(i)) - y^(i)) * x^(i)
b = b - α * (1/m) Σ(f(x^(i)) - y^(i))

where f(x^(i)) = w * x^(i) + b

Convex Function Advantage

Linear regression with squared error cost function has a special property:

Global Minimum Guarantee

Convex function: Bowl-shaped cost function Single minimum: Only one global minimum exists No local minima: Cannot get trapped in suboptimal solutions Guaranteed convergence: Always finds the best possible solution

Comparison with Complex Functions

Neural networks: May have multiple local minima
Non-convex surfaces: Risk of getting stuck in suboptimal solutions
Linear regression: Always reaches global optimum with proper learning rate

Guaranteed: Will always find global minimum Condition: Learning rate must be chosen appropriately Time: Depends on learning rate and data characteristics

Implementation Steps

Initialize parameters: Start with w = 0, b = 0
Set learning rate: Choose appropriate α (e.g., 0.01)
Compute derivatives: Calculate both partial derivatives
Update parameters: Apply gradient descent updates simultaneously
Check convergence: Repeat until cost stops decreasing significantly
Use model: Apply learned parameters for predictions

Practical Benefits

Automation: No manual parameter tuning required Optimality: Finds mathematically best fit to training data Efficiency: Systematic approach is faster than trial-and-error Scalability: Works with any size dataset

Summary

Combining linear regression with gradient descent creates a complete, automated system for finding the best straight-line fit to data. The convex nature of the squared error cost function guarantees finding the global optimum, making this a reliable and effective machine learning algorithm.