Gradient Descent

Gradient Descent for Multiple Features

Vector Notation for Parameters

Instead of treating w₁, w₂, …, wₙ as separate parameters, we collect them into a vector w of length n. The model parameters are now:

w: Vector of weights [w₁, w₂, …, wₙ]
b: Scalar bias term

Model Representation

Vector Form

f_{w,b}(x) = w⃗ · x⃗ + b

Cost Function

J(w⃗, b) - now takes a vector w and scalar b as inputs

Gradient Descent Updates

Single Feature (Review)

For univariate regression:

w = w - α * (1/m) * Σ(f(x^(i)) - y^(i)) * x^(i)
b = b - α * (1/m) * Σ(f(x^(i)) - y^(i))

Multiple Features

For each parameter wⱼ (j = 1 to n):

w_j = w_j - α * (1/m) * Σ(f(x^(i)) - y^(i)) * x_j^(i)
b = b - α * (1/m) * Σ(f(x^(i)) - y^(i))

Key Differences from Single Feature

Error Term

Same Structure: (f(x⁽ⁱ⁾) - y⁽ⁱ⁾) remains the prediction error
Vector Operations: w and x are now vectors using dot product

Feature Indexing

Single Feature: Used xⁱ directly
Multiple Features: Use x_j⁽ⁱ⁾ for the j-th feature of the i-th example

Update Pattern

Each weight wⱼ is updated using its corresponding feature x_j⁽ⁱ⁾ from each training example.

Implementation Considerations

Vectorized Implementation

Use NumPy operations for efficient computation
Leverage parallel processing capabilities
Handle large feature sets effectively

Algorithm Flow

Initialize parameters w and b
Compute predictions using vectorized dot product
Calculate cost and gradients for all parameters
Update all parameters simultaneously
Repeat until convergence

Vectorization

Alternative: Normal Equation

Overview

The normal equation is an analytical solution that solves for w and b directly without iterations.

Limitations

Not Generalizable: Only works for linear regression
Computational Complexity: Slow for large numbers of features (n > 10,000)
Limited Applicability: Cannot be used for other algorithms like logistic regression or neural networks

Practical Use

Some machine learning libraries may use normal equation internally
Most practitioners stick with gradient descent for flexibility and efficiency
Important to know the term for interviews but implementation details aren’t critical

Gradient descent with multiple features follows the same principles as single-feature gradient descent but leverages vector operations for efficiency and scalability. This approach forms the foundation for most machine learning optimization problems.