Skip to content
Pablo Rodriguez

Gradient Descent

Instead of treating w₁, w₂, …, wₙ as separate parameters, we collect them into a vector w of length n. The model parameters are now:

  • w: Vector of weights [w₁, w₂, …, wₙ]
  • b: Scalar bias term

f_{w,b}(x) = w⃗ · x⃗ + b

J(w⃗, b) - now takes a vector w and scalar b as inputs

For univariate regression:

single_feature.py
w = w - α * (1/m) * Σ(f(x^(i)) - y^(i)) * x^(i)
b = b - α * (1/m) * Σ(f(x^(i)) - y^(i))

For each parameter wⱼ (j = 1 to n):

multiple_features.py
w_j = w_j - α * (1/m) * Σ(f(x^(i)) - y^(i)) * x_j^(i)
b = b - α * (1/m) * Σ(f(x^(i)) - y^(i))
  • Same Structure: (f(x⁽ⁱ⁾) - y⁽ⁱ⁾) remains the prediction error
  • Vector Operations: w and x are now vectors using dot product
  • Single Feature: Used xⁱ directly
  • Multiple Features: Use x_j⁽ⁱ⁾ for the j-th feature of the i-th example

Each weight wⱼ is updated using its corresponding feature x_j⁽ⁱ⁾ from each training example.

  • Use NumPy operations for efficient computation
  • Leverage parallel processing capabilities
  • Handle large feature sets effectively
  1. Initialize parameters w and b
  2. Compute predictions using vectorized dot product
  3. Calculate cost and gradients for all parameters
  4. Update all parameters simultaneously
  5. Repeat until convergence
Vectorization

The normal equation is an analytical solution that solves for w and b directly without iterations.

  • Not Generalizable: Only works for linear regression
  • Computational Complexity: Slow for large numbers of features (n > 10,000)
  • Limited Applicability: Cannot be used for other algorithms like logistic regression or neural networks
  • Some machine learning libraries may use normal equation internally
  • Most practitioners stick with gradient descent for flexibility and efficiency
  • Important to know the term for interviews but implementation details aren’t critical

Gradient descent with multiple features follows the same principles as single-feature gradient descent but leverages vector operations for efficiency and scalability. This approach forms the foundation for most machine learning optimization problems.