Gradient Descent
Gradient Descent for Multiple Features
Section titled “Gradient Descent for Multiple Features”Vector Notation for Parameters
Section titled “Vector Notation for Parameters”Instead of treating w₁, w₂, …, wₙ as separate parameters, we collect them into a vector w of length n. The model parameters are now:
- w: Vector of weights [w₁, w₂, …, wₙ]
- b: Scalar bias term
Model Representation
Section titled “Model Representation”Vector Form
Section titled “Vector Form”f_{w,b}(x) = w⃗ · x⃗ + b
Cost Function
Section titled “Cost Function”J(w⃗, b) - now takes a vector w and scalar b as inputs
Gradient Descent Updates
Section titled “Gradient Descent Updates”Single Feature (Review)
Section titled “Single Feature (Review)”For univariate regression:
w = w - α * (1/m) * Σ(f(x^(i)) - y^(i)) * x^(i)b = b - α * (1/m) * Σ(f(x^(i)) - y^(i))
Multiple Features
Section titled “Multiple Features”For each parameter wⱼ (j = 1 to n):
w_j = w_j - α * (1/m) * Σ(f(x^(i)) - y^(i)) * x_j^(i)b = b - α * (1/m) * Σ(f(x^(i)) - y^(i))
Key Differences from Single Feature
Section titled “Key Differences from Single Feature”Error Term
Section titled “Error Term”- Same Structure: (f(x⁽ⁱ⁾) - y⁽ⁱ⁾) remains the prediction error
- Vector Operations: w and x are now vectors using dot product
Feature Indexing
Section titled “Feature Indexing”- Single Feature: Used xⁱ directly
- Multiple Features: Use x_j⁽ⁱ⁾ for the j-th feature of the i-th example
Update Pattern
Section titled “Update Pattern”Each weight wⱼ is updated using its corresponding feature x_j⁽ⁱ⁾ from each training example.
Implementation Considerations
Section titled “Implementation Considerations”Vectorized Implementation
Section titled “Vectorized Implementation”- Use NumPy operations for efficient computation
- Leverage parallel processing capabilities
- Handle large feature sets effectively
Algorithm Flow
Section titled “Algorithm Flow”- Initialize parameters w and b
- Compute predictions using vectorized dot product
- Calculate cost and gradients for all parameters
- Update all parameters simultaneously
- Repeat until convergence
Alternative: Normal Equation
Section titled “Alternative: Normal Equation”Overview
Section titled “Overview”The normal equation is an analytical solution that solves for w and b directly without iterations.
Limitations
Section titled “Limitations”- Not Generalizable: Only works for linear regression
- Computational Complexity: Slow for large numbers of features (n > 10,000)
- Limited Applicability: Cannot be used for other algorithms like logistic regression or neural networks
Practical Use
Section titled “Practical Use”- Some machine learning libraries may use normal equation internally
- Most practitioners stick with gradient descent for flexibility and efficiency
- Important to know the term for interviews but implementation details aren’t critical
Gradient descent with multiple features follows the same principles as single-feature gradient descent but leverages vector operations for efficiency and scalability. This approach forms the foundation for most machine learning optimization problems.