Gradient Descent Implementation

Applying Gradient Descent to Logistic Regression

Optimization Goal

Find parameters w and b that minimize the logistic regression cost function:

J(w,b) = -(1/m) * Σ[y⁽ⁱ⁾*log(f(x⁽ⁱ⁾)) + (1-y⁽ⁱ⁾)*log(1-f(x⁽ⁱ⁾))]

Standard Gradient Descent Algorithm

w_j := w_j - α * (∂J/∂w_j)
b := b - α * (∂J/∂b)

Where α is the learning rate and j goes from 1 to n (number of features).

Computing the Derivatives

Derivative with Respect to w_j

Using calculus on the logistic cost function:

∂J/∂w_j = (1/m) * Σ[(f(x⁽ⁱ⁾) - y⁽ⁱ⁾) * x_j⁽ⁱ⁾]

Derivative with Respect to b

∂J/∂b = (1/m) * Σ[(f(x⁽ⁱ⁾) - y⁽ⁱ⁾)]

Complete Gradient Descent Update

Simultaneous Updates

Compute all derivatives for current parameters
Update all parameters simultaneously using:
- w_j := w_j - α * (1/m) * Σ[(f(x⁽ⁱ⁾) - y⁽ⁱ⁾) * x_j⁽ⁱ⁾]
- b := b - α * (1/m) * Σ[(f(x⁽ⁱ⁾) - y⁽ⁱ⁾)]

Similarity to Linear Regression

Identical Update Rules?

The gradient descent updates look exactly the same as linear regression:

Same derivative formulas
Same update structure
Same simultaneous update requirement

Key Difference: Function Definition

The crucial difference is in the definition of f(x):

Linear Regression

f(x) = w·x + b

Logistic Regression

f(x) = 1/(1 + e^(-(w·x + b)))

Even though the update equations appear identical, they represent completely different algorithms due to the different function definitions.

Optimization Considerations

Monitoring Convergence

Use the same convergence monitoring techniques as linear regression
Plot cost function vs. iterations to verify decreasing cost
Check for appropriate learning rate selection

Feature Scaling

Feature scaling remains beneficial for logistic regression:

Scales features to similar ranges (e.g., -1 to +1)
Helps gradient descent converge faster
Prevents features with large scales from dominating

Vectorization

The algorithm can be vectorized for computational efficiency:

Implement matrix operations instead of loops
Significantly faster execution on large datasets
Same principles as vectorized linear regression

Summary

Gradient descent for logistic regression uses the same algorithmic structure as linear regression but with the sigmoid function defining f(x). The resulting updates look identical in form but solve a fundamentally different optimization problem, making logistic regression suitable for classification tasks.