Logistic Regression

Logistic Regression Cost Function

Why Squared Error Doesn’t Work

The Problem with Linear Regression’s Cost Function

For linear regression, we used the squared error cost function:

J(w,b) = (1/2m) * Σ(f(x) - y)²

However, when f(x) uses the sigmoid function for logistic regression, this creates a non-convex cost function with many local minima, making gradient descent unreliable.

The Logistic Loss Function

New Loss Definition

Instead of squared error, we define a new loss function for a single training example:

If y = 1: Loss = -log(f(x)) If y = 0: Loss = -log(1 - f(x))

Intuition Behind the Loss Function

When y = 1 (Positive Class)

If f(x) ≈ 1 (correct prediction): Loss ≈ 0 (very small penalty)
If f(x) = 0.5: Loss is moderate
If f(x) ≈ 0 (wrong prediction): Loss → ∞ (very large penalty)

The loss function encourages the algorithm to output high probabilities for positive examples.

When y = 0 (Negative Class)

If f(x) ≈ 0 (correct prediction): Loss ≈ 0 (very small penalty)
If f(x) = 0.5: Loss is moderate
If f(x) ≈ 1 (wrong prediction): Loss → ∞ (very large penalty)

The loss function encourages the algorithm to output low probabilities for negative examples.

Properties of the Logistic Loss

Convex Function

Creates a convex cost surface, ensuring gradient descent finds the global minimum

Penalizes Wrong Predictions

Large penalties for confident but incorrect predictions

Smooth Gradients

Provides smooth gradients for reliable optimization

Mathematical Foundation

Maximum Likelihood Estimation

This cost function is derived from the statistical principle of maximum likelihood estimation, which provides a principled way to find optimal parameters for probabilistic models.

Summary

The logistic loss function replaces squared error to create a convex optimization problem suitable for binary classification. By heavily penalizing confident but wrong predictions, it encourages the model to output appropriate probabilities for each class, leading to better classification performance.