Skip to content
Pablo Rodriguez

Motivations

Classification is the third major topic in machine learning where the output variable y can take on only one of a small handful of possible values, rather than any number in an infinite range like linear regression.

  • Email Spam Detection: Determining whether an email is spam (yes) or not spam (no)
  • Financial Transaction Fraud: Identifying if an online transaction is fraudulent or legitimate
  • Tumor Classification: Classifying a tumor as malignant versus benign

The choice of which class to call positive or negative is somewhat arbitrary. Different engineers might swap the assignments - for example, calling a good email the positive class or a healthy patient the positive class.

Why Linear Regression Fails for Classification

Section titled “Why Linear Regression Fails for Classification”

When attempting classification, you might consider using linear regression with a threshold:

  • If f(x) ≥ 0.5, predict y = 1
  • If f(x) < 0.5, predict y = 0

Linear regression can work initially on simple datasets, but adding outlier training examples causes the best-fit line to shift. This shifts the decision boundary inappropriately, leading to misclassification of examples that should remain correctly classified.

The core issue is that linear regression predicts values across the entire real number line, but classification needs outputs constrained to specific categories.

Logistic regression solves classification problems by:

  • Always outputting values between 0 and 1
  • Using an S-shaped curve instead of a straight line
  • Avoiding the decision boundary shifting problems of linear regression
Note

Despite the name “logistic regression,” this algorithm is actually used for classification, not regression. The name exists for historical reasons.

Classification problems require specialized algorithms because linear regression is unsuitable for categorical outputs. The need for bounded outputs between 0 and 1, along with stable decision boundaries, motivates the development of logistic regression as the preferred algorithm for binary classification tasks.