Motivations

Classification Motivations

Introduction to Classification

Classification is the third major topic in machine learning where the output variable y can take on only one of a small handful of possible values, rather than any number in an infinite range like linear regression.

Examples of Classification Problems

Binary Classification Examples

Email Spam Detection: Determining whether an email is spam (yes) or not spam (no)
Financial Transaction Fraud: Identifying if an online transaction is fraudulent or legitimate
Tumor Classification: Classifying a tumor as malignant versus benign

Binary Classification Terminology

The choice of which class to call positive or negative is somewhat arbitrary. Different engineers might swap the assignments - for example, calling a good email the positive class or a healthy patient the positive class.

Why Linear Regression Fails for Classification

Initial Approach

When attempting classification, you might consider using linear regression with a threshold:

If f(x) ≥ 0.5, predict y = 1
If f(x) < 0.5, predict y = 0

The Problem with Additional Data

Linear regression can work initially on simple datasets, but adding outlier training examples causes the best-fit line to shift. This shifts the decision boundary inappropriately, leading to misclassification of examples that should remain correctly classified.

The core issue is that linear regression predicts values across the entire real number line, but classification needs outputs constrained to specific categories.

Introduction to Logistic Regression

Logistic regression solves classification problems by:

Always outputting values between 0 and 1
Using an S-shaped curve instead of a straight line
Avoiding the decision boundary shifting problems of linear regression

Note

Despite the name “logistic regression,” this algorithm is actually used for classification, not regression. The name exists for historical reasons.

Summary

Classification problems require specialized algorithms because linear regression is unsuitable for categorical outputs. The need for bounded outputs between 0 and 1, along with stable decision boundaries, motivates the development of logistic regression as the preferred algorithm for binary classification tasks.