Motivations
Classification Motivations
Section titled “Classification Motivations”Introduction to Classification
Section titled “Introduction to Classification”Classification is the third major topic in machine learning where the output variable y can take on only one of a small handful of possible values, rather than any number in an infinite range like linear regression.
Examples of Classification Problems
Section titled “Examples of Classification Problems”Binary Classification Examples
Section titled “Binary Classification Examples”- Email Spam Detection: Determining whether an email is spam (yes) or not spam (no)
- Financial Transaction Fraud: Identifying if an online transaction is fraudulent or legitimate
- Tumor Classification: Classifying a tumor as malignant versus benign
Binary Classification Terminology
Section titled “Binary Classification Terminology”The choice of which class to call positive or negative is somewhat arbitrary. Different engineers might swap the assignments - for example, calling a good email the positive class or a healthy patient the positive class.
Why Linear Regression Fails for Classification
Section titled “Why Linear Regression Fails for Classification”Initial Approach
Section titled “Initial Approach”When attempting classification, you might consider using linear regression with a threshold:
- If f(x) ≥ 0.5, predict y = 1
- If f(x) < 0.5, predict y = 0
The Problem with Additional Data
Section titled “The Problem with Additional Data”Linear regression can work initially on simple datasets, but adding outlier training examples causes the best-fit line to shift. This shifts the decision boundary inappropriately, leading to misclassification of examples that should remain correctly classified.
The core issue is that linear regression predicts values across the entire real number line, but classification needs outputs constrained to specific categories.
Introduction to Logistic Regression
Section titled “Introduction to Logistic Regression”Logistic regression solves classification problems by:
- Always outputting values between 0 and 1
- Using an S-shaped curve instead of a straight line
- Avoiding the decision boundary shifting problems of linear regression
Despite the name “logistic regression,” this algorithm is actually used for classification, not regression. The name exists for historical reasons.
Summary
Section titled “Summary”Classification problems require specialized algorithms because linear regression is unsuitable for categorical outputs. The need for bounded outputs between 0 and 1, along with stable decision boundaries, motivates the development of logistic regression as the preferred algorithm for binary classification tasks.