Feature Scaling

The Problem with Different Feature Ranges

When features have vastly different ranges of values, gradient descent can run slowly and inefficiently.

Example Scenario

x₁ (house size): 300-2,000 square feet
x₂ (bedrooms): 0-5 bedrooms

Parameter Relationships

For a house with 2,000 sq ft, 5 bedrooms, priced at $500k:

Poor Parameters: w₁=50, w₂=0.1, b=50

Prediction: 50×2000 + 0.1×5 + 50 = 100,050k (way too high)

Good Parameters: w₁=0.1, w₂=50, b=50

Prediction: 0.1×2000 + 50×5 + 50 = 500k (correct)

Impact on Gradient Descent

Contour Plot Characteristics

Without Scaling: Tall, narrow elliptical contours
Inefficient Path: Gradient descent bounces back and forth
Slow Convergence: Takes many iterations to reach minimum

With Feature Scaling

Circular Contours: More symmetric cost function shape
Direct Path: Gradient descent moves efficiently toward minimum
Fast Convergence: Significantly fewer iterations needed

Feature Scaling Methods

1. Division by Maximum

x1_scaled = x1 / max(x1)  # Range: 0.15 to 1.0
x2_scaled = x2 / max(x2)  # Range: 0.0 to 1.0

2. Mean Normalization

# Calculate mean (μ) for each feature
mu1 = mean(x1)  # e.g., 600
mu2 = mean(x2)  # e.g., 2.3

# Normalize
x1_norm = (x1 - mu1) / (max(x1) - min(x1))
x2_norm = (x2 - mu2) / (max(x2) - min(x2))

Results: Features centered around zero with both positive and negative values

3. Z-Score Normalization

# Calculate mean (μ) and standard deviation (σ)
mu1, sigma1 = mean(x1), std(x1)  # e.g., 600, 450
mu2, sigma2 = mean(x2), std(x2)  # e.g., 2.3, 1.4

# Z-score normalize
x1_zscore = (x1 - mu1) / sigma1
x2_zscore = (x2 - mu2) / sigma2

Z-Score

Target Ranges and Guidelines

Acceptable Ranges

Aim for features in approximately -1 to +1 range, though these are flexible guidelines:

✅ Good: -3 to +3, -0.3 to +0.3, 0 to 3, -2 to +0.5
⚠️ Consider Rescaling: -100 to +100, -0.001 to +0.001, 98.6 to 105°F

When to Apply Feature Scaling

Features with very different magnitudes (e.g., house size vs. number of rooms)

Practical Examples

Temperature Data

Hospital patient temperatures ranging 98.6-105°F:

Values around 100 are large compared to typical scaled features
Recommendation: Scale to improve gradient descent performance

Financial Data

Income ($30k-$200k) vs. Age (25-65 years):

Very different scales require feature scaling
Prevents income from dominating the model due to larger numeric values

Feature scaling is a simple yet powerful technique that can dramatically improve the efficiency and performance of gradient descent algorithms, making it essential for practical machine learning implementations.