Cost Function

Purpose and Importance

The cost function is one of the most universal and important concepts in machine learning, used in both linear regression and training advanced AI models worldwide.

Cost Function Purpose

Goal: Tell us how well the model is performing Benefit: Helps improve the model by identifying areas for adjustment Application: Used across all types of machine learning models

Model Components Review

Training Set: Contains input features x and output targets y Model: Linear function f_w,b(x) = wx + b Parameters: w and b (variables adjusted during training to improve the model)

Parameter Impact on Model

Different values of w and b create different functions and different lines:

Example Parameter Effects

w = 0, b = 1.5: f(x) = 1.5 (horizontal line, constant prediction)
- Always predicts 1.5 regardless of input
- b represents the y-intercept
w = 0.5, b = 0: f(x) = 0.5x (line through origin)
- Slope of 0.5
- When x = 2, prediction = 1
w = 0.5, b = 1: f(x) = 0.5x + 1
- Slope of 0.5, y-intercept of 1
- When x = 0, prediction = 1; when x = 2, prediction = 2

Choosing Good Parameters

Objective: Choose w and b so the straight line fits the data well Visual interpretation: Line should pass through or close to training examples

Cost Function Construction

The cost function measures prediction accuracy by comparing predictions to actual values:

Error Calculation

Error for single example: ŷ^(i) - y^(i) (prediction minus actual)
Squared error: (ŷ^(i) - y^(i))² (eliminates negative values)
Sum across all examples: Σ(ŷ^(i) - y^(i))² for i = 1 to m

Final Cost Function Formula

J(w,b) = 1/(2m) * Σ(f(x^(i)) - y^(i))²

Where:

J(w,b): Cost function
m: Number of training examples
1/2m: Normalizes by dataset size and makes later calculations neater
f(x^(i)): Model prediction for i-th example

Why This Specific Form?

Average instead of total: Dividing by m prevents the cost from automatically increasing with larger datasets Division by 2: Makes derivative calculations cleaner (cancels out later) Squared errors: Penalizes large errors more than small errors

Alternative Representation

J(w,b) = 1/(2m) * Σ(f(x^(i)) - y^(i))²

Since f(x^(i)) = wx^(i) + b, this emphasizes the relationship between parameters and predictions.

Cost Function Goal

Minimize J(w,b): Find parameter values that result in the smallest possible cost Mathematical notation: min J(w,b) over w,b Interpretation: Smaller cost means better fit to training data

The squared error cost function is the most commonly used cost function for linear regression and many other regression problems, providing good results across diverse applications.

Cost Function Intuition

Simplified Model for Understanding

To build intuition about the cost function, we’ll use a simplified version of linear regression:

Original model: f_w,b(x) = wx + b
Simplified model: f_w(x) = wx (setting b = 0)
Simplified cost function: J(w) depends only on parameter w

Simplification Benefits

Visualization: Easier to understand with 2D graphs instead of 3D Concepts: Same principles apply to full model with both w and b Goal: Minimize J(w) by finding optimal value of w

Example Dataset

Training Examples: (1,1), (2,2), (3,3) Pattern: Perfect linear relationship where y = x

Cost Function Analysis

Case 1: w = 1

Function: f(x) = 1·x = x Predictions:

f(1) = 1, actual y = 1 → error = 0
f(2) = 2, actual y = 2 → error = 0
f(3) = 3, actual y = 3 → error = 0

Cost calculation: J(1) = 1/(2·3) × (0² + 0² + 0²) = 0

Case 2: w = 0.5

Function: f(x) = 0.5x Predictions:

f(1) = 0.5, actual y = 1 → error = -0.5, squared = 0.25
f(2) = 1, actual y = 2 → error = -1, squared = 1
f(3) = 1.5, actual y = 3 → error = -1.5, squared = 2.25

Cost calculation: J(0.5) = 1/(2·3) × (0.25 + 1 + 2.25) = 3.5/6 ≈ 0.58

Case 3: w = 0

Function: f(x) = 0 (horizontal line on x-axis) Predictions:

f(1) = 0, actual y = 1 → error = -1, squared = 1
f(2) = 0, actual y = 2 → error = -2, squared = 4
f(3) = 0, actual y = 3 → error = -3, squared = 9

Cost calculation: J(0) = 1/(2·3) × (1 + 4 + 9) = 14/6 ≈ 2.33

Case 4: w = -0.5

Function: f(x) = -0.5x (downward sloping line) Result: Even higher cost around 5.25

Horizontal axis: Input x (house size) Vertical axis: Output y (price) Points: Training examples plotted as crosses Line: Function f(x) = wx for different values of w

Key Insights

Parameter-Cost Relationship

Each w value: Defines a different straight line through the data
Corresponding cost: Every line has an associated cost J(w)
Optimal choice: w = 1 gives minimum cost (perfect fit for this data)

Cost Function Shape

U-shaped curve: Called a “bowl” shape
Minimum point: Occurs at w = 1 where J(w) = 0
Increasing cost: Moving away from optimal w increases cost

Gradient Descent Preview

The systematic approach to finding the optimal w (and b in the full model) involves:

Goal: Minimize J(w,b) over parameters w and b
Method: Use gradient descent algorithm (covered in upcoming videos)
Result: Automatically find parameter values that give the best fit

Summary

The cost function provides a systematic way to measure model performance. By understanding how different parameter values affect the cost, we can identify the best parameters for our model. The relationship between model function f(x) and cost function J(w) shows how parameter choices directly impact prediction quality and overall model performance.