Skip to content
Pablo Rodriguez

Cost Function

The cost function is one of the most universal and important concepts in machine learning, used in both linear regression and training advanced AI models worldwide.

Cost Function Purpose

Goal: Tell us how well the model is performing Benefit: Helps improve the model by identifying areas for adjustment Application: Used across all types of machine learning models

Training Set: Contains input features x and output targets y Model: Linear function f_w,b(x) = wx + b Parameters: w and b (variables adjusted during training to improve the model)

Different values of w and b create different functions and different lines:

  • w = 0, b = 1.5: f(x) = 1.5 (horizontal line, constant prediction)

    • Always predicts 1.5 regardless of input
    • b represents the y-intercept
  • w = 0.5, b = 0: f(x) = 0.5x (line through origin)

    • Slope of 0.5
    • When x = 2, prediction = 1
  • w = 0.5, b = 1: f(x) = 0.5x + 1

    • Slope of 0.5, y-intercept of 1
    • When x = 0, prediction = 1; when x = 2, prediction = 2

Objective: Choose w and b so the straight line fits the data well Visual interpretation: Line should pass through or close to training examples

The cost function measures prediction accuracy by comparing predictions to actual values:

  1. Error for single example: ŷ^(i) - y^(i) (prediction minus actual)
  2. Squared error: (ŷ^(i) - y^(i))² (eliminates negative values)
  3. Sum across all examples: Σ(ŷ^(i) - y^(i))² for i = 1 to m
Squared Error Cost Function
J(w,b) = 1/(2m) * Σ(f(x^(i)) - y^(i))²

Where:

  • J(w,b): Cost function
  • m: Number of training examples
  • 1/2m: Normalizes by dataset size and makes later calculations neater
  • f(x^(i)): Model prediction for i-th example

Average instead of total: Dividing by m prevents the cost from automatically increasing with larger datasets Division by 2: Makes derivative calculations cleaner (cancels out later) Squared errors: Penalizes large errors more than small errors

With explicit model function
J(w,b) = 1/(2m) * Σ(f(x^(i)) - y^(i))²

Since f(x^(i)) = wx^(i) + b, this emphasizes the relationship between parameters and predictions.

Minimize J(w,b): Find parameter values that result in the smallest possible cost Mathematical notation: min J(w,b) over w,b Interpretation: Smaller cost means better fit to training data

The squared error cost function is the most commonly used cost function for linear regression and many other regression problems, providing good results across diverse applications.

To build intuition about the cost function, we’ll use a simplified version of linear regression:

  • Original model: f_w,b(x) = wx + b
  • Simplified model: f_w(x) = wx (setting b = 0)
  • Simplified cost function: J(w) depends only on parameter w

Simplification Benefits

Visualization: Easier to understand with 2D graphs instead of 3D Concepts: Same principles apply to full model with both w and b Goal: Minimize J(w) by finding optimal value of w

Training Examples: (1,1), (2,2), (3,3) Pattern: Perfect linear relationship where y = x

Function: f(x) = 1·x = x Predictions:

  • f(1) = 1, actual y = 1 → error = 0
  • f(2) = 2, actual y = 2 → error = 0
  • f(3) = 3, actual y = 3 → error = 0

Cost calculation: J(1) = 1/(2·3) × (0² + 0² + 0²) = 0

Function: f(x) = 0.5x Predictions:

  • f(1) = 0.5, actual y = 1 → error = -0.5, squared = 0.25
  • f(2) = 1, actual y = 2 → error = -1, squared = 1
  • f(3) = 1.5, actual y = 3 → error = -1.5, squared = 2.25

Cost calculation: J(0.5) = 1/(2·3) × (0.25 + 1 + 2.25) = 3.5/6 ≈ 0.58

Function: f(x) = 0 (horizontal line on x-axis) Predictions:

  • f(1) = 0, actual y = 1 → error = -1, squared = 1
  • f(2) = 0, actual y = 2 → error = -2, squared = 4
  • f(3) = 0, actual y = 3 → error = -3, squared = 9

Cost calculation: J(0) = 1/(2·3) × (1 + 4 + 9) = 14/6 ≈ 2.33

Function: f(x) = -0.5x (downward sloping line) Result: Even higher cost around 5.25

Horizontal axis: Input x (house size) Vertical axis: Output y (price) Points: Training examples plotted as crosses Line: Function f(x) = wx for different values of w

  • Each w value: Defines a different straight line through the data
  • Corresponding cost: Every line has an associated cost J(w)
  • Optimal choice: w = 1 gives minimum cost (perfect fit for this data)
  • U-shaped curve: Called a “bowl” shape
  • Minimum point: Occurs at w = 1 where J(w) = 0
  • Increasing cost: Moving away from optimal w increases cost

The systematic approach to finding the optimal w (and b in the full model) involves:

  1. Goal: Minimize J(w,b) over parameters w and b
  2. Method: Use gradient descent algorithm (covered in upcoming videos)
  3. Result: Automatically find parameter values that give the best fit

The cost function provides a systematic way to measure model performance. By understanding how different parameter values affect the cost, we can identify the best parameters for our model. The relationship between model function f(x) and cost function J(w) shows how parameter choices directly impact prediction quality and overall model performance.