Optimization Objective

K-means as Cost Function Optimization

Like supervised learning algorithms (courses 1 and 2), K-means also optimizes a specific cost function. However, it doesn’t use gradient descent - it uses the algorithm shown in previous videos.

Notation Review

c⁽ⁱ⁾ = index of cluster (1 to K) to which training example x⁽ⁱ⁾ is currently assigned
μₖ = location of cluster centroid k
μc⁽ⁱ⁾ = cluster centroid of the cluster to which example x⁽ⁱ⁾ has been assigned

Example

If training example x⁽¹⁰⁾ is assigned to cluster 2: c⁽¹⁰⁾ = 2
Then μc⁽¹⁰⁾ = μ₂ = location of cluster centroid 2

Cost Function Definition

The cost function J is:

J(c⁽¹⁾,...,c⁽ᵐ⁾, μ₁,...,μₖ) = (1/m) Σ(i=1 to m) ||x⁽ⁱ⁾ - μc⁽ⁱ⁾||²

Components

Input variables: c⁽¹⁾ through c⁽ᵐ⁾ (cluster assignments), μ₁ through μₖ (centroid locations)
Calculation: Average squared distance between every training example x⁽ⁱ⁾ and the cluster centroid to which it’s assigned
Alternative name: Distortion function (common in literature)

Visual Understanding

At any step of K-means:

Look at all blue points → measure distances to blue centroid → square them
Look at all red points → measure distances to red centroid → square them
Average of squares of all these distances = value of cost function J

How K-means Minimizes J

Step 1: Assign Points to Cluster Centroids

Goal: Update c⁽¹⁾ through c⁽ᵐ⁾ to minimize J
Method: Hold μ₁ through μₖ fixed
Logic: To minimize ||x⁽ⁱ⁾ - μc⁽ⁱ⁾||², assign x⁽ⁱ⁾ to closest cluster centroid

Example

Two cluster centroids μ₁ and μ₂
Single training example x⁽ⁱ⁾
Assign to centroid 1: large squared distance
Assign to centroid 2: much smaller squared distance
Result: Choose closer centroid to minimize cost

Step 2: Move Cluster Centroids

Goal: Update μ₁ through μₖ to minimize J
Method: Hold c⁽¹⁾ through c⁽ᵐ⁾ fixed
Logic: Choose μₖ as mean of points assigned to cluster k

Example

Cluster with two points at positions 1 and 11:

Centroid at position 1:
- Distance to point 1: 0, squared = 0
- Distance to point 11: 10, squared = 100
- Average: (0 + 100)/2 = 50
Centroid at position 6 (mean):
- Distance to point 1: 5, squared = 25
- Distance to point 11: 5, squared = 25
- Average: (25 + 25)/2 = 25

Mean location minimizes squared distance.

Algorithm Convergence Properties

Guaranteed Convergence

Every K-means iteration either decreases J or keeps it same
J should never increase (if it does, there’s a bug in code)
Each step optimizes J, so algorithm must converge

Convergence Detection

If cost function stops going down → K-means has converged
Can stop algorithm when J stays same for one iteration
Sometimes J goes down very slowly → may choose to stop (similar to gradient descent)

Understanding K-means as a cost function optimization provides insight into why the algorithm works and guarantees convergence to a local minimum of the distortion function.