Skip to content
Pablo Rodriguez

Optimization Objective

Like supervised learning algorithms (courses 1 and 2), K-means also optimizes a specific cost function. However, it doesn’t use gradient descent - it uses the algorithm shown in previous videos.

  • c⁽ⁱ⁾ = index of cluster (1 to K) to which training example x⁽ⁱ⁾ is currently assigned
  • μₖ = location of cluster centroid k
  • μc⁽ⁱ⁾ = cluster centroid of the cluster to which example x⁽ⁱ⁾ has been assigned
  • If training example x⁽¹⁰⁾ is assigned to cluster 2: c⁽¹⁰⁾ = 2
  • Then μc⁽¹⁰⁾ = μ₂ = location of cluster centroid 2

The cost function J is:

k-means-cost-function
J(c⁽¹⁾,...,c⁽ᵐ⁾, μ₁,...,μₖ) = (1/m) Σ(i=1 to m) ||x⁽ⁱ⁾ - μc⁽ⁱ⁾||²
  • Input variables: c⁽¹⁾ through c⁽ᵐ⁾ (cluster assignments), μ₁ through μₖ (centroid locations)
  • Calculation: Average squared distance between every training example x⁽ⁱ⁾ and the cluster centroid to which it’s assigned
  • Alternative name: Distortion function (common in literature)

At any step of K-means:

  • Look at all blue points → measure distances to blue centroid → square them
  • Look at all red points → measure distances to red centroid → square them
  • Average of squares of all these distances = value of cost function J

Step 1: Assign Points to Cluster Centroids

Section titled “Step 1: Assign Points to Cluster Centroids”
  • Goal: Update c⁽¹⁾ through c⁽ᵐ⁾ to minimize J
  • Method: Hold μ₁ through μₖ fixed
  • Logic: To minimize ||x⁽ⁱ⁾ - μc⁽ⁱ⁾||², assign x⁽ⁱ⁾ to closest cluster centroid
  • Two cluster centroids μ₁ and μ₂
  • Single training example x⁽ⁱ⁾
  • Assign to centroid 1: large squared distance
  • Assign to centroid 2: much smaller squared distance
  • Result: Choose closer centroid to minimize cost
  • Goal: Update μ₁ through μₖ to minimize J
  • Method: Hold c⁽¹⁾ through c⁽ᵐ⁾ fixed
  • Logic: Choose μₖ as mean of points assigned to cluster k

Cluster with two points at positions 1 and 11:

  • Centroid at position 1:

    • Distance to point 1: 0, squared = 0
    • Distance to point 11: 10, squared = 100
    • Average: (0 + 100)/2 = 50
  • Centroid at position 6 (mean):

    • Distance to point 1: 5, squared = 25
    • Distance to point 11: 5, squared = 25
    • Average: (25 + 25)/2 = 25

Mean location minimizes squared distance.

  • Every K-means iteration either decreases J or keeps it same
  • J should never increase (if it does, there’s a bug in code)
  • Each step optimizes J, so algorithm must converge
  • If cost function stops going down → K-means has converged
  • Can stop algorithm when J stays same for one iteration
  • Sometimes J goes down very slowly → may choose to stop (similar to gradient descent)

Understanding K-means as a cost function optimization provides insight into why the algorithm works and guarantees convergence to a local minimum of the distortion function.