Skip to content
Pablo Rodriguez

K Means Algorithm

  • Randomly initialize K cluster centroids: μ₁, μ₂, …, μₖ
  • In the example: K = 2 (red cross = μ₁, blue cross = μ₂)
  • μ₁ and μ₂ are vectors with same dimension as training examples
  • If training examples have n = 2 features, then μ₁ and μ₂ are also 2-dimensional vectors

K-means repeatedly carries out two steps:

Step 2.1: Assign Points to Cluster Centroids

Section titled “Step 2.1: Assign Points to Cluster Centroids”

For i = 1 to m (all m training examples):

  • Set c⁽ⁱ⁾ = index of cluster centroid closest to training example x⁽ⁱ⁾
  • Mathematically: find k that minimizes ||x⁽ⁱ⁾ - μₖ||²
  • Distance between two points: ||x⁽ⁱ⁾ - μₖ|| (L2 norm)
  • More convenient to minimize squared distance: ||x⁽ⁱ⁾ - μₖ||²
  • Cluster centroid with smallest squared distance = same as smallest distance
  • Point closer to red centroid (centroid 1): set c⁽¹⁾ = 1
  • Point closer to blue centroid (centroid 2): set c⁽¹²⁾ = 2

For k = 1 to K (number of clusters):

  • Set cluster centroid location = average/mean of points assigned to cluster k
  • Look at all points with same color (e.g., all red points)
  • Compute average value on horizontal axis (first feature x₁)
  • Compute average value on vertical axis (second feature x₂)
  • New centroid location = these two averages

If cluster 1 has training examples 10 assigned to it:

centroid-calculation
μ₁ = (x⁽¹⁾ + x⁽⁵⁾ + x⁽⁶⁾ + x⁽¹⁰⁾) / 4

Where:

  • x⁽¹⁾, x⁽⁵⁾, x⁽⁶⁾, x⁽¹⁰⁾ are training examples (vectors)
  • Divide by 4 (number of examples)
  • μ₁ will have same number of dimensions as training examples

Problem: What if cluster has zero training examples assigned?

  • Cannot compute average of zero points (not well-defined)

Solutions:

  1. Most common: Eliminate that cluster (end up with K-1 clusters)
  2. Alternative: Randomly reinitialize that cluster centroid
    • Hope it gets assigned some points next iteration
    • Less commonly used

Applications Beyond Well-Separated Clusters

Section titled “Applications Beyond Well-Separated Clusters”
  • K-means works even when clusters aren’t well-separated
  • Collect data on people’s heights and weights
  • Height and weight vary continuously without clear clusters
  • Run K-means with 3 clusters to determine:
    • How small should small t-shirt be?
    • What dimensions for medium t-shirt?
    • How large should large t-shirt be?
  • Three clusters might represent small, medium, large t-shirt sizes
  • Cluster centroids give sense of most representative height/weight
  • Choose t-shirt dimensions to fit individuals in each cluster well

The K-means algorithm provides a systematic approach to clustering through random initialization followed by iterative assignment and centroid updates until convergence is achieved.