K Means Algorithm

K-means Algorithm

Algorithm Overview

Step 1: Initialize Cluster Centroids

Randomly initialize K cluster centroids: μ₁, μ₂, …, μₖ
In the example: K = 2 (red cross = μ₁, blue cross = μ₂)
μ₁ and μ₂ are vectors with same dimension as training examples
If training examples have n = 2 features, then μ₁ and μ₂ are also 2-dimensional vectors

Step 2: Iterative Process

K-means repeatedly carries out two steps:

Step 2.1: Assign Points to Cluster Centroids

For i = 1 to m (all m training examples):

Set c⁽ⁱ⁾ = index of cluster centroid closest to training example x⁽ⁱ⁾
Mathematically: find k that minimizes ||x⁽ⁱ⁾ - μₖ||²

Distance Calculation

Distance between two points: ||x⁽ⁱ⁾ - μₖ|| (L2 norm)
More convenient to minimize squared distance: ||x⁽ⁱ⁾ - μₖ||²
Cluster centroid with smallest squared distance = same as smallest distance

Concrete Examples

Point closer to red centroid (centroid 1): set c⁽¹⁾ = 1
Point closer to blue centroid (centroid 2): set c⁽¹²⁾ = 2

Step 2.2: Move Cluster Centroids

For k = 1 to K (number of clusters):

Set cluster centroid location = average/mean of points assigned to cluster k
Look at all points with same color (e.g., all red points)
Compute average value on horizontal axis (first feature x₁)
Compute average value on vertical axis (second feature x₂)
New centroid location = these two averages

Mathematical Formulation

If cluster 1 has training examples 10 assigned to it:

μ₁ = (x⁽¹⁾ + x⁽⁵⁾ + x⁽⁶⁾ + x⁽¹⁰⁾) / 4

Where:

x⁽¹⁾, x⁽⁵⁾, x⁽⁶⁾, x⁽¹⁰⁾ are training examples (vectors)
Divide by 4 (number of examples)
μ₁ will have same number of dimensions as training examples

Corner Case: Empty Cluster

Problem: What if cluster has zero training examples assigned?

Cannot compute average of zero points (not well-defined)

Solutions:

Most common: Eliminate that cluster (end up with K-1 clusters)
Alternative: Randomly reinitialize that cluster centroid
- Hope it gets assigned some points next iteration
- Less commonly used

Applications Beyond Well-Separated Clusters

T-shirt Sizing Example

K-means works even when clusters aren’t well-separated
Collect data on people’s heights and weights
Height and weight vary continuously without clear clusters
Run K-means with 3 clusters to determine:
- How small should small t-shirt be?
- What dimensions for medium t-shirt?
- How large should large t-shirt be?

Result Interpretation

Three clusters might represent small, medium, large t-shirt sizes
Cluster centroids give sense of most representative height/weight
Choose t-shirt dimensions to fit individuals in each cluster well

The K-means algorithm provides a systematic approach to clustering through random initialization followed by iterative assignment and centroid updates until convergence is achieved.