K Means Intuition

K-means Intuition

Example Dataset Setup

Data set with 30 unlabeled training examples (30 points)
Goal: Run K-means to find two clusters (K=2)
Later in week: discuss how to decide how many clusters to find

Initial Setup

Step 1: Random Initialization

K-means algorithm first takes random guess at where centers of clusters might be
Randomly pick two points (red cross and blue cross) as initial cluster centers
These are just random initial guesses - not particularly good guesses
Centers of clusters are called cluster centroids

Core K-means Process

K-means repeatedly does two different things:

Assign points to cluster centroids
Move cluster centroids

Step 1: Assign Points to Cluster Centroids

Go through each of the points (x⁽¹⁾ through x⁽³⁰⁾)
For each point, check if it’s closer to red cross or blue cross
Assign each point to whichever cluster centroid it is closer to
Illustrate assignment by painting each point either red or blue
- Point closer to red centroid → painted red
- Point closer to blue centroid → painted blue

Step 2: Move Cluster Centroids

Look at all red points and take average of them
Move red cross to average location of red dots
Look at all blue dots and take average of them
Move blue cross to average location of blue dots
This gives new locations for both cluster centroids

Iterative Process

Continue Until Convergence

Look through all 30 training examples again
Check for every point whether it’s closer to red or blue centroid (new locations)
Associate each point with closer cluster centroid (indicated by color)
Some points may change color as centroids have moved
Repeat: compute average location of red dots and blue dots
Move cluster centroids to new average locations
Repeat process

Convergence

Keep repeating the two steps:
- Assign each point to nearest cluster centroid
- Move each cluster centroid to mean of all points with same color
Algorithm converges when there are no more changes to:
- Colors of points
- Locations of cluster centroids

Final Result

In the example, K-means has done a pretty good job:

Points up in one area correspond to one cluster
Points down in another area correspond to second cluster

The K-means algorithm demonstrates a simple yet powerful approach to automatically discovering natural groupings in data through this iterative refinement process.