Choosing Number Of Clusters

Choosing the Number of Clusters

The Ambiguity Problem

Inherent Ambiguity in Clustering

For many clustering problems, the “right” value of K is truly ambiguous.

Example: Same dataset, different valid perspectives:

Some people: “Two distinct clusters” ✓ (correct)
Other people: “Four distinct clusters” ✓ (also correct)
Alternative view: “Three clusters” ✓ (also valid)

Why This Happens

Clustering is unsupervised learning
No “right answers” in form of specific labels to replicate
Data itself often doesn’t give clear indicator for number of clusters
Multiple valid interpretations exist

The Elbow Method

Academic Approach (Rarely Used in Practice)

One technique mentioned in academic literature:

Process

Run K-means with variety of K values
Plot cost function J (distortion) vs number of clusters
Look for “elbow” in the curve

Elbow Pattern

Few clusters (e.g., K=1): High distortion function J
More clusters: J decreases rapidly initially
Even more clusters: J decreases more slowly
“Elbow”: Point where decrease becomes more gradual (e.g., K=3)

Why Called “Elbow”

Shape resembles arm: hand → elbow joint → upper arm
Clear bend in the curve suggests optimal K

Limitations of Elbow Method

Personal experience: “I personally hardly ever use the elbow method”
Reason: Many cost functions decrease smoothly without clear elbow
Reality: Most applications don’t show obvious elbow point

What NOT to Do

Wrong approach: Choose K to minimize cost function J

Problem: Would almost always choose largest possible K
Why: More clusters → lower cost function J
Result: Not meaningful clustering

Practical Approach: Downstream Purpose

Recommended Strategy

Evaluate K-means based on how well it performs for later/downstream purpose.

T-shirt Sizing Example

Business Context

Run K-means to determine t-shirt sizes
Question: How many t-shirt sizes should there be?

Two Valid Options

Option 1: K = 3 (Small, Medium, Large)

Three distinct clusters for three t-shirt sizes
Simpler manufacturing and shipping
Lower costs

Option 2: K = 5 (XS, S, M, L, XL)

Five clusters for five t-shirt sizes
Better fit for customers
Higher manufacturing and shipping costs

Decision Process

Run K-means with K = 3 and K = 5
Examine both solutions
Consider trade-offs:
- Better fit (more sizes) vs Lower cost (fewer sizes)
Decide based on business needs and constraints

Image Compression Example

Context: K-means for image compression (programming exercise)

Trade-off: Image quality vs compression ratio
Decision factors:
- How good should image look?
- How much compression is needed?
- File size constraints?

Manual Decision Process

Use trade-off analysis to manually decide optimal K based on:

Quality requirements
Storage limitations
Computational constraints

Choosing the number of clusters requires balancing mathematical optimization with practical considerations specific to your application’s goals and constraints.