Choosing Number Of Clusters
Choosing the Number of Clusters
Section titled “Choosing the Number of Clusters”The Ambiguity Problem
Section titled “The Ambiguity Problem”Inherent Ambiguity in Clustering
Section titled “Inherent Ambiguity in Clustering”For many clustering problems, the “right” value of K is truly ambiguous.
Example: Same dataset, different valid perspectives:
- Some people: “Two distinct clusters” ✓ (correct)
- Other people: “Four distinct clusters” ✓ (also correct)
- Alternative view: “Three clusters” ✓ (also valid)
Why This Happens
Section titled “Why This Happens”- Clustering is unsupervised learning
- No “right answers” in form of specific labels to replicate
- Data itself often doesn’t give clear indicator for number of clusters
- Multiple valid interpretations exist
The Elbow Method
Section titled “The Elbow Method”Academic Approach (Rarely Used in Practice)
Section titled “Academic Approach (Rarely Used in Practice)”One technique mentioned in academic literature:
Process
Section titled “Process”- Run K-means with variety of K values
- Plot cost function J (distortion) vs number of clusters
- Look for “elbow” in the curve
Elbow Pattern
Section titled “Elbow Pattern”- Few clusters (e.g., K=1): High distortion function J
- More clusters: J decreases rapidly initially
- Even more clusters: J decreases more slowly
- “Elbow”: Point where decrease becomes more gradual (e.g., K=3)
Why Called “Elbow”
Section titled “Why Called “Elbow””- Shape resembles arm: hand → elbow joint → upper arm
- Clear bend in the curve suggests optimal K
Limitations of Elbow Method
Section titled “Limitations of Elbow Method”- Personal experience: “I personally hardly ever use the elbow method”
- Reason: Many cost functions decrease smoothly without clear elbow
- Reality: Most applications don’t show obvious elbow point
What NOT to Do
Section titled “What NOT to Do”Wrong approach: Choose K to minimize cost function J
- Problem: Would almost always choose largest possible K
- Why: More clusters → lower cost function J
- Result: Not meaningful clustering
Practical Approach: Downstream Purpose
Section titled “Practical Approach: Downstream Purpose”Recommended Strategy
Section titled “Recommended Strategy”Evaluate K-means based on how well it performs for later/downstream purpose.
T-shirt Sizing Example
Section titled “T-shirt Sizing Example”Business Context
Section titled “Business Context”- Run K-means to determine t-shirt sizes
- Question: How many t-shirt sizes should there be?
Two Valid Options
Section titled “Two Valid Options”Option 1: K = 3 (Small, Medium, Large)
- Three distinct clusters for three t-shirt sizes
- Simpler manufacturing and shipping
- Lower costs
Option 2: K = 5 (XS, S, M, L, XL)
- Five clusters for five t-shirt sizes
- Better fit for customers
- Higher manufacturing and shipping costs
Decision Process
Section titled “Decision Process”- Run K-means with K = 3 and K = 5
- Examine both solutions
- Consider trade-offs:
- Better fit (more sizes) vs Lower cost (fewer sizes)
- Decide based on business needs and constraints
Image Compression Example
Section titled “Image Compression Example”Context: K-means for image compression (programming exercise)
- Trade-off: Image quality vs compression ratio
- Decision factors:
- How good should image look?
- How much compression is needed?
- File size constraints?
Manual Decision Process
Section titled “Manual Decision Process”Use trade-off analysis to manually decide optimal K based on:
- Quality requirements
- Storage limitations
- Computational constraints
Choosing the number of clusters requires balancing mathematical optimization with practical considerations specific to your application’s goals and constraints.