Skip to content
Pablo Rodriguez

Choosing Number Of Clusters

For many clustering problems, the “right” value of K is truly ambiguous.

Example: Same dataset, different valid perspectives:

  • Some people: “Two distinct clusters” ✓ (correct)
  • Other people: “Four distinct clusters” ✓ (also correct)
  • Alternative view: “Three clusters” ✓ (also valid)
  • Clustering is unsupervised learning
  • No “right answers” in form of specific labels to replicate
  • Data itself often doesn’t give clear indicator for number of clusters
  • Multiple valid interpretations exist

Academic Approach (Rarely Used in Practice)

Section titled “Academic Approach (Rarely Used in Practice)”

One technique mentioned in academic literature:

  1. Run K-means with variety of K values
  2. Plot cost function J (distortion) vs number of clusters
  3. Look for “elbow” in the curve
  • Few clusters (e.g., K=1): High distortion function J
  • More clusters: J decreases rapidly initially
  • Even more clusters: J decreases more slowly
  • “Elbow”: Point where decrease becomes more gradual (e.g., K=3)
  • Shape resembles arm: hand → elbow joint → upper arm
  • Clear bend in the curve suggests optimal K
  • Personal experience: “I personally hardly ever use the elbow method”
  • Reason: Many cost functions decrease smoothly without clear elbow
  • Reality: Most applications don’t show obvious elbow point

Wrong approach: Choose K to minimize cost function J

  • Problem: Would almost always choose largest possible K
  • Why: More clusters → lower cost function J
  • Result: Not meaningful clustering

Evaluate K-means based on how well it performs for later/downstream purpose.

  • Run K-means to determine t-shirt sizes
  • Question: How many t-shirt sizes should there be?

Option 1: K = 3 (Small, Medium, Large)

  • Three distinct clusters for three t-shirt sizes
  • Simpler manufacturing and shipping
  • Lower costs

Option 2: K = 5 (XS, S, M, L, XL)

  • Five clusters for five t-shirt sizes
  • Better fit for customers
  • Higher manufacturing and shipping costs
  1. Run K-means with K = 3 and K = 5
  2. Examine both solutions
  3. Consider trade-offs:
    • Better fit (more sizes) vs Lower cost (fewer sizes)
  4. Decide based on business needs and constraints

Context: K-means for image compression (programming exercise)

  • Trade-off: Image quality vs compression ratio
  • Decision factors:
    • How good should image look?
    • How much compression is needed?
    • File size constraints?

Use trade-off analysis to manually decide optimal K based on:

  • Quality requirements
  • Storage limitations
  • Computational constraints

Choosing the number of clusters requires balancing mathematical optimization with practical considerations specific to your application’s goals and constraints.