Measuring Purity

Entropy as Impurity Measure

Entropy quantifies the impurity of a set of examples, providing a mathematical foundation for decision tree splitting decisions.

Entropy Definition and Formula

Basic Setup

For a set of examples:

p₁: Fraction of examples that are cats (positive class)
p₀: Fraction of examples that are not cats = (1 - p₁)

Entropy Function

Mathematical Definition:

H(p₁) = -p₁ log₂(p₁) - p₀ log₂(p₀)
     = -p₁ log₂(p₁) - (1-p₁) log₂(1-p₁)

Base 2 Logarithm

Using log₂ makes the peak entropy value equal to 1, providing intuitive interpretation.

Entropy Behavior Examples

Example 1: Maximum Impurity

Dataset: 3 cats, 3 dogs (6 total)

p₁ = 3/6 = 0.5
H(0.5) = 1.0
Maximum impurity: 50-50 split represents highest uncertainty

Example 2: High Purity

Dataset: 5 cats, 1 dog (6 total)

p₁ = 5/6 ≈ 0.83
H(0.83) ≈ 0.65
Lower impurity: Strong majority class

Example 3: Perfect Purity

Dataset: 6 cats, 0 dogs (6 total)

p₁ = 6/6 = 1.0
H(1.0) = 0
Zero impurity: Single class only

Example 4: Moderate Impurity

Dataset: 2 cats, 4 dogs (6 total)

p₁ = 2/6 = 1/3 ≈ 0.33
H(0.33) ≈ 0.92
High impurity: Closer to 50-50 than Example 2

Entropy Curve Characteristics

Maximum at p₁ = 0.5

Entropy = 1.0

50-50 class distribution
Highest uncertainty
Most impure state

Minimum at Extremes

Entropy = 0.0

p₁ = 0 (all negative)
p₁ = 1 (all positive)
Perfect purity

Special Cases and Computation

Handling 0 log(0)

Mathematical Issue: log(0) is undefined (negative infinity) Convention: For entropy calculation, 0 log(0) = 0 Result: Correctly computes entropy as 0 for pure nodes

Similarity to Logistic Loss

Alternative Purity Measures

Gini Criterion

Alternative Option: Some open-source packages use Gini criterion

Similar behavior: Also goes from 0 → 1 → 0 as p₁ varies
Equivalent effectiveness: Works well for decision trees
Focus Choice: We’ll use entropy for simplicity and consistency

Practical Interpretation

Entropy Values Guide Decisions

H ≈ 1.0: Mixed classes, needs splitting
H ≈ 0.5-0.8: Moderate impurity, potential for improvement
H ≈ 0.0: Pure node, stop splitting

Information Content

Higher Entropy = More Information Needed

Uncertain outcomes require more questions
Additional features needed to improve classification

Lower Entropy = Less Information Needed

Predictable outcomes require fewer questions
Current features sufficient for good classification

Entropy provides the mathematical foundation for systematically choosing which features to split on at each node, enabling the decision tree algorithm to make optimal splitting decisions based on measurable impurity reduction.