Skip to content
Pablo Rodriguez

Information Gain

Information Gain measures the reduction in entropy achieved by splitting on a particular feature, enabling systematic feature selection for decision tree nodes.

Starting Point: All 10 examples at root node

  • p₁ = 5/10 = 0.5 (5 cats, 5 dogs)
  • Root Entropy: H(0.5) = 1.0 (maximum impurity)

Split Results:

  • Left Branch: 5 examples, p₁ = 4/5 = 0.8, H(0.8) ≈ 0.72
  • Right Branch: 5 examples, p₁ = 1/5 = 0.2, H(0.2) ≈ 0.72

Split Results:

  • Left Branch: 7 examples, p₁ = 4/7, H(4/7) ≈ 0.99
  • Right Branch: 3 examples, p₁ = 1/3, H(1/3) ≈ 0.92

Split Results:

  • Left Branch: 4 examples, p₁ = 3/4, H(3/4) ≈ 0.81
  • Right Branch: 6 examples, p₁ = 2/6, H(2/6) ≈ 0.92

Key Insight: Larger branches with high entropy are worse than smaller branches with high entropy.

Weight Calculation:

  • w^left: Fraction of examples going to left branch
  • w^right: Fraction of examples going to right branch

Ear Shape Split:

Weighted Entropy = (5/10) × H(0.8) + (5/10) × H(0.2)
= 0.5 × 0.72 + 0.5 × 0.72 = 0.72

Face Shape Split:

Weighted Entropy = (7/10) × H(4/7) + (3/10) × H(1/3)
= 0.7 × 0.99 + 0.3 × 0.92 ≈ 0.97

Whiskers Split:

Weighted Entropy = (4/10) × H(3/4) + (6/10) × H(2/6)
= 0.4 × 0.81 + 0.6 × 0.92 ≈ 0.88

Information Gain = Entropy Reduction from Split

Information Gain = H(p₁^root) - [w^left × H(p₁^left) + w^right × H(p₁^right)]

Ear Shape

Information Gain = 1.0 - 0.72 = 0.28

  • Highest reduction in entropy
  • Best feature choice

Face Shape

Information Gain = 1.0 - 0.97 = 0.03

  • Minimal entropy reduction
  • Poor feature choice

Whiskers

Information Gain = 1.0 - 0.88 = 0.12

  • Moderate entropy reduction
  • Second-best option
  • p₁^left: Fraction of positive examples in left branch
  • p₁^right: Fraction of positive examples in right branch
  • p₁^root: Fraction of positive examples at current node
  • w^left: Fraction of examples going left
  • w^right: Fraction of examples going right
Information Gain = H(p₁^root) - [w^left × H(p₁^left) + w^right × H(p₁^right)]

Where: w^left + w^right = 1 (all examples go either left or right)

Choose the feature that maximizes information gain

Example Result: Ear Shape (0.28) > Whiskers (0.12) > Face Shape (0.03) Decision: Split on Ear Shape at root node

  1. Quantitative Comparison: Numerical values enable direct comparison
  2. Purity Optimization: Maximizes separation between classes
  3. Stopping Criteria: Small gains indicate splitting may not be worthwhile

Use Case: Stop splitting when information gain < threshold

  • Benefit: Prevents overfitting
  • Benefit: Keeps tree size manageable
  • Example: If gain < 0.01, create leaf node instead

Information gain provides the mathematical foundation for systematic feature selection in decision trees, ensuring each split maximally improves class separation while enabling principled stopping decisions.