Information Gain

Choosing a Split: Information Gain

Information Gain Concept

Information Gain measures the reduction in entropy achieved by splitting on a particular feature, enabling systematic feature selection for decision tree nodes.

Feature Comparison Example

Root Node Setup

Starting Point: All 10 examples at root node

p₁ = 5/10 = 0.5 (5 cats, 5 dogs)
Root Entropy: H(0.5) = 1.0 (maximum impurity)

Option 1: Split on Ear Shape

Split Results:

Left Branch: 5 examples, p₁ = 4/5 = 0.8, H(0.8) ≈ 0.72
Right Branch: 5 examples, p₁ = 1/5 = 0.2, H(0.2) ≈ 0.72

Option 2: Split on Face Shape

Split Results:

Left Branch: 7 examples, p₁ = 4/7, H(4/7) ≈ 0.99
Right Branch: 3 examples, p₁ = 1/3, H(1/3) ≈ 0.92

Option 3: Split on Whiskers

Split Results:

Left Branch: 4 examples, p₁ = 3/4, H(3/4) ≈ 0.81
Right Branch: 6 examples, p₁ = 2/6, H(2/6) ≈ 0.92

Weighted Average Entropy

Why Weight by Size?

Key Insight: Larger branches with high entropy are worse than smaller branches with high entropy.

Weight Calculation:

w^left: Fraction of examples going to left branch
w^right: Fraction of examples going to right branch

Weighted Entropy Examples

Ear Shape Split:

Weighted Entropy = (5/10) × H(0.8) + (5/10) × H(0.2)
                 = 0.5 × 0.72 + 0.5 × 0.72 = 0.72

Face Shape Split:

Weighted Entropy = (7/10) × H(4/7) + (3/10) × H(1/3)
                 = 0.7 × 0.99 + 0.3 × 0.92 ≈ 0.97

Whiskers Split:

Weighted Entropy = (4/10) × H(3/4) + (6/10) × H(2/6)
                 = 0.4 × 0.81 + 0.6 × 0.92 ≈ 0.88

Information Gain Formula

Standard Definition

Information Gain = Entropy Reduction from Split

Information Gain = H(p₁^root) - [w^left × H(p₁^left) + w^right × H(p₁^right)]

Calculations for Each Feature

Ear Shape

Information Gain = 1.0 - 0.72 = 0.28

Highest reduction in entropy
Best feature choice

Face Shape

Information Gain = 1.0 - 0.97 = 0.03

Minimal entropy reduction
Poor feature choice

Whiskers

Information Gain = 1.0 - 0.88 = 0.12

Moderate entropy reduction
Second-best option

Formal Information Gain Definition

Notation Setup

p₁^left: Fraction of positive examples in left branch
p₁^right: Fraction of positive examples in right branch
p₁^root: Fraction of positive examples at current node
w^left: Fraction of examples going left
w^right: Fraction of examples going right

Complete Formula

Information Gain = H(p₁^root) - [w^left × H(p₁^left) + w^right × H(p₁^right)]

Where: w^left + w^right = 1 (all examples go either left or right)

Decision Making Process

Feature Selection Rule

Choose the feature that maximizes information gain

Example Result: Ear Shape (0.28) > Whiskers (0.12) > Face Shape (0.03) Decision: Split on Ear Shape at root node

Benefits of Information Gain Approach

Quantitative Comparison: Numerical values enable direct comparison
Purity Optimization: Maximizes separation between classes
Stopping Criteria: Small gains indicate splitting may not be worthwhile

Practical Applications

Stopping Threshold

Use Case: Stop splitting when information gain < threshold

Benefit: Prevents overfitting
Benefit: Keeps tree size manageable
Example: If gain < 0.01, create leaf node instead

Information gain provides the mathematical foundation for systematic feature selection in decision trees, ensuring each split maximally improves class separation while enabling principled stopping decisions.