Skip to content
Pablo Rodriguez

Putting It Together

Information gain enables systematic feature selection at every node. Here’s the complete process for building decision trees from training data.

  1. Start at Root: Place all training examples at root node
  2. Calculate Information Gain: For all possible features
  3. Select Best Feature: Choose feature with highest information gain
  4. Split Dataset: Create left and right branches based on selected feature
  5. Repeat Process: Apply same process to each branch recursively
  6. Stop When Criteria Met: Based on purity, depth, or gain thresholds

Input: All 10 training examples (5 cats, 5 dogs) Process: Calculate information gain for ear shape, face shape, whiskers Decision: Ear shape provides highest information gain (0.28) Result: Split on ear shape feature

Focus: 5 examples with pointy ears (4 cats, 1 dog) Check Stopping Criteria: Not all same class → continue splitting

Feature Evaluation:

  • Face shape: Calculate information gain
  • Whiskers: Calculate information gain
  • Ear shape: Information gain = 0 (all have pointy ears)

Decision: Face shape provides highest information gain Result: Split left branch on face shape

  • Left sub-branch: 4 examples, all cats → Leaf Node: Cat
  • Right sub-branch: 1 example, dog → Leaf Node: Not Cat

Focus: 5 examples with floppy ears (1 cat, 4 dogs) Check Stopping Criteria: Not all same class → continue splitting

Process: Evaluate remaining features (face shape, whiskers) Decision: Whiskers provides highest information gain Result: Split right branch on whiskers

  • Left sub-branch: 1 example, cat → Leaf Node: Cat
  • Right sub-branch: 4 examples, all dogs → Leaf Node: Not Cat

Perfect Purity

Node is 100% single class

  • All cats or all dogs
  • Entropy = 0
  • Natural stopping point

Maximum Depth

Tree exceeds depth limit

  • Prevents overly complex trees
  • Reduces overfitting risk
  • User-defined parameter

Information Gain Threshold

Gain below minimum threshold

  • Small improvements not worthwhile
  • Keeps tree size manageable
  • Prevents overfitting

Minimum Examples

Too few examples in node

  • Avoid splits on tiny subsets
  • Statistical reliability
  • Prevent overfitting

Recursive Algorithm: Function that calls itself

Decision Tree Recursion:

  • Build overall tree by building smaller sub-trees
  • Left subtree: Decision tree on left subset of data
  • Right subtree: Decision tree on right subset of data
  • Software libraries: Often implement as recursive functions
  • Understanding optional: Can use trees without implementing recursion
  • Conceptual help: Useful for understanding tree construction logic

Intuition: Larger depth = more complex model

  • Benefits: Can learn more sophisticated patterns
  • Risks: Higher overfitting potential
  • Analogy: Similar to higher-degree polynomials or larger neural networks
  1. Cross-validation: Test different depth values, choose best CV performance
  2. Library defaults: Open-source packages provide good default values
  3. Advanced methods: Libraries offer sophisticated parameter selection

Given new test example:

  1. Start at root node
  2. Follow decision path based on feature values
  3. Continue until leaf node
  4. Output leaf prediction

Example: Animal with pointy ears, round face, whiskers present

  • Root: Pointy ears → go left
  • Decision node: Round face → go left
  • Leaf node: Predict “Cat”

Multiple researchers contributed different pieces:

  • Basic algorithm: Core splitting methodology
  • Stopping criteria: Various threshold approaches
  • Parameter selection: Depth limits and gain thresholds
  • Optimization techniques: Efficiency improvements

Focus on application rather than implementation details:

  • Use established libraries: scikit-learn, XGBoost, etc.
  • Leverage defaults: Well-tested parameter choices
  • Focus on data and evaluation: More impact than parameter tweaking

The complete algorithm combines information gain calculations with systematic recursive splitting, creating effective classifiers while providing multiple approaches for controlling model complexity.