Practice Lab Decision Trees
Practice Lab: Decision Trees
Section titled “Practice Lab: Decision Trees”Lab Overview
Section titled “Lab Overview”In this exercise, you will implement a decision tree from scratch and apply it to classify whether a mushroom is edible or poisonous.
Build from ScratchThis lab focuses on implementing the core decision tree algorithms rather than using libraries.
Problem Statement
Section titled “Problem Statement”Business Context: Starting a wild mushroom company Challenge: Determine if mushrooms are edible or poisonous based on physical attributes Goal: Build a classifier using existing data to identify safe mushrooms for sale
Dataset Description
Section titled “Dataset Description”Mushroom Features
Section titled “Mushroom Features”10 training examples with three categorical features:
Feature | Values | Encoding |
---|---|---|
Cap Color | Brown/Red | Brown=1, Red=0 |
Stalk Shape | Tapering/Enlarging | Tapering=1, Enlarging=0 |
Solitary | Yes/No | Yes=1, No=0 |
Target Variable: Edible (1=Yes, 0=Poisonous)
One-Hot Encoded Dataset
Section titled “One-Hot Encoded Dataset”X_train = np.array([[1,1,1], # Brown, Tapering, Solitary → Edible [1,0,1], # Brown, Enlarging, Solitary → Edible [1,0,0], # Brown, Enlarging, Not Solitary → Poisonous # ... more examples ])
y_train = np.array([1,1,0,0,1,0,0,1,1,0])
print('Training examples (m):', len(X_train)) # 10 examplesprint('Features (n):', X_train.shape[1]) # 3 features
Core Implementation
Section titled “Core Implementation”Exercise 1: Compute Entropy
Section titled “Exercise 1: Compute Entropy”Implementation Task: Calculate entropy to measure node impurity
def compute_entropy(y): """ Computes entropy for binary classification
Args: y (ndarray): Binary labels (1=edible, 0=poisonous) Returns: entropy (float): Entropy value """ entropy = 0.
# STUDENT IMPLEMENTATION REQUIRED if len(y) != 0: p1 = len(y[y == 1]) / len(y) # Fraction of edible mushrooms
if p1 != 0 and p1 != 1: entropy = -p1 * np.log2(p1) - (1-p1) * np.log2(1-p1) else: entropy = 0. # Pure node
return entropy
# Test: Root node entropy should be 1.0 (5 edible, 5 poisonous)
Key Implementation Points:
- Handle edge cases: empty nodes and pure nodes (p₁ = 0 or 1)
- Use log base 2 for standard entropy calculation
- Convention: 0 log(0) = 0 for computation purposes
Exercise 2: Split Dataset
Section titled “Exercise 2: Split Dataset”Implementation Task: Split data based on feature values
def split_dataset(X, node_indices, feature): """ Splits data into left (feature=1) and right (feature=0) branches
Args: X (ndarray): Feature matrix node_indices (list): Active example indices feature (int): Feature index to split on Returns: left_indices, right_indices: Split indices """ left_indices = [] # feature = 1 right_indices = [] # feature = 0
# STUDENT IMPLEMENTATION REQUIRED for i in node_indices: if X[i][feature] == 1: left_indices.append(i) else: right_indices.append(i)
return left_indices, right_indices
# Test: Split on feature 0 (Brown Cap) at root
Exercise 3: Calculate Information Gain
Section titled “Exercise 3: Calculate Information Gain”Implementation Task: Compute information gain from splitting
def compute_information_gain(X, y, node_indices, feature): """ Compute information gain from splitting on given feature
Args: X, y: Training data node_indices: Active indices at current node feature: Feature to split on Returns: information_gain: Reduction in entropy """ # Split the dataset left_indices, right_indices = split_dataset(X, node_indices, feature)
# Extract relevant data y_node = y[node_indices] y_left = y[left_indices] y_right = y[right_indices]
# STUDENT IMPLEMENTATION REQUIRED # Calculate entropies node_entropy = compute_entropy(y_node) left_entropy = compute_entropy(y_left) right_entropy = compute_entropy(y_right)
# Calculate weights w_left = len(left_indices) / len(node_indices) w_right = len(right_indices) / len(node_indices)
# Calculate weighted entropy weighted_entropy = w_left * left_entropy + w_right * right_entropy
# Information gain information_gain = node_entropy - weighted_entropy
return information_gain
Expected Results:
- Brown Cap: 0.035 information gain
- Tapering Stalk: 0.125 information gain
- Solitary: 0.278 information gain (highest - best split)
Exercise 4: Get Best Split
Section titled “Exercise 4: Get Best Split”Implementation Task: Find feature with maximum information gain
def get_best_split(X, y, node_indices): """ Returns optimal feature to split on
Args: X, y: Training data node_indices: Active indices at current node Returns: best_feature: Index of best feature to split on """ num_features = X.shape[1] best_feature = -1
# STUDENT IMPLEMENTATION REQUIRED max_info_gain = 0
for feature in range(num_features): info_gain = compute_information_gain(X, y, node_indices, feature)
if info_gain > max_info_gain: max_info_gain = info_gain best_feature = feature
return best_feature
# Test: Should return feature 2 (Solitary) as best split
Tree Building Process
Section titled “Tree Building Process”Recursive Tree Construction
Section titled “Recursive Tree Construction”Automated Implementation (provided - no student coding required):
def build_tree_recursive(X, y, node_indices, branch_name, max_depth, current_depth): """ Build decision tree using recursive splitting
Stopping criteria: - Maximum depth reached (max_depth=2) - Pure node achieved - No more examples to split """
# Check stopping criteria if current_depth == max_depth: print(f"Leaf node: {branch_name} with indices {node_indices}") return
# Get best split best_feature = get_best_split(X, y, node_indices) print(f"Depth {current_depth}, {branch_name}: Split on feature {best_feature}")
# Split and recurse left_indices, right_indices = split_dataset(X, node_indices, best_feature)
build_tree_recursive(X, y, left_indices, "Left", max_depth, current_depth+1) build_tree_recursive(X, y, right_indices, "Right", max_depth, current_depth+1)
Tree Construction Results
Section titled “Tree Construction Results”Expected Tree Structure:
- Root: Split on Solitary (feature 2)
- Left Branch (Solitary=Yes): Further splits on other features
- Right Branch (Solitary=No): Further splits on other features
- Leaf Nodes: Final predictions (Edible/Poisonous)
Key Lab Learning Objectives
Section titled “Key Lab Learning Objectives”Algorithmic Understanding
Section titled “Algorithmic Understanding”- Entropy Calculation: Measure impurity using information theory
- Dataset Splitting: Partition examples based on feature values
- Information Gain: Quantify improvement from splitting
- Feature Selection: Choose optimal splits systematically
- Recursive Construction: Build complete tree through repeated splitting
Implementation Skills
Section titled “Implementation Skills”- Mathematical formulas: Translate entropy equations into code
- Data manipulation: Split datasets efficiently using indices
- Algorithm logic: Implement greedy feature selection
- Edge case handling: Deal with empty nodes and pure nodes
- Recursive thinking: Understand tree building as recursive process
Practical Insights
Section titled “Practical Insights”- Feature importance: Solitary feature provides highest information gain
- Stopping criteria: Maximum depth prevents overfitting
- Pure nodes: Natural stopping points when entropy = 0
- Greedy selection: Best local choice at each node
This hands-on implementation reinforces the theoretical concepts from lectures and prepares students for using production-ready decision tree libraries with better understanding of the underlying algorithms.