Optional Lab Tree Ensembles

Optional Lab: Tree Ensembles

Lab Overview

This lab demonstrates how to use Pandas for one-hot encoding and implement Decision Tree, Random Forest, and XGBoost models using scikit-learn and XGBoost libraries.

Dataset: Heart Failure Prediction

Context

Cardiovascular Disease (CVD) is the leading cause of death globally:

17.9 million lives lost annually (31% of all deaths)
Early detection and management crucial for high-risk patients
11 features available for heart disease prediction

Features Description

Age: Patient age in years
Sex: M (Male) or F (Female)
ChestPainType: TA (Typical Angina), ATA (Atypical Angina), NAP (Non-Anginal Pain), ASY (Asymptomatic)
RestingBP: Resting blood pressure [mm Hg]
Cholesterol: Serum cholesterol [mm/dl]
FastingBS: 1 if > 120 mg/dl, 0 otherwise
RestingECG: Normal, ST (abnormality), LVH (left ventricular hypertrophy)
MaxHR: Maximum heart rate achieved [60-202]
ExerciseAngina: Y (Yes) or N (No)
Oldpeak: ST depression numeric value
ST_Slope: Up (upsloping), Flat, Down (downsloping)
HeartDisease: 1 (disease), 0 (normal) - TARGET VARIABLE

Data Preprocessing

One-Hot Encoding with Pandas

Categorical Variables requiring encoding:

Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope

# Categorical variables to encode
cat_variables = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']

# One-hot encode using pandas
df = pd.get_dummies(data=df,
                 prefix=cat_variables,
                 columns=cat_variables)

# Result: 11 original features → 17 features after encoding

Dataset Splitting

# Split into features and target
features = [x for x in df.columns if x not in 'HeartDisease']

# Train-validation split
X_train, X_val, y_train, y_val = train_test_split(
  df[features], df['HeartDisease'],
  train_size=0.8,
  random_state=RANDOM_STATE
)

print(f'train samples: {len(X_train)}')
print(f'validation samples: {len(X_val)}')
print(f'target proportion: {sum(y_train)/len(y_train):.4f}')

Model Implementation

1. Decision Tree

Key Hyperparameters:

min_samples_split: Minimum samples required to split internal node
max_depth: Maximum depth of tree

Hyperparameter Tuning Results

min_samples_split Analysis:

Low values (2-10): Higher training accuracy, potential overfitting
Higher values (30-50): Reduced overfitting, training/validation gap closes
Very high values (200+): May underfit

max_depth Analysis:

Shallow trees (1-3): Underfitting, low accuracy
Optimal depth (4): Best validation performance
Deep trees (8+): Overfitting, high training accuracy but poor validation

# Optimal Decision Tree configuration
decision_tree_model = DecisionTreeClassifier(
  min_samples_split=50,
  max_depth=3,
  random_state=RANDOM_STATE
).fit(X_train, y_train)

# Results: No overfitting, balanced performance

2. Random Forest

Additional Hyperparameter:

n_estimators: Number of decision trees in the forest

Key Advantages

Reduced overfitting: Ensemble voting reduces variance
Feature randomization: Each tree uses √n features (by default)
Parallel training: Trees can be trained independently

# Optimal Random Forest configuration
random_forest_model = RandomForestClassifier(
  n_estimators=100,
  max_depth=16,
  min_samples_split=10,
  random_state=RANDOM_STATE
).fit(X_train, y_train)

# Better performance than single decision tree

Hyperparameter Insights

n_estimators: 100 trees provide good balance of performance vs. computation
Deeper trees allowed: Ensemble reduces overfitting risk
Less conservative parameters: Can afford more complexity

3. XGBoost

Gradient Boosting Features:

Sequential training: Each tree learns from previous mistakes
Early stopping: Prevents overfitting using validation set
Built-in regularization: Advanced overfitting prevention

# XGBoost with early stopping
xgb_model = XGBClassifier(
  n_estimators=500,
  learning_rate=0.1,
  verbosity=1,
  random_state=RANDOM_STATE
)

# Early stopping implementation
xgb_model.fit(X_train_fit, y_train_fit,
            eval_set=[(X_train_eval, y_train_eval)],
            early_stopping_rounds=10)

# Training stopped at round 26 (best was round 16)

Early Stopping Mechanism

Monitor validation metric: Track log loss on evaluation set
Best iteration tracking: Round 16 had lowest evaluation metric
Automatic stopping: After 10 rounds without improvement
Efficiency: Only 26 estimators needed instead of 500

Performance Comparison

Model Results

Decision Tree

Training Accuracy: ~0.85 Validation Accuracy: ~0.82

No overfitting
Simple, interpretable

Random Forest

Training Accuracy: ~0.98 Validation Accuracy: ~0.87

Best validation performance
Robust ensemble method

XGBoost

Training Accuracy: ~0.90 Validation Accuracy: ~0.87

Competitive performance
Automatic optimization

Key Lab Takeaways

GridSearchCV Note

XGBoost Advantages

Competition-grade performance: Widely used in Kaggle competitions
Efficient training: Weighted examples instead of sampling with replacement
Automatic stopping: Built-in overfitting prevention
Regression support: XGBRegressor for continuous targets

Implementation Simplicity

All models require just 3 lines:

Import the model class
Initialize and fit: model.fit(X_train, y_train)
Predict: model.predict(X_test)

Practical Insights

Random Forest: Often best for tabular data without extensive tuning
XGBoost: Superior for competitions and complex datasets
Single Decision Tree: Good baseline, highly interpretable
Early stopping: Critical for gradient boosting methods

This lab demonstrates the progression from simple decision trees to sophisticated ensemble methods, showing how each approach builds upon the previous to achieve better performance and robustness.