Skip to content
Pablo Rodriguez

Optional Lab Tree Ensembles

This lab demonstrates how to use Pandas for one-hot encoding and implement Decision Tree, Random Forest, and XGBoost models using scikit-learn and XGBoost libraries.

Cardiovascular Disease (CVD) is the leading cause of death globally:

  • 17.9 million lives lost annually (31% of all deaths)
  • Early detection and management crucial for high-risk patients
  • 11 features available for heart disease prediction
  • Age: Patient age in years
  • Sex: M (Male) or F (Female)
  • ChestPainType: TA (Typical Angina), ATA (Atypical Angina), NAP (Non-Anginal Pain), ASY (Asymptomatic)
  • RestingBP: Resting blood pressure [mm Hg]
  • Cholesterol: Serum cholesterol [mm/dl]
  • FastingBS: 1 if > 120 mg/dl, 0 otherwise
  • RestingECG: Normal, ST (abnormality), LVH (left ventricular hypertrophy)
  • MaxHR: Maximum heart rate achieved [60-202]
  • ExerciseAngina: Y (Yes) or N (No)
  • Oldpeak: ST depression numeric value
  • ST_Slope: Up (upsloping), Flat, Down (downsloping)
  • HeartDisease: 1 (disease), 0 (normal) - TARGET VARIABLE

Categorical Variables requiring encoding:

  • Sex, ChestPainType, RestingECG, ExerciseAngina, ST_Slope
one_hot_encoding.py
# Categorical variables to encode
cat_variables = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
# One-hot encode using pandas
df = pd.get_dummies(data=df,
prefix=cat_variables,
columns=cat_variables)
# Result: 11 original features → 17 features after encoding
data_split.py
# Split into features and target
features = [x for x in df.columns if x not in 'HeartDisease']
# Train-validation split
X_train, X_val, y_train, y_val = train_test_split(
df[features], df['HeartDisease'],
train_size=0.8,
random_state=RANDOM_STATE
)
print(f'train samples: {len(X_train)}')
print(f'validation samples: {len(X_val)}')
print(f'target proportion: {sum(y_train)/len(y_train):.4f}')

Key Hyperparameters:

  • min_samples_split: Minimum samples required to split internal node
  • max_depth: Maximum depth of tree

min_samples_split Analysis:

  • Low values (2-10): Higher training accuracy, potential overfitting
  • Higher values (30-50): Reduced overfitting, training/validation gap closes
  • Very high values (200+): May underfit

max_depth Analysis:

  • Shallow trees (1-3): Underfitting, low accuracy
  • Optimal depth (4): Best validation performance
  • Deep trees (8+): Overfitting, high training accuracy but poor validation
decision_tree.py
# Optimal Decision Tree configuration
decision_tree_model = DecisionTreeClassifier(
min_samples_split=50,
max_depth=3,
random_state=RANDOM_STATE
).fit(X_train, y_train)
# Results: No overfitting, balanced performance

Additional Hyperparameter:

  • n_estimators: Number of decision trees in the forest
  • Reduced overfitting: Ensemble voting reduces variance
  • Feature randomization: Each tree uses √n features (by default)
  • Parallel training: Trees can be trained independently
random_forest.py
# Optimal Random Forest configuration
random_forest_model = RandomForestClassifier(
n_estimators=100,
max_depth=16,
min_samples_split=10,
random_state=RANDOM_STATE
).fit(X_train, y_train)
# Better performance than single decision tree
  • n_estimators: 100 trees provide good balance of performance vs. computation
  • Deeper trees allowed: Ensemble reduces overfitting risk
  • Less conservative parameters: Can afford more complexity

Gradient Boosting Features:

  • Sequential training: Each tree learns from previous mistakes
  • Early stopping: Prevents overfitting using validation set
  • Built-in regularization: Advanced overfitting prevention
xgboost.py
# XGBoost with early stopping
xgb_model = XGBClassifier(
n_estimators=500,
learning_rate=0.1,
verbosity=1,
random_state=RANDOM_STATE
)
# Early stopping implementation
xgb_model.fit(X_train_fit, y_train_fit,
eval_set=[(X_train_eval, y_train_eval)],
early_stopping_rounds=10)
# Training stopped at round 26 (best was round 16)
  • Monitor validation metric: Track log loss on evaluation set
  • Best iteration tracking: Round 16 had lowest evaluation metric
  • Automatic stopping: After 10 rounds without improvement
  • Efficiency: Only 26 estimators needed instead of 500

Decision Tree

Training Accuracy: ~0.85 Validation Accuracy: ~0.82

  • No overfitting
  • Simple, interpretable

Random Forest

Training Accuracy: ~0.98 Validation Accuracy: ~0.87

  • Best validation performance
  • Robust ensemble method

XGBoost

Training Accuracy: ~0.90 Validation Accuracy: ~0.87

  • Competitive performance
  • Automatic optimization
  • Competition-grade performance: Widely used in Kaggle competitions
  • Efficient training: Weighted examples instead of sampling with replacement
  • Automatic stopping: Built-in overfitting prevention
  • Regression support: XGBRegressor for continuous targets

All models require just 3 lines:

  1. Import the model class
  2. Initialize and fit: model.fit(X_train, y_train)
  3. Predict: model.predict(X_test)
  • Random Forest: Often best for tabular data without extensive tuning
  • XGBoost: Superior for competitions and complex datasets
  • Single Decision Tree: Good baseline, highly interpretable
  • Early stopping: Critical for gradient boosting methods

This lab demonstrates the progression from simple decision trees to sophisticated ensemble methods, showing how each approach builds upon the previous to achieve better performance and robustness.