Decision Tree
Training Accuracy: ~0.85 Validation Accuracy: ~0.82
- No overfitting
- Simple, interpretable
This lab demonstrates how to use Pandas for one-hot encoding and implement Decision Tree, Random Forest, and XGBoost models using scikit-learn and XGBoost libraries.
Cardiovascular Disease (CVD) is the leading cause of death globally:
Categorical Variables requiring encoding:
# Categorical variables to encodecat_variables = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
# One-hot encode using pandasdf = pd.get_dummies(data=df, prefix=cat_variables, columns=cat_variables)
# Result: 11 original features → 17 features after encoding
# Split into features and targetfeatures = [x for x in df.columns if x not in 'HeartDisease']
# Train-validation splitX_train, X_val, y_train, y_val = train_test_split( df[features], df['HeartDisease'], train_size=0.8, random_state=RANDOM_STATE)
print(f'train samples: {len(X_train)}')print(f'validation samples: {len(X_val)}')print(f'target proportion: {sum(y_train)/len(y_train):.4f}')
Key Hyperparameters:
min_samples_split Analysis:
max_depth Analysis:
# Optimal Decision Tree configurationdecision_tree_model = DecisionTreeClassifier( min_samples_split=50, max_depth=3, random_state=RANDOM_STATE).fit(X_train, y_train)
# Results: No overfitting, balanced performance
Additional Hyperparameter:
# Optimal Random Forest configurationrandom_forest_model = RandomForestClassifier( n_estimators=100, max_depth=16, min_samples_split=10, random_state=RANDOM_STATE).fit(X_train, y_train)
# Better performance than single decision tree
Gradient Boosting Features:
# XGBoost with early stoppingxgb_model = XGBClassifier( n_estimators=500, learning_rate=0.1, verbosity=1, random_state=RANDOM_STATE)
# Early stopping implementationxgb_model.fit(X_train_fit, y_train_fit, eval_set=[(X_train_eval, y_train_eval)], early_stopping_rounds=10)
# Training stopped at round 26 (best was round 16)
Decision Tree
Training Accuracy: ~0.85 Validation Accuracy: ~0.82
Random Forest
Training Accuracy: ~0.98 Validation Accuracy: ~0.87
XGBoost
Training Accuracy: ~0.90 Validation Accuracy: ~0.87
All models require just 3 lines:
model.fit(X_train, y_train)
model.predict(X_test)
This lab demonstrates the progression from simple decision trees to sophisticated ensemble methods, showing how each approach builds upon the previous to achieve better performance and robustness.