Skip to content
Pablo Rodriguez

Xgboost

Industry Standard

XGBoost (Extreme Gradient Boosting) is the most commonly used implementation of decision tree ensembles, known for speed, effectiveness, and competition-winning performance.

Approach: Train trees independently on different random samples

  • Sample selection: Equal probability for all examples
  • Tree independence: Each tree trained separately
  • Combination: Simple voting/averaging

Key Innovation: Focus subsequent trees on examples that previous trees got wrong

Modified Algorithm:

  1. First tree: Train on original dataset (equal probability sampling)
  2. Subsequent trees: Higher probability of selecting misclassified examples
  3. Iterative improvement: Each tree focuses on “hard” examples
  4. Final ensemble: Weighted combination of all trees

Inefficient approach: Practice entire 5-minute piece repeatedly Efficient approach:

  1. Play complete piece to identify difficult sections
  2. Focus practice on problematic parts only
  3. Targeted improvement on specific weaknesses

Inefficient ensemble: All trees see same data distribution Efficient boosting:

  1. Train initial tree on full dataset
  2. Identify misclassified examples that need more attention
  3. Focus next tree on difficult examples
  4. Iterative specialization on remaining errors

Step 1 - First Tree: Train on original training set

  • Standard decision tree training
  • Evaluate predictions on all examples
  • Identify misclassified examples

Step 2 - Focus on Errors:

  • Increase sampling probability for misclassified examples
  • Decrease sampling probability for correctly classified examples
  • Train second tree on this reweighted dataset

Step 3 - Iterative Improvement:

  • Evaluate ensemble performance (trees 1 + 2)
  • Further adjust sampling probabilities based on remaining errors
  • Continue for B iterations

Example Analysis

Tree 1 Performance

  • ✓ Correct: Examples 1, 2, 4, 5, 6, 7, 9, 10
  • ✗ Incorrect: Examples 3, 8
  • Tree 2 focus: Higher probability on examples 3, 8

Ensemble Building

Progressive Specialization

  • Tree 1: General patterns
  • Tree 2: Difficult cases from Tree 1
  • Tree 3: Remaining difficult cases
  • Result: Complementary expertise
  • Fast implementation: Highly optimized C++ core
  • Good defaults: Excellent out-of-the-box performance
  • Built-in regularization: Prevents overfitting automatically
  • Flexible objective functions: Supports various loss functions
Kaggle Winner

Machine Learning Competitions: XGBoost frequently wins data science competitions

  • Deep Learning: Also commonly wins competitions
  • Dual dominance: XGBoost + Deep Learning dominate competitive ML

Technical improvement: Instead of sampling with replacement, XGBoost assigns different weights to training examples

  • Computational benefit: Avoids creating multiple datasets
  • Same intuition: Higher weights = more focus on difficult examples
  • Better efficiency: Single dataset with weighted examples
xgboost_usage.py
# Import and basic usage
import xgboost as xgb
from xgboost import XGBClassifier
# Classification
model = XGBClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
# Regression alternative
from xgboost import XGBRegressor
model = XGBRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
  • n_estimators: Number of boosting rounds (trees)
  • learning_rate: Step size for gradient descent
  • max_depth: Maximum tree depth
  • reg_alpha, reg_lambda: Regularization parameters
  • Independent trees: Each tree trained separately
  • Parallel training: Can train trees simultaneously
  • Equal focus: All trees see similar data distribution
  • Robust: Less prone to overfitting
  • Sequential trees: Each tree builds on previous trees
  • Sequential training: Must train trees in order
  • Focused learning: Later trees specialize on difficult cases
  • Powerful: Often achieves higher accuracy

Good for:

  • Quick prototyping
  • Parallel training needs
  • Robust baseline models
  • When interpretability matters

XGBoost represents the state-of-the-art in gradient boosting, combining the power of ensemble learning with sophisticated optimization techniques to create highly effective models for structured data problems.