Skip to content
Pablo Rodriguez

Random Forest Algorithm

Random Forest builds multiple decision trees using sampling with replacement to create a robust ensemble classifier.

  1. For B = 1 to B (total trees):

    • Use sampling with replacement to create new training set of size M
    • Train decision tree on this new dataset
  2. For prediction:

    • Get all B trees to vote on final prediction
    • Use majority vote for classification

Original Training Set: 10 examples Process:

  • Tree 1: Sample 10 examples with replacement → Train decision tree
  • Tree 2: Sample 10 examples with replacement → Train different decision tree
  • Tree 3: Sample 10 examples with replacement → Train another decision tree
  • Continue for B total trees…

Typical B values: 64 to 128 trees

  • More trees: Generally doesn’t hurt performance
  • Diminishing returns: Beyond ~100 trees, improvement minimal
  • Computational cost: More trees = slower training and prediction
Bagged Decision Tree

This basic ensemble method is called “bagged decision tree” because training examples are placed in a virtual “bag” for sampling.

Issue: Even with sampling, trees often make same split decisions

  • Root node: Frequently chooses same feature across different samples
  • Near-root nodes: Similar split patterns emerge
  • Result: Trees less diverse than optimal

Random Forest Solution: Feature Randomization

Section titled “Random Forest Solution: Feature Randomization”

Key Innovation: At each node, randomly select subset of features to consider

Process at each node:

  1. Available features: n total features
  2. Random subset: Choose k < n features randomly
  3. Best split selection: Choose feature with highest information gain from k features only

Feature subset size (k):

  • Typical choice: k = √n (square root of total features)
  • Small datasets: Technique more useful for larger feature sets
  • Example: If n = 100 features, consider k = 10 features at each split

Basic Bagging

Randomization Source

  • Sample different training sets
  • Same features available at each node

Random Forest

Enhanced Randomization

  • Sample different training sets
  • Random feature subsets at each node
  • Greater tree diversity

Multiple sources of randomness:

  1. Data sampling: Different training sets via sampling with replacement
  2. Feature sampling: Different feature subsets at each node split

Result: More diverse treesBetter ensemble performance

Ensemble averaging: Algorithm explores many small changes to data automatically

  • Data variations: Sampling with replacement creates data perturbations
  • Feature variations: Random feature selection creates different perspectives
  • Collective wisdom: Averaging over variations reduces impact of individual changes
  • Training: Fixed dataset, all features available
  • Weakness: High sensitivity to data changes
  • Strength: Fast, interpretable
  • Training: Multiple sampled datasets, all features available
  • Improvement: Reduced sensitivity through ensemble voting
  • Limitation: Trees may still be too similar
  • Training: Multiple sampled datasets, random feature subsets
  • Advantages: Maximum diversity, excellent robustness
  • Performance: Typically much better than single trees
random_forest_concept.py
# Conceptual Random Forest algorithm
for b in range(B): # B = number of trees
# 1. Sample with replacement
sample_data = sample_with_replacement(original_data, M)
# 2. Train tree with feature randomization
tree = DecisionTree()
# At each node, only consider sqrt(n) random features
tree.max_features = int(sqrt(n_features))
tree.fit(sample_data)
# 3. Add to ensemble
forest.append(tree)
# Prediction: majority vote from all trees
def predict(x):
votes = [tree.predict(x) for tree in forest]
return majority_vote(votes)
  • Accuracy: Usually significantly better than single decision tree
  • Robustness: Much less sensitive to training data changes
  • Overfitting reduction: Ensemble averaging reduces overfitting
  • Training time: Longer than single tree (training B trees)
  • Prediction time: Slower (need B predictions + voting)
  • Memory usage: Higher (storing B trees)
  • Parallelization: Tree training can be parallelized

Random Forest represents a major improvement over single decision trees, using two complementary randomization techniques to create diverse, robust ensembles that significantly outperform individual trees while maintaining the interpretability and efficiency benefits of tree-based methods.