Skip to content
Pablo Rodriguez

Sampling With Replacement

Sampling with Replacement is a statistical technique used to create diverse training sets for building tree ensembles.

Setup: Four colored tokens (red, yellow, green, blue) in a bag

Process:

  1. Draw token: Randomly select one (e.g., green)
  2. Replace token: Put it back in the bag
  3. Shake and repeat: Draw again (e.g., yellow)
  4. Continue process: Draw, replace, shake
  5. Final sample: [green, yellow, blue, blue]

Key Observations:

  • Blue appears twice: Same item can be selected multiple times
  • Red never appears: Some items may not be selected at all
  • Different each time: Repeating process gives different sequences

Original Training Set: 10 examples of cats and dogs

Sampling Process:

  1. Place examples in “theoretical bag”: All 10 training examples
  2. Sample with replacement: Draw 10 examples randomly
  3. Allow duplicates: Same example can appear multiple times
  4. Create new training set: Same size (10) but different composition

Original Set: [Example 1, Example 2, …, Example 10] Sampled Set: [Example 3, Example 7, Example 3, Example 1, Example 9, Example 2, Example 7, Example 5, Example 8, Example 7]

Characteristics:

  • Duplicates: Example 3 and Example 7 appear multiple times
  • Missing examples: Examples 4, 6, 10 don’t appear
  • Same size: Still 10 examples total
  • Different distribution: Changed frequency of different patterns

Repeated Examples

Some examples appear multiple times

  • Increases weight of certain patterns
  • Creates emphasis on specific cases

Missing Examples

Some examples don’t appear

  • Reduces influence of certain patterns
  • Creates gaps in training coverage

Different Emphasis

Each sample has unique focus

  • Different trees see different patterns
  • Natural source of diversity

Tree 1: Trained on Sample Set A

  • Emphasized certain patterns due to duplicates
  • Missing some patterns entirely
  • Develops specific decision boundaries

Tree 2: Trained on Sample Set B

  • Different duplicates and missing patterns
  • Develops different decision boundaries
  • Complements Tree 1’s knowledge

Tree 3: Trained on Sample Set C

  • Yet another pattern distribution
  • Further diversifies ensemble knowledge

For each draw: Every example has equal probability (1/n) of being selected Across multiple draws: Some examples selected more, some less, some not at all Overall effect: Creates natural variation in training set composition

Probability an example appears:

  • Not selected in single draw: (n-1)/n
  • Not selected in any of n draws: ((n-1)/n)^n ≈ 1/e ≈ 37%
  • Expected missing examples: ~37% of original examples

Practical Result: Each sampled training set missing about 1/3 of original examples, with different 1/3 missing each time.

Common practice: Sample same number of examples as original training set

  • Advantage: Maintains statistical properties
  • Result: Good balance of diversity and coverage

Without replacement: Always get identical training set With replacement: Essential for creating diversity

sampling_concept.py
# Conceptual sampling with replacement
def sample_with_replacement(original_data, sample_size):
new_sample = []
for i in range(sample_size):
# Randomly select index from original data
random_index = random.choice(range(len(original_data)))
# Add selected example to new sample
new_sample.append(original_data[random_index])
return new_sample

Sampling with replacement provides the foundation for creating diverse training sets that enable robust tree ensembles, transforming the weakness of individual decision tree sensitivity into the strength of ensemble robustness.