Sampling With Replacement

Sampling with Replacement

Concept Demonstration

Sampling with Replacement is a statistical technique used to create diverse training sets for building tree ensembles.

Physical Demonstration

Setup: Four colored tokens (red, yellow, green, blue) in a bag

Process:

Draw token: Randomly select one (e.g., green)
Replace token: Put it back in the bag
Shake and repeat: Draw again (e.g., yellow)
Continue process: Draw, replace, shake
Final sample: [green, yellow, blue, blue]

Key Observations:

Blue appears twice: Same item can be selected multiple times
Red never appears: Some items may not be selected at all
Different each time: Repeating process gives different sequences

Application to Machine Learning

Creating Diverse Training Sets

Original Training Set: 10 examples of cats and dogs

Sampling Process:

Place examples in “theoretical bag”: All 10 training examples
Sample with replacement: Draw 10 examples randomly
Allow duplicates: Same example can appear multiple times
Create new training set: Same size (10) but different composition

Example Sampling Result

Original Set: [Example 1, Example 2, …, Example 10] Sampled Set: [Example 3, Example 7, Example 3, Example 1, Example 9, Example 2, Example 7, Example 5, Example 8, Example 7]

Characteristics:

Duplicates: Example 3 and Example 7 appear multiple times
Missing examples: Examples 4, 6, 10 don’t appear
Same size: Still 10 examples total
Different distribution: Changed frequency of different patterns

Why This Creates Diversity

Statistical Properties

Repeated Examples

Some examples appear multiple times

Increases weight of certain patterns
Creates emphasis on specific cases

Missing Examples

Some examples don’t appear

Reduces influence of certain patterns
Creates gaps in training coverage

Different Emphasis

Each sample has unique focus

Different trees see different patterns
Natural source of diversity

Tree Training Impact

Tree 1: Trained on Sample Set A

Emphasized certain patterns due to duplicates
Missing some patterns entirely
Develops specific decision boundaries

Tree 2: Trained on Sample Set B

Different duplicates and missing patterns
Develops different decision boundaries
Complements Tree 1’s knowledge

Tree 3: Trained on Sample Set C

Yet another pattern distribution
Further diversifies ensemble knowledge

Mathematical Foundation

Probability of Selection

For each draw: Every example has equal probability (1/n) of being selected Across multiple draws: Some examples selected more, some less, some not at all Overall effect: Creates natural variation in training set composition

Expected Diversity

Probability an example appears:

Not selected in single draw: (n-1)/n
Not selected in any of n draws: ((n-1)/n)^n ≈ 1/e ≈ 37%
Expected missing examples: ~37% of original examples

Practical Result: Each sampled training set missing about 1/3 of original examples, with different 1/3 missing each time.

Implementation Considerations

Sample Size Choice

Common practice: Sample same number of examples as original training set

Advantage: Maintains statistical properties
Result: Good balance of diversity and coverage

Replacement Necessity

Without replacement: Always get identical training set With replacement: Essential for creating diversity

# Conceptual sampling with replacement
def sample_with_replacement(original_data, sample_size):
  new_sample = []
  for i in range(sample_size):
      # Randomly select index from original data
      random_index = random.choice(range(len(original_data)))
      # Add selected example to new sample
      new_sample.append(original_data[random_index])
  return new_sample

Sampling with replacement provides the foundation for creating diverse training sets that enable robust tree ensembles, transforming the weakness of individual decision tree sensitivity into the strength of ensemble robustness.