Repeated Examples
Some examples appear multiple times
- Increases weight of certain patterns
- Creates emphasis on specific cases
Sampling with Replacement is a statistical technique used to create diverse training sets for building tree ensembles.
Setup: Four colored tokens (red, yellow, green, blue) in a bag
Process:
Key Observations:
Original Training Set: 10 examples of cats and dogs
Sampling Process:
Original Set: [Example 1, Example 2, …, Example 10] Sampled Set: [Example 3, Example 7, Example 3, Example 1, Example 9, Example 2, Example 7, Example 5, Example 8, Example 7]
Characteristics:
Repeated Examples
Some examples appear multiple times
Missing Examples
Some examples don’t appear
Different Emphasis
Each sample has unique focus
Tree 1: Trained on Sample Set A
Tree 2: Trained on Sample Set B
Tree 3: Trained on Sample Set C
For each draw: Every example has equal probability (1/n) of being selected Across multiple draws: Some examples selected more, some less, some not at all Overall effect: Creates natural variation in training set composition
Probability an example appears:
Practical Result: Each sampled training set missing about 1/3 of original examples, with different 1/3 missing each time.
Common practice: Sample same number of examples as original training set
Without replacement: Always get identical training set With replacement: Essential for creating diversity
# Conceptual sampling with replacementdef sample_with_replacement(original_data, sample_size): new_sample = [] for i in range(sample_size): # Randomly select index from original data random_index = random.choice(range(len(original_data))) # Add selected example to new sample new_sample.append(original_data[random_index]) return new_sample
Sampling with replacement provides the foundation for creating diverse training sets that enable robust tree ensembles, transforming the weakness of individual decision tree sensitivity into the strength of ensemble robustness.