Adding Data

Strategic Data Collection

Rather than collecting “more data of everything,” focus efforts based on error analysis insights for more efficient improvement.

Targeted Data Collection

Error Analysis-Driven Approach

If error analysis reveals pharmaceutical spam as major problem:

Targeted effort: Focus on collecting more pharmaceutical spam examples
Cost efficiency: Modest cost compared to general data collection
Higher impact: Specific improvement in problematic area

Implementation Strategy

Unlabeled data utilization:

Have labelers skim through unlabeled email data
Specifically identify pharmaceutical spam examples
Much more efficient than random data collection

Data Augmentation

Definition: Create new training examples by modifying existing ones, especially effective for images and audio.

Image Data Augmentation

Optical Character Recognition (OCR) Example: Recognizing letters A-Z

Basic Transformations

Rotation (slight angles)
Enlarging/shrinking
Contrast adjustment
Mirror images (for applicable letters)

Advanced Techniques

Grid warping
Random distortions
Creates multiple training examples from single image

Result: One original image → Multiple training examples with same label

Audio Data Augmentation

Speech Recognition Example: “What is today’s weather?”

Background Noise Addition:

Crowd noise: Original audio + crowd sounds = speech in noisy environment
Car noise: Original audio + car sounds = speech in vehicle
Phone distortion: Simulate bad cell phone connection quality

Implementation: Simply add background audio to original clean audio

# Conceptual approach
clean_audio + background_noise = augmented_audio
# Creates realistic training examples for various environments

Data Augmentation Best Practices

Representative Distortions

Effective approach: Ensure distortions match expected test conditions

OCR warping should resemble real-world text distortions
Audio noise should match actual usage environments (cars, crowds, phones)

Avoid Meaningless Noise

Poor approach: Adding random per-pixel noise to images

Creates unrealistic examples unlike test set
Doesn’t improve real-world performance
Wastes computational resources

Guidelines

Distort data to remain similar to test set conditions

Data Synthesis

Definition: Create entirely new training examples from scratch rather than modifying existing ones.

Photo OCR Example

Application: Reading text from real-world images

Synthetic generation approach:

Use computer fonts to generate text
Screenshot text with various:
- Colors and contrasts
- Font types and sizes
- Background variations
Create large datasets of realistic-looking examples

Comparison:

Left (Real data): Actual photos of text in natural settings
Right (Synthetic data): Computer-generated text that appears realistic

Implementation Considerations

High development cost: Significant coding effort to create realistic synthetic data
High payoff: Can generate very large training datasets
Primary use: Most successful in computer vision applications
Limited adoption: Less common in audio or other domains

Model-Centric vs Data-Centric Approaches

Traditional Model-Centric Approach

Focus: Improve algorithms/models while keeping data fixed
Historical emphasis: Most ML research focused on better algorithms
Current state: Algorithms (linear regression, neural networks, etc.) already quite good

Modern Data-Centric Approach

Focus on engineering the data:

Collect targeted data based on error analysis
Apply data augmentation techniques
Generate synthetic training examples
Often more fruitful than algorithm improvements

Model-Centric
Data-Centric

Download fixed dataset
Focus on algorithm improvements
Traditional research approach

The systematic approach of targeted data collection, augmentation, and synthesis provides powerful tools for improving learning algorithm performance, often yielding better results than purely algorithmic improvements.