Skip to content
Pablo Rodriguez

Adding Data

Rather than collecting “more data of everything,” focus efforts based on error analysis insights for more efficient improvement.

If error analysis reveals pharmaceutical spam as major problem:

  • Targeted effort: Focus on collecting more pharmaceutical spam examples
  • Cost efficiency: Modest cost compared to general data collection
  • Higher impact: Specific improvement in problematic area

Unlabeled data utilization:

  • Have labelers skim through unlabeled email data
  • Specifically identify pharmaceutical spam examples
  • Much more efficient than random data collection

Definition: Create new training examples by modifying existing ones, especially effective for images and audio.

Optical Character Recognition (OCR) Example: Recognizing letters A-Z

Basic Transformations

  • Rotation (slight angles)
  • Enlarging/shrinking
  • Contrast adjustment
  • Mirror images (for applicable letters)

Advanced Techniques

  • Grid warping
  • Random distortions
  • Creates multiple training examples from single image

Result: One original image → Multiple training examples with same label

Speech Recognition Example: “What is today’s weather?”

Background Noise Addition:

  • Crowd noise: Original audio + crowd sounds = speech in noisy environment
  • Car noise: Original audio + car sounds = speech in vehicle
  • Phone distortion: Simulate bad cell phone connection quality

Implementation: Simply add background audio to original clean audio

audio_augmentation.py
# Conceptual approach
clean_audio + background_noise = augmented_audio
# Creates realistic training examples for various environments

Effective approach: Ensure distortions match expected test conditions

  • OCR warping should resemble real-world text distortions
  • Audio noise should match actual usage environments (cars, crowds, phones)

Poor approach: Adding random per-pixel noise to images

  • Creates unrealistic examples unlike test set
  • Doesn’t improve real-world performance
  • Wastes computational resources
Guidelines

Distort data to remain similar to test set conditions

Definition: Create entirely new training examples from scratch rather than modifying existing ones.

Application: Reading text from real-world images

Synthetic generation approach:

  1. Use computer fonts to generate text
  2. Screenshot text with various:
    • Colors and contrasts
    • Font types and sizes
    • Background variations
  3. Create large datasets of realistic-looking examples

Comparison:

  • Left (Real data): Actual photos of text in natural settings
  • Right (Synthetic data): Computer-generated text that appears realistic
  • High development cost: Significant coding effort to create realistic synthetic data
  • High payoff: Can generate very large training datasets
  • Primary use: Most successful in computer vision applications
  • Limited adoption: Less common in audio or other domains
  • Focus: Improve algorithms/models while keeping data fixed
  • Historical emphasis: Most ML research focused on better algorithms
  • Current state: Algorithms (linear regression, neural networks, etc.) already quite good

Focus on engineering the data:

  • Collect targeted data based on error analysis
  • Apply data augmentation techniques
  • Generate synthetic training examples
  • Often more fruitful than algorithm improvements
  • Download fixed dataset
  • Focus on algorithm improvements
  • Traditional research approach

The systematic approach of targeted data collection, augmentation, and synthesis provides powerful tools for improving learning algorithm performance, often yielding better results than purely algorithmic improvements.