Rather than collecting “more data of everything,” focus efforts based on error analysis insights for more efficient improvement.
If error analysis reveals pharmaceutical spam as major problem:
Targeted effort : Focus on collecting more pharmaceutical spam examples
Cost efficiency : Modest cost compared to general data collection
Higher impact : Specific improvement in problematic area
Unlabeled data utilization :
Have labelers skim through unlabeled email data
Specifically identify pharmaceutical spam examples
Much more efficient than random data collection
Definition : Create new training examples by modifying existing ones, especially effective for images and audio.
Optical Character Recognition (OCR) Example : Recognizing letters A-Z
Basic Transformations
Rotation (slight angles)
Enlarging/shrinking
Contrast adjustment
Mirror images (for applicable letters)
Advanced Techniques
Grid warping
Random distortions
Creates multiple training examples from single image
Result : One original image → Multiple training examples with same label
Speech Recognition Example : “What is today’s weather?”
Background Noise Addition :
Crowd noise : Original audio + crowd sounds = speech in noisy environment
Car noise : Original audio + car sounds = speech in vehicle
Phone distortion : Simulate bad cell phone connection quality
Implementation : Simply add background audio to original clean audio
clean_audio + background_noise = augmented_audio
# Creates realistic training examples for various environments
Effective approach : Ensure distortions match expected test conditions
OCR warping should resemble real-world text distortions
Audio noise should match actual usage environments (cars, crowds, phones)
Poor approach : Adding random per-pixel noise to images
Creates unrealistic examples unlike test set
Doesn’t improve real-world performance
Wastes computational resources
Guidelines
Distort data to remain similar to test set conditions
Definition : Create entirely new training examples from scratch rather than modifying existing ones.
Application : Reading text from real-world images
Synthetic generation approach :
Use computer fonts to generate text
Screenshot text with various:
Colors and contrasts
Font types and sizes
Background variations
Create large datasets of realistic-looking examples
Comparison :
Left (Real data) : Actual photos of text in natural settings
Right (Synthetic data) : Computer-generated text that appears realistic
High development cost : Significant coding effort to create realistic synthetic data
High payoff : Can generate very large training datasets
Primary use : Most successful in computer vision applications
Limited adoption : Less common in audio or other domains
Focus : Improve algorithms/models while keeping data fixed
Historical emphasis : Most ML research focused on better algorithms
Current state : Algorithms (linear regression, neural networks, etc.) already quite good
Focus on engineering the data :
Collect targeted data based on error analysis
Apply data augmentation techniques
Generate synthetic training examples
Often more fruitful than algorithm improvements
Download fixed dataset
Focus on algorithm improvements
Traditional research approach
Engineer and improve training data
Often more efficient for performance gains
Modern practical approach
The systematic approach of targeted data collection, augmentation, and synthesis provides powerful tools for improving learning algorithm performance, often yielding better results than purely algorithmic improvements.