Developing Evaluating Detection
Developing and Evaluating an Anomaly Detection System
Section titled “Developing and Evaluating an Anomaly Detection System”Importance of Real Number Evaluation
Section titled “Importance of Real Number Evaluation”Development Challenge
Section titled “Development Challenge”When developing anomaly detection systems:
- Need to choose features
- Need to tune parameters (like ε)
- Need to make decisions about algorithm modifications
Real Number Evaluation Benefits
Section titled “Real Number Evaluation Benefits”Having a way to compute a number that indicates algorithm performance makes development much easier:
- Quickly test changes to features or parameters
- Determine if algorithm got better or worse
- Make faster development decisions
Using Labeled Data for Evaluation
Section titled “Using Labeled Data for Evaluation”Modified Assumption
Section titled “Modified Assumption”Even though anomaly detection is unsupervised, for evaluation purposes assume we have:
- Small number of previously observed anomalies
- Labels: y = 1 (anomalous), y = 0 (normal)
Training Set Structure
Section titled “Training Set Structure”- Unlabeled training set: x⁽¹⁾ through x⁽ᵐ⁾
- Assumption: All examples are normal (y = 0)
- Note: Few anomalous examples accidentally in training set is okay
Evaluation Sets
Section titled “Evaluation Sets”Create cross-validation set and test set that include:
- Some normal examples (y = 0)
- Few anomalous examples (y = 1)
Note: Both sets should have mix of normal and anomalous examples for proper evaluation.
Aircraft Engine Example
Section titled “Aircraft Engine Example”Dataset Composition
Section titled “Dataset Composition”- 10,000 good/normal engines (y = 0)
- 20 flawed/anomalous engines (y = 1)
Typical range: 2-50 known anomalies is common for this type of algorithm.
Data Split Example
Section titled “Data Split Example”Training set: 6,000 good engines
- Use for fitting Gaussian distributions
- If couple anomalous engines slip in, it’s okay
Cross-validation set: 2,000 good + 10 anomalous engines
- Use for tuning parameter ε
- Evaluate detection performance
Test set: 2,000 good + 10 anomalous engines
- Final evaluation after tuning
Development Process
Section titled “Development Process”- Train: Fit Gaussian distributions on 6,000 training examples
- Validate: Use cross-validation set to:
- Tune ε parameter higher/lower
- Add/subtract/modify features
- Check detection of 10 anomalies
- Monitor false positives on 2,000 good engines
- Test: Final evaluation on separate test set
Still Unsupervised Learning
Section titled “Still Unsupervised Learning”- Primary learning from unlabeled training set (all y = 0)
- Labeled examples only used for evaluation and parameter tuning
- Core algorithm learns by fitting Gaussian distributions as before
Alternative: No Test Set
Section titled “Alternative: No Test Set”When to Use
Section titled “When to Use”When you have very few anomalous examples (e.g., only 2 flawed engines):
Training set: 6,000 good engines Cross-validation set: 4,000 good + all 20 anomalous engines
Trade-offs
Section titled “Trade-offs”Advantages:
- Makes sense when data is extremely limited
- Uses all anomalous examples for evaluation
Disadvantages:
- No fair way to evaluate final performance
- Higher risk of overfitting to cross-validation set
- Performance on future data may not match expectations
Evaluation Metrics
Section titled “Evaluation Metrics”Algorithm Evaluation Process
Section titled “Algorithm Evaluation Process”- Fit model: Learn p(x) from training set
- Make predictions: For any example x, predict:
- y = 1 (anomalous) if p(x) < ε
- y = 0 (normal) if p(x) ≥ ε
- Compare: Match predictions against true labels in cross-validation/test sets
Handling Skewed Distributions
Section titled “Handling Skewed Distributions”Problem: Much fewer anomalies (y = 1) than normal examples (y = 0)
- Example: 10 anomalies vs 2,000 normal examples
Solution: Use evaluation metrics designed for skewed data:
- True positive rate
- False positive rate
- False negative rate
- Precision and recall
- F₁ score
These metrics work better than simple classification accuracy for highly imbalanced datasets.
Practical Evaluation
Section titled “Practical Evaluation”Basic approach:
- Count how many anomalies detected correctly
- Count how many normal engines incorrectly flagged
- Use this information to choose good value for ε
The combination of unlabeled training data with a small set of labeled evaluation examples provides the best of both worlds: unsupervised learning from abundant normal data plus supervised evaluation for system optimization.