Developing Evaluating Detection

Developing and Evaluating an Anomaly Detection System

Importance of Real Number Evaluation

Development Challenge

When developing anomaly detection systems:

Need to choose features
Need to tune parameters (like ε)
Need to make decisions about algorithm modifications

Real Number Evaluation Benefits

Having a way to compute a number that indicates algorithm performance makes development much easier:

Quickly test changes to features or parameters
Determine if algorithm got better or worse
Make faster development decisions

Using Labeled Data for Evaluation

Modified Assumption

Even though anomaly detection is unsupervised, for evaluation purposes assume we have:

Small number of previously observed anomalies
Labels: y = 1 (anomalous), y = 0 (normal)

Training Set Structure

Unlabeled training set: x⁽¹⁾ through x⁽ᵐ⁾
Assumption: All examples are normal (y = 0)
Note: Few anomalous examples accidentally in training set is okay

Evaluation Sets

Create cross-validation set and test set that include:

Some normal examples (y = 0)
Few anomalous examples (y = 1)

Note: Both sets should have mix of normal and anomalous examples for proper evaluation.

Aircraft Engine Example

Dataset Composition

10,000 good/normal engines (y = 0)
20 flawed/anomalous engines (y = 1)

Typical range: 2-50 known anomalies is common for this type of algorithm.

Data Split Example

Training set: 6,000 good engines

Use for fitting Gaussian distributions
If couple anomalous engines slip in, it’s okay

Cross-validation set: 2,000 good + 10 anomalous engines

Use for tuning parameter ε
Evaluate detection performance

Test set: 2,000 good + 10 anomalous engines

Final evaluation after tuning

Development Process

Train: Fit Gaussian distributions on 6,000 training examples
Validate: Use cross-validation set to:
- Tune ε parameter higher/lower
- Add/subtract/modify features
- Check detection of 10 anomalies
- Monitor false positives on 2,000 good engines
Test: Final evaluation on separate test set

Still Unsupervised Learning

Primary learning from unlabeled training set (all y = 0)
Labeled examples only used for evaluation and parameter tuning
Core algorithm learns by fitting Gaussian distributions as before

Alternative: No Test Set

When to Use

When you have very few anomalous examples (e.g., only 2 flawed engines):

Training set: 6,000 good engines Cross-validation set: 4,000 good + all 20 anomalous engines

Trade-offs

Advantages:

Makes sense when data is extremely limited
Uses all anomalous examples for evaluation

Disadvantages:

No fair way to evaluate final performance
Higher risk of overfitting to cross-validation set
Performance on future data may not match expectations

Evaluation Metrics

Algorithm Evaluation Process

Fit model: Learn p(x) from training set
Make predictions: For any example x, predict:
- y = 1 (anomalous) if p(x) < ε
- y = 0 (normal) if p(x) ≥ ε
Compare: Match predictions against true labels in cross-validation/test sets

Handling Skewed Distributions

Problem: Much fewer anomalies (y = 1) than normal examples (y = 0)

Example: 10 anomalies vs 2,000 normal examples

Solution: Use evaluation metrics designed for skewed data:

True positive rate
False positive rate
False negative rate
Precision and recall
F₁ score

These metrics work better than simple classification accuracy for highly imbalanced datasets.

Practical Evaluation

Basic approach:

Count how many anomalies detected correctly
Count how many normal engines incorrectly flagged
Use this information to choose good value for ε

The combination of unlabeled training data with a small set of labeled evaluation examples provides the best of both worlds: unsupervised learning from abundant normal data plus supervised evaluation for system optimization.