Skip to content
Pablo Rodriguez

Developing Evaluating Detection

Developing and Evaluating an Anomaly Detection System

Section titled “Developing and Evaluating an Anomaly Detection System”

When developing anomaly detection systems:

  • Need to choose features
  • Need to tune parameters (like ε)
  • Need to make decisions about algorithm modifications

Having a way to compute a number that indicates algorithm performance makes development much easier:

  • Quickly test changes to features or parameters
  • Determine if algorithm got better or worse
  • Make faster development decisions

Even though anomaly detection is unsupervised, for evaluation purposes assume we have:

  • Small number of previously observed anomalies
  • Labels: y = 1 (anomalous), y = 0 (normal)
  • Unlabeled training set: x⁽¹⁾ through x⁽ᵐ⁾
  • Assumption: All examples are normal (y = 0)
  • Note: Few anomalous examples accidentally in training set is okay

Create cross-validation set and test set that include:

  • Some normal examples (y = 0)
  • Few anomalous examples (y = 1)

Note: Both sets should have mix of normal and anomalous examples for proper evaluation.

  • 10,000 good/normal engines (y = 0)
  • 20 flawed/anomalous engines (y = 1)

Typical range: 2-50 known anomalies is common for this type of algorithm.

Training set: 6,000 good engines

  • Use for fitting Gaussian distributions
  • If couple anomalous engines slip in, it’s okay

Cross-validation set: 2,000 good + 10 anomalous engines

  • Use for tuning parameter ε
  • Evaluate detection performance

Test set: 2,000 good + 10 anomalous engines

  • Final evaluation after tuning
  1. Train: Fit Gaussian distributions on 6,000 training examples
  2. Validate: Use cross-validation set to:
    • Tune ε parameter higher/lower
    • Add/subtract/modify features
    • Check detection of 10 anomalies
    • Monitor false positives on 2,000 good engines
  3. Test: Final evaluation on separate test set
  • Primary learning from unlabeled training set (all y = 0)
  • Labeled examples only used for evaluation and parameter tuning
  • Core algorithm learns by fitting Gaussian distributions as before

When you have very few anomalous examples (e.g., only 2 flawed engines):

Training set: 6,000 good engines Cross-validation set: 4,000 good + all 20 anomalous engines

Advantages:

  • Makes sense when data is extremely limited
  • Uses all anomalous examples for evaluation

Disadvantages:

  • No fair way to evaluate final performance
  • Higher risk of overfitting to cross-validation set
  • Performance on future data may not match expectations
  1. Fit model: Learn p(x) from training set
  2. Make predictions: For any example x, predict:
    • y = 1 (anomalous) if p(x) < ε
    • y = 0 (normal) if p(x) ≥ ε
  3. Compare: Match predictions against true labels in cross-validation/test sets

Problem: Much fewer anomalies (y = 1) than normal examples (y = 0)

  • Example: 10 anomalies vs 2,000 normal examples

Solution: Use evaluation metrics designed for skewed data:

  • True positive rate
  • False positive rate
  • False negative rate
  • Precision and recall
  • F₁ score

These metrics work better than simple classification accuracy for highly imbalanced datasets.

Basic approach:

  • Count how many anomalies detected correctly
  • Count how many normal engines incorrectly flagged
  • Use this information to choose good value for ε

The combination of unlabeled training data with a small set of labeled evaluation examples provides the best of both worlds: unsupervised learning from abundant normal data plus supervised evaluation for system optimization.