Anomaly Detection Programming

Programming Assignment: Anomaly Detection

Assignment Overview

This exercise implements the anomaly detection algorithm and applies it to detect failing servers on a network.

Problem Statement

Server Monitoring Scenario

Dataset: Contains two features for each server
- Throughput (mb/s)
- Latency (ms) of response
Training data: m = 307 examples of server behavior
Assumption: Vast majority are “normal” examples
Goal: Use Gaussian model to detect anomalous server behavior

Dataset Structure

X_train: Used to fit Gaussian distribution
X_val and y_val: Cross-validation set for threshold selection
y_val labels: 1 = anomalous, 0 = normal

Key Exercises

Exercise 1: Estimate Gaussian Parameters

Objective: Complete estimate_gaussian function

Task: Calculate mean (μ) and variance (σ²) for each feature

Required Formulas:

μᵢ = (1/m) * Σ(j=1 to m) xᵢ⁽ʲ⁾

σᵢ² = (1/m) * Σ(j=1 to m) (xᵢ⁽ʲ⁾ - μᵢ)²

Implementation Approach:

def estimate_gaussian(X):
  m, n = X.shape

  # Vectorized implementation
  mu = (1/m) * np.sum(X, axis=0)
  var = (1/m) * np.sum((X - mu) ** 2, axis=0)

  return mu, var

Expected Output:

Mean: [14.11222578 14.99771051]
Variance: [1.83263141 1.70974533]

Exercise 2: Select Threshold

Objective: Complete select_threshold function

Task: Find best threshold ε using F₁ score on cross-validation set

Key Metrics:

True positives (tp): Correctly identified anomalies
False positives (fp): Normal examples incorrectly flagged
False negatives (fn): Missed anomalies

Formulas:

Precision = tp / (tp + fp)
Recall = tp / (tp + fn)
F₁ = (2 * precision * recall) / (precision + recall)

Implementation Approach:

def select_threshold(y_val, p_val):
  best_epsilon = 0
  best_F1 = 0

  step_size = (max(p_val) - min(p_val)) / 1000

  for epsilon in np.arange(min(p_val), max(p_val), step_size):
      # Predictions: 1 if anomaly (p < epsilon), 0 if normal
      predictions = (p_val < epsilon)

      # Calculate metrics
      tp = np.sum((predictions == 1) & (y_val == 1))
      fp = np.sum((predictions == 1) & (y_val == 0))
      fn = np.sum((predictions == 0) & (y_val == 1))

      prec = tp / (tp + fp)
      rec = tp / (tp + fn)
      F1 = 2 * prec * rec / (prec + rec)

      if F1 > best_F1:
          best_F1 = F1
          best_epsilon = epsilon

  return best_epsilon, best_F1

Expected Output:

Best epsilon: 8.99e-05
Best F₁ score: 0.875

Algorithm Process

Step 1: Data Visualization

Scatter plot shows 2D server data (throughput vs latency)
Visual inspection reveals general clustering pattern

Step 2: Gaussian Fitting

Estimate parameters μ and σ² for each feature
Create probability model p(x) using multivariate Gaussian

Step 3: Threshold Selection

Use cross-validation set to find optimal ε
Optimize for best F₁ score performance
Balance between detecting anomalies and avoiding false positives

Step 4: Anomaly Detection

Apply learned model to identify anomalous servers
Visualize results with contour plots showing probability regions

High-Dimensional Extension

Realistic Dataset

Features: 11 dimensions (many server properties)
Process: Same algorithm, more complex feature space
Results:
- Best epsilon: 1.38e-18
- Best F₁: 0.615385
- Anomalies found: 117

Scalability

Algorithm works well with many features
Same mathematical principles apply
Feature engineering becomes more important

Key Learning Outcomes

Algorithm Implementation

Gaussian parameter estimation from data
Probability computation for anomaly scoring
Threshold selection using cross-validation

Evaluation Understanding

F₁ score for imbalanced datasets
Cross-validation for parameter tuning
Trade-offs between precision and recall

Practical Applications

Server monitoring and fault detection
Systematic approach to identifying unusual behavior
Real-world relevance in system administration

The exercise showcases the power of statistical modeling for identifying unusual patterns in system behavior, a critical capability for maintaining reliable computer infrastructure.