Skip to content
Pablo Rodriguez

Anomaly Detection Programming

This exercise implements the anomaly detection algorithm and applies it to detect failing servers on a network.

  • Dataset: Contains two features for each server
    • Throughput (mb/s)
    • Latency (ms) of response
  • Training data: m = 307 examples of server behavior
  • Assumption: Vast majority are “normal” examples
  • Goal: Use Gaussian model to detect anomalous server behavior
  • X_train: Used to fit Gaussian distribution
  • X_val and y_val: Cross-validation set for threshold selection
  • y_val labels: 1 = anomalous, 0 = normal

Objective: Complete estimate_gaussian function

Task: Calculate mean (μ) and variance (σ²) for each feature

Required Formulas:

gaussian-parameters
μᵢ = (1/m) * Σ(j=1 to m) xᵢ⁽ʲ⁾
σᵢ² = (1/m) * Σ(j=1 to m) (xᵢ⁽ʲ⁾ - μᵢ)²

Implementation Approach:

estimate-gaussian-solution
def estimate_gaussian(X):
m, n = X.shape
# Vectorized implementation
mu = (1/m) * np.sum(X, axis=0)
var = (1/m) * np.sum((X - mu) ** 2, axis=0)
return mu, var

Expected Output:

  • Mean: [14.11222578 14.99771051]
  • Variance: [1.83263141 1.70974533]

Objective: Complete select_threshold function

Task: Find best threshold ε using F₁ score on cross-validation set

Key Metrics:

  • True positives (tp): Correctly identified anomalies
  • False positives (fp): Normal examples incorrectly flagged
  • False negatives (fn): Missed anomalies

Formulas:

evaluation-metrics
Precision = tp / (tp + fp)
Recall = tp / (tp + fn)
F₁ = (2 * precision * recall) / (precision + recall)

Implementation Approach:

select-threshold-solution
def select_threshold(y_val, p_val):
best_epsilon = 0
best_F1 = 0
step_size = (max(p_val) - min(p_val)) / 1000
for epsilon in np.arange(min(p_val), max(p_val), step_size):
# Predictions: 1 if anomaly (p < epsilon), 0 if normal
predictions = (p_val < epsilon)
# Calculate metrics
tp = np.sum((predictions == 1) & (y_val == 1))
fp = np.sum((predictions == 1) & (y_val == 0))
fn = np.sum((predictions == 0) & (y_val == 1))
prec = tp / (tp + fp)
rec = tp / (tp + fn)
F1 = 2 * prec * rec / (prec + rec)
if F1 > best_F1:
best_F1 = F1
best_epsilon = epsilon
return best_epsilon, best_F1

Expected Output:

  • Best epsilon: 8.99e-05
  • Best F₁ score: 0.875
  • Scatter plot shows 2D server data (throughput vs latency)
  • Visual inspection reveals general clustering pattern
  • Estimate parameters μ and σ² for each feature
  • Create probability model p(x) using multivariate Gaussian
  • Use cross-validation set to find optimal ε
  • Optimize for best F₁ score performance
  • Balance between detecting anomalies and avoiding false positives
  • Apply learned model to identify anomalous servers
  • Visualize results with contour plots showing probability regions
  • Features: 11 dimensions (many server properties)
  • Process: Same algorithm, more complex feature space
  • Results:
    • Best epsilon: 1.38e-18
    • Best F₁: 0.615385
    • Anomalies found: 117
  • Algorithm works well with many features
  • Same mathematical principles apply
  • Feature engineering becomes more important
  • Gaussian parameter estimation from data
  • Probability computation for anomaly scoring
  • Threshold selection using cross-validation
  • F₁ score for imbalanced datasets
  • Cross-validation for parameter tuning
  • Trade-offs between precision and recall
  • Server monitoring and fault detection
  • Systematic approach to identifying unusual behavior
  • Real-world relevance in system administration

The exercise showcases the power of statistical modeling for identifying unusual patterns in system behavior, a critical capability for maintaining reliable computer infrastructure.