Anomaly Detection Programming
Programming Assignment: Anomaly Detection
Section titled “Programming Assignment: Anomaly Detection”Assignment Overview
Section titled “Assignment Overview”This exercise implements the anomaly detection algorithm and applies it to detect failing servers on a network.
Problem Statement
Section titled “Problem Statement”Server Monitoring Scenario
Section titled “Server Monitoring Scenario”- Dataset: Contains two features for each server
- Throughput (mb/s)
- Latency (ms) of response
- Training data: m = 307 examples of server behavior
- Assumption: Vast majority are “normal” examples
- Goal: Use Gaussian model to detect anomalous server behavior
Dataset Structure
Section titled “Dataset Structure”- X_train: Used to fit Gaussian distribution
- X_val and y_val: Cross-validation set for threshold selection
- y_val labels: 1 = anomalous, 0 = normal
Key Exercises
Section titled “Key Exercises”Exercise 1: Estimate Gaussian Parameters
Section titled “Exercise 1: Estimate Gaussian Parameters”Objective: Complete estimate_gaussian
function
Task: Calculate mean (μ) and variance (σ²) for each feature
Required Formulas:
μᵢ = (1/m) * Σ(j=1 to m) xᵢ⁽ʲ⁾
σᵢ² = (1/m) * Σ(j=1 to m) (xᵢ⁽ʲ⁾ - μᵢ)²
Implementation Approach:
def estimate_gaussian(X): m, n = X.shape
# Vectorized implementation mu = (1/m) * np.sum(X, axis=0) var = (1/m) * np.sum((X - mu) ** 2, axis=0)
return mu, var
Expected Output:
- Mean: [14.11222578 14.99771051]
- Variance: [1.83263141 1.70974533]
Exercise 2: Select Threshold
Section titled “Exercise 2: Select Threshold”Objective: Complete select_threshold
function
Task: Find best threshold ε using F₁ score on cross-validation set
Key Metrics:
- True positives (tp): Correctly identified anomalies
- False positives (fp): Normal examples incorrectly flagged
- False negatives (fn): Missed anomalies
Formulas:
Precision = tp / (tp + fp)Recall = tp / (tp + fn)F₁ = (2 * precision * recall) / (precision + recall)
Implementation Approach:
def select_threshold(y_val, p_val): best_epsilon = 0 best_F1 = 0
step_size = (max(p_val) - min(p_val)) / 1000
for epsilon in np.arange(min(p_val), max(p_val), step_size): # Predictions: 1 if anomaly (p < epsilon), 0 if normal predictions = (p_val < epsilon)
# Calculate metrics tp = np.sum((predictions == 1) & (y_val == 1)) fp = np.sum((predictions == 1) & (y_val == 0)) fn = np.sum((predictions == 0) & (y_val == 1))
prec = tp / (tp + fp) rec = tp / (tp + fn) F1 = 2 * prec * rec / (prec + rec)
if F1 > best_F1: best_F1 = F1 best_epsilon = epsilon
return best_epsilon, best_F1
Expected Output:
- Best epsilon: 8.99e-05
- Best F₁ score: 0.875
Algorithm Process
Section titled “Algorithm Process”Step 1: Data Visualization
Section titled “Step 1: Data Visualization”- Scatter plot shows 2D server data (throughput vs latency)
- Visual inspection reveals general clustering pattern
Step 2: Gaussian Fitting
Section titled “Step 2: Gaussian Fitting”- Estimate parameters μ and σ² for each feature
- Create probability model p(x) using multivariate Gaussian
Step 3: Threshold Selection
Section titled “Step 3: Threshold Selection”- Use cross-validation set to find optimal ε
- Optimize for best F₁ score performance
- Balance between detecting anomalies and avoiding false positives
Step 4: Anomaly Detection
Section titled “Step 4: Anomaly Detection”- Apply learned model to identify anomalous servers
- Visualize results with contour plots showing probability regions
High-Dimensional Extension
Section titled “High-Dimensional Extension”Realistic Dataset
Section titled “Realistic Dataset”- Features: 11 dimensions (many server properties)
- Process: Same algorithm, more complex feature space
- Results:
- Best epsilon: 1.38e-18
- Best F₁: 0.615385
- Anomalies found: 117
Scalability
Section titled “Scalability”- Algorithm works well with many features
- Same mathematical principles apply
- Feature engineering becomes more important
Key Learning Outcomes
Section titled “Key Learning Outcomes”Algorithm Implementation
Section titled “Algorithm Implementation”- Gaussian parameter estimation from data
- Probability computation for anomaly scoring
- Threshold selection using cross-validation
Evaluation Understanding
Section titled “Evaluation Understanding”- F₁ score for imbalanced datasets
- Cross-validation for parameter tuning
- Trade-offs between precision and recall
Practical Applications
Section titled “Practical Applications”- Server monitoring and fault detection
- Systematic approach to identifying unusual behavior
- Real-world relevance in system administration
The exercise showcases the power of statistical modeling for identifying unusual patterns in system behavior, a critical capability for maintaining reliable computer infrastructure.