Skip to content
Pablo Rodriguez

Anomaly Detection Algorithm

  • Training set: x⁽¹⁾ through x⁽ᵐ⁾
  • Each example x has n features
  • Each x⁽ⁱ⁾ is a vector with n numbers
  • Two features: heat (x₁) and vibrations (x₂)
  • Each x⁽ⁱ⁾ is 2-dimensional vector
  • n = 2
  • For practical applications: n can be dozens or hundreds

Build model for p(x) where:

independence-assumption
p(x) = p(x₁) × p(x₂) × p(x₃) × ... × p(xₙ)
  • Assumes features x₁, x₂, …, xₙ are statistically independent
  • Algorithm often works fine even when features aren’t actually independent
  • Understanding statistical independence not required for effective use
product-notation
p(x) = ∏(j=1 to n) p(xⱼ)

Where ∏ symbol means “product” (like Σ means “sum”)

For each feature j:

  • μⱼ: Mean of feature j
  • σⱼ²: Variance of feature j

Mean estimation:

feature-mean
μⱼ = (1/m) * Σ(i=1 to m) xⱼ⁽ⁱ⁾

Variance estimation:

feature-variance
σⱼ² = (1/m) * Σ(i=1 to m) (xⱼ⁽ⁱ⁾ - μⱼ)²

For efficiency, can compute all means simultaneously:

vectorized-mean
μ = (1/m) * Σ(i=1 to m) x⁽ⁱ⁾

Where x and μ are both vectors.

Select features x⁽ⁱ⁾ that might be indicative of anomalous examples.

Estimate μ₁ through μₙ and σ₁² through σₙ² for n features using formulas above.

For new example x, compute:

probability-calculation
p(x) = ∏(j=1 to n) p(xⱼ; μⱼ, σⱼ²)

Substituting Gaussian formula:

full-probability
p(x) = ∏(j=1 to n) (1/√(2πσⱼ)) * e^(-(xⱼ-μⱼ)²/(2σⱼ²))
  • If p(x) < ε: Flag as anomaly
  • If p(x) ≥ ε: Consider normal

Algorithm flags example as anomalous if one or more features are either:

  • Very large relative to training set, OR
  • Very small relative to training set
  • Each feature xⱼ fits a Gaussian distribution
  • If any feature xⱼ is “way out here” → p(xⱼ) becomes very small
  • If any single term in product is very small → overall p(x) becomes small
  • Systematic way to quantify unusually large or small feature values
  • Feature x₁: Large range of values
  • Feature x₂: Smaller range of values
  • μ₁ = 5, σ₁ = 2 (for x₁)
  • μ₂ = 3, σ₂ = 1 (for x₂)
  • Multiplying p(x₁) × p(x₂) creates 3D surface
  • Height at any point = product of individual probabilities
  • Higher probability near center
  • Lower probability toward edges

Example 1 (x_test1):

  • p(x) ≈ 0.4 (much larger than ε = 0.02)
  • Decision: Normal, not anomalous

Example 2 (x_test2) (x₁ ≈ 8, x₂ ≈ 0.5):

  • p(x) ≈ 0.0021 (much smaller than ε = 0.02)
  • Decision: Flag as anomaly

This systematic approach provides a principled method for detecting anomalies by modeling normal behavior and identifying significant deviations from expected patterns.