Skip to content
Pablo Rodriguez

Iterative Ml Development

Developing a machine learning model is an iterative process that rarely works perfectly on the first attempt. The systematic approach involves multiple cycles of improvement guided by diagnostics and evaluation.

  1. Choose Overall Architecture

    • Select machine learning model type
    • Decide what data to use
    • Pick hyperparameters
    • Define system components
  2. Implement and Train Model

    • Build the initial implementation
    • Train using chosen architecture
    • Expect suboptimal initial performance
  3. Run Diagnostics

    • Analyze bias and variance
    • Perform error analysis
    • Evaluate model performance
  4. Make Informed Decisions

    • Increase neural network size
    • Adjust regularization parameter (λ)
    • Add or remove data/features
    • Modify architecture based on insights

Objective: Build classifier to distinguish spam from legitimate emails

Input Features:

  • Top 10,000 words from English dictionary
  • Feature vector x₁, x₂, …, x₁₀,₀₀₀
  • Binary encoding (1 if word appears, 0 otherwise)

Given email: “Hi Andrew, buy this great deal discount…”

Word Features

  • a: 0 (doesn’t appear)
  • Andrew: 1 (appears)
  • buy: 1 (appears)
  • deal: 1 (appears)
  • discount: 0 (doesn’t appear)

Alternative Approach

Count frequency of word occurrences instead of binary presence, though binary works well in practice

When initial model performance is insufficient, multiple approaches seem tempting:

  • Honeypot projects: Create fake email addresses to attract spam
  • More sophisticated email routing features: Analyze email header paths
  • Enhanced body text features: Better handling of misspellings and variants
  • Unified word treatment: Treat “discounting” and “discount” as same word
  • Misspelling detection: Identify deliberate misspellings like “w4tches”, “med1cine”, “m0rtgage”
  • Routing analysis: Examine email server paths for spam indicators
  • High bias algorithm: Spending months on honeypot data collection may not help
  • High variance algorithm: Collecting more data could provide significant improvement

Rather than randomly trying improvements:

  1. Run diagnostics to understand current limitations
  2. Analyze results to identify most promising directions
  3. Choose techniques based on evidence rather than intuition
  4. Iterate systematically through the development loop

The key insight is that proper diagnostics (bias/variance analysis, error analysis) provide crucial guidance for architectural choices, preventing wasted effort on low-impact improvements.

Multiple iterations through this loop, guided by systematic evaluation, lead to models that achieve desired performance levels.