Skip to content
Pablo Rodriguez

Precision Recall Tradeoff

Ideal scenario: High precision AND high recall Reality: Often there’s a trade-off between precision and recall that requires careful consideration.

Precision: True Positives / (True Positives + False Positives) Recall: True Positives / (True Positives + False Negatives)

  • Standard approach: Predict y=1 if f(x) ≥ 0.5
  • Standard approach: Predict y=0 if f(x) < 0.5

High Confidence Predictions (Higher Precision)

Section titled “High Confidence Predictions (Higher Precision)”

Scenario: Only predict disease if very confident

  • Threshold: f(x) ≥ 0.7 (instead of 0.5)
  • Philosophy: Avoid unnecessary invasive/expensive treatments
  • Use case: When disease consequences are manageable if untreated

Results:

  • Higher precision: When you predict disease, more likely to be correct
  • Lower recall: Identify fewer of the total disease cases

Extreme example: f(x) ≥ 0.9

  • Very high precision: Almost always right when predicting disease
  • Very low recall: Miss many actual disease cases

High Sensitivity Predictions (Higher Recall)

Section titled “High Sensitivity Predictions (Higher Recall)”

Scenario: Avoid missing disease cases (“when in doubt, predict y=1”)

  • Threshold: f(x) ≥ 0.3 (instead of 0.5)
  • Philosophy: Better safe than sorry for serious diseases
  • Use case: When untreated disease has severe consequences

Results:

  • Lower precision: More false alarms, but fewer missed cases
  • Higher recall: Catch more of the actual disease cases

High threshold (0.99):

  • Very high precision, low recall
  • Few predictions, but very confident when made

Low threshold (0.01):

  • Low precision, high recall
  • Many predictions, catch most cases but many false alarms

High Threshold

Threshold = 0.9

  • High Precision
  • Low Recall
  • Conservative predictions

Low Threshold

Threshold = 0.1

  • Low Precision
  • High Recall
  • Liberal predictions

Example results:

  • Algorithm 1: P=0.5, R=0.4
  • Algorithm 2: P=0.7, R=0.1
  • Algorithm 3: P=0.3, R=0.7

Problem: No single algorithm is clearly best on both metrics

Average approach: (Precision + Recall) / 2

  • Algorithm 1: (0.5 + 0.4) / 2 = 0.45
  • Algorithm 2: (0.7 + 0.1) / 2 = 0.4
  • Algorithm 3: (0.3 + 0.7) / 2 = 0.5

F1 Score: Emphasizes whichever value (precision or recall) is lower

Formula:

F1 = 1 / (1/2 * (1/P + 1/R))
= 2PR / (P + R)

Algorithm 1: F1 = 2(0.5)(0.4) / (0.5 + 0.4) = 0.4 / 0.9 = 0.444 Algorithm 2: F1 = 2(0.7)(0.1) / (0.7 + 0.1) = 0.14 / 0.8 = 0.175 Algorithm 3: F1 = 2(0.3)(0.7) / (0.3 + 0.7) = 0.42 / 1.0 = 0.042

Better Metric

Key insight: F1 score heavily penalizes algorithms with very low precision OR very low recall

Results interpretation:

  • Algorithm 1: Best overall balance (F1 = 0.444)
  • Algorithm 2: Good precision but terrible recall (F1 = 0.175)
  • Algorithm 3: Good recall but terrible precision (F1 = 0.042)

Common approach: Plot precision-recall curve and manually select threshold balancing:

  • Cost of false positives vs false negatives
  • Medical/business consequences of each error type
  • Available resources for follow-up procedures

When automated selection needed: Use F1 score to pick best algorithm or threshold

  • Harmonic mean: Mathematical average that emphasizes smaller values
  • Practical benefit: Identifies algorithms with good balance rather than extreme trade-offs

Conservative Applications (High Precision Priority)

Section titled “Conservative Applications (High Precision Priority)”
  • Expensive follow-up procedures
  • Low disease severity if untreated
  • Patient anxiety from false positives

Aggressive Applications (High Recall Priority)

Section titled “Aggressive Applications (High Recall Priority)”
  • Serious consequences if disease missed
  • Relatively inexpensive/non-invasive treatments
  • Early intervention critical for outcomes

The precision-recall trade-off requires understanding your specific application context, but F1 score provides a useful automated way to identify algorithms with good overall balance between the two metrics.