Precision Recall Tradeoff

Trading Off Precision and Recall

Ideal vs Reality

Ideal scenario: High precision AND high recall Reality: Often there’s a trade-off between precision and recall that requires careful consideration.

Precision and Recall Definitions

Precision: True Positives / (True Positives + False Positives) Recall: True Positives / (True Positives + False Negatives)

Threshold Adjustment Strategy

Default Logistic Regression Threshold

Standard approach: Predict y=1 if f(x) ≥ 0.5
Standard approach: Predict y=0 if f(x) < 0.5

High Confidence Predictions (Higher Precision)

Scenario: Only predict disease if very confident

Threshold: f(x) ≥ 0.7 (instead of 0.5)
Philosophy: Avoid unnecessary invasive/expensive treatments
Use case: When disease consequences are manageable if untreated

Results:

Higher precision: When you predict disease, more likely to be correct
Lower recall: Identify fewer of the total disease cases

Extreme example: f(x) ≥ 0.9

Very high precision: Almost always right when predicting disease
Very low recall: Miss many actual disease cases

High Sensitivity Predictions (Higher Recall)

Scenario: Avoid missing disease cases (“when in doubt, predict y=1”)

Threshold: f(x) ≥ 0.3 (instead of 0.5)
Philosophy: Better safe than sorry for serious diseases
Use case: When untreated disease has severe consequences

Results:

Lower precision: More false alarms, but fewer missed cases
Higher recall: Catch more of the actual disease cases

Precision-Recall Curve

Threshold Impact Visualization

High threshold (0.99):

Very high precision, low recall
Few predictions, but very confident when made

Low threshold (0.01):

Low precision, high recall
Many predictions, catch most cases but many false alarms

High Threshold

Threshold = 0.9

High Precision
Low Recall
Conservative predictions

Low Threshold

Threshold = 0.1

Low Precision
High Recall
Liberal predictions

Algorithm Comparison Challenge

Multiple Algorithm Dilemma

Example results:

Algorithm 1: P=0.5, R=0.4
Algorithm 2: P=0.7, R=0.1
Algorithm 3: P=0.3, R=0.7

Problem: No single algorithm is clearly best on both metrics

Poor Solution: Simple Average

Average approach: (Precision + Recall) / 2

Algorithm 1: (0.5 + 0.4) / 2 = 0.45
Algorithm 2: (0.7 + 0.1) / 2 = 0.4
Algorithm 3: (0.3 + 0.7) / 2 = 0.5

F1 Score Solution

Harmonic Mean Approach

F1 Score: Emphasizes whichever value (precision or recall) is lower

Formula:

F1 = 1 / (1/2 * (1/P + 1/R))
   = 2PR / (P + R)

F1 Score Calculations

Algorithm 1: F1 = 2(0.5)(0.4) / (0.5 + 0.4) = 0.4 / 0.9 = 0.444 Algorithm 2: F1 = 2(0.7)(0.1) / (0.7 + 0.1) = 0.14 / 0.8 = 0.175 Algorithm 3: F1 = 2(0.3)(0.7) / (0.3 + 0.7) = 0.42 / 1.0 = 0.042

F1 Score Advantages

Better Metric

Key insight: F1 score heavily penalizes algorithms with very low precision OR very low recall

Results interpretation:

Algorithm 1: Best overall balance (F1 = 0.444)
Algorithm 2: Good precision but terrible recall (F1 = 0.175)
Algorithm 3: Good recall but terrible precision (F1 = 0.042)

Practical Decision Making

Manual Threshold Selection

Common approach: Plot precision-recall curve and manually select threshold balancing:

Cost of false positives vs false negatives
Medical/business consequences of each error type
Available resources for follow-up procedures

Automatic Selection with F1

When automated selection needed: Use F1 score to pick best algorithm or threshold

Harmonic mean: Mathematical average that emphasizes smaller values
Practical benefit: Identifies algorithms with good balance rather than extreme trade-offs

Application Context Matters

Conservative Applications (High Precision Priority)

Expensive follow-up procedures
Low disease severity if untreated
Patient anxiety from false positives

Aggressive Applications (High Recall Priority)

Serious consequences if disease missed
Relatively inexpensive/non-invasive treatments
Early intervention critical for outcomes

The precision-recall trade-off requires understanding your specific application context, but F1 score provides a useful automated way to identify algorithms with good overall balance between the two metrics.