Ideal scenario : High precision AND high recall
Reality : Often there’s a trade-off between precision and recall that requires careful consideration.
Precision : True Positives / (True Positives + False Positives)
Recall : True Positives / (True Positives + False Negatives)
Standard approach : Predict y=1 if f(x) ≥ 0.5
Standard approach : Predict y=0 if f(x) < 0.5
Scenario : Only predict disease if very confident
Threshold : f(x) ≥ 0.7 (instead of 0.5)
Philosophy : Avoid unnecessary invasive/expensive treatments
Use case : When disease consequences are manageable if untreated
Results :
Higher precision : When you predict disease, more likely to be correct
Lower recall : Identify fewer of the total disease cases
Extreme example : f(x) ≥ 0.9
Very high precision : Almost always right when predicting disease
Very low recall : Miss many actual disease cases
Scenario : Avoid missing disease cases (“when in doubt, predict y=1”)
Threshold : f(x) ≥ 0.3 (instead of 0.5)
Philosophy : Better safe than sorry for serious diseases
Use case : When untreated disease has severe consequences
Results :
Lower precision : More false alarms, but fewer missed cases
Higher recall : Catch more of the actual disease cases
High threshold (0.99) :
Very high precision, low recall
Few predictions, but very confident when made
Low threshold (0.01) :
Low precision, high recall
Many predictions, catch most cases but many false alarms
High Threshold
Threshold = 0.9
High Precision
Low Recall
Conservative predictions
Low Threshold
Threshold = 0.1
Low Precision
High Recall
Liberal predictions
Example results :
Algorithm 1 : P=0.5, R=0.4
Algorithm 2 : P=0.7, R=0.1
Algorithm 3 : P=0.3, R=0.7
Problem : No single algorithm is clearly best on both metrics
Average approach : (Precision + Recall) / 2
Algorithm 1: (0.5 + 0.4) / 2 = 0.45
Algorithm 2: (0.7 + 0.1) / 2 = 0.4
Algorithm 3: (0.3 + 0.7) / 2 = 0.5
F1 Score : Emphasizes whichever value (precision or recall) is lower
Formula :
F1 = 1 / (1/2 * (1/P + 1/R))
Algorithm 1 : F1 = 2(0.5)(0.4) / (0.5 + 0.4) = 0.4 / 0.9 = 0.444
Algorithm 2 : F1 = 2(0.7)(0.1) / (0.7 + 0.1) = 0.14 / 0.8 = 0.175
Algorithm 3 : F1 = 2(0.3)(0.7) / (0.3 + 0.7) = 0.42 / 1.0 = 0.042
Better Metric
Key insight : F1 score heavily penalizes algorithms with very low precision OR very low recall
Results interpretation :
Algorithm 1 : Best overall balance (F1 = 0.444)
Algorithm 2 : Good precision but terrible recall (F1 = 0.175)
Algorithm 3 : Good recall but terrible precision (F1 = 0.042)
Common approach : Plot precision-recall curve and manually select threshold balancing:
Cost of false positives vs false negatives
Medical/business consequences of each error type
Available resources for follow-up procedures
When automated selection needed : Use F1 score to pick best algorithm or threshold
Harmonic mean : Mathematical average that emphasizes smaller values
Practical benefit : Identifies algorithms with good balance rather than extreme trade-offs
Expensive follow-up procedures
Low disease severity if untreated
Patient anxiety from false positives
Serious consequences if disease missed
Relatively inexpensive/non-invasive treatments
Early intervention critical for outcomes
The precision-recall trade-off requires understanding your specific application context, but F1 score provides a useful automated way to identify algorithms with good overall balance between the two metrics.