Evaluation Metrics Deep Dive¶
What This Is¶
Metric choice decides what "better" means. This page is about one decision:
- which metric should control model choice for this task
A model can win on accuracy and fail the real job. A model can rank well and still have useless probabilities. The metric has to match the consequence of the mistake.
When You Use It¶
- choosing the primary model-selection metric
- comparing models that win on different numbers
- explaining results to a team with a specific operational cost
- deciding whether ranking, thresholding, or probability quality matters most
Start With The Decision Type¶
Choose the metric by what the score will actually control:
| Need | Better first metric | Why |
|---|---|---|
| balanced classes, equal costs | accuracy | simple, but only when the class mix is honest |
| rare positive class | average precision or balanced accuracy | accuracy hides minority-class failure |
| ranking before threshold choice | ROC AUC or average precision | the score orders cases rather than making one hard call |
| hard threshold policy | precision, recall, or cost at the chosen cutoff | the threshold is the decision |
| trustworthy probabilities | log loss or Brier score | probability quality matters, not just ranking |
Quick Rule¶
Ask three questions:
- is the class balance skewed
- is one error worse than the other
- does the score control ranking, thresholding, or calibrated probability
If you cannot answer those, the metric is still arbitrary.
Minimal Pattern¶
from sklearn.metrics import (
accuracy_score,
average_precision_score,
brier_score_loss,
precision_score,
recall_score,
roc_auc_score,
)
The important move is not computing every metric. The important move is deciding which one should be primary and why.
What To Inspect First¶
Inspect these before announcing a winner:
- class balance
- baseline metric values
- whether two models swap places under different metrics
- whether the chosen metric actually matches the downstream decision
If the primary metric changes, the story about the "best" model can change with it.
Failure Pattern¶
The classic failure is choosing accuracy on a rare-event task.
Another common failure is using ROC AUC when the real decision is a constrained review budget or a calibrated cutoff. The model may rank examples well and still be poor at the actual operating point.
Common Mistakes¶
- reporting one metric as if it tells the whole story
- using accuracy on imbalanced data
- calling ROC AUC proof of good probabilities
- optimizing F1 without saying whether precision or recall matters more
- comparing scores across datasets with different class balance and acting as if they mean the same thing
A Good Metric Note¶
After one experiment, the learner should be able to say:
- which metric is primary
- why that metric matches the task
- which secondary metric guards against a blind spot
- what decision would change if the metric changed
Practice¶
- Pick a rare-event task and explain why accuracy is weak there.
- Compare ROC AUC and average precision on the same predictions.
- Explain when Brier score matters more than ranking metrics.
- Choose a metric for a review-budget workflow and defend it.
- Show one case where the same model wins on one metric and loses on another.
Runnable Example¶
Longer Connection¶
Continue with Calibration and Thresholds when probability quality matters, and Honest Splits and Baselines when the split itself is still the bigger problem.