Evaluation Metrics Deep Dive¶

What This Is¶

Metric choice decides what "better" means. This page is about one decision:

which metric should control model choice for this task

A model can win on accuracy and fail the real job. A model can rank well and still have useless probabilities. The metric has to match the consequence of the mistake.

When You Use It¶

choosing the primary model-selection metric
comparing models that win on different numbers
explaining results to a team with a specific operational cost
deciding whether ranking, thresholding, or probability quality matters most

Start With The Decision Type¶

Choose the metric by what the score will actually control:

Need	Better first metric	Why
balanced classes, equal costs	accuracy	simple, but only when the class mix is honest
rare positive class	average precision or balanced accuracy	accuracy hides minority-class failure
ranking before threshold choice	ROC AUC or average precision	the score orders cases rather than making one hard call
hard threshold policy	precision, recall, or cost at the chosen cutoff	the threshold is the decision
trustworthy probabilities	log loss or Brier score	probability quality matters, not just ranking

Quick Rule¶

Ask three questions:

is the class balance skewed
is one error worse than the other
does the score control ranking, thresholding, or calibrated probability

If you cannot answer those, the metric is still arbitrary.

Minimal Pattern¶

from sklearn.metrics import (
    accuracy_score,
    average_precision_score,
    brier_score_loss,
    precision_score,
    recall_score,
    roc_auc_score,
)

The important move is not computing every metric. The important move is deciding which one should be primary and why.

What To Inspect First¶

Inspect these before announcing a winner:

class balance
baseline metric values
whether two models swap places under different metrics
whether the chosen metric actually matches the downstream decision

If the primary metric changes, the story about the "best" model can change with it.

Failure Pattern¶

The classic failure is choosing accuracy on a rare-event task.

Another common failure is using ROC AUC when the real decision is a constrained review budget or a calibrated cutoff. The model may rank examples well and still be poor at the actual operating point.

Common Mistakes¶

reporting one metric as if it tells the whole story
using accuracy on imbalanced data
calling ROC AUC proof of good probabilities
optimizing F1 without saying whether precision or recall matters more
comparing scores across datasets with different class balance and acting as if they mean the same thing

A Good Metric Note¶

After one experiment, the learner should be able to say:

which metric is primary
why that metric matches the task
which secondary metric guards against a blind spot
what decision would change if the metric changed

Practice¶

Pick a rare-event task and explain why accuracy is weak there.
Compare ROC AUC and average precision on the same predictions.
Explain when Brier score matters more than ranking metrics.
Choose a metric for a review-budget workflow and defend it.
Show one case where the same model wins on one metric and loses on another.

Runnable Example¶

Longer Connection¶

Continue with Calibration and Thresholds when probability quality matters, and Honest Splits and Baselines when the split itself is still the bigger problem.