Skip to content

Evaluation Metrics Deep Dive

What This Is

Metric choice decides what "better" means. This page is about one decision:

  • which metric should control model choice for this task

A model can win on accuracy and fail the real job. A model can rank well and still have useless probabilities. The metric has to match the consequence of the mistake.

When You Use It

  • choosing the primary model-selection metric
  • comparing models that win on different numbers
  • explaining results to a team with a specific operational cost
  • deciding whether ranking, thresholding, or probability quality matters most

Start With The Decision Type

Choose the metric by what the score will actually control:

Need Better first metric Why
balanced classes, equal costs accuracy simple, but only when the class mix is honest
rare positive class average precision or balanced accuracy accuracy hides minority-class failure
ranking before threshold choice ROC AUC or average precision the score orders cases rather than making one hard call
hard threshold policy precision, recall, or cost at the chosen cutoff the threshold is the decision
trustworthy probabilities log loss or Brier score probability quality matters, not just ranking

Quick Rule

Ask three questions:

  1. is the class balance skewed
  2. is one error worse than the other
  3. does the score control ranking, thresholding, or calibrated probability

If you cannot answer those, the metric is still arbitrary.

Minimal Pattern

from sklearn.metrics import (
    accuracy_score,
    average_precision_score,
    brier_score_loss,
    precision_score,
    recall_score,
    roc_auc_score,
)

The important move is not computing every metric. The important move is deciding which one should be primary and why.

What To Inspect First

Inspect these before announcing a winner:

  • class balance
  • baseline metric values
  • whether two models swap places under different metrics
  • whether the chosen metric actually matches the downstream decision

If the primary metric changes, the story about the "best" model can change with it.

Failure Pattern

The classic failure is choosing accuracy on a rare-event task.

Another common failure is using ROC AUC when the real decision is a constrained review budget or a calibrated cutoff. The model may rank examples well and still be poor at the actual operating point.

Common Mistakes

  • reporting one metric as if it tells the whole story
  • using accuracy on imbalanced data
  • calling ROC AUC proof of good probabilities
  • optimizing F1 without saying whether precision or recall matters more
  • comparing scores across datasets with different class balance and acting as if they mean the same thing

A Good Metric Note

After one experiment, the learner should be able to say:

  • which metric is primary
  • why that metric matches the task
  • which secondary metric guards against a blind spot
  • what decision would change if the metric changed

Practice

  1. Pick a rare-event task and explain why accuracy is weak there.
  2. Compare ROC AUC and average precision on the same predictions.
  3. Explain when Brier score matters more than ranking metrics.
  4. Choose a metric for a review-budget workflow and defend it.
  5. Show one case where the same model wins on one metric and loses on another.

Runnable Example

Longer Connection

Continue with Calibration and Thresholds when probability quality matters, and Honest Splits and Baselines when the split itself is still the bigger problem.