Skip to content

Imbalanced Metrics and Review Budgets

What This Is

Rare-event triage tasks need ranked decisions, not only plain labels. The real question is not “is the model accurate?” but “does the top of the ranking contain enough of the cases we care about?”

Accuracy can look strong while the review queue is almost useless. For rare positives, a model must be judged by what it puts near the top, how that behaves under a fixed review budget, and whether the operating threshold still makes sense after the base rate shifts.

When You Use It

  • rare-event detection with human review
  • fixed manual-review capacity
  • operating-point selection from predicted scores
  • queue design where false negatives are more expensive than false positives

Core Tools

  • average_precision_score
  • precision_recall_curve
  • PrecisionRecallDisplay.from_predictions
  • precision_score
  • recall_score
  • fbeta_score
  • confusion_matrix
  • ConfusionMatrixDisplay.from_predictions
  • classification_report
  • predict_proba
  • decision_function
  • CalibratedClassifierCV
  • calibration_curve
  • CalibrationDisplay.from_predictions
  • np.argsort
  • np.quantile
  • pd.DataFrame.nlargest
  • pd.cut
  • pd.qcut
  • groupby
  • crosstab

Reading The Ranking

If your model emits scores, start by asking whether the highest scores really contain the positives.

order = np.argsort(-scores)
top_cases = df.iloc[order[:50]]
top_positive_rate = top_cases["target"].mean()
top_cases = df.nlargest(50, "score")

Use this pattern when you care about a fixed review count. It is better than staring at a single threshold because it directly answers, “What do reviewers actually see?”

Prevalence And Lift

Rare-event work gets clearer once you compare the queue against the base rate.

  • prevalence: positive_count / total_count
  • precision at k: positive rate inside the reviewed set
  • lift at k: precision_at_k / prevalence

If prevalence is 0.03 and precision at the top 5% is 0.21, the lift is 7x. That says more than accuracy ever will about whether the ranking is worth reviewer time.

Threshold And Curve View

For rare events, the precision-recall view is usually more informative than accuracy.

from sklearn.metrics import average_precision_score, precision_recall_curve

ap = average_precision_score(y_true, scores)
precision, recall, thresholds = precision_recall_curve(y_true, scores)
from sklearn.metrics import PrecisionRecallDisplay

PrecisionRecallDisplay.from_predictions(y_true, scores)

Interpretation:

  • average_precision_score summarizes how well positives are ranked near the top.
  • precision_recall_curve shows the tradeoff across thresholds.
  • PrecisionRecallDisplay.from_predictions is useful when you want the whole curve visible in one place.

Practical trick:

  • If the curve is flat near the left edge, the model is not separating the hard positives well enough for a tight queue.
  • If the curve is good only at very low recall, the model may not be useful once the budget expands.

Counterexample:

  • model A can have a better overall average precision because it ranks positives reasonably well across the whole list
  • model B can still have better precision@k at the tiny operating point your reviewers actually use

That is why AP and top-k precision answer related but different questions.

Budget Tables

Once the queue size is fixed, build a small table over candidate budgets.

budgets = [0.01, 0.02, 0.05, 0.10]
rows = []

for budget in budgets:
    k = max(1, int(len(df) * budget))
    reviewed = df.nlargest(k, "score")
    rows.append(
        {
            "budget": budget,
            "k": k,
            "review_precision": reviewed["target"].mean(),
            "captured_recall": reviewed["target"].sum() / df["target"].sum(),
        }
    )

budget_table = pd.DataFrame(rows)

That table is more useful than a single score because it tells you how the model behaves at several realistic operating points.

Useful follow-up:

  • add the false-positive count
  • add the number of positives captured
  • compare the table against the prevalence or random-ranking floor
  • compare it against one simple heuristic ranker if the task has an obvious hand-built signal

Thresholds Versus Top-k Under Shift

A fixed threshold and a fixed top-k budget solve different problems.

  • threshold: keeps the score meaning fixed, but the queue size can change under score drift
  • top-k: keeps the queue size fixed, but the score cutoff can move as the score distribution shifts

In a stable operational queue, top-k is often the safer first control. In a calibrated risk policy, a threshold can be better. The important point is to decide which quantity should stay fixed when the base rate or score scale moves.

Operating Point

When you need a hard decision rule, convert scores into labels and inspect the confusion matrix at the chosen threshold.

pred = (scores >= threshold).astype(int)

from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_true, pred)
report = classification_report(y_true, pred, zero_division=0)
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_predictions(y_true, pred, normalize="true")

What to look for:

  • normalize="true" helps compare misses and hits within each class.
  • classification_report(..., zero_division=0) avoids noisy exceptions when a threshold predicts no positives, and output_dict=True is useful when you want to tabulate several thresholds.
  • precision_score and recall_score are often the right pair when the class is imbalanced.
  • fbeta_score(beta=2) is useful when recall matters more than precision.

Calibration And Threshold Sweeps

If predicted probabilities are poorly spread, a threshold chosen on one split may not transfer well.

from sklearn.calibration import CalibratedClassifierCV, calibration_curve

calibrator = CalibratedClassifierCV(estimator=model, method="sigmoid", cv=5)
probabilities = calibrator.fit(X_train, y_train).predict_proba(X_valid)[:, 1]
from sklearn.calibration import CalibrationDisplay

CalibrationDisplay.from_predictions(y_true, probabilities)

Use this when:

  • the queue size changes frequently
  • score ordering is stable but the cutoff is unstable
  • the top scores look too compressed to separate cleanly

Practical rule:

  • predict_proba gives you probabilities to threshold directly.
  • decision_function is fine for ranking when probabilities are unavailable, but it is not a calibration-plot input by itself.
  • use calibrated probabilities for CalibrationDisplay.from_predictions(...), not arbitrary ranking scores.

Slice-Aware Checks

A good review queue can still fail on a subgroup. Always check whether the budget is being consumed by the right slice.

df = df.assign(reviewed=df["score"] >= threshold)

slice_table = df.groupby("slice").agg(
    reviewed_rate=("reviewed", "mean"),
    positive_rate=("target", "mean"),
    count=("target", "size"),
)
slice_budget = pd.crosstab(df["slice"], df["reviewed"], normalize="index")

What matters here:

  • the smallest slice may need its own threshold check
  • a slice with a good average score can still have poor top-k capture
  • counts must stay next to rates, or the table becomes misleading

Failure Pattern

Picking the highest-accuracy model and discovering that it barely improves the review queue because the positives are rare.

Another common mistake is choosing a threshold from validation and assuming it will stay optimal if the queue size, base rate, or subgroup mix changes.

Practice

  1. Compare a random-ranking or simple heuristic baseline against a learned ranked model using average_precision_score.
  2. Evaluate three or four review budgets and write the queue precision and captured recall for each one.
  3. Compare the same threshold before and after calibration.
  4. Build one per-slice budget table and identify the weakest slice.
  5. Explain whether your operating point should maximize precision, recall, or fbeta_score.
  6. Say which result would convince you that the model is not worth deploying.

Runnable Example

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect the queue at multiple budgets, then compare the thresholded confusion matrix against the ranked top-k view.

Library Notes

Questions To Ask

  1. How many positives can the queue catch at this budget?
  2. Is the top of the ranking actually better, or just more confident?
  3. What happens if the review capacity changes next month?
  4. Would a small gain in recall be worth a large drop in precision?
  5. Which threshold would you explain to a human reviewer?
  6. Does the same threshold still work on the weakest slice?
  7. Is the model better than a simpler ranking baseline at the operating points that matter?

Longer Connection

Continue with Imbalanced Triage and Review Budgets for the full leaderboard, budget curve, slice check, and submission workflow.