Imbalanced Metrics and Review Budgets¶

What This Is¶

Rare-event triage tasks need ranked decisions, not only plain labels. The real question is not “is the model accurate?” but “does the top of the ranking contain enough of the cases we care about?”

Accuracy can look strong while the review queue is almost useless. For rare positives, a model must be judged by what it puts near the top, how that behaves under a fixed review budget, and whether the operating threshold still makes sense after the base rate shifts.

When You Use It¶

rare-event detection with human review
fixed manual-review capacity
operating-point selection from predicted scores
queue design where false negatives are more expensive than false positives

Core Tools¶

average_precision_score
precision_recall_curve
PrecisionRecallDisplay.from_predictions
precision_score
recall_score
fbeta_score
confusion_matrix
ConfusionMatrixDisplay.from_predictions
classification_report
predict_proba
decision_function
CalibratedClassifierCV
calibration_curve
CalibrationDisplay.from_predictions
np.argsort
np.quantile
pd.DataFrame.nlargest
pd.cut
pd.qcut
groupby
crosstab

Reading The Ranking¶

If your model emits scores, start by asking whether the highest scores really contain the positives.

order = np.argsort(-scores)
top_cases = df.iloc[order[:50]]
top_positive_rate = top_cases["target"].mean()

top_cases = df.nlargest(50, "score")

Use this pattern when you care about a fixed review count. It is better than staring at a single threshold because it directly answers, “What do reviewers actually see?”

Prevalence And Lift¶

Rare-event work gets clearer once you compare the queue against the base rate.

prevalence: positive_count / total_count
precision at k: positive rate inside the reviewed set
lift at k: precision_at_k / prevalence

If prevalence is 0.03 and precision at the top 5% is 0.21, the lift is 7x. That says more than accuracy ever will about whether the ranking is worth reviewer time.

Threshold And Curve View¶

For rare events, the precision-recall view is usually more informative than accuracy.

from sklearn.metrics import average_precision_score, precision_recall_curve

ap = average_precision_score(y_true, scores)
precision, recall, thresholds = precision_recall_curve(y_true, scores)

from sklearn.metrics import PrecisionRecallDisplay

PrecisionRecallDisplay.from_predictions(y_true, scores)

Interpretation:

average_precision_score summarizes how well positives are ranked near the top.
precision_recall_curve shows the tradeoff across thresholds.
PrecisionRecallDisplay.from_predictions is useful when you want the whole curve visible in one place.

Practical trick:

If the curve is flat near the left edge, the model is not separating the hard positives well enough for a tight queue.
If the curve is good only at very low recall, the model may not be useful once the budget expands.

Counterexample:

model A can have a better overall average precision because it ranks positives reasonably well across the whole list
model B can still have better precision@k at the tiny operating point your reviewers actually use

That is why AP and top-k precision answer related but different questions.

Budget Tables¶

Once the queue size is fixed, build a small table over candidate budgets.

budgets = [0.01, 0.02, 0.05, 0.10]
rows = []

for budget in budgets:
    k = max(1, int(len(df) * budget))
    reviewed = df.nlargest(k, "score")
    rows.append(
        {
            "budget": budget,
            "k": k,
            "review_precision": reviewed["target"].mean(),
            "captured_recall": reviewed["target"].sum() / df["target"].sum(),
        }
    )

budget_table = pd.DataFrame(rows)

That table is more useful than a single score because it tells you how the model behaves at several realistic operating points.

Useful follow-up:

add the false-positive count
add the number of positives captured
compare the table against the prevalence or random-ranking floor
compare it against one simple heuristic ranker if the task has an obvious hand-built signal

Thresholds Versus Top-k Under Shift¶

A fixed threshold and a fixed top-k budget solve different problems.

threshold: keeps the score meaning fixed, but the queue size can change under score drift
top-k: keeps the queue size fixed, but the score cutoff can move as the score distribution shifts

In a stable operational queue, top-k is often the safer first control. In a calibrated risk policy, a threshold can be better. The important point is to decide which quantity should stay fixed when the base rate or score scale moves.

Operating Point¶

When you need a hard decision rule, convert scores into labels and inspect the confusion matrix at the chosen threshold.

pred = (scores >= threshold).astype(int)

from sklearn.metrics import confusion_matrix, classification_report

cm = confusion_matrix(y_true, pred)
report = classification_report(y_true, pred, zero_division=0)

from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_predictions(y_true, pred, normalize="true")

What to look for:

normalize="true" helps compare misses and hits within each class.
classification_report(..., zero_division=0) avoids noisy exceptions when a threshold predicts no positives, and output_dict=True is useful when you want to tabulate several thresholds.
precision_score and recall_score are often the right pair when the class is imbalanced.
fbeta_score(beta=2) is useful when recall matters more than precision.

Calibration And Threshold Sweeps¶

If predicted probabilities are poorly spread, a threshold chosen on one split may not transfer well.

from sklearn.calibration import CalibratedClassifierCV, calibration_curve

calibrator = CalibratedClassifierCV(estimator=model, method="sigmoid", cv=5)
probabilities = calibrator.fit(X_train, y_train).predict_proba(X_valid)[:, 1]

from sklearn.calibration import CalibrationDisplay

CalibrationDisplay.from_predictions(y_true, probabilities)

Use this when:

the queue size changes frequently
score ordering is stable but the cutoff is unstable
the top scores look too compressed to separate cleanly

Practical rule:

predict_proba gives you probabilities to threshold directly.
decision_function is fine for ranking when probabilities are unavailable, but it is not a calibration-plot input by itself.
use calibrated probabilities for CalibrationDisplay.from_predictions(...), not arbitrary ranking scores.

Slice-Aware Checks¶

A good review queue can still fail on a subgroup. Always check whether the budget is being consumed by the right slice.

df = df.assign(reviewed=df["score"] >= threshold)

slice_table = df.groupby("slice").agg(
    reviewed_rate=("reviewed", "mean"),
    positive_rate=("target", "mean"),
    count=("target", "size"),
)

slice_budget = pd.crosstab(df["slice"], df["reviewed"], normalize="index")

What matters here:

the smallest slice may need its own threshold check
a slice with a good average score can still have poor top-k capture
counts must stay next to rates, or the table becomes misleading

Failure Pattern¶

Picking the highest-accuracy model and discovering that it barely improves the review queue because the positives are rare.

Another common mistake is choosing a threshold from validation and assuming it will stay optimal if the queue size, base rate, or subgroup mix changes.

Practice¶

Compare a random-ranking or simple heuristic baseline against a learned ranked model using average_precision_score.
Evaluate three or four review budgets and write the queue precision and captured recall for each one.
Compare the same threshold before and after calibration.
Build one per-slice budget table and identify the weakest slice.
Explain whether your operating point should maximize precision, recall, or fbeta_score.
Say which result would convince you that the model is not worth deploying.

Runnable Example¶

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect the queue at multiple budgets, then compare the thresholded confusion matrix against the ranked top-k view.

Library Notes¶

average_precision_score is usually more useful than accuracy when positives are rare.
precision_recall_curve helps you choose thresholds from a score list.
ConfusionMatrixDisplay.from_predictions makes thresholded errors easier to read.
CalibratedClassifierCV helps when score thresholds are unstable.
DataFrame.nlargest is a clean way to build top-k review sets.
np.argsort is the lower-level option when you need explicit ranking control.
pd.cut and pd.qcut are useful when you want budget bands instead of a single cutoff.

Questions To Ask¶

How many positives can the queue catch at this budget?
Is the top of the ranking actually better, or just more confident?
What happens if the review capacity changes next month?
Would a small gain in recall be worth a large drop in precision?
Which threshold would you explain to a human reviewer?
Does the same threshold still work on the weakest slice?
Is the model better than a simpler ranking baseline at the operating points that matter?

Longer Connection¶

Continue with Imbalanced Triage and Review Budgets for the full leaderboard, budget curve, slice check, and submission workflow.