Imbalanced Metrics and Review Budgets¶
What This Is¶
Rare-event triage tasks need ranked decisions, not only plain labels. The real question is not “is the model accurate?” but “does the top of the ranking contain enough of the cases we care about?”
Accuracy can look strong while the review queue is almost useless. For rare positives, a model must be judged by what it puts near the top, how that behaves under a fixed review budget, and whether the operating threshold still makes sense after the base rate shifts.
When You Use It¶
- rare-event detection with human review
- fixed manual-review capacity
- operating-point selection from predicted scores
- queue design where false negatives are more expensive than false positives
Core Tools¶
average_precision_scoreprecision_recall_curvePrecisionRecallDisplay.from_predictionsprecision_scorerecall_scorefbeta_scoreconfusion_matrixConfusionMatrixDisplay.from_predictionsclassification_reportpredict_probadecision_functionCalibratedClassifierCVcalibration_curveCalibrationDisplay.from_predictionsnp.argsortnp.quantilepd.DataFrame.nlargestpd.cutpd.qcutgroupbycrosstab
Reading The Ranking¶
If your model emits scores, start by asking whether the highest scores really contain the positives.
order = np.argsort(-scores)
top_cases = df.iloc[order[:50]]
top_positive_rate = top_cases["target"].mean()
top_cases = df.nlargest(50, "score")
Use this pattern when you care about a fixed review count. It is better than staring at a single threshold because it directly answers, “What do reviewers actually see?”
Prevalence And Lift¶
Rare-event work gets clearer once you compare the queue against the base rate.
- prevalence:
positive_count / total_count - precision at k: positive rate inside the reviewed set
- lift at k:
precision_at_k / prevalence
If prevalence is 0.03 and precision at the top 5% is 0.21, the lift is 7x. That says more than accuracy ever will about whether the ranking is worth reviewer time.
Threshold And Curve View¶
For rare events, the precision-recall view is usually more informative than accuracy.
from sklearn.metrics import average_precision_score, precision_recall_curve
ap = average_precision_score(y_true, scores)
precision, recall, thresholds = precision_recall_curve(y_true, scores)
from sklearn.metrics import PrecisionRecallDisplay
PrecisionRecallDisplay.from_predictions(y_true, scores)
Interpretation:
average_precision_scoresummarizes how well positives are ranked near the top.precision_recall_curveshows the tradeoff across thresholds.PrecisionRecallDisplay.from_predictionsis useful when you want the whole curve visible in one place.
Practical trick:
- If the curve is flat near the left edge, the model is not separating the hard positives well enough for a tight queue.
- If the curve is good only at very low recall, the model may not be useful once the budget expands.
Counterexample:
- model A can have a better overall average precision because it ranks positives reasonably well across the whole list
- model B can still have better
precision@kat the tiny operating point your reviewers actually use
That is why AP and top-k precision answer related but different questions.
Budget Tables¶
Once the queue size is fixed, build a small table over candidate budgets.
budgets = [0.01, 0.02, 0.05, 0.10]
rows = []
for budget in budgets:
k = max(1, int(len(df) * budget))
reviewed = df.nlargest(k, "score")
rows.append(
{
"budget": budget,
"k": k,
"review_precision": reviewed["target"].mean(),
"captured_recall": reviewed["target"].sum() / df["target"].sum(),
}
)
budget_table = pd.DataFrame(rows)
That table is more useful than a single score because it tells you how the model behaves at several realistic operating points.
Useful follow-up:
- add the false-positive count
- add the number of positives captured
- compare the table against the prevalence or random-ranking floor
- compare it against one simple heuristic ranker if the task has an obvious hand-built signal
Thresholds Versus Top-k Under Shift¶
A fixed threshold and a fixed top-k budget solve different problems.
- threshold: keeps the score meaning fixed, but the queue size can change under score drift
- top-k: keeps the queue size fixed, but the score cutoff can move as the score distribution shifts
In a stable operational queue, top-k is often the safer first control. In a calibrated risk policy, a threshold can be better. The important point is to decide which quantity should stay fixed when the base rate or score scale moves.
Operating Point¶
When you need a hard decision rule, convert scores into labels and inspect the confusion matrix at the chosen threshold.
pred = (scores >= threshold).astype(int)
from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(y_true, pred)
report = classification_report(y_true, pred, zero_division=0)
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(y_true, pred, normalize="true")
What to look for:
normalize="true"helps compare misses and hits within each class.classification_report(..., zero_division=0)avoids noisy exceptions when a threshold predicts no positives, andoutput_dict=Trueis useful when you want to tabulate several thresholds.precision_scoreandrecall_scoreare often the right pair when the class is imbalanced.fbeta_score(beta=2)is useful when recall matters more than precision.
Calibration And Threshold Sweeps¶
If predicted probabilities are poorly spread, a threshold chosen on one split may not transfer well.
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
calibrator = CalibratedClassifierCV(estimator=model, method="sigmoid", cv=5)
probabilities = calibrator.fit(X_train, y_train).predict_proba(X_valid)[:, 1]
from sklearn.calibration import CalibrationDisplay
CalibrationDisplay.from_predictions(y_true, probabilities)
Use this when:
- the queue size changes frequently
- score ordering is stable but the cutoff is unstable
- the top scores look too compressed to separate cleanly
Practical rule:
predict_probagives you probabilities to threshold directly.decision_functionis fine for ranking when probabilities are unavailable, but it is not a calibration-plot input by itself.- use calibrated probabilities for
CalibrationDisplay.from_predictions(...), not arbitrary ranking scores.
Slice-Aware Checks¶
A good review queue can still fail on a subgroup. Always check whether the budget is being consumed by the right slice.
df = df.assign(reviewed=df["score"] >= threshold)
slice_table = df.groupby("slice").agg(
reviewed_rate=("reviewed", "mean"),
positive_rate=("target", "mean"),
count=("target", "size"),
)
slice_budget = pd.crosstab(df["slice"], df["reviewed"], normalize="index")
What matters here:
- the smallest slice may need its own threshold check
- a slice with a good average score can still have poor top-k capture
- counts must stay next to rates, or the table becomes misleading
Failure Pattern¶
Picking the highest-accuracy model and discovering that it barely improves the review queue because the positives are rare.
Another common mistake is choosing a threshold from validation and assuming it will stay optimal if the queue size, base rate, or subgroup mix changes.
Practice¶
- Compare a random-ranking or simple heuristic baseline against a learned ranked model using
average_precision_score. - Evaluate three or four review budgets and write the queue precision and captured recall for each one.
- Compare the same threshold before and after calibration.
- Build one per-slice budget table and identify the weakest slice.
- Explain whether your operating point should maximize precision, recall, or
fbeta_score. - Say which result would convince you that the model is not worth deploying.
Runnable Example¶
Open the matching example in AI Academy and run it from the platform.
Run the same idea in the browser:
Inspect the queue at multiple budgets, then compare the thresholded confusion matrix against the ranked top-k view.
Library Notes¶
average_precision_scoreis usually more useful than accuracy when positives are rare.precision_recall_curvehelps you choose thresholds from a score list.ConfusionMatrixDisplay.from_predictionsmakes thresholded errors easier to read.CalibratedClassifierCVhelps when score thresholds are unstable.DataFrame.nlargestis a clean way to build top-k review sets.np.argsortis the lower-level option when you need explicit ranking control.pd.cutandpd.qcutare useful when you want budget bands instead of a single cutoff.
Questions To Ask¶
- How many positives can the queue catch at this budget?
- Is the top of the ranking actually better, or just more confident?
- What happens if the review capacity changes next month?
- Would a small gain in recall be worth a large drop in precision?
- Which threshold would you explain to a human reviewer?
- Does the same threshold still work on the weakest slice?
- Is the model better than a simpler ranking baseline at the operating points that matter?
Longer Connection¶
Continue with Imbalanced Triage and Review Budgets for the full leaderboard, budget curve, slice check, and submission workflow.