Selective Prediction and Review Budgets¶
What This Is¶
Selective prediction means the model does not need to decide every case automatically. It can accept high-confidence cases, defer borderline cases to review, and abstain when the score is too weak to trust.
This is a policy problem as much as a modeling problem. The real question is not only "How accurate is the model?" but also "How much can we safely automate before the error rate or review load becomes unacceptable?"
Use this page when the main decision is the abstain or review band. If the real question is “which top part of the queue fits inside a fixed budget,” start with Imbalanced Metrics and Review Budgets first.
When You Use It¶
- human-in-the-loop workflows
- limited reviewer capacity
- high-cost mistakes where abstention is allowed
- queues where only the top-ranked cases should be reviewed first
Core Signals¶
predict_probagives a probability-like score for thresholding.decision_functiongives a raw confidence score when probabilities are unavailable.precision_recall_curveshows the tradeoff between catching positives and keeping precision high.average_precision_scoresummarizes ranking quality when positives are rare.- a threshold sweep table built from
confusion_matrixhelps compare many cutoffs without hand-writing every confusion table. calibration_curvechecks whether the scores behave like probabilities or only like rankings.CalibratedClassifierCVhelps when the scores are useful but poorly calibrated.TunedThresholdClassifierCVhelps when you want the threshold chosen by cross-validation instead of by guesswork.balanced_accuracy_scoreis a useful sanity check when the classes are imbalanced.
Tooling¶
predict_probadecision_functionprecision_recall_curveaverage_precision_scoreconfusion_matrixcalibration_curveCalibratedClassifierCVTunedThresholdClassifierCVbalanced_accuracy_scorepd.DataFrameDataFrame.assignDataFrame.groupbyDataFrame.aggDataFrame.quantilepd.cutnp.linspacenp.quantilenp.clip
Getting Scores¶
scores = clf.predict_proba(X_valid)[:, 1]
Use predict_proba when the model exposes probabilities and you want a direct cutoff on the positive class.
scores = clf.decision_function(X_valid)
Use decision_function when the estimator does not expose calibrated probabilities. The sign and ranking still matter, and many thresholding tools can work with raw scores.
Practical rule:
- use
predict_probawhen you care about probability-like outputs and coverage bands - use
decision_functionwhen you mainly need a ranking signal - do not pass raw
decision_functionoutputs into calibration plots that expect probabilities - if the base model only exposes raw scores, calibrate on training data first and then plot or threshold the resulting probabilities
Three-Way Policy¶
Selective prediction is usually a three-way policy, not a disguised binary classifier:
- automatic negative
- review or abstain
- automatic positive
The review band is an abstention region. It is not the same thing as predicting negative.
low, high = 0.20, 0.80
auto_negative = probabilities <= low
review = (probabilities > low) & (probabilities < high)
auto_positive = probabilities >= high
For a policy like this, inspect:
- automation rate: fraction of cases decided automatically
- automated error rate: mistakes inside the automated bands only
- auto-positive precision
- auto-negative miss rate
- review positive rate: whether the abstention band is catching the hard cases
Threshold Sweep¶
A sweep turns one score vector into a policy table.
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
thresholds = np.linspace(0.05, 0.95, 19)
rows = []
for t in thresholds:
accept = scores >= t
y_pred = accept.astype(int)
tn, fp, fn, tp = confusion_matrix(y_valid, y_pred, labels=[0, 1]).ravel()
rows.append(
{
"threshold": t,
"coverage": accept.mean(),
"review_rate": 1.0 - accept.mean(),
"precision": tp / max(tp + fp, 1),
"recall": tp / max(tp + fn, 1),
"balanced_accuracy": (tp / max(tp + fn, 1) + tn / max(tn + fp, 1)) / 2,
}
)
policy_table = pd.DataFrame(rows)
What this buys you for a single-threshold positive policy:
coveragetells you how much of the queue is above the thresholdreview_ratetells you how much still needs reviewprecisiontells you whether reviewed positives are worth the effortrecalltells you what fraction of positives the policy still catchesbalanced_accuracyis a quick check that the policy is not ignoring one class
If the budget is fixed, pick the row that satisfies the review limit first and only then compare the remaining metrics.
This sweep is useful for one-threshold triage. It is not the full replacement for a two-threshold abstention policy.
Review Bands¶
A single threshold is often too brittle for review workflows. A low-confidence band is easier to explain and easier to monitor.
low, high = np.quantile(scores, [0.2, 0.8])
auto_negative = scores < low
review = (scores >= low) & (scores <= high)
auto_positive = scores > high
coverage = float((auto_negative | auto_positive).mean())
review_load = float(review.mean())
This pattern is useful when:
- you want the most obvious cases to bypass review
- the middle of the score range is where errors cluster
- reviewers should only see borderline cases
Useful trick:
- if the score distribution shifts, recompute the band from the new validation window
- if you are too close to the center, widen the review band instead of pretending the policy is stable
- if the queue is too large, shrink the band and check which slice loses coverage first
Do not report the whole middle band as if it were automatic negatives. The point of the band is precisely that the model is refusing to decide there.
Calibration Check¶
Thresholds are easier to defend when the scores are calibrated. If the score says 0.8, the observed positive rate should be close to 0.8 in the corresponding bin.
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
calibrated = CalibratedClassifierCV(estimator=clf, cv=5, method="sigmoid")
calibrated.fit(X_train, y_train)
valid_prob = calibrated.predict_proba(X_valid)[:, 1]
prob_true, prob_pred = calibration_curve(y_valid, valid_prob, n_bins=10, strategy="quantile")
Interpretation:
prob_predis the mean predicted probability in each binprob_trueis the actual fraction of positives in each bin- if the model is overconfident, a threshold that looked safe may be too aggressive in practice
If the calibration curve is poor, calibrate before freezing the policy and before interpreting the score axis as probability.
Choosing The Cutoff¶
If you want the model to choose a threshold for you, use a threshold tuner rather than hand-picking a number from one validation run.
from sklearn.model_selection import TunedThresholdClassifierCV
tuned = TunedThresholdClassifierCV(
estimator=clf,
scoring="balanced_accuracy",
response_method="predict_proba",
thresholds=100,
cv=5,
)
Use this when:
- the base model is already sensible
- the main question is the operating point
- you want the cutoff selected by cross-validation instead of a single split
If the problem is heavily imbalanced, also check average_precision_score and precision_recall_curve so the chosen threshold matches the queue you actually care about.
Freeze-And-Test Protocol¶
A selective policy should be frozen like a model:
- choose the base scorer on training data
- calibrate if needed inside the training boundary
- choose the low and high cutoffs on validation
- write down the expected automation rate and review load
- evaluate once on the locked test or later time window
If the cutoffs move every time the queue looks uncomfortable, the workflow no longer has honest evidence.
Cost View¶
Three-way policies are easier to defend when the costs are explicit:
- automatic positive false alarm cost
- automatic negative miss cost
- human review cost per abstained case
That turns the policy question from “which threshold looks nice?” into “which action mix is cheapest or safest under the current constraints?”
Budget-Aware Monitoring¶
The policy should be monitored the same way the model is monitored.
df = pd.DataFrame({"score": scores, "y_true": y_valid, "slice": slice_name})
df["band"] = pd.cut(df["score"], bins=np.linspace(0.0, 1.0, 6), include_lowest=True)
band_table = (
df.assign(review=df["score"].between(low, high, inclusive="both"))
.groupby("band")
.agg(
count=("y_true", "size"),
positive_rate=("y_true", "mean"),
review_rate=("review", "mean"),
)
)
This is where pandas is useful:
assignkeeps the policy column close to the datacutmakes score bands visible instead of hiding them inside a single averagegroupby(...).agg(...)shows which parts of the score range are driving the review load
Per-Slice Checks¶
A policy that looks good overall can still overload one subgroup or miss one subgroup entirely.
slice_table = (
df.assign(review=df["score"].between(low, high, inclusive="both"))
.groupby("slice")
.agg(
count=("y_true", "size"),
review_rate=("review", "mean"),
positive_rate=("y_true", "mean"),
)
)
What to look for:
- slices with tiny counts but huge review loads
- slices with high positive rates but low coverage
- slices where calibration is worse than the global average
If one slice is much weaker, the next move may be to change the threshold policy, not the model family.
Failure Patterns¶
Choosing a threshold from the validation set and then treating it as permanent.
Using accuracy alone when the positive class is rare. A model can be accurate and still be useless for the review queue.
Ignoring calibration. If the scores are miscalibrated, the threshold is defending a number, not a policy.
Checking only the overall queue and never checking slice-level load. That can hide a serious operational bottleneck.
Stronger Evidence¶
Stronger evidence usually means at least one of these:
- the policy still works after calibration
- the policy still works under a different seed or time window
- the review queue stays within budget across important slices
- the policy beats a simpler baseline at the same coverage level
For applied evaluation work, this matters more than a single headline metric because real-world deployment often rewards operational consistency, not just a lucky cutoff.
Practice¶
- Compare a one-threshold policy against a two-band abstention policy.
- Build a threshold sweep table and choose the best row under a fixed review budget.
- Inspect a calibration curve and decide whether the scores are trustworthy.
- Compare
average_precision_scoreagainst plain accuracy on an imbalanced queue. - Check how review load changes across at least one important slice.
- Explain which metric would make you stop tuning and freeze the policy.
- Identify one case where a wider review band is safer than a narrower one.
Runnable Example¶
Open the matching example in AI Academy and run it from the platform.
Run the same idea in the browser:
Inspect how coverage changes as the review band becomes stricter.
Questions To Ask¶
- How much of the queue can reviewers actually handle?
- Are you optimizing the policy for the cases that matter most?
- Does the chosen cutoff still make sense after calibration?
- Which slice would break first if the score distribution drifted?
- Would a smaller review band reduce overload without hiding the hardest cases?
- Is the threshold stable enough to freeze, or are you still guessing?
- Would a ranking metric or a coverage metric better match the task?
Longer Connection¶
Continue with Imbalanced Triage and Review Budgets for the longer rare-event queue and operating-point workflow.