Selective Prediction and Review Budgets¶

What This Is¶

Selective prediction means the model does not need to decide every case automatically. It can accept high-confidence cases, defer borderline cases to review, and abstain when the score is too weak to trust.

This is a policy problem as much as a modeling problem. The real question is not only "How accurate is the model?" but also "How much can we safely automate before the error rate or review load becomes unacceptable?"

Use this page when the main decision is the abstain or review band. If the real question is “which top part of the queue fits inside a fixed budget,” start with Imbalanced Metrics and Review Budgets first.

When You Use It¶

human-in-the-loop workflows
limited reviewer capacity
high-cost mistakes where abstention is allowed
queues where only the top-ranked cases should be reviewed first

Core Signals¶

predict_proba gives a probability-like score for thresholding.
decision_function gives a raw confidence score when probabilities are unavailable.
precision_recall_curve shows the tradeoff between catching positives and keeping precision high.
average_precision_score summarizes ranking quality when positives are rare.
a threshold sweep table built from confusion_matrix helps compare many cutoffs without hand-writing every confusion table.
calibration_curve checks whether the scores behave like probabilities or only like rankings.
CalibratedClassifierCV helps when the scores are useful but poorly calibrated.
TunedThresholdClassifierCV helps when you want the threshold chosen by cross-validation instead of by guesswork.
balanced_accuracy_score is a useful sanity check when the classes are imbalanced.

Tooling¶

predict_proba
decision_function
precision_recall_curve
average_precision_score
confusion_matrix
calibration_curve
CalibratedClassifierCV
TunedThresholdClassifierCV
balanced_accuracy_score
pd.DataFrame
DataFrame.assign
DataFrame.groupby
DataFrame.agg
DataFrame.quantile
pd.cut
np.linspace
np.quantile
np.clip

Getting Scores¶

scores = clf.predict_proba(X_valid)[:, 1]

Use predict_proba when the model exposes probabilities and you want a direct cutoff on the positive class.

scores = clf.decision_function(X_valid)

Use decision_function when the estimator does not expose calibrated probabilities. The sign and ranking still matter, and many thresholding tools can work with raw scores.

Practical rule:

use predict_proba when you care about probability-like outputs and coverage bands
use decision_function when you mainly need a ranking signal
do not pass raw decision_function outputs into calibration plots that expect probabilities
if the base model only exposes raw scores, calibrate on training data first and then plot or threshold the resulting probabilities

Three-Way Policy¶

Selective prediction is usually a three-way policy, not a disguised binary classifier:

automatic negative
review or abstain
automatic positive

The review band is an abstention region. It is not the same thing as predicting negative.

low, high = 0.20, 0.80
auto_negative = probabilities <= low
review = (probabilities > low) & (probabilities < high)
auto_positive = probabilities >= high

For a policy like this, inspect:

automation rate: fraction of cases decided automatically
automated error rate: mistakes inside the automated bands only
auto-positive precision
auto-negative miss rate
review positive rate: whether the abstention band is catching the hard cases

Threshold Sweep¶

A sweep turns one score vector into a policy table.

import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix

thresholds = np.linspace(0.05, 0.95, 19)
rows = []

for t in thresholds:
    accept = scores >= t
    y_pred = accept.astype(int)
    tn, fp, fn, tp = confusion_matrix(y_valid, y_pred, labels=[0, 1]).ravel()
    rows.append(
        {
            "threshold": t,
            "coverage": accept.mean(),
            "review_rate": 1.0 - accept.mean(),
            "precision": tp / max(tp + fp, 1),
            "recall": tp / max(tp + fn, 1),
            "balanced_accuracy": (tp / max(tp + fn, 1) + tn / max(tn + fp, 1)) / 2,
        }
    )

policy_table = pd.DataFrame(rows)

What this buys you for a single-threshold positive policy:

coverage tells you how much of the queue is above the threshold
review_rate tells you how much still needs review
precision tells you whether reviewed positives are worth the effort
recall tells you what fraction of positives the policy still catches
balanced_accuracy is a quick check that the policy is not ignoring one class

If the budget is fixed, pick the row that satisfies the review limit first and only then compare the remaining metrics.

This sweep is useful for one-threshold triage. It is not the full replacement for a two-threshold abstention policy.

Review Bands¶

A single threshold is often too brittle for review workflows. A low-confidence band is easier to explain and easier to monitor.

low, high = np.quantile(scores, [0.2, 0.8])

auto_negative = scores < low
review = (scores >= low) & (scores <= high)
auto_positive = scores > high

coverage = float((auto_negative | auto_positive).mean())
review_load = float(review.mean())

This pattern is useful when:

you want the most obvious cases to bypass review
the middle of the score range is where errors cluster
reviewers should only see borderline cases

Useful trick:

if the score distribution shifts, recompute the band from the new validation window
if you are too close to the center, widen the review band instead of pretending the policy is stable
if the queue is too large, shrink the band and check which slice loses coverage first

Do not report the whole middle band as if it were automatic negatives. The point of the band is precisely that the model is refusing to decide there.

Calibration Check¶

Thresholds are easier to defend when the scores are calibrated. If the score says 0.8, the observed positive rate should be close to 0.8 in the corresponding bin.

from sklearn.calibration import CalibratedClassifierCV, calibration_curve

calibrated = CalibratedClassifierCV(estimator=clf, cv=5, method="sigmoid")
calibrated.fit(X_train, y_train)
valid_prob = calibrated.predict_proba(X_valid)[:, 1]
prob_true, prob_pred = calibration_curve(y_valid, valid_prob, n_bins=10, strategy="quantile")

Interpretation:

prob_pred is the mean predicted probability in each bin
prob_true is the actual fraction of positives in each bin
if the model is overconfident, a threshold that looked safe may be too aggressive in practice

If the calibration curve is poor, calibrate before freezing the policy and before interpreting the score axis as probability.

Choosing The Cutoff¶

If you want the model to choose a threshold for you, use a threshold tuner rather than hand-picking a number from one validation run.

from sklearn.model_selection import TunedThresholdClassifierCV

tuned = TunedThresholdClassifierCV(
    estimator=clf,
    scoring="balanced_accuracy",
    response_method="predict_proba",
    thresholds=100,
    cv=5,
)

Use this when:

the base model is already sensible
the main question is the operating point
you want the cutoff selected by cross-validation instead of a single split

If the problem is heavily imbalanced, also check average_precision_score and precision_recall_curve so the chosen threshold matches the queue you actually care about.

Freeze-And-Test Protocol¶

A selective policy should be frozen like a model:

choose the base scorer on training data
calibrate if needed inside the training boundary
choose the low and high cutoffs on validation
write down the expected automation rate and review load
evaluate once on the locked test or later time window

If the cutoffs move every time the queue looks uncomfortable, the workflow no longer has honest evidence.

Cost View¶

Three-way policies are easier to defend when the costs are explicit:

automatic positive false alarm cost
automatic negative miss cost
human review cost per abstained case

That turns the policy question from “which threshold looks nice?” into “which action mix is cheapest or safest under the current constraints?”

Budget-Aware Monitoring¶

The policy should be monitored the same way the model is monitored.

df = pd.DataFrame({"score": scores, "y_true": y_valid, "slice": slice_name})
df["band"] = pd.cut(df["score"], bins=np.linspace(0.0, 1.0, 6), include_lowest=True)

band_table = (
    df.assign(review=df["score"].between(low, high, inclusive="both"))
      .groupby("band")
      .agg(
          count=("y_true", "size"),
          positive_rate=("y_true", "mean"),
          review_rate=("review", "mean"),
      )
)

This is where pandas is useful:

assign keeps the policy column close to the data
cut makes score bands visible instead of hiding them inside a single average
groupby(...).agg(...) shows which parts of the score range are driving the review load

Per-Slice Checks¶

A policy that looks good overall can still overload one subgroup or miss one subgroup entirely.

slice_table = (
    df.assign(review=df["score"].between(low, high, inclusive="both"))
      .groupby("slice")
      .agg(
          count=("y_true", "size"),
          review_rate=("review", "mean"),
          positive_rate=("y_true", "mean"),
      )
)

What to look for:

slices with tiny counts but huge review loads
slices with high positive rates but low coverage
slices where calibration is worse than the global average

If one slice is much weaker, the next move may be to change the threshold policy, not the model family.

Failure Patterns¶

Choosing a threshold from the validation set and then treating it as permanent.

Using accuracy alone when the positive class is rare. A model can be accurate and still be useless for the review queue.

Ignoring calibration. If the scores are miscalibrated, the threshold is defending a number, not a policy.

Checking only the overall queue and never checking slice-level load. That can hide a serious operational bottleneck.

Stronger Evidence¶

Stronger evidence usually means at least one of these:

the policy still works after calibration
the policy still works under a different seed or time window
the review queue stays within budget across important slices
the policy beats a simpler baseline at the same coverage level

For applied evaluation work, this matters more than a single headline metric because real-world deployment often rewards operational consistency, not just a lucky cutoff.

Practice¶

Compare a one-threshold policy against a two-band abstention policy.
Build a threshold sweep table and choose the best row under a fixed review budget.
Inspect a calibration curve and decide whether the scores are trustworthy.
Compare average_precision_score against plain accuracy on an imbalanced queue.
Check how review load changes across at least one important slice.
Explain which metric would make you stop tuning and freeze the policy.
Identify one case where a wider review band is safer than a narrower one.

Runnable Example¶

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect how coverage changes as the review band becomes stricter.

Questions To Ask¶

How much of the queue can reviewers actually handle?
Are you optimizing the policy for the cases that matter most?
Does the chosen cutoff still make sense after calibration?
Which slice would break first if the score distribution drifted?
Would a smaller review band reduce overload without hiding the hardest cases?
Is the threshold stable enough to freeze, or are you still guessing?
Would a ranking metric or a coverage metric better match the task?

Longer Connection¶

Continue with Imbalanced Triage and Review Budgets for the longer rare-event queue and operating-point workflow.