Skip to content

Selective Prediction and Review Budgets

What This Is

Selective prediction means the model does not need to decide every case automatically. It can accept high-confidence cases, defer borderline cases to review, and abstain when the score is too weak to trust.

This is a policy problem as much as a modeling problem. The real question is not only "How accurate is the model?" but also "How much can we safely automate before the error rate or review load becomes unacceptable?"

Use this page when the main decision is the abstain or review band. If the real question is “which top part of the queue fits inside a fixed budget,” start with Imbalanced Metrics and Review Budgets first.

When You Use It

  • human-in-the-loop workflows
  • limited reviewer capacity
  • high-cost mistakes where abstention is allowed
  • queues where only the top-ranked cases should be reviewed first

Core Signals

  • predict_proba gives a probability-like score for thresholding.
  • decision_function gives a raw confidence score when probabilities are unavailable.
  • precision_recall_curve shows the tradeoff between catching positives and keeping precision high.
  • average_precision_score summarizes ranking quality when positives are rare.
  • a threshold sweep table built from confusion_matrix helps compare many cutoffs without hand-writing every confusion table.
  • calibration_curve checks whether the scores behave like probabilities or only like rankings.
  • CalibratedClassifierCV helps when the scores are useful but poorly calibrated.
  • TunedThresholdClassifierCV helps when you want the threshold chosen by cross-validation instead of by guesswork.
  • balanced_accuracy_score is a useful sanity check when the classes are imbalanced.

Tooling

  • predict_proba
  • decision_function
  • precision_recall_curve
  • average_precision_score
  • confusion_matrix
  • calibration_curve
  • CalibratedClassifierCV
  • TunedThresholdClassifierCV
  • balanced_accuracy_score
  • pd.DataFrame
  • DataFrame.assign
  • DataFrame.groupby
  • DataFrame.agg
  • DataFrame.quantile
  • pd.cut
  • np.linspace
  • np.quantile
  • np.clip

Getting Scores

scores = clf.predict_proba(X_valid)[:, 1]

Use predict_proba when the model exposes probabilities and you want a direct cutoff on the positive class.

scores = clf.decision_function(X_valid)

Use decision_function when the estimator does not expose calibrated probabilities. The sign and ranking still matter, and many thresholding tools can work with raw scores.

Practical rule:

  • use predict_proba when you care about probability-like outputs and coverage bands
  • use decision_function when you mainly need a ranking signal
  • do not pass raw decision_function outputs into calibration plots that expect probabilities
  • if the base model only exposes raw scores, calibrate on training data first and then plot or threshold the resulting probabilities

Three-Way Policy

Selective prediction is usually a three-way policy, not a disguised binary classifier:

  • automatic negative
  • review or abstain
  • automatic positive

The review band is an abstention region. It is not the same thing as predicting negative.

low, high = 0.20, 0.80
auto_negative = probabilities <= low
review = (probabilities > low) & (probabilities < high)
auto_positive = probabilities >= high

For a policy like this, inspect:

  • automation rate: fraction of cases decided automatically
  • automated error rate: mistakes inside the automated bands only
  • auto-positive precision
  • auto-negative miss rate
  • review positive rate: whether the abstention band is catching the hard cases

Threshold Sweep

A sweep turns one score vector into a policy table.

import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix

thresholds = np.linspace(0.05, 0.95, 19)
rows = []

for t in thresholds:
    accept = scores >= t
    y_pred = accept.astype(int)
    tn, fp, fn, tp = confusion_matrix(y_valid, y_pred, labels=[0, 1]).ravel()
    rows.append(
        {
            "threshold": t,
            "coverage": accept.mean(),
            "review_rate": 1.0 - accept.mean(),
            "precision": tp / max(tp + fp, 1),
            "recall": tp / max(tp + fn, 1),
            "balanced_accuracy": (tp / max(tp + fn, 1) + tn / max(tn + fp, 1)) / 2,
        }
    )

policy_table = pd.DataFrame(rows)

What this buys you for a single-threshold positive policy:

  • coverage tells you how much of the queue is above the threshold
  • review_rate tells you how much still needs review
  • precision tells you whether reviewed positives are worth the effort
  • recall tells you what fraction of positives the policy still catches
  • balanced_accuracy is a quick check that the policy is not ignoring one class

If the budget is fixed, pick the row that satisfies the review limit first and only then compare the remaining metrics.

This sweep is useful for one-threshold triage. It is not the full replacement for a two-threshold abstention policy.

Review Bands

A single threshold is often too brittle for review workflows. A low-confidence band is easier to explain and easier to monitor.

low, high = np.quantile(scores, [0.2, 0.8])

auto_negative = scores < low
review = (scores >= low) & (scores <= high)
auto_positive = scores > high

coverage = float((auto_negative | auto_positive).mean())
review_load = float(review.mean())

This pattern is useful when:

  • you want the most obvious cases to bypass review
  • the middle of the score range is where errors cluster
  • reviewers should only see borderline cases

Useful trick:

  • if the score distribution shifts, recompute the band from the new validation window
  • if you are too close to the center, widen the review band instead of pretending the policy is stable
  • if the queue is too large, shrink the band and check which slice loses coverage first

Do not report the whole middle band as if it were automatic negatives. The point of the band is precisely that the model is refusing to decide there.

Calibration Check

Thresholds are easier to defend when the scores are calibrated. If the score says 0.8, the observed positive rate should be close to 0.8 in the corresponding bin.

from sklearn.calibration import CalibratedClassifierCV, calibration_curve

calibrated = CalibratedClassifierCV(estimator=clf, cv=5, method="sigmoid")
calibrated.fit(X_train, y_train)
valid_prob = calibrated.predict_proba(X_valid)[:, 1]
prob_true, prob_pred = calibration_curve(y_valid, valid_prob, n_bins=10, strategy="quantile")

Interpretation:

  • prob_pred is the mean predicted probability in each bin
  • prob_true is the actual fraction of positives in each bin
  • if the model is overconfident, a threshold that looked safe may be too aggressive in practice

If the calibration curve is poor, calibrate before freezing the policy and before interpreting the score axis as probability.

Choosing The Cutoff

If you want the model to choose a threshold for you, use a threshold tuner rather than hand-picking a number from one validation run.

from sklearn.model_selection import TunedThresholdClassifierCV

tuned = TunedThresholdClassifierCV(
    estimator=clf,
    scoring="balanced_accuracy",
    response_method="predict_proba",
    thresholds=100,
    cv=5,
)

Use this when:

  • the base model is already sensible
  • the main question is the operating point
  • you want the cutoff selected by cross-validation instead of a single split

If the problem is heavily imbalanced, also check average_precision_score and precision_recall_curve so the chosen threshold matches the queue you actually care about.

Freeze-And-Test Protocol

A selective policy should be frozen like a model:

  1. choose the base scorer on training data
  2. calibrate if needed inside the training boundary
  3. choose the low and high cutoffs on validation
  4. write down the expected automation rate and review load
  5. evaluate once on the locked test or later time window

If the cutoffs move every time the queue looks uncomfortable, the workflow no longer has honest evidence.

Cost View

Three-way policies are easier to defend when the costs are explicit:

  • automatic positive false alarm cost
  • automatic negative miss cost
  • human review cost per abstained case

That turns the policy question from “which threshold looks nice?” into “which action mix is cheapest or safest under the current constraints?”

Budget-Aware Monitoring

The policy should be monitored the same way the model is monitored.

df = pd.DataFrame({"score": scores, "y_true": y_valid, "slice": slice_name})
df["band"] = pd.cut(df["score"], bins=np.linspace(0.0, 1.0, 6), include_lowest=True)

band_table = (
    df.assign(review=df["score"].between(low, high, inclusive="both"))
      .groupby("band")
      .agg(
          count=("y_true", "size"),
          positive_rate=("y_true", "mean"),
          review_rate=("review", "mean"),
      )
)

This is where pandas is useful:

  • assign keeps the policy column close to the data
  • cut makes score bands visible instead of hiding them inside a single average
  • groupby(...).agg(...) shows which parts of the score range are driving the review load

Per-Slice Checks

A policy that looks good overall can still overload one subgroup or miss one subgroup entirely.

slice_table = (
    df.assign(review=df["score"].between(low, high, inclusive="both"))
      .groupby("slice")
      .agg(
          count=("y_true", "size"),
          review_rate=("review", "mean"),
          positive_rate=("y_true", "mean"),
      )
)

What to look for:

  • slices with tiny counts but huge review loads
  • slices with high positive rates but low coverage
  • slices where calibration is worse than the global average

If one slice is much weaker, the next move may be to change the threshold policy, not the model family.

Failure Patterns

Choosing a threshold from the validation set and then treating it as permanent.

Using accuracy alone when the positive class is rare. A model can be accurate and still be useless for the review queue.

Ignoring calibration. If the scores are miscalibrated, the threshold is defending a number, not a policy.

Checking only the overall queue and never checking slice-level load. That can hide a serious operational bottleneck.

Stronger Evidence

Stronger evidence usually means at least one of these:

  • the policy still works after calibration
  • the policy still works under a different seed or time window
  • the review queue stays within budget across important slices
  • the policy beats a simpler baseline at the same coverage level

For applied evaluation work, this matters more than a single headline metric because real-world deployment often rewards operational consistency, not just a lucky cutoff.

Practice

  1. Compare a one-threshold policy against a two-band abstention policy.
  2. Build a threshold sweep table and choose the best row under a fixed review budget.
  3. Inspect a calibration curve and decide whether the scores are trustworthy.
  4. Compare average_precision_score against plain accuracy on an imbalanced queue.
  5. Check how review load changes across at least one important slice.
  6. Explain which metric would make you stop tuning and freeze the policy.
  7. Identify one case where a wider review band is safer than a narrower one.

Runnable Example

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect how coverage changes as the review band becomes stricter.

Questions To Ask

  1. How much of the queue can reviewers actually handle?
  2. Are you optimizing the policy for the cases that matter most?
  3. Does the chosen cutoff still make sense after calibration?
  4. Which slice would break first if the score distribution drifted?
  5. Would a smaller review band reduce overload without hiding the hardest cases?
  6. Is the threshold stable enough to freeze, or are you still guessing?
  7. Would a ranking metric or a coverage metric better match the task?

Longer Connection

Continue with Imbalanced Triage and Review Budgets for the longer rare-event queue and operating-point workflow.