Calibration and Thresholds¶

Scenario: Optimizing Fraud Alerts in Banking¶

You're deploying a fraud detection model where false positives waste investigator time, and false negatives let fraud slip through. The model ranks transactions well but overestimates probabilities—calibrate it and tune thresholds to balance costs, ensuring alerts are reliable and actionable.

What This Topic Is¶

Calibration and thresholding are the bridge between a score and an action.

A classifier can do three different things well or badly: - rank examples from most likely to least likely - assign probabilities that match reality - turn a score into a yes/no decision at the right cutoff

Those are related, but they are not the same. A model can rank well and still give overconfident probabilities. It can also be well calibrated and still use the wrong threshold for the task.

When You Need It¶

when a score will trigger a human review queue
when false positives and false negatives have different costs
when class balance is skewed and 0.5 is not a serious threshold
when you need probabilities that mean something operationally
when the team keeps asking, “What cutoff should we use?”

The Core Tools¶

predict_proba
decision_function
calibration_curve
CalibrationDisplay
CalibratedClassifierCV
precision_recall_curve
FixedThresholdClassifier
TunedThresholdClassifierCV

Library Notes¶

predict_proba is the default way to get class probabilities when the estimator supports it. A calibrated probability estimate should mean that samples predicted near 0.8 are positive about 8 times out of 10.
decision_function returns a score, not a probability. For many binary classifiers, scikit-learn uses 0 as the default cutoff on that score.
calibration_curve compares average predicted probability with observed frequency in bins. It is the basic reliability-diagram helper.
CalibrationDisplay is the plotting wrapper for calibration curves. Use it when you want a visual check rather than just arrays.
CalibratedClassifierCV recalibrates a classifier’s outputs. In stable scikit-learn 1.8, it supports sigmoid, isotonic, and temperature; in older stable releases, only sigmoid and isotonic are available.
FixedThresholdClassifier lets you set a threshold manually.
TunedThresholdClassifierCV chooses a threshold automatically using cross-validation and a scoring metric.
precision_recall_curve is useful when the positive class is rare and threshold choice changes the business outcome more than raw accuracy does.
precision_recall_curve returns one more precision and recall value than threshold values. That last point is the endpoint of the curve and does not have a matching threshold.

What To Understand First¶

Start with the separation below:

If the model cannot rank examples well, do not obsess over calibration yet.
If the model ranks well but its probabilities are unreliable, calibrate it.
If the model probabilities are usable but the cutoff is wrong, tune the threshold.

That order matters. Threshold tuning does not fix a bad ranker. Calibration does not automatically pick the right business cutoff.

Honest Protocol¶

The clean workflow is:

train the base model on training data
calibrate on held-out calibration data or with CV inside the training boundary
choose the threshold or policy on validation data
report once on the locked test set

If you tune the threshold on the same data that certified the probabilities, the decision rule becomes much easier to overfit.

One safe pattern is:

X_model, X_policy, y_model, y_policy = train_test_split(
    X_train, y_train, test_size=0.25, stratify=y_train, random_state=0
)
calibrated = CalibratedClassifierCV(base_model, method="sigmoid", cv=5)
calibrated.fit(X_model, y_model)
policy_proba = calibrated.predict_proba(X_policy)[:, 1]

Then the final threshold is frozen before the locked test is touched.

The Default Decision Rule¶

For binary classification in scikit-learn, the default cutoff is hard-coded:

positive if predict_proba(... )[:, 1] > 0.5
positive if decision_function(...) > 0

That default is only a starting point. It is often wrong when:

the positive class is rare
false negatives are expensive
the class balance in production differs from training
the model is not well calibrated

Minimal Examples¶

Probability Output¶

proba = model.predict_proba(X_valid)[:, 1]

Use this when the estimator exposes class probabilities. Treat those values as candidate operational probabilities only after a calibration check or calibration step; raw predict_proba output is not automatically trustworthy just because it is on a [0, 1] scale.

Raw Decision Score¶

score = model.decision_function(X_valid)

Use this when the estimator does not expose probabilities or when you want the underlying ranking score.

Simple Thresholding¶

pred = (proba >= 0.30).astype(int)

This is the simplest possible decision rule. It is fine only when the threshold is chosen for a reason.

Calibration Pattern¶

from sklearn.calibration import CalibratedClassifierCV, calibration_curve

base_model = ...
calibrated = CalibratedClassifierCV(base_model, method="sigmoid", cv=5)
calibrated.fit(X_train, y_train)

proba = calibrated.predict_proba(X_valid)[:, 1]
prob_true, prob_pred = calibration_curve(y_valid, proba, n_bins=10, strategy="quantile")

Use sigmoid when the model is mildly miscalibrated or the calibration set is not huge. Use isotonic only when you have enough calibration data and you want a more flexible correction.

Practical rule: - sigmoid is usually safer on smaller data - isotonic can fit more shape, but it overfits more easily

Plot the Calibration Curve¶

from sklearn.calibration import CalibrationDisplay

CalibrationDisplay.from_predictions(y_valid, proba, n_bins=10, strategy="quantile")

Use this when you want the diagram directly instead of working with the binned arrays yourself.

Threshold Pattern¶

from sklearn.metrics import precision_recall_curve

precision, recall, thresholds = precision_recall_curve(y_valid, proba)

This is the standard sweep when you care about the tradeoff between finding positives and avoiding false alarms.

The key detail is that precision, recall, and thresholds are aligned by index, but precision and recall each have one extra endpoint value.

A useful pattern is to pick the first threshold that satisfies a constraint:

chosen = None
for p, r, t in zip(precision[:-1], recall[:-1], thresholds):
    if p >= 0.90:
        chosen = t
        break

That is a more honest workflow than picking the threshold with the prettiest single metric.

Manual Threshold Control¶

from sklearn.model_selection import FixedThresholdClassifier

wrapped = FixedThresholdClassifier(model, threshold=0.35)
wrapped.fit(X_train, y_train)
pred = wrapped.predict(X_valid)

Use this when the threshold comes from policy, cost, or capacity rather than from automatic optimization.

Automatic Threshold Tuning¶

from sklearn.model_selection import TunedThresholdClassifierCV

tuned = TunedThresholdClassifierCV(model, scoring="balanced_accuracy", cv=5)
tuned.fit(X_train, y_train)
pred = tuned.predict(X_valid)

Use this when you want scikit-learn to search for the threshold that best matches a metric.

Important habit: - tune the threshold on validation data, not on test data - keep the test set for the final check only

What The Curves Tell You¶

Calibration curves and precision-recall curves answer different questions.

calibration curve: “Do the predicted probabilities match observed frequency?”
precision-recall curve: “What happens if I move the cutoff?”

Do not confuse them.

A model can have: - good ranking and poor calibration - good calibration and mediocre ranking - a strong validation threshold that fails in production

Cost, Budget, And Metric Rules¶

Thresholds should usually come from a policy rule, not from visual preference.

Examples:

review-budget rule: choose the highest threshold that keeps the queue within capacity
cost rule: minimize cost_fp * FP + cost_fn * FN
recall-floor rule: choose the smallest threshold that still keeps recall above a required level

Those are all better than inheriting 0.5 without explanation.

Brier, Log-Loss, And Sharpness¶

Use different metrics for different questions:

Brier score: overall probability accuracy for binary outcomes
log loss: harsher penalty for confident mistakes
sharpness: how concentrated or decisive the probabilities are, separate from whether they are calibrated

A model can be sharp but badly calibrated, or well calibrated but not very sharp. That is why ranking, calibration, and threshold choice need separate checks.

Multiclass Note¶

The same ideas extend to multiclass problems, but the policy becomes class-conditional:

inspect per-class calibration, not only one global summary
choose thresholds or reject options per class when the costs differ
do not assume the argmax class probability is calibrated just because the ranking is reasonable

What To Inspect¶

whether predictions above 0.7 really happen about 7 times out of 10
whether the curve is systematically above or below the diagonal
whether the threshold was chosen from validation only
whether the positive class is rare enough that precision-recall matters more than accuracy
whether calibration changed probabilities without changing the ranking
whether your threshold remains stable across resampled splits

Inspection Tricks¶

Compare the raw model against the calibrated model before changing the threshold.
Use quantile bins when you want roughly equal sample counts per bin.
Use uniform bins when you want fixed probability ranges.
Look at the low-probability region if false positives are cheap but frequent.
Look at the high-probability region if you only act on confident positives.
Check whether the same threshold still looks reasonable after a class-balance shift.

Common Mistakes¶

treating ROC AUC as proof that probabilities are calibrated
choosing 0.5 because it feels standard
tuning the threshold on the test set
using isotonic on too little data
forgetting that decision_function scores are not probabilities
assuming calibration will fix a weak feature set
using the same split for training, calibration, and threshold selection

Failure Checks¶

Ask these before trusting the output:

Did I separate training data from calibration data?
Did I choose the threshold on validation only?
Did I check whether the class balance in production matches the training split?
Did I compare the calibrated model to the uncalibrated one?
Did I check whether my threshold is robust across multiple splits?
Did I inspect precision and recall, not only accuracy?

If the answer to any of these is no, the decision rule is probably under-validated.

Applied Examples¶

Review Queue¶

If the team can only review 100 cases per day, use threshold tuning to keep the queue within budget. In that setup, the “best” threshold is the one that fits the review capacity while preserving as much recall as possible.

Fraud Or Risk¶

If false positives are expensive but tolerable, a higher threshold may be better than the default 0.5. If false negatives are expensive, a lower threshold may be better.

Triage¶

If a probability is used to decide whether a case goes to human review, calibration matters because the score should mean what the team thinks it means.

Questions To Ask¶

Is the model supposed to rank or decide?
Do I need a probability estimate or only a class label?
Is 0.5 justified, or is it just inherited?
Which mistake is worse: false positive or false negative?
Does the calibration curve show overconfidence or underconfidence?
Should I use a fixed threshold, a tuned threshold, or both?

Practice¶

Explain the difference between ranking quality and calibration quality.
Explain why a good ROC AUC does not guarantee useful probabilities.
Describe when sigmoid calibration is safer than isotonic.
Describe one reason to prefer precision_recall_curve over accuracy.
Pick a threshold rule for a review budget and explain it.
Explain why threshold tuning should not be done on the test set.
Say what the calibration curve would look like if the model is systematically overconfident.
Say what you would inspect before trusting a tuned threshold in production.

Common Trick¶

The best trick is to stop asking one metric to do three jobs.

Use one check for ranking, one check for calibration, and one check for the final cutoff. That keeps the decision pipeline honest.

Another useful trick is to write down the threshold rule in plain language before you tune it. If you cannot explain the rule to a teammate, it is probably not ready.

Case Study: Email Spam Filtering with Threshold Tuning¶

Gmail uses calibrated probabilities and tuned thresholds to minimize false positives in spam detection. This ensures important emails aren't flagged, while catching most spam—improving user trust and efficiency.

Expanded Quick Quiz¶

What's the difference between ranking and calibration?

Answer: Ranking orders examples by likelihood; calibration ensures probabilities match real frequencies.

When should you use isotonic calibration?

Answer: For complex, non-monotonic probability distortions; it's more flexible than sigmoid but needs more data.

How does TunedThresholdClassifierCV choose a threshold?

Answer: It uses cross-validation to optimize a scoring metric like precision or F1, finding the best cutoff.

In the fraud detection scenario, why tune thresholds?

Answer: To balance false positives (costly investigations) and false negatives (missed fraud) based on business costs.

Progress Checkpoint¶

[ ] Plotted calibration curves and identified miscalibration.
[ ] Applied CalibratedClassifierCV to improve probabilities.
[ ] Tuned thresholds using precision-recall curves.
[ ] Evaluated trade-offs and chose an operational threshold.
[ ] Answered quiz questions without peeking.

Milestone: Complete this to unlock "Evaluation Metrics Deep Dive" in the Classical ML track. Share your threshold analysis in the academy Discord!

Runnable Example¶

Open the matching example in AI Academy and inspect how the score changes as you move the threshold.

Focus on: - what happens to precision when the threshold rises - what happens to recall when the threshold rises - whether calibration improved the probability story

Longer Connection¶

Continue with the broader evaluation workflow in the validation and tuning track, then return here when you need to turn a score into a decision.