Calibration and Thresholds¶
Scenario: Optimizing Fraud Alerts in Banking¶
You're deploying a fraud detection model where false positives waste investigator time, and false negatives let fraud slip through. The model ranks transactions well but overestimates probabilities—calibrate it and tune thresholds to balance costs, ensuring alerts are reliable and actionable.
What This Topic Is¶
Calibration and thresholding are the bridge between a score and an action.
A classifier can do three different things well or badly: - rank examples from most likely to least likely - assign probabilities that match reality - turn a score into a yes/no decision at the right cutoff
Those are related, but they are not the same. A model can rank well and still give overconfident probabilities. It can also be well calibrated and still use the wrong threshold for the task.
When You Need It¶
- when a score will trigger a human review queue
- when false positives and false negatives have different costs
- when class balance is skewed and 0.5 is not a serious threshold
- when you need probabilities that mean something operationally
- when the team keeps asking, “What cutoff should we use?”
The Core Tools¶
predict_probadecision_functioncalibration_curveCalibrationDisplayCalibratedClassifierCVprecision_recall_curveFixedThresholdClassifierTunedThresholdClassifierCV
Library Notes¶
predict_probais the default way to get class probabilities when the estimator supports it. A calibrated probability estimate should mean that samples predicted near 0.8 are positive about 8 times out of 10.decision_functionreturns a score, not a probability. For many binary classifiers, scikit-learn uses0as the default cutoff on that score.calibration_curvecompares average predicted probability with observed frequency in bins. It is the basic reliability-diagram helper.CalibrationDisplayis the plotting wrapper for calibration curves. Use it when you want a visual check rather than just arrays.CalibratedClassifierCVrecalibrates a classifier’s outputs. In stable scikit-learn1.8, it supportssigmoid,isotonic, andtemperature; in older stable releases, onlysigmoidandisotonicare available.FixedThresholdClassifierlets you set a threshold manually.TunedThresholdClassifierCVchooses a threshold automatically using cross-validation and a scoring metric.precision_recall_curveis useful when the positive class is rare and threshold choice changes the business outcome more than raw accuracy does.precision_recall_curvereturns one more precision and recall value than threshold values. That last point is the endpoint of the curve and does not have a matching threshold.
What To Understand First¶
Start with the separation below:
- If the model cannot rank examples well, do not obsess over calibration yet.
- If the model ranks well but its probabilities are unreliable, calibrate it.
- If the model probabilities are usable but the cutoff is wrong, tune the threshold.
That order matters. Threshold tuning does not fix a bad ranker. Calibration does not automatically pick the right business cutoff.
Honest Protocol¶
The clean workflow is:
- train the base model on training data
- calibrate on held-out calibration data or with CV inside the training boundary
- choose the threshold or policy on validation data
- report once on the locked test set
If you tune the threshold on the same data that certified the probabilities, the decision rule becomes much easier to overfit.
One safe pattern is:
X_model, X_policy, y_model, y_policy = train_test_split(
X_train, y_train, test_size=0.25, stratify=y_train, random_state=0
)
calibrated = CalibratedClassifierCV(base_model, method="sigmoid", cv=5)
calibrated.fit(X_model, y_model)
policy_proba = calibrated.predict_proba(X_policy)[:, 1]
Then the final threshold is frozen before the locked test is touched.
The Default Decision Rule¶
For binary classification in scikit-learn, the default cutoff is hard-coded:
- positive if
predict_proba(... )[:, 1] > 0.5 - positive if
decision_function(...) > 0
That default is only a starting point. It is often wrong when:
- the positive class is rare
- false negatives are expensive
- the class balance in production differs from training
- the model is not well calibrated
Minimal Examples¶
Probability Output¶
proba = model.predict_proba(X_valid)[:, 1]
Use this when the estimator exposes class probabilities. Treat those values as candidate operational probabilities only after a calibration check or calibration step; raw predict_proba output is not automatically trustworthy just because it is on a [0, 1] scale.
Raw Decision Score¶
score = model.decision_function(X_valid)
Use this when the estimator does not expose probabilities or when you want the underlying ranking score.
Simple Thresholding¶
pred = (proba >= 0.30).astype(int)
This is the simplest possible decision rule. It is fine only when the threshold is chosen for a reason.
Calibration Pattern¶
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
base_model = ...
calibrated = CalibratedClassifierCV(base_model, method="sigmoid", cv=5)
calibrated.fit(X_train, y_train)
proba = calibrated.predict_proba(X_valid)[:, 1]
prob_true, prob_pred = calibration_curve(y_valid, proba, n_bins=10, strategy="quantile")
Use sigmoid when the model is mildly miscalibrated or the calibration set is not huge. Use isotonic only when you have enough calibration data and you want a more flexible correction.
Practical rule:
- sigmoid is usually safer on smaller data
- isotonic can fit more shape, but it overfits more easily
Plot the Calibration Curve¶
from sklearn.calibration import CalibrationDisplay
CalibrationDisplay.from_predictions(y_valid, proba, n_bins=10, strategy="quantile")
Use this when you want the diagram directly instead of working with the binned arrays yourself.
Threshold Pattern¶
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_valid, proba)
This is the standard sweep when you care about the tradeoff between finding positives and avoiding false alarms.
The key detail is that precision, recall, and thresholds are aligned by index, but precision and recall each have one extra endpoint value.
A useful pattern is to pick the first threshold that satisfies a constraint:
chosen = None
for p, r, t in zip(precision[:-1], recall[:-1], thresholds):
if p >= 0.90:
chosen = t
break
That is a more honest workflow than picking the threshold with the prettiest single metric.
Manual Threshold Control¶
from sklearn.model_selection import FixedThresholdClassifier
wrapped = FixedThresholdClassifier(model, threshold=0.35)
wrapped.fit(X_train, y_train)
pred = wrapped.predict(X_valid)
Use this when the threshold comes from policy, cost, or capacity rather than from automatic optimization.
Automatic Threshold Tuning¶
from sklearn.model_selection import TunedThresholdClassifierCV
tuned = TunedThresholdClassifierCV(model, scoring="balanced_accuracy", cv=5)
tuned.fit(X_train, y_train)
pred = tuned.predict(X_valid)
Use this when you want scikit-learn to search for the threshold that best matches a metric.
Important habit: - tune the threshold on validation data, not on test data - keep the test set for the final check only
What The Curves Tell You¶
Calibration curves and precision-recall curves answer different questions.
- calibration curve: “Do the predicted probabilities match observed frequency?”
- precision-recall curve: “What happens if I move the cutoff?”
Do not confuse them.
A model can have: - good ranking and poor calibration - good calibration and mediocre ranking - a strong validation threshold that fails in production
Cost, Budget, And Metric Rules¶
Thresholds should usually come from a policy rule, not from visual preference.
Examples:
- review-budget rule: choose the highest threshold that keeps the queue within capacity
- cost rule: minimize
cost_fp * FP + cost_fn * FN - recall-floor rule: choose the smallest threshold that still keeps recall above a required level
Those are all better than inheriting 0.5 without explanation.
Brier, Log-Loss, And Sharpness¶
Use different metrics for different questions:
- Brier score: overall probability accuracy for binary outcomes
- log loss: harsher penalty for confident mistakes
- sharpness: how concentrated or decisive the probabilities are, separate from whether they are calibrated
A model can be sharp but badly calibrated, or well calibrated but not very sharp. That is why ranking, calibration, and threshold choice need separate checks.
Multiclass Note¶
The same ideas extend to multiclass problems, but the policy becomes class-conditional:
- inspect per-class calibration, not only one global summary
- choose thresholds or reject options per class when the costs differ
- do not assume the argmax class probability is calibrated just because the ranking is reasonable
What To Inspect¶
- whether predictions above 0.7 really happen about 7 times out of 10
- whether the curve is systematically above or below the diagonal
- whether the threshold was chosen from validation only
- whether the positive class is rare enough that precision-recall matters more than accuracy
- whether calibration changed probabilities without changing the ranking
- whether your threshold remains stable across resampled splits
Inspection Tricks¶
- Compare the raw model against the calibrated model before changing the threshold.
- Use
quantilebins when you want roughly equal sample counts per bin. - Use
uniformbins when you want fixed probability ranges. - Look at the low-probability region if false positives are cheap but frequent.
- Look at the high-probability region if you only act on confident positives.
- Check whether the same threshold still looks reasonable after a class-balance shift.
Common Mistakes¶
- treating ROC AUC as proof that probabilities are calibrated
- choosing 0.5 because it feels standard
- tuning the threshold on the test set
- using
isotonicon too little data - forgetting that
decision_functionscores are not probabilities - assuming calibration will fix a weak feature set
- using the same split for training, calibration, and threshold selection
Failure Checks¶
Ask these before trusting the output:
- Did I separate training data from calibration data?
- Did I choose the threshold on validation only?
- Did I check whether the class balance in production matches the training split?
- Did I compare the calibrated model to the uncalibrated one?
- Did I check whether my threshold is robust across multiple splits?
- Did I inspect precision and recall, not only accuracy?
If the answer to any of these is no, the decision rule is probably under-validated.
Applied Examples¶
Review Queue¶
If the team can only review 100 cases per day, use threshold tuning to keep the queue within budget. In that setup, the “best” threshold is the one that fits the review capacity while preserving as much recall as possible.
Fraud Or Risk¶
If false positives are expensive but tolerable, a higher threshold may be better than the default 0.5. If false negatives are expensive, a lower threshold may be better.
Triage¶
If a probability is used to decide whether a case goes to human review, calibration matters because the score should mean what the team thinks it means.
Questions To Ask¶
- Is the model supposed to rank or decide?
- Do I need a probability estimate or only a class label?
- Is 0.5 justified, or is it just inherited?
- Which mistake is worse: false positive or false negative?
- Does the calibration curve show overconfidence or underconfidence?
- Should I use a fixed threshold, a tuned threshold, or both?
Practice¶
- Explain the difference between ranking quality and calibration quality.
- Explain why a good ROC AUC does not guarantee useful probabilities.
- Describe when
sigmoidcalibration is safer thanisotonic. - Describe one reason to prefer
precision_recall_curveover accuracy. - Pick a threshold rule for a review budget and explain it.
- Explain why threshold tuning should not be done on the test set.
- Say what the calibration curve would look like if the model is systematically overconfident.
- Say what you would inspect before trusting a tuned threshold in production.
Common Trick¶
The best trick is to stop asking one metric to do three jobs.
Use one check for ranking, one check for calibration, and one check for the final cutoff. That keeps the decision pipeline honest.
Another useful trick is to write down the threshold rule in plain language before you tune it. If you cannot explain the rule to a teammate, it is probably not ready.
Case Study: Email Spam Filtering with Threshold Tuning¶
Gmail uses calibrated probabilities and tuned thresholds to minimize false positives in spam detection. This ensures important emails aren't flagged, while catching most spam—improving user trust and efficiency.
Expanded Quick Quiz¶
What's the difference between ranking and calibration?
Answer: Ranking orders examples by likelihood; calibration ensures probabilities match real frequencies.
When should you use isotonic calibration?
Answer: For complex, non-monotonic probability distortions; it's more flexible than sigmoid but needs more data.
How does TunedThresholdClassifierCV choose a threshold?
Answer: It uses cross-validation to optimize a scoring metric like precision or F1, finding the best cutoff.
In the fraud detection scenario, why tune thresholds?
Answer: To balance false positives (costly investigations) and false negatives (missed fraud) based on business costs.
Progress Checkpoint¶
- [ ] Plotted calibration curves and identified miscalibration.
- [ ] Applied CalibratedClassifierCV to improve probabilities.
- [ ] Tuned thresholds using precision-recall curves.
- [ ] Evaluated trade-offs and chose an operational threshold.
- [ ] Answered quiz questions without peeking.
Milestone: Complete this to unlock "Evaluation Metrics Deep Dive" in the Classical ML track. Share your threshold analysis in the academy Discord!
Further Reading¶
- Scikit-Learn Calibration Guide.
- "Predicting Good Probabilities with Supervised Learning" paper.
- Precision-Recall Curve tutorials.
Runnable Example¶
Open the matching example in AI Academy and inspect how the score changes as you move the threshold.
Focus on: - what happens to precision when the threshold rises - what happens to recall when the threshold rises - whether calibration improved the probability story
Longer Connection¶
Continue with the broader evaluation workflow in the validation and tuning track, then return here when you need to turn a score into a decision.