Reliability Slices¶

What This Is¶

Reliability slices ask whether a model still behaves acceptably on important subgroups, difficult conditions, or shifted inputs. A strong overall metric is not enough by itself.

The slice view is where hidden risk usually appears. A model can look strong on the whole dataset and still be clearly unacceptable for one subgroup or one operating condition.

When You Use It¶

checking subgroup performance
looking for robustness failures
deciding whether deployment risk is acceptable
comparing before and after a change on the same subgroups
checking whether a threshold or calibration choice hurts one slice more than others

Tooling¶

pandas.DataFrame.groupby
pandas.DataFrame.assign
pandas.crosstab
pandas.Series.value_counts
groupby(...).agg(...) with counts beside scores
groupby(...).value_counts(...) for subgroup class mix
sort_values(...) to surface the weakest slice first
sklearn.metrics.confusion_matrix
sklearn.metrics.ConfusionMatrixDisplay
sklearn.metrics.classification_report
sklearn.metrics.precision_recall_fscore_support
sklearn.metrics.balanced_accuracy_score
sklearn.metrics.calibration_curve
sklearn.calibration.CalibratedClassifierCV
sklearn.model_selection.TunedThresholdClassifierCV

Minimal Example¶

group_mask = df["channel"] == "chat"
slice_accuracy = (y_pred[group_mask] == y_true[group_mask]).mean()

Worked Pattern¶

slice_table = (
    df.assign(
        actual_positive=(y_true == 1).astype(int),
        actual_negative=(y_true == 0).astype(int),
        tp=((y_true == 1) & (y_pred == 1)).astype(int),
        fn=((y_true == 1) & (y_pred == 0)).astype(int),
        fp=((y_true == 0) & (y_pred == 1)).astype(int),
        tn=((y_true == 0) & (y_pred == 0)).astype(int),
    )
    .groupby("channel")
    .agg(
        count=("actual_positive", "size"),
        positive_rate=("actual_positive", "mean"),
        tp=("tp", "sum"),
        fn=("fn", "sum"),
        fp=("fp", "sum"),
        tn=("tn", "sum"),
    )
)

slice_table["recall"] = slice_table["tp"] / (slice_table["tp"] + slice_table["fn"]).clip(lower=1)
slice_table["fnr"] = slice_table["fn"] / (slice_table["tp"] + slice_table["fn"]).clip(lower=1)
slice_table["fpr"] = slice_table["fp"] / (slice_table["fp"] + slice_table["tn"]).clip(lower=1)
slice_table = slice_table.sort_values(["fnr", "count"], ascending=[False, False])

That first table is intentionally more operational than plain accuracy. If the next decision is about thresholding, calibration, or safety review, recall, false-negative rate, and false-positive rate usually tell the real story sooner than slice accuracy does.

Slice Design¶

Do not make slices only because the data frame has columns. Choose slices that could change the deployment decision.

Strong first slice families:

user or device groups that map to real product populations
time windows or acquisition regimes that reflect shift
length, quality, or missingness bands that reflect input difficulty
confidence or score bands where the model is supposed to defer or stay calibrated
safety-critical subgroups where false negatives or false positives matter more

Weak slice families:

arbitrary bins with no operational meaning
dozens of tiny subgroup combinations that no one will act on
post-hoc slices invented only after seeing one bad result

The best slice table is not the biggest one. It is the smallest one that would actually change the next decision.

Uncertainty Floors¶

Slice metrics need support floors before they become deployment arguments.

Practical rules:

if the slice count is tiny, treat the metric as a warning sign, not as final evidence
if the slice has too few positives, do not over-read recall or false-negative rate
if a critical slice is weak on both validation and holdout, that is much stronger evidence than one bad table

A useful first-pass floor is:

enough rows to make the slice operationally relevant
enough positives and negatives that fnr and fpr are not driven by one example

When the floor is not met, the next move is usually more data or a larger holdout, not a hard deployment claim.

Useful trick:

always keep the counts next to the metric
sort by the weakest slice first when you want to find the risk quickly
compare slice metrics before and after a change, not only the overall score
keep a slice table for both the validation set and the final holdout set
if a slice is small, treat the metric as noisy instead of over-reading it

Another useful pattern is a slice-level confusion table:

slice_confusion = pd.crosstab(
    df["channel"],
    [y_true, y_pred],
    rownames=["channel"],
    colnames=["true", "pred"],
    margins=True,
)

That makes it easier to see whether a slice is failing through false positives, false negatives, or both.

Failure Pattern¶

Stopping at the overall accuracy and never checking whether one important slice performs much worse than the rest.

Another failure is checking too many slices without a decision rule. The point is not to make the table larger. The point is to identify the risk that changes the next move.

One more failure is using classification_report only once on the full validation set. A report for the whole set can hide a slice where recall is collapsing.

Another failure is to promote a weak slice into a blocking issue before asking whether the slice was important and whether its support was large enough to trust.

Mitigation Ideas¶

collect more examples for the weak slice
simplify the model if the slice failure looks like overfitting
adjust the threshold if the error type is asymmetric
calibrate probabilities if the model is overconfident on one slice
defer hard cases to review if the slice is safety-critical
add a slice-specific feature view if the subgroup is poorly represented

Practice¶

Compute one slice metric by subgroup.
Add counts so the slice table is interpretable.
Explain what mitigation you would consider if one slice is much weaker.
Pick the slice you would investigate first if time were short.
Name one slice that should have a higher standard than the overall average.
Say whether you would keep a model that is strong overall but weak on a critical slice.
Compare slice precision and slice recall, not only slice accuracy.
Explain whether the weak slice is a data problem, a threshold problem, or a model problem.

Runnable Example¶

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect the per-slice metrics and compare them with the overall score before drawing conclusions.

Library Notes¶

groupby(...).agg(...) is the quickest way to summarize slices when you want counts and scores in one table.
groupby(...).value_counts(...) is useful when you want to see how class mix changes across slices.
crosstab(...) is a good fit for a slice-versus-error view or a slice-versus-label view.
classification_report(..., output_dict=True) is useful when you want a machine-readable breakdown for one slice at a time.
balanced_accuracy_score is a better quick check than plain accuracy when class counts are uneven.
calibration_curve(...) helps when a slice looks reliable overall but its probabilities are poorly calibrated.
CalibratedClassifierCV is worth trying when a slice problem is really a confidence problem.
TunedThresholdClassifierCV is useful when one threshold is too blunt for the operating policy.

Questions To Ask¶

Which slice is weakest?
Is the weakest slice also important operationally?
Did the last model change help the weak slice, or only the overall average?
Is the risk localized, or does it repeat across several slices?
Would you change the threshold, the feature view, or the model family first?
Is the slice large enough that the metric is trustworthy?
Are the errors mostly false positives or false negatives?
If you changed the threshold, would the weak slice improve or just move the problem elsewhere?

Decision Rule¶

A weak slice blocks deployment when the slice is both important and consistently under the minimum acceptable floor.

A practical review rule is:

if a critical slice fails badly, do not average it away
if a slice is tiny, demand more data before making a hard decision
if the whole table improves but the critical slice gets worse, treat that as a regression
if calibration is poor, inspect thresholding before changing the whole model

The point is to decide whether the model is acceptable, not to create a larger scoreboard.

Longer Connection¶

Continue with scikit-learn Validation and Tuning for the broader evaluation workflow.