Skip to content

Reliability Slices

What This Is

Reliability slices ask whether a model still behaves acceptably on important subgroups, difficult conditions, or shifted inputs. A strong overall metric is not enough by itself.

The slice view is where hidden risk usually appears. A model can look strong on the whole dataset and still be clearly unacceptable for one subgroup or one operating condition.

When You Use It

  • checking subgroup performance
  • looking for robustness failures
  • deciding whether deployment risk is acceptable
  • comparing before and after a change on the same subgroups
  • checking whether a threshold or calibration choice hurts one slice more than others

Tooling

  • pandas.DataFrame.groupby
  • pandas.DataFrame.assign
  • pandas.crosstab
  • pandas.Series.value_counts
  • groupby(...).agg(...) with counts beside scores
  • groupby(...).value_counts(...) for subgroup class mix
  • sort_values(...) to surface the weakest slice first
  • sklearn.metrics.confusion_matrix
  • sklearn.metrics.ConfusionMatrixDisplay
  • sklearn.metrics.classification_report
  • sklearn.metrics.precision_recall_fscore_support
  • sklearn.metrics.balanced_accuracy_score
  • sklearn.metrics.calibration_curve
  • sklearn.calibration.CalibratedClassifierCV
  • sklearn.model_selection.TunedThresholdClassifierCV

Minimal Example

group_mask = df["channel"] == "chat"
slice_accuracy = (y_pred[group_mask] == y_true[group_mask]).mean()

Worked Pattern

slice_table = (
    df.assign(
        actual_positive=(y_true == 1).astype(int),
        actual_negative=(y_true == 0).astype(int),
        tp=((y_true == 1) & (y_pred == 1)).astype(int),
        fn=((y_true == 1) & (y_pred == 0)).astype(int),
        fp=((y_true == 0) & (y_pred == 1)).astype(int),
        tn=((y_true == 0) & (y_pred == 0)).astype(int),
    )
    .groupby("channel")
    .agg(
        count=("actual_positive", "size"),
        positive_rate=("actual_positive", "mean"),
        tp=("tp", "sum"),
        fn=("fn", "sum"),
        fp=("fp", "sum"),
        tn=("tn", "sum"),
    )
)

slice_table["recall"] = slice_table["tp"] / (slice_table["tp"] + slice_table["fn"]).clip(lower=1)
slice_table["fnr"] = slice_table["fn"] / (slice_table["tp"] + slice_table["fn"]).clip(lower=1)
slice_table["fpr"] = slice_table["fp"] / (slice_table["fp"] + slice_table["tn"]).clip(lower=1)
slice_table = slice_table.sort_values(["fnr", "count"], ascending=[False, False])

That first table is intentionally more operational than plain accuracy. If the next decision is about thresholding, calibration, or safety review, recall, false-negative rate, and false-positive rate usually tell the real story sooner than slice accuracy does.

Slice Design

Do not make slices only because the data frame has columns. Choose slices that could change the deployment decision.

Strong first slice families:

  • user or device groups that map to real product populations
  • time windows or acquisition regimes that reflect shift
  • length, quality, or missingness bands that reflect input difficulty
  • confidence or score bands where the model is supposed to defer or stay calibrated
  • safety-critical subgroups where false negatives or false positives matter more

Weak slice families:

  • arbitrary bins with no operational meaning
  • dozens of tiny subgroup combinations that no one will act on
  • post-hoc slices invented only after seeing one bad result

The best slice table is not the biggest one. It is the smallest one that would actually change the next decision.

Uncertainty Floors

Slice metrics need support floors before they become deployment arguments.

Practical rules:

  • if the slice count is tiny, treat the metric as a warning sign, not as final evidence
  • if the slice has too few positives, do not over-read recall or false-negative rate
  • if a critical slice is weak on both validation and holdout, that is much stronger evidence than one bad table

A useful first-pass floor is:

  • enough rows to make the slice operationally relevant
  • enough positives and negatives that fnr and fpr are not driven by one example

When the floor is not met, the next move is usually more data or a larger holdout, not a hard deployment claim.

Useful trick:

  • always keep the counts next to the metric
  • sort by the weakest slice first when you want to find the risk quickly
  • compare slice metrics before and after a change, not only the overall score
  • keep a slice table for both the validation set and the final holdout set
  • if a slice is small, treat the metric as noisy instead of over-reading it

Another useful pattern is a slice-level confusion table:

slice_confusion = pd.crosstab(
    df["channel"],
    [y_true, y_pred],
    rownames=["channel"],
    colnames=["true", "pred"],
    margins=True,
)

That makes it easier to see whether a slice is failing through false positives, false negatives, or both.

Failure Pattern

Stopping at the overall accuracy and never checking whether one important slice performs much worse than the rest.

Another failure is checking too many slices without a decision rule. The point is not to make the table larger. The point is to identify the risk that changes the next move.

One more failure is using classification_report only once on the full validation set. A report for the whole set can hide a slice where recall is collapsing.

Another failure is to promote a weak slice into a blocking issue before asking whether the slice was important and whether its support was large enough to trust.

Mitigation Ideas

  • collect more examples for the weak slice
  • simplify the model if the slice failure looks like overfitting
  • adjust the threshold if the error type is asymmetric
  • calibrate probabilities if the model is overconfident on one slice
  • defer hard cases to review if the slice is safety-critical
  • add a slice-specific feature view if the subgroup is poorly represented

Practice

  1. Compute one slice metric by subgroup.
  2. Add counts so the slice table is interpretable.
  3. Explain what mitigation you would consider if one slice is much weaker.
  4. Pick the slice you would investigate first if time were short.
  5. Name one slice that should have a higher standard than the overall average.
  6. Say whether you would keep a model that is strong overall but weak on a critical slice.
  7. Compare slice precision and slice recall, not only slice accuracy.
  8. Explain whether the weak slice is a data problem, a threshold problem, or a model problem.

Runnable Example

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect the per-slice metrics and compare them with the overall score before drawing conclusions.

Library Notes

  • groupby(...).agg(...) is the quickest way to summarize slices when you want counts and scores in one table.
  • groupby(...).value_counts(...) is useful when you want to see how class mix changes across slices.
  • crosstab(...) is a good fit for a slice-versus-error view or a slice-versus-label view.
  • classification_report(..., output_dict=True) is useful when you want a machine-readable breakdown for one slice at a time.
  • balanced_accuracy_score is a better quick check than plain accuracy when class counts are uneven.
  • calibration_curve(...) helps when a slice looks reliable overall but its probabilities are poorly calibrated.
  • CalibratedClassifierCV is worth trying when a slice problem is really a confidence problem.
  • TunedThresholdClassifierCV is useful when one threshold is too blunt for the operating policy.

Questions To Ask

  1. Which slice is weakest?
  2. Is the weakest slice also important operationally?
  3. Did the last model change help the weak slice, or only the overall average?
  4. Is the risk localized, or does it repeat across several slices?
  5. Would you change the threshold, the feature view, or the model family first?
  6. Is the slice large enough that the metric is trustworthy?
  7. Are the errors mostly false positives or false negatives?
  8. If you changed the threshold, would the weak slice improve or just move the problem elsewhere?

Decision Rule

A weak slice blocks deployment when the slice is both important and consistently under the minimum acceptable floor.

A practical review rule is:

  • if a critical slice fails badly, do not average it away
  • if a slice is tiny, demand more data before making a hard decision
  • if the whole table improves but the critical slice gets worse, treat that as a regression
  • if calibration is poor, inspect thresholding before changing the whole model

The point is to decide whether the model is acceptable, not to create a larger scoreboard.

Longer Connection

Continue with scikit-learn Validation and Tuning for the broader evaluation workflow.