Cross-Validation¶

Scenario: Evaluating Fraud Models for a Bank¶

You're building fraud detection for a bank. Models must be reliable—overfitting could miss fraud or flag innocents. Cross-validation helps compare stability across customer groups, ensuring the chosen model generalizes without wild swings.

Learning Objectives¶

By the end of this module (25-35 minutes), you should be able to: - Choose appropriate CV splitters (KFold, Stratified, Group) for different data types. - Interpret CV scores for stability, not just averages. - Avoid common pitfalls like data leakage. - Use scikit-learn CV tools for efficient evaluation. - Apply CV in model selection workflows.

Prerequisites: Basic scikit-learn (models, fitting); understanding train/validation splits. Difficulty: Intermediate.

What This Is¶

Cross-validation estimates how stable a model family looks inside the training boundary. It is a selection tool, not a replacement for the final test result.

The deeper point is that cross-validation is about variability, not just average performance. A model that wins on the mean but swings wildly across folds is not automatically a safer choice than a slightly weaker but steadier model.

When You Use It¶

comparing model families
estimating variation across folds
choosing a workflow before the final locked evaluation

Tooling¶

KFold
StratifiedKFold
StratifiedGroupKFold
cross_val_score
cross_validate
cross_val_predict
GroupKFold
GroupShuffleSplit
TimeSeriesSplit
PredefinedSplit
LeaveOneOut
RepeatedStratifiedKFold
permutation_test_score

Library Notes¶

StratifiedKFold is the default choice for classification when class balance matters.
KFold is the basic split pattern for regression or any dataset where class balance is not the concern.
cross_val_score is the simplest way to get one score per fold.
cross_validate is better when you want multiple metrics, training scores, fit times, score times, or even the fitted estimators themselves.
cross_val_predict is for out-of-fold predictions and error analysis, not for replacing a proper cross-validation score.
GroupKFold keeps the same group out of train and validation together, which is essential when rows belong to the same person, document, device, or session.
StratifiedGroupKFold is the stronger choice when you need both class balance and group separation.
TimeSeriesSplit respects order and avoids training on the future.
GroupShuffleSplit is useful when you want group-aware random holdout splits instead of full K-fold partitioning.
PredefinedSplit is useful when the validation fold is already defined by the experiment design.
LeaveOneOut is a special case with one test sample per fold and is usually too expensive for larger datasets.
RepeatedStratifiedKFold is useful when one pass of stratified folds feels too noisy.
permutation_test_score helps test whether a score is meaningfully above chance.
shuffle=True plus random_state gives you a repeatable split pattern when the splitter supports shuffling.

Fold Design Notes¶

use more folds when the dataset is small and each validation slice needs more reuse
use fewer folds when the dataset is large or the training cost is high
more folds are not automatically better; beyond a point they add compute and can leave the decision just as noisy
keep the same split logic while comparing model families
watch for groups, time order, or repeated entities that require a different split strategy
if class balance matters, prefer StratifiedKFold over plain KFold
if the same entity appears multiple times, use a group-aware splitter instead of regular folds
if the data are ordered in time, use TimeSeriesSplit instead of shuffling

Selection Protocol¶

Cross-validation answers one narrow question: which workflow looks safest inside the training boundary?

Use it like this:

keep the final test or holdout untouched
run CV only on the training portion
compare candidates on the same folds
pick the simplest candidate whose fold behavior is good enough
refit on the full training data and evaluate once on the locked holdout

When search is active, the CV summary is selection evidence, not final evidence.

When To Reach For Each Splitter¶

StratifiedKFold: classification with class imbalance or any case where fold balance matters
GroupKFold: repeated rows from the same person, patient, customer, file, device, or conversation
TimeSeriesSplit: forecasting, drift, rolling windows, or any task where the future must stay out of training
RepeatedStratifiedKFold: small data where one fold assignment feels too noisy to trust

What The Main Functions Do¶

cross_val_score returns one array of fold scores, which is ideal when one metric is enough.
cross_validate returns a dictionary of fold-by-fold results, which is ideal when you want several metrics, timing information, or the fitted estimators themselves with return_estimator=True.
cross_val_predict returns an out-of-fold prediction for each row, which is useful for inspection but not a general replacement for cross_val_score.
permutation_test_score compares the observed score to a null distribution built from permuted labels.

Minimal Example¶

from sklearn.model_selection import StratifiedKFold, cross_val_score

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(model, X_train, y_train, cv=cv, scoring="roc_auc")

Worked Pattern¶

results = cross_validate(
    model,
    X_train,
    y_train,
    cv=cv,
    scoring={"roc_auc": "roc_auc", "avg_precision": "average_precision"},
)

That pattern is useful because it exposes both the ranking quality and the minority-class behavior without needing separate ad hoc runs.

Applied examples:

if the average ROC AUC is good but one fold is much worse, the model may be too fragile for the task
if average precision is much lower than ROC AUC, the model may rank well overall but still be weak where the positive class matters most
if fit time is large and the score barely changes across folds, the model may be too expensive for the gain it gives

Paired Fold Deltas¶

Do not compare candidates only by separate means. Compare them fold by fold on the same split.

delta = scores_b["test_roc_auc"] - scores_a["test_roc_auc"]
mean_delta = delta.mean()
worst_delta = delta.min()

Why this matters:

a tiny mean win can disappear when you inspect the paired differences
one model may have a better mean only because it spiked on one easy fold
a safer model often has a smaller worst-fold loss even if the means are close

If the paired deltas are inconsistent, the selection claim should stay weak.

Out-Of-Fold Inspection¶

from sklearn.model_selection import cross_val_predict

oof_pred = cross_val_predict(model, X_train, y_train, cv=cv, method="predict_proba")[:, 1]

Use this when you want to inspect where the model is wrong on rows it did not train on.

Useful questions:

which rows are consistently misranked
which subgroup gets the worst probabilities
whether the error pattern points to a feature problem or a split problem

Do not use out-of-fold predictions as a shortcut to a final score unless you understand the metric behavior and the fold sizes.

Group And Time Splits¶

from sklearn.model_selection import GroupKFold, StratifiedGroupKFold, TimeSeriesSplit

group_cv = GroupKFold(n_splits=5)
group_stratified_cv = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=0)
time_cv = TimeSeriesSplit(n_splits=5)

Use GroupKFold when leakage would happen if the same person, file, or session appeared in both train and validation.

Use StratifiedGroupKFold when you need both class balance and group isolation. That pattern is common when the positive class is rare and the samples also come from repeated entities.

Use TimeSeriesSplit when each later fold should train on earlier history and validate on later data.

Nested CV When Search Is Active¶

If you are tuning hyperparameters or comparing many model settings, the honest pattern is nested:

inner CV chooses settings
outer CV estimates the performance of that choice process

You do not always need full nested CV in day-to-day work, but the logic matters. Once the same folds pick the settings and certify them, the mean is already optimistic.

A practical compromise is:

hold out a final test set
use inner CV for tuning on the remaining training data
report the tuned model once on the holdout

Worked pattern:

from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_validate

inner_cv = StratifiedKFold(n_splits=4, shuffle=True, random_state=0)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

search = GridSearchCV(
    pipeline,
    param_grid={"model__C": [0.1, 1.0, 10.0]},
    cv=inner_cv,
    scoring="roc_auc",
)

outer_results = cross_validate(
    search,
    X_train,
    y_train,
    cv=outer_cv,
    scoring="roc_auc",
)

The interpretation is important:

inner CV chooses settings
outer CV estimates the performance of that whole choice procedure
the outer mean is evidence for trusting the selection rule, not just the final parameter value

Walk-Forward Temporal Example¶

Time-aware validation needs both order and spacing.

from sklearn.model_selection import TimeSeriesSplit

horizon = 24
gap = 6
time_cv = TimeSeriesSplit(n_splits=4, test_size=horizon, gap=gap)

Useful meanings:

test_size=horizon keeps the validation horizon aligned with the task you care about
gap=gap prevents rows near the boundary from leaking through lagged or slowly revealed features

If the model will always predict six hours ahead, one day ahead, or one week ahead, the validation horizon should imitate that instead of using arbitrary fold widths.

Permutation Check¶

from sklearn.model_selection import permutation_test_score

score, perm_scores, pvalue = permutation_test_score(model, X_train, y_train, cv=cv, scoring="roc_auc")

This is useful when a score looks suspiciously strong and you want a sanity check against chance.

What To Inspect¶

whether the fold scores are close together or noisy
whether one model wins on average but loses badly on some folds
whether the metric choice matches the task you actually care about
whether the cross-validation setup mirrors the final training workflow
whether the fold means agree with the later validation result
whether one fold is a clear outlier
whether different metrics tell the same story or a different one
whether the split design matches the deployment story
whether the fold assignment respects groups or time order
whether the out-of-fold errors line up with the suspicious examples you expected
whether a high score survives a permutation sanity check
whether return_estimator=True reveals one fold with a very different learned model

If the spread is large, the mean alone is not enough. A stable-looking average can hide a brittle model family. If the folds disagree strongly, the task is telling you the evaluation story is not stable enough yet.

One more important caution: fold standard deviation is not a confidence interval. It is only a rough measure of fold-to-fold spread under one split design. Do not present it as if it were formal uncertainty around deployment performance.

More Splitters Worth Knowing¶

KFold is the plain baseline. Use it when you do not need stratification, groups, or time order.
StratifiedGroupKFold is the practical answer when both class balance and group boundaries matter.
PredefinedSplit is useful when a project already comes with a fixed validation fold, such as a curated holdout or a custom benchmark slice.
LeaveOneOut can be useful for very small datasets, but it is expensive and rarely the best default.
GroupShuffleSplit is useful when you want random group-aware holdouts instead of a full set of equal-size folds.

Applied pattern:

if the dataset is tiny and every row matters, consider LeaveOneOut or repeated K-fold style comparisons
if the dataset is grouped and imbalanced, prefer StratifiedGroupKFold
if the benchmark already defines a test fold, use PredefinedSplit
if you need a quick group-aware sanity check, GroupShuffleSplit can be easier than a full partition strategy

Common Mistakes¶

using plain shuffled folds on grouped or time-ordered data
treating cross_val_predict as a direct replacement for a proper score
comparing models with different split logic
tuning repeatedly against the same cross-validation summary until the folds stop being independent evidence
reading the mean and ignoring the spread

One useful counterexample is a grouped task where shuffled StratifiedKFold says the model is stable, but GroupKFold reveals a large fold spread and a lower mean. The second result is not harsher by accident. It is closer to deployment.

Failure Pattern¶

Treating the cross-validation mean as the final public score. Cross-validation helps you choose the workflow; the final test set still has a different role.

Another failure pattern is tuning against the same fold summary too many times. Repeated peeking eventually turns CV into a soft validation set.

One more failure mode is to use ordinary shuffled folds on grouped or time-based data. That can make the model look more stable than it really is.

Practice¶

Run cross-validation for two candidate models on the same training split.
Report both the mean and the spread.
Explain why the test set should still remain untouched.
Identify one sign that the fold design is too noisy.
Explain what would make you prefer cross_validate over cross_val_score.
Describe one reason to keep the cross-validation design unchanged after the first comparison.
Explain when grouped or time-based splitting would be more appropriate.
Describe one reason a model with a slightly lower mean can still be the safer choice.
Explain when cross_val_predict is useful and when it is the wrong tool.
Name one situation where permutation_test_score is a useful sanity check.

Case Study: Fraud Detection Stability¶

A credit card company used KFold CV but ignored customer groups, leaking data. After switching to GroupKFold, the model's CV variance dropped, revealing it was unstable—preventing a costly deployment.

Expanded Quick Quiz¶

Why is CV about variability, not just averages?

Answer: A model with high mean but wild swings is risky; stability matters for reliability.

When to use StratifiedKFold?

Answer: For imbalanced classification to maintain class proportions in folds.

What's the risk of tuning on CV scores repeatedly?

Answer: Overfitting to CV; use a separate validation set for tuning.

In the fraud scenario, why avoid regular KFold?

Answer: Customers have multiple transactions; GroupKFold prevents group leakage.

Progress Checkpoint¶

[ ] Ran CV on a dataset and compared models.
[ ] Analyzed fold variability.
[ ] Completed all practice questions.

Milestone: Complete this to unlock "Hyperparameter Tuning" in the scikit-learn Validation track. Post your CV results in the academy forum!

Runnable Example¶

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect the mean and spread for both models instead of looking only at the best single fold.

Common Trick¶

If one fold looks wildly different from the others, check the split logic before blaming the model. Sometimes the issue is a rare subgroup, but sometimes it is a bad fold construction choice.

If the mean is close but the spread is large, prefer the model family that is easier to defend operationally. Stability is often the better tradeoff than a tiny mean gain.

If a classifier looks much stronger under StratifiedKFold than under a group-aware splitter, the split is probably telling you something important about repeated entities.

If a time-based task looks great under shuffled folds, stop and ask whether the evaluation is accidentally letting the future leak in.

Longer Connection¶

Continue with scikit-learn Validation and Tuning for the complete selection workflow.