Cross-Validation¶
Scenario: Evaluating Fraud Models for a Bank¶
You're building fraud detection for a bank. Models must be reliable—overfitting could miss fraud or flag innocents. Cross-validation helps compare stability across customer groups, ensuring the chosen model generalizes without wild swings.
Learning Objectives¶
By the end of this module (25-35 minutes), you should be able to: - Choose appropriate CV splitters (KFold, Stratified, Group) for different data types. - Interpret CV scores for stability, not just averages. - Avoid common pitfalls like data leakage. - Use scikit-learn CV tools for efficient evaluation. - Apply CV in model selection workflows.
Prerequisites: Basic scikit-learn (models, fitting); understanding train/validation splits. Difficulty: Intermediate.
What This Is¶
Cross-validation estimates how stable a model family looks inside the training boundary. It is a selection tool, not a replacement for the final test result.
The deeper point is that cross-validation is about variability, not just average performance. A model that wins on the mean but swings wildly across folds is not automatically a safer choice than a slightly weaker but steadier model.
When You Use It¶
- comparing model families
- estimating variation across folds
- choosing a workflow before the final locked evaluation
Tooling¶
KFoldStratifiedKFoldStratifiedGroupKFoldcross_val_scorecross_validatecross_val_predictGroupKFoldGroupShuffleSplitTimeSeriesSplitPredefinedSplitLeaveOneOutRepeatedStratifiedKFoldpermutation_test_score
Library Notes¶
StratifiedKFoldis the default choice for classification when class balance matters.KFoldis the basic split pattern for regression or any dataset where class balance is not the concern.cross_val_scoreis the simplest way to get one score per fold.cross_validateis better when you want multiple metrics, training scores, fit times, score times, or even the fitted estimators themselves.cross_val_predictis for out-of-fold predictions and error analysis, not for replacing a proper cross-validation score.GroupKFoldkeeps the same group out of train and validation together, which is essential when rows belong to the same person, document, device, or session.StratifiedGroupKFoldis the stronger choice when you need both class balance and group separation.TimeSeriesSplitrespects order and avoids training on the future.GroupShuffleSplitis useful when you want group-aware random holdout splits instead of full K-fold partitioning.PredefinedSplitis useful when the validation fold is already defined by the experiment design.LeaveOneOutis a special case with one test sample per fold and is usually too expensive for larger datasets.RepeatedStratifiedKFoldis useful when one pass of stratified folds feels too noisy.permutation_test_scorehelps test whether a score is meaningfully above chance.shuffle=Trueplusrandom_stategives you a repeatable split pattern when the splitter supports shuffling.
Fold Design Notes¶
- use more folds when the dataset is small and each validation slice needs more reuse
- use fewer folds when the dataset is large or the training cost is high
- more folds are not automatically better; beyond a point they add compute and can leave the decision just as noisy
- keep the same split logic while comparing model families
- watch for groups, time order, or repeated entities that require a different split strategy
- if class balance matters, prefer
StratifiedKFoldover plainKFold - if the same entity appears multiple times, use a group-aware splitter instead of regular folds
- if the data are ordered in time, use
TimeSeriesSplitinstead of shuffling
Selection Protocol¶
Cross-validation answers one narrow question: which workflow looks safest inside the training boundary?
Use it like this:
- keep the final test or holdout untouched
- run CV only on the training portion
- compare candidates on the same folds
- pick the simplest candidate whose fold behavior is good enough
- refit on the full training data and evaluate once on the locked holdout
When search is active, the CV summary is selection evidence, not final evidence.
When To Reach For Each Splitter¶
StratifiedKFold: classification with class imbalance or any case where fold balance mattersGroupKFold: repeated rows from the same person, patient, customer, file, device, or conversationTimeSeriesSplit: forecasting, drift, rolling windows, or any task where the future must stay out of trainingRepeatedStratifiedKFold: small data where one fold assignment feels too noisy to trust
What The Main Functions Do¶
cross_val_scorereturns one array of fold scores, which is ideal when one metric is enough.cross_validatereturns a dictionary of fold-by-fold results, which is ideal when you want several metrics, timing information, or the fitted estimators themselves withreturn_estimator=True.cross_val_predictreturns an out-of-fold prediction for each row, which is useful for inspection but not a general replacement forcross_val_score.permutation_test_scorecompares the observed score to a null distribution built from permuted labels.
Minimal Example¶
from sklearn.model_selection import StratifiedKFold, cross_val_score
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
scores = cross_val_score(model, X_train, y_train, cv=cv, scoring="roc_auc")
Worked Pattern¶
results = cross_validate(
model,
X_train,
y_train,
cv=cv,
scoring={"roc_auc": "roc_auc", "avg_precision": "average_precision"},
)
That pattern is useful because it exposes both the ranking quality and the minority-class behavior without needing separate ad hoc runs.
Applied examples:
- if the average ROC AUC is good but one fold is much worse, the model may be too fragile for the task
- if average precision is much lower than ROC AUC, the model may rank well overall but still be weak where the positive class matters most
- if fit time is large and the score barely changes across folds, the model may be too expensive for the gain it gives
Paired Fold Deltas¶
Do not compare candidates only by separate means. Compare them fold by fold on the same split.
delta = scores_b["test_roc_auc"] - scores_a["test_roc_auc"]
mean_delta = delta.mean()
worst_delta = delta.min()
Why this matters:
- a tiny mean win can disappear when you inspect the paired differences
- one model may have a better mean only because it spiked on one easy fold
- a safer model often has a smaller worst-fold loss even if the means are close
If the paired deltas are inconsistent, the selection claim should stay weak.
Out-Of-Fold Inspection¶
from sklearn.model_selection import cross_val_predict
oof_pred = cross_val_predict(model, X_train, y_train, cv=cv, method="predict_proba")[:, 1]
Use this when you want to inspect where the model is wrong on rows it did not train on.
Useful questions:
- which rows are consistently misranked
- which subgroup gets the worst probabilities
- whether the error pattern points to a feature problem or a split problem
Do not use out-of-fold predictions as a shortcut to a final score unless you understand the metric behavior and the fold sizes.
Group And Time Splits¶
from sklearn.model_selection import GroupKFold, StratifiedGroupKFold, TimeSeriesSplit
group_cv = GroupKFold(n_splits=5)
group_stratified_cv = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=0)
time_cv = TimeSeriesSplit(n_splits=5)
Use GroupKFold when leakage would happen if the same person, file, or session appeared in both train and validation.
Use StratifiedGroupKFold when you need both class balance and group isolation. That pattern is common when the positive class is rare and the samples also come from repeated entities.
Use TimeSeriesSplit when each later fold should train on earlier history and validate on later data.
Nested CV When Search Is Active¶
If you are tuning hyperparameters or comparing many model settings, the honest pattern is nested:
- inner CV chooses settings
- outer CV estimates the performance of that choice process
You do not always need full nested CV in day-to-day work, but the logic matters. Once the same folds pick the settings and certify them, the mean is already optimistic.
A practical compromise is:
- hold out a final test set
- use inner CV for tuning on the remaining training data
- report the tuned model once on the holdout
Worked pattern:
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_validate
inner_cv = StratifiedKFold(n_splits=4, shuffle=True, random_state=0)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
search = GridSearchCV(
pipeline,
param_grid={"model__C": [0.1, 1.0, 10.0]},
cv=inner_cv,
scoring="roc_auc",
)
outer_results = cross_validate(
search,
X_train,
y_train,
cv=outer_cv,
scoring="roc_auc",
)
The interpretation is important:
- inner CV chooses settings
- outer CV estimates the performance of that whole choice procedure
- the outer mean is evidence for trusting the selection rule, not just the final parameter value
Walk-Forward Temporal Example¶
Time-aware validation needs both order and spacing.
from sklearn.model_selection import TimeSeriesSplit
horizon = 24
gap = 6
time_cv = TimeSeriesSplit(n_splits=4, test_size=horizon, gap=gap)
Useful meanings:
test_size=horizonkeeps the validation horizon aligned with the task you care aboutgap=gapprevents rows near the boundary from leaking through lagged or slowly revealed features
If the model will always predict six hours ahead, one day ahead, or one week ahead, the validation horizon should imitate that instead of using arbitrary fold widths.
Permutation Check¶
from sklearn.model_selection import permutation_test_score
score, perm_scores, pvalue = permutation_test_score(model, X_train, y_train, cv=cv, scoring="roc_auc")
This is useful when a score looks suspiciously strong and you want a sanity check against chance.
What To Inspect¶
- whether the fold scores are close together or noisy
- whether one model wins on average but loses badly on some folds
- whether the metric choice matches the task you actually care about
- whether the cross-validation setup mirrors the final training workflow
- whether the fold means agree with the later validation result
- whether one fold is a clear outlier
- whether different metrics tell the same story or a different one
- whether the split design matches the deployment story
- whether the fold assignment respects groups or time order
- whether the out-of-fold errors line up with the suspicious examples you expected
- whether a high score survives a permutation sanity check
- whether
return_estimator=Truereveals one fold with a very different learned model
If the spread is large, the mean alone is not enough. A stable-looking average can hide a brittle model family. If the folds disagree strongly, the task is telling you the evaluation story is not stable enough yet.
One more important caution: fold standard deviation is not a confidence interval. It is only a rough measure of fold-to-fold spread under one split design. Do not present it as if it were formal uncertainty around deployment performance.
More Splitters Worth Knowing¶
KFoldis the plain baseline. Use it when you do not need stratification, groups, or time order.StratifiedGroupKFoldis the practical answer when both class balance and group boundaries matter.PredefinedSplitis useful when a project already comes with a fixed validation fold, such as a curated holdout or a custom benchmark slice.LeaveOneOutcan be useful for very small datasets, but it is expensive and rarely the best default.GroupShuffleSplitis useful when you want random group-aware holdouts instead of a full set of equal-size folds.
Applied pattern:
- if the dataset is tiny and every row matters, consider
LeaveOneOutor repeated K-fold style comparisons - if the dataset is grouped and imbalanced, prefer
StratifiedGroupKFold - if the benchmark already defines a test fold, use
PredefinedSplit - if you need a quick group-aware sanity check,
GroupShuffleSplitcan be easier than a full partition strategy
Common Mistakes¶
- using plain shuffled folds on grouped or time-ordered data
- treating
cross_val_predictas a direct replacement for a proper score - comparing models with different split logic
- tuning repeatedly against the same cross-validation summary until the folds stop being independent evidence
- reading the mean and ignoring the spread
One useful counterexample is a grouped task where shuffled StratifiedKFold says the model is stable, but GroupKFold reveals a large fold spread and a lower mean. The second result is not harsher by accident. It is closer to deployment.
Failure Pattern¶
Treating the cross-validation mean as the final public score. Cross-validation helps you choose the workflow; the final test set still has a different role.
Another failure pattern is tuning against the same fold summary too many times. Repeated peeking eventually turns CV into a soft validation set.
One more failure mode is to use ordinary shuffled folds on grouped or time-based data. That can make the model look more stable than it really is.
Practice¶
- Run cross-validation for two candidate models on the same training split.
- Report both the mean and the spread.
- Explain why the test set should still remain untouched.
- Identify one sign that the fold design is too noisy.
- Explain what would make you prefer
cross_validateovercross_val_score. - Describe one reason to keep the cross-validation design unchanged after the first comparison.
- Explain when grouped or time-based splitting would be more appropriate.
- Describe one reason a model with a slightly lower mean can still be the safer choice.
- Explain when
cross_val_predictis useful and when it is the wrong tool. - Name one situation where
permutation_test_scoreis a useful sanity check.
Case Study: Fraud Detection Stability¶
A credit card company used KFold CV but ignored customer groups, leaking data. After switching to GroupKFold, the model's CV variance dropped, revealing it was unstable—preventing a costly deployment.
Expanded Quick Quiz¶
Why is CV about variability, not just averages?
Answer: A model with high mean but wild swings is risky; stability matters for reliability.
When to use StratifiedKFold?
Answer: For imbalanced classification to maintain class proportions in folds.
What's the risk of tuning on CV scores repeatedly?
Answer: Overfitting to CV; use a separate validation set for tuning.
In the fraud scenario, why avoid regular KFold?
Answer: Customers have multiple transactions; GroupKFold prevents group leakage.
Progress Checkpoint¶
- [ ] Ran CV on a dataset and compared models.
- [ ] Analyzed fold variability.
- [ ] Completed all practice questions.
Milestone: Complete this to unlock "Hyperparameter Tuning" in the scikit-learn Validation track. Post your CV results in the academy forum!
Further Reading¶
- Scikit-Learn CV Guide.
- Papers on nested CV for unbiased model selection.
Runnable Example¶
Open the matching example in AI Academy and run it from the platform.
Run the same idea in the browser:
Inspect the mean and spread for both models instead of looking only at the best single fold.
Common Trick¶
If one fold looks wildly different from the others, check the split logic before blaming the model. Sometimes the issue is a rare subgroup, but sometimes it is a bad fold construction choice.
If the mean is close but the spread is large, prefer the model family that is easier to defend operationally. Stability is often the better tradeoff than a tiny mean gain.
If a classifier looks much stronger under StratifiedKFold than under a group-aware splitter, the split is probably telling you something important about repeated entities.
If a time-based task looks great under shuffled folds, stop and ask whether the evaluation is accidentally letting the future leak in.
Longer Connection¶
Continue with scikit-learn Validation and Tuning for the complete selection workflow.