Honest Splits and Baselines¶
Scenario: Evaluating a Spam Email Classifier¶
You're developing a spam filter for an email service. Without proper splits and baselines, you might overstate performance—establish honest train/validation/test splits and compare against dummy baselines to ensure the model genuinely improves over random or majority-class guessing.
What This Is¶
This topic is about creating a clean split before comparison starts and making sure a real model beats a trivial baseline honestly.
The deeper lesson is that the split is part of the method. If the split is weak, every later comparison becomes suspect.
When You Use It¶
- starting any supervised tabular task
- comparing a first learned model against a dummy baseline
- keeping the validation story clean
- checking whether a metric or feature idea is actually better than the floor
- deciding whether a stronger model is worth the extra complexity yet
Tooling¶
train_test_splitStratifiedShuffleSplitDummyClassifierLogisticRegressionstratify
Library Notes¶
train_test_splitis the quickest honest split when you want one train and one validation set.stratify=ykeeps the class mix similar across splits when the target is imbalanced.random_statemakes the split reproducible, which matters when you want to compare model changes fairly.StratifiedShuffleSplitis useful when you want repeated random stratified splits instead of just one split.StratifiedKFoldis useful when you want several stratified validation folds and a steadier estimate than a single holdout.DummyClassifier(strategy="prior")is the cleanest baseline when you want to know what the class distribution alone gives you.DummyClassifier(strategy="most_frequent")is useful when you want to compare against the simplest hard-label rule.DummyClassifier(strategy="stratified")is useful when you want a random baseline that still respects the observed class balance.DummyClassifier(strategy="uniform")is a rough lower-information baseline when you want to compare against chance-like behavior.DummyClassifier(strategy="constant")is useful when the task is really about one class and you want to test that assumption directly.
prior and most_frequent often look similar in predict, but they differ in predict_proba. That matters when the metric depends on probabilities rather than hard labels.
The Evaluation Contract¶
The honest contract for a first serious workflow is:
- define the split rule from the deployment story
- reserve a validation set or cross-validation scheme for selection
- keep one locked test or holdout for the final check only
- compare the dummy baseline and the first learned model under exactly the same split
If the task is small, cross-validation can replace a single validation split for selection, but the locked test still keeps a different role. It is not another knob in the tuning loop.
Two baseline numbers are worth knowing cold:
- majority-class accuracy floor:
max_c p(y=c) - random-ranking average precision floor: approximately the positive prevalence
Those numbers stop a weak metric choice from looking like progress.
Baseline Ladder¶
- start with
DummyClassifieras the floor - then try one honest linear model
- then add one stronger family if the baseline is stable
- only after that decide whether more complex tuning is worth the effort
This keeps the first comparison honest and prevents early overfitting to a pleasing result.
Split Recipe¶
Use one split rule and keep it fixed while you compare models:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(
X,
y,
test_size=0.30,
stratify=y,
random_state=0,
)
Why this pattern matters:
- the validation set is created once
- the class mix stays believable
- the same rows are used to compare every first-pass model
- the split can be repeated exactly when you need to explain a result later
If the task has groups, time order, or repeated entities, this recipe is not enough. Use a split strategy that respects the data shape instead of forcing everything through one random split.
Which Split Should Win¶
Do not ask only whether the model wins. Ask whether it still wins under the split that matches the real task.
- random split: useful only when rows are genuinely independent
- grouped split: required when repeated entities can leak identity across train and validation
- ordered split: required when features are meant to predict the future
- leaky split: any split that allows future or duplicate information to cross the boundary and should be treated as invalid evidence
If a model looks strong on a random split but weak on grouped or ordered validation, the correct conclusion is usually that the earlier comparison was flattering, not that the grouped split is unfair.
Minimal Example¶
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(
X, y, test_size=0.30, stratify=y, random_state=0
)
If you need several comparable validation draws, repeat the split with a stratified splitter:
from sklearn.model_selection import StratifiedShuffleSplit
splitter = StratifiedShuffleSplit(n_splits=5, test_size=0.30, random_state=0)
for train_idx, valid_idx in splitter.split(X, y):
X_train, X_valid = X[train_idx], X[valid_idx]
y_train, y_valid = y[train_idx], y[valid_idx]
Worked Pattern¶
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
dummy = DummyClassifier(strategy="prior").fit(X_train, y_train)
model = LogisticRegression(max_iter=1000).fit(X_train, y_train)
The point of this pattern is not to impress yourself with the model. It is to create a floor that is hard to fool.
Another useful pattern is to compare two dummy choices before moving on:
majority = DummyClassifier(strategy="most_frequent").fit(X_train, y_train)
prior = DummyClassifier(strategy="prior").fit(X_train, y_train)
If those two baselines behave very differently, the metric is telling you something about class balance or probability quality before the real model even enters the picture.
For a clearer comparison on a single split, check both hard predictions and probabilities:
majority_pred = majority.predict(X_valid)
prior_proba = prior.predict_proba(X_valid)
That simple check helps you see whether the task is mostly about predicting the common class or about ranking examples with probabilities.
If you want a steadier estimate, wrap the baseline inside stratified folds:
from sklearn.model_selection import StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for train_idx, valid_idx in cv.split(X, y):
baseline = DummyClassifier(strategy="prior").fit(X[train_idx], y[train_idx])
score = baseline.score(X[valid_idx], y[valid_idx])
This is especially helpful when one lucky validation split would otherwise make the baseline look misleadingly strong or weak.
What To Check¶
- whether the validation split was fixed before feature work began
- whether the baseline uses the same input matrix as the real model
- whether the train and validation class mix looks plausible
- whether the dummy model is strong enough to expose a weak metric choice
- whether the baseline and the real model are competing under the same assumptions
One reliable counterexample is a repeated-entity dataset where a random split makes the learned model look excellent while GroupKFold cuts the score back toward the dummy floor. That is not bad luck. It is the split revealing that the earlier score was partially identity lookup.
If the baseline already looks suspiciously good, that is usually a sign to inspect the split, not to celebrate the model.
Inspection Habits¶
- print the class balance in train and validation before comparing models
- check that every model sees the same split
- compare both a hard-label baseline and a probability-aware baseline when the metric uses probabilities
- ask whether the baseline is strong because the task is easy or because the split is leaking
- compare against the simplest model before reaching for tuning
If a validation score feels too good, ask what would happen if the target distribution were slightly more imbalanced.
Failure Pattern¶
Building features or choosing the model before the split is fixed. Once the split boundary moves around casually, the comparison stops being trustworthy.
Another failure pattern is letting a good dummy baseline talk you out of a real model. A baseline is a reference point, not the end goal.
If the data are imbalanced, the baseline can be deceptively strong on accuracy. In that case, a rare-class metric usually tells a more honest story.
Another failure pattern is using stratify as a cure-all. Stratification helps preserve class ratios, but it does not fix leakage, group overlap, or a bad feature definition.
Another failure pattern is treating one lucky split as a final answer. A single split is a decision aid, not a guarantee.
Applied Examples¶
- In a rare-fraud task,
DummyClassifier(strategy="prior")can show that accuracy is almost meaningless before you touch the model. - In a balanced three-class task,
DummyClassifier(strategy="most_frequent")gives a harder floor than chance if the classes are uneven. - In a review-queue task,
train_test_split(..., stratify=y, random_state=0)helps you keep the positive rate believable while you compare a baseline logistic model against a more complex family. - In a repeated-evaluation setting,
StratifiedShuffleSplitcan give you several comparable train/test draws without changing the overall split story. - In a small dataset,
StratifiedKFoldcan show whether a baseline is stable or just lucky on one holdout. - In a probability-scored task,
DummyClassifier(strategy="prior")is often the better first check thanmost_frequentbecause it produces a meaningful probability floor.
Practice¶
- Train a dummy baseline and report its validation accuracy.
- Train logistic regression on the same split and compare it honestly.
- Explain why the split should be chosen before tuning begins.
- State one class-imbalance situation where accuracy would mislead you.
- Describe one change that should be postponed until after the baseline is stable.
- Explain why reproducibility matters even before tuning starts.
- Explain what you would compare after the dummy baseline.
- State one reason to prefer
priorovermost_frequentin a baseline. - Explain when
stratify=yis helpful and when it is not enough. - Describe what you would check before believing a single split result.
Case Study: Baseline Checks in Kaggle Competitions¶
Kaggle winners always start with honest splits and dummy baselines to avoid overfitting to public leaderboards. This ensures their models generalize to private test sets, leading to higher final rankings.
Expanded Quick Quiz¶
Why stratify splits for imbalanced data?
Answer: To maintain similar class distributions in train and test sets, preventing biased evaluation.
What's the difference between prior and most_frequent dummy strategies?
Answer: Prior predicts class probabilities based on training distribution; most_frequent always predicts the majority class.
When to use cross-validation over a single split?
Answer: For smaller datasets to get a more stable performance estimate.
In the spam classifier scenario, why compare to a dummy baseline?
Answer: To ensure the model genuinely improves over trivial rules like "all emails are spam."
Progress Checkpoint¶
- [ ] Created stratified train/test splits.
- [ ] Trained and evaluated dummy baselines (prior, most_frequent).
- [ ] Compared a simple model (e.g., logistic regression) against baselines.
- [ ] Used cross-validation for stable estimates.
- [ ] Answered quiz questions without peeking.
Milestone: Complete this to unlock "Leakage Patterns" in the Classical ML track. Share your baseline comparison in the academy Discord!
Further Reading¶
- Scikit-Learn Model Selection Guide.
- "Why Do We Need Train/Test Splits?" tutorials.
- Baseline model best practices.
Runnable Example¶
Open the matching example in AI Academy and run it from the platform.
Run the same idea in the browser:
Inspect the dummy-versus-logistic comparison first, then compare it with the leaky variant.
Common Trick¶
If the dummy baseline is already strong, switch attention from raw accuracy to average precision, ROC AUC, or a thresholded metric that matches the task. The right metric often reveals that the real model has more room to improve than accuracy suggests.
If the baseline and the learned model are very close, the next improvement might be a better feature representation or a better split story rather than a different algorithm.
If you need one repeatable comparison and the task is class-imbalanced, StratifiedShuffleSplit is often the better habit than ad hoc re-splitting by hand.
If you want the simplest baseline to explain to a teammate, DummyClassifier(strategy="prior") is usually the cleanest first reference point because it turns the observed class mix into a probability floor.
If the baseline is being used as a teaching reference, keep the code visible and short: one split, one dummy model, one real model, one metric. That makes it easier for students to inspect the whole comparison at once.
What Students Often Miss¶
DummyClassifieris not only for classification accuracy; it is also a way to test whether your metric is sensible.random_stateis not just for reproducibility; it is part of making the comparison defensible.- a split that preserves class balance is still not honest if leakage is present
- a good baseline is not the final goal; it is the floor that makes the next improvement meaningful
Questions To Ask¶
- Is the split reproducible enough that someone else could get the same comparison?
- Does the baseline still look weak when you switch from accuracy to a ranking metric?
- Is the model winning because it learned signal or because the split is favorable?
- Would
prior,most_frequent, orstratifiedbe the most informative baseline for this task? - What would make you choose repeated stratified splits instead of one holdout?
Function Cheat Sheet¶
train_test_split
- use it for a first honest holdout
- add
stratify=ywhen class balance matters - add
random_statewhen you want the same comparison again later - avoid it when rows belong to groups or when order carries meaning
StratifiedShuffleSplit
- use it when one holdout is too noisy
- use it when you want repeated random splits with similar class balance
- avoid it when the split must respect time or grouped entities
StratifiedKFold
- use it when you want a more stable estimate across several folds
- use it when the dataset is small and one holdout feels too fragile
- avoid it when group identity or order must stay intact
DummyClassifier
- use
priorwhen you want a probability-aware floor - use
most_frequentwhen you want the simplest label-only floor - use
stratifiedwhen you want a random baseline that still mirrors class frequencies - use
uniformwhen you want a rough chance-level reference - use
constantwhen you want to test a fixed label assumption directly
Longer Connection¶
Continue with scikit-learn Validation and Tuning for the full split-selection-tuning workflow.