Honest Splits and Baselines¶

Scenario: Evaluating a Spam Email Classifier¶

You're developing a spam filter for an email service. Without proper splits and baselines, you might overstate performance—establish honest train/validation/test splits and compare against dummy baselines to ensure the model genuinely improves over random or majority-class guessing.

What This Is¶

This topic is about creating a clean split before comparison starts and making sure a real model beats a trivial baseline honestly.

The deeper lesson is that the split is part of the method. If the split is weak, every later comparison becomes suspect.

When You Use It¶

starting any supervised tabular task
comparing a first learned model against a dummy baseline
keeping the validation story clean
checking whether a metric or feature idea is actually better than the floor
deciding whether a stronger model is worth the extra complexity yet

Tooling¶

train_test_split
StratifiedShuffleSplit
DummyClassifier
LogisticRegression
stratify

Library Notes¶

train_test_split is the quickest honest split when you want one train and one validation set.
stratify=y keeps the class mix similar across splits when the target is imbalanced.
random_state makes the split reproducible, which matters when you want to compare model changes fairly.
StratifiedShuffleSplit is useful when you want repeated random stratified splits instead of just one split.
StratifiedKFold is useful when you want several stratified validation folds and a steadier estimate than a single holdout.
DummyClassifier(strategy="prior") is the cleanest baseline when you want to know what the class distribution alone gives you.
DummyClassifier(strategy="most_frequent") is useful when you want to compare against the simplest hard-label rule.
DummyClassifier(strategy="stratified") is useful when you want a random baseline that still respects the observed class balance.
DummyClassifier(strategy="uniform") is a rough lower-information baseline when you want to compare against chance-like behavior.
DummyClassifier(strategy="constant") is useful when the task is really about one class and you want to test that assumption directly.

prior and most_frequent often look similar in predict, but they differ in predict_proba. That matters when the metric depends on probabilities rather than hard labels.

The Evaluation Contract¶

The honest contract for a first serious workflow is:

define the split rule from the deployment story
reserve a validation set or cross-validation scheme for selection
keep one locked test or holdout for the final check only
compare the dummy baseline and the first learned model under exactly the same split

If the task is small, cross-validation can replace a single validation split for selection, but the locked test still keeps a different role. It is not another knob in the tuning loop.

Two baseline numbers are worth knowing cold:

majority-class accuracy floor: max_c p(y=c)
random-ranking average precision floor: approximately the positive prevalence

Those numbers stop a weak metric choice from looking like progress.

Baseline Ladder¶

start with DummyClassifier as the floor
then try one honest linear model
then add one stronger family if the baseline is stable
only after that decide whether more complex tuning is worth the effort

This keeps the first comparison honest and prevents early overfitting to a pleasing result.

Split Recipe¶

Use one split rule and keep it fixed while you compare models:

from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X,
    y,
    test_size=0.30,
    stratify=y,
    random_state=0,
)

Why this pattern matters:

the validation set is created once
the class mix stays believable
the same rows are used to compare every first-pass model
the split can be repeated exactly when you need to explain a result later

If the task has groups, time order, or repeated entities, this recipe is not enough. Use a split strategy that respects the data shape instead of forcing everything through one random split.

Which Split Should Win¶

Do not ask only whether the model wins. Ask whether it still wins under the split that matches the real task.

random split: useful only when rows are genuinely independent
grouped split: required when repeated entities can leak identity across train and validation
ordered split: required when features are meant to predict the future
leaky split: any split that allows future or duplicate information to cross the boundary and should be treated as invalid evidence

If a model looks strong on a random split but weak on grouped or ordered validation, the correct conclusion is usually that the earlier comparison was flattering, not that the grouped split is unfair.

Minimal Example¶

from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=0
)

If you need several comparable validation draws, repeat the split with a stratified splitter:

from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits=5, test_size=0.30, random_state=0)
for train_idx, valid_idx in splitter.split(X, y):
    X_train, X_valid = X[train_idx], X[valid_idx]
    y_train, y_valid = y[train_idx], y[valid_idx]

Worked Pattern¶

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

dummy = DummyClassifier(strategy="prior").fit(X_train, y_train)
model = LogisticRegression(max_iter=1000).fit(X_train, y_train)

The point of this pattern is not to impress yourself with the model. It is to create a floor that is hard to fool.

Another useful pattern is to compare two dummy choices before moving on:

majority = DummyClassifier(strategy="most_frequent").fit(X_train, y_train)
prior = DummyClassifier(strategy="prior").fit(X_train, y_train)

If those two baselines behave very differently, the metric is telling you something about class balance or probability quality before the real model even enters the picture.

For a clearer comparison on a single split, check both hard predictions and probabilities:

majority_pred = majority.predict(X_valid)
prior_proba = prior.predict_proba(X_valid)

That simple check helps you see whether the task is mostly about predicting the common class or about ranking examples with probabilities.

If you want a steadier estimate, wrap the baseline inside stratified folds:

from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for train_idx, valid_idx in cv.split(X, y):
    baseline = DummyClassifier(strategy="prior").fit(X[train_idx], y[train_idx])
    score = baseline.score(X[valid_idx], y[valid_idx])

This is especially helpful when one lucky validation split would otherwise make the baseline look misleadingly strong or weak.

What To Check¶

whether the validation split was fixed before feature work began
whether the baseline uses the same input matrix as the real model
whether the train and validation class mix looks plausible
whether the dummy model is strong enough to expose a weak metric choice
whether the baseline and the real model are competing under the same assumptions

One reliable counterexample is a repeated-entity dataset where a random split makes the learned model look excellent while GroupKFold cuts the score back toward the dummy floor. That is not bad luck. It is the split revealing that the earlier score was partially identity lookup.

If the baseline already looks suspiciously good, that is usually a sign to inspect the split, not to celebrate the model.

Inspection Habits¶

print the class balance in train and validation before comparing models
check that every model sees the same split
compare both a hard-label baseline and a probability-aware baseline when the metric uses probabilities
ask whether the baseline is strong because the task is easy or because the split is leaking
compare against the simplest model before reaching for tuning

If a validation score feels too good, ask what would happen if the target distribution were slightly more imbalanced.

Failure Pattern¶

Building features or choosing the model before the split is fixed. Once the split boundary moves around casually, the comparison stops being trustworthy.

Another failure pattern is letting a good dummy baseline talk you out of a real model. A baseline is a reference point, not the end goal.

If the data are imbalanced, the baseline can be deceptively strong on accuracy. In that case, a rare-class metric usually tells a more honest story.

Another failure pattern is using stratify as a cure-all. Stratification helps preserve class ratios, but it does not fix leakage, group overlap, or a bad feature definition.

Another failure pattern is treating one lucky split as a final answer. A single split is a decision aid, not a guarantee.

Applied Examples¶

In a rare-fraud task, DummyClassifier(strategy="prior") can show that accuracy is almost meaningless before you touch the model.
In a balanced three-class task, DummyClassifier(strategy="most_frequent") gives a harder floor than chance if the classes are uneven.
In a review-queue task, train_test_split(..., stratify=y, random_state=0) helps you keep the positive rate believable while you compare a baseline logistic model against a more complex family.
In a repeated-evaluation setting, StratifiedShuffleSplit can give you several comparable train/test draws without changing the overall split story.
In a small dataset, StratifiedKFold can show whether a baseline is stable or just lucky on one holdout.
In a probability-scored task, DummyClassifier(strategy="prior") is often the better first check than most_frequent because it produces a meaningful probability floor.

Practice¶

Train a dummy baseline and report its validation accuracy.
Train logistic regression on the same split and compare it honestly.
Explain why the split should be chosen before tuning begins.
State one class-imbalance situation where accuracy would mislead you.
Describe one change that should be postponed until after the baseline is stable.
Explain why reproducibility matters even before tuning starts.
Explain what you would compare after the dummy baseline.
State one reason to prefer prior over most_frequent in a baseline.
Explain when stratify=y is helpful and when it is not enough.
Describe what you would check before believing a single split result.

Case Study: Baseline Checks in Kaggle Competitions¶

Kaggle winners always start with honest splits and dummy baselines to avoid overfitting to public leaderboards. This ensures their models generalize to private test sets, leading to higher final rankings.

Expanded Quick Quiz¶

Why stratify splits for imbalanced data?

Answer: To maintain similar class distributions in train and test sets, preventing biased evaluation.

What's the difference between prior and most_frequent dummy strategies?

Answer: Prior predicts class probabilities based on training distribution; most_frequent always predicts the majority class.

When to use cross-validation over a single split?

Answer: For smaller datasets to get a more stable performance estimate.

In the spam classifier scenario, why compare to a dummy baseline?

Answer: To ensure the model genuinely improves over trivial rules like "all emails are spam."

Progress Checkpoint¶

[ ] Created stratified train/test splits.
[ ] Trained and evaluated dummy baselines (prior, most_frequent).
[ ] Compared a simple model (e.g., logistic regression) against baselines.
[ ] Used cross-validation for stable estimates.
[ ] Answered quiz questions without peeking.

Milestone: Complete this to unlock "Leakage Patterns" in the Classical ML track. Share your baseline comparison in the academy Discord!

Runnable Example¶

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect the dummy-versus-logistic comparison first, then compare it with the leaky variant.

Common Trick¶

If the dummy baseline is already strong, switch attention from raw accuracy to average precision, ROC AUC, or a thresholded metric that matches the task. The right metric often reveals that the real model has more room to improve than accuracy suggests.

If the baseline and the learned model are very close, the next improvement might be a better feature representation or a better split story rather than a different algorithm.

If you need one repeatable comparison and the task is class-imbalanced, StratifiedShuffleSplit is often the better habit than ad hoc re-splitting by hand.

If you want the simplest baseline to explain to a teammate, DummyClassifier(strategy="prior") is usually the cleanest first reference point because it turns the observed class mix into a probability floor.

If the baseline is being used as a teaching reference, keep the code visible and short: one split, one dummy model, one real model, one metric. That makes it easier for students to inspect the whole comparison at once.

What Students Often Miss¶

DummyClassifier is not only for classification accuracy; it is also a way to test whether your metric is sensible.
random_state is not just for reproducibility; it is part of making the comparison defensible.
a split that preserves class balance is still not honest if leakage is present
a good baseline is not the final goal; it is the floor that makes the next improvement meaningful

Questions To Ask¶

Is the split reproducible enough that someone else could get the same comparison?
Does the baseline still look weak when you switch from accuracy to a ranking metric?
Is the model winning because it learned signal or because the split is favorable?
Would prior, most_frequent, or stratified be the most informative baseline for this task?
What would make you choose repeated stratified splits instead of one holdout?

Function Cheat Sheet¶

train_test_split

use it for a first honest holdout
add stratify=y when class balance matters
add random_state when you want the same comparison again later
avoid it when rows belong to groups or when order carries meaning

StratifiedShuffleSplit

use it when one holdout is too noisy
use it when you want repeated random splits with similar class balance
avoid it when the split must respect time or grouped entities

StratifiedKFold

use it when you want a more stable estimate across several folds
use it when the dataset is small and one holdout feels too fragile
avoid it when group identity or order must stay intact

DummyClassifier

use prior when you want a probability-aware floor
use most_frequent when you want the simplest label-only floor
use stratified when you want a random baseline that still mirrors class frequencies
use uniform when you want a rough chance-level reference
use constant when you want to test a fixed label assumption directly

Longer Connection¶

Continue with scikit-learn Validation and Tuning for the full split-selection-tuning workflow.