Leakage Patterns¶
Scenario: Predicting Stock Prices with Historical Data¶
You're building a model to predict tomorrow's stock prices using today's data. Accidentally including future prices or post-event features would cause leakage—learn to detect and prevent it to ensure the model uses only available information for realistic predictions.
Leakage is one of the fastest ways to get a fake improvement in classical ML. The model looks strong on paper, then falls apart on new data because it learned from information that would not be available at prediction time.
The main habit to build is simple: split the data first, and keep every learned transformation inside the training boundary.
Use this page early in the classical workflow, not only after tuning. Honest splits, cross-validation, and feature work all depend on the same leakage discipline.
What Leakage Usually Means¶
Leakage happens when the training process sees something it should not see.
That can be: - a statistic computed from the full dataset before the split - a label-derived feature - a row that appears in both train and validation - a person, account, patient, or device that appears across folds - future information in a time-ordered problem - a join that accidentally carries post-outcome data into features
The dangerous part is that leakage often makes the model look better without making it truly better.
Prediction-Time Availability Timeline¶
The most reliable leakage question is chronological:
- when is the prediction made
- when does each feature become available
- when does the outcome become known
A feature is suspicious if its availability time is after the prediction time.
Useful rule:
feature_ready_time <= prediction_time < outcome_time
If that inequality is false, the feature does not belong in the model no matter how predictive it looks.
Leakage Triage¶
When a score jump looks suspicious, run this triage before you celebrate it:
- check prediction-time availability for the new feature
- check whether preprocessing stayed inside the training boundary
- check whether the splitter matches the entity or time structure
- remove the suspicious feature and see whether the gain survives
If the gain disappears under one of those checks, the next move is diagnosis, not more tuning.
The Most Useful Tools¶
train_test_split¶
Use this to create a clean first split before any preprocessing.
It is the default move for small and medium problems when you want a simple train/validation split. The key idea is that the split should happen before scaling, encoding, imputation, feature selection, or target encoding.
Good habit: - split first - fit transformations only on the training portion - transform validation with the fitted transformers
Pipeline¶
Use Pipeline when you want preprocessing and modeling to behave as one unit.
This is the safest everyday pattern because it makes scikit-learn fit each transformer on the right subset during cross-validation. It also prevents the common mistake of fitting a scaler, imputer, selector, or encoder on all rows before evaluation.
Typical use: - scaling numeric columns - imputing missing values - feature selection - dimensionality reduction - the final estimator
ColumnTransformer¶
Use ColumnTransformer when different columns need different preprocessing.
This is especially useful for mixed tabular data. Numeric columns can be scaled, categorical columns can be one-hot encoded, and the whole thing still stays inside the same pipeline.
Why it helps with leakage: - it keeps preprocessing tied to the training fold - it reduces manual column handling - it makes feature flow easier to inspect
make_column_selector¶
Use make_column_selector when your DataFrame has mixed dtypes and you want scikit-learn to select columns by type or name pattern.
This helps reduce hand-written column lists, which are often where mistakes creep in when data changes shape over time.
OneHotEncoder(handle_unknown="ignore")¶
Use this instead of hand-made dummy columns when you have categorical data.
This setting is useful because validation or test data can contain categories that were not seen during training. With handle_unknown="ignore", those categories do not crash the transform step.
Why it matters for leakage: - separate dummy creation on train and validation can create mismatched columns - manual column alignment can hide problems - one encoder fitted on train keeps the schema consistent
GroupKFold¶
Use GroupKFold when rows belong to the same entity and that entity should never be split across train and validation.
Examples: - multiple visits from the same patient - multiple sessions from the same user - multiple rows from the same document or machine
This prevents the model from seeing one record from a group and being judged on another record from the same group.
TimeSeriesSplit¶
Use TimeSeriesSplit when order matters and future rows must not influence past rows.
This is the right choice for chronological data because random shuffling can leak future patterns into training. The training set grows over time, and each test fold comes after the corresponding train fold.
cross_validate¶
Use cross_validate when you want a proper cross-validation score for a full pipeline.
It evaluates the estimator on each fold and keeps the preprocessing inside the fold boundary when the estimator is a Pipeline.
Good use: - compare model families - compare feature sets - check score stability across folds
TargetEncoder¶
Use TargetEncoder carefully when you need target-aware categorical encoding.
The important detail is that its fit_transform uses cross fitting to reduce leakage. That means it is safer than manually computing target means over the full dataset.
If you use it, prefer the encoder inside a pipeline so the training fold gets the correct cross-fitted behavior.
pandas merge(validate=..., indicator=True)¶
Use merge(validate=...) when joining feature tables.
This is one of the best ways to catch accidental many-to-many joins or duplicate keys. If the merge shape is not what you expect, pandas can raise an error before the leakage reaches the model.
indicator=True adds a column that shows whether rows came from the left table, the right table, or both.
pandas DataFrame.duplicated¶
Use duplicated to inspect repeated rows or repeated entity keys.
This is useful before splitting and after joining. If the same entity appears in both train and validation, the model may learn the entity rather than the problem.
pandas get_dummies¶
Use get_dummies only when you have a clear reason not to use OneHotEncoder.
It can work for quick exploration, but it is easier to misuse than a pipeline-based encoder. If you use it separately on train and validation, the columns can drift apart. That often causes silent mistakes or manual alignment hacks.
Safe Patterns¶
Split First, Then Fit¶
This is the most important rule.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_valid, y_train, y_valid = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
The unsafe version is fitting the scaler on all of X before the split. That lets validation statistics leak into training.
Keep Preprocessing Inside a Pipeline¶
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
model = Pipeline([
("scale", StandardScaler()),
("clf", LogisticRegression(max_iter=1000)),
])
This is the clean default when you want a model that can be cross-validated safely.
Use Column-Aware Preprocessing for Tabular Data¶
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
preprocessor = ColumnTransformer(
[
("num", StandardScaler(), make_column_selector(dtype_include="number")),
("cat", OneHotEncoder(handle_unknown="ignore"), make_column_selector(dtype_include="object")),
]
)
model = Pipeline([
("prep", preprocessor),
("clf", LogisticRegression(max_iter=1000)),
])
This pattern is strong for mixed tabular data because the same fitted preprocessing is used in every fold.
Choose the Right Splitter¶
from sklearn.model_selection import GroupKFold, TimeSeriesSplit, cross_validate
# Grouped data
cv = GroupKFold(n_splits=5)
scores = cross_validate(model, X, y, groups=groups, cv=cv, scoring="roc_auc")
# Time-ordered data
cv = TimeSeriesSplit(n_splits=5)
scores = cross_validate(model, X, y, cv=cv, scoring="roc_auc")
Use group-aware or time-aware splitting whenever random shuffling would mix information that should stay separate.
Validate Joins Before Modeling¶
joined = left.merge(
right,
on="entity_id",
how="left",
validate="one_to_one",
indicator=True,
)
If the join unexpectedly becomes many-to-many, pandas can stop it. That is much better than discovering the problem after the model gets a fake boost.
Keep Target Encoding Fold-Safe¶
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import TargetEncoder
from sklearn.linear_model import LogisticRegression
model = Pipeline([
("encode", TargetEncoder()),
("clf", LogisticRegression(max_iter=1000)),
])
The important idea is that target-aware encoding should be learned inside the evaluation loop, not from the whole dataset at once.
Adaptive Validation Leakage¶
Leakage is not only a feature problem. It can also happen through the evaluation process itself.
Examples:
- tuning against the same validation set until the split behaves like soft training data
- deciding to clean labels only in the rows that validation exposed as mistakes
- choosing transformations after inspecting the test or hidden-holdout errors
This is adaptive leakage: the model may never read the target directly, but the workflow still absorbs information from a boundary that was meant to stay neutral.
The fix is procedural:
- write the evaluation rule before the iteration starts
- keep one locked holdout out of the loop
- treat surprising gains as a reason to inspect availability, duplication, or peeking
Common Mistakes¶
- Fitting scalers, imputers, selectors, or encoders on the full dataset before splitting.
- Using future data to build features in a time problem.
- Letting the same person, customer, patient, or device appear in both train and validation.
- Computing category statistics from the whole dataset before evaluation.
- Using
get_dummiesseparately on train and validation and then patching column mismatches by hand. - Joining tables without checking key uniqueness.
- Creating features from the label itself or from fields that are only known after the event.
- Selecting the model after looking at the test set.
Applied Examples¶
Example 1: The scaler trap¶
If you fit a scaler on the full dataset, the validation set influences the mean and variance used by training.
That may look harmless, but it still changes the evaluation. The safe habit is to fit on train only and transform validation with the fitted scaler.
Example 2: Categorical drift¶
If one split has categories that another split never saw, manual dummy creation often breaks the schema.
OneHotEncoder(handle_unknown="ignore") avoids that by producing a stable feature space and by keeping the transform behavior consistent when a new category appears later.
Example 3: Repeated entities¶
Suppose a customer appears in many rows.
If that customer is present in both train and validation, the model may learn the customer identity rather than the actual task. GroupKFold helps prevent that by keeping groups non-overlapping across folds.
Example 4: Time order¶
Suppose your data has timestamps.
If you split randomly, the model can learn patterns from the future and score unrealistically well. TimeSeriesSplit keeps the test fold after the train fold so the evaluation matches the real use case.
Example 5: Feature joins¶
Suppose you join an event table with a customer table.
Check whether the join key is truly unique, and ask whether the joined columns were available before the prediction point. merge(validate=...) and indicator=True are simple ways to catch the dangerous cases.
Example 6: Honest versus leaky scoring¶
If an honest logistic pipeline scores around roc_auc=0.74 and a version with a target-copy or post-outcome feature jumps to roc_auc=0.99, that is not a successful feature-engineering story. It is a leakage diagnosis. Large jumps are evidence to investigate, not evidence to celebrate.
Runnable Example¶
Open the matching example in AI Academy and run it from the platform.
Run the same idea in the browser:
Inspect the honest-versus-leaky score gap first, then remove the suspicious feature and check whether the gain survives.
Case Study: Leakage in Healthcare Predictions¶
Healthcare models often suffer from leakage when using post-diagnosis data (e.g., treatment outcomes) to predict diagnoses. Detecting this early prevents overoptimistic results and ensures models are deployable in real clinical settings.
Expanded Quick Quiz¶
Why is leakage dangerous?
Answer: It makes models appear better than they are by using unavailable information, leading to failure in production.
How to prevent temporal leakage?
Answer: Use time-aware splits like TimeSeriesSplit to ensure test data is always after train data.
What does GroupKFold prevent?
Answer: Overlapping entities (e.g., same user) across folds, which could cause the model to learn identities instead of patterns.
In the stock prediction scenario, what feature might leak?
Answer: Any data from after the prediction time, like next day's prices or future news.
Progress Checkpoint¶
- [ ] Identified potential leakage sources (temporal, entity overlap, feature availability).
- [ ] Used appropriate splitters (GroupKFold, TimeSeriesSplit) for data structure.
- [ ] Checked feature availability timelines.
- [ ] Removed suspicious features and re-evaluated performance.
- [ ] Answered quiz questions without peeking.
Milestone: Complete this to unlock "Learning Curves and Bias-Variance" in the Classical ML track. Share your leakage check in the academy Discord!
Further Reading¶
- "Leakage in Machine Learning" articles.
- Scikit-Learn Cross-Validation Guide.
- Case studies on data leakage in competitions.
Practical Questions To Ask¶
- Could I know this feature at prediction time?
- Does this feature come from the label, a future event, or a later system update?
- Could the same entity appear in two different folds?
- Did I fit anything on the full dataset before splitting?
- Did I use a splitter that matches the data structure?
- Did I check the join cardinality before modeling?
- If I remove the suspicious feature, does the score collapse?
Inspection Habits That Catch Leakage¶
- Compare train and validation scores fold by fold, not just the mean.
- Shuffle the labels once. A good model should collapse to near-noise.
- Remove one suspicious feature at a time and see whether the score changes dramatically.
- Inspect the rows that share entity IDs across splits.
- Print merge counts and key uniqueness before and after each join.
- Review which transformers are being fitted inside the pipeline and which are not.
- Treat unusually large score jumps as a warning sign, not a victory.
Failure Checks¶
- If the public score is high but private performance later drops, check leakage alongside leaderboard overfitting, shift, and variance rather than jumping to one explanation first.
- If a tiny feature change causes a huge score jump, inspect for label leakage or duplicate rows.
- If random splits look great but group-aware or time-aware splits do not, the earlier score was probably too optimistic.
- If a new category or missing-value pattern appears only in validation, check whether the encoder or imputer was fit correctly.
What Students Should Remember¶
Leakage is not just a technical mistake. It changes the story you think the model is telling.
The safest default is: - split first - keep preprocessing inside a pipeline - use the right splitter for the data - validate joins and duplicates - distrust gains that are too easy
If a score is strong only because the model had access to the answer key in disguise, it is not a real improvement.