Sequential Splits and Lag Features¶

What This Is¶

Sequential data must respect time order. Features often depend on windows or lags, and evaluation becomes invalid quickly if future information leaks backward.

The hard rule is simple: if the future can influence the feature generation, the split is wrong. Sequential tasks usually reward patience, not random shuffling.

When You Use It¶

forecasting
event prediction over time
any dataset where later rows should not influence earlier decisions
click, demand, churn, fraud, and maintenance problems with timestamps
any task where the model will be used on "future" rows that were not available during training

Tooling¶

chronological splitting
lag features
rolling windows
time-aware validation
backtesting-style summaries
event alignment with nearest-time joins
frequency alignment before feature creation
grouped lag construction when multiple entities are mixed together

Core Idea¶

Sort first, then split, then build history-based features inside the ordered frame.

If the rows are not in time order, a lag can point to the wrong event.

Chronological Split¶

The simplest safe split uses a timestamp cutoff:

df = df.sort_values("timestamp")
train = df[df["timestamp"] < split_time]
valid = df[df["timestamp"] >= split_time]

That pattern is useful when you need one clean holdout window and you want to simulate deployment on later data.

When the task has multiple evaluation windows, prefer a time-aware cross-validation splitter from scikit-learn instead of making up random folds.

Worked Pattern¶

The most common lag pattern in pandas is shift:

df["lag_1"] = df["value"].shift(1)
df["lag_7"] = df["value"].shift(7)
df = df.dropna()

Useful trick:

keep lag construction inside the time-ordered frame
use shift(1) for the immediately previous observation and larger shifts for longer memory
drop rows that do not have a full lag history instead of filling them blindly
compare a lag-only baseline to a lag-plus-current-value baseline when you want to know how much history matters
if the data has several entities, build the lag per entity so one user's history does not leak into another user's row

Grouped lag features matter when the table mixes users, devices, stores, or patients:

df = df.sort_values(["entity_id", "timestamp"])
df["lag_1"] = df.groupby("entity_id")["value"].shift(1)
df["lag_3"] = df.groupby("entity_id")["value"].shift(3)

That pattern keeps the lag history local to each entity.

Window Features¶

Sometimes the previous row is not enough. Use a rolling window when recent context matters more than one exact lag:

df = df.sort_values("timestamp")
df["mean_last_3"] = df["value"].rolling(3).mean()
df["std_last_7"] = df["value"].rolling(7).std()

Be careful: rolling windows can accidentally include the current target row. If you are predicting the current row from the past, shift before rolling when needed:

df["past_mean_7"] = df["value"].shift(1).rolling(7).mean()

That pattern is safer when the target is defined on the same row as the features.

Calendar Alignment¶

If your data should live on a regular timeline, align it first.

asfreq is useful when you want a row for every expected timestamp and missing times should stay missing:

ts = ts.asfreq("D")

resample is useful when multiple raw events must be aggregated into a coarser time bucket:

daily = df.resample("D", on="timestamp")["value"].mean()

Use asfreq when you want to expose gaps. Use resample when you want to summarize many events into one time step.

Event Alignment¶

When features come from a separate event stream, merge_asof is the usual tool. It matches each row to the nearest prior event rather than forcing an exact timestamp match.

aligned = pd.merge_asof(
    left=observations.sort_values("timestamp"),
    right=events.sort_values("timestamp"),
    on="timestamp",
    by="entity_id",
    direction="backward",
)

Use this when you need the most recent known signal at prediction time, such as the last status update, last price, or last sensor reading.

scikit-learn Pattern¶

For validation, TimeSeriesSplit is the main scikit-learn splitter for time-ordered data.

from sklearn.model_selection import TimeSeriesSplit, cross_val_score

tscv = TimeSeriesSplit(n_splits=5, gap=1)
scores = cross_val_score(model, X, y, cv=tscv, scoring="roc_auc")

Why this matters:

each fold respects time order
later folds train on more history than earlier ones
gap helps when labels or features need a buffer between train and test

If you only need one final holdout window, use a simple chronological split. If you need a reliability check across several time windows, use TimeSeriesSplit.

Failure Pattern¶

Randomly shuffling time-dependent data before the split. That leaks future patterns backward and makes the task look easier than it is.

Another failure is computing rolling features with a window that accidentally includes the current target row or later rows.

Other common mistakes:

building lags before sorting by time
using a global statistic, like a mean over the full dataset, and then pretending it was known early
fitting preprocessing on the whole table before the split
using train_test_split(..., shuffle=True) on sequential data
joining on timestamps without checking whether the joined row came from the past or the future

Inspection habit:

print the first few rows after sorting and lag creation
check whether the first valid lag row is actually aligned with the second time step
compare the timestamp of each feature source against the prediction timestamp
inspect one entity at a time when the table has multiple entities mixed together

Practice¶

Build two lag features for a sequential table and explain what each lag means in plain language.
Compare a random split against a chronological split. Which one matches deployment?
Explain why the random split is invalid for forecasting even if the score is higher.
Name one feature that would be safe in a time-aware setup and one that would not.
Explain what you would backtest first if time were short.
Say whether a simple linear model is enough to test the lag idea.
Decide whether asfreq, resample, or merge_asof is the right tool for a given data source.
Say when TimeSeriesSplit is better than one fixed cutoff.
Identify one place where a rolling window could silently leak future information.
Explain why grouped lags are safer than global lags when the table mixes entities.

Runnable Example¶

Open the matching example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect the random-split score against the chronological score and explain which one matches deployment.

Library Notes¶

DataFrame.shift(...) and Series.shift(...) build lag features by moving observations forward in time.
groupby(...).shift(...) keeps lags inside each entity.
rolling(...) is useful when the task depends on recent history rather than just the previous row.
asfreq(...) keeps a regular timeline visible and exposes missing periods.
resample(...) aggregates many events into time buckets.
merge_asof(...) is the practical way to align each row with the most recent known event.
TimeSeriesSplit(...) gives time-aware cross-validation folds in scikit-learn.
cross_val_score(...) is a quick way to compare models under those folds.
SequentialFeatureSelector(...) can help if you want to test whether a small set of lag features is actually enough.
A chronological split should be chosen before ranking models or tuning features.

Short Applied Examples¶

Use a lag when the target depends on recent history:

df["sales_lag_1"] = df["sales"].shift(1)
df["sales_lag_7"] = df["sales"].shift(7)

Use a grouped lag when the table contains many entities:

df = df.sort_values(["store_id", "date"])
df["prev_store_sales"] = df.groupby("store_id")["sales"].shift(1)

Use a rolling mean when noise is high and the short-term trend matters:

df["sales_mean_7"] = df["sales"].shift(1).rolling(7).mean()

Use a time-aware split when the deployment target is later data:

tscv = TimeSeriesSplit(n_splits=4, gap=2)

Questions To Ask¶

What counts as future information in this task?
Which feature would be unavailable at prediction time?
How much score is lost when you force a real chronological split?
Does the lag feature help because of causality or because it leaks?
Would you trust a model that only wins on a random split?
Can the same feature be computed without seeing the answer row?
If you remove one lag, do the scores collapse or barely change?
Would a feature still be valid if the test rows arrived one week later?
Is the entity-specific history isolated correctly?
Are you evaluating the model on the same time granularity it will face in use?

Longer Connection¶

Continue with scikit-learn Validation and Tuning for the broader split-discipline workflow.