Tabular Feature Engineering¶
What This Is¶
Feature engineering for tabular data turns raw columns into representations that help models find signal. It is often the single most impactful step in a tabular ML pipeline — more so than model selection.
When You Use It¶
- when raw features are not expressive enough for the model to learn patterns
- when domain knowledge suggests interactions, ratios, or aggregations
- when the baseline model plateaus and you suspect the representation is the bottleneck
- before trying a more complex model family
Tooling¶
| Tool | What it does |
|---|---|
PolynomialFeatures |
creates interaction and polynomial terms |
KBinsDiscretizer |
bins continuous features into categories |
FunctionTransformer |
wraps custom transformations in a pipeline |
ColumnTransformer |
applies different transforms to different columns |
pandas.cut / pandas.qcut |
manual binning by value or quantile |
groupby().transform() |
group-level aggregation as new features |
Feature Engineering Ladder¶
- Clean — handle missing values, fix types
- Encode — one-hot, ordinal, or target encoding for categories
- Scale — standardize or normalize numeric features
- Combine — create interactions, ratios, differences
- Aggregate — group-level statistics (mean, count, std per category)
- Select — drop low-importance or redundant features
Common Patterns¶
Interaction features¶
df["rooms_per_person"] = df["total_rooms"] / df["population"]
df["income_x_rooms"] = df["median_income"] * df["total_rooms"]
Date decomposition¶
df["day_of_week"] = df["date"].dt.dayofweek
df["month"] = df["date"].dt.month
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
Group-level aggregations¶
df["group_mean_score"] = df.groupby("category")["score"].transform("mean")
df["group_count"] = df.groupby("category")["score"].transform("count")
df["score_vs_group"] = df["score"] - df["group_mean_score"]
Binning¶
df["income_bin"] = pd.qcut(df["income"], q=5, labels=False)
Failure Pattern¶
Creating hundreds of features without checking which ones actually help. Feature engineering should increase signal, not just dimensionality.
Another trap: computing group aggregations on the full dataset before splitting, which leaks test-set statistics into training.
Common Mistakes¶
- creating ratios that produce infinity or NaN when the denominator is zero
- target encoding without proper cross-validation, which leaks the target
- one-hot encoding high-cardinality features and drowning the model in sparse noise
- forgetting to apply the same transformation pipeline to validation and test data
Practice¶
- Create 3 interaction features from a dataset and compare model performance before and after.
- Decompose a date column and explain which components carry signal.
- Add group-level mean and count features and measure the impact.
- Use
PolynomialFeatures(degree=2)and inspect the resulting feature matrix. - Explain why group aggregations must be computed within the training fold only.
Runnable Example¶
Longer Connection¶
Continue with Feature Matrix Construction for building the initial matrix, and Data Cleaning and Preprocessing for the step that should come before engineering.