Tabular Feature Engineering¶

What This Is¶

Feature engineering for tabular data turns raw columns into representations that help models find signal. It is often the single most impactful step in a tabular ML pipeline — more so than model selection.

When You Use It¶

when raw features are not expressive enough for the model to learn patterns
when domain knowledge suggests interactions, ratios, or aggregations
when the baseline model plateaus and you suspect the representation is the bottleneck
before trying a more complex model family

Tooling¶

Tool	What it does
`PolynomialFeatures`	creates interaction and polynomial terms
`KBinsDiscretizer`	bins continuous features into categories
`FunctionTransformer`	wraps custom transformations in a pipeline
`ColumnTransformer`	applies different transforms to different columns
`pandas.cut` / `pandas.qcut`	manual binning by value or quantile
`groupby().transform()`	group-level aggregation as new features

Feature Engineering Ladder¶

Clean — handle missing values, fix types
Encode — one-hot, ordinal, or target encoding for categories
Scale — standardize or normalize numeric features
Combine — create interactions, ratios, differences
Aggregate — group-level statistics (mean, count, std per category)
Select — drop low-importance or redundant features

Common Patterns¶

Interaction features¶

df["rooms_per_person"] = df["total_rooms"] / df["population"]
df["income_x_rooms"] = df["median_income"] * df["total_rooms"]

Date decomposition¶

df["day_of_week"] = df["date"].dt.dayofweek
df["month"] = df["date"].dt.month
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)

Group-level aggregations¶

df["group_mean_score"] = df.groupby("category")["score"].transform("mean")
df["group_count"] = df.groupby("category")["score"].transform("count")
df["score_vs_group"] = df["score"] - df["group_mean_score"]

Binning¶

df["income_bin"] = pd.qcut(df["income"], q=5, labels=False)

Failure Pattern¶

Creating hundreds of features without checking which ones actually help. Feature engineering should increase signal, not just dimensionality.

Another trap: computing group aggregations on the full dataset before splitting, which leaks test-set statistics into training.

Common Mistakes¶

creating ratios that produce infinity or NaN when the denominator is zero
target encoding without proper cross-validation, which leaks the target
one-hot encoding high-cardinality features and drowning the model in sparse noise
forgetting to apply the same transformation pipeline to validation and test data

Practice¶

Create 3 interaction features from a dataset and compare model performance before and after.
Decompose a date column and explain which components carry signal.
Add group-level mean and count features and measure the impact.
Use PolynomialFeatures(degree=2) and inspect the resulting feature matrix.
Explain why group aggregations must be computed within the training fold only.

Runnable Example¶

Longer Connection¶

Continue with Feature Matrix Construction for building the initial matrix, and Data Cleaning and Preprocessing for the step that should come before engineering.