Skip to content

Tabular Feature Engineering

What This Is

Feature engineering for tabular data turns raw columns into representations that help models find signal. It is often the single most impactful step in a tabular ML pipeline — more so than model selection.

When You Use It

  • when raw features are not expressive enough for the model to learn patterns
  • when domain knowledge suggests interactions, ratios, or aggregations
  • when the baseline model plateaus and you suspect the representation is the bottleneck
  • before trying a more complex model family

Tooling

Tool What it does
PolynomialFeatures creates interaction and polynomial terms
KBinsDiscretizer bins continuous features into categories
FunctionTransformer wraps custom transformations in a pipeline
ColumnTransformer applies different transforms to different columns
pandas.cut / pandas.qcut manual binning by value or quantile
groupby().transform() group-level aggregation as new features

Feature Engineering Ladder

  1. Clean — handle missing values, fix types
  2. Encode — one-hot, ordinal, or target encoding for categories
  3. Scale — standardize or normalize numeric features
  4. Combine — create interactions, ratios, differences
  5. Aggregate — group-level statistics (mean, count, std per category)
  6. Select — drop low-importance or redundant features

Common Patterns

Interaction features

df["rooms_per_person"] = df["total_rooms"] / df["population"]
df["income_x_rooms"] = df["median_income"] * df["total_rooms"]

Date decomposition

df["day_of_week"] = df["date"].dt.dayofweek
df["month"] = df["date"].dt.month
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)

Group-level aggregations

df["group_mean_score"] = df.groupby("category")["score"].transform("mean")
df["group_count"] = df.groupby("category")["score"].transform("count")
df["score_vs_group"] = df["score"] - df["group_mean_score"]

Binning

df["income_bin"] = pd.qcut(df["income"], q=5, labels=False)

Failure Pattern

Creating hundreds of features without checking which ones actually help. Feature engineering should increase signal, not just dimensionality.

Another trap: computing group aggregations on the full dataset before splitting, which leaks test-set statistics into training.

Common Mistakes

  • creating ratios that produce infinity or NaN when the denominator is zero
  • target encoding without proper cross-validation, which leaks the target
  • one-hot encoding high-cardinality features and drowning the model in sparse noise
  • forgetting to apply the same transformation pipeline to validation and test data

Practice

  1. Create 3 interaction features from a dataset and compare model performance before and after.
  2. Decompose a date column and explain which components carry signal.
  3. Add group-level mean and count features and measure the impact.
  4. Use PolynomialFeatures(degree=2) and inspect the resulting feature matrix.
  5. Explain why group aggregations must be computed within the training fold only.

Runnable Example

Longer Connection

Continue with Feature Matrix Construction for building the initial matrix, and Data Cleaning and Preprocessing for the step that should come before engineering.