Skip to content

Feature Matrix Construction

What This Is

Feature-matrix construction is the step where you turn a table into the numeric representation a model can actually fit. The hard part is not making numbers. The hard part is keeping row alignment, column meaning, and train/validation consistency intact while you do it.

The strongest feature matrix is usually the simplest one that still preserves the information the model needs.

When You Use It

  • preparing tabular data for scikit-learn
  • mixing numeric and categorical columns
  • adding a small number of derived features
  • keeping train and validation matrices consistent
  • deciding whether to build features manually or inside a pipeline

Tooling

  • pandas.select_dtypes
  • pandas.get_dummies
  • pandas.concat
  • pandas.reindex
  • pandas.assign
  • pandas.fillna
  • pandas.merge
  • pandas.DataFrame.align
  • pandas.DataFrame.to_numpy
  • sklearn.compose.ColumnTransformer
  • sklearn.compose.make_column_selector
  • sklearn.preprocessing.OneHotEncoder
  • sklearn.preprocessing.StandardScaler
  • sklearn.pipeline.Pipeline

Library Notes

  • select_dtypes is the fastest way to separate numeric and categorical columns when you already know which columns are allowed into the feature block.
  • get_dummies is the simplest manual one-hot encoder when you want a readable matrix quickly.
  • reindex is how you force validation or test columns to match the training layout before scoring.
  • concat is often the cleanest way to join numeric and one-hot blocks after encoding.
  • assign is useful for creating a small number of derived features without losing the original table.
  • ColumnTransformer is the most reliable way to build mixed-type features inside a Pipeline.
  • OneHotEncoder(handle_unknown="ignore") protects scoring from unseen categories.
  • StandardScaler is usually the right choice for numeric features that feed linear or margin-based models.

Manual Pattern

feature_columns = ["days_until_deadline", "attendance_rate", "quiz_average", "channel", "issue_category"]
numeric_columns = ["days_until_deadline", "attendance_rate", "quiz_average"]

numeric = df[numeric_columns].fillna(0.0)
categorical = pd.get_dummies(df[["channel", "issue_category"]], drop_first=False)
X = pd.concat([numeric, categorical], axis=1)

This pattern is useful when you want:

  • a quick baseline
  • readable feature names
  • explicit control over missing values
  • a matrix you can inspect row by row

What to check first:

  • whether the allowed feature list excluded the target and ID columns before any dtype-based selection
  • whether numeric columns stayed numeric
  • whether the categorical columns expanded into the expected indicators
  • whether the row count stayed fixed
  • whether the target column stayed out of the feature block

If you really want a dtype-based shortcut, use it only after you remove columns that are not features:

blocked = ["needs_human_review", "student_id"]
candidate_frame = df.drop(columns=blocked)
numeric = candidate_frame.select_dtypes(include=["number"]).fillna(0.0)

Stable Train/Validation Pattern

train_X = pd.get_dummies(train_df[["days_until_deadline", "channel", "issue_category"]], drop_first=False)
valid_X = pd.get_dummies(valid_df[["days_until_deadline", "channel", "issue_category"]], drop_first=False)
valid_X = valid_X.reindex(columns=train_X.columns, fill_value=0)

This pattern matters because train and validation do not always contain the same categories.

Without reindex, a category that exists only in training can disappear in validation and break the column layout. With reindex, the validation matrix is forced to match the training columns exactly.

Pipeline Pattern

from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

numeric_selector = make_column_selector(dtype_include=["number"])
categorical_selector = make_column_selector(dtype_exclude=["number"])

preprocess = ColumnTransformer(
    [
        ("num", StandardScaler(), numeric_selector),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_selector),
    ]
)

model = Pipeline(
    [
        ("preprocess", preprocess),
        ("clf", LogisticRegression(max_iter=1000)),
    ]
)

This is the safer production-style route when you want:

  • one place for preprocessing
  • no leakage between train and validation
  • categories handled consistently
  • model fitting and feature construction tied together

Why it helps:

  • the preprocessing is learned only on the training split
  • unknown categories at validation time do not crash the run
  • the same pipeline can be cross-validated cleanly

Common Functions To Know

  • select_dtypes for separating column types without guesswork
  • get_dummies for manual categorical expansion
  • reindex for column alignment across splits
  • concat for joining feature blocks
  • assign for small derived features
  • merge for joining tables when the key column is explicit
  • ColumnTransformer for mixed-type preprocessing
  • OneHotEncoder for categorical columns inside a pipeline
  • StandardScaler for numeric columns that need scale control

Failure Pattern

Building features from one table while the target comes from another. That creates row misalignment and invalid labels.

Other traps:

  • calling get_dummies separately on train and validation and forgetting to align the columns
  • using merge without checking that the join key is unique
  • dropping the target column into the feature matrix by accident
  • filling missing values after converting to a raw NumPy array, which hides column meaning
  • scaling sparse one-hot data with a dense centering step

What To Inspect First

When the matrix looks wrong, check these in order:

  1. row count
  2. column count
  3. target alignment
  4. category expansion
  5. missing-value handling
  6. train/validation column match

If any one of those is off, the model result is not trustworthy yet.

Practical Tricks

  • Use assign for one or two derived features, not a long chain of hidden transformations.
  • Use fillna(0.0) for counts or indicator-like numeric gaps, but be careful with true measurements where zero is a real value.
  • Use merge(..., validate=...) when the join key should be one-to-one or many-to-one.
  • Use OneHotEncoder(handle_unknown="ignore") if validation data may contain unseen categories.
  • Use StandardScaler on numeric columns before LogisticRegression or an SVM, but not as a blanket rule for every model.

What Makes A Good First Baseline

A good first baseline is usually:

  • easy to explain
  • stable across splits
  • small enough to inspect
  • strong enough to expose weak data handling

That often means a numeric block, a one-hot categorical block, and a simple linear model.

Practice

  1. Build a numeric block with select_dtypes and explain which columns it included.
  2. One-hot encode two categorical columns and join them to the numeric features.
  3. Rebuild the same matrix on a validation table and align the columns with reindex.
  4. Write a ColumnTransformer version of the same matrix and compare it to the manual one.
  5. Explain when get_dummies is good enough and when a pipeline is safer.
  6. Describe one case where StandardScaler helps and one case where it is unnecessary.
  7. Show how handle_unknown="ignore" prevents a validation failure.
  8. Explain why a joined table can still be wrong even when it has the right shape.

Runnable Example

Open the matching example in AI Academy and run it from the platform.

While reading the output, ask:

  • whether the row count stayed fixed
  • whether numeric columns stayed numeric
  • whether categorical columns expanded into the expected indicators
  • whether the validation matrix matches the training layout
  • whether the target column stayed separate

Inspect the feature-matrix shape and the first few encoded rows before moving on.

Quick Checks

  • If two tables were filtered differently, they should not be matched by position alone.
  • If a category exists only in validation, the matrix should still have a stable place for it.
  • If the target column appears in the feature set, stop immediately.
  • If the matrix is hard to explain, simplify the feature set before trying a more complex model.
  • If the join key is not unique, inspect the merge before trusting the output.
  • If the sparse/dense choice changes the model behavior, check whether the preprocessing path is still appropriate.

Questions To Ask

  1. Which feature comes from the original table and which one was derived?
  2. Which transformation should happen before the split, and which should happen inside the pipeline?
  3. Which categorical column is safe to one-hot encode, and which one may have too many levels?
  4. Which missing-value strategy matches the meaning of the column?
  5. Would a manual matrix or a ColumnTransformer be easier to defend here?
  6. What would make the validation matrix differ from training in a dangerous way?
  7. Which function would you use first to debug a column alignment problem?
  8. What is the simplest feature set that still preserves the signal?

Longer Connection

Continue with Python, NumPy, Pandas, Visualization for the full table-to-matrix workflow.