Feature Matrix Construction¶

What This Is¶

Feature-matrix construction is the step where you turn a table into the numeric representation a model can actually fit. The hard part is not making numbers. The hard part is keeping row alignment, column meaning, and train/validation consistency intact while you do it.

The strongest feature matrix is usually the simplest one that still preserves the information the model needs.

When You Use It¶

preparing tabular data for scikit-learn
mixing numeric and categorical columns
adding a small number of derived features
keeping train and validation matrices consistent
deciding whether to build features manually or inside a pipeline

Tooling¶

pandas.select_dtypes
pandas.get_dummies
pandas.concat
pandas.reindex
pandas.assign
pandas.fillna
pandas.merge
pandas.DataFrame.align
pandas.DataFrame.to_numpy
sklearn.compose.ColumnTransformer
sklearn.compose.make_column_selector
sklearn.preprocessing.OneHotEncoder
sklearn.preprocessing.StandardScaler
sklearn.pipeline.Pipeline

Library Notes¶

select_dtypes is the fastest way to separate numeric and categorical columns when you already know which columns are allowed into the feature block.
get_dummies is the simplest manual one-hot encoder when you want a readable matrix quickly.
reindex is how you force validation or test columns to match the training layout before scoring.
concat is often the cleanest way to join numeric and one-hot blocks after encoding.
assign is useful for creating a small number of derived features without losing the original table.
ColumnTransformer is the most reliable way to build mixed-type features inside a Pipeline.
OneHotEncoder(handle_unknown="ignore") protects scoring from unseen categories.
StandardScaler is usually the right choice for numeric features that feed linear or margin-based models.

Manual Pattern¶

feature_columns = ["days_until_deadline", "attendance_rate", "quiz_average", "channel", "issue_category"]
numeric_columns = ["days_until_deadline", "attendance_rate", "quiz_average"]

numeric = df[numeric_columns].fillna(0.0)
categorical = pd.get_dummies(df[["channel", "issue_category"]], drop_first=False)
X = pd.concat([numeric, categorical], axis=1)

This pattern is useful when you want:

a quick baseline
readable feature names
explicit control over missing values
a matrix you can inspect row by row

What to check first:

whether the allowed feature list excluded the target and ID columns before any dtype-based selection
whether numeric columns stayed numeric
whether the categorical columns expanded into the expected indicators
whether the row count stayed fixed
whether the target column stayed out of the feature block

If you really want a dtype-based shortcut, use it only after you remove columns that are not features:

blocked = ["needs_human_review", "student_id"]
candidate_frame = df.drop(columns=blocked)
numeric = candidate_frame.select_dtypes(include=["number"]).fillna(0.0)

Stable Train/Validation Pattern¶

train_X = pd.get_dummies(train_df[["days_until_deadline", "channel", "issue_category"]], drop_first=False)
valid_X = pd.get_dummies(valid_df[["days_until_deadline", "channel", "issue_category"]], drop_first=False)
valid_X = valid_X.reindex(columns=train_X.columns, fill_value=0)

This pattern matters because train and validation do not always contain the same categories.

Without reindex, a category that exists only in training can disappear in validation and break the column layout. With reindex, the validation matrix is forced to match the training columns exactly.

Pipeline Pattern¶

from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

numeric_selector = make_column_selector(dtype_include=["number"])
categorical_selector = make_column_selector(dtype_exclude=["number"])

preprocess = ColumnTransformer(
    [
        ("num", StandardScaler(), numeric_selector),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_selector),
    ]
)

model = Pipeline(
    [
        ("preprocess", preprocess),
        ("clf", LogisticRegression(max_iter=1000)),
    ]
)

This is the safer production-style route when you want:

one place for preprocessing
no leakage between train and validation
categories handled consistently
model fitting and feature construction tied together

Why it helps:

the preprocessing is learned only on the training split
unknown categories at validation time do not crash the run
the same pipeline can be cross-validated cleanly

Common Functions To Know¶

select_dtypes for separating column types without guesswork
get_dummies for manual categorical expansion
reindex for column alignment across splits
concat for joining feature blocks
assign for small derived features
merge for joining tables when the key column is explicit
ColumnTransformer for mixed-type preprocessing
OneHotEncoder for categorical columns inside a pipeline
StandardScaler for numeric columns that need scale control

Failure Pattern¶

Building features from one table while the target comes from another. That creates row misalignment and invalid labels.

Other traps:

calling get_dummies separately on train and validation and forgetting to align the columns
using merge without checking that the join key is unique
dropping the target column into the feature matrix by accident
filling missing values after converting to a raw NumPy array, which hides column meaning
scaling sparse one-hot data with a dense centering step

What To Inspect First¶

When the matrix looks wrong, check these in order:

row count
column count
target alignment
category expansion
missing-value handling
train/validation column match

If any one of those is off, the model result is not trustworthy yet.

Practical Tricks¶

Use assign for one or two derived features, not a long chain of hidden transformations.
Use fillna(0.0) for counts or indicator-like numeric gaps, but be careful with true measurements where zero is a real value.
Use merge(..., validate=...) when the join key should be one-to-one or many-to-one.
Use OneHotEncoder(handle_unknown="ignore") if validation data may contain unseen categories.
Use StandardScaler on numeric columns before LogisticRegression or an SVM, but not as a blanket rule for every model.

What Makes A Good First Baseline¶

A good first baseline is usually:

easy to explain
stable across splits
small enough to inspect
strong enough to expose weak data handling

That often means a numeric block, a one-hot categorical block, and a simple linear model.

Practice¶

Build a numeric block with select_dtypes and explain which columns it included.
One-hot encode two categorical columns and join them to the numeric features.
Rebuild the same matrix on a validation table and align the columns with reindex.
Write a ColumnTransformer version of the same matrix and compare it to the manual one.
Explain when get_dummies is good enough and when a pipeline is safer.
Describe one case where StandardScaler helps and one case where it is unnecessary.
Show how handle_unknown="ignore" prevents a validation failure.
Explain why a joined table can still be wrong even when it has the right shape.

Runnable Example¶

Open the matching example in AI Academy and run it from the platform.

While reading the output, ask:

whether the row count stayed fixed
whether numeric columns stayed numeric
whether categorical columns expanded into the expected indicators
whether the validation matrix matches the training layout
whether the target column stayed separate

Inspect the feature-matrix shape and the first few encoded rows before moving on.

Quick Checks¶

If two tables were filtered differently, they should not be matched by position alone.
If a category exists only in validation, the matrix should still have a stable place for it.
If the target column appears in the feature set, stop immediately.
If the matrix is hard to explain, simplify the feature set before trying a more complex model.
If the join key is not unique, inspect the merge before trusting the output.
If the sparse/dense choice changes the model behavior, check whether the preprocessing path is still appropriate.

Questions To Ask¶

Which feature comes from the original table and which one was derived?
Which transformation should happen before the split, and which should happen inside the pipeline?
Which categorical column is safe to one-hot encode, and which one may have too many levels?
Which missing-value strategy matches the meaning of the column?
Would a manual matrix or a ColumnTransformer be easier to defend here?
What would make the validation matrix differ from training in a dangerous way?
Which function would you use first to debug a column alignment problem?
What is the simplest feature set that still preserves the signal?

Longer Connection¶

Continue with Python, NumPy, Pandas, Visualization for the full table-to-matrix workflow.