Feature Matrix Construction¶
What This Is¶
Feature-matrix construction is the step where you turn a table into the numeric representation a model can actually fit. The hard part is not making numbers. The hard part is keeping row alignment, column meaning, and train/validation consistency intact while you do it.
The strongest feature matrix is usually the simplest one that still preserves the information the model needs.
When You Use It¶
- preparing tabular data for
scikit-learn - mixing numeric and categorical columns
- adding a small number of derived features
- keeping train and validation matrices consistent
- deciding whether to build features manually or inside a pipeline
Tooling¶
pandas.select_dtypespandas.get_dummiespandas.concatpandas.reindexpandas.assignpandas.fillnapandas.mergepandas.DataFrame.alignpandas.DataFrame.to_numpysklearn.compose.ColumnTransformersklearn.compose.make_column_selectorsklearn.preprocessing.OneHotEncodersklearn.preprocessing.StandardScalersklearn.pipeline.Pipeline
Library Notes¶
select_dtypesis the fastest way to separate numeric and categorical columns when you already know which columns are allowed into the feature block.get_dummiesis the simplest manual one-hot encoder when you want a readable matrix quickly.reindexis how you force validation or test columns to match the training layout before scoring.concatis often the cleanest way to join numeric and one-hot blocks after encoding.assignis useful for creating a small number of derived features without losing the original table.ColumnTransformeris the most reliable way to build mixed-type features inside aPipeline.OneHotEncoder(handle_unknown="ignore")protects scoring from unseen categories.StandardScaleris usually the right choice for numeric features that feed linear or margin-based models.
Manual Pattern¶
feature_columns = ["days_until_deadline", "attendance_rate", "quiz_average", "channel", "issue_category"]
numeric_columns = ["days_until_deadline", "attendance_rate", "quiz_average"]
numeric = df[numeric_columns].fillna(0.0)
categorical = pd.get_dummies(df[["channel", "issue_category"]], drop_first=False)
X = pd.concat([numeric, categorical], axis=1)
This pattern is useful when you want:
- a quick baseline
- readable feature names
- explicit control over missing values
- a matrix you can inspect row by row
What to check first:
- whether the allowed feature list excluded the target and ID columns before any dtype-based selection
- whether numeric columns stayed numeric
- whether the categorical columns expanded into the expected indicators
- whether the row count stayed fixed
- whether the target column stayed out of the feature block
If you really want a dtype-based shortcut, use it only after you remove columns that are not features:
blocked = ["needs_human_review", "student_id"]
candidate_frame = df.drop(columns=blocked)
numeric = candidate_frame.select_dtypes(include=["number"]).fillna(0.0)
Stable Train/Validation Pattern¶
train_X = pd.get_dummies(train_df[["days_until_deadline", "channel", "issue_category"]], drop_first=False)
valid_X = pd.get_dummies(valid_df[["days_until_deadline", "channel", "issue_category"]], drop_first=False)
valid_X = valid_X.reindex(columns=train_X.columns, fill_value=0)
This pattern matters because train and validation do not always contain the same categories.
Without reindex, a category that exists only in training can disappear in validation and break the column layout. With reindex, the validation matrix is forced to match the training columns exactly.
Pipeline Pattern¶
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
numeric_selector = make_column_selector(dtype_include=["number"])
categorical_selector = make_column_selector(dtype_exclude=["number"])
preprocess = ColumnTransformer(
[
("num", StandardScaler(), numeric_selector),
("cat", OneHotEncoder(handle_unknown="ignore"), categorical_selector),
]
)
model = Pipeline(
[
("preprocess", preprocess),
("clf", LogisticRegression(max_iter=1000)),
]
)
This is the safer production-style route when you want:
- one place for preprocessing
- no leakage between train and validation
- categories handled consistently
- model fitting and feature construction tied together
Why it helps:
- the preprocessing is learned only on the training split
- unknown categories at validation time do not crash the run
- the same pipeline can be cross-validated cleanly
Common Functions To Know¶
select_dtypesfor separating column types without guessworkget_dummiesfor manual categorical expansionreindexfor column alignment across splitsconcatfor joining feature blocksassignfor small derived featuresmergefor joining tables when the key column is explicitColumnTransformerfor mixed-type preprocessingOneHotEncoderfor categorical columns inside a pipelineStandardScalerfor numeric columns that need scale control
Failure Pattern¶
Building features from one table while the target comes from another. That creates row misalignment and invalid labels.
Other traps:
- calling
get_dummiesseparately on train and validation and forgetting to align the columns - using
mergewithout checking that the join key is unique - dropping the target column into the feature matrix by accident
- filling missing values after converting to a raw NumPy array, which hides column meaning
- scaling sparse one-hot data with a dense centering step
What To Inspect First¶
When the matrix looks wrong, check these in order:
- row count
- column count
- target alignment
- category expansion
- missing-value handling
- train/validation column match
If any one of those is off, the model result is not trustworthy yet.
Practical Tricks¶
- Use
assignfor one or two derived features, not a long chain of hidden transformations. - Use
fillna(0.0)for counts or indicator-like numeric gaps, but be careful with true measurements where zero is a real value. - Use
merge(..., validate=...)when the join key should be one-to-one or many-to-one. - Use
OneHotEncoder(handle_unknown="ignore")if validation data may contain unseen categories. - Use
StandardScaleron numeric columns beforeLogisticRegressionor an SVM, but not as a blanket rule for every model.
What Makes A Good First Baseline¶
A good first baseline is usually:
- easy to explain
- stable across splits
- small enough to inspect
- strong enough to expose weak data handling
That often means a numeric block, a one-hot categorical block, and a simple linear model.
Practice¶
- Build a numeric block with
select_dtypesand explain which columns it included. - One-hot encode two categorical columns and join them to the numeric features.
- Rebuild the same matrix on a validation table and align the columns with
reindex. - Write a
ColumnTransformerversion of the same matrix and compare it to the manual one. - Explain when
get_dummiesis good enough and when a pipeline is safer. - Describe one case where
StandardScalerhelps and one case where it is unnecessary. - Show how
handle_unknown="ignore"prevents a validation failure. - Explain why a joined table can still be wrong even when it has the right shape.
Runnable Example¶
Open the matching example in AI Academy and run it from the platform.
While reading the output, ask:
- whether the row count stayed fixed
- whether numeric columns stayed numeric
- whether categorical columns expanded into the expected indicators
- whether the validation matrix matches the training layout
- whether the target column stayed separate
Inspect the feature-matrix shape and the first few encoded rows before moving on.
Quick Checks¶
- If two tables were filtered differently, they should not be matched by position alone.
- If a category exists only in validation, the matrix should still have a stable place for it.
- If the target column appears in the feature set, stop immediately.
- If the matrix is hard to explain, simplify the feature set before trying a more complex model.
- If the join key is not unique, inspect the merge before trusting the output.
- If the sparse/dense choice changes the model behavior, check whether the preprocessing path is still appropriate.
Questions To Ask¶
- Which feature comes from the original table and which one was derived?
- Which transformation should happen before the split, and which should happen inside the pipeline?
- Which categorical column is safe to one-hot encode, and which one may have too many levels?
- Which missing-value strategy matches the meaning of the column?
- Would a manual matrix or a
ColumnTransformerbe easier to defend here? - What would make the validation matrix differ from training in a dangerous way?
- Which function would you use first to debug a column alignment problem?
- What is the simplest feature set that still preserves the signal?
Longer Connection¶
Continue with Python, NumPy, Pandas, Visualization for the full table-to-matrix workflow.