Data Cleaning and Preprocessing¶
What This Is¶
Data cleaning is the step where you decide which values are trustworthy enough to enter the model path. The goal is not to make the table look tidy. The goal is to stop type bugs, missing-value shortcuts, duplicate inflation, and preprocessing leakage before feature engineering starts.
This topic teaches one core rule:
- inspect first
- split first
- fit preprocessing only on the training split
When You Use It¶
- after loading a new table
- after merging sources with different schemas
- before feature engineering or model fitting
- when a model looks suspiciously good or strangely unstable
First Questions¶
Before you clean anything, answer these:
- which columns have the wrong type
- where is missingness concentrated
- are there duplicates that would inflate metrics
- which categorical columns need encoding
- which transformations must be learned only from train
If you cannot answer those questions, you are cleaning blindly.
The Cleaning Ladder¶
Use the same order every time:
- fix obvious type problems
- inspect missingness and decide drop versus impute
- remove harmful duplicates
- encode categorical columns
- scale numeric columns only when the model needs it
- keep the whole preprocessing path inside a pipeline
The ladder matters because later steps depend on earlier ones. Encoding before fixing types usually makes the table harder to reason about.
Decision Table¶
| Problem | First move | What to inspect next |
|---|---|---|
| numeric column loaded as text | pd.to_numeric(..., errors="coerce") |
how many values became missing |
| date column loaded as text | pd.to_datetime(..., errors="coerce") |
missing parse rate and suspicious formats |
| sparse missingness in a low-value column | drop rows or column | how much data you lost |
| meaningful missingness in an important column | impute inside pipeline | whether the imputed values distort the distribution |
| duplicate rows | drop_duplicates() or grouped dedupe rule |
whether the duplicates were genuine repeats or real repeated events |
| unordered categories | OneHotEncoder(handle_unknown="ignore") |
matrix width and unseen categories |
| ordered categories | OrdinalEncoder |
whether the order is real or invented |
Core Pattern¶
This is the shortest honest preprocessing path for mixed tabular data:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
numeric_cols = ["age", "income"]
categorical_cols = ["city", "plan"]
preprocessor = ColumnTransformer(
[
(
"num",
Pipeline(
[
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
]
),
numeric_cols,
),
(
"cat",
Pipeline(
[
("impute", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore")),
]
),
categorical_cols,
),
]
)
model = Pipeline(
[
("prep", preprocessor),
("clf", LogisticRegression(max_iter=1000)),
]
)
model.fit(X_train, y_train)
This pattern matters because the imputer, encoder, and scaler are all fitted only on X_train.
What To Inspect First¶
Start with inspection, not transformation:
df.dtypes
df.isna().mean().sort_values(ascending=False)
df.duplicated().sum()
df.head()
Look for:
- numeric columns hiding as
object - missingness concentrated in one slice or source
- duplicates that would copy the target signal
- categories with messy spelling or inconsistent casing
The Most Important Failure Pattern¶
The main failure is fitting preprocessing on the full dataset before the split.
That breaks the logic of evaluation in three ways:
- the imputer sees validation values
- the encoder sees validation categories
- the scaler sees validation distribution
The score may still look reasonable, but the workflow is no longer honest.
Common Mistakes¶
- using
dropna()so aggressively that the dataset collapses - filling every missing value with
0without asking whether zero is meaningful - one-hot encoding extremely high-cardinality identifiers
- forgetting
handle_unknown="ignore"and crashing on new categories - scaling columns that do not need scaling
- treating duplicate rows as harmless when they duplicate the label too
A Good Cleaning Note¶
After one pass, the learner should be able to write:
- which columns had type problems
- which columns were dropped or imputed
- which duplicate rule was used
- which encoding path was chosen
- what leakage risk the pipeline avoided
If the note only says "cleaned the data," the workflow is still too vague.
Practice¶
- Inspect a new table and name the first three cleaning problems you see.
- Decide which columns should be dropped and which should be imputed.
- Build a
ColumnTransformerfor numeric and categorical columns. - Explain why the imputer must be fitted only on training data.
- Compare a dropped-row version against an imputed version and say which one is more defensible.
Runnable Example¶
Longer Connection¶
Continue with Table Inspection for the inspection habits that come before cleaning, and Feature Matrix Construction for turning the cleaned table into a stable model matrix.