Data Cleaning and Preprocessing¶

What This Is¶

Data cleaning is the step where you decide which values are trustworthy enough to enter the model path. The goal is not to make the table look tidy. The goal is to stop type bugs, missing-value shortcuts, duplicate inflation, and preprocessing leakage before feature engineering starts.

This topic teaches one core rule:

inspect first
split first
fit preprocessing only on the training split

When You Use It¶

after loading a new table
after merging sources with different schemas
before feature engineering or model fitting
when a model looks suspiciously good or strangely unstable

First Questions¶

Before you clean anything, answer these:

which columns have the wrong type
where is missingness concentrated
are there duplicates that would inflate metrics
which categorical columns need encoding
which transformations must be learned only from train

If you cannot answer those questions, you are cleaning blindly.

The Cleaning Ladder¶

Use the same order every time:

fix obvious type problems
inspect missingness and decide drop versus impute
remove harmful duplicates
encode categorical columns
scale numeric columns only when the model needs it
keep the whole preprocessing path inside a pipeline

The ladder matters because later steps depend on earlier ones. Encoding before fixing types usually makes the table harder to reason about.

Decision Table¶

Problem	First move	What to inspect next
numeric column loaded as text	`pd.to_numeric(..., errors="coerce")`	how many values became missing
date column loaded as text	`pd.to_datetime(..., errors="coerce")`	missing parse rate and suspicious formats
sparse missingness in a low-value column	drop rows or column	how much data you lost
meaningful missingness in an important column	impute inside pipeline	whether the imputed values distort the distribution
duplicate rows	`drop_duplicates()` or grouped dedupe rule	whether the duplicates were genuine repeats or real repeated events
unordered categories	`OneHotEncoder(handle_unknown="ignore")`	matrix width and unseen categories
ordered categories	`OrdinalEncoder`	whether the order is real or invented

Core Pattern¶

This is the shortest honest preprocessing path for mixed tabular data:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

numeric_cols = ["age", "income"]
categorical_cols = ["city", "plan"]

preprocessor = ColumnTransformer(
    [
        (
            "num",
            Pipeline(
                [
                    ("impute", SimpleImputer(strategy="median")),
                    ("scale", StandardScaler()),
                ]
            ),
            numeric_cols,
        ),
        (
            "cat",
            Pipeline(
                [
                    ("impute", SimpleImputer(strategy="most_frequent")),
                    ("onehot", OneHotEncoder(handle_unknown="ignore")),
                ]
            ),
            categorical_cols,
        ),
    ]
)

model = Pipeline(
    [
        ("prep", preprocessor),
        ("clf", LogisticRegression(max_iter=1000)),
    ]
)

model.fit(X_train, y_train)

This pattern matters because the imputer, encoder, and scaler are all fitted only on X_train.

What To Inspect First¶

Start with inspection, not transformation:

df.dtypes
df.isna().mean().sort_values(ascending=False)
df.duplicated().sum()
df.head()

Look for:

numeric columns hiding as object
missingness concentrated in one slice or source
duplicates that would copy the target signal
categories with messy spelling or inconsistent casing

The Most Important Failure Pattern¶

The main failure is fitting preprocessing on the full dataset before the split.

That breaks the logic of evaluation in three ways:

the imputer sees validation values
the encoder sees validation categories
the scaler sees validation distribution

The score may still look reasonable, but the workflow is no longer honest.

Common Mistakes¶

using dropna() so aggressively that the dataset collapses
filling every missing value with 0 without asking whether zero is meaningful
one-hot encoding extremely high-cardinality identifiers
forgetting handle_unknown="ignore" and crashing on new categories
scaling columns that do not need scaling
treating duplicate rows as harmless when they duplicate the label too

A Good Cleaning Note¶

After one pass, the learner should be able to write:

which columns had type problems
which columns were dropped or imputed
which duplicate rule was used
which encoding path was chosen
what leakage risk the pipeline avoided

If the note only says "cleaned the data," the workflow is still too vague.

Practice¶

Inspect a new table and name the first three cleaning problems you see.
Decide which columns should be dropped and which should be imputed.
Build a ColumnTransformer for numeric and categorical columns.
Explain why the imputer must be fitted only on training data.
Compare a dropped-row version against an imputed version and say which one is more defensible.

Runnable Example¶

Longer Connection¶

Continue with Table Inspection for the inspection habits that come before cleaning, and Feature Matrix Construction for turning the cleaned table into a stable model matrix.