Skip to content

Data Cleaning and Preprocessing

What This Is

Data cleaning is the step where you decide which values are trustworthy enough to enter the model path. The goal is not to make the table look tidy. The goal is to stop type bugs, missing-value shortcuts, duplicate inflation, and preprocessing leakage before feature engineering starts.

This topic teaches one core rule:

  • inspect first
  • split first
  • fit preprocessing only on the training split

When You Use It

  • after loading a new table
  • after merging sources with different schemas
  • before feature engineering or model fitting
  • when a model looks suspiciously good or strangely unstable

First Questions

Before you clean anything, answer these:

  1. which columns have the wrong type
  2. where is missingness concentrated
  3. are there duplicates that would inflate metrics
  4. which categorical columns need encoding
  5. which transformations must be learned only from train

If you cannot answer those questions, you are cleaning blindly.

The Cleaning Ladder

Use the same order every time:

  1. fix obvious type problems
  2. inspect missingness and decide drop versus impute
  3. remove harmful duplicates
  4. encode categorical columns
  5. scale numeric columns only when the model needs it
  6. keep the whole preprocessing path inside a pipeline

The ladder matters because later steps depend on earlier ones. Encoding before fixing types usually makes the table harder to reason about.

Decision Table

Problem First move What to inspect next
numeric column loaded as text pd.to_numeric(..., errors="coerce") how many values became missing
date column loaded as text pd.to_datetime(..., errors="coerce") missing parse rate and suspicious formats
sparse missingness in a low-value column drop rows or column how much data you lost
meaningful missingness in an important column impute inside pipeline whether the imputed values distort the distribution
duplicate rows drop_duplicates() or grouped dedupe rule whether the duplicates were genuine repeats or real repeated events
unordered categories OneHotEncoder(handle_unknown="ignore") matrix width and unseen categories
ordered categories OrdinalEncoder whether the order is real or invented

Core Pattern

This is the shortest honest preprocessing path for mixed tabular data:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

numeric_cols = ["age", "income"]
categorical_cols = ["city", "plan"]

preprocessor = ColumnTransformer(
    [
        (
            "num",
            Pipeline(
                [
                    ("impute", SimpleImputer(strategy="median")),
                    ("scale", StandardScaler()),
                ]
            ),
            numeric_cols,
        ),
        (
            "cat",
            Pipeline(
                [
                    ("impute", SimpleImputer(strategy="most_frequent")),
                    ("onehot", OneHotEncoder(handle_unknown="ignore")),
                ]
            ),
            categorical_cols,
        ),
    ]
)

model = Pipeline(
    [
        ("prep", preprocessor),
        ("clf", LogisticRegression(max_iter=1000)),
    ]
)

model.fit(X_train, y_train)

This pattern matters because the imputer, encoder, and scaler are all fitted only on X_train.

What To Inspect First

Start with inspection, not transformation:

df.dtypes
df.isna().mean().sort_values(ascending=False)
df.duplicated().sum()
df.head()

Look for:

  • numeric columns hiding as object
  • missingness concentrated in one slice or source
  • duplicates that would copy the target signal
  • categories with messy spelling or inconsistent casing

The Most Important Failure Pattern

The main failure is fitting preprocessing on the full dataset before the split.

That breaks the logic of evaluation in three ways:

  • the imputer sees validation values
  • the encoder sees validation categories
  • the scaler sees validation distribution

The score may still look reasonable, but the workflow is no longer honest.

Common Mistakes

  • using dropna() so aggressively that the dataset collapses
  • filling every missing value with 0 without asking whether zero is meaningful
  • one-hot encoding extremely high-cardinality identifiers
  • forgetting handle_unknown="ignore" and crashing on new categories
  • scaling columns that do not need scaling
  • treating duplicate rows as harmless when they duplicate the label too

A Good Cleaning Note

After one pass, the learner should be able to write:

  • which columns had type problems
  • which columns were dropped or imputed
  • which duplicate rule was used
  • which encoding path was chosen
  • what leakage risk the pipeline avoided

If the note only says "cleaned the data," the workflow is still too vague.

Practice

  1. Inspect a new table and name the first three cleaning problems you see.
  2. Decide which columns should be dropped and which should be imputed.
  3. Build a ColumnTransformer for numeric and categorical columns.
  4. Explain why the imputer must be fitted only on training data.
  5. Compare a dropped-row version against an imputed version and say which one is more defensible.

Runnable Example

Longer Connection

Continue with Table Inspection for the inspection habits that come before cleaning, and Feature Matrix Construction for turning the cleaned table into a stable model matrix.