Table Inspection¶

Table inspection is the first habit for understanding a dataset before you model, join, filter, or aggregate it. The goal is simple: answer a few basic questions quickly. What columns exist? Which ones are missing values? Which fields are numeric, categorical, dates, or identifiers? Are there duplicate rows? Are a few values dominating the table?

What This Is¶

In pandas, table inspection means reading the structure and the shape of a DataFrame before doing deeper analysis. The most useful inspection methods are the ones that help you detect schema problems, bad encodings, hidden duplicates, and misleading first impressions.

Think of it as a short checklist, not a single command.

When You Use It¶

Use inspection whenever you:

load a new table
receive a file from another team
compare train and test data
prepare features for a model
debug a suspicious result
check whether a join, filter, or pivot changed the data shape

Tooling¶

The most useful pandas inspection functions for this topic are:

Function	What it tells you	When to use it
`DataFrame.info()`	column names, non-null counts, dtypes, memory use	first pass on schema and missingness
`DataFrame.head()` / `tail()`	first or last rows	quick row-level sanity check
`DataFrame.sample()`	random rows	when the top rows are not representative
`DataFrame.dtypes`	per-column types	when you need a compact type summary
`DataFrame.isna()` / `notna()`	missing-value mask	missingness checks and filters
`DataFrame.count()`	non-missing counts	completeness checks by column or group
`DataFrame.nunique()`	distinct-value counts	identifier checks, cardinality checks
`Series.value_counts()`	category frequencies	label balance, placeholders, rare values
`DataFrame.describe()`	summary statistics	numeric spread and rough distribution checks
`DataFrame.duplicated()`	repeated rows or keys	duplicate detection
`DataFrame.sort_values()`	ordered extremes	outliers, suspicious records, boundary cases

Official pandas references:

Inspection Ladder¶

Start broad, then get more specific:

df.shape, df.info(), df.dtypes
df.head() and df.sample(n=5, random_state=0)
df.isna().sum() and df.count()
df.nunique() and df[col].value_counts(dropna=False)
df.duplicated().sum() or duplicate key checks
df.describe(include="all")
df.sort_values(...) to inspect extremes or suspicious rows

That order matters. If you start with summaries before checking types and missingness, you can miss the real problem.

Minimal Example¶

import pandas as pd

df = pd.DataFrame(
    {
        "student_id": [101, 102, 103, 103],
        "grade": [8, 9, None, 9],
        "group": ["A", "A", "B", "A"],
        "score": [84, 91, 77, 91],
    }
)

df.info()
df.head()
df.sample(n=2, random_state=0)
df.isna().sum()
df.nunique()
df["group"].value_counts(dropna=False)
df.describe(include="all")
df.duplicated().sum()

This small set already answers a lot:

Is grade missing anywhere?
Is student_id really unique?
Is group balanced or dominated by one value?
Are there duplicate rows?
Is score in a plausible range?

Worked Patterns¶

1. Use `info()` to check schema and completeness¶

info() is the fastest way to spot type and missing-value problems together.

df.info()

Look for:

columns with far fewer non-null values than expected
object columns that should be numeric or datetime
columns that were read as strings because of mixed values

Common mistake: reading only the shape and ignoring the non-null counts. A table can look fine at a glance and still have half of one feature missing.

2. Use `sample()` when the top rows are misleading¶

head() is useful, but file order can mislead you. If the rows are sorted by time, label, or source, the first rows may not represent the dataset.

df.sample(n=5, random_state=42)

Use sample() when you want a more honest spot check. If the table is tiny, head() and tail() together are still helpful.

3. Use `value_counts()` for balance and rare categories¶

For categorical columns, value_counts() is usually more useful than describe().

df["group"].value_counts(dropna=False)

This tells you whether a category is dominant, whether a placeholder like "unknown" is common, and whether missing values are silently present.

Good habit: compare counts with proportions.

df["group"].value_counts(normalize=True, dropna=False)

4. Use `nunique()` to find identifiers and suspicious constants¶

nunique() helps you answer whether a column behaves like an identifier, a category, or a mostly-constant field.

df.nunique()

What to notice:

very high uniqueness in a supposed category can mean the field is really an identifier
very low uniqueness in a feature can mean it carries little signal
a constant column is often useless unless it matters as a filter or metadata field

5. Use `duplicated()` before you trust row counts¶

Duplicate rows can inflate frequencies, distort averages, and leak information across train/test splits.

df.duplicated().sum()
df[df.duplicated(keep=False)]

If you care about a subset of columns, check duplicates on the key columns instead of the full row.

df.duplicated(subset=["student_id"]).sum()

6. Use `describe(include="all")` carefully¶

describe() gives a compact overview, but it does not replace detailed inspection.

df.describe(include="all")

Use it to look for:

suspicious ranges
unexpected min/max values
highly skewed numeric distributions
repeated values in categorical columns

Common mistake: treating describe() as if it explains the whole table. It does not show duplicates, key uniqueness, or row-level context.

7. Use `sort_values()` for boundary and outlier checks¶

When a value looks suspicious, sort it.

df.sort_values("score").head()
df.sort_values("score").tail()

This is especially useful for:

outlier detection
impossible values
rows with extreme missingness patterns
categories that appear only in the highest or lowest records

What To Look For¶

Ask these questions every time:

Does the table have the columns I expected?
Are the dtypes reasonable?
Are missing values concentrated in a few columns?
Is a label or category heavily imbalanced?
Are there duplicate rows or duplicate keys?
Are there placeholder strings such as "N/A", "unknown", or "-"?
Are there columns that look numeric but were read as text?
Are there values outside the expected range?

Common Mistakes¶

Trusting head() alone. Top rows are often sorted and unrepresentative.
Checking isna() without checking the denominator. A missing count is more useful with the total row count.
Confusing empty strings with missing values. "" is not NaN.
Assuming describe() works the same way for every column type. Categorical summaries and numeric summaries answer different questions.
Ignoring duplicate keys. A table can have unique rows and still have repeated identifiers.
Looking at counts without proportions. A rare category can still matter a lot.
Treating one inspection pass as enough. Good inspection is iterative.

Practical Tricks¶

Use dropna=False in value_counts() when missing labels matter.
Pair value_counts() with normalize=True to see imbalance quickly.
Use df[cols].isna().sum() on a focused subset instead of the whole table when you already know the suspicious columns.
Use df.select_dtypes(include="number") before numeric summaries if the table mixes text and numbers.
Use df.sort_values(by=...) on both ends to inspect extremes, not just the first few rows.
Use df[df.duplicated(keep=False)] when you need to inspect all repeated rows, not only count them.

Failure Checks¶

If a table looks “fine,” still verify:

assert df.shape[0] > 0
assert df.notna().any().any()
assert df.nunique().max() > 1

That is not a full validation suite, but it catches obvious failures:

empty tables
tables with all missing values
tables with columns that unexpectedly collapsed to one value

If a feature should be an identifier, check that it is actually close to unique:

df["student_id"].nunique() == len(df)

If a label should be balanced enough to train on, inspect its distribution before fitting anything:

df["label"].value_counts(normalize=True)

Practice¶

Try answering these before writing any analysis code:

Which columns look like identifiers?
Which columns are likely categorical?
Which columns have missing values that could matter?
Which column would you inspect with value_counts() first?
Which column would you sort to inspect extreme values?
Which problem would duplicate rows cause in a modeling pipeline?

Runnable Example¶

Use this pattern when you want a quick inspection pass:

def inspect_table(df):
    print(df.shape)
    print(df.dtypes)
    print(df.isna().sum())
    print(df.nunique())
    print(df.duplicated().sum())
    print(df.describe(include="all"))

Then refine it with column-specific checks:

def inspect_categories(df, column):
    counts = df[column].value_counts(dropna=False, normalize=True)
    print(counts)

Run the same inspection loop in the browser:

Inspect the shape, missing-value table, and category balance before you write any feature code.

Quick Checks¶

info() for structure
sample() for realism
isna() and count() for completeness
nunique() for cardinality
value_counts() for category balance
duplicated() for repeated rows
describe() for fast distribution context
sort_values() for extremes

Longer Connection¶

Table inspection is not just a cleanup step. It is part of model quality. The same habits help with feature engineering, leakage detection, train/test comparison, and post-merge validation. If you inspect tables well, you catch errors earlier and spend less time debugging the wrong layer.

The best modelers usually ask boring questions first. That is what makes their later work faster.