Skip to content

Table Inspection

Table inspection is the first habit for understanding a dataset before you model, join, filter, or aggregate it. The goal is simple: answer a few basic questions quickly. What columns exist? Which ones are missing values? Which fields are numeric, categorical, dates, or identifiers? Are there duplicate rows? Are a few values dominating the table?

What This Is

In pandas, table inspection means reading the structure and the shape of a DataFrame before doing deeper analysis. The most useful inspection methods are the ones that help you detect schema problems, bad encodings, hidden duplicates, and misleading first impressions.

Think of it as a short checklist, not a single command.

When You Use It

Use inspection whenever you:

  • load a new table
  • receive a file from another team
  • compare train and test data
  • prepare features for a model
  • debug a suspicious result
  • check whether a join, filter, or pivot changed the data shape

Tooling

The most useful pandas inspection functions for this topic are:

Function What it tells you When to use it
DataFrame.info() column names, non-null counts, dtypes, memory use first pass on schema and missingness
DataFrame.head() / tail() first or last rows quick row-level sanity check
DataFrame.sample() random rows when the top rows are not representative
DataFrame.dtypes per-column types when you need a compact type summary
DataFrame.isna() / notna() missing-value mask missingness checks and filters
DataFrame.count() non-missing counts completeness checks by column or group
DataFrame.nunique() distinct-value counts identifier checks, cardinality checks
Series.value_counts() category frequencies label balance, placeholders, rare values
DataFrame.describe() summary statistics numeric spread and rough distribution checks
DataFrame.duplicated() repeated rows or keys duplicate detection
DataFrame.sort_values() ordered extremes outliers, suspicious records, boundary cases

Official pandas references:

Inspection Ladder

Start broad, then get more specific:

  1. df.shape, df.info(), df.dtypes
  2. df.head() and df.sample(n=5, random_state=0)
  3. df.isna().sum() and df.count()
  4. df.nunique() and df[col].value_counts(dropna=False)
  5. df.duplicated().sum() or duplicate key checks
  6. df.describe(include="all")
  7. df.sort_values(...) to inspect extremes or suspicious rows

That order matters. If you start with summaries before checking types and missingness, you can miss the real problem.

Minimal Example

import pandas as pd

df = pd.DataFrame(
    {
        "student_id": [101, 102, 103, 103],
        "grade": [8, 9, None, 9],
        "group": ["A", "A", "B", "A"],
        "score": [84, 91, 77, 91],
    }
)

df.info()
df.head()
df.sample(n=2, random_state=0)
df.isna().sum()
df.nunique()
df["group"].value_counts(dropna=False)
df.describe(include="all")
df.duplicated().sum()

This small set already answers a lot:

  • Is grade missing anywhere?
  • Is student_id really unique?
  • Is group balanced or dominated by one value?
  • Are there duplicate rows?
  • Is score in a plausible range?

Worked Patterns

1. Use info() to check schema and completeness

info() is the fastest way to spot type and missing-value problems together.

df.info()

Look for:

  • columns with far fewer non-null values than expected
  • object columns that should be numeric or datetime
  • columns that were read as strings because of mixed values

Common mistake: reading only the shape and ignoring the non-null counts. A table can look fine at a glance and still have half of one feature missing.

2. Use sample() when the top rows are misleading

head() is useful, but file order can mislead you. If the rows are sorted by time, label, or source, the first rows may not represent the dataset.

df.sample(n=5, random_state=42)

Use sample() when you want a more honest spot check. If the table is tiny, head() and tail() together are still helpful.

3. Use value_counts() for balance and rare categories

For categorical columns, value_counts() is usually more useful than describe().

df["group"].value_counts(dropna=False)

This tells you whether a category is dominant, whether a placeholder like "unknown" is common, and whether missing values are silently present.

Good habit: compare counts with proportions.

df["group"].value_counts(normalize=True, dropna=False)

4. Use nunique() to find identifiers and suspicious constants

nunique() helps you answer whether a column behaves like an identifier, a category, or a mostly-constant field.

df.nunique()

What to notice:

  • very high uniqueness in a supposed category can mean the field is really an identifier
  • very low uniqueness in a feature can mean it carries little signal
  • a constant column is often useless unless it matters as a filter or metadata field

5. Use duplicated() before you trust row counts

Duplicate rows can inflate frequencies, distort averages, and leak information across train/test splits.

df.duplicated().sum()
df[df.duplicated(keep=False)]

If you care about a subset of columns, check duplicates on the key columns instead of the full row.

df.duplicated(subset=["student_id"]).sum()

6. Use describe(include="all") carefully

describe() gives a compact overview, but it does not replace detailed inspection.

df.describe(include="all")

Use it to look for:

  • suspicious ranges
  • unexpected min/max values
  • highly skewed numeric distributions
  • repeated values in categorical columns

Common mistake: treating describe() as if it explains the whole table. It does not show duplicates, key uniqueness, or row-level context.

7. Use sort_values() for boundary and outlier checks

When a value looks suspicious, sort it.

df.sort_values("score").head()
df.sort_values("score").tail()

This is especially useful for:

  • outlier detection
  • impossible values
  • rows with extreme missingness patterns
  • categories that appear only in the highest or lowest records

What To Look For

Ask these questions every time:

  • Does the table have the columns I expected?
  • Are the dtypes reasonable?
  • Are missing values concentrated in a few columns?
  • Is a label or category heavily imbalanced?
  • Are there duplicate rows or duplicate keys?
  • Are there placeholder strings such as "N/A", "unknown", or "-"?
  • Are there columns that look numeric but were read as text?
  • Are there values outside the expected range?

Common Mistakes

  • Trusting head() alone. Top rows are often sorted and unrepresentative.
  • Checking isna() without checking the denominator. A missing count is more useful with the total row count.
  • Confusing empty strings with missing values. "" is not NaN.
  • Assuming describe() works the same way for every column type. Categorical summaries and numeric summaries answer different questions.
  • Ignoring duplicate keys. A table can have unique rows and still have repeated identifiers.
  • Looking at counts without proportions. A rare category can still matter a lot.
  • Treating one inspection pass as enough. Good inspection is iterative.

Practical Tricks

  • Use dropna=False in value_counts() when missing labels matter.
  • Pair value_counts() with normalize=True to see imbalance quickly.
  • Use df[cols].isna().sum() on a focused subset instead of the whole table when you already know the suspicious columns.
  • Use df.select_dtypes(include="number") before numeric summaries if the table mixes text and numbers.
  • Use df.sort_values(by=...) on both ends to inspect extremes, not just the first few rows.
  • Use df[df.duplicated(keep=False)] when you need to inspect all repeated rows, not only count them.

Failure Checks

If a table looks “fine,” still verify:

assert df.shape[0] > 0
assert df.notna().any().any()
assert df.nunique().max() > 1

That is not a full validation suite, but it catches obvious failures:

  • empty tables
  • tables with all missing values
  • tables with columns that unexpectedly collapsed to one value

If a feature should be an identifier, check that it is actually close to unique:

df["student_id"].nunique() == len(df)

If a label should be balanced enough to train on, inspect its distribution before fitting anything:

df["label"].value_counts(normalize=True)

Practice

Try answering these before writing any analysis code:

  1. Which columns look like identifiers?
  2. Which columns are likely categorical?
  3. Which columns have missing values that could matter?
  4. Which column would you inspect with value_counts() first?
  5. Which column would you sort to inspect extreme values?
  6. Which problem would duplicate rows cause in a modeling pipeline?

Runnable Example

Use this pattern when you want a quick inspection pass:

def inspect_table(df):
    print(df.shape)
    print(df.dtypes)
    print(df.isna().sum())
    print(df.nunique())
    print(df.duplicated().sum())
    print(df.describe(include="all"))

Then refine it with column-specific checks:

def inspect_categories(df, column):
    counts = df[column].value_counts(dropna=False, normalize=True)
    print(counts)

Run the same inspection loop in the browser:

Inspect the shape, missing-value table, and category balance before you write any feature code.

Quick Checks

  • info() for structure
  • sample() for realism
  • isna() and count() for completeness
  • nunique() for cardinality
  • value_counts() for category balance
  • duplicated() for repeated rows
  • describe() for fast distribution context
  • sort_values() for extremes

Longer Connection

Table inspection is not just a cleanup step. It is part of model quality. The same habits help with feature engineering, leakage detection, train/test comparison, and post-merge validation. If you inspect tables well, you catch errors earlier and spend less time debugging the wrong layer.

The best modelers usually ask boring questions first. That is what makes their later work faster.