Table Inspection¶
Table inspection is the first habit for understanding a dataset before you model, join, filter, or aggregate it. The goal is simple: answer a few basic questions quickly. What columns exist? Which ones are missing values? Which fields are numeric, categorical, dates, or identifiers? Are there duplicate rows? Are a few values dominating the table?
What This Is¶
In pandas, table inspection means reading the structure and the shape of a DataFrame before doing deeper analysis. The most useful inspection methods are the ones that help you detect schema problems, bad encodings, hidden duplicates, and misleading first impressions.
Think of it as a short checklist, not a single command.
When You Use It¶
Use inspection whenever you:
- load a new table
- receive a file from another team
- compare train and test data
- prepare features for a model
- debug a suspicious result
- check whether a join, filter, or pivot changed the data shape
Tooling¶
The most useful pandas inspection functions for this topic are:
| Function | What it tells you | When to use it |
|---|---|---|
DataFrame.info() |
column names, non-null counts, dtypes, memory use | first pass on schema and missingness |
DataFrame.head() / tail() |
first or last rows | quick row-level sanity check |
DataFrame.sample() |
random rows | when the top rows are not representative |
DataFrame.dtypes |
per-column types | when you need a compact type summary |
DataFrame.isna() / notna() |
missing-value mask | missingness checks and filters |
DataFrame.count() |
non-missing counts | completeness checks by column or group |
DataFrame.nunique() |
distinct-value counts | identifier checks, cardinality checks |
Series.value_counts() |
category frequencies | label balance, placeholders, rare values |
DataFrame.describe() |
summary statistics | numeric spread and rough distribution checks |
DataFrame.duplicated() |
repeated rows or keys | duplicate detection |
DataFrame.sort_values() |
ordered extremes | outliers, suspicious records, boundary cases |
Official pandas references:
DataFrame.infoDataFrame.headDataFrame.tailDataFrame.sampleDataFrame.describeSeries.value_countsDataFrame.duplicatedDataFrame.isnaDataFrame.nuniqueDataFrame.countDataFrame.sort_values
Inspection Ladder¶
Start broad, then get more specific:
df.shape,df.info(),df.dtypesdf.head()anddf.sample(n=5, random_state=0)df.isna().sum()anddf.count()df.nunique()anddf[col].value_counts(dropna=False)df.duplicated().sum()or duplicate key checksdf.describe(include="all")df.sort_values(...)to inspect extremes or suspicious rows
That order matters. If you start with summaries before checking types and missingness, you can miss the real problem.
Minimal Example¶
import pandas as pd
df = pd.DataFrame(
{
"student_id": [101, 102, 103, 103],
"grade": [8, 9, None, 9],
"group": ["A", "A", "B", "A"],
"score": [84, 91, 77, 91],
}
)
df.info()
df.head()
df.sample(n=2, random_state=0)
df.isna().sum()
df.nunique()
df["group"].value_counts(dropna=False)
df.describe(include="all")
df.duplicated().sum()
This small set already answers a lot:
- Is
grademissing anywhere? - Is
student_idreally unique? - Is
groupbalanced or dominated by one value? - Are there duplicate rows?
- Is
scorein a plausible range?
Worked Patterns¶
1. Use info() to check schema and completeness¶
info() is the fastest way to spot type and missing-value problems together.
df.info()
Look for:
- columns with far fewer non-null values than expected
- object columns that should be numeric or datetime
- columns that were read as strings because of mixed values
Common mistake: reading only the shape and ignoring the non-null counts. A table can look fine at a glance and still have half of one feature missing.
2. Use sample() when the top rows are misleading¶
head() is useful, but file order can mislead you. If the rows are sorted by time, label, or source, the first rows may not represent the dataset.
df.sample(n=5, random_state=42)
Use sample() when you want a more honest spot check. If the table is tiny, head() and tail() together are still helpful.
3. Use value_counts() for balance and rare categories¶
For categorical columns, value_counts() is usually more useful than describe().
df["group"].value_counts(dropna=False)
This tells you whether a category is dominant, whether a placeholder like "unknown" is common, and whether missing values are silently present.
Good habit: compare counts with proportions.
df["group"].value_counts(normalize=True, dropna=False)
4. Use nunique() to find identifiers and suspicious constants¶
nunique() helps you answer whether a column behaves like an identifier, a category, or a mostly-constant field.
df.nunique()
What to notice:
- very high uniqueness in a supposed category can mean the field is really an identifier
- very low uniqueness in a feature can mean it carries little signal
- a constant column is often useless unless it matters as a filter or metadata field
5. Use duplicated() before you trust row counts¶
Duplicate rows can inflate frequencies, distort averages, and leak information across train/test splits.
df.duplicated().sum()
df[df.duplicated(keep=False)]
If you care about a subset of columns, check duplicates on the key columns instead of the full row.
df.duplicated(subset=["student_id"]).sum()
6. Use describe(include="all") carefully¶
describe() gives a compact overview, but it does not replace detailed inspection.
df.describe(include="all")
Use it to look for:
- suspicious ranges
- unexpected min/max values
- highly skewed numeric distributions
- repeated values in categorical columns
Common mistake: treating describe() as if it explains the whole table. It does not show duplicates, key uniqueness, or row-level context.
7. Use sort_values() for boundary and outlier checks¶
When a value looks suspicious, sort it.
df.sort_values("score").head()
df.sort_values("score").tail()
This is especially useful for:
- outlier detection
- impossible values
- rows with extreme missingness patterns
- categories that appear only in the highest or lowest records
What To Look For¶
Ask these questions every time:
- Does the table have the columns I expected?
- Are the dtypes reasonable?
- Are missing values concentrated in a few columns?
- Is a label or category heavily imbalanced?
- Are there duplicate rows or duplicate keys?
- Are there placeholder strings such as
"N/A","unknown", or"-"? - Are there columns that look numeric but were read as text?
- Are there values outside the expected range?
Common Mistakes¶
- Trusting
head()alone. Top rows are often sorted and unrepresentative. - Checking
isna()without checking the denominator. A missing count is more useful with the total row count. - Confusing empty strings with missing values.
""is notNaN. - Assuming
describe()works the same way for every column type. Categorical summaries and numeric summaries answer different questions. - Ignoring duplicate keys. A table can have unique rows and still have repeated identifiers.
- Looking at counts without proportions. A rare category can still matter a lot.
- Treating one inspection pass as enough. Good inspection is iterative.
Practical Tricks¶
- Use
dropna=Falseinvalue_counts()when missing labels matter. - Pair
value_counts()withnormalize=Trueto see imbalance quickly. - Use
df[cols].isna().sum()on a focused subset instead of the whole table when you already know the suspicious columns. - Use
df.select_dtypes(include="number")before numeric summaries if the table mixes text and numbers. - Use
df.sort_values(by=...)on both ends to inspect extremes, not just the first few rows. - Use
df[df.duplicated(keep=False)]when you need to inspect all repeated rows, not only count them.
Failure Checks¶
If a table looks “fine,” still verify:
assert df.shape[0] > 0
assert df.notna().any().any()
assert df.nunique().max() > 1
That is not a full validation suite, but it catches obvious failures:
- empty tables
- tables with all missing values
- tables with columns that unexpectedly collapsed to one value
If a feature should be an identifier, check that it is actually close to unique:
df["student_id"].nunique() == len(df)
If a label should be balanced enough to train on, inspect its distribution before fitting anything:
df["label"].value_counts(normalize=True)
Practice¶
Try answering these before writing any analysis code:
- Which columns look like identifiers?
- Which columns are likely categorical?
- Which columns have missing values that could matter?
- Which column would you inspect with
value_counts()first? - Which column would you sort to inspect extreme values?
- Which problem would duplicate rows cause in a modeling pipeline?
Runnable Example¶
Use this pattern when you want a quick inspection pass:
def inspect_table(df):
print(df.shape)
print(df.dtypes)
print(df.isna().sum())
print(df.nunique())
print(df.duplicated().sum())
print(df.describe(include="all"))
Then refine it with column-specific checks:
def inspect_categories(df, column):
counts = df[column].value_counts(dropna=False, normalize=True)
print(counts)
Run the same inspection loop in the browser:
Inspect the shape, missing-value table, and category balance before you write any feature code.
Quick Checks¶
info()for structuresample()for realismisna()andcount()for completenessnunique()for cardinalityvalue_counts()for category balanceduplicated()for repeated rowsdescribe()for fast distribution contextsort_values()for extremes
Longer Connection¶
Table inspection is not just a cleanup step. It is part of model quality. The same habits help with feature engineering, leakage detection, train/test comparison, and post-merge validation. If you inspect tables well, you catch errors earlier and spend less time debugging the wrong layer.
The best modelers usually ask boring questions first. That is what makes their later work faster.