Experiments and Ablations¶

What This Is¶

Experiments and ablations are about making smaller, more credible claims. The point is to isolate what changed, measure the effect honestly, and decide whether the change is worth keeping.

If you cannot explain the difference between the base run and the changed run, you do not really have an experiment yet. You have two scores.

A strong experiment answers four questions:

what changed
what metric moved
how much it moved
whether the change survives a second seed, a harder slice, or a holdout check

When You Use It¶

comparing two workflow choices
testing whether a feature or module helps
checking whether a result is robust across runs
deciding whether to keep a new baseline or roll it back
checking whether a gain is real or just seed noise
separating model improvement from data or split luck

Tooling¶

pd.DataFrame tables for runs, settings, and scores
groupby(...).agg(...) for seed summaries
pivot_table(...) for comparing settings by slice or metric
np.random.default_rng(...) for explicit seed control
np.mean, np.std, np.percentile, and np.quantile for spread
cross_validate(...) with return_train_score=True
validation_curve(...) and learning_curve(...) for model diagnostics
permutation_test_score(...) for a quick significance check
plt.errorbar(...) and plt.fill_between(...) for compact score plots
one change at a time

Minimal Example¶

rng = np.random.default_rng(7)
run_scores = rng.normal(loc=0.81, scale=0.01, size=5)
mean_score = run_scores.mean()
spread = run_scores.std(ddof=1)

Worked Pattern¶

runs = pd.DataFrame(
    {
        "setting": ["base", "base", "with_feature_x", "with_feature_x"],
        "seed": [0, 1, 0, 1],
        "validation_score": [0.82, 0.81, 0.84, 0.83],
        "holdout_score": [0.80, 0.79, 0.82, 0.81],
    }
)

summary = (
    runs.groupby("setting")
    .agg(
        mean_validation=("validation_score", "mean"),
        std_validation=("validation_score", "std"),
        sem_validation=("validation_score", "sem"),
        mean_holdout=("holdout_score", "mean"),
        n=("seed", "size"),
    )
    .reset_index()
    .sort_values("mean_validation", ascending=False)
)

What to look for in an ablation table:

whether the changed row improves the right metric
whether the gain is large enough to matter in absolute terms
whether the variance across runs is smaller than the gain
whether the simpler baseline remains competitive
whether validation and holdout move in the same direction
whether the score gap is bigger than the standard error

Useful trick:

use sem when you want a quick uncertainty read on repeated seeds
use std when you want to know the raw run-to-run spread
use quantile when the run distribution is skewed or has an outlier
keep the sample count beside every score so the table does not overstate precision

Paired Deltas¶

The cleanest ablation comparison is paired: same split, same seed, same evaluation budget.

paired = runs.pivot(index="seed", columns="setting", values="validation_score")
paired["delta_vs_base"] = paired["with_feature_x"] - paired["base"]

That keyed reshape is safer than depending on row order. It still works if you add more settings or if the table is sorted differently later.

Why this matters:

the paired delta removes some of the noise from split or seed variation
a gain that looks real in separate means can disappear once the pairing is respected
a negative delta on the hardest seeds is often more informative than a tiny positive average

At minimum, keep these columns in the ablation log:

setting
seed
what_changed
validation_score
holdout_score
decision

Honest Comparison¶

A good comparison usually has the same split, the same metric, and the same evaluation budget.

If the split changes between rows, you are not comparing models anymore. You are comparing different tests.

If the only number you keep is the best one, the table stops being an experiment record and becomes a highlight reel.

2x2 Interaction Checks¶

When two changes might interact, do not jump directly from base to both-changes-on. Run the four cells:

base
change A only
change B only
change A plus change B

That small grid tells you whether:

each change helps alone
one change only helps when the other is present
the bundle is stronger or weaker than the sum of its parts

If you skip the middle cells, you cannot tell whether the gain belongs to one component or to the interaction.

Plot It¶

fig, ax = plt.subplots(figsize=(6, 3))
ax.errorbar(
    summary["setting"],
    summary["mean_validation"],
    yerr=summary["sem_validation"],
    fmt="o",
    capsize=4,
)
ax.set_ylabel("validation score")
ax.set_title("Ablation with uncertainty")

When you want to show a range instead of a point estimate, fill_between is the better choice than a bare line:

ax.fill_between(x, lower, upper, alpha=0.2)

That helps students see that a small gain may sit inside the noise band.

Fast Evaluation Routine¶

When time is short, use a very small experiment routine:

keep one base run
change one thing only
write both scores into one table
repeat only if the change is large enough to matter

This is what keeps leaderboard work from turning into noise collection.

Trick:

if a change helps one metric but hurts the metric you actually care about, record that explicitly instead of rounding it away
if a gain disappears under a second seed, treat it as weak evidence, not a win
if a simpler model matches the score within uncertainty, prefer the simpler model unless you have a strong reason not to
if the validation gain is real but the holdout gap grows, the change may be overfitting the public signal

Failure Pattern¶

Claiming a feature helped because one run looked better. Without repeated runs or a clean comparison, the claim is weaker than it looks.

Another failure is adding two changes at once. Then the result is a conclusion about a bundle, not a reasoned decision about a component.

Another common failure is to reuse the same validation set for too many decisions. Once the validation set becomes the scoreboard for every idea, its value drops and the leaderboard effect gets worse.

What Counts As Stronger Evidence¶

Stronger evidence usually means at least one of these:

the gain survives across repeated seeds
the gain survives on a harder slice
the gain still looks good when compared to a simpler baseline
the gain survives a public-versus-private comparison
the gain also shows up in cross_validate(...) or validation_curve(...)
the improvement is larger than the seed spread and the standard error

Promote, Defer, Reject¶

Use a simple post-ablation rule:

promote when the gain is larger than the paired-noise level, survives the holdout or hard slice, and still looks worth the added complexity
defer when the gain exists but is too small to justify a confident claim yet
reject when the gain disappears under a second seed, hurts the hard slice, or only improves a secondary metric

This keeps the ablation table connected to actual engineering choices instead of turning it into a score museum.

That last point is why this topic connects directly to the public/private leaderboard track.

Library Notes¶

pd.DataFrame is the easiest way to keep runs, seeds, and scores visible together.
DataFrame.groupby(...).agg(...) lets you summarize a repeated experiment without hand-written loops.
DataFrame.pivot_table(...) is useful when you want a setting-by-slice comparison table.
np.random.default_rng(...) is the cleanest way to control randomness for repeatable comparisons.
cross_validate(...) gives both test scores and training scores, which helps reveal overfitting.
validation_curve(...) is useful when you want to vary one hyperparameter and see where performance peaks.
learning_curve(...) is useful when you want to know whether more data or less model complexity would help.
permutation_test_score(...) is useful when you want a quick sanity check that the observed score is not random luck.
GridSearchCV is helpful when the search space is small, but it is not a substitute for an ablation table.

If you want a compact experiment summary, this pattern is often enough:

table = runs.groupby("setting").agg(
    mean_score=("validation_score", "mean"),
    std_score=("validation_score", "std"),
    n=("validation_score", "size"),
)
table["ci_like"] = 1.96 * table["std_score"] / np.sqrt(table["n"])

That is not a formal confidence interval for every situation, but it is a useful competitive check for whether a gain is obviously bigger than the noise.

Diagnosing A Result¶

Use the diagnostic that matches the question:

cross_validate(...) when you want stable fold-level scores and train/test comparison
validation_curve(...) when you want to know whether a hyperparameter is too small or too large
learning_curve(...) when you want to know whether more data would help
permutation_test_score(...) when you want to know whether the score is clearly above random chance

If the training score is high and the validation score is low, the experiment is telling you to simplify, regularize, or improve the split.

If both scores are low, the experiment is telling you the feature view or model family is not yet strong enough.

If both scores are high and the holdout drops, the experiment is telling you the validation setup was too easy or too narrow.

Practice¶

Write one ablation table with a base setting and one removed component.
Repeat a run with three random seeds and summarize the spread.
Explain what you can and cannot claim from that result.
Identify the single change that deserves another run.
Explain which result would make you stop iterating.
Name one time when a negative ablation is actually useful.
Build a slice-by-setting table and identify where the gain disappears.
Plot a mean score with error bars and decide whether the gain is meaningful.
Use cross_validate to compare training and test behavior for the same model.
Use validation_curve to show whether a hyperparameter is under-tuned or over-tuned.

Runnable Example¶

Open the experiments-and-ablations example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect how much the score moves when one component is removed and when the seed changes.

Questions To Ask¶

Did the same change help both validation and holdout?
Is the gain bigger than the run-to-run noise?
What is the simplest explanation for the change?
Is the change broad or only helping a single slice?
If you removed the change tomorrow, would the model still be acceptable?
Is the score stable enough to justify more engineering time?
Did the change help the hard cases or only the easy ones?
Would a smaller model get almost the same result with less complexity?