Experiments and Ablations¶
What This Is¶
Experiments and ablations are about making smaller, more credible claims. The point is to isolate what changed, measure the effect honestly, and decide whether the change is worth keeping.
If you cannot explain the difference between the base run and the changed run, you do not really have an experiment yet. You have two scores.
A strong experiment answers four questions:
- what changed
- what metric moved
- how much it moved
- whether the change survives a second seed, a harder slice, or a holdout check
When You Use It¶
- comparing two workflow choices
- testing whether a feature or module helps
- checking whether a result is robust across runs
- deciding whether to keep a new baseline or roll it back
- checking whether a gain is real or just seed noise
- separating model improvement from data or split luck
Tooling¶
pd.DataFrametables for runs, settings, and scoresgroupby(...).agg(...)for seed summariespivot_table(...)for comparing settings by slice or metricnp.random.default_rng(...)for explicit seed controlnp.mean,np.std,np.percentile, andnp.quantilefor spreadcross_validate(...)withreturn_train_score=Truevalidation_curve(...)andlearning_curve(...)for model diagnosticspermutation_test_score(...)for a quick significance checkplt.errorbar(...)andplt.fill_between(...)for compact score plots- one change at a time
Minimal Example¶
rng = np.random.default_rng(7)
run_scores = rng.normal(loc=0.81, scale=0.01, size=5)
mean_score = run_scores.mean()
spread = run_scores.std(ddof=1)
Worked Pattern¶
runs = pd.DataFrame(
{
"setting": ["base", "base", "with_feature_x", "with_feature_x"],
"seed": [0, 1, 0, 1],
"validation_score": [0.82, 0.81, 0.84, 0.83],
"holdout_score": [0.80, 0.79, 0.82, 0.81],
}
)
summary = (
runs.groupby("setting")
.agg(
mean_validation=("validation_score", "mean"),
std_validation=("validation_score", "std"),
sem_validation=("validation_score", "sem"),
mean_holdout=("holdout_score", "mean"),
n=("seed", "size"),
)
.reset_index()
.sort_values("mean_validation", ascending=False)
)
What to look for in an ablation table:
- whether the changed row improves the right metric
- whether the gain is large enough to matter in absolute terms
- whether the variance across runs is smaller than the gain
- whether the simpler baseline remains competitive
- whether validation and holdout move in the same direction
- whether the score gap is bigger than the standard error
Useful trick:
- use
semwhen you want a quick uncertainty read on repeated seeds - use
stdwhen you want to know the raw run-to-run spread - use
quantilewhen the run distribution is skewed or has an outlier - keep the sample count beside every score so the table does not overstate precision
Paired Deltas¶
The cleanest ablation comparison is paired: same split, same seed, same evaluation budget.
paired = runs.pivot(index="seed", columns="setting", values="validation_score")
paired["delta_vs_base"] = paired["with_feature_x"] - paired["base"]
That keyed reshape is safer than depending on row order. It still works if you add more settings or if the table is sorted differently later.
Why this matters:
- the paired delta removes some of the noise from split or seed variation
- a gain that looks real in separate means can disappear once the pairing is respected
- a negative delta on the hardest seeds is often more informative than a tiny positive average
At minimum, keep these columns in the ablation log:
settingseedwhat_changedvalidation_scoreholdout_scoredecision
Honest Comparison¶
A good comparison usually has the same split, the same metric, and the same evaluation budget.
If the split changes between rows, you are not comparing models anymore. You are comparing different tests.
If the only number you keep is the best one, the table stops being an experiment record and becomes a highlight reel.
2x2 Interaction Checks¶
When two changes might interact, do not jump directly from base to both-changes-on. Run the four cells:
- base
- change A only
- change B only
- change A plus change B
That small grid tells you whether:
- each change helps alone
- one change only helps when the other is present
- the bundle is stronger or weaker than the sum of its parts
If you skip the middle cells, you cannot tell whether the gain belongs to one component or to the interaction.
Plot It¶
fig, ax = plt.subplots(figsize=(6, 3))
ax.errorbar(
summary["setting"],
summary["mean_validation"],
yerr=summary["sem_validation"],
fmt="o",
capsize=4,
)
ax.set_ylabel("validation score")
ax.set_title("Ablation with uncertainty")
When you want to show a range instead of a point estimate, fill_between is the better choice than a bare line:
ax.fill_between(x, lower, upper, alpha=0.2)
That helps students see that a small gain may sit inside the noise band.
Fast Evaluation Routine¶
When time is short, use a very small experiment routine:
- keep one base run
- change one thing only
- write both scores into one table
- repeat only if the change is large enough to matter
This is what keeps leaderboard work from turning into noise collection.
Trick:
- if a change helps one metric but hurts the metric you actually care about, record that explicitly instead of rounding it away
- if a gain disappears under a second seed, treat it as weak evidence, not a win
- if a simpler model matches the score within uncertainty, prefer the simpler model unless you have a strong reason not to
- if the validation gain is real but the holdout gap grows, the change may be overfitting the public signal
Failure Pattern¶
Claiming a feature helped because one run looked better. Without repeated runs or a clean comparison, the claim is weaker than it looks.
Another failure is adding two changes at once. Then the result is a conclusion about a bundle, not a reasoned decision about a component.
Another common failure is to reuse the same validation set for too many decisions. Once the validation set becomes the scoreboard for every idea, its value drops and the leaderboard effect gets worse.
What Counts As Stronger Evidence¶
Stronger evidence usually means at least one of these:
- the gain survives across repeated seeds
- the gain survives on a harder slice
- the gain still looks good when compared to a simpler baseline
- the gain survives a public-versus-private comparison
- the gain also shows up in
cross_validate(...)orvalidation_curve(...) - the improvement is larger than the seed spread and the standard error
Promote, Defer, Reject¶
Use a simple post-ablation rule:
- promote when the gain is larger than the paired-noise level, survives the holdout or hard slice, and still looks worth the added complexity
- defer when the gain exists but is too small to justify a confident claim yet
- reject when the gain disappears under a second seed, hurts the hard slice, or only improves a secondary metric
This keeps the ablation table connected to actual engineering choices instead of turning it into a score museum.
That last point is why this topic connects directly to the public/private leaderboard track.
Library Notes¶
pd.DataFrameis the easiest way to keep runs, seeds, and scores visible together.DataFrame.groupby(...).agg(...)lets you summarize a repeated experiment without hand-written loops.DataFrame.pivot_table(...)is useful when you want a setting-by-slice comparison table.np.random.default_rng(...)is the cleanest way to control randomness for repeatable comparisons.cross_validate(...)gives both test scores and training scores, which helps reveal overfitting.validation_curve(...)is useful when you want to vary one hyperparameter and see where performance peaks.learning_curve(...)is useful when you want to know whether more data or less model complexity would help.permutation_test_score(...)is useful when you want a quick sanity check that the observed score is not random luck.GridSearchCVis helpful when the search space is small, but it is not a substitute for an ablation table.
If you want a compact experiment summary, this pattern is often enough:
table = runs.groupby("setting").agg(
mean_score=("validation_score", "mean"),
std_score=("validation_score", "std"),
n=("validation_score", "size"),
)
table["ci_like"] = 1.96 * table["std_score"] / np.sqrt(table["n"])
That is not a formal confidence interval for every situation, but it is a useful competitive check for whether a gain is obviously bigger than the noise.
Diagnosing A Result¶
Use the diagnostic that matches the question:
cross_validate(...)when you want stable fold-level scores and train/test comparisonvalidation_curve(...)when you want to know whether a hyperparameter is too small or too largelearning_curve(...)when you want to know whether more data would helppermutation_test_score(...)when you want to know whether the score is clearly above random chance
If the training score is high and the validation score is low, the experiment is telling you to simplify, regularize, or improve the split.
If both scores are low, the experiment is telling you the feature view or model family is not yet strong enough.
If both scores are high and the holdout drops, the experiment is telling you the validation setup was too easy or too narrow.
Practice¶
- Write one ablation table with a base setting and one removed component.
- Repeat a run with three random seeds and summarize the spread.
- Explain what you can and cannot claim from that result.
- Identify the single change that deserves another run.
- Explain which result would make you stop iterating.
- Name one time when a negative ablation is actually useful.
- Build a slice-by-setting table and identify where the gain disappears.
- Plot a mean score with error bars and decide whether the gain is meaningful.
- Use
cross_validateto compare training and test behavior for the same model. - Use
validation_curveto show whether a hyperparameter is under-tuned or over-tuned.
Runnable Example¶
Open the experiments-and-ablations example in AI Academy and run it from the platform.
Run the same idea in the browser:
Inspect how much the score moves when one component is removed and when the seed changes.
Questions To Ask¶
- Did the same change help both validation and holdout?
- Is the gain bigger than the run-to-run noise?
- What is the simplest explanation for the change?
- Is the change broad or only helping a single slice?
- If you removed the change tomorrow, would the model still be acceptable?
- Is the score stable enough to justify more engineering time?
- Did the change help the hard cases or only the easy ones?
- Would a smaller model get almost the same result with less complexity?