Skip to content

Experiments and Ablations

What This Is

Experiments and ablations are about making smaller, more credible claims. The point is to isolate what changed, measure the effect honestly, and decide whether the change is worth keeping.

If you cannot explain the difference between the base run and the changed run, you do not really have an experiment yet. You have two scores.

A strong experiment answers four questions:

  • what changed
  • what metric moved
  • how much it moved
  • whether the change survives a second seed, a harder slice, or a holdout check

When You Use It

  • comparing two workflow choices
  • testing whether a feature or module helps
  • checking whether a result is robust across runs
  • deciding whether to keep a new baseline or roll it back
  • checking whether a gain is real or just seed noise
  • separating model improvement from data or split luck

Tooling

  • pd.DataFrame tables for runs, settings, and scores
  • groupby(...).agg(...) for seed summaries
  • pivot_table(...) for comparing settings by slice or metric
  • np.random.default_rng(...) for explicit seed control
  • np.mean, np.std, np.percentile, and np.quantile for spread
  • cross_validate(...) with return_train_score=True
  • validation_curve(...) and learning_curve(...) for model diagnostics
  • permutation_test_score(...) for a quick significance check
  • plt.errorbar(...) and plt.fill_between(...) for compact score plots
  • one change at a time

Minimal Example

rng = np.random.default_rng(7)
run_scores = rng.normal(loc=0.81, scale=0.01, size=5)
mean_score = run_scores.mean()
spread = run_scores.std(ddof=1)

Worked Pattern

runs = pd.DataFrame(
    {
        "setting": ["base", "base", "with_feature_x", "with_feature_x"],
        "seed": [0, 1, 0, 1],
        "validation_score": [0.82, 0.81, 0.84, 0.83],
        "holdout_score": [0.80, 0.79, 0.82, 0.81],
    }
)

summary = (
    runs.groupby("setting")
    .agg(
        mean_validation=("validation_score", "mean"),
        std_validation=("validation_score", "std"),
        sem_validation=("validation_score", "sem"),
        mean_holdout=("holdout_score", "mean"),
        n=("seed", "size"),
    )
    .reset_index()
    .sort_values("mean_validation", ascending=False)
)

What to look for in an ablation table:

  • whether the changed row improves the right metric
  • whether the gain is large enough to matter in absolute terms
  • whether the variance across runs is smaller than the gain
  • whether the simpler baseline remains competitive
  • whether validation and holdout move in the same direction
  • whether the score gap is bigger than the standard error

Useful trick:

  • use sem when you want a quick uncertainty read on repeated seeds
  • use std when you want to know the raw run-to-run spread
  • use quantile when the run distribution is skewed or has an outlier
  • keep the sample count beside every score so the table does not overstate precision

Paired Deltas

The cleanest ablation comparison is paired: same split, same seed, same evaluation budget.

paired = runs.pivot(index="seed", columns="setting", values="validation_score")
paired["delta_vs_base"] = paired["with_feature_x"] - paired["base"]

That keyed reshape is safer than depending on row order. It still works if you add more settings or if the table is sorted differently later.

Why this matters:

  • the paired delta removes some of the noise from split or seed variation
  • a gain that looks real in separate means can disappear once the pairing is respected
  • a negative delta on the hardest seeds is often more informative than a tiny positive average

At minimum, keep these columns in the ablation log:

  • setting
  • seed
  • what_changed
  • validation_score
  • holdout_score
  • decision

Honest Comparison

A good comparison usually has the same split, the same metric, and the same evaluation budget.

If the split changes between rows, you are not comparing models anymore. You are comparing different tests.

If the only number you keep is the best one, the table stops being an experiment record and becomes a highlight reel.

2x2 Interaction Checks

When two changes might interact, do not jump directly from base to both-changes-on. Run the four cells:

  • base
  • change A only
  • change B only
  • change A plus change B

That small grid tells you whether:

  • each change helps alone
  • one change only helps when the other is present
  • the bundle is stronger or weaker than the sum of its parts

If you skip the middle cells, you cannot tell whether the gain belongs to one component or to the interaction.

Plot It

fig, ax = plt.subplots(figsize=(6, 3))
ax.errorbar(
    summary["setting"],
    summary["mean_validation"],
    yerr=summary["sem_validation"],
    fmt="o",
    capsize=4,
)
ax.set_ylabel("validation score")
ax.set_title("Ablation with uncertainty")

When you want to show a range instead of a point estimate, fill_between is the better choice than a bare line:

ax.fill_between(x, lower, upper, alpha=0.2)

That helps students see that a small gain may sit inside the noise band.

Fast Evaluation Routine

When time is short, use a very small experiment routine:

  1. keep one base run
  2. change one thing only
  3. write both scores into one table
  4. repeat only if the change is large enough to matter

This is what keeps leaderboard work from turning into noise collection.

Trick:

  • if a change helps one metric but hurts the metric you actually care about, record that explicitly instead of rounding it away
  • if a gain disappears under a second seed, treat it as weak evidence, not a win
  • if a simpler model matches the score within uncertainty, prefer the simpler model unless you have a strong reason not to
  • if the validation gain is real but the holdout gap grows, the change may be overfitting the public signal

Failure Pattern

Claiming a feature helped because one run looked better. Without repeated runs or a clean comparison, the claim is weaker than it looks.

Another failure is adding two changes at once. Then the result is a conclusion about a bundle, not a reasoned decision about a component.

Another common failure is to reuse the same validation set for too many decisions. Once the validation set becomes the scoreboard for every idea, its value drops and the leaderboard effect gets worse.

What Counts As Stronger Evidence

Stronger evidence usually means at least one of these:

  • the gain survives across repeated seeds
  • the gain survives on a harder slice
  • the gain still looks good when compared to a simpler baseline
  • the gain survives a public-versus-private comparison
  • the gain also shows up in cross_validate(...) or validation_curve(...)
  • the improvement is larger than the seed spread and the standard error

Promote, Defer, Reject

Use a simple post-ablation rule:

  • promote when the gain is larger than the paired-noise level, survives the holdout or hard slice, and still looks worth the added complexity
  • defer when the gain exists but is too small to justify a confident claim yet
  • reject when the gain disappears under a second seed, hurts the hard slice, or only improves a secondary metric

This keeps the ablation table connected to actual engineering choices instead of turning it into a score museum.

That last point is why this topic connects directly to the public/private leaderboard track.

Library Notes

  • pd.DataFrame is the easiest way to keep runs, seeds, and scores visible together.
  • DataFrame.groupby(...).agg(...) lets you summarize a repeated experiment without hand-written loops.
  • DataFrame.pivot_table(...) is useful when you want a setting-by-slice comparison table.
  • np.random.default_rng(...) is the cleanest way to control randomness for repeatable comparisons.
  • cross_validate(...) gives both test scores and training scores, which helps reveal overfitting.
  • validation_curve(...) is useful when you want to vary one hyperparameter and see where performance peaks.
  • learning_curve(...) is useful when you want to know whether more data or less model complexity would help.
  • permutation_test_score(...) is useful when you want a quick sanity check that the observed score is not random luck.
  • GridSearchCV is helpful when the search space is small, but it is not a substitute for an ablation table.

If you want a compact experiment summary, this pattern is often enough:

table = runs.groupby("setting").agg(
    mean_score=("validation_score", "mean"),
    std_score=("validation_score", "std"),
    n=("validation_score", "size"),
)
table["ci_like"] = 1.96 * table["std_score"] / np.sqrt(table["n"])

That is not a formal confidence interval for every situation, but it is a useful competitive check for whether a gain is obviously bigger than the noise.

Diagnosing A Result

Use the diagnostic that matches the question:

  • cross_validate(...) when you want stable fold-level scores and train/test comparison
  • validation_curve(...) when you want to know whether a hyperparameter is too small or too large
  • learning_curve(...) when you want to know whether more data would help
  • permutation_test_score(...) when you want to know whether the score is clearly above random chance

If the training score is high and the validation score is low, the experiment is telling you to simplify, regularize, or improve the split.

If both scores are low, the experiment is telling you the feature view or model family is not yet strong enough.

If both scores are high and the holdout drops, the experiment is telling you the validation setup was too easy or too narrow.

Practice

  1. Write one ablation table with a base setting and one removed component.
  2. Repeat a run with three random seeds and summarize the spread.
  3. Explain what you can and cannot claim from that result.
  4. Identify the single change that deserves another run.
  5. Explain which result would make you stop iterating.
  6. Name one time when a negative ablation is actually useful.
  7. Build a slice-by-setting table and identify where the gain disappears.
  8. Plot a mean score with error bars and decide whether the gain is meaningful.
  9. Use cross_validate to compare training and test behavior for the same model.
  10. Use validation_curve to show whether a hyperparameter is under-tuned or over-tuned.

Runnable Example

Open the experiments-and-ablations example in AI Academy and run it from the platform.

Run the same idea in the browser:

Inspect how much the score moves when one component is removed and when the seed changes.

Questions To Ask

  1. Did the same change help both validation and holdout?
  2. Is the gain bigger than the run-to-run noise?
  3. What is the simplest explanation for the change?
  4. Is the change broad or only helping a single slice?
  5. If you removed the change tomorrow, would the model still be acceptable?
  6. Is the score stable enough to justify more engineering time?
  7. Did the change help the hard cases or only the easy ones?
  8. Would a smaller model get almost the same result with less complexity?

Longer Connection