Skip to content

Clinic 10

Splitter Choice Under Ambiguity

The problem statement is silent on how to split. Random gives a great score, grouped cuts it in half, time-ordered collapses it to the dummy floor. Pick the split you would defend.

Situation

Three Splits, Three Scores

Random, grouped, and time-ordered splits give wildly different validation numbers. The problem statement does not say which one the grader uses.

Your Job

Pick The Deployment-Shaped Split

Pick the split you would defend, the model you would ship, and say what single piece of evidence would change your mind.

Bad Habit To Avoid

Best Score Wins

If the reasoning picks the split that looks best, the clinic failed.

Situation

You are reviewing a customer-behavior classification task. The dataset has:

  • ~40,000 rows spanning 18 months of user activity
  • repeated entities: about 3,500 unique users, each appearing multiple times
  • a timestamp column that is monotonically increasing across the file
  • no instruction in the problem statement about how to split

You ran three validation schemes for the same model and got three different stories:

Artifact Packet

Read the packet before you decide:

split scheme validation ROC AUC validation average precision dummy baseline ROC AUC notes
random_70_30 0.912 0.74 0.50 no group or time constraint
grouped_by_user 0.681 0.38 0.50 held-out users never seen in train
time_ordered_cutoff 0.553 0.22 0.50 train on first 12 months, validate on last 6
grouped_and_time 0.541 0.21 0.50 held-out users AND later time period

The tempting move is obvious: the random split has the most flattering number.

The harder question is which split matches the actual deployment story. If the model will be applied to new users later in time, the relevant split is grouped_and_time, which looks barely above the dummy floor.

Decision Prompt

Write the note before you open the reveal.

Your note should answer:

  1. Which validation split would you defend as the right one for this task?
  2. What score would you report to the grader, and from which split?
  3. Would you stop training or iterate — and if iterating, what would you target?
  4. What single piece of evidence would change your mind?

Keep the note short. Four to six sentences is enough.

Strong Reasoning Looks Like

  • it starts from the deployment story ("what will this model predict about, and when"), not from the scores
  • it acknowledges the 40% gap between random and grouped and treats the random split as flattering
  • it picks grouped_and_time if the deployment includes new users arriving later, otherwise grouped_by_user if new users are expected but time drift is not a concern
  • it is willing to report a lower number as the honest number
  • it names one direction to iterate (stronger per-user features, drift-aware features) or explicitly says "stop — the model barely beats the floor under the honest split"

Common Wrong Moves

  • picking random_70_30 because it has the best AUC
  • picking grouped_by_user without checking whether time drift also matters
  • averaging the four split scores as if the average were meaningful
  • reporting the random number to the grader and hoping the grader uses random
  • concluding the task is impossible because grouped_and_time is weak — it may be a real but harder task
  • refusing to pick until you "know" which split the grader uses, when the problem itself tells you about the deployment shape

Run The Clinic In Browser

Use the browser runner to sketch your reasoning before writing the note.

Reference Reveal

Open only after you write the note The reference choice is: - `selected_split = grouped_and_time` - `reported_score = 0.541 ROC AUC, 0.21 AP` - `decision = continue only on features that respect both group and time boundaries; otherwise stop and report the honest floor` Why: - the dataset has repeated users AND a monotonic timestamp — both are classical leakage axes - the random split mixes train/validation users and shuffles time, which is the loosest possible interpretation; its score is not deployable - the grouped-only split leaks time information; the time-only split leaks user identity - `grouped_and_time` is the only split that mirrors the deployment story "new users at a later time" - the score under that split is just above the floor, which is honest information even though it is not flattering The practical lesson is: the split is part of the method. Pick the split that matches the deployment, report the score under that split, and iterate against it — not against a more flattering alternative.

What To Do Next

After this clinic:

  1. open Honest Splits and Baselines
  2. open Leakage Patterns for the adjacent failures
  3. use IOAI Competition Surface for the problem-reading habit that catches this earlier