Clinic 10
Splitter Choice Under Ambiguity
The problem statement is silent on how to split. Random gives a great score, grouped cuts it in half, time-ordered collapses it to the dummy floor. Pick the split you would defend.
Situation
Three Splits, Three Scores
Random, grouped, and time-ordered splits give wildly different validation numbers. The problem statement does not say which one the grader uses.
Your Job
Pick The Deployment-Shaped Split
Pick the split you would defend, the model you would ship, and say what single piece of evidence would change your mind.
Bad Habit To Avoid
Best Score Wins
If the reasoning picks the split that looks best, the clinic failed.
Situation¶
You are reviewing a customer-behavior classification task. The dataset has:
- ~40,000 rows spanning 18 months of user activity
- repeated entities: about 3,500 unique users, each appearing multiple times
- a
timestampcolumn that is monotonically increasing across the file - no instruction in the problem statement about how to split
You ran three validation schemes for the same model and got three different stories:
Artifact Packet¶
Read the packet before you decide:
| split scheme | validation ROC AUC | validation average precision | dummy baseline ROC AUC | notes |
|---|---|---|---|---|
random_70_30 |
0.912 | 0.74 | 0.50 | no group or time constraint |
grouped_by_user |
0.681 | 0.38 | 0.50 | held-out users never seen in train |
time_ordered_cutoff |
0.553 | 0.22 | 0.50 | train on first 12 months, validate on last 6 |
grouped_and_time |
0.541 | 0.21 | 0.50 | held-out users AND later time period |
The tempting move is obvious: the random split has the most flattering number.
The harder question is which split matches the actual deployment story. If the model will be applied to new users later in time, the relevant split is grouped_and_time, which looks barely above the dummy floor.
Decision Prompt¶
Write the note before you open the reveal.
Your note should answer:
- Which validation split would you defend as the right one for this task?
- What score would you report to the grader, and from which split?
- Would you stop training or iterate — and if iterating, what would you target?
- What single piece of evidence would change your mind?
Keep the note short. Four to six sentences is enough.
Strong Reasoning Looks Like¶
- it starts from the deployment story ("what will this model predict about, and when"), not from the scores
- it acknowledges the 40% gap between random and grouped and treats the random split as flattering
- it picks
grouped_and_timeif the deployment includes new users arriving later, otherwisegrouped_by_userif new users are expected but time drift is not a concern - it is willing to report a lower number as the honest number
- it names one direction to iterate (stronger per-user features, drift-aware features) or explicitly says "stop — the model barely beats the floor under the honest split"
Common Wrong Moves¶
- picking
random_70_30because it has the best AUC - picking
grouped_by_userwithout checking whether time drift also matters - averaging the four split scores as if the average were meaningful
- reporting the random number to the grader and hoping the grader uses random
- concluding the task is impossible because
grouped_and_timeis weak — it may be a real but harder task - refusing to pick until you "know" which split the grader uses, when the problem itself tells you about the deployment shape
Run The Clinic In Browser¶
Use the browser runner to sketch your reasoning before writing the note.
Reference Reveal¶
Open only after you write the note
The reference choice is: - `selected_split = grouped_and_time` - `reported_score = 0.541 ROC AUC, 0.21 AP` - `decision = continue only on features that respect both group and time boundaries; otherwise stop and report the honest floor` Why: - the dataset has repeated users AND a monotonic timestamp — both are classical leakage axes - the random split mixes train/validation users and shuffles time, which is the loosest possible interpretation; its score is not deployable - the grouped-only split leaks time information; the time-only split leaks user identity - `grouped_and_time` is the only split that mirrors the deployment story "new users at a later time" - the score under that split is just above the floor, which is honest information even though it is not flattering The practical lesson is: the split is part of the method. Pick the split that matches the deployment, report the score under that split, and iterate against it — not against a more flattering alternative.What To Do Next¶
After this clinic:
- open Honest Splits and Baselines
- open Leakage Patterns for the adjacent failures
- use IOAI Competition Surface for the problem-reading habit that catches this earlier