Clinic 10

Splitter Choice Under Ambiguity

The problem statement is silent on how to split. Random gives a great score, grouped cuts it in half, time-ordered collapses it to the dummy floor. Pick the split you would defend.

Back To Clinics Open Splits Topic Open The Full Track

Situation

Three Splits, Three Scores

Random, grouped, and time-ordered splits give wildly different validation numbers. The problem statement does not say which one the grader uses.

Your Job

Pick The Deployment-Shaped Split

Pick the split you would defend, the model you would ship, and say what single piece of evidence would change your mind.

Bad Habit To Avoid

Best Score Wins

If the reasoning picks the split that looks best, the clinic failed.

Situation¶

You are reviewing a customer-behavior classification task. The dataset has:

~40,000 rows spanning 18 months of user activity
repeated entities: about 3,500 unique users, each appearing multiple times
a timestamp column that is monotonically increasing across the file
no instruction in the problem statement about how to split

You ran three validation schemes for the same model and got three different stories:

Artifact Packet¶

Read the packet before you decide:

split scheme	validation ROC AUC	validation average precision	dummy baseline ROC AUC	notes
`random_70_30`	0.912	0.74	0.50	no group or time constraint
`grouped_by_user`	0.681	0.38	0.50	held-out users never seen in train
`time_ordered_cutoff`	0.553	0.22	0.50	train on first 12 months, validate on last 6
`grouped_and_time`	0.541	0.21	0.50	held-out users AND later time period

The tempting move is obvious: the random split has the most flattering number.

The harder question is which split matches the actual deployment story. If the model will be applied to new users later in time, the relevant split is grouped_and_time, which looks barely above the dummy floor.

Decision Prompt¶

Write the note before you open the reveal.

Your note should answer:

Which validation split would you defend as the right one for this task?
What score would you report to the grader, and from which split?
Would you stop training or iterate — and if iterating, what would you target?
What single piece of evidence would change your mind?

Keep the note short. Four to six sentences is enough.

Strong Reasoning Looks Like¶

it starts from the deployment story ("what will this model predict about, and when"), not from the scores
it acknowledges the 40% gap between random and grouped and treats the random split as flattering
it picks grouped_and_time if the deployment includes new users arriving later, otherwise grouped_by_user if new users are expected but time drift is not a concern
it is willing to report a lower number as the honest number
it names one direction to iterate (stronger per-user features, drift-aware features) or explicitly says "stop — the model barely beats the floor under the honest split"

Common Wrong Moves¶

picking random_70_30 because it has the best AUC
picking grouped_by_user without checking whether time drift also matters
averaging the four split scores as if the average were meaningful
reporting the random number to the grader and hoping the grader uses random
concluding the task is impossible because grouped_and_time is weak — it may be a real but harder task
refusing to pick until you "know" which split the grader uses, when the problem itself tells you about the deployment shape

Run The Clinic In Browser¶

Use the browser runner to sketch your reasoning before writing the note.

Reference Reveal¶

Open only after you write the note

The reference choice is: - `selected_split = grouped_and_time` - `reported_score = 0.541 ROC AUC, 0.21 AP` - `decision = continue only on features that respect both group and time boundaries; otherwise stop and report the honest floor` Why: - the dataset has repeated users AND a monotonic timestamp — both are classical leakage axes - the random split mixes train/validation users and shuffles time, which is the loosest possible interpretation; its score is not deployable - the grouped-only split leaks time information; the time-only split leaks user identity - `grouped_and_time` is the only split that mirrors the deployment story "new users at a later time" - the score under that split is just above the floor, which is honest information even though it is not flattering The practical lesson is: the split is part of the method. Pick the split that matches the deployment, report the score under that split, and iterate against it — not against a more flattering alternative.

What To Do Next¶

After this clinic:

open Honest Splits and Baselines
open Leakage Patterns for the adjacent failures
use IOAI Competition Surface for the problem-reading habit that catches this earlier