Clinic 12
Mock Packet Triage
You open a fresh mock packet at minute zero. The clock is 90 minutes. Decide — in under five — which split to trust, which metric to chase, and what baseline you will have submitted by minute twenty.
Situation
Minute Zero, Ninety On The Clock
A packet lands. Under-specified problem, oddly chosen metric, a leakage trap hiding in the data card. You have five minutes before the clock becomes tight.
Your Job
Triage Before Coding
Pick the split, the metric, and the baseline you will submit by minute 20. Write the note. Then open the reveal.
Bad Habit To Avoid
Code First, Read Second
If the first thing your hands do is pip install, the clinic failed.
Situation¶
You are sitting down to a mock packet. The packet is given to you verbatim:
Problem. A logistics company wants to predict whether a shipped package will be delivered on time. The prediction is used to flag high-risk shipments for a supervisor's review queue. The queue can hold 200 shipments per day; anything flagged past that is ignored.
Data card. 210,000 rows, one per shipment over the last 24 months. Fields include
shipment_id,origin_hub,destination_hub,carrier,weight_kg,route_distance_km,scheduled_transit_days,weather_at_origin,delivered_on_time. Two notes in the data card: (1) a handful ofdestination_hubcodes appear only in the last month of the data; (2) theweather_at_originfield was recorded at delivery time from a retrospective archive, not at dispatch time.Evaluation. Scoring is
recall@200-per-day— given 200 flagged shipments per day in the test period, how many of the actually late deliveries appear in the flagged set.Split. The test set is the last 30 days of data, unspecified otherwise.
Time budget. 90 minutes. Single submission.
The packet looks routine. Three things in it are not.
Artifact Packet¶
Read carefully. Every row is a decision waiting for you to make it.
| facet | what the packet says |
|---|---|
| deployment | "flagged for supervisor review, capped at 200/day" |
| data span | 24 months, test = last 30 days |
destination_hub |
some codes appear only in the last month |
weather_at_origin |
recorded at delivery time, not dispatch time |
| metric | recall@200-per-day |
For each row, write down what it implies for your validation rule, your feature set, your metric, and your stop condition before reading the reveal. There are at least five non-trivial implications buried in the packet — if you find fewer than five, you are not done triaging.
Decision Prompt¶
Before opening the reveal, write a six-line note that answers:
- Which validation split will you use inside training (to pick hyperparameters and stop epochs)?
- Which field will you drop or transform to avoid leakage, and why?
- Which baseline model will you submit by minute 20?
- Which single inspection will you run between iterations?
- What evidence would make you revise the submission?
- If
recall@200-per-dayis suspiciously high on your validation split, which failure do you suspect first?
Four to six sentences is enough. No code yet.
Strong Reasoning Looks Like¶
- it names the
weather_at_originleakage explicitly and removes or re-engineers the field (e.g., use the forecast that was available at dispatch, not the retrospective observation) - it picks a time-ordered CV inside training (e.g., walk-forward by week) so the training split mimics the test split
- it commits to a ranking-friendly baseline (gradient boosting or a calibrated logistic with sensible features) rather than chasing accuracy
- it treats the new
destination_hubcodes as an out-of-vocabulary problem: either a catch-all "unseen hub" bucket, or target encoding with proper CV fallback - it names one inspection that runs between iterations — typically: "rebuild recall@200-per-day on my last 30 days of training data and confirm the number did not move from pure shuffle luck"
- it is suspicious of an unusually high validation recall — the leaky field is exactly the kind of thing that would produce one
Common Wrong Moves¶
- starting to code before triage is finished
- random k-fold CV on a time-ordered task
- keeping
weather_at_originbecause "it's the best feature" (it is — but not for a deployable model) - submitting an accuracy-tuned model to a
recall@kgrader - treating new
destination_hubcodes as "I'll handle it later" — the test set is full of them - jumping to a neural net on a 200k-row tabular problem with no baseline
- running a second iteration before the first one's number was recorded
Run The Clinic In Browser¶
Use the runner as a scratchpad for the triage note, not for training code.
Reference Reveal¶
Open only after you write the note
The reference triage is: - `selected_cv = walk-forward by week across the last 6 months of training data` - `leakage_fix = drop weather_at_origin; if time permits, replace with a dispatch-time forecast feature` - `baseline = LightGBM with ordinal-encoded hubs, one-hot carrier, engineered lag features (past-week on-time rate per hub/carrier), early-stopping on recall@200 on the rolling validation window` - `between-iteration inspection = recompute recall@200-per-day on the most recent 30 days of training data — if the number is inflated, suspect leakage first` - `revise-if = validation recall@200 is higher than a naive "flag by scheduled_transit_days > 5" heuristic by an implausible margin — that is a leakage signal, not a win` - `if recall is suspicious = first suspect: the field that was recorded at delivery time is still present; second suspect: the walk-forward CV accidentally saw the future; third: new destination hubs are being mapped to a training code and inheriting its base rate` Why: - the graded metric is a ranking metric with a budget cap, so the entire workflow — loss, eval, baseline, inspection — must respect rank and the 200/day cap - the `weather_at_origin` field is the kind of subtle leak that inflates validation while destroying deployment; reading the data card once catches it, but only if you actually read it - a time-ordered split is structural, not cosmetic — it is the only way to produce a number that resembles the test period - new categories at test time are a common silent failure; handling them at minute 10 is ten times cheaper than at minute 80 The practical lesson is *triage before training*. Five minutes of reading the packet for leakage traps, metric shape, and split structure is the most leverage you will have all day. Coding before that is theatre.What To Do Next¶
After this clinic:
- open Mock Tasks and Timed Workflows and run Packet A start to finish under a real clock
- open Leakage Patterns — the
weather_at_origintrap is a textbook case from that topic - open Splitter Choice Under Ambiguity — the time-ordered CV decision is the same muscle
- open Imbalanced Metrics and Review Budgets — the 200/day cap is an instance of the review-budget frame