Clinic 12

Mock Packet Triage

You open a fresh mock packet at minute zero. The clock is 90 minutes. Decide — in under five — which split to trust, which metric to chase, and what baseline you will have submitted by minute twenty.

Back To Clinics Open Mock Tasks Topic Open The IOAI Surface

Situation

Minute Zero, Ninety On The Clock

A packet lands. Under-specified problem, oddly chosen metric, a leakage trap hiding in the data card. You have five minutes before the clock becomes tight.

Your Job

Triage Before Coding

Pick the split, the metric, and the baseline you will submit by minute 20. Write the note. Then open the reveal.

Bad Habit To Avoid

Code First, Read Second

If the first thing your hands do is pip install, the clinic failed.

Situation¶

You are sitting down to a mock packet. The packet is given to you verbatim:

Problem. A logistics company wants to predict whether a shipped package will be delivered on time. The prediction is used to flag high-risk shipments for a supervisor's review queue. The queue can hold 200 shipments per day; anything flagged past that is ignored.

Data card. 210,000 rows, one per shipment over the last 24 months. Fields include shipment_id, origin_hub, destination_hub, carrier, weight_kg, route_distance_km, scheduled_transit_days, weather_at_origin, delivered_on_time. Two notes in the data card: (1) a handful of destination_hub codes appear only in the last month of the data; (2) the weather_at_origin field was recorded at delivery time from a retrospective archive, not at dispatch time.

Evaluation. Scoring is recall@200-per-day — given 200 flagged shipments per day in the test period, how many of the actually late deliveries appear in the flagged set.

Split. The test set is the last 30 days of data, unspecified otherwise.

Time budget. 90 minutes. Single submission.

The packet looks routine. Three things in it are not.

Artifact Packet¶

Read carefully. Every row is a decision waiting for you to make it.

facet	what the packet says
deployment	"flagged for supervisor review, capped at 200/day"
data span	24 months, test = last 30 days
`destination_hub`	some codes appear only in the last month
`weather_at_origin`	recorded at delivery time, not dispatch time
metric	`recall@200-per-day`

For each row, write down what it implies for your validation rule, your feature set, your metric, and your stop condition before reading the reveal. There are at least five non-trivial implications buried in the packet — if you find fewer than five, you are not done triaging.

Decision Prompt¶

Before opening the reveal, write a six-line note that answers:

Which validation split will you use inside training (to pick hyperparameters and stop epochs)?
Which field will you drop or transform to avoid leakage, and why?
Which baseline model will you submit by minute 20?
Which single inspection will you run between iterations?
What evidence would make you revise the submission?
If recall@200-per-day is suspiciously high on your validation split, which failure do you suspect first?

Four to six sentences is enough. No code yet.

Strong Reasoning Looks Like¶

it names the weather_at_origin leakage explicitly and removes or re-engineers the field (e.g., use the forecast that was available at dispatch, not the retrospective observation)
it picks a time-ordered CV inside training (e.g., walk-forward by week) so the training split mimics the test split
it commits to a ranking-friendly baseline (gradient boosting or a calibrated logistic with sensible features) rather than chasing accuracy
it treats the new destination_hub codes as an out-of-vocabulary problem: either a catch-all "unseen hub" bucket, or target encoding with proper CV fallback
it names one inspection that runs between iterations — typically: "rebuild recall@200-per-day on my last 30 days of training data and confirm the number did not move from pure shuffle luck"
it is suspicious of an unusually high validation recall — the leaky field is exactly the kind of thing that would produce one

Common Wrong Moves¶

starting to code before triage is finished
random k-fold CV on a time-ordered task
keeping weather_at_origin because "it's the best feature" (it is — but not for a deployable model)
submitting an accuracy-tuned model to a recall@k grader
treating new destination_hub codes as "I'll handle it later" — the test set is full of them
jumping to a neural net on a 200k-row tabular problem with no baseline
running a second iteration before the first one's number was recorded

Run The Clinic In Browser¶

Use the runner as a scratchpad for the triage note, not for training code.

Reference Reveal¶

Open only after you write the note

The reference triage is: - `selected_cv = walk-forward by week across the last 6 months of training data` - `leakage_fix = drop weather_at_origin; if time permits, replace with a dispatch-time forecast feature` - `baseline = LightGBM with ordinal-encoded hubs, one-hot carrier, engineered lag features (past-week on-time rate per hub/carrier), early-stopping on recall@200 on the rolling validation window` - `between-iteration inspection = recompute recall@200-per-day on the most recent 30 days of training data — if the number is inflated, suspect leakage first` - `revise-if = validation recall@200 is higher than a naive "flag by scheduled_transit_days > 5" heuristic by an implausible margin — that is a leakage signal, not a win` - `if recall is suspicious = first suspect: the field that was recorded at delivery time is still present; second suspect: the walk-forward CV accidentally saw the future; third: new destination hubs are being mapped to a training code and inheriting its base rate` Why: - the graded metric is a ranking metric with a budget cap, so the entire workflow — loss, eval, baseline, inspection — must respect rank and the 200/day cap - the `weather_at_origin` field is the kind of subtle leak that inflates validation while destroying deployment; reading the data card once catches it, but only if you actually read it - a time-ordered split is structural, not cosmetic — it is the only way to produce a number that resembles the test period - new categories at test time are a common silent failure; handling them at minute 10 is ten times cheaper than at minute 80 The practical lesson is *triage before training*. Five minutes of reading the packet for leakage traps, metric shape, and split structure is the most leverage you will have all day. Coding before that is theatre.

What To Do Next¶

After this clinic:

open Mock Tasks and Timed Workflows and run Packet A start to finish under a real clock
open Leakage Patterns — the weather_at_origin trap is a textbook case from that topic
open Splitter Choice Under Ambiguity — the time-ordered CV decision is the same muscle
open Imbalanced Metrics and Review Budgets — the 200/day cap is an instance of the review-budget frame