Skip to content

Clinic 02

Leakage Or Signal?

A validation jump looks too good to ignore. The harder question is whether the gain came from a real feature or from an answer key in disguise.

Situation

Huge Gain, Wrong Reason?

The most predictive feature is also the one whose availability looks suspiciously late in the workflow.

Your Job

Ship, Reject, Or Inspect

Choose the model you would keep, name the feature you would reject or keep, and say what evidence would change your mind.

Bad Habit To Avoid

Score Jump = Progress

If the whole argument is “it scored better,” the clinic failed.

Situation

You are reviewing a support-triage model with repeated customers and a frozen grouped split.

The packet says:

  • the baseline pipeline is stable
  • one new feature causes a dramatic jump
  • the same feature is populated only after the case is resolved
  • a safer feature gives a smaller but believable gain

Artifact Packet

Read the packet before you decide:

candidate grouped ROC AUC grouped average precision prediction-time availability top feature
baseline_pipeline 0.712 0.391 yes recent_backlog_count
plus_resolution_status 0.964 0.901 no, filled after the case is closed resolved_within_24h
plus_backlog_pressure 0.748 0.427 yes recent_backlog_count + queue_pressure

The tempting move is obvious: plus_resolution_status dominates the board.

The harder question is the real one: would that score survive the actual prediction timeline?

Decision Prompt

Write the note before you open the reveal.

Your note should answer:

  1. Which candidate would you keep right now?
  2. Which feature or change would you reject immediately?
  3. Would you stop or continue?
  4. What evidence would justify changing the decision?

Keep the note short. Four to six sentences is enough.

Strong Reasoning Looks Like

  • it names the strongest score and still refuses to ship a feature that is unavailable at prediction time
  • it prefers a smaller honest gain over a dramatic invalid gain
  • it uses the grouped split and the feature-availability rule together
  • it says what next inspection would still be worth doing

Common Wrong Moves

  • shipping the best row without mentioning feature availability
  • saying “the grouped split is clean, so the feature must be safe”
  • rejecting every feature change instead of separating the unsafe one from the useful one
  • continuing without naming what you would inspect next

Reference Reveal

Open only after you write the note The reference choice is: - `selected_candidate = plus_backlog_pressure` - `rejected_feature = resolved_within_24h` - `decision = continue only with prediction-time-safe features` Why: - the late feature is not available when the prediction is made - the huge gain is therefore invalid evidence, not a real improvement - the smaller grouped gain from `queue_pressure` is still honest enough to inspect further The practical lesson is simple: a believable smaller gain is stronger than an impossible larger one.

What To Do Next

After this clinic:

  1. open Leakage Patterns
  2. run the matching leakage example
  3. use scikit-learn Validation and Tuning when you want the full split-pipeline-selection workflow