Skip to content

Validation, Leakage, and Model Choice

This pack is about one thing: protecting the honesty of the evaluation loop. These questions are adapted from public official university materials and rewritten into academy form.

QV01. Where Does Imputation Belong?

Question: You have a train/validation/test split with missing values in all three partitions. A teammate computes median imputation values on the full dataset first, then applies them everywhere. Is that acceptable?

Solution:

  • No.
  • The imputation statistics must be fit on the training partition only.
  • If you use the full dataset first, the validation and test partitions influence preprocessing.
  • That leaks information from held-out data into the pipeline, even if the labels are untouched.

Why it matters: Leakage often enters through preprocessing long before model fitting.

Source family: UC Berkeley CS189 cross-validation themes and Stanford CS229 model-selection notes

QV02. Repeated Test-Set Tuning

Question: You try eight hyperparameter settings and look at the test score after each one. The last setting is best. Can you report that test score as an honest final estimate?

Solution:

  • No.
  • Once you used the test set to compare settings, it stopped being a pure final check.
  • The honest workflow is: tune on training/validation or cross-validation, freeze the choice rule, then evaluate once on test.

Why it matters: This is the clean line between model selection and final evaluation.

Source family: UC Berkeley CS189 cross-validation themes and Stanford CS229 model-selection notes

QV03. Group Leakage Across Users

Question: You are predicting churn from repeated events, and the same user appears many times. A random row-wise split gives much higher validation accuracy than a user-grouped split. What is the first diagnosis?

Solution:

  • The row-wise split is probably optimistic because events from the same user appear in both train and validation.
  • The model may be exploiting user-specific signatures rather than generalizable behavior.
  • The safer split groups by user so the validation set represents unseen users.

Why it matters: Entity leakage can make a weak model look production-ready.

Source family: CMU 10-601 evaluation and overfitting themes and UC Berkeley CS189 cross-validation themes

QV04. Chronological Data, Random Split

Question: You are forecasting next-week demand from historical sales. A random split produces excellent validation error. Should you trust it?

Solution:

  • Not by default.
  • For forecasting or sequential decisions, random splits usually break the deployment story.
  • The model may train on future patterns while validating on earlier periods.
  • The first honest check is a chronological split or rolling backtest.

Why it matters: A strong score under the wrong split is not evidence.

Source family: Stanford CS229 model-selection notes and CMU 10-601 evaluation themes

QV05. Calibration Or Better Ranking?

Question: Model A and Model B have the same ROC-AUC. Model A is badly overconfident, while Model B's predicted probabilities match observed frequencies much better. Which is better if the deployment decision depends on a probability threshold tied to review cost?

Solution:

  • Model B is usually the safer choice for a threshold policy.
  • Equal ranking quality does not mean equal probability quality.
  • If the operating policy uses predicted probabilities directly, calibration matters because threshold decisions depend on the scale being trustworthy.

Why it matters: Good ranking is not enough when money, review bandwidth, or risk policy depends on actual probabilities.

Source family: UC Berkeley CS189 bias-variance and evaluation themes and Stanford CS229 model-selection notes

QV06. Public Leaderboard Jump

Question: Your public leaderboard score jumps after a long sequence of small feature edits, but your local cross-validation does not improve. What is the disciplined next move?

Solution:

  • Do not trust the jump by itself.
  • Keep the split and pipeline fixed, rerun local validation, and inspect whether the change makes sense on slices or paired examples.
  • If the local evidence is weak, the jump may be leaderboard noise or mild overfitting to the public subset.

Why it matters: A flattering public score is not the same thing as better generalization.

Source family: MIT 6.867 generative-versus-discriminative themes and Stanford CS229 model-selection notes

QV07. Weak Slice, Strong Average

Question: Overall validation F1 improved from 0.81 to 0.83, but the worst demographic slice fell from 0.62 to 0.48. Should the new model win automatically?

Solution:

  • No.
  • The aggregate improvement is real, but it does not settle the decision alone.
  • If the weak slice matters operationally or ethically, the drop may dominate the choice.
  • The right move is to make the objective explicit: overall average only, minimum-slice floor, or some weighted policy between them.

Why it matters: Averages hide where the system actually fails.

Source family: CMU 10-601 evaluation and model-comparison themes

QV08. Threshold Tuning Under Review Budget

Question: Your model outputs probabilities for fraud. Operations can review only 200 cases per day. Is the right first move to maximize default F1 at threshold 0.5?

Solution:

  • No.
  • The first move is to match the operating rule to the budget.
  • Rank cases by score, measure precision/recall at the review cutoff that corresponds to 200 cases, and tune the threshold or top-k policy around that constraint.
  • Threshold 0.5 is not special unless the deployment objective makes it special.

Why it matters: Metrics should follow the decision rule, not the other way around.

Source family: CMU 10-601 Naive Bayes and evaluation questions and UC Berkeley CS189 evaluation themes

What To Do After This Pack

If this pack exposed a weakness, route back into the smallest academy layer that fixes it: