Deep Learning and Checkpoints¶

This pack is about training dynamics, overfitting signals, transfer choices, and checkpoint discipline. The questions are adapted from public official course materials and rewritten into academy form.

QD01. Learning Rate Too High¶

Question: During training, loss jumps up and down violently and sometimes becomes nan after a few updates. What is the first optimizer diagnosis?

Solution:

The learning rate is probably too high.
Large steps can overshoot repeatedly and destabilize training.
The first corrective move is to reduce the learning rate and confirm whether the loss curve becomes smoother.

Why it matters: Many “mysterious” training failures are just step-size failures.

Source family: Stanford CS231n schedule and Stanford CS229 optimization and regularization themes

QD02. Low Train Loss, High Validation Loss¶

Question: Training loss keeps improving, validation loss stays much worse, and the gap widens. What is the first diagnosis?

Solution:

High variance or overfitting.
The model is fitting the training distribution more aggressively than the held-out evidence supports.
First responses include stronger regularization, earlier stopping, more data, or less model capacity.

Why it matters: The train-valid gap is one of the core deep-learning readouts.

Source family: Stanford CS229 regularization/model-selection notes and UC Berkeley CS189 bias-variance themes

QD03. Which Checkpoint Do You Keep?¶

Question: Validation accuracy peaks at epoch 7, then drifts downward, but training accuracy keeps climbing until epoch 25. Which checkpoint should you keep?

Solution:

Keep the best-validation checkpoint.
Model selection should follow the held-out metric tied to deployment, not the final epoch by default.
The later epochs are evidence of additional fit, not necessarily better generalization.

Why it matters: Checkpoint choice is part of the evaluation protocol.

Source family: Stanford CS231n assignments and Stanford CS229 regularization/model-selection notes

QD04. Frozen Head, Partial Unfreeze, Or Full Fine-Tune?¶

Question: You have 2,000 labeled images and a good pretrained backbone from a related visual domain. What is the safest first sequence: full fine-tune immediately, or start smaller?

Solution:

Start smaller.
The safest first sequence is usually: train a new head on frozen features, then unfreeze selectively if validation evidence supports it.
Full fine-tuning can help, but it raises optimization and overfitting risk.

Why it matters: Transfer learning works best when escalation is earned, not assumed.

Source family: Stanford CS231n assignments and Stanford CS229 model-selection notes

QD05. Weight Decay Or More Capacity?¶

Question: An MLP already fits the training set extremely well but generalizes poorly. Should the next move usually be a larger network or stronger regularization?

Solution:

Usually stronger regularization.
If training fit is already strong, more capacity often makes variance worse.
Weight decay, dropout, earlier stopping, or data augmentation are more sensible first responses.

Why it matters: The next move should respond to the failure mode, not just add power.

Source family: Stanford CS231n schedule and Stanford CS229 regularization/model-selection notes

QD06. Small-Batch Fine-Tuning And BatchNorm¶

Question: You are fine-tuning a pretrained vision network with very small batches. Is it safer to aggressively relearn every BatchNorm statistic immediately, or to be conservative?

Solution:

Be conservative first.
Very small batches can produce noisy BatchNorm estimates.
A common safe starting move is to keep BatchNorm behavior stable while testing the rest of the adaptation plan under a fixed validation rule.

Why it matters: Some fine-tuning failures come from unstable normalization rather than bad representation.

Source family: Stanford CS231n assignments

QD07. Better Final Train Loss, Worse Validation Curve¶

Question: Run A ends with lower training loss than Run B, but Run B achieves the best validation metric at any checkpoint. Which run is better?

Solution:

Run B.
The best deployment candidate is the run that wins on the held-out metric you actually care about.
Final training loss is secondary if it does not translate into held-out performance.

Why it matters: This is the same discipline as classical model selection, just inside a deeper training loop.

Source family: Stanford CS229 model-selection notes and Stanford CS231n assignments

QD08. Augmentation Helped Training, Hurt Deployment Slice¶

Question: A stronger augmentation recipe improves average validation accuracy slightly but hurts performance on a critical real deployment slice. Should it stay by default?

Solution:

No.
The average gain is relevant, but the deployment slice can still dominate the choice.
Keep the split and slice definition fixed, then decide based on the real operating objective rather than average performance alone.

Why it matters: A training trick is only useful if it helps the deployment problem you actually have.

Source family: Stanford CS231n schedule and Stanford CS229 model-selection notes

What To Do After This Pack¶

If this pack exposed a gap, route back into:

core training loop: PyTorch Training Loops
regularization choices: Optimizers and Regularization
transfer decisions: Transfer and Fine-Tuning
longer workflow: PyTorch Training Recipes
transfer workflow: ResNet, BERT, and Fine-Tuning