Optimizers and Regularization¶
What This Is¶
This page is about one practical training question:
- is the run failing because of the optimization recipe or because the model is memorizing too easily
Many weak neural runs are not architecture failures. They are learning-rate, optimizer, weight-decay, dropout, or mode-control failures.
When You Use It¶
- the loss is unstable or slow
- validation collapses while training keeps improving
- you need a stronger baseline recipe before changing the architecture
- you are fine-tuning and different parts of the model should move at different speeds
Start With The Symptom¶
Choose the next fix by what the run is doing:
| Symptom | Better first move | Why |
|---|---|---|
loss oscillates or goes nan |
lower LR, inspect gradients, add clipping if needed | the step is too aggressive |
| train improves, validation degrades early | stronger regularization or earlier stop | the model is memorizing |
| optimization is slow but stable | try scheduler or optimizer change | the recipe may be too blunt |
| pretrained backbone moves too much | smaller backbone LR or staged unfreeze | reuse is being damaged |
Strong First Recipe¶
For many tabular or general deep-learning baselines, this is a reasonable first recipe:
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
If the model is pretrained or the task is more delicate, use parameter groups so the backbone moves more slowly than the head.
What To Inspect Together¶
Always inspect these together:
- training loss
- validation loss
- validation metric
- current learning rate
Looking at only one of them hides the real cause.
Failure Pattern¶
The most common failure is changing the architecture before checking the training recipe.
That usually hides a much simpler issue:
- learning rate too high
- no meaningful weight decay
- dropout or regularization mismatch
- wrong train/eval mode
- overly aggressive unfreezing during transfer
Common Moves¶
AdamWwhen you want a strong default with decoupled weight decaySGDwith momentum when you want a more classic, explicit recipe- dropout when memorization starts early
- weight decay when large weights are part of the overfit story
- gradient clipping when spikes are causing unstable steps
- parameter groups when backbone and head should move differently
Common Mistakes¶
- comparing
AdamandAdamWwith no effective weight decay and thinking the comparison taught something - adding multiple regularizers at once and not knowing which one mattered
- clipping gradients every step instead of asking whether the LR is already wrong
- leaving dropout or batch norm in training mode during evaluation
- using the same learning rate for head and backbone in fine-tuning without justification
A Good Training-Recipe Note¶
After one run, the learner should be able to say:
- what the main symptom was
- which recipe change was tried first
- what metric or curve actually moved
- whether the next move should still target optimization or should return to data/model structure
Practice¶
- Compare one optimizer change while keeping the architecture fixed.
- Add weight decay or dropout and describe what changed in validation.
- Use parameter groups for backbone and head and explain why.
- Decide whether the next fix should be regularization, scheduler, or simpler architecture.
Longer Connection¶
Continue with PyTorch Training Loops for the full loop mechanics, Learning Rate Schedulers when the optimizer is stable but pacing is weak, and Transfer and Fine-Tuning when adaptation depth is the real issue.