Optimizers and Regularization¶

What This Is¶

This page is about one practical training question:

is the run failing because of the optimization recipe or because the model is memorizing too easily

Many weak neural runs are not architecture failures. They are learning-rate, optimizer, weight-decay, dropout, or mode-control failures.

the loss is unstable or slow
validation collapses while training keeps improving
you need a stronger baseline recipe before changing the architecture
you are fine-tuning and different parts of the model should move at different speeds

Choose the next fix by what the run is doing:

Symptom	Better first move	Why
loss oscillates or goes `nan`	lower LR, inspect gradients, add clipping if needed	the step is too aggressive
train improves, validation degrades early	stronger regularization or earlier stop	the model is memorizing
optimization is slow but stable	try scheduler or optimizer change	the recipe may be too blunt
pretrained backbone moves too much	smaller backbone LR or staged unfreeze	reuse is being damaged

For many tabular or general deep-learning baselines, this is a reasonable first recipe:

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)

If the model is pretrained or the task is more delicate, use parameter groups so the backbone moves more slowly than the head.

Always inspect these together:

Looking at only one of them hides the real cause.

The most common failure is changing the architecture before checking the training recipe.

That usually hides a much simpler issue:

comparing Adam and AdamW with no effective weight decay and thinking the comparison taught something
adding multiple regularizers at once and not knowing which one mattered
clipping gradients every step instead of asking whether the LR is already wrong
leaving dropout or batch norm in training mode during evaluation
using the same learning rate for head and backbone in fine-tuning without justification

After one run, the learner should be able to say:

what the main symptom was
which recipe change was tried first
what metric or curve actually moved
whether the next move should still target optimization or should return to data/model structure

Compare one optimizer change while keeping the architecture fixed.
Add weight decay or dropout and describe what changed in validation.
Use parameter groups for backbone and head and explain why.
Decide whether the next fix should be regularization, scheduler, or simpler architecture.

Continue with PyTorch Training Loops for the full loop mechanics, Learning Rate Schedulers when the optimizer is stable but pacing is weak, and Transfer and Fine-Tuning when adaptation depth is the real issue.