PyTorch Training Loops¶

What This Is¶

This page is about one rule:

if the training loop is not trustworthy, nothing downstream is trustworthy

A healthy loop is not just forward, backward, step. It is mode control, gradient hygiene, validation separation, checkpoint choice, and enough inspection to know whether learning is actually happening.

When You Use It¶

building the first neural baseline
debugging a stalled or unstable run
adding validation and checkpointing to a toy model
fine-tuning a pretrained model without breaking evaluation discipline

The Core Loop¶

The loop should stay conceptually simple:

for x, y in train_loader:
    optimizer.zero_grad(set_to_none=True)
    logits = model(x)
    loss = loss_fn(logits, y)
    loss.backward()
    optimizer.step()

Then validation runs separately:

model.eval()
with torch.no_grad():
    ...

That train/validation boundary is the core discipline.

What To Inspect Every Epoch¶

Inspect:

training loss
validation loss
validation metric
best checkpoint so far
whether train() and eval() were switched correctly

If you only look at the last epoch, you can easily keep the wrong checkpoint.

Start With These Questions¶

Before debugging anything fancy, ask:

are gradients being zeroed at the right time
is model.train() used during training
is model.eval() used during validation
is validation wrapped in torch.no_grad()
is the saved checkpoint tied to the best held-out metric

If any of those are wrong, the loop itself is still broken.

Failure Pattern¶

The most common failure is blaming the model when the loop is the real problem.

Typical examples:

validation still running in training mode
gradients accumulating across batches accidentally
checkpointing the last epoch instead of the best validation epoch
reading only training loss and assuming learning is healthy

Common Mistakes¶

shuffling or leaking the validation loader
using the training split as validation by accident
applying softmax before CrossEntropyLoss
validating with gradients enabled and holding onto graphs
clipping gradients before backward()
saving only the model weights and not the best validation decision

A Good Loop Note¶

After one run, the learner should be able to say:

what the best checkpoint was
what the training and validation curves implied
whether the loop was stable enough to trust
what failure pattern, if any, appeared first

Practice¶

Build a minimal training and validation loop for one classifier.
Add checkpointing based on best validation metric.
Deliberately leave model.eval() out once and explain what changes.
Explain why the best checkpoint and the last checkpoint can differ.

Longer Connection¶

Continue with MLP Training Baseline for the smallest useful neural baseline, Optimizers and Regularization for recipe choices, and Transfer and Fine-Tuning when the loop starts controlling a pretrained backbone.