Skip to content

PyTorch Training Loops

What This Is

This page is about one rule:

  • if the training loop is not trustworthy, nothing downstream is trustworthy

A healthy loop is not just forward, backward, step. It is mode control, gradient hygiene, validation separation, checkpoint choice, and enough inspection to know whether learning is actually happening.

When You Use It

  • building the first neural baseline
  • debugging a stalled or unstable run
  • adding validation and checkpointing to a toy model
  • fine-tuning a pretrained model without breaking evaluation discipline

The Core Loop

The loop should stay conceptually simple:

for x, y in train_loader:
    optimizer.zero_grad(set_to_none=True)
    logits = model(x)
    loss = loss_fn(logits, y)
    loss.backward()
    optimizer.step()

Then validation runs separately:

model.eval()
with torch.no_grad():
    ...

That train/validation boundary is the core discipline.

What To Inspect Every Epoch

Inspect:

  • training loss
  • validation loss
  • validation metric
  • best checkpoint so far
  • whether train() and eval() were switched correctly

If you only look at the last epoch, you can easily keep the wrong checkpoint.

Start With These Questions

Before debugging anything fancy, ask:

  1. are gradients being zeroed at the right time
  2. is model.train() used during training
  3. is model.eval() used during validation
  4. is validation wrapped in torch.no_grad()
  5. is the saved checkpoint tied to the best held-out metric

If any of those are wrong, the loop itself is still broken.

Failure Pattern

The most common failure is blaming the model when the loop is the real problem.

Typical examples:

  • validation still running in training mode
  • gradients accumulating across batches accidentally
  • checkpointing the last epoch instead of the best validation epoch
  • reading only training loss and assuming learning is healthy

Common Mistakes

  • shuffling or leaking the validation loader
  • using the training split as validation by accident
  • applying softmax before CrossEntropyLoss
  • validating with gradients enabled and holding onto graphs
  • clipping gradients before backward()
  • saving only the model weights and not the best validation decision

A Good Loop Note

After one run, the learner should be able to say:

  • what the best checkpoint was
  • what the training and validation curves implied
  • whether the loop was stable enough to trust
  • what failure pattern, if any, appeared first

Practice

  1. Build a minimal training and validation loop for one classifier.
  2. Add checkpointing based on best validation metric.
  3. Deliberately leave model.eval() out once and explain what changes.
  4. Explain why the best checkpoint and the last checkpoint can differ.

Longer Connection

Continue with MLP Training Baseline for the smallest useful neural baseline, Optimizers and Regularization for recipe choices, and Transfer and Fine-Tuning when the loop starts controlling a pretrained backbone.