Learning Rate Schedulers¶
Scenario: Fine-Tuning a Language Model¶
You're fine-tuning a large BERT model for sentiment analysis. A high learning rate would erase pretrained knowledge—use schedulers with warmup and decay to gently adapt the model without catastrophic forgetting.
What This Is¶
A learning rate scheduler adjusts the learning rate during training. The right schedule can make the difference between a model that converges quickly and one that oscillates or stalls.
Objectives¶
- Understand how learning rate schedules improve training stability and convergence
- Implement common schedulers like StepLR, CosineAnnealingLR, and OneCycleLR
- Apply warmup and decay strategies for fine-tuning pretrained models
- Monitor and adjust schedules based on training metrics
When You Use It¶
- fine-tuning a pretrained model where a fixed high rate would destroy learned features
- training a large model where the learning rate should decrease as the loss plateaus
- using warmup to stabilize early training before the optimizer finds a good trajectory
- comparing training runs and needing reproducible schedules
Tooling¶
torch.optim.lr_scheduler.StepLR— reduce LR by a factor every N epochstorch.optim.lr_scheduler.CosineAnnealingLR— smooth cosine decaytorch.optim.lr_scheduler.ReduceLROnPlateau— reduce LR when a metric stops improvingtorch.optim.lr_scheduler.OneCycleLR— warmup then annealing in one policytorch.optim.lr_scheduler.LinearLR— linear warmup or decaytorch.optim.lr_scheduler.SequentialLR— chain multiple schedulers
Common Schedules¶
| Schedule | Pattern | Best For |
|---|---|---|
| StepLR | drop by factor every N epochs | simple baselines |
| CosineAnnealing | smooth cosine curve to a minimum | general training |
| ReduceLROnPlateau | drop when metric stalls | when you watch validation loss |
| OneCycleLR | warmup → high → anneal | fast convergence with super-convergence |
| LinearLR | linear ramp up or down | warmup phase |
Quick Quiz¶
-
What scheduler would you use for a simple baseline that drops the learning rate by half every 10 epochs?
a) CosineAnnealingLR
b) StepLR
c) ReduceLROnPlateau
d) OneCycleLR -
When should you use ReduceLROnPlateau instead of a fixed schedule?
a) When you want a smooth cosine curve
b) When the metric stops improving
c) For fine-tuning pretrained models
d) For super-convergence -
What is the key difference between OneCycleLR and other schedulers?
a) It uses cosine annealing
b) It steps per batch, not per epoch
c) It reduces LR when loss stalls
d) It has a linear warmup
Minimal Examples¶
StepLR¶
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)
for epoch in range(num_epochs):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
Cosine Annealing¶
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50, eta_min=1e-6)
OneCycleLR (per-batch stepping)¶
scheduler = torch.optim.lr_scheduler.OneCycleLR(
optimizer, max_lr=0.01, steps_per_epoch=len(train_loader), epochs=num_epochs
)
for epoch in range(num_epochs):
for batch in train_loader:
train_step(model, batch, optimizer)
scheduler.step() # called per batch, not per epoch
ReduceLROnPlateau¶
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
optimizer, mode="min", factor=0.5, patience=5
)
for epoch in range(num_epochs):
train_one_epoch(model, train_loader, optimizer)
val_loss = evaluate(model, valid_loader)
scheduler.step(val_loss) # pass the metric
Warmup Pattern¶
warmup = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=0.1, total_iters=5)
cosine = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=45)
scheduler = torch.optim.lr_scheduler.SequentialLR(optimizer, [warmup, cosine], milestones=[5])
How To Inspect The Learning Rate¶
current_lr = optimizer.param_groups[0]["lr"]
print(f"Epoch {epoch}: lr = {current_lr:.6f}")
Always log the learning rate. Many training bugs are invisible without it.
Failure Pattern¶
Using a fixed high learning rate throughout training and wondering why the loss oscillates after initial progress.
Another failure: calling scheduler.step() at the wrong frequency — per-batch schedulers like OneCycleLR must be stepped every batch, while epoch schedulers like StepLR must be stepped every epoch.
Common Mistakes¶
- calling
scheduler.step()beforeoptimizer.step()(deprecated pattern) - using
ReduceLROnPlateauwithout passing the monitored metric toscheduler.step(metric) - setting warmup too long so the model barely trains during early epochs
- forgetting that
OneCycleLRis per-batch, not per-epoch
Practice¶
- Compare training with a fixed LR versus cosine annealing and inspect convergence.
- Add a warmup phase and explain why the first few epochs look different.
- Use
ReduceLROnPlateauand explain what triggers the LR reduction. - Log the learning rate every epoch and create a simple plot of the schedule.
- Compare
StepLRandCosineAnnealingLRon the same task.
Checkpoint¶
- [ ] Implement StepLR and CosineAnnealingLR in a training loop
- [ ] Add warmup using LinearLR and SequentialLR
- [ ] Monitor learning rate changes and plot the schedule
- [ ] Use ReduceLROnPlateau with validation loss monitoring
- [ ] Explain when to choose each scheduler type
Runnable Example¶
Longer Connection¶
Continue with PyTorch Training Loops for the full loop structure, and Transfer and Fine-Tuning where schedulers matter most.
Further Reading¶
- PyTorch LR Scheduler Documentation
- "Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates" (Smith & Topin, 2018)
- "Cyclical Learning Rates for Training Neural Networks" (Smith, 2017)