Learning Rate Schedulers¶

Scenario: Fine-Tuning a Language Model¶

You're fine-tuning a large BERT model for sentiment analysis. A high learning rate would erase pretrained knowledge—use schedulers with warmup and decay to gently adapt the model without catastrophic forgetting.

What This Is¶

A learning rate scheduler adjusts the learning rate during training. The right schedule can make the difference between a model that converges quickly and one that oscillates or stalls.

Objectives¶

Understand how learning rate schedules improve training stability and convergence
Implement common schedulers like StepLR, CosineAnnealingLR, and OneCycleLR
Apply warmup and decay strategies for fine-tuning pretrained models
Monitor and adjust schedules based on training metrics

When You Use It¶

fine-tuning a pretrained model where a fixed high rate would destroy learned features
training a large model where the learning rate should decrease as the loss plateaus
using warmup to stabilize early training before the optimizer finds a good trajectory
comparing training runs and needing reproducible schedules

Tooling¶

torch.optim.lr_scheduler.StepLR — reduce LR by a factor every N epochs
torch.optim.lr_scheduler.CosineAnnealingLR — smooth cosine decay
torch.optim.lr_scheduler.ReduceLROnPlateau — reduce LR when a metric stops improving
torch.optim.lr_scheduler.OneCycleLR — warmup then annealing in one policy
torch.optim.lr_scheduler.LinearLR — linear warmup or decay
torch.optim.lr_scheduler.SequentialLR — chain multiple schedulers

Common Schedules¶

Schedule	Pattern	Best For
StepLR	drop by factor every N epochs	simple baselines
CosineAnnealing	smooth cosine curve to a minimum	general training
ReduceLROnPlateau	drop when metric stalls	when you watch validation loss
OneCycleLR	warmup → high → anneal	fast convergence with super-convergence
LinearLR	linear ramp up or down	warmup phase

Quick Quiz¶

What scheduler would you use for a simple baseline that drops the learning rate by half every 10 epochs?
a) CosineAnnealingLR
b) StepLR
c) ReduceLROnPlateau
d) OneCycleLR
When should you use ReduceLROnPlateau instead of a fixed schedule?
a) When you want a smooth cosine curve
b) When the metric stops improving
c) For fine-tuning pretrained models
d) For super-convergence
What is the key difference between OneCycleLR and other schedulers?
a) It uses cosine annealing
b) It steps per batch, not per epoch
c) It reduces LR when loss stalls
d) It has a linear warmup

Minimal Examples¶

StepLR¶

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

for epoch in range(num_epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Cosine Annealing¶

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50, eta_min=1e-6)

OneCycleLR (per-batch stepping)¶

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=0.01, steps_per_epoch=len(train_loader), epochs=num_epochs
)

for epoch in range(num_epochs):
    for batch in train_loader:
        train_step(model, batch, optimizer)
        scheduler.step()  # called per batch, not per epoch

ReduceLROnPlateau¶

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5, patience=5
)

for epoch in range(num_epochs):
    train_one_epoch(model, train_loader, optimizer)
    val_loss = evaluate(model, valid_loader)
    scheduler.step(val_loss)  # pass the metric

Warmup Pattern¶

warmup = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=0.1, total_iters=5)
cosine = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=45)
scheduler = torch.optim.lr_scheduler.SequentialLR(optimizer, [warmup, cosine], milestones=[5])

How To Inspect The Learning Rate¶

current_lr = optimizer.param_groups[0]["lr"]
print(f"Epoch {epoch}: lr = {current_lr:.6f}")

Always log the learning rate. Many training bugs are invisible without it.

Failure Pattern¶

Using a fixed high learning rate throughout training and wondering why the loss oscillates after initial progress.

Another failure: calling scheduler.step() at the wrong frequency — per-batch schedulers like OneCycleLR must be stepped every batch, while epoch schedulers like StepLR must be stepped every epoch.

Common Mistakes¶

calling scheduler.step() before optimizer.step() (deprecated pattern)
using ReduceLROnPlateau without passing the monitored metric to scheduler.step(metric)
setting warmup too long so the model barely trains during early epochs
forgetting that OneCycleLR is per-batch, not per-epoch

Practice¶

Compare training with a fixed LR versus cosine annealing and inspect convergence.
Add a warmup phase and explain why the first few epochs look different.
Use ReduceLROnPlateau and explain what triggers the LR reduction.
Log the learning rate every epoch and create a simple plot of the schedule.
Compare StepLR and CosineAnnealingLR on the same task.

Checkpoint¶

[ ] Implement StepLR and CosineAnnealingLR in a training loop
[ ] Add warmup using LinearLR and SequentialLR
[ ] Monitor learning rate changes and plot the schedule
[ ] Use ReduceLROnPlateau with validation loss monitoring
[ ] Explain when to choose each scheduler type

Runnable Example¶

This example stays local-only for now because the browser runner does not yet include PyTorch.

Longer Connection¶

Continue with PyTorch Training Loops for the full loop structure, and Transfer and Fine-Tuning where schedulers matter most.