Skip to content

Learning Rate Schedulers

Scenario: Fine-Tuning a Language Model

You're fine-tuning a large BERT model for sentiment analysis. A high learning rate would erase pretrained knowledge—use schedulers with warmup and decay to gently adapt the model without catastrophic forgetting.

What This Is

A learning rate scheduler adjusts the learning rate during training. The right schedule can make the difference between a model that converges quickly and one that oscillates or stalls.

Objectives

  • Understand how learning rate schedules improve training stability and convergence
  • Implement common schedulers like StepLR, CosineAnnealingLR, and OneCycleLR
  • Apply warmup and decay strategies for fine-tuning pretrained models
  • Monitor and adjust schedules based on training metrics

When You Use It

  • fine-tuning a pretrained model where a fixed high rate would destroy learned features
  • training a large model where the learning rate should decrease as the loss plateaus
  • using warmup to stabilize early training before the optimizer finds a good trajectory
  • comparing training runs and needing reproducible schedules

Tooling

  • torch.optim.lr_scheduler.StepLR — reduce LR by a factor every N epochs
  • torch.optim.lr_scheduler.CosineAnnealingLR — smooth cosine decay
  • torch.optim.lr_scheduler.ReduceLROnPlateau — reduce LR when a metric stops improving
  • torch.optim.lr_scheduler.OneCycleLR — warmup then annealing in one policy
  • torch.optim.lr_scheduler.LinearLR — linear warmup or decay
  • torch.optim.lr_scheduler.SequentialLR — chain multiple schedulers

Common Schedules

Schedule Pattern Best For
StepLR drop by factor every N epochs simple baselines
CosineAnnealing smooth cosine curve to a minimum general training
ReduceLROnPlateau drop when metric stalls when you watch validation loss
OneCycleLR warmup → high → anneal fast convergence with super-convergence
LinearLR linear ramp up or down warmup phase

Quick Quiz

  1. What scheduler would you use for a simple baseline that drops the learning rate by half every 10 epochs?
    a) CosineAnnealingLR
    b) StepLR
    c) ReduceLROnPlateau
    d) OneCycleLR

  2. When should you use ReduceLROnPlateau instead of a fixed schedule?
    a) When you want a smooth cosine curve
    b) When the metric stops improving
    c) For fine-tuning pretrained models
    d) For super-convergence

  3. What is the key difference between OneCycleLR and other schedulers?
    a) It uses cosine annealing
    b) It steps per batch, not per epoch
    c) It reduces LR when loss stalls
    d) It has a linear warmup

Minimal Examples

StepLR

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

for epoch in range(num_epochs):
    train_one_epoch(model, train_loader, optimizer)
    scheduler.step()

Cosine Annealing

scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50, eta_min=1e-6)

OneCycleLR (per-batch stepping)

scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=0.01, steps_per_epoch=len(train_loader), epochs=num_epochs
)

for epoch in range(num_epochs):
    for batch in train_loader:
        train_step(model, batch, optimizer)
        scheduler.step()  # called per batch, not per epoch

ReduceLROnPlateau

scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode="min", factor=0.5, patience=5
)

for epoch in range(num_epochs):
    train_one_epoch(model, train_loader, optimizer)
    val_loss = evaluate(model, valid_loader)
    scheduler.step(val_loss)  # pass the metric

Warmup Pattern

warmup = torch.optim.lr_scheduler.LinearLR(optimizer, start_factor=0.1, total_iters=5)
cosine = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=45)
scheduler = torch.optim.lr_scheduler.SequentialLR(optimizer, [warmup, cosine], milestones=[5])

How To Inspect The Learning Rate

current_lr = optimizer.param_groups[0]["lr"]
print(f"Epoch {epoch}: lr = {current_lr:.6f}")

Always log the learning rate. Many training bugs are invisible without it.

Failure Pattern

Using a fixed high learning rate throughout training and wondering why the loss oscillates after initial progress.

Another failure: calling scheduler.step() at the wrong frequency — per-batch schedulers like OneCycleLR must be stepped every batch, while epoch schedulers like StepLR must be stepped every epoch.

Common Mistakes

  • calling scheduler.step() before optimizer.step() (deprecated pattern)
  • using ReduceLROnPlateau without passing the monitored metric to scheduler.step(metric)
  • setting warmup too long so the model barely trains during early epochs
  • forgetting that OneCycleLR is per-batch, not per-epoch

Practice

  1. Compare training with a fixed LR versus cosine annealing and inspect convergence.
  2. Add a warmup phase and explain why the first few epochs look different.
  3. Use ReduceLROnPlateau and explain what triggers the LR reduction.
  4. Log the learning rate every epoch and create a simple plot of the schedule.
  5. Compare StepLR and CosineAnnealingLR on the same task.

Checkpoint

  • [ ] Implement StepLR and CosineAnnealingLR in a training loop
  • [ ] Add warmup using LinearLR and SequentialLR
  • [ ] Monitor learning rate changes and plot the schedule
  • [ ] Use ReduceLROnPlateau with validation loss monitoring
  • [ ] Explain when to choose each scheduler type

Runnable Example

This example stays local-only for now because the browser runner does not yet include PyTorch.

Longer Connection

Continue with PyTorch Training Loops for the full loop structure, and Transfer and Fine-Tuning where schedulers matter most.

Further Reading

  • PyTorch LR Scheduler Documentation
  • "Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates" (Smith & Topin, 2018)
  • "Cyclical Learning Rates for Training Neural Networks" (Smith, 2017)