PyTorch Optimization Recipes¶

What This Is¶

Optimization recipes go beyond basic Adam— they include advanced schedulers, regularization, mixed precision, and parameter-efficient fine-tuning (PEFT). These stabilize training, speed convergence, and handle large models without full fine-tuning. The focus is on practical choices for competition or production workflows.

When You Use It¶

training large models that overfit or converge slowly
fine-tuning pretrained encoders without updating all parameters
speeding up training with mixed precision
debugging unstable gradients or plateaus
comparing optimizers/schedulers for baselines

Learning Objectives¶

By the end of this topic, you should be able to:

Implement advanced optimizers and schedulers in PyTorch.
Apply PEFT techniques like LoRA for efficient fine-tuning.
Use mixed precision to accelerate training.
Diagnose and fix common optimization issues.
Choose recipes based on model size and task.

Tooling¶

torch.optim.AdamW with decoupled weight decay
torch.optim.lr_scheduler.CosineAnnealingWarmRestarts for cyclic schedules
torch.cuda.amp for mixed precision (FP16)
peft library for LoRA/Adapters (install: pip install peft)
torch.nn.utils.clip_grad_norm_ for stability
torch.optim.lr_scheduler.ReduceLROnPlateau for adaptive decay

Minimal Example¶

import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import GradScaler, autocast

# Model setup
model = nn.Linear(100, 10)  # Example
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=10)
scaler = GradScaler()  # For mixed precision

# Training step with mixed precision
for inputs, labels in dataloader:
    optimizer.zero_grad()

    with autocast():  # FP16 forward
        outputs = model(inputs)
        loss = nn.CrossEntropyLoss()(outputs, labels)

    scaler.scale(loss).backward()  # Scaled backward
    scaler.step(optimizer)
    scaler.update()

    scheduler.step()

Key Concepts Explained¶

Advanced Optimizers¶

AdamW: Standard for stability; decouples weight decay from gradient.
Lion: Memory-efficient alternative (from Google); faster on large models.
Adafactor: Scales better for huge models; adaptive LR per parameter.

Schedulers¶

Cosine Annealing: Smooth decay to minimum; good for convergence.
Warm Restarts: Cyclic cosine; helps escape plateaus.
ReduceLROnPlateau: Adaptive; reduces LR when validation stalls.

PEFT Techniques¶

LoRA: Low-Rank Adaptation; adds trainable adapters to frozen layers. Reduces params by 90%+.
Adapters: Similar; inserts bottleneck layers for task-specific tuning.

Mixed Precision¶

Uses FP16 for forward/backward, FP32 for weights. Speeds up training 2-3x on GPUs.
Requires GradScaler to handle gradient scaling.

What Can Go Wrong¶

Gradient Explosion: Clip norms or reduce LR.
NaN Losses: Check for bad data or unstable activations; use gradient clipping.
Slow Convergence: Try warm restarts or adaptive schedulers.
PEFT Mismatch: Ensure adapters fit model architecture.

Inspection Habits¶

Monitor LR decay and loss curves.
Check gradient norms (log them).
Compare train/val loss for overfitting.
Profile memory/ speed with torch.profiler.

Quick Quiz¶

What is LoRA, and when to use it?
a) A type of regularization; use for small models
b) Low-Rank Adaptation; use for efficient fine-tuning of large models
c) A learning rate scheduler; use for cyclic decay
d) A mixed precision technique; use for faster training
How does mixed precision speed up training?
a) By using lower precision floats for computations
b) By increasing batch size
c) By reducing model parameters
d) By using simpler optimizers
When would you use CosineAnnealingWarmRestarts?
a) For simple exponential decay
b) For cyclic LR decay to escape local minima
c) For adaptive LR based on validation
d) For warmup phases only

Checkpoint¶

[ ] Implement AdamW with decoupled weight decay
[ ] Apply CosineAnnealingWarmRestarts for cyclic learning rates
[ ] Set up mixed precision training with GradScaler
[ ] Implement LoRA for parameter-efficient fine-tuning
[ ] Use gradient clipping and monitor training stability