Generative Adversarial Networks¶
What This Is¶
A Generative Adversarial Network (GAN) is a pair of neural networks trained against each other:
- a generator
G: z → x̂that maps random noise to synthetic samples - a discriminator
D: x → p(real)that tries to tell real samples from synthetic ones
Training alternates between improving the discriminator (to get better at catching fakes) and improving the generator (to fool the discriminator). In equilibrium, the generator produces samples the discriminator cannot distinguish from real ones.
GANs ruled image generation from 2014 until around 2022. Diffusion models have since taken that crown for controllable generation, and GANs are now a specialized tool rather than the default. This topic teaches what they are, when they are still the right tool, and how to spot the two training failures that cost most GAN projects their time.
When You Use It¶
- you need fast inference — GAN generation is a single forward pass, unlike diffusion's many
- you want sharp, high-frequency samples (textures, faces, specific photo styles)
- the task is narrow and the dataset is clean (StyleGAN on faces is still arguably state of the art for photorealistic single-domain)
- you need a learned discriminator as part of a larger objective (perceptual losses, cycle losses for image translation)
Do Not Use It When¶
- the data is heterogeneous or long-tailed — GANs struggle with mode coverage
- you need controllable, text-conditional generation — diffusion is usually better
- training stability is a first-order concern — GAN training is famously touchy; prefer diffusion or VAEs if you cannot tolerate long debugging cycles
- likelihoods or density estimates are needed — GANs do not estimate them
The Minimax Objective¶
The classical GAN loss is:
min_G max_D E_{x ~ p_data}[ log D(x) ] + E_{z ~ p_z}[ log (1 - D(G(z))) ]
In plain terms: the discriminator wants to output 1 on real and 0 on fake; the generator wants the discriminator to output 1 on its fakes.
In practice, the log(1 - D(G(z))) term gives vanishing gradients when the discriminator is strong, so the generator is trained with the "non-saturating" loss instead:
L_G = -E[ log D(G(z)) ]
This is what actually gets used everywhere. The original formulation is mathematically cleaner; the non-saturating version is what trains.
A Minimal GAN In PyTorch¶
A barebones DCGAN-style skeleton:
import torch
from torch import nn
import torch.nn.functional as F
class Generator(nn.Module):
def __init__(self, z_dim=64, out_dim=784):
super().__init__()
self.net = nn.Sequential(
nn.Linear(z_dim, 256), nn.ReLU(),
nn.Linear(256, 512), nn.ReLU(),
nn.Linear(512, out_dim), nn.Tanh(),
)
def forward(self, z): return self.net(z)
class Discriminator(nn.Module):
def __init__(self, in_dim=784):
super().__init__()
self.net = nn.Sequential(
nn.Linear(in_dim, 512), nn.LeakyReLU(0.2),
nn.Linear(512, 256), nn.LeakyReLU(0.2),
nn.Linear(256, 1),
)
def forward(self, x): return self.net(x)
def gan_step(x_real, G, D, opt_G, opt_D, z_dim=64):
B = x_real.size(0)
# --- Discriminator ---
z = torch.randn(B, z_dim, device=x_real.device)
x_fake = G(z).detach()
d_real = D(x_real)
d_fake = D(x_fake)
loss_d = (F.binary_cross_entropy_with_logits(d_real, torch.ones_like(d_real))
+ F.binary_cross_entropy_with_logits(d_fake, torch.zeros_like(d_fake)))
opt_D.zero_grad(); loss_d.backward(); opt_D.step()
# --- Generator (non-saturating) ---
z = torch.randn(B, z_dim, device=x_real.device)
x_fake = G(z)
d_fake = D(x_fake)
loss_g = F.binary_cross_entropy_with_logits(d_fake, torch.ones_like(d_fake))
opt_G.zero_grad(); loss_g.backward(); opt_G.step()
return loss_d.item(), loss_g.item()
Points that matter:
- use
LeakyReLU(0.2)in the discriminator, plain ReLU in the generator — tradition that still holds up - the generator's last activation is
Tanhand images are normalized to[-1, 1]— matching the range is a common bug detach()the fake when trainingD— otherwise gradients flow back intoGthe wrong way
Variants That Changed Things¶
| variant | what it changed | why it matters |
|---|---|---|
| DCGAN (Radford) | convolutional generator/discriminator, specific architecture choices | first stable image GAN |
| WGAN / WGAN-GP (Arjovsky) | Wasserstein distance + gradient penalty | much more stable training; clearer correlation between loss and quality |
| Spectral Normalization | bounds discriminator Lipschitz constant | standard stability aid |
| Progressive Growing / StyleGAN | grow resolution over training, style-based generator | high-res face generation |
| BigGAN | large batch, class conditioning, truncation | high-res conditional generation |
| CycleGAN | unpaired image translation | practical for style transfer |
| Pix2Pix | paired image-to-image translation | practical for conditional mapping |
For a new GAN project today, start from WGAN-GP or spectral normalization and a DCGAN-ish architecture. Do not try to invent stability from scratch.
Mode Collapse¶
The defining GAN failure is mode collapse: the generator learns to produce one (or a few) samples that fool the discriminator reliably, and stops exploring the data distribution. Output diversity collapses; the model "wins" by being a narrow specialist.
Signatures:
- a small number of distinct samples across many
zdraws - generator loss drops suddenly; discriminator loss rises suddenly; then both stabilize at degenerate values
- interpolation between
z_1andz_2produces nearly the same sample
Fixes (not guarantees):
- use the non-saturating loss, not the original saturating form
- add spectral normalization or gradient penalty to stabilize the discriminator
- use minibatch discrimination — the discriminator sees a batch at once and can penalize similarity across a batch
- use historical buffers — occasionally train the discriminator against past generator outputs, not just the latest
- consider unrolled GANs or other stabilization techniques
Evaluation¶
The same evaluation caveats as diffusion apply:
- FID (Fréchet Inception Distance) is the default quality metric
- Inception Score is historical; do not rely on it alone
- precision and recall (Kynkäänniemi et al.) — decompose FID into "are samples on the manifold" (precision) and "does the manifold cover the data" (recall). Mode collapse shows up as a low recall with reasonable precision.
- human evaluation — the only thing that catches uncanny-valley artifacts
- classifier-based diagnostics — a classifier trained on real-vs-fake reveals what axes the generator has not yet matched
For mode-coverage diagnostics specifically, precision-and-recall is much more informative than FID.
What To Inspect¶
- loss dynamics — both generator and discriminator losses should oscillate within a band, not diverge. A generator loss that goes to zero or a discriminator loss that stays at zero are both bad.
- gradient norms — huge gradient swings are a precursor to mode collapse
- sample diversity — fix a batch of random
zand decode every N iterations; watch whether the samples are still distinct - precision-recall curves on real vs. generated features — the honest coverage signal
- training time — GANs can fail silently after convergence; run past when you think it is done
Failure Pattern¶
The two classic failures:
- mode collapse — described above, most common
- training instability divergence — generator and discriminator losses oscillate wildly; training never settles. Often caused by: LR mismatch, overly aggressive discriminator, missing spectral norm or gradient penalty
A third, quieter failure: the discriminator becomes too good, too fast. Gradients into the generator vanish (especially with the saturating loss), and the generator stops moving. Switching to the non-saturating loss or using WGAN-GP is usually sufficient.
Common Mistakes¶
- using the saturating
log(1 - D(G(z)))loss in practice (use the non-saturating form) - matching learning rates between generator and discriminator when one should be lower (often two-time-scale, e.g.,
2e-4for D,1e-4for G with Adam) - stopping too early — GANs can go through long plateaus before gaining quality
- judging quality by visual cherry-picks without a quantitative metric
- using FID with too few samples (< 5000) — the metric is noisy
- not setting up a random-z eval panel that is re-sampled every N steps
- training on heavily class-imbalanced data without a plan — GANs magnify imbalance into mode collapse
Decision: GAN vs. Diffusion vs. VAE vs. AR¶
| option | samples | speed | training ease | likelihood |
|---|---|---|---|---|
| GAN | sharp | fast (single forward pass) | hard | no |
| diffusion | sharp | slow (but distillable) | medium | tractable (via ELBO) |
| VAE | blurry | fast | easy | tractable (via ELBO) |
| autoregressive | sharp, conditional | slow (serial) | medium | exact |
A practical rule: if you need fast sampling and you can tolerate training pain, GAN. If you need controllable text-to-image at scale and can afford inference time, diffusion. If you need a latent space, VAE. If you need density, AR or diffusion.
Practice¶
- Train the minimal GAN above on MNIST with the non-saturating loss. Plot generator and discriminator loss side by side for 10 epochs.
- Swap to the saturating loss. Watch the generator gradient vanish. Compare loss curves.
- Add spectral normalization to the discriminator. Measure training stability across three seeds.
- Deliberately produce mode collapse by dropping the discriminator's learning rate to 1/10 of the generator's. Show that the generator "wins" but produces 1–2 modes.
- Train once with WGAN-GP instead of the BCE loss. Report the FID on MNIST with 5000 generated samples and 5000 real samples.
- Run precision-and-recall on the same trained model against the training data. Identify whether the model has a precision problem, a recall problem, or both.
Runnable Example¶
pip install torch torchvision matplotlib. For anything beyond MNIST, plan on a GPU; image GANs at 64×64+ resolution are training-time demanding and not suited to the browser runner.
Longer Connection¶
GANs stand next to the other generative topics in the academy:
- Autoencoders and VAEs — the other cheap-to-sample alternative; VAEs lose to GANs on sharpness but win on stability
- Diffusion Models — the modern default for most generative tasks; training is more stable and samples are competitive
- Attention and Transformers — modern GAN generators and discriminators often use attention; StyleGAN-XL and derivatives push this
For the decision frame — "GAN or diffusion or VAE" — the choice is almost always driven by sample cost at inference, training pain tolerance, and the shape of the data. Read Baseline-First Task Solving before picking; a simpler method may already meet the bar you think a GAN is needed for.