Generative Adversarial Networks¶

What This Is¶

A Generative Adversarial Network (GAN) is a pair of neural networks trained against each other:

a generator G: z → x̂ that maps random noise to synthetic samples
a discriminator D: x → p(real) that tries to tell real samples from synthetic ones

Training alternates between improving the discriminator (to get better at catching fakes) and improving the generator (to fool the discriminator). In equilibrium, the generator produces samples the discriminator cannot distinguish from real ones.

GANs ruled image generation from 2014 until around 2022. Diffusion models have since taken that crown for controllable generation, and GANs are now a specialized tool rather than the default. This topic teaches what they are, when they are still the right tool, and how to spot the two training failures that cost most GAN projects their time.

When You Use It¶

you need fast inference — GAN generation is a single forward pass, unlike diffusion's many
you want sharp, high-frequency samples (textures, faces, specific photo styles)
the task is narrow and the dataset is clean (StyleGAN on faces is still arguably state of the art for photorealistic single-domain)
you need a learned discriminator as part of a larger objective (perceptual losses, cycle losses for image translation)

Do Not Use It When¶

the data is heterogeneous or long-tailed — GANs struggle with mode coverage
you need controllable, text-conditional generation — diffusion is usually better
training stability is a first-order concern — GAN training is famously touchy; prefer diffusion or VAEs if you cannot tolerate long debugging cycles
likelihoods or density estimates are needed — GANs do not estimate them

The Minimax Objective¶

The classical GAN loss is:

min_G  max_D   E_{x ~ p_data}[ log D(x) ]  +  E_{z ~ p_z}[ log (1 - D(G(z))) ]

In plain terms: the discriminator wants to output 1 on real and 0 on fake; the generator wants the discriminator to output 1 on its fakes.

In practice, the log(1 - D(G(z))) term gives vanishing gradients when the discriminator is strong, so the generator is trained with the "non-saturating" loss instead:

L_G = -E[ log D(G(z)) ]

This is what actually gets used everywhere. The original formulation is mathematically cleaner; the non-saturating version is what trains.

A Minimal GAN In PyTorch¶

A barebones DCGAN-style skeleton:

import torch
from torch import nn
import torch.nn.functional as F

class Generator(nn.Module):
    def __init__(self, z_dim=64, out_dim=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(z_dim, 256), nn.ReLU(),
            nn.Linear(256, 512), nn.ReLU(),
            nn.Linear(512, out_dim), nn.Tanh(),
        )
    def forward(self, z): return self.net(z)

class Discriminator(nn.Module):
    def __init__(self, in_dim=784):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, 512), nn.LeakyReLU(0.2),
            nn.Linear(512, 256), nn.LeakyReLU(0.2),
            nn.Linear(256, 1),
        )
    def forward(self, x): return self.net(x)

def gan_step(x_real, G, D, opt_G, opt_D, z_dim=64):
    B = x_real.size(0)
    # --- Discriminator ---
    z = torch.randn(B, z_dim, device=x_real.device)
    x_fake = G(z).detach()
    d_real = D(x_real)
    d_fake = D(x_fake)
    loss_d = (F.binary_cross_entropy_with_logits(d_real, torch.ones_like(d_real))
            + F.binary_cross_entropy_with_logits(d_fake, torch.zeros_like(d_fake)))
    opt_D.zero_grad(); loss_d.backward(); opt_D.step()

    # --- Generator (non-saturating) ---
    z = torch.randn(B, z_dim, device=x_real.device)
    x_fake = G(z)
    d_fake = D(x_fake)
    loss_g = F.binary_cross_entropy_with_logits(d_fake, torch.ones_like(d_fake))
    opt_G.zero_grad(); loss_g.backward(); opt_G.step()
    return loss_d.item(), loss_g.item()

Points that matter:

use LeakyReLU(0.2) in the discriminator, plain ReLU in the generator — tradition that still holds up
the generator's last activation is Tanh and images are normalized to [-1, 1] — matching the range is a common bug
detach() the fake when training D — otherwise gradients flow back into G the wrong way

Variants That Changed Things¶

variant	what it changed	why it matters
DCGAN (Radford)	convolutional generator/discriminator, specific architecture choices	first stable image GAN
WGAN / WGAN-GP (Arjovsky)	Wasserstein distance + gradient penalty	much more stable training; clearer correlation between loss and quality
Spectral Normalization	bounds discriminator Lipschitz constant	standard stability aid
Progressive Growing / StyleGAN	grow resolution over training, style-based generator	high-res face generation
BigGAN	large batch, class conditioning, truncation	high-res conditional generation
CycleGAN	unpaired image translation	practical for style transfer
Pix2Pix	paired image-to-image translation	practical for conditional mapping

For a new GAN project today, start from WGAN-GP or spectral normalization and a DCGAN-ish architecture. Do not try to invent stability from scratch.

Mode Collapse¶

The defining GAN failure is mode collapse: the generator learns to produce one (or a few) samples that fool the discriminator reliably, and stops exploring the data distribution. Output diversity collapses; the model "wins" by being a narrow specialist.

Signatures:

a small number of distinct samples across many z draws
generator loss drops suddenly; discriminator loss rises suddenly; then both stabilize at degenerate values
interpolation between z_1 and z_2 produces nearly the same sample

Fixes (not guarantees):

use the non-saturating loss, not the original saturating form
add spectral normalization or gradient penalty to stabilize the discriminator
use minibatch discrimination — the discriminator sees a batch at once and can penalize similarity across a batch
use historical buffers — occasionally train the discriminator against past generator outputs, not just the latest
consider unrolled GANs or other stabilization techniques

Evaluation¶

The same evaluation caveats as diffusion apply:

FID (Fréchet Inception Distance) is the default quality metric
Inception Score is historical; do not rely on it alone
precision and recall (Kynkäänniemi et al.) — decompose FID into "are samples on the manifold" (precision) and "does the manifold cover the data" (recall). Mode collapse shows up as a low recall with reasonable precision.
human evaluation — the only thing that catches uncanny-valley artifacts
classifier-based diagnostics — a classifier trained on real-vs-fake reveals what axes the generator has not yet matched

For mode-coverage diagnostics specifically, precision-and-recall is much more informative than FID.

What To Inspect¶

loss dynamics — both generator and discriminator losses should oscillate within a band, not diverge. A generator loss that goes to zero or a discriminator loss that stays at zero are both bad.
gradient norms — huge gradient swings are a precursor to mode collapse
sample diversity — fix a batch of random z and decode every N iterations; watch whether the samples are still distinct
precision-recall curves on real vs. generated features — the honest coverage signal
training time — GANs can fail silently after convergence; run past when you think it is done

Failure Pattern¶

The two classic failures:

mode collapse — described above, most common
training instability divergence — generator and discriminator losses oscillate wildly; training never settles. Often caused by: LR mismatch, overly aggressive discriminator, missing spectral norm or gradient penalty

A third, quieter failure: the discriminator becomes too good, too fast. Gradients into the generator vanish (especially with the saturating loss), and the generator stops moving. Switching to the non-saturating loss or using WGAN-GP is usually sufficient.

Common Mistakes¶

using the saturating log(1 - D(G(z))) loss in practice (use the non-saturating form)
matching learning rates between generator and discriminator when one should be lower (often two-time-scale, e.g., 2e-4 for D, 1e-4 for G with Adam)
stopping too early — GANs can go through long plateaus before gaining quality
judging quality by visual cherry-picks without a quantitative metric
using FID with too few samples (< 5000) — the metric is noisy
not setting up a random-z eval panel that is re-sampled every N steps
training on heavily class-imbalanced data without a plan — GANs magnify imbalance into mode collapse

Decision: GAN vs. Diffusion vs. VAE vs. AR¶

option	samples	speed	training ease	likelihood
GAN	sharp	fast (single forward pass)	hard	no
diffusion	sharp	slow (but distillable)	medium	tractable (via ELBO)
VAE	blurry	fast	easy	tractable (via ELBO)
autoregressive	sharp, conditional	slow (serial)	medium	exact

A practical rule: if you need fast sampling and you can tolerate training pain, GAN. If you need controllable text-to-image at scale and can afford inference time, diffusion. If you need a latent space, VAE. If you need density, AR or diffusion.

Practice¶

Train the minimal GAN above on MNIST with the non-saturating loss. Plot generator and discriminator loss side by side for 10 epochs.
Swap to the saturating loss. Watch the generator gradient vanish. Compare loss curves.
Add spectral normalization to the discriminator. Measure training stability across three seeds.
Deliberately produce mode collapse by dropping the discriminator's learning rate to 1/10 of the generator's. Show that the generator "wins" but produces 1–2 modes.
Train once with WGAN-GP instead of the BCE loss. Report the FID on MNIST with 5000 generated samples and 5000 real samples.
Run precision-and-recall on the same trained model against the training data. Identify whether the model has a precision problem, a recall problem, or both.

Runnable Example¶

A GAN on MNIST trains in minutes on a laptop CPU and fits in the snippet above. Install with pip install torch torchvision matplotlib. For anything beyond MNIST, plan on a GPU; image GANs at 64×64+ resolution are training-time demanding and not suited to the browser runner.

Longer Connection¶

GANs stand next to the other generative topics in the academy:

Autoencoders and VAEs — the other cheap-to-sample alternative; VAEs lose to GANs on sharpness but win on stability
Diffusion Models — the modern default for most generative tasks; training is more stable and samples are competitive
Attention and Transformers — modern GAN generators and discriminators often use attention; StyleGAN-XL and derivatives push this

For the decision frame — "GAN or diffusion or VAE" — the choice is almost always driven by sample cost at inference, training pain tolerance, and the shape of the data. Read Baseline-First Task Solving before picking; a simpler method may already meet the bar you think a GAN is needed for.