Skip to content

Steering Frozen Generative Models

When the weights of a generator are frozen and you cannot fine-tune, you can still make it produce the output you want — by editing the inputs it sees. Text prompts, text embeddings, and initial latents are all knobs. Learning small edits on those knobs is the technique.

What This Is

Given a frozen generator G(text_embedding, latent_z) → image (or text, or audio), you want to solve:

find   δ_text, δ_z    such that   G(text_emb + δ_text, z + δ_z)  matches some objective
subject to            ||δ_text|| <= ε_text,  ||δ_z|| <= ε_z

No weight in G is updated. Only the deltas are learned. The deltas can be:

  • a single token embedding added to the tokenizer (textual inversion, the "new concept" case)
  • a prompt embedding edit that shifts the generation toward a target concept
  • an initial-latent edit that primes the sampling path
  • a classifier-free guidance scale that amplifies or dampens the conditioning signal

Three named techniques you should recognize:

  • Textual Inversion (Gal 2022). Learn one new token embedding v* by optimizing E[ ||G("a photo of v*", z) - x_target||² ] over a small set of target images. At inference you just use v* in prompts.
  • Prompt-to-Prompt / Embedding Arithmetic. Find a direction in text-embedding space that corresponds to adding a concept (e.g., emb("with fire hydrant") - emb("") averaged over contexts). Add a scaled version of that direction at inference.
  • Classifier-Free Guidance tuning. Rerun sampling with different guidance scales; higher scales push harder toward the conditioning; too high produces artifacts. One scalar, often the cheapest knob.

When You Use It

  • you must not change the generator's weights (task constraint, license constraint, or compute constraint)
  • you need the generator to produce a specific concept, style, or entity not in its training set
  • you need the generator to combine two concepts in a way the text encoder does not naturally bridge
  • the edit budget is small — you can learn one token embedding or a handful of scalars, not a full LoRA

Do Not Use It When

  • you have permission and compute to fine-tune the generator (full or LoRA) — that is usually faster and more reliable
  • the model has a safety filter that rejects your target — you cannot steer around a server-side filter with client-side edits
  • the objective is a hard constraint (e.g., "output must contain a legally specific string") — generative steering is a soft push, not a guarantee

Textual Inversion In Practice

You want the generator to learn a new concept v* from a handful of target images x_1, ..., x_K.

v = nn.Parameter(randn(embedding_dim))   # one new token embedding
for step in range(N):
    x = random.choice(targets)
    t = random_timestep()
    noise = randn_like(x)
    x_noisy = q_sample(x, t, noise)
    pred_noise = G.unet(x_noisy, t, text_embedding_with(v))
    loss = ||pred_noise - noise||²
    loss.backward()
    v.data -= lr * v.grad

The UNet is frozen. Only v is updated. Typical budget: 3-10 target images, a few thousand steps, and you end up with one token you can drop into any prompt.

Embedding-Arithmetic Steering In Practice

When you do not have target images, but you know the concept as text, you can compute a direction:

base_contexts = ["a photo of a cow", "a photo of a cat", ...]
target_contexts = ["a photo of a cow with a fire hydrant", "a photo of a cat with a fire hydrant", ...]
direction = mean([text_encoder(t) - text_encoder(b) for b, t in zip(base_contexts, target_contexts)])

# at inference:
edited_embedding = text_encoder(user_prompt) + α * direction

The scalar α is the knob — α = 0 leaves the prompt unchanged, α too high produces artifacts where the concept dominates.

Initial-Latent Priming

Diffusion samples start from Gaussian noise z_T. You can pick a z_T that, run forward with a fixed prompt, tends to produce the concept you want:

  • generate many samples, rank by a concept classifier, keep the latents behind the highest-scoring samples
  • average or interpolate between them to get a "primed" latent
  • at inference, start from a noisy version of this primed latent instead of pure noise

This is less principled but often effective when the budget on other deltas is tight.

What To Inspect

  • sample grid — generate a 4×4 grid of samples with and without the edit; visually confirm the concept shows up
  • concept-classifier score distribution — if you have an external classifier that detects the target concept, score samples with and without the edit; look at the full distribution, not just the mean
  • unintended changes — does the edit also break unrelated prompts? A "fire hydrant direction" should not also change "a photo of a sunset"
  • magnitude sweep — generate samples at α ∈ {0.25, 0.5, 1.0, 2.0, 4.0} and find the sweet spot
  • training-image overfit (for textual inversion) — does v* reproduce the training images verbatim? That means you overfit and lost generality

Failure Pattern

  • embedding explosion. Direction magnitudes grow without bound because there is no weight decay. Sample quality degrades. Fix: L2-regularize the delta or clip its norm.
  • concept leakage. The edit is "too strong" — every sample now has a fire hydrant, even prompts where that is absurd. Fix: reduce α, or mix: α * direction only when the user prompt semantically allows it.
  • textual-inversion overfit. With too few training images or too many steps, v* memorizes exact training images. Fix: fewer steps, more diverse augmentations, or a smaller token-embedding learning rate.
  • classifier-free guidance clipping. Very high CFG scales (>15) produce oversaturated, artifacted images. The visual effect is "burned" colors and unnatural contrast. Fix: keep CFG in the 5-12 range.
  • evaluating with the wrong metric. "Does the concept appear?" is what the task wants, but you measure FID or CLIP-score. Those do not directly capture concept presence. Use a task-specific classifier or the exact evaluator the task uses.

Quick Checks

  • is the generator truly frozen? (all parameters with requires_grad=False; only the delta has requires_grad=True)
  • is the delta magnitude bounded? (norm clip or L2 regularization in the loss)
  • are you evaluating on the same classifier/metric the task uses, not a proxy?
  • did you compare against the no-edit baseline? ("my edit works" means "my edit works compared to no edit with the same prompts")
  • have you checked unrelated prompts still look normal?

Practice

Run academy/labs/generative-model-steering/src/steering_workflow.py:

  • trains a small conditional pixel generator (toy VAE decoder, not a real diffusion model — just enough to demonstrate the mechanics) on a synthetic shape dataset
  • freezes the generator and learns a textual-inversion-style new token embedding that makes it produce a held-out shape class it was not trained to generate
  • adds an embedding-arithmetic experiment: compute a "size → large" direction and demonstrate it transfers
  • demonstrates the "LLM-judge-in-the-loop" black-box search pattern: given a frozen scorer and a candidate generator output, search over prompts with a small beam to maximize the scorer

You should leave able to explain when frozen-model steering is the right move, what textual inversion optimizes, and why the same search pattern ("optimize inputs against a frozen evaluator") applies whether the evaluator is a classifier, a generator, or an LLM judge.

Longer Connection

"Optimize inputs against a frozen model" is a design pattern with many instances:

  • textual inversion — learn an input token to match a target image distribution
  • adversarial examples — learn an input perturbation to flip a classifier
  • prompt engineering by search — search over prompts to maximize an external judge's score (see Black-Box LLM Optimization)
  • adversarial patch attacks — learn a fixed image patch that, when overlaid, triggers a target class

The family of techniques is sometimes called input-space optimization or activation steering. They are powerful precisely because they do not require retraining the target model — which means they apply to cases where retraining is impossible (API-only models, licensed weights, compute-limited settings). They are also the natural way to engage with any "the model is frozen, here is a budget for deltas on the inputs" task.