Skip to content

Vision and Text Encoders

What This Is

This page is about one practical decision:

  • should you use a frozen encoder
  • a small probe or retrieval layer
  • or a fine-tuned vision/text model

The right answer depends on task type, label count, and whether the pretrained representation is already doing most of the work.

Do not treat encoders as magic. Treat them as reusable representations that need to justify their extra cost.

When You Use It

  • image or text classification with limited labels
  • embedding reuse for retrieval or similarity
  • comparing frozen reuse against fine-tuning
  • multimodal tasks where image and text need a common representation

Start With The Workflow Choice

Choose the smallest reuse strategy that matches the task:

Situation Best first move What to inspect
small labeled dataset frozen encoder plus linear probe whether the probe is already competitive
retrieval or nearest-neighbor task frozen embeddings nearest neighbors and cluster quality
moderate label count with domain shift small fine-tune whether the frozen representation is missing domain signal
true multimodal matching joint encoder or aligned embeddings whether text and image neighbors actually agree

If the frozen representation already works, full fine-tuning is usually waste.

Representation Ladder

Use this ladder in order:

  1. frozen encoder
  2. probe, retrieval rule, or shallow head
  3. small fine-tune
  4. larger adaptation only if the simpler reuse path clearly fails

This order matters because every step upward adds cost, memory pressure, and more ways to overfit.

Minimal Pattern

This is the shortest reusable-encoder pattern:

import torch
from transformers import AutoModel, AutoTokenizer
from torchvision.models import resnet50

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text_encoder = AutoModel.from_pretrained("bert-base-uncased")
vision_encoder = resnet50(weights="DEFAULT")
vision_encoder.eval()

inputs = tokenizer("hello world", return_tensors="pt")

with torch.no_grad():
    text_hidden = text_encoder(**inputs).last_hidden_state.mean(dim=1)

The goal of this pattern is not to show every library option. The goal is to make frozen reuse visible before you add a task-specific head.

What To Inspect First

Before you fine-tune anything, inspect:

  • the frozen probe result
  • nearest-neighbor quality in the embedding space
  • whether the representation fails on a specific slice
  • whether the task is really classification, retrieval, or matching
  • whether the added compute is justified by the gain

If you skip those checks, you do not know whether the problem is representation quality or adaptation depth.

Good First Comparisons

Make one honest comparison at a time:

  • simple baseline versus frozen encoder
  • frozen encoder versus small fine-tune
  • text-only or vision-only encoder versus multimodal version
  • accuracy or retrieval quality versus latency and memory

The page is doing its job only if one of those comparisons changes the next move.

Failure Pattern

The most common failure is fine-tuning because the pretrained model feels advanced.

That usually hides one of these simpler realities:

  • the frozen representation was already good enough
  • the task really needed retrieval, not classification
  • the domain shift was small and only the head needed updating
  • the evaluation never checked whether the representation organized the data sensibly

Common Mistakes

  • skipping the frozen baseline
  • using a multimodal encoder when one modality already solves the task
  • judging embeddings only with a pretty projection
  • changing the encoder, the head, and the split at the same time
  • calling a tiny validation gain proof that full fine-tuning is necessary
  • forgetting to compare compute cost against the size of the gain

A Good Encoder Note

After one run, the learner should be able to say:

  • what task shape was being solved
  • whether frozen reuse was already strong enough
  • what slice or neighbor pattern still failed
  • whether the next move is probe, retrieval, or fine-tuning
  • why the chosen adaptation depth is worth it

Runnable Example

Use the local demo:

academy/.venv/bin/python academy/examples/vision-text-encoders-demo.py

It is deliberately small. Its job is to make frozen extraction, head replacement, and comparison against trainable variants visible.

Practice

  1. Compare a frozen encoder against a small trainable head.
  2. Inspect whether nearest neighbors look semantically correct.
  3. Decide whether the task is really retrieval, matching, or classification.
  4. Explain when a multimodal encoder is actually justified.
  5. Name one case where frozen reuse is the better stopping point.

Longer Connection

Continue with Multi-Modal Fusion when image and text really need to interact, Text Representations and Order for lighter text baselines, and Representation Reuse and Embedding Transfer for the full reuse workflow.