Vision and Text Encoders¶

What This Is¶

This page is about one practical decision:

should you use a frozen encoder
a small probe or retrieval layer
or a fine-tuned vision/text model

The right answer depends on task type, label count, and whether the pretrained representation is already doing most of the work.

Do not treat encoders as magic. Treat them as reusable representations that need to justify their extra cost.

When You Use It¶

image or text classification with limited labels
embedding reuse for retrieval or similarity
comparing frozen reuse against fine-tuning
multimodal tasks where image and text need a common representation

Start With The Workflow Choice¶

Choose the smallest reuse strategy that matches the task:

Situation	Best first move	What to inspect
small labeled dataset	frozen encoder plus linear probe	whether the probe is already competitive
retrieval or nearest-neighbor task	frozen embeddings	nearest neighbors and cluster quality
moderate label count with domain shift	small fine-tune	whether the frozen representation is missing domain signal
true multimodal matching	joint encoder or aligned embeddings	whether text and image neighbors actually agree

If the frozen representation already works, full fine-tuning is usually waste.

Representation Ladder¶

Use this ladder in order:

frozen encoder
probe, retrieval rule, or shallow head
small fine-tune
larger adaptation only if the simpler reuse path clearly fails

This order matters because every step upward adds cost, memory pressure, and more ways to overfit.

Minimal Pattern¶

This is the shortest reusable-encoder pattern:

import torch
from transformers import AutoModel, AutoTokenizer
from torchvision.models import resnet50

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text_encoder = AutoModel.from_pretrained("bert-base-uncased")
vision_encoder = resnet50(weights="DEFAULT")
vision_encoder.eval()

inputs = tokenizer("hello world", return_tensors="pt")

with torch.no_grad():
    text_hidden = text_encoder(**inputs).last_hidden_state.mean(dim=1)

The goal of this pattern is not to show every library option. The goal is to make frozen reuse visible before you add a task-specific head.

What To Inspect First¶

Before you fine-tune anything, inspect:

the frozen probe result
nearest-neighbor quality in the embedding space
whether the representation fails on a specific slice
whether the task is really classification, retrieval, or matching
whether the added compute is justified by the gain

If you skip those checks, you do not know whether the problem is representation quality or adaptation depth.

Good First Comparisons¶

Make one honest comparison at a time:

simple baseline versus frozen encoder
frozen encoder versus small fine-tune
text-only or vision-only encoder versus multimodal version
accuracy or retrieval quality versus latency and memory

The page is doing its job only if one of those comparisons changes the next move.

Failure Pattern¶

The most common failure is fine-tuning because the pretrained model feels advanced.

That usually hides one of these simpler realities:

the frozen representation was already good enough
the task really needed retrieval, not classification
the domain shift was small and only the head needed updating
the evaluation never checked whether the representation organized the data sensibly

Common Mistakes¶

skipping the frozen baseline
using a multimodal encoder when one modality already solves the task
judging embeddings only with a pretty projection
changing the encoder, the head, and the split at the same time
calling a tiny validation gain proof that full fine-tuning is necessary
forgetting to compare compute cost against the size of the gain

A Good Encoder Note¶

After one run, the learner should be able to say:

what task shape was being solved
whether frozen reuse was already strong enough
what slice or neighbor pattern still failed
whether the next move is probe, retrieval, or fine-tuning
why the chosen adaptation depth is worth it

Runnable Example¶

Use the local demo:

academy/.venv/bin/python academy/examples/vision-text-encoders-demo.py

It is deliberately small. Its job is to make frozen extraction, head replacement, and comparison against trainable variants visible.

Practice¶

Compare a frozen encoder against a small trainable head.
Inspect whether nearest neighbors look semantically correct.
Decide whether the task is really retrieval, matching, or classification.
Explain when a multimodal encoder is actually justified.
Name one case where frozen reuse is the better stopping point.

Longer Connection¶

Continue with Multi-Modal Fusion when image and text really need to interact, Text Representations and Order for lighter text baselines, and Representation Reuse and Embedding Transfer for the full reuse workflow.