Encoder-Decoder And Machine Translation¶

What This Is¶

An encoder-decoder model maps an input sequence to an output sequence through two modules: an encoder that reads the whole input into a set of contextual representations, and a decoder that generates the output token by token, conditioned on the encoder states and on what it has already produced.

The practical lesson is that encoder-decoder is not just a translation architecture — it is the standard recipe for any sequence-to-sequence task: translation, summarization, question-answering over context, image captioning (with a vision encoder), and code completion over documents. The choices of encoder, decoder, attention, and decoding strategy all come from the same family of decisions.

When You Use It¶

machine translation
summarization — long article → short summary
question-answering — context + question → answer span or generated answer
image captioning — vision encoder → text decoder
speech recognition — audio encoder → text decoder
any task where the output length is not determined by the input length and must be generated one token at a time

Core Structure¶

source tokens → encoder → contextual representations (one per source position)
target tokens so far → decoder → next-token distribution
                                  attends to encoder states + its own past

Two forms:

RNN encoder-decoder (Sutskever, Bahdanau) — LSTMs on both sides, bridged by attention; historical but clear
Transformer encoder-decoder (Vaswani et al.) — self-attention within each side, cross-attention from decoder to encoder; the modern default

See Attention and Transformers for the attention mechanics that make both work.

Teacher Forcing — How Training Works¶

During training the decoder is given the ground-truth previous tokens as input, not its own predictions. This is teacher forcing. The loss at each position is cross-entropy against the true next token.

input to decoder:    <BOS> y_1 y_2 ... y_{T-1}
target:              y_1   y_2 y_3 ... y_T

Teacher forcing makes training parallel (all positions in one forward pass) and stable. It also introduces the exposure bias problem — at inference the decoder conditions on its own (possibly wrong) predictions, a distribution it never saw during training. The fixes are scheduled sampling, minimum-risk training, and at large scale, just accepting the gap and shipping the model.

See Language Modeling Fundamentals for the underlying next-token-prediction framing.

Decoding At Inference Time¶

The decoder is autoregressive: it predicts one token, appends it to the context, and predicts the next. Strategies to pick each token:

Strategy	Rule	Good for
Greedy	argmax at every step	fast, deterministic, weak on fluency
Beam search	keep top-`k` partial hypotheses	translation, summarization — the historical default
Sampling	draw from the distribution	open-ended generation
Top-k sampling	sample from top `k` tokens, renormalized	creative writing, chat
Top-p / nucleus	sample from smallest set with total prob ≥ `p`	modern default for open-ended text
Temperature	scale logits by `1/T` before softmax	lower `T` = sharper, higher `T` = more random

Beam search is still the right default for translation; top-p with temperature is the right default for open-ended generation. See Prompting and Tool Use for how these choices interact with instruction following.

Evaluation Metrics¶

Sequence generation has no single right answer, so evaluation is harder than classification.

Metric	Measures	Use for
BLEU	n-gram precision against references	translation — standard baseline
chrF	character-level F-score	translation for morphologically rich languages
ROUGE	n-gram recall against references	summarization
BERTScore	embedding-space similarity to references	semantic similarity beyond n-grams
COMET	learned metric trained on human judgments	translation — state of the art
perplexity	token-level likelihood under the model	language modeling and intrinsic fit

No metric is a substitute for reading outputs. A BLEU of 42 says the n-grams overlap with the reference; it does not say the sentence is correct.

Length Control And /¶

Two special tokens hold the sequence-generation plumbing together:

<BOS> (beginning of sequence) — primes the decoder
<EOS> (end of sequence) — the decoder emits this to stop; without it the decoder generates forever
practical systems also impose a max_new_tokens safety cap

At inference, generation stops when the decoder emits <EOS> or the cap is hit. A decoder that rarely emits <EOS> is a common bug — usually caused by a training set where short, punctual termination was rare.

What To Inspect¶

the alignment of decoder cross-attention to encoder positions — do they correspond to the words you expect
whether <EOS> is produced at a reasonable rate
output length distribution vs. reference length distribution — a systematic shortening or lengthening is a training-signal bug
whether the error pattern lives in translation quality (word choice) or in fluency (syntax)
BLEU and a small read of 20 examples — never ship on metric alone
teacher-forced loss vs. free-running loss gap — large gap signals exposure bias

Failure Pattern¶

Shipping a model that never emits <EOS> and relying on max_new_tokens to truncate. Truncation looks like a bug to users; the fix is to rebalance the training set or upweight <EOS> in the loss.

A second failure pattern is using greedy decoding on a translation task and wondering why the outputs are dull or short. Beam search is not optional for translation.

A third failure pattern is comparing two models on BLEU alone and picking the higher one. BLEU ties of ±0.5 are within noise; read the actual outputs on a fixed diff set.

A fourth failure pattern is evaluating generation quality on the training distribution. Exposure bias shows only at inference on fresh inputs, where the decoder must condition on its own output.

Quick Checks¶

Is teacher forcing wired in training, and free-running wired at inference?
Is the decoder producing <EOS> at a reasonable rate?
Is the decoding strategy matched to the task — beam for translation, nucleus for open-ended?
Is the cross-attention alignment sensible on a sample?
Is the output length distribution matched to the reference length distribution?

Practice¶

Implement a tiny RNN encoder-decoder on a character-level reversal task (input → reversed).
Replace the RNN decoder with a greedy transformer decoder and compare sample quality.
Add beam search with k = 1, 4, 16 and compare BLEU on a small translation set.
Lower decoding temperature from 1.0 to 0.2 on an open-ended generation task and observe the sample distribution.
Compute teacher-forced loss and free-running loss on the same validation set. Explain the gap.
Explain why cross-attention is what makes the decoder "look at" the encoder.
Describe one case where BLEU disagrees with human judgment.
State why <EOS> is a trained decision and not a post-processing step.
Explain the relationship between temperature and top-p and when they stack.
Describe one systematic error pattern you would expect in a low-resource language pair.

Longer Connection¶

Encoder-decoder sits next to:

Attention and Transformers — the architecture that powers the modern encoder-decoder stack
Language Modeling Fundamentals — the next-token-prediction framing shared with decoder-only models
Tokenization Mechanics — the sub-word tokenizers that keep open-vocabulary sequence generation tractable
Text Generation and Language Models — the broader generation workflow including decoder-only models
Recurrent Networks and Sequences — the sequence models the modern architecture replaced

Encoder-decoder is the shape of every sequence-in, sequence-out problem. The decisions — attention pattern, decoding strategy, tokenization, metric — are where the real engineering lives.