Encoder-Decoder And Machine Translation¶
What This Is¶
An encoder-decoder model maps an input sequence to an output sequence through two modules: an encoder that reads the whole input into a set of contextual representations, and a decoder that generates the output token by token, conditioned on the encoder states and on what it has already produced.
The practical lesson is that encoder-decoder is not just a translation architecture — it is the standard recipe for any sequence-to-sequence task: translation, summarization, question-answering over context, image captioning (with a vision encoder), and code completion over documents. The choices of encoder, decoder, attention, and decoding strategy all come from the same family of decisions.
When You Use It¶
- machine translation
- summarization — long article → short summary
- question-answering — context + question → answer span or generated answer
- image captioning — vision encoder → text decoder
- speech recognition — audio encoder → text decoder
- any task where the output length is not determined by the input length and must be generated one token at a time
Core Structure¶
source tokens → encoder → contextual representations (one per source position)
target tokens so far → decoder → next-token distribution
attends to encoder states + its own past
Two forms:
- RNN encoder-decoder (Sutskever, Bahdanau) — LSTMs on both sides, bridged by attention; historical but clear
- Transformer encoder-decoder (Vaswani et al.) — self-attention within each side, cross-attention from decoder to encoder; the modern default
See Attention and Transformers for the attention mechanics that make both work.
Teacher Forcing — How Training Works¶
During training the decoder is given the ground-truth previous tokens as input, not its own predictions. This is teacher forcing. The loss at each position is cross-entropy against the true next token.
input to decoder: <BOS> y_1 y_2 ... y_{T-1}
target: y_1 y_2 y_3 ... y_T
Teacher forcing makes training parallel (all positions in one forward pass) and stable. It also introduces the exposure bias problem — at inference the decoder conditions on its own (possibly wrong) predictions, a distribution it never saw during training. The fixes are scheduled sampling, minimum-risk training, and at large scale, just accepting the gap and shipping the model.
See Language Modeling Fundamentals for the underlying next-token-prediction framing.
Decoding At Inference Time¶
The decoder is autoregressive: it predicts one token, appends it to the context, and predicts the next. Strategies to pick each token:
| Strategy | Rule | Good for |
|---|---|---|
| Greedy | argmax at every step | fast, deterministic, weak on fluency |
| Beam search | keep top-k partial hypotheses |
translation, summarization — the historical default |
| Sampling | draw from the distribution | open-ended generation |
| Top-k sampling | sample from top k tokens, renormalized |
creative writing, chat |
| Top-p / nucleus | sample from smallest set with total prob ≥ p |
modern default for open-ended text |
| Temperature | scale logits by 1/T before softmax |
lower T = sharper, higher T = more random |
Beam search is still the right default for translation; top-p with temperature is the right default for open-ended generation. See Prompting and Tool Use for how these choices interact with instruction following.
Evaluation Metrics¶
Sequence generation has no single right answer, so evaluation is harder than classification.
| Metric | Measures | Use for |
|---|---|---|
| BLEU | n-gram precision against references | translation — standard baseline |
| chrF | character-level F-score | translation for morphologically rich languages |
| ROUGE | n-gram recall against references | summarization |
| BERTScore | embedding-space similarity to references | semantic similarity beyond n-grams |
| COMET | learned metric trained on human judgments | translation — state of the art |
| perplexity | token-level likelihood under the model | language modeling and intrinsic fit |
No metric is a substitute for reading outputs. A BLEU of 42 says the n-grams overlap with the reference; it does not say the sentence is correct.
Length Control And /¶
Two special tokens hold the sequence-generation plumbing together:
<BOS>(beginning of sequence) — primes the decoder<EOS>(end of sequence) — the decoder emits this to stop; without it the decoder generates forever- practical systems also impose a
max_new_tokenssafety cap
At inference, generation stops when the decoder emits <EOS> or the cap is hit. A decoder that rarely emits <EOS> is a common bug — usually caused by a training set where short, punctual termination was rare.
What To Inspect¶
- the alignment of decoder cross-attention to encoder positions — do they correspond to the words you expect
- whether
<EOS>is produced at a reasonable rate - output length distribution vs. reference length distribution — a systematic shortening or lengthening is a training-signal bug
- whether the error pattern lives in translation quality (word choice) or in fluency (syntax)
- BLEU and a small read of 20 examples — never ship on metric alone
- teacher-forced loss vs. free-running loss gap — large gap signals exposure bias
Failure Pattern¶
Shipping a model that never emits <EOS> and relying on max_new_tokens to truncate. Truncation looks like a bug to users; the fix is to rebalance the training set or upweight <EOS> in the loss.
A second failure pattern is using greedy decoding on a translation task and wondering why the outputs are dull or short. Beam search is not optional for translation.
A third failure pattern is comparing two models on BLEU alone and picking the higher one. BLEU ties of ±0.5 are within noise; read the actual outputs on a fixed diff set.
A fourth failure pattern is evaluating generation quality on the training distribution. Exposure bias shows only at inference on fresh inputs, where the decoder must condition on its own output.
Quick Checks¶
- Is teacher forcing wired in training, and free-running wired at inference?
- Is the decoder producing
<EOS>at a reasonable rate? - Is the decoding strategy matched to the task — beam for translation, nucleus for open-ended?
- Is the cross-attention alignment sensible on a sample?
- Is the output length distribution matched to the reference length distribution?
Practice¶
- Implement a tiny RNN encoder-decoder on a character-level reversal task (input → reversed).
- Replace the RNN decoder with a greedy transformer decoder and compare sample quality.
- Add beam search with
k = 1, 4, 16and compare BLEU on a small translation set. - Lower decoding temperature from 1.0 to 0.2 on an open-ended generation task and observe the sample distribution.
- Compute teacher-forced loss and free-running loss on the same validation set. Explain the gap.
- Explain why cross-attention is what makes the decoder "look at" the encoder.
- Describe one case where BLEU disagrees with human judgment.
- State why
<EOS>is a trained decision and not a post-processing step. - Explain the relationship between temperature and top-p and when they stack.
- Describe one systematic error pattern you would expect in a low-resource language pair.
Longer Connection¶
Encoder-decoder sits next to:
- Attention and Transformers — the architecture that powers the modern encoder-decoder stack
- Language Modeling Fundamentals — the next-token-prediction framing shared with decoder-only models
- Tokenization Mechanics — the sub-word tokenizers that keep open-vocabulary sequence generation tractable
- Text Generation and Language Models — the broader generation workflow including decoder-only models
- Recurrent Networks and Sequences — the sequence models the modern architecture replaced
Encoder-decoder is the shape of every sequence-in, sequence-out problem. The decisions — attention pattern, decoding strategy, tokenization, metric — are where the real engineering lives.