Skip to content

Encoder-Decoder And Machine Translation

What This Is

An encoder-decoder model maps an input sequence to an output sequence through two modules: an encoder that reads the whole input into a set of contextual representations, and a decoder that generates the output token by token, conditioned on the encoder states and on what it has already produced.

The practical lesson is that encoder-decoder is not just a translation architecture — it is the standard recipe for any sequence-to-sequence task: translation, summarization, question-answering over context, image captioning (with a vision encoder), and code completion over documents. The choices of encoder, decoder, attention, and decoding strategy all come from the same family of decisions.

When You Use It

  • machine translation
  • summarization — long article → short summary
  • question-answering — context + question → answer span or generated answer
  • image captioning — vision encoder → text decoder
  • speech recognition — audio encoder → text decoder
  • any task where the output length is not determined by the input length and must be generated one token at a time

Core Structure

source tokens → encoder → contextual representations (one per source position)
target tokens so far → decoder → next-token distribution
                                  attends to encoder states + its own past

Two forms:

  • RNN encoder-decoder (Sutskever, Bahdanau) — LSTMs on both sides, bridged by attention; historical but clear
  • Transformer encoder-decoder (Vaswani et al.) — self-attention within each side, cross-attention from decoder to encoder; the modern default

See Attention and Transformers for the attention mechanics that make both work.

Teacher Forcing — How Training Works

During training the decoder is given the ground-truth previous tokens as input, not its own predictions. This is teacher forcing. The loss at each position is cross-entropy against the true next token.

input to decoder:    <BOS> y_1 y_2 ... y_{T-1}
target:              y_1   y_2 y_3 ... y_T

Teacher forcing makes training parallel (all positions in one forward pass) and stable. It also introduces the exposure bias problem — at inference the decoder conditions on its own (possibly wrong) predictions, a distribution it never saw during training. The fixes are scheduled sampling, minimum-risk training, and at large scale, just accepting the gap and shipping the model.

See Language Modeling Fundamentals for the underlying next-token-prediction framing.

Decoding At Inference Time

The decoder is autoregressive: it predicts one token, appends it to the context, and predicts the next. Strategies to pick each token:

Strategy Rule Good for
Greedy argmax at every step fast, deterministic, weak on fluency
Beam search keep top-k partial hypotheses translation, summarization — the historical default
Sampling draw from the distribution open-ended generation
Top-k sampling sample from top k tokens, renormalized creative writing, chat
Top-p / nucleus sample from smallest set with total prob ≥ p modern default for open-ended text
Temperature scale logits by 1/T before softmax lower T = sharper, higher T = more random

Beam search is still the right default for translation; top-p with temperature is the right default for open-ended generation. See Prompting and Tool Use for how these choices interact with instruction following.

Evaluation Metrics

Sequence generation has no single right answer, so evaluation is harder than classification.

Metric Measures Use for
BLEU n-gram precision against references translation — standard baseline
chrF character-level F-score translation for morphologically rich languages
ROUGE n-gram recall against references summarization
BERTScore embedding-space similarity to references semantic similarity beyond n-grams
COMET learned metric trained on human judgments translation — state of the art
perplexity token-level likelihood under the model language modeling and intrinsic fit

No metric is a substitute for reading outputs. A BLEU of 42 says the n-grams overlap with the reference; it does not say the sentence is correct.

Length Control And /

Two special tokens hold the sequence-generation plumbing together:

  • <BOS> (beginning of sequence) — primes the decoder
  • <EOS> (end of sequence) — the decoder emits this to stop; without it the decoder generates forever
  • practical systems also impose a max_new_tokens safety cap

At inference, generation stops when the decoder emits <EOS> or the cap is hit. A decoder that rarely emits <EOS> is a common bug — usually caused by a training set where short, punctual termination was rare.

What To Inspect

  • the alignment of decoder cross-attention to encoder positions — do they correspond to the words you expect
  • whether <EOS> is produced at a reasonable rate
  • output length distribution vs. reference length distribution — a systematic shortening or lengthening is a training-signal bug
  • whether the error pattern lives in translation quality (word choice) or in fluency (syntax)
  • BLEU and a small read of 20 examples — never ship on metric alone
  • teacher-forced loss vs. free-running loss gap — large gap signals exposure bias

Failure Pattern

Shipping a model that never emits <EOS> and relying on max_new_tokens to truncate. Truncation looks like a bug to users; the fix is to rebalance the training set or upweight <EOS> in the loss.

A second failure pattern is using greedy decoding on a translation task and wondering why the outputs are dull or short. Beam search is not optional for translation.

A third failure pattern is comparing two models on BLEU alone and picking the higher one. BLEU ties of ±0.5 are within noise; read the actual outputs on a fixed diff set.

A fourth failure pattern is evaluating generation quality on the training distribution. Exposure bias shows only at inference on fresh inputs, where the decoder must condition on its own output.

Quick Checks

  1. Is teacher forcing wired in training, and free-running wired at inference?
  2. Is the decoder producing <EOS> at a reasonable rate?
  3. Is the decoding strategy matched to the task — beam for translation, nucleus for open-ended?
  4. Is the cross-attention alignment sensible on a sample?
  5. Is the output length distribution matched to the reference length distribution?

Practice

  1. Implement a tiny RNN encoder-decoder on a character-level reversal task (input → reversed).
  2. Replace the RNN decoder with a greedy transformer decoder and compare sample quality.
  3. Add beam search with k = 1, 4, 16 and compare BLEU on a small translation set.
  4. Lower decoding temperature from 1.0 to 0.2 on an open-ended generation task and observe the sample distribution.
  5. Compute teacher-forced loss and free-running loss on the same validation set. Explain the gap.
  6. Explain why cross-attention is what makes the decoder "look at" the encoder.
  7. Describe one case where BLEU disagrees with human judgment.
  8. State why <EOS> is a trained decision and not a post-processing step.
  9. Explain the relationship between temperature and top-p and when they stack.
  10. Describe one systematic error pattern you would expect in a low-resource language pair.

Longer Connection

Encoder-decoder sits next to:

Encoder-decoder is the shape of every sequence-in, sequence-out problem. The decisions — attention pattern, decoding strategy, tokenization, metric — are where the real engineering lives.