Skip to content

Tokenization Mechanics

What This Is

A tokenizer maps a string of characters to a sequence of integer token IDs drawn from a fixed vocabulary — and back. It is the contract between human-readable text and a neural network that only sees integers.

The practical lesson is that the tokenizer is part of the model. Changing it invalidates pretrained weights, silently breaks inference, and changes what "one token" even means for prompt-length budgets and perplexity numbers. Every subword tokenizer is a learned object that must ship with the model.

When You Use It

  • preparing text as input to any transformer or sub-word-aware model
  • counting tokens for context-window budgets and API billing
  • comparing model outputs and losses (tokenizer-dependent)
  • debugging unexpected model behavior that traces to a tokenization quirk

Why Sub-word Tokenization Exists

Three older schemes and why modern models abandoned them:

  • word-level — vocabulary grows with the corpus, out-of-vocabulary (OOV) words become <UNK> and lose their meaning, and rare or compound words are brittle
  • character-level — no OOV but sequences get long, and the model has to rediscover morphology from scratch
  • sub-word — learn a vocabulary of common character sequences; frequent words stay whole, rare words split into meaningful pieces. No OOV, reasonable sequence length, compositional.

Every modern large language model uses sub-word tokenization.

The Three Standard Algorithms

Byte-Pair Encoding (BPE) — GPT family

  1. start with a vocabulary of individual bytes (or characters)
  2. count all adjacent pairs in the training corpus
  3. merge the most frequent pair into a new token
  4. repeat until the vocabulary reaches a target size (typically 30k-100k)

Encoding applies the learned merges greedily. BPE is the dominant scheme for open-vocabulary text generation.

WordPiece — BERT family

Similar to BPE, but the pair chosen at each step is the one that maximizes the likelihood of the training corpus under a unigram-language-model view, rather than raw frequency. In practice the vocabularies are very close to BPE vocabularies.

Distinguishing mark: WordPiece prefixes continuation tokens with ## (e.g., play ##ing).

SentencePiece (unigram or BPE) — T5, LLaMA, many multilingual models

SentencePiece treats the input as a raw byte stream and does its own whitespace handling internally — critical for languages without space-separated words (Japanese, Chinese, Thai). It implements both a BPE algorithm and a unigram language model tokenizer (which drops tokens from a large initial vocabulary based on likelihood).

Distinguishing mark: a leading underscore (U+2581) marks word boundaries.

Byte-Level BPE

Most modern GPT-family tokenizers (tiktoken, GPT-2/3/4 tokenizers) are byte-level BPE — the base alphabet is bytes, not unicode characters. This gives a closed, finite alphabet of 256 base units that covers every possible text (including emoji, non-Latin scripts, binary bytes) without any UNK handling.

Consequence: a single emoji can become several tokens, and a non-Latin sentence can use 2-4× more tokens than the same sentence in English.

Minimal Example

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("gpt2")
ids = tok.encode("Tokenization is a contract.")
print(ids)
print(tok.convert_ids_to_tokens(ids))

A clean habit is to print both the IDs and the token strings when debugging — a lot of tokenizer bugs become obvious the moment you see which subwords the model actually consumes.

Special Tokens

Models ship with reserved tokens that have special roles:

  • <BOS> / [CLS] — beginning-of-sequence / classification head anchor
  • <EOS> / [SEP] — end-of-sequence / separator
  • <PAD> — padding to a fixed length
  • <UNK> — reserved unknown token (rare in modern sub-word tokenizers)
  • <MASK> — masked position in MLM
  • chat-template tokens — modern chat models define roles (<|user|>, <|assistant|>, etc.) as reserved tokens

These must match the model. A model trained with <|end|> as the EOS will not stop generating if you instead pass <|endoftext|>.

What "One Token" Actually Is

A few calibration points (GPT-family BPE, English text):

  • Hello → 1 token
  • tokenization → 2 tokens (token + ization)
  • unexpectedly → 3 tokens
  • 1234567890 → several tokens (digits often split)
  • emoji → 1-4 tokens each
  • a Chinese sentence → roughly 2× more tokens per character than English

Practical implication: context windows are in tokens, not characters. A "32k context" model fits roughly 24k English words or fewer CJK words.

What To Inspect

  • the actual token sequence for a representative input — before training, before prompting, before debugging
  • whether the tokenizer matches the model (check the checkpoint metadata)
  • the token count budget for your context — and the cost in tokens of your padding
  • whether special tokens (BOS/EOS/PAD) are being added as expected
  • whether chat templates are being applied (many libraries apply them silently)
  • which inputs produce unusually long token counts — often a sign of non-Latin text or numeric strings

Failure Pattern

Loading a model with a different tokenizer than it was trained with. No error; the model just produces garbage because the token IDs mean nothing to it.

A second failure pattern is computing perplexity or loss with a tokenizer different from the one used during training, then comparing to published numbers. The comparison is meaningless.

A third failure pattern is measuring prompt length in characters and then being surprised when the model truncates. Prompt length is measured in tokens, not characters, and the ratio is not constant across languages or content types.

A fourth failure pattern is forgetting to apply the chat template. Passing raw "You are a helpful assistant.\nUser: hello" to a chat model can produce strange outputs because the template tokens the model expects are missing.

A fifth failure pattern is padding on the wrong side. Most LMs expect left-padding during batched generation so that the decoder sees consecutive real tokens at the end; right-padding during generation can silently degrade quality.

Quick Checks

  1. Does the tokenizer checkpoint match the model checkpoint?
  2. Is the chat template being applied if you are using a chat model?
  3. Is padding on the correct side for your operation (left for generation, right for training)?
  4. Are special tokens (BOS, EOS, PAD) the ones the model expects?
  5. Does your token budget calculation use the actual tokenizer or a character-based guess?

Practice

  1. Encode the same sentence with GPT-2, BERT, and LLaMA tokenizers. Compare token counts and splits.
  2. Train a tiny BPE tokenizer on a small corpus and inspect the top 100 merges.
  3. Tokenize a Japanese sentence with a SentencePiece unigram tokenizer and count tokens per character.
  4. Deliberately load a wrong tokenizer against a model and observe the generation output.
  5. Measure token count for a prompt at 100-character, 1000-character, and 10000-character lengths. Compute the token-per-character ratio.
  6. Explain why byte-level BPE has no OOV problem.
  7. Describe why hello (with a leading space) and hello are different tokens in BPE tokenizers.
  8. Explain the difference between the ## prefix in WordPiece and the prefix in SentencePiece.
  9. State one reason digit-by-digit splitting can cause a model to struggle with long-number arithmetic.
  10. Describe one practical implication of non-Latin scripts using more tokens per character.

Longer Connection

Tokenization mechanics sit next to:

The tokenizer is not a pre-processing detail; it is part of the model's learned contract. Changing it, mis-matching it, or miscounting its tokens is one of the easiest and most painful ways to ship a broken NLP system.