Tokenization Mechanics¶
What This Is¶
A tokenizer maps a string of characters to a sequence of integer token IDs drawn from a fixed vocabulary — and back. It is the contract between human-readable text and a neural network that only sees integers.
The practical lesson is that the tokenizer is part of the model. Changing it invalidates pretrained weights, silently breaks inference, and changes what "one token" even means for prompt-length budgets and perplexity numbers. Every subword tokenizer is a learned object that must ship with the model.
When You Use It¶
- preparing text as input to any transformer or sub-word-aware model
- counting tokens for context-window budgets and API billing
- comparing model outputs and losses (tokenizer-dependent)
- debugging unexpected model behavior that traces to a tokenization quirk
Why Sub-word Tokenization Exists¶
Three older schemes and why modern models abandoned them:
- word-level — vocabulary grows with the corpus, out-of-vocabulary (OOV) words become
<UNK>and lose their meaning, and rare or compound words are brittle - character-level — no OOV but sequences get long, and the model has to rediscover morphology from scratch
- sub-word — learn a vocabulary of common character sequences; frequent words stay whole, rare words split into meaningful pieces. No OOV, reasonable sequence length, compositional.
Every modern large language model uses sub-word tokenization.
The Three Standard Algorithms¶
Byte-Pair Encoding (BPE) — GPT family¶
- start with a vocabulary of individual bytes (or characters)
- count all adjacent pairs in the training corpus
- merge the most frequent pair into a new token
- repeat until the vocabulary reaches a target size (typically 30k-100k)
Encoding applies the learned merges greedily. BPE is the dominant scheme for open-vocabulary text generation.
WordPiece — BERT family¶
Similar to BPE, but the pair chosen at each step is the one that maximizes the likelihood of the training corpus under a unigram-language-model view, rather than raw frequency. In practice the vocabularies are very close to BPE vocabularies.
Distinguishing mark: WordPiece prefixes continuation tokens with ## (e.g., play ##ing).
SentencePiece (unigram or BPE) — T5, LLaMA, many multilingual models¶
SentencePiece treats the input as a raw byte stream and does its own whitespace handling internally — critical for languages without space-separated words (Japanese, Chinese, Thai). It implements both a BPE algorithm and a unigram language model tokenizer (which drops tokens from a large initial vocabulary based on likelihood).
Distinguishing mark: a leading underscore ▁ (U+2581) marks word boundaries.
Byte-Level BPE¶
Most modern GPT-family tokenizers (tiktoken, GPT-2/3/4 tokenizers) are byte-level BPE — the base alphabet is bytes, not unicode characters. This gives a closed, finite alphabet of 256 base units that covers every possible text (including emoji, non-Latin scripts, binary bytes) without any UNK handling.
Consequence: a single emoji can become several tokens, and a non-Latin sentence can use 2-4× more tokens than the same sentence in English.
Minimal Example¶
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2")
ids = tok.encode("Tokenization is a contract.")
print(ids)
print(tok.convert_ids_to_tokens(ids))
A clean habit is to print both the IDs and the token strings when debugging — a lot of tokenizer bugs become obvious the moment you see which subwords the model actually consumes.
Special Tokens¶
Models ship with reserved tokens that have special roles:
<BOS>/[CLS]— beginning-of-sequence / classification head anchor<EOS>/[SEP]— end-of-sequence / separator<PAD>— padding to a fixed length<UNK>— reserved unknown token (rare in modern sub-word tokenizers)<MASK>— masked position in MLM- chat-template tokens — modern chat models define roles (
<|user|>,<|assistant|>, etc.) as reserved tokens
These must match the model. A model trained with <|end|> as the EOS will not stop generating if you instead pass <|endoftext|>.
What "One Token" Actually Is¶
A few calibration points (GPT-family BPE, English text):
Hello→ 1 tokentokenization→ 2 tokens (token+ization)unexpectedly→ 3 tokens1234567890→ several tokens (digits often split)- emoji → 1-4 tokens each
- a Chinese sentence → roughly 2× more tokens per character than English
Practical implication: context windows are in tokens, not characters. A "32k context" model fits roughly 24k English words or fewer CJK words.
What To Inspect¶
- the actual token sequence for a representative input — before training, before prompting, before debugging
- whether the tokenizer matches the model (check the checkpoint metadata)
- the token count budget for your context — and the cost in tokens of your padding
- whether special tokens (BOS/EOS/PAD) are being added as expected
- whether chat templates are being applied (many libraries apply them silently)
- which inputs produce unusually long token counts — often a sign of non-Latin text or numeric strings
Failure Pattern¶
Loading a model with a different tokenizer than it was trained with. No error; the model just produces garbage because the token IDs mean nothing to it.
A second failure pattern is computing perplexity or loss with a tokenizer different from the one used during training, then comparing to published numbers. The comparison is meaningless.
A third failure pattern is measuring prompt length in characters and then being surprised when the model truncates. Prompt length is measured in tokens, not characters, and the ratio is not constant across languages or content types.
A fourth failure pattern is forgetting to apply the chat template. Passing raw "You are a helpful assistant.\nUser: hello" to a chat model can produce strange outputs because the template tokens the model expects are missing.
A fifth failure pattern is padding on the wrong side. Most LMs expect left-padding during batched generation so that the decoder sees consecutive real tokens at the end; right-padding during generation can silently degrade quality.
Quick Checks¶
- Does the tokenizer checkpoint match the model checkpoint?
- Is the chat template being applied if you are using a chat model?
- Is padding on the correct side for your operation (left for generation, right for training)?
- Are special tokens (
BOS,EOS,PAD) the ones the model expects? - Does your token budget calculation use the actual tokenizer or a character-based guess?
Practice¶
- Encode the same sentence with GPT-2, BERT, and LLaMA tokenizers. Compare token counts and splits.
- Train a tiny BPE tokenizer on a small corpus and inspect the top 100 merges.
- Tokenize a Japanese sentence with a SentencePiece unigram tokenizer and count tokens per character.
- Deliberately load a wrong tokenizer against a model and observe the generation output.
- Measure token count for a prompt at 100-character, 1000-character, and 10000-character lengths. Compute the token-per-character ratio.
- Explain why byte-level BPE has no OOV problem.
- Describe why
hello(with a leading space) andhelloare different tokens in BPE tokenizers. - Explain the difference between the
##prefix in WordPiece and the▁prefix in SentencePiece. - State one reason digit-by-digit splitting can cause a model to struggle with long-number arithmetic.
- Describe one practical implication of non-Latin scripts using more tokens per character.
Longer Connection¶
Tokenization mechanics sit next to:
- Language Modeling Fundamentals — perplexity and everything downstream depend on the tokenizer
- Text Representations and Order — where sub-word tokenization sits in the broader text-encoding story
- Text Generation and Language Models — generation is over token IDs; the tokenizer converts back to text
- Encoder-Decoder And Machine Translation — why cross-lingual models use SentencePiece and why source and target often share a tokenizer
- Retrieval-Augmented Generation — where token budgets for context become a production constraint
The tokenizer is not a pre-processing detail; it is part of the model's learned contract. Changing it, mis-matching it, or miscounting its tokens is one of the easiest and most painful ways to ship a broken NLP system.