Thinking in Transformers
Chapter 1: Tokenization — Where Language Becomes Discrete
Why tokenization is the first information bottleneck in LLMs, shaping compression, context length, multilingual behavior, and symbolic performance.
Before a language model learns anything about syntax, semantics, facts, code, reasoning, or style, someone has to answer a question that sounds almost insultingly mechanical:
What counts as a unit?
That question is easy to underestimate because language arrives on the page looking already segmented. We see words, punctuation, lines, symbols. But that neatness is partly an illusion. Raw text is just a stream of characters. Characters do not reliably correspond to meaningful units. Words feel more natural to humans, but even “word” turns fuzzy the moment you move across languages, scripts, domains, code, mathematical notation, URLs, chemical formulas, emoji, or mixed-script text.
A language model cannot work directly with that ambiguity. It needs a finite vocabulary of discrete pieces. The system that creates those pieces is the tokenizer, and it is the very first transformation applied to raw text before the model sees anything at all.
That makes tokenization easy to dismiss as preprocessing, a solved problem, an implementation detail at the edge of the real system.
That dismissal is a mistake.
Tokenization is the first lossy compression in the entire pipeline. Its choices propagate forward into everything that follows: sequence length, memory cost, attention patterns, multilingual capability, arithmetic behavior, context efficiency, and even some failure modes that later get misdiagnosed as “reasoning problems.”
The sentence to hold onto through this chapter is this:
The compression ratio achieved by the tokenizer is the information bottleneck that constrains every layer above it.
Every tokenizer is an answer to the same design question: how should language be compressed into discrete units so a neural network can process it? That answer determines what the model finds easy, what it finds hard, and what it never gets a fair chance to learn.
The naive extremes both fail
The easiest way to understand tokenization is to look first at the two extremes.
Character-level tokenization
The most obvious scheme is the most granular one: make every character a token.
On paper, this looks clean. The vocabulary is tiny. There is no out-of-vocabulary problem. There is no complex segmentation logic. Everything can be represented.
The problem is sequence length.
A sentence that might take roughly 10 subword tokens can easily take 50 or more character tokens once spaces and punctuation are included. That difference matters because transformers pay for sequence length everywhere. Attention grows expensive as sequences grow longer. Memory pressure increases. Context windows hold less usable content. KV caches get larger. Training slows down.
But cost is only the visible problem. The deeper one is informational.
A single character carries very little meaning. The letter “t” in “the” and the letter “t” in “transformer” are identical as tokens even though they participate in entirely different structures. A character-level model has to learn, from scratch, that some character sequences form morphemes, that morphemes compose into words, and that words participate in syntax and semantics. It has to rediscover structure that a better tokenization scheme could have handed it almost for free.
This is not impossible. Character-level models can work. But they are inefficient, slow to train, and historically not the frontier choice for large-scale language modeling. The information density per token is too low. You are asking the model to spend precious capacity learning spelling before it can learn language.
Word-level tokenization
Now move to the other extreme: make each word a token.
This solves the sequence-length problem immediately. Tokens now carry much richer semantic content. “Transformer” is one token. “Architecture” is one token. The model can operate over shorter, denser sequences.
But the vocabulary explodes.
Once you account for inflections, compounds, proper nouns, misspellings, neologisms, domain-specific terminology, code identifiers, multilingual text, and the endless long tail of real-world strings, the number of possible words becomes enormous. A vocabulary large enough to cover realistic language use becomes unwieldy. The embedding table grows huge. Rare words receive little training signal. And worst of all, truly unseen words become impossible to represent cleanly.
That last issue is fatal.
In a word-level system, anything outside the vocabulary collapses into an unknown token. A new name, a typo, a novel technical term, an underrepresented language form — all of it becomes the same placeholder. Information is not merely compressed. It is erased.
A general-purpose language model cannot afford that.
The subword insight
The solution that made modern language modeling practical sits between those extremes.
Language has internal structure, and that structure is compressible.
Take a word like unhappiness. A word-level system treats it as a single opaque symbol. A character-level system sees eleven low-information pieces. A more useful representation recognizes recurring substructures: un + happy + ness.
Those pieces matter because they recur across many contexts. Prefixes recur. Roots recur. Suffixes recur. Common words recur. Common fragments recur. Rare words often decompose into familiar pieces.
Subword tokenization exploits exactly that fact.
Instead of committing to characters or words, it learns a vocabulary of variable-length units directly from data. Frequent words often stay intact. Rare words get decomposed into familiar sub-pieces. The vocabulary remains bounded, but coverage remains open-ended because the tokenizer can always fall back to smaller units.
That is why subword tokenization was such a decisive idea. It gives you a workable compromise on all the key axes:
- tokens richer than characters
- vocabulary far smaller than word-level schemes
- graceful handling of unseen strings
- shorter sequences than character-level encoding
- more reusable structure than word-level encoding
This is not a heuristic stuck on the front of the model. It is a compression scheme.
The tokenizer is learning which patterns deserve their own code under a fixed vocabulary budget. Frequent patterns get short, reusable representations. Rare patterns get decomposed into common pieces. In that sense, tokenization is much closer to data compression than to dictionary lookup.
Byte Pair Encoding: the basic idea that changed everything
The most influential subword method is Byte Pair Encoding, or BPE.
Its power comes from how simple it is.
Start with a base vocabulary of atomic units. Historically this could be characters; in modern systems it is often bytes. Then scan the training corpus and count adjacent pairs. Find the most frequent pair. Merge it into a new token. Add that token to the vocabulary. Replace occurrences of the pair in the corpus. Repeat.
That is the whole mechanism.
A tiny example makes the intuition clearer. Suppose your corpus contains forms like:
- low
- lower
- new
- newer
- wider
At the beginning, everything is split into atomic units. The most frequent adjacent pairs get merged first. So “e” followed by “w” may become ew. Then “n” + ew may become new. Later, “l” + “o” may become lo, and then lo + “w” may become low. Eventually, common words become single tokens, while less frequent ones become sensible compositions like new + er or low + er.
The important thing is not the toy mechanics. The important thing is what the algorithm is optimizing for: efficient reuse of recurring structure.
Frequent patterns become cheap. Rare patterns remain representable. The merge list itself becomes the tokenizer.
This is why BPE feels almost inevitable in hindsight. It turns the vocabulary-building problem into a compression problem, and compression is the right lens.
It also makes vocabulary size a real design lever. A larger vocabulary shortens sequences and reduces downstream compute, but it grows the embedding matrix and makes rare tokens sparser. A smaller vocabulary keeps embeddings lighter and training signal denser, but lengthens sequences and raises the cost of attention and caching later on.
So the first major tokenizer tradeoff appears immediately:
vocabulary size versus sequence length
That tradeoff is not cosmetic. It is one of the first places where Chapter 0’s compute-versus-memory tension becomes concrete.
WordPiece and Unigram: different scoring rules, same underlying problem
BPE is not the only subword algorithm. It is just the most famous.
WordPiece
WordPiece, associated most strongly with BERT, follows a similar merge-based intuition but changes the scoring rule. Instead of simply merging the most frequent pair, it favors merges that most improve the likelihood of the training corpus under a probabilistic model.
Conceptually, that means WordPiece is a little less naive about raw frequency. It is not just asking what appears often. It is asking what is most worth making atomic.
In practice, BPE and WordPiece often produce similar behavior on large corpora, especially in English-heavy settings. The difference is more about optimization philosophy than about a completely different worldview. Both are still answers to the same underlying question: which substructures deserve dedicated tokens?
Unigram and SentencePiece
The Unigram model, popularized through SentencePiece, flips the construction process around.
BPE grows a vocabulary upward from small units. Unigram starts with a large candidate vocabulary and prunes it downward. It treats tokenization as a probabilistic segmentation problem: a sentence may admit multiple valid segmentations, and the model tries to keep the vocabulary that best explains the corpus.
This has a conceptual elegance that BPE lacks. Ambiguity is not treated as a nuisance to eliminate immediately. It is treated as part of the model. That tends to work well in multilingual settings and in languages with rich morphology, where rigid greedy merging can produce awkward boundaries.
The big picture, though, is the same across all three methods:
- characters are too small
- words are too big
- useful units must be learned from data
Why bytes matter more than characters
Once you leave English, a new problem appears: what exactly is the atomic unit at the bottom?
If you build a character-level base vocabulary over full Unicode, the base vocabulary becomes very large before you have even learned a single useful merge. Real-world text contains far more than ASCII. It contains thousands of scripts, symbols, and encoded forms.
The elegant solution is to go one level lower: bytes.
Byte-level BPE uses the 256 possible byte values as the universal base vocabulary. Any UTF-8 text, regardless of language or script, can be represented as bytes. From there, merges learn whatever recurring structures the data supports.
This move solved several problems at once:
- no true out-of-vocabulary text
- multilingual coverage without a separate tokenizer per language
- a fixed, tiny base vocabulary
- graceful handling of arbitrary strings, symbols, and weird edge cases
That is a big reason modern general-purpose language models can at least process so many writing systems under one tokenizer.
But byte-level tokenization is not free. Human-legible boundaries can become less transparent. Some scripts compress better than others depending on training distribution. Rare languages may remain much closer to byte-level fragmentation, paying a higher token tax than dominant languages.
That matters more than it first appears.
What the tokenizer actually determines
People often talk about tokenization as though it only affects vocabulary construction. In practice, it reaches much further.
1. Sequence length and effective context
The tokenizer determines how much text fits into a fixed context window.
A 128K-token model does not really have a single context length in any human sense. It has a token budget. How much real text that budget covers depends on how efficiently the tokenizer compresses the input. A language that compresses well gets more usable context. A language that tokenizes poorly gets less.
So context length is not just an architectural number. It is partly a tokenizer property.
2. Multilingual fairness
A tokenizer trained primarily on English will learn efficient merges for English. That means English gets short, dense tokenizations. Other languages, especially underrepresented ones, may get split into many more pieces for the same semantic content.
The result is a silent inequality:
- the same prompt costs more tokens in some languages than others
- the same context window holds less content
- the same generation takes more steps
- the same model can appear weaker in a language not only because of data scarcity, but because the tokenizer gave that language a worse deal
This is one reason tokenization is not just a systems choice. It is also, quietly, a modeling and fairness choice.
3. Arithmetic and symbolic behavior
Tokenization also shapes what the model can learn about numbers.
A string like 13927 may be one token, two tokens, or several fragments depending on the tokenizer. The model then has to learn arithmetic over whatever representation it is given. If number boundaries are inconsistent or opaque, arithmetic becomes harder than it looks.
This helps explain a behavior that confuses many people the first time they see it: the same model that writes good code can fail on what looks like simple arithmetic or character manipulation.
That failure is not always “lack of intelligence.” Sometimes the task was made structurally awkward upstream. Code often tokenizes into recurring, clean patterns. Numbers and symbolic strings often do not.
The model is solving the problem it was handed, not the one the human thinks they asked.
4. Whitespace, formatting, and prompt sensitivity
Many tokenizers also bake formatting into the token stream more aggressively than readers expect. A leading space can change token identity. Small formatting changes can shift token boundaries. That means seemingly trivial prompt edits sometimes alter the model’s behavior in ways that feel disproportionate.
They are not disproportionate. They are tokenization effects.
The counterfactual is what makes this real
The cleanest way to see that tokenization matters is to change it while holding as much else fixed as possible.
Increase vocabulary size and sequences get shorter, sometimes improving efficiency and perplexity, until diminishing returns and sparse training signal start to push back.
Train a tokenizer on a more balanced multilingual corpus and performance on underrepresented languages can improve even when the transformer architecture stays the same.
Use a tokenizer that preserves digit structure more transparently and arithmetic behavior can improve without any grand breakthrough in reasoning.
Those counterfactuals matter because they force the right conclusion:
the tokenizer is not a neutral preprocessor
It is part of the model.
Not in the narrow sense that it sits inside the transformer block diagram, but in the more important sense that it shapes the space of problems the model can learn efficiently. It defines the granularity of the world the network gets to see.
Once you see that, tokenization stops being boring instantly.
The tensions become visible here first
Tokenization is the first place in the series where several recurring tensions collide in a form that is easy to feel.
Vocabulary size vs. sequence length
Larger vocabularies compress better and shorten sequences. Smaller vocabularies lengthen sequences but keep embeddings smaller and training signal denser. There is no universally correct answer. The best choice depends on model size, data mixture, target languages, and serving constraints.
Compression efficiency vs. compositional transparency
An aggressively merged tokenization is efficient, but it can hide internal structure. A more decomposed tokenization is less efficient, but sometimes exposes reusable pieces the model can exploit. Whether the model benefits more from shorter sequences or from clearer composition depends on scale and domain.
Universality vs. specialization
A single tokenizer for everything is operationally simple. A domain-specific or language-specific tokenizer may compress its target far better. General-purpose systems usually choose universality, but they pay for that choice, especially in specialized domains and underrepresented languages.
These are not abstract tensions. They show up immediately in cost, coverage, robustness, and capability.
Why this chapter matters to the rest of the series
Chapter 0 argued that the LLM pipeline is a cascade of irreversible commitments. Tokenization is the first one.
Once raw text has been discretized, the model never gets the original stream back. If the tokenizer fractures an important structure, every later layer inherits that damage. Embeddings can only map the units they are given. Attention can only relate the units it is given. Context windows are measured in the units it is given.
That is why tokenization sits much closer to the conceptual core of LLMs than most introductions admit.
It also sets up several later chapters directly.
The embedding layer exists because token IDs need geometry. Attention operates over token sequences whose length the tokenizer determines. Long-context behavior is constrained by token efficiency. Multimodality will later inherit the same compression logic when images are turned into patch-like token sequences. Even some questions about reasoning are really questions about representation.
Before the model can learn, it has to be handed a world.
The tokenizer decides how that world is chopped up.
And that decision, quiet as it looks, is the first bottleneck in the entire stack.
The next chapter moves from discrete units to continuous geometry. A token ID is just an integer. Integers do not encode meaning, similarity, or order. So once tokenization has decided what the units are, the model faces the next problem: how to place those units into a space where relationships can be learned at all.