Tokens and Tokenization | Simon’s Journal

💥 Explore this must-read post from Hacker News 📖

📂 **Category**:

💡 **What You’ll Learn**:

Ask GPT-4 how many r’s are in “strawberry” and it will confidently say two. The right answer is three. This isn’t because the model can’t count. It’s because it never sees the letters at all.

Every Large Language Model (LLM) starts with the same operation: text comes in, gets chopped into chunks called tokens, and those chunks become integer IDs that index into an embedding matrix. The chunks aren’t characters and they aren’t words. They’re something more specific, and the specificity matters more than most people realize.

What a “token” really is#

Most people first meet the word “token” through prices and limits: “1,500 tokens used”, “the context window is 128K tokens”. Those numbers are real, but they hide what a token actually is.

A token is the smallest unit of input a specific model can perceive. Each model has its own fixed list of tokens, called its vocabulary, decided once at training time. GPT-4’s vocabulary isn’t Claude’s. Claude’s isn’t Llama’s.

When you send text to a model, the text gets chopped into pieces from that model’s vocabulary, and each piece is swapped for an integer ID. Only those IDs ever reach the model. The model never sees text. It sees a sequence of integer indices into its own private alphabet.

So tokens aren’t “roughly like words” or “kind of like characters”. They’re the atoms of perception for one specific model, and they’re the only language that model speaks. Two models fed the same English sentence will produce two different integer sequences, often of different lengths:

“I love strawberry milkshakes!”

GPT-4
I
·love
·str
aw
berry
·milk
sh
akes
!
9 tokens

Llama 3
I
·love
·straw
berry
·milk
shakes
!
7 tokens

Each chip is one token. · marks a leading space (so ·love is the token love, distinct from love). Splits are approximate; the interactive playground at the end of the post shows exact tokenization.

The same sentence is nine tokens to GPT-4 and seven tokens to Llama 3. Not because Llama is smarter or the sentence changed, but because the two models have different vocabularies. To GPT-4, the token ·straw doesn’t exist as a single chunk, so “strawberry” splits across three pieces. Llama 3’s vocabulary happens to include ·straw, so it gets through in two.

Here’s GPT-4’s actual tokenizer running in your browser. Type anything: your name, a strange word, a sentence in another language. Each chip below is one token.

How does a model end up with one specific vocabulary instead of another? The dominant algorithm is Byte Pair Encoding, or BPE.

BPE, the algorithm#

BPE is an algorithm for deciding which subword chunks deserve to be tokens, given a corpusA corpus is the dataset of text used to train the tokenizer (and the model). Typically a giant mix of web pages, books, code, and other text. For modern models it’s measured in trillions of tokens. and a target vocabulary size. It starts small and grows the vocabulary one merge at a time, always merging the most frequent adjacent pair in the corpus.

The whole algorithm fits on a sticky note.

The setup. You have:

  • A corpus to tokenize.
  • A target vocabulary size $V$ (a number you choose; typical values are 30,000 to 100,000).

You want to end up with a list of $V$ tokens such that common substrings (the, ing, to) get their own token, so common text compresses into short sequences. Rare substrings decompose into smaller pieces, down to single characters in the worst case, so nothing is ever out-of-vocabulary.

The algorithm.

  1. Initialize the vocabulary as every distinct character in the corpus.
  2. Scan the corpus and count every adjacent pair of tokens.
  3. Take the most frequent pair, merge it into a new token, and add it to the vocabulary.
  4. Repeat steps 2 and 3 until the vocabulary has $V$ entries.

That’s it. No clever scoring, no neural networkA computational model made of layers of trainable mathematical functions whose parameters are tuned to fit data. Modern LLMs are massive neural networks. BPE, by contrast, is plain bookkeeping with no learned parameters., no second pass. The “merge” in step 3 doesn’t do anything sophisticated. It just declares: from now on, whenever you see t followed by h in this corpus, treat them as one symbol called th.

Two details matter:

  • The originals don’t disappear: when t and h get merged into th, all three are now in the vocabulary. If a word later happens to use t followed by some other character, the tokenizer can still represent it. The vocabulary grows monotonically.
  • Pairs get re-counted after each merge: once th is a token, the next iteration might find that th + e is the new top pair → merge → the. Then + thethe. Multi-character common words emerge from running the same 4-step loop with no extra cleverness. The vocabulary builds combinatorially.

A worked example#

Let me run it on a tiny corpus: just two words, cat appearing 3 times and mat appearing 2 times.

cat   × 3
mat   × 2

The initial vocabulary is the 4 distinct characters that appear: c, a, t, m. Every word starts as a sequence of single-character tokens.

Initial state

cat ×3
c
a
t

mat ×2
m
a
t

Vocabulary (4 tokens)

c
a
t
m

Iteration 1. Count every adjacent pair, weighted by word frequency:

pair count
(c, a) 3
(a, t) 3+2 = 5
(m, a) 2

Winner: (a, t) → at. The suffix at appears in both words, which is why it scores highest. Merge it:

After merge (a, t) → at

cat ×3
c
at

mat ×2
m
at

Vocabulary (5 tokens)

c
a
t
m
at

Iteration 2. Re-count:

pair count
(c, at) 3
(m, at) 2

(c, at) → cat wins because cat is the more frequent word. Merge:

After merge (c, at) → cat

cat ×3
cat

mat ×2
m
at

Vocabulary (6 tokens)

c
a
t
m
at
cat

After two merges the vocabulary holds 6 tokens: c, a, t, m, at, cat. Notice what just happened. The word cat now tokenizes to a single token. The word mat still takes two tokens (m + at), because BPE judged cat worth its own ID but not yet mat. In a larger corpus where mat was more common, it would eventually merge too. This is exactly what real tokenizers look like: common words collapse to one token, rarer words decompose into shared subword pieces like the at suffix.

A two-word corpus only takes the algorithm so far. Let’s step through a richer four-word corpus to watch meaningful subwords emerge.

So the whole algorithm is bookkeeping. No machine learning, no scoring functions. The structure that emerges (suffixes like est, common words like low, eventually multi-character tokens for frequent words like the, ing, tion) is a direct snapshot of the corpus’s frequency statistics.

Byte-level BPE#

Look back at one line from the algorithm: “the initial vocabulary is every distinct character in the corpus”. That works fine if the corpus is plain English with no surprises. The moment you feed BPE the actual internet (Chinese, emoji, code, accented letters, rare Unicode codepointsUnicode’s numeric IDs for characters, written as U+XXXX in hex. E.g. U+0041 for A, U+1F353 for 🍓. About 150,000 codepoints in total, covering every script, symbol, and emoji.), the “distinct characters” set explodes, and worse: any rare codepoint the corpus didn’t include is still out-of-vocabulary at the character level.

GPT-2 introduced a fix that’s now near-universal: don’t start with characters. Start with bytesA byte is just 8 bits, a number from 0 to 255. Everything stored on a computer (text, images, programs) ultimately lives as a sequence of bytes; text is just a particular interpretation of byte sequences via an encoding like UTF-8..

There are exactly 256 possible byte values, so:

  • The initial vocabulary is fixed at 256, regardless of corpus.
  • Every byte is in the vocabulary, by definition.
  • Any text representable on a computer is, by definition, a byte sequence.
  • Out-of-vocabulary is eliminated by construction. The worst case for any input is “fall back to bytes”.

The UTF-8 wrinkle. Most modern text is encoded as UTF-8A variable-length encoding that maps each Unicode character to 1 to 4 bytes. ASCII takes 1 byte, most European scripts 2, most Asian scripts 3, emoji 4., where each Unicode character becomes a sequence of 1 to 4 bytes:

character bytes (hex) bytes
A 41 1
é C3 A9 2
E4 B8 AD 3
🍓 F0 9F 8D 93 4

ASCII is just “UTF-8 where every character is one byte”, so plain English text is unchanged. But enters the tokenizer as the 3-byte sequence E4 B8 AD, not as a single character.

After BPE training on a multilingual corpus, the merges could end up producing a single token for the sequence E4 B8 AD. Those three bytes always appear together in any valid UTF-8 encoding of . The byte triple gets compressed into a “character-shaped” token via merging, the same way est and low did in the English example. The algorithm doesn’t change. We just swapped the starting alphabet.

Input: “Hello 🍓!”

Character-level

H
e
l
l
o
·
⚠ UNK
!
8 tokens

🍓 isn’t in the vocabulary. Replaced with . The character is lost — the model can never recover it.

Byte-level

H
e
l
l
o
·
F0
9F
8D
93
!
11 tokens

🍓 decomposes into 4 bytes F0 9F 8D 93. Every byte is in the vocabulary by construction. Nothing is lost.

Same input, two tokenizers. The character-level one fails on any character it wasn’t trained to know. The byte-level one cannot fail.

Byte-level BPE pays in tokens to win in coverage:

  • The cost: non-ASCII text uses more tokens when the training corpus underrepresents the script. A Chinese sentence run through an English-heavy model decomposes into byte-level chunks rather than character-shaped tokens. Same string, more tokens. This is why API pricing tends to hit Chinese, Arabic, and Hindi harder than English.
  • The guarantee: nothing is ever out-of-vocabulary. The starting vocabulary is fixed at 256 entries, every byte sequence is representable by construction, and there’s no token to lose information to.

Once you internalize that the model literally never sees characters (only integer IDs corresponding to byte sequences that may or may not align with human characters), a bunch of LLM weirdness stops being mysterious. The strawberry problem is one of them. We’ll get there.

Vocabulary size as a design knob#

Vocabulary size $V$ (the number of distinct tokens in the model’s vocabulary) is a hyperparameter, meaning it is set by hand before training rather than learned from data. The obvious instinct is that bigger should be better, since common substrings collapse into single tokens and text compresses into shorter sequences. So why do real models stop at 32K to 256K? Why not a vocabulary of a million tokens, or ten million?

The short answer: $V$ controls three different costs at once and only one benefit, and the cost quickly becomes severe.

Alongside $V$ sits one other number that shows up in nearly every formula below: $d$, the model’s hidden dimension. It’s the width of every vector the model passes around internally. For a 7B-class model $d$ is around 4,096; for 70B-class models it grows to 8,192. Bigger $d$ gives vectors more room to encode meaning, but compute grows with $d^2$. Most of the formulas below are some flavor of $V \cdot d$.

A quick clarification on those size labels: “7B” means the model has 7 billion learned parameters in total, “70B” means 70 billion. That total is a fixed budget the whole model has to share. Even the vocab tables we’re about to discuss come out of it: every parameter the designer spends on one part of the model is a parameter that cannot go to another part.

The benefit: compression. Bigger $V$ means more common substrings get their own token, which means a given document encodes into fewer tokens. Shorter sequences are worth a lot:

  • Less work per document: the model processes fewer tokens to read the same text.
  • More content per budget: a fixed input window holds more real text.
  • Lower compute cost: both training and inference scale with token count, so each gets cheaper.

Cost 1: embedding matrices. Every token needs its own row in the embedding matrix, which has shapeA matrix’s shape names its row and column counts. Shape $V \times d$ means $V$ rows and $d$ columns, holding $V \cdot d$ numbers in total. $V \times d$. There’s also a matching output matrix at the top of the model that projects each final vector back to a $V$-dimensional distribution over the vocabulary. That matrix is also $V \times d$. So just the vocab tables cost:

$$\text{vocab parameters} = 2 \cdot V \cdot d$$

(Some models weight-tieA common trick where the same matrix is used both as the embedding lookup at the input and as the output projection at the top of the model, cutting the parameter cost in half. The two ends of the model share the same vocabulary, so reusing the matrix mostly works. embedding and output, cutting this to $V \cdot d$. The principle is the same.)

With $d = 4{,}096$:

$V$ model vocab parameters
32,000 LLaMA 2 262 M
128,000 LLaMA 3 1.05 B
256,000 Gemini 2.10 B
1,000,000 hypothetical 8.19 B

At $V = 1\text{M}$, you’ve spent the parameter count of an entire 8B-class model on lookup tables alone. None of that capacity goes to the rest of the model, where the actual processing happens. Every parameter spent on $V$ is a parameter you cannot spend on reasoning capacity.

The benefit shrinks with each new token. Cost grows linearly with $V$: every new token in the vocabulary costs the same. Benefit does not. Real text is dominated by a small number of very common tokens, so once those have their own vocabulary entries, each additional token covers vanishing additional content.

Empirically, English text tokenizes to roughly:

$V$ tokens per word comment
1,000 $\approx$ 5 essentially character-level
30,000 $\approx$ 1.3 common words are one token
100,000 $\approx$ 1.15 most words and common phrases consolidate
1,000,000 $\approx$ 1.05 tiny extra gain, huge extra cost

Compression gain scales roughly with $\log V$.

So the central tradeoff is on the table: parameter cost rising linearly with $V$, compression gain rising only with $\log V$. Before stacking on the other costs, the two curves are worth a long look side by side. Drag the dial, type into it, watch where you sit on each.

Parameter cost1K10K100K1M
Tokens per word1K10K100K1M

The asymmetry is plain. Past 100K to 256K, you pay linear cost for vanishing additional gain. But two more cost considerations haven’t entered the picture yet, and both pull the optimum further toward smaller $V$.

Cost 2: rare tokens barely get trained. A token’s row in the embedding matrix only gets trained on the times that token appears in the data. A token that shows up millions of times gets a well-trained embedding. A token that shows up a handful of times barely gets trained at all.

Real text is brutally skewed. Zipf’s lawAn empirical regularity in natural language: the kth most frequent word (or token) appears about 1/k as often as the most frequent one. Holds, roughly, across languages, corpora, and subword tokenization. says the $k$-th most common token appears proportional to $1/k$ as often as the most common. Practically:

  • The top 1,000 tokens cover roughly 80% of all text.
  • The top 10,000 cover something like 95%.
  • Everything beyond is the long tail.

On a 1 trillion-token training corpus (a typical pre-training scale):

  • $V = 32\text{K}$: even the rarest tokens see tens of thousands of updates. Embeddings converge.
  • $V = 1\text{M}$: hundreds of thousands of long-tail tokens see only 10 to a few hundred updates each. Those embeddings stay close to their random initialization. The parameters are allocated but never learn anything useful.

Cost 3: each prediction gets expensive. Every time the model picks the next token, it first produces a probability distribution over all $V$ tokens in the vocabulary: its prediction for what comes next. To produce that distribution, it computes a score for every token (a $V \times d$ matrix multiplication, called the unembedding), then normalizes those $V$ scores into probabilities through a softmaxA function that converts a list of raw scores into probabilities: bigger scores get bigger probabilities, and the results all sum to 1..

The $V \times d$ matrix has $V \cdot d$ entries, and producing each prediction means touching every one of them, costing $V \cdot d$ basic arithmetic operations. Each layer in the rest of the model costs something like $12 \cdot d^2$ operations per token, give or take, depending on the architecture. The two are comparable when $V$ is around $12 \cdot d$. For $d = 4{,}096$ that crossover lands somewhere near $V = 50{,}000$. Beyond it, the prediction is one of the most expensive single operations the model does per token. Training also gets harder: the model has to learn to pick the right token from more options.

Putting it together. Two competing curves:

$$
\text{parameter cost} \propto V, \qquad \text{compression gain} \propto \log V
$$

That gives a clear Pareto frontierA term from optimization: the curve of best-possible tradeoffs between two competing objectives. At any point on it, you cannot improve one objective without hurting the other.. At small $V$ (below 30K), spending a small extra parameter budget yields big compression gains: spend more. At large $V$ (above 256K), spending huge extra parameters yields almost nothing: stop. The sweet spot is wherever marginal cost matches marginal gain.

Where real models land. For modern hardware and modern $d$ values, the empirical answer sits in the 30K to 256K range:

model $V$ comment
LLaMA 1 / LLaMA 2 32,000 English-focused, parameter-efficient
GPT-2 50,257
GPT-4 (cl100k_base) $\approx$ 100,000
LLaMA 3 128,256 jumped specifically for multilingual coverage
Gemini 256,000 heavy multilingual

The dominant pressure pushing $V$ up is multilingual coverage. Each new script (Cyrillic, Arabic, Devanagari, Chinese-Japanese-Korean or CJK) wants its own token budget. The alternative, decomposing those scripts into bytes via the byte-level BPE we just saw, bloats sequence length unacceptably for users writing in those languages.

Variants: BPE, WordPiece, and SentencePiece#

So far we’ve focused on BPE, since it’s the dominant algorithm and the one used directly by most modern models. But it’s not the only one. Two variants share the rest of the landscape, and both are common enough that any survey of LLM tokenization has to cover them.

All three solve the same problem: split text into subword chunks drawn from a fixed vocabulary. They differ on two axes: what they merge (the scoring function for picking which pair to combine) and how they treat the raw text (whether they pre-tokenize before merging).

BPE. Already covered above. Frequency-based: at each iteration, merge the most common adjacent pair. Runs on pre-tokenized words (regex-split on whitespace and punctuation first, GPT-style). The most common standalone choice, used directly by the GPT family.

WordPiece. Google’s variant, originally introduced for speech recognition and later adopted by BERTBidirectional Encoder Representations from Transformers: a 2018 Google language model focused on understanding text rather than generating it. Predates the modern LLM wave but is still widely used for classification, search, and similar tasks where you want to analyze text rather than produce it.. Same overall loop as BPE: start with characters, count pairs, merge the top pair, repeat. What changes is the scoring function. Where BPE picks the pair with highest joint count, WordPiece picks the pair whose merging most increases the corpus’s likelihood under a unigram model. Concretely:

$$\text{score}(a, b) = \frac{\text{count}(a, b)}{\text{count}(a) \cdot \text{count}(b)}$$

The numerator (joint count) rewards pairs that appear together often. The denominator (product of individual counts) penalizes pairs whose pieces are already frequent on their own. Intuition: BPE picks the most frequent pair; WordPiece picks the “stickiest” pair, the one whose pieces co-occur more often than chance would predict. In practice the two produce vocabularies that look very similar at the same target size, with subtle differences in how low-frequency content gets handled. Used by BERT, RoBERTa, DistilBERT, ELECTRA, and most of the BERT family.

SentencePiece. Google’s other tokenizer, and a bigger philosophical departure. The shift is that SentencePiece skips pre-tokenization entirely. It reads raw text directly, treating whitespace as just another character (rendered visibly as , the lower-one-eighth block).

Underneath, SentencePiece can run BPE or a different algorithm called the unigram language model (a probabilistic alternative we won’t go into here). Either way, the no-pre-tokenization choice is the defining feature.

Why does that matter? Because pre-tokenization assumes whitespace marks word boundaries, and that assumption breaks outside the Indo-European family. Chinese, Japanese, Thai, and Khmer do not put spaces between words. A BPE tokenizer that pre-splits on whitespace will treat an entire Chinese paragraph as a single “word”, which breaks the merge logic. SentencePiece sidesteps the problem by not assuming whitespace means anything special.

The trade-off: SentencePiece adds the character to tokens that begin a new “word” (tokens preceded by a space in the original text), so the original whitespace can be recovered from the token sequence. Spaces are not lost; they are encoded into the tokens themselves. Used by LLaMA (with BPE underneath), T5, Gemma, mT5, and most multilingual models.

The practical landscape. Which family uses what:

Model family Tokenizer
GPT (GPT-2, GPT-3, GPT-4) BPE (with pre-tokenization)
BERT family (BERT, RoBERTa, DistilBERT, ELECTRA) WordPiece
LLaMA, T5, Gemma, mT5 SentencePiece (BPE underneath)

One useful note for reading papers and model cards: when something says “our tokenizer is SentencePiece”, that’s a toolkit claim. The underlying algorithm is almost always still BPE. The SentencePiece-vs-BPE choice is mostly about pre-tokenization handling and multilingual support, not the merge algorithm itself.

The strawberry investigation#

The most famous demonstration of tokenization’s hidden weirdness is the strawberry problem. Ask a language model how many r’s are in “strawberry” and watch it confidently miss.

How many r’s are in “strawberry”?

There are 2 r’s in “strawberry”.

The right answer is three: s-t-r-a-w-b-e-r-r-y. This is not the model being stupid. It’s the model running into the limits of its own perception.

Why it happens. The model never sees the letters. By the time the word reaches the transformerThe neural-network architecture at the heart of every modern LLM. Stacked layers that pass token vectors through attention (which mixes information across positions) and feed-forward networks (which mix information within each position). (covered in depth later in the series), it has already been chopped into tokens. GPT-4’s tokenizer splits “strawberry” into three pieces:

“strawberry” through GPT-4’s tokenizer

strawberry
str
aw
berry
3 tokens

The three r’s are distributed across the tokens: one in str, two in berry. The model never sees them as separate letters at all.

Each token is an integer ID. The model never sees s, t, r, a, w, b, e, r, r, y individually. From its perspective, the input is three opaque IDs.

To answer “how many r’s are in strawberry”, the model would have to:

  1. Spell out each token internally: str → s, t, r. aw → a, w. berry → b, e, r, r, y.
  2. Count r’s per token: 1, 0, 2.
  3. Add: 3.

None of those steps is a native operation for a token-level predictor. The model has to know from training what letters live inside each token, then perform multi-step counting reasoning that it has no built-in primitive for.

A useful analogy. Imagine you only know a word by hearing it spoken, never by reading it. Someone asks how many c’s are in “macchiato”. You can probably get there, but only by mentally rehearsing the spelling first and then counting. The LLM is in that situation for every word it processes.

Modern frontier models often pass. Ask GPT-4, Claude, or Gemini today and they will usually answer three. They get there through some combination of:

  • Targeted training: specific letter-counting examples included in post-training data.
  • Chain-of-thought reasoning: the model spells the word out token by token in its own output, then counts.
  • Tool use: calling a spelling utility.

But the underlying limitation has not gone away. Smaller open models still fail. Subtler letter-level questions, counting double letters, finding the second-to-last consonant, judging whether two words rhyme exactly, can still trip up frontier models. The mitigations are scaffolding around a token-level core that simply does not have letter-level structure.

The broader point. The strawberry failure is the cleanest demonstration of a deeper truth: an LLM’s “alphabet” is its token vocabulary, and that alphabet does not decompose into letters. Anything that needs character-level operations (counting letters, finding anagrams, detecting palindromes, judging rhymes, manipulating spelling, swapping letter positions) is uphill for the architecture.

Once you internalize that, a whole class of “the model is dumb” reactions stops being mysterious.

Summary#

What we covered:

  • Tokens are the model’s atoms of perception. Each model has its own vocabulary, decided once at training time. Two models given the same sentence produce different integer sequences.
  • BPE produces that vocabulary by repeatedly merging the most frequent adjacent pair, starting from raw characters or bytes. Simple bookkeeping, no machine learning underneath.
  • Byte-level BPE eliminates out-of-vocabulary failures. Starting from the 256 possible bytes guarantees any input is representable.
  • Vocabulary size $V$ is a design knob. Parameter cost grows linearly with $V$, compression gain grows roughly with $\log V$. Real models land at 30K to 256K, with multilingual coverage pushing the upper end.
  • WordPiece and SentencePiece are variants on the same merge-based core. WordPiece changes the scoring; SentencePiece changes how raw text is handled.
  • The model is blind to letters. It never sees them, only tokens. Any character-level task (counting, anagrams, rhymes) is uphill for a token-level architecture.

The model now reads text as a sequence of integer IDs. But an integer alone tells the model nothing about how one token relates to another. To do anything useful with those IDs (compute similarities, predict the next token, propagate gradients), they have to become vectors. That’s where this series goes next.

{💬|⚡|🔥} **What’s your take?**
Share your thoughts in the comments below!

#️⃣ **#Tokens #Tokenization #Simons #Journal**

🕒 **Posted on**: 1780866561

🌟 **Want more?** Click here for more info! 🌟

By

Leave a Reply

Your email address will not be published. Required fields are marked *