Foundations · Day 3
Tokens & Context: The Invisible Limits of Every AI
Once you understand tokens and context windows, strange AI behaviour stops being mysterious, and starts being predictable.
April 22, 2026·11 min read
✦ Tokens, Looked at More Closely
In Day 1, we called tokens "fragments of text that hold predictive value." That was true, and enough to start with.
Now we can look deeper.
Why Not Just Use Words?
Here is the question most people never think to ask: why would anyone design a system that breaks "Intelligence" into "Int" and "elligence"?
The answer comes from a tension in language itself.
A vocabulary of individual characters is too small: the model would need to chain hundreds of characters to say anything meaningful.
A vocabulary of complete words is too large: English alone has hundreds of thousands of words, and an infinite number of names, compounds, and technical terms across all languages.
Tokens are the compromise, and they were discovered, not designed.
A compression algorithm called Byte Pair Encoding (BPE) was set loose on a vast corpus of text with one instruction:
Find the most frequent pair of adjacent characters. Merge them into a single unit. Repeat.
Start with every letter as its own unit. Merge the most common pair. Then the next. After tens of thousands of iterations, you have a vocabulary of roughly 50,000–100,000 tokens that covers common words, common syllables, and rare characters, all in the same dictionary.
The result handles any language, any jargon, any code, with roughly equal efficiency.
What this looks like in practice:
- "sun", "the", "is" → 1 token each, seen so often they earn their own entry
- "training", "language", "tokenization" → 1–2 tokens, common enough to stay intact
- "indivisible", "unbelievable" → 3–4 tokens, broken into familiar pieces
- Emojis, uncommon scripts, rare names → sometimes 1 character per token
Here is a direct way to feel the token boundary: open the tokenizer and type
the. Note the token. Now typethewith a leading space before it. It produces a different token. To you, a space before a word is just formatting. To the tokenizer, it is part of the token itself.Now try
Thewith a capital letter. Three ways of writing the same word, three different tokens. The model does not read the way you do. Characters you consider invisible, spacing, casing, punctuation, are part of the fragment the model actually sees.
✦ The Context Window
Now the central concept.
When you send a message to an LLM, it does not see just your message.
It sees:
- Your current message
- Everything the assistant has already said
- Everything you said before that
- Any system instructions set at the start
All of this is assembled into one long sequence of tokens and passed to the model at once. In Day 2 we saw how the model generates token-by-token, each new token conditioned on everything before it. That "everything before it," all of it, must fit here.
This sequence has a maximum length. That maximum is called the context window.
Think of it as a desk with a fixed surface. Everything the model needs to work with must fit on the desk at once. New pages arrive from the left. Old pages fall off the right.
A context window of 128,000 tokens holds roughly 90,000–100,000 words, about 200 pages of text. That sounds generous. But consider:
- A long conversation accumulates thousands of tokens per exchange
- A document analysis task may involve an entire report
- A coding session with a large codebase can fill a context window in a single paste
A new conversation. The system prompt occupies a small slice. Your first message, a handful of tokens. The context window is almost entirely open.
What Gets Lost First
When the context window fills, something has to go.
In many AI systems, older parts of the conversation are the first to be trimmed from the active context. In practice, this often means the model begins to lose sight of how the conversation started.
OpenAI’s docs describe this kind of truncation explicitly, while also noting that truncation strategy can be configured.
The practical consequence is simple: the model may forget the original framing.
It may still know you are discussing climate policy. But it may no longer see the careful instruction you gave at the beginning, such as: “respond in the voice of an economist, not an activist.”
The instruction that shaped everything can quietly fall out of view.
This is one of the most common sources of drift in long AI conversations. The model is not behaving strangely. It is behaving correctly, given the context it can still see.
Understanding this changes how you structure long sessions.
✦ Lost in the Middle
Here is a subtler problem, and in some ways a more important one.
Even within a full context window, models do not read all positions with equal care.
Research published by Stanford in 2023, now widely cited across the industry, examined how well language models could answer questions when the relevant information was placed at different positions within a long context.
The finding was striking: accuracy dropped significantly when relevant information was placed in the middle of the context, compared to the beginning or end.
The performance curve was U-shaped: strong at the opening, strong at the close, weak in between. In multi-document question-answering tasks, accuracy when the answer was buried in the middle fell to under 40%, versus over 80% when it was at the start or end.
Psychologists call this the serial-position effect: humans also remember the first and last things in a list far better than the middle. It turns out the models we trained on human language inherited the same bias.
This has a direct, practical implication:
Put the most important information at the beginning of your prompt, or at the end. Never bury it in the middle.
This one principle, applied consistently, will improve the quality of your AI outputs more than most prompt tricks.
You are summarising a 60-page policy document and paste the full text into an AI with a 128K context window. The critical exception you care about is on page 34, right in the middle. Given what you now know about context and attention, what might go wrong? What would you do differently to ensure the model focuses on what actually matters?
✦ Context Windows Across Models
Context window sizes have grown dramatically, and fast.
| Model / Era | Context Window | Roughly equivalent to |
|---|---|---|
| GPT-2 (2019) | 1,024 tokens | ~750 words, less than this article |
| GPT-3 (2020) | 4,096 tokens | ~3,000 words, a short story |
| GPT-4 Turbo (2023) | 128K tokens | ~90,000 words, a full novel |
| Claude 3 Opus (2024) | 200K tokens | ~150,000 words, 300 pages |
| Claude Sonnet 4.6 (2025) | 1M tokens | ~750,000 words, a full codebase |
| Gemini 3 Pro (2026) | 2M tokens | ~1.5 million words, a small library |
The growth is real and remarkable. But a critical caveat:
Advertised context length ≠ reliable performance across the full range.
Several frontier models that advertise 1M-token context windows have been observed degrading significantly past 256K tokens, struggling to maintain above 50% match accuracy at their maximum claimed lengths. The context window is a ceiling, not a quality guarantee.
Think of it this way: a scholar can physically carry 1,000 books into an examination room. Whether they can meaningfully reason across all 1,000 simultaneously is a different question.
This brings us to context rot.
✦ Context Rot: When More Becomes Less
You might expect that adding more context always helps: more information, richer answers.
The research disagrees.
Context rot is the measurable degradation in output quality that occurs as the context grows, even when the model is nowhere near its limit. Adding more tokens does not just leave quality unchanged; it can actively reduce it.
The model does not just lose focus in the middle. It also becomes more diffuse overall, as the signal it needs competes with a growing volume of loosely relevant noise.
The implication for builders is counterintuitive:
A 500-token prompt containing exactly the right information often outperforms a 5,000-token prompt padded with loosely related background material.
Quality over quantity. In tokens, as in most things.
✦ Tokens as Currency
This is where tokens shift from being a technical concept to a business one.
Every token you send and receive is counted. Billed.
Reference pricing (2025, illustrative):
- GPT-4.1: ~$2.00 per million input tokens, ~$8.00 per million output tokens
- Claude Sonnet: competitive, with no surcharge for extended context
A single API call might use 500–2,000 tokens. Negligible. But consider an application making 10,000 calls per day, averaging 1,500 tokens each. That is 15 million input tokens daily: $30/day in input alone, before output tokens.
The multiplier most developers discover too late:
In a multi-turn conversation, every new message includes all previous messages as context. A 10-turn conversation does not cost 10× one message. It can cost 50× or more, because each turn carries the full accumulated history.
Applications that behave fine in development, with clean, short test conversations, can generate surprising production bills because real users have long, rambling sessions.
Tokens are not just a technical constraint. They are the unit of cost, latency, and reliability in every AI system you will build.
✦ Working Smart Within the Limits
Understanding the constraints is only useful if it changes how you act.
1. Lead with what matters most.
Given the U-shaped attention curve, put critical instructions at the beginning of your prompt. If your conversation is long, re-state key constraints before asking for important outputs.
2. Trim the unnecessary.
Well-crafted prompts can achieve the same output quality with 30–50% fewer tokens by removing filler. Phrases like "As you know," "I was wondering if maybe," "feel free to" add tokens without adding information.
3. Summarise long threads.
For very long sessions, do not let the raw history accumulate indefinitely. Summarise what has been established so far and carry the summary forward. The summary preserves meaning at a fraction of the token cost.
4. Use structured prompts.
Asking for a specific output format (a numbered list, a table, JSON) helps the model produce focused, efficient responses. A vague question invites a sprawling answer.
5. Preview: RAG for large documents.
If your use case involves querying long documents, the right tool is not pasting the whole document into the context. It is Retrieval-Augmented Generation (RAG): storing the document in a vector database and retrieving only the relevant passages at query time. This keeps the actual context small, regardless of document length.
We cover RAG in depth on Day 8. For now, know it exists, and that it was specifically designed for the problem we have been describing today.
You are building a customer support bot. After five days of live use, conversations have become very long, and the bot begins contradicting its own earlier advice within a single session. You suspect context overflow. What two changes would you make first, and what would you measure to know if they worked?
✦ Takeaway Summary
| Concept | What It Means |
|---|---|
| Token | A chunk of text (not a word), the fundamental unit LLMs read and generate |
| BPE | The algorithm that builds the token vocabulary by merging frequent character pairs |
| Context Window | The maximum total tokens the model can see at once: prompt + history + system instructions all count |
| FIFO Overflow | When the window fills, the oldest tokens are silently dropped with no warning |
| Lost in the Middle | Models attend strongly to beginning and end; accuracy drops 30%+ for information buried in the middle |
| Context Rot | Output quality can degrade even before the window is full; more context is not always better |
| Token Cost | Every token is billed; multi-turn conversations accumulate rapidly in production |
✦ Try It Yourself
1. Explore your own tokenisation
Open the OpenAI Tokenizer and try these:
- Paste a paragraph in Hindi, Tamil, or another Indian language. Compare its token count with the same idea expressed in English. You are seeing the "low-resource language tax": languages underrepresented in the training corpus cost more tokens per word.
- Open any tokenizer tool and try these one by one:
tokenization,Tokenization,tokenization!,token-ization,👨🏽💻. Before checking the result, pause and guess how many tokens each one will become. Then reveal the tokenization and compare your guesses. Notice what changed: sometimes a space, punctuation mark, capitalization shift, or emoji changes the boundary completely. The point of the exercise is simple: models do not read text as words the way humans do. They read token pieces.
✦ Learn More
- Hugging Face: Byte Pair Encoding Tokenization
- Liu et al. (2023): Lost in the Middle, How Language Models Use Long Contexts
- Morph: Context Rot, Why LLMs Degrade as Context Grows
- OpenAI: What Are Tokens and How to Count Them
The vessel is finite. Clarity begins when you stop filling it with everything, and start filling it with what matters.