Context engineering | Chris Loy

🚀 Read this awesome post from Hacker News 📖

📂 Category:

✅ Here’s what you’ll learn:

As our use of LLMs has changed from conversational chatbots and into integral decision-making components
of complex systems, our inference approach must also evolve. The practice of “prompt engineering”, in
which precise wording is submitted to the LLM to elicit desired responses, has serious limitations. And so this is
giving way to a more general practice of considering every token fed into the LLM in a way that is more dynamic,
targeted, and deliberate. This expanded, more structured practice is what we now call “context engineering.”

Throughout, we’ll use a toy example of understanding how an LLM might help us answer a subjective question such as
“What is the best sci-fi film?”


Context windows

An LLM is a machine learning model that understands language by modelling
it as a sequence of tokens and learning the meaning of those tokens from the patterns of their co-occurrence in large
datasets. The number of tokens that the model can comprehend is a fixed quantity for each model, often in the hundreds
of thousands, and is known as the context window:
LLMs are models for understanding sequences of tokens
LLMs are trained through repeated exposure to coherent token sequences — normally large textual databases scraped from
the internet. Once trained, we use the LLM by running “inference” (i.e. prediction) of the next token based on all the
previous tokens in a sequence. This sequence of previous tokens is what we used to refer to as the prompt:
At inference time they generate one new token
Inference continues the token sequence by adding high-probability tokens to the sequence one at a time.

When prompted to complete the sentence “the best sci-fi film ever made is…”, the highest probability
tokens to be generated might be probably, star, and wars.

Early uses of LLMs focused on this mode of “completion”, taking partially written texts and predicting each subsequent
token in order to complete the text based on the desired lines. While impressive at the time, this was limiting in
several ways, including that it was difficult to instruct the LLM exactly how you wished the text to be completed.


Chat framing

To address this limitation, model providers started training their models to expect sequences of tokens that framed
conversations, with special tokens inserted to indicate the hand-off between two speakers. By learning to replicate
this “chat” framing when generating a completion, models were suddenly far more usable in conversational settings, and
therefore easier to instruct:
Most are tuned to operate in a chat mode
The context window started to be more greedily filled up by different types of messages — system messages (special
instructions telling the LLM what to do), and chat history from both the user and the response from the LLM itself.

With a chat framing, we can instruct the LLM that it is “a film critic” before “asking” it what the best sci-fi film
is. Maybe we’ll now get the response tokens blade and runner, as the AI plays the role of a speaker likely to
reflect critical rather than popular consensus.

The crucial point to understand here is that the LLM architecture did not change — it was still just predicting the
next token one at a time. But it was now doing that with a worldview learned from a training dataset that framed
everything in terms of delimited back-and-forth conversations, and so would consistently respond in kind.


Prompt engineering

In this setting, getting the most out of LLMs involved finding the perfect sequence of prompt
tokens to elicit the best completions. This was the birth of so-called “prompt engineering”, though in practice
there was often far less “engineering” than trial-and-error guesswork. This could often feel closer to
uttering mystical incantations and hoping for magic to happen, rather than the deliberate construction and rigorous
application of systems thinking that epitomises true engineering.

We might try imploring the AI to reflect critical consensus with a smarter system prompt, something like
You are a knowledgeable and fair film critic who is aware of the history of cinema awards. We might hope that this
will “trick” the LLM into generating more accurate answers, but this hope rests on linguistic probability
and offers no guarantees.


In-context learning

As LLMs got smarter and more reliable, we were able to feed them more complex sequences of tokens, covering different
types of structured and unstructured data. This enabled LLMs to produce completions that displayed
“knowledge” of probable token sequences based on novel structures in the prompt, rather than just remembered patterns
from their training dataset. This mode of feeding examples to the LLM is known as in-context learning because the
LLM appears to “learn” how to produce output purely based on example sequences within its context window.

This approach led to an explosion of different token sequences that we might programmatically include within
the prompt:

  • Hard-coded examples, taken from our knowledge domain (documentation, past examples of good output from human
    or generated sources, toy examples) to encourage predictable output.
  • Non-text modalities, with tokens that represented images, audio, or video, that were either directly part of
    the context window, or first transcribed to text and then tokenised.
  • Tool and function calls, defining external functions that the LLM could tell the caller to invoke to
    access data or computation from the outside world.
  • Documents and summaries, returned via “RAG” from data sources, or uploaded by users, to feed knowledge into the
    LLM that lay outside its training dataset.
  • Memory and conversation history, condensing information from prior chats, that allowed continuity between a
    single user and the “chatbot” over multiple conversations.

In our sci-fi film example, our prompt could include many things to help the LLM: historic box office receipts, lists
of the hundred greatest films from various publications, Rotten Tomatoes ratings, the full history of Oscar winners,
etc.

Suddenly, our 100,000+ context window isn’t looking so generous anymore, as we stuff it with tokens from all kinds of
places:
But the full context can be controlled by the caller
This expansion of context not only depletes the available context window for output generation, it also increases the
overall footprint and complexity of what the LLM is paying attention to at any one time. This then increases the risk of
failure modes such as hallucination. As such, we must start approaching its construction with more nuance — considering
brevity, relevance, timeliness, safety, and other factors.

At this point, we aren’t simply “prompt engineering” anymore. We are beginning to engineer the entire context in which
generation occurs.


From oracle to analyst

Language encodes knowledge, but it also encodes meaning, logic, structure, and thought. Training an LLM to encode
knowledge of what exists in the world, and to be capable of producing language that would describe it, therefore, also
produces a system capable of simulating thought. This is, in fact, the key utility of an LLM, and to take advantage
of it requires a mindset shift in how we approach inference.

To adopt context engineering as an approach to LLM usage is to reject using the LLM as a mystical oracle to
approach, pray to with muttered incantations, and await the arrival of wisdom. We instead think of briefing a skilled
analyst
: bringing them all the relevant information to sift through, clearly and precisely defining the task at hand,
documenting the tools available to complete it, and avoiding reliance on outdated, imperfectly remembered training data.
Context engineering reframes the LLM as an analyst, not an oracle
In practice, our integration of the LLM shifts from “crafting the perfect prompt”, towards instead the precise
construction of exactly the right set of
tokens needed to complete the task at hand. Managing context becomes an engineering problem, and the LLM is reframed
as a task solver whose output is natural language.


Engineering context for agentic behaviour

Let’s consider a simple question you might wish an LLM to answer for you:

What is the average weekly cinema box office revenue in the UK?

In “oracle” mode, our LLM will happily quote a value learned from the data in its training dataset prior to its cutoff:

As of 2019, the UK box office collects roughly £24 million in revenue per week on average.

This answer from GPT 4.1 is accurate, but imprecise and outdated. Through context engineering, we can do a lot better.
Consider what additional context we might feed into the context window before generating the first token of the
response:

  • The date, so we use updated stats (GPT 4.1 thinks now is June 2024)
  • Actual published statistics such as this BBC News article
  • Instructions on how to tell the caller to divide two numbers

The above should be enough for the LLM to know how to: look for data for 2024; extract the total
figure of £979 million from the document; and call an external function to precisely divide that by 52 weeks.
Assuming the caller then runs that calculation and
invokes the LLM again, with all the above context, plus its own output, plus the result of the calculation, we will
then get our accurate answer:

Across the full year of 2024, the UK box office collected £18.8 million in revenue per week on average.

Even this trivial example involves multiple ways of engineering the context before generating the answer:

  • Stating the current date and desired outcome;
  • Searching for and returning relevant documents;
  • Documenting available calculation operations;
  • Expanding context with intermediary results.

Fortunately, we do not need to invent a new approach every single time.


Is this just RAG?

Retrieval-augmented generation (RAG) is a fashionable technique for injecting external knowledge into the
context window at inference time. Leaving aside implementation details of how to identify the correct documents to
include, we can clearly see that this is another specific form of context engineering:
RAG is just one pattern of context engineering
This is a useful and obvious way to use pre-trained LLMs in contexts that need access to knowledge outside the training
dataset.

For a correct answer, our application needs to be aware of up-to-date film reviews, ratings, and awards, to track
new films and critical opinion after the point the model was trained.
By including
relevant extracts in the context window, we enable our LLM to generate completions with today’s data and avoid
hallucination.

To do this, we can search for relevant documents and then include them in the context window. If this sounds
conceptually simple, that is because it is — though reliable
implementation is not trivial and requires robust engineering.

Complex systems can be brittle and opaque to build. We need a way to scale complexity without harming our ability to
maintain, debug, and reason about our code. Fortunately, we can apply the same thinking that traditional software design
used to solve this same problem.

We can think of RAG as simply the first of many design patterns for context engineering. And
just as with other software engineering design patterns, in future we will find that most complex systems will have to
employ variations and combinations of such patterns in order to be most effective.


Composition over inheritance

In software engineering, design patterns promote reusable software by providing proven, general solutions to common
design problems. They
encourage composition over inheritance, meaning systems are built from smaller, interchangeable components rather than
rigid class hierarchies. They make your codebase more flexible, testable, and easier to maintain or extend. They are a
crucial piece of the software design toolkit, that enable engineers to build large functioning codebases that can scale
over time.

Some examples of software engineering design patterns include:

  • Factory: standardises object creation to make isolated testing easier
  • Decorator: extends behaviour without editing the original
  • Command: passes work around as a value, similar to a lambda function
  • Facade: hides internals with a simple interface to promote abstraction
  • Dependency injection: wires modules externally using configuration

These patterns were developed over a long time, though many were first codified in
a single book. Context engineering is a nascent field, but already
we see some common patterns emerging that adapt LLMs well to certain tasks:

  • RAG: inject retrieved documents based on relevance to user intent
  • Tool calling: list available tools and inject results into the context
  • Structured output: fix a JSON/XML schema for the LLM completions
  • Chain of thought / ReAct: emit reasoning tokens before answering
  • Context compression: summarise long history into pertinent facts
  • Memory: store and recall salient facts across sessions

In our examples above, we have already used some of these patterns:

  • RAG for getting film reviews, critics’ lists, and box office data
  • Tool calling to calculate weekly revenues accurately

Some of the other techniques, such as ReAct, could help our LLM to frame and verify its responses more carefully,
counterbalancing the weight of linguistic probability learnt from its training data.

By seeing each as a context engineering design pattern, we are able to pick the right ones for the task at hand,
compose them into an “agent”, and avoid compromising our ability to test and reason about our code.


Extending to multiple agents

Production systems that rely on LLMs for decision-making and action will naturally evolve towards multiple agents with
different specialisations: safety guardrails; information retrieval; knowledge distillation; human interaction; etc.
Each of these is a component that interprets a task, then returning a sequence of tokens indicating
actions to take, the information retrieved, or both.
Multi-agent systems feed context to each other

For our multi-agent film ranker, we might need several agents:

  • Chatbot Agent: to maintain a conversation with the user
  • Safety Agent: to check that the user is not acting maliciously
  • Preference Agent: recalls if the user wants to ignore some reviews
  • Critic Agent: to synthesise sources and make a final decision

Each of these is specialised for a given task, but this can be done purely through engineering the context they
consume, including outputs from other agents in the system.

Outputs are then passed around the system and into the context windows of other agents. At every step, the crucial
aspect to consider is the patterns by which token sequences are generated, and how the output of one agent will
be used as context for another agent to complete its own task. The hand-off token sequence is effectively the
contract for agent interaction — apply as much rigour to it as you would any other API within your software
architecture.


Summary

Context engineering is the nascent but critical discipline that governs how we are able to effectively guide LLMs
into solving the tasks we feed into them. As a subfield of software engineering, it benefits from systems and design
thinking, and we can learn lessons from the application of design patterns for producing software that is
modular, robust, and comprehensible.

When working with LLMs, we must therefore:

  • Treat the LLM as an analyst, not an oracle. Give it whatever it needs to solve the task.
  • Take responsibility for the entire context window, not just the system and user prompts.
  • Use composable, reusable design patterns that can be engineered and tested in isolation.
  • Frame the hand-off between agents as an API contract between their context windows.

By doing these, we can control in-context learning with the same rigour as any other engineered software.

💬 What do you think?

#️⃣ #Context #engineering #Chris #Loy

🕒 Posted on 1762081293

By

Leave a Reply

Your email address will not be published. Required fields are marked *