For the past few years, prompt engineering has been the center of gravity in applied AI. We obsessed over verbs, personas, and delimiters. But as we transition from chat interfaces to autonomous agents, a more foundational paradigm is emerging: context engineering.
Building with large language models (LLMs) is less about "hacking" the right sentence and more about solving a systems-level architecture problem: "What holistic configuration of context is most likely to yield the desired model behavior while maintaining efficiency and reliability?"
What is Context Engineering?
Context refers to the entire "memory space" provided to a model during a single inference call — the tokens that represent the history, tools, data, and instructions. Context engineering is the strategic curation, reduction, and orchestration of these tokens.
Unlike prompt engineering, which is often a discrete, one-shot exercise, context engineering is:
- Dynamic: It changes based on the agent's previous actions.
- Systemic: It involves external data sources (RAG), tool definitions, and multi-agent protocols.
- Resource-Aware: It respects the finite limits of token windows and the cognitive "attention" of the model.
Effectively wrangling LLMs requires thinking in context — considering the holistic state available to the model at any given time and how that state shapes the probability space of its next response.
The Physics of Attention: Why Context Matters
Despite the hype around "infinite context" windows (reaching 1M+ tokens), LLMs are still bound by the physics of the Transformer architecture.
1. The $O(n^2)$ Bottleneck
Standard self-attention mechanisms scale quadratically. Every token must "attend" to every other token. While various optimizations (FlashAttention, Ring Attention) exist, the computational cost remains high. More importantly, the model's "mental" energy is diluted across more tokens.
2. Context Rot and "Lost in the Middle"
Researchers have identified that LLMs often suffer from Context Rot. Performance typically follows a U-shaped curve: information at the very beginning or very end of a context window is recalled accurately, while information buried in the middle is frequently ignored or hallucinated.
3. The Signal-to-Noise Ratio
In an agentic loop, more data is not always better. Including 10 pieces of irrelevant documentation can decrease the accuracy of a tool call by confusing the model's focus. Context engineering is, therefore, primarily an exercise in signal extraction.
The Anatomy of a High-Signal Context
A well-engineered context window for an agent typically consists of four distinct layers:
1. The Adaptive System Prompt
The system prompt defines the "identity" and "operating rules." Instead of massive blocks of text, use structured sections with clear hierarchy:
MARKDOWN<role>
Senior DevOps Engineer specialized in Kubernetes.
</role>
<constraints>
- Never suggest 'force delete' without warning.
- Use 'grep' before 'cat' to save context.
</constraints>
<context_markers>
The current environment is [PROD].
Current active namespace: [default].
</context_markers>
2. Just-in-Time Retrieval (RAG)
Rather than dumping an entire codebase into an agent, use a Progressive Disclosure strategy. The agent should "scout" first:
- Phase 1: List file names or directory structures.
- Phase 2: Retrieve specific lines or summaries based on relevance.
- Phase 3: Load full context only when a specific edit is required.
3. Tool Manifests
Tools are the agent's hands. They must be defined with extreme precision. If an agent is constantly picking the wrong tool, the context engineering of your tool_definitions is likely flawed (e.g., overlapping descriptions or ambiguous parameters).
Advanced Strategies for Long-Horizon Tasks
When an agent works on a task for hours, the context window inevitably fills up. We use three primary techniques to handle this:
1. Compaction & Distillation
Before a context window overflows, the system should trigger a "Compaction" event. An LLM (often a smaller, faster one) reads the history and produces a distilled summary that preserves:
- The original goal.
- Decisions made so far.
- Unresolved blockers.
- Minimal "anchor" tokens for continuity.
2. The "Sub-Agent" Orchestration Pattern
Break one complex task into multiple context windows.
- The Manager: Maintains the high-level plan and state.
- The Worker: Receives a scoped task and a fresh, clean context. It executes and returns only the result to the Manager.
3. Structured External Memory
Instead of relying on the LLM's history, maintain a state.json or JOURNAL.md that the agent explicitly reads and writes to. This offloads the "hard memory" to a deterministic storage layer, leaving the context window free for "reasoning."
Practical Takeaways for Developers
- Be Prudent: Treat every token as a cost — not just in dollars, but in accuracy.
- Standardize Hierarchy: Use XML-like tags (e.g.,
<thought>,<action>,<result>) to help the model distinguish between different types of context. - Validate Tools: If a human can't distinguish between Tool A and Tool B based on their descriptions, your agent won't either.
- Monitor "The Middle": If your agent is missing details, check if they are being buried in the middle of a massive 50-message thread.
The Future: From "Context Windows" to "Context Streams"
As we move toward a world of always-on agents, the distinction between "prompt" and "data" will vanish. We will stop "prompting" models and start "streaming" them the exact slice of reality they need to solve the task at hand. Context engineering is the bridge to that future.
Found this useful? This post is part of a series on Building Reliable AI Agents. Next up: "State Machine Design for Autonomous Systems."