Why Context Engineering Mirrors Information Architecture for LLMs

Oct 15

If you've worked in UX, this might sound familiar. Information architecture (IA) is the discipline of organizing website content so users can find what they need without cognitive overload. This is the same fundamental challenge we face with LLMs.

Both disciplines answer the same question: How do you structure information so the entity consuming it (whether human or LLM) can efficiently perceive and act on what's relevant?

In traditional IA, you're designing for human attention spans and cognitive load. In context engineering, you're designing for transformer attention mechanisms and token limits. The constraints differ, but the principles are remarkably similar.

The Million Token Trap

When GPT-4 and Gemini started rolling out context windows over 200K tokens and then pushing toward a million, I remember the excitement in the community. Finally, we could throw entire codebases, documentation sets, even whole books into a single prompt. No more careful retrieval, no more juggling what to include. Just dump everything in and let the model figure it out.

Except that's not how it works in practice. At all.

We learned this lesson pretty quickly at Innowhyte when building production agentic systems. Bigger context windows didn't magically make agents more reliable. In some cases, they made things worse. And once you start digging into why, you realize there are some fundamental issues with how these models process long contexts.

Why Long Context Problems Are Inevitable: Three Root Causes

The context challenges we see in production are fundamental properties of how LLMs are trained and work. Understanding this matters because it changes how you approach the problem. You're not waiting for the next model release to solve context issues. You're engineering around inherent constraints.

1. Architecture: Mathematical Inevitability

The transformer architecture that powers modern LLMs has fundamental mathematical constraints:

Attention spreads thin as context grows. When you double the context length, each individual token receives roughly half the attention it had before. This isn't a bug but it's how the math works. The model physically cannot focus as intensely on specific information across 100,000 tokens as it does across 1,000 tokens. Think of it like trying to remember details from a 5-minute conversation versus a 5-hour one.

Errors compound through the architecture. When the attention mechanism misallocates focus early, attending to irrelevant tokens while missing critical ones, these errors propagate and amplify through subsequent layers. The model builds its understanding layer by layer, so mistakes at layer 5 influence layer 10, which influences layer 15, creating a cascade of degradation.

There's an unavoidable trade-off between focus and coverage. The model can either spread attention broadly across many tokens (good for coverage, bad for precision) or concentrate attention narrowly on a few tokens (good for precision, bad for coverage). You cannot have both simultaneously—it's mathematically impossible. This creates inherent limitations in how well models can process long, complex contexts.

2. Training Data: Position Bias Is Learned

Even if the architecture were perfect, training data creates systematic biases:

Models learn that important information should be nearby. Throughout training, models see billions of examples where answers follow questions immediately, where summaries appear at the start or end of documents, where relevant information sits within a few sentences of the query. The model learns this pattern: "what I need is usually within ±10-20 tokens." When you put critical information at position 45,000 in a 100K context, you're fighting against this deeply learned expectation.

This creates the lost in the middle problem. Researchers have tested where models can actually find information in long contexts, you see a U-shaped curve: information at the beginning gets found 70-80% of the time, information in the middle drops to 40-50%, and information at the end recovers to 70-80%. This isn't random—it directly reflects how training data is structured.

3. Evaluation: Incentive Misalignment

OpenAI's recent research paper Why Language Models Hallucinate reveals a third fundamental cause: the way we evaluate models creates perverse incentives.

Language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty. OpenAI researchers analyzed ten major AI benchmarks and found nine out of ten use binary grading systems that award zero points for expressing uncertainty. When an AI says "I don't know," it receives the same score as giving completely wrong information. The optimal strategy becomes: always guess.

During pretraining, models learn from massive text without true/false labels—only positive examples of fluent language. For patterns with clear structure (spelling, grammar), models learn reliably. But for arbitrary facts with no learnable pattern—like a random person's birthday—the model cannot distinguish valid from invalid statements.

Post-training RLHF exacerbates this guessing behavior. A model that abstains 52% of the time when uncertain produces substantially fewer wrong answers than one that abstains only 1%, even though the latter shows higher "accuracy" by getting lucky on guesses.

This evaluation misalignment directly impacts context engineering. When a model encounters ambiguous or insufficient context, the training incentive says "guess confidently" rather than "admit insufficient information." This makes context problems worse—not only does the model struggle with long contexts architecturally, but its training actively discourages the cautious behaviors that would help manage context limitations gracefully.

What Can Go Wrong with Long Context

Let me walk you through the failure modes we've seen. These aren't edge cases but systematic problems that show up when you're not careful about context management.

The Hallucination Snowball Effect

Here's a scenario we ran into: An agent is helping analyze quarterly reports. Early in its workflow, it misreads a table - maybe confuses revenue with expenses. Ideally, it should self-correct.

But here's what actually happens: That error gets referenced in the next step. "Based on the expenses we identified earlier..." Now the error is reinforced. It appears in the scratchpad, in the reasoning trace, in tool call results. By step five, the entire analysis is built on a foundation of garbage. The agent has poisoned its own context.

The Deep Mind team called this out as "context poisoning" in the Gemini 2.5 technical report.

"An especially egregious form of this issue can take place with "context poisoning" – where many parts of the context (goals, summary) are "poisoned" with misinformation about the game state, which can often take a very long time to undo. As a result, the model can become fixated on achieving impossible or irrelevant goals."

This is context poisoning. And it's particularly insidious because the agent doesn't know it's wrong. The hallucination looks just as authoritative in context as real facts.

When History Becomes a Weight

We saw this in a customer service agent we built. Early versions worked great for the first 10-20 turns of conversation. But in longer sessions, the agent started repeating itself. Using the same phrases. Falling back on patterns from earlier in the conversation instead of adapting to new information.

The research backs this up. When Gemini's context grew beyond 100K tokens, it started favoring "repeating actions from its vast history rather than synthesizing novel plans." It wasn't being creative anymore - it was pattern matching against its own accumulated behavior.

For smaller models, this ceiling is way lower. A Databricks study shows Llama 3.1 405B starts degrading around 32K tokens. The model gets overwhelmed by its own history and starts relying on it instead of its training.

The Problem of Too Many Options

Here's something that surprised us: giving agents access to more tools doesn't make them more capable. It makes them worse at tool selection.

The Berkeley Function-Calling Leaderboard shows this consistently. Every single model - from the best to the smallest - performs worse when you give it multiple tools versus a single tool. Even when only one tool is relevant, having other options in context confuses the model.

We saw this firsthand building a multi-tool agent. With 10 carefully selected tools, it worked great. We expanded to 15 tools to handle more scenarios. Performance tanked. The agent started calling irrelevant tools, mixing up similar-sounding descriptions, basically getting lost in its own toolbox.

The problem is that the model has to pay attention to everything in context. If you put something there, it influences the model's reasoning. There's no such thing as "just ignore these other 35 tools."

When Your Context Argues With Itself

This one's subtle but really problematic for agent systems. Microsoft and Salesforce ran a study where they took benchmark problems and split them across multiple conversation turns - simulating how agents actually gather information incrementally.

Performance dropped by 39% on average. A model that scored 98% when given all information upfront dropped to 64% when gathering that same information over time.

This happens because the context accumulated the model's early, incomplete attempts at solving the problem. Those premature solutions stayed visible. So the final answer had to contend with the model's own earlier and incorrect reasoning. Context was clashing with itself.

This hits agents hard because this is exactly how agents work. They gather information step by step. Each step produces reasoning that goes into context. By the end, you have a context full of partial attempts, incomplete information, and evolving understanding. All of it visible. All of it influencing the final decision.

So What Do We Actually Do About This?

The key insight that connects back to our original thesis about augmentation is that context engineering is about intentional information architecture. You can't just throw stuff into the context and hope.

Feeding Context Strategically (The Principle of Disclosure)

IA Principle: Only reveal relevant information as needed. Users can become overwhelmed if too much information is presented to them all at once.

First, not everything needs to be in the active context window all the time. We use what we call "staged memory" - different persistence layers for different types of information.

Scratchpads are great for session-scoped notes. The agent can write down intermediate findings, hypotheses, or things to remember without bloating the prompt. Think of it like working memory - available when needed, but not crowding out everything else. [Langgraph Deep Agents](https://blog.langchain.com/deep-agents/) uses a Todo list tool inspired by Claude Code.

Long-term memory is trickier. Systems like ChatGPT auto-generate memories from conversations. But we've found you need to be selective. Not every conversation detail deserves to persist. We need to build memory systems that extract patterns and preferences, not transcripts.

Selecting Context Carefully (The Principle of Choices)

IA Principle: Users should never feel overwhelmed by too many choices when navigating an interface. Instead, there should be a limited number of options available at any given time.

This is where RAG principles really shine. You don't want to retrieve everything - you want to retrieve what's relevant.

For code agents, we maintain rule files that always get pulled in. Stable, procedural knowledge that doesn't change much. For semantic memory - facts, relationships, historical context - we use embedding-based retrieval. Find the relevant stuff, leave the rest out.

Tool selection needs the same treatment. We've had good results with RAG over tool descriptions. When a task comes in, retrieve the 3-5 most relevant tools rather than dumping 40 tool schemas into the prompt. Recent research shows this improves accuracy by 3x.

Compressing When Needed (The Principle of Growth)

IA Principle: Designers should consider how the design will evolve. Over time, more information will be added to the interface, so it should have built-in flexibility to accommodate additional content without becoming too cluttered.

As information grows, context needs to be managed along with it. Summarization is what keeps that growth manageable. For long-running systems, it isn’t just nice to have, it’s necessary.

Claude Code, for example, automatically compresses the conversation once it reaches 95% of its context capacity. We take a similar approach but make it more focused. After big, token-heavy steps like searches or document retrievals, we summarize right away instead of waiting until space runs out.

We’ve also tried hierarchical summarization, where we group related steps, summarize each group, and then summarize those summaries. It’s like building an inverted pyramid: the most important information stays detailed, and everything else gets gradually more compact. This keeps the interface flexible and easy to grow without becoming overwhelming.

Isolating When It Makes Sense (The Principle of Focused Navigation)

IA Principle: Navigation should be clear, consistent, and intuitive so that users can quickly get from one page or section to another. Each navigational scheme should serve a single, specific purpose without mixing concerns or including extraneous links.

Instead of one agent trying to hold everything - the full conversation history, all tool outputs, every piece of retrieved knowledge - you split concerns. Different sub-agents maintain different contexts. The research coordinator has its context. The data analyst has its context. They hand off summarized findings rather than raw context.

It is important to note that multi-agent systems adds communication complexity and uses more tokens. Anthropic reported 15x more tokens for their multi-agent researcher compared to chat. But it also performs better because each agent's context stays focused.

We also use state objects for isolation. Store things in structured fields that don't automatically go to the LLM. Pull them in only when needed. This gives you fine-grained control over what enters context at each step rather than accumulating everything.

From Art to Architecture

Just as information architecture transformed web design from artistic intuition into systematic discipline, context engineering must evolve the same way.

Andrej Karpathy described context engineering as "the delicate art and science of filling the context window with just the right information". Every token depletes attention budget with diminishing marginal returns as window fills. Focus on relevance over volume determines effective context utilization.

Below are the three engineering disciplines that are critical to Context Engineering:

Visibility: You need to see exactly what context your model receives at every step. Not guess. Not infer. Actually see it. We instrument everything. Every LLM call is logged with its full prompt. When something goes wrong, we can trace exactly what information the model had when it made a bad decision. It is mandatory to integrate your agent with an observability platform.
Ownership: Too many frameworks hide context management behind abstractions. You set some parameters and hope the framework does the right thing. That's fine for demos. In production, you need to own your context construction logic. Write it explicitly. Test it. Version it. This is also highlighted as the third factor in the 12-factor agents.
Intentionality: Context format matters. We've run experiments where the exact same information structured differently yields 20-30% accuracy differences. Sometimes consolidating everything into one structured message works better than alternating user/assistant messages. Sometimes you want explicit XML tags. Sometimes plain text is best. You need to experiment and measure.

In short, achieving visibility, ownership, and intentionality isn’t possible without strong observability and evaluation. They are the feedback mechanisms that turn context management from an art into an engineering discipline.

Conclusion

Context engineering mirrors information architecture because the problems are fundamentally the same: finite attention resources, information overload, and the need for intuitive navigation. The only difference is whether you're architecting for human minds or transformer attention heads. The principles remain constant.

Looking Forward

As models keep improving, I don't think context engineering becomes less important. It actually becomes more important.

Because as we ask agents to do more ambitious things - operate autonomously for hours, coordinate complex workflows, maintain consistency across long sessions - the context engineering challenges multiply.

The gap between "this model is theoretically capable of X" and "this system reliably does X in production" is entirely about information architecture. How you structure context. What you include and exclude. How you handle history and errors and conflicting information.

The truth is your model is probably good enough. The question is whether your context architecture lets that capability shine through. Most of the time, it doesn't. And that's what we need to fix.

A Note on Learning and Building

The insights in this blog come from two sources: our hands-on experience building production agents at Innowhyte, and standing on the shoulders of researchers and practitioners who've openly shared their findings. We've learned as much from reading papers and blog posts as we have from debugging failing agents at 2 AM.

While we strongly believe in learning by building, we also believe in learning from others' hard-won lessons. The references provided at after this section are carefully curated resources that shaped our understanding of context engineering. We encourage you to read them.

References

Shiv Mohith