Tiered Context Loading for LLM Agent Routers

It is late, the kind of late where you are doing back-of-the-envelope math on a system you were sure was fine, and the envelope keeps telling you it is not. The dispatcher had been happily picking the right agent for each incoming task. The fleet kept growing, more capabilities every week, and that was supposed to be a good thing. Then I actually multiplied two numbers I had never bothered to multiply, and the floor quietly dropped out. The router I had shipped could not, on paper, do the thing I was watching it do. It was only still standing because the registry had stayed small enough to hide the truth.

I had built it the way you build everything the first time, which is to say without imagining it ever getting big. Every time a task came in, I loaded every capability's full spec into context, let the model read all of it, and asked it to choose. Description, parameters, examples, edge cases, error modes, the whole document for every tool the system owned. It worked beautifully when I had a dozen capabilities. It worked fine at fifty. The trouble is that nothing in the code complains as you climb. There is no alarm at the moment it stops being affordable. It just keeps working right up until the arithmetic says it cannot, and you only find that wall if you go looking for it.

So I did the math I should have done before I ever shipped it, and the number was genuinely funny in the way that only an own goal can be. Take a realistic multi-agent registry. I will use round engineering estimates here, not a census of the real fleet, because the shape of the problem is what matters and the shape holds at any scale: call it somewhere between two hundred and four hundred capabilities, each full spec running maybe four to eight kilobytes once you count all the parameters and examples and edge cases. Put four hundred capabilities at a ballpark six kilobytes each and you are holding around 2.4 megabytes of plain text. GPT-4o gives you a 128K-token window. At roughly four bytes a token, that is about 512 kilobytes of usable room. You can fully load on the order of eighty-five capabilities into that. The estimate said four hundred.

Loading all of them was not a little over budget. Four hundred specs at eight thousand tokens each is 3.2 million tokens against a 128K window, before a single word of conversation history. Not twenty-five percent over the window. Roughly twenty-five times over it. I had not built a routing system with a performance problem. On the numbers, I had built one that could not run, and had been quietly getting away with it only because the live registry stayed small enough to keep the bill from coming due. The exact figures are illustrative; the order of magnitude is the point, and the order of magnitude does not forgive you.

The three bad doors

My first move was the obvious one, and the obvious one was a trap. Just load less. Load none of the specs upfront and fetch them on demand. But fetch them how? The honest options were all doors that opened onto the same drop. Load all the specs: impossible, that was the wall the arithmetic had just drawn. Load none and route blind: useless, the model cannot pick a tool it knows nothing about. Load by keyword match: brittle in exactly the ways that bite you in production, where synonyms miss, categories overlap, and the router confidently sends a task to the wrong arm because two descriptions happened to share a word. These were not three alternatives to a solution. They were three different ways to fail, and I spent a frustrating evening discovering that each one personally.

What dragged me out of it was a reframe I should have started with. I had been treating "what tools exist" and "how do I use this tool" as the same question, loaded at the same time, paid for on every single call. They are not the same question. To pick a capability, the router barely needs to know it is there and roughly what it does. It does not need the parameter types, the examples, the retry semantics, any of it. All of that detail only matters at the instant of dispatch, for the one tool actually being called. I had been paying the full price of knowing how to use four hundred tools in order to make one decision about which one to reach for.

The turn

I did not invent the way out of this. I found it. KARIMO, an Apache-2.0 project, had already codified exactly the reframe I was groping toward, and given it a clean shape: three tiers, L0, L1, L2, each holding a different depth of knowledge, each loaded at a different moment.

L0 is the always-present index. For every capability you keep almost nothing: the name, a one-line description, the primary task type. No parameters, no examples. At roughly a hundred tokens per capability, four hundred of them cost on the order of 40K tokens, about a third of the window. That is the rent you pay unconditionally, on every call, and in exchange the router always knows the full menu of what exists. It can scan all four hundred names in one pass and shortlist the candidates without ever fetching a retrieval system or risking a recall miss. It just does not yet know how to use any of them.

L1 is the category layer, and it loads only when the router gets stuck. When the shortlist comes back ambiguous, two or three candidates in the same category whose one-liners are too close to separate, the router pulls the full parameter summaries for that one category and nothing else. One category, on demand, a couple thousand tokens. When one candidate clearly wins, this step never happens at all.

L2 is the full spec, the complete document I used to load for everything: every parameter with its types and constraints, the examples, the edge cases, the error modes. It loads for exactly one capability, the one being dispatched, at the moment of dispatch, and never a beat earlier. The invariant underneath all of it is almost insultingly simple. At any point in a routing decision, the total context you are carrying is L0 for everything, plus L1 for at most one category, plus L2 for at most one capability. That is the whole ceiling.

What the math becomes

Run the same arithmetic that had been mocking me, now against the tiered shape. L0 across all four hundred capabilities is around 40K tokens. One category of L1 is about 2K. One capability of L2 is about 8K. The worst case the system can ever be in is roughly 50K tokens, which leaves about 78K of a 128K window free for conversation history, retrieved documents, and the actual output.

Fifty thousand tokens, holding steady no matter how the registry grows, versus the millions the naive version demanded. Same capabilities, same window, same model. The difference is not a clever compression trick. The router never needed all that detail at once; I had just never given it permission to look things up only when it had to. The wall the numbers drew was not a limit of the model. It was a consequence of asking the wrong question on every call.

The cleanup that keeps it honest

There is one more piece that makes this hold up over a long session rather than just on paper, and it is the part I find quietly elegant. When context does eventually press against the ceiling, the eviction order is a hard rule, not a guess. The L2 spec for an in-flight dispatch is untouchable, because truncating it mid-call means the router generates against incomplete parameters and emits a malformed request. A category's L1 is atomic, all of it or none, because half a category overview is worse than no overview. So truncation always falls on L0 first, and never at random: the least-routed capabilities drop out before the most-routed ones.

The effect is that, over a long-running session, the always-present index slowly sheds the tools nobody is calling and keeps the ones the work actually leans on. The system concentrates its remaining headroom on its real working set. That is not decay. It is the design doing its job.

What I would tell you

If you are routing across a registry that can grow, build the tiered split before it grows, not after the wall. The whole lesson is one I had to learn the embarrassing way: knowing that a tool exists and knowing how to use it are different questions, and a router that pays for the second one on every decision is buying something it does not need yet. Separate them, load each at the moment it actually matters, and a registry that could never fit suddenly costs a flat, predictable slice of your window forever.

None of the core idea is mine. The L0/L1/L2 discipline is KARIMO's, published openly. What I brought was the bad evening, the arithmetic I should have run first, and the relief of recognizing my own problem in someone else's clean answer. It needs no embeddings, no retrieval infrastructure, no dependency beyond a registry you author with three depths of detail instead of one. The only cost is writing those three layers when you register a capability, and that cost pays for itself on the first routing decision the old way could not have survived.

Repository

Pattern: github.com/opensesh/KARIMO, Apache-2.0