Tiered Context Loading: Fit a Huge Agent Registry in Your Context Window
Pattern source: KARIMO (Apache-2.0)
The Token Wall
Tiered context loading earns its keep the moment your agent’s capability registry outgrows the context window - which happens fast. Naively loading all capability specs into context costs 25x the available budget. That is not 25% over. It is twenty-five times.
A realistic multi-agent capability registry reaches 200-400 registered capabilities. Each capability spec-description, parameters, examples, edge cases, error modes-runs 4-8 KB. At 400 capabilities x 6 KB average, that is 2.4 MB of plain text.
GPT-4o’s context window is 128K tokens. At ~4 bytes per token, that is 512 KB of usable space. You can fully load 85 capabilities. You have 400.
Load all L2 specs: 400 x 8,000 tokens = 3.2M tokens against a 128K window. Twenty-five times the budget, before you have added a single word of conversation history.
The naive options-load all (impossible), load none (useless), load by keyword match (brittle: synonyms fail, category overlap breaks routing)-are not alternatives to tiered loading. They are just different ways to fail.
The solution is tiered context loading. KARIMO codified this into a three-level discipline: L0, L1, L2.
The Three Tiers
L0 - The Always-Present Index (~100 tokens per capability)
L0 is the capability index. For each capability you store: name, one-line description, primary task type. No parameters. No examples. Nothing else.
At ~100 tokens per capability and 400 capabilities, L0 costs ~40K tokens-roughly 31% of a 128K context window. This is the permanent overhead you pay unconditionally. Every inference call carries L0. The router always knows what capabilities exist; it just doesn’t know how to use any of them yet.
L0 makes capability discovery O(1) with respect to context. The router can scan all 400 capability names in a single pass and identify candidates. Without L0, you’d need a retrieval system, which introduces latency, embedding drift, and recall failures. L0 trades 31% of context headroom for zero-failure capability enumeration.
L1 - Category Overview (~2K tokens per category, loaded on disambiguation)
L1 is the category-level detail layer. When the router identifies a task category-”this is a data-processing task,” “this is an external API call,” “this is a file-system operation”-it loads the full L1 for that category: all capabilities in the category with full parameter summaries, input/output type signatures, and brief usage notes.
L1 is loaded on demand, not upfront. One category at a time. At ~2K tokens per category, loading L1 for a single category costs 2K tokens incremental. You load one; you don’t load the others.
The trigger for L1 is ambiguity in the L0 routing pass. When the router identifies two or more candidate capabilities with overlapping descriptions, it requests L1 for the relevant category before narrowing its selection.
L2 - Full Spec (~8K tokens per capability, loaded only at dispatch)
L2 is the complete capability specification: full parameter descriptions with types and constraints, multiple usage examples, edge cases, error modes, retry semantics. This is the document a developer would read before implementing a call.
L2 is loaded for exactly one capability-the one being dispatched-at the moment of dispatch. Never earlier. Never for multiple capabilities simultaneously.
The invariant KARIMO enforces: at any point in the routing lifecycle, total loaded context equals L0 (all capabilities) + L1 (one category) + L2 (one capability). Full stop.
The Token Math
The arithmetic is the argument:
| Tier | Per Unit | Units Loaded | Total Tokens |
|---|---|---|---|
| L0 | 100 tok | 400 | 40,000 |
| L1 | 2,000 tok | 1 category | 2,000 |
| L2 | 8,000 tok | 1 capability | 8,000 |
| Total | 50,000 |
50K tokens. That leaves 78K tokens in a 128K window for conversation history, retrieved documents, and generated output.
Compare to the naive alternative of loading all L2 specs: 400 capabilities at 8,000 tokens each is 3,200,000 tokens. That is 25x over the context budget. Not 25% over. Twenty-five times. This is not an optimization problem. Tiered loading is not a performance improvement over some working baseline. Without it, the system cannot function at all. L0/L1/L2 is a prerequisite, not a refinement.
The Lookup Protocol
The routing decision follows three discrete steps, each tied to a tier:
Step 1 - Candidate identification (L0 only). The router receives a task description. L0 is always present. The router scans all 400 capability summaries and identifies the top-3 candidates by relevance. No additional context is loaded.
Step 2 - Disambiguation (L0 + L1). If the top-3 candidates share a category and their L0 summaries are insufficiently distinct to resolve the selection, the router computes an ambiguity score. When ambiguity exceeds a threshold, L1 for the relevant category is loaded. The router re-evaluates with full parameter summaries available and narrows to a single candidate. For unambiguous tasks-when one candidate dominates clearly-this step is skipped entirely.
Step 3 - Dispatch (L0 + L1 + L2). The router selects a single capability and requests L2. The full spec is loaded. The router constructs the dispatch payload with complete type information and examples available. L2 is evicted after dispatch.
The ambiguity score is the decision gate between Step 1 and Step 2. A capability pair that shares a category and whose L0 descriptions have cosine similarity above a threshold (or, in a simpler implementation, overlap more than N keywords) triggers L1 load. Below threshold: skip to L2 directly.
The Design
A loader for this pattern needs no dependencies beyond the standard library. Each registered capability carries four fields: its name, its category, and the three tiered payloads - the L0 one-liner (bounded to roughly 100 tokens), the L1 category-level parameter summary (bounded to roughly 2,000 tokens), and the full L2 specification (bounded to roughly 8,000 tokens). At construction the loader indexes capabilities by name for direct lookup and groups them by category for L1 loading, and it tracks which L1 category and which L2 capability are currently active so it can report the live context footprint.
The three tier loaders map one-to-one onto the three routing steps. The L0 loader concatenates every capability’s name, category, and one-liner into a single always-present index. The L1 loader takes a category, raises if it is unknown, marks it active, and emits the full parameter summaries for every capability in that category. The L2 loader takes a single capability name, raises if unknown, marks it active, and emits its complete specification; this is also where the loader bumps a per-capability routing-frequency counter that later drives truncation. A separate eviction step clears the active L2 once a dispatch completes.
A token-accounting helper approximates token counts from word counts (a rough multiplier suffices for budgeting), and a budget report sums the live L0, L1, and L2 footprints and reports remaining headroom against the context window. The report reads the active L2 content directly rather than through the L2 loader, so generating a diagnostic never inflates the routing-frequency counter that truncation depends on.
The interface is deliberately small - three tier loaders, an eviction call, and a budget report - and maps directly onto the three-step routing protocol described above.
The Truncation Contract
When total context approaches the budget ceiling, the eviction priority order is a hard rule, not a heuristic:
- L2 for an active dispatch is inviolable. Truncating it mid-dispatch means the router generates with incomplete parameter information, producing malformed calls.
- L1 is atomic per category. A partial category overview is worse than none.
- L0 truncates last, and only when L1 and L2 are already at minimum.
Truncation therefore always hits L0 first. And L0 truncation is not random: KARIMO’s pattern uses a recency-weighted frequency counter per capability - the routing-frequency tally the L2 loader maintains. Capabilities with the lowest routing frequency are dropped first. The most-routed capabilities stay visible longest.
The result: over a long session, the L0 index self-prunes toward the working set. That is not a bug. It is the designed behavior-the system concentrates its remaining context headroom on the capabilities it actually uses.
Repository
Pattern: github.com/opensesh/KARIMO - Apache-2.0
The loader described above is a standalone realization of KARIMO’s L0/L1/L2 pattern. It drops into any agent framework that manages a capability registry. The three tier loaders map directly to the three routing steps, and the budget report gives you live token accounting for observability.
No external dependencies. No embeddings. No retrieval infrastructure. The only requirement is that you populate L0, L1, and L2 fields when you register capabilities-a one-time authoring cost per capability that pays for itself on the first routing decision.