Caching LLM Agent Routing Decisions

It bothered me more than it should have. The agent was not learning anything. It had no memory of competence. It knew nothing about what had worked a hundred times before, only what the model currently predicted was most likely to work given the prompt in front of it. Every dispatch was the first dispatch. The planner tax was being paid forty-seven times for a route that never changed.

I knew the obvious fix, and I knew it was wrong, which is the worst kind of knowing. Embed the incoming task, find the nearest cached task by cosine distance, reuse the route if the similarity clears a threshold. It demos beautifully. It falls apart in production, and I had three reasons sitting in my head for exactly why.

Embedding similarity is not the same thing as task identity. "Summarize the Q3 earnings report" and "summarize this PDF about competitor pricing" sit close together in vector space. They are not the same task. The things that actually decide which agent handles the job are structural: document type, domain, the output format the caller needs. Those are discrete questions with crisp answers. A similarity score smears them into one continuous number that approximates meaning, not identity.

A similarity cache also has no concept of confidence. A route cached from one lucky run carries the same weight as one confirmed fifty times. And when a sub-agent's capabilities shift underneath it, a new tool, a swapped model, a deprecated endpoint, the cache has no way to express that it should trust the old entry less. The stale route just sits there at the same threshold it always had. Worse, the thing grows forever. Every unique-enough task adds another entry, and the longer it runs the more it accumulates embeddings from workflows that no longer exist. There is no signal telling it what to forget.

So I had a problem I understood and a solution I did not trust. The routing problem was structural, and I wanted a structural answer. I just could not see one.

Then I remembered SOAR.

I had read about it years ago, the cognitive architecture from a 1987 paper, the kind of thing you file away as intellectual furniture and never expect to use. SOAR runs a loop: take a state, pick an operator, apply it, update the state, repeat until the goal is reached. When it hits a situation where no operator applies, an impasse, it opens a subgoal, works the problem out in that smaller space, and resolves it. The expensive part is that same impasses recur across different tasks, and without help the architecture solves each one from scratch every time. Replaying the same moves. Deliberation without memory.

That last phrase was the moment it clicked. Deliberation without memory was exactly what I was staring at. My routing agent was a machine for re-resolving the same impasse forty-seven times.

SOAR closes that loop with a mechanism called chunking. When a subgoal resolves successfully, the architecture traces the conditions that led to the impasse and the operators that resolved it, and compiles them into a single rule: given this context, fire this directly. Next time the same context shows up, the rule fires immediately and the whole deliberation is skipped. Recall replaces reasoning.

Two properties make it precise, and they are the two properties I had been missing. Chunking only happens on successful resolution; failed attempts compile nothing, so the cache is built exclusively from evidence of competence. And it fires on context equivalence, not similarity. The match is structural, a pattern match, not a nearest-neighbor guess. The payoff is what the SOAR authors call a competence ratchet: performance over time only holds steady or improves. The architecture cannot get slower at a problem it has already solved.

That was the abstraction I needed, sitting in a paper older than most of the people building agents today.

The mapping turned out to be almost embarrassingly direct. A SOAR problem context becomes a task fingerprint: a deterministic hash of the structural attributes that decide the route, task type, input modality, required output format, domain flags. Not an embedding. A fingerprint. Two tasks with the same fingerprint should get the same route, and if they would not, the fingerprinting schema is wrong, not the threshold. The operator trace becomes a routing trajectory, the ordered list of agents and tools a successful run actually used, captured only after the outcome is confirmed good. A SOAR production rule becomes a cached route keyed by fingerprint. An impasse resolution is the planner call. A chunk firing is a direct dispatch with the planner bypassed entirely.

And the ratchet, in this setting, is one clean rule: after K confirmed successful dispatches for the same fingerprint, stop deliberating and dispatch the cached route. The planner does not run. The route is known.

I went with K equal to three. One success could be luck. Two is a signal. Three is enough confirmation to skip the deliberation while still being small enough to adapt fast when something changes. I want to be honest that this is not a deeply principled number. It is a reasonable prior that should be configurable for wherever it runs. I would rather flag that than dress it up.

The honesty extends to a second mechanism I borrowed from biology rather than cognitive science. A chunk that has not been used in a long time is a liability, not an asset, because its route may point at an agent that no longer exists. So chunks die. In cell biology, apoptosis is programmed cell death, the body clearing cells that have stopped being useful. A chunk untouched past its age window gets swept out the same way. Ninety days is my conservative default; for a fast-moving deployment with frequent model swaps, thirty is more honest. The invariant I care about: no chunk survives long enough to become a trap after the ground has shifted under it.

I also kept a list of places where I refuse to chunk at all, because a ratchet that bypasses thought is dangerous in the wrong context. High-stakes routing, where a misroute means data written to the wrong system or a destructive operation on the wrong resource, the planner is cheap insurance and I leave it in. Drift-detected contexts, where a distribution-shift signal says the world has changed, the chunks go on probation. Genuinely new task shapes, where there is no fingerprint match, the planner runs and that is correct, and I resist every temptation to add fuzzy matching, because structural precision is the entire point. And tasks where the route selection is itself the valuable reasoning, A/B comparisons, exploring route diversity, caching would eliminate the exploration I actually wanted. This is an optimization for settled decisions, not for routing research.

The moment I want you to feel is the fourth run. The first three dispatches for a fingerprint pay the full planner cost and accumulate their successes, and the third one is what pushes the count over the threshold. From the fourth run on, the lookup returns the cached route, the planner is skipped, and the latency just collapses. In a toy loop the per-run cost drops from around a hundred milliseconds to around twenty, the planner taken clean off the critical path. In a real deployment, where the planner is an actual model API call with a network round-trip, the delta is bigger, four hundred to twelve hundred milliseconds depending on the model and the infrastructure. The agent stops re-deciding what it already knows.

That is the whole story, and it is also the principle. The agents we build re-deliberate settled decisions because we never gave them a way to remember being right. A thirty-nine-year-old paper on general intelligence had already solved that, and the only new work was recognizing my problem in its shape. The competence ratchet is not a clever trick I came up with. It is an old idea I was lucky enough to remember at the right moment, on attempt forty-eight, when I finally got tired of paying the same tax twice.

If you are building routing agents, you have probably paid it too. You just might not have noticed yet.

Citation

Laird, J. E., Newell, A., & Rosenbloom, P. S. (1987). SOAR: An architecture for general intelligence. Artificial Intelligence, 33(1), 1-64. https://doi.org/10.1016/0004-3702(87)90050-6