← BACK TO DISPATCH

Agent Routing Caches: A Competence Ratchet from SOAR Chunking

An agent routing cache solves a quiet but expensive problem: a routing agent that re-decides the same call over and over. Your routing agent has successfully dispatched "summarize this PDF" to the same sub-agent 47 times. On attempt 48, it calls the planner again.

Agent Routing Caches: A Competence Ratchet from SOAR Chunking

The Problem: Agents That Re-Deliberate Settled Routes

An agent routing cache solves a quiet but expensive problem: a routing agent that re-decides the same call over and over. Your routing agent has successfully dispatched “summarize this PDF” to the same sub-agent 47 times. On attempt 48, it calls the planner again.

This is not a hypothetical. Any agent that selects tools or delegates to sub-agents via an LLM call re-deliberates every single invocation unless you explicitly prevent it. Tokens spent. Latency incurred. Same route selected. Every time. The agent has no memory of competence - it knows nothing about what worked before, only what the LLM currently predicts is most likely to work based on the prompt.

The performance ceiling for a routing agent in this configuration is flat. You can add faster hardware, a snappier model, better prompts. The agent will still burn planning capacity on tasks it has handled successfully dozens of times. There is no accumulation.

In 1987, a paper on cognitive architecture described a mechanism that solved a structurally identical problem. The paper was SOAR. The mechanism was chunking. The idea translates directly.


Why Vector-Similarity Caching Is the Wrong Abstraction

The obvious first move is semantic caching: embed the incoming task, find the nearest cached task by cosine distance, and if the similarity exceeds some threshold, reuse the stored route. This works in demos. It fails in production for three compounding reasons.

Embedding similarity is not fingerprint equivalence. “Summarize the Q3 earnings report” and “Summarize this PDF about competitor pricing” have high cosine similarity. They are not the same task type. The attributes that determine which agent handles the job - document type, domain, required output format - are structural, not semantic. A similarity score conflates them. Structural routing correctness is a discrete question with a crisp answer; vector distance is a continuous measure that approximates meaning, not task identity.

No confidence model means stale routes survive capability drift. A similarity cache has no concept of how many times a particular route has been confirmed correct. A route cached from a single successful run carries the same weight as one confirmed by 50 runs. When sub-agent capabilities change - a new tool becomes available, a model is swapped, an endpoint is deprecated - the cache has no mechanism to express reduced confidence. The stale entry persists at the same similarity threshold it always had.

No apoptosis means the cache becomes a liability. Similarity caches grow monotonically. Every unique-enough task that passes the similarity threshold but misses an exact match adds a new entry. The longer the system runs, the more the cache accumulates task embeddings from contexts that may no longer exist: deprecated workflows, retired data sources, old task shapes. There is no pruning signal. The cache doesn’t know what it has forgotten to forget.

The routing problem is structural. The solution needs to be structural.


SOAR’s Chunking - The 1987 Idea

SOAR (State, Operator, And Result) is a cognitive architecture designed around a specific theory of how general intelligence arises from problem-space search.1 The core loop: given a current state, select an operator, apply it, update state, repeat until the goal is reached. When the architecture encounters a situation where no operator applies directly - a so-called impasse - it opens a subgoal, works through the impasse in that subproblem space, and eventually resolves it.

The expensive part is impasse resolution. The same impasse pattern can recur across different high-level tasks. Without chunking, SOAR resolves it from scratch every time, replaying the same operator sequence. This is deliberation without memory.

Chunking is the learning mechanism that closes this loop. When a subgoal is successfully resolved, SOAR traces the conditions that led to the impasse and the operators that resolved it, then compiles them into a single production rule. The production rule encodes: given this problem context, fire this operator directly. On subsequent encounters with the same context, the production fires immediately, bypassing the entire impasse-resolution subgoal. Deliberation is replaced by recall.

Two properties make this precise:

Chunking requires successful resolution. Rules are only compiled from subgoals that were actually resolved. Failed attempts do not generate chunks. The cache is built exclusively from evidence of competence.

Chunking requires context equivalence, not similarity. The production fires when the problem context matches exactly, not approximately. Structural identity is the trigger. This is not a nearest-neighbor lookup; it is a pattern match.

The consequence is the competence ratchet: performance over time is monotonically non-decreasing. Every resolved impasse contributes a production that accelerates future resolution. The architecture cannot get slower at tasks it has already solved. It can only stay the same or get faster.


Modern Reframing: Chunks as Cached Routing Trajectories

The mapping to LLM agent routing is direct.

SOAR conceptAgent routing analog
Problem contextTask fingerprint - structured hash of task type + key attributes
Operator application traceRouting trajectory - which agent/tool chain was called, in what order
Production ruleCached route entry, keyed by fingerprint
Impasse resolutionLLM planner call
Chunk firingDirect dispatch, planner bypassed

A task fingerprint is not an embedding. It is a deterministic hash of the structural attributes that determine which route is correct: task type, input modality, required output format, any domain flags that affect routing. Two tasks with the same fingerprint should receive the same route. If they wouldn’t, your fingerprinting schema needs refinement, not your similarity threshold.

A routing trajectory is the ordered list of agent identifiers or tool names invoked during a successful run. It is captured after the run completes and the outcome is confirmed successful - the SOAR constraint holds here too. Only successful runs contribute chunks.

The competence ratchet in this setting: after K confirmed successful dispatches for the same fingerprint, stop deliberating and dispatch the cached route directly. The LLM planner is not called. The route is known.

K = 3 is a defensible default. One success could be coincidence. Two is a signal. Three represents enough confirmation to warrant bypassing deliberation while still being small enough to adapt quickly when capabilities change. This is not a deeply principled number - it is a reasonable prior that should be configurable in your specific deployment context.


The Design

The chunk store needs no external dependencies. Each cached entry - a chunk - records the task fingerprint that keys it, the routing trajectory (the ordered list of agents or tools to dispatch), the count of confirmed successes accumulated so far, and two UTC timestamps: when the chunk was created and when it was last used. The store itself is a dictionary keyed by fingerprint, optionally backed by a persistence file that it loads on startup.

Recording an outcome is where the SOAR constraint lives. The method takes a fingerprint, a route, and a success flag. A failed run returns immediately without touching the store - chunks form from confirmed competence only, and a single failure neither creates nor degrades an entry (staleness is handled separately by pruning). A successful run either creates a new entry with a success count of one, or increments the count on the existing entry and refreshes its last-used timestamp. Notably, a success always stores the most recent successful route: if a capability change causes a different trajectory to start succeeding for the same fingerprint, the chunk naturally migrates to the new route as the count climbs against it.

Reading a chunk enforces the ratchet threshold. The lookup returns the cached route only if an entry exists and its success count has reached the configured threshold (three by default); otherwise it returns nothing and the caller falls through to the LLM planner. A successful read also touches the last-used timestamp, which resets the apoptosis clock so that actively-used chunks never expire.

Three more operations round out the interface. Pruning removes every chunk not accessed within the maximum-age window and reports how many it dropped - this is the apoptosis step, meant to run on a schedule. A clear operation removes either a named list of fingerprints or the entire store, which is the integration point for a drift signal that should force re-deliberation. A statistics call reports total entries, how many have matured past the threshold, how many are still pending, and the configured parameters, for observability.

Persistence, when enabled, writes the store to a file via an atomic replace - write to a temporary file, then swap it into place - so a crash mid-write can never corrupt the cache. Loading reverses the process on startup, tolerating an empty file. Every mutation that changes the store flushes to disk so the cache survives restarts.


Apoptosis: Programmed Chunk Death

In cellular biology, apoptosis is programmed cell death - the mechanism by which an organism clears cells that are no longer useful or that have become potentially harmful. The analogy to cache management is precise: a chunk that has not been used in 90 days is a chunk whose task type may have shifted, whose route may point at a deprecated agent, or whose fingerprint may no longer match any live task shape. It occupies memory and may mislead the system if its context is reactivated.

The pruning operation is the apoptosis mechanism. Wire it to a scheduler - a daily cron job, a startup hook, or a background thread on a configurable interval - and it sweeps out every chunk that has gone untouched past the age window, reporting how many it removed.

The 90-day default is conservative. For rapidly evolving agent deployments - frequent model swaps, new tools added monthly - a shorter window of around 30 days may be appropriate. The right value is a function of your deployment cadence: how often does the optimal route for a given task type change? Set the maximum age just above that period. The invariant you want: no chunk survives long enough to become a liability after the capability landscape has shifted underneath it.


When Not to Chunk

The competence ratchet is not a universal optimization. There are task categories where bypassing deliberation is the wrong engineering decision.

High-stakes routing decisions. If the cost of a misrouted task is significant - data written to the wrong system, a destructive operation triggered on the wrong resource, a compliance boundary crossed - the marginal token cost of deliberation is cheap insurance. The LLM planner is a sanity check. Do not cache it away.

Drift-detected contexts. If you have a distribution shift signal on the incoming task stream - ADWIN, a monitoring alert, a significant drop in downstream task success rates - cached chunks should be invalidated or placed on probation. A chunk formed during a prior operating regime may not apply to the current one. The detection mechanism is outside the scope of the chunker; the integration point is: on drift signal, call chunker.clear() or selectively remove fingerprints associated with the drifted task category.

Genuinely new task shapes. By construction, a task with no prior fingerprint match gets no chunk. The planner runs. This is correct behavior - the chunker has nothing to offer. The scenario to watch is near-miss fingerprints: a task that is structurally slightly different from a cached one. Resist the temptation to add fuzzy fingerprint matching. If a task is similar but structurally distinct, the planner should handle it. Structural precision is the mechanism’s strength.

Tasks where route selection is the valuable work. Some routing decisions are themselves a form of reasoning that should not be short-circuited. If the point of the agent is to select among genuinely interchangeable options - A/B testing different sub-agents, exploring route diversity for quality comparison - then caching eliminates the exploration you need. The chunker is an optimization for settled routing decisions, not for routing research.


Seeing the Ratchet in Action

Picture the chunker driving a minimal fake agent loop. A deterministic fingerprint is built from a task type plus its key structural attributes - say a PDF-modality summarization task. A stand-in planner simulates LLM deliberation with a fixed latency cost of around 80 milliseconds, and a stand-in dispatcher simulates executing the chosen route with a smaller execution cost and always reports success.

The loop runs the same fingerprint six times. On each pass it first asks the chunker for a cached route; if one comes back, it dispatches directly and labels the path as a chunk bypass, and if not, it calls the planner and labels the path as deliberation. Either way it records the successful outcome. The first three runs find no mature chunk, so they pay the full planner cost and accumulate successes; the third run is the one that pushes the success count to the threshold and matures the chunk. From the fourth run onward the lookup returns the cached route, the planner is skipped entirely, and per-run latency collapses from roughly 100 milliseconds to roughly 20 - the planner taken off the critical path. The final statistics show a single mature chunk and no pending entries.

The latency drop on the fourth run represents the planner being taken off the critical path. In a real deployment, where the planner is an LLM API call with network round-trip, the delta is larger - typically 400-1200ms depending on model and infrastructure.


Implementation Shape

A complete implementation has three working parts plus the usual scaffolding: the chunk store itself, the agent-loop demonstration, and a unit-test suite. The tests should cover threshold enforcement (no route returned before the success count reaches the threshold), immutability of returned routes (mutating a returned route does not corrupt the stored copy), apoptosis correctness (pruning removes only entries outside the age window), persistence round-trip (save then load reproduces identical store state), and failure non-contribution (a recorded failure never increments any counter).


Citation

Laird, J. E., Newell, A., & Rosenbloom, P. S. (1987). SOAR: An architecture for general intelligence. Artificial Intelligence, 33(1), 1-64. https://doi.org/10.1016/0004-3702(87)90050-6

Footnotes

  1. Laird, J. E., Newell, A., & Rosenbloom, P. S. (1987). SOAR: An architecture for general intelligence. Artificial Intelligence, 33(1), 1-64.