Drift Detection for LLM Routing: Catching Silent Model Degradation
The Problem: Drift Detection in LLM Routing
You have four capabilities - a fast cheap model, a slow expensive one, a retrieval tool, and a code-execution agent. You route incoming tasks to one of them and observe a binary reward: did the output satisfy the quality gate? You run this for a few thousand calls and your bandit policy converges. Routing weights stabilize. Life is good.
Then the fast cheap model gets a silent update. Its accuracy on your task distribution drops from 0.85 to 0.30 over two days. Your bandit doesn’t know this. It has a high historical success estimate for that arm, accumulated over weeks of good performance. It keeps routing there. Your quality gate failure rate climbs. You get paged.
This is the stationarity assumption failure. Multi-armed bandit theory - from Thompson Sampling to UCB to epsilon-greedy - rests on the premise that each arm’s true reward probability is fixed. In production LLM routing, it almost never is. Models update. Prompts age. External APIs degrade. The question isn’t whether drift will happen - it’s whether your routing layer will notice before your users do.
ADWIN (ADaptive WINdowing), published by Bifet and Gavaldà at SIAM SDM 2007, solves exactly this. It monitors a streaming binary signal, maintains an adaptive observation window, and triggers a statistically grounded alarm when the mean shifts. Wire one ADWIN instance per capability arm into your bandit’s reward path and you get automatic policy refresh on drift - no retraining, no fixed window sizes, no manual thresholds.
Takeaways:
- Bandit routing assumes each arm’s success rate is fixed; in production it isn’t.
- A silent model update can degrade an arm for days before your policy reacts.
- Per-arm drift detection gives you automatic policy refresh without retraining.
The Bandit Routing Setup
You have a handful of capability arms. At each step you select one according to your routing policy, observe a binary reward - did the output pass the quality gate or not - and update that arm’s success estimate.
The simplest policy keeps an exponential moving average per arm: each new outcome nudges the running estimate by a fixed learning rate, and you route to whichever arm has the highest estimate. Under stationary conditions this converges to the true best arm, and it converges faster the wider the quality gap between the best arm and its nearest rival. Thompson Sampling does this more efficiently via posterior sampling, but the core structure is the same: each arm has a learned estimate, routing follows the estimate.
The stationarity assumption is baked in. When an arm’s true success rate shifts, the estimate lags - and the lag is worse the longer the arm’s good history. With a slow learning rate the effective memory is roughly the last twenty observations, but if a capability has served two thousand historical calls at 85 percent accuracy, the moving average accumulates enormous inertia. A drop to 30 percent then takes a few dozen fresh failures just to halve the gap, and far longer to close it. During that lag your routing policy preferentially selects the degraded arm, and the cost grows with the delay.
The fix is not a faster learning rate - that introduces variance everywhere, making every estimate jittery even when nothing has changed. The fix is a drift detector that watches each arm’s reward stream and fires a targeted reset only when a distributional shift is detected with high confidence.
How ADWIN Detects Drift, in Plain English
ADWIN maintains a sliding window of recent binary observations for a single data stream. The window grows by appending each new observation to the right. The key operation runs after every append: ADWIN looks for a split point inside the window where the older portion and the more recent portion have means that are statistically different - too far apart to be the same underlying distribution.
If such a split exists, ADWIN declares drift. It then shrinks the window to the recent portion - discarding everything before the split - and signals the caller. The caller resets the arm’s learned estimate using only the observations that survived in the new, shorter window.
This adaptive shrinkage is the algorithm’s central insight. A fixed sliding window can only detect drift once the shift has been present for roughly half the window - you’re always looking backwards at a fixed horizon. ADWIN’s window instead grows without bound during stable periods, accumulating statistical power to resist false positives, and collapses aggressively the moment drift occurs, giving fast detection and fast policy recovery. The window size is a consequence of the data, not a hyperparameter you have to guess.
The behavior is intuitive in the two regimes:
Stable regime. No split passes the statistical test. The window grows. With a larger window the test’s tolerance band tightens, making it harder to trip on noise, so the false-positive rate stays controlled.
Drift regime. A genuine shift occurs. At first the recent observations just look like ordinary variance. As more post-shift observations accumulate, the gap between the older mean and the recent mean widens past the tolerance band. ADWIN then fires the alarm and resets to the post-shift window. The detection delay depends on the size of the shift and on the chosen sensitivity: a bigger shift is detected faster, and a more conservative sensitivity setting detects more slowly but more reliably.
For a bandit with several arms, you run one independent ADWIN instance per reward stream. They share no state. Each arm’s drift detection is independent of the others, which is correct: arms don’t generally degrade at the same moment.
How the Detection Test Works
The whole test rests on one question, asked after every new observation: do the older and more recent halves of the window look like they came from the same distribution? ADWIN answers it by comparing the gap between the two halves’ success rates against a tolerance band, and declaring drift only when the gap is too large to be chance.
Three things set the width of that tolerance band, and each earns its place. First, the balance of the split: a band is at its tightest when the two halves are evenly sized, and it widens sharply toward the edges, so a split that puts only one or two observations on one side is essentially impossible to trip - exactly what you want, since a single fresh observation should never on its own declare drift. Second, a multiple-comparisons penalty: because ADWIN tests every possible split point in the window, it inflates the band to account for all those chances to get unlucky - but the inflation grows only gently as the window lengthens, not in proportion to it. Third, the observed variance of the stream: low-variance, high-quality arms - ones that almost always succeed - get a tighter band than a worst-case argument would allow, so genuine degradation on a near-perfect arm is caught sooner.
A single sensitivity knob governs how eager the test is to fire - effectively, the false-positive rate you’ll tolerate on a perfectly stationary stream. The implementation in the river library uses a variance-aware form of this bound that is tighter, for large windows and high-quality arms, than the simpler version sketched in the original 2007 paper; the two agree closely at very small windows.
For routing applications a sensitivity of 0.002 is a practical default. With a window of around a thousand observations, that setting keeps spurious firings well under one per five hundred evaluations per arm. Across four arms at a hundred routing decisions an hour, you’d expect a false drift event roughly once every thirty hours - low enough not to pollute your policy, high enough that you don’t wait weeks to catch real degradation.
The original authors prove two guarantees. On a truly stationary stream, the chance of a false drift alarm stays bounded by the sensitivity setting. And once a genuine shift of a given size occurs, ADWIN detects it within a number of observations that scales inversely with the square of the shift magnitude - so a large drop is caught quickly and a subtle one takes proportionally longer. Both bounds are tight.
The Design
You don’t need to implement ADWIN yourself - the river online-machine-learning library ships a maintained implementation. The work is wiring it into the routing layer. Build a monitor that holds one independent ADWIN detector per capability arm, a running success-rate estimate per arm seeded at an uninformed 0.5 prior, and a per-arm observation counter for diagnostics and volume gating. The constructor validates that the capability list is non-empty and that the false-positive rate sits strictly between zero and one.
The core entry point records one binary outcome for a named arm after each routed request. It snapshots the arm’s current window size and mean, feeds the success-or-failure bit into that arm’s detector, bumps the counter, and refreshes the running estimate to the detector’s current window mean. When the detector signals drift, the monitor packages an immutable event describing the arm, the window size before and after the collapse, and the mean before and after, then resets the arm’s estimate to the fresh post-drift window mean and returns the event. On a non-drift outcome it returns nothing. Representing the drift event as a frozen, immutable record means it is safe to pass downstream to logging or dashboards without defensive copying.
Two further methods serve the routing loop. One converts the per-arm estimates into normalized routing weights that sum to one - a probability simplex that plugs directly into either softmax sampling or deterministic argmax routing. Crucially, no arm is ever zeroed out: a tiny floor on each weight preserves a sliver of exploration budget even for a consistently failing arm, so the system keeps probing it and can notice if it recovers. A second method exposes per-arm diagnostics - estimate, observation count, and the configured delta - purely for logging and observability.
The decisive design choice is decoupling the monitor’s estimate from ADWIN’s internal window. When drift fires, ADWIN discards its pre-drift history and keeps only the post-drift window; the monitor reads that collapsed window’s mean and adopts it as the new estimate, which is exactly the fast policy refresh the whole exercise is built to deliver.
Drift Detection in Action: A Synthetic Demo
To see the mechanism work, simulate a four-arm routing bandit over 500 steps with a fixed random seed for reproducibility. Three arms hold steady true success rates - roughly 0.70, 0.65, and 0.60. The fourth, capability A, starts strong at 0.85 but degrades sharply to 0.25 at step 300, modeling a silent model update. Each step, the simulation pulls the current routing weights, picks an arm by weighted random sampling (a greedy policy would just take the top-weighted arm), draws a success-or-failure outcome from that arm’s current true rate, and feeds the result back to the monitor. Every drift event the monitor emits gets logged, and a final summary reports the routing weights at the end of the run.
With that seed, the run produces two drift events, both on capability A. The first fires around step 318 - 18 steps after the true shift at step 300. The arm’s estimated success rate drops from about 0.85 to about 0.28 as ADWIN’s window collapses from over 300 observations down to roughly a dozen, discarding the stale high-accuracy history. A second, smaller event near step 412 is a minor recalibration as the window settles on the new regime. By the end, capability A’s routing weight has fallen from its pre-drift level near 0.40 to around 0.11 - reflecting its true post-drift rate - while the three stable arms hold weights near 0.25 to 0.29 each.
That 18-step lag is consistent with ADWIN’s guarantee that detection time scales inversely with the square of the shift magnitude: the shift here is 0.60 (from 0.85 down to 0.25), which puts the theoretical floor at only a few observations - the constant factors and window initialization account for the rest.
Tuning Drift Sensitivity
The sensitivity setting controls the precision/recall tradeoff for drift detection. Concretely:
Low sensitivity (0.0001). The tolerance band is wider, requiring a bigger gap before firing. Fewer false alarms, but detection is slower - ADWIN needs more post-drift observations to clear the band. Use this when your arm volumes are high (above 500 calls per hour per arm) and you have strong cost aversion to spurious policy resets.
High sensitivity (0.05). Fires faster on genuine drift, but also fires on natural variance. Across four arms at this setting, you can expect a spurious drift event roughly every twenty stable evaluations per arm - too noisy for most production systems.
Default sensitivity (0.002) occupies a practical sweet spot for LLM routing. The cost asymmetry is important: a false positive causes an unnecessary policy reset for one arm, which self-corrects within a few dozen observations as the arm re-estimates from fresh data. A false negative means routing to a degraded capability for an extended period. The false-positive cost is low; the false-negative cost is high. That asymmetry pushes toward a slightly more eager setting than you’d choose where false alarms are expensive, such as medical monitoring.
One calibration approach: run a 500-observation baseline on each arm during a known-stable period and count ADWIN firings. If you see more than one firing per arm, lower the sensitivity by a factor of five and repeat.
When NOT to Use ADWIN for Drift Detection
Your arm sees fewer than 30 observations per hour. The tolerance band is enormous at small sample sizes; only a complete collapse triggers detection, and post-drift estimates are too noisy to trust. Aggregate at a coarser granularity or use a Bayesian change-point detector with an informative prior.
Your drift is gradual. ADWIN is designed for abrupt shifts - a model update, a tool endpoint degradation, a sudden prompt distribution change. For a capability whose success rate decays one percent per week due to world-knowledge aging, ADWIN’s window grows slowly, the gap between old and recent halves stays narrow, and detection lags by months. A scheduled two-sample test on rolling 7-day buckets is the right tool here.
Your non-stationarity is structural. If reward correlates with time-of-day, task type, or session context by design, ADWIN will fire constantly and churn your estimates. That’s not a routing system with drift detection - it’s a contextual bandit in denial. Model the context explicitly.
Implementation Shape
A complete companion implementation has a handful of parts: the per-arm drift monitor itself, the synthetic four-arm simulation, an optional visualization of the detected drift, and a test suite holding coverage above 80%. The only runtime dependency is the river library (which provides the ADWIN detector); the simulation and tests need nothing else beyond a test runner. Pin a recent river version, document the reproduction steps, and the whole thing runs from a single command.
References
Bifet, A., & Gavaldà, R. (2007). Learning from time-changing data with adaptive windowing. In Proceedings of the 2007 SIAM International Conference on Data Mining (SDM 2007), pp. 443-448. Society for Industrial and Applied Mathematics. doi:10.1137/1.9781611972771.42
river Python library. Online machine learning in Python. BSD-3-Clause license. Source: github.com/online-ml/river. ADWIN implementation: river.drift.ADWIN.