Articles.
Essays and field notes. The why behind the work, at reading length.
Testing AI Agents: Five Dimensions Line Coverage Misses
Testing AI agents well is not about line coverage - and a green suite proves less than you think. You build an LLM-based routing agent. You write unit tests. You hit 95% line coverage. CI turns green. You deploy. Within a week, the agent produces nonsensical routing decisions under load, crashes on unusual task strings, and gets into routing loops the loop detector misses because two threads…
Tiered Context Loading: Fit a Huge Agent Registry in Your Context Window
Pattern source: KARIMO (Apache-2.0)
LLM Self-Preference Bias: How Anonymized Peer Review Fixes It
LLM self-preference bias is the reason naive multi-model evaluation panels don't work. Ask GPT-4o to judge outputs from GPT-4o, Claude, and Gemini. It will prefer its own output the majority of the time, regardless of quality. Panickssery et al. (2024, "LLM Evaluators Recognize and Favor Their Own Generations", NeurIPS) measured GPT-4 self-preference at a pairwise win-rate above 0.90 when…
Agent Routing Caches: A Competence Ratchet from SOAR Chunking
An agent routing cache solves a quiet but expensive problem: a routing agent that re-decides the same call over and over. Your routing agent has successfully dispatched "summarize this PDF" to the same sub-agent 47 times. On attempt 48, it calls the planner again.
Semantic Loop Detection: Catching Stuck AI Agents
Semantic loop detection exists to catch a failure that hash-based loop detection cannot see: an AI agent that keeps changing what it says while making zero real progress. Your agent is trying to fix a bug. It generates a patch, runs the tests, they fail. It reads the error, generates another patch - different variable names, different line numbers, same underlying logic. Tests fail again. Third…
Drift Detection for LLM Routing: Catching Silent Model Degradation
You have four capabilities - a fast cheap model, a slow expensive one, a retrieval tool, and a code-execution agent. You route incoming tasks to one of them and observe a binary reward: did the output satisfy the quality gate? You run this for a few thousand calls and your bandit policy converges. Routing weights stabilize. Life is good.
A Field Guide to Multi-Agent Orchestration in Late 2025: ruflo, KARIMO, llm-council
Every few months a new paper announces that multi-agent LLM orchestration has been figured out. ReAct, then Reflexion, then AutoGen, then LangGraph, then a hundred forks of each. Within their experimental setup, the problem is solved. The problem is that the setup is always a toy. Fixed task horizon. Homogeneous model pool. No concurrent agents writing to shared state. No context windows that…
Two queues for local-LLM fleets
Two ollama pulls, plus an LM Studio Llama 70B load, plus two subagents hitting a cloud LLM provider's API, plus seven daemons running scheduled scans. All at once. 2026-05-13, 10:58 UTC. Kernel panic.