← BACK TO DISPATCH

Two queues for local-LLM fleets

Two ollama pulls, plus an LM Studio Llama 70B load, plus two subagents hitting a cloud LLM provider's API, plus seven daemons running scheduled scans. All at once. 2026-05-13, 10:58 UTC. Kernel panic.

Two queues for local-LLM fleets

Two ollama pulls, plus an LM Studio Llama 70B load, plus two subagents hitting a cloud LLM provider’s API, plus seven daemons running scheduled scans. All at once. 2026-05-13, 10:58 UTC. Kernel panic.

I’d triggered all of them myself, carelessly, inside ten minutes. The ollama pulls were fetching the latest quantized weights for two different oracle models. LM Studio was loading a 70B parameter model into resident memory for a council review. Two subagents were dispatched via my orchestration layer, each making concurrent calls to a frontier model API. Seven launchd daemons fired on schedule because it was the top of the hour. The machine had 96GB of unified memory. It wasn’t enough.

The postmortem produced a rule I now follow religiously: local-heavy tasks run serially, one at a time; remote-API fleet tasks run with bounded concurrency; never cross-mix the two. This is the two-queue discipline.

A quick vocabulary stop before we go further. By fleet I mean a set of cloud-LLM agent calls running in parallel against remote APIs. Each agent is a thin process; the work happens server-side. By oracle I mean a heavy local model I load into resident memory for high-stakes reasoning that has to happen on the machine itself, usually for IP or latency reasons. Both are part of the same agent-orchestration pattern, but they have completely different resource profiles. That difference is exactly why mixing them blows up.

The two task classes

When you’re running local LLMs alongside cloud APIs, the work splits into two classes based on where the saturation happens.

Local-heavy tasks saturate your machine’s resources directly. Model downloads via ollama pull. Loading a 30-70GB quantized model into LM Studio or ollama with keep_alive set to hold it resident. Running a full pytest sweep across your entire repo. Oracle council dispatch where you’re loading massive models into RAM for inference. These compete for unified memory bandwidth, disk I/O, CPU cores for dequantization.

Remote-API fleet tasks saturate neither your memory nor your CPU. They saturate network connections and your cloud provider’s rate limits. Subagent dispatch via an orchestration layer where each agent calls a frontier model via API. Parallel web scrapes. Batch processing jobs that fan out to external services. These tasks are I/O-bound on network latency and remote throughput, not local compute.

The distinction matters: saturation failure modes are different. Local-heavy tasks cause memory pressure, thermal throttling, and in the worst case, kernel panics when the memory allocator can’t satisfy a request. Remote-API tasks cause connection pool exhaustion, rate limit errors, and cascading timeouts when you overwhelm either your network stack or the remote service.

Why mixing them saturates faster than you’d expect

Here’s the non-obvious part. On Apple Silicon, the memory is unified: the same physical RAM serves CPU, GPU, and the Neural Engine. When ollama loads a 67GB quantized model with keep_alive > 0, it’s not just occupying 67GB. It’s pinning that allocation in a way that fragments the address space for everything else. The OS can’t trivially reclaim it because the model needs to stay resident for fast inference.

Now add two ollama pulls in parallel. Each one is streaming gigabytes of weights from the network and writing them to disk while simultaneously validating checksums and decompressing blobs. That’s sustained disk I/O and memory allocation churn. The system is juggling resident model memory, in-flight download buffers, filesystem cache pressure, and whatever else is running.

Then add an LM Studio load: another 30-70GB allocation request. LM Studio is trying to mmap the model file into a contiguous region. If the address space is already fragmented by the ollama allocations, the kernel has to work harder to find a suitable range. On a system with 96GB total and a 67GB model already loaded, the remaining headroom is only ~29GB. That’s before fragmentation overhead, filesystem cache, and kernel allocations eat into it. A 30-70GB request collides with that headroom immediately. At the low end, a smaller model squeezes in with zero margin for spikes; at the high end, the request fails outright. Either way the kernel starts swapping, which on a machine built for low-latency inference is effectively a soft hang.

Now add two subagents making concurrent API calls. They’re not heavy on memory, but they are allocating connection state, buffers for HTTP responses, and JSON parsing overhead. Not huge individually, but enough to tip the balance when the system is already under memory pressure from the local-heavy work.

Add seven daemons firing on schedule. Maybe they’re just doing lightweight scans, but each one spawns a process, allocates a stack, opens file descriptors, and touches the filesystem. Again, not huge individually. But in aggregate, on a system already saturated, it’s the cumulative load with no single smoking gun. Each spawn adds overhead the kernel can’t reclaim fast enough.

The kernel panic isn’t random. It’s deterministic. You’ve exceeded the practical working set the unified memory architecture can sustain under concurrent pressure. The math isn’t “96GB total, models fit if they sum to less than 96GB.” The math is “96GB total minus fragmentation overhead minus in-flight buffers minus filesystem cache minus kernel allocations minus margin for allocation spikes.” You hit the limit well before the total RAM. Once you account for everything else the system is doing, the practical working set is significantly smaller.

The two-queue rule

Don’t mix the classes. Run them in separate queues with separate concurrency limits.

Local-heavy tasks: serial only. One at a time. No exceptions. If you’re pulling a model via ollama, nothing else heavy runs until it’s done. If you’re loading an oracle model into LM Studio, no other model loads or downloads run concurrently. If you’re running a full test sweep, no oracle dispatch, no model pulls. Serial.

Remote-API fleet tasks: bounded concurrency, ≤5 concurrent by default. Five subagents hitting cloud APIs in parallel is fine. They’re network-bound, not memory-bound. You can saturate your rate limit before you saturate your machine. But don’t go unbounded. Connection pool exhaustion is real, and most cloud providers have per-account concurrency limits anyway.

Never cross-mix. If a local-heavy task is in flight, the remote-API queue is paused. If the remote-API fleet is running, no local-heavy tasks start. This is the hard rule. It eliminates the saturation-interaction failure mode entirely.

The pre-flight gate

Before starting any heavy task, I check three things.

Load average. uptime shows the 1-minute load average. If it’s above 4.0 on my 12-core machine, something is already saturated. Wait or kill. Don’t pile on.

Free disk space. df ~ shows available space on the home volume. I need at least 30GB free before pulling a large model. Ollama writes to a temp location before moving the final file, so you need roughly 2× the model size in transient headroom.

In-flight heavy task check. I explicitly verify no other local-heavy task is running. ps aux | grep -E 'ollama.*pull|lmstudio|pytest|oracle_dispatch' as a sanity check. It’s manual, it’s primitive, but it works. Automation can come later. Discipline comes first.

If any gate fails, I don’t proceed. I reschedule or I kill the conflicting task. The pre-flight check takes ten seconds. Recovery from a kernel panic takes ten minutes and loses whatever state was in flight.

The forbidden combinations

Some combinations are always wrong:

  • Local-heavy + remote-API fleet active: memory pressure + connection churn → saturation
  • Fleet active + oracle dispatch starting: oracle loads a 30-70GB model while fleet holds HTTP state → OOM
  • Oracle dispatch + ollama pull: two large allocations competing for the same unified memory → kernel panic
  • Two oracle models loaded simultaneously: 67GB + 30GB > practical working set → swap death spiral
  • More than 5 concurrent fleet dispatches without override: connection pool exhaustion, cascading timeouts

Each of these surfaced as a real failure at some point. The rule isn’t theoretical. It’s scar tissue.

The discipline

Postmortems for solo founders means writing rules your future self will hate following, until the day they save you.

What comes next

The two-queue rule is the floor, not the ceiling. Once the discipline is in place, four trajectories open up.

Auto-scheduling. The pre-flight gate is currently ten seconds of manual checks. With the gate logic codified, you can wrap it in a scheduler that decides automatically: when an ollama pull finishes, fire the next queued heavy task. When the fleet rate-limit ceiling approaches, throttle. When load average rises, defer. Manual discipline becomes automatic policy. The hard part is not the scheduler. The hard part is encoding “what counts as heavy” precisely enough that the scheduler doesn’t have to ask.

Cross-machine fleet coordination. The same logic extends to a multi-node setup. One machine handles oracle work, another handles fleet, a third runs daemons. The queues become network-coordinated and the rule generalizes: never let a node accept a task that would push it past its per-class concurrency cap. The interesting design question is where the queue state lives. A Redis sorted set works for two nodes. Past five nodes you start wanting a real durable queue, and now you have a different operations problem.

Predictive saturation modeling. The kernel-panic math at the end of section 3 is a working-set predictor in disguise. Given a tuple of (free memory, in-flight model sizes, filesystem cache pressure, kernel allocation headroom), you can compute whether a candidate task will fit before dispatching. The math is there. What’s missing is the wrapper that runs it on every dispatch and refuses the unsafe ones. That refusal is more valuable than the scheduler, because it stops you from making the mistake in the first place.

Observability that matches the discipline. A single surface showing what’s queued vs running per queue, real-time load average, projected memory headroom after the next dispatch. Not a separate observability project. Just the gate logic exposed as state, with a small UI on top. The reason this is worth building: when the rule fails it fails silently for several minutes before the kernel intervenes. A live view of “you’re about to violate the rule” beats a postmortem of “you violated the rule” every time.

Each of these builds on the two-queue rule without abandoning it. The rule stays the same: serial on local, bounded on remote, never cross-mix. What changes is how much of the discipline you have to hold in your head, and how much the system holds for you.


AI Disclosure

This artifact was prepared with assistance from generative AI tools, in accordance with COPE+STM “AI in scholarly publishing” guidance and the target venue’s AI-use policy:

  • Drafting: Author-Enthusiast agent via a frontier large language model
  • Voice + IP firewall: Author-Human agent via a frontier large language model
  • AI-tell removal + readability: Author-Humanizer agent via a frontier large language model
  • Final responsibility: The named human author has read and approved the final content. No generative AI is listed as a co-author. Substantive intellectual contribution remains with the human author.