Scheduling large language models across a mixed fleet of consumer GPUs

May 19, 20266 min

One VRAM Number Can't Schedule LLMs Across Mixed Consumer GPUs

What we learned running a 27B coding model across a heterogeneous fleet of consumer GPUs — why the obvious sizing formula silently OOMs, why pipeline parallelism behaves nothing like the benchmarks suggest, and the gotchas nobody writes down.

vLLMGPU orchestrationpipeline parallelism