Scheduling

Disaggregated Prefill: Splitting Compute Across Machines

📅 Apr 22, 2026 · 📝 May 30, 2026 · ☕ 9 min read · ✍️ k4i

Routing prefill and decode to separate GPU pools eliminates interference entirely, enabling independent scaling and optimal latency — at the cost of KV cache migration across machines.

Chunked Prefill: Slicing the Prefill to Protect Decode Latency

📅 Apr 22, 2026 · 📝 May 30, 2026 · ☕ 8 min read · ✍️ k4i

Splitting a long prefill across multiple iterations keeps decode requests from stalling, with no extra FLOPs and negligible IO overhead.

Continuous Batching: Scheduling at Iteration Granularity

📅 Apr 22, 2026 · 📝 May 30, 2026 · ☕ 7 min read · ✍️ k4i

How iteration-level scheduling eliminates GPU idle time by inserting new requests the moment a slot opens, and the math behind mixing prefill and decode in a single forward pass.