Scheduling
Disaggregated Prefill: Splitting Compute Across Machines
· ☕ 9 min read · ✍️ k4i
Routing prefill and decode to separate GPU pools eliminates interference entirely, enabling independent scaling and optimal latency — at the cost of KV cache migration across machines.
Disaggregated Prefill: Splitting Compute Across Machines
Continuous Batching: Scheduling at Iteration Granularity
· ☕ 9 min read · ✍️ k4i
How iteration-level scheduling eliminates GPU idle time by inserting new requests the moment a slot opens, and the math behind mixing prefill and decode in a single forward pass.
Continuous Batching: Scheduling at Iteration Granularity