First Principles
Strip it down until it's true
Bloom Once, Bloom Right
Write it clean the first time
Rain, Tea & Recursion
The paper you skipped has the answer
Disaggregated Prefill: Splitting Compute Across Machines
· ☕ 9 min read · âœī¸ k4i
Routing prefill and decode to separate GPU pools eliminates interference entirely, enabling independent scaling and optimal latency — at the cost of KV cache migration across machines.
Disaggregated Prefill: Splitting Compute Across Machines
Prefix Caching: Reusing KV Cache Across Requests
· ☕ 8 min read · âœī¸ k4i
When thousands of requests share the same system prompt, recomputing its KV cache each time is pure waste. Prefix caching stores and reuses those vectors, cutting TTFT by up to 97% in common deployments.
Prefix Caching: Reusing KV Cache Across Requests
Continuous Batching: Scheduling at Iteration Granularity
· ☕ 9 min read · âœī¸ k4i
How iteration-level scheduling eliminates GPU idle time by inserting new requests the moment a slot opens, and the math behind mixing prefill and decode in a single forward pass.
Continuous Batching: Scheduling at Iteration Granularity
Paged Attention: Virtual Memory for the GPU
· ☕ 10 min read · âœī¸ k4i
How vLLM borrows the OS paging idea to eliminate KV cache memory fragmentation, pushing GPU utilization from ~30% to ~96%.
Paged Attention: Virtual Memory for the GPU
Online Softmax: Tiling for Arbitrarily Large Rows
· ☕ 6 min read · âœī¸ k4i
how online softmax extends the fused kernel to handle rows that exceed sram capacity, using a numerically stable 2-pass tiling algorithm.
Online Softmax: Tiling for Arbitrarily Large Rows
Why KV Cache Works in LLM Inference
· ☕ 8 min read · âœī¸ k4i
why the key-value cache avoids redundant computation in autoregressive decoding, and the memory/compute tradeoffs it introduces.
Why KV Cache Works in LLM Inference
Fused Softmax in Triton
· ☕ 7 min read · âœī¸ k4i
how to write a fused softmax kernel in triton that eliminates redundant memory accesses and outperforms pytorch's native implementation.
Fused Softmax in Triton