Chunked Prefill: Slicing the Prefill to Protect Decode Latencyđ Apr 22, 2026 · đ May 30, 2026 · â 8 min read · âī¸ k4iSplitting a long prefill across multiple iterations keeps decode requests from stalling, with no extra FLOPs and negligible IO overhead.
Continuous Batching: Scheduling at Iteration Granularityđ Apr 22, 2026 · đ May 30, 2026 · â 7 min read · âī¸ k4iHow iteration-level scheduling eliminates GPU idle time by inserting new requests the moment a slot opens, and the math behind mixing prefill and decode in a single forward pass.
Paged Attention: Virtual Memory for the GPUđ Apr 22, 2026 · đ May 30, 2026 · â 10 min read · âī¸ k4iHow vLLM borrows the OS paging idea to eliminate KV cache memory fragmentation, pushing GPU utilization from ~30% to ~96%.
Online Softmax: Tiling for Arbitrarily Large Rowsđ Apr 21, 2026 · đ Jun 5, 2026 · â 6 min read · âī¸ k4ihow online softmax extends the fused kernel to handle rows that exceed sram capacity, using a numerically stable 2-pass tiling algorithm.
Why KV Cache Works in LLM Inferenceđ Apr 20, 2026 · đ Jun 5, 2026 · â 8 min read · âī¸ k4iwhy the key-value cache avoids redundant computation in autoregressive decoding, and the memory/compute tradeoffs it introduces.
Fused Softmax in Tritonđ Apr 20, 2026 · đ Jun 5, 2026 · â 7 min read · âī¸ k4ihow to write a fused softmax kernel in triton that eliminates redundant memory accesses and outperforms pytorch's native implementation.
Batch vs Stochastic Gradient Descentđ Feb 16, 2026 · đ Apr 19, 2026 · â 4 min read · âī¸ k4iunderstand batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.
Forward & Backward Propagationđ Feb 16, 2026 · đ Apr 19, 2026 · â 5 min read · âī¸ k4iunderstand how backward propagation works in gradient descent.