AI – k4i's blog

Paged Attention: Virtual Memory for the GPU

📅 Apr 22, 2026 · ☕ 10 min read · ✍️ k4i

How vLLM borrows the OS paging idea to eliminate KV cache memory fragmentation, pushing GPU utilization from ~30% to ~96%.

Online Softmax: Tiling for Arbitrarily Large Rows

📅 Apr 21, 2026 · ☕ 6 min read · ✍️ k4i

how online softmax extends the fused kernel to handle rows that exceed sram capacity, using a numerically stable 2-pass tiling algorithm.

Online Softmax: Tiling for Arbitrarily Large Rows

Why KV Cache Works in LLM Inference

📅 Apr 20, 2026 · ☕ 9 min read · ✍️ k4i

why the key-value cache avoids redundant computation in autoregressive decoding, and the memory/compute tradeoffs it introduces.

Why KV Cache Works in LLM Inference

Fused Softmax in Triton

📅 Apr 20, 2026 · ☕ 7 min read · ✍️ k4i

how to write a fused softmax kernel in triton that eliminates redundant memory accesses and outperforms pytorch's native implementation.

Fused Softmax in Triton

Batch vs Stochastic Gradient Descent

📅 Feb 16, 2026 · ☕ 4 min read · ✍️ k4i

understand batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.

Batch vs Stochastic Gradient Descent

Forward & Backward Propagation

📅 Feb 16, 2026 · ☕ 5 min read · ✍️ k4i

understand how backward propagation works in gradient descent.

Forward & Backward Propagation

1
2