AI
Paged Attention: Virtual Memory for the GPU
· ☕ 10 min read · âœī¸ k4i
How vLLM borrows the OS paging idea to eliminate KV cache memory fragmentation, pushing GPU utilization from ~30% to ~96%.
Paged Attention: Virtual Memory for the GPU
Why KV Cache Works in LLM Inference
· ☕ 9 min read · âœī¸ k4i
why the key-value cache avoids redundant computation in autoregressive decoding, and the memory/compute tradeoffs it introduces.
Why KV Cache Works in LLM Inference
Fused Softmax in Triton
· ☕ 7 min read · âœī¸ k4i
how to write a fused softmax kernel in triton that eliminates redundant memory accesses and outperforms pytorch's native implementation.
Fused Softmax in Triton
Batch vs Stochastic Gradient Descent
· ☕ 4 min read · âœī¸ k4i
understand batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.
Batch vs Stochastic Gradient Descent