This page looks best with JavaScript enabled

LLM Attention Kernels and GPU Primitives

 ·  ☕ 1 min read · 👀... views
Read aloud Ready
0/0

This series is for kernels and GPU primitives. The mechanism series explains why a serving system needs an optimization; this series explains how the optimization works at the kernel and memory-access level.

Existing Posts

  1. Fused Softmax in Triton
  2. Online Softmax: Tiling for Arbitrarily Large Rows

Planned Posts

  • FlashAttention: how online softmax becomes IO-aware attention
  • From FlashAttention to PagedAttention: how attention kernels and cache layout constrain each other
  • PagedAttention kernels: how block tables enter the attention memory path
  • Triton profiling: using roofline thinking for bandwidth-bound and compute-bound kernels
  • Why decode kernels are often limited by HBM bandwidth

Questions Each Post Should Answer

  • Which memory access does this kernel remove or reduce?
  • How does data move through HBM, L2, shared memory, and registers?
  • Does it improve prefill, decode, or both?
  • How is it coupled to vLLM / SGLang serving parameters or cache layout?
Share on