AI
LLM Attention Kernels and GPU Primitives
· ☕ 1 min read · âœī¸ k4i
A series index for LLM attention kernels and GPU primitives: fused softmax, online softmax, FlashAttention, PagedAttention kernels, Triton/CUDA, and memory-access optimization.
LLM Attention Kernels and GPU Primitives
LLM Quantization and Low-Precision Serving
· ☕ 1 min read · âœī¸ k4i
A series index for LLM quantization and low-precision serving: INT8/INT4, GPTQ, AWQ, SmoothQuant, NF4, AQLM, KV cache quantization, FP8 serving, and quality/speed/memory tradeoffs.
LLM Quantization and Low-Precision Serving
Disaggregated Prefill: Splitting Compute Across Machines
· ☕ 9 min read · âœī¸ k4i
Routing prefill and decode to separate GPU pools eliminates interference entirely, enabling independent scaling and optimal latency — at the cost of KV cache migration across machines.
Disaggregated Prefill: Splitting Compute Across Machines
Prefix Caching: Reusing KV Cache Across Requests
· ☕ 8 min read · âœī¸ k4i
When thousands of requests share the same system prompt, recomputing its KV cache each time is pure waste. Prefix caching stores and reuses those vectors, cutting TTFT by up to 97% in common deployments.
Prefix Caching: Reusing KV Cache Across Requests