AI – k4i's blog

LLM Attention Kernels and GPU Primitives

📅 Jun 5, 2026 · ☕ 1 min read · ✍️ k4i

A series index for LLM attention kernels and GPU primitives: fused softmax, online softmax, FlashAttention, PagedAttention kernels, Triton/CUDA, and memory-access optimization.

LLM Quantization and Low-Precision Serving

📅 Jun 5, 2026 · ☕ 1 min read · ✍️ k4i

A series index for LLM quantization and low-precision serving: INT8/INT4, GPTQ, AWQ, SmoothQuant, NF4, AQLM, KV cache quantization, FP8 serving, and quality/speed/memory tradeoffs.

LLM Inference Lab Reports: Experiments and Benchmarks for Serving Systems

📅 Jun 5, 2026 · ☕ 2 min read · ✍️ k4i

An LLM inference experiment series index: vLLM/SGLang benchmarks, TTFT/TPOT, prefix cache, chunked prefill, PagedAttention, quantization, and a profiler dashboard.

vLLM / SGLang Source Reading: From Request to Forward Pass

📅 Jun 4, 2026 · 📝 Jun 5, 2026 · ☕ 1 min read · ✍️ k4i

A vLLM / SGLang source-reading series index: request lifecycle, scheduler, KV cache allocation, block manager, radix cache, and benchmarks.

LLM Inference Internals: Core Mechanisms for Serving Engines

📅 Jun 4, 2026 · 📝 Jun 5, 2026 · ☕ 1 min read · ✍️ k4i

A series index for core LLM serving mechanisms: prefill/decode, KV cache, PagedAttention, continuous batching, prefix caching, and disaggregated prefill.

A Survey of LLM Quantization: From Linear Quantization to Codebooks

📅 Jun 1, 2026 · 📝 Jun 5, 2026 · ☕ 34 min read · ✍️ k4i

A practical survey of LLM quantization, covering linear quantization, codebook quantization, LLM.int8(), SmoothQuant, GPTQ, AWQ, NF4, AQLM, KV cache quantization, and FP8.

From Absolute Positional Encoding to RoPE: Why Position Can Be a Rotation

📅 May 28, 2026 · 📝 Jun 5, 2026 · ☕ 10 min read · ✍️ k4i

A step-by-step explanation of positional encoding in Transformers, from absolute embeddings to sinusoidal encodings, Euler's formula, and rotary position embeddings.

Estimating Compute and Memory Requirements for LLM Training and Inference

📅 May 27, 2026 · 📝 Jun 5, 2026 · ☕ 14 min read · ✍️ k4i

A back-of-the-envelope framework for estimating LLM training FLOPs, inference FLOPs, weight memory, KV cache, and training memory.

Disaggregated Prefill: Splitting Compute Across Machines

📅 Apr 22, 2026 · 📝 May 30, 2026 · ☕ 9 min read · ✍️ k4i

Routing prefill and decode to separate GPU pools eliminates interference entirely, enabling independent scaling and optimal latency — at the cost of KV cache migration across machines.

Prefix Caching: Reusing KV Cache Across Requests

📅 Apr 22, 2026 · 📝 May 30, 2026 · ☕ 8 min read · ✍️ k4i

When thousands of requests share the same system prompt, recomputing its KV cache each time is pure waste. Prefix caching stores and reuses those vectors, cutting TTFT by up to 97% in common deployments.