Inference

A Survey of LLM Quantization: From Linear Quantization to Codebooks

📅 Jun 1, 2026 · 📝 Jun 3, 2026 · ☕ 29 min read · ✍️ k4i

A practical survey of LLM quantization, covering linear quantization, codebook quantization, LLM.int8(), SmoothQuant, GPTQ, AWQ, NF4, AQLM, KV cache quantization, and FP8.

Estimating Compute and Memory Requirements for LLM Training and Inference

📅 May 27, 2026 · 📝 May 28, 2026 · ☕ 14 min read · ✍️ k4i

A back-of-the-envelope framework for estimating LLM training FLOPs, inference FLOPs, weight memory, KV cache, and training memory.

Disaggregated Prefill: Splitting Compute Across Machines

📅 Apr 22, 2026 · 📝 May 30, 2026 · ☕ 9 min read · ✍️ k4i

Routing prefill and decode to separate GPU pools eliminates interference entirely, enabling independent scaling and optimal latency — at the cost of KV cache migration across machines.

Prefix Caching: Reusing KV Cache Across Requests

📅 Apr 22, 2026 · 📝 May 30, 2026 · ☕ 8 min read · ✍️ k4i

When thousands of requests share the same system prompt, recomputing its KV cache each time is pure waste. Prefix caching stores and reuses those vectors, cutting TTFT by up to 97% in common deployments.

Chunked Prefill: Slicing the Prefill to Protect Decode Latency

📅 Apr 22, 2026 · 📝 May 30, 2026 · ☕ 8 min read · ✍️ k4i

Splitting a long prefill across multiple iterations keeps decode requests from stalling, with no extra FLOPs and negligible IO overhead.

Continuous Batching: Scheduling at Iteration Granularity

📅 Apr 22, 2026 · 📝 May 30, 2026 · ☕ 7 min read · ✍️ k4i

How iteration-level scheduling eliminates GPU idle time by inserting new requests the moment a slot opens, and the math behind mixing prefill and decode in a single forward pass.