This series answers why inference engines are shaped the way they are. The focus is not framework APIs, but the core mechanisms behind vLLM / SGLang-style serving engines: prefill/decode, KV cache, PagedAttention, continuous batching, prefix caching, chunked prefill, and disaggregated prefill.
Existing Posts
Read the existing posts in this order:
- Estimating Compute and Memory Requirements for LLM Training and Inference
- From Absolute Positional Encoding to RoPE: Why Position Can Be a Rotation
- Why KV Cache Works in LLM Inference
- Paged Attention: Virtual Memory for the GPU
- Continuous Batching: Scheduling at Iteration Granularity
- Chunked Prefill: Slicing the Prefill to Protect Decode Latency
- Prefix Caching: Reusing KV Cache Across Requests
- Disaggregated Prefill: Splitting Compute Across Machines
Planned Posts
- Prefill vs decode: why one model has two very different bottlenecks
- The scheduler’s real objective: bigger batches are not always better
- KV cache eviction: LRU, prefix trees, reference counts, and cache pollution
Questions Each Post Should Answer
- What production problem does this mechanism solve?
- Does it mainly affect TTFT, TPOT, throughput, or memory capacity?
- How does it change KV cache, scheduler, attention kernels, or GPU workload?
- Which vLLM / SGLang design or parameter does it map to?