This page looks best with JavaScript enabled

LLM Inference Internals: Core Mechanisms for Serving Engines

 ·   ·  โ˜• 1 min read · ๐Ÿ‘€... views
Read aloud Ready
0/0

This series answers why inference engines are shaped the way they are. The focus is not framework APIs, but the core mechanisms behind vLLM / SGLang-style serving engines: prefill/decode, KV cache, PagedAttention, continuous batching, prefix caching, chunked prefill, and disaggregated prefill.

Existing Posts

Read the existing posts in this order:

  1. Estimating Compute and Memory Requirements for LLM Training and Inference
  2. From Absolute Positional Encoding to RoPE: Why Position Can Be a Rotation
  3. Why KV Cache Works in LLM Inference
  4. Paged Attention: Virtual Memory for the GPU
  5. Continuous Batching: Scheduling at Iteration Granularity
  6. Chunked Prefill: Slicing the Prefill to Protect Decode Latency
  7. Prefix Caching: Reusing KV Cache Across Requests
  8. Disaggregated Prefill: Splitting Compute Across Machines

Planned Posts

  • Prefill vs decode: why one model has two very different bottlenecks
  • The scheduler’s real objective: bigger batches are not always better
  • KV cache eviction: LRU, prefix trees, reference counts, and cache pollution

Questions Each Post Should Answer

  • What production problem does this mechanism solve?
  • Does it mainly affect TTFT, TPOT, throughput, or memory capacity?
  • How does it change KV cache, scheduler, attention kernels, or GPU workload?
  • Which vLLM / SGLang design or parameter does it map to?
Share on