This series is for experiment reports. Unlike mechanism explainers or source-reading notes, each post should include a reproducible environment, commands, metrics, tables or figures, and concrete tuning conclusions.
For inference-engine interviews, knowing the names PagedAttention, prefix cache, and chunked prefill is only the first layer. The stronger signal is being able to answer: which workload benefits, how much did the metric improve, where did the bottleneck move, and what should we inspect first if production metrics regress?
Experiment Order
The planned order is:
- Build a vLLM / SGLang benchmark environment
- Experiment: how batch size and
max_num_batched_tokenschange throughput and latency - Experiment: how prefix cache hit rate changes TTFT
- Experiment: tuning chunk size for chunked prefill
- Experiment: PagedAttention and memory fragmentation
- Experiment: the memory, speed, and quality triangle for quantized models
- Final project: an inference-service profiler dashboard for TTFT, TPOT, cache hit rate, memory watermark, and tuning suggestions
Standard Report Format
Each lab report should include:
- Question: what hypothesis is this experiment testing?
- Environment: GPU, driver, CUDA, model, framework version, and launch parameters.
- Workload: prompt length, output length, concurrency, request distribution, and whether prefixes are shared.
- Metrics: TTFT, TPOT, throughput, memory watermark, cache hit rate, and GPU utilization.
- Results: tables or figures that show the key changes.
- Explanation: connect the result back to prefill, decode, KV cache, scheduler, or kernels.
- Conclusion: what should change in the next deployment or tuning pass?
Without these details, a post is mostly a learning note. With them, it becomes evidence of engineering judgment.