LLM Inference Lab Reports: Experiments and Benchmarks for Serving Systems

This series is for experiment reports. Unlike mechanism explainers or source-reading notes, each post should include a reproducible environment, commands, metrics, tables or figures, and concrete tuning conclusions.

For inference-engine interviews, knowing the names PagedAttention, prefix cache, and chunked prefill is only the first layer. The stronger signal is being able to answer: which workload benefits, how much did the metric improve, where did the bottleneck move, and what should we inspect first if production metrics regress?

Experiment Order

The planned order is:

Build a vLLM / SGLang benchmark environment
Experiment: how batch size and max_num_batched_tokens change throughput and latency
Experiment: how prefix cache hit rate changes TTFT
Experiment: tuning chunk size for chunked prefill
Experiment: PagedAttention and memory fragmentation
Experiment: the memory, speed, and quality triangle for quantized models
Final project: an inference-service profiler dashboard for TTFT, TPOT, cache hit rate, memory watermark, and tuning suggestions

Standard Report Format

Each lab report should include:

Question: what hypothesis is this experiment testing?
Environment: GPU, driver, CUDA, model, framework version, and launch parameters.
Workload: prompt length, output length, concurrency, request distribution, and whether prefixes are shared.
Metrics: TTFT, TPOT, throughput, memory watermark, cache hit rate, and GPU utilization.
Results: tables or figures that show the key changes.
Explanation: connect the result back to prefill, decode, KV cache, scheduler, or kernels.
Conclusion: what should change in the next deployment or tuning pass?

Without these details, a post is mostly a learning note. With them, it becomes evidence of engineering judgment.

LLM Inference Lab Reports: Experiments and Benchmarks for Serving Systems

Experiment Order

Standard Report Format

See Also