Llm
vLLM Request Lifecycle: From OpenAI API to One Forward Pass
· ☕ 8 min read · âœī¸ k4i
A source-reading walkthrough of the vLLM V1 request path: OpenAI-compatible HTTP entrypoint, serving render, AsyncLLM, EngineCore client, Tensor IPC, scheduler, and one GPUModelRunner forward pass.
vLLM Request Lifecycle: From OpenAI API to One Forward Pass
LLM Attention Kernels and GPU Primitives
· ☕ 1 min read · âœī¸ k4i
A series index for LLM attention kernels and GPU primitives: fused softmax, online softmax, FlashAttention, PagedAttention kernels, Triton/CUDA, and memory-access optimization.
LLM Attention Kernels and GPU Primitives
LLM Quantization and Low-Precision Serving
· ☕ 1 min read · âœī¸ k4i
A series index for LLM quantization and low-precision serving: INT8/INT4, GPTQ, AWQ, SmoothQuant, NF4, AQLM, KV cache quantization, FP8 serving, and quality/speed/memory tradeoffs.
LLM Quantization and Low-Precision Serving
Disaggregated Prefill: Splitting Compute Across Machines
· ☕ 9 min read · âœī¸ k4i
Routing prefill and decode to separate GPU pools eliminates interference entirely, enabling independent scaling and optimal latency — at the cost of KV cache migration across machines.
Disaggregated Prefill: Splitting Compute Across Machines