A mechanism-first guide to activation functions: why neural networks need nonlinearities, how sigmoid, tanh, ReLU, GELU, and SiLU differ, and why a 400-function survey is best read as a map rather than a menu.
A practical model for upload-side and download-side streaming: transport moves bytes, while the application layer defines boundaries, progress, recovery, idempotency, backpressure, and business meaning.
A source-reading walkthrough of the vLLM V1 request path: OpenAI-compatible HTTP entrypoint, serving render, AsyncLLM, EngineCore client, Tensor IPC, scheduler, and one GPUModelRunner forward pass.
Why LLM inference splits into a compute-bound prefill phase and a memory-bandwidth-bound decode phase, and how that explains TTFT, TPOT, batching, KV cache pressure, and serving-engine design.
A series index for LLM attention kernels and GPU primitives: fused softmax, online softmax, FlashAttention, PagedAttention kernels, Triton/CUDA, and memory-access optimization.
A series index for LLM quantization and low-precision serving: INT8/INT4, GPTQ, AWQ, SmoothQuant, NF4, AQLM, KV cache quantization, FP8 serving, and quality/speed/memory tradeoffs.
An LLM inference experiment series index: vLLM/SGLang benchmarks, TTFT/TPOT, prefix cache, chunked prefill, PagedAttention, quantization, and a profiler dashboard.
A series index for core LLM serving mechanisms: prefill/decode, KV cache, PagedAttention, continuous batching, prefix caching, and disaggregated prefill.
A step-by-step explanation of positional encoding in Transformers, from absolute embeddings to sinusoidal encodings, Euler's formula, and rotary position embeddings.
A practical note on managing agent skills: how to install, create, remove, disable, update, and evolve skills, plus which meta-skills are worth installing first.
Routing prefill and decode to separate GPU pools eliminates interference entirely, enabling independent scaling and optimal latency â at the cost of KV cache migration across machines.
When thousands of requests share the same system prompt, recomputing its KV cache each time is pure waste. Prefix caching stores and reuses those vectors, cutting TTFT by up to 97% in common deployments.