A series index for LLM attention kernels and GPU primitives: fused softmax, online softmax, FlashAttention, PagedAttention kernels, Triton/CUDA, and memory-access optimization.
A series index for LLM quantization and low-precision serving: INT8/INT4, GPTQ, AWQ, SmoothQuant, NF4, AQLM, KV cache quantization, FP8 serving, and quality/speed/memory tradeoffs.
An LLM inference experiment series index: vLLM/SGLang benchmarks, TTFT/TPOT, prefix cache, chunked prefill, PagedAttention, quantization, and a profiler dashboard.
A series index for core LLM serving mechanisms: prefill/decode, KV cache, PagedAttention, continuous batching, prefix caching, and disaggregated prefill.
A step-by-step explanation of positional encoding in Transformers, from absolute embeddings to sinusoidal encodings, Euler's formula, and rotary position embeddings.
A practical note on managing agent skills: how to install, create, remove, disable, update, and evolve skills, plus which meta-skills are worth installing first.
Routing prefill and decode to separate GPU pools eliminates interference entirely, enabling independent scaling and optimal latency â at the cost of KV cache migration across machines.