2026
- LLM Inference Sampling: What Temperature, Top-p, and Top-k Actually Control
- Three Routes For Embodied Models: VLA, World Models, And WAM
- Activation Functions: The Small Nonlinearity That Shapes a Network
- Streaming Design: Why The Application Layer Still Matters
- vLLM Request Lifecycle: From OpenAI API to One Forward Pass
- Prefill vs Decode: Why One Model Has Two Very Different Bottlenecks
- LLM Attention Kernels and GPU Primitives
- LLM Quantization and Low-Precision Serving
- LLM Inference Lab Reports: Experiments and Benchmarks for Serving Systems
- vLLM / SGLang Source Reading: From Request to Forward Pass
- LLM Inference Internals: Core Mechanisms for Serving Engines
- A Survey of LLM Quantization: From Linear Quantization to Codebooks
- From Absolute Positional Encoding to RoPE: Why Position Can Be a Rotation
- Estimating Compute and Memory Requirements for LLM Training and Inference
- Agent Skill Management: Turning AI Assistants from Clever to Reliable
- Disaggregated Prefill: Splitting Compute Across Machines
- Prefix Caching: Reusing KV Cache Across Requests
- Chunked Prefill: Slicing the Prefill to Protect Decode Latency
- Continuous Batching: Scheduling at Iteration Granularity
- Paged Attention: Virtual Memory for the GPU