k4i's blog

Entropy, Cross Entropy, And KL Divergence: A Coding-Cost View

📅 Jul 14, 2026 · ☕ 5 min read · ✍️ k4i

A coding-cost explanation of entropy, cross entropy, and KL divergence: what is unavoidable uncertainty, what is the cost of coding with the wrong distribution, and why cross entropy naturally appears in classification and language-model training.

Entropy, Cross Entropy, And KL Divergence: A Coding-Cost View

Why KL Divergence Is Not A Distance: Direction Changes The Question

📅 Jul 7, 2026 · ☕ 7 min read · ✍️ k4i

An intuition-first explanation of KL divergence: why KL(P || Q) differs from KL(Q || P), what each direction penalizes, and how this connects to SFT, cross entropy, and RLHF KL penalties.

Why KL Divergence Is Not A Distance: Direction Changes The Question

Common Probability Distributions: Variance And Standard Deviation

📅 Jul 2, 2026 · ☕ 9 min read · ✍️ k4i

A compact guide to common probability distributions, what each one models, and the mean, variance, and standard deviation formulas behind Bernoulli, Binomial, Poisson, Uniform, Normal, Exponential, Gamma, Beta, Chi-square, t, F, and related distributions.

Optimizers: From SGD To AdamW

📅 Jun 29, 2026 · ☕ 10 min read · ✍️ k4i

A mechanism-first guide to optimizers: what SGD, momentum, RMSProp, Adam, and AdamW each solve, why AdamW became a strong default for modern deep learning, and when other optimizers still matter.

vLLM Scheduler: How Request Queues Become SchedulerOutput

📅 Jun 23, 2026 · ☕ 6 min read · ✍️ k4i

A source-reading walkthrough of vLLM V1 Scheduler: how it decides across running/waiting queues, token budget, KV cache blocks, prefix-cache hits, and preemption to produce SchedulerOutput for ModelRunner.

Numeric Types in Neural Networks: FP32, BF16, FP8, INT8, and INT4

📅 Jun 23, 2026 · ☕ 4 min read · ✍️ k4i

A concise map of floating point, integer quantization, storage dtype, compute dtype, and accumulation dtype in neural networks.

vLLM ModelRunner: How SchedulerOutput Becomes a GPU Forward

📅 Jun 23, 2026 · ☕ 6 min read · ✍️ k4i

A source-reading walkthrough of vLLM V1 GPUModelRunner: how SchedulerOutput becomes input batches, attention metadata, KV slot mappings, model forward, logits, and sampled tokens.

Loss Functions: What a Model Is Really Optimizing

📅 Jun 23, 2026 · ☕ 9 min read · ✍️ k4i

A practical guide to loss functions: when to use MSE, MAE, Huber, binary cross entropy, cross entropy, KL divergence, hinge loss, contrastive loss, and triplet loss.

Loss Functions: What a Model Is Really Optimizing

LLM Inference Sampling: What Temperature, Top-p, and Top-k Actually Control

📅 Jun 18, 2026 · ☕ 7 min read · ✍️ k4i

A small 5-token example for understanding temperature, top-p, and top-k during LLM inference, with source-reading notes from the vLLM V1 sampler.

Activation Functions: The Small Nonlinearity That Shapes a Network

📅 Jun 18, 2026 · ☕ 8 min read · ✍️ k4i

A mechanism-first guide to activation functions: why neural networks need nonlinearities, how sigmoid, tanh, ReLU, GELU, and SiLU differ, and why a 400-function survey is best read as a map rather than a menu.

Three Routes For Embodied Models: VLA, World Models, And WAM

📅 Jun 18, 2026 · ☕ 9 min read · ✍️ k4i

A compact map of embodied model development: VLA policies, JEPA-style world models, and WAM-style models that tie actions to predicted future states.

Streaming Design: Why The Application Layer Still Matters

📅 Jun 17, 2026 · ☕ 11 min read · ✍️ k4i

A practical model for upload-side and download-side streaming: transport moves bytes, while the application layer defines boundaries, progress, recovery, idempotency, backpressure, and business meaning.

vLLM Request Lifecycle: From OpenAI API to One Forward Pass

📅 Jun 7, 2026 · ☕ 5 min read · ✍️ k4i

A source-reading walkthrough of the vLLM V1 request path: OpenAI-compatible HTTP entrypoint, serving render, AsyncLLM, EngineCore client, Tensor IPC, scheduler, and one GPUModelRunner forward pass.

Prefill vs Decode: Why One Model Has Two Very Different Bottlenecks

📅 Jun 5, 2026 · ☕ 8 min read · ✍️ k4i

Why LLM inference splits into a compute-bound prefill phase and a memory-bandwidth-bound decode phase, and how that explains TTFT, TPOT, batching, KV cache pressure, and serving-engine design.

LLM Attention Kernels and GPU Primitives

📅 Jun 5, 2026 · ☕ 1 min read · ✍️ k4i

A series index for LLM attention kernels and GPU primitives: fused softmax, online softmax, FlashAttention, PagedAttention kernels, Triton/CUDA, and memory-access optimization.

LLM Quantization and Low-Precision Serving

📅 Jun 5, 2026 · ☕ 1 min read · ✍️ k4i

A series index for LLM quantization and low-precision serving: INT8/INT4, GPTQ, AWQ, SmoothQuant, NF4, AQLM, KV cache quantization, FP8 serving, and quality/speed/memory tradeoffs.

LLM Inference Lab Reports: Experiments and Benchmarks for Serving Systems

📅 Jun 5, 2026 · ☕ 2 min read · ✍️ k4i

An LLM inference experiment series index: vLLM/SGLang benchmarks, TTFT/TPOT, prefix cache, chunked prefill, PagedAttention, quantization, and a profiler dashboard.

vLLM / SGLang Source Reading: From Request to Forward Pass

📅 Jun 4, 2026 · ☕ 1 min read · ✍️ k4i

A vLLM / SGLang source-reading series index: request lifecycle, scheduler, KV cache allocation, block manager, radix cache, and benchmarks.

LLM Inference Internals: Core Mechanisms for Serving Engines

📅 Jun 4, 2026 · ☕ 1 min read · ✍️ k4i

A series index for core LLM serving mechanisms: prefill/decode, KV cache, PagedAttention, continuous batching, prefix caching, and disaggregated prefill.

A Survey of LLM Quantization: From Linear Quantization to Codebooks

📅 Jun 1, 2026 · ☕ 34 min read · ✍️ k4i

A practical survey of LLM quantization, covering linear quantization, codebook quantization, LLM.int8(), SmoothQuant, GPTQ, AWQ, NF4, AQLM, KV cache quantization, and FP8.