Online Softmax: Tiling for Arbitrarily Large Rows
· ☕ 6 min read · âœī¸ k4i
how online softmax extends the fused kernel to handle rows that exceed sram capacity, using a numerically stable 2-pass tiling algorithm.
Online Softmax: Tiling for Arbitrarily Large Rows
Why KV Cache Works in LLM Inference
· ☕ 8 min read · âœī¸ k4i
why the key-value cache avoids redundant computation in autoregressive decoding, and the memory/compute tradeoffs it introduces.
Why KV Cache Works in LLM Inference
Fused Softmax in Triton
· ☕ 7 min read · âœī¸ k4i
how to write a fused softmax kernel in triton that eliminates redundant memory accesses and outperforms pytorch's native implementation.
Fused Softmax in Triton
Batch vs Stochastic Gradient Descent
· ☕ 4 min read · âœī¸ k4i
understand batch gradient descent, stochastic gradient descent, and mini-batch gradient descent.
Batch vs Stochastic Gradient Descent
Forward & Backward Propagation
· ☕ 5 min read · âœī¸ k4i
understand how backward propagation works in gradient descent.
Forward & Backward Propagation
Key Management With GnuPG
· ☕ 11 min read · âœī¸ k4i
learn how to manage you keys with GPG, and use it with ssh and git and pass.
Key Management With GnuPG
Shortest Paths Algorithms
· ☕ 3 min read · âœī¸ k4i
compare shortest path algorithms: dijkstra, floyd, bellman-ford
Shortest Paths Algorithms