Online Softmax: Tiling for Arbitrarily Large Rows📅 Apr 21, 2026 · 📝 Apr 22, 2026 · ☕ 6 min read · ✍️ k4ihow online softmax extends the fused kernel to handle rows that exceed sram capacity, using a numerically stable 2-pass tiling algorithm.
Fused Softmax in Triton📅 Apr 20, 2026 · 📝 Apr 22, 2026 · ☕ 7 min read · ✍️ k4ihow to write a fused softmax kernel in triton that eliminates redundant memory accesses and outperforms pytorch's native implementation.