Triton
Online Softmax: Tiling for Arbitrarily Large Rows
· ☕ 6 min read · âœī¸ k4i
how online softmax extends the fused kernel to handle rows that exceed sram capacity, using a numerically stable 2-pass tiling algorithm.
Online Softmax: Tiling for Arbitrarily Large Rows
Fused Softmax in Triton
· ☕ 7 min read · âœī¸ k4i
how to write a fused softmax kernel in triton that eliminates redundant memory accesses and outperforms pytorch's native implementation.
Fused Softmax in Triton