Performance
Fused Softmax in Triton
· ☕ 7 min read · ✍️ k4i
how to write a fused softmax kernel in triton that eliminates redundant memory accesses and outperforms pytorch's native implementation.
Fused Softmax in Triton