RBF Attention Reveals Dot‑Product's Hidden Norm Bias
Swapping dot‑product attention for RBF attention sounds like an architectural revolution. In Raphael Pisoni’s experiment, it turned out to be something stranger: a one‑line algebraic tweak that sil...

Source: DEV Community
Swapping dot‑product attention for RBF attention sounds like an architectural revolution. In Raphael Pisoni’s experiment, it turned out to be something stranger: a one‑line algebraic tweak that silently reproduces half the “mysterious” behaviors of modern Transformers — and breaks the hardware stack in the process. TL;DR RBF attention is just dot‑product attention plus an explicit squared‑L2 penalty on keys; the “new” geometry is already latent in SDPA. Changing the metric forces you to confront everything your stack has hard‑coded about dot products: RoPE, attention sinks, fused kernels, even how you debug training. The right way to use RBF is as a diagnostic scalpel: borrow its inductive biases (norm penalties, distance‑based similarity) without paying the full engineering tax of a wholesale swap. RBF Attention Is Just Dot‑Product + a Key L2 Penalty Pisoni’s math trick is the key: start from a distance‑based score instead of a dot product, [ \text{score}(q,k) = -\gamma\lVert q - k\rV