Cosine Similarity vs Dot Product in Attention Mechanisms

By Spark Maverick · March 30, 2026 · 1 min read

For comparing the hidden states between the encoder and decoder, we need a similarity score. Two common approaches to calculate this are: Cosine similarity Dot product Cosine Similarity It performs a dot product on the vectors and then normalizes the result. Example Encoder output: [-0.76, 0.75] Decoder output: [0.91, 0.38] Cosine similarity ≈ -0.39 Close to 1 → very similar → strong attention Close to 0 → not related Negative → opposite → low attention This is useful when: Values can vary a lot in size You want a consistent scale (-1 to 1) The problem is that it’s a bit expensive. It requires extra calculations (division, square roots), and in attention we don’t always need that. Dot Product Dot product is much simpler. It does the following: Multiply corresponding values Add them up Example (-0.76 × 0.91) + (0.75 × 0.38) = -0.41 Dot product is preferred in attention because: It’s fast It’s simple It gives good relative scores Even if the numbers are not normalized, the model can stil

Cosine Similarity vs Dot Product in Attention Mechanisms

Related Posts

Trending on ShareHub

Latest on ShareHub

Browse Topics

Around the Network