Cosine Similarity vs Dot Product in Attention Mechanisms
For comparing the hidden states between the encoder and decoder, we need a similarity score. Two common approaches to calculate this are: Cosine similarity Dot product Cosine Similarity It performs...

Source: DEV Community
For comparing the hidden states between the encoder and decoder, we need a similarity score. Two common approaches to calculate this are: Cosine similarity Dot product Cosine Similarity It performs a dot product on the vectors and then normalizes the result. Example Encoder output: [-0.76, 0.75] Decoder output: [0.91, 0.38] Cosine similarity ≈ -0.39 Close to 1 → very similar → strong attention Close to 0 → not related Negative → opposite → low attention This is useful when: Values can vary a lot in size You want a consistent scale (-1 to 1) The problem is that it’s a bit expensive. It requires extra calculations (division, square roots), and in attention we don’t always need that. Dot Product Dot product is much simpler. It does the following: Multiply corresponding values Add them up Example (-0.76 × 0.91) + (0.75 × 0.38) = -0.41 Dot product is preferred in attention because: It’s fast It’s simple It gives good relative scores Even if the numbers are not normalized, the model can stil