Attention Weights

Table of Contents

Attention Weights are the normalized version of Attention Scores used to compute the final Context Vector. They represent the percentage or fraction of attention the model should pay to each input token for a given Query.

Calculation

Attention weights are derived from attention scores through a two-step process in Scaled Dot-Product Attention:

  1. Scaling: scores are divided by dk\sqrt{d_k} (where dkd_k is the Key dimension) to ensure stability.
  2. Normalization: Softmax Normalization is applied to ensure the weights sum to 1.

Weights=softmax(Scoresdk)\text{Weights} = \text{softmax}\left(\frac{\text{Scores}}{\sqrt{d_k}}\right)

Properties

Interpretability

Because they are normalized, we can make statements like: “When the query is ‘Journey’, the model pays 20% attention to ‘starts’, 15% to ‘your’, and only 9% to ‘one’”.

    Mike 3.0

    Send a message to start the chat!

    You can ask the bot anything about me and it will help to find the relevant information!

    Try asking: