Attention Weights are the normalized version of Attention Scores used to compute the final Context Vector. They represent the percentage or fraction of attention the model should pay to each input token for a given Query.
Calculation
Attention weights are derived from attention scores through a two-step process in Scaled Dot-Product Attention:
- Scaling: scores are divided by (where is the Key dimension) to ensure stability.
- Normalization: Softmax Normalization is applied to ensure the weights sum to 1.
Properties
- Sum to One: Weights across all inputs for a single query must sum to exactly 1.0 (or 100%). This makes them interpretable as probabilities.
- Positivity: Softmax ensures all weights are positive.
Interpretability
Because they are normalized, we can make statements like: “When the query is ‘Journey’, the model pays 20% attention to ‘starts’, 15% to ‘your’, and only 9% to ‘one’”.
