Attention Weights

Attention Weights are the normalized version of Attention Scores used to compute the final Context Vector. They represent the percentage or fraction of attention the model should pay to each input token for a given Query.

Calculation

Attention weights are derived from attention scores through a two-step process in Scaled Dot-Product Attention:

Scaling: scores are divided by $\sqrt{d_k}$ (where $d_k$ is the Key dimension) to ensure stability.
Normalization: Softmax Normalization is applied to ensure the weights sum to 1.

$\text{Weights} = \text{softmax}\left(\frac{\text{Scores}}{\sqrt{d_k}}\right)$

Properties

Sum to One: Weights across all inputs for a single query must sum to exactly 1.0 (or 100%). This makes them interpretable as probabilities.
Positivity: Softmax ensures all weights are positive.

Interpretability

Because they are normalized, we can make statements like: “When the query is ‘Journey’, the model pays 20% attention to ‘starts’, 15% to ‘your’, and only 9% to ‘one’”.

Calculation

Properties

Interpretability

Chat with Mike 3.0