Scaled Dot-Product Attention

Table of Contents

Scaled Dot-Product Attention is a specific type of Self-Attention Mechanism used in Transformer Architecture, where the Attention Scores are scaled by the inverse square root of the dimension of the keys (dkd_k) before applying the Softmax Normalization.

This scaling is critical to prevent the dot products from growing too large in magnitude, which would push the softmax function into regions with extremely small gradients (vanishing gradients), making optimization difficult.

Mechanism

The process involves: 1. Computing Attention Scores by taking the dot product of queries and keys.

  1. Scaling these scores by 1dk\frac{1}{\sqrt{d_k}}.
  2. Applying Softmax to obtain Attention Weights.
  3. Computing the Context Vector as the weighted sum of Values.

Mathematical Formulation

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Sample Code

# Calculate attention scores
attn_scores = queries @ keys.T

# Scale and apply softmax to get attention weights
attn_weights = torch.softmax(
    attn_scores / keys.shape[-1]**0.5, dim=-1
)

# Compute context vector
context_vec = attn_weights @ values

    Mike 3.0

    Send a message to start the chat!

    You can ask the bot anything about me and it will help to find the relevant information!

    Try asking: