Causal Attention

Causal Attention, also known as Masked Attention, is a specialized form of self-attention that restricts the model to consider only the previous and current inputs in a sequence when processing any given token.

This restriction is crucial for autoregressive tasks like text generation in models like GPT, where the model must not have access to future tokens during training.

In contrast to standard self-attention which allows access to the entire input sequence, causal attention masks out future tokens, ensuring that for a particular query, only tokens occurring at or before the current position are factored in.

Implementation Concept

In the attention weight matrix, this is visualized as considering only the elements up to the diagonal for each row. For example, when processing the token “Journey” (2nd token), the model only attends to “Your” (1st) and “Journey” (2nd), masking out “Starts”, “with”, “one”, “step”.

Implementation Concept

Chat with Mike 3.0