Multi-head Attention

The Multi-head Attention Wrapper is an implementation pattern that stacks multiple instances of a causal self-attention module to create a multi-head attention mechanism.

Implementation Details

Example

If num_heads=2 and d_out=2, the wrapper creates two causal attention instances. If the input is a batch of sentences, the wrapper processes them through both instances and combines the results.

    Mike 3.0

    Send a message to start the chat!

    You can ask the bot anything about me and it will help to find the relevant information!

    Try asking: