Broadcasting is a mechanism in libraries like PyTorch and NumPy that allows operations on tensors of different shapes by automatically expanding their dimensions to match.
Context in LLMs
In the context of creating Input Embeddings, broadcasting is used to add Positional Embeddings (which typically have a smaller shape corresponding to the Context Size) to the batched Token Embeddings.
Example Calculation
- Token Embeddings Shape:
[Batch Size, Context Length, Embedding Dim]=8 x 4 x 256 - Positional Embeddings Shape:
[Context Length, Embedding Dim]=4 x 256 - Operation: When adding these two tensors, PyTorch automatically “broadcasts” the
4 x 256positional matrix across the 8 batches of the token embeddings.- Effectively, the same
4 x 256positional vectors are added to each of the 8 samples in the batch.
- Effectively, the same
Application in Softmax
In the coding implementation of the Softmax Activation Function, broadcasting is critical for operations like subtracting the maximum value or dividing by the sum.
- Keep Dimensions: When calculating the max or sum along a specific axis (e.g.,
axis=1for rows), it is crucial to usekeepdims=True. This retains the dimensions (e.g., changing shape from(3, 3)to(3, 1)instead of reducing it to(3,)), allowing the result to be correctly broadcast back against the original matrix for element-wise subtraction or division.
