Input-Target Pairs are the fundamental data structure used for training Large Language Model in a supervised, self-supervised manner. In this setup, the “Input” is a sequence of tokens, and the “Target” is the next token that follows the input sequence.
Concept
The goal is to train the model to predict the next word given a context.
- Input: A sequence of text (e.g., “LLMs learn to”).
- Target: The immediate next word (e.g., “predict”).
This process is repeated for every position in the text, creating multiple training examples from a single sentence.
Example
Given the sentence: “LLMs learn to predict one word at a time” If the Context Size is 4:
- Input: “LLMs” Target: “learn”
- Input: “LLMs learn” Target: “to”
- Input: “LLMs learn to” Target: “predict”
In code, this is often implemented by creating two variables, x (input) and y (target), where y is simply x shifted by one position.
x = [290, 4920, 2241, 287] # Input tokens
y = [4920, 2241, 287, 257] # Target tokens (shifted by 1)
