Tokenization is the process of breaking down text into individual words or subwords, known as tokens. It is the first step in the data preparation pipeline for building Large Language Model.
Process: The process typically involves three steps:
- Splitting the text into individual word and subword tokens.
- Converting these tokens into Token IDs.
- Encoding these token IDs into Vector Embedding representations.
Example: For the input “This is an example”:
- Step 1: “This”, “is”, “an”, “example” (Tokens).
- Step 2: 14013, 2011, 302, 1134 (Token IDs).
