Tokenization

Step 1: “This”, “is”, “an”, “example” (Tokens).
Step 2: 14013, 2011, 302, 1134 (Token IDs).

Tokenization is the process of breaking down text into individual words or subwords, known as tokens. It is the first step in the data preparation pipeline for building Large Language Model.

Process: The process typically involves three steps: