Tokenization

Tokenization is the process of breaking down text into individual words or subwords, known as tokens. It is the first step in the data preparation pipeline for building Large Language Model.

Process: The process typically involves three steps:

  1. Splitting the text into individual word and subword tokens.
  2. Converting these tokens into Token IDs.
  3. Encoding these token IDs into Vector Embedding representations.

Example: For the input “This is an example”:

    Mike 3.0

    Send a message to start the chat!

    You can ask the bot anything about me and it will help to find the relevant information!

    Try asking: