Reinforcement Learning (RL)

Table of Contents

Reinforcement Learning (RL) is a machine learning paradigm where an Agent learns to make decisions by performing actions in an Environment and receiving feedback in the form of Rewards (positive) or Penalties (negative).

Unlike Supervised Fine-Tuning (SFT), where the model is told exactly what to output, RL tells the model how good its output was and lets it figure out the optimal strategy.

Core Components

  1. Agent: The learner or decision maker (e.g., the LLM).
  2. Environment: The world the agent interacts with (e.g., the chat interface, a game, a simulator).
  3. Action: What the agent does (e.g., generating a token).
  4. State: The current situation (e.g., the conversation history).
  5. Reward: A scalar signal indicating success (e.g., +1 for a helpful answer, -1 for a toxic one).
  6. Policy: The strategy or rule the agent follows (the mapping from State to Action).

Role in LLMs

RL is critical for the Alignment and Reasoning phases of modern LLMs:

1. Alignment (RLHF)

2. Reasoning (e.g., DeepSeek R1)

    Mike 3.0

    Send a message to start the chat!

    You can ask the bot anything about me and it will help to find the relevant information!

    Try asking: