Turn Non-Reasoning LLMs into Reasoning LLMs
In January 2025, DeepSeek released DeepSeek R1, a model that fundamentally changed how we approach reasoning in Large Language Models. Unlike previous “Reasoning” models that relied heavily on massive amounts of Supervised Fine-Tuning (SFT) data (like “Chain of Thought” traces), R1 demonstrated that strong reasoning capabilities could emerge purely through Reinforcement Learning (RL).
This section covers how we can replicate this phenomenon—turning a standard “instruction-following” model (like Qwen 2.5) into a “reasoning” model that pauses, thinks, and self-corrects before answering.
The “Aha!” Moment & RLVR
The most fascinating discovery from DeepSeek’s research (specifically the DeepSeek R1-Zero experiment) was that when an LLM is incentivized correctly using RL, it autonomously learns to:
- Thinking/Test Time Compute: Spend more time processing the problem before generating an answer.
- Self-Correction: Re-evaluate its own steps and backtrack if it detects an error (the “Aha!” moment).
They achieved this using a method called RLVR (Reinforcement Learning through Verifiable Rewards). Instead of training a complex, black-box Reward Model (which is standard in RLHF), they used deterministic, verifiable outcomes.
For example, in math problems (like the GSM8K dataset):
- Is the answer correct? (Checked via a simple rule or calculator)
- Is the format correct? (Did the model use the required
<think>tags?)
Group Relative Policy Optimization (GRPO)
To train this efficiently, we use GRPO. Traditional PPO (Proximal Policy Optimization) requires a “Value Function” model (Critic) which doubles the memory requirements. GRPO simplifies this by:
- Generating a Group of outputs for the same prompt (e.g., 4-8 different responses).
- Calculating the mean reward of that group.
- Using the relative advantage of each response compared to the group average to update the model.
If a response is better than the group average, it gets a positive signal. If worse, a negative one. No extra Value Model needed.
Implementation
I used the Qwen 2.5 (3B) model and fine-tuned it on the GSM8K dataset (grade school math problems) using MLX framework for efficiency on Mac silicon.
1. The Training Template
Crucially, we don’t tell the model how to think. We just provide a structure that forces it to separate its “internal monologue” from its final answer.
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""
2. Reward Functions
We define two simple reward functions. One for Correctness and one for Format.
import re
# 1. Correctness Reward: Does the final answer match the ground truth?
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
responses = [completion[0]['content'] for completion in completions]
extracted_responses = [extract_xml_answer(r) for r in responses]
# GSM8K answers are often just numbers or short phrases
gold = answer[0] if isinstance(answer, (list, tuple)) else answer
if gold is None:
return [0.0] * len(extracted_responses)
return [2.0 if r == gold else 0.0 for r in extracted_responses]
# 2. Strict Format Reward: Must strictly follow the XML structure
def strict_format_reward_func(completions, **kwargs) -> list[float]:
pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
responses = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, r, flags=re.DOTALL) for r in responses]
return [0.5 if match else 0.0 for match in matches]
3. The GRPO Training Loop
Step 1: Verification (Sampling)
For a single prompt, we generate a group of outputs ().
Question: Julie is reading a 120-page book. Yesterday, she was able to read 12 pages and today, she read twice as many pages as yesterday. If she wants to read half of the remaining pages tomorrow, how many pages should she read?
-
Response A:
<reasoning>Yesterday she read 12 pages. Today she read 12*2=24 pages. Total read is 12+24=36. Remaining is 120-36=84. Half of remaining is 84/2=42.</reasoning><answer>42</answer>(Correct Answer + Correct Format) -
Response B:
<reasoning>Yesterday 12. Today 12*2=24. Total 36. Remaining 120-36=84. She needs to read all of them.</reasoning><answer>84</answer>(Incorrect Answer) -
Response C:
42(Correct Answer + Poor Format)
Step 2: Evaluation (Rewards)
We calculate rewards based on our functions (Correctness + Format). Let’s assume the Maximum Reward is 1.0.
- Reward A (): 1.0 (Correct + Good Format)
- Reward B (): 0.0 (Incorrect)
- Reward C (): 0.5 (Correct + Bad Format)
Group Rewards:
Step 3: Advantage Calculation
Instead of a Value Model predicting how good each state is, GRPO uses the Group Average as the baseline.
1. Calculate Mean Reward ():
2. Calculate Standard Deviation ():
3. Calculate Advantage ():
- Advantage A: (Strongly Encourage)
- Advantage B: (Strongly Discourage)
- Advantage C: (Neutral)
Step 4: Optimization (Policy Update)
Finally, we update the model’s weights to reinforce the good behaviors we found.
1. The Intuition: Pushing Probabilities Think of the Advantage as a “force”.
- If a response was good (), we want to push its probability up.
- If a response was bad (), we want to push its probability down.
2. The Mechanism: Updating Logits Inside the model, probabilities act merely as potential outcomes determined by internal values called logits. Learning happens by nudging these logits.
Let’s trace exactly how our correct “Julie” answer gets reinforced using the values from our simulation script:
- Initial State: The model had a logit of for the correct token, which equals a probability of 10.0%.
- The Signal: We calculated an Advantage of (it was better than the group average).
- The Update: We apply the gradient update rule:
- The Result: The new logit (-2.08) converts to a probability of 11.1%.
The model has physically “learned” to be 1.1% more confident in the correct answer.
3. Loss Function ()
To implement this efficiently across thousands of tokens, GRPO uses a loss function derived from PPO. It compares the ratio between the new and old probabilities:
- Ratio : Measures how much the probability has changed.
- Objective: Maximizing (Ratio Advantage) achieves the “pushing” effect we calculated above.
- KL Penalty: Prevents the model from changing too drastically (drift) during a single update.
Concrete Loss Calculation: Using the probabilities we just found (Old: , New: ):
-
Calculate Ratio:
-
Calculate Surrogate Objective (Ignoring clipping/KL for simplicity):
-
Calculate KL Penalty (Drift): We check how far the new policy drifted from the old one using the approximation , where .
- Inverse Ratio:
- Log Term:
- :
With a standard (a common hyperparameter magnitude for the KL penalty),
-
Final Value (for Response A):
This positive value means the model is being powerfully reinforced for this correct answer.
-
The Group Average (): We repeat this process for every response in the group and take the average.
- Response A (Correct): (Reinforced)
- Response B (Incorrect): (Penalized - calculated similarly with a negative advantage)
- Response C (Neutral): (Ignored - advantage was 0)
The optimizer tries to make this negative number even more negative (maximizing the positive total objective).
Notice that unlike standard PPO, we didn’t need a separate “Critic” model to estimation the baseline. The group average served as our baseline, making GRPO much more memory efficient.
Training Loop
def train_step(batch):
prompt = batch['prompt']
# Step 1: Verification (Generation & Old Probs)
# Sample G outputs from the old policy
responses = model_old.generate(prompt, num_generations=G)
# Store the log-probs of these responses under the old policy (before update)
old_log_probs = model_old.get_log_probs(responses)
# Step 2: Evaluation (Rewards)
# Check correctness & formatting
rewards = compute_rewards(responses, batch['answer'])
# Step 3: Advantage Calculation
# A_i = (R_i - Mean(R)) / Std(R)
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)
# Step 4: Optimization (Policy Update)
# A. Get new log-probs (gradient flows here)
new_log_probs = model.get_log_probs(responses)
# B. Get ref log-probs (for KL penalty) - usually same as model_old
ref_log_probs = ref_model.get_log_probs(responses)
# C. Calculate Ratio
ratio = exp(new_log_probs - old_log_probs)
surrogate = ratio * advantages
# D. Subtract KL Penalty (Drift from Reference)
kl_penalty = beta * (ref_log_probs - new_log_probs - ... )
objectives = surrogate - kl_penalty
# Step 5: Aggregation (Group Average)
loss = -objectives.mean()
optimizer.update(loss)
Training Progression
Here is a snippet of the training logs showing how the model iterates. You can see the Mean Reward fluctuating as the model explores different reasoning paths.
Loading Qwen/Qwen2.5-3B-Instruct...
Fetching 9 files: 100%|█████████████████████████████| 9/9 [00:00<00:00, 226040.34it/s]
Loading GSM8K...
Starting Training...
Step 1/300 | Loss: 0.2549 | Mean Reward: 0.3110
----------------------------------------
Question: The basketball team sold 50 cupcakes for $2 each and 40 cookies for $0.5 each. Then the money from their sales was used to buy two basketballs that cost $40 each. The rest of the money was used to buy 20 bottles of energy drinks. How much does a bottle of energy drink cost?
Gold Answer: 2
Sample Response:
The basketball team sold 50 cupcakes at $2 each and 40 cookies at $0.5 each. We need to first calculate the total earnings from selling the cupcakes and cookies.
The total earnings from cupcakes is:...
Sample Rewards: array([-0.106, 0.267, 0.083, 1], dtype=float32)
----------------------------------------
Step 2/300 | Loss: 0.1916 | Mean Reward: 0.2900
----------------------------------------
Question: Dina has 60 dolls. She has twice as many dolls as Ivy. 2/3 of Ivy's dolls are collectors editions. How many collectors edition dolls does Ivy have?
Gold Answer: 20
Sample Response:
Dina has 60 dolls, which is twice as many as Ivy. Therefore, Ivy must have half the number of dolls as Dina.
So, Ivy has \( \frac{60}{2} = 30 \) dolls.
Additionally, 2/3 of Ivy's dolls are collect...
Sample Rewards: array([0.568, 0.172, 0.181, 0.239], dtype=float32)
... (omitted)
----------------------------------------
Step 9/300 | Loss: 0.1309 | Mean Reward: 0.2580
----------------------------------------
Question: In four years, Peter will be twice as old as Harriet. If Peter's age is currently half of his mother's age, who is 60, how old is Harriet now?
Gold Answer: 13
Sample Response:
Let's break down the problem into smaller parts and define the variables:
- Let \( P \) be Peter's current age.
- Let \( H \) be Harriet's current age.
- Peter's mother is currently 60 years old.
- Ac...
Sample Rewards: array([0.145, 0.428, 0.365, 0.094], dtype=float32)
----------------------------------------
Step 10/300 | Loss: -0.0394 | Mean Reward: 0.4820
----------------------------------------
Question: Hannah fills her kids' stockings with 4 candy canes, 2 beanie babies and 1 book. If she has 3 kids, how many stocking stuffers does she buy total?
Gold Answer: 21
Sample Response:
Hannah fills each stocking with 4 candy canes, 2 beanie babies, and 1 book. Therefore, the total number of items for one stocking is the sum of these items: 4 (candy canes) + 2 (beanie babies) + 1 (bo...
Sample Rewards: array([0.477, 0.35, 0.547, 0.554], dtype=float32)
... (omitted)
----------------------------------------
Step 299/300 | Loss: 0.5351 | Mean Reward: 0.4538
----------------------------------------
Question: A party store ordered 5 boxes of balloons. Each box contains 8 bags of balloons, and there are 12 balloons in each bag. How many balloons are there?
Gold Answer: 480
Sample Response:
First, we need to determine the total number of bags. Given that there are 5 boxes and each box contains 8 bags, the total number of bags is 5 * 8.
Next, each bag contains 12 balloons. Therefore, to ...
Sample Rewards: array([1, 0.123, -0.308, 1], dtype=float32)
----------------------------------------
Step 300/300 | Loss: 3.1886 | Mean Reward: 0.5498
----------------------------------------
Question: John and his two brothers decide to split the cost of an apartment. It is 40% more expensive than John's old apartment which costs $1200 per month. How much does John save per year by splitting the apartment compared to living alone?
Gold Answer: 7680
Sample Response:
First, we need to calculate the cost of the new apartment that John and his two brothers are sharing. The problem states that the new apartment is 40% more expensive than John's old apartment, which c...
Sample Rewards: array([1, 0.998, 0.375, -0.174], dtype=float32)
----------------------------------------
Comparing the initial steps to the final ones, we can see a clear trend:
- Reward Growth: The Mean Reward improved from hovering around 0.30 (Step 1-3) to over 0.50 (Step 300). This indicates the model is generating correct answers and following formatting rules more consistently.
- Loss Volatility: The loss fluctuates significantly (e.g., from
0.25to3.18), which is characteristic of RL training where explore/exploit dynamics can lead to sudden shifts in policy updates. - Reasoning Quality: The responses in later steps show structured “First… Next…” logical flows, matching the reward incentives.
Results: The Strawberry Test
A classic failure mode of standard LLMs is the question: “How many R’s are in Strawberry?”.
Before GRPO (Base Qwen 2.5):
There are two R's in strawberry.
After GRPO (Reasoning Model):
> <reasoning>
> To find out how many times the letter R appears in the word strawberry, we can go through the word character by character and count each occurrence.
> - s
> - t
> - r (1)
> - a
> - w
> - b
> - e
> - r (2)
> - r (3)
> - y
> </reasoning>
> <answer>
> The answer is 3.
> </answer>
The model has learned that simply guessing leads to low rewards. By breaking the problem down (even without being explicitly taught how to count characters), it achieves the correct result. This is the essence of building a Large Reasoning Model.

