The Gap Between Prediction and Usefulness
If you have been following this blog, you know how LLMs generate text: attention mechanisms, KV caches, speculative decoding, quantized weights moving through GPU memory hierarchies. We have spent considerable time understanding what happens after the model exists. This post asks a different question: how did the model learn to be useful in the first place?
A pretrained language model is a remarkable thing. It can complete sentences, mimic writing styles, and recite facts absorbed from trillions of tokens of internet text. But it cannot follow instructions. Ask it to summarize an article and it might continue the article instead. Ask it to refuse a harmful request and it will cheerfully comply. The gap between “can predict the next token” and “can be a helpful assistant” is enormous, and closing it is the job of reinforcement learning from human feedback (RLHF) and its descendants.
This post covers three techniques that bridge this gap: PPO (Proximal Policy Optimization), the original workhorse that proved RL could align language models; DPO (Direct Preference Optimization), an elegant reformulation that eliminates the reward model entirely; and GRPO (Group Relative Policy Optimization), the technique behind DeepSeek-R1’s reasoning capabilities. Each optimizes the same underlying objective (maximize reward while staying close to a reference policy) but they make fundamentally different engineering trade-offs.
We will not cover pretraining, nor will we survey every RLHF variant (KTO, SimPO, ORPO, and others exist but are beyond our scope). Instead, we will go deep on these three methods: the math, the intuition, the practical trade-offs, and the reasons each one was invented.
Where RL Fits: The Model Training Lifecycle
Before we touch any equations, let’s establish context. Training a modern LLM that can follow instructions involves three distinct phases, each with a different objective.
Phase 1: Pretraining. The model learns language by predicting the next token on a massive corpus: books, articles, code, web pages. This produces a powerful text completion engine that knows facts, grammar, and reasoning patterns but has no concept of a “conversation” or “helpfulness.” This phase consumes the vast majority of compute (months on thousands of GPUs) and produces what we call the base model.
Phase 2: Supervised Fine-Tuning (SFT). Humans write demonstration data: pairs of (instruction, ideal response). The model is trained to reproduce these demonstrations using standard cross-entropy loss:
$$\mathcal{L}_{\text{SFT}} = -\sum_{t} \log \pi_\theta(y_t \mid x, y_{\lt t})$$This is the same next-token prediction objective from pretraining, just applied to curated instruction-response pairs. The model learns the format of helpful responses: how to structure answers, when to use code blocks, how to handle multi-turn conversations. SFT typically requires only thousands of examples and a few hours of training.
But SFT has a fundamental limitation: it teaches the model to mimic demonstrations without understanding why one response is better than another. The model learns that a particular answer to “explain quantum computing” was in the training data, but it cannot distinguish between a clear explanation and a subtly misleading one. It learns format, not judgment.
Phase 3: RL-based Alignment. This is where the techniques in this post come in. Instead of showing the model what to produce, we teach it what better means. The model generates its own responses, receives feedback on quality, and updates its parameters to produce higher-quality outputs. This is reinforcement learning: the model (agent) generates text (actions), receives scores (rewards), and improves its generation strategy (policy).
The SFT model serves double duty here: it initializes the policy we will optimize, and it becomes the frozen reference policy $\pi_{\text{ref}}$ that prevents the RL-trained model from drifting too far from coherent language. This reference is crucial. Without it, the model can find degenerate ways to maximize reward that produce nonsensical text.
The Model Training Lifecycle
Click a path to explore each alignment approach
Select a Path
Click any of the three training paths above to see how it works, what models are required, and the key trade-offs involved.
The three methods we will examine differ in how they implement Phase 3. PPO trains a separate reward model, then runs RL against it. DPO skips the reward model by extracting the reward signal directly from preference data. GRPO replaces learned value estimates with group statistics and pairs naturally with verifiable rewards. Let’s start with the foundation they all share: the reward signal.
The Reward Signal: Bradley-Terry and What Makes It Work
The fundamental challenge of aligning language models is this: we cannot write a reward function for “helpfulness.” Unlike game-playing AI where the score is clearly defined, the quality of a text response is subjective, contextual, and multidimensional. But humans can do something simpler: given two responses to the same prompt, they can usually say which one is better.
This observation is the foundation of RLHF. Collect pairwise comparisons, then train a model to predict which response humans prefer. The mathematical framework for this is the Bradley-Terry model, originally developed in 1952 for ranking chess players:
The elegant property: only the difference in rewards matters. Adding a constant to all reward scores leaves preferences unchanged. This means the reward model only needs to learn a relative ranking, not absolute quality scores.
We train this reward model by maximizing the log-likelihood of observed human preferences:
$$\mathcal{L}_{\text{RM}} = -\log \sigma\big(r_\theta(x, y_w) - r_\theta(x, y_l)\big)$$This is binary cross-entropy: we are training a classifier that says “response A is better than response B.” Architecturally, the reward model is typically the same transformer as the language model, with the language modeling head replaced by a single linear layer that maps the final hidden state to a scalar reward.
In practice, InstructGPT used a 6B parameter reward model to guide a 175B policy, trained on approximately 33,000 prompts with 4-9 ranked completions each. The reward model is trained for only a single epoch to avoid overfitting to the preference data, a detail that matters more than it might seem.
With a trained reward model in hand, we can now define what “better” means mathematically. The question becomes: how do we actually optimize the language model to produce higher-reward outputs?
PPO: The Four-Model Pipeline
The RLHF Objective
The goal of RLHF is captured in a single objective. Let’s walk through it symbol by symbol:
$$\max_{\pi_\theta} \; \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot \mid x)} \big[r_\phi(x, y)\big] - \beta \cdot D_{\text{KL}}\big(\pi_\theta \| \pi_{\text{ref}}\big)$$Reading left to right: $\max_{\pi_\theta}$ means “find the policy parameters that maximize the following expression.” $\mathbb{E}$ is the expected value, averaging over many prompts and responses. $x \sim \mathcal{D}$ means prompts are drawn from the training distribution. $y \sim \pi_\theta(\cdot \mid x)$ means responses are sampled from the current policy, not taken from a fixed dataset. $r_\phi(x, y)$ is the reward model’s score. $\beta$ is a coefficient controlling constraint strength. $D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})$ is the KL divergence measuring how far the policy has drifted from the reference.
This objective contains two opposing forces. The first term pushes toward human-preferred outputs: maximize the expected reward. The second term, the KL divergence penalty, is the guardrail that prevents reward hacking.
Reward hacking is not a theoretical concern. Without the KL constraint, models learn to game the reward model: they produce longer responses (reward models often prefer length), use confident language and bullet-point formatting (which correlates with higher human ratings), and can even produce convincing fabrications that fool the reward evaluator. Wen et al. (2024) showed that RLHF without proper regularization increases human approval ratings while simultaneously decreasing actual correctness. The KL penalty keeps the optimized policy close enough to the reference that these degenerate strategies remain unlikely.
The PPO Clipped Surrogate
The RLHF objective tells us what to optimize. PPO tells us how. The challenge is that policy gradient methods are notoriously unstable. A single large update can destroy the policy, and recovery is difficult. PPO solves this with a clipping mechanism that limits how much any single update can change the policy.
First, we define the probability ratio, how much the policy’s opinion of a particular token has changed:
$$r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\text{old}}}(a_t \mid s_t)}$$When $r_t = 1.0$, the policy hasn’t changed its probability for this token. When $r_t = 1.5$, the token is 50% more likely under the new policy. When $r_t = 0.6$, it is 40% less likely. The ratio tells us the direction and magnitude of the policy shift.
The PPO clipped surrogate objective is:
$$\mathcal{L}^{\text{CLIP}} = \mathbb{E}_t \left[\min\Big(r_t(\theta) \cdot \hat{A}_t,\; \text{clip}\big(r_t(\theta),\, 1-\varepsilon,\, 1+\varepsilon\big) \cdot \hat{A}_t\Big)\right]$$Here $\hat{A}_t$ is the advantage estimate: how much better (positive) or worse (negative) this token was compared to the expected baseline. Concretely, a critic network $V_\psi(s)$ estimates the expected future reward from each state; the advantage is the difference between the actual return and this estimate. If the model produced a token that led to higher reward than expected, $\hat{A}_t > 0$ (“good token, do more of this”); if the reward was lower than expected, $\hat{A}_t < 0$ (“bad token, do less of this”). The advantage is what tells PPO which direction to push; the clipping mechanism controls how far.
The clip function is a simple three-case clamp: if $r_t < 1-\varepsilon$, return $1-\varepsilon$; if $r_t > 1+\varepsilon$, return $1+\varepsilon$; otherwise return $r_t$ unchanged. With the standard $\varepsilon = 0.2$, the ratio is constrained to the range $[0.8, 1.2]$.
The behavior of this objective follows a 2×2 matrix that is worth internalizing:
| Advantage $\hat{A}_t > 0$ (good token) | Advantage $\hat{A}_t < 0$ (bad token) | |
|---|---|---|
| Policy increases probability ($r_t > 1$) | Clip activates at $1+\varepsilon$. Caps how aggressively we reinforce. | No clipping. Full gradient to suppress this token. |
| Policy decreases probability ($r_t < 1$) | No clipping. Full gradient to reinforce this token. | Clip activates at $1-\varepsilon$. Caps how aggressively we suppress. |
The pattern reveals something important: clipping only constrains the policy when it is already moving in the right direction too aggressively. When the policy increases probability for a good token (top-left), clipping says “that’s enough reinforcement for one update.” When it decreases probability for a bad token (bottom-right), clipping says “that’s enough suppression.” But when the policy is moving in the wrong direction — decreasing a good token or increasing a bad one — the full gradient signal flows through. Clipping never protects wrong moves.
Why $\min$ and not $\max$? The $\min$ operator takes the pessimistic bound. If the clipped version yields a lower objective than the unclipped version, we take the clipped (lower) one, preventing overconfident updates. If the unclipped version is already lower (meaning the policy moved in a harmful direction), we take that instead, allowing the full corrective gradient.
Let’s trace through concrete numbers. With $\hat{A}_t = +2.0$ (a good token) and $\varepsilon = 0.2$:
- At $r_t = 1.1$: Unclipped = $1.1 \times 2.0 = 2.2$. Clipped = $1.1 \times 2.0 = 2.2$ (within bounds). $\min = 2.2$. Full gradient.
- At $r_t = 1.3$: Unclipped = $1.3 \times 2.0 = 2.6$. Clipped = $1.2 \times 2.0 = 2.4$ (capped at $1+\varepsilon$). $\min = 2.4$. Gradient is reduced.
- At $r_t = 2.0$: Unclipped = $2.0 \times 2.0 = 4.0$. Clipped = $1.2 \times 2.0 = 2.4$. $\min = 2.4$. The objective plateaus. No matter how much more likely this token becomes, the gradient contribution is capped.
This plateau is the key mechanism. The objective becomes flat beyond the clip boundary, which means the gradient is zero, so the optimizer receives no signal to push the ratio further. The policy can only change by $\pm 20\%$ per update, ensuring training stability.
PPO Clipped Surrogate Objective
How epsilon-clipping constrains policy updates within a trust region
Behavioral Case
Good token, policy reinforcing — clip caps the gain
Current Values
An important subtlety for LLM training: clipping operates per-token, not globally. In a 512-token response, some tokens might have $r_t$ well within bounds (contributing full gradients) while others hit the clip boundary (contributing zero gradient). The overall update is a blend of these per-token signals, which produces remarkably stable training even without careful learning rate tuning.
One notable exception: DeepSeek-R1 uses $\varepsilon = 10$, which effectively disables clipping. Their group-normalized advantages (which we will see in the GRPO section) are already well-scaled, reducing the need for a tight trust region.
The Four-Model Problem
Running PPO for LLM alignment requires four models simultaneously in GPU memory:
- The policy $\pi_\theta$ — the model being trained. Requires gradients and optimizer states (2-3× the weight memory).
- The reference policy $\pi_{\text{ref}}$ — a frozen copy of the SFT model. Only forward passes, but still occupies full weight memory.
- The reward model $r_\phi$ — scores generated responses. Frozen during PPO, forward passes only.
- The value/critic network $V_\psi$ — estimates expected future reward to compute advantages $\hat{A}_t$. Requires gradients and optimizer states.
For a 7B parameter model in fp16, weights alone consume approximately 14GB per model, roughly 56GB across all four, before accounting for optimizer states (Adam stores two additional copies of the policy’s and critic’s parameters). With a batch of generated sequences in memory, the total easily exceeds 100GB for a single 7B model. Running PPO on a 70B model requires multi-node setups that only frontier labs can afford.
Beyond memory, PPO faces two systemic challenges. Distribution shift: as the policy improves, the reward model’s training data (collected from an earlier, weaker policy) becomes stale. The proxy reward keeps climbing while true human preference plateaus or declines. Gao et al. (2022) formalized this as “reward model overoptimization.” Hyperparameter sensitivity: learning rates, KL coefficients, clipping parameters, and even Adam’s epsilon require careful tuning. Huang et al. (2023) found that reward scores and loss values are poor indicators of training health; practitioners should monitor KL divergence, response length distributions, and perplexity instead.
Despite all this complexity, PPO produced the first convincing result: InstructGPT showed that a 1.3B parameter model trained with RLHF was preferred by human evaluators over the 175B parameter base GPT-3. A 130× smaller model, made more useful through alignment. The engineering was expensive, but the result was undeniable.
The Question That Sparked DPO
PPO demonstrated that RL could align language models with human preferences. But the engineering complexity was severe: four models, meticulous hyperparameter tuning, and infrastructure that only a handful of organizations could afford. Researchers began asking: could we achieve similar results without the reward model entirely?
The mathematical observation that makes this possible: the RLHF objective has a closed-form optimal policy. If we can express the reward in terms of the policy itself, we can optimize directly on preference data without a reward model, RL loop, or critic network. This insight leads to DPO.
DPO: Your Language Model Is Secretly a Reward Model
The Reparameterization That Changes Everything
Let’s start from the same KL-constrained RLHF objective we defined for PPO. Using variational calculus (or, more practically, by expanding the KL divergence and completing the algebra), we can derive the optimal policy in closed form:
$$\pi^*(y \mid x) = \frac{1}{Z(x)} \cdot \pi_{\text{ref}}(y \mid x) \cdot \exp\!\left(\frac{r(x, y)}{\beta}\right)$$where $Z(x) = \sum_y \pi_{\text{ref}}(y \mid x) \cdot \exp(r(x, y) / \beta)$ is the partition function that ensures the distribution sums to 1.
The intuition here is direct: the optimal policy is the reference distribution “warped” by an exponential reward function. Responses with high reward get boosted in probability; responses with low reward get suppressed. The parameter $\beta$ controls how aggressive this warping is. When $\beta \to 0$, the policy collapses toward pure reward maximization; only the highest-reward response gets any probability mass. When $\beta \to \infty$, the exponential flattens and the policy stays frozen at the reference.
We can rearrange this to express the reward in terms of the policy:
$$r(x, y) = \beta \cdot \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \cdot \log Z(x)$$This says something remarkable: the reward is fully determined by the log-ratio of optimal policy to reference policy, plus a prompt-dependent constant $Z(x)$. The reward is hiding inside the policy all along.
But $Z(x)$ is intractable. It requires summing over all possible responses to prompt $x$, every possible sequence of tokens the model could produce. For a vocabulary of 50,000 tokens and responses of even modest length, this is an astronomically large set. PPO avoids computing $Z(x)$ by using iterative approximate optimization. DPO avoids it through algebraic cancellation.
From RL to Classification in One Substitution
Here is where DPO’s elegance emerges. We substitute the implicit reward expression into the Bradley-Terry preference model. For a preferred response $y_w$ and dispreferred response $y_l$ given the same prompt $x$:
$$r(x, y_w) - r(x, y_l) = \beta \log \frac{\pi^*(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} + \beta \log Z(x) - \beta \log \frac{\pi^*(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} - \beta \log Z(x)$$The $\beta \log Z(x)$ terms cancel exactly. This is the critical step. $Z(x)$ depends only on the prompt, not the response, so it appears identically in both terms and drops out of the difference. The intractable partition function vanishes.
Substituting into the Bradley-Terry model and replacing the theoretical optimal policy $\pi^*$ with our trainable policy $\pi_\theta$, we get the DPO loss:
$$\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right]$$This is a binary cross-entropy loss. The “logit” is the difference in implicit rewards between the preferred and dispreferred responses. Each implicit reward $\beta \log(\pi_\theta / \pi_{\text{ref}})$ measures how much the current policy has shifted its probability relative to the reference, a direct proxy for how much the policy “values” that response.
During training, this loss simultaneously increases the relative probability of preferred completions and decreases the relative probability of dispreferred ones. The $\beta$ parameter controls how sharply: low $\beta$ (e.g., 0.1) allows aggressive optimization away from the reference, while high $\beta$ (e.g., 0.5) keeps updates conservative. No explicit KL penalty is needed because the reference policy appears directly in the loss function. Deviating too far automatically reduces the gradient signal through the sigmoid saturation.
A Worked Numerical Example
Let’s make this concrete. Consider a prompt $x$ = “What is the capital of France?” with two responses:
- $y_w$ (preferred): “The capital of France is Paris.”
- $y_l$ (dispreferred): “France’s capital is Berlin, a beautiful city.”
Suppose the reference policy assigns $\pi_{\text{ref}}(y_w \mid x) = 0.15$ and $\pi_{\text{ref}}(y_l \mid x) = 0.12$. The trainable policy $\pi_\theta$ starts as a copy of the reference, so initially $\pi_\theta = \pi_{\text{ref}}$. Let $\beta = 0.1$.
At initialization:
The implicit rewards are both zero:
$$\hat{r}(x, y_w) = 0.1 \cdot \log \frac{0.15}{0.15} = 0.1 \cdot \log 1 = 0$$$$\hat{r}(x, y_l) = 0.1 \cdot \log \frac{0.12}{0.12} = 0$$The reward difference is $0 - 0 = 0$. The loss is $-\log \sigma(0) = -\log(0.5) = \log 2 \approx 0.693$. The model has no preference, exactly what we’d expect before any training.
After one gradient step:
The gradient pushes $\pi_\theta(y_w \mid x)$ up to $0.20$ and $\pi_\theta(y_l \mid x)$ down to $0.08$:
$$\hat{r}(x, y_w) = 0.1 \cdot \log \frac{0.20}{0.15} = 0.1 \cdot 0.288 = 0.029$$$$\hat{r}(x, y_l) = 0.1 \cdot \log \frac{0.08}{0.12} = 0.1 \cdot (-0.405) = -0.041$$The reward difference is $0.029 - (-0.041) = 0.069$. The loss drops to $-\log \sigma(0.069) \approx 0.676$.
The model is learning without ever computing an explicit reward. The reward signal emerges from the probability shift relative to the reference. And notice the structural KL constraint at work: as $\pi_\theta$ pushes probabilities further from $\pi_{\text{ref}}$, the log-ratio grows, which eventually saturates the sigmoid and produces diminishing gradient signal. The policy naturally resists extreme deviations.
The insight, elegantly stated by the DPO authors: “The reward model was never eliminated — it was absorbed into the policy itself.”
DPO: Learning Without a Reward Model
Step through training to see implicit rewards emerge
Where DPO Shines and Where It Falls Short
DPO’s practical advantages are substantial. Training requires only two models (policy and reference), not four. The implementation is roughly 20 lines of PyTorch on top of a standard language modeling pipeline. HuggingFace’s TRL library provides a DPOTrainer that handles the details. Major models adopted it quickly: Llama 3, Zephyr-beta, and Tulu 2 all used DPO in their alignment pipelines. DPO democratized alignment research. Any lab with a GPU and preference data could train an aligned model.
But DPO has limitations that become apparent at scale. The most fundamental is its offline nature: DPO trains on a fixed dataset of preference pairs, with no mechanism for the model to explore and discover new behaviors. As training progresses, the policy drifts from the distribution that generated the training data, but the training data cannot adapt. This is particularly problematic for tasks where the model needs to discover novel reasoning strategies.
Xu et al. (ICML 2024) conducted a systematic comparison and found that PPO consistently surpasses DPO across all tested benchmarks when properly tuned, especially on challenging code generation tasks (on CodeContest, PPO-34B achieved 22.4% while DPO-34B scored significantly lower). The gap widens on tasks that require exploration and long-horizon reasoning.
There is also a subtler issue: DPO assumes the Bradley-Terry preference model perfectly fits the data. Real human preferences can be intransitive (A > B, B > C, but C > A), context-dependent, and noisy. When these assumptions break down, DPO’s loss function can produce misleading gradients.
DPO traded RL’s complexity for supervised learning’s simplicity. The next technique we’ll examine takes a different path: keep the online RL loop, but find a cheaper way to run it.
GRPO: Grading Responses on a Curve
The Insight: Eliminate the Critic, Keep the RL Loop
DPO eliminated RL entirely but lost online exploration. GRPO takes a different approach: retain the online RL loop (the model generates responses, gets feedback, and updates) but eliminate the critic network, which is the most expensive component of PPO after the reward model.
Recall that PPO needs a critic $V_\psi(s)$ to compute advantages $\hat{A}_t$, estimating how much better each token was compared to baseline expectations. This critic is a full-sized neural network with its own gradient computation and optimizer states. GRPO’s key observation: instead of learning this baseline, we can estimate it empirically by sampling multiple responses to the same prompt and comparing them to each other.
The Mechanism: Sample, Score, Normalize
For each prompt $q$, GRPO samples $G$ completions $\{o_1, o_2, \ldots, o_G\}$ from the current policy $\pi_\theta$. Each completion is scored by a reward function, producing rewards $\{r_1, r_2, \ldots, r_G\}$. The advantage for each completion is computed via z-score normalization:
$$\hat{A}_i = \frac{r_i - \text{mean}(r_1, \ldots, r_G)}{\text{std}(r_1, \ldots, r_G)}$$This is “grading on a curve.” Instead of evaluating each response against an absolute rubric (the learned critic), we evaluate it relative to its peers. A response that scores 0.8 when all other responses also score around 0.8 gets a near-zero advantage because it was average for this prompt. The same score of 0.8 when peers score around 0.3 earns a strongly positive advantage because it was exceptional.
The group mean serves as an empirical Monte Carlo estimate of the expected reward for this prompt, playing the same role as the learned value function $V(s)$ in PPO. More samples mean a better estimate. In practice, $G = 8$ to $G = 64$ provides sufficient accuracy without excessive compute.
The standard deviation in the denominator does something subtle but important. It acts as a curvature-adaptive gradient mechanism. For easy prompts where the model consistently scores well (low reward variance), the std is small and the advantage magnitudes are amplified, but since the raw rewards are already clustered near the mean, the actual advantages remain small. For hard prompts where reward variance is high, the std normalizes away the scale differences, producing moderate advantages regardless of the raw reward range. This provides automatic per-prompt learning rate adaptation without any additional hyperparameters.
GRPO: Group Relative Policy Optimization
Sample, score, and normalize advantages within a group
The full GRPO objective incorporates this advantage into a clipped surrogate structure that should look familiar:
$$\mathcal{J}_{\text{GRPO}} = \mathbb{E}_{q \sim \mathcal{D}} \left[\frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \min\!\Big(\rho_t^{(i)} \hat{A}_i,\; \text{clip}\big(\rho_t^{(i)}, 1{-}\varepsilon, 1{+}\varepsilon\big) \hat{A}_i\Big) - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]$$This is structurally identical to PPO’s clipped surrogate. The probability ratio $\rho_t^{(i)} = \pi_\theta(o_{i,t} \mid q, o_{i,\lt t}) / \pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,\lt t})$ is the same per-token ratio. The clipping mechanism works identically. The only difference is where the advantage $\hat{A}_i$ comes from: PPO estimates it with a learned critic, GRPO estimates it with group statistics. The double averaging (over group members and over tokens) combined with clipping and KL penalty gives GRPO the stability of PPO without the critic’s memory cost.
GRPO + RLVR: The Reasoning Revolution
GRPO’s natural partner is Reinforcement Learning from Verifiable Rewards (RLVR). These are tasks where correctness can be checked deterministically: math problems have right answers, code must pass test cases, logic puzzles have verifiable solutions. For these tasks, the reward function is a simple rule (correct or incorrect) requiring no learned reward model at all.
Rule-based rewards are immune to reward hacking. There is no neural network to exploit, no proxy to overoptimize. The reward is ground truth. This makes GRPO + RLVR an extraordinarily clean training setup: sample responses, check if they are correct, normalize advantages within the group, update the policy. Two models in memory (policy and reference), a deterministic reward function, and online exploration.
DeepSeek-R1 demonstrated how powerful this combination can be. Its reward function was remarkably simple:
$$R = R_{\text{accuracy}} + R_{\text{format}}$$where $R_{\text{accuracy}}$ is binary (1 if the final answer matches the ground truth, 0 otherwise, verified by regex matching) and $R_{\text{format}}$ enforces structured reasoning with <think>...</think> and <answer>...</answer> tags. That’s it. No neural reward model, no human preference data for the RL stage.
The results with DeepSeek-R1-Zero — trained with GRPO and no supervised fine-tuning at all — were striking: 71.0% on AIME 2024 (matching OpenAI’s o1-preview), 97.3% on MATH-500, and a 2,029 Elo rating on Codeforces. Perhaps most remarkable was the emergent behavior: the model spontaneously developed self-correction strategies (“Wait, let me reconsider this step…”) — without any explicit training signal for reflection. This self-verification behavior emerged purely from the pressure to produce correct final answers.
DeepSeek-R1’s practical configurations: $G = 16$ responses per prompt, batches of 32 unique questions, and notably $\varepsilon = 10$, which effectively disables clipping entirely. The group-normalized advantages are already well-scaled, reducing the need for a tight trust region constraint.
Connection to REINFORCE and Variance Reduction
GRPO is best understood as a variant of REINFORCE, the simplest policy gradient algorithm, with a group-based baseline for variance reduction. Vanilla REINFORCE computes policy gradients as:
$$\nabla_\theta J = \mathbb{E}\big[R \cdot \nabla_\theta \log \pi_\theta(a \mid s)\big]$$This has notoriously high variance because raw returns fluctuate enormously between episodes. The standard fix is to subtract a baseline $b$ from the return: $\nabla_\theta J = \mathbb{E}[(R - b) \cdot \nabla_\theta \log \pi_\theta]$. Any baseline that does not depend on the action is unbiased. PPO learns this baseline with an expensive critic network. GRPO uses the group mean instead, a sample-based estimate that improves with larger group sizes, costs no additional parameters, and requires no additional training.
Choosing the Right Technique
The choice between PPO, DPO, and GRPO depends primarily on the nature of your reward signal and your computational constraints.
Use PPO when you are training on open-ended tasks (creative writing, general helpfulness, safety) where reward must come from a learned model, and you have the compute budget to support four models in memory. Despite its complexity, PPO with proper tuning remains the highest-ceiling approach. Xu et al. (ICML 2024) showed it consistently outperforms DPO on challenging benchmarks. It is the choice of frontier labs training flagship models.
Use DPO when you have high-quality paired preference data, want simple and stable training, and are working with limited compute. DPO matched or exceeded PPO on summarization benchmarks (61% GPT-4 win rate vs. PPO’s 57% on TL;DR) and is implementable with a standard supervised training pipeline. It is ideal for quick alignment passes and situations where the preference dataset covers the intended use distribution well.
Use GRPO when your task has verifiable rewards (math, code, logic, factual QA with ground-truth answers). It combines online exploration (like PPO) with low memory footprint (like DPO), and rule-based rewards eliminate reward hacking entirely. It is the standard for training reasoning models.
In practice, these techniques are often used in combination, not isolation. Llama 3 used a pipeline of SFT → Rejection Sampling → PPO → DPO, where each stage addressed different aspects of alignment. DeepSeek-R1 alternated between SFT stages, RLVR with GRPO for reasoning, and RLHF stages for general helpfulness. The techniques are complementary.
| Attribute | PPO | DPO | GRPO |
|---|---|---|---|
| Models in Memory |
4 modelspolicy, reference, reward, critic |
2 modelspolicy, reference |
2 modelspolicy, reference |
| Training Paradigm |
Online RL |
Offline supervised |
Online RL |
| Reward Source |
Learned reward model |
Implicitderived from preferences |
Verifiable / rule-based |
| Implementation |
Complex~1000s lines of code |
Simple~20 LOC core |
Moderate~100s lines of code |
| Training Stability |
Sensitive to hyperparameters |
Very stable |
Stablewith group normalization |
| Performance Ceiling |
Highestwith proper tuning |
Goodlimited by offline data |
Excellentfor verifiable tasks |
| Reward Hacking Risk |
Highlearned proxy |
Lowno explicit reward |
Very lowrule-based rewards |
| Best For |
General alignment, frontier models |
Quick alignment, limited compute |
Math, code, reasoning tasks |
(math, code, formal logic)
and limited compute?
The landscape continues to expand. On the DPO side, IPO removes the Bradley-Terry assumption, KTO works with binary feedback (thumbs up/down) instead of pairwise preferences, and SimPO simplifies the reference model dependency. On the GRPO side, DAPO addresses training instabilities with dynamic sampling, and Dr. GRPO provides variance reduction to the gradient estimates. Each builds on the foundations covered here.
From Explicit Rewards to Emergent Reasoning
Let’s step back and trace the arc we have followed. PPO established that RL could align language models, using an explicit reward model to score responses and a learned critic to estimate advantages. DPO showed the reward model was unnecessary. The reward signal was implicit in the probability ratios, waiting to be extracted through a clever reparameterization. GRPO showed the critic was unnecessary too. Group statistics could replace learned value functions, especially when paired with verifiable rewards.
Each step eliminated a component that turned out to be inessential for the task at hand. What remained was the core objective: maximize expected reward while staying close to the reference. And progressively simpler ways of optimizing it.
But the most interesting result came from the simplest setup. DeepSeek-R1-Zero, trained with GRPO and binary correct/incorrect rewards, spontaneously developed multi-step reasoning, self-correction, and solution verification, capabilities that were not explicitly trained. The model learned how to think from the sole pressure to be correct. No demonstrations of reasoning. No reward for intermediate steps. Just final-answer accuracy and the group-relative advantage signal.
This suggests that the path to capable reasoning models may be less about sophisticated reward engineering and more about giving models the right optimization framework and letting them discover strategies on their own. The field is still learning which components are genuinely necessary and which are engineering artifacts of earlier approaches. These three techniques (PPO, DPO, and GRPO) represent the progression of that understanding.
References
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint.
- The original PPO paper introducing the clipped surrogate objective.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023.
- The DPO paper showing preference optimization can be reduced to classification.
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Zhang, M., … & Guo, D. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv preprint.
- Introduces GRPO and demonstrates its effectiveness for mathematical reasoning.
DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv preprint.
- DeepSeek-R1 and R1-Zero results using GRPO with verifiable rewards.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., … & Lowe, R. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
- The InstructGPT paper establishing the SFT → RM → PPO pipeline.
Christiano, P. F., Leike, J., Brown, T., Marber, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. NeurIPS 2017.
- Foundational work on learning reward models from human preferences.
Gao, L., Schulman, J., & Hilton, J. (2022). Scaling Laws for Reward Model Overoptimization. ICML 2023.
- Formalizes reward hacking and overoptimization in RLHF.
Xu, J., Xie, T., Zhao, A., Song, J., Wang, J., & Zhang, Y. (2024). Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study. ICML 2024.
- Systematic comparison showing PPO outperforms DPO when properly tuned.
Wen, Y., Zhang, Z., Jiao, H., Yang, M., Zhang, H., & Wang, G. (2024). From RLHF to RLHF: The Dilemma of Improving Human Alignment. arXiv preprint.
- Analysis of reward hacking showing approval ratings can increase while correctness decreases.
Huang, H., Zhong, H., Li, S., Yang, K., & Zitnik, M. (2023). The N Implementation Details of RLHF with PPO. arXiv preprint.
- Practical insights on PPO hyperparameter sensitivity and training diagnostics.
Bradley, R. A., & Terry, M. E. (1952). Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika, 39(3/4), 324-345.
- The original paired comparison model adapted for preference learning.
Yu, Q., Zhang, H., Shao, Z., Guo, D., Zhu, Q., & Lu, H. (2025). DAPO: An Open-Source LLM Reinforcement Learning System. arXiv preprint.
- Addresses GRPO training instabilities with dynamic sampling and clip-higher strategy.