Making progress step by step
Reinforcement learning is a commonly used method of learning. For example, the Large Language Model creation process utilizes reinforcement learning - “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” by DeepSeek Team (2025) used it to introduce “reasoning” capabilities into an LLM. There is a range of RL algorithms out there, but Proximal Policy Optimization is a popular one and was used in the paper above. PPO was introduced in “Proximal Policy Optimization Algorithms” by Schulman et al (2017). The algorithm itself is quite straightforward, however, it utilizes a “surrogate objective” whose motivation and derivation were in turn introduced in the paper “Trust Region Policy Optimization” by Schulman et al (2015).
We will go into the details of the derivation and motivations behind “surrogate objective” in general as well as modifications for PPO in particular to gain a good feeling for the reasoning behind PPO. The core derivation will mostly follow as introduced in “Trust Region Policy Optimization” with some rearrangements, expansions, and addressing some typos. We will also keep derivation self-contained - if some extra tangential derivation is needed, we will do it ourselves. Familiarity with RL and linear algebra will be assumed, however, familiarity with PPO and TRPO is not required though could be helpful.
As a starting point, what is the problem PPO and TRPO is trying to address? One in particular is that small parameter update \(\theta\) might cause new policy \(\tilde{\pi}(\theta)\) to have large changes, which can ruin our progress. So is there a way instead of making small steps in terms of parameter \(\theta\), to make small steps in terms of policy \(\tilde{\pi}\)?
Minorization-Maximization
Our approach to finding the right steps will be the Minorization-Maximization algorithm. We would like to maximize the expected discounted reward \(\eta\), but instead, we will find “surrogate function” \(g(\tilde{\pi}, \pi)\). It is a function, which given current policy \(\pi\) and new policy \(\tilde{\pi}\) has following properties $$ \tilde{\eta} \geq g(\tilde{\pi}, \pi) $$ $$ \eta = g(\pi, \pi) $$ Where \(\tilde{\eta}\) is the expected discounted reward under the new policy. If we have such a function we can guarantee improvements. $$ \tilde{\pi}^* = \arg \max_{\tilde{\pi}} g(\tilde{\pi}, \pi) $$ $$ \tilde{\eta} \geq g(\tilde{\pi}^*, \pi) \geq g(\pi, \pi) = \eta $$ The algorithm is then just running this evaluation on the loop.
Finding the surrogate function
Before moving forward a few notes on notation. We will use a tilde to indicate variables to be under the new policy, the over-bar will indicate a vector, and the under-bar a co-vector (or in simpler terms a “row vector”). We begin by considering matrix \(P\), which contains the probability that the state will transition from the state \(s\) to \(s'\). We also define \(\overline{\rho}\), where each element defines the probability of being in the state. Then if the starting state is \(\overline{\rho}_0\), then after one iteration, the new state is \(\overline{\rho}_1 = P\overline{\rho}_0\). We could keep advancing the state in which case \(\overline{\rho}_n = P^n\overline{\rho}_0\). Next, we define the row vector of the reward \(\underline{r}\), where each element specifies the reward for a given state. So expected reward for state \(\overline{\rho}\) is \(\underline{r}\overline{\rho}\). So now we can express the expected discounted reward. Given discount factor \(\gamma\).
$$ \eta = \sum_t \gamma^t\underline{r}\overline{\rho}_t = \underline{r} \sum_t \gamma^t P^t\overline{\rho}_0 = \underline{r}G\overline{\rho}_0 $$
Where we have defined \(G \equiv \sum_t (\gamma P)^t \). This is the ol’ good geometric series but with matrices, and as in the same manipulations with regular geometric series (with some caveats on convergence of matrices) \(G \equiv \sum_t (\gamma P)^t = (1-\gamma P)^{-1} \).
Let us calculate the difference in expected reward under new and old policy $$ \tilde{\eta} - \eta = \underline{r}\tilde{G}{\rho}_0 - \underline{r}G\overline{\rho}_0 = \underline{r}(\tilde{G} - G)\overline{\rho}_0 $$
To find \(\tilde{G} - G\) we will start with their inverses, use the result of geometric series, left and right multiply with \(G\) and \(\tilde{G}\), and use definition \(P_{\Delta} \equiv \tilde{P} - P \). $$ G^{-1} - \tilde{G}^{-1} = (1-\gamma P) - (1-\gamma \tilde{P}) = \gamma(\tilde{P} - P) $$ $$ GG^{-1}\tilde{G} - G\tilde{G}^{-1}\tilde{G} = G\gamma P_{\Delta} \tilde{G} $$ $$ \tilde{G} - G = \gamma GP_{\Delta}\tilde{G} $$ $$ \tilde{G} = G + \gamma GP_{\Delta}\tilde{G} $$ $$ \tilde{G} = G + \gamma GP_{\Delta}(G + \gamma GP_{\Delta}\tilde{G}) $$ $$ \tilde{G} - G = \gamma GP_{\Delta}G + \gamma^2 GP_{\Delta} GP_{\Delta}\tilde{G} $$ $$ \tilde{\eta} - \eta = \underline{r}(\tilde{G} - G)\overline{\rho}_0 = \underline{r}(\gamma GP_{\Delta}G + \gamma^2 GP_{\Delta} GP_{\Delta}\tilde{G})\overline{\rho}_0 $$ $$ \tilde{\eta} - \eta = \gamma \underline{r} GP_{\Delta}G\overline{\rho}_0 + \gamma^2 \underline{r}GP_{\Delta} GP_{\Delta}\tilde{G}\overline{\rho}_0 $$
Let us investigate the first term \(\gamma \underline{r} GP_{\Delta}G\overline{\rho}_0\). It does look somewhat opaque, but there are familiar things in disguise. First we consider \(\underline{r}G\). Let us say the simulation starts in a specific state \(i\), so \({\rho}_0\) has all components zero but one. In this case expected discounted reward is the \(i\)-th component of \(\underline{r}G\), or in other words the co-vector \(\underline{v} \equiv \underline{r}G\) is the ol’ good state value function. So, we can rewrite \(\gamma \underline{r} GP_{\Delta}G\overline{\rho}_0 = \gamma\underline{v}P_{\Delta}G\overline{\rho}_0 \). Next we consider \(G\overline{\rho}_0\) and define \(\overline{\rho} \equiv G\overline{\rho}_0\). If we go back to definition of \(G\) we observe that \(\overline{\rho} = \sum_t \gamma^t P^t\overline{\rho}_0 \). The terms \(P^t\overline{\rho}_0 \) all are probability distribution vectors, which are then discounted and summed up. It looks like \(\overline{\rho}\) is a probability distribution, but with one problem - it does not sum to one \(\sum_i \rho_i = \sum_t \gamma^t \sum_i (P^t\overline{\rho}_0)_i = \sum_t \gamma^t = (1-\gamma)^{-1} \) as \(P\overline{\rho}_t\) is still a probability vector and thus sums to one. This we can remedy by normalizing the \(\overline{\rho}\), so lets redefine \(\overline{\rho} \equiv (1-\gamma) G\overline{\rho}_0\). Using back into our term \(\gamma\underline{v}P_{\Delta}G\overline{\rho}_0 = \gamma (1 - \gamma )^{-1} \underline{v}P_{\Delta}\overline{\rho}\). This is looking better and better. Next, we write out the equation as a sum and recall that the components of \(P\) can be expanded in terms of policy \(\pi\) as \(p(s'|s) = \sum_a p(s'|s,a)\pi(a|s)\), and we will define \(\pi_{\Delta} \equiv \tilde{\pi} - \pi\) $$ \gamma\underline{v}P_{\Delta}\overline{\rho}\ = \gamma\sum_s \overline{\rho}(s) \sum_{s’} (\tilde{p}(s’|s) - p(s’|s))v(s’) $$ $$ = \mathbb{E}_{s \sim \overline{\rho}}[\sum_{s’} (\sum_a p(s’|s,a)\tilde{\pi}(a|s) - \sum_a p(s’|s,a)\pi(a|s))\gamma v(s’)] $$ $$ = \mathbb{E}_{s \sim \overline{\rho}}[\sum_a(\tilde{\pi}(a|s) - \pi(a|s))\sum_{s’}p(s’|s,a)\gamma v(s’)] $$ $$ = \mathbb{E}_{s \sim \overline{\rho}}[\sum_a\pi_{\Delta}(a|s)\sum_{s’}p(s’|s,a)\gamma v(s’)] $$ One of the terms might look familiar. Writing out explicitly the advantage $$ A(s, a) = Q(s,a) - v(s) $$ $$ = \sum_{s’} p(s’|s,a)(r(s) + \gamma v(s’)) - v(s) $$ $$ = \sum_{s’} p(s’|s)r(s) + \sum_{s’} p(s’|s,a)\gamma v(s’) - v(s) $$ $$ \sum_{s’} p(s’|s,a)\gamma v(s’) = A(s, a) - \sum_{s’} p(s’|s)r(s) + v(s) $$
Replacing the term in our equation and noting that for any \(f(s,s')\) we get \(\sum_a\pi_{\Delta}f(s,s') = f(s,s')\sum_a\pi_{\Delta} = 0\) as \(\sum_a \pi(a|s) = 1\).
$$ \gamma\underline{v}P_{\Delta}\overline{\rho}\ = \mathbb{E}_{s \sim \overline{\rho}}[\sum_a\pi_{\Delta}(a|s)(A(s, a) - \sum_{s’} p(s’|s)r(s) + v(s))] $$ $$ \gamma\underline{v}P_{\Delta}\overline{\rho}\ = \mathbb{E}_{s \sim \overline{\rho}}[\sum_a\pi_{\Delta}(a|s)A(s, a)] $$
And we have arrived at something that might start to more familiar. Putting it back into our equation for differences in expected reward $$ \tilde{\eta} - \eta = \gamma (1 - \gamma )^{-1} \underline{v}P_{\Delta}\overline{\rho} + \gamma^2 \underline{r}GP_{\Delta} GP_{\Delta}\tilde{G}\overline{\rho}_0 $$ $$ \tilde{\eta} - \eta = (1-\gamma)^{-1} \mathbb{E}_{s \sim \overline{\rho}}[\sum_a\pi_{\Delta}(a|s)A(s,a)] + \gamma^2 \underline{r}GP_{\Delta} GP_{\Delta}\tilde{G}\overline{\rho}_0 $$
Now that we have considered the first term, let us tackle the second term \(\gamma^2 \underline{r}GP_{\Delta} GP_{\Delta}\tilde{G}\overline{\rho}_0\). Instead of approaching this term exactly, we will establish an inequality on the size of the term.
$$ |\tilde{\eta} - \eta - (1-\gamma)^{-1} \mathbb{E}_{s \sim \overline{\rho}}[\sum_a\pi_{\Delta}(a|s)A(s,a)]| = |\gamma^2 \underline{r}GP_{\Delta} GP_{\Delta}\tilde{G}\overline{\rho}_0| $$
Before proceeding, with derivation, we will have a short refresher interlude about norms. For our purposes we will define \(\|\underline{x}\|_\infty \equiv \max_i |x_i|\), \(\|\overline{x}\|_1 \equiv \sum_i|x_i|\), and for a matrix \(\|A\|_\infty \equiv \max_i \sum_j|A_{ij}|\). Why would this be handy for finding a bound? We will derive a few properties.
$$ |\underline{x}\overline{y}| = |\sum_i x_iy_i| \leq \sum_i |x_iy_i| = \sum_i |x_i||y_i| $$ $$ \leq \sum_i \max_j|x_j||y_i| = \max_j|x_j| \sum_i |y_i| = \lVert \underline{x} \rVert_\infty \lVert \overline{y} \rVert_1 $$ $$ \lVert \underline{x}A \rVert_\infty = \max_i |\sum_j A_{ij}x_j| \leq \max_i \sum_j |A_{ij}|(\max_k |x_k|) $$ $$ = \lVert \underline{x} \rVert_\infty \max_i \sum_j |A_{ij}| = \lVert \underline{x} \rVert_\infty \lVert A \rVert_\infty $$ $$ \lVert \underline{x}BA \rVert_\infty \leq \lVert \underline{x}B \rVert_\infty \lVert A \rVert_\infty \leq \lVert \underline{x} \rVert_\infty \lVert B \rVert_\infty \lVert A \rVert_\infty $$ Now that we have some good properties, let us apply them to find a bound on our term $$ |\gamma^2 \underline{r}GP_{\Delta} GP_{\Delta}\tilde{G}\overline{\rho}_0| \leq \gamma^2 \lVert \underline{r}G P_{\Delta} GP_{\Delta}\tilde{G} \rVert_\infty\lVert \overline{\rho}_0 \rVert_1 $$ $$ \leq \gamma^2 \lVert \underline{r}G \rVert_\infty\lVert P_{\Delta} \rVert_\infty\lVert G \rVert_\infty\lVert P_{\Delta} \rVert_\infty\lVert \tilde{G} \rVert_\infty\lVert \overline{\rho}_0 \rVert_1 $$ $$ = C2^{-2}\lVert P_{\Delta} \rVert_\infty^2 $$ Where \(C \equiv 2^2\gamma^2 \lVert \underline{r}G \rVert_\infty\lVert G \rVert_\infty\lVert \tilde{G} \rVert_\infty\lVert \overline{\rho}_0 \rVert_1 \). Let us find \(\lVert P_{\Delta} \rVert_\infty\) next $$ \lVert P_{\Delta}\rVert_\infty = \max_{s} \sum_{s’}|\tilde{p}(s’|s)-p(s’|s)| $$ $$ = \max_{s} \sum_{s’}|\sum_a \pi_{\Delta}(a|s)p(s’|sa)| $$ $$ \leq \max_{s} \sum_{s’} \sum_a |\pi_{\Delta}(a|s)|p(s’|sa) $$ $$ = \max_{s} \sum_a |\pi_{\Delta}(a|s)|\sum_{s’} p(s’|sa) $$ $$ = \max_{s} \sum_a |\tilde{\pi}(a|s) - \pi(a|s)| $$ We recognize the sum as the difference in area between two probability distributions, which in effect measures how different the distributions are. This measure goes by the name of Total Variation Distance, and we will write \(D_{TV}(\tilde{\pi}, \pi) \equiv 2^{-1}\max_{s}\sum_a|\tilde{\pi}(a|s)-\pi(a|s)|\). Putting it all back into our inequality $$ |\gamma^2 \underline{r}GP_{\Delta} GP_{\Delta}\tilde{G}\overline{\rho}_0| \leq CD^2_{TV}(\tilde{\pi}, \pi) $$ $$ |\tilde{\eta} - \eta - (1-\gamma)^{-1} \mathbb{E}_{s \sim \overline{\rho}}[\sum_a\pi_{\Delta}(a|s)A(s,a)]| \leq CD^2_{TV}(\tilde{\pi}, \pi) $$ $$ \tilde{\eta} - \eta - (1-\gamma)^{-1} \mathbb{E}_{s \sim \overline{\rho}}[\sum_a\pi_{\Delta}(a|s)A(s,a)]| \geq -CD^2_{TV}(\tilde{\pi}, \pi) $$ $$ \tilde{\eta} \geq \eta + (1-\gamma)^{-1} \mathbb{E}_{s \sim \overline{\rho}}[\sum_a\pi_{\Delta}(a|s)A(s,a)]| - CD^2_{TV}(\tilde{\pi}, \pi) $$ And we have found our surrogate function \(g(\tilde{\pi}, \pi)\) $$ g(\tilde{\pi}, \pi) \equiv \eta + (1-\gamma)^{-1} \mathbb{E}_{s \sim \overline{\rho}}[\sum_a\pi_{\Delta}(a|s)A(s,a)]| - CD^2_{TV}(\tilde{\pi}, \pi) $$ $$ \tilde{\eta} \geq g(\tilde{\pi}, \pi) $$ Let us check what happens if the new policy is the same as the old one. In such case \(\pi_{\Delta} = 0\) and \(D_{TV} = 0\) so \(\eta = g(\pi, \pi)\) as required.
Maximizing
As we discussed at the start, we are guaranteed to make progress (or in the worst case not regress) if we maximize \(g(\tilde{\pi}, \pi)\). Given we are looking for maximum value the \(g(\tilde{\pi}, \pi)\) can be simplified, as not all terms depend on \(\tilde{\pi}\) and are constant, so we can throw them away $$ g(\pi) \equiv \max_{\tilde{\pi}} g(\tilde{\pi}, \pi) $$ $$ = \max_{\tilde{\pi}}[\eta + (1-\gamma)^{-1} \mathbb{E}_{s \sim \overline{\rho}}[\sum_a\pi_{\Delta}(a|s)A(s,a)]| - CD^2_{TV}(\tilde{\pi}, \pi)] $$ $$ = \max_{\tilde{\pi}}[(1-\gamma)^{-1} \mathbb{E}_{s \sim \overline{\rho}}[\sum_a\tilde{\pi}(a|s)A(s,a)]| - CD^2_{TV}(\tilde{\pi}, \pi)] $$
We have an algorithm now that guarantees safe progress - finding \(g(\pi_o)\) gives \(\pi_1\), then \(g(\pi_1)\) gives \(\pi_2\), and life is great, but … as a matter of practice as noted by Schulman et al. (2015) step size would be very small. We advance always safely but pay for it with the speed of advancement.
Worry less - PPO and TRPO
Lets look a bit more at the parts of \(g(\tilde{\pi}, \pi) \) $$ l(\tilde{\pi}, \pi) \equiv (1-\gamma)^{-1} \mathbb{E}_{s \sim \overline{\rho}}[\sum_a\pi_{\Delta}(a|s)A(s,a)] $$ $$ \tilde{\eta} - \eta \geq l(\tilde{\pi}, \pi) - CD^2_{TV}(\tilde{\pi}, \pi) $$
To maximize \(\tilde{\eta} - \eta\), we would like to maximize \(l(\pi) \equiv \max_{\tilde{\pi}}l(\tilde{\pi}, \pi)\), but the \(CD^2_{TV}(\tilde{\pi}, \pi)\) is holding us back - the more different the policies are the more negative the term gets. However as \(l(\tilde{\pi}, \pi) \geq 0\) as in worst case new policy is just current policy, we have \(\tilde{\eta} - \eta \geq - CD^2_{TV}\). In other words, worst-case deterioration in our expected rewards is quadratically proportional to the distance between new and old policy when maximizing \(l(\pi)\) alone.
This gives an idea of what we could do to speed up our algorithm. We could sacrifice the guarantee of always making progress and look \(\tilde{\pi}\) from \(l(\pi)\) instead of \(g(\pi)\), but add on top some more heuristic requirement which keeps the policy differences within some bounds. So we might progress faster at the price of making missteps, but our missteps should not cause us to fall off the cliff. We are optimizing policy within some region of trust.
What heuristic could we use? The original TRPO algorithm restrains policy by requiring \(\mathbb{E}[D(\tilde{\pi}, \pi)] < \epsilon\) where \(\epsilon\) is some number [1]. The PPO uses a different approach to restraining policy distance. Let us have a look at \(l(\pi)\)
$$ l(\pi) \equiv \max_{\tilde{\pi}} [(1-\gamma)^{-1} \mathbb{E}_{s \sim \overline{\rho}}[\sum_a\pi_{\Delta}(a|s)A(s,a)]] $$ $$ = \max_{\tilde{\pi}} \mathbb{E}_{s \sim \overline{\rho}}[\mathbb{E}_{a \sim \tilde{\pi}}[\tilde{\pi}(a|s)A(s,a)]] $$ $$ = \max_{\tilde{\pi}} \mathbb{E}_{s \sim \overline{\rho}, a \sim \tilde{\pi}}[A(s,a)]] $$ We note that we sample from old policy \(s \sim \overline{\rho}\), but action from new policy \(a \sim \tilde{\pi}\). Let us adjust to always sample from the old policy using the following identity $$ \mathbb{E}_{x \sim p}[f(x)] = \sum_x p(x)f(x) = \sum_x \frac{q(x)}{q(x)}p(x)f(x) = \mathbb{E}_{x \sim q}[\frac{p(x)}{q(x)}f(x)] $$ $$ l(\pi) = \max_{\tilde{\pi}} \mathbb{E}_{s \sim \overline{\rho}, a \sim \pi}[\frac{\tilde{\pi}(a|s)}{\pi(a|s)}A(s,a)] $$ The equation contains a ratio of two policies, which could be used as a measure of distance between policies. To remove the incentive for \(\tilde{\pi}\) to get large, we could clip the ratio of \(\frac{\tilde{\pi}}{\pi}\) to be close to one, so our reward function would be \(\text{clip}(\frac{\tilde{\pi}}{\pi}, 1-\epsilon, 1+\epsilon)A\). While we have removed the incentive, there is no particular disincentive for the ratio to get large either, which in the original expression \(g(\pi)\) was done by \(CD^2_{TV}\) term. To improve on this front, we note that the clipping puts bounds on reward function not only when it would make \(l(\pi)\) more positive, but also when it would make \(l(\pi)\) smaller. As \(l(\pi)\) is already maximizing, let us remove the clipping when the term would make the \(l(\pi)\) smaller. Then we have some terms that potentially could make \(l(\pi)\) more negative without bounds as \(\frac{\tilde{\pi}}{\pi}\) gets bigger, which could come into play if we are not overfitting \(\tilde{\pi}\). Putting it all together we arrive at PPO, which we run on a loop. $$ l(\pi) = \max_{\tilde{\pi}} \mathbb{E}_{s \sim \overline{\rho}, a \sim \pi}[\min(\frac{\tilde{\pi}}{\pi}A, \text{clip}(\frac{\tilde{\pi}}{\pi}, 1-\epsilon, 1+\epsilon)A)] $$
Does the final result work in practice? Yes!
[1] Instead of using Total Variation Distance, it uses Kullback–Leibler divergence, and the two are related through Pinsker’s inequality.