Proximal Policy Optimization
- March 12, 2020
- Posted by: Shubha Manikarnike
- Category: Reinforcement Learning
In my previous post, we discussed the simplest Policy Gradient REINFORCE. We saw, how Policy based methods are better than value based methods, a derivation of the Gradient of Score(Cost) function, and an implementation of simple Policy Gradient to train Gym’s Acrobot-v0. We then saw, how introducing a baseline reduces variance which leads to the Vanilla Policy Gradient (VPG) algorithm.
In this post, we discuss the roles played by Credit Assignment and Importance Sampling in Policy Optimization. We will also understand gradient calculation of the Clipped PPO Algorithm.
PPO was released by OpenAI in Aug 2017. Here is a copy to
paper Published. It is a policy gradient method which optimizes a ‘surrogate’
objective function using stochastic gradient ascent.
PPO has some of the benefits of trust region policy optimization (TRPO), but it is much simpler to implement, more general, and have better sample complexity (empirically). Let us dive in to the details of the algorithm.
We know that gradient of the Score function is given by
In the Vanilla Policy Gradient Algorithm, which uses a baseline, the gradient is given by
Credit Assignment: A close look at the rewards for a trajectory tau, is the sum of the rewards at each time step.
So the Gradient can be expressed as
Since, this is a Markov process, the action at time step t can influence only the future reward and the past reward should not be contributing to the gradient function.
Or alternatively, this can be written as
Where we are considering a reward only from time step t onward.
Importance Sampling: In statistics, importance sampling is a general technique for estimating properties of a particular distribution, while only having samples generated from a different distribution than the distribution of interest. We use importance sampling, so that the trajectories given by the previous policies can be reused.
In a simple Policy Gradient Algorithm, the weights of the neural network are updated using the below equation,
The probability of sampling a trajectory using the policy πθ. Is given by P(τ;θ) , and the probability of sampling the trajectory using the updated policy is given by P(τ;θ′).
By using, importance sampling, the score function can be rewritten as:
And the gradient is given by:
Here Lsur is called the Surrogate function, which computes the difference between old Policy and new Policy.
A new Objective function is constructed to Clip the estimated advantage function, if the new policy is far away from the old Policy.
The same can be rewritten as
In the above equation, epsilon is a small number, usually taking the value of 0.1 or 0.2. If the value of the advantage function is positive, then we clip it using (1 + epsilon). If the Advantage function is negative, we clip it using (1-epsilon)