## Research/Blog

# DDPG and TD3

- April 20, 2020
- Posted by: Shubha Manikarnike
- Category: Reinforcement Learning Robotics

This post assumes that you have a strong understanding of the basics of Reinforcement Learning, MDP, DQN and Policy Gradient Algorithms. You can go through **Policy Gradients** to understand the derivation for Stochastic Policies

In the previous **post on Actor Critic**, we saw the advantage of merging Value based and Policy based methods together. The Actor, takes in a state and outputs a policy , which is a probability distribution over actions. The Critic evaluates the policy and estimates the value function. The Actor uses this value function to update the probabilities of its actions.

The Actor Critic algorithm discussed, can be used in cases of discrete action spaces. The Critic uses the **Mean Squared Bellman Error Loss function** shown below.

where

The gradient for the score function for the policy based algorithm **REINFORCE **is given by : the reward times the log probability of the Actions.

In Actor Critic method, the reward term in the above equation is replaced by the Q Value calculated by the Critic. The Actor is trained by using the **Gradient Ascent** method to maximize the Q Value.

This Actor Critic algorithm works well for finite Discrete Action Space. Atari Games have a finite discrete action space. and hence, DQN algorithms work well. Imagine a problem with a continuous action space – Like a two-joint robot arm. The action space would contain angles and angular velocities of the two joints ranging over a set of values. Recall the Mean Square Bellman error which is used as a Loss function for the Critic network. This requires computation of the maximum Q value for the next state S’. It is easy to compute the maximum over a set of finite discrete actions. However, this becomes impossible over large continuous Actions.

**Deep Deterministic Policy Gradient (DDPG)** addresses this Issue. The Actor network outputs a **deterministic policy** instead of a **Stochastic policy**. It gives out the exact action itself instead of a probability distribution over actions. Such a policy is called as a deterministic policy. The action given out by the Actor is used by the Critic to compute the Q-Value.

DDPG reuses the tricks of ‘Experience Replay’ and ‘Fixed Q Targets’ from the DQN algorithm.

**Experience Replay:** A replay buffer is used to store experience tuples. Since every action affects the next state, this outputs a sequence of experience tuples which can be highly correlated. A random mini batch is sampled from the Replay buffer, so that the set of tuples are not correlated.

**Fixed Q-Targets: **In fixed Q- Targets, the Target in the MSBE is computed using a different set of weights. This is because , we need to reduce the difference between the Q-value and the target. If we use the same weights to calculate both Q value and the Target, then every time the Q value changes, the target value also changes. This boils down to chasing a moving target. The target weights are synchronized with the local weights for every tau steps. But in DDPG, we use **polyak averaging** :

Incorporating , the fixed targets, the DDPG algorithm has **two instances** of the Actor ( local and target) which compute the deterministic policy and two instances of the Critic ( local and target ) which compute the Q Value for the Actions given by Actor.

Since, computing max over actions in the target is a challenging task, DDPG uses the target network to compute an action which approximately maximizes Q Target. The Target policy network uses polyak averaging.

On the policy learning side, the actor needs to learn a deterministic policy by giving out an action which maximizes the Q value. The q function is differentiable with respect to the action , we perform a gradient ascent to solve

**Exploration : **In Reinforcement learning for discrete action spaces, exploration is done via the probability of selecting a random action (such as epsilon-greedy ). In continuous action spaces, exploration is done via adding noise to the action itself (there is also the parameter space noise but we will skip that for now). In the DDPG paper, the authors use * Ornstein-Uhlenbeck Process *to add noise to the action output (Uhlenbeck & Ornstein, 1930): here .

##

**Twin Delayed Deep Deterministic Policy Gradients**

In Deep Q Learning, function approximation errors lead to over estimation of values and sub optimal policies. Twin Delayed Deep Deterministic Policy Gradients (TD3) solve this program by drawing similarities from **Double DQN**. This helps in preventing over estimation of Q values.

TD3 has 3 new enhancements to DDPG to address the above problem.

: In Deep Q Networks, we saw that the target needs to compute the maximum of Q Value for the next state S’. This means that we need to choose the best action in the next state S’. In early stages, the Q Values are still evolving and we do not have enough information to get the best action in next state. Since not a lot of states have been explored and actions been tried, – calculation of the target results in over estimating the target Q Values. Hence, we use double Q Learning, where, the actions in the Target calculation are derived from the Target Q Network. Since the target Q network is stationary for a while, it can be used to derive the action which can be used in the target calculation. Hence, two functions are used to calculate the Q Value and the algorithm uses the smaller of the two q Values to update the Target.*Two Clipped Critic networks*: The target Policy is updated less frequently than the Q function. The Paper – suggests one policy update for every two Q function updates. This helps the Value network to become more stable before updating the policy.*Policy Updates are delayed*: This helps in target policy smoothing. The policy methods produce target values with high variance when updating the critic. TD3 reduces the variance by adding a clipped noise*Add Noise to the target action*

Target Policy smoothing is done as shown below

where a lies between a_{Low} and a_{High}

In **Clipped Double Q Learning**, both the Q functions use the minimum of the two as a target, which is shown below.

**References :**