## Research/Blog

# RL with Actor-Critic Methods

- March 19, 2020
- Posted by: vsinghal
- Category: Reinforcement Learning Robotics

*#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #AlwaysUpskilling*

**Minutes from Saturday 14th March 2020 AI Lab Workshop at BLR :-**

__Session Presenter__ : SHUBHA M., Deep Reinforcement Learning Researcher, CellStrat AI Lab

Last Saturday, our Reinforcement Learning Team Lead **Shubha M. **presented a fantastic presentation and workshop on **Actor-Critic method** used in RL. She also demonstrated a demo of this technique for **Stock Market predictions**.

Reinforcement Learning broadly involves Value-based methods and Policy-based Methods.

__VALUE-BASED LEARNING__ :-

__VALUE-BASED LEARNING__:-An RL agent has **state, action, reward** paradigm. An RL agent in a particular state takes a certain action for which the environment grants it a reward.

An MDP or **Markov Decision Process** provides the mathematical framework to solve the RL problem.

The Markov Process states that the future is only dependent on past state and not on past states. In other words, the current state captures all past dependencies. The **Markov Reward Process** models the paradigm.

The cumulative reward at time step *t* is the sum of current and future discounted rewards. The future rewards are exponentially discounted by gamma factor *γ*, which is normally a fraction. The agent’s goal is to learn a policy *π* which maximizes this cumulative reward.

The **state-value function** *V*_{π}** (s)** measures how good it is to be in a certain state using the policy

*π*.

The sample returns from Student Markov Reward Process can be depicted as :-

In this way, a value can be calculated for each state.

The **Bellman Equation** for MRP is coded as :-

The Bellman Equation for Student MRP can be depicted as follows :-

A Markov Decision Process is a Markov Reward Process with decisions.

The **state-value function** captures the value starting from a particular state *s*. The **action-value function** captures the value starting from a particular state *s* and taking action *a*, as per policy *π*.

The state-value function and action-value functions may be decomposed as follows :-

The optimal value functions are found by maximizing over all policies :-

The optimal policy *π*_{*} is found by maximizing over *q*_{*}*(s,a)* :-

__Deep Q-Learning__ **:-**

__Deep Q-Learning__**:-**

** Q-learning** is a model-free reinforcement learning algorithm to learn a policy telling an agent what action to take under what circumstances [

*from Wikipedia*].

Arriving at optimal policy can involve **Exploration vs Exploitation**. Exploitation is about going with safe choices (e.g. visit your favorite restaurant). Exploration involves going with random choices in order to discover longer-term gratification (e.g. try a new restaurant).

We use techniques such as **ε-greedy** to make a call on exploration vs exploitation at each step.

Q-Learning does have limitations; e.g. total no of system states may be enormous (such as screen state space pixel values in an Atari game screen). Also both Q(s, a) and V(s) explore discrete action spaces, and are not suitable for continuous control spaces such as angle of a steering wheel, or the temperature of a heater.

Then comes **Deep Q Learning** :-

For an Atari game a DQN might look like :-

A DQN for an Atari game takes the pixel states as input and predicts an action. The change in Game Score is fed back to the network at each time step.

The Q-values are updated as per this formula :-

For additional information on Q-value update, click here.

We also employ an **Experience Replay** technique in order to avoid forgetting previous experiences and to reduce correlations between experiences.

In normal DQN learning, the same weights are used for estimating the target and the Q value. The weight adjustment is given by :-

A technique called **Fixed Q Targets** (introduced by Deepmind) allows us to avoid this problem of chasing moving targets. Here we use a different network with fixed weights *w-* for estimating the TD target. At every Tau step, we copy the parameters from our DQN network to the target network. The Target-Q Network’s weights are updated less often than primary Q-Network.

__POLICY-BASED LEARNING__ :-

__POLICY-BASED LEARNING__:-Here the system learns an optimal policy directly without storing action-values. Unlike value-based methods, policy-based methods can learn true stochastic policies. Also policy-based methods are suitable for continuous action spaces.

The **policy gradient** is always of the form (for details and derivation of this equation, please check our prior post on Policy Gradients here) :-

The central term is the log likelihood of the policy. In our context, it measures how likely the trajectory is under the current policy. We multiply this with rewards, due to which, highly positive rewards increase the likelihood of a policy and vice versa.

The **REINFORCE algorithm** is given by :-

REINFORCE algorithm can be stated as follows :-

1) Perform a trajectory roll-out using the current policy

2) Store log probabilities (of policy) and reward values at each step

3) Calculate discounted cumulative future reward at each step

4) Compute policy gradient and update policy parameter

5) Repeat 1–4

The Gradient tries to :-

- increase probability of paths with positive R
- decrease probability of paths with negative R

A **trajectory **is a sequence of states and actions in one particular episode.

In REINFORCE algorithm, we update the policy parameter through **Monte Carlo updates** (i.e. taking random samples). This increases variance of the log probabilities (of policy distribution) and cumulative rewards values, leading to noisy gradients. This causes unstable policies or policies skewing to non-optimal directions.

One way to reduce variance and increase stability is subtracting the cumulative reward by a baseline.

Recall that the policy gradient is given by :-

We establish a Reward baseline as follows :-

The **Advantage function *** A* is defined as :-

The Advantage function provides a measure of how each action compares to a certain baseline. Using *A*^{π}*(s*^{t}*,a*^{t}*)* centers the learning signal and reduces the variance significantly.

A **Vanilla Policy Gradient** algorithm or VPG is given by :-

Another policy-based method is the Proximal Policy Optimization or the PPO, which is described here.

**Actor-Critic**** :-**

**Actor-Critic****:-**

Here we combine Value-based and Policy-based methods. The Actor is policy-based and Critic is value-based.

First lets focus on chosing the right baseline. Actor-critic methods use the value function as a baseline for policy gradients, such that the only fundamental difference between actor-critic methods and other baseline methods are that actor-critic methods utilise a learned value function :-

=>in effect, increase logprob of action proportionally to how much its returns are better than the expected return (V(s)) under the current policy.

So we can rewrite the policy gradient using the advantage function:

The Advantage function provides a measure of how each action compares to the average performance at the state *s*^{t} , which is given by *V*_{π}*(s*^{t}*)*.

The Actor-Critic architecture consists of two neural networks, the Actor and the Critic.

- The Actor network takes in state as the input and outputs probability of Actions
- The Critic network receives the state and reward resulting from the previous interaction. The critic uses the TD error calculated from this information to update itself and the actor.
- The Actor network is trained to maximize the reward using Gradient Ascent.
- The Critic network is trained to minimize the MSE /TD error between State values

Here is a summary of PG Algorithms :-

The **Q Actor Critic** algorithm is :-

Two different neural networks may be used for Actor and Critic Networks. Sometimes the base network can be common :-

After this extensive discussion, Shubha demonstrated use of Actor-Critic for stock price prediction of S&P500 US Equity Markets Index. Her demo used Actor-Critic model with Fixed Q Targets and Experience Replay Buffer.

## CellStrat Deep Reinforcement Learning Course :-

CellStrat AI Lab is a leading AI Lab and is working on the cutting-edge of Artificial Intelligence including latest algorithms in ML, DL, RL, Computer Vision, NLP etc.

Interested in learning Deep RL from one of the world’s best AI Labs ? If yes, enroll in our extensive course in Deep Reinforcement Learning (DRL). More details and enrollment here : https://bit.ly/CSDRLC

Questions ? Please feel free to call me at **+91-9742800566** !

Best Regards,

Vivek Singhal

Co-Founder & Chief Data Scientist, CellStrat

+91-9742800566

*References :-*

- David Silver lectures on DRL at UCL
- CMU DRL lectures for 10703 DRL course
- John Schulman and Pieter Abbeel lectures at UCLA Berkeley
- Reinforcement Learning – An Introduction, Richard S. Sutton and Andrew G. Barto