## Research/Blog

# A Summary of Model-Free RL Algorithms

- April 13, 2020
- Posted by: vsinghal
- Category: Reinforcement Learning Robotics

*#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #AlwaysUpskilling*

Reinforcement Learning (RL) refers to training agents with help of incentive-driven environments.

RL typically involves a tuple of **<state, action, reward>** paradigm, which means that the agent has action choices to make in various states, and each action entails a potential reward. This also means that each state has a “**value**” associated with it. The sequence of <state, action> pairs follows a recommended “**policy**“, such that maximum rewards can be attained by following that policy *π*.

In mid-2000s, with the advent and progress in Deep Learning, RL started using Deep Neural Networks for policy optimization techniques. Modern RL is generally separated into two types “**model-free**” and “**model-based**” (MBRL).

**Model free methods** learn directly from experience, this means that they perform actions either in the real world (ex: robots )or in computer (ex: games). Then they collect the reward from the environment, whether positive or negative, and they update their value functions.

This is a **key difference **with Model-Based approach. Model-Free methods act in the real environment in order to learn.

Conversely **Model-Based algorithm** uses a reduced number of interactions with the real environment during the learning phase. Its aim is to construct a model based on these interactions, and then use this model to simulate the further episodes, not in the real environment but by applying them to the **constructed model** and get the results returned by that model.

This has the advantage of speeding the learning, since there is no need to wait for the environment to respond nor to reset the environment to some state in order to resume learning.

On the downside however, if the model is inaccurate, we risk learning something completely different from the reality.

(*Source Credit – **https://towardsdatascience.com/model-based-reinforcement-learning-cb9e41ff1f0d*)

**We will discuss model-free RL in the rest of this article**. This is a more mature area in terms of research and practicality. Model-based RL lies more in research domain so far.

Model-free RL has two types of algorithms – value-based and policy-based.

*Value-based algorithms :-*

Value-based algorithms iteratively update the perceived value of a state to finally learn an optimal policy. These algorithms are generally “**off-policy**“. (with exception of a few like SARSA which is “on-policy”).

Value-based algorithms solve for the Markov Decision Process (MDP). The Markov Process states that the future is only dependent on past state and not on past states. In other words, the current state captures all past dependencies. The **Markov Reward Process** models the paradigm.

The **cumulative reward** at time step *t* is the sum of current and future discounted rewards. The agent’s goal is to learn a policy *π* which maximizes this cumulative reward.

The **state-value function v**_{π}** (s)** measures how good it is to be in a certain state using the policy

*π*.

The major types of value-based algorithms are :-

### Q-Learning :

Here agent stores a perceived value of each state, action pair called a **Q value**, which then decides the policy action.

We use **Dynamic Programming** to define the Q value. This uses the recursive **Bellman Equation** for this purpose :

This means that *Q*_{π} can be improved by *bootstrapping*, i.e., we can use the current values of our estimate of *Q*_{π} to improve our estimate.

The Q-values for various states are updated in an iterative manner using a Temporal Difference algorithm, which measures the old and new Q value after each action.

where *α* is the learning rate and *ẟ* the temporal difference (TD) error.

The table below summarizes the Q learning algorithm.

The limitation of Q-learning is that a system may have enormous no of state options making the learning unviable. This makes it unsuitable for continuous action spaces such as steering wheel of a car.

### Deep-Q Learning :

Deep-Q Learning uses neural networks to predict Q-values for various state-action combinations, allowing an expansion to continuous action spaces while saving computational resources.

A common technique in value-based methods is to try **exploration vs exploitation** paradigm. Exploration means deliberately trying states which have lower perceived value in order to achieve better overall result over the lifetime of the episode (e.g. a salesperson explores new markets in anticipation of achieving higher overall sales). Exploitation means sticking to “safe bets”, in other words taking actions with higher reward potential (e.g. a salesperson sticks to proven markets for sales hunt).

More details on value-based methods are available here.

*Policy-based algorithms :-*

Policy-based methods update the policy directly without storing state values. These typically use **Policy Gradient (PG)** algorithms. PG algos modify an agent’s policy based on which actions bring it higher rewards. These algos are considered “**on-policy**“.

One key point to note here is that there are two probabilities involved – the policy (*π*) probability and environment probability (*p*).

The Policy recommends what actions (a1, a2 etc.) to take with what probability E.g. in a game of Chess, policy predicts various probabilities for multiple moves possible at any point by the player (e.g. move Pawn A forward with 20% probability and move Pawn B forward with 80% probability). We will move Pawn B forward (as it has higher recommendation) and once we do that, the Environment probability is 100% that it lands in the desired slot.

Whereas in a game of Frozen Lake, the Policy recommends probabilities around what slots to move to. But after taking the action with highest probability, the Environment will dictate the probability of the agent landing in the targeted slot (the slippery ice means that there is a chance that the agent may not land in targeted slot).

Policy methods have these advantages compared to value-based methods – (i) better convergence, (ii) suitable for continuos action space (having infinite possibility of actions), (iii) can learn stochastic policies.

A Policy can be of two types :

(1) A **deterministic policy** maps state to actions. You give it a state and the function returns an action to take [*π(s) -> a*]. The action clearly determines the outcome. E.g. Make a three-dimensional robot walk forward as fast as possible, without falling over.

(2) A **stochastic policy** gives a probabilistic output [*π(a|s) -> P(a*_{t}*|s*_{t}*)*] . The stochastic policy is used when the environment is uncertain and policy outputs a probability distribution over actions, but not concrete actions. e.g. in Rock Paper Scissor game, you have to output Rock Paper Scissor with equi-probable random policy of 33% each. In Chess, you can move Pawn A or Pawn B with different probabilities to different slots.

A basic PG algorithm is the “**REINFORCE**” technique, which implements the core idea of “reinforcing” policy gradients (in other words, increasing the likelihood of actions) which lead to higher rewards.

The REINFORCE algorithm is given by :-

More details on Value-based and Policy-based algorithms are available here and here.

*Actor-Critic Algorithms :-*

These algorithms combine the Value-based and Policy based methods in a zero sum process. The Actor is a policy-based network and recommends an action based on policy. The Critic is a value-based network and takes the reward attained by Actor’s action along with state to update itself and the Actor using a TD control mechanism.

More details on Actor-Critic are available here.

Policy gradients suffer from noisy gradients. Recent variations of PG algos attempt to mitigate some of these issues. These include TRPO (Trust Region Policy Optimization, TRPO paper), PPO (Proximal Policy Optimization, PPO paper), DDPG (Deep Deterministic Policy Gradient), Rainbow, TD3 (Twin Delayed Deep Deterministic Policy Gradient) and SAC (Soft Actor Critic). Rainbow combines many variations of DQN for a better result such as Prioritization DQN, Dueling DQN, A3C, Distributional DQN and Noisy DQN.

Model-free RL is improving rapidly with modern State-Of-The-Art algorithms. It will enable future applications like robotics.

## CellStrat Deep Reinforcement Learning Course :-

CellStrat AI Lab is a leading AI Lab and is working on the cutting-edge of Artificial Intelligence including latest algorithms in ML, DL, RL, Computer Vision, NLP etc.

We are pleased to launch an extensive course in Deep Reinforcement Learning (DRL). More details and enrollment here : https://bit.ly/CSDRLC

Wish to attend a TRIAL CLASS (online webinar) for the new Deep RL course ? If yes, please RSVP below to attend.

*CellStrat AI Lab :-*

Register : http://bit.ly/15A-RLdp

Topic : Dynamic Programming

Date : Wednesday 15th Apr 2020, 4:00 – 5:30 PM IST

See you this Wednesday in the RL webinar !

Questions ? Call me at **+91-9742800566** !

Best Regards,

Vivek Singhal

Co-Founder & Chief Data Scientist, CellStrat

+91-9742800566

*References :-*

- Gists of Recent Deep RL Algorithms by Nathan Lambert
- A Brief Survey of Deep Reinforcement Learning by Kai Arulkumaran et al.
- An introduction to Policy Gradients with Cartpole and Doom by Thomas Simonini