## Research/Blog

# Temporal Difference methods in RL

- April 27, 2020
- Posted by: Shubha Manikarnike
- Category: Reinforcement Learning Robotics

This post discusses temporal difference (TD) methods, used in Reinforcement Learning. It contrasts TD methods with Monte Carlo (MC) methods and dynamic programming. You need to have a thorough understanding of Markov Decision Process (MDP) to understand this post.

**Prediction and Control : **In general, RL methods have two components

1) *Prediction / Evaluation *: where you take in a policy and evaluate the future.

2) *Control / Improvement*: Once the policy has been evaluated, it needs to be optimized to find the best policy.

**On-policy vs Off-Policy : **Control methods can be either

1) *On-policy* – where it learns about a policy by using the experience sampled by the same policy

2) *Off-policy* – where it learns about a policy by using experience sampled from a different policy.

**Dynamic programming : **DP requires a complete knowledge of the environment. It needs to know the state transition probabilities and rewards received at each time step. Once, the complete environment is known, it takes in a policy and evaluates the state value for every state in the environment using ** Bellman expectation equation**.

The ** Policy improvement** part happens by selecting an action greedily at the next state. It selects the best action at every state ( or the action which leads to the highest valued next state ) . This is called generalized policy improvement. DP uses

**and**

*Policy Iteration***as Control methods. Policy Iteration alternates between Policy Evaluation and Policy Improvement. In Value Iteration, policy is evaluated, until the optimal value is computed. The Policy improvement happens after the optimal value is computed.**

*Value Iteration*For an overview of Model-Free RL, Value-based vs Policy-based algorithms, Q Learning and Dynamic Programming, click here.

**Monte Carlo methods :-**

In MC methods, episodes are sampled by following a policy. The returns from these episodes are collected. The value of every state is computed by averaging over the collected return. Since these methods sample the entire episode to get the return *G*, they need not have a complete knowledge of the environment. These methods apply only to episodic tasks.

Monte Carlo control uses the generalized policy improvement algorithm to optimize its policy. It uses the ** Epsilon Greedy policy** for policy improvement, with a probability (1-epsilon), it picks the best action and with a probability epsilon, it picks a random action.

**Temporal Difference methods :-**

TD learning is a combination of Monte Carlo methods and Dynamic Programming methods. They do not need complete knowledge of the environment.

Like MC methods, they learn from experience and like DP, they update estimates based on other learned estimates.They do not wait for the episode to complete. They bootstrap. It uses one step actual return and an estimate of the return from that step.

*where,*

In the above update equations, MC update uses the return *G* obtained upon completing the episode, whereas, TD update uses the one-step reward obtained *R*_{t+1} and an estimate of the State Value at the next state *S*_{t+1}.

**TD Control** : The Control follows the usual pattern of Generalized Policy Iteration. The exploration and exploitation trade-off is achieved using epsilon-greedy method. These approaches fall into two main categories : on-policy and off-policy. We discuss three main TD control methods in the subsequent sessions : (1) SARSA (2) Q-learning (3) Expected SARSA.

**SARSA** – **On-policy TD control** :-

*q*_{π}*(s,a)* needs to be estimated for the current behavior of policy *π*, for all states *s* and actions *a*. This update is done after every transition from a non-terminal state *S*_{t}. If *S*_{t+1} is terminal, then *Q(S*_{t+1}*,A*_{t+1}*)* is defined as zero. This rule uses every element of the quintuple of events, *(S*_{t}*, A*_{t}*, R*_{t+1}*, S*_{t+1}*, A*_{t+1}*)*, that make up a transition from one state–action pair to the next. This quintuple gives rise to the name “*SARSA*” for the algorithm.

**Q-learning / SARSAMax :-**

This is an off-policy TD control algorithm. The learned action-value function, *Q*, directly approximates *q*_{*}, the optimal action value function, independent of the policy being followed.

**Expected SARSA :-**

This is an On-policy TD control algorithm. It is similar to Q-learning except that instead of the maximum over next state–action pairs it uses the expected value, taking into account how likely each action is under the current policy. Given the next state, *S*_{t+1}, this algorithm moves deterministically in the same direction as SARSA moves in expectation, and accordingly it is called Expected SARSA.

**Backup Diagrams :**

*References :*

- Introduction to Reinforcement Learning – Sutton and Barto
- David Silver’s lecture at UCL
- A Theoretical and Empirical Analysis of Expected Sarsa