## Research/Blog

# AlphaZero with Monte Carlo Tree Search

- August 20, 2020
- Posted by: Shubha Manikarnike
- Category: Gaming Reinforcement Learning Robotics

*#CellStratAILab #disrupt4.0 #WeCreateAISuperstars #WhereLearningNeverStops*

In recent weeks, I had presented a session on **“AlphaZero with Monte Carlo Tree Search” **algorithm at the CellStrat AI Lab. This is an algorithm developed by **Google Deepmind** in 2016. It mastered the **game of GO** and beat the 18-time world champion at the time Lee Sedol.

Go is an ancient Chinese abstract strategy board game more than 2500 years old. The standard board size is 19 x 19. It has more board positions (10^{17010170}) than there are atoms in the universe.

In history, computer programs were able to achieve super human capability in games of Chess, Gammon or checkers. But achieving super human performance in GO was thought impossible, until year 2016 when Google’s AlphaGo exceeded human performance.

## AlphaGo -> AlphaGo Zero -> AlphaZero

**AlphaGo**: Program developed by DeepMind in 2016, which defeated world champion Lee Sedol in the game of Go 4-1 in a series watched by over 200 million people. Expert moves were fed into the program and it was trained using supervised learning. The paper was titled – ‘**Mastering the game of Go with deep neural networks and tree search’**

**AlphaGo Zero**: In October 2017, DeepMind unveiled a variant of AlphaGo called AlphaGo Zero, where learning happened through self-play starting from a blank state. It was able to find strategies to beat previous incarnations of itself. The paper was titled ‘**Mastering the Game of Go without Human Knowledge**’.

**AlphaZero**: In December 2017, DeepMind came up with AlphaZero, a more generic version of AlphaGo Zero, which could defeat world champions at Chess and Shogi. The paper is titled ‘**Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm**’.

### Components of AlphaZero :-

**Monte Carlo Tree Search (MCTS)**– a search algorithm which is used for classical games.- A two headed Neural Network in the Actor Critic Style is used.
- The Neural Network initially contains a residual CNN to read images of board positions.
- It then has two heads – a Policy Head and a Value Head.
- The Neural Network guides the MCTS algorithm to output the most promising moves.

## Game Tree Concept

Games can be modelled as Trees. Here, Node represents a state/ specific configuration. An Edge represents an action. On performing an action, there is a transition from a node to one of its children.

Terminal nodes are leaf nodes. A brute force search would be to try out all possible paths in order to determine moves which result in winning. Though, this might work for smaller games, it is impossible for games like Go or Chess.

## Monte Carlo Tree Search (MCTS)

There are many ways to search for the best move in such a game tree. In the algorithm called ** minimax**, every path is searched to get the best move. This brute force approach fails for complex games, since the Tree gets very large.

MCTS works effectively for trees with high branching factor. It samples down many paths down the tree. As it tries more paths, it gains better estimates about paths that are good.

MCTS builds its own search tree from scratch corresponding to the game tree. The nodes in this tree store statistical information which help the search algorithm to make good moves.

The focus of MCTS is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space.

The application of Monte Carlo tree search in games is based on many playouts, also called ** roll-outs**. In each playout, the game is played out to the very end by selecting moves at random.

The final game result of each playout is then used to weight the nodes in the game tree so that better nodes are more likely to be chosen in future playouts.

The Search Algorithm follows the following 4 steps :

1) **Selection
:**
Moves are selected based on existing information in the Tree. This is carried
out until you reach an unvisited child node.

2) **Expansion
:** An
unvisited child is randomly chosen and added to the Search Tree.

3) **Simulation****:** From
this
node a simulation is run till the end to determine the winner (completely
randomly).

4) **Backpropagation:** All the nodes in the selected path are updated with new information gained from the simulated game.

### MCTS – Selection :-

Starting from the root node, you traverse the tree by selecting a legal move and moving to the child node. You stop selection when you reach an unvisited node.

The selection function should balance *exploration *and *exploitation*.

A good selection function which balances both is called **UCB1 (Upper Confidence Bound 1)**. When applied to MCTS, the combined algorithm is named **UCT **(Upper Confidence Bound 1 applied to trees). So, MCTS + UCB1 = UCT.

The expression below (UCB1) determines the value of the node based on which it is likely to be selected (or not) in a particular playout.

where

: this node’s number of simulations that resulted in a win*wᵢ*: this node’s total number of simulations*sᵢ*: parent node’s total number of simulations*sₚ*: exploration parameter*c*- The left term (
/*wᵢ*) is the*sᵢ**exploitation term*. It is simply the average win rate, going larger the better a node has historically performed. - The right term (
*sqrt*(*ln*/*sₚ*)) is the*sᵢ**exploration term*. It goes larger the less frequently a node is selected for simulation. - The exploration parameter
is just a number we can choose that allows us to control how much the equation favors exploration over exploitation.*c*

Here the positions and moves selected by the UCB1 algorithm (instead of selecting randomly) at each step are marked in bold.

Note that a number of playouts have already been run to accumulate the statistics shown. Each circle contains the number of wins / number of times played.

The game is alternating between two players – represented by white nodes and gray nodes in this figure :

### MCTS – Expansion :-

Expansion phase occurs when you have only unvisited children. The UCB1 algorithm can no longer be applied. An unvisited child position is randomly chosen, and a new record node is added to the tree of statistics.

The position marked 1/1 at the bottom of the tree has no further statistics records under it, so we choose a random move and add a new record for it (bold), initialized to 0/0.

### MCTS – Simulation :-

In the simulation phase, the remainder of the playout occurs.

This is done as a typical Monte Carlo simulation.

From the newly created node in the expansion phase, moves are selected randomly and the game state is repeatedly advanced. This repeats until the game is finished and a winner emerges. No new nodes are created in this phase.

Simulation in the figure is depicted using dashed arrow.

### MCTS – Backpropagation :-

This occurs when the playout reaches the end of the game. All of the positions visited during this playout have their play count incremented. Each visited node has its simulation count ** sᵢ** incremented. Depending on which player wins, its win count

**may also be incremented.**

*wᵢ*In the figure, the incremented count is shown with bolded numbers.

## Policy Iteration in AlphaGo Zero

Alpha(Go) Zero uses a Neural Network to guide the MCTS. It uses Policy Iteration to train this Neural Network.

**Policy iteration** is a classic algorithm that generates a sequence of improving policies, by alternating between * policy evaluation* – estimating the value function of the current policy – and

**policy improvement****– using the current value function to generate a better policy.**

The AlphaGo Zero self-play algorithm can similarly be understood as an approximate policy iteration scheme in which MCTS is used for both policy improvement and policy evaluation. Policy improvement starts with a neural network policy, executes an MCTS based on that policy’s recommendations, and then projects the (much stronger) search policy back into the function space of the neural network.

Policy evaluation is applied to the (much stronger) search policy: the outcomes of self-play games are also projected back into the function space of the neural network.

These projection steps are achieved by training the neural network parameters to match the search. In order words, the MCTS is the target goal (Y) for the neural network to try to achieve via training.

### The Neural Network :-

The program uses a deep neural network f_{ϴ} with parameters ϴ. This neural network takes as an input the raw board representation s of the position and its history, and outputs both move probabilities and a value,

(**p**; v) = f_{ϴ}(s)

The vector of move probabilities p represents the probability of selecting each move (including pass),

p_{a} = Pr(a|s)

The value v is a scalar evaluation, estimating the probability of the current player winning from position s. v_{θ}(s)∈[−1,1].

This neural network combines the roles of both policy network and value network into a single architecture. This is similar to the Actor Critic architecture, which has a common body and outputs the Policy and Value.

The neural network in AlphaZero consists of many residual blocks of convolutional layers with batch normalization and rectifier non-linearities.

### Tree Creation guided by the Neural Network :-

Given a state s, the neural network provides an estimate of the policy p_{θ}. During the training phase, we wish to improve these estimates. This is accomplished using a Monte Carlo Tree Search (MCTS).

In the search tree, each node represents a board configuration. A directed edge exists between two nodes i→j, if a valid action can cause state transition from state i to j.

Starting with an empty search tree, we expand the search tree one node (state) at a time. When a new node is encountered, instead of performing a rollout, the value of the new node is obtained from the neural network itself. This value is propagated up the search path.

### Calculating the UCB1 :-

For the tree search, we maintain the following:

- Q(s,a): the expected reward for taking action a from state s, i.e. the Q values
- N(s,a): the number of times we took action a from state s across simulations
- P(s,⋅) = p
_{θ}(s): the initial estimate of taking an action from the state s according to the policy returned by the current neural network.

From these, we can calculate U(s,a), the upper confidence bound on the Q-values as

Here c_{puct} is a hyper parameter that controls the degree of exploration.

### Tree Expansion using Neural Network part :-

To use MCTS to improve the initial policy returned by the current neural network,

- We initialize our empty search tree with s as the root.
- We compute the action a that maximizes the upper confidence bound U(s,a). If the next state s′ (obtained by playing action a on state s) exists in our tree, we recursively call the search on s′. If it does not exist, we add the new state to our tree and initialize P(s′,⋅)=→p
_{θ}(s′) and the value v(s′)=v_{θ}(s′) from the neural network, and initialize Q(s’,a) and N(s’, a) to 0 for all a. - Instead of performing a rollout, we then propagate v(s′) up along the path seen in the current simulation and update all Q(s,a) values. On the other hand, if we encounter a terminal state, we propagate the actual reward (+1 if player wins, else -1).

### Self Play using MCTS :-

MCTS may be viewed as a self-play algorithm that, given neural network parameters and a root position s, computes a vector of search probabilities recommending moves to play, π = α_{ϴ}(s), proportional to the exponentiated visit count for each move, π_{a} α N(s; a)^{1/τ}, where τ is a temperature parameter that controls the degree of exploration.

AlphaGO Zero initially used τ = 1 for the first 30 moves and later sets it to a very small value.

The neural network is trained by a self-play reinforcement learning algorithm that uses MCTS to play each move. First, the neural network is initialized to random weights 0. At each subsequent iteration i >= 1, games of self-play are generated. At each time-step t, an MCTS search π_{t} = α_{ϴi-1}(s_{t}) is executed using the previous iteration of neural network f_{ϴi-1} , and a move is played by sampling the search probabilities π_{t}.

A game terminates at step T when both players pass, when the search value drops below a resignation threshold, or when the game exceeds a maximum length; the game is then scored to give a final reward of r_{T} ϵ {-1,+1}. The data for each time-step t is stored as (s_{t};π_{t}; z_{t}) where z_{t} = r_{T} is the game winner from the perspective of the current player at step t.

In parallel, new network parameters ϴ_{i} are trained from data (s;π; z) sampled uniformly among all time-steps of the last iteration(s) of self-play.

### Loss Function :-

The Loss function is a combination of **Critic Loss** ( Mean squared error loss between v and z) and **Actor Loss** (cross entropy between p and π).

When training the network, at the end of each game of self-play, the neural network is provided training examples of the form (s_{t},π_{t},z_{t}).

π_{t} is an estimate of the policy from state s_{t},

and z_{t}∈{−1,1} is the final outcome of the game from the perspective of the player at s_{t} (+1 if the player wins, -1 if the player loses). The neural network is then trained to minimize the following loss function (excluding regularization terms) :

### MCTS for Policy Iteration :-

Each simulation traverses the tree by selecting the edge with maximum action-value Q, plus an upper confidence bound U that depends on a stored prior probability P and visit count N for that edge (which is incremented once traversed).

The leaf node is expanded and the associated position s is evaluated by the neural network (P(s; ); V (s)) = f(s); the vector of P values are stored in the outgoing edges from s.

Action-values Q are updated to track the mean of all evaluations V in the subtree below that action.

Once the search is complete, search probabilities are returned, proportional to N^{1/τ}, where N is the visit count of each move from the root state and τ is a parameter controlling temperature.

## A Recap

- Starting with random weights, the network was trained by stochastic gradient descent (with momentum, regularization, and step-size parameter decreasing as training continues) using batches of examples sampled uniformly at random from all the steps of the most recent 500,000 games of self-play with the current best policy. Extra noise was added to the network’s output p to encourage exploration of all possible moves.
- At periodic checkpoints during training (that the authors of the algorithm chose to be at every 1,000 training steps), the policy output by the ANN with the latest weights was evaluated by simulating 400 games (using MCTS with 1,600 iterations to select each move) against the current best policy.
- If the new policy won (by a margin set to reduce noise in the outcome), then it became the best policy to be used in subsequent self-play.
- The network’s weights were updated to make the network’s policy output p more closely match the policy returned by MCTS, and to make its value output, v, more closely match the probability that the current best policy wins from the board position represented by the network’s input.

## Alpha(Go) Zero – Algorithm Summary

- Neural Network starts training with random weights and assigns (p,v) values for various states.
- MCTS builds the tree from initial state while balancing exploration vs exploitation using U+Q values of nodes. Once it encounters a new leaf node, it uses the Neural Net to predict the (p,v) values of this new node. Also the Q and N value of this new node are zero at this point.
- There is a backpropagation which goes up and updates Q and N values of upper nodes (by averaging the V of child nodes).This way the whole tree gets built out.
- Use the MCTS from Step (2) to do some rounds of self-play to get the (pi, z) values.
- RECIRCLE – Neural Network is trained again with new Y targets of (pi,z). If the new NN overtakes the previous NN model by certain threshold, the new NN becomes the model to be used in next MCTS build-out.

Above cycle repeats till certain thresholds are met or certain game conditions are met.

## Code snippet – MCTS

## CellStrat Deep Reinforcement Learning Course :-

CellStrat AI Lab is a leading AI Lab and is working on the cutting-edge of Artificial Intelligence including latest algorithms in ML, DL, RL, Computer Vision, NLP etc.

Interested in learning Deep RL from one of the world’s best AI Labs ? If yes, enroll in our extensive course in Deep Reinforcement Learning (DRL). More details and enrollment here : https://bit.ly/CSDRLC

Questions ? Please feel free to call me at **+91-9742800566** !

Best Regards,

Vivek Singhal

Co-Founder & Chief Data Scientist, CellStrat

+91-9742800566