Thinking Fast and Slow with Deep Learning and Tree Search

Think fast and slow says that human mind has two systems system 1 and system2.

System 1: intuition or heuristic process.

Eg: Reinforce, DQN

System 2: conscious, explicit and rule-based mode of reasoning

Tree Search

Expert Iteration - System 1 to make selection with no look over head and System 2 to search possible

search policies and suggest a strong policies

Imitation Learnning : Apprentice is trained to imitate the behaviour of expert policy

Reference:

Read 2.Preliminaries for good explanation

Markov Decision Process (MDP

Monte Carlo tree search

Kullback–Leibler divergence

A search tree is created while simulation is been run on the game, which has two phases: tree phase and roll out phase.

Each simulation consists of two parts. First, a tree phase, where the tree is traversed by taking actions according to a tree policy. Second, a rollout phase, where some default policy is followed until the simulation reaches a terminal game state. The result returned by this simulation can then be used to update estimates of the value of each node traversed in the tree during the first phase.

At each node the algo store n(s), the number of iterations in which the node has been visited so far. Each edge stores both n(s; a), the number of times it has been traversed, and r(s; a) the sum of all rewards obtained in simulations that passed through the edge.