There are no explicit rewards in this setup.
The agent never has to look more than one step ahead to make a move! V(s) will contain information about the future(all features will be stored in the node). Compare to Minimax.
The agent will explore the weakness in the opponent. In contrast, Minimax will not make a move which may lead to defeat, even if that move always leads to victory in practice (due to weaknesses in the opponent). On the other hand, the RL/TD agent may become too specialized. Compare to over-training and getting such in local minima.

TD based RL, more general

From last lecture, we know that the value of a state s at time t is a (discounted) sum of future rewards (on average – lets drop the expected values for now).

TDbasedonRLFlowImagepng

V_{t}=\sum_{k=0}^{\infty} \gamma^{k} r_{t+k}=\gamma^{0} r_{t}+\sum_{k=1}^{\infty} \gamma^{k} r_{t+k}=r_{t}+\gamma \sum_{k=1}^{\infty} \gamma^{k-1} r_{t+k}=r_{t}+\gamma \sum_{k=0}^{\infty} \gamma^{k} r_{t+k+1}=r_{t}+\mathcal{\gamma V}_{t+1}

If all values are correctly estimated, this relation holds by definition of $V_t=\mathrm{E}\left(r_{t}+\gamma V_{t+1}\right)$, which is a variant of Bellman’s equation in DP.

If the estimates are not perfect, there will be a difference between the two sides of the equation. This difference is called the temporal difference error

$Err_{TD}=r_{t}+\gamma V_{t+1}-V_{t}$

and can be used to update the value of $V_t$ :

V_{t} \leftarrow V_{t}+\eta\left(E r r_{T D}\right)=V_{t}+\eta\left(r_{t}+\mathcal\gamma V_{t+1}-V_{t}\right)

This is the learning rule. (Sutton 1988, special case of )

Sarsa (Rummery & Niranjan 1994)

Associate values to actions, rather than states. These are called Q-values.

The Q-value, Q(s, a) is the estimated value of doing action a in state s.

SaraFlow

So, we are in a state s select an action a, receive a reward r and reach a new state s’ where we select an action a'. (Hence the name "Sarsa").

$Q(s, a)=r+\gamma Q\left(s^{\prime}, a^{\prime}\right)$

The right hand side - left hand side can be used as an error to update .

$Q(s,a)$ .

$Q(s, a) \leftarrow Q(s, a)+\eta\left(r+\gamma Q\left(s^{\prime}, a^{\prime}\right)-Q(s, a)\right)$

Q-learning (Watkins 1989)

The Q-value is defined under an assumption: Q(s, a) is the estimated value of doing action a in state s, assuming that all future actions are greedy.

The assumption leads to a max in the relation:

Q(s, a)=r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)

The update rule (r.h.s – l.h.s. again) becomes:

Q(s, a) \leftarrow Q(s, a)+\eta\left(r+\gamma \max Q\left(s^{\prime}, a^{\prime}\right)-Q(s, a)\right)

Comparing Q-Learning and Sarsa

Note:'PPO'

Neural networks in Q-Learning (and Sarsa)

For large state space, we can store them in table.

Q-learning requires a large table (with an entry for each state-action pair).

ANNs can be used to estimate the Q-values.

We want,

ANN with the state, s, on its inputs and n outputs, estimating the Q-values of the n possible actions.
Use supervised learning with the right hand side of (5) as target
Only update the output corresponding to the selected action (the errors of all other outputs are set to zero, since we don't know how good they are).
Use linear outputs
Backprop not so good (continuous learning). RBFs (Lecture 10) better. Alternative, use backprop but with experience replay (Lecture 10).
Deep Q-learning = same thing, just a deeper network.

Algorithms of Q-learning and SARSA

In this additional part, we are going to talk about the algorithms of Q-learning and SARSA.

algorithms

This figure is copyright by morvanzhou. In shows how these algorithms update Q table.

Let us focus on a state S that an agent stay in. The agent do action a(according to its policy), then it comes into state S'.

In the state S', there are some key differences.

The agent following Q-learning assumes that the next reward is based on greedy policy. It updates its Q-values using the Q-value of the state s′and the greedy action a′, . Then at the time the agent makes action , it does action according to its policy. This action may not be a′.
The agent following SARSA assumes that the current policy will be used in future. It updates its Q-values using the Q-value of the state s′and the action* a′ according to its policy. Then at the time the agent makes action , it does action a′.