ex) TicTacToe
Assign value to each board position (state):
 +1 if the state is a winning terminal state.
 0 if the state is a losing terminal state
 +0.5 to all other states (terminal and non–terminal)
play many games
After each move, adjust the value of the previous state towards the value of the current state:
Temporal difference learning = to learn from successive estimates.
Notes:

There are no explicit rewards in this setup.

The agent never has to look more than one step ahead to make a move! V(s) will contain information about the future(all features will be stored in the node). Compare to Minimax.

The agent will explore the weakness in the opponent. In contrast, Minimax will not make a move which may lead to defeat, even if that move always leads to victory in practice (due to weaknesses in the opponent). On the other hand, the RL/TD agent may become too specialized. Compare to overtraining and getting such in local minima.
TD based RL, more general
From last lecture, we know that the value of a state s at time t is a (discounted) sum of future rewards (on average – lets drop the expected values for now).
If all values are correctly estimated, this relation holds by definition of \(V_t=\mathrm{E}\left(r_{t}+\gamma V_{t+1}\right)\), which is a variant of Bellman’s equation in DP.
If the estimates are not perfect, there will be a difference between the two sides of the equation. This difference is called the temporal difference error
Err_{TD}=r_{t}+\gamma V_{t+1}V_{t}
and can be used to update the value of \(V_t\) :
This is the learning rule. (Sutton 1988, special case of )
Sarsa (Rummery & Niranjan 1994)
Associate values to actions, rather than states. These are called Qvalues.
The Qvalue, Q(s, a) is the estimated value of doing action a in state s.
So, we are in a state s select an action a, receive a reward r and reach a new state s’ where we select an action a'. (Hence the name "Sarsa").
Q(s, a)=r+\gamma Q\left(s^{\prime}, a^{\prime}\right)
The right hand side  left hand side can be used as an error to update .
Q(s,a).
Q(s, a) \leftarrow Q(s, a)+\eta\left(r+\gamma Q\left(s^{\prime}, a^{\prime}\right)Q(s, a)\right)
Qlearning (Watkins 1989)
The Qvalue is defined under an assumption: Q(s, a) is the estimated value of doing action a in state s, assuming that all future actions are greedy.
The assumption leads to a max in the relation:
The update rule (r.h.s – l.h.s. again) becomes:
Note:'PPO'
Neural networks in QLearning (and Sarsa)
For large state space, we can store them in table.
Qlearning requires a large table (with an entry for each stateaction pair).
ANNs can be used to estimate the Qvalues.
We want,

ANN with the state, s, on its inputs and n outputs, estimating the Qvalues of the n possible actions.


Use supervised learning with the right hand side of (5) as target

Only update the output corresponding to the selected action (the errors of all other outputs are set to zero, since we don't know how good they are).

Use linear outputs

Backprop not so good (continuous learning). RBFs (Lecture 10) better. Alternative, use backprop but with experience replay (Lecture 10).

Deep Qlearning = same thing, just a deeper network.
In this additional part, we are going to talk about the algorithms of Qlearning and SARSA.
This figure is copyright by morvanzhou. In shows how these algorithms update Q table.
Let us focus on a state S
that an agent stay in. The agent do action a
(according to its policy), then it comes into state S'
.
In the state S'
, there are some key differences.

The agent following Qlearning assumes that the next reward is based on greedy policy. It updates its Qvalues using the Qvalue of the state
s′
and the greedy actiona′
, . Then at the time the agent makes action , it does action according to its policy. This action may not bea′
. 
The agent following SARSA assumes that the current policy will be used in future. It updates its Qvalues using the Qvalue of the state
s′
and the action*a′
according to its policy. Then at the time the agent makes action , it does actiona′
.
文章评论