ex) Tic-Tac-Toe
Assign value to each board position (state):
- +1 if the state is a winning terminal state.
- 0 if the state is a losing terminal state
- +0.5 to all other states (terminal and non–terminal)
play many games
After each move, adjust the value of the previous state towards the value of the current state:
Temporal difference learning = to learn from successive estimates.
Notes:
-
There are no explicit rewards in this setup.
-
The agent never has to look more than one step ahead to make a move! V(s) will contain information about the future(all features will be stored in the node). Compare to Minimax.
-
The agent will explore the weakness in the opponent. In contrast, Minimax will not make a move which may lead to defeat, even if that move always leads to victory in practice (due to weaknesses in the opponent). On the other hand, the RL/TD agent may become too specialized. Compare to over-training and getting such in local minima.
TD based RL, more general
From last lecture, we know that the value of a state s at time t is a (discounted) sum of future rewards (on average – lets drop the expected values for now).
If all values are correctly estimated, this relation holds by definition of \(V_t=\mathrm{E}\left(r_{t}+\gamma V_{t+1}\right)\), which is a variant of Bellman’s equation in DP.
If the estimates are not perfect, there will be a difference between the two sides of the equation. This difference is called the temporal difference error
Err_{TD}=r_{t}+\gamma V_{t+1}-V_{t}
and can be used to update the value of \(V_t\) :
This is the learning rule. (Sutton 1988, special case of )
Sarsa (Rummery & Niranjan 1994)
Associate values to actions, rather than states. These are called Q-values.
The Q-value, Q(s, a) is the estimated value of doing action a in state s.
So, we are in a state s select an action a, receive a reward r and reach a new state s’ where we select an action a'. (Hence the name "Sarsa").
Q(s, a)=r+\gamma Q\left(s^{\prime}, a^{\prime}\right)
The right hand side - left hand side can be used as an error to update .
Q(s,a).
Q(s, a) \leftarrow Q(s, a)+\eta\left(r+\gamma Q\left(s^{\prime}, a^{\prime}\right)-Q(s, a)\right)
Q-learning (Watkins 1989)
The Q-value is defined under an assumption: Q(s, a) is the estimated value of doing action a in state s, assuming that all future actions are greedy.
The assumption leads to a max in the relation:
The update rule (r.h.s – l.h.s. again) becomes:
Note:'PPO'
Neural networks in Q-Learning (and Sarsa)
For large state space, we can store them in table.
Q-learning requires a large table (with an entry for each state-action pair).
ANNs can be used to estimate the Q-values.
We want,
-
ANN with the state, s, on its inputs and n outputs, estimating the Q-values of the n possible actions.
-
-
Use supervised learning with the right hand side of (5) as target
-
Only update the output corresponding to the selected action (the errors of all other outputs are set to zero, since we don't know how good they are).
-
Use linear outputs
-
Backprop not so good (continuous learning). RBFs (Lecture 10) better. Alternative, use backprop but with experience replay (Lecture 10).
-
Deep Q-learning = same thing, just a deeper network.
In this additional part, we are going to talk about the algorithms of Q-learning and SARSA.
This figure is copyright by morvanzhou. In shows how these algorithms update Q table.
Let us focus on a state S
that an agent stay in. The agent do action a
(according to its policy), then it comes into state S'
.
In the state S'
, there are some key differences.
-
The agent following Q-learning assumes that the next reward is based on greedy policy. It updates its Q-values using the Q-value of the state
s′
and the greedy actiona′
, . Then at the time the agent makes action , it does action according to its policy. This action may not bea′
. -
The agent following SARSA assumes that the current policy will be used in future. It updates its Q-values using the Q-value of the state
s′
and the action*a′
according to its policy. Then at the time the agent makes action , it does actiona′
.
文章评论