# Natural Computation Methods for Machine Learning Note 08

2020年2月20日 2590点热度 0人点赞 0条评论

# Natural Computation Methods for Machine Learning Note 08

ex) Tic-Tac-Toe

Assign value to each board position (state):

• +1 if the state is a winning terminal state.
• 0 if the state is a losing terminal state
• +0.5 to all other states (terminal and non–terminal)

play many games

After each move, adjust the value of the previous state towards the value of the current state:

$$V(s) \leftarrow V(s)+\eta\left[V\left(s^{\prime}\right)-V(s)\right]$$

Temporal difference learning = to learn from successive estimates.

Notes:

• There are no explicit rewards in this setup.

• The agent never has to look more than one step ahead to make a move! V(s) will contain information about the future(all features will be stored in the node). Compare to Minimax.

• The agent will explore the weakness in the opponent. In contrast, Minimax will not make a move which may lead to defeat, even if that move always leads to victory in practice (due to weaknesses in the opponent). On the other hand, the RL/TD agent may become too specialized. Compare to over-training and getting such in local minima.

## TD based RL, more general

From last lecture, we know that the value of a state s at time t is a (discounted) sum of future rewards (on average – lets drop the expected values for now).

$$V_{t}=\sum_{k=0}^{\infty} \gamma^{k} r_{t+k}=\gamma^{0} r_{t}+\sum_{k=1}^{\infty} \gamma^{k} r_{t+k}=r_{t}+\gamma \sum_{k=1}^{\infty} \gamma^{k-1} r_{t+k}=r_{t}+\gamma \sum_{k=0}^{\infty} \gamma^{k} r_{t+k+1}=r_{t}+\mathcal{\gamma V}_{t+1}$$

If all values are correctly estimated, this relation holds by definition of ​$$V_t=\mathrm{E}\left(r_{t}+\gamma V_{t+1}\right)$$, which is a variant of Bellman’s equation in DP.

If the estimates are not perfect, there will be a difference between the two sides of the equation. This difference is called the temporal difference error

Err_{TD}=r_{t}+\gamma V_{t+1}-V_{t}

and can be used to update the value of $$V_t$$ :

$$V_{t} \leftarrow V_{t}+\eta\left(E r r_{T D}\right)=V_{t}+\eta\left(r_{t}+\mathcal\gamma V_{t+1}-V_{t}\right)$$

This is the learning rule. (Sutton 1988, special case of )

## Sarsa (Rummery & Niranjan 1994)

Associate values to actions, rather than states. These are called Q-values.

The Q-value, Q(s, a) is the estimated value of doing action a in state s.

So, we are in a state s select an action a, receive a reward r and reach a new state s’ where we select an action a'. (Hence the name "Sarsa").

Q(s, a)=r+\gamma Q\left(s^{\prime}, a^{\prime}\right)

The right hand side - left hand side can be used as an error to update .

Q(s,a).

Q(s, a) \leftarrow Q(s, a)+\eta\left(r+\gamma Q\left(s^{\prime}, a^{\prime}\right)-Q(s, a)\right)

## Q-learning (Watkins 1989)

The Q-value is defined under an assumption: Q(s, a) is the estimated value of doing action a in state s, assuming that all future actions are greedy.

The assumption leads to a max in the relation:

$$Q(s, a)=r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)$$

The update rule (r.h.s – l.h.s. again) becomes:

$$Q(s, a) \leftarrow Q(s, a)+\eta\left(r+\gamma \max Q\left(s^{\prime}, a^{\prime}\right)-Q(s, a)\right)$$

Note:'PPO'

## Neural networks in Q-Learning (and Sarsa)

For large state space, we can store them in table.

Q-learning requires a large table (with an entry for each state-action pair).

ANNs can be used to estimate the Q-values.

We want,

• ANN with the state, s, on its inputs and n outputs, estimating the Q-values of the n possible actions.

• Use supervised learning with the right hand side of (5) as target

• Only update the output corresponding to the selected action (the errors of all other outputs are set to zero, since we don't know how good they are).

• Use linear outputs

• Backprop not so good (continuous learning). RBFs (Lecture 10) better. Alternative, use backprop but with experience replay (Lecture 10).

• Deep Q-learning = same thing, just a deeper network.

## Algorithms of Q-learning and SARSA

In this additional part, we are going to talk about the algorithms of Q-learning and SARSA.

This figure is copyright by morvanzhou. In shows how these algorithms update Q table.

Let us focus on a state S that an agent stay in. The agent do action a(according to its policy), then it comes into state S'.

In the state S', there are some key differences.

• The agent following Q-learning assumes that the next reward is based on greedy policy. It updates its Q-values using the Q-value of the state s′and the greedy action a′, . Then at the time the agent makes action , it does action according to its policy. This action may not be a′.

• The agent following SARSA assumes that the current policy will be used in future. It updates its Q-values using the Q-value of the state s′and the action* a′ according to its policy. Then at the time the agent makes action , it does action a′.

Dong Wang

I will work as a PhD student of TU Graz in Austria. My research interests include Embedded/Edge AI, federated learning, computer vision, and IoT.