木叶下

  • 编程算法
  • 深度学习
  • 微小工作
  • 善用软件
  • 杂记
  • 诗人远方
南国羽说
文字记录生活
  1. 首页
  2. 深度学习
  3. 正文

Natural Computation Methods for Machine Learning Note 08

2020年2月20日 5102点热度 0人点赞 0条评论

Natural Computation Methods for Machine Learning Note 08

Contents hide
1 Natural Computation Methods for Machine Learning Note 08
1.1 TD based RL, more general
1.2 Sarsa (Rummery & Niranjan 1994)
1.3 Q-learning (Watkins 1989)
1.4 Neural networks in Q-Learning (and Sarsa)
1.5 Algorithms of Q-learning and SARSA

ex) Tic-Tac-Toe

Assign value to each board position (state):

  • +1 if the state is a winning terminal state.
  • 0 if the state is a losing terminal state
  • +0.5 to all other states (terminal and non–terminal)

play many games

Tic-Tac-Toe Tree

After each move, adjust the value of the previous state towards the value of the current state:

$$V(s) \leftarrow V(s)+\eta\left[V\left(s^{\prime}\right)-V(s)\right]$$

Temporal difference learning = to learn from successive estimates.

Notes:

  • There are no explicit rewards in this setup.

  • The agent never has to look more than one step ahead to make a move! V(s) will contain information about the future(all features will be stored in the node). Compare to Minimax.

  • The agent will explore the weakness in the opponent. In contrast, Minimax will not make a move which may lead to defeat, even if that move always leads to victory in practice (due to weaknesses in the opponent). On the other hand, the RL/TD agent may become too specialized. Compare to over-training and getting such in local minima.

TD based RL, more general

From last lecture, we know that the value of a state s at time t is a (discounted) sum of future rewards (on average – lets drop the expected values for now).

TDbasedonRLFlowImagepng

$$V_{t}=\sum_{k=0}^{\infty} \gamma^{k} r_{t+k}=\gamma^{0} r_{t}+\sum_{k=1}^{\infty} \gamma^{k} r_{t+k}=r_{t}+\gamma \sum_{k=1}^{\infty} \gamma^{k-1} r_{t+k}=r_{t}+\gamma \sum_{k=0}^{\infty} \gamma^{k} r_{t+k+1}=r_{t}+\mathcal{\gamma V}_{t+1}$$

If all values are correctly estimated, this relation holds by definition of ​\(V_t=\mathrm{E}\left(r_{t}+\gamma V_{t+1}\right)\), which is a variant of Bellman’s equation in DP.

If the estimates are not perfect, there will be a difference between the two sides of the equation. This difference is called the temporal difference error

​Err_{TD}=r_{t}+\gamma V_{t+1}-V_{t}

and can be used to update the value of \(V_t\) ​:

$$V_{t} \leftarrow V_{t}+\eta\left(E r r_{T D}\right)=V_{t}+\eta\left(r_{t}+\mathcal\gamma V_{t+1}-V_{t}\right)$$

This is the ​ learning rule. (Sutton 1988, special case of ​)

Sarsa (Rummery & Niranjan 1994)

Associate values to actions, rather than states. These are called Q-values.

The Q-value, Q(s, a) is the estimated value of doing action a in state s.

SaraFlow

So, we are in a state s select an action a, receive a reward r and reach a new state s’ where we select an action a'. (Hence the name "Sarsa").

​
Q(s, a)=r+\gamma Q\left(s^{\prime}, a^{\prime}\right)

The right hand side - left hand side can be used as an error to update ​.

​

Q(s,a).

Q(s, a) \leftarrow Q(s, a)+\eta\left(r+\gamma Q\left(s^{\prime}, a^{\prime}\right)-Q(s, a)\right)

Q-learning (Watkins 1989)

The Q-value is defined under an assumption: Q(s, a) is the estimated value of doing action a in state s, assuming that all future actions are greedy.

The assumption leads to a max in the relation:

$$Q(s, a)=r+\gamma \max _{a^{\prime}} Q\left(s^{\prime}, a^{\prime}\right)$$

The update rule (r.h.s – l.h.s. again) becomes:

$$Q(s, a) \leftarrow Q(s, a)+\eta\left(r+\gamma \max Q\left(s^{\prime}, a^{\prime}\right)-Q(s, a)\right)$$

Comparing Q-Learning and Sarsa

Note:'PPO'

Neural networks in Q-Learning (and Sarsa)

For large state space, we can store them in table.

Q-learning requires a large table (with an entry for each state-action pair).

ANNs can be used to estimate the Q-values.

We want,

  • ANN with the state, s, on its inputs and n outputs, estimating the Q-values of the n possible actions.

  • NNQlearning

  • Use supervised learning with the right hand side of (5) as target

  • Only update the output corresponding to the selected action (the errors of all other outputs are set to zero, since we don't know how good they are).

  • Use linear outputs

  • Backprop not so good (continuous learning). RBFs (Lecture 10) better. Alternative, use backprop but with experience replay (Lecture 10).

  • Deep Q-learning = same thing, just a deeper network.

Algorithms of Q-learning and SARSA

In this additional part, we are going to talk about the algorithms of Q-learning and SARSA.

algorithms

This figure is copyright by morvanzhou. In shows how these algorithms update Q table.

Let us focus on a state S that an agent stay in. The agent do action a(according to its policy), then it comes into state S'.

In the state S', there are some key differences.

  • The agent following Q-learning assumes that the next reward is based on greedy policy. It updates its Q-values using the Q-value of the state s′and the greedy action a′, . Then at the time the agent makes action , it does action according to its policy. This action may not be a′.

  • The agent following SARSA assumes that the current policy will be used in future. It updates its Q-values using the Q-value of the state s′and the action* a′ according to its policy. Then at the time the agent makes action , it does action a′.

标签: machine learning Q-learning Reinforcement Sarsa
最后更新:2020年2月27日

Dong Wang

I am a PhD student of TU Graz in Austria. My research interests include Embedded/Edge AI, efficient machine learning, model sparsity, deep learning, computer vision, and IoT. I would like to understand the foundational problems in deep learning.

点赞
< 上一篇
下一篇 >

文章评论

razz evil exclaim smile redface biggrin eek confused idea lol mad twisted rolleyes wink cool arrow neutral cry mrgreen drooling persevering
取消回复

这个站点使用 Akismet 来减少垃圾评论。了解你的评论数据如何被处理。

文章目录
  • Natural Computation Methods for Machine Learning Note 08
    • TD based RL, more general
    • Sarsa (Rummery & Niranjan 1994)
    • Q-learning (Watkins 1989)
    • Neural networks in Q-Learning (and Sarsa)
    • Algorithms of Q-learning and SARSA

COPYRIGHT © 2013-2024 nanguoyu.com. ALL RIGHTS RESERVED.

Theme Kratos Made By Seaton Jiang

陕ICP备14007751号-1