From today, we are going to introduce reinforcement learning(RL).

Forget ANNs for a while - RL is a field in its own right, independent from ANNs. However, as we will see, ANNs are sometimes used in RL.

**Definition**: Learning by interaction with an environment, to maximize some long-term scalar value (or to minimize a cost). Trial-and-error.

Some classic RL problems: Robot navigation, pole balancing, TD-gammon, AlphaGo, Crite’s elevators.

## MENACE

Learns to play the game Tic–Tac–Toe (Noughts and Crosses), using a number of matchboxes and coloured beads.

The machine gets no feedback until the game ends.

It is very difficult to decide which individual moves during a game which were the most responsible for the result. Here: All moves made during a game are considered good, if we won, and bad if we lost.

All moves are credited/blamed by the same amount, disregarding how early or late they were made during the game. We would perhaps want give more credit to recent moves than to early ones. This can be done by initiating the boxes with a different number of beads of each colour, depending on the distance from the starting position.

The rules are given as a prior (by the initialization procedure). MENACE can therefore focus on learning, not how to play, but how to play well.

All beads of a certain colour may be removed in a box.

## General issues

### state

A description of the current state of the environment.

(Menace: the current board position)

Must contain enough information for the agent to solve its task, but does not have to be (seldom is) a complete description of the environment. (Think of it as a vector of sensor values)

## Reward (reinforcement signal, grade)

A quality measure of the environment's state. Only an indirect measure of the agent’s state (maybe incomplete) and actions (since the environment may be affected by other things as well). This is what we want to minimize/ maximize.

Note: Don’t assume that negative rewards are bad and positive rewards are good – the reward is simply a scalar to be maximized. The maximum can be negative!

## Action

Affects the environment and hence, indirectly, the next state and reward. [The environment may be affected by other factors as well, e.g. other agents] Discrete or continuous (finite or infinite number of possible actions for each state). (Menace: discrete, Learning to steer a car: continuous).

## The Agent

Action selection should be stochastic! The agent must be allowed to select and the must be allowed to explore, i.e. sometimes make actions which may seem sub-optimal at the time.

Common example: Epsilon-greedy: With probability , select a random action (i.e. ignore the suggestions from the learning system). Otherwise, be greedy.

Greedy, in this context, means to trust the learning system - to select the action that is currently believed to be the best with (Menace: the action which has the largest number of associated beads in the box)

## Delayed vs Immediate RL

Rewards should be associated with a sequence of actions (delayed RL), not only the latest one (immediate RL), since earlier decisions may also have contributed to the result. (Menace: delayed).

The difference is not only in how often we get rewarded, but also in how the rewards are used/interpreted.

Core issue: How to distribute credit or blame on the sequence of actions. Temporal credit assignment problem (Menace: give more credit/blame to recent moves than earlier ones)

## Comparison to supervised learning

Supervised learning imitates (and can never exceed the performance of its teacher). RL explores (and therefore can). Can be combined, as in AlphaGo.

Supervised learning does not (usually) interact. RL does.

Feedback in supervised learning is specific and instructive (w.r.t. the agent’s actions). In RL it is not. The reward does not say which other actions would have been better (which is why we must explore).

## Goals, values and policies

### Goal

To maximize the long term reward, i.e. to select actions which maximize expected future(long term) rewards. E.g., at time t to maximize

R_{t}=E\left(\sum_{k=0}^{h} r_{t+k}\right)

(discrete time, finite horizon, looks h steps ahead)

**Problem**: We often need to do this for an infinite horizon (\(h=\infty\)). Then, we must ensure that the sum still is finite (since it is supposed to be maximized).

**Solution**: Maximize the value, V – the expected value of a discounted (exponentially decreasing) sum of future rewards

V_{t}=E\left(\sum_{k=0}^{\infty} \gamma^{k} r_{t+k}\right), \gamma \in[0,1) \text{is a discount factor}

Continuous time algorithms exist but is out of scope for this course.

A **policy** is a function \(\pi(s)\) from states to actions (i.e. the learning system).

**Goal (reformulated)**: Find an optimal policy, \(\pi^*(s)\), which for each state selects the action which maximizes V.

These type of optimization problems are called Markov Decision Problems (MDPs) and most methods to solve them are closely related to Dynamic Programming (DP). RL can be seen as a generalization of DP, applicable to much larger problems.

## 文章评论