# Natural Computation Methods for Machine Learning Note 04

2020年2月7日 2394点热度 0人点赞 0条评论

# Natural Computation Methods for Machine Learning Note 04

## Credit assignment problem

in order to decide by how much of the result blamed or credited to a particular weight.

We need sigmoid function. Sigmoid function = any S-shape function.

examples.

logistical function $$y=f(s)=\frac{1}{1+e^{- \lambda s}}$$, $$\lambda$$ decides slope. When $$\lambda$$ approach $$\infty$$, $$f(s)$$ approach a step function. Normally $$\lambda = 1$$. The logistic function is differentiable and has a derivative which can be expressed in terms of the function itself.

f'(s)= \lambda f(s) (1-f(s))= \lambda y(1-y)

than function $$f(s)=tanh(s)$$

## How to make a learning rule

### The delta rule

$$y=f(s) \ s= \sum_{i=0}^n w_ix_i \ where \ \begin{cases} x_0=-1 \\w_0=\theta \end{cases}$$

According to the chain rule, we have these

\frac{\partial E}{w_i} = \frac{\partial E}{y} \frac{\partial y}{s} \frac{\partial s}{w_i}

\frac{\partial E}{y} = 2\frac{1}{2}(d-y)(-1)=-(d-y)

\frac{\partial y}{s} = \frac{\partial f(s)}{s}=f'(s)

\frac{\partial s}{w_i}=\frac{\partial \sum_{j=0}^n w_jx_j}{w_i} = \frac{\partial (w_ix_i)}{w_i}=x_i

Hence,

\frac{\partial E}{w_i}=-(d-y)f'(s)x_i

From step 3 in the recipe

\Delta w_i = - \eta \frac{\partial E}{w_i} = \eta (d-y)f'(s)x_i

or, on the general form:

$$\Delta w_{i}=\eta \delta x_{i}$$

where $$\delta=f^{\prime}(S)(d-y)=\lambda y(1-y)(d-y)$$

This is as known as LMS.

### Back propagation(generalized delta rule)

For a multiply perceptron(MLP)

Just extend the delta rule to the hidden nodes and weights.

For the output layer, the delta rule is applicable as it is, only that we must index the nodes, and $$w$$ needs a second index. $$w_{ji}$$ is the weight from node (or input) $$i$$ to node $$j$$.

So, the update equation is $$\Delta w_{j i}=\eta \delta_{j} x_{i}$$

where, For an output node, $$\delta_{j}=f^{\prime}\left(S_{j}\right)\left(d_{j}-y_{j}\right)=\lambda y_{j}\left(1-y_{j}\right)\left(d_{j}-y_{j}\right)$$

For a hidden node, $$\delta_{j}=f^{\prime}\left(S_{j}\right) \sum_{k} w_{k j} \delta_{k}=\lambda y_{j}\left(1-y_{j}\right) \sum_{k} w_{k j} \delta_{k}$$

Note we usually set $$\lambda$$ to 1.

### $$\eta$$ size

$$\eta$$ corresponds to the step length. Large $$\eta$$ may overshoot minima or zig-zag
across ravines. Small $$\eta$$ is slower and more likely to get stuck in local minima

### steps

1. Initialize. Set all weights to small random values with zero mean.
2. Present an input vector, $$\boldsymbol{x}=\left(x_{1}, x_{2}, \ldots, x_{n}\right)$$ corresponding target vector, $$\boldsymbol{d}=\left(d_{l}, d_{2}, \ldots, d_{m}\right)$$.
3. Feed forward phase. Compute network outputs, by updating the nodes layer by layer from the first hidden layer to the outputs. The first hidden layer computes (for all nodes, $$y_j$$): y_{j}=f\left(\sum_{i=0}^{n} w_{j i} x_{i}\right) \text { where } x_{0}=-1 The next layer applies the same formula, substituting this layer's node values for $$x_i$$.
4. Back propagation phase. Compute weight changes iteratively, layer by layer, from the outputs to the first hidden layer \Delta w_{j i}=\eta \delta_{j} x_{i} \delta_{j}=\left\{\begin{array}{l}
{y_{j}\left(1-y_{j}\right)\left(d_{j}-y_{j}\right) \text { if } y_{j} \text { is an output node }} \\
{y_{j}\left(1-y_{j}\right) \sum_{k} w_{k j} \delta_{k} \text { if } y_{j} \text { is a hidden node }}
\end{array}\right.

5. Repeat from step 2 with a new input-target pair.

### Momentum

Add a momentum term to the update formula, causing smoother weight changes over time:

$$\Delta w_{j i}(t+1)=\eta \delta_{j} x_{i}+\alpha \Delta w_{j i}(t)$$

## Pattern learning and Epoch learning

* Epoch learning: Accumulate $$\Delta w$$ until all patterns in the training set has been presented once (= 1 epoch). Then update the weight and clear $$\Delta w$$. Special case of Batch Learning (batches can be smaller).

* Pattern/stochastic learning: Update $$w$$ after each pattern presentation (as step 4.5:

$$w+=\Delta w$$). This requires a random order of presentation. This requires a random order of presentation.

## How to use the algorithm and when to stop training

Decide network structure

we should firstly collect a set of input-target pairs

Then split the set into a train set (larger)and a test set(smaller)

We are going to minimize the error of the training set.

The worst case is, a network with too many hidden nodes which is training for too long or too little data. This will be overfitting. There are three reasons that

1. too large network

2. too long training

3. too little data

Dong Wang

Master student of computer science at Uppsala University in Sweden. My primary research interests are deep learning, computer vision, federated learning and internet-of-things.