Natural Computation Methods for Machine Learning Note 04
Credit assignment problem
in order to decide by how much of the result blamed or credited to a particular weight.
We need sigmoid function. Sigmoid function = any Sshape function.
examples.
logistical function \( y=f(s)=\frac{1}{1+e^{ \lambda s}} \), \(\lambda\) decides slope. When \(\lambda\) approach \(\infty\), \(f(s)\) approach a step function. Normally \(\lambda = 1\). The logistic function is differentiable and has a derivative which can be expressed in terms of the function itself.
f'(s)= \lambda f(s) (1f(s))= \lambda y(1y)
than function \(f(s)=tanh(s)\)
How to make a learning rule
The delta rule
\(y=f(s) \ s= \sum_{i=0}^n w_ix_i \ where \ \begin{cases} x_0=1 \\w_0=\theta \end{cases}\)
According to the chain rule, we have these
\frac{\partial E}{w_i} = \frac{\partial E}{y} \frac{\partial y}{s} \frac{\partial s}{w_i}
\frac{\partial E}{y} = 2\frac{1}{2}(dy)(1)=(dy)
\frac{\partial y}{s} = \frac{\partial f(s)}{s}=f'(s)
\frac{\partial s}{w_i}=\frac{\partial \sum_{j=0}^n w_jx_j}{w_i} = \frac{\partial (w_ix_i)}{w_i}=x_i
Hence,
\frac{\partial E}{w_i}=(dy)f'(s)x_i
From step 3 in the recipe
\Delta w_i =  \eta \frac{\partial E}{w_i} = \eta (dy)f'(s)x_i
or, on the general form:
\(\Delta w_{i}=\eta \delta x_{i}\)
where \(\delta=f^{\prime}(S)(dy)=\lambda y(1y)(dy)\)
This is as known as LMS.
Back propagation(generalized delta rule)
For a multiply perceptron(MLP)
Just extend the delta rule to the hidden nodes and weights.
For the output layer, the delta rule is applicable as it is, only that we must index the nodes, and \(w\) needs a second index. \(w_{ji}\) is the weight from node (or input) \(i\) to node \(j\).
So, the update equation is \(\Delta w_{j i}=\eta \delta_{j} x_{i}\)
where, For an output node, \(\delta_{j}=f^{\prime}\left(S_{j}\right)\left(d_{j}y_{j}\right)=\lambda y_{j}\left(1y_{j}\right)\left(d_{j}y_{j}\right)\)
For a hidden node, \(\delta_{j}=f^{\prime}\left(S_{j}\right) \sum_{k} w_{k j} \delta_{k}=\lambda y_{j}\left(1y_{j}\right) \sum_{k} w_{k j} \delta_{k}\)
Note we usually set \(\lambda\) to 1.
\(\eta\) size
\(\eta\) corresponds to the step length. Large \(\eta\) may overshoot minima or zigzag
across ravines. Small \(\eta\) is slower and more likely to get stuck in local minima
steps
1. Initialize. Set all weights to small random values with zero mean.
2. Present an input vector, \(\boldsymbol{x}=\left(x_{1}, x_{2}, \ldots, x_{n}\right)\) corresponding target vector, \(\boldsymbol{d}=\left(d_{l}, d_{2}, \ldots, d_{m}\right)\).
3. Feed forward phase. Compute network outputs, by updating the nodes layer by layer from the first hidden layer to the outputs. The first hidden layer computes (for all nodes, \(y_j\)): y_{j}=f\left(\sum_{i=0}^{n} w_{j i} x_{i}\right) \text { where } x_{0}=1 The next layer applies the same formula, substituting this layer's node values for \(x_i\).
4. Back propagation phase. Compute weight changes iteratively, layer by layer, from the outputs to the first hidden layer \Delta w_{j i}=\eta \delta_{j} x_{i} \delta_{j}=\left\{\begin{array}{l}
{y_{j}\left(1y_{j}\right)\left(d_{j}y_{j}\right) \text { if } y_{j} \text { is an output node }} \\
{y_{j}\left(1y_{j}\right) \sum_{k} w_{k j} \delta_{k} \text { if } y_{j} \text { is a hidden node }}
\end{array}\right.
5. Repeat from step 2 with a new inputtarget pair.
Momentum
Add a momentum term to the update formula, causing smoother weight changes over time:
\(\Delta w_{j i}(t+1)=\eta \delta_{j} x_{i}+\alpha \Delta w_{j i}(t)\)
Pattern learning and Epoch learning
* Epoch learning: Accumulate \(\Delta w\) until all patterns in the training set has been presented once (= 1 epoch). Then update the weight and clear \(\Delta w\). Special case of Batch Learning (batches can be smaller).
* Pattern/stochastic learning: Update \(w\) after each pattern presentation (as step 4.5:
\(w+=\Delta w\)). This requires a random order of presentation. This requires a random order of presentation.
How to use the algorithm and when to stop training
Decide network structure
we should firstly collect a set of inputtarget pairs
Then split the set into a train set (larger)and a test set(smaller)
We are going to minimize the error of the training set.
The worst case is, a network with too many hidden nodes which is training for too long or too little data. This will be overfitting. There are three reasons that

too large network

too long training

too little data
文章评论