Natural Computation Methods for Machine Learning Note 05
Let's continue talk about overtraining.
Training set size
What to do if we have too little data?
- noise injection and noise to the input
- k-fold cross validation: split data in k parts (of equal size)
for all sub-sets:
train all other sub-sets
test on set-i
- generalization measure: average error over the k test.
k=n, it is
Early stopping
- training set (larger) "reset usually"
- validation set (smallest)
- test set (smaller) to test the generalized ability
Requires lots of data. This is one of many regularization techniques
How many layers? One hidden layer is sufficient in theory! In practice, for some problems the required number of nodes in this layer can be very large. Two hidden layers are therefore sometimes used, which drastically reduces the required number of nodes in each layer. To get a feeling for how many nodes we need in the hidden layer, we must get a feeling for what the hidden layer does.
How many nodes?
- In classification, the hidden nodes form discriminants. With sigmoidal nodes, the lines (hyperplanes) of a classifier become fuzzy and the corners rounded. It is possible to approximate a circle with three hidden nodes.
- In regression/function approximation, the hidden nodes correpond to monotonic regions in the function.
The ability to represent a function is not a guarantee that this function can be found by training! The required number of hidden nodes is greater in practice than in theory!
ANNs for classification should have sigmoidal outputs, while ANNs for functions approximation should have linear outputs (hidden layer still nonlinear).
Optimizing for speed
\eta \ \begin{cases} step \ length \\step \ size \\ learning \ rate \end{cases}
Idea: Increase training speed, by on-line adjustments of the step length, either globally or individually for each weight
Examples:
- Backprop with a momentum term.
- Start with large and reduce it over time.
- Consider the gradient history. Has it changed a lot lately? If so, decrease the , else increase it.
Requires epoch learning.
Adaptive \(\eta\), local for each weight.
Idea: \(\frac{\partial E}{\partial w_{j i}}\) decides direction only(i.e we only consider its sign).
The step length is instead decided by a new parameter \(\Delta ji\) (individual for each weight, replacing \(\eta\)):
\Delta w_{j i}=-\Delta_{j i} \operatorname{sign}\left(\frac{\partial E}{\partial w_{j i}}\right)
\(\Delta j i\) is updated (within a specified interval) so that:
- If E’ keeps its sign the step length \(\Delta\) is increased by a factor \(\eta^+\)
- If E’ changes sign \(\Delta\) is reduced by a factor \(\eta^-\) (and the weight change is discarded)
Effect: Accelerate down slopes. Decelerate when we (would have) overshot a
minimum.
文章评论