Natural Computation Methods for Machine Learning Note 05

2020年2月9日 2907点热度 0人点赞 0条评论

Natural Computation Methods for Machine Learning Note 05

Let's continue talk about overtraining.

Training set size

The number of training set samples should be much larger than the number of weights roughly for \(N^2\)(N = the number of weights).

What to do if we have too little data?

  • minimize the number of nodes and layers(weights)
  • noise injection and noise to the input
  • k-fold cross validation: split data in k parts (of equal size)

                             for all sub-sets:

                                   train all other sub-sets

                                   test on set-i

  • generalization measure: average error over the k test.

k=n, it is leave-one-out

Early stopping

  • training set (larger) "reset usually"
  • validation set (smallest)
  • test set (smaller) to test the generalized ability

Requires lots of data. This is one of many regularization techniques

Network size

With sufficiently many hidden layers and nodes, the MLP can approximate any function to any degree of accuracy.

How many layers? One hidden layer is sufficient in theory! In practice, for some problems the required number of nodes in this layer can be very large. Two hidden layers are therefore sometimes used, which drastically reduces the required number of nodes in each layer. To get a feeling for how many nodes we need in the hidden layer, we must get a feeling for what the hidden layer does.

How many nodes?

  • In classification, the hidden nodes form discriminants. With sigmoidal nodes, the lines (hyperplanes) of a classifier become fuzzy and the corners rounded. It is possible to approximate a circle with three hidden nodes.
  • In regression/function approximation, the hidden nodes correpond to monotonic regions in the function.

\TODO There should be a picture.


The ability to represent a function is not a guarantee that this function can be found by training! The required number of hidden nodes is greater in practice than in theory!

ANNs for classification should have sigmoidal outputs, while ANNs for functions approximation should have linear outputs (hidden layer still nonlinear).

Optimizing for speed

\eta \ \begin{cases} step \ length \\step \ size \\ learning \ rate \end{cases}

Idea: Increase training speed, by on-line adjustments of the step length, either globally or individually for each weight


  1. Backprop with a momentum term.
  2. Start with large and reduce it over time.
  3. Consider the gradient history. Has it changed a lot lately? If so, decrease the , else increase it.

Resilient backpropagation (RPROP)

Requires epoch learning.

Adaptive \(\eta\), local for each weight.

Idea: \(\frac{\partial E}{\partial w_{j i}}\) decides direction only(i.e we only consider its sign).

The step length is instead decided by a new parameter \(\Delta ji\) (individual for each weight, replacing \(\eta\)):

\Delta w_{j i}=-\Delta_{j i} \operatorname{sign}\left(\frac{\partial E}{\partial w_{j i}}\right)

\(\Delta j i\) is updated (within a specified interval) so that:

  • If E’ keeps its sign the step length \(\Delta\) is increased by a factor \(\eta^+\)
  • If E’ changes sign \(\Delta\) is reduced by a factor \(\eta^-\) (and the weight change is discarded)

Effect: Accelerate down slopes. Decelerate when we (would have) overshot a







Dong Wang

I will work as a PhD student of TU Graz in Austria. My research interests include Embedded/Edge AI, federated learning, computer vision, and IoT.