Natural Computation Methods for Machine Learning Note 06

2020年2月12日 2774点热度 0人点赞 0条评论

Natural Computation Methods for Machine Learning Note 06

In this course, we learn more on extensions. and prior knowledge.

Automatic dimensioning

We want to minimize the number of hidden nodes(=> less overfitting)

The approaches:

  1. start with a large networks and prune.
  2. start with a small networks and grow.

There are two example.

Weight decay (pruning)

Let each weight strive to zero during training.

w^{new}=(1-\varepsilon) w^{old}

Remove the 0-weights and retrain.

This is not only a pruning technique but also good to keep the weights small anyway.

Numerical reasons. Restricts the network. Reduces risk of overtraining.

The Upstart algorithm (growing)

Self dimensioning method for classification.

Idea: Create child nodes, trained separately to recognize when the parent node
makes mistakes.

An output node in a classifier network can make two types of mistakes:

E^+: The node’s value is 1, when it should be 0.

E^-: The node’s value is 0, when it should be 1.

For each output, create two children, x^+ and x^- , trained to recognize the cases
where the parent make mistakes of type E^+ and E^-, respectively. This is also a
classification problem (smaller).

x^+ is connected to the parent node with a large negative weight.
x^- is connected to the parent node with a large positive weight

If the children cannot solve this task, let them create their own children, etc.
Result: A finite “tree” of neurons. Severe risk of overtraining.

Tip! Fahlman’s cascade correlation algorithm Similar idea, but applicable to function approximation in general

Second order methods

Use second order information (how the slope changes over time).


Require Epoch learning

1.The error surface (the landscape) can be approximated locally by a
parabola(In Chinese:抛物线).

2.The change in the slope \frac{\partial E}{w_{ji}} from the previous step, is only due to
the change of w_{ji}.

Then, the current and previous slope together with the latest weight change \Delta w_{ji}
can be used to define a parabola. Jump directly to its minimum.

Very sensitive to choice of control parameters, but can be extremely fast

No free lunch theorem

(In Chinese: 没有免费午餐理论)

Averaged over all possible learning problems, no learning algorithm is better
than any other. They all perform the same. This includes random search.

  • There is always a catch. All modifications that seem to make things better,
    must have a drawback!
  • If your algorithm is worse than another on a subset of problems, you also
    know that there must exist another subset for which your algorithm is


The choice of input and output representations is the problem. This is where the
user really solves the problem. (example: two-spirals)

spiral data set

  • Distribute your representations!
    Example: In classification, use as many output as there are classes. Trainusing target vectors where all elements are 0 except for the element corresponding to the current class (which is 1) One-hot. The network will approach a Bayesian classifier (output i will approximate P(C_i|x)
  • Any prior knowledge on statistical distributions, symmetries, etc, should
    be exploited in the pre-processing stage. Don't let NN learn something you already know.
    Normalization/scaling – see Engelbrecht. Remember, though, that this may
    affect performance (introduces bias).
  • Normalization example: Normalizing inputs makes the network more
    sensitive to fluctuations for some inputs over others. That bias may be the
    reason for normalizing (give small range values a chance) but it is easy to
  • Scaling example: Logarithms of target values make the network more
    sensitive (better) for low values than higher values.

Exploiting prior knowledge

Examples of prior knowledge types:

  1. Initial guess

    a. Choosing intial weight (or how to randomize them)

  2. Known decomposition of the problem into subproblems

    a. Preprocessing
    b. Network structure
    c. Multitask learning

  3. Constraints
    a. Modifying the objective function (and deriving a new learning rule)

  4. Regions
    a. Preprocessing
    b. Multitask learning

Multitask learning (extra output learning)

Have additional data which we think could help the networks.

Example, XOR is much easier (becomes linearly separable) if we add an extra
input ab. But the extra info then becomes required, also after training.

Solution: Add ab – the hint function – as an extra output instead! Network
trained to implement both functions (outputs) at once.

Multitask Learning for XOR + x

This figure is copyright by Olle Gällmo.

Restricts the freedom of the hidden layer, i.e. the number of models the network
can form of the original target function.

Effect: Faster training. Less overtraining. Less variance in both training time and
results. Required number of nodes closer to theory. Hint only needed during

Example: XOR and function approximation experiments.

Dong Wang

I will work as a PhD student of TU Graz in Austria. My research interests include Embedded/Edge AI, federated learning, computer vision, and IoT.