Natural Computation Methods for Machine Learning Note 06
In this course, we learn more on extensions. and prior knowledge.
Automatic dimensioning
We want to minimize the number of hidden nodes(=> less overfitting)
The approaches:
 start with a large networks and prune.
 start with a small networks and grow.
There are two example.
Weight decay (pruning)
Let each weight strive to zero during training.
w^{new}=(1\varepsilon) w^{old}
Remove the 0weights and retrain.
This is not only a pruning technique but also good to keep the weights small anyway.
Numerical reasons. Restricts the network. Reduces risk of overtraining.
The Upstart algorithm (growing)
Self dimensioning method for classification.
Idea: Create child nodes, trained separately to recognize when the parent node
makes mistakes.
An output node in a classifier network can make two types of mistakes:
E^+: The node’s value is 1, when it should be 0.
E^: The node’s value is 0, when it should be 1.
For each output, create two children, x^+ and x^ , trained to recognize the cases
where the parent make mistakes of type E^+ and E^, respectively. This is also a
classification problem (smaller).
x^+ is connected to the parent node with a large negative weight.
x^ is connected to the parent node with a large positive weight
If the children cannot solve this task, let them create their own children, etc.
Result: A finite “tree” of neurons. Severe risk of overtraining.
Tip! Fahlman’s cascade correlation algorithm Similar idea, but applicable to function approximation in general
Second order methods
Use second order information (how the slope changes over time).
Quickprop
Require Epoch learning
1.The error surface (the landscape) can be approximated locally by a
parabola(In Chinese:抛物线).
2.The change in the slope \frac{\partial E}{w_{ji}} from the previous step, is only due to
the change of w_{ji}.
Then, the current and previous slope together with the latest weight change \Delta w_{ji}
can be used to define a parabola. Jump directly to its minimum.
Very sensitive to choice of control parameters, but can be extremely fast
No free lunch theorem
(In Chinese: 没有免费午餐理论)
Averaged over all possible learning problems, no learning algorithm is better
than any other. They all perform the same. This includes random search.
 There is always a catch. All modifications that seem to make things better,
must have a drawback!  If your algorithm is worse than another on a subset of problems, you also
know that there must exist another subset for which your algorithm is
better.
Preprocessing
The choice of input and output representations is the problem. This is where the
user really solves the problem. (example: twospirals)
 Distribute your representations!
Example: In classification, use as many output as there are classes. Trainusing target vectors where all elements are 0 except for the element corresponding to the current class (which is 1) Onehot. The network will approach a Bayesian classifier (output i will approximate P(C_ix)  Any prior knowledge on statistical distributions, symmetries, etc, should
be exploited in the preprocessing stage. Don't let NN learn something you already know.
Normalization/scaling – see Engelbrecht. Remember, though, that this may
affect performance (introduces bias).  Normalization example: Normalizing inputs makes the network more
sensitive to fluctuations for some inputs over others. That bias may be the
reason for normalizing (give small range values a chance) but it is easy to
overcompensate.  Scaling example: Logarithms of target values make the network more
sensitive (better) for low values than higher values.
Exploiting prior knowledge
Examples of prior knowledge types:
 Initial guess
a. Choosing intial weight (or how to randomize them)

Known decomposition of the problem into subproblems
a. Preprocessing
b. Network structure
c. Multitask learning 
Constraints
a. Modifying the objective function (and deriving a new learning rule)  Regions
a. Preprocessing
b. Multitask learning
Multitask learning (extra output learning)
Have additional data which we think could help the networks.
Example, XOR is much easier (becomes linearly separable) if we add an extra
input ab. But the extra info then becomes required, also after training.
Solution: Add ab – the hint function – as an extra output instead! Network
trained to implement both functions (outputs) at once.
This figure is copyright by Olle Gällmo.
Restricts the freedom of the hidden layer, i.e. the number of models the network
can form of the original target function.
Effect: Faster training. Less overtraining. Less variance in both training time and
results. Required number of nodes closer to theory. Hint only needed during
training.
Example: XOR and function approximation experiments.
文章评论