For each output, create two children, $x^+$ and $x^-$ , trained to recognize the cases
where the parent make mistakes of type $E^+$ and $E^-$ , respectively. This is also a
classification problem (smaller).

$x^+$ is connected to the parent node with a large negative weight.
$x^-$ is connected to the parent node with a large positive weight

If the children cannot solve this task, let them create their own children, etc.
Result: A finite “tree” of neurons. Severe risk of overtraining.

Tip! Fahlman’s cascade correlation algorithm Similar idea, but applicable to function approximation in general

Second order methods

Use second order information (how the slope changes over time).

Quickprop

Require Epoch learning

1.The error surface (the landscape) can be approximated locally by a
parabola(In Chinese:抛物线).

2.The change in the slope $\frac{\partial E}{w_{ji}}$ from the previous step, is only due to
the change of $w_{ji}$ .

Then, the current and previous slope together with the latest weight change $\Delta w_{ji}$
can be used to define a parabola. Jump directly to its minimum.

Very sensitive to choice of control parameters, but can be extremely fast

No free lunch theorem

(In Chinese: 没有免费午餐理论)

Averaged over all possible learning problems, no learning algorithm is better
than any other. They all perform the same. This includes random search.

There is always a catch. All modifications that seem to make things better,
must have a drawback!
If your algorithm is worse than another on a subset of problems, you also
know that there must exist another subset for which your algorithm is
better.

Preprocessing

The choice of input and output representations is the problem. This is where the
user really solves the problem. (example: two-spirals)

spiral data set

Distribute your representations!
Example: In classification, use as many output as there are classes. Trainusing target vectors where all elements are 0 except for the element corresponding to the current class (which is 1) One-hot. The network will approach a Bayesian classifier (output i will approximate $P(C_i|x)$
Any prior knowledge on statistical distributions, symmetries, etc, should
be exploited in the pre-processing stage. Don't let NN learn something you already know.
Normalization/scaling – see Engelbrecht. Remember, though, that this may
affect performance (introduces bias).
Normalization example: Normalizing inputs makes the network more
sensitive to fluctuations for some inputs over others. That bias may be the
reason for normalizing (give small range values a chance) but it is easy to
overcompensate.
Scaling example: Logarithms of target values make the network more
sensitive (better) for low values than higher values.

Exploiting prior knowledge

Examples of prior knowledge types:

Initial guess
a. Choosing intial weight (or how to randomize them)
Known decomposition of the problem into subproblems

a. Preprocessing
b. Network structure
c. Multitask learning
Constraints
a. Modifying the objective function (and deriving a new learning rule)
Regions
a. Preprocessing
b. Multitask learning

Multitask learning (extra output learning)

Have additional data which we think could help the networks.

Example, XOR is much easier (becomes linearly separable) if we add an extra
input ab. But the extra info then becomes required, also after training.

Solution: Add ab – the hint function – as an extra output instead! Network
trained to implement both functions (outputs) at once.

Multitask Learning for XOR + x

This figure is copyright by Olle Gällmo.

Restricts the freedom of the hidden layer, i.e. the number of models the network
can form of the original target function.

Effect: Faster training. Less overtraining. Less variance in both training time and
results. Required number of nodes closer to theory. Hint only needed during
training.

Example: XOR and function approximation experiments.