The hidden nodes (the RBF nodes) compute distances, instead of weighted sums.
The hidden layer activation functions are Gaussians (or similar), instead of sigmoids.
The output layer nodes are linear weighted sum units, i.e. a linear a combination of Gaussians.
One hidden layer only.

RBFNs for classification

The hidden nodes now form hyperspheres instead of hyperplanes.

The weight vector of a hidden node represents the center of the sphere.

rbfn

Local 'receptive fields' (regions) instead of global dividers

– An RBF node only reacts (significantly) for inputs
within its region

Note: In Competitive Learning we also compute distances, to find the closest one. In RBFNs, the nodes do not compete – they are combined

RBFN implementation

Each hidden node, j, computes $h_{j}=f\left(r_{j}\right)$ where $r_j$ is the distance between the current input vector x and the
node’s weight vector, $t_j$ (= the centre of a hypersphere)
$r_{j}=\sum_{j=1}^{N}\left(x_{i}-t_{j i}\right)^{2}$

$f\left(r_{j}\right)$ should have a maximum at 0, i.e. when the input vector is at the centre of the sphere, e.g. a Gaussian:
$f\left(r_{j}\right)=e^{-\left(\frac{r_{j}}{\sigma}\right)^{2}}$
where $\sigma$ is the standard deviation (width)

RBFN learning

Learning: To find the position and size of the spheres (hidden layer) and how to combine them (output layer)

We could train this supervised (deriving a new version of Backprop – everything is still differentiable!)

– Slow, and does not exploit the localized properties

Hidden and output layer are conceptually different now

two separate learning problems

Output layer is just a layer of (linear) perceptrons.
can use the Delta rule
but we must train the hidden layer first!

Could be random and fixed (i.e. not trained at all)
– Any problem can be solved this way
– Requires very many nodes

Train unsupervised! (K-Means or Competitive Learning)
– Allows the basis functions to move around, to where they are most likely needed

Can later be fine-tuned by supervised learning (gradient descent/Backprop)

– Backprop can be optimized for this (only a few nodes have to be updated)

How to set $\sigma$

Often equal and fixed ( $\sigma$ set as a global constant)

Two common ways to set $\sigma$ :

The average distance between a basis function and its closest neighbor

$\sigma=d / \sqrt{2 M}$

d = max distance between two nodes
M = number of basis functions

Do not fine-tune by gradient decent

tends to make them very large, destroying the localized properties

RBFN v.s. MLP

RBFN hidden nodes form local regions. The hyperplanes of a MLP are global

MLPs often do better in regions where little data is available and therefore also when little data is available in total.
MLPs usually require fewer hidden nodes and tend to generalize better.
RBFNs learn faster
-- Sometimes, only the output layer has to be trained
-- Even if the hidden nodes are trained, only a few of them are affected by a given input and need to be updated
MLPs often do better for problems with many input variables
–- A consequence of the “curse of dimensionality”
RBFNs are less sensitive to the order of presentation
–- Good for on-line learning (e.g. in reinforcement learning)
RBFNs make less false-yes classification errors
–- The basis functions tend to respond with a low value for data far from their receptive fields.
–- MLPs can respond with very high output also for data from uncharted terrain (which is also why they may extrapolate better than RBFNs)

Regularization techniques

To avoid overfitting

Regularization = any method which tries to prevent this

Early stopping (to avoid training for too long)
Trying to minimize the network size (parameters/weights)
Noise injection (to force the network to generalize)
Weight decay (with or without removal of weights)
Lagrangian optimization (constraint terms in the obj.fcn.)
Multitask learning (constrains the hidden layer)
Averaging (over several approximations). Bagging in ML
Dropout

Dropout

For each training example, switch off hidden nodes at random (prob. 0.5)

When testing, use all hidden nodes, but 50% the hidden-to-output weights

In effect, the network outputs the average of several networks (similar to bagging)

Can let inputs drop out too (but, if so, with lower prob.)

Takes longer to train

Deep Learning

Simplest definition: any neural network with more than one hidden layer

Why more layers

when one hidden layer should suffice?

• More layers -> more levels of abstraction
• More levels of abstraction -> automatic feature selection

Why now?

Severe risk of overfitting (huge number of parameters)
Requires huge amounts of data
Long training times. Requires very fast computers
Vanishing gradients
the big companies that own the data are heavily involved

Vanishing gradients

Backprop computes weight changes with the chain rule
Chain rule -> multiplying many (small) gradients
More layers -> longer chain  very small gradients

Even worse

Backprop tends to push the nodes towards their extreme values (towards either end of the activation function)
For sigmoids, the derivative f’(S) is very close to 0 for large positive and negative values
f’(S) is part of the chain

Deep Learning milestones

2006 Hinton shows how unsupervised pre-training (using auto-encoders) makes deep nets trainable by gradient descent
2010 Martens show that deep nets can be trained with second-order methods, without requiring pre-training
2010 It is discovered that the use of Rectified Linear Units makes a big difference for gradient descent (no more vanishing gradients!)
2013 Sutskever et al show that regular gradient descent (e.g. Backprop) can outperform second-order methods, with clever selection of initial weights and momentum
2014 Dropout is introduced as a regularization method

Rectified Linear Units (ReLUs)

y = max(x, 0)

Pros
• Very easy to compute
• Almost linear (but non-linear enough)
• Does not saturate (no vanishing gradients)
• Very simple derivative (0 for x<0, 1 for x>0)

Cons
• Derivative undefined for x=0
• Nodes with 0 output will not learn
• Dead units
• Probably not so good for shallow networks?

Common ReLU variants

Leaky ReLU
$y=\left{\begin{array}{c} x, \text { if } x>0 \ 0.01 x, \text { otherwise } \end{array}\right.$
Parametric ReLU
$y=\left{\begin{array}{c} x, \text { if } x>0 \ a x, \text { otherwise } \end{array}\right.$
Softplus
$y=\log \left(1+e^{x}\right)$

Convolutional Neural Networks

to convolute = to fold

CNN characteristics
• Receptive fields (windows)
• Shared weights
• Pooling (subsampling)

Basics:

Convolutional layer = layer of feature maps
Each feature map is a grid of identical feature detectors
(so each map looks for one feature, but all over the image at once)
Feature detector = Filter = Neuron

CNN Feature Maps

CNNFeatureMap

Here, each neuron looks for a horizontal line in the centre of its field.

The field/window is small, so the neuron only has a few weights (here 3x3+1(bias) = 10 weights)

All neurons in the map detect the same feature, so they all share those 10 weights

So, the total number of weights for the whole map, is just 10

Stride = step length when moving the window.

CNN Pooling (Subsampling)

CNNFeatureMap

Interleaved with the convolutional layers
Reduces the resolution of the preceeding feature map
Same principle (moving window) but predefined function (for example max, average, …)
Effect depends on the function (e.g. for max, don’t care where the feature is within the pooling window)

CNNStructure