Natural Computation Methods for Machine Learning Note 10
In this note, we are going talk about Deep Learning very briefly.
Radial Basis Function Networks
RBFNs are feedforward networks, where:
- The hidden nodes (the RBF nodes) compute distances, instead of weighted sums.
-
The hidden layer activation functions are Gaussians (or similar), instead of sigmoids.
-
The output layer nodes are linear weighted sum units, i.e. a linear a combination of Gaussians.
-
One hidden layer only.
RBFNs for classification
The hidden nodes now form hyperspheres instead of hyperplanes.
The weight vector of a hidden node represents the center of the sphere.
Local 'receptive fields' (regions) instead of global dividers
– An RBF node only reacts (significantly) for inputs
within its region
Note: In Competitive Learning we also compute distances, to find the closest one. In RBFNs, the nodes do not compete – they are combined
RBFN implementation
Each hidden node, j, computes h_{j}=f\left(r_{j}\right) where r_jis the distance between the current input vector x and the
node’s weight vector, t_j (= the centre of a hypersphere)
r_{j}=\sum_{j=1}^{N}\left(x_{i}-t_{j i}\right)^{2}
f\left(r_{j}\right) should have a maximum at 0, i.e. when the input vector is at the centre of the sphere, e.g. a Gaussian:
f\left(r_{j}\right)=e^{-\left(\frac{r_{j}}{\sigma}\right)^{2}}
where \sigma is the standard deviation (width)
RBFN learning
Learning: To find the position and size of the spheres (hidden layer) and how to combine them (output layer)
We could train this supervised (deriving a new version of Backprop – everything is still differentiable!)
– Slow, and does not exploit the localized properties
Hidden and output layer are conceptually different now
two separate learning problems
Output layer is just a layer of (linear) perceptrons.
can use the Delta rule
but we must train the hidden layer first!
Could be random and fixed (i.e. not trained at all)
– Any problem can be solved this way
– Requires very many nodes
Train unsupervised! (K-Means or Competitive Learning)
– Allows the basis functions to move around, to where they are most likely needed
Can later be fine-tuned by supervised learning (gradient descent/Backprop)
– Backprop can be optimized for this (only a few nodes have to be updated)
How to set \sigma
Often equal and fixed ( \sigma set as a global constant)
Two common ways to set \sigma:
The average distance between a basis function and its closest neighbor
\sigma=d / \sqrt{2 M}
d = max distance between two nodes
M = number of basis functions
Do not fine-tune by gradient decent
tends to make them very large, destroying the localized properties
RBFN v.s. MLP
RBFN hidden nodes form local regions. The hyperplanes of a MLP are global
- MLPs often do better in regions where little data is available and therefore also when little data is available in total.
- MLPs usually require fewer hidden nodes and tend to generalize better.
-
RBFNs learn faster
-- Sometimes, only the output layer has to be trained
-- Even if the hidden nodes are trained, only a few of them are affected by a given input and need to be updated -
MLPs often do better for problems with many input variables
–- A consequence of the “curse of dimensionality” -
RBFNs are less sensitive to the order of presentation
–- Good for on-line learning (e.g. in reinforcement learning) -
RBFNs make less false-yes classification errors
–- The basis functions tend to respond with a low value for data far from their receptive fields.
–- MLPs can respond with very high output also for data from uncharted terrain (which is also why they may extrapolate better than RBFNs)
Regularization techniques
To avoid overfitting
Regularization = any method which tries to prevent this
- Early stopping (to avoid training for too long)
- Trying to minimize the network size (parameters/weights)
- Noise injection (to force the network to generalize)
- Weight decay (with or without removal of weights)
- Lagrangian optimization (constraint terms in the obj.fcn.)
- Multitask learning (constrains the hidden layer)
- Averaging (over several approximations). Bagging in ML
- Dropout
Dropout
For each training example, switch off hidden nodes at random (prob. 0.5)
When testing, use all hidden nodes, but 50% the hidden-to-output weights
In effect, the network outputs the average of several networks (similar to bagging)
Can let inputs drop out too (but, if so, with lower prob.)
Takes longer to train
Deep Learning
Simplest definition: any neural network with more than one hidden layer
Why more layers
when one hidden layer should suffice?
• More layers -> more levels of abstraction
• More levels of abstraction -> automatic feature selection
Why now?
- Severe risk of overfitting (huge number of parameters)
-
Requires huge amounts of data
- Long training times. Requires very fast computers
- Vanishing gradients
- the big companies that own the data are heavily involved
Vanishing gradients
- Backprop computes weight changes with the chain rule
- Chain rule -> multiplying many (small) gradients
- More layers -> longer chain very small gradients
Even worse
- Backprop tends to push the nodes towards their extreme values (towards either end of the activation function)
- For sigmoids, the derivative f’(S) is very close to 0 for large positive and negative values
- f’(S) is part of the chain
Deep Learning milestones
- 2006 Hinton shows how unsupervised pre-training (using auto-encoders) makes deep nets trainable by gradient descent
-
2010 Martens show that deep nets can be trained with second-order methods, without requiring pre-training
-
2010 It is discovered that the use of Rectified Linear Units makes a big difference for gradient descent (no more vanishing gradients!)
-
2013 Sutskever et al show that regular gradient descent (e.g. Backprop) can outperform second-order methods, with clever selection of initial weights and momentum
-
2014 Dropout is introduced as a regularization method
Rectified Linear Units (ReLUs)
y = max(x, 0)
Pros
• Very easy to compute
• Almost linear (but non-linear enough)
• Does not saturate (no vanishing gradients)
• Very simple derivative (0 for x<0, 1 for x>0)
Cons
• Derivative undefined for x=0
• Nodes with 0 output will not learn
• Dead units
• Probably not so good for shallow networks?
Common ReLU variants
Leaky ReLU
y=\left{\begin{array}{c}
x, \text { if } x>0 \
0.01 x, \text { otherwise }
\end{array}\right.
Parametric ReLU
y=\left{\begin{array}{c}
x, \text { if } x>0 \
a x, \text { otherwise }
\end{array}\right.
Softplus
y=\log \left(1+e^{x}\right)
Convolutional Neural Networks
to convolute = to fold
CNN characteristics
• Receptive fields (windows)
• Shared weights
• Pooling (subsampling)
Basics:
Convolutional layer = layer of feature maps
Each feature map is a grid of identical feature detectors
(so each map looks for one feature, but all over the image at once)
Feature detector = Filter = Neuron
CNN Feature Maps
Here, each neuron looks for a horizontal line in the centre of its field.
The field/window is small, so the neuron only has a few weights (here 3x3+1(bias) = 10 weights)
All neurons in the map detect the same feature, so they all share those 10 weights
So, the total number of weights for the whole map, is just 10
Stride = step length when moving the window.
CNN Pooling (Subsampling)
- Interleaved with the convolutional layers
- Reduces the resolution of the preceeding feature map
- Same principle (moving window) but predefined function (for example max, average, …)
- Effect depends on the function (e.g. for max, don’t care where the feature is within the pooling window)
conv->pool->conv->pool->…->MLP
CNN = MLP with automatic feature extraction
Structure is deep and wide with a huge number of connections, but much fewer unique parameters due to shared weights
Question: Do Deep Nets Have to be Deep?
文章评论