木叶下

  • 编程算法
  • 深度学习
  • 微小工作
  • 善用软件
  • 杂记
  • 诗人远方
南国羽说
文字记录生活
  1. 首页
  2. 深度学习
  3. 正文

Natural Computation Methods for Machine Learning Note 10

2020年3月2日 4772点热度 0人点赞 0条评论

Natural Computation Methods for Machine Learning Note 10

Contents hide
1 Natural Computation Methods for Machine Learning Note 10
1.1 Radial Basis Function Networks
1.1.1 RBFNs for classification
1.1.2 RBFN implementation
1.1.3 RBFN learning
1.1.4 How to set \sigma
1.1.5 RBFN v.s. MLP
1.2 Regularization techniques
1.2.1 Dropout
1.3 Deep Learning
1.3.1 Why more layers
1.3.2 Why now?
1.3.2.1 Vanishing gradients
1.3.3 Deep Learning milestones
1.3.4 Rectified Linear Units (ReLUs)
1.3.4.1 Common ReLU variants
1.4 Convolutional Neural Networks
1.4.1 CNN Feature Maps
1.4.2 CNN Pooling (Subsampling)

In this note, we are going talk about Deep Learning very briefly.

Radial Basis Function Networks

RBFNs are feedforward networks, where:

  • The hidden nodes (the RBF nodes) compute distances, instead of weighted sums.
  • The hidden layer activation functions are Gaussians (or similar), instead of sigmoids.

  • The output layer nodes are linear weighted sum units, i.e. a linear a combination of Gaussians.

  • One hidden layer only.

RBFNs for classification

The hidden nodes now form hyperspheres instead of hyperplanes.

The weight vector of a hidden node represents the center of the sphere.

rbfn

Local 'receptive fields' (regions) instead of global dividers

– An RBF node only reacts (significantly) for inputs
within its region

Note: In Competitive Learning we also compute distances, to find the closest one. In RBFNs, the nodes do not compete – they are combined

RBFN implementation

Each hidden node, j, computes h_{j}=f\left(r_{j}\right) where r_jis the distance between the current input vector x and the
node’s weight vector, t_j (= the centre of a hypersphere)
r_{j}=\sum_{j=1}^{N}\left(x_{i}-t_{j i}\right)^{2}

f\left(r_{j}\right) should have a maximum at 0, i.e. when the input vector is at the centre of the sphere, e.g. a Gaussian:
f\left(r_{j}\right)=e^{-\left(\frac{r_{j}}{\sigma}\right)^{2}}
where \sigma is the standard deviation (width)

RBFN learning

Learning: To find the position and size of the spheres (hidden layer) and how to combine them (output layer)

We could train this supervised (deriving a new version of Backprop – everything is still differentiable!)

​ – Slow, and does not exploit the localized properties

Hidden and output layer are conceptually different now

​ two separate learning problems

Output layer is just a layer of (linear) perceptrons.
can use the Delta rule
but we must train the hidden layer first!

Could be random and fixed (i.e. not trained at all)
– Any problem can be solved this way
– Requires very many nodes

Train unsupervised! (K-Means or Competitive Learning)
– Allows the basis functions to move around, to where they are most likely needed

Can later be fine-tuned by supervised learning (gradient descent/Backprop)

– Backprop can be optimized for this (only a few nodes have to be updated)

How to set \sigma

Often equal and fixed ( \sigma set as a global constant)

Two common ways to set \sigma:

The average distance between a basis function and its closest neighbor

\sigma=d / \sqrt{2 M}

d = max distance between two nodes
M = number of basis functions

Do not fine-tune by gradient decent

​ tends to make them very large, destroying the localized properties

RBFN v.s. MLP

RBFN hidden nodes form local regions. The hyperplanes of a MLP are global

  • MLPs often do better in regions where little data is available and therefore also when little data is available in total.
  • MLPs usually require fewer hidden nodes and tend to generalize better.
  • RBFNs learn faster
    -- Sometimes, only the output layer has to be trained
    -- Even if the hidden nodes are trained, only a few of them are affected by a given input and need to be updated

  • MLPs often do better for problems with many input variables
    –- A consequence of the “curse of dimensionality”

  • RBFNs are less sensitive to the order of presentation
    –- Good for on-line learning (e.g. in reinforcement learning)

  • RBFNs make less false-yes classification errors
    –- The basis functions tend to respond with a low value for data far from their receptive fields.
    –- MLPs can respond with very high output also for data from uncharted terrain (which is also why they may extrapolate better than RBFNs)

Regularization techniques

To avoid overfitting

Regularization = any method which tries to prevent this

  • Early stopping (to avoid training for too long)
  • Trying to minimize the network size (parameters/weights)
  • Noise injection (to force the network to generalize)
  • Weight decay (with or without removal of weights)
  • Lagrangian optimization (constraint terms in the obj.fcn.)
  • Multitask learning (constrains the hidden layer)
  • Averaging (over several approximations). Bagging in ML
  • Dropout

Dropout

For each training example, switch off hidden nodes at random (prob. 0.5)

When testing, use all hidden nodes, but 50% the hidden-to-output weights

In effect, the network outputs the average of several networks (similar to bagging)

Can let inputs drop out too (but, if so, with lower prob.)

Takes longer to train

Deep Learning

Simplest definition: any neural network with more than one hidden layer

Why more layers

when one hidden layer should suffice?

• More layers -> more levels of abstraction
• More levels of abstraction -> automatic feature selection

Why now?

  • Severe risk of overfitting (huge number of parameters)
  • Requires huge amounts of data

  • Long training times. Requires very fast computers
  • Vanishing gradients
  • the big companies that own the data are heavily involved

Vanishing gradients

  • Backprop computes weight changes with the chain rule
  • Chain rule -> multiplying many (small) gradients
  • More layers -> longer chain  very small gradients

Even worse

  • Backprop tends to push the nodes towards their extreme values (towards either end of the activation function)
  • For sigmoids, the derivative f’(S) is very close to 0 for large positive and negative values
  • f’(S) is part of the chain

Deep Learning milestones

  • 2006 Hinton shows how unsupervised pre-training (using auto-encoders) makes deep nets trainable by gradient descent
  • 2010 Martens show that deep nets can be trained with second-order methods, without requiring pre-training

  • 2010 It is discovered that the use of Rectified Linear Units makes a big difference for gradient descent (no more vanishing gradients!)

  • 2013 Sutskever et al show that regular gradient descent (e.g. Backprop) can outperform second-order methods, with clever selection of initial weights and momentum

  • 2014 Dropout is introduced as a regularization method

Rectified Linear Units (ReLUs)

y = max(x, 0)

Pros
• Very easy to compute
• Almost linear (but non-linear enough)
• Does not saturate (no vanishing gradients)
• Very simple derivative (0 for x<0, 1 for x>0)

Cons
• Derivative undefined for x=0
• Nodes with 0 output will not learn
• Dead units
• Probably not so good for shallow networks?

Common ReLU variants

Leaky ReLU
y=\left{\begin{array}{c}
x, \text { if } x>0 \
0.01 x, \text { otherwise }
\end{array}\right.

Parametric ReLU
y=\left{\begin{array}{c}
x, \text { if } x>0 \
a x, \text { otherwise }
\end{array}\right.

Softplus
y=\log \left(1+e^{x}\right)

Convolutional Neural Networks

to convolute = to fold

CNN characteristics
• Receptive fields (windows)
• Shared weights
• Pooling (subsampling)

Basics:

Convolutional layer = layer of feature maps
Each feature map is a grid of identical feature detectors
(so each map looks for one feature, but all over the image at once)
Feature detector = Filter = Neuron

CNN Feature Maps

CNNFeatureMap

Here, each neuron looks for a horizontal line in the centre of its field.

The field/window is small, so the neuron only has a few weights (here 3x3+1(bias) = 10 weights)

All neurons in the map detect the same feature, so they all share those 10 weights

So, the total number of weights for the whole map, is just 10

Stride = step length when moving the window.

CNN Pooling (Subsampling)

CNNFeatureMap

  • Interleaved with the convolutional layers
  • Reduces the resolution of the preceeding feature map
  • Same principle (moving window) but predefined function (for example max, average, …)
  • Effect depends on the function (e.g. for max, don’t care where the feature is within the pooling window)

CNNStructure

conv->pool->conv->pool->…->MLP

CNN = MLP with automatic feature extraction

Structure is deep and wide with a huge number of connections, but much fewer unique parameters due to shared weights

Question: Do Deep Nets Have to be Deep?

标签: Deep Learning machine learning Natural Computation Regularization
最后更新:2020年3月4日

Dong Wang

I am a PhD student of TU Graz in Austria. My research interests include Embedded/Edge AI, efficient machine learning, model sparsity, deep learning, computer vision, and IoT. I would like to understand the foundational problems in deep learning.

点赞
< 上一篇
下一篇 >

文章评论

razz evil exclaim smile redface biggrin eek confused idea lol mad twisted rolleyes wink cool arrow neutral cry mrgreen drooling persevering
取消回复

这个站点使用 Akismet 来减少垃圾评论。了解你的评论数据如何被处理。

文章目录
  • Natural Computation Methods for Machine Learning Note 10
    • Radial Basis Function Networks
      • RBFNs for classification
      • RBFN implementation
      • RBFN learning
      • How to set \sigma
      • RBFN v.s. MLP
    • Regularization techniques
      • Dropout
    • Deep Learning
      • Why more layers
      • Why now?
      • Deep Learning milestones
      • Rectified Linear Units (ReLUs)
    • Convolutional Neural Networks
      • CNN Feature Maps
      • CNN Pooling (Subsampling)

COPYRIGHT © 2013-2024 nanguoyu.com. ALL RIGHTS RESERVED.

Theme Kratos Made By Seaton Jiang

陕ICP备14007751号-1