CGS-760: Neural Networks (may 2000)
LEARNING Backpropagation
Neural networks by error propagation
The classical method for
training a multiplayer feed-forward neural network is the steepest descent backpropagation
algorithm. The basic idea of the backpropagation learning algorithm [1] is the repeated
application of the chain rule to compute the influence of each weight in the
network with respect to an arbitrary errorfunction E as:
where wij
is the weight from neuron j to neuron i, si is the
output, and neti is the weighted sum of the inputs of neuron i.
Once the partial derivative for each weight is known, the aim of minimizing the
errorfunction is achieved by the forming a simple gradient descent as:
Obviously, the choice of the
learning rate h, which scales the
derivative, has an important effect on the time needed until convergence is
reached.
A proposed way to get rid of the
above problem is to introduce a momentum method. This technique was
popularized by Rumehart et al. [2].
As explained above,
backpropagation can be expressed as a gradient descent method for training (or
learning) multilayer perceptron weights. Therefore, the rule for changing
weights can be presented as follows.
Condition:
For a given problem such: {" xÎX | X = set of
training vectors}
there is:
{" dÎD | d = associated
desired output vector & D = set of desired outputs associated with the training
vectors in X}.
Now let matrix and vector
form of the instantaneous error Ep be defined as:
where dk,p
is the kth component of the pth desired output zp
when the pth training exemplar xp is input to the
multilayer perceptron.
Let the total error ET, which is the sum of errors for
all the input patterns, shall be defined as follows:
where P is the
cardinality of X.
Important note that
the total error ET is
a function of both:
1)
the training set of network;
2)
the weights in the network.
To increase the learning rate without leading to oscillation, the
backpropagation learning rule may be defined as follows:
where h, which is the
learning rate is some small positive number between 0 and 1 (in
practice: 0.05 < η < 0.75); a, the momentum
factor, which is also a small positive number, and w represents any
single weight in the network. Note In the above equation, Dw(t) is the change in
the weight computed at time t.
Note:
If a ¹ 0, the training rule will be
called the momentum method.
If a = 0, the training rule will
be called the instantaneous backpropagation.
If ET
is used the, training rule will be called the batch backpropagation
method.
[1] Robbins H. & Monro, S. A stochastic
approximation method. Annals of Mathematics and Statistics. Vol. 22, pp
400-407. 1951.
[2] Rumelhart, D. E., McClelland, J. L. &
PDP. Parallel distributed processing. MIT Press. 1986.
Suggested Readings:
Bishop,C. Neural Networks for Pattern Recognition. Oxford University
Press, Oxford, UK. 1995
Samana Fatala School of Engineering Central Philippine
University |