CGS-760: Neural Networks (may 2000)

LEARNING Backpropagation Neural networks by error propagation

The classical method for training a multiplayer feed-forward neural network is the steepest descent backpropagation algorithm. The basic idea of the backpropagation learning algorithm [1] is the repeated application of the chain rule to compute the influence of each weight in the network with respect to an arbitrary errorfunction E as:

where w_ij is the weight from neuron j to neuron i, s_i is the output, and net_i is the weighted sum of the inputs of neuron i. Once the partial derivative for each weight is known, the aim of minimizing the errorfunction is achieved by the forming a simple gradient descent as:

Obviously, the choice of the learning rate h, which scales the derivative, has an important effect on the time needed until convergence is reached.

A proposed way to get rid of the above problem is to introduce a momentum method. This technique was popularized by Rumehart et al. [2].

As explained above, backpropagation can be expressed as a gradient descent method for training (or learning) multilayer perceptron weights. Therefore, the rule for changing weights can be presented as follows.

Condition:

For a given problem such: {" x�X | X = set of training vectors}

there is:

{" d�D | d = associated desired output vector & D = set of desired outputs associated with the training vectors in X}.

Now let matrix and vector form of the instantaneous error E_p be defined as:

where d_k,p is the kth component of the pth desired output z_p when the pth training exemplar x_p is input to the multilayer perceptron.

Let the total error E_T, which is the sum of errors for all the input patterns, shall be defined as follows:

where P is the cardinality of X.

Important note that the total error E_T� is a function of both:

1) the training set of network;

2) the weights in the network.

To increase the learning rate without leading to oscillation, the backpropagation learning rule may be defined as follows:

where h, which is the learning rate is some small positive number between 0 and 1 (in practice: 0.05 < η < 0.75); a, the momentum factor, which is also a small positive number, and w represents any single weight in the network. Note In the above equation, Dw(t) is the change in the weight computed at time t.

Note:

If a � 0, the training rule will be called the momentum method.

If a = 0, the training rule will be called the instantaneous backpropagation.

If E_T is used the, training rule will be called the batch backpropagation method.

References

[1] Robbins H. & Monro, S. A stochastic approximation method. Annals of Mathematics and Statistics. Vol. 22, pp 400-407. 1951.

[2] Rumelhart, D. E., McClelland, J. L. & PDP. Parallel distributed processing. MIT Press. 1986.

Suggested Readings:

Bishop,C. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, UK. 1995

Samana Fatala

School of Engineering

Central Philippine University

sfatala@teacher.com