Improving the Learning Rate of the Back Propagation Algorithm by Aitkin Process

. Abstract The Back Propagation Algorithm is used for training feed Forward Multilayer Neural Networks (FFMNN).But often this algorithm takes long time to converge since it may fall into local minimu, for this reason we need a long time to train the network. The suitable choice of the learning rate helps us to escape from slow convergent for the BP and reduce the time of learning. In this paper, we derived a new adaptive learning rate for the BP algorithm, our derivation is based on the Aitkin's process. The most important distinct feature of our approach is the computing of the learning rate needs only first order derivatives and is suitable for large training sets and large networks. Its efficiency is proved on the standard test functions including heart , XOR and function approximation

The Back Propagation Algorithm is used for training feed Forward Multilayer Neural Networks (FFMNN).But often this algorithm takes long time to converge since it may fall into local minimu, for this reason we need a long time to train the network. The suitable choice of the learning rate helps us to escape from slow convergent for the BP and reduce the time of learning. In this paper, we derived a new adaptive learning rate for the BP algorithm, our derivation is based on the Aitkin's process. The most important distinct feature of our approach is the computing of the learning rate needs only first order derivatives and is suitable for large training sets and large networks. Its efficiency is proved on the standard test functions including heart , XOR and function approximation problems .

Introduction
Methods to speed up the learning phase and to optimize the learning process in Feed Forward Neural Networks (FFNN) [Shahla,etal.(1997), Steven & Narciso(1999)]. Many apply special techniques of initialization of weights [Nguyen & Widrow(1990)].
Most of them apply the higher order gradient optimization routines to minimize the appropriately error function [Amir,et.al.(2005), Livieris&Pintelas(2008), Mollar(1993), Rumelhart,etal. (1986)], the multivariable function that depends on the weights of the network. However there is still the problem of accelerating the learning process, especially when large training sets and large network are used. The neural networks training can be formulated as minimization a non-linear unconstrained optimization problem [Livieris,et.at (2009)]. The energy or cost function to be minimized is defined in the usual way as the squared difference between the destination and the actual responses of the output neurons over all P training samples. Let us assume the multilayer FFNN of N input and M output neurons. The number of hidden layers may be arbitrary just as the number of neurons in the layers. The supervised learning of this net is equivalent to the minimization of the energy function (the error function ), which can be written as follows: The variables O i and T i stand for actual and desired response of i_th output neurons, respectively. The superscript denotes the particular learning pattern. The vector w is composed of all weights in the net.
Summation of the actual errors takes place over all M output neurons and all P learning data (x,T), where the N-dimensional vector x is the input vector and the M-dimensional vector T is the destination (Target) vector associated with x. Back Propagation (BP) is a learning procedure that adjusts the weight vector w through a steepest descent with respect to E in weight space: , γ is the learning rate which is constant γ∈(0,1) and k w is a vector representing the weights at iteration (epoch) step k. Though the procedure is widespread-dependent, disadvantages. First convergence is fast only if the parameter setting is nearly optimal. Furthermore, the convergence rate is slow (linear) and decreases rapidly as the problem size increases. Finally, convergence is guaranteed only if the learning rate γ is small enough [Kuan & Hornik (1991), Rumelhart, et.al.(1986)]. The main problem then is to determine a priori what small enough means. In other words, for shallow minimum the learning rate is often too small where as for narrow minimum it is often too large and the procedure never converges therefore the BP algorithm with constant learning rate (which is called classical BP i.e. CBP) tends to inefficient [Rumelhart, et.al. (1986.)].
The remainder of this paper is organized as follows. Section 2 presents a brief summary of Aitkin's process, section 3 presents the proposed BP algorithm (AIBP algorithm say). Section 4, reports our experimental results and in section 5 are presented our concluding remarks. We summarize the CBP algorithm as follows, CBP Algorithm: Step(1): initialization : Number of epochs, k=1, γ∈(0,1) error goal=eg, stopping criteria 0 > ε . Choose wk randomly and compute Step (2):check for convergence: If and update variable k w : w k+1 =w k + γ d k , set γ=0.01 and Compute Step(4): set 1 + = k k and go to step (2)

Accelerated with Aitkin's ∆ 2 process
When a sequence or an iterative process is slowly converging, a convergence acceleration process has to be used. It consists in transforming the slowly converging sequence into a new one which, under some assumptions converges faster to the same limit. Aitken's process is the most well known sequence transformation. It has been proved that able to accelerate the convergence linearly converging sequences [Clade & Michela (2007)]. Let [Clade & Michela (2007)]. Suppose that w k generated by the equation (2), k=1,2,3 converges linearly so it satisfies : ) ( Solving equations (4) and (5) for * w while eliminating µ leads to ..…………. (6) In general, the original assumption (3) will not be true, nevertheless it is expected that the sequence {w k } ∞ , defined by : .………….. (7) converges more rapidly to * w than original sequence {w k } ∞ . The point k w is better approximation of * w than w k or w k+1 . The formula (7) can be written [3] in the equivalent form as : , we see that the formula (8) is suitable when the sequence w k is real or complex numbers, for a vector sequence a scalar transformation could be used separately on each component or some modifications are made.

Proposed Learning Method
Derivation of the Method In this section we present a modified Back Propagation (AIBP) algorithm by simple multiplicative modification of the learning rate. The idea is to modify the steepest descent method by introducing a relaxation of the following form : where γ is the learning rate which is used in a classical BP and it has constant value, αk∈ (0,1) is the relaxation parameter and k g d To derive the value of αk , assume that . from equations (2) and (8) we have: .…………… (14) We see from equation (12), to compute i w we need only two points namely k w and 1 + k w . Therefore our algorithm can be stated as: Given k w compute 1 + k w using CBP then other points can be computed as: To insure the descent property for d we choose We summarize our suggested algorithm (AIBP) as: Step (1): Number of epochs k=1 , γ∈ (0,1) , error gol =eg, Step (2) Step(4); Set k=k+1 and go to step(2).

Convergence Analysis:
In general there is no an algorithm which is convergence in all cases , therefore in convergence analysis for algorithms often some mild assumptions are made. Under the following assumption we show that our algorithm (AIBP) is globally convergent: 1. Assume that the Error function is bounded below on the level set )} ( is convex function on the convex set S and αk ∈ (0,1) ∀ k.
The following theorem (which is similar to one given in [Andrei (2005)] gives the convergence of AIBP algorithm.
Then by Taylor theorem we have: Since E is assumed to be convex it follows that ) (α Φ is convex function and is concave function on (0,2/Mγ) and has Max value at 1/ Mγ the reminder of the proof is similar to theorem 2 in [Andrei (2005)] hence is omitted . We conclude that

Experiments and Results:
A computer simulation has been developed to study the performance of the learning algorithms. The simulations have been carried out using MATIAB(7.6) the performance of the AIBP has been evaluated and compared with batch versions of the classical Back propagation (CBP) known as (traingd) see appendix, in the neural network toolbox, adaptive BP (ABP) (traingda) . Toolbox default values for the heuristic parameters of the above algorithms are used unless stated otherwise. The algorithms were tested using the initial weights, initialized by the Nguyen -Widrow method [Nguyen & Widrow (1990)] and received the same sequence of input patterns . The weights of network are updated only after the entire set of patterns to be learned has been presented .
For each of the test problems, a table summarizing the performance of the algorithms for simulations that reached solution is presented . The reported parameters are min the minimum number of epochs, mean the mean value of epochs, Max the maximum number of epochs, Tav the average of total time and Succ, the succeeded simulations out of (50) trails within error function evaluations limit. If an algorithm fails to converge within the above limit considered that it fails to train the FFNN, but its epochs are not included in the statistically analysis of the algorithm, one gradient and one error function evaluations are necessary at each epoch.

Problem (1): (Speect Heart Problem)
This data set contains data instances derived from Cardiac Single Proton Emission Computed Tomography (SPECT) images from the university of Colorado [Livieris, et.al

Problem (2): Continuous Function Approximation:
The second test problem we consider is the approximation of the continuous trigonometric function: The network architecture for this problem is 1-15-1 FNN (thirty weights, sixteen biases) is trained to approximate the function f(x), where x∈[-π,π] and the network is trained until the sum of the squares of the errors becomes less than the error goal 0.001. The network is based on hidden neurons of logistic activations with biases and on a linear output neuron with bias. Comparative results are shown in table (2).  (2), we conclude that the algorithm AIBP is the beast algorithm with respect to the succeeded simulations, number of epochs and the time.

Problem (3):(XOR Problem)
The last problem we have encountered is the XOR Boolean function problem, which is considered as a classical problem for the FFNN training . The XOR function maps two binary inputs to a single binary output. As it is well known this function is not linearly separable. The network architectures for this binary classification problem consists of one hidden layer with 3 neurons and an output layer of one neuron. The termination criterion is set to E ≤ 0.001 within the limit of 1000 epochs, table (3) summarizes the result of all algorithms i.e for 50 simulations the minimum epochs for each algorithm is listed in the first column (Min), the maximum epoch for each algorithm is listed in the second column, third column contains (Mean) the mean value of epochs and (Tav) is the average of time for 50 simulations and last columns contain the percentage of succeeds of the algorithms in 50 simulations.  (3), we conclude that the algorithm AIBP is the beast algorithm with respect to the succeeded simulations, number of epochs and the time.