Conjugate Gradient Algorithm Based on Meyer Method for Training Artificial Neural Networks

The difference between the desired output and the actual output of an multi-layers feed forward neural network produces an error value can be expressed it as a function of the network weights. Therefore training the network becomes an optimization problem to minimize the error function. This search suggests a new formula for computing learning rate based on Meyers formula to modify conjugate gradient algorithm (MCG) for training the FFNN. Typically this method accelerate the method of Fletcher–Reeves (FRCG) and Polak–Ribere (PRCG) when using it to solve three different types problems well known in the artificial neural network (namely, XOR problem, function approximation, and the Monk1 problem ) with 100 simulations.


Introduction
Learning systems, such as multilayer feed forward neural networks (MLFFNN) are parallel computational models comprised of densely interconnected, adaptive processing units, characterized by an inherent propensity for learning from experience and also discovering new knowledge.Due to their excellent capability of self-learning and selfadapting, they have been successfully applied in many areas of artificial intelligence [Bishop (1995), Haykin (1994), Hmich, etal (2011), Takeuchi, etal.(2003) and Wu, etal.(1995)] and are often found to be more efficient and accurate than other classification techniques [Lerner, etal. (1999)].
The operation of the feed forward neural networks(FNN) is usually based on the equations: ) ( , net is the sum of the weight inputs for the j-th node in the l -th layer (j=1,2,…, l N ), j i w , is the weights from the i-th neuron to the j-th neuron at the l l , 1 th layer ,respectively, l j b is the bias of the j-th neuron at the l-th layer and l j x is the output of the j-th neuron which belongs to the l -th layer.The problem of training a neural network is to iteratively adjust its weights, in order to minimize a the difference between the actual output of the network and the desired output of the training set [Rumelhart, etal.(1986)].Actually finding such minimum is equivalent to find an optimal minimization of the error function which defined by: The variables i T and i O are the desired and the actual output of the ith neuron, respectively.The index j denotes the particular learning pattern.The vector w is composed of all weights in the net.
Back propagation (BP) algorithm is the most widely used to train multilayer feed forward neural networks.The standard back propagation algorithm adjusts the weight vector w using steepest descent with respect to E such that : Where the constant is the learning rate belongs to the interval (0,1) and k w is a vector representing the weights at iteration (epoch) step k .Since the steepest descent method has slow convergence rate and the search for the global minimum often becomes trapped at a poor local minimum, implies that the back propagation algorithm takes unendurable  Jarmo, etal. (2003), Johansson, etal.(1990),Kostopoulos, etat.(2004),Plagianakos, et.al.(1998),Sabeur & Farhat (2008), Zulhadi, etal.(2010)].The other introduce the momentum term [Daniel,etal.(1997),Huajin,etal.(2011)],others use the alternative cost functions or dynamic adaptation of the learning parameters [Shahla,etal.(1997),Steven & Narciso (1999)].Many apply special techniques of initialization of weights [Nguyen & Widrow (1990) ].Most of them apply the higher order gradient optimization routines to minimize the appropriately error function [Amir, etal.(2005)[Livieris, et.at (2009)].
This search is organized as follows.Section 2 a short description of the conjugate gradient algorithms.section 3 presents the proposed BP algorithm (MBP algorithm say).Section 4, repots our experimental results which are compared of the proposed method with FRCG and PRCG methods through three different types of problems.

Conjugate Gradient Methods
Conjugate gradient (CG) methods [Livieris & Pintelas (2008)] are among the most commonly and efficient used methods for large scale optimization problems due to their speed and simplicity.In general, conjugate gradient methods play an important role for efficiently training neural networks due to their simplicity and their very low memory requirements, since they don't require the evaluation of the Hessian matrix neither the impractical storage of an approximation of it.In the literature there is a variety of conjugate gradient methods [Birgin & Martinez(1999), Moller(1993), Livieris & Pintelas(2011) and Abbo (2010)] that have been intensively used for neural network training in several applications [Daniel, etal. (1997) and Zoutendijk (1970)].
The main idea for determining the search direction is the linear combination of the negative gradient vector at the current iteration with the previous search direction.The way to determine the search direction can be expressed as follows: Conjugate gradient methods differ in their way of defining the multiplier k .The most famous approaches were proposed by Fletcher-Reeves (FR), Polak-Ribere (PR) and Hestenes-Stifel (HS): The conjugate gradient methods using FR update were shown to be globally convergent [AL-Baali (1999)].However the corresponding methods using PR or HS update are generally more efficient ever without satisfying the global convergence property.In the convergence analysis and implementations of CG methods, one often requires the inexact lien search such as the Wolfe line search.The standard Wolfe line search requites k satisfying: Moreover, an important issue of CG algorithms is that when the search direction (4) fails to be descent (by Descent, we mean ) directions we restart the algorithm using the negative gradient direction to grantee convergence .A more sophisticated and popular restarting is the Powell restart.
Where, denotes to the Euclidean norm.Other important issue for the CG methods is that the search directions generated from equation ( 4) are conjugate if the objective function is convex and line search is exact i.e: Where, G is the Hessian matrix for the objective function .Dai and Lioa in [Dai and Liao (2001)] showed that the equation ( 10) can be written as follows: which is called pure conjugacy condition and generalize to the 0 , (12) for general objective function with inexact line search.

Suggested Conjugate Gradient Algorithm:
When a sequence or an iterative process is slowly converging a convergence acceleration process has to be used , Aitken's process is the most well-known convergence acceleration for linearly converging sequence.Abbo  Step 2: Calculate the error function value k E and its gradient k g .
Step 3: , return goal is meet and stop .
Step 4: compute the descent direction : and then compute: Step5:compute the learning rate k by line search procedure, such the standard Wolfe conditions ( 6) and (7) .
Step 6: update the weights

Accelerated with Meyer Process
Conjugate Gradient Algorithm Based on Meyer Method for . . . .

] 123 [
The convergence acceleration process transforming the slowly converging sequence in to a new one which, under some assumptions converges faster to the same limit [Clade & Michela (2007)].
In Mathematics and Computer Science, there exists a large number of iterative algorithms whose goal is usually to reach a solution to a problem within a certain tolerance within a given number of iterations.Iterating means going over a pattern of steps and procedures that can sometimes be complex and sometimes take a substantial amount of time even for fast modern computers.[ David and William (1982), Meyers A. and Mathews & Fink (1999)] As we know, Aitken's process is the most well-known convergence acceleration for linearly converging sequence and there are three equivalent forms for this method as follows: .It shows the Aitken formula in another form: Let us now construct a sequence In general, Meyers iterative Aitken formula become as: It is very important to realize here that it has not been demonstrated that N n n y is convergent, nor that it respects the necessary conditions to be accelerated by the Aitken formula.Although in practice the Meyers formula works really well in many cases and its convergence is in many case very impressive, it might be necessary to verify that it applies to the problem at hand.Nevertheless the advantage of this method is that it does not require storing all previous values of the sequence and it does not require more use of the function f .

Derivation of The Proposed Learning Method
In this section we present a new CG algorithm (MCG) by simple multiplicative modification of the learning rate.The idea is to modified the learning rate of the following form : .

MCG Algorithm
Step1:Initialize Step 2: Calculate the error function value k E and its gradient k g .
Step 3: and then compute: Step5: compute the learning rate k using equations(15a,b) then use the backtracking to satisfy the standard Wolfe conditions ( 9) and (10) .
Step 6: update the weights

4.Experimental Results
In the following section, we will present experimental results in order to study and evaluate the performance of our proposed conjugate gradient algorithm MCG in three classical artificial intelligence problems(XOR problem, Continuous Function Approximation and Monk1 problem).
In particular, we investigate the performance of gradient methods with Fletcher-Reeves update (FRCG) and Polack-Ribiere update (PRCG); (equations 8) then, we compare them with our method MCG.All conjugate gradient methods have been implemented with the Wolfe line search conditions ( 6) and (7).The implementation has been carried out by using Matlab (2007a) and the Matlab Neural Network Toolbox.
For each test problem, we present a table summarizing the performance of the algorithms for (100) simulations that reached solution within a predetermined limit of epochs.The parameters used in all tables are as follows: Min the minimum number of epochs, Mean, the mean value of epochs, Max the maximum number of epochs, Tav the average of total time, FcEv the number of function evaluations and Succ.The simulation succeeded out of 100 trials within predetermined error limit.
Worth mentioning, for each algorithm, the networks has resaved the same input samples and the same initial weights and if an algorithm fails to converge within the above limit considered, then it fails to train the FFNN, and its epochs are not included in the statically analysis of the algorithm.

Problem (1):XOR Problem
The XOR problem is considered as one of the well-known test function to train neural network.This function maps two binary inputs to a single binary output.As it is well known, this function is not linearly separable.The network architectures for this binary classification problem consists of two hidden layers with 2 and 3 neurons, respectively, with one neuron in the output layer.The termination criterion is set to E 0.001 within the limit of 1000 epochs.
Table (1) summarize the average performance of the presented a l g o r i t h m s f o r t h e X O R p r o b l e m .C l e a r l y t h a t F R C G a n d P R C G algorithms exhibit excellent probability (93%) of successful training for network architectures.Thus, computational cost is probably the most appropriate indicator for measuring the efficiency of the algorithms.Therefore the FRCG algorithm exhibit, the best performance, since it reports the least average number of epochs to converge, time and the number of function evaluations.

Problem (2): Continuous Function Approximation
The second test problem is the Continuous Function Approximation.We consider the approximation of the continuous trigonometric function as: , where x [-, ] The network architecture for this problem is 1-15-1 FNN (thirty weights, sixteen biases) is trained to approximate the function and the network is trained until the sum of the squares of the errors becomes less than the error goal 0.001 within the limit of 2000 epochs.The activation function of the hidden neurons is the logistic function with biases and a linear function in the output neuron with bias.
Tables (2) present the performance comparison of the algorithms FRCG, PRCG and MCG for the continuous function approximation problem.All algorithms exhibit excellent probability (100%) of successful training for network architectures using the same initial weights.Thus, computational cost is probably the most appropriate indicator for measuring the efficiency of the algorithms.The new m ethod (MCG) im proved the result of FRCG and PRCG, since The MCG significantly outperforms all algorithms in terms of the average number of epochs and the number of function evaluations.

Monk1 Problem
The Monk1 problem [Thrun,et.al.,1991] is a collection of three binary classification problems relying on the artificial robot domain, in which robots are described by six different attributes.These benchmarks are made of a numeric base of examples and of a set of symbolic rules.
Monk-1 consists of 124 patterns which were selected randomly from the data set for training, while the remaining 308 were used for the generalization testing.Table (3) summarize the average performance of the presented algorithms for the Monk1 problem.All algorithms exhibit excellent probability (100%) of successful training for network architectures when using the same initial weights.Thus, computational cost is probably the most appropriate indicator for measuring the efficiency of the algorithms.
In general MCG algorithm exhibits the best performance since it report, the least average number of epochs to converge and demonstrate the highest success rate in the case of training a feed forward neural network i.e.MCG is the best.

The 6 th
Scientific Conference of the College of Computer Sciences & Mathematics ] 122 [

return
Error goal not meet and stop else go to step (2).
iteration.Which is prompting to suggest new ways to escape from disadvantage.One of these manners is Meyer method.The steps of this method can be illustrated by the following way:First let us rewrite (13b) by replacing iteration guess start value, chosen sometimes randomly or with a rough estimation depending the problem you are trying to solve. 1 y is the first iteration and defined for the original sequence by Let's apply(14) to 0 y and 1 y to determine 2 y : is meet and stop .Step 4: compute the descent direction :Conjugate Gradient Algorithm Based on Meyer Method for . . . .

return
Error goal not meet and stop else go to step (2).

The 6 th
Scientific Conference of the College of Computer Sciences & Mathematics ] 126 [

Conjugate
Figure (3) present the performance profiles of our proposed algorithm (MCG) together with FRCG and PRCG with using MONK1 problem.This Figure show that our proposed method (MCG) is the best algorithm with respect to the number of epochs corresponding with other algorithms.Table(3) summarize the average performance of the presented algorithms for the Monk1 problem.All algorithms exhibit excellent probability (100%) of successful training for network architectures when using the same initial weights.Thus, computational cost is probably the most appropriate indicator for measuring the efficiency of the algorithms.In general MCG algorithm exhibits the best performance since it report, the least average number of epochs to converge and demonstrate the highest success rate in the case of training a feed forward neural network i.e.MCG is the best.Algorithms Min Mean Max Tav FcEv Succ FRCG 21.0 47.93 100.0 0.49831 83.36 100% PRCG 17.0 36.44 68.0 0.44798 84.19 100% MCG 15.0 32.9 69.0 0.98967 105.38 100% Figure (3) present the performance profiles of our proposed algorithm (MCG) together with FRCG and PRCG with using MONK1 problem.This Figure show that our proposed method (MCG) is the best algorithm with respect to the number of epochs corresponding with other algorithms.Table(3) summarize the average performance of the presented algorithms for the Monk1 problem.All algorithms exhibit excellent probability (100%) of successful training for network architectures when using the same initial weights.Thus, computational cost is probably the most appropriate indicator for measuring the efficiency of the algorithms.In general MCG algorithm exhibits the best performance since it report, the least average number of epochs to converge and demonstrate the highest success rate in the case of training a feed forward neural network i.e.MCG is the best.Algorithms Min Mean Max Tav FcEv Succ FRCG 21.0 47.93 100.0 0.49831 83.36 100% PRCG 17.0 36.44 68.0 0.44798 84.19 100% MCG 15.0 32.9 69.0 0.98967 105.38 100%

The 6 th Scientific Conference of the College of Computer Sciences & Mathematics ] 120 [ time
to adapt the weights between the units in the network.For this reason many researches proposed to improve this algorithm; several researches are based on new adaptive learning rate[Abbo & Tatal  (2011),Abbo & Mohammed (2013), and Mohammed in[Abbo & Mohammed(2012)] suggested a new conjugate gradient (NACG) algorithm to train neural network based on Aitken's process which guarantees sufficient descent with Wolfe line search .This algorithm summarize as follows:

Table ( 1): Comparative the Results of the XOR Problem with Fixing Initial Weights Conjugate Gradient Algorithm Based on Meyer Method for
. . . .

78.08 165.0 1.6291 130.85 100% Table(2): Comparative the Results of the Function Approximation Problem with Fixing Weights Graphically
(2)igure(2)present the performance profiles of our proposed conjugate gradient algorithm (MCG) together with FRCG and PRCG by using function approximation problem.This Figure show that our proposed method (MCG) is the best algorithm with respect to the number of epochs corresponding with other algorithms.
Table(3.8): Comparative the Results of the Monk1 Problem with Fixing Initial