Numerical Experience with Damped Quasi-Newton Optimization Methods when the Objective Function is Quadratic

A class of damped quasi-Newton methods for nonlinear optimization has recently been proposed by extending the damped-technique of Powell for the BFGS method to the Broyden family of quasi-Newton methods. It has been shown that this damped class has the global and superlinear convergence property that a restricted class of 'undamped' methods has for convex objective functions in unconstrained optimization. To test this result, we applied several members of the Broyden family and their corresponding damped methods to a simple quadratic function and observed several useful features of the damped-technique. These observations and other numerical experiences are described in this paper. The important role of the damped-technique is shown not only for enforcing the above convergence property, but also for improving the performance of efficient, inefficient and divergent undamped methods substantially (significantly in the latter case). Thus, some appropriate ways for employing the damped-technique are suggested.

B  by using a member of the Broyden family in terms of the differences in points and gradients 1 =, where k g denotes the gradient ( ).
k fx  This family consists of efficient, inefficient and divergent methods.The former two types of methods are usually defined sufficiently closely to the well known BFGS and DFP methods, respectively, while the latter type is defined sufficiently remotely away from these methods.For further details see for instance (Fletcher, 1987;Dennis and Schnabel, 1996;Nocedal and Wright, 1999).Al-Baali and Grandinetti (2009) show that the performance of the BFGS method can be improved if k y is modified before updating to the damped-technique ˆ= (1 ) , where k  is a parameter chosen appropriately and sufficiently large in the interval (0,1] .The resulting damped (D)-BFGS method is proposed by Powell (1978) for the Lagrangian function in constrained optimization and used many times with only values of 0.8 k   , see for example (Fletcher, 1987;Nocedal and Wright, 1999).
The aim of this paper is to show that small values of k  can be used to improve the behaviour of not only the BFGS method, but also all members of the Broyden family of methods for unconstrained optimization.To illustrate this possibility, we applied several members of this family and their corresponding damped methods to a simple quadratic function of two variables, using certain initial point 1 x and Hessian approximation 1 .B For particular choices of , k  it is shown that many damped methods work substantially better than the BFGS method.The analysis is organized in the following way.Section 2 describes the Broyden and D-Broyden classes of methods, Section 3 provides some numerical results and Section 4 concludes the paper.

Damped quasi-Newton methods
We now briefly describe the Broyden and D-Broyden classes of methods.On each iteration of these methods, the current point k x and a symmetric and positive-definite Hessian approximation k B are available, given initially 1 x and 1 .B Using these available data, a new point  denotes a steplength which is usually chosen such that the Wolfe conditions 10 where k f denotes ( ), k fx 0 (0,0.5)

 
and 10 ( ,1), and ˆk y is defined by (1) for a suitable value of .
k  This class of damped updates is reduced to the Broyden family if ˆ= kk yy (which corresponds to =1 k  ).Thus, if this equality holds for all iterations we obtain the Broyden family of methods.Otherwise, we obtain the D-Broyden class of methods.In particular, the choices =0  is a certain negative value sufficiently close to zero) converges globally and q-superlinearly for convex functions, the performance of the class becomes worse as k  increases.In general, the BFGS method is robust and the DFP method is inefficient, though the former method usually suffers from large eigenvalues of k B , see in particular (Powell, 1986;Byrd et al., 1992).Therefore several modification techniques have been introduced to the BFGS method, see for example (Yabe et al., 2007;Gratton and Toint, 2010).Here, we test the D-Broyden class of methods which is an extension of the D-BFGS update of Powell (1978) in the augmented Lagrangian and SQP methods for constrained optimization.For details on the latter case, see for instance (Fletcher, 1987;Nocedal and Wright, 1999).This damped method was applied to unconstrained optimization problems for the first time by Al-Baali (2003) who considered  is sufficiently large or small, respectively.Another useful feature of formula ( 6) is that T kk syremains sufficiently positive even when 0 T kk sy  (which may occur if the Wolfe condition (3) is not enforced on k s ).We also observe that formula (6) is reduced to the choice of Powell (1978) if 2 = 0.8  and 3 =.
  For certain choices of these parameters, Al-Baali (2004) introduced formula (6) to the limited memory L-BFGS method of Nocedal (1980) and reported encouraging numerical results in certain cases.Other encouraging results for the D-BFGS method with formula (6) on unconstrained optimization have been reported by Al-Baali and Grandinetti (2009).The authors also show that this formula works better than other modified BFGS formulae which modify 8) for certain vectors , see for example (Yabe et al., 2007) and the references therein.Because this form is reduced to k y when the objective function is quadratic, the proposed modified BFGS methods maintain the difficulty associated with the BFGS method in this case.Due to this feature and the above observations, we consider only formula (6) here and state the following investigations.
In practice, we observed that formula ( 6) with 2  sufficiently close to 0.6 seems to work well in some cases, but generally worsens the performance of the D-BFGS method as 2  decreases.This observation illustrates that large values of 2  such as 0.8 or 0.9 are usually recommended when the D-BFGS update is employed in methods for constrained optimization, see for example (Fletcher, 1987;Nocedal and Wright, 1999;Powell, 1978Powell, , 2009)).We will argue below that small values of k  are also useful in some cases, although large values of k  yield small changes in .k y Al-Baali (2011) extends the global convergence property of the restricted Broyden class of methods to that of D-Broyden, assuming that 2 1 2 1 ( 1) , where 12 , (0,1), is given by =.
(1 ) We observe that < 0,  and 2 . Thus, this condition adds another useful feature to the BFGS method over the other methods.Although prior to this result k  was never chosen outside the interval ( ,1], k  except for a well defined SR1 method (see for example Fletcher, 1987), condition (9) allows any real number Recently, for further restrictions on , 1 kk bh  is sufficiently remotely away from zero, we use this value only when both terms satisfy the latter condition.This suggestion is due to the fact that the Broyden family of updates is reduced to a single update if =1 kk bh (see for example Al-Baali, 1993).Therefore, since the first limit in (12) yields formula (6) (see Al-Baali, 1993), we consider employing it only when the value of the product kk bhis sufficiently larger than one so that we obtain where 4 0.

 
Indeed, for some choices of these parameters, we observed that formula (13) works better than (6).We now consider the second limit of (12) which suggests using where 4 0,   which can be substituted into (11) to obtain the corresponding value of .k  However, since 4  is chosen by the user and where, as above, 4 0.  13)   and ( 14) are considered here for illustration.We note that the above formulae for k  are independent of the updating parameter k  , so that the global convergence condition (9) may not hold for sufficiently large values of | | .
k  Therefore, in this case, this condition can be enforced by choosing k  sufficiently small.For example, the equality in (9) could be solved for k  which can be substituted into (11) to obtain the largest acceptable value of .k  In this way, all the damped algorithms which we consider below for ( ,1) kk    (the divergent Broyden options) solved the considered problem successfully.However, alternatively, replacing 1 Using formula (16), we observed that all the damped algorithms which we consider below for several values of k  from inside and outside the convex interval solved the problem successfully with performance significantly better than that of BFGS.
On the basis of the above discussions, we state the following outline for practical damped quasi-Newton methods.
We note that the strong Wolfe conditions in Step 3 yield the Wolfe conditions (2)-(3).If, on every iteration, the choice ˆ= kk yy is used in Step 5, which follows from (1) and ( 13) with values of 2 = 1, 1 is reduced to the 'undamped' Broyden family of methods.In particular, this choice with = 0, k  for all values of k, reduces the D-BFGS method to the standard BFGS method.
In practice, we observed that several values for 2 > 1 2  improve the performance of the BFGS and DFP methods (the latter significantly).If k  is chosen such that the usual positive definiteness property holds (which we consider here), then any value of k  satisfies condition (9).Indeed, we observed that sufficiently small values of k  are sometimes useful, particularly when || k  is remotely away from zero.These features will be illustrated in the next section.

Algorithm 2.1 D-Broyden class
Step 0: Given a starting point 1 , x a symmetric and positive-definite matrix 1 , B positive values of 0  and 1 ,  and a tolerance > 0.  such that the following strong Wolfe conditions hold:  1) and ( 4), respectively.

A numerical example
The question that has been most useful to the development of successful algorithms for unconstrained optimization is 'Does the method work well when the objective function is quadratic?'This statement is given by Powell (2009) who also states that the answer is very welcome and encouraging for the updating of second derivative matrices of quadratic models by a Broyden family method.Since this family usually suffers from illconditioned problems, we would like to apply some selected Broyden family methods to a quadratic function with a very ill-conditioned Hessian matrix.
Because quasi-Newton methods are invariant under linear transformation, the difficulty is unchanged if we choose the simple quadratic function Note that the former performance is maintained for the modified BFGS methods which use (8).The derivation of Powell (1986) for 1 k x  has been extended to a restricted Broyden class of methods by Byrd et al. (1992) who also show that the performance of this class becomes worse as k  increases.Since these papers illustrate that these methods usually suffer from large eigenvalues of k B rather than small ones, we will describe below the numerical results for a very large value of .


In a similar manner, we have tested the D-Broyden class of methods by applying Algorithm 2.1 to the above problem.To define Step 0 of this algorithm, we let 1 x and 1 B be given as above, using Step 3 on all iterations whether this value satisfies the Wolfe conditions or not.Nevertheless, these conditions hold to the limit if a quasi-Newton method converges superlinearly (Dennis and Moré, 1974).Thus the algorithms we consider below differ only in the choices of k  and k  as required in Step 5 of Algorithm 2.1.All experiments were run in Matlab. .We will report the number of function evaluations ( nfe ) which is required to terminate Algorithm 2.1 in Step 1 with the Euclidean norm .
k g   This number is the same as that of the gradient evaluations as well as that of iterations.'F' indicates the algorithm was terminated before the latter inequality held, ie the algorithm failed to solve the problem.We also consider the BFGS method for comparison, which required = 32 nfe to solve the problem.We firstly tested the D-BFGS method with k  given by formulae ( 6), ( 13) and ( 14) for various choices of the parameters 2  and 4 . We do not consider choices for 3  here, because the second case in ( 6) and ( 13) was not used as the inequality 3 >1 k

 
was not satisfied for 3 > 0.


The results expressed in terms of nfe are presented in Table 1.We used (6) for 10 different values of 2 (0,1).

 
For sufficiently large values of 2 ,  D-BFGS required = 32 nfe to solve the problem, which is the same as that required by BFGS.The reason is that using 2 =1  in formula (6) yields =1 k  which (by the damped-technique (1)) reduces the D-BFGS update to the BFGS one.We also note that D-BFGS works slightly better than BFGS when 2  is close to 0.6, while it becomes worse as 2  decreases in (0,0.5).Indeed, D-BFGS also failed to solve the problem when values of 6 2 10    were used.This observation is somehow expected on the basis of some results from the literature, because (according to our knowledge) values of 2 = 0.8  or 0.9 have been used in the D-BFGS method for constrained optimization during the last three decades, see for example (Powell, 2009).
This drawback of formula ( 6) with choices of 2 0.5

 
can be avoided by employing it only when the BFGS update is sufficiently away from the rank one update, that is when kk bhis sufficiently away from 1.To illustrate this feature, we repeated the run using formula (13) with most of the above values of 2  and some values of 4 0.

 
The results are given in Table 2.We do not report the results for 4 >2  here, because we observed that D-BFGS required the same nfe as that required for 4 = 2.


It is clear that the damped-technique improves BFGS substantially when 4  is defined sufficiently close to 0.5 and 2 0.6   and significantly for very small values of 2 .
 It is rather surprising that D-BFGS with This table also shows that the choice of 4 =0  (or nearly so) which yields that the damped-technique is employed on most iterations, is not desirable.The damped-technique seems to work well if the left limit in ( 12) is enforced only when the term under the right limit, 1, kk bh  is sufficiently away from zero.Therefore, it is worth testing formula (14) which enforces the right limit in (12).The results are given in Table 3.We observe that the damped-technique improves over BFGS as 4  decreases.For example when 4 = 0.01,  D-BFGS required only =8 nfe , which is very small compared to 32 as required by BFGS.(Byrd et al., 1992).Note that 10 = 10 nfe   required for =1 k  (the DFP option) is expected, see (Powell, 1986).
The remaining results in


it is also surprising that they solved the problem successfully.Note that the slow convergence of these algorithms is avoided as described in the following paragraph.We now repeat the test with formula (16).The results are given in Table 5.We observe the surprising results that all damped algorithms solve the problem successfully.For sufficiently small values of 4 ,  the algorithms performance is significantly better than that of BFGS, and the algorithms improve as 4  decreases.

=0
 means that the damped-technique is always employed, except when the Broyden family of updating formulae is reduced to the SR1 update.All algorithms approach an optimal method with =4 nfe which is required to solve the problem by a well defined SR1 method.Although 4 =0  gives the best choice, the value of 4 = 0.95  might be typical for a general function, and further experiments on several test problems should be considered.
It is worth reporting that an application of some algorithms to the above quadratic problem (or its linear transformation) with some choices for 1 x and   is sufficiently away from zero, the damped algorithms work significantly better than the undamped Broyden methods.For = 0, k  we observed that generally the standard BFGS method is preferable to the D-BFGS algorithm unless further restrictions on the damped parameter k  are made or, similarly, the scalars 2 ,  3  and 4  are appropriately chosen.Indeed, Al-  Baali and Grandinetti (2009) showed that the performance of the D-BFGS method with formula (6) for 2 = 0.9,  3 =9  and further restrictions on k  is substantially better than that of the BFGS method.This performance is improved further when both the damped and self-scaling techniques are combined in a certain sense (Al-Baali and Khalfan, 2009).For large-scale problems, Al-Baali ( 2004) introduced the damped formula (6) to the limited memory L-BFGS method, see for example (Nocedal and Wright, 1999).Using 2 = 0.6,  3 3   and sufficiently large values of , k  Al-Baali (2004) reported encouraging numerical results in certain cases.

Conclusion
It is shown that the D-Broyden class of methods with appropriate choices for the damped-technique work very well when applied to the quadratic function of Powell (1986).In particular, the damped parameter (16) improves the performance of robust methods (e.g.BFGS) substantially and inefficient methods (e.g.DFP) significantly and enforces fast convergence of divergent methods.The numerical results also demonstrate that small values for the damped parameter k  are useful in some cases.
Since this finding contradicts the well known fact that large values of 0.8 k   should be used in the D-BFGS update of methods for the Lagrangian function in constrained optimization (see for example Fletcher, 1987;Nocedal and Wright, 1999;Powell, 1978;Powell, 2009;Gill and Leonard, 2003), it is worth noting that smaller values of k  using formulae (13) and ( 14) which modify that of Powell (1978) should also be further investigated.
For a general function, these formulae can be used with k y replaced by the right hand side of (8) as in (Al- Baali and Grandinetti, 2009).

Acknowledgment
We would like to thank the two anonymous referees for making a number of valuable comments on a draft of this paper.
known BFGS and DFP methods, respectively.Although the restricted class of quasi-Newton methods (defined for =1

k
Al-Baali (2011) extends the above superlinear convergence property to the damped methods and shows that (


calculated by either the BFGS or DFP method with = 1.k He shows that for large values of ,  BFGS performs badly and DFP is far worse, although k y is exact.
The damped-technique (1) with suitable values of k  has the ability of correcting the Hessian approximations successfully.Here, we report the results for several choices of k problem.This result shows that small values of 2  (and hence k  ) are useful in practice.
. We observed, as expected, that all the algorithms solved the problem successfully with =2 nfe or 3 whether the dampedtechnique was employed or not.To see if the above results are generalised to general functions, we applied Algorithm 2.1, employing the strong Wolfe conditions (17) with the usual values optimization problems and observed the following.When k This modified formula ensures that the value of k  or its modification lies in k

Table 3
To see if the above encouraging numerical results for D-BFGS are generalised to other members of the D-Broyden class, we repeated the run for formula (14) with the same values of 4 Table4were obtained by testing condition (9) with 12 = = 0.05.we reduce this value such that this condition holds with equality.In this way convergence is enforced to all algorithms which, indeed, solves the problem successfully.Thus the second row k  k  k