An Overview of Some Practical Quasi-Newton Methods for Unconstrained Optimization

: Quasi-Newton methods are among the most practical and efficient iterative methods for solving unconstrained minimization problems. In this paper we give an overview of some of these methods with focus primarily on the Hessian approximation updates and modifications aimed at improving their performance.


Introduction
In this paper we give an overview of some line search quasi-Newton methods for solving the unconstrained minimization problem ( ) where f is a twice continuously differentiable function.Emphasis will be on the Hessian approximation formulas used in these methods, and techniques developed to improve their performance.
The basic iteration of a quasi-Newton method consists of the following.Starting with an initial approximation 1 x to a solution x * of (1), and an initial positive definite Hessian approximation 1 B , calculate a new approximation at iteration k by where k a is a steplength and k d is a search direction obtained by ( ) where k B is a positive definite nn ´ Hessian approximation matrix chosen from the general Broyden class of updates discussed in the next section.
The steplength k a is calculated such that the Wolfe conditions (5) are satisfied.The first condition ensures sufficient reduction in f , and the second one guarantees that the steplegnth is not too small relative to the initial rate of decrease in f .In practice, the strong Wolfe conditions (4) and (2) is globally convergent.(see e.g.Fletcher, 1987, Dennis and Schnabel, 1996, and Nocedal and Wright, 1999).
In the next section we discuss some well-known members of the Broyden class of Hessian approximation updates.In Section 3 we outline some approaches for improving the performance of some standard updates by modifications of the gradient difference ( ) ( ) In Section 4 we discuss self- scaling quasi-Newton methods aimed at handling ill-conditioned problems.Some quasi-Newton methods for large scale optimization are discussed in Section 5. and q is a parameter.This class includes as special case the BFGS update, when 0 q = ; the DFP update, when 1 q = ; and the SR1 update, when ()

Quasi-newton updates
TTT kkkkkkk sysysBs q =/-.(Another family of updates was proposed by Huang (1970), but is not discussed here since it was shown to be equivalent to the self-scaling Broyden family discussed in Section 5.) (See e.g.Fletcher, 1987).Powell (1976) showed that if f is a convex function and the Wolfe conditions hold, then for any starting point 1 x and any positive definite initial matrix 1 B , the BFGS method converges globally; and if furthermore the true Hessian 2 fx * aeö ç÷ èø Ñ is positive definite, then the rate of convergence is q -superlinear.This result was extended by Byrd, Nocedal and Yuan (1987) to the interval 01 q £<, (updates belonging to [ ] 01 , are known as the convex class).
Although Dixon (1972) showed that for a general nonlinear function f , all well-defined members of the Broyden family with k q q ¹ generate the same sequence of iterates when used with exact line searches, numerical experience showed that only some updates, which we discuss next, worked well in practice when inexact line searches are used; and that the performance deteriorates as q increases above 0 (see e.g.Byrd, Liu and Nocedal, 1992).
The SR1 method enjoys some desirable features which are not shared by other standard updates.Fiaco and McCormick (1968) showed that for a positive definite quadratic function, if the SR1 update is used with linearly independent steps, and all the updates are well-defined, then the solution is reached in at most 1 n + iterations.Furthermore, if 1 n + iterations are required, then the final Hessian approximation 1 n B + is the actual Hessian.This quadratic termination property is not generally true for other members of the Broyden family, unless exact line searches are used.
For general functions, Conn, Gould and Toint (1991) proved that the sequence of SR1 Hessian approximations converges to the true Hessian at the solution provided that the steps are uniformly linearly independent; that the SR1 update denominator is always sufficiently different from zero, and that the iterates converge to a finite limit.Hence under these conditions the rate of convergence is q -superlinear.If the assumption of uniform linear independence is dropped, then as shown in Khalfan, Byrd and Schnabel (1993) the SR1 method converges (1) n + -q -superlinearly provided that for all k, k B is positive definite and bounded.
On the other hand, Ge and Powell (1983) showed that the sequence of matrices generated by the BFGS method converges to a matrix not necessarily equal to the true Hessian.
In order to obtain a well-conditioned update, Davidon (1975) proposed the update 1 ,2 1 1 ,. 1 The first update in this formula is obtained by minimizing over q the condition number , and the second one is the SR1 formula.Practical experience, however, showed no significant improvement using this update (see e.g.Al-Baali, 1993 andLukšan andSpedicato, 2000).Since updates from the preconvex class work well in practice (see e.g.Zhang andTewarson 1988 andByrd, Liu andNocedal, 1992), Al-Baali (1993) reported improved numerical performance using the switching BFGS/SR1 update 01 ,, ïî which preserves positive definiteness.Lukšan and Spedicato (2000) also reported competitive performance using this update.Other updates of the switching type are given in Al-Baali, Fuduli and Musmanno ( 2004).

Modifying gradient-difference vector
Many approaches have been proposed to improve the quasi-Newton Hessian approximation updates.In this section we outline some recent suggested updates obtained by modifying the vector k y .
Zhang, Deng, and Chen (1999) suggested replacing k y in the BFGS formula by the vector

Ñ-=|||| ||||
and reduces to k y if f is quadratic.The resulting modified BFGS method retains global and q -superlinear convergence for convex functions, and performs slightly better than the standard BFGS method on some test problems.Li and Fukushima (2001)  T kk ys * > .They showed that for a general function, the resulting BFGS method with backtracking line search converges globally and superlinearly, under standard assumptions on the objective function.Their numerical experience also indicated some improvement in performance.For other modifications of this type (Yuan, 1991, Xu and Zhang, 2001, Wei et al., 2004, and Zhang, 2005).Modifying k y was originally suggested by Powell (1978) who proposed a BFGS method for constrained optimization with
Other approaches for improving Hessian approximation involved employing multiple quasi-Newton updates at each iteration, using information at the current and previous steps.Some improvement was observed for certain type of test problems (Khalfan, 1989, Ford and Tharmlikit, 2003, and Al-Baali, et al., 2004).

Self-scaling quasi-newton methods
The standard quasi-Newton methods that we considered so far may have difficulties in solving some illconditioned problems.Powell (1986) showed that, the BFGS and DFP methods behaved badly (the latter far worse) when applied to a simple ill-conditioned quadratic function.Moreover, Dai (2002) and Mascarenhas (2004) gave examples of nonconvex functions, for which the BFGS method failed to converge to the solution of the problem.In this section we consider self-scaling Hessian approximation updates for handling ill-conditioned problems.Oren and Luenberger (1974)  where q and t are chosen such that the new update is optimally conditioned in some sense.For the parameters q and t the authors suggested the intervals 1 1, when f is a quadratic function with a positive definite Hessian G.
Using intervals (12), Oren and Spedicato (1976) proposed the class ( ) which can be obtained by minimizing the condition number Spedicato (1978)).Notice that substituting 1 t = in class (13) gives the Davidon optimally conditioned update, given in the first case of (8).Moreover, as shown in Al-Baali (1995), class ( 13) can be also obtained by minimizing the same condition number over t .
In order to include members from the nonconvex class such as the SR1 update, and maintain positive definiteness, Al-Baali (1995) and Hu and Storey (1994)  and k v are defined by (9).The end points of intervals ( 14) define two self-scaling SR1 updates which are the same updates obtained by Osborne and Sun (1988) and Wolkowicz (1996).The choice ( ) is preferable since methods from the preconvex class work well in practice.
Most of these self-scaling updates, however, do not generally improve the performance of the unscaled methods as reported by several authors.In fact, Shanno and Phua (1978) showed that the BFGS method worked better when scaling is used only for the initial Hessian approximation 1 B .Moreover, Nocedal and Yuan (1993) showed that, compared with the (unscaled) BFGS method, the best self-scaling BFGS algorithm of Oren and Luenberger (1974), 0 q = and 1/ k b t = , performs badly, when used for solving a simple quadratic problem of two variables.They also showed that for the same problem, superliner convergence is not obtained unless certain steplegnth values are used which cannot be guaranteed in practice.Al-Baali (1998) however, extended the global and superliner convergence theory of Byrd, Liu and Nocedal (1992) for convex functions to the self-scaling class, (11), on the intervals 123 1, , where ( ) 123 ,,0,1, cccÎ and reported that self-scaling methods from these intervals, outperformed corresponding unscaled methods.Further numerical testing reported by Al-Baali and Khalfan (2005) showed that these methods succeeded in solving more problems than the unscaled methods, especially when 1. q ³ Using scaling only when 1, t < also improves the performance of other self-scaling methods discussed above.For example, replacing the best self-scaling BFGS method of Oren and Luenberger (1974)  the resulting method outperformed even the standard BFGS method.Performance improvement was also reported, especially for the DFP method, by Contreras andTapia (1993), andYabe, Martines, andTapia (2004), when a similar self-scaling approach was used for a certain type of problems.

Large-scale quasi-newton methods
The storage and computational requirement of the methods we considered so far is ( ) for a problem of n variables.Several modifications of quasi-Newton updates have been proposed to improve their efficiency when this cost is excessive.In this section we consider two approaches that are used widely in practice: the limited-memory method which is suitable for problems in which the true Hessian is not sparse, and the partitioned method which is the method of choice for problems with partially separable Hessians.

Limited-memory methods
In these methods only a few vectors of length n are used for approximating the inverse of the Hessian implicitly instead of storing a full n ´n matrix.For example, if we write the BFGS update in the form ( ) as shown for example in Nocedal and Wright (1999).This method is referred to as the L-BFGS method and it converges only globally for convex functions as shown in Liu and Nocedal (1989).A compact representation of limited-memory methods for general quasi-Newton updates is given by Byrd, Nocedal and Schnabel (1994).
Computational experience with ( ) 1/, kk DhI = a multiple of the identity matrix, indicates that for large scale problems in which ( ) is not sparse, the L-BFGS method outperforms other methods such as the nonlinear conjugate gradient method; and that its performance improves substantially, in term of computing time, as n gets large.The method however may suffer from slow convergence, which costs more function evaluations, especially on very ill-conditioned problems (Nocedal and Wright, 1999).For large-scale leastsquare problems, Al-Baali (2003a) considered a modified L-BFGS method using a vector k y * , similar to the one discussed in section 3, instead of k y and reported substantial improvement in numerical performance.
Another limited-memory approach, is based on the fact that the standard BFGS method accumulates approximate curvature in a sequence of expanding subspaces, which allows using a smaller reduced matrix to approximate the Hessian, that increases in dimension at each iteration.This feature is used to define limitedmemory reduced-Hessian methods that require half the storage of conventional limited-memory methods.For more on these methods see Gill, and Leonard (2003 ) and Lukšan and Vlček (2006).

Partitioned methods
Every function f with a sparse Hessian can be written in the form nnnn =++ L (see Griewank and Toint (1982)).In a partitioned method the Heassian of each element function i f is approximated using a quasi- Newton Hessian approximation,  16) is solved inexactly as shown in Griewank (1991).The partitioned BFGS method performs well in practice provided that partial separability is fully exploited.Practical implementation however, mostly use the standard assumptions on f , f Ñ is Lipschitz continuous, the steplength k a satisfies the Wolfe conditions, and the matrices k B are positive definite and have a bounded condition number, then iteration positive definite, and the curvature condition 0 by Cauchy's inequality), it clearly follows that any update with 0 q ³ (such as the BFGS and DFP updates) preserves positive definiteness if the curvature condition holds.The SR1 update preserves positive definiteness only if either 1 k b < or 1 k h < , which may not hold even for quadratic functions.
indicated that several updates from this interval worked well in practice, especially the modified include the SR1 formula.The fast rate of convergence observed in the numerical experience with these updates suggests further study of their convergence properties.
are convex, then the BFGS method converges globally, even if the system ( proposed the two parameters class of self-scaling Hessian approximation updates,