SAS.STAT 9.1 Users Guide (Vol. 5)

This section contains details about the underlying theory and computations of PROC NLMIXED.

Modeling Assumptions and Notation

PROC NLMIXED operates under the following general framework for nonlinear mixed models. Assume that you have an observed data vector y i for each of i subjects, i = 1 ,..., s . The y i are assumed to be independent across i , but within-subject covariance is likely to exist because each of the elements of y i are measured on the same subject. As a statistical mechanism for modeling this within-subject covariance, assume that there exist latent random-effect vectors u i of small dimension (typically one or two) that are also independent across i . Assume also that an appropriate model linking y i and u i exists, leading to the joint probability density function

where X i is a matrix of observed explanatory variables and and ¾ are vectors of unknown parameters.

Let = ( , ¾ ) and assume that it is of dimension n . Then inferences about are based on the marginal likelihood function

In particular, the function

is minimized over numerically in order to estimate , and the inverse Hessian (second derivative) matrix at the estimates provides an approximate variance-covariance matrix for the estimate of . The function f ( ) is referred to both as the negative log likelihood function and as the objective function for optimization.

As an example of the preceding general framework, consider the nonlinear growth curve example in the Getting Started section. Here, the conditional distribution p ( y i X i , ,u i ) is normal with mean

and variance ; thus . Also, u i is a scalar and q ( u i ¾ ) is normal with mean 0 and variance ; thus .

The following additional notation is also found in this chapter. The quantity ( k ) refers to the parameter vector at the k th iteration, the function g ( ) refers to the gradient vector ˆ f ( ), and the matrix H ( ) refers to the Hessian ˆ 2 f ( ).Other symbols are used to denote various constants or option values.

Integral Approximations

An important part of the marginal maximum likelihood method described previously is the computation of the integral over the random effects. The default method in PROC NLMIXED for computing this integral is adaptive Gaussian quadrature as described in Pinheiro and Bates (1995). Another approximation method is the first-order method of Beal and Sheiner (1982, 1988). A description of these two methods follows .

Adaptive Gaussian Quadrature

A quadrature method approximates a given integral by a weighted sum over prede-fined abscissas for the random effects. A good approximation can usually be obtained with an adequate number of quadrature points as well as appropriate centering and scaling of the abscissas. Adaptive Gaussian quadrature for the integral over u i centers the integral at the empirical Bayes estimate of u i ,defined as the vector » i that minimizes

with and ¾ set equal to their current estimates. The final Hessian matrix from this optimization can be used to scale the quadrature abscissas.

Suppose ( z j , w j ; j = 1 , ..., p ) denote the standard Gauss-Hermite abscissas and weights (Golub and Welsch 1969, or Table 25.10 of Abramowitz and Stegun 1972). The adaptive Gaussian quadrature integral approximation is as follows.

where r is the dimension of u i , “ ( X i , ) is the Hessian matrix from the empirical Bayes minimization, z j 1 , ..., j r is a vector with elements ( z j 1 , ..., z j r ), and

PROC NLMIXED selects the number of quadrature points adaptively by evaluating the log likelihood function at the starting values of the parameters until two successive evaluations have a relative difference less than the value of the QTOL= option. The specific search sequence is described under the QFAC= option. Using the QPOINTS= option, you can adjust the number of quadrature points p to obtain different levels of accuracy. Setting p = 1 results in the Laplacian approximation as described in Beal and Sheiner (1992), Wolfinger (1993), Vonesh (1992, 1996), Vonesh and Chinchilli (1997), and Wolfinger and Lin (1997).

The NOAD option in the PROC NLMIXED statement requests nonadaptive Gaussian quadrature. Here all » i are set equal to zero, and the Cholesky root of the estimated variance matrix of the random effects is substituted for “ ( X i , ) ˆ’ 1 / 2 in the preceding expression for a j 1 ,...,j r . In this case derivatives are computed using the algorithm of Smith (1995). The NOADSCALE option requests the same scaling substitution but with the empirical Bayes » i .

PROC NLMIXED computes the derivatives of the adaptive Gaussian quadrature approximation when carrying out the default dual quasi-Newton optimization.

First-Order Method

Another integral approximation available in PROC NLMIXED is the first-order method of Beal and Sheiner (1982, 1988) and Sheiner and Beal (1985). This approximation is used only in the case where p ( y i X i , ,u i ) is normal, that is,

where n i is the dimension of y i , R i is a diagonal variance matrix, and m i is the conditional mean vector of y i .

The first-order approximation is obtained by expanding m ( X i , ,u i ) with a one- term Taylor series expansion about u i =0, resulting in the approximation

where Z i ( X i , ) is the Jacobian matrix ˆ‚ m i ( X i , ,u i ) / ˆ‚ u i evaluated at u i =0.

Assuming that q ( u i ¾ ) is normal with mean 0 and variance matrix G ( ¾ ),thefirst-order integral approximation is computable in closed form after completing the square:

where V i ( X i , )= Z i ( X i , ) G ( ¾ ) Z i ( X i , ) T + R i ( X i , ). The resulting approximation for f ( ) is then minimized over =( , ¾ ) to obtain the first-order estimates. PROC NLMIXED uses finite-difference derivatives of the first-order integral approximation when carrying out the default dual quasi-Newton optimization.

Optimization Algorithms

There are several optimization techniques available in PROC NLMIXED. You can choose a particular optimizer with the TECH= name option in the PROC NLMIXED statement.

Algorithm

TECH=

trust region Method

TRUREG

Newton-Raphson method with line search

NEWRAP

Newton-Raphson method with ridging

NRRIDG

quasi-Newton methods (DBFGS, DDFP, BFGS, DFP)

QUANEW

double-dogleg method (DBFGS, DDFP)

DBLDOG

conjugate gradient methods (PB, FR, PR, CD)

CONGRA

Nelder-Mead simplex method

NMSIMP

No algorithm for optimizing general nonlinear functions exists that always finds the global optimum for a general nonlinear minimization problem in a reasonable amount of time. Since no single optimization technique is invariably superior to others, PROC NLMIXED provides a variety of optimization techniques that work well in various circumstances. However, you can devise problems for which none of the techniques in PROC NLMIXED can find the correct solution. Moreover, nonlinear optimization can be computationally expensive in terms of time and memory, so you must be careful when matching an algorithm to a problem.

All optimization techniques in PROC NLMIXED use O ( n 2 ) memory except the conjugate gradient methods, which use only O ( n ) of memory and are designed to optimize problems with many parameters. Since the techniques are iterative, they require the repeated computation of

However, since each of the optimizers requires different derivatives, some computational efficiencies can be gained . The following table shows, for each optimization technique, which derivatives are required (FOD: first-order derivatives; SOD: second-order derivatives).

Algorithm

FOD

SOD

TRUREG

x

x

NEWRAP

x

x

NRRIDG

x

x

QUANEW

x

-

DBLDOG

x

-

CONGRA

x

-

NMSIMP

-

-

Each optimization method employs one or more convergence criteria that determine when it has converged . The various termination criteria are listed and described in the PROC NLMIXED Statement section. An algorithm is considered to have converged when any one of the convergence criterion is satisfied. For example, under the default settings, the QUANEW algorithm will converge if ABSGCONV < 1E ˆ’ 5, FCONV < 10 ˆ’ FDIGITS , or GCONV < 1E ˆ’ 8.

Choosing an Optimization Algorithm

The factors that go into choosing a particular optimization technique for a particular problem are complex and may involve trial and error.

For many optimization problems, computing the gradient takes more computer time than computing the function value, and computing the Hessian sometimes takes much more computer time and memory than computing the gradient, especially when there are many decision variables. Unfortunately, optimization techniques that do not use some kind of Hessian approximation usually require many more iterations than techniques that do use a Hessian matrix, and as a result the total run time of these techniques is often longer. Techniques that do not use the Hessian also tend to be less reliable. For example, they can more easily terminate at stationary points rather than at global optima.

A few general remarks about the various optimization techniques are as follows.

Algorithm Descriptions

Some details about the optimization techniques are as follows.

Trust Region Optimization (TRUREG)

The trust region method uses the gradient g ( ( k ) ) and the Hessian matrix H ( ( k ) ); thus, it requires that the objective function f ( ) have continuous first- and second-order derivatives inside the feasible region.

The trust region method iteratively optimizes a quadratic approximation to the nonlinear objective function within a hyper-elliptic trust region with radius ” that constrains the step size corresponding to the quality of the quadratic approximation. The trust region method is implemented using Dennis, Gay, and Welsch (1981), Gay (1983), and Mor and Sorensen (1983).

The trust region method performs well for small- to medium-sized problems, and it does not need many function, gradient, and Hessian calls. However, if the computation of the Hessian matrix is computationally expensive, one of the (dual) quasi-Newton or conjugate gradient algorithms may be more efficient.

Newton-Raphson Optimization with Line Search (NEWRAP)

The NEWRAP technique uses the gradient g ( ( k ) ) and the Hessian matrix H ( ( k ) ); thus, it requires that the objective function have continuous first- and second-order derivatives inside the feasible region. If second-order derivatives are computed efficiently and precisely, the NEWRAP method may perform well for medium-sized to large problems, and it does not need many function, gradient, and Hessian calls.

This algorithm uses a pure Newton step when the Hessian is positive definite and when the Newton step reduces the value of the objective function successfully. Otherwise , a combination of ridging and line search is performed to compute successful steps. If the Hessian is not positive definite, a multiple of the identity matrix is added to the Hessian matrix to make it positive definite (Eskow and Schnabel 1991).

In each iteration, a line search is performed along the search direction to find an approximate optimum of the objective function. The default line-search method uses quadratic interpolation and cubic extrapolation (LIS=2).

Newton-Raphson Ridge Optimization (NRRIDG)

The NRRIDG technique uses the gradient g ( ( k ) ) and the Hessian matrix H ( ( k ) ); thus, it requires that the objective function have continuous first- and second-order derivatives inside the feasible region.

This algorithm uses a pure Newton step when the Hessian is positive definite and when the Newton step reduces the value of the objective function successfully. If at least one of these two conditions is not satisfied, a multiple of the identity matrix is added to the Hessian matrix.

The NRRIDG method performs well for small- to medium-sized problems, and it does not require many function, gradient, and Hessian calls. However, if the computation of the Hessian matrix is computationally expensive, one of the (dual) quasi-Newton or conjugate gradient algorithms may be more efficient.

Since the NRRIDG technique uses an orthogonal decomposition of the approximate Hessian, each iteration of NRRIDG can be slower than that of the NEWRAP technique, which works with Cholesky decomposition. Usually, however, NRRIDG requires fewer iterations than NEWRAP.

Quasi-Newton Optimization (QUANEW)

The (dual) quasi-Newton method uses the gradient g ( ( k ) ), and it does not need to compute second-order derivatives since they are approximated. It works well for medium to moderately large optimization problems where the objective function and the gradient are much faster to compute than the Hessian; but, in general, it requires more iterations than the TRUREG, NEWRAP, and NRRIDG techniques, which compute second-order derivatives. QUANEW is the default optimization algorithm because it provides an appropriate balance between the speed and stability required for most nonlinear mixed model applications.

The QUANEW technique is one of the following, depending upon the value of the UPDATE= option.

You can specify four update formulas with the UPDATE= option:

In each iteration, a line search is performed along the search direction to find an approximate optimum. The default line-search method uses quadratic interpolation and cubic extrapolation to obtain a step size ± satisfying the Goldstein conditions. One of the Goldstein conditions can be violated if the feasible region defines an upper limit of the step size. Violating the left-side Goldstein condition can affect the positive definiteness of the quasi-Newton update. In that case, either the update is skipped or the iterations are restarted with an identity matrix, resulting in the steepest descent or ascent search direction. You can specify line-search algorithms other than the default with the LIS= option.

The QUANEW algorithm performs its own line-search technique. All options and parameters (except the INSTEP= option) controlling the line search in the other algorithms do not apply here. In several applications, large steps in the first iterations are troublesome . You can use the INSTEP= option to impose an upper bound for the step size ± during the first five iterations. You can also use the INHESSIAN[= r ] option to specify a different starting approximation for the Hessian. If you specify only the INHESSIAN option, the Cholesky factor of a (possibly ridged) finite difference approximation of the Hessian is used to initialize the quasi-Newton update process. The values of the LCSINGULAR=, LCEPSILON=, and LCDEACT= options, which control the processing of linear and boundary constraints, are valid only for the quadratic programming subroutine used in each iteration of the QUANEW algorithm.

Double Dogleg Optimization (DBLDOG)

The double dogleg optimization method combines the ideas of the quasi-Newton and trust region methods. In each iteration, the double dogleg algorithm computes the step s ( k ) as the linear combination of the steepest descent or ascent search direction and a quasi-Newton search direction .

The step is requested to remain within a prespecified trust region radius; refer to Fletcher (1987, p. 107). Thus, the DBLDOG subroutine uses the dual quasi-Newton update but does not perform a line search. You can specify two update formulas with the UPDATE= option:

The double dogleg optimization technique works well for medium to moderately large optimization problems where the objective function and the gradient are much faster to compute than the Hessian. The implementation is based on Dennis and Mei (1979) and Gay (1983), but it is extended for dealing with boundary and linear constraints. The DBLDOG technique generally requires more iterations than the TRUREG, NEWRAP, or NRRIDG technique, which requires second-order derivatives; however, each of the DBLDOG iterations is computationally cheap. Furthermore, the DBLDOG technique requires only gradient calls for the update of the Cholesky factor of an approximate Hessian.

Conjugate Gradient Optimization (CONGRA)

Second-order derivatives are not required by the CONGRA algorithm and are not even approximated. The CONGRA algorithm can be expensive in function and gradient calls, but it requires only O ( n ) memory for unconstrained optimization. In general, many iterations are required to obtain a precise solution, but each of the CONGRA iterations is computationally cheap. You can specify four different update formulas for generating the conjugate directions by using the UPDATE= option:

The default, UPDATE=PB, behaved best in most test examples. You are advised to avoid the option UPDATE=CD, which behaved worst in most test examples.

The CONGRA subroutine should be used for optimization problems with large n . For the unconstrained or boundary constrained case, CONGRA requires only O ( n ) bytes of working memory, whereas all other optimization methods require order O ( n 2 ) bytes of working memory. During n successive iterations, uninterrupted by restarts or changes in the working set, the conjugate gradient algorithm computes a cycle of n conjugate search directions. In each iteration, a line search is performed along the search direction to find an approximate optimum of the objective function. The default line-search method uses quadratic interpolation and cubic extrapolation to obtain a step size ± satisfying the Goldstein conditions. One of the Goldstein conditions can be violated if the feasible region defines an upper limit for the step size. Other line-search algorithms can be specified with the LIS= option.

Nelder-Mead Simplex Optimization (NMSIMP)

The Nelder-Mead simplex method does not use any derivatives and does not assume that the objective function has continuous derivatives. The objective function itself needs to be continuous. This technique is quite expensive in the number of function calls, and it may be unable to generate precise results for n ‰« 40.

The original Nelder-Mead simplex algorithm is implemented and extended to boundary constraints. This algorithm does not compute the objective for infeasible points, but it changes the shape of the simplex adapting to the nonlinearities of the objective function, which contributes to an increased speed of convergence. It uses a special termination criteria.

Finite Difference Approximations of Derivatives

The FD= and FDHESSIAN= options specify the use of finite difference approximations of the derivatives. The FD= option specifies that all derivatives are approximated using function evaluations, and the FDHESSIAN= option specifies that second-order derivatives are approximated using gradient evaluations.

Computing derivatives by finite difference approximations can be very time consuming, especially for second-order derivatives based only on values of the objective function (FD= option). If analytical derivatives are difficult to obtain (for example, if a function is computed by an iterative process), you might consider one of the optimization techniques that uses first-order derivatives only (QUANEW, DBLDOG, or CONGRA).

Forward Difference Approximations

The forward difference derivative approximations consume less computer time, but they are usually not as precise as approximations that use central difference formulas.

Central Difference Approximations

Central difference approximations are usually more precise, but they consume more computer time than approximations that use forward difference derivative formulas.

You can use the FDIGITS== option to specify the number of accurate digits in the evaluation of the objective function. This specification is helpful in determining an appropriate interval size h to be used in the finite difference formulas.

The step sizes h j , j = 1 , , n are defined as follows.

The value of · is defined by the FDIGITS= option:

Hessian Scaling

The rows and columns of the Hessian matrix can be scaled when you use the trust region, Newton-Raphson, and double dogleg optimization techniques. Each element H i,j , i, j = 1 , , n is divided by the scaling factor d i d j , where the scaling vector d = ( d 1 , ,d n ) is iteratively updated in a way specified by the HESCAL= i option, as follows.

i = 0: No scaling is done (equivalent to d i = 1).

i ‰  0: First iteration and each restart iteration sets:

i = 1 : Refer to Mor (1978):

i =2 : Refer to Dennis, Gay, and Welsch (1981):

i = 3 : d i is reset in each iteration:

In the preceding equations, ˆˆ is the relative machine precision or, equivalently, the largest double precision value that, when added to 1, results in 1.

Active Set Methods

The parameter vector ˆˆ R n can be subject to a set of m linear equality and inequality constraints:

The coefficients a ij and right-hand sides b i of the equality and inequality constraints are collected in the m — n matrix A and the m vector b .

The m linear constraints define a feasible region G in R n that must contain the point * that minimizes the problem. If the feasible region G is empty, no solution to the optimization problem exists.

In PROC NLMIXED, all optimization techniques use active set methods . The iteration starts with a feasible point (0) , which you can provide or which can be computed by the Schittkowski and Stoer (1979) algorithm implemented in PROC NLMIXED. The algorithm then moves from one feasible point ( k ˆ’ 1) to a better feasible point ( k ) along a feasible search direction s ( k ) ,

Theoretically, the path of points ( k ) never leaves the feasible region G of the optimization problem, but it can reach its boundaries. The active set A ( k ) of point ( k ) is defined as the index set of all linear equality constraints and those inequality constraints that are satisfied at ( k ) . If no constraint is active ( k ) , the point is located in the interior of G , and the active set A ( k ) = is empty. If the point ( k ) in iteration k hits the boundary of inequality constraint i , this constraint i becomes active and is added to A ( k ) . Each equality constraint and each active inequality constraint reduce the dimension (degrees of freedom) of the optimization problem.

In practice, the active constraints can be satisfied only with finite precision. The LCEPSILON= r option specifies the range for active and violated linear constraints. If the point ( k ) satisfies the condition

where t = r ( b i +1), the constraint i is recognized as an active constraint. Otherwise, the constraint i is either an inactive inequality or a violated inequality or equality constraint. Due to rounding errors in computing the projected search direction, error can be accumulated so that an iterate ( k ) steps out of the feasible region.

In those cases, PROC NLMIXED may try to pull the iterate ( k ) back into the feasible region. However, in some cases the algorithm needs to increase the feasible region by increasing the LCEPSILON= r value. If this happens, a message is displayed in the log output.

If the algorithm cannot improve the value of the objective function by moving from an active constraint back into the interior of the feasible region, it makes this inequality constraint an equality constraint in the next iteration. This means that the active set A ( k +1) still contains the constraint i . Otherwise, it releases the active inequality constraint and increases the dimension of the optimization problem in the next iteration.

A serious numerical problem can arise when some of the active constraints become (nearly) linearly dependent. PROC NLMIXED removes linearly dependent equality constraints before starting optimization. You can use the LCSINGULAR= option to specify a criterion r used in the update of the QR decomposition that determines whether an active constraint is linearly dependent relative to a set of other active constraints.

If the solution * is subjected to n act linear equality or active inequality constraints, the QR decomposition of the n — n act matrix ‚ T of the linear constraints is computed by ‚ T = QR , where Q is an n — n orthogonal matrix and R is an n — n act upper triangular matrix. The n columns of matrix Q can be separated into two matrices, Q =[ Y, Z ], where Y contains the first n act orthogonal columns of Q and Z contains the last n ˆ’ n act orthogonal columns of Q . The n — ( n ˆ’ n act ) column-orthogonal matrix Z is also called the nullspace matrix of the active linear constraints ‚ T . The n ˆ’ n act columns of the n — ( n ˆ’ n act ) matrix Z form a basis orthogonal to the rows of the n act — n matrix ‚ .

At the end of the iterating, PROC NLMIXED computes the projected gradient g Z ,

In the case of boundary-constrained optimization, the elements of the projected gradient correspond to the gradient elements of the free parameters. A necessary condition for * to be a local minimum of the optimization problem is

The symmetric n act — n act matrix G Z ,

is called a projected Hessian matrix . A second-order necessary condition for * to be a local minimizer requires that the projected Hessian matrix is positive semidefinite.

Those elements of the n act vector of first-order estimates of Lagrange multipliers ,

that correspond to active inequality constraints indicate whether an improvement of the objective function can be obtained by releasing this active constraint. For minimization, a significant negative Lagrange multiplier indicates that a possible reduction of the objective function can be achieved by releasing this active linear constraint. The LCDEACT= r option specifies a threshold r for the Lagrange multiplier that determines whether an active inequality constraint remains active or can be deactivated. (In the case of boundary-constrained optimization, the Lagrange multipliers for active lower (upper) constraints are the negative (positive) gradient elements corresponding to the active parameters.)

Line-Search Methods

In each iteration k , the (dual) quasi-Newton, conjugate gradient, and Newton-Raphson minimization techniques use iterative line-search algorithms that try to optimize a linear, quadratic, or cubic approximation of f along a feasible descent search direction s ( k )

by computing an approximately optimal scalar ± ( k ) .

Therefore, a line-search algorithm is an iterative process that optimizes a nonlinear function f ( ± ) of one parameter ( ± ) within each iteration k of the optimization technique. Since the outside iteration process is based only on the approximation of the objective function, the inside iteration of the line-search algorithm does not have to be perfect. Usually, it is satisfactory that the choice of ± significantly reduces (in a minimization) the objective function. Criteria often used for termination of line-search algorithms are the Goldstein conditions (refer to Fletcher 1987).

You can select various line-search algorithms by specifying the LIS= option. The line-search method LIS=2 seems to be superior when function evaluation consumes significantly less computation time than gradient evaluation. Therefore, LIS=2 is the default method for Newton-Raphson, (dual) quasi-Newton, and conjugate gradient optimizations.

You can modify the line-search methods LIS=2 and LIS=3 to be exact line searches by using the LSPRECISION= option and specifying the ƒ parameter described in Fletcher (1987). The line-search methods LIS=1, LIS=2, and LIS=3 satisfy the left-hand side and right-hand side Goldstein conditions (refer to Fletcher 1987). When derivatives are available, the line-search methods LIS=6, LIS=7, and LIS=8 try to satisfy the right-hand side Goldstein condition; if derivatives are not available, these line-search algorithms use only function calls.

Restricting the Step Length

Almost all line-search algorithms use iterative extrapolation techniques that can easily lead them to (feasible) points where the objective function f is no longer defined or difficult to compute. Therefore, PROC NLMIXED provides options restricting the step length ± or trust region radius ” , especially during the first main iterations.

The inner product g T s of the gradient g and the search direction s is the slope of f ( ± ) = f ( + ± s ) along the search direction s . The default starting value ± (0) = ± ( k, 0) in each line-search algorithm (min ± > f ( + ± s )) during the main iteration k is computed in three steps:

  1. The first step uses either the difference df = f ( k ) ˆ’ f ( k ˆ’ 1) of the function values during the last two consecutive iterations or the final step-size value ± ˆ’ of the last iteration k ˆ’ 1 to compute a first value of .

  1. During the first five iterations, the second step enables you to reduce to a smaller starting value using the INSTEP= r option:

    After more than five iterations, is set to .

  2. The third step can further reduce the step length by

    where u is the maximum length of a step inside the feasible region.

The INSTEP= r option enables you to specify a smaller or larger radius ” of the trust region used in the first iteration of the trust region and double dogleg algorithms. The default initial trust region radius ” (0) is the length of the scaled gradient (Mor 1978). This step corresponds to the default radius factor of r = 1. In most practical applications of the TRUREG and DBLDOG algorithms, this choice is successful. However, for bad initial values and highly nonlinear objective functions (such as the EXP function), the default start radius can result in arithmetic overflows. If this happens, you can try decreasing values of INSTEP= r , 0 < r < 1, until the iteration starts successfully. A small factor r also affects the trust region radius ” ( k +1) of the next steps because the radius is changed in each iteration by a factor 0 < c ‰ 4, depending on the ratio expressing the goodness of quadratic function approximation. Reducing the radius ” corresponds to increasing the ridge parameter » , producing smaller steps directed more closely toward the (negative) gradient direction.

Computational Problems

Floating Point Errors and Overflows

Numerical optimization of a numerically integrated function is a difficult task, and the computation of the objective function and its derivatives can lead to arithmetic exceptions and overflows. A typical cause of these problems is parameters with widely varying scales . If the scaling of your parameters varies by more than a few orders of magnitude, the numerical stability of the optimization problem can be seriously reduced and result in computational difficulties. A simple remedy is to rescale each parameter so that its final estimated value has a magnitude near 1.

If parameter rescaling does not help, consider the following actions:

The line-search algorithms that work with cubic extrapolation are especially sensitive to arithmetic overflows. If an overflow occurs during a line search, you can use the INSTEP= option to reduce the length of the first trial step during the first five iterations, or you can use the DAMPSTEP or MAXSTEP option to restrict the step length of the initial ± in subsequent iterations. If an arithmetic overflow occurs in the first iteration of the trust region or double dogleg algorithms, you can use the INSTEP= option to reduce the default trust region radius of the first iteration. You can also change the optimization technique or the line-search method.

Long Run Times

PROC NLMIXED can take a long time to run for problems with complex models, many parameters, or large input data sets. Although the optimization techniques used by PROC NLMIXED are some of the best ones available, they are not guaranteed to converge quickly for all problems. Ill-posed or misspecified models can cause the algorithms to use more extensive calculations designed to achieve convergence, and this can result in longer run times. So first make sure that your model is specified correctly, that your parameters are scaled to be of the same order of magnitude, and that your data reasonably match the model you are contemplating.

If you are using the default adaptive Gaussian quadrature algorithm and no iteration history is printing at all, then PROC NLMIXED may be bogged down trying to determine the number of quadrature points at the first set of starting values. Specifying the QPOINTS= option will bypass this stage and proceed directly to iterations; however, be aware that the likelihood approximation may not be accurate if there are too few quadrature points.

PROC NLMIXED may also have difficulty determining the number of quadrature points if the initial starting values are far from the optimum values. To obtain more accurate starting values for the model parameters, one easy method is to fit a model with no RANDOM statement. You can then use these estimates as starting values, although you will still need to specify values for the random-effects distribution. For normal-normal models, another strategy is to use METHOD=FIRO. If you can obtain estimates using this approximate method, then they can be used as starting values for more accurate likelihood approximations.

If you are running PROC NLMIXED multiple times, you will probably want to include a statement like the following in your program:

ods output ParameterEstimates=pe;

This statement creates a SAS data set named PE upon completion of the run. In your next invocation of PROC NLMIXED, you can then specify

parms / data=pe;

to read in the previous estimates as starting values.

To speed general computations, you should check over your programming statements to minimize the number of floating point operations. Using auxiliary variables and factoring amenable expressions can be useful changes in this regard.

Problems Evaluating Code for Objective Function

The starting point (0) must be a point for which the programming statements can be evaluated. However, during optimization, the optimizer may iterate to a point ( k ) where the objective function or its derivatives cannot be evaluated. In some cases, the specification of boundary for parameters can avoid such situations. In many other cases, you can indicate that the point (0) is a bad point simply by returning an extremely large value for the objective function. In these cases, the optimization algorithm reduces the step length and stays closer to the point that has been evaluated successfully in the former iteration.

No Convergence

There are a number of things to try if the optimizer fails to converge.

Convergence to Stationary Point

The gradient at a stationary point is the null vector, which always leads to a zero search direction. This point satisfies the first-order termination criterion. Search directions that are based on the gradient are zero, so the algorithm terminates. There are two ways to avoid this situation:

The signs of the eigenvalues of the (reduced) Hessian matrix contain information regarding a stationary point.

Precision of Solution

In some applications, PROC NLMIXED may result in parameter values that are not precise enough. Usually, this means that the procedure terminated at a point too far from the optimal point. The termination criteria define the size of the termination region around the optimal point. Any point inside this region can be accepted for terminating the optimization process. The default values of the termination criteria are set to satisfy a reasonable compromise between the computational effort (computer time) and the precision of the computed estimates for the most common applications. However, there are a number of circumstances in which the default values of the termination criteria specify a region that is either too large or too small.

If the termination region is too large, then it can contain points with low precision. In such cases, you should determine which termination criterion stopped the optimization process. In many applications, you can obtain a solution with higher precision simply by using the old parameter estimates as starting values in a subsequent run in which you specify a smaller value for the termination criterion that was satisfied at the former run.

If the termination region is too small, the optimization process may take longer to find a point inside such a region, or it may not even find such a point due to rounding errors in function values and derivatives. This can easily happen in applications in which finite difference approximations of derivatives are used and the GCONV and ABSGCONV termination criteria are too small to respect rounding errors in the gradient values.

Covariance Matrix

The estimated covariance matrix of the parameter estimates is computed as the inverse Hessian matrix, and for unconstrained problems it should be positive definite. If the final parameter estimates are subjected to n act > 0 active linear inequality constraints, the formulas of the covariance matrices are modified similar to Gallant (1987) and Cramer (1986, p. 38) and additionally generalized for applications with singular matrices.

There are several steps available that enable you to tune the rank calculations of the covariance matrix.

  1. You can use the ASINGULAR=, MSINGULAR=, and VSINGULAR= options to set three singularity criteria for the inversion of the Hessian matrix H . The singularity criterion used for the inversion is

    where d j,j is the diagonal pivot of the matrix H , and ASING, VSING, and MSING are the specified values of the ASINGULAR=, VSINGULAR=, and MSINGULAR= options. The default values are

    • ASING: the square root of the smallest positive double precision value

    • MSING: 1E ˆ’ 12 if you do not specify the SINGHESS= option and max(10 ˆˆ , 1E-4*SINGHESS) otherwise, where ˆˆ is the machine precision

    • VSING: 1E ˆ’ 8 if you do not specify the SINGHESS= option and the value of SINGHESS otherwise

    Note that, in many cases, a normalized matrix D ˆ’ 1 AD ˆ’ 1 is decomposed, and the singularity criteria are modified correspondingly.

  2. If the matrix H is found to be singular in the first step, a generalized inverse is computed. Depending on the G4= option, either a generalized inverse satisfying all four Moore-Penrose conditions is computed or a generalized inverse satisfying only two Moore-Penrose conditions is computed. If the number of parameters n of the application is less than or equal to G4= i , a G4 inverse is computed; otherwise, only a G2 inverse is computed. The G4 inverse is computed by the (computationally very expensive but numerically stable) eigenvalue decomposition, and the G2 inverse is computed by Gauss transformation. The G4 inverse is computed using the eigenvalue decomposition A = Z Z T , where Z is the orthogonal matrix of eigenvectors and is the diagonal matrix of eigenvalues, = diag ( » 1 , , » n ). The G4 inverse of H is set to

    where the diagonal matrix is defined using the COVSING= option.

    If you do not specify the COVSING= option, the nr smallest eigenvalues are set to zero, where nr is the number of rank deficiencies found in the first step.

For optimization techniques that do not use second-order derivatives, the covariance matrix is computed using finite difference approximations of the derivatives.

Prediction

The nonlinear mixed model is a useful tool for statistical prediction. Assuming a prediction is to be made regarding the i th subject, suppose that f ( , u i ) is a differentiable function predicting some quantity of interest. Recall that denotes the vector of unknown parameters and u i denotes the vector of random effects for the i th subject. A natural point prediction is f ( , » i ), where is the maximum likelihood estimate of and » i is the empirical Bayes estimate of u i described previously in Integral Approximations.

An approximate prediction variance matrix for ( , » i ) is

where is the approximate Hessian matrix from the optimization for , is the approximate Hessian matrix from the optimization for » i , and ( ˆ‚ » i / ˆ‚ ) is the derivative of » i with respect to , evaluated at ( , » i ). The approximate variance matrix for is the standard one discussed in the previous section, and that for » i is an approximation to the conditional mean squared error of prediction described by Booth and Hobert (1998).

The prediction variance for a general scalar function f ( , u i ) is defined as the expected squared difference . PROC NLMIXED computes an approximation to it as follows. The derivative of f ( , u i ) is computed with respect to each element of ( , u i ) and evaluated at ( , » i ). If a i is the resulting vector, then the approximate prediction variance is . This approximation is known as the delta method (Billingsley, 1986; Cox, 1998).

Computational Resources

Since nonlinear optimization is an iterative process that depends on many factors, it is difficult to estimate how much computer time is necessary to find an optimal solution satisfying one of the termination criteria. You can use the MAXTIME=, MAXITER=, and MAXFU= options to restrict the amount of CPU time, the number of iterations, and the number of function calls in a single run of PROC NLMIXED.

In each iteration k , the NRRIDG technique uses a symmetric Householder transformation to decompose the n — n Hessian matrix H

to compute the (Newton) search direction s

The TRUREG and NEWRAP techniques use the Cholesky decomposition to solve the same linear system while computing the search direction. The QUANEW, DBLDOG, CONGRA, and NMSIMP techniques do not need to invert or decompose a Hessian matrix; thus, they require less computational resources than the other techniques.

The larger the problem, the more time is needed to compute function values and derivatives. Therefore, you may want to compare optimization techniques by counting and comparing the respective numbers of function, gradient, and Hessian evaluations.

Finite difference approximations of the derivatives are expensive because they require additional function or gradient calls:

Many applications need considerably more time for computing second-order derivatives (Hessian matrix) than for computing first-order derivatives (gradient). In such cases, a dual quasi-Newton technique is recommended, which does not require second-order derivatives.

Displayed Output

This section describes the displayed output from PROC NLMIXED. See the section ODS Table Names on page 3107 for details about how this output interfaces with the Output Delivery System.

Specifications

The NLMIXED procedure first displays the Specifications table, listing basic information about the nonlinear mixed model that you have specified. It includes the principal variables and estimation methods.

Dimensions

The Dimensions table lists counts of important quantities in your nonlinear mixed model, including the number of observations, subjects, parameters, and quadrature points.

Parameters

The Parameters table displays the information you provided with the PARMS statement and the value of the negative log likelihood function evaluated at the starting values.

Starting Gradient and Hessian

The START option in the PROC NLMIXED statement displays the gradient of the negative log likelihood function at the starting values of the parameters. If you also specify the HESS option, then the starting Hessian is displayed as well.

Iterations

The iteration history consists of one line of output for each iteration in the optimization process. The iteration history is displayed by default because it is important that you check for possible convergence problems. The default iteration history includes the following variables:

For the QUANEW method, the value of Slope should be significantly negative. Otherwise, the line-search algorithm has difficulty reducing the function value sufficiently. If this difficulty is encountered , an asterisk (*) appears after the iteration number. If there is a tilde (~) after the iteration number, the BFGS update is skipped, and very high values of the Lagrange function are produced. A backslash (\ ) after the iteration number indicates that Powell s correction for the BFGS update is used.

For methods using second derivatives, an asterisk (*) after the iteration number means that the computed Hessian approximation was singular and had to be ridged with a positive value.

For the NMSIMP method, only one line is displayed for several internal iterations. This technique skips the output for some iterations because some of the termination tests (StdDev and Size) are rather time consuming compared to the simplex operations, and they are performed only every five simplex operations.

The ITDETAILS option in the PROC NLMIXED statement provides a more detailed iteration history. Besides listing the current values of the parameters and their gradients, the following values are provided in addition to the default output:

An apostrophe ( ) trailing the number of active constraints indicates that at least one of the active constraints was released from the active set due to a significant Lagrange multiplier.

Fitting Information

The Fitting Information table lists the final minimized value of ˆ’ 2 times the log likelihood as well as the information criteria of Akaike (AIC) and Schwarz (BIC), as well as a finite-sample corrected version of AIC (AICC). The criteria are computed as follows:

where f () is the negative of the marginal log likelihood function, is the vector of parameter estimates, p is the number of parameters, n is the number of observations, and s is the number of subjects. Refer to Hurvich and Tsai (1989) and Burnham and Anderson (1998) for additional details.

Parameter Estimates

The Parameter Estimates table lists the estimates of the parameter values after successful convergence of the optimization problem or the final values of the parameters under nonconvergence. If the problem did converge, standard errors are computed from the final Hessian matrix. The ratio of the estimate with its standard error produces a t value, with approximate degrees of freedom computed as the number of subjects minus the number of random effects. A p -value and confidence limits based on this t distribution are also provided. Finally, the gradient of the negative log likelihood function is displayed for each parameter, and you should verify that they each are sufficiently small for non-constrained parameters.

Covariance and Correlation Matrices

Following standard maximum likelihood theory (for example, Serfling 1980), the asymptotic variance-covariance matrix of the parameter estimates equals the inverse of the Hessian matrix. You can display this matrix with the COV option in the PROC NLMIXED statement. The corresponding correlation form is available with the CORR option.

Additional Estimates

The Additional Estimates table displays the results of all ESTIMATE statements that you specify, with the same columns as the Parameter Estimates table. The ECOV and ECORR options in the PROC NLMIXED statement produce tables displaying the approximate covariance and correlation matrices of the additional estimates. They are computed using the delta method (Billingsley 1986; Cox 1998). The EDER option in the PROC NLMIXED statement produces a table displaying the derivatives of the additional estimates with respect to the model parameters evaluated at their final estimated values.

ODS Table Names

PROC NLMIXED assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, Using the Output Delivery System.

Table 51.2: ODS Tables Produced in PROC NLMIXED

ODS Table Name

Description

Statement or Option

AdditionalEstimates

Results from ESTIMATE statements

ESTIMATE

ConvergenceStatus

Convergence status

default

CorrMatAddEst

Correlation matrix of additional estimates

ECORR

CorrMatParmEst

Correlation matrix of parameter estimates

CORR

CovMatAddEst

Covariance matrix of additional estimates

ECOV

CovMatParmEst

Covariance matrix of parameter estimates

COV

DerAddEst

Derivatives of additional estimates

EDER

Dimensions

Dimensions of the problem

default

FitStatistics

Fit statistics

default

Hessian

Second derivative matrix

HESS

IterHistory

Iteration history

default

Parameters

Parameters

default

ParameterEstimates

Parameter estimates

default

Specifications

Model specifications

default

StartingHessian

Starting hessian matrix

START HESS

StartingValues

Starting values and gradient

START

Категории