SAS.STAT 9.1 Users Guide (Vol. 6)

2017-07-07 02:10:07

CLUSTER CLUSTERS variables ;

The CLUSTER statement names variables that identify the clusters in a clustered sample design. The combinations of categories of CLUSTER variables define the clusters in the sample. If there is a STRATA statement, clusters are nested within strata.

If your sample design has clustering at multiple stages, you should identify only the first-stage clusters, or primary sampling units (PSUs), in the CLUSTER statement. See the section 'Primary Sampling Units (PSUs)' on page 4281 for more information.

The CLUSTER variables are one or more variables in the DATA= input data set. These variables can be either character or numeric. The formatted values of the CLUSTER variables determine the CLUSTER variable levels. Thus, you can use formats to group values into levels. Refer to the discussion of the FORMAT procedure in the SAS Procedures Guide and to the discussions of the FORMAT statement and SAS formats in SAS Language Reference: Dictionary .

You can use multiple CLUSTER statements to specify cluster variables. The procedure uses all variables from all CLUSTER statements to create clusters.

CONTRAST Statement

CONTRAST 'label' row-description < , , row-description >< / options > ;

where a row-description is: effect values < , effect values>

The CONTRAST statement provides a mechanism for obtaining customized hypothesis tests. It is similar to the CONTRAST statement in PROC LOGISTIC and PROC GLM, depending on the coding schemes used with any classification variables involved.

The CONTRAST statement enables you to specify a matrix, L , for testing the hypothesis L = , where is the parameter vector. You must be familiar with the details of the model parameterization that PROC SURVEYLOGISTIC uses (for more information, see the PARAM= option in the section 'CLASS Statement' on page 4253). Optionally, the CONTRAST statement enables you to estimate each row, , of L and test the hypothesis = 0. Computed statistics are based on the asymptotic chi-square distribution of the Wald statistic.

There is no limit to the number of CONTRAST statements that you can specify, but they must appear after the MODEL statement.

The following parameters are specified in the CONTRAST statement:

label	identifies the contrast on the output. A label is required for every contrast specified, and it must be enclosed in quotes.
effect	identifies an effect that appears in the MODEL statement. The name INTERCEPT can be used as an effect when one or more intercepts are included in the model. You do not need to include all effects that are included in the MODEL statement.
values	are constants that are elements of the L matrix associated with the effect. To correctly specify your contrast, it is crucial to know the ordering of parameters within each effect and the variable levels associated with any parameter. The 'Class Level Information' table shows the ordering of levels within variables. The E option, described later in this section, enables you to verify the proper correspondence of values to parameters.

The rows of L are specified in order and are separated by commas. Multiple degreeof-freedom hypotheses can be tested by specifying multiple row-descriptions . For any of the full-rank parameterizations, if an effect is not specified in the CONTRAST statement, all of its coefficients in the L matrix are set to 0. If too many values are specified for an effect, the extra ones are ignored. If too few values are specified, the remaining ones are set to 0.

When you use effect coding (by default or by specifying PARAM=EFFECT in the CLASS statement), all parameters are directly estimable (involve no other parameters). For example, suppose an effect coded CLASS variable A has four levels. Then there are three parameters ( ± ₁ , ± ₂ , ± ₃ ) representing the first three levels, and the fourth parameter is represented by

To test the first versus the fourth level of A , you would test

or, equivalently,

which, in the form L = , is

Therefore, you would use the following CONTRAST statement:

contrast '1 vs. 4' A 2 1 1;

To contrast the third level with the average of the first two levels, you would test

or, equivalently,

Therefore, you would use the following CONTRAST statement:

contrast '1&2 vs. 3' A 1 1 -2;

Other CONTRAST statements are constructed similarly. For example,

contrast '1 vs. 2 ' A 1 1 0; contrast '1&2 vs. 4 ' A 3 3 2; contrast '1&2 vs. 3&4' A 2 2 0; contrast 'Main Effect' A 1 0 0, A 0 1 0, A 0 0 1;

When you use the less-than -full-rank parameterization (by specifying PARAM=GLM in the CLASS statement), each row is checked for estimability. If PROC SURVEYLOGISTIC finds a contrast to be nonestimable, it displays missing values in corresponding rows in the results. PROC SURVEYLOGISTIC handles missing level combinations of classification variables in the same manner as PROC LOGISTIC. Parameters corresponding to missing level combinations are not included in the model. This convention can affect the way in which you specify the L matrix in your CONTRAST statement. If the elements of L are not specified for an effect that contains a specified effect, then the elements of the specified effect are distributed over the levels of the higher-order effect just as the LOGISTIC procedure does for its CONTRAST and ESTIMATE statements. For example, suppose that the model contains effects A and B and their interaction A*B. If you specify a CONTRAST statement involving A alone, the L matrix contains nonzero terms for both A and A*B, since A*B contains A.

The degrees of freedom is the number of linearly independent constraints implied by the CONTRAST statement, that is, the rank of L .

You can specify the following options after a slash (/).

ALPHA= ±

sets the confidence level for confidence limits. The value of the ALPHA= option must be between 0 and 1, and the default value is 0.05. A confidence level of ± produces 100(1 ˆ’ ± )% confidence limits. The default of ALPHA=0.05 produces 95% confidence limits.

requests that the L matrix be displayed.

ESTIMATE= keyword

requests that each individual contrast (that is, each row, , of L ² ) or exponentiated contrast ( ) be estimated and tested. PROC SURVEYLOGISTIC displays the point estimate, its standard error, a Wald confidence interval, and a Wald chi-square test for each contrast. The significance level of the confidence interval is controlled by the ALPHA= option. You can estimate the contrast or the exponentiated contrast ( ), or both, by specifying one of the following keywords :

PARM	specifies that the contrast itself be estimated
EXP	specifies that the exponentiated contrast be estimated
BOTH	specifies that both the contrast and the exponentiated contrast be estimated

SINGULAR = number

tunes the estimability check. This option is ignored when the full-rank parameterization is used. If v is a vector, define ABS( v ) to be the largest absolute value of the elements of v . For a row vector l ² of the contrast matrix L , define c to be equal to ABS( l ) if ABS( l ) is greater than 0; otherwise , c equals 1. If ABS( l ² ˆ’ l ² T ) is greater than c * number , then l is declared nonestimable. The T matrix is the Hermite form matrix I , where represents a generalized inverse of the information matrix I of the null model. The value for number must be between 0 and 1; the default value is 1E ˆ’ 4.

FREQ Statement

FREQ variable ;

The variable in the FREQ statement identifies a variable that contains the frequency of occurrence of each observation. PROC SURVEYLOGISTIC treats each observation as if it appears n times, where n is the value of the FREQ variable for the observation. If it is not an integer, the frequency value is truncated to an integer. If the frequency value is less than 1 or missing, the observation is not used in the model fitting. When the FREQ statement is not specified, each observation is assigned a frequency of 1.

If you use the events/trials syntax in the MODEL statement, the FREQ statement is disallowed because the event and trial variables represent the frequencies in the data set.

MODEL Statement

MODEL events/trials = < effects >< / options > ;

MODEL variable < ( variable_options ) > = < effects >< /options > ;

The MODEL statement names the response variable and the explanatory effects, including covariates, main effects, interactions, and nested effects; see the section 'Specification of Effects' on page 1784 of Chapter 32, 'The GLM Procedure,' for more information. If you omit the explanatory variables, the procedure fits an intercept-only model. Model options can be specified after a slash (/).

Two forms of the MODEL statement can be specified. The first form, referred to as single-trial syntax, is applicable to binary, ordinal, and nominal response data. The second form, referred to as events/trials syntax, is restricted to the case of binary response data. The single-trial syntax is used when each observation in the DATA= data set contains information on only a single trial, for instance, a single subject in an experiment. When each observation contains information on multiple binaryresponse trials, such as the counts of the number of subjects observed and the number responding, then events/trials syntax can be used.

In the events/trials syntax, you specify two variables that contain count data for a binomial experiment. These two variables are separated by a slash. The value of the first variable, events , is the number of positive responses (or events). The value of the second variable, trials , is the number of trials. The values of both events and ( trials - events ) must be nonnegative and the value of trials must be positive for the response to be valid.

In the single-trial syntax, you specify one variable (on the left side of the equal sign) as the response variable. This variable can be character or numeric. Options specific to the response variable can be specified immediately after the response variable with a pair of parentheses around them.

For both forms of the MODEL statement, explanatory effects follow the equal sign. Variables can be either continuous or classification variables. Classification variables can be character or numeric, and they must be declared in the CLASS statement. When an effect is a classification variable, the procedure enters a set of coded columns into the design matrix instead of directly entering a single column containing the values of the variable.

Response Variable Options

You specify the following options by enclosing them in a pair of parentheses after the response variable.

DESCENDING DESC

reverses the order of response categories. If both the DESCENDING and ORDER= options are specified, PROC SURVEYLOGISTIC orders the response categories according to the ORDER= option and then reverses that order. See the 'Response Level Ordering' section on page 4269 for more detail.

EVENT= ' category ' keyword

specifies the event category for the binary response model. PROC SURVEYLOGISTIC models the probability of the event category. The EVENT= option has no effect when there are more than two response categories. You can specify the value (formatted if a format is applied) of the event category in quotes or you can specify one of the following keywords. The default is EVENT=FIRST.

FIRST

designates the first ordered category as the event

LAST

designates the last ordered category as the event

One of the most common sets of response levels is {0,1}, with 1 representing the event for which the probability is to be modeled . Consider the example where Y takes the values 1 and 0 for event and nonevent, respectively, and Exposure is the explanatory variable. To specify the value 1 as the event category, use the model statement

model Y(event='1') = Exposure;

ORDER= DATA FORMATTED FREQ INTERNAL

specifies the sorting order for the levels of the response variable. By default, ORDER=FORMATTED. For FORMATTED and INTERNAL, the sort order is machine dependent.

When the default ORDER=FORMATTED is in effect for numeric variables for which you have supplied no explicit format, the levels are ordered by their internal values.

The following table shows the interpretation of the ORDER= values.

Value of ORDER=	Levels Sorted By
DATA	order of appearance in the input data set
FORMATTED	external formatted value, except for numeric variables with no explicit format, which are sorted by their unformatted (internal) value
FREQ	descending frequency count; levels with the most observations come first in the order
INTERNAL	unformatted value

For more information on sorting order, see the chapter on the SORT procedure in the SAS Procedures Guide and the discussion of BY-group processing in SAS Language Reference: Concepts .

REFERENCE= ' category ' keyword

REF= ' category ' keyword

specifies the reference category for the generalized logit model and the binary response model. For the generalized logit model, each nonreference category is contrasted with the reference category. For the binary response model, specifying one response category as the reference is the same as specifying the other response category as the event category. You can specify the value (formatted if a format is applied) of the reference category in quotes or you can specify one of the following keywords. The default is REF=LAST.

FIRST

designates the first ordered category as the reference

LAST

designates the last ordered category as the reference

Model Options

Model options can be specified after a slash (/). Table 69.1 summarizes the options available in the MODEL statement.

Table 69.1: Model Statement Options
Option	Description
Model Specification Options
LINK=	Specifies link function
NOINT	Suppresses intercept(s)
OFFSET=	Specifies offset variable
Convergence Criterion Options
ABSFCONV=	Specifies absolute function convergence criterion
FCONV=	Specifies relative function convergence criterion
GCONV=	Specifies relative gradient convergence criterion
XCONV=	Specifies relative parameter convergence criterion
MAXITER=	Specifies maximum number of iterations
NOCHECK	Suppresses checking for infinite parameters
RIDGING=	Specifies technique used to improve the log- likelihood function when its value is worse than that of the previous step
SINGULAR=	Specifies tolerance for testing singularity
TECHNIQUE=	Specifies iterative algorithm for maximization
Options for Adjustment to Variance Estimation
VADJUST=	Choose variance estimation adjustment method
Options for Confidence Intervals
ALPHA=	Specifies ± for the 100(1 ˆ’ ± )% confidence intervals
CLPARM	Computes confidence intervals for parameters
CLODDS	Computes confidence intervals for odds ratios
Options for Display of Details
CORRB	Displays correlation matrix
COVB	Displays covariance matrix
EXPB	Displays exponentiated values of estimates
ITPRINT	Displays iteration history
NODUMMYPRINT	Suppresses 'Class Level Information' table
PARMLABEL	Displays parameter labels
RSQUARE	Displays generalized R ²
STB	Displays standardized estimates

The following list describes these options.

ABSFCONV= value

specifies the absolute function convergence criterion. Convergence requires a small change in the log-likelihood function in subsequent iterations,

where l ⁽ ⁱ ⁾ is the value of the log-likelihood function at iteration i . See the section 'Convergence Criteria' on page 4277.

ALPHA= ±

sets the level of significance ± for 100(1 ˆ’ ± )% confidence intervals for regression parameters or odds ratios. The value ± must be between 0 and 1. By default, ± is equal to the value of the ALPHA= option in the PROC SURVEYLOGISTIC statement, or ± = 0 . 05 if the option is not specified. This option has no effect unless confidence limits for the parameters or odds ratios are requested .

CLODDS

requests confidence intervals for the odds ratios. Computation of these confidence intervals is based on individual Wald tests. The confidence coefficient can be specified with the ALPHA= option. See the 'Wald Confidence Intervals for Parameters' section on page 4288 for more information.

CLPARM

requests confidence intervals for the parameters. Computation of these confidence intervals is based on the individual Wald tests. The confidence coefficient can be specified with the ALPHA= option. See the 'Wald Confidence Intervals for Parameters' section on page 4288 for more information.

CORRB

displays the correlation matrix of the parameter estimates.

COVB

displays the covariance matrix of the parameter estimates.

EXPB

EXPEST

displays the exponentiated values ( ) of the parameter estimates _i in the 'Analysis of Maximum Likelihood Estimates' table for the logit model. These exponentiated values are the estimated odds ratios for the parameters corresponding to the continuous explanatory variables.

FCONV= value

specifies the relative function convergence criterion. Convergence requires a small relative change in the log-likelihood function in subsequent iterations,

where l ⁽ ⁱ ⁾ is the value of the log-likelihood at iteration i . See the section 'Convergence Criteria' on page 4277.

GCONV= value

specifies the relative gradient convergence criterion. Convergence requires that the normalized prediction function reduction is small,

where l ⁽ ⁱ ⁾ is the value of the log-likelihood function, g ⁽ ⁱ ⁾ is the gradient vector, and I ⁽ ⁱ ⁾ the (expected) information matrix. All of these functions are evaluated at iteration i . This is the default convergence criterion, and the default value is 1E ˆ’ 8. See the section 'Convergence Criteria' on page 4277.

ITPRINT

displays the iteration history of the maximum-likelihood model fitting. The ITPRINT option also displays the last evaluation of the gradient vector and the final change in the ˆ’ 2 Log Likelihood.

LINK= keyword

L= keyword

specifies the link function linking the response probabilities to the linear predictors. You can specify one of the following keywords. The default is LINK=LOGIT.

CLOGLOG	the complementary log-log function. PROC SURVEYLOGISTIC fits the binary complementary log-log model for binary response and fits the cumulative complementary log-log model when there are more than two response categories. Aliases: CCLOGLOG, CCLL, CUMCLOGLOG.
GLOGIT	the generalized logit function. PROC SURVEYLOGISTIC fits the generalized logit model where each nonreference category is contrasted with the reference category. You can use the response variable option REF= to specify the reference category.
LOGIT	the cumulative logit function. PROC SURVEYLOGISTIC fits the binary logit model when there are two response categories and fits the cumulative logit model when there are more than two response categories. Aliases: CLOGIT, CUMLOGIT.
PROBIT	the inverse standard normal distribution function. PROC SURVEYLOGISTIC fits the binary probit model when there are two response categories and fits the cumulative probit model when there are more than two response categories. Aliases: NORMIT, CPROBIT, CUMPROBIT.

See the section 'Link Functions and the Corresponding Distributions' on page 4273 for details.

MAXITER= n

specifies the maximum number of iterations to perform. By default, MAXITER=25. If convergence is not attained in n iterations, the displayed output created by the procedure contain results that are based on the last maximum likelihood iteration.

NOCHECK

disables the checking process to determine whether maximum likelihood estimates of the regression parameters exist. If you are sure that the estimates are finite, this option can reduce the execution time if the estimation takes more than eight iterations. For more information, see the 'Existence of Maximum Likelihood Estimates' section on page 4277.

NODUMMYPRINT

NODESIGNPRINT

NODP

suppresses the 'Class Level Information' table, which shows how the design matrix columns for the CLASS variables are coded.

NOINT

suppresses the intercept for the binary response model or the first intercept for the ordinal response model.

OFFSET= name

names the offset variable. The regression coefficient for this variable will be fixed at 1.

PARMLABEL

displays the labels of the parameters in the 'Analysis of Maximum Likelihood Estimates' table.

RIDGING=ABSOLUTE RELATIVE NONE

specifies the technique used to improve the log-likelihood function when its value in the current iteration is less than that in the previous iteration. If you specify the RIDGING=ABSOLUTE option, the diagonal elements of the negative (expected) Hessian are inflated by adding the ridge value. If you specify the RIDGING=RELATIVE option, the diagonal elements are inflated by a factor of 1 plus the ridge value. If you specify the RIDGING=NONE option, the crude line search method of taking half a step is used instead of ridging. By default, RIDGING=RELATIVE.

RSQUARE

RSQ

requests a generalized R ² measure for the fitted model. For more information, see the 'Generalized Coefficient of Determination' section on page 4280.

SINGULAR= value

specifies the tolerance for testing the singularity of the Hessian matrix (NewtonRaphson algorithm) or the expected value of the Hessian matrix (Fisher-scoring algorithm). The Hessian matrix is the matrix of second partial derivatives of the log likelihood. The test requires that a pivot for sweeping this matrix be at least this number times a norm of the matrix. Values of the SINGULAR= option must be numeric. By default, SINGULAR=1E ˆ’ 12.

STB

displays the standardized estimates for the parameters for the continuous explanatory variables in the 'Analysis of Maximum Likelihood Estimates' table. The standardized estimate of _i is given by _i / ( s/s _i ), where s _i is the total sample standard deviation for the i th explanatory variable and

For the intercept parameters and parameters associated with a CLASS variable, the standardized estimates are set to missing.

TECHNIQUE=FISHER NEWTON

TECH=FISHER NEWTON

specifies the optimization technique for estimating the regression parameters. NEWTON (or NR) is the Newton-Raphson algorithm and FISHER (or FS) is the Fisher-scoring algorithm. Both techniques yield the same estimates, but the estimated covariance matrices are slightly different except for the case when the LOGIT link is specified for binary response data. The default is TECHNIQUE=FISHER. See the section 'Iterative Algorithms for Model-Fitting' on page 4275 for details.

VADJUST=DF MOREL NONE < ( Morel-options ) >

VARADJ=DF MOREL NONE < ( Morel-options ) >

VARADJUST=DF MOREL NONE < ( Morel-options ) >

specifies an adjustment to the variance estimation (on page 4286) for the regression coefficients.

By default, PROC SURVEYLOGISTIC uses the degrees of freedom adjustment VADJUST=DF.

You can specify the VADJUST=MOREL option for the variance adjustment proposed by Morel (1989).

If you do not wish to use any variance adjustment, you can specify the VADJUST=NONE option.

You can specify the following Morel-options within parentheses after the VADJUST=MOREL option.

ADJBOUND =
- sets the upper bound coefficient in the variance adjustment. This upper bound must be positive. By default, the procedure use = 0 . 5. See the section 'Adjustments to the Variance Estimation' on page 4286 for more details on how this upper bound is used in the variance estimation.

DEFFBOUND =
- sets the lower bound of the estimated design effect in the variance adjustment. This lower bound must be positive. By default, the procedure use = 1. See the section 'Adjustments to the Variance Estimation' on page 4286 for more details on how this lower bound is used in the variance estimation.

XCONV = value

specifies the relative parameter convergence criterion. Convergence requires a small relative parameter change in subsequent iterations,

where

and is the estimate of the j th parameter at iteration i . See the section 'Iterative Algorithms for Model-Fitting' on page 4275.

STRATA Statement

STRATA STRATUM variables < / option > ;

The STRATA statement names variables that form the strata in a stratified sample design. The combinations of levels of STRATA variables define the strata in the sample.

If your sample design has stratification at multiple stages, you should identify only the first-stage strata in the STRATA statement. See the section 'Specification of Population Totals and Sampling Rates' on page 4280 for more information.

The STRATA variables are one or more variables in the DATA= input data set. These variables can be either character or numeric. The formatted values of the STRATA variables determine the levels. Thus, you can use formats to group values into levels. See the discussion of the FORMAT procedure in the SAS Procedures Guide .

You can specify the following option in the STRATA statement after a slash (/):

LIST

displays a 'Stratum Information' table, which includes values of the STRATA variables and sampling rates for each stratum. This table also provides the number of observations and number of clusters for each stratum and analysis variable. See the section 'Displayed Output' on page 4292 for more details.

TEST Statement

label: > TEST equation1 < , , < equationk >>< / option > ;

The TEST statement tests linear hypotheses about the regression coefficients. The Wald test is used to jointly test the null hypotheses ( H : L = c ) specified in a single TEST statement. When c = you should specify a CONTRAST statement instead.

Each equation specifies a linear hypothesis (a row of the L matrix and the corresponding element of the c vector); multiple equations are separated by commas. The label, which must be a valid SAS name, is used to identify the resulting output and should always be included. You can submit multiple TEST statements.

The form of an equation is as follows :

term < ± term ... > < = ± term < ± term ... >>

where term is a parameter of the model, or a constant, or a constant times a parameter. For a binary response model, the intercept parameter is named INTERCEPT; for an ordinal response model, the intercept parameters are named INTERCEPT, INTERCEPT2, INTERCEPT3, and so on. When no equal sign appears, the expression is set to 0. The following code illustrates possible uses of the TEST statement:

proc surveylogistic; model y= a1 a2 a3 a4; test1: test intercept + .5 * a2 = 0; test2: test intercept + .5 * a2; test3: test a1=a2=a3; test4: test a1=a2, a2=a3; run;

Note that the first and second TEST statements are equivalent, as are the third and fourth TEST statements.

You can specify the following option in the TEST statement after a slash(/).

displays intermediate calculations in the testing of the null hypothesis H : L = c . This includes L ( ) L ² bordered by ( L - c ) and [ L ( ) L ² ] ^{ˆ’ 1} bordered by [ L ( ) L ² ] ^{ˆ’ 1} (L ˆ’ c ), where is the maximum likelihood estimator of and ( ) is the estimated covariance matrix of .

For more information, see the 'Testing Linear Hypotheses about the Regression Coefficients' section on page 4288.

FIRST	designates the first ordered category as the event
LAST	designates the last ordered category as the event