SAS/STAT 9.1 Users Guide Volume 2 only

Missing Values

Observations with missing values for any variable listed in the MODEL or POPULATION statement are omitted from the analysis.

If the WEIGHT variable for an observation has a missing value, the observation is by default omitted from the analysis. You can modify this behavior by specifying the MISSING= option in the MODEL statement. The option MISSING= value sets all missing weights to value and all missing cells to value . The option MISSING=SAMPLING causes all missing cells in a contingency table to be treated as sampling zeros.

Any observation with nonpositive weight is also, by default, omitted from the analysis. If it has zero weight, then you can specify the ZERO= option in the MODEL statement.

Input Data Sets

Data to be analyzed by PROC CATMOD must be in a SAS data set containing one of the following:

If you specify a WEIGHT statement, then PROC CATMOD uses the values of the WEIGHT variable as the frequency counts. If the READ function is specified in the RESPONSE statement, then the procedure expects the input data set to contain the values of response functions and their covariance matrix. Otherwise, PROC CATMOD assumes that the SAS data set contains raw data values.

Raw Data Values

If you use raw data, PROC CATMOD first counts the number of observations having each combination of values for all variables specified in the MODEL or POPULATION statements. For example, suppose the variables A and B each take on the values 1 and 2, and their frequencies can be represented as follows .

 

A=1

A=2

B=1

2

1

B=2

3

1

The SAS data set Raw containing the raw data might be as follows.

Observation

A

B

1

1

1

2

1

1

3

1

2

4

1

2

5

1

2

6

2

1

7

2

2

And the statements for PROC CATMOD would be

proc catmod data=Raw; model A=B; run;

For discussions of how to handle structural and random zeros with raw data as input data, see the 'Zero Frequencies' section on page 888 and Example 22.5 on page 919.

Frequency Counts

If your data set contains frequency counts, then use the WEIGHT statement in PROC CATMOD to specify the variable containing the frequencies. For example, you could create the Summary data set as follows.

data Summary; input A B Count; datalines; 1 1 2 1 2 3 2 1 1 2 2 1 ;

In this case, the corresponding statements would be

proc catmod data=Summary; weight Count; model A=B; run;

The data set Summary can also be created from data set Raw by using the FREQ procedure:

proc freq data=Raw; tables A*B / out=Summary; run;

Inputting Response Functions and Covariances Directly

If you want to read in the response functions and their covariance matrix, rather than have PROC CATMOD compute them, create a TYPE=EST data set. In addition to having one variable name for each function, the data set should have two additional variables: _TYPE_ and _NAME_ , both character variables of length 8. The variable _TYPE_ should have the value 'PARMS' when the observation contains the response functions; it should have the value 'COV' when the observation contains elements of the covariance matrix of the response functions. The variable _NAME_ is used only when _TYPE_ =COV, in which case it should contain the name of the variable that has its covariance elements stored in that observation. In the following data set, for example, the covariance between the second and fourth response functions is 0.000102.

data direct(type=est); input b1-b4 _type_ $ _name_ .; datalines; 0.590463 0.384720 0.273269 0.136458 PARMS . 0.001690 0.000911 0.000474 0.000432 COV B1 0.000911 0.001823 0.000031 0.000102 COV B2 0.000474 0.000031 0.001056 0.000477 COV B3 0.000432 0.000102 0.000477 0.000396 COV B4 ;

In order to tell PROC CATMOD that the input data set contains the values of response functions and their covariance matrix,

For example, suppose the response functions correspond to four populations that represent the cross-classification of two age groups by two race groups. You can use the FACTORS statement to identify these two factors and to name the effects in the model. The statements required to fit a main-effects model to these data are

proc catmod data=direct; response read b1-b4; model _f_=_response_; factors age 2, race 2 / _response_=age race; run;

Ordering of Populations and Responses

By default, populations and responses are sorted in standard SAS order as follows:

Suppose you specify the following statements:

data one; length A B $ 6; input A $ B $ wt @@; datalines; low low 23 low medium 31 low high 38 medium low 40 medium medium 42 medium high 50 high low 52 high medium 54 high high 61 ; proc catmod; weight wt; model A=B / oneway; run;

The ordering of populations and responses corresponds to the alphabetical order of the levels of the character variables. You can specify the ONEWAY option to display the ordering of the variables, while the 'Population Profiles' and 'Response Profiles' tables display the ordering of the populations and the responses, respectively.

Population Profiles

Response Profiles

Sample

B

Response

A

1

high

1

high

2

low

2

low

3

medium

3

medium

However, in this example, you may want to have the levels ordered in the natural order of ˜low,' ˜medium,' ˜high.' If you specify the ORDER=DATA option

proc catmod order=data; weight wt; model a=b / oneway; run;

then the ordering of populations and responses is as follows.

Population Profiles

Response Profiles

Sample

B

Response

A

1

low

1

low

2

medium

2

medium

3

high

3

high

Thus, you can use the ORDER=DATA option to ensure that populations and responses are ordered in a specific way. But since this also affects the definitions and the ordering of the parameters, you must exercise caution when using the _RESPONSE_ effect, the CONTRAST statement, or direct input of the design matrix.

An alternative method of ensuring that populations and responses are ordered in a specific way is to assign a format to your variables and specify the ORDER=FORMATTED option. The levels will be ordered according to their formatted values.

Another method is to replace any character variables with numeric variables and to assign formatted values such as ˜yes' and ˜no' to the numeric levels. Since ORDER=INTERNAL is the default ordering, PROC CATMOD orders the populations and responses according to the numeric values but displays the formatted values.

Specification of Effects

By default, the CATMOD procedure treats all variables as classification variables. As a result, there is no CLASS statement in PROC CATMOD. The values of a classification variable can be numeric or character. PROC CATMOD builds a set of effects-coded variables to represent the levels of the classification variable and then uses these to fit the model (for details, see the 'Generation of the Design Matrix' section on page 876). You can modify the default by using the DIRECT statement to treat numeric independent continuous variables as continuous variables. The classification variables, combinations of classification variables, and continuous variables arethenusedinfitting linear models to data.

The parameters of a linear model are generally divided into subsets that correspond to meaningful sources of variation in the response functions. These sources, called effects , can be specified in the MODEL, LOGLIN, FACTORS, REPEATED, and CONTRAST statements. Effects can be specified in any of the following ways:

The variables for crossed and nested effects remain in the order in which they are first encountered . For example, in the model

model R=B A A*B C(A B);

the effect A*B is reported as B*A since B appeared before A in the statement. Also, C(A B) is interpreted as C(A*B) and is therefore reported as C(B*A) .

Bar Notation

You can shorten the specification of multiple effects by using bar notation. For example, two methods of writing a full three-way factorial model are

proc catmod; model y=a b c a*b a*c b*c a*b*c; run;

and

proc catmod; model y=abc; run;

When you use the bar () notation, the right- and left-hand sides become effects, and the interaction between them becomes an effect. Multiple bars are permitted. The expressions are expanded from left to right, using rules 1 through 4 given in Searle (1971, p. 390):

You can also specify the maximum number of variables involved in any effect that results from bar evaluation by specifying that maximum number, preceded by an @ sign, at the end of the bar effect. For example, the specification A B C @ 2 would result in only those effects that contain 2 or fewer variables; in this case, the effects A, B, A*B, C, A*C , and B*C are generated.

Other examples of the bar notation are

A C(B)

is equivalent to

AC(B) A*C(B)

A(B)C(B)

is equivalent to

A(B) C(B) A*C(B)

A(B)B(DE)

is equivalent to

A(B) B(DE)

A B(A)C

is equivalent to

A B(A) C A*C B*C(A)

A B(A)C @2

is equivalent to

A B(A) C A*C

A B C D @2

is equivalent to

A B A*B C A*C B*C D A*D B*D C*D

For details on how the effects specified lead to a design matrix, see the 'Generation of the Design Matrix' section on page 876.

Output Data Sets

OUT= Data Set

For each population, the OUT= data set contains the observed and predicted values of the response functions, their standard errors, the residuals, and variables that describe the population and response profiles. In addition, if you use the standard response functions, the data set includes observed and predicted values for the cell frequencies or the cell probabilities, together with their standard errors and residuals.

Number of Observations

For the standard response functions, there are s — (2 q ˆ’ 1) observations in the data set for each BY group , where s is the number of populations, and q is the number of response functions per population. Otherwise, there are s — q observations in the data set for each BY group.

Variables in the OUT= Data Set

The data set contains the following variables:

BY variables

If you use a BY statement, the BY variables are included in the OUT= data set.

dependent variables

If the response functions are the default ones (generalized logits), then the dependent variables, which describe the response profiles, are included in the OUT= data set. When _TYPE_ =FUNCTION, the values of these variables are missing.

independent variables

The independent variables, which describe the population profiles, are included in the OUT= data set.

_NUMBER_

the sequence number of the response function or the cell probability or the cell frequency

_OBS_

the observed value

_PRED_

the predicted value

_RESID_

the residual (observed - predicted)

_SAMPLE_

the population number. This matches the sample number in the Population Profile section of the output.

_SEOBS_

the standard error of the observed value

_SEPRED_

the standard error of the predicted value

_TYPE_

specifies a character variable with three possible values. When _TYPE_ =FUNCTION, the observed and predicted values are values of the response functions. When _TYPE_ =PROB, they are values of the cell probabilities. When _TYPE_ =FREQ, they are values of the cell frequencies. Cell probabilities or frequencies are provided only when the default response functions are modeled . In this case, cell probabilities are provided by default, and cell frequencies are provided if you specify the option PRED=FREQ.

OUTEST= Data Set

This TYPE=EST output data set contains the estimated parameter vector and its estimated covariance matrix. If you specify both the ML and WLS options in the MODEL statement, the OUTEST= data set contains both sets of estimates. For each BY group, there are p + 1 observations in the data set for each estimation method, where p is the number of estimated parameters. The data set contains the following variables.

B1, B2, and so on

variables for the estimated parameters. The OUTEST= data set contains one variable for each estimated parameter.

BY variables

If you use a BY statement, the BY variables are included in the OUT= data set.

_METHOD_

the method used to obtain parameter estimates. For weighted least-squares estimation, _METHOD_ =WLS, and for maximum likelihood estimation, _METHOD_ =ML.

_NAME_

identifies parameter names . When _TYPE_ =PARMS, _NAME_ is blank, but when _TYPE_ =COV, _NAME_ has one of the values B1, B2, and so on, corresponding to the parameter names.

_STATUS_

indicates whether the estimates have converged

_TYPE_

identifies the statistics contained in the variables for parameter estimates (B1, B2, and so on). When _TYPE_ =PARMS, the variables contain parameter estimates; when _TYPE_ =COV, they contain covariance estimates.

The variables _METHOD_, _NAME_ , and _TYPE_ are character variables; the BY variables can be either character or numeric; and the variables for estimated parameters are numeric.

See Appendix A, 'Special SAS Data Sets,' for more information on special SAS data sets.

Logistic Analysis

In a logistic analysis, the response functions are the logits of the dependent variable.

PROC CATMOD can compute three different types of logits with the use of keywords in the RESPONSE statement. Other types of response functions can be generated by specifying appropriate transformations in the RESPONSE statement.

If the dependent variable has only two responses, then the cumulative logit and the adjacent-category logit are the negative of the generalized logit, as computed by PROC CATMOD. Consequently, parameter estimates obtained using these logits are the negative of those obtained using generalized logits. A simple logistic analysis of variance uses statements like the following:

proc catmod; model r=ab; run;

Logistic Regression

If the independent variables are treated quantitatively (like continuous variables), then a logistic analysis is known as a logistic regression . If you want PROC CATMOD to treat the independent variables as quantitative variables, specify them in both the DIRECT and MODEL statements, as follows.

proc catmod; direct x1 x2 x3; model r=x1 x2 x3; run;

Since the preceding statements do not include a RESPONSE statement, generalized logits are computed. See Example 22.3 for another example.

The parameter estimates from the CATMOD procedure are the same as those from a logistic regression program such as PROC LOGISTIC (see Chapter 42, 'The LOGISTIC Procedure'). The chi-square statistics and the predicted values are also identical. In the binary response case, PROC CATMOD can be made to model the probability of the maximum value by either (1) organizing the input data so that the maximum value occurs first and specifying ORDER=DATA in the PROC CATMOD statement or (2) specifying cumulative logits (CLOGITS) in the RESPONSE statement.

CAUTION: Computational difficulties may occur if you use a continuous variable with a large number of unique values in a DIRECT statement. See the 'Continuous Variables' section on page 870 for more details.

Cumulative Logits

If your dependent variable is ordinally scaled, you can specify the analysis of cumulative logits that take into account the ordinal nature of the dependent variable:

proc catmod; response clogits; direct x; model r=a x; run;

The preceding statements correspond to a simple analysis that addresses the question of existence of an association between the independent variables and the ordinal dependent variable. However, there are some commonly used models for the analysis of ordinal data (Agresti 1984) that address the structure of association (in terms of odds ratios), as well as its existence.

If the independent variables are class variables, a typical analysis for such a model uses the following statements:

proc catmod; weight wt; response clogits; model r=_response_ a b; run;

On the other hand, if the independent variables are ordinally scaled, you might specify numeric scores in variables x1 and x2 , and use the following statements:

proc catmod; weight wt; direct x1 x2; response clogits; model r=_response_ x1 x2; run;

Refer to Agresti (1984) for additional details of estimation, testing, and interpretation.

Continuous Variables

Computational difficulties may occur if you have a continuous variable with a large number of unique values and you use this variable in a DIRECT statement, since an observation often represents a separate population of size one. At this extreme of sparseness, the weighted least-squares method is inappropriate since there are too many zero frequencies. Therefore, you should use the maximum likelihood method. PROC CATMOD is not designed optimally for continuous variables; therefore, it may be less efficient and unable to allocate sufficient memory to handle this problem, as compared with a procedure designed specifically to handle continuous data. In these situations, consider using the LOGISTIC, GENMOD, or PROBIT procedure to analyze your data.

Log-Linear Model Analysis

When the response functions are the default generalized logits, then inclusion of the keyword _RESPONSE_ in every effect in the right-hand side of the MODEL statement induces a log-linear model. The keyword _RESPONSE_ tells PROC CATMOD that you want to model the variation among the dependent variables. You then specify the actual model in the LOGLIN statement.

When you perform log-linear model analysis, you can request weighted least-squares estimates, maximum likelihood estimates, or both. By default, PROC CATMOD calculates maximum likelihood estimates when the default response functions are used. The following table provides appropriate MODEL statements for the combinations of types of estimates.

Estimation Desired

MODEL Statement

Maximum likelihood (Newton-Raphson)

model a*b=_response_;

Maximum likelihood (Iterative Proportional Fitting)

model a*b=_response_ / ml=ipf;

Weighted least squares

model a*b=_response_ / wls;

Maximum likelihood and weighted least squares

model a*b=_response_ / wls ml;

CAUTION: sampling zeros in the input data set should be specified with the ZERO= option to ensure that these sampling zeros are not treated as structural zeros. Alternatively, you can replace cell counts for sampling zeros by some positive number close to zero (such as 1E-20) in a DATA step. Data containing sampling zeros should be analyzed with maximum likelihood estimation. See the ' Cautions ' section on page 887 and Example 22.5 on page 919 for further information and an illustration for both cell count data and raw data.

One Population

The usual log-linear model analysis has one population, which means that all of the variables are dependent variables. For example, the statements

proc catmod; weight wt; model r1*r2=_response_; loglin r1r2; run;

yield a maximum likelihood analysis of a saturated log-linear model for the dependent variables r1 and r2 .

If you want to fit a reduced model with respect to the dependent variables (for example, a model of independence or conditional independence), specify the reduced model in the LOGLIN statement. For example, the statements

proc catmod; weight wt; model r1*r2=_response_ / pred; loglin r1 r2; run;

yield a main-effects log-linear model analysis of the factors r1 and r2 . The output includes Wald statistics for the individual effects r1 and r2 , as well as predicted cell probabilities. Moreover, the goodness-of-fit statistic is the likelihood ratio test for the hypothesis of independence between r1 and r2 or, equivalently, a test of r1*r2 .

Multiple Populations

You can do log-linear model analysis with multiple populations by using a POPULATION statement or by including effects on the right-hand side of the MODEL statement that contain independent variables. Each effect must include the _RESPONSE_ keyword.

For example, suppose the dependent variables r1 and r2 are dichotomous, and the independent variable group has three levels. Then

proc catmod; weight wt; model r1*r2=_response_ group*_response_; loglin r1r2; run;

specifies a saturated model (three degrees of freedom for _RESPONSE_ and six degrees of freedom for the interaction between _RESPONSE_ and group ). From another point of view, _RESPONSE_* group can be regarded as a main effect for group with respect to the three response functions, while _RESPONSE_ can be regarded as an intercept effect with respect to the functions. In other words, these statements give essentially the same results as the logistic analysis:

proc catmod; weight wt; model r1*r2=group; run;

The ability to model the interaction between the independent and the dependent variables becomes particularly useful when a reduced model is specified for the dependent variables. For example,

proc catmod; weight wt; model r1*r2=_response_ group*_response_; loglin r1 r2; run;

specifies a model with two degrees of freedom for _RESPONSE_ (one for r1 and one for r2 ) and four degrees of freedom for the interaction of _RESPONSE_* group . The likelihood ratio goodness-of-fit statistic (three degrees of freedom) tests the hypothesis that r1 and r2 are independent in each of the three groups.

Iterative Proportional Fitting

You can use the iterative proportional fitting (IPF) algorithm to fit a hierarchical log-linear model with no independent variables and no population variables.

The advantage of IPF over the Newton-Raphson (NR) algorithm and over the weighted least squares (WLS) method is that, when the contingency table has several dimensions and the parameter vector is large, you can obtain the log-likelihood, the goodness-of-fit G 2 , and the predicted frequencies or probabilities without performing potentially expensive parameter estimation and covariance matrix calculations. This enables you to

Each iteration of the IPF algorithm is generally faster than an iteration of the NR algorithm; however, the IPF algorithm converges to the MLEs more slowly than the NR algorithm. Both NR and WLS are more general methods that are able to perform more complex analyses than IPF can.

Repeated Measures Analysis

If there are multiple dependent variables and the variables represent repeated measurements of the same observational unit, then the variation among the dependent variables can be attributed to one or more repeated measurement factors. The factors can be included in the model by specifying _RESPONSE_ on the right-hand side of the MODEL statement and using a REPEATED statement to identify the factors.

To perform a repeated measures analysis, you also need to specify a RESPONSE statement, since the standard response functions (generalized logits) cannot be used. Typically, the MEANS or MARGINALS response functions are specified in a repeated measures analysis, but other response functions may also be reasonable.

One Population

Consider an experiment in which each subject is measured at three times, and the response functions are marginal probabilities for each of the dependent variables. If the dependent variables each has k levels, then PROC CATMOD computes k ˆ’ 1 response functions for each time. Differences among the response functions with respect to these times could be attributed to the repeated measurement factor Time . To incorporate the Time variation into the model, specify

proc catmod; response marginals; model t1*t2*t3=_response_; repeated Time 3 / _response_=Time; run;

These statements induce a Time effect that has 2( k ˆ’ 1) degrees of freedom since there are k ˆ’ 1 response functions at each time point. Thus, for a dichotomous variable, the Time effect has two degrees of freedom.

Now suppose that at each time point, each subject has X-rays taken, and the X-rays are read by two different radiologists. This creates six dependent variables that represent the 3 — 2 cross-classification of the repeated measurement factors Time and Reader . A saturated model with respect to these factors can be obtained by specifying

proc catmod; response marginals; model r11*r12*r21*r22*r31*r32=_response_; repeated Time 3, Reader 2 / _response_=Time Reader Time*Reader; run;

If you want to fit a main-effects model with respect to Time and Reader , then change the REPEATED statement to

repeated Time 3, Reader 2 / _response_=Time Reader;

If you want to fit a main-effects model for Time but for only one of the readers, the REPEATED statement might look like

repeated Time $ 3, Reader $ 2 /_response_=Time(Reader=Smith) profile =('1' Smith, '1' Jones, '2' Smith, '2' Jones, '3' Smith, '3' Jones);

If Jones had been unavailable for a reading at time 3, then there would be only 5( k ˆ’ 1) response functions, even though PROC CATMOD would be expecting some multiple of 6 (= 3 — 2). In that case, the PROFILE= option would be necessary to indicate which repeated measurement profiles were actually represented:

repeated Time $ 3, Reader $ 2 /_response_=Time(Reader=Smith) profile =('1' Smith, '1' Jones, '2' Smith, '2' Jones, '3' Smith);

When two or more repeated measurement factors are specified, PROC CATMOD presumes that the response functions are ordered so that the levels of the rightmost factor change most rapidly . This means that the dependent variables should be specified in the same order. For this example, the order implied by the REPEATED statement is as follows, where the variable r ij corresponds to Time i and Reader j .

Response Function

Dependent Variable

Time

Reader

1

r 11

1

1

2

r 12

1

2

3

r 21

2

1

4

r 22

2

2

5

r 31

3

1

6

r 32

3

2

Thus, the order of dependent variables in the MODEL statement must agree with the order implied by the REPEATED statement.

Multiple Populations

When there are variables specified in the POPULATION statement or in the righthand side of the MODEL statement, these variables induce multiple populations. PROC CATMOD can then model these independent variables, the repeated measurement factors, and the interactions between the two.

For example, suppose that there are five groups of subjects, that each subject in the study is measured at three different times, and that the dichotomous dependent variables are labeled t1, t2 , and t3 . The following statements induce the computation of three response functions for each population:

proc catmod; weight wt; population Group; response marginals; model t1*t2*t3=_response_; repeated Time / _response_=Time; run;

PROC CATMOD then regards _RESPONSE_ as a variable with three levels corresponding to the three response functions in each population and forms an effect with two degrees of freedom. The MODEL and REPEATED statements tell PROC CATMOD to fit the main effect of Time .

In general, the MODEL statement tells PROC CATMOD how to integrate the independent variables and the repeated measurement factors into the model. For example, again suppose that there are five groups of subjects, that each subject is measured at three times, and that the dichotomous independent variables are labeled t1, t2 , and t3 . If you use the same WEIGHT, POPULATION, RESPONSE, and REPEATED statements as in the preceding program, the following MODEL statements result in the indicated analyses:

model t1*t2*t3=Group / averaged;

specifies the Group main effect (with four degrees of freedom).

model t1*t2*t3=_response_;

specifies the Time main effect (with two degrees of freedom).

model t1*t2*t3=_response_*Group;

specifies the interaction between Time and Group (with eight degrees of freedom).

model t1*t2*t3=_response_Group;

specifies both main effects, and the interaction between Time and Group (with a total of fourteen degrees of freedom).

model t1*t2*t3=_response_(Group);

specifies a Time main effect within each Group (with ten degrees of freedom).

However, the following MODEL statement is invalid since effects cannot be nested within _RESPONSE_:

model t1*t2*t3=Group(_response_);

Generation of the Design Matrix

Each row of the design matrix (corresponding to a population) is generated by a unique combination of independent variable values. Each column of the design matrix corresponds to a model parameter. The columns are produced from the effect specifications in the MODEL, LOGLIN, FACTORS, and REPEATED statements. For details on effect specifications, see the 'Specification of Effects' section on page 864.

This section is divided into three parts :

This section assumes that the default effect parameterization is used. Specifying the reference parameterization replaces the ' ˆ’ 1's with zeros in the design matrix for the main effects of classification variables, and makes appropriate changes to interaction terms.

You can display the design matrix by specifying the DESIGN option in the MODEL statement.

One Response Function Per Population

Intercept

When there is one response function per population, all design matrices start with a column of 1s for the intercept unless the NOINT option is specified or the design matrix is input directly.

Main Effects

If a class variable A has k levels, then its main effect has k ˆ’ 1 degrees of freedom, and the design matrix has k ˆ’ 1 columns that correspond to the first k ˆ’ 1 levels of A . The i th column contains a 1 in the i th row, a ˆ’ 1 in the last row, and 0s everywhere else. If ± i denotes the parameter that corresponds to the i th level of variable A , then the k ˆ’ 1 columns yield estimates of the independent parameters, ± 1 , ± i ,..., ± k ˆ’ 1 . The last parameter is not needed because PROC CATMOD constrains the k parameters to sum to zero. In other words, PROC CATMOD uses a full-rank center-point parameterization to build design matrices. Here are two examples.

Data Levels

Design Columns

A

A

1

1

2

1

3

ˆ’ 1

ˆ’ 1

B

B

1

1

2

ˆ’ 1

For an effect with three levels, such as A , PROC CATMOD produces two parameter estimates for each response function. By default, the first (corresponding to the first row in the 'Design Columns') estimates the effect of level 1 of A compared to the average effect of the three levels of A. The second (corresponding to the second row in the 'Design Columns') estimates the effect of level 2 of A compared to the average effect of the three levels of A. The sum-to-zero constraint requires the effect of level 3 of A to be the negative of the sum of the level 1 and 2 effects (as shown by the third row in the 'Design Columns').

Crossed Effects (Interactions)

Crossed effects (such as A*B ) are formed by the horizontal direct products of main effects, as illustrated in the following table.

Data Levels

Design Matrix Columns

A

B

A

B

A*B

1

1

1

1

1

1

2

1

ˆ’ 1

ˆ’ 1

2

1

1

1

1

2

2

1

ˆ’ 1

ˆ’ 1

3

1

ˆ’ 1

ˆ’ 1

1

ˆ’ 1

ˆ’ 1

3

2

ˆ’ 1

ˆ’ 1

ˆ’ 1

1

1

The number of degrees of freedom for a crossed effect (that is, the number of design matrix columns) is equal to the product of the numbers of degrees of freedom for the separate effects.

Nested Effects

The effect A(B) is read ' A within B ' and is the same as specifying an A main effect for every value of B . If n a and n b are the number of levels in A and B , respectively, then the number of columns for A(B) is( n a ˆ’ 1) n b when every combination of levels exists in the data. The following table gives an example.

Data Levels

Design Matrix Columns

B

A

A(B)

1

1

1

1

2

1

1

3

ˆ’ 1

ˆ’ 1

2

1

1

2

2

1

2

3

ˆ’ 1

ˆ’ 1

CAUTION: PROC CATMOD actually allocates a column for all possible combinations of values even though some combinations may not be present in the data. This may be of particular concern if the data are not balanced with respect to the nested levels.

Nested-by-value Effects

Instead of nesting an effect within all values of the main effect, you can nest an effect within specified values of the nested variable ( A ( B =1), for example). The four degrees of freedom for the A(B) effect shown in the preceding section can also be obtained by specifying the two separate nested effects with values.

Data Levels

Design Matrix Columns

B

A

A ( B =1)

A ( B =2)

1

1

1

1

2

1

1

3

ˆ’ 1

ˆ’ 1

2

1

1

2

2

1

2

3

ˆ’ 1

ˆ’ 1

Each effect has n a ˆ’ 1 degrees of freedom, assuming a complete combination. Thus, for the example, each effect has two degrees of freedom.

The procedure compares nested values to data values on the basis of formatted values. If a format is not specified for the variable, the procedure formats internal data values to BEST16, left-justified. The nested values specified in nested-by-value effects are also converted to a BEST16 formatted value, left-justified.

For example, if the numeric variable B has internal data values 1 and 2, then A ( B =1), A ( B =1.0), and A ( B =1E0) are all valid nested-by-value effects. However, if the data value 1 is formatted as ˜one', then A ( B ='one') is a valid effect, but A ( B =1) is not since the formatted nested value (1) does not match the formatted data value (one).

To ensure correct nested-by-value effects, look at the tables of population and response profiles. These are displayed by default, and they contain the formatted data values. In addition, the population and response profiles are displayed when you specify the ONEWAY option in the MODEL statement.

Direct Effects

To request that the actual values of a variable be inserted into the design matrix, declare the variable in a DIRECT statement, and specify the effect by the variable name. For example, specifying the effects X1 and X2 in both the MODEL and DIRECT statements results in the following.

Data Levels

Design Columns

X1

X2

X1

X2

1

1

1

1

2

4

2

4

3

9

3

9

Unless there is a POPULATION statement that excludes the direct variables, the direct variables help to define the sample populations. In general, the variables should not be continuous in the sense that every subject has a different value because this would induce a separate population for each subject (note, however, that such a strategy is used purposely for logistic regression).

If there is a POPULATION statement that omits mention of the direct variables, then the values of the direct variables must be identical for all subjects in a given population since there can only be one independent variable profile for each population.

Two or More Response Functions Per Population

When there is more than one response function per population, the structure of the design matrix depends on whether or not the model type is AVERAGED (see the AVERAGED option on page 842). The model type is AVERAGED if independent variable effects are averaged over the multiple responses within a population, rather than being nested in them.

The following subsections illustrate the effect of specifying (or not specifying) an AVERAGED model type. This section does not apply to log-linear models; for these models, see the 'Log-Linear Model Design Matrices' section on page 884.

Model Type Not AVERAGED

Ordering of Parameters

Model Type AVERAGED

Log-Linear Model Design Matrices

   

Design Matrix

Sample

Number Response Function

X

Y

X*Y

1

1

2

2

1

2

2

ˆ’ 2

1

3

2

ˆ’ 2

Design matrices for reduced models are constructed similarly. For example, suppose you request a main-effects log-linear model analysis of the factors X and Y :

proc catmod; model X*Y=_response_ / design; loglin X Y; run;

Since the main-effects log-linear model implies that

it follows that the MODEL statement yields

Therefore, the corresponding design matrix is as follows.

   

Design Matrix

Sample

Response Function Number

X

Y

1

1

2

2

1

2

2

1

3

2

Since it is difficult to tell from the final design matrix whether PROC CATMOD used the parameterization that you intended, the procedure displays the untransformed _RESPONSE_ matrix for log-linear models. For example, the main-effects model in the preceding example induces the display of the following matrix.

Response Function Number

_Response_

Matrix

1

2

1

1

1

2

1

ˆ’ 1

3

ˆ’ 1

1

4

ˆ’ 1

ˆ’ 1

You can suppress the display of this matrix by specifying the NORESPONSE option in the MODEL statement.

Cautions

Effective Sample Size

Since the method depends on asymptotic approximations, you need to be careful that the sample sizes are sufficiently large to support the asymptotic normal distributions of the response functions. A general guideline is that you would like to have an effective sample size of at least 25 to 30 for each response function that is being analyzed. For example, if you have one dependent variable and r = 4 response levels, and you use the standard response functions to compute three generalized logits for each population, then you would like the sample size of each population to be at least 75. Moreover, the subjects should be dispersed throughout the table so that less than 20 percent of the response functions have an effective sample size less than 5. For example, if each population had less than 5 subjects in the first response category, then it would be wiser to pool this category with another category rather than to assume the asymptotic normality of the first response function. Or, if the dependent variable is ordinally scaled, an alternative is to request the mean score response function rather than three generalized logits.

If there is more than one dependent variable, and you specify RESPONSE MEANS, then the effective sample size for each response function is the same as the actual sample size. Thus, a sample size of 30 could be sufficient to support four response functions, provided that the functions are the means of four dependent variables.

A Singular Covariance Matrix

If there is a singular (noninvertible) covariance matrix for the response functions in any population, then PROC CATMOD writes an error message and stops processing. You have several options available to correct this situation:

Zero Frequencies

There are two types of zero cells in a contingency table: structural and sampling. A structural zero cell has an expected value of zero, while a sampling zero cell may have nonzero expected value and may be estimable .

If you use the standard response functions and there are zero frequencies, you should use maximum likelihood estimation (the default is ML=NR) rather than weighted least-squares to analyze the data. For weighted least-squares analysis, the CATMOD procedure always computes the observed response functions and may need to take the logarithm of a zero proportion. In this case, PROC CATMOD issues a warning and then takes the log of a small value (0 . 5 /n i for the probability) in order to continue, but this can produce invalid results if the cells contain too few observations. Maximum likelihood analysis, on the other hand, does not require computation of the observed response functions and therefore yields valid results for the parameter estimates and all of the predicted values.

For a log-linear model analysis using WLS or ML=NR, PROC CATMOD creates response profiles only for the observed profiles. Thus, for any log-linear model analysis with one population (the usual case), the contingency table will not contain zeros, which means that all zero frequencies are treated as structural zeros. If there is more than one population, then a zero in the body of the contingency table is treated as a sampling zero (as long as some population has a nonzero count for that profile). If you fit the log-linear model using ML=IPF, the contingency table is incomplete and the zeros are treated like structural zeros. If you want zero frequencies that PROC CATMOD would normally treat as structural zeros to be interpreted as sampling zeros, you may specify the ZERO=SAMPLING and MISSING=SAMPLING options in the MODEL statement. Alternatively, you can specify ZERO=1E ˆ’ 20 and MISSING=1E ˆ’ 20.

Refer to Bishop, Fienberg, and Holland (1975) for a discussion of the issues and Example 22.5 on page 919 for an illustration of a log-linear model analysis of data that contain both structural and sampling zeros.

If you perform a weighted least-squares analysis on a contingency table that contains zero cell frequencies, then avoid using the LOG transformation as the first transformation on the observed proportions . In general, it may be better to change the response functions or to pool some of the response categories than to settle for the 0.5 correction or to use the ADDCELL= option.

Testing the Wrong Hypothesis

If you use the keyword _RESPONSE_ in the MODEL statement, and you specify MARGINALS, LOGITS, ALOGITS, or CLOGITS in your RESPONSE statement, you may receive the following warning message:

Warning: The _RESPONSE_ effect may be testing the wrong hypothesis since the marginal levels of the dependent variables do not coincide. Consult the response profiles and the CATMOD documentation.

The following examples illustrate situations in which the _RESPONSE_ effect tests the wrong hypothesis.

Zeros in the Marginal Frequencies

Suppose you specify the following statements:

data A1; input Time1 Time2 @@; datalines; 1 2 2 3 1 3 ; proc catmod; response marginals; model Time1*Time2=_response_; repeated Time 2 / _response_=Time; run;

One marginal probability is computed for each dependent variable, resulting in two response functions. The model is a saturated one: one degree of freedom for the intercept and one for the main effect of Time . Except for the warning message, PROC CATMOD produces an analysis with no apparent errors, but the 'Response Profiles' table displayed by PROC CATMOD is as follows.

Response Profiles

Response

Time1

Time2

1

1

2

2

1

3

3

2

3

Since RESPONSE MARGINALS yields marginal probabilities for every level but the last, the two response functions being analyzed are Prob(Time1=1) and Prob(Time2=2). Thus, the Time effect is testing the hypothesis that Prob(Time1=1)=Prob(Time2=2). What it should be testing is the hypothesis that

Prob(Time1=1) = Prob(Time2=1) Prob(Time1=2) = Prob(Time2=2) Prob(Time1=3) = Prob(Time2=3)

but there are not enough data to support the test (assuming that none of the probabilities are structural zeros by the design of the study).

The ORDER=DATA Option

Suppose you specify

data a1; input Time1 Time2 @@; datalines; 2 1 2 2 1 1 1 2 2 1 ; proc catmod order=data; response marginals; model Time1*Time2=_response_; repeated Time 2 / _response_=Time; run;

As in the preceding example, one marginal probability is computed for each dependent variable, resulting in two response functions. The model is also the same: one degree of freedom for the intercept and one for the main effect of Time . PROC CATMOD issues the warning message and displays the following 'Response Profiles' table.

Response Profiles

Response

Time1

Time2

1

2

1

2

2

2

3

1

1

4

1

2

Although the marginal levels are the same for the two dependent variables, they are not in the same order because the ORDER=DATA option specified that they be ordered according to their appearance in the input stream. Since RESPONSE MARGINALS yields marginal probabilities for every level except the last, the two response functions being analyzed are Prob(Time1=2) and Prob(Time2=1). Thus, the Time effect is testing the hypothesis that Prob(Time1=2)=Prob(Time2=1). What it should be testing is the hypothesis that

Prob(Time1=1) = Prob(Time2=1) Prob(Time1=2) = Prob(Time2=2)

Whenever the warning message appears, look at the 'Response Profiles' table or the 'One-Way Frequencies' table to determine what hypothesis is actually being tested . For the latter example, a correct analysis can be obtained by deleting the ORDER=DATA option or by reordering the data so that the (1,1) observation is first.

Computational Method

The notation used in PROC CATMOD differs slightly from that used in other literature. The following table provides a summary of the basic dimensions and the notation for a contingency table. See the 'Computational Formulas' section, which follows, for a complete description.

Summary of Basic Dimensions

s

=

number of populations or samples ( = number of rows in the underlying contingency table)

r

=

number of response categories (= number of columns in the underlying contingency table)

q

=

number of response functions computed for each population

d

=

number of parameters

Notation

j

denotes a column vector of 1s.

J

denotes a square matrix of 1s.

ˆ‘ k

is the sum over all the possible values of k .

n i

denotes the row sum ˆ‘ j n ij .

DIAG n ( p )

is the diagonal matrix formed from the first n elements of the vector p .

is the inverse of DIAG n ( p ).

DIAG ( A 1 , A 2 , ..., A k )

denotes a block diagonal matrix with the A matrices on the main diagonal.

Input data can be represented by a contingency table, as shown in Table 22.4.

Table 22.4: Input Data Represented by a Contingency Table
 

Response

 

Population

1

2

r

Total

1

n 11

n 12

n 1 r

n 1

2

n 21

n 22

n 2 r

n 2

s

n s 1

n s 2

n sr

n s

Computational Formulas

The following formulas are shown for each population and for all populations combined.

Source

Formula

Dimension

Probability Estimates

j th response

1 — 1

i th population

r — 1

all populations

sr — 1

Variance of Probability Estimates

i th population

r — r

all populations

V = DIAG ( V 1 , V 2 , , V s

sr — sr

Response Functions

i th population

F i = F ( p i )

q — 1

all populations

sq — 1

Derivative of Function with Respect to Probability Estimates

i th population

q — r

all populations

H = DIAG ( H 1 , H 2 , ..., H s )

sq — sr

Variance of Functions

i th population

S i = H i V i H i ²

q — q

all populations

S = DIAG ( S 1 , S 2 , ..., S s )

sq — sq

Inverse Variance of Functions

i th population

S i = ( S i ) ˆ’ 1

q — q

all populations

S ˆ’ 1 = DIAG ( S 1 , S 2 , ..., S s )

sq — sq

Derivative Table for Compound Functions: Y=F(G(p))

In the following table, let G ( p ) be a vector of functions of p , and let D denote ˆ‚ G / ˆ‚ p , which is the first derivative matrix of G with respect to p .

Function

Y = F ( G )

Derivative ( ˆ‚ Y / ˆ‚ p )

Multiply matrix

Y = A * G

A * D

Logarithm

Y = LOG ( G )

DIAG ˆ’ 1 ( G ) * D

Exponential

Y = EXP ( G )

DIAG ( Y ) * D

Add constant

Y = G + A

D

Default Response Functions: Generalized Logits

In the following table, subscripts i for the population are suppressed. Also denote for j = 1, ..., r ˆ’ 1 for each population i = 1 , ..., s .

Inverse of Response Functions for a Population

for j = 1, , r ˆ’ 1

Form of F and Derivative for a Population

F = KLOG(p) = (I r ˆ’ 1 , ˆ’ j) LOG(p)

Covariance Results for a Population

  • where V , H , and J are as previously defined.

S ˆ’ 1 = n(DIAG r ˆ’ 1 (p) ˆ’ qq ² ) where q = DIAG r ˆ’ 1 (p) j

The following calculations are shown for each population and then for all populations combined.

Source

Formula

Dimension

Design Matrix

i th population

X i

q — d

all populations

sq — d

Crossproduct of Design Matrix

i th population

C i = X i ² S i X i

d — d

all populations

C = X ² S ˆ’ 1 X = ˆ‘ i C i

d — d

In the following table, z p is the 100 p th percentile of the standard normal distribution.

Source

Formula

Dimension

Crossproduct of Design Matrix with Function

 

R = X ² S ˆ’ 1 F = ˆ‘ i X i ² S i F i

d — 1

Weighted Least-Squares Estimates

 

b = C ˆ’ 1 R = ( X ² S ˆ’ 1 X ) ˆ’ 1 ( X ² S ˆ’ 1 F )

d — 1

Covariance of Weighted Least-Squares Estimates

 

COV ( b ) = C ˆ’ 1

d — d

Wald Confidence Limits for Parameter Estimates

 

k = 1 , ..., d

Predicted Response Functions

 

sq — 1

Covariance of Predicted Response Functions

 

sq — sq

Residual Chi-Square

 

1 — 1

Source

Formula

Dimension

Chi-Square for H : L ² =

 

Q = ( Lb ) ² ( LC ˆ’ 1 L ² ) ˆ’ 1 ( Lb )

1 — 1

Maximum Likelihood Method

Let C be the Hessian matrix and G be the gradient of the log-likelihood function (both functions of and the parameters ² ). Let denote the vector containing the first r ˆ’ 1 sample proportions from population i , and let denote the corresponding vector of probability estimates from the current iteration. Starting with the least-squares estimates b of ² (if you use the ML and WLS options; with the ML option alone, the procedure starts with ), the probabilities ( b ) are computed, and b is calculated iteratively by the Newton-Raphson method until it converges (see the EPSILON= option on page 842). The factor » is a step-halving factor that equals one at the start of each iteration. For any iteration in which the likelihood decreases, PROC CATMOD uses a series of subiterations in which » is iteratively divided by two. The subiterations continue until the likelihood is greater than that of the previous iteration. If the likelihood has not reached that point after ten subiterations, then convergence is assumed, and a warning message is displayed.

Sometimes, infinite parameters may be present in the model, either because of the presence of one or more zero frequencies or because of a poorly specified model with collinearity among the estimates. If an estimate is tending toward infinity, then PROC CATMOD flags the parameter as infinite and holds the estimate fixed in subsequent iterations. PROC CATMOD regards a parameter to be infinite when two conditions apply:

The estimator of the asymptotic covariance matrix of the maximum likelihood predicted probabilities is given by Imrey, Koch, and Stokes (1981, eq. 2.18).

The following equations summarize the method:

where

Iterative Proportional Fitting

The algorithm used by PROC CATMOD for iterative proportional fitting is described in Bishop, Fienberg, and Holland (1975), Haberman (1972), and Agresti (2002). To illustrate the method, consider the observed three-dimensional table { n ijk } for the variables X, Y, and Z. The statements

model X*Y*Z = _response_ / ml=ipf; loglin XYZ@2;

request that PROC CATMOD use IPF to fit the hierarchical model

Begin with a table of initial cell estimates PROC CATMOD produces the initial estimates by setting the n sz structural zero cells to 0 and all other cells to n/ ( n c ˆ’ n sz ), where n is the total weight of the table and n c is the total number of cells in the table. Iteratively adjust the estimates at step s ˆ’ 1 to the observed marginal tables specified in the model by cycling through the following three-stage process to produce the estimates at step s .

The subscript ' ·' indicates summation over the missing subscript. The log-likelihood l s is estimated at each step s by

When the function ( l s ˆ’ 1 ˆ’ l s ) /l s ˆ’ 1 is less than 10 ˆ’ 8 , the iterations terminate. You can change the comparison value with the EPSILON= option, and you can change the convergence criterion with the CONV= option. The option CONV=CELL uses the maximum cell difference

as the criterion while the option CONV=MARGIN computes the maximum difference of the margins

Memory and Time Requirements

The memory and time required by PROC CATMOD are proportional to the number of parameters in the model.

Displayed Output

PROC CATMOD displays the following information in the 'Data Summary' table:

Except for the analysis of variance table, all of the following items can be displayed or suppressed, depending on your specification of statements and options.

ODS Table Names

PROC CATMOD assigns a name to each table it creates. You can use these names to reference the table when using the Output Delivery System (ODS) to select tables and create output data sets. These names are listed in the following table. For more information on ODS, see Chapter 14, 'Using the Output Delivery System.'

Table 22.5: ODS Tables Produced in PROC CATMOD

ODS Table Name

Description

Statement

Option

ANOVA

Analysis of variance

MODEL

default

Contrasts

Contrasts

CONTRAST

default

ContrastEstimates

Analysis of Contrasts

CONTRAST

ESTIMATE=

ConvergenceStatus

Convergence status

MODEL

ML

CorrB

Correlation matrix of the estimates

MODEL

CORRB

CovB

Covariance matrix of the estimates

MODEL

COVB

DataSummary

Data summary

PROC

default

Estimates

Analysis of estimates

MODEL

default, unless NOPARM

MaxLikelihood

Maximum likelihood analysis

MODEL

ML and ITPRINT

OneWayFreqs

One-way frequencies

MODEL

ONEWAY

PopProfiles

Population profiles

MODEL

default, unless NOPROFILE

PredictedFreqs

Predicted frequencies

MODEL

PRED=FREQ

PredictedProbs

Predicted probabilities

MODEL

PREDICT or PRED=PROB

PredictedValues

Predicted values

MODEL

PREDICT or PRED=

ResponseCov

Response functions, covariance matrix

MODEL

COV

ResponseDesign

Response functions, design matrix

MODEL

DESIGN, unless NODESIGN

ResponseFreqs

Response frequencies

MODEL

FREQ

ResponseMatrix

_RESPONSE_ matrix

MODEL & LOGLIN

DESIGN, unless NORESPONSE

ResponseProbs

Response probabilities

MODEL

PROB

ResponseProfiles

Response profiles

MODEL

default, unless NOPROFILE

XPX

X '*Inv( S )* X matrix

MODEL

XPX, for WLS [*]

[*] WLS estimation is the default for response functions other than the default (generalized logits).

Категории