SAS/STAT 9.1 Users Guide, Volumes 1-7

Overview

The ACECLUS (Approximate Covariance Estimation for CLUStering) procedure obtains approximate estimates of the pooled within-cluster covariance matrix when the clusters are assumed to be multivariate normal with equal covariance matrices. Neither cluster membership nor the number of clusters need be known. PROC ACECLUS is useful for preprocessing data to be subsequently clustered by the CLUSTER or the FASTCLUS procedure.

Many clustering methods perform well with spherical clusters but poorly with elongated elliptical clusters (Everitt 1980, 77-97). If the elliptical clusters have roughly the same orientation and eccentricity, you can apply a linear transformation to the data to yield a spherical within-cluster covariance matrix, that is, a covariance matrix proportional to the identity. Equivalently, the distance between observations can be measured in the metric of the inverse of the pooled within-cluster covariance matrix. The remedy is difficult to apply, however, because you need to know what the clusters are in order to compute the sample within-cluster covariance matrix. One approach is to estimate iteratively both cluster membership and within-cluster covariance (Wolfe 1970; Hartigan 1975). Another approach is provided by Art, Gnanadesikan, and Kettenring (1982). They have devised an ingenious method for estimating the within-cluster covariance matrix without knowledge of the clusters. The method can be applied before any of the usual clustering techniques, including hierarchical clustering methods.

First, Art, Gnanadesikan, and Kettenring (1982) obtain a decomposition of the total-sample sum-of-squares-and-cross-products (SSCP) matrix into within-cluster and between-cluster SSCP matrices computed from pairwise differences between observations, rather than differences between observations and means. Then, they show how the within-cluster SSCP matrix based on pairwise differences can be approximated without knowing the number or the membership of the clusters. The approximate within-cluster SSCP matrix can be used to compute distances for cluster analysis, or it can be used in a canonical analysis similar to canonical discriminant analysis. For more information, see Chapter 21, 'The CANDISC Procedure.'

Art, Gnanadesikan, and Kettenring demonstrate by Monte Carlo calculations that their method can produce better clusters than the Euclidean metric even when the approximation to the within-cluster SSCP matrix is poor or the within-cluster covariances are moderately heterogeneous. The algorithm used by the ACECLUS procedure differs slightly from the algorithm used by Art, Gnanadesikan, and Kettenring. In the following sections, the PROC ACECLUS algorithm is described first; then, differences between PROC ACECLUS and the method used by Art, Gnanadesikan, and Kettenring are summarized.

Background

It is well known from the literature on nonparametric statistics that variances and, hence, covariances can be computed from pairwise differences instead of deviations from means. (For example, Puri and Sen (1971, pp. 51-52) show that the variance is a U statistic of degree 2.) Let X =( x ij ) be the data matrix with n observations (rows) and v variables ( columns ), and let x j be the mean of the j th variable. The sample covariance matrix S =( s jk ) is usually defined as

The matrix S can also be computed as

Let W =( w jk ) be the pooled within-cluster covariance matrix, q be the number of clusters, n c be the number of observations in the c th cluster, and

The matrix W is normally defined as

where x cj is the mean of the j th variable in cluster c . Let

The matrix W can also be computed as

If the clusters are not known, cannot be determined. However, an approximation to W can be obtained by using instead

where u is an appropriately chosen value and M = ( m jk ) is an appropriate metric. Let A = ( a jk ) be defined as

If all of the following conditions hold, A equals W :

If the clusters are of unequal size , A gives more weight to large clusters than W does, but this discrepancy should be of little importance if the population within-cluster covariance matrices are equal. There may be large differences between A and W if the cutoff u does not discriminate between pairs in the same cluster and pairs in different clusters. Lack of discrimination may occur for one of the following reasons:

In the former case, little can be done to remedy the problem. The remaining question concerns how to choose M and u . Consider M first. The best choice for M is W ˆ’ 1 , but W is not known. The solution is to use an iterative algorithm:

  1. Obtain an initial estimate of A , such as the identity or the total-sample covariance matrix. (See the INITIAL= option in the PROC ACECLUS statement for more information.)

  2. Let M equal A ˆ’ 1 .

  3. Recompute A using the preceding formula.

  4. Repeat steps 2 and 3 until the estimate stabilizes.

Convergence is assessed by comparing values of A on successive iterations. Let A i be the value of A on the i th iteration and A be the initial estimate of A . Let Z be a user -specified v — v matrix. (See the METRIC= option in the PROC ACECLUS statement for more information.) The convergence measure is

where ... indicates the Euclidean norm, that is, the square root of the sum of the squares of the elements of the matrix. In PROC ACECLUS, Z can be the identity or an inverse factor of S or diag( S ). Iteration stops when e i falls below a user-specified value. (See the CONVERGE= option or the MAXITER= option in the PROC ACECLUS statement for more information.)

The remaining question of how to choose u has no simple answer. In practice, you must try several different values. PROC ACECLUS provides four different ways of specifying u :

In most cases, the analysis should begin with the last method using values of p between 0.5 and 0.01 and using the full covariance matrix as the initial estimate of A .

Proportions p are transformed to distances t using the formula

where is the quantile (inverse cumulative distribution) function of an F random variable with v and n ˆ’ v degrees of freedom. The squared Mahalanobis distance between a single pair of observations sampled from a multivariate normal distribution is distributed as 2 v times an F random variable with v and n ˆ’ v degrees of freedom. The distances between two pairs of observations are correlated if the pairs have an observation in common. The quantile function is raised to the power given in the preceding formula to compensate approximately for the correlations among distances between pairs of observations that share a member. Monte Carlo studies indicate that the approximation is acceptable if the number of observations exceeds the number of variables by at least 10 percent.

If A becomes singular, step 2 in the iterative algorithm cannot be performed because A cannot be inverted. In this case, let Z be the matrix as defined in discussing the convergence measure, and let Z ² AZ = R ² R where R ² R = RR ² = I and = ( » jk ) is diagonal. Let be a diagonal matrix where , and 0 < g < 1 is a user-specified singularity criterion (see the SINGULAR= option in the PROC ACECLUS statement for more information). Then M is computed as ZR ² ( *) ˆ’ 1 RZ ² .

The ACECLUS procedure differs from the method used by Art, Gnanadesikan, and Kettenring (1982) in several respects.

Analyses of Fisher's (1936) iris data, consisting of measurements of petal and sepal length and width for fifty specimens from each of three iris species, are summarized in Table 16.1. The number of misclassified observations out of 150 is given for four clustering methods:

Table 16.1: Number of Misclassified and Unclassified Observations Using Fisher's (1936) Iris Data

Data

Clustering Method

k -means

Ward's

Average

Linkage

Centroid

raw data

16 [*]

16 [**]

25 + 12 [*]

14 [*]

standardized data

25

26

33+4

33+4

two standardized principal components

29

31

30+9

27+32

four standardized principal components

39

27

32+7

45+11

transformed by ACECLUS P=0.32

39

10+9

7+25

 

transformed by ACECLUS P=0.16

39

18+9

7+19

7+26

transformed by ACECLUS P=0.08

19

9

3+13

5+16

transformed by ACECLUS P=0.04

4

5

1+19

3+12

transformed by ACECLUS P=0.02

4

3

3

3

transformed by ACECLUS P=0.01

4

4

3

4

transformed by ACECLUS P=0.005

4

4

4

4

canonical variables

3

5

4

4+1

[*] A single number represents misclassi ed observations with no unclassi ed observations.

[**] Where two numbers are separated by a plus sign, the rst is the number of misclassied observations; the second is the number of unclassi ed observations.

Each hierarchical analysis is followed by the TREE procedure with NCL=3 to determine cluster assignments at the three-cluster level. Clusters with twenty or fewer observations are discarded by using the DOCK=20 option. The observations in a discarded cluster are considered unclassified.

Each method is applied to

Theoretically, the best results should be obtained by using the canonical variables from PROC CANDISC. PROC ACECLUS yields results comparable to PROC CANDISC for values of the PROPORTION= option ranging from 0.005 to 0.02. At PROPORTION=0.04, average linkage and the centroid method show some deterioration, but k -means and Ward's method continue to produce excellent classifications. At larger values of the PROPORTION= option, all methods perform poorly, although no worse than with four standardized principal components.

This example demonstrates the following:

Although experience with the Art, Gnanadesikan, and Kettenring and PROC ACECLUS methods is limited, the results so far suggest that these methods help considerably more often than they hinder the subsequent cluster analysis, especially with normal-mixture techniques such as k -means and Ward's minimum variance method.

Категории