SAS.STAT 9.1 Users Guide (Vol. 4)

Example 47.1. Cluster Analysis of Samples from Univariate Distributions

This example uses pseudo-random samples from a uniform distribution, an exponential distribution, and a bimodal mixture of two normal distributions. Results are presented in Output 47.1.1 through Output 47.1.3 as plots displaying both the true density and the estimated density, as well as cluster membership.

The following statements produce Output 47.1.1:

options noovp ps=28 ls=95; title 'Modeclus Example with Univariate Distributions'; title2 'Uniform Distribution'; data uniform; drop n; true=1; do n=1 to 100; x=ranuni(123); output; end; axis1 label=(angle=90 rotate=0) minor=none order=(0 to 3 by 0.5); axis2 minor=none; symbol9 v=none i=splines; proc modeclus data=uniform m=1 k=10 20 40 60 out=out short; var x; proc gplot data=out; plot density*x=cluster /frame cframe=ligr vzero nolegend vaxis=axis1 haxis=axis2; plot2 true*x=9; by _K_; run; proc modeclus data=uniform m=1 r=.05 .10 .20 .30 out=out short; var x; axis1 label=(angle=90 rotate=0) minor=none order=(0 to 2 by 0.5); proc gplot data=out; plot density*x=cluster /frame cframe=ligr vzero nolegend vaxis=axis1 haxis=axis2; plot2 true*x=9; by _R_; run;

Output 47.1.1: Cluster Analysis of Sample from a Uniform Distribution

Modeclus Example with Univariate Distributions Uniform Distribution The MODECLUS Procedure Cluster Summary Frequency of Number of Unclassified K Clusters Objects ------------------------------------ 10 6 0 20 3 0 40 2 0 60 1 0

 

Modeclus Example with Univariate Distributions Uniform Distribution The MODECLUS Procedure Cluster Summary Frequency of Number of Unclassified R Clusters Objects ------------------------------------ 0.05 4 0 0.1 2 0 0.2 2 0 0.3 1 0

The following statements produce Output 47.1.2:

title2 'Exponential Distribution'; data expon; drop n; do n=1 to 100; x=ranexp(123); true=exp(-x); output; end; axis1 label=(angle=90 rotate=0) minor=none order=(0 to 2 by 0.5); axis2 minor=none; proc modeclus data=expon m=1 k=10 20 40 out=out short; var x; proc gplot; plot density*x=cluster /frame cframe=ligr vzero nolegend vaxis=axis1 haxis=axis2; plot2 true*x=9; by _K_; run; /*********************************************/ proc modeclus data=expon m=1 r=.20 .40 .80 out=out short; var x; axis1 label=(angle=90 rotate=0) minor=none order=(0 to 1 by 0.5); proc gplot; plot density*x=cluster /frame cframe=ligr vzero nolegend vaxis=axis1 haxis=axis2; plot2 true*x=9; by _R_; run; /*********************************************/ title3 Different Density-Estimation and Clustering Windows; proc modeclus data=expon m=1 r=.20 ck=10 20 40 out=out short; var x; proc gplot; plot density*x=cluster /frame cframe=ligr vzero nolegend vaxis=axis1 haxis=axis2; plot2 true*x=9; by _CK_; run; /*********************************************/ title3 'Cascaded Density Estimates Using Arithmetic Means'; proc modeclus data=expon m=1 r=.20 cascade=1 2 4 am out=out short; var x; proc gplot; plot density*x=cluster /frame cframe=ligr vzero nolegend vaxis=axis1 haxis=axis2; plot2 true*x=9; by _R_ _CASCAD_; run;

Output 47.1.2: Cluster Analysis of Sample from an Exponential Distribution

Modeclus Example with Univariate Distributions Exponential Distribution The MODECLUS Procedure Cluster Summary Frequency of Number of Unclassified K Clusters Objects ------------------------------------- 10 5 0 20 3 0 40 1 0

 

Modeclus Example with Univariate Distributions Exponential Distribution The MODECLUS Procedure Cluster Summary Frequency of Number of Unclassified R Clusters Objects ------------------------------------ 0.2 8 0 0.4 6 0 0.8 1 0

Modeclus Example with Different Density-Estimation and Clustering Windows The MODECLUS Procedure Cluster Summary Frequency of Number of Unclassified R CK Clusters Objects ------------------------------------------------ 0.2 10 3 0 0.2 20 2 0 0.2 40 1 0

Modeclus Example with Cascaded Density Estimates Using Arithmetic Means The MODECLUS Procedure Cluster Summary Frequency of Number of Unclassified R Cascade Clusters Objects ----------------------------------------------- 0.2 1 8 0 0.2 2 8 0 0.2 4 7 0

The following statements produce Output 47.1.3:

title2 'Normal Mixture Distribution'; data normix; drop n sigma; sigma=.125; do n=1 to 100; x=rannor(456)*sigma+mod(n,2)/2; true=exp(-.5*(x/sigma)**2)+exp(-.5*((x-.5)/sigma)**2); true=.5*true/(sigma*sqrt(2*3.1415926536)); output; end; axis1 label=(angle=90 rotate=0) minor=none order=(0 to 3 by 0.5); axis2 minor=none; proc modeclus data=normix m=1 k=10 20 40 60 out=out short; var x; proc gplot; plot density*x=cluster /frame cframe=ligr vzero nolegend vaxis=axis1 haxis=axis2; plot2 true*x=9; by _K_; run; /*********************************************/ proc modeclus data=normix m=1 r=.05 .10 .20 .30 out=out short; var x; proc gplot; plot density*x=cluster /frame cframe=ligr vzero nolegend vaxis=axis1 haxis=axis2 ; plot2 true*x=9; by _R_; run; /*********************************************/ title3 'Cascaded Density Estimates Using Arithmetic Means'; proc modeclus data=normix m=1 r=.05 cascade=1 2 4 am out=out short; var x; axis1 label=(angle=90 rotate=0) minor=none order=(0 to 2 by 0.5); proc gplot; plot density*x=cluster /frame cframe=ligr vzero nolegend vaxis=axis1 haxis=axis2 ; plot2 true*x=9; by _R_ _CASCAD_; run;

Output 47.1.3: Cluster Analysis of Sample from a Bimodal Mixture of Two Normal Distributions

Modeclus Example with Normal Mixture Distribution The MODECLUS Procedure Cluster Summary Frequency of Number of Unclassified K Clusters Objects ------------------------------------- 10 7 0 20 2 0 40 2 0 60 1 0

 

Modeclus Example with Normal Mixture Distribution The MODECLUS Procedure Cluster Summary Frequency of Number of Unclassified R Clusters Objects ------------------------------------- 0.05 5 0 0.1 2 0 0.2 2 0 0.3 1 0

Modeclus Example with Normal Mixture Distribution Cascaded Density Estimates Using Arithmetic Means The MODECLUS Procedure Cluster Summary Frequency of Number of Unclassified R Cascade Clusters Objects ------------------------------------------------ 0.05 1 5 0 0.05 2 4 0 0.05 4 4 0

Example 47.2. Cluster Analysis of Flying Mileages between Ten American Cities

This example uses distance data and illustrates the use of the TRANSPOSE procedure and the DATA step to fill in the upper triangle of the distance matrix. The results are displayed in Output 47.2.1 through Output 47.2.2.

Output 47.2.1: Clustering with K- Nearest -Neighbor Density Estimates

Modeclus Analysis of 10 American Cities Based on Flying Mileages The MODECLUS Procedure Nearest Neighbor List CITY Neighbor Distance --------------------------------------------------- ATLANTA WASHINGTON D.C. 543.0000000 CHICAGO 587.0000000 --------------------------------------------------- CHICAGO ATLANTA 587.0000000 WASHINGTON D.C. 597.0000000 --------------------------------------------------- DENVER LOS ANGELES 831.0000000 HOUSTON 879.0000000 --------------------------------------------------- HOUSTON ATLANTA 701.0000000 DENVER 879.0000000 --------------------------------------------------- LOS ANGELES SAN FRANCISCO 347.0000000 DENVER 831.0000000 --------------------------------------------------- MIAMI ATLANTA 604.0000000 WASHINGTON D.C. 923.0000000 --------------------------------------------------- NEW YORK WASHINGTON D.C. 205.0000000 CHICAGO 713.0000000 --------------------------------------------------- SAN FRANCISCO LOS ANGELES 347.0000000 SEATTLE 678.0000000 --------------------------------------------------- SEATTLE SAN FRANCISCO 678.0000000 LOS ANGELES 959.0000000 --------------------------------------------------- WASHINGTON D.C. NEW YORK 205.0000000 ATLANTA 543.0000000

 

Output 47.2.2: Clustering with Uniform Kernel Density Estimates

Modeclus Analysis of 10 American Cities Based on Flying Mileages The MODECLUS Procedure Nearest Neighbor List CITY Neighbor Distance --------------------------------------------------- ATLANTA WASHINGTON D.C. 543.0000000 CHICAGO 587.0000000 MIAMI 604.0000000 HOUSTON 701.0000000 NEW YORK 748.0000000 --------------------------------------------------- CHICAGO ATLANTA 587.0000000 WASHINGTON D.C. 597.0000000 NEW YORK 713.0000000 --------------------------------------------------- HOUSTON ATLANTA 701.0000000 --------------------------------------------------- LOS ANGELES SAN FRANCISCO 347.0000000 --------------------------------------------------- MIAMI ATLANTA 604.0000000 --------------------------------------------------- NEW YORK WASHINGTON D.C. 205.0000000 CHICAGO 713.0000000 ATLANTA 748.0000000 --------------------------------------------------- SAN FRANCISCO LOS ANGELES 347.0000000 SEATTLE 678.0000000 --------------------------------------------------- SEATTLE SAN FRANCISCO 678.0000000 --------------------------------------------------- WASHINGTON D.C. NEW YORK 205.0000000 ATLANTA 543.0000000 CHICAGO 597.0000000

 

The following statements produce Output 47.2.1:

title 'Modeclus Analysis of 10 American Cities'; title2 'Based on Flying Mileages'; options ls=90; data mileages(type=distance); input (ATLANTA CHICAGO DENVER HOUSTON LOSANGELES MIAMI NEWYORK SANFRAN SEATTLE WASHDC) (5.) @53 CITY .; datalines; 0 ATLANTA 587 0 CHICAGO 1212 920 0 DENVER 701 940 879 0 HOUSTON 1936 1745 831 1374 0 LOS ANGELES 604 1188 1726 968 2339 0 MIAMI 748 713 1631 1420 2451 1092 0 NEW YORK 2139 1858 949 1645 347 2594 2571 0 SAN FRANCISCO 2182 1737 1021 1891 959 2734 2408 678 0 SEATTLE 543 597 1494 1220 2300 923 205 2442 2329 0 WASHINGTON D.C. ; *-----Fill in Upper Triangle of Distance Matrix---------------; proc transpose out=tran; copy CITY; data mileages(type=distance); merge mileages tran; array var ATLANTA--WASHDC; array col col1-col10; drop col1-col10 _name_; do over var; var=sum(var,col); end; *-----Clustering with K-Nearest-Neighbor Density Estimates-----; proc modeclus data=mileages all m=1 k=3; id CITY; run;

Modeclus Analysis of 10 American Cities Based on Flying Mileages The MODECLUS Procedure K=3 METHOD=1 Boundary Objects -Cluster Proportions CITY Density Cluster 1 2 DENVER 0.0001706485 2 0.486 0.514 HOUSTON 0.0001706485 1 0.600 0.400 Cluster Statistics Maximum Estimated Estimated Boundary Saddle Cluster Frequency Density Frequency Density ---------------------------------------------------------------- 1 6 0.00027624 1 0.00017065 2 4 0.00022124 1 0.00017065

Modeclus Analysis of 10 American Cities Based on Flying Mileages The MODECLUS Procedure Cluster Summary Frequency of Number of Unclassified K Clusters Objects ------------------------------------- 3 2 0

The following statements produce Output 47.2.2:

*------Clustering with Uniform Kernel Density Estimates--------; proc modeclus data=mileages all m=1 r=600 800; id CITY; run;

Modeclus Analysis of 10 American Cities Based on Flying Mileages The MODECLUS Procedure R=600 METHOD=1 No Boundary Objects Cluster Statistics Maximum Estimated Estimated Boundary Saddle Cluster Frequency Density Frequency Density ---------------------------------------------------------------- 1 4 0.00033333 0 . 2 2 0.00016667 0 . 3 1 0.00008333 0 . 4 1 0.00008333 0 . 5 1 0.00008333 0 . 6 1 0.00008333 0 . Modeclus Analysis of 10 American Cities Based on Flying Mileages The MODECLUS Procedure R=800 METHOD=1 No Boundary Objects Cluster Statistics Maximum Estimated Estimated Boundary Saddle Cluster Frequency Density Frequency Density ---------------------------------------------------------------- 1 6 0.000375 0 . 2 3 0.0001875 0 . 3 1 0.0000625 0 .

Modeclus Analysis of 10 American Cities Based on Flying Mileages The MODECLUS Procedure Cluster Summary Frequency of Number of Unclassified R Clusters Objects ------------------------------------- 600 6 0 800 3 0

The following statements produce Output 47.2.3:

*------Uniform Kernel Density Estimates, Clustering Neighborhoods extended to nearest neighbor--------------; proc modeclus data=mileages list m=1 ck=2 r=600 800; id CITY; run;

Outptt 47.2.3: Uniform Kernel Density Estimates, Clustering Neighborhoods Extended to Nearest Neighbor

Modeclus Analysis of 10 American Cities Based on Flying Mileages The MODECLUS Procedure CK=2 R=600 METHOD=1 Cluster Statistics Maximum Estimated Estimated Boundary Saddle Cluster Frequency Density Frequency Density ---------------------------------------------------------------- 1 6 0.00033333 0 . 2 4 0.00016667 0 .

 

Modeclus Analysis of 10 American Cities Based on Flying Mileages The MODECLUS Procedure CK=2 R=800 METHOD=1 Cluster Statistics Maximum Estimated Estimated Boundary Saddle Cluster Frequency Density Frequency Density ---------------------------------------------------------------- 1 6 0.000375 0 . 2 4 0.0001875 0 .

Modeclus Analysis of 10 American Cities Based on Flying Mileages The MODECLUS Procedure Cluster Summary Frequency of Number of Unclassified R CK Clusters Objects ------------------------------------------------ 600 2 2 0 800 2 2 0

Example 47.3. Cluster Analysis with Significance Tests

This example uses artificial data containing two clusters. One cluster is from a circular bivariate normal distribution. The other is a ring-shaped cluster that completely surrounds the first cluster. Without significance tests, the ring is divided into several sample clusters for any degree of smoothing that yields reasonable density estimates. The JOIN= option puts the ring back together. Output 47.3.1 displays a short summary generated from the first PROC MODECLUS statement. Output 47.3.2 contains a series of tables produced from the second PROC MODECLUS statement. The lack of p -value in the JOIN= option makes joining continue until only one cluster remains (see the description of the JOIN= option on page 2866). The cluster memberships are then plotted as displayed in Output 47.3.3.

title 'Modeclus Analysis with the JOIN= option'; title2 'A Normal Cluster Surrounded by a Ring Cluster'; options ls=120 ps=38; data circle; keep x y; c=1; do n=1 to 30; x=rannor(5); y=rannor(5); output; end; c=2; do n=1 to 300; x=rannor(5); y=rannor(5); z=rannor(5)+8; l=z/sqrt(x**2+y**2); x=x*l; y=y*l; output; end; axis1 label=(angle=90 rotate=0) minor=none order=(-10 to 10 by 5); axis2 minor=none order=(-15 to 15 by 5); proc modeclus data=circle m=1 r=1 to 3.5 by .25 join=20 short; proc modeclus data=circle m=1 r=2.5 join out=out; proc gplot data=out; plot y*x=cluster/frame cframe=ligr vzero nolegend vaxis=axis1 haxis=axis2 ; by _NJOIN_; run;

Output 47.3.1: Significance Tests with the JOIN=20 and SHORT Options

Modeclus Analysis with the JOIN= option A Normal Cluster Surrounded by a Ring Cluster The MODECLUS Procedure Cluster Summary Number of Frequency of Clusters Maximum Number of Unclassified R Joined P-value Clusters Objects ------------------------------------------------------------- 1 36 0.9339 1 301 1.25 20 0.7131 1 301 1.5 10 0.3296 1 300 1.75 5 0.1990 2 0 2 5 0.0683 2 0 2.25 3 0.0504 2 0 2.5 4 0.0301 2 0 2.75 3 0.0585 2 0 3 5 0.0003 1 0 3.25 4 0.1923 2 0 3.5 4 0.0000 1 0

 

Output 47.3.2: Significance Tests with the JOIN Option

Modeclus Analysis with the JOIN= option A Normal Cluster Surrounded by a Ring Cluster The MODECLUS Procedure R=2.5 METHOD=1 Cluster Statistics Maximum Estimated -------------Saddle Test: Version 92.7------------ Estimated Boundary Saddle Mode Saddle Overlap Approx Cluster Frequency Density Frequency Density Count Count Count Z P-value --------------------------------------------------------------------------------------------------------------------- 1 103 0.00617328 22 0.00308664 39 19 0 2.495 0.5055 2 71 0.00571029 20 0.0043213 36 27 9 1.193 0.999 3 53 0.00509296 18 0.00401263 32 25 10 0.986 0.9999 4 45 0.00478429 19 0.00354964 30 22 14 1.429 0.9924 5 30 0.00462996 0 . 29 0 . 3.611 0.0301 6 28 0.00370397 17 0.00354964 23 22 9 0.000 1

 

Modeclus Analysis with the JOIN= option A Normal Cluster Surrounded by a Ring Cluster The MODECLUS Procedure R=2.5 METHOD=1 Cluster Statistics Maximum Estimated -------------Saddle Test: Version 92.7------------ Estimated Boundary Saddle Mode Saddle Overlap Approx Cluster Frequency Density Frequency Density Count Count Count Z P-value --------------------------------------------------------------------------------------------------------------------- 1 103 0.00617328 22 0.00308664 39 19 0 2.495 0.5055 2 71 0.00571029 20 0.0043213 36 27 9 1.193 0.999 3 53 0.00509296 18 0.00401263 32 25 10 0.986 0.9999 4 73 0.00478429 13 0.00293231 30 18 0 1.588 0.9778 5 30 0.00462996 0 . 29 0 . 3.611 0.0301

Modeclus Analysis with the JOIN= option A Normal Cluster Surrounded by a Ring Cluster The MODECLUS Procedure R=2.5 METHOD=1 Cluster Statistics Maximum Estimated -------------Saddle Test: Version 92.7------------ Estimated Boundary Saddle Mode Saddle Overlap Approx Cluster Frequency Density Frequency Density Count Count Count Z P-value --------------------------------------------------------------------------------------------------------------------- 1 156 0.00617328 17 0.00246931 39 15 0 3.130 0.1318 2 71 0.00571029 20 0.0043213 36 27 9 1.193 0.999 3 73 0.00478429 13 0.00293231 30 18 0 1.588 0.9778 4 30 0.00462996 0 . 29 0 . 3.611 0.0301

Modeclus Analysis with the JOIN= option A Normal Cluster Surrounded by a Ring Cluster The MODECLUS Procedure R=2.5 METHOD=1 Cluster Statistics Maximum Estimated -------------Saddle Test: Version 92.7------------ Estimated Boundary Saddle Mode Saddle Overlap Approx Cluster Frequency Density Frequency Density Count Count Count Z P-value --------------------------------------------------------------------------------------------------------------------- 1 156 0.00617328 17 0.00246931 39 15 0 3.130 0.1318 2 144 0.00571029 14 0.00293231 36 18 0 2.313 0.6447 3 30 0.00462996 0 . 29 0 . 3.611 0.0301

Modeclus Analysis with the JOIN= option A Normal Cluster Surrounded by a Ring Cluster The MODECLUS Procedure R=2.5 METHOD=1 Cluster Statistics Maximum Estimated -------------Saddle Test: Version 92.7------------ Estimated Boundary Saddle Mode Saddle Overlap Approx Cluster Frequency Density Frequency Density Count Count Count Z P-value --------------------------------------------------------------------------------------------------------------------- 1 300 0.00617328 0 . 39 0 . 4.246 0.0026 2 30 0.00462996 0 . 29 0 . 3.611 0.0301

Modeclus Analysis with the JOIN= option A Normal Cluster Surrounded by a Ring Cluster The MODECLUS Procedure R=2.5 METHOD=1 Cluster Statistics Maximum Estimated -------------Saddle Test: Version 92.7------------ Estimated Boundary Saddle Mode Saddle Overlap Approx Cluster Frequency Density Frequency Density Count Count Count Z P-value --------------------------------------------------------------------------------------------------------------------- 1 300 0.00617328 0 . 39 0 . 4.246 0.0026

Modeclus Analysis with the JOIN= option A Normal Cluster Surrounded by a Ring Cluster The MODECLUS Procedure Cluster Summary Number of Frequency of Clusters Maximum Number of Unclassified R Joined P-value Clusters Objects ------------------------------------------------------------- 2.5 0 1.0000 6 0 2.5 1 0.9999 5 0 2.5 2 0.9990 4 0 2.5 3 0.6447 3 0 2.5 4 0.0301 2 0 2.5 5 0.0026 1 30

Output 47.3.3: Scatter Plots of Cluster Memberships by _NJOIN_

 

Example 47.4. Cluster Analysis: Hertzsprung-Russell Plot

This example uses computer-generated data to mimic a Hertzsprung-Russell plot (Struve and Zebergs 1962, p. 259) of the temperature and luminosity of stars. The data are plotted and displayed in Output 47.4.1; see Example 4 from Proc Modeclus in the SAS/STAT Sample Program Library for the complete data set. It appears that there are two main groups of stars and a collection of isolated stars. The long straggling group of points appearing diagonally across the figure represents the main group of stars; the more compact group in the top right-hand corner contains giant stars. The JOIN= option is specified at a 0.05 significance level with various smoothing parameters. The CK=5 option is specified in order to prevent the numerous outliers from forming separate clusters. The results from PROC MODECLUS is displayed in Output 47.4.2. The cluster memberships are then plotted by PROC GPLOT, as displayed in Output 47.4.3.

Output 47.4.1: Scatter Plot of Data

 

Output 47.4.2: Results from PROC MODECLUS

Hertzsprung-Russell Plot of Visible Stars Computer-Generated Fake Data The MODECLUS Procedure Cluster Summary Number of Frequency of Clusters Maximum Number of Unclassified R CK Joined P-value Clusters Objects ------------------------------------------------------------------------ 1 5 14 0.0001 2 0 1.5 5 6 0.0000 3 0 2 5 4 0.0000 2 0 2.5 5 2 0.0000 1 0

 

Output 47.4.3: Scatter Plots of Cluster Memberships by _R_

 

Notice in Output 47.4.3 that the graphic output from PROC GPLOT when _R_ = 2.5 is not available because only one cluster remains after joining at a 5% significance level, and the results are not written to the OUT= data set. See the description of the JOIN= option on page 2866 for more information.

title 'Hertzsprung-Russell Plot of Visible Stars'; title2 'Computer-Generated Fake Data'; data hr; input x y @@; label x='-Temperature' y='-Luminosity'; datalines; 1.0 12.8 0.9 13.7 0.9 12.9 1.0 12.3 1.0 12.2 2.6 10.9 2.4 10.9 2.5 11.2 2.3 11.5 2.6 12.0 2.4 12.1 2.3 10.9 2.6 11.5 2.5 11.9 2.4 11.0 3.4 11.1 3.3 11.2 3.4 11.1 3.4 9.9 3.2 10.4 ... 150 lines omitted ... 18.5 12.6 14.2 16.1 23.2 6.6 11.4 12.4 20.4 11.7 20.9 8.1 18.9 13.7 16.9 9.7 15.5 9.9 18.3 14.2 19.3 13.7 17.0 12.9 10.1 11.6 17.9 13.5 14.3 1.4 13.1 -0.8 8.1 -0.9 20.0 7.0 21.0 8.5 15.6 13.2 ; symbol1 value=circle c=white; symbol2 value=plus c=yellow; symbol3 value=triangle c=cyan; legend1 frame cframe=ligr cborder=black position=center value=(justify=center); axis1 label=(angle=90 rotate=0) minor=none; axis2 minor=none; proc gplot; plot y*x/legend=legend1 frame cframe=ligr vzero vaxis=axis1 haxis=axis2 ; proc modeclus data=hr m=1 r=1 1.5 2 2.5 ck=5 join=.05 short out=out; run; title2 'MODECLUS Analysis'; proc gplot; plot y*x=cluster/frame cframe=ligr vzero legend=legend1 vaxis=axis1 haxis=axis2; by _R_; run;

Example 47.5. Using the TRACE Option when METHOD=6

To illustrate how the TRACE option can help you to understand the clustering process when METHOD=6 is specified, the following data set is created with 12 observations.

data test; input x@@; datalines; 1 2 3 4 5 7.5 9 11.5 13 14.5 15 16 ;

The first five observations seem to be close to each other, and the last five observations seem to be close to each other. Observation 6 is separated from the first five observations with a (Euclidean) distance of 2.5, and the same distance separates observation 7 from the last five observations. Observations 6 and 7 differ by 1.5.

Suppose METHOD=6 with a radius=2.5 is chosen for the cluster analysis. You can specify the TRACE option to understand how each observation is assigned.

The following statements produce Output 47.5.1 and Output 47.5.2:

/*-- METHOD=6 with TRACE and THRESHOLD=0.5 (default) --*/ proc modeclus method=6 r=2.5 trace short out=out; var x; run; data markobs; drop _r_ _method_ _obs_ density cluster; length function style text $ 2; retain xsys '2' ysys '2' hsys '1' when 'a'; set out; /* create the text for obs */ function='label'; size=4; style='swiss'; text=left(put(_obs_,2.)); position='3'; x=x; y=density; output; run; legend1 frame cframe=ligr cborder=black position=center value=(justify=center); axis1 label=(angle=90 rotate=0) minor=none; axis2 minor=none; title 'Plot of DENSITY*X=CLUSTER'; proc gplot data=out; plot density*x=cluster/ annotate=markobs frame cframe=ligr legend=legend1 vaxis=axis1 haxis=axis2; run;

Output 47.5.1: Partial Output of METHOD=6 with TRACE and Default THRESHOLD=

The MODECLUS Procedure R=2.5 METHOD=6 Trace of Clustering Algorithm Cluster Obs Density Old New Ratio ------------------------------------------------ 3 0.0833333 -1 1 M 2 0.0666667 0 1 N 4 0.0666667 0 1 N 5 0.0666667 0 1 N 1 0.0500000 0 1 N 6 0.0500000 0 1 0.571 7 0.0500000 -1 1 0.500 9 0.0666667 -1 2 M 8 0.0500000 0 2 N 10 0.0666667 -1 2 S 12 0.0500000 0 2 N 11 0.0666667 -1 2 S

 

Output 47.5.2: Density Plot

 

Notice that in Output 47.5.1, observation 7 is originally a seed (indicated by a value of -1 in the Old column) and then assigned to cluster 1. This is because the ratio of observation 7 to cluster 1 is 0.5 and is not less than the default value of THRESHOLD= (0.5).

If the value of the THRESHOLD= option is increased to 0.55, observation 7 should be excluded from cluster 1 and the cluster membership of observation 7 is changed.

The following statements produce Output 47.5.3 and Output 47.5.4:

/*-- METHOD=6 with TRACE and THRESHOLD=0.55 --*/ proc modeclus method=6 r=2.5 trace threshold=0.55 short out=out; var x; run; . . . (the Data Step and the PROC GPLOT statement are omitted because they are the same as the previous job)

Output 47.5.3: Partial Output of METHOD=6 with TRACE and THRESHOLD=.55

The MODECLUS Procedure R=2.5 METHOD=6 Trace of Clustering Algorithm Cluster Obs Density Old New Ratio ------------------------------------------------ 3 0.0833333 1 1 M 2 0.0666667 0 1 N 4 0.0666667 0 1 N 5 0.0666667 0 1 N 1 0.0500000 0 1 N 6 0.0500000 0 1 0.571 9 0.0666667 1 2 M 8 0.0500000 0 2 N 10 0.0666667 1 2 S 12 0.0500000 0 2 N 11 0.0666667 1 2 S 7 0.0500000 1 2 S

 

Output 47.5.4: Density Plot

 

In Output 47.5.3, observation 7 is a seed that is excluded by cluster 1 because its ratio to cluster 1 is less than 0.55. Being a neighbor of a member (observation 8) of cluster 2, observation 7 eventually joins cluster 2 even though it remains a SEED. (See Step 2.2 in the section METHOD=6 on page 2875.)

Категории