MicrosoftВ® Office ExcelВ® 2007: Data Analysis and Business Modeling (Bpg -- Other)

Overview

Trend curves are a great help in understanding how two variables are related. Often, however, we need to understand how more than two variables are related. Looking at the correlation between any pair of variables can provide insights into how multiple variables move up and down in value together.

The correlation (usually denoted by r) between two variables (call them x and y) is a unit-free measure of the strength of the linear relationship between x and y. The correlation between any two variables is always between –1 and +1. Although the exact formula used to compute the correlation between two variables isn’t very important, being able to interpret the correlation between the variables is.

A correlation near +1 means that x and y have a strong positive linear relationship. That is, when x is larger than average, y tends to be larger than average, and when x is smaller than average, y also tends to be smaller than average. When a straight line is applied to the data, there will be a straight line with a positive slope that does a good job of fitting the points. As an example, for the data shown in Figure 46-1 (x=units produced and y=monthly production cost), x and y have a correlation of +0.90.

Figure 46-1: Correlation near +1, indicating that two variables have a strong positive linear relationship

On the other hand, a correlation near –1 means that there is a strong negative linear relationship between x and y. That is, when x is larger than average, y tends to be smaller than average, and when x is smaller than average, y tends to be larger than average. When a straight line is applied to the data, the line will have a negative slope that does a good job of fitting the points. As an example, for the data shown in Figure 46-2 (x= the price charged for a product and y= product demand), x and y have a correlation of –0.94.

Figure 46-2: Correlation near –1, indicating that two variables have a strong negative linear relationship

A correlation near 0 means that x and y have a weak linear relationship. That is, knowing whether x is larger or smaller than its mean tells you little about whether y will be larger or smaller than its mean. Figure 46-3 shows a graph of the dependence of unit sales (y) on years of sales experience (x). Years of experience and unit sales have a correlation of 0.003. In our data set, the average experience is 10 years. We see that when a person has more than 10 years of sales experience, his or her sales can be either low or high. We also see that when a person has fewer than 10 years of sales experience, sales can be low or high. Although experience and sales have little or no linear relationship, there is a strong non-linear relationship (see the fitted curve) between years of experience and sales. Correlation does not measure the strength of non-linear relationships.

Figure 46-3: Correlation near 0, indicating a weak linear relationship between two variables

Filling in the correlation matrix   As you can see in this example, Microsoft Office Excel 2007 left some entries in the correlation matrix blank. For example, the correlation between Microsoft and GE (which is equal to the correlation between GE and Microsoft) is omitted. If you want to fill in the entire correlation matrix, right-click the matrix, and then click Copy. Right-click a blank portion of the worksheet, and then click Paste Special. In the Paste Special dialog box, select Transpose. This flips the data on its side. Now right-click the flipped data, and click Copy. Right-click the original correlation matrix, and click Paste Special again. In the Paste Special dialog box, check the Skip Blanks box, and then click OK. The transposed data is copied to the original matrix, but does not copy the blanks from the transposed data. The full correlation matrix is shown in Figure 46-7.

Figure 46-7: Complete correlation matrix

Using the CORREL function   As an alternative to using the Correlation option of the Analysis Toolpak, you can use the CORREL function. For example, entering in cell I49 the formula CORREL(E52:E181,F52:F181) confirms that the correlation between monthly returns on Cisco (shown in column F) and GM (shown in column E) is 0.159.

Relationship between correlation and R2   In Chapter 45, “Estimating Straight Line Relationships,” we found an R2 value for units produced and monthly operating cost of 0.688. How is this value related to the correlation between units produced and monthly operating costs? The correlation between two sets of data is simply

for the trendline, where we choose the sign for the square root to be the same as the sign of the slope of the trendline. Thus the correlation between units produced and monthly operating cost for our Chapter 41, “Consolidating Data,” data is

Correlation and regression towards the mean   You have probably heard the phrase “regression towards the mean.” Essentially, this means that the predicted value of a dependent variable will be in some sense closer to its average value than the independent variable. More precisely, suppose we try to predict a dependent variable y from an independent variable x. If x is k standard deviations above average, then our prediction for y will be r×k standard deviations above average (here, r=correlation between x and y). Because r is between -1 and +1 this means that y is fewer standard deviations away from the mean than x. This is the real definition of “regression towards the mean.” See Problem 5 for an interesting application of the concept of regression towards the mean.

Категории