MicrosoftВ® Office ExcelВ® 2007: Data Analysis and Business Modeling (Bpg -- Other)

Overview

In Chapter 36, “Summarizing Data by Using Histograms,” I showed how you can describe data sets by using histograms. In this chapter, I’ll show how to describe a data set by using particular characteristics of the data, such as the mean, median, standard deviation, and variance-measures that Microsoft Office Excel 2007 groups together as descriptive statistics. You can obtain the descriptive statistics for a set of data by clicking Data Analysis on the Analysis tab of the Ribbon, and then selecting the Descriptive Statistics option. After you enter the relevant data and click OK, all the descriptive statistics of your data are displayed.

Mean The mean of a data set is written as x and is simply the average of all observations in the sample. Thus, if the data values were x1, x2,…, xn, then the following equation calculates the mean.

Here, n equals the number of observations in the sample, and xi is the ith observation in the sample. We find that Cisco’s mean monthly return was 5.6 percent per month.

It is always true that the sum of the deviations of all values from the mean equals 0. Thus, you can think of a data set’s mean as a “balancing point” for the data. Of course, without using the Descriptive Statistics option, we can obtain a sample’s mean in Excel by applying the AVERAGE function to the appropriate cell range.

Median The median of a sample is the “middle” observation, when the data is listed from smallest to largest. If a sample contains an odd number of observations, the median is the observation that has as many observations below it as above it. Thus, for a sample of 9, the median would be the fifth smallest (or fifth largest) observation. When a sample includes an even number of observations, you can simply average the two middle observations. Essentially, the median is the 50th percentile of the data. For example, the median monthly return on Cisco was 5 percent. We could also obtain this information by using the MEDIAN function.

Mode The mode is the most frequently occurring value in the sample. If no value occurs more than once, the mode does not exist. For GM, no monthly return occurred more than once for the years 1990–2000, so the mode does not exist. For Cisco, the mode was approximately 5.14 percent. You can also use the MODE function to compute the mode. If no data value occurs more than once, the MODE function returns #NA.

The mode is rarely used as a measure of central location. It is interesting to note, however, that for a symmetric data set, the mean, median, and mode are equal.

A natural question is whether the mean or median is a better measure of central location. Essentially, we use the mean as the best measure of central location if the data set does not exhibit an excessive skew. Otherwise, we use the median as the measure of central location. If a data set is highly skewed, extreme values distort the mean. In this case, the median is a better measure of a typical data set value. For example, the U.S. government reports median family income instead of mean family income because family income is highly positively skewed.

The skewness measure reported by the Descriptive Statistics output informs us whether a data set is highly skewed.

Thus, monthly returns of GM and Cisco exhibit a slight degree of positive skewness. Because the skewness measure for each data set is less than +1, the mean is a better measure of a typical return than the median. You can also use the SKEW function to compute the skew of a data set.

Kurtosis Kurtosis, which sounds like a disease, is not a very important measure. Kurtosis near 0 means a data set exhibits “peakedness” close to the normal (or standard Bell-shaped) curve. (I’ll discuss the normal curve in Chapter 58, “The Normal Random Variable.”) Positive kurtosis means that a data set is more peaked than a normal random variable, whereas negative kurtosis means that data is less peaked than a normal random variable. GM monthly returns are more peaked than a normal curve, whereas Cisco monthly returns are less peaked than a normal curve.

Sample variance and sample standard deviation The sample variance s2 is defined by the following formula.

You can think of the sample variance as the average squared deviation of the data from its mean. Intuitively, it seems like we should divide by n to compute a true average squared deviation, but technical reasons require us to divide by n–1.

Dividing the sum of the squared deviations by n–1 ensures that our sample variance will be an unbiased measure of the true variance of the population from which the sampled data was drawn.

The sample standard deviation s is just the square root of s2.

Here is an example of these computations for the three numbers 1, 3, and 5.

Then we find that

We find that the sample standard deviation of monthly returns for Cisco is 12.2% with a sample variance of 0.015%2. Naturally, %2 is hard to interpret, so we usually look at the sample standard deviation. For GM, the sample standard deviation is 8.97%.

Range The range of a data set is the largest number in the data set minus the smallest number. We find that the range in the monthly Cisco returns is equal to 54 percent and the range for GM monthly returns is 52 percent.

Using Conditional Formatting to Highlight Outliers

Figure 37-3: Outliers for Cisco highlighted with conditional formatting

To begin, I computed the lower cutoff for an outlier (mean–2s) in cell J69 and the upper cutoff for an outlier (mean+2s) in cell J70. Next, I selected the entire range of Cisco returns (cells F52:F181). I then go to the first cell in the range (F52), select Conditional Formatting on the Home tab, and select New Rule. Then in the New Formatting Rule dialog box, select Use A Formula To Determine Which Cells To Format and fill in the rest of the dialog box as shown in Figure 37-4.

Figure 37-4: Conditional formatting rules to select outliers, as shown in the New Formatting Rule dialog box

This condition ensures that if cell F52 is either more than 2s above or below the mean monthly Cisco return, the format we select (a red font color in this case) will be applied to cell F52. This formatting condition is automatically copied to our selected range. Note that all outliers show up in red.

Категории