Descriptive Statistics and Data Displays

Overview

Purpose of these tools

To provide basic information about the distribution and properties of a set of data

Deciding which tool to use

Statistical term conventions

The field of statistics is typically divided into two areas of study:

  1. Descriptive statistics represent a characteristic of a large group of observations (a population or a sample representing a population).

    • Ex: Mean and standard deviation are descriptive statistics about a set of data
  2. Inferential Statistics draw conclusions about a population based upon analysis of sample data. A small set of numbers (a sample) is used to make inferences about a much larger set of numbers (the population).

    • Ex: You'll use inferential statistics when hypothesis testing (see Chapter 9)

Parameters are terms used to describe the key characteristics of a population.

In most cases, the data used in process improvement is a sample (a subset) taken from a population..

In general mathematics as well as in statistics, capital Greek letters are also used. These big letters serve as "operators" in equations, telling us what mathematical calculation to perform. In this book, you'll see a "capital sigma" in many equations:

∑ (capital sigma) indicates that the values should be added together (summing)

Measures of central tendency (mean, median, mode)

Highlights

Mean average

The mean is the arithmetic average of a set of data.

Median

The median is the midpoint of a ranked order set of data.

To determine the median, arrange the data in ascending or descending order. The median is the value at the center (if there is an odd number of data points), or the average of the two middle values (if there is an even number of data points). The symbol for the median is X with a tilde (~) over it.

Mode

The mode of a set of data is the most frequently observed value(s).

Scores from 10 students arranged in ascending order:

  Tips 
  • While the mean is most frequently used, the median is occasionally helpful because it is not affected as much by outliers.

    • Ex: In the student scores data above, changing the "44" to a "99" would make the mean = 42.1 (up nearly 6 points) but the median would stay at 36. In that instance, the median would be far more representative of the data set as a whole.

Measures of spread (range, variance, standard deviation)

Highlights

Range

Range is the difference between the largest and smallest values in a data set.

Variance

Variance tells you how far off the data values are from the mean overall.

  1. Calculate the mean of all the data points, Xbar
  2. Calculate the difference between each data point and the average (Xi—Xbar)
  3. Square those figures for all data points

    • This ensures that you'll always be dealing with a positive number—otherwise, all of the values would cancel each other out and sum to zero
  4. Add the squared values together (a value called the sum of squares in statistics)
  5. Divide that total by n-1 (the number of data values minus 1)

Note that the equation above follows statistical conventions (p. 105) for describing sample statistics. Variance for a population uses a sigma as shown here.

Though more people are familiar with standard deviation (see below), variance has one big advantage: it is additive while standard deviations are not. That means, for example, that the total variance for a process can be determined by adding together the variances for all the process steps.

A drawback to using variance is that it is not in the same units of measure as the data points. Ex: for cycle times, the variance would be in units of "minutes squared," which doesn't make logical sense.

Standard deviation

Think of standard deviation as the "average distance from each data point to the mean." Calculate the standard deviation for a sample or population by doing the same steps as for the variance, then simply taking the square root. Here's how the equation would look for the 10 ages listed on the previous page:

Just as with variance, the standard deviation of a population is denoted with sigma instead of "s", as shown here:

The standard deviation is a handy measure of variability because it is stated in the same units as the data points. But as noted above, you CANNOT add standard deviations together to get a combined standard deviation for multiple process steps. If you want an indication of spread for a process overall, add together the variances for each step then take the square root.

Boxplots

Highlights

To use boxplots…

Frequency plot (histogram)

Purpose

To evaluate the distribution of a set of data (to learn about its basic properties and to evaluate whether you can apply certain statistical tests)

When to use frequency plots

Types of frequency plots

Though they all basically do the same thing, there are several different types of frequency plots you may encounter:

  1. Dot plot

    Dot plots display a dot (or other mark) for each observation along a number line. If there are multiple occurrences of an observation, or if observations are too close together, then dots will be stacked vertically.

    • Dot plots are very easy to construct by hand, so they can be used "in the field" for relatively small sets of data.
    • Dot plots are typically used for data sets with fewer than 30 to 50 points. Larger data sets use histograms (see below) and box plots (see p. 110).
    • Unlike histograms, dot plots show you how often specific data values occur.
  2. Histogram

    Histograms displays bars representing the count within different ranges of data rather than plotting individual data points. The groups represent non-overlapping segments in the range of data.

    • Ex: All the values between 0.5 and 1.49 might be grouped in an interval labeled "1," all the values between 1.5 and 2.49 might be grouped in an interval labeled "2," etc.

How to create a histogram

  1. Take the difference between the min and max values in your observations to get the range of observed values
  2. Divide the range into evenly spaced intervals

    • This is often trickier than it seems. Having too many intervals will exaggerate the variation; too few intervals will obscure the amount of variation.
  3. Count the number of observations in each interval
  4. Create bars whose heights represent the count in each interval

Interpreting histogram patterns

Histograms and dot plots tell you about the underlying distribution of the data, which in turn tells you what kind of statistical tests you can perform and also point out potential improvement opportunities.

  1. This first pattern is what a normal distribution would look like, with data more-or-less symmetric about a central mean.

  2. A histogram with two peaks is called bimodal. This usually indicates that there are two distinct pathways through the process. You need to define customer requirements for this process, investigate what accounts for the systematic differences, and improve the pathways to shift both paths towards the requirements.

  3. You may see a number of distributions that are skewed—meaning data values pile up towards one end and tail off towards the other end. The pattern is common with data such as time measurements (where a relatively small number of jobs can take much longer than the majority). This type of patterns occurs when the data have an underlying distribution that is not normal or when measurement devices or methods are inadequate. If a non-normal distribution is at work, you cannot use hypothesis tests or calculate control limits for this kind of data unless you take subgroup averages (see Central Limit Theorem, p. 114).

Normal distribution

In many situations, data follow a normal distribution (bell-shaped curve). One of the key properties of the normal distribution is the relationship between the shape of the curve and the standard deviation (σ for population; s for sample).

To use these probabilities, your data must be random, independent, and normally distributed.

Non normal distributions and the Central Limit Theorem

Highlights

Central Limit Theorem

Regardless of the shape of the parent population, the distribution of the means calculated from samples quickly approaches the normal distribution as shown below:

Practical Rules of Thumb

Категории