School of Six Sigma Descriptive Statistics
Overview In this module we’re going to learn about descriptive statistics. By the end of this module you’ll know what descriptive statistics are as well as what the different measures of central tendency and dispersion are.
Definition of Descriptive Statistics Let’s get started by explaining what descriptive statistics are and when they’re used.
If someone were to ask you describe each of these people you’d likely talk about things such as their height, eye color, hair color, and so on.
When we’re describing someone’s appearance we’re actually using a form of descriptive statistics that help us to describe our data.
Measures of Central Tendency When it comes to describing our data we generally focus on two characteristics: measures of central tendency and measures of dispersion. Let’s explore the different measures of central tendency.
Mean
The first measure of central tendency is the mean, which is the arithmetic balance point or average of a distribution. Calculating the mean is straightforward. Here
we see a data set consisting of 9 numbers: 3, 2, 6, 6, 8, 10, 6, 1, and 4. To calculate the mean, we first add the numbers up, which in our example equates to 46.
We then divide this figure by the total number of data points, which is 9. When we divide 46 by 9 we learn that the
mean, or the average, is 5.1. We’ll use the mean to describe the central tendency when our data are normally distributed.
We can usually tell when our data are normally distributed by looking at it in a graph known as a histogram, which we’ll cover later in the course. Additionally, there are other statistical tests we can run to determine whether our data are normal or not.
Median
The second measure of central tendency is the median, which is the mid-‐point of a data set. Let’s use the same data set to learn how to calculate the median. The first thing we must do is arrange the numbers in ascending order, or smallest to largest. We then locate the midpoint of the data set which, in our example is 6 since it lands in the middle of the data set. If we had an even number of data
points we would simply average the two middle figures in order to arrive at the median.
We’ll use the median to describe the central tendency when our data are not normally distributed. For example, in this histogram we can see that the data are skewed to the right, as such the mean may not be reliable.
We often see the median used within the real estate industry to describe home prices since most neighborhoods have a few extremely expensive homes that artificially drive the average home price up. Using the median, which isn’t affected by a few outlier data points, makes the most sense. If you ever have a realtor speaking to you about the average home price you should explain to them why using the median is more appropriate.
Mode
The last measure of central tendency is the mode, which is the most frequently occurring value in a list. The mode is useful when dealing with attributes data and is actually the statistic used to create things like Pareto charts.
Let’s learn how to determine the mode using the same data set as before. While not mandatory, it’s helpful to once again order the data in ascending order. We then note which value occurs the most, which in our example is 6. So those are the 3 primary measures of central tendency.
Measures of Dispersion Let’s now turn our attention to measures of dispersion. The 3 primary measures of dispersion are the range, variance, and standard deviation. These statistics help us to describe the variation or spread in our data.
Range
Let’s start with the range, which is the difference between the largest and smallest observation in a data set. Let’s calculate the Range using the same data
we’ve been working with. While it’s not mandatory, it’s easier if you order the data from smallest to largest. You then simply subtract the smallest value from the largest value. In this case our Range is 10 – 1 or 9.
We typically use the range when our data are not normally distributed. In
other words, when we decide to use the median as our measure of central tendency because our data are skewed or we have outliers that seem to be driving the average up, we’ll also use the range as the measure of dispersion or spread.
Sample Variance
Next we have the sample variance, which is the average squared distance between an observation and the mean. You’ll notice, since we’re speaking about a sample statistic we’re using the Roman letter s. The math looks much worse than it really it is, so let’s work through an example. For this example our data set consists of the following numbers – 3.8, 4.1, 3.9, and 4.4. When we add these numbers together and divide by 4 we learn that the sample mean is 4.05.
Believe it or not, this all we need to calculate the sample variance which is noted as a lower case s squared. In order to calculate the sample variance we simply subtract the mean from each data point before squaring it. We then add all these values together and divide them by the number samples minus 1. The reason we subtract 1 from the number of samples is because of something called Bessel’s correction which is meant to help us correct any potential bias in the estimation of the population variance.
This is what it looks like when we plug our values into the formula. For example, we take 3.8 minus 4.05 and square that and then add that to 4.1 minus 4.05 squared and so on. Once we work out the math we learn that our sample variance is 0.07.
Sample Standard Deviation
And last, we come to the sample standard deviation which is simply the square root of the sample variance. Staying with the example we just worked with, when we take the square root of 0.07 we learn that our sample standard deviation is 0.265.
You might wonder why we bother calculating the sample standard deviation when we already know the sample variance. When we calculate the sample variance the differences are squared, meaning the units of the sample variance are not the same as the units of the actual data points. By taking the square root of the variance, the units of standard deviation match the original data points.
When our data are normally distributed, we’ll use the sample standard deviation as the measure of dispersion along with the sample mean as the measure of central tendency. But if our data are not normally distributed, we’ll typically use the range as the measure of dispersion along with the median as the measure of central tendency.