lecture 2 describing data ii ©. summarizing and describing data frequency distribution and the...
Post on 17-Dec-2015
240 Views
Preview:
TRANSCRIPT
Lecture 2Lecture 2
Describing Data IIDescribing Data II
©
Summarizing and Summarizing and Describing DataDescribing Data
Frequency distribution and Frequency distribution and the shape of the distributionthe shape of the distribution
Measures of variabilityMeasures of variability
1. Frequency distribution 1. Frequency distribution and the shape of the and the shape of the
distributiondistribution
In the previous lecture, we saw that the mean of the household savings gives an inflated image of the saving of a “normal household”.
This was because the shape of the histogram was not symmetric.
It is important to look at how the observations are distributed.
Japanese household savingsJapanese household savingsHistgram of J apanese Household Savings
14.1
10.69.5
8.26.9 6.2
5.1 4.5 3.5 3 3 2.7 2 2 1.9 1.7 1.2 1.3 1 1
10.7
02468
10121416
below 2,000
2,000-4,000
4,000-6,000
6,000-8,000
8,000-10,000
10,000-12,000
12,000-14,000
14,000-16,000
16,000-18,000
18,000-20,000
20,000-22,000
22,000-24,000
24,000-26,000
26,000-28,000
28,000-30,000
30,000-32,000
32,000-34,000
34,000-36,000
36,000-38,000
38,000-40,000
Above 40,000
Savings in thousand yen
Perce
ntage
Sample Average=17,280,000
Median =10,520,000
1-1 Frequency 1-1 Frequency DistributionDistribution
The frequency table that we used in the previous lecture is also called the frequency distribution.frequency distribution. A frequency distribution is usually referred to how observations are distributed. When we plot the frequency table, it is called a HistogramHistogram.
A histogram usually shows the number of observations in a specific range. However, sometimes, it shows the percentage of observations in a specific range.
1-2 Shape of the Distribution1-2 Shape of the Distribution
The shape of the distribution refers to the shape of the Histogram.
1-3 Symmetric 1-3 Symmetric DistributionDistribution
The shape of the distribution is said to be symmetricsymmetric if the observations are balanced, or evenly distributed, about the mean. The shape of the distribution is symmetric if the shape of the histogram is symmetric
Symmetric DistributionSymmetric DistributionSymmetric Distribution
0123456789
10
1 2 3 4 5 6 7 8 9
Fre
qu
en
cy
Note: For a symmetric distribution, the mean and median are equal.
Symmetric Distribution: An Symmetric Distribution: An exampleexample
The age distribution of the clients (from the previous lecture note) is nearly symmetric.
Histogram
0 0
45
11 11
6
4
2
0 00
2
4
6
8
10
12
Clients' Age range
Freq
uenc
y
1-4 Skewed Distribution1-4 Skewed Distribution
A distribution is skewedskewed if the observations are not symmetrically distributed above and below the mean. A positively skewedpositively skewed (or skewed to the right) distribution has a tail that extends to the right in the direction of positive values. A negatively skewednegatively skewed (or skewed to the left) distribution has a tail that extends to the left in the direction of negative values.
Positively skewed Positively skewed distributiondistribution
Positively Skewed Distribution
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9
Fre
qu
ency
Positively skewed Positively skewed distribution: An exampledistribution: An example
The household saving histogram (from the previous lecture) is an example of a positively skewed distribution.
Histgram of J apanese Household Savings
14.1
10.69.5
8.26.9 6.2
5.1 4.5 3.5 3 3 2.7 2 2 1.9 1.7 1.2 1.3 1 1
10.7
02468
10121416
below 2,000
2,000-4,000
4,000-6,000
6,000-8,000
8,000-10,000
10,000-12,000
12,000-14,000
14,000-16,000
16,000-18,000
18,000-20,000
20,000-22,000
22,000-24,000
24,000-26,000
26,000-28,000
28,000-30,000
30,000-32,000
32,000-34,000
34,000-36,000
36,000-38,000
38,000-40,000
Above 40,000
Savings in thousand yen
Perc
enta
ge
Sample Average=17,280,000
Median =10,520,000
Positively skewed Positively skewed distribution: distribution:
A noteA note
For a positively skewed distribution the mean is greater than the median.
Negatively skewed Negatively skewed distributiondistribution
Negatively Skewed Distribution
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9
Fre
qu
en
cy
Note: For a negatively skewed distribution, the mean is less than the median.
2. Measures of Variability2. Measures of Variability
VarianceStandard deviation
ExampleExample Data “Sales at two different stores”
contain daily sales data for two different stores. Data are collected for 60 days.
Store A’s average daily sales is 231,800 yen. Store B’s average daily sales is 230,500 yen.
Can we say that they are similar stores? Look at the following graphs.
Daily sales of the two storesDaily sales of the two stores
Store A: Daily Sales
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
450.0
0 10 20 30 40 50 60 70
Day
Daily
sales
in 10
00 ye
n
Average=231,800yen
Store B: Daily Sales
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
450.0
0 10 20 30 40 50 60 70
Day
Daily
sales
in 1
00 ye
n Average=230,500yen
Daily sales of the two Daily sales of the two storesstores
The difference between the two stores is that, Store A’s sales have much higher variation than Store B’s sales.
We need a measure of variability in data.
2-1 How to measure the 2-1 How to measure the variability (1)variability (1)
Take the Store A’s data as an example, variability of each observation can be seen from the difference between the observation and the mean.
But, how do we measure the overall variability of the data?
Store A: Daily Sales
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
450.0
0 10 20 30 40 50 60 70
Day
Daily s
ales in
1000
yen
Average=231,800yen
For eachobservation, you cancompute thedifference from theaverage
How to measure the variability How to measure the variability (2)(2)
Overall variabilityOverall variability How about taking the
average of all differences?
This is not a good idea, since the differences can be both positive or negative, so they would sum up to zero.
Therefore, we take the square of each difference. This is the first step to compute the “Variance”, a measure of overall variability.
Store A: Daily Sales
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
450.0
0 10 20 30 40 50 60 70
Day
Daily
sales
in 1
000
yen Average
=231,800yen
For eachobservation, you cancompute thedifference from theaverage
2-2 Variance2-2 VarianceA measure of variabilityA measure of variability
Variance is computed in the following way.1. Subtract the mean from each observation
(compute the difference between each observation and the mean. Note that the difference can be minus)
2. Then, square each difference3. Sum all the squared differences4. Divide the sum of squared differences by n-1
(the number of observations minus 1) We will learn the reason why we divide the sum
of squares by n-1 after we learn the concept of the expectation.
Computation of the variance:Computation of the variance:ExerciseExercise
Open the data “Computation of Variance”, and compute the variance of Store A’s daily sales
Compute the variance of Store B’s daily sales
Computation of the variance:Computation of the variance:ExerciseExercise
Store A: Average daily sales =231.8 thousand yen Variance =4979.9 Store B: Average daily sales=230.5 thousand yen Variance =335.9 Notice that variance for Store A is higher
than that for Store B. This is because the variation in the daily sales is higher for Store A.
Variance: noteVariance: note In the previous slide, we did not use any
unit of measurement for variance. (For example, we do not say that the variance for Store A is 4979.9 thousand yen.)
This is because, when we compute the variance, we square the data. Therefore, the unit of measurement for variance is “square of thousand yen”, which is not a meaningful unit.
Therefore, we use the Standard Deviation, another measure of variation.
2-3 A measure of variability: 2-3 A measure of variability: Standard deviationStandard deviation
Standard deviation is the square root of the variance.
Exercise: Compute the standard
deviation of the daily sales for Store A and Store B.
VarianceDeviation Standard
Standard Deviation: Store Standard Deviation: Store sales data examplesales data example
Standard deviation of Store A’s daily sales=70.57 thousand yen.
Standard deviation for Store B’s daily sales= 18.33 thousand yen.
This means that the average variation of the store A’s sales is about 70.6 thousand yen, and the average variation of the store B’s sales is about 18.3 thousand yen.
Standard deviation and variance Standard deviation and variance as measures of risk (or as measures of risk (or
uncertainty)uncertainty)
Often standard deviation and variance are used as measures of uncertainty or risk.
If you would like to work as a store manager, then store B may be a better store to work for; although the average sales is almost the same as store A, the uncertainty is lower (low standard deviation)
Standard deviation and variance Standard deviation and variance as measures of risk (or as measures of risk (or
uncertainty)uncertainty) In the store sales data, the average sales for both
stores are similar. However, in many other occasions, higher return
(higher average sales) comes with higher risk (higher standard deviation).
One makes a decision by choosing a good combination of return and risk. For example, if you invest in a stock, you would choose a stock with a combination of return and risk that suits your preference.
Therefore, standard deviation and variance are important numerical measures of summarizing data for a decision making purpose.
2-4. Understanding the 2-4. Understanding the mathematical notation of the mathematical notation of the
variancevariance
Most of the time, we only have sample data (not population data).
Variance computed from a sample is called sample variance. We denote sample variance by s2.
When we have population data (which does not happen often), we can compute the population variance. We denote the population variance by σ2.
Understanding the Understanding the mathematical notation of mathematical notation of
sample variancesample variance
Observation id Variable X
1 x1
2 x2
3 x3
.
...
n xn
The typical data we use comes in this format. Using this format, we would like to represent variance in a mathematical form.
Understanding the Understanding the mathematical notation of mathematical notation of
sample variancesample variance
Obs idVariabl
e X
Each data-the mean
(Each data-the mean)2
1 X1 X1 - (X1 - )2
2 X2 X2 - (X2 - )2
3 X3 X3 - (X3 - )2
: : :
n Xn Xn -(Xn - )
2
Average
X
X
X
X
X
X
X
X
X
The first steps of computing variance are written in the table.
The variance can be computed by summing the last column, and divide the sum by (n-1)
Therefore, mathematically, a sample variance, s2, can be written as
next page
Understanding the Understanding the mathematical notation for mathematical notation for
sample variancesample variance
1
)(
1
)()()()( 1
222
32
22
12
n
Xx
n
XxXxXxXxs
n
ii
n
Mathematically, sample variance, denoted as s2, can be written as
Mathematical notation for Mathematical notation for population variancepopulation variance
Though not often, we may have population data. Then we can compute the population variance. We use the notation, σ2, to denote the population variance. We also use upper case N to denote the number of observations. The mathematical notation for the population variance is
N
x
N
xxxx
N
ii
n
1
222
32
22
12
)()()()()(
Unlike the case for sample variance, we do not have to divide the sum of squares by N-1. We simply divide it by N.
2-5. Mathematical notation for 2-5. Mathematical notation for the sample standard deviation the sample standard deviation
The sample standard deviation, s, sample standard deviation, s, is written as
1
)(1
2
2
n
Xxss
n
ii
Mathematical Notation for Mathematical Notation for population standard deviationpopulation standard deviation
The population standard deviation, population standard deviation, , , is written as
N
xN
ii
1
2
2
)(
2-6. Short-cut formula for 2-6. Short-cut formula for sample variance sample variance
The short-cut formula for the sample sample variance variance is:
1
)( 2
1
2
2
n
Xnxs
n
ii
ExerciseExercise
Compute the variance for the sales of Store A by applying the short-cut formula for sample variance, and show that this indeed coincides with our previous calculation.
Other Measures of VariabilityOther Measures of Variability1. The Range 1. The Range
The range range in a set of data is the difference between the largest and smallest observations
Other Measures of Central Other Measures of Central TendencyTendency2. Mode2. Mode
The mode, mode, if one exists, is the most frequently occurring observation in the sample or population.
This lecture note covers:This lecture note covers:
Textbook P23~P28: Frequency distribution
Textbook 3.1, 3.2: Measures of central tendency and variability
top related