examining distributions - university of virginia graphs and pie charts describe the distribution of...
TRANSCRIPT
Examining Distributions - Introduction
Chapter 1
A variable records characteristics of individuals (i.e., objects of interest) in its values.
A variable’s distribution describes the counts or relative proportions of its values.
Variables
Examining Distributions - Describing Distributions with Graphs
Section 1.1
Bar graphs and pie charts describe the distribution of a categorical variable.
A Pareto chart is a bar graph with categories ordered by decreasing frequency.
Histograms are essentially bar graphs of a quantitative variable.
Stemplots are back-of-the-envelope histograms drawn with the digits of quantitative values.
Time plots graph time series values by time.
Some graphical statistics
Histograms
Use equal bar-widths and “eyeball” for best picture
December 2004 state unemployment rates.
(Raw data in Table 1.1 of text.)
Interpreting histograms
Too much detail Visualize a smooth curve highlighting the overall pattern
Look for shape, center, and spread.
Distribution shapes Symmetric distribution
Right-skewed distribution
Complex, multimodal distribution
Interpreting histograms
Look for deviations, like outliers.
Alaska Florida
Stemplot
December 2004 state unemployment rates.
(Raw data in Table 1.1 of text.)
Stem Leaves
Split stem
Examining Distributions - Describing Distributions with Numbers
Section 1.2
Measure of center: the mean
Heights (in.) of 25 women
Measure of center: the median
Step 2.a: If n is odd, M = middle value
Step 1: Sort x1, …, xn.
Step 2.b: If n is even, M = avg. of two middle values
M = 3.4
M = (3.3+3.4)/2 = 3.35
Left skew Right skew
Comparisons Symmetry
Observe:
The mean is “pulled” by outliers.
The median is resistant to outliers.
M = 3.4
Q1= 2.2
Q3 = 4.35
Measure of spread: the quartiles
The first quartile, Q1, is the median of values below M.
The third quartile, Q3, is the median of values above M.
M = 3.4
Q3 = 4.35
Q1 = 2.2
Max = 6.1
Min = 0.6
Five-number summary and boxplot
Measure of spread: the standard deviation
Heights (in.) of 25 women
, where
Note: Calculate by computer
Summarizing distributions
M Q3
Q1
Max
Min
Five number summary Error bars
(Resistant) (Not resistant)
Examining Distributions - The Normal Distributions
Section 1.3
Density curves A density curve is a mathematical idealization of a histogram
Actual
Idealization
“Area under the curve” ≈ proportion of observations.
Other idealizations
Histogram Density curve
Median halves “area under the curve” The mean is the balance point
Examples
Have easy mathematical formulas
No easy formula
Normal distributions The normal curves:
x x Properties:
Symmetric, single-peaked, and bell-shaped.
Indexed by µ and σ, denoted N(µ, σ)
µ ± σ mark inflection points.
“Exponential” function
Impact of µ and σ
Same µ, different σ
Different µ, same σ
The 68-95-99.7 Rule
If x is N(µ, σ):
68% of obs. within µ ± σ
95% of obs. within µ ± 2σ
99.7% of obs. within µ ± 3σ
Standardization A z-score measures the location of x from µ in units
of σ,
Key property: If x is N(µ, σ) then z is N(0, 1).
Benefit: To calculate an “area under the curve” for N(µ, σ) translate to a z-score and use N(0, 1).
“Standard Normal” distribution
Example calculation: heights
Problem: Heights, x, is N(64.5, 2.5).
For what proportion of individuals is x < 67?
Solution:
Ask: How far is c = 67 from µ = 64.5 in units of σ = 2.5?
(c – µ) / σ = (67 – 64.5) / 2.5 = 1
Translate: z = (x – µ) / σ is N(0, 1)
For what proportion of individuals is z < 1?
Calculate: normsdist(1) = 0.84
Example calculation: heights (cont) 68-95-99.7 rule:
Proportion with -1 < z < 1 is 0.68
Equally divide remaining between z < -1 and z > 1
Proportion with z < 1 is 0.16 + 0.68 = 0.84
0.68
0.16 0.16
Calculation of “area between” Problem: Proportion with c1< z < c2
Solution: (prop. with z < c2) – (prop. with z < c1)
Example: Proportion with 1.4 < z < 2.2.
normsdist(2.2) – normsdist(1.4)
= 0.9861 – 0.9192 = 0.0669
Backward calculations Problem: For what c is p the proportion with z < c?
Solution: c = normsinv(p)
Examples:
normsinv(0.84) = 1
normsinv(0.16) = -1
0.68
0.16 0.16
Problem: MPG, x, of compact cars is N(25.7, 5.88).
For what c does 10% of compact cars have x > c?
Solution: First, normsinv(0.90) = 1.28
Translate: z = (x – µ) / σ is N(0, 1)
10% of compact cars
have z > 1.28 = (c – µ) / σ
Solve: 1.28 = (c – 25.7) / 5.88
⇒ c = 25.7 + (1.28)(5.88)
= 33.2
Example calculation: mpg
Examining Relationships Scatterplots
Section 2.1
Often, individuals are measured in more than one variable
Follow the same approach as before:
Plot data and calculate numerical summaries
Look for overall patterns and deviations
Consider suitability of mathematical models (later)
Examining relationships
Examining relationships
Additional considerations:
Do some variables tend to vary together?
Do some variables explain variability in another?
Definitions: A response variable measures or records an
outcome of a study. (Also: y, dependent variable.)
An explanatory variable explains changes in the response variable. (Also: x, independent variable.)
Scatterplots
A scatterplot is a graph of two quantitative variables measured on the same set of individuals.
If appropriate: response variable on y-axis
explanatory variable on x-axis
Example Beers Drank
Blood Alcohol
5 0.10
2 0.03
9 0.19
7 0.10
3 0.07
3 0.02
4 0.07
5 0.09
8 0.12
3 0.04
5 0.06
5 0.05
6 0.10
7 0.09
1 0.01
4 0.05
Interpretation: form Linear
Nonlinear
No relationship
Interpretation: direction
Negative Positive
high x ↔ low y low x ↔ high y
high x ↔ high y low x ↔ low y
Interpretation: strength
A stronger relationship has points falling more closely to a clear from
Perfect linear Less strong
An outlier (of the relationship) is a point that falls off the trend
Outlier
Outlier in x and y but not of the relationship
Outlier of the relationship
Examining Relationships - Correlation
Section 2.2
Measure of direction and strength: correlation Beers Drank
Blood Alcohol
5 0.10
2 0.03
9 0.19
7 0.10
3 0.07
3 0.02
4 0.07
5 0.09
8 0.12
3 0.04
5 0.06
5 0.05
6 0.10
7 0.09
1 0.01
4 0.05
Note: Calculate by computer
Examples
Properties
-1 ≤ r ≤ 1, always
Response and explanatory variables are interchangeable
Unitless, and independent of variables’ units.
r is not resistant.
Properties (cont.) Interprets only linear relationships
Linear Non-linear
r is appropriate r may mislead