part ii : descriptive statistics - binghamton...
Post on 18-Mar-2020
3 Views
Preview:
TRANSCRIPT
Chapter 3 - The HistogramPART II : DESCRIPTIVE STATISTICS
Dr. Joseph Brennan
Math 148, BU
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 1 / 37
Variables
Once a study has been designed and data collected, researchers begin toSUMMARIZE their data. Data may be summarized by plotting figuresand computing certain summary measures to obtain important informationabout the data.
STATISTIC : A summary measure computed from the data.
Recall : Every data point is the value of the response VARIABLEmeasured on a unit. So we should think of variable as the quantity thattakes different values for different individuals.
Examples: gender, color of eyes, weight, bacteria count.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 2 / 37
Variables
There are two types of variables, dependant upon on their possible values:qualitative (categorical) quantitative (numerical).
Quantitative variables are further divided into discrete and continuous.
VARIABLES
Qualitative Quantitative
Discrete Continuous
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 3 / 37
Variable Types
A qualitative variable places an individual into one of several groups orcategories. Such variables are also called categorical variables.
The variable gender has two possible values male and female.
The variable major has numerous values such as Mathematics,Biology, Physics, Economics, Chemistry, . . .
A quantitative variable takes numerical values for which arithmeticoperations (such as adding and averaging) make sense. Quantitativevariables are also called numerical variables.
NOTE: If unsure on how to classify a variable, question how it can beaffected mathematically. We cannot average gender or major, thereforethey are qualitative variables.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 4 / 37
Quantitative Variables
Quantitative variables are divided into discrete and continuous:
Discrete quantitative variable takes on values which are spaced, i.e,for two adjacent values, there is no value that goes between them.
The variable number of children is discrete. It takes on integer values... there cannot be 2.5 kids in a family.
Continuous quantitative variable take values in a given interval.For ANY two values of the variable, we can always find another valuethat can go between the two.
Variables such as weight, time, and distance are continuous.
NOTE: The variable salary is continuous but essentially discrete if allsalaries are rounded to the whole dollar.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 5 / 37
Example 1
Classify each of the following variables as qualitative or quantitative(discrete or continuous):
Color of eyes. qualitative
Blood pressure quantitative
Weight (in lb) quantitative
Residence (country) qualitative
Number of patients under a treatment quantitative
Zip code qualitative
NOTE: Not all the variables that take on numerical values arequantitative!
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 6 / 37
Variables in statistical studies
In statistical studies we have encountered three types of variables:
Treatment Variables
Response Variables
Confounding Variables
All the above types of variables can be either qualitative or quantitative.In the studies which we considered in Part I the treatment variable wasusually qualitative:
In the fever example (Example 1, Part I) the treatment variable was drugwith the values drug A and drug B.
In the smoking example (Example 2, Part II) the treatment variable wassmoking status with the values Yes and No.
In Part III we will develop methods to analyze data from studies for whichboth the treatment and response variables are quantitative.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 7 / 37
An Introduction to the Histogram
Data represents the values of the response variable measured fromeach unit.
The distribution of data is a list summarizing the observed values ofthe response variable and how often they were observed.
When the data is quantitative, whether discrete or continuous, ahistogram may be used to display its distribution.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 8 / 37
Example (Guinea pigs)
Taken from Moore and McCabe, Table 1.8, Chapter 1.
The table gives the survival times in days of 72 guinea pigs after they wereinjected with tubercle bacilli in a medical experiment.
43 45 53 56 56 57 58 66 67 7374 79 80 80 81 81 81 82 83 8384 88 89 91 91 92 92 97 99 99
100 100 101 102 102 102 103 104 107 108109 113 114 118 121 123 126 128 137 138139 144 145 147 156 162 174 178 179 184191 198 211 214 243 249 329 380 403 511522 598
We aim to describe the distribution of the survival times.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 9 / 37
Example (Guinea pigs)
First, plot the observations on a horizontal axis.
Figure : Guinea pigs survival times plotted on a horizontal axis.
We can see that the observations are not uniformly spread along the axis.In particular, there is a crowding of observations around 100.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 10 / 37
Example (Guinea pigs)
The density histogram is a graph representing the density of observationsalong the horizontal axis. Such a histogram is constructed three steps:
1) Density Histogram: Step 1. Break the range of values of a variableinto adjacent intervals, which are called class intervals or bins.
Class interval
40 ≤ survival time < 8080 ≤ survival time < 120
120 ≤ survival time < 160160 ≤ survival time < 200200 ≤ survival time < 250250 ≤ survival time < 400400 ≤ survival time < 600
No particular rule was used to choose the above class intervals.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 11 / 37
Example (Guinea pigs)
2) Density Histogram: Step 2. Create the distribution table whichcontains the count and percent (or proportion) of individuals in eachclass interval.
Class interval Count Proportion Percent
40≤survival time<80 12 0.1667 16.67%
80≤survival time<120 32 0.4444 44.44%
120≤survival time<160 11 0.1528 15.28%
160≤survival time<200 7 0.0972 9.72%
200≤survival time<250 4 0.0556 5.56%
250≤survival time<400 2 0.0278 2.78%
400≤survival time<600 4 0.0556 5.56%
Total 72 1.0001 100.01%
NOTE: The total percent is equal to 100.01% rather than 100% due to errors
introduced by rounding.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 12 / 37
Example (Guinea pigs)
Density Histogram: Step 3. Constructing the histogram.
1 On the horizontal axis mark the endpoints of class intervals.
2 On each class interval plot a rectangle, whose base covers the classinterval and whose height is computed in the following way:
Bar height = percentage of observations in class intervalwidth of class interval
The HIGHER the bar, the GREATER the concentration (density) ofobservations in the corresponding class interval.
We assume that observations are spread uniformly within a classinterval.
The units of measurement on the VERTICAL axis of the histogramare percents (proportion) per unit width.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 13 / 37
Example (Guinea pigs)
Density Histogram: Step 3, continued.
Class interval Bar height
40≤survival time<80 0.1667/40 = 0.004167580≤survival time<120 0.4444/40 = 0.01111120≤survival time<160 0.1528/40 = 0.00382160≤survival time<200 0.0972/40 = 0.00243200≤survival time<250 0.0556/50 = 0.001112250≤survival time<400 0.0278/150 ≈ 0.0001853400≤survival time<600 0.0556/200 = 0.000278
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 14 / 37
Figure : Figure 4. Histogram for guinea pigs survival times.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 15 / 37
Width of Class Intervals
Class intervals are generally chosen of uniform width; contrary to the uneven
intervals in our guinea pig study. Most computer programs default to bins ofequal width. What number of bins of equal width should we use?
There is no best number of bins, and different bin sizes can reveal different
features of the data. Usually, the number of bins is chosen between 5 and
25. The larger the data set, the greater number of bins should be used.
One frequently used rule to compute the number of bins k (of equalwidth) for a data set of size n is Sturges’ formula:
k = 1 + 3.322 log10(n)
We obtain k by rounding it to the nearest integer.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 16 / 37
Example (Guinea pigs)
We will replot the density histogram using the CrunchIt! program, usingthe default CrunchIt! number of bins.CrunchIt! directions :
Open the data from the CrunchIt!http://crunchit2.bfwpub.com/crunchit2/ips5e/?section_id=
Click on Chapter 1 in the upper left corner. Choose Table 1.8 fromthe list.On the Grey panel click on Graphics −→ Histogram. Click on thevariable and choose Density.
The CrunchIt! produces the graph shown in Figure 5. It uses 12 classintervals of width 50 ranging from 0 to 600. NOTE that CrunchIt! uses adifferent formula from Sturges’ rule to find the number of bins in itsdefault setting. By the Sturge’s formula
k = 1 + 3.322 log10(72) = 7.17 ∼ 7.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 17 / 37
Histogram with Equal Bin Widths
Figure : Figure 5. Histogram for guinea pigs (CrunchIt!)
NOTE: Histograms in Figures 4 and 5 are similar, but in general the appearance
of a histogram can substantially change when you change the widths of class
intervals.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 18 / 37
Four Histograms Plotted from the Same Data
The bin width and positioning of the bin edges can have a significant effect onthe resulting histogram.Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 19 / 37
Endpoint Convention
What should we do if an observation happens to be on the boundarybetween two bars? In which class interval does the data point lie?
Each observation must be taken into account just once, so we need tochoose between the left and right bars. The choice is arbitrary, howeveryour choice should be indicated and rigourously adhered to.
The convention followed in Example 2 is that the left endpoint isincluded in the class interval, and the right endpoint is excluded.
NOTE: There are 2 data values of 80 days. In the histogram plotted in Figure 4
the first class interval includes data values for which [40, 80), the second class
interval contains observations with [80, 120). How does the histogram change by
altering endpoint convention?
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 20 / 37
Area Under the Density Histogram
The area of each bar is equal to the percent (or proportion) ofobservations in that bar.
In the histogram plotted in Figure 4 the area of the first bar is thefollowing.
area of the first bar = base× height =
40(days)× 0.0041675(proportion of observations per day) =
0.1667 (proportion of all observations).
The total area under the density histogram is 1(100%).
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 21 / 37
Zero Bar Heights
Zero bar heights in histograms: In a histogram there is no horizontalspace between the bars unless a class interval is empty (has no data), sothat its bar has height equal to zero.
NOTE: The histogram in Figure 5 has 2 empty classes for survivaltimes in the ranges 250 - 300 and 450 - 500. These intervals do nothave any data point.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 22 / 37
Different Types of Histograms
There are 3 main types of the histograms :
Density histogram displays percents (or proportions) per unit widthin the vertical direction.
In a frequency histogram the height of each bar is equal to theactual count of observations in the class interval.
In a relative frequency histogram the height of each bar is equal tothe proportion or percentage of observations in the class interval.
We will mostly deal with density histograms in this course. Later in thisunit we will approximate the density histograms with density curves.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 23 / 37
Example of a Frequency Histogram
This example is taken from Moore and McCabe.
The frequency histogram below shows the distribution of IQ scores for 60
fifth-grade students. On the y -axis we have the count of students in each class
interval. The sum of all the bar heights equal 60, the number of tested students.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 24 / 37
Example of a Relative Frequency Histogram
This example is taken from Moore and McCabe.
The relative frequency histogram below shows the distribution of thelengths of words used in Shakespeare’s plays.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 25 / 37
Outliers in Data
Very often the extreme bars of the histogram correspond to outlyingobservations or outliers.
An outlier is an observation which falls outside of the overall patternof the histogram.
Rules exist to identify outliers, but in many cases it is just a matter of
judgement.
Look for points that are clearly apart from the body of data, not just themost extreme observations in a distribution.
Observation 598 in Example 2 (Guinea pigs) is clearly an outlier.Observations 511 and 522 are also potential outliers.
A formal rule for detecting outliers will be developed later in this unit.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 26 / 37
Population Distribution and Data Distribution
Making a histogram is not an end in itself. The purpose of the histogramis to help us understand the data and make observations about thepopulation from which the data was drawn.
There is the true population distribution of the variable of interest, whichmay be computed from the census data.
The population distribution is usually unknown since we cannot make acensus. The histogram computed from the sample data shows the datadistribution which estimates the true population distribution.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 27 / 37
Analyzing a Histogram
Once you’ve plotted the histogram (data distribution), look for the overallpattern and for outliers.
The overall pattern of a distribution can be described by its shape,center, and spread. We will learn the measures of distribution’s center and
spread in Chapter 4.
The shape of the distribution can be described:
by specifying the number of modes.
as symmetric or skewed.
Modes are major peaks in the distribution. Distribution with one, two andthree modes are called unimodal, bimodal, and trimodal, respectively. Ifa distribution has more than 3 modes, it is usually called multimodal.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 28 / 37
Modes in Histograms
Figure : Unimodal and bimodal distributions.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 29 / 37
Symmetric Distributions
A symmetric distribution has a histogram symmetric about the midpoint.
Imagine drawing a vertical line through the center of the histogram andfolding the histogram in half around that line: the two halves shouldmatch up.
Even if the true population distribution is symmetric, we do not expectthe histogram of the data to be perfectly symmetric.
The unimodal histogram on the previous slide is fairly symmetric, so it may
correspond to a symmetric population distribution.
Many symmetric unimodal histograms look bell-shaped.The unimodal histogram on the previous slide appears bell-shaped.
Many biological measurements on specimens from the same speciesand sex - lengths of bird flies, heights of adults - have symmetricbell-shaped distributions.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 30 / 37
Some History
A symmetric bell-shaped distribution is also called a normal distributionor a Gaussian distribution.
It was first discovered in 1809 by the famous mathematician, CarlFrederich Gauss (1777-1855).
As a general rule, test scores in large classes (like MAT148) tend tofollow a normal distribution!
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 31 / 37
Analyzing the Histogram
Once you have plotted the histogram (data distribution), look for outliers,and the overall pattern, which is described by its shape, center, andspread. The shape of the distribution can be described
by specifying the number of modes.
as symmetric or skewed.
Figure : Unimodal and bimodal distributions. The unimodal appears symmetric.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 32 / 37
Symmetric and Skewed Distributions
Tails are the parts of a distribution away from modes. There is the lefttail (for smallest values of the variable) and the right tail (for largestvalues of the variable).
A distribution is right skewed if the right tail (larger values) is muchlonger than the left tail (smaller values).
A distribution is left skewed if the left tail (smaller values) is much longerthan the right tail (larger values).
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 33 / 37
Example: Histogram of Income Distribution
Money amounts usually have right-skewed distributions. A few familieshave very large income compared to the majority families, which skews theincome distribution data.
Figure : Income distribution.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 34 / 37
The Appearance of Symmetric and Skewed distributions
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 35 / 37
Example (English)
What is the most frequently used letter in the English language?
This is a relative frequency “histogram” generated (as per Wikipedia) from a
sample of about 2700 words taken from 3 different sources.
Alternatively, which letter is most likely to be at the end of a word?Take a look at end of word letter frequencies :
Letter e s d t n
Frequency 0.1917 0.1435 0.0923 0.0864 0.0786
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 36 / 37
Smooth Histogram Sketches
Many sources will represent histograms as smooth curves. However, wehave defined histograms to be based upon bar graphs.
Histogram sketches are smooth curves drawn through the tops of thehistogram bars and used to indicate the overall shape of a histogram.
Figure : A Histogram and its Smooth Histogram Sketch.
Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 37 / 37
top related