part ii : descriptive statistics - binghamton...

Chapter 3 - The HistogramPART II : DESCRIPTIVE STATISTICS

Dr. Joseph Brennan

Math 148, BU

Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 1 / 37

Variables

Once a study has been designed and data collected, researchers begin toSUMMARIZE their data. Data may be summarized by plotting figuresand computing certain summary measures to obtain important informationabout the data.

STATISTIC : A summary measure computed from the data.

Recall : Every data point is the value of the response VARIABLEmeasured on a unit. So we should think of variable as the quantity thattakes different values for different individuals.

Examples: gender, color of eyes, weight, bacteria count.

Variables

There are two types of variables, dependant upon on their possible values:qualitative (categorical) quantitative (numerical).

Quantitative variables are further divided into discrete and continuous.

VARIABLES

Qualitative Quantitative

Discrete Continuous

Variable Types

A qualitative variable places an individual into one of several groups orcategories. Such variables are also called categorical variables.

The variable gender has two possible values male and female.

The variable major has numerous values such as Mathematics,Biology, Physics, Economics, Chemistry, . . .

A quantitative variable takes numerical values for which arithmeticoperations (such as adding and averaging) make sense. Quantitativevariables are also called numerical variables.

NOTE: If unsure on how to classify a variable, question how it can beaffected mathematically. We cannot average gender or major, thereforethey are qualitative variables.

Quantitative Variables

Quantitative variables are divided into discrete and continuous:

Discrete quantitative variable takes on values which are spaced, i.e,for two adjacent values, there is no value that goes between them.

The variable number of children is discrete. It takes on integer values... there cannot be 2.5 kids in a family.

Continuous quantitative variable take values in a given interval.For ANY two values of the variable, we can always find another valuethat can go between the two.

Variables such as weight, time, and distance are continuous.

NOTE: The variable salary is continuous but essentially discrete if allsalaries are rounded to the whole dollar.

Example 1

Classify each of the following variables as qualitative or quantitative(discrete or continuous):

Color of eyes. qualitative

Blood pressure quantitative

Weight (in lb) quantitative

Residence (country) qualitative

Number of patients under a treatment quantitative

Zip code qualitative

NOTE: Not all the variables that take on numerical values arequantitative!

Variables in statistical studies

In statistical studies we have encountered three types of variables:

Treatment Variables

Response Variables

Confounding Variables

All the above types of variables can be either qualitative or quantitative.In the studies which we considered in Part I the treatment variable wasusually qualitative:

In the fever example (Example 1, Part I) the treatment variable was drugwith the values drug A and drug B.

In the smoking example (Example 2, Part II) the treatment variable wassmoking status with the values Yes and No.

In Part III we will develop methods to analyze data from studies for whichboth the treatment and response variables are quantitative.

An Introduction to the Histogram

Data represents the values of the response variable measured fromeach unit.

The distribution of data is a list summarizing the observed values ofthe response variable and how often they were observed.

When the data is quantitative, whether discrete or continuous, ahistogram may be used to display its distribution.

Example (Guinea pigs)

Taken from Moore and McCabe, Table 1.8, Chapter 1.

The table gives the survival times in days of 72 guinea pigs after they wereinjected with tubercle bacilli in a medical experiment.

43 45 53 56 56 57 58 66 67 7374 79 80 80 81 81 81 82 83 8384 88 89 91 91 92 92 97 99 99

100 100 101 102 102 102 103 104 107 108109 113 114 118 121 123 126 128 137 138139 144 145 147 156 162 174 178 179 184191 198 211 214 243 249 329 380 403 511522 598

We aim to describe the distribution of the survival times.

First, plot the observations on a horizontal axis.

Figure : Guinea pigs survival times plotted on a horizontal axis.

We can see that the observations are not uniformly spread along the axis.In particular, there is a crowding of observations around 100.

The density histogram is a graph representing the density of observationsalong the horizontal axis. Such a histogram is constructed three steps:

1) Density Histogram: Step 1. Break the range of values of a variableinto adjacent intervals, which are called class intervals or bins.

Class interval

40 ≤ survival time < 8080 ≤ survival time < 120

120 ≤ survival time < 160160 ≤ survival time < 200200 ≤ survival time < 250250 ≤ survival time < 400400 ≤ survival time < 600

No particular rule was used to choose the above class intervals.

2) Density Histogram: Step 2. Create the distribution table whichcontains the count and percent (or proportion) of individuals in eachclass interval.

Class interval Count Proportion Percent

40≤survival time<80 12 0.1667 16.67%

80≤survival time<120 32 0.4444 44.44%

120≤survival time<160 11 0.1528 15.28%

160≤survival time<200 7 0.0972 9.72%

200≤survival time<250 4 0.0556 5.56%

250≤survival time<400 2 0.0278 2.78%

400≤survival time<600 4 0.0556 5.56%

Total 72 1.0001 100.01%

NOTE: The total percent is equal to 100.01% rather than 100% due to errors

introduced by rounding.

Density Histogram: Step 3. Constructing the histogram.

1 On the horizontal axis mark the endpoints of class intervals.

2 On each class interval plot a rectangle, whose base covers the classinterval and whose height is computed in the following way:

Bar height = percentage of observations in class intervalwidth of class interval

The HIGHER the bar, the GREATER the concentration (density) ofobservations in the corresponding class interval.

We assume that observations are spread uniformly within a classinterval.

The units of measurement on the VERTICAL axis of the histogramare percents (proportion) per unit width.

Density Histogram: Step 3, continued.

Class interval Bar height

40≤survival time<80 0.1667/40 = 0.004167580≤survival time<120 0.4444/40 = 0.01111120≤survival time<160 0.1528/40 = 0.00382160≤survival time<200 0.0972/40 = 0.00243200≤survival time<250 0.0556/50 = 0.001112250≤survival time<400 0.0278/150 ≈ 0.0001853400≤survival time<600 0.0556/200 = 0.000278

Figure : Figure 4. Histogram for guinea pigs survival times.

Width of Class Intervals

Class intervals are generally chosen of uniform width; contrary to the uneven

intervals in our guinea pig study. Most computer programs default to bins ofequal width. What number of bins of equal width should we use?

There is no best number of bins, and different bin sizes can reveal different

features of the data. Usually, the number of bins is chosen between 5 and

25. The larger the data set, the greater number of bins should be used.

One frequently used rule to compute the number of bins k (of equalwidth) for a data set of size n is Sturges’ formula:

k = 1 + 3.322 log10(n)

We obtain k by rounding it to the nearest integer.

We will replot the density histogram using the CrunchIt! program, usingthe default CrunchIt! number of bins.CrunchIt! directions :

Open the data from the CrunchIt!http://crunchit2.bfwpub.com/crunchit2/ips5e/?section_id=

Click on Chapter 1 in the upper left corner. Choose Table 1.8 fromthe list.On the Grey panel click on Graphics −→ Histogram. Click on thevariable and choose Density.

The CrunchIt! produces the graph shown in Figure 5. It uses 12 classintervals of width 50 ranging from 0 to 600. NOTE that CrunchIt! uses adifferent formula from Sturges’ rule to find the number of bins in itsdefault setting. By the Sturge’s formula

k = 1 + 3.322 log10(72) = 7.17 ∼ 7.

Histogram with Equal Bin Widths

Figure : Figure 5. Histogram for guinea pigs (CrunchIt!)

NOTE: Histograms in Figures 4 and 5 are similar, but in general the appearance

of a histogram can substantially change when you change the widths of class

intervals.

Four Histograms Plotted from the Same Data

The bin width and positioning of the bin edges can have a significant effect onthe resulting histogram.Dr. Joseph Brennan (Math 148, BU) Chapter 3 - The Histogram 19 / 37

Endpoint Convention

What should we do if an observation happens to be on the boundarybetween two bars? In which class interval does the data point lie?

Each observation must be taken into account just once, so we need tochoose between the left and right bars. The choice is arbitrary, howeveryour choice should be indicated and rigourously adhered to.

The convention followed in Example 2 is that the left endpoint isincluded in the class interval, and the right endpoint is excluded.

NOTE: There are 2 data values of 80 days. In the histogram plotted in Figure 4

the first class interval includes data values for which [40, 80), the second class

interval contains observations with [80, 120). How does the histogram change by

altering endpoint convention?

Area Under the Density Histogram

The area of each bar is equal to the percent (or proportion) ofobservations in that bar.

In the histogram plotted in Figure 4 the area of the first bar is thefollowing.

area of the first bar = base× height =

40(days)× 0.0041675(proportion of observations per day) =

0.1667 (proportion of all observations).

The total area under the density histogram is 1(100%).

Zero Bar Heights

Zero bar heights in histograms: In a histogram there is no horizontalspace between the bars unless a class interval is empty (has no data), sothat its bar has height equal to zero.

NOTE: The histogram in Figure 5 has 2 empty classes for survivaltimes in the ranges 250 - 300 and 450 - 500. These intervals do nothave any data point.

Different Types of Histograms

There are 3 main types of the histograms :

Density histogram displays percents (or proportions) per unit widthin the vertical direction.

In a frequency histogram the height of each bar is equal to theactual count of observations in the class interval.

In a relative frequency histogram the height of each bar is equal tothe proportion or percentage of observations in the class interval.

We will mostly deal with density histograms in this course. Later in thisunit we will approximate the density histograms with density curves.

Example of a Frequency Histogram

This example is taken from Moore and McCabe.

The frequency histogram below shows the distribution of IQ scores for 60

fifth-grade students. On the y -axis we have the count of students in each class

interval. The sum of all the bar heights equal 60, the number of tested students.

Example of a Relative Frequency Histogram

This example is taken from Moore and McCabe.

The relative frequency histogram below shows the distribution of thelengths of words used in Shakespeare’s plays.

Outliers in Data

Very often the extreme bars of the histogram correspond to outlyingobservations or outliers.

An outlier is an observation which falls outside of the overall patternof the histogram.

Rules exist to identify outliers, but in many cases it is just a matter of

judgement.

Look for points that are clearly apart from the body of data, not just themost extreme observations in a distribution.

Observation 598 in Example 2 (Guinea pigs) is clearly an outlier.Observations 511 and 522 are also potential outliers.

A formal rule for detecting outliers will be developed later in this unit.

Population Distribution and Data Distribution

Making a histogram is not an end in itself. The purpose of the histogramis to help us understand the data and make observations about thepopulation from which the data was drawn.

There is the true population distribution of the variable of interest, whichmay be computed from the census data.

The population distribution is usually unknown since we cannot make acensus. The histogram computed from the sample data shows the datadistribution which estimates the true population distribution.

Analyzing a Histogram

Once you’ve plotted the histogram (data distribution), look for the overallpattern and for outliers.

The overall pattern of a distribution can be described by its shape,center, and spread. We will learn the measures of distribution’s center and

spread in Chapter 4.

The shape of the distribution can be described:

by specifying the number of modes.

as symmetric or skewed.

Modes are major peaks in the distribution. Distribution with one, two andthree modes are called unimodal, bimodal, and trimodal, respectively. Ifa distribution has more than 3 modes, it is usually called multimodal.

Modes in Histograms

Figure : Unimodal and bimodal distributions.

Symmetric Distributions

A symmetric distribution has a histogram symmetric about the midpoint.

Imagine drawing a vertical line through the center of the histogram andfolding the histogram in half around that line: the two halves shouldmatch up.

Even if the true population distribution is symmetric, we do not expectthe histogram of the data to be perfectly symmetric.

The unimodal histogram on the previous slide is fairly symmetric, so it may

correspond to a symmetric population distribution.

Many symmetric unimodal histograms look bell-shaped.The unimodal histogram on the previous slide appears bell-shaped.

Many biological measurements on specimens from the same speciesand sex - lengths of bird flies, heights of adults - have symmetricbell-shaped distributions.

Some History

A symmetric bell-shaped distribution is also called a normal distributionor a Gaussian distribution.

It was first discovered in 1809 by the famous mathematician, CarlFrederich Gauss (1777-1855).

As a general rule, test scores in large classes (like MAT148) tend tofollow a normal distribution!

Analyzing the Histogram

Once you have plotted the histogram (data distribution), look for outliers,and the overall pattern, which is described by its shape, center, andspread. The shape of the distribution can be described

by specifying the number of modes.

as symmetric or skewed.

Figure : Unimodal and bimodal distributions. The unimodal appears symmetric.

Symmetric and Skewed Distributions

Tails are the parts of a distribution away from modes. There is the lefttail (for smallest values of the variable) and the right tail (for largestvalues of the variable).

A distribution is right skewed if the right tail (larger values) is muchlonger than the left tail (smaller values).

A distribution is left skewed if the left tail (smaller values) is much longerthan the right tail (larger values).

Example: Histogram of Income Distribution

Money amounts usually have right-skewed distributions. A few familieshave very large income compared to the majority families, which skews theincome distribution data.

Figure : Income distribution.

The Appearance of Symmetric and Skewed distributions

Example (English)

What is the most frequently used letter in the English language?

This is a relative frequency “histogram” generated (as per Wikipedia) from a

sample of about 2700 words taken from 3 different sources.

Alternatively, which letter is most likely to be at the end of a word?Take a look at end of word letter frequencies :

Letter e s d t n

Frequency 0.1917 0.1435 0.0923 0.0864 0.0786

Smooth Histogram Sketches

Many sources will represent histograms as smooth curves. However, wehave defined histograms to be based upon bar graphs.

Histogram sketches are smooth curves drawn through the tops of thehistogram bars and used to indicate the overall shape of a histogram.

Figure : A Histogram and its Smooth Histogram Sketch.

part ii : descriptive statistics - binghamton...

Documents

part iv : probability - binghamton...

manual for soa exam mlc. - binghamton...

ba ii plus™ - binghamton...

manual for soa exam mlc. - binghamton...

spt intlprod map onesheet 02 - sony pictures...

manual for soa exam fm/cas exam 2. - binghamton...

manual for soa exam fm/cas exam 2. - binghamton...

manual for soa exam mlc. - binghamton...

pi mu epsilon induction program 2011 - binghamton...

manual for soa exam fm/cas exam 2. - binghamton...

midterm exam iii review - binghamton...

the stable homotopy category - binghamton...

chapter 2: observational studies - binghamton...

binghamton...

midterm 1 review - binghamton...

in nite continued fractions - binghamton...

manual for soa exam mlc. - binghamton...

manual for soa exam mlc. - binghamton...

manual for soa exam mlc. - binghamton...

manual for soa exam mlc. - binghamton...