d lecture 3c histograms, formally defined (movie...

Post on 18-Aug-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Histograms

All time US top grossing movies, adjusted for inflation

The gross dollar amounts converted into millions

3 digit numbers are easier to work with than 9 digit numbers

To build a frequency distribution, you must group data into contiguous intervals, called bins

● The number of bins we choose affects the frequency distributions, and hence our interpretation of the data

● We can mask or highlight certain insights

A frequency distribution of gross amounts in millions of dollars, assuming 10 bins

Frequency table

A frequency distribution of gross amounts in millions of dollars, assuming 100 bins

First 15 rows of Frequency table

To build a frequency distribution, you must group data into contiguous intervals, called bins

● When you fix the number of bins, doing so also determines the width of the bins

○ The range of our data was about 1500○ So with 10 bins, each one is of width 1500/10 = 150○ And assuming 100 bins, each one is of width 1500/100 = 15

To build a frequency distribution, you must group data into contiguous intervals, called bins

● Instead of fixing the number of bins, we can instead fix the width of each bin, which indirectly determines the number.

○ If we fix our bin width at 100, this yields 1500/100 = 15 bins.

● For now, let’s assume these 15 bins are of equal width○ Let’s choose ranges starting and ending with 50, like 250 - 350, etc.○ Let’s also make it so that our bins do not include data at their left

(lower) endpoints, but do include data at their right (upper) endpoints.■ E.g., let’s put 350 in the 250 - 350 bin, 450 in the 350 - 450 bin, etc.■ N.B. We could just as well do the reverse!

A frequency distribution of gross amounts in millions of dollars, assuming bin width is 100

Frequency table

Observations about movie revenues● We see an initial jump in the frequency of movies that grossed around 300

million to around 400 million (adjusted) dollars● The highest bar is for movies that grossed around 400 million (between 350

and 450) (adjusted) dollars, so the most movies fall in this range● A small number of movies grossed more than 650 million

● The frequencies are “skewed to the right”. Equivalently, there is “a long right-hand tail”. This shape is common in distributions of income or rent.

Histograms, formally defined

A key difference between a bar graph and histogram

● Histograms can have bins of unequal width● The data are highly concentrated in the

range of 350 to 650 million dollars● The data are more “spread out” beyond

650 million dollars● So we could use only three bins● And, this would still be a histogram!

Den

sity

Gross (Million Dollars)

This is still a histogram because...

The definition of a histogram is:a bar graph in which the area under each bar is the frequency

○ 100 x .36 = 36○ 300 x .45 = 135○ 1100 x .02636 = 29

The total area under all bars is the sample size (36 + 135 + 29 = 200).

The heights in a histogram are called densities.Gross (Million Dollars)

Den

sity

Actually, even this is still a histogram because...

The real definition of a histogram is: a bar graph in which the area under each bar is the relative frequency (i.e.., proportional to the frequency)

○ 100 x .0018 = .18○ 300 x .00225 = .675○ 1100 x .0001318 = .145

The total area under all bars 1.Gross (Million Dollars)

Den

sity

Unnormalized vs. Normalized histograms

Gross (Million Dollars) Gross (Million Dollars)

Den

sity

Den

sity

Unnormalized histograms● The area under each bar is equal to the number of data points

that lie in the corresponding bin● The total area under all bars is equal to the sample size

Normalized histograms● The area under each bar is equal to the proportion of data

points that lie in the corresponding bin● The total area under all bars is equal to 1

Choosing a level of detail

● Some detail is lost by grouping values into bins● Movies are unevenly distributed across the bin 350 - 650● Sometimes it may be better to use a rough approximation rather than a finer level

of detail (akin to using descriptive statistics)

Den

sity

Gross (Million Dollars)

Den

sity

Gross (Million Dollars)

What is a histogram?● A bar chart for plotting a frequency distribution● The bins are always contiguous (even if some of them are

empty), and their widths are drawn to scale● The areas of the bars are proportional to the frequencies

○ The width of each bin is the magnitude of its range of outcomes○ The height of each bin is its density, meaning frequency / width

● The sum of the areas is proportional to the sample size

Summary: Bar chart vs. Histogram?

Bar Chart

● Frequency distribution of categorical data

● All bars the in the chart have the same width

● The heights of the bars are proportional to the frequencies

Histogram

● Frequency distribution of quantitative data

● The bars in the chart can have different widths

● The areas of the bars are proportional to the frequencies

● N.B. If the width of all bins is 1, then the areas equal the heights

iClicker Q: How long do you hope to live?

A) 55-65

B) 65-75

C) 75-85

D) 85-95

E) 95+

iClicker Q: How long do you hope to live?

A) 85-90

B) 90-95

C) 95-100

D) 100-110

E) 110+

Extras

Graphing absolute frequencies (or counts)

A natural way to depict a distribution

Not a histogram● A plot with varying widths becomes

very misleading when using counts● This plot does not take into account the

difference in width of the bins● The height of each bar is simply the

number of movies in that bin ● This example exaggerates movies

grossing at least 550 million dollars

Taking this to an extreme

With just two bins, the shape, and hence meaning, of the distribution is lost completely

iClicker Q: Which bin has the most movies in it?

A: first bin (250 - 350)

B: second bin (350 - 650)

C: third bin (650 - 1750)

top related