d lecture 3c histograms, formally defined (movie...

27
Histograms

Upload: others

Post on 18-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

Histograms

Page 2: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

All time US top grossing movies, adjusted for inflation

Page 3: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

The gross dollar amounts converted into millions

3 digit numbers are easier to work with than 9 digit numbers

Page 4: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

To build a frequency distribution, you must group data into contiguous intervals, called bins

● The number of bins we choose affects the frequency distributions, and hence our interpretation of the data

● We can mask or highlight certain insights

Page 6: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

A frequency distribution of gross amounts in millions of dollars, assuming 10 bins

Frequency table

Page 7: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

A frequency distribution of gross amounts in millions of dollars, assuming 100 bins

First 15 rows of Frequency table

Page 8: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

To build a frequency distribution, you must group data into contiguous intervals, called bins

● When you fix the number of bins, doing so also determines the width of the bins

○ The range of our data was about 1500○ So with 10 bins, each one is of width 1500/10 = 150○ And assuming 100 bins, each one is of width 1500/100 = 15

Page 9: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

To build a frequency distribution, you must group data into contiguous intervals, called bins

● Instead of fixing the number of bins, we can instead fix the width of each bin, which indirectly determines the number.

○ If we fix our bin width at 100, this yields 1500/100 = 15 bins.

● For now, let’s assume these 15 bins are of equal width○ Let’s choose ranges starting and ending with 50, like 250 - 350, etc.○ Let’s also make it so that our bins do not include data at their left

(lower) endpoints, but do include data at their right (upper) endpoints.■ E.g., let’s put 350 in the 250 - 350 bin, 450 in the 350 - 450 bin, etc.■ N.B. We could just as well do the reverse!

Page 10: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

A frequency distribution of gross amounts in millions of dollars, assuming bin width is 100

Frequency table

Page 11: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

Observations about movie revenues● We see an initial jump in the frequency of movies that grossed around 300

million to around 400 million (adjusted) dollars● The highest bar is for movies that grossed around 400 million (between 350

and 450) (adjusted) dollars, so the most movies fall in this range● A small number of movies grossed more than 650 million

● The frequencies are “skewed to the right”. Equivalently, there is “a long right-hand tail”. This shape is common in distributions of income or rent.

Page 12: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

Histograms, formally defined

Page 13: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

A key difference between a bar graph and histogram

● Histograms can have bins of unequal width● The data are highly concentrated in the

range of 350 to 650 million dollars● The data are more “spread out” beyond

650 million dollars● So we could use only three bins● And, this would still be a histogram!

Den

sity

Gross (Million Dollars)

Page 14: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

This is still a histogram because...

The definition of a histogram is:a bar graph in which the area under each bar is the frequency

○ 100 x .36 = 36○ 300 x .45 = 135○ 1100 x .02636 = 29

The total area under all bars is the sample size (36 + 135 + 29 = 200).

The heights in a histogram are called densities.Gross (Million Dollars)

Den

sity

Page 15: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

Actually, even this is still a histogram because...

The real definition of a histogram is: a bar graph in which the area under each bar is the relative frequency (i.e.., proportional to the frequency)

○ 100 x .0018 = .18○ 300 x .00225 = .675○ 1100 x .0001318 = .145

The total area under all bars 1.Gross (Million Dollars)

Den

sity

Page 16: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

Unnormalized vs. Normalized histograms

Gross (Million Dollars) Gross (Million Dollars)

Den

sity

Den

sity

Page 17: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

Unnormalized histograms● The area under each bar is equal to the number of data points

that lie in the corresponding bin● The total area under all bars is equal to the sample size

Normalized histograms● The area under each bar is equal to the proportion of data

points that lie in the corresponding bin● The total area under all bars is equal to 1

Page 18: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

Choosing a level of detail

● Some detail is lost by grouping values into bins● Movies are unevenly distributed across the bin 350 - 650● Sometimes it may be better to use a rough approximation rather than a finer level

of detail (akin to using descriptive statistics)

Den

sity

Gross (Million Dollars)

Den

sity

Gross (Million Dollars)

Page 19: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

What is a histogram?● A bar chart for plotting a frequency distribution● The bins are always contiguous (even if some of them are

empty), and their widths are drawn to scale● The areas of the bars are proportional to the frequencies

○ The width of each bin is the magnitude of its range of outcomes○ The height of each bin is its density, meaning frequency / width

● The sum of the areas is proportional to the sample size

Page 20: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

Summary: Bar chart vs. Histogram?

Bar Chart

● Frequency distribution of categorical data

● All bars the in the chart have the same width

● The heights of the bars are proportional to the frequencies

Histogram

● Frequency distribution of quantitative data

● The bars in the chart can have different widths

● The areas of the bars are proportional to the frequencies

● N.B. If the width of all bins is 1, then the areas equal the heights

Page 21: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

iClicker Q: How long do you hope to live?

A) 55-65

B) 65-75

C) 75-85

D) 85-95

E) 95+

Page 22: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

iClicker Q: How long do you hope to live?

A) 85-90

B) 90-95

C) 95-100

D) 100-110

E) 110+

Page 23: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

Extras

Page 24: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

Graphing absolute frequencies (or counts)

A natural way to depict a distribution

Page 25: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

Not a histogram● A plot with varying widths becomes

very misleading when using counts● This plot does not take into account the

difference in width of the bins● The height of each bar is simply the

number of movies in that bin ● This example exaggerates movies

grossing at least 550 million dollars

Page 26: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

Taking this to an extreme

With just two bins, the shape, and hence meaning, of the distribution is lost completely

Page 27: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency

iClicker Q: Which bin has the most movies in it?

A: first bin (250 - 350)

B: second bin (350 - 650)

C: third bin (650 - 1750)