d lecture 3c histograms, formally defined (movie...
Transcript of d lecture 3c histograms, formally defined (movie...
![Page 1: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/1.jpg)
Histograms
![Page 2: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/2.jpg)
All time US top grossing movies, adjusted for inflation
![Page 3: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/3.jpg)
The gross dollar amounts converted into millions
3 digit numbers are easier to work with than 9 digit numbers
![Page 4: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/4.jpg)
To build a frequency distribution, you must group data into contiguous intervals, called bins
● The number of bins we choose affects the frequency distributions, and hence our interpretation of the data
● We can mask or highlight certain insights
![Page 5: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/5.jpg)
Selecting the number of bins
Image source
![Page 6: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/6.jpg)
A frequency distribution of gross amounts in millions of dollars, assuming 10 bins
Frequency table
![Page 7: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/7.jpg)
A frequency distribution of gross amounts in millions of dollars, assuming 100 bins
First 15 rows of Frequency table
![Page 8: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/8.jpg)
To build a frequency distribution, you must group data into contiguous intervals, called bins
● When you fix the number of bins, doing so also determines the width of the bins
○ The range of our data was about 1500○ So with 10 bins, each one is of width 1500/10 = 150○ And assuming 100 bins, each one is of width 1500/100 = 15
![Page 9: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/9.jpg)
To build a frequency distribution, you must group data into contiguous intervals, called bins
● Instead of fixing the number of bins, we can instead fix the width of each bin, which indirectly determines the number.
○ If we fix our bin width at 100, this yields 1500/100 = 15 bins.
● For now, let’s assume these 15 bins are of equal width○ Let’s choose ranges starting and ending with 50, like 250 - 350, etc.○ Let’s also make it so that our bins do not include data at their left
(lower) endpoints, but do include data at their right (upper) endpoints.■ E.g., let’s put 350 in the 250 - 350 bin, 450 in the 350 - 450 bin, etc.■ N.B. We could just as well do the reverse!
![Page 10: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/10.jpg)
A frequency distribution of gross amounts in millions of dollars, assuming bin width is 100
Frequency table
![Page 11: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/11.jpg)
Observations about movie revenues● We see an initial jump in the frequency of movies that grossed around 300
million to around 400 million (adjusted) dollars● The highest bar is for movies that grossed around 400 million (between 350
and 450) (adjusted) dollars, so the most movies fall in this range● A small number of movies grossed more than 650 million
● The frequencies are “skewed to the right”. Equivalently, there is “a long right-hand tail”. This shape is common in distributions of income or rent.
![Page 12: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/12.jpg)
Histograms, formally defined
![Page 13: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/13.jpg)
A key difference between a bar graph and histogram
● Histograms can have bins of unequal width● The data are highly concentrated in the
range of 350 to 650 million dollars● The data are more “spread out” beyond
650 million dollars● So we could use only three bins● And, this would still be a histogram!
Den
sity
Gross (Million Dollars)
![Page 14: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/14.jpg)
This is still a histogram because...
The definition of a histogram is:a bar graph in which the area under each bar is the frequency
○ 100 x .36 = 36○ 300 x .45 = 135○ 1100 x .02636 = 29
The total area under all bars is the sample size (36 + 135 + 29 = 200).
The heights in a histogram are called densities.Gross (Million Dollars)
Den
sity
![Page 15: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/15.jpg)
Actually, even this is still a histogram because...
The real definition of a histogram is: a bar graph in which the area under each bar is the relative frequency (i.e.., proportional to the frequency)
○ 100 x .0018 = .18○ 300 x .00225 = .675○ 1100 x .0001318 = .145
The total area under all bars 1.Gross (Million Dollars)
Den
sity
![Page 16: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/16.jpg)
Unnormalized vs. Normalized histograms
Gross (Million Dollars) Gross (Million Dollars)
Den
sity
Den
sity
![Page 17: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/17.jpg)
Unnormalized histograms● The area under each bar is equal to the number of data points
that lie in the corresponding bin● The total area under all bars is equal to the sample size
Normalized histograms● The area under each bar is equal to the proportion of data
points that lie in the corresponding bin● The total area under all bars is equal to 1
![Page 18: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/18.jpg)
Choosing a level of detail
● Some detail is lost by grouping values into bins● Movies are unevenly distributed across the bin 350 - 650● Sometimes it may be better to use a rough approximation rather than a finer level
of detail (akin to using descriptive statistics)
Den
sity
Gross (Million Dollars)
Den
sity
Gross (Million Dollars)
![Page 19: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/19.jpg)
What is a histogram?● A bar chart for plotting a frequency distribution● The bins are always contiguous (even if some of them are
empty), and their widths are drawn to scale● The areas of the bars are proportional to the frequencies
○ The width of each bin is the magnitude of its range of outcomes○ The height of each bin is its density, meaning frequency / width
● The sum of the areas is proportional to the sample size
![Page 20: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/20.jpg)
Summary: Bar chart vs. Histogram?
Bar Chart
● Frequency distribution of categorical data
● All bars the in the chart have the same width
● The heights of the bars are proportional to the frequencies
Histogram
● Frequency distribution of quantitative data
● The bars in the chart can have different widths
● The areas of the bars are proportional to the frequencies
● N.B. If the width of all bins is 1, then the areas equal the heights
![Page 21: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/21.jpg)
iClicker Q: How long do you hope to live?
A) 55-65
B) 65-75
C) 75-85
D) 85-95
E) 95+
![Page 22: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/22.jpg)
iClicker Q: How long do you hope to live?
A) 85-90
B) 90-95
C) 95-100
D) 100-110
E) 110+
![Page 23: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/23.jpg)
Extras
![Page 24: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/24.jpg)
Graphing absolute frequencies (or counts)
A natural way to depict a distribution
![Page 25: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/25.jpg)
Not a histogram● A plot with varying widths becomes
very misleading when using counts● This plot does not take into account the
difference in width of the bins● The height of each bar is simply the
number of movies in that bin ● This example exaggerates movies
grossing at least 550 million dollars
![Page 26: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/26.jpg)
Taking this to an extreme
With just two bins, the shape, and hence meaning, of the distribution is lost completely
![Page 27: d lecture 3c histograms, formally defined (movie data)cs.brown.edu/courses/cs100/lectures/lecture3c.pdf · Observations about movie revenues We see an initial jumpin the frequency](https://reader036.fdocuments.us/reader036/viewer/2022071005/5fc2ea4f7d69c242330563d8/html5/thumbnails/27.jpg)
iClicker Q: Which bin has the most movies in it?
A: first bin (250 - 350)
B: second bin (350 - 650)
C: third bin (650 - 1750)