Post on 16-Jul-2015
What is a Histogram? A histogram is "a representation of a frequency
distribution by means of rectangles whose widths
represent class intervals and whose areas are
proportional to the corresponding frequencies."
Online Webster's Dictionary
Sounds complicated . . . but the concept really is
pretty simple. We graph groups of numbers
according to how often they appear. Thus if we have
the set {1,2,2,3,3,3,3,4,4,5,6}, we can graph them
like this:
This graph is pretty easy to make and gives us some
useful data about the set. For example, the graph
peaks at 3, which is also the median and the mode of
the set. The mean of the set is 3.27—also not far
from the peak. The shape of the graph gives us an
idea of how the numbers in the set are distributed
about the mean: the distribution of this graph is wide
compared to size of the peak, indicating that values
in the set are only loosely bunched round the mean.
How is a Real Histogram Made? The example above is a little too simple. In most real data sets almost all numbers will be unique.
Consider the set {3, 11, 12, 19, 22, 23, 24, 25, 27, 29, 35, 36, 37, 45, 49}. A graph which shows
how many ones, how many twos, how many threes, etc. would be meaningless. Instead we bin the
data into convenient ranges. In this case, with a bin width of 10, we can easily group the data as
below.
Note: Changing the size of the bin changes the apprearance of the graph and the conclusions you
may draw from it. The Shodor histogram activity allows you to change the bin size for a data set
and the impact on the curve.
Data
Range Frequency
0-10 1
10-20 3
20-30 6
30-40 4
40-50 2
Note that the median is 25 and that there is no mode;
the mean is 26.5.
How Shall We Look at Histograms? Of course, part of the power of histograms is that they allow us to analyze extremely large
datasets by reducing them to a single graph that can show primary, secondary and tertiary peaks
in data as well as give a visual representation of the statistical significance of those peaks. To get
an idea, look at these three histograms:
This plot represents data with a well-defined
peak that is close in value to the median and
the mean. While there are "outlyers," they are
of relatively low frequency. Thus it can be said
that deviations in this data group from the
mean are of low frequency. If this were a mass
plot in particle physics, we'd say the mass is
understood with good precision.
In this plot the peak is still fairly close to the
median and the mean but it is much less
defined. It is harder to tell from the plot what
the exact location of the peak is. There are
almost as many values close to the peak as at
the peak itself and outlyers are frequent. As a
particle physics mass plot, this gives an
imprecise and undertain mass of a particle.
Where are the median and the mean? It is hard
to tell; it also may not be relevant. There are
two peaks in this plot: a taller primary peak as
well as a shorter secondary peak. This could
indicate either very poor definition of one
signal in the data or, more likely, two signals.
In particle physics, this could show two
separate particles or, as is often the case, a
large signal with "background" particles and a
smaller signal (sometimes very small), called a
"bump," which shows the actual particle under
study.
Resources
Sample Histogram - This is another example of how a histogram is made, with a focus on
the effect of bin size.
Shodor Histogram Page - This is a nice interactive histogram page in which you can
choose different sample histograms and vary the bin size.
Excel Help - To work with large datasets, it helps to use a spreadsheet. This tutorial
walks you through the process of making a histogram in MS Excel.
Histogram Problems - These are practice problems (with solutions) so that you can
construct and analyze histograms on your own.