02 Describing Distributions

download 02 Describing Distributions

of 14

Transcript of 02 Describing Distributions

  • 8/17/2019 02 Describing Distributions

    1/14

    - 1 -

    DESCRIBING DISTRIBUTIONS – REVIEW

    Topics Outline

    • Types of Data

    • Describing Categorical Variables

    • Describing Numerical Variables

    • Using the StatTools Add-In

    • Charts for Numerical Variables

    • Measures of Central Tendency

    • Measures of Shape

    • Measures of Variability

    • The Empirical Rule

    Types of Data

    Statistics is the science of data. Data are the statistician’s raw material, the material we use tointerpret reality. There are several ways to categorize data. The main data types follow.

    1) Categorical Data(synonyms: qualitative, attribute)– can be numeric or nonnumeric (e.g. parts labels, license plate numbers)– cannot perform arithmetic operations with them

    How many? – Discrete (e.g. SAT scores)

    2) Numerical Data

    How much? – Continuous (e.g. car prices, distance)(synonyms: quantitative, variable)– always numeric– can perform arithmetic operations with them

    3) Cross-sectional Data – collected at the same or approximately the same point in time.Example – gasoline prices at 40 gas stations in Austin today.

    4) Time Series Data – collected over several time periods.Example – average gasoline prices in Austin for each month of this year.

  • 8/17/2019 02 Describing Distributions

    2/14

    - 2 -

    Describing Categorical Variables

    Because it is not appropriate to perform arithmetic operations on categorical data, there are only afew possibilities for describing a categorical variable, and these are all based on counting.

    Example 1Supermarket Sales

    The file Supermarket_Transactions.xlsx contains over 14,000 transactions made by supermarketcustomers over a period of approximately two years.Column B contains the date of the purchase, column C is a unique identifier for each customer,columns D–H contain information about the customer, columns I–K contain the location of the store,columns L–N contain information about the product purchased, and the last two columns indicatethe number of items purchased and the amount paid.

    (a) Which of the variables are numerical and which are categorical?

    Children, Units Sold, and Revenue are numerical.Purchase Date is a date variable, and Customer ID is used only to identify customers.All of the others are categorical. This includes Annual Income, which has been binned intocategories.Three of these categorical variables – Gender, Marital Status, and Homeowner – have onlytwo categories. The others have more than two categories.

    (b) Create categorical summaries for Gender, Marital Status, Homeowner, and Annual Income.

    Each of the counts in column S can be obtained with Excel's COUNTIF function.For example, the formula in cell S3 is =COUNTIF($D$2:$D$14060, R3).This function takes two arguments, the data range and a criterion.To get the percentages in column T, each count is divided by the total number of observations.As a check, it is a good idea to sum these percentages. They should sum to 100% for eachvariable, as they do here.

  • 8/17/2019 02 Describing Distributions

    3/14

    - 3 -

    (c) Create a column chart for variable Gender.

    To get the top left chart:Highlight the range R2:S4.Insert, Column, select the type of chart

    To get the middle chart:

    Select the first chartRight click on the vertical axisFormat AxisAxis OptionsSelect Minimum Fixed and enter 0Select Maximum Fixed and enter 8000Close

    The charts for percentages and the pie charts can be constructed in a similar way.

    Please note that you get essentially the same chart regardless of whether you graph the counts or

    the percentages. However, be careful with misleading scales. The vertical scale of the top leftchart starts well above 6000, which makes it appear that there are many more females than males.By resetting the vertical scale to start at 0, as in the two middle charts, you see more accuratelythat there are almost as many males as females.

  • 8/17/2019 02 Describing Distributions

    4/14

    - 4 -

    Describing Numerical Variables

    Different graphical and numerical procedures are used to summarize the most representativeinformation of a numerical variable depending on the type of question asked and the nature of thedata being summarized. Usually we are interested in the following two questions:

    1. What is the center of the data?2. How far from the center the data tend to range?

    Example 2

    Baseball Salaries

    The file Baseball_Salaries.xlsx contains data on 818 Major League Baseball (MLB) players as ofMay 2009. There are four variables: the player's name, team, the position, and salary.How can these 818 salaries be summarized?

    Using the StatTools Add-In

    1. Run the StatTools Add-In

    With Palisade DecisionTools suite already installed, there are two options to run the StatTools:

    1. If Excel is not currently running, you can launch Excel and  StatTools by clicking on theWindows Start button and selecting the StatTools item from the Palisade DecisionTools group.

    2. If Excel is currently running, the first procedure will load StatTools on top of Excel.You will know that StatTools is loaded when you see its tab and the associated ribbon:

    If you want to unload StatTools without closing Excel, you can choose the Unload StatTools itemfrom the Utilities dropdown list.

  • 8/17/2019 02 Describing Distributions

    5/14

    - 5 -

    2. Choose an Application Setting

    UtilitiesApplication Settings

    ReportsPlacementChoose Active Workbook(places the results on a new worksheet) orchoose Query for Starting Cell(lets you choose the cell where yourresults will start)OK, Yes

    3. Define a StatTools data set.

    Click anywhere within the data setBaseball_Salaries.xlsx.

    Data Set ManagerYes

    You will see the dialog box shown to the right.

    StatTools makes several guesses about yourdata set. They are generally correct, but youcan always override them. For now, simply

    click on OK.

  • 8/17/2019 02 Describing Distributions

    6/14

    - 6 -

    4. Get summary measures.

    Choose:Summary StatisticsOne-Variable Summary

    You will see the dialog box shown to the right.

    (If you see two columns of variables in thetop pane, click on the Format button andselect Stacked.)

    In the top section select Salary.In the bottom section, select all of themeasures.

    Click on the “double-check” button to the leftof the OK button if you want to designate adifferent place for the results.OK

    The results are placed in a new worksheet automaticallynamed One Var Summary.

    Notes:

    1. The cells in the output are “live”, that is if you changethe data for any of the salaries, the summary measureswill update automatically.

    2. If you open a file with StatTools outputs but StatToolsis not loaded, you may see #VALUE! errors in the cells.This can be fixed by closing the file, loading StatTools,and opening the file again.

    3. If you compare StatTools results to the measures obtained

    with Excel functions (see the tab Excel Summary Measuresin the file Baseball_Salaries_Finished.xlsx), you will seesome slight discrepancies. The reason is that StatTools isusing its own statistical functions based on best practicesfrom the statistical literature. Don’t be overly concernedabout these discrepancies. Both sets of measures providethe same basic picture of how the salaries are distributed.

  • 8/17/2019 02 Describing Distributions

    7/14

    - 7 -

    Charts for Numerical Variables

    There are many graphical techniques available for representing numerical variables – histograms,dot plots, stem and leaf diagrams, box plots, Pareto diagrams, ogive plots, Q-Q plots, etc.We will discuss histograms and box plots.

    Histograms

    A histogram is the most common type of chart for showing the distribution of a numerical variable.It is based on binning the variable. That is, we arrange the data in ascending order; then we divide therange of the data into classes of equal width and count the number of observations within each class.The histogram is then a column chart of the counts in the various classes (with no gaps between the bars)The bar’s height is the number of data points in that class interval (for a frequency histogram),it is the number of data points in that class interval divided by the number of observations(for a relative frequency histogram) and it is the percent of data points in that class interval(for a percent frequency histogram).

    Constructing a histogram with StatTools:

    Designate a StatTools data set (which has already been done for the salary data)Summary GraphsHistogramSelect the Salary variableOK

    The resulting histogram, along with the bin data it is based on, is shown below.

  • 8/17/2019 02 Describing Distributions

    8/14

    - 8 -

    The histogram shows that the salaries are skewed to the right. The vast majority of the players are inthe lowest two categories, and the salaries of the stars account for the long tail to the right.

    Box Plots

    A box plot (also called a box-whisker plot) is an alternative type of chart for showing the distributionof a variable.

    Constructing a box plot with StatTools:

    Designate a StatTools data set (which hasalready been done for the salary data)Summary Graphs

    Box-Whisker PlotSelect the Salary variableClick Include Key Describing PlotElements (optional)OK

    Here is the box plot.

  • 8/17/2019 02 Describing Distributions

    9/14

    - 9 -

    The box plot of baseball salaries istypical of an extremely right-skeweddistribution.

    The mean is much larger than themedian as we explained earlier;there is virtually no whisker out ofthe left side of the box (because thefirst quartile is barely above theminimum value), and there are manyoutliers to the right (the stars).In fact, many of these outliersoverlap one another.

    StatTools provides also a generic box plot (not drawn toscale) as a learning device to remind you what the differentelements of a box plot mean. You don't need to continueasking for this once you are familiar with box plots.

    As this generic diagram indicates, the box itself extends,left to right, from the 1st quartile to the 3rd quartile.This means that it contains the middle 50% of the dataand its length is equal to the interquartile range.The line inside the box is positioned at the median,and the x inside the box is positioned at the mean.The lines (whiskers) coming out either side of the boxextend to 1.5 IQRs (interquartile ranges) from the quartiles.These generally include most of the data outside the box.More distant values, called outliers, are denoted separatelywith small squares. They are hollow for “mild” outliers, andsolid for “extreme” outliers, as indicated in the explanation.

    Note: The height of the box is irrelevant in StatTools’s box plots.

    Box plots are especially suited to comparing two or more data sets (e.g. salaries for men versussalaries for women). In doing so, you should use the same scale for all the box plots.

    We can also use a box plot to identify the approximate shape of the distribution of a data set.Employing box plots to identify the shape of a distribution is most useful with large data sets.For small data sets, box plots can be unreliable in identifying distribution shape.

  • 8/17/2019 02 Describing Distributions

    10/14

    - 10 -

    Measures of Central Tendency

    Measures of central tendency describe the “average” or “typical” piece of information in a set of data.The most important measures of center are the mean, the median and the mode.

    Mean  x  

    The mean  x  of a set of observations n x x ,,1 K  is computed by summing all the data and then

    dividing by the number of observations n :

    ∑=

    =++

    =

    n

    i

    i

    n  xnn

     x x x

    1

    1 1L  

    The mean can be calculated in Excel with the AVERAGE function.

    For our example, the average salary for all players is a whopping $3,260,059.

    The mean uses all of the observations, and each observation affects the mean. Extremely large or

    small data points (possibly outliers) can cause the mean to be pulled toward the extreme data.

    Even though the mean is sensitive to extreme values, it is still the most widely used measure ofcenter. This is due to the fact that the mean has valuable mathematical properties that make it

    convenient for use with inferential statistical analysis. For example, ∑=

    =−

    n

    i

    i   x x1

    0)(  

    Median

    The median is the “middle” observation when the data are arranged from smallest to largest.If n  is odd, the median is the middle number.

    If n  is even, the median is usually defined as the average of the two middle observations.If there are no ties, 50% of the observations are smaller than the median and 50% are larger.

    The median can be calculated in Excel with the MEDIAN function.

    For our example, the median salary is $1,151,000.In words, half of the players make less than this, and half make more.

    Mode

    The mode is the most frequently occurring value in a set of observations and it can be calculated in

    Excel with the MODE function.

    For our example, the mode is $400,000 and it occurs 70 times.In other words, close to 10% of the players earn $400,000.

    One data set can have many modes. Single mode distributions are called unimodal; distributionswith two modes are called bimodal. (In Excel 2010, MODE.SNGL can be used to find a single mode,and MODE.MULT to find multiple modes.)

  • 8/17/2019 02 Describing Distributions

    11/14

    - 11 -

    Measures of Shape

    A symmetrical distribution (e.g. the normal distribution) arises in situations when the observationsbecome increasingly more frequent at the intermediate values. Distributions that are not symmetricalin form, those that tail off either to the right or the left, are referred to as skewed distributions.

    In our example, a few stars have really large salaries, and no players have really small salaries.Alternatively, the largest salaries are much farther to the right of the mean than the smallest salariesare to the left of the mean. We say that these salaries are skewed to the right (or positively skewed)because the skewness is due to the really large salaries. If the skewness were due to really smallvalues (as might occur if we were examining temperature lows in Antarctica), then we would call itskewness to the left (or negatively skewed).

    The most commonly used measures of shape are:– skewness (calculated with Excel’s SKEW function)– kurtosis (calculated with Excel’s KURT function)

    The skewness is a measure of the symmetry (or more precisely, of the lack of symmetry) of adistribution and its magnitute increases as the degree of skewness increases:

    skewness = 0 for symmetric distributionsskewness < 0 for left-skewed distributionsskewness > 0 for right-skewed distributions

    For the baseball data, it is approximately 2.1 indicating a right-skewed distribution.

    The kurtosis measures the “fatness” of the tails of the distribution relative to the central portion ofthe distribution. The kurtosis of any normal distribution is 0.Positive kurtosis indicates a “spiky” distribution and negative kurtosis indicates a “flat” distribution.

    For the baseball slaries, the kurtosis is about 5.1 indicating fat tails.

    Note: For kurtosis, StatTools provides an index that is 3 for a normal distribution. That is why thevalue of the kurtosis 8.1 from the StatTools output differs by 3.

    Comparing the mean and the median

    The mean salary  x = $3,260,059 is higher than the median salary $1,151,000.

    As you might expect, the vast majority of baseball players have relatively modest salaries that aredwarfed by the astronomical salaries of a few stars. Because it is an average, the mean is stronglyinfluenced by these really large values, so it is quite high. In contrast, the median is completely

    unaffected by the magnitude of the really large salaries, so it is much smaller.For example, the median would not change by a single cent if Alex Rodriguez made $33 trillioninstead of his measly $33 million, but the mean would increase to more than $34 million.

    Generally, the median provides a better measure of center than the mean when there are someextremely large or small observations, that is when the data are skewed to the right or to the left.For this reason, median income is used as the measure of center for the US household income.

  • 8/17/2019 02 Describing Distributions

    12/14

    - 12 -

    When a data distribution is basically symmetrical in form, the mean and median will have verynearly the same value. In a skewed distribution the mean tends to get dragged toward the tail of thedistribution, toward few unusually high values (for right-skewed distribution) or unusually lowvalues (for left-skewed distribution).

    Knowing only the values of the mean and the median of a distribution, we can generally

    (but not always!) guess the shape of the distribution:

    mean = median →  symmetrical distribution

    mean > median →  right (or positively) skewed distributionmean < median →  left (or negatively) skewed distribution

    Measures of Variability

    (The devil is in the deviations!)

    Average by itself is not a good indication of the quality of the data set. For example, if you learn thatthe mean (or median) salary in some company is $100,000, this tells you something about the “typical”salary, but it tells you nothing about how spread out the salaries are, that is, their variability.

    That is why measures of variability are generally reported with the measures central tendency.

    A small value for a measure of spread indicates that the data are concentrated around the mean;therefore, the mean is a good representative of the data set. On the other hand, a large measure ofspread indicates that the mean is not a good representative of the data set.

    The most commonly used measures of variability are:range, quartiles, Interquartile Range (IQR), variance, standard deviation

    Range

    The range is the difference between the largest and the smallest observations:

    Range = Maximum – MinimumFor the baseball salaries, this range is $32.6 million. It certainly tells us how spread out the salariesare, but it is too sensitive to the extremes. For example, if Alex Rodriguez’s salary increased to $43million, the range would increase by $10 million – just because of one player.Less sensitive measures are the quartiles and the interquartile range.Quartiles

    Quartiles are values that separate an ordered (from smallest to largest) data set into four equalclasses, that is into quarters. There are three quartiles:

    first quartile 1Q   – 25% of the observations lie at or below it

    second quartile 2Q = median – 50% of the observations lie at or below it

    third quartile 3Q   – 75% of the observations lie at or below it

    Please note that 1Q is the median of the observations that are less than the median,

    and 3Q  is the median of the observations that are greater than the median.

    For the baseball data, 25% of the players make within $20,000 of the league minimum, and more thana quarter of all players make more than $4 million. In fact, more than 1% of the players make well over$18 million, with Alex Rodriguez topping the list at $33 million. And they say it’s just a game!

  • 8/17/2019 02 Describing Distributions

    13/14

    - 13 -

    Interquartile range (IQR)

    The interquartile range is defined as the third quartile minus the first quartile:

    IQR = interquartile range = 13   QQ   −  

    It describes the amount of variation in the middle half of the data. In other words, it is the range of

    the middle 50% of the data.

    For the baseball data, the IQR is $3,817,950. If you excluded the 25% of players with the lowest salariesand the 25% with the highest salaries, this IQR would be the range of the remaining 50% of the salaries.

    Variance

    The variance measures spread by looking at how far the observations are from the mean.

    It is essentially the average of the squared deviations from the mean  x :

    ∑=−

    =

    −++−=

    n

    i i

    n  x xnn

     x x x xs

    1

    222

    12 )(1

    1

    1

    )()(   L 

    Note that we divide by 1−n  rather than n . The factor 1−n  is called degrees of freedom.The strong explanation for this strange divisor can be given by means of statistical estimation theory

    ( 2s  is an unbiased estimator of 2σ   ). The intuitive explanation is that since the deviations )(   x xi   −  

    always sum to exactly 0, ∑=

    =−

    n

    i

    i   x x1

    0)( , knowing 1−n  of them determines the last one.

    In other words, we have only 1−n  independent pieces of information. Some calculators offer achoice between dividing by n and dividing by 1−n . Please be sure to use 1−n .

    Please note that variance is expressed in squared units. For example, the huge value shown as avariance in the output for the baseball data is in squared dollars.

    The variance can be calculated in Excel with the VAR function.

    Standard Deviation

    Variance is an important measure of variability. However, it is not expressed in the same unitsas the observations which makes it hard to interpret. This problem can be solved by working withthe square root of the variance.

    The square root of the variance 2s  is denoted by s  and is called standard deviation:

    ∑=

    ==

    n

    i

    i   x xn

    ss1

    22 )(1

    The standard deviation can be calculated in Excel with the STDEV function.The standard deviation for the baseball salaries is slightly above $4.36 million.

  • 8/17/2019 02 Describing Distributions

    14/14

    - 14 -

    Both variance and standard deviation provide the same information; one can always be obtainedfrom the other. However the standard deviation is always expressed in the same units as the rawdata and is considered as the most important measure of variation.

    The standard deviation is the amount a typical score varies from the mean. Small values of s indicate

    that all the data are clustered near the mean. Large values of s indicate that the data are spread outfrom the mean. Standard deviation is always zero or greater than zero. It is zero only when all theobservations are equal, and consequently there is no variation.

    The Empirical Rule

    All approximately normal (symmetric and bell-shaped) distributions satisfy the following rule:

    • About 68% of all observations fall within 1 standard deviation of the mean

    • About 95% of all observations fall within 2 standard deviations of the mean

    • About 99.7% of all observations fall within 3 standard deviations of the mean

    The empirical rule is exactly valid for distributions which are exactly normal.Real data are never exactly normal.

    Can the empirical rule be applied to the baseball salaries?

    The answer is that you can always try, but because of obvious skewness in the salary data,the assumption of an approximately normal distribution is not satisfied and the rule will not be veryaccurate. Nevertheless, if you calculate the percentage of observations falling within one, two, andthree standard deviations of the mean, you will get 85.33%, 92.54%, and 98.04%.These three percentages, according to the empirical rule, should be about 68%, 95%, and 99.7%.The second and the third percentage are not way off, but the first is not even close.