Basic Statistics: Overview

7
A CONCISE OVERVIEW OF BASIC STATISTICS The goal of descriptive statistics is the description of a population (a set of individuals or entities) or some characteristics of this population via the collection and study of data concerning all of the set’s elements, or the elements of a certain subset of the population (to which we will refer as a sample). The characteristic of the population that we want to study is called the variable. For example: the number of items sold at each one of the 12 fine art auctions in Santa Monica, one for every month in 2014. Here the population is the set of all these Santa Monica fine art auctions in 2014; the variable is the number of items that were sold. 1. Measures of location (or: of place) – central tendencies of a set of data a. Mean, Median, Mode (“the three M’s”) For a list of data x 1 , x 2 , x 3 , ... , x N [ ] (listing the values measured for each of the individuals or entities in the population or the sample of size N) the (arithmetic) mean or average of is given by the formula x i i= 1 N N . In case of a population the mean usually is denoted by the Greek letter μ . In case of a sample one usually writes X for the mean. Example: These are the number of items sold at the Santa Monica fine art auctions in 2014: [14, 22, 16, 11, 23, 16, 18, 21, 14, 19, 9, 26]. This is a population: we are given the data for all the auctions in 2014. The mean of these data is : X = 14, 22, 16, 11, 23, 16, 18, 21, 14, 19, 9, 26 12 = 17, 42 . The median is the midpoint of the values, after these have been put in increasing (or decreasing) order. Example (continued): Order the list of values. We get 9, 11, 14, 14, 16, 16, 18, 19, 21, 22, 23, 26 . In case of an uneven number of data, the median will be the central number in the ordered sequence. Because here the number of date is even, there is no single central number. We then take the average of the two most central ones. In this case, these are 16 and 18. So the median of our data set is 17. The mode or modal value is the value that appears most frequently in a set of data.

description

A concise overview of basic descriptive statistics

Transcript of Basic Statistics: Overview

  • A CONCISE OVERVIEW OF BASIC STATISTICS The goal of descriptive statistics is the description of a population (a set of individuals or entities) or some characteristics of this population via the collection and study of data concerning all of the sets elements, or the elements of a certain subset of the population (to which we will refer as a sample). The characteristic of the population that we want to study is called the variable. For example: the number of items sold at each one of the 12 fine art auctions in Santa Monica, one for every month in 2014. Here the population is the set of all these Santa Monica fine art auctions in 2014; the variable is the number of items that were sold. 1. Measures of location (or: of place) central tendencies of a set of data

    a. Mean, Median, Mode (the three Ms) For a list of data x1, x2, x3, ... , xN[ ] (listing the values measured for each of the individuals or entities in the population or the sample of size N) the (arithmetic)

    mean or average of is given by the formulaxi

    i=1

    N

    N

    . In case of a population the

    mean usually is denoted by the Greek letter . In case of a sample one usually writes X for the mean. Example: These are the number of items sold at the Santa Monica fine art auctions in 2014: [14, 22, 16, 11, 23, 16, 18, 21, 14, 19, 9, 26]. This is a population: we are given the data for all the auctions in 2014. The mean of these data is :

    X = 14, 22, 16, 11, 23, 16, 18, 21, 14, 19, 9, 2612

    =17, 42 .

    The median is the midpoint of the values, after these have been put in increasing (or decreasing) order. Example (continued): Order the list of values. We get 9, 11, 14, 14, 16, 16, 18, 19, 21, 22, 23, 26 . In case of an uneven number of data, the median will be the central number in the ordered sequence. Because here the number of date is even, there is no single central number. We then take the average of the two most central ones. In this case, these are 16 and 18. So the median of our data set is 17. The mode or modal value is the value that appears most frequently in a set of data.

  • Example (continued): Unlike the mean and the median, the mode does not have to be unique. In our example both 14 and 16 occur twice; the other values appear only once. In such cases one sometimes speaks of a bimodal data set.

    b. Quartiles, deciles, percentiles The median is the measure of location that identifies the center of a collection of observations. It divides the ordered sequence of our data into two equal parts. In a similar manner we can divide the ordered (from small to large) sequence of data into four, ten or a hundred equal parts. Quartiles divide the sequence of data into four equal parts. The first or lower quartile, Q1, is the value below which 25% of our data occur. The second quartile, Q2, is the value below which 50% of our data occur; it is equal to the median. The third or upper quartile, Q3, is the value below which 75% of our data occur. Deciles divide the sequence of data into ten equal parts. The first decile, D1, is the value below which 10% of our data occur. The fifth decile, D5, is equal to the median. The ninth decile, D9, is the value below which 90% of our data occur. Percentiles divide the sequence of data into a hundred equal parts. The fiftieth percentile, P50, is equal to the median. To find the location of a given percentile,

    we use the formula Lp = (N +1)p100

    , where N is the size of our sequence, and p

    the desired percentile.

    Example (continued): The location of the median in a sequence of 12 values is

    equal to that of the 50th percentile, which is L50 =1350100

    = 6,5 . This means

    that the median is halfway the 6th and the 7th value in the sequence. The location of the first quartile is that of the 25th percentile:

    L25 =1325100

    = 3,25 . We find the value of Q1 between the 3th and the 4th value.

    In our sequence of data that is between 14 and 14. Therefore Q1 =14 . The location of the third quartile is that of the 75th percentile:

    L75 =1375100

    = 9, 75 . We find the value of Q3 between the 9th and the 10th value.

    In our sequence of data that is between 21 and 22. We use linear interpolation to find the precise value: Q3 = 21+ 0, 751= 21, 75 .

    2. Measures of dispersion a. Range

  • The range of a collection of data is the difference between the greatest and the least value. Example (continued): The range of the data in our example is 26 9 = 17.

    b. Variance and standard deviation Measures of dispersion indicate the degree to which numerical data are spread out. Variance and standard deviation are based on the squares of the difference between each of the values and the datas arithmetic mean. There is a subtle but nevertheless important difference between the formulas used to calculate these measures value in the case of a population and in the case of a sample.

    Population variance: 2 =xi ( )

    2

    i=1

    N

    N

    ; standard deviation: = 2 .

    Sample variance: S2 =xi X( )

    2

    i=1

    N

    N 1

    ; standard deviation: S = S2 .

    It is sometimes useful to expresses the standard deviation as the ratio of the standard deviation to the mean. This is called the coefficient of variation (cv) (also named variation coefficient or relative standard deviation, rsd).

    In case of a population: cv =

    ; in case of a sample: cv = SX.

    Example (continued): In our example the data are those of a population. The

    population variance 2 =xi 17, 42( )

    2

    i=1

    12

    12

    = 23, 41 . The sample standard

    deviation = 23, 41 = 4,84 . The variation coefficient is 23, 4117, 42

    =1,34 .

    c. Interquartile & interdecile range

    The interquartile range (also called midspread or middle fifty) is the difference between the upper and the lower quartile: IQR =Q3 Q1 . It contains the most centrally placed 50% of our data. (The smaller the IQR, the smaller our datas dispersion.) Similarly, the interdecile range contains the most centrally placed 80% of our data: IDR = D9 D1 . It is the width of the smallest interval containing 80% of the most central of our datas.

  • A graphical representation of the list of data, their distribution and their tendencies in a boxplot or box and whisker diagram:

    3. Frequency distributions A statistical analysis of large sets of data will in general start by organizing the observed data in a certain number of classes or intervals, in many cases (but not always) of a constant, fixed, width. This is called a frequency distribution. It counts how many of our data fall within a certain class. Example: The following table counts the number of documented art works depicting Venus or Aphrodite, created in France within time-periods of half century, from 1500 2000. (source: K. Bender)

    [Midpoints:] [1525] [1575] [1625] [1675] [1725] [1775] [1825] [1875] [1925] [1975] 1500-

    1549 1550-1599

    1600-1649

    1650-1699

    1700-1749

    1750-1799

    1800-1849

    1850-1899

    1900-1949

    1950-1999

    Frequency

    9 60 137 210 483 785 383 327 286 162

    frequency percentage

    0,32% 2,11% 4,82% 7,39% 17% 27,62% 13,48% 11,51% 10,06% 5,7%

    cumulative frequency percentage

    0,32% 2,43% 7,25% 14,64% 31,63% 59,25% 72,73% 84,24% 94,3% 100%

    (You can find the data and the calculations on the excel worksheet ExcelWorksheet_1) A frequency distribution of a data set is usually visualized by means of a so-called histogram, which can easily be generated e.g. in Excel from the listing of the classes (the bins) and the corresponding frequencies of occurrences of values within each of these bins. Here is the histogram visualization of the temporal distribution of French Venus art works, as generated in Excel:

  • Given a frequency distribution, the only possible estimation of the range of the data obtained for our population or sample is the difference between the lowest and the highest possible value. We can estimate the mean of the data by using the center (midpoint) of a class as its value, and then using the frequencies of each of the classes as weights in a weighted

    average of these midpoints: X =fi mi

    i=1

    c

    fii=1

    c

    . Here c is the number of classes; fi is the

    frequency of class i; mi is the center (midpoint) of class i. Example (continued): In the number of French Venus art works distribution there are 10 classes, with midpoints 1525, 1575, 1625, 1650, 1675, 1700, 1725, 1775, 1825, 1875, 1925, and 1975. If we are specifically interested in this estimation of the mean as an estimation of the mean age of the documented French Venus art works, we can use age midpoints, counting back from 2000. The 1525 will correspond to an age of 475 years, 1575 to an age of 425 years, et cetera. Then the estimated mean age will be:

    X france =9 475+ 60 425++16225

    2842=5922502842

    = 208,39 .

    We use linear interpolation to determine estimations of other measures of locations, like the median and the quartiles. Example (continued): In the frequency distribution we learn from the row of cumulative frequency percentages that 31,36% of the French Venus art works dates from before 1750, and that 59,25% dates from before 1800. The median Me (the date such that 50% of the documented works was created before) therefore must be somewhere between 1750 and 1800. Applying linear interpolation we find as an estimation of this median: Me1750

    50=

    5031,3659,2531,36

    Me =1750+ 50 18,6427,89

    1783, 4 . I.e. the median age

    will be 2000 1783,4 = 216,6 years.

  • Similarly, we can determine the first and the third quartiles, Q1 and Q3: Q1 170050

    =2514,6431,3614,64

    Q1 =1700+ 5010,3616, 72

    1731 . I.e., we estimate that one

    quarter of the French Venus art works is more than 269 years old. Q3 1800

    50=

    75 72, 7382,14 72, 73

    Q3 =1800+ 502,279, 41

    1812,1 . I.e., we also estimate that

    about a quarter of the French Venus art works is less than 188 years old. Finally, we estimate that half of all the documented French Venus art works came into being between 1731 and 1812 (the interquartile range, or IQR, is 81 years). As before, we may visualize this descriptive analysis of our data in a box-plot:

    ( There are many tools that help you quickly generate such boxplot images, for example, online at http://www.imathas.com/stattools/boxplot.html )

    Like for the mean, we also use the midpoints of the classes to estimate the variance and the standard deviation:

    Population variance: 2 =fi mi X( )

    2

    i=1

    c

    fii=1

    c

    ; standard deviation: = 2 .

    Sample variance: S2 =fi mi X( )

    2

    i=1

    c

    fii=1

    c

    #

    $%

    &

    '(1

    ; standard deviation: S = S2 .

    Example (continued): The frequency distribution was obtained from population data.

    Therefore the estimated variance is fi mi 208,39( )

    2

    i=1

    5

    2842

    = 9046,83 ; the estimated

    standard deviation is 9046,83 = 95,11 years.

    The estimated coefficient of variation = 95,11208,39

    0, 46 , i.e. 46%: on the average, the age

    of a French Venus art work will differ by 46% from the estimated mean age of 208,4 years.

  • Exercise: On the Excel worksheet ExcelWorksheet_1 you will find the data that were used above as well as all the calculations, detailed in Excel. The worksheet also contains similar data sets (source: K. Bender) for the Venus iconography between 1500 and 2000 in Italy (it), in the Low Countries (lc) and in Germany, Switzerland and Central European countries (gsce). In order to train yourself in the use of Excel to quickly perform a basic descriptive statistical analysis of a set of data:

    a. Perform a descriptive analysis similar to the one given in this note, for each of these three supplementary data sets. Summarize your findings in a short text report; include a histogram and a boxplot.

    b. Determine the total distribution of Venus art works in Europe between 1500 and 2000 (see the final right columns on the work sheet). Again perform the descriptive analysis for this full distribution, and summarize your findings in a short text report with histogram and boxplot.

    c. Compare and interpret the results.