MEASURES OF LOCATION AND SPREAD - Writings,...

7
Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 1 MEASURES OF LOCATION AND SPREAD Frequency distributions and other methods of data summarization and presentation explained in the previous lectures provide a fairly detailed description of the data and how it is distributed in the sample. In case of categorical variables this will be usually enough. But in case of quantitative variables we have more methods to summerize and present the data. Since quantitative variables are numbers (whether discrete or continuous) we can order them and summarize them in terms of how they are clustered and spread out in the sample. Quantitative variables can be summarized in terms of location of different values (measures of location or measures of central tendency) and how they are spread in the sample (measures of spread or variation) MEASURES OF LOCATION (Measures of Central Tendency) Measures of location tell us how different values of the variable are located when the data is ordered. There are three measures of location which are the median, the mode and the mean. Each of these measures has its own advantages and disadvantages which depend on the type of data being summarized. Median When we order the variables in ascending or descending way, the median is the value that divides the distribution into two equal parts so that there is the same number of observations above and below the median. For example: Age of 15 women in a survey was as follows: 17, 25, 36, 23, 44, 39, 19, 22, 30, 33, 42, 28, 27, 22, 18 To calculate the median, we rearrange the values in an ascending order. The observation number 8 (27 years) is the middle observation, i.e. there are 7 observation on either side of 27, so the median age is 27 years. When there is an even number of data values, there is no single middle value. In this case the median is calculated by the average of the central pair of values i.e. we add up the two central values and divide the result by 2. For example in table 2 there are 16 observations, there is no middle value for 16. The median fo this data will is calculated from the two values in the middle of the data i.e. observations 7 and 8: Median age =(27+ 28)/2= 55/2=27.5 years Median for Frequency Distributions The median for a frequency distribution is simply the value at which the cumulative relative frequency is 50%. Table 1 Table 2 ID Age ID Age 1 17 1 17 2 18 2 18 3 19 3 19 4 22 4 22 5 22 5 22 6 23 6 23 7 25 7 25 8 27 8 27 9 28 9 28 10 30 10 30 11 33 11 33 12 36 12 36 13 39 13 39 14 42 14 42 15 44 15 44 16 46

Transcript of MEASURES OF LOCATION AND SPREAD - Writings,...

Page 1: MEASURES OF LOCATION AND SPREAD - Writings, …infonas.net/wp-content/uploads/2013/11/CH_LocationSpread.pdf · MEASURES OF LOCATION AND SPREAD ... To calculate sum of each interval

Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 1

MEASURES OF LOCATION AND SPREAD Frequency distributions and other methods of data summarization and presentation explained in the previous lectures provide a fairly detailed description of the data and how it is distributed in the sample. In case of categorical variables this will be usually enough. But in case of quantitative variables we have more methods to summerize and present the data. Since quantitative variables are numbers (whether discrete or continuous) we can order them and summarize them in terms of how they are clustered and spread out in the sample. Quantitative variables can be summarized in terms of location of different values (measures of location or measures of central tendency) and how they are spread in the sample (measures of spread or variation)

MEASURES OF LOCATION (Measures of Central Tendency) Measures of location tell us how different values of the variable are located when the data is ordered. There are three measures of location which are the median, the mode and the mean. Each of these measures has its own advantages and disadvantages which depend on the type of data being summarized. Median When we order the variables in ascending or descending way, the median is the value that divides the distribution into two equal parts so that there is the same number of observations above and below the median. For example: Age of 15 women in a survey was as follows: 17, 25, 36, 23, 44, 39, 19, 22, 30, 33, 42, 28, 27, 22, 18 To calculate the median, we rearrange the values in an ascending order. The observation number 8 (27 years) is the middle observation, i.e. there are 7 observation on either side of 27, so the median age is 27 years. When there is an even number of data values, there is no single middle value. In this case the median is calculated by the average of the central pair of values i.e. we add up the two central values and divide the result by 2. For example in table 2 there are 16 observations, there is no middle value for 16. The median fo this data will is calculated from the two values in the middle of the data i.e. observations 7 and 8: Median age =(27+ 28)/2= 55/2=27.5 years Median for Frequency Distributions The median for a frequency distribution is simply the value at which the cumulative relative frequency is 50%.

Table 1 Table 2 ID Age ID Age 1 17 1 17 2 18 2 18 3 19 3 19 4 22 4 22 5 22 5 22 6 23 6 23 7 25 7 25 8 27 8 27 9 28 9 28 10 30 10 30 11 33 11 33 12 36 12 36 13 39 13 39 14 42 14 42 15 44 15 44 16 46

Page 2: MEASURES OF LOCATION AND SPREAD - Writings, …infonas.net/wp-content/uploads/2013/11/CH_LocationSpread.pdf · MEASURES OF LOCATION AND SPREAD ... To calculate sum of each interval

Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 2

Mode The mode of a distribution is simply the value that occurs most frequently. A distribution may have more than one mode. In the example above, 22 is repeated twice, so it is the mode. Mean The mean is the average of all values. The mean is calculated from the sum of all values divided by the number of observations. If we assume that each of n observations (n is the sample size) has a value xi then the mean will be: Example: Age of 15 women in a survey was as follows: 17, 25, 36, 23, 44, 39, 19, 22, 30, 33, 42, 28, 27, 22, 18 Mean age of the women= sum of all ages/n= (17+ 25+ 36+ 23+ 44+ 39+ 19+ 22+ 30+ 33+ 42+ 28+ 27+ 22+ 18)/15 =425/15= 28.3 years The mean age of the sample is 28.3 years. Mean for Frequency Distributions If we have grouped data from a frequency table and we don’t have individual values, we can still calculate the mean from the grouped data by calculating the total for each interval (frequency X midpoint) and then adding up totals for all intervals and dividing the total by the sample size. If f is frequency of each interval, the mean will be calculated in the following way: Table 1 displays grouped data for Hb of 50 women. To calculate sum of each interval we first calculate the midpoint for the interval (column 3), multiply this with the frequency (colum 2) to calculate sum of the values for each interval (column 4). Mean Hb= [(4*8.5) + (7 *9.5) +(18*10.5)+ (13*11.5)+ (3*12.5)+ (4*13.5)+ (1*14.5)]/50 Mean Hb=545/50=10.9 gm Therefore the mean Hb of the 5o women is 10.9 gm.

Table 1. Calculation of mean Hb of 50 women from a frequency distribution table

Hb Frequency Mid-point Sum of interval

8-8.9 4 8.5 34 9-9.9 7 9.5 66.5

10-10.9 18 10.5 189 11-11.9 13 11.5 149.5 12-12.9 3 12.5 37.5 13-13.9 4 13.5 54

14 and over 1 14.5 14.5 Total 50 545

Page 3: MEASURES OF LOCATION AND SPREAD - Writings, …infonas.net/wp-content/uploads/2013/11/CH_LocationSpread.pdf · MEASURES OF LOCATION AND SPREAD ... To calculate sum of each interval

Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 3

Properties of the Mean, Median & Mode

1. The mean, mode and median will be similar if the data is normally distributed (symmetrically distributed around the mean). If the data is not normally distributed the three measures will be different.

2. The mean is sensitive to outliers; the others are not. An outlier is an extreme value, a value which is far from the rest of the values. If there are outliers in the data, the mean will be affected. The mode and the median are not affected by outliers.

3. The mode may be affected by small changes in the data but the mean and median are not affected by small changes in the data.

Which measures we should use? Generally if the data distribution is not symmetrical (there are outliers) the median is a better measure of location than the mean. When we want to perform statistical analysis for inference, the mean is more flexible and useful to use. But, if the data is not symmetrically distributed (not normally distributed), even for statistical inference, we have to use the median.

Page 4: MEASURES OF LOCATION AND SPREAD - Writings, …infonas.net/wp-content/uploads/2013/11/CH_LocationSpread.pdf · MEASURES OF LOCATION AND SPREAD ... To calculate sum of each interval

Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 4

MEASURES OF SPREAD If we look at a set of quantitative data displayed as a frequency distribution or a graph, we can say whether the observations are widely spread out from the mean or clustered around the mean. But this is not enough; it is usually necessary to describe this variability of the observations as a numerical value. Such a value is called a measure of spread. A measure of spread of the data along with the mean provides a better informative summary of a data set. There are 3 main ways to summarize the variability of a set of data (three measures of spread):

1. Range: gives the range of all values 2. Percentiles; reports what values are located in certain percentages of the

whole data 3. The standard deviation: calculates a single numerical measure of the

spread around the mean Each measure has its own advantages but the standard deviation is most useful in statistical calculations. Range The simplest way to describe the spread of a set of observations is to report the range from the minimum value to the maximum. Therefore a range tells as the lowest value and the highest value and hence the difference in-between. The problem with this is that it reports the most extreme values which may not represent the majority of the data. The actual distribution of all the values in-between these two extremes are not summarized in any way. Example: Age of 15 women in a survey was as follows: 17, 25, 36, 23, 44, 39, 19, 22, 30, 33, 42, 28, 27, 22, 18 To calculate the range we first order the values from minimum ti maximum, then we identify the smallest and the biggest value and report it. 17, 18, 19, 22, 22, 23, 25, 27, 28, 30, 33, 36, 39, 42, 44 The range is 17-44 years or 17. 44 years. This means that age of the women is spread out from 17 to 44 years, including 44. Sometimes when we report range we also report the interval (the difference between maximum and minimum). For example difference between 44 and 17 (44-17) is 27 years. Then we say range was 27 years, (17-44). Percentiles A percentile (or centile) is the value below which a given percentage of the data has occurred. For example, in the graph below of the height of a group of people, the 5% percentile is 145 cm meaning that 5% of the group had height below 145 cm. The 95% percentile is 165cm which means that 95% of the group had height below 165 cm. By specifying these two percentiles we give a range in which 90% of the data lies and thus

Page 5: MEASURES OF LOCATION AND SPREAD - Writings, …infonas.net/wp-content/uploads/2013/11/CH_LocationSpread.pdf · MEASURES OF LOCATION AND SPREAD ... To calculate sum of each interval

Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 5

140 145 150 155 160 165 170

Height in cm

Page 6: MEASURES OF LOCATION AND SPREAD - Writings, …infonas.net/wp-content/uploads/2013/11/CH_LocationSpread.pdf · MEASURES OF LOCATION AND SPREAD ... To calculate sum of each interval

Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 6

Standard Deviation The most common way of quantifying the variability of a distribution is to calculate its standard deviation. This method uses all the observations, by accounting for all deviations from the mean. By deviations we mean the differences between each observation and the mean. The standard deviation is a sort of average of all the deviations. Mathematically, if we say each observation has a value Xi (where i = 1 to n) then the distance from the mean value ,X¯, will be (X¯-Xi). With n observations we will have n such distances. We calculate the average of these distances by summing all the observed deviations and dividing by n. Average Deviation = [∑ (Xi- X¯)]/n However, simply calculating the average deviation is not sufficient. In fact this equation will always give an average deviation of zero, because positive deviations from the mean will always exactly balance the negative deviations. What we are interested in is the magnitude of the deviations. If we square the deviations before summing them, we will always get a positive quantity. Dividing this by the total number of observations then gives a measure of average deviation from the mean, known as the variance. Variance, S² = [∑ (Xi- X¯)²]/n-1 Note. In this equation we use n-1, not n, as the denominator, because we are estimating the population variance. The problem with the variance is that it is squared, and so it is not in the same unit as the original data. For example height of individuals will be in square cm which is unit of area, not height. If we take the square root of the variance we get a measure of variability in the same units as the raw data. This quantity is called the standard deviation and tells us the average distance of all the observations in a dataset from the mean. Standard Deviation, S = √ [∑ (Xi- X¯)²]/n-1 Example: calculate variance and standard deviation for the following set of data on weight of 10 people in Kgs. 61, 75, 65 58, 78, 82, 70, 72, 91, 77 For calculating variance, first calculate the mean weight X¯=∑Xi/n= (61+ 75+ 65+58+78+82+70+72+91+77)/10=72.9 years Then calculate variance by the formula Variance, S² = [∑ (Xi- X¯)²]/n-1

Page 7: MEASURES OF LOCATION AND SPREAD - Writings, …infonas.net/wp-content/uploads/2013/11/CH_LocationSpread.pdf · MEASURES OF LOCATION AND SPREAD ... To calculate sum of each interval

Biostatistics for medical students: written by Dr. Nasih Othman, Sulaimani Polytechnic University 2012 7

Variance= [58-72.9)+(61-72.9)+65-72.9)+(70-72.9)+(72-72.9)+(75-72.9)+(77-72.9)+(78-72.9)+(82-72.9)+(91-72.9)] ² /9=99.2 Then calculate standard deviation by taking the square root of the variance S= √ variance= √99.2=9.96 What does this mean? The standard deviation for the data was 10 Kg, meaning that on average each observation was 10 kg away from the mean (either more or less than the mean). How normal data is distributed i.e. spread out in relation to standard deviation? For data that is normally distributed:

• About 68% of the data lies within 1 standard deviation of the mean • About 95% of the data lies within 2 standard deviations of the mean • About 99% of the data lies within 3 standard deviations of the mean

These proportions apply to all normal distributions, regardless of the total number of data values or the width of the distribution. The standard deviation helps to summarize the distribution of data. The standard deviation plays an important role in statistical data analysis

.