CHAPTER 4 files/statbook/chapter4.pdf60 CHAPTER 4 VARIABILITY ANALYSES Chapter 3 introduced the...

60

CHAPTER 4

VARIABILITY ANALYSES

Chapter 3 introduced the mode, median, and mean as tools for summarizing the

information provided in an distribution of data. Measures of central tendency are often useful to

researchers, but they do not yield a complete picture of the set of data. When used alone, there is

a danger that measures of central tendency may tend to distort or mislead rather than clarify the

meaning of the data. Measures of central tendency identify the “middle” of a distribution, but do

not provide any information about how broadly values are scattered around that central point.

This chapter introduces three common measures of variability; range, variance, and

standard deviation, which are commonly used to supplement the information provided by

measures of central tendency. Each of these statistical tools is used to evaluate the degree of

variation (sometimes referred to as spread or dispersion) within a distribution of data.

Why is it important to evaluate the degree of variability in the data in addition to

identifying the middle of the distribution? Consider an hypothetical example. A researcher

wishes to compare the income levels of residents in two countries identified as Country A and

Country B. The average income level for an individual in both countries is $20,000. Based on

the information presented thus far, it would appear that income levels, and standards of living,

should be approximately equal in each country. In our hypothetical example however, Country A

has a small group of individuals who control virtually all of the resources of the country and

dominate a much larger group of individuals making up a lower class which lives on the brink of

starvation. Country B has a very different distribution of wealth within society. There is a

degree of variability in the income levels of citizens, but there are few who are extremely wealthy

59

and few who are extremely poor. Essentially, most individuals living in Country B have an

annual income that is within a few thousand dollars of the national average while the income

levels of most individuals living in Country A are either extremely high or extremely low.

Figure 5:1 provides a graphic illustration of the income distributions in both countries.

FIGURE 4:1

A review of the distributions of values presented in earlier chapters will demonstrate the

tendency for the values within frequency distributions to vary substantially. Distributions of data

differ in shape, dispersion, and a variety of other factors that may not be accurately reflected

when the only statistics employed are measures of central tendency. If there was no dispersion

within this distribution, everyone would have the same income. Everyone would be the average.

Examination of statistical data in almost any context clearly demonstrates that this is not the case.

Observed values tend to differ from case to case with some varying a great deal from the average

while others fall at or near that position within the overall distribution of data. The distance

60

between a particular observed value and the mean of the overall distribution is referred to as

deviation and is represented by the formula: where X represents the

specific observed value and represents the mean of the distribution.

The presence of substantial deviation from the mean among individual observations is

especially noticeable in the context of research involving issues such as income. In a national

distribution of incomes taken at random, researchers will encounter those who are extremely rich,

extremely poor, and average. Of course there are also those that have income somewhere

between the rich, poor and average. In this example, it can be said that the incomes are

dispersed or scattered throughout the distribution. The presence of dispersion in the data has

the effect of reducing the degree to which measures of central tendency can be considered

representative. If few of the actual values in the distribution are anywhere near the points

identified as the average, median, or mode, the ability of these statistics to describe the data in a

distribution will be weakened dramatically.

Range (R), variance (s ), and standard deviation (s) are the primary measures of2

variability discussed in this chapter. The simplest and easiest of these measures is range (R).

Range represents the difference between the largest and the smallest value in the distribution. It

is easy to compute because it merely requires the researcher to subtract the smallest value in the

distribution from the largest. The range is only a rough measure of variability at best. The major

problem with the range is that it either of the values used to calculate it may be extreme values or

outliers. For example, in a distribution of values 40, 50, 60, 70, 80, 90, 200, the range (R) is 160

or 200 - 40 = 160. On the other hand, for a distribution of values 40, 50, 60, 70, 80, 90, the range

61

is 50. Changing a single value (200) within the distribution causes the range to change

dramatically. Because range is determined entirely by the two most extreme values in the

distribution, it provides no information about the dispersion of any of the other values included

within the distribution of data. The presence of this limitation makes it impossible to use range

as more than a generalized estimate of the variability present in a distribution. Despite the crude

nature of this statistic, range can be applied in some contexts. For example, stock brokers very

frequently refer to the fact that a stock like IBM has had a price range from $70 to $135 per share

in one year. The range is depicted daily in the stock market report as the high and low prices for

the year. Investors may be more interested in the range than other statistics. If the range of a

distribution is small, the group is said to be more homogeneous. However, if the range for a

distribution is large, the group is more heterogeneous. The homogeneous group has fewer

extreme values while the heterogeneous group has extreme values at one end of the distribution

or both.

When researchers need to develop more precise measures of dispersion, they almost

always employ variance and standard deviation. These statistics provide information about the

level of dispersion within a distribution of data by evaluating the deviation of each value from

the mean of the overall distribution. This information is then used to compute variance and

standard deviation which are designed to produce a statistic representing an average deviation

from the mean. The reaction of most students to this concept is that it should be relatively easy

to compute the average deviation from the mean by merely summing all of the deviations from

that point and dividing by the number in the distribution. Unfortunately, the “easy” way does not

work in this context. Such a strategy is doomed to failure because the sum of all deviations from

62

the mean will always be zero. Some values will fall below the mean of the distribution,

producing a negative value for deviation. Others will be located above the mean, producing a

positive value. Because the mean represents the center of the distribution, the result of this

process will be a value of zero which cannot help the researcher to evaluate the level of

dispersion present within the data.

Variance and standard deviation address the problem created by the fact that deviations

from the mean always sum to zero by squaring the value of each deviation to eliminate the

positive or negative sign carried with it. Variance (s ) represents the average of the squared2

deviations from the mean. It is calculated by determining the deviation of each observed value

from the mean. Each of these values is then squared. The squared values are added together to

produce a sum of squared deviations. This sum of squared deviations is then divided by the

number of case in the distribution to produce an average squared deviation from the mean. The

formula for calculating variance of a population is: where:

s =variance2

= the sum of all deviations from the mean.

N= the number of cases or observations in the distribution

The formula above is used for the calculation of variance when a researcher has observed

values for the entire population. When research is focused on a sample, or representative subset,

of the population, a slightly modified form of the formula is employed. The formula for the

variance of a sample is: . The only change to the formula is in the

63

denominator. The upper case N (which represents the population size) is replaced with a lower

case n (which represents the sample size). The formula for variance of a sample also requires the

researcher to subtract one from the size of the sample. This has the effect of increasing the value

of variance slightly and is done to allow for the potential impact of sampling error on the process.

For this class, the formula for the variance of a sample will always be employed when

performing calculations.

The principle difficulty with the concept of variance lies with interpretation of results.

Variance is expressed in squared units which often make little sense to the researcher or observer

of the research process. For example, the meaning of a result indicating a variance of four

thousand squared dollar units is not very easy to understand and interpret. In addition, the

process of squaring each deviation tends to increase the apparent distance between each value

and the mean of the distribution.

Standard deviation (s) is highly similar to variance which seeks to alleviate some of the

difficulties created by variance. The only difference between the two is that the process of

calculating standard deviation adds a final step of taking the square root of the variance. This

step is designed to partially reverse the effect produced by squaring all deviations from the mean

and to convert the result of the process of calculation back into more meaningful units. The

formula for the standard deviation (s) of a sample is: . As you can see,

the only difference in the formulas is the addition of the square root sign in the formula for

standard deviation.

64

Standard deviation describes the overall characteristics of a group or distribution of

values. If a group is extremely heterogeneous, the standard deviation will be large. If the

variability of the raw values of a distribution from their mean is large, the standard deviation is

also larger. If the variability around the mean of the distribution is small or less spread out, the

standard deviation will be small and the group can be best described as homogeneous. In a

perfectly homogeneous group, the standard deviation is zero because there is no variance. All of

the values in the distribution are the same, therefore, everyone is the average or mean. None of

the values in the distribution differ for the typical value or mean. Later in the text, standard

deviation will be used to measure the baseline of a normal curve.

The process of calculating the variance (and standard deviation) for a distribution of

values can be simplified using a solution matrix. The first step is to organize the data in the

matrix and calculate the mean of the distribution in the same manner that has been employed in

previous chapters. A new column is then added to the matrix which is labeled . The

values for this column are computed by taking each raw value in the distribution and subtracting

the distribution mean from it. A second new column is then added to the solution matrix and

labeled . Values for this column are calculated by squaring each of the values for

deviation calculated in the previous column. A final new column is added and labeled

. This represents the squared deviation for each value multiplied by the frequency

that each value occurs within the distribution. The values for the final column are then summed

65

and divided by n to yield variance. Standard deviation is calculated by simply taking the square

root of the variance.

To illustrate the sequential statistical steps for finding the standard deviation of a

population with a distribution of values, a standard deviation will be obtained for the following

data. X = 10, 10, 20, 20, 20, 30, 30, 30, 30, 40, 40, 50, 50, 60, 60. Each individual sequential

step is a very logical progression and is clearly identifiable in Figure 5:2. This solution matrix

has a series of columns. These columns are used to assist one in identifying the specific

calculations required to solve for standard deviation.

66

FIGURE 4:2

X f fX cf

60 2 120 16 26.25 689.06 1378.12

50 2 100 14 16.25 264.06 528.12

40 3 120 12 6.25 39.06 117.18

30 4 120 9 -3.75 14.06 56.24

20 3 60 5 -13.75 189.06 567.18

10 2 20 2 -23.75 564.06 1128.12

The square root of the squared deviation values or standard deviation for this distribution is

15.36. This suggests that the average deviation from the mean is approximately 15.36 units. To

calculate the values that fall 15.36 units above and below the mean, the following procedure is

used:

The range of values that fall within one standard deviation unit of the mean (from -1s to

+1s) is 18.39 to 49.11. The size of the standard deviation suggests that this group is somewhat

heterogeneous.

67

The distribution of values used for the preceding example represented observations for

the entire population. Likewise, the formulas given for variance and standard deviation are

designed for instances in which researchers are working with the entire population of cases in a

particular area of study. In those cases in which the values used to calculate variance and

standard deviation involve a sample rather than the entire population, a minor modification to the

formulas for these two statistics is necessary. The formula for the variance and standard

deviation of a sample is modified by replacing “N” with “n–1". Since all of the values for a

particular characteristic of a population are rarely available to the researcher, most of the

calculations will be for sample variance and sample deviation, which use "n-1" as a part of the

formulas. Dividing by "n-1" removes some of the error or bias that always exists in selecting a

sample from a population. When this is done, the standard deviation of the sample, which is an

estimate of the standard deviation of the unknown population standard deviation, is generally

considered to be more accurate.

It has been suggested in this chapter that the greater the variability around a mean, the

larger the standard deviation is likely to be. The normal (or bell) curve is one of the most useful

tools in understanding the meaning of this statistical concept. The basic assumption of the

normal curve is that the characteristic of the population is normally distributed within the

population. The word "normal" only means that this distribution is frequently found. Data from

a wide variety of sources have similar frequency distributions. The normal curve is a theoretical

distribution developed by statisticians during the eighteenth and nineteenth centuries. For these

distributions a normal or bell shaped curve generally is the result. If the characteristic is not

normally distributed, the curve will be either positively or negatively skewed. The normal curve

is illustrated in Figure 5:3.

68

FIGURE 4:3

NORMAL FREQUENCY-DISTRIBUTION CURVE

The normal curve is a symmetrical distribution with a certain set of characteristics. As the figure

above indicates, 68.26% of values within a “normal” distribution are located within one standard

deviation unit of the mean (-1s to +1s). 95.44% of values fall within two standard deviation units

of the mean (-2s to +2s) in a normal distribution and 99.74% of cases fall within three standard

deviation units (-3s to +3s) of the mean. These represent defining characteristics of a normal

distribution and are always present when the data conform to the normal curve. These may seem

to be very simple concepts at this point, understanding them is critical to future efforts in this

class. Several later concepts such as probability rely heavily on assumptions about the normal

69

curve and make it imperative that students work to grasp this concept at an early stage in the

learning process.

An additional statistic associated with the concept of standard deviation is the z score or

the standard score. The z score measures how many standard deviation units an individual raw

value is from its mean. The sign of each z score also indicates the direction that any given raw

value deviates from the mean of a distribution of a scale of standard deviation units. A positive z

score indicates that a value is higher than the mean while a negative z score indicates that the

score falls below the mean value. For example, a z score of plus 2.4 indicates that the raw score

falls 2.4 standard deviations above the mean. The z score permits one to determine how

individual values relate to the overall distribution and the overall mean. The z score is obtained

by calculating deviation for a particular value (distance from the mean) and dividing the result by

the standard deviation as indicated in the following formula:

Suppose a researcher had a sample of major crimes committed in New York City during a 24

hour day. The mean number of crimes committed was 200 per day with a standard deviation of 50.

Assuming the crimes committed per day is normally distributed, one could translate the number of crimes

for a particular day, 100, into a z, or standard, score in the following manner:

The number of crimes committed in New York City on a particular day (100) is two standard

deviation units below the mean of 200 crimes per day. How many standard deviations relative to the

mean is 300 crimes committed in New York City on a particular day? It is evident from this illustration

70

that the z score is an excellent way to measure the relationship between the crime on an individual day

(X) and the average number ( ) of crimes committed per day for the entire year. As Figure 5:4

indicates, each standard deviation is 50 crimes.

FIGURE 4:4

By way of review, it should be noted that if the z score is a negative value, it is below the mean;

a positive value is above the mean. A z score of zero is equal to the mean of the distribution. These

ideas are illustrated in Figures 5:5 and 5:6.

FIGURE 4:5TWO STANDARD DEVIATIONS ABOVE THE MEAN

71

To determine the dispersion within adistribution of values, calculate therange, variance and standard deviation.Using these descriptive statistics, the relative homogeneity and heterogeneity of the values in the distribution can be determined.

FIGURE 4:6TWO STANDARD DEVIATIONS BELOW THE MEAN

From consulting the standard normal curve, what percentage of the values fall between a plus two and

minus two standard deviation units from the mean? One finds that 95.44% of the values fall within this

range of values.

In this chapter, range, variance, and standard deviation, which are measures of variability, have

been discussed. The range is a quick but rough indicator of variability which is easily found by

calculating the difference between the highest and lowest values of a distribution. The variance of a

distribution is the average of the square of the differences of the individual values from their mean. The

standard deviation is the positive square root of the variance. In standard deviation, one has a reliable

measure of variability which can be used for more complex sequential statistical steps that will be

explored in other chapters of the text. In this chapter, the student was also introduced to z scores as they

relate to the normal curve. A z score is how far a raw value deviates from its mean in standard

deviation units. The full meaning of z scores and the normal curve will also be explored further in the

next chapter on probability. One should now review the sequential statistical steps for finding standard

deviation and z scores which appear at the end of this chapter and provide a summary of these concepts.

A major summary idea:

72

SEQUENTIAL STATISTICAL STEPSFINDING STANDARD DEVIATIONS AND z SCORES

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Step 8

Step 9

Step 10

Step 11

What is the first step in analyzing the data? Create a solution matrix. Besure to account for all of the columns.

What is the cumulative frequency of the distribution? What is the totalnumber of values in the distribution?

What is the sum of the frequencies times each raw value in thedistribution? Add each of the values in the distribution.

What is the mean of the distribution? Divide the sum of the values('fX) by the number (n) of values in the distribution.

How far does each raw value (X) vary or deviate from the mean? Subtract the mean from each value.

What is the square of each of the deviation scores? Square each of thedeviation scores in that column of the solution matrix.

What are the frequencies times the squared deviation scores? Multiplythe squared deviation score by the number of times the value occurs inthe distribution.

What is the sum of the deviation scores squared? Add the values in thefrequencies times the deviation scores squared column.

What is the variance of the distribution? Divide the sum of thedeviation scores squared by n - 1 for a sample.

What is the standard deviation of the distribution? Take the squareroot of the variance.

What is the z score of any value in the distribution? Subtract the meanfrom the value and divide by the standard deviation

OrganizeData

cf

73

X f

500 1

450 2

400 5

390 6

300 8

210 4

150 3

55 2

30 1

EXERCISES - CHAPTER 4

(1) Define the following terms:1

(A) mean(B) median(C) mode(D) range(E) variance(F) standard deviation

(2) Treat these data as sample data: X = 2, 4, 5, 6, 7, 8, 10, 11, 12, 15, 20 Find the mean,variance, standard deviation, z score for 3, z score for 9 and the median. Be sure to layoutin matrix format.

(3) Treat these data as sample data: X = 26, 28, 28, 35, 35, 35, 36, 36, 36, 36, 39, 39, 39, 39,39, 40, 40, 43, 43, 50, 50, 50, 55, 59, 60 Find the mean, variance, standard deviation,mode and skew. Be sure to layout in matrix format.

(4) Treat these data as sample data: X = 62, 61, 58, 56, 54, 51, 50, 48, 43, 41, 32, 21, 17, 13,10, 5 X is a sample of scores on a statistics exam. What is the average score? What isthe standard deviation? How many standard deviation units is 61 above the mean and 17below the mean? What is the median? What is the range?

(5) Find the mean, median, mode, standard deviation, and z scores for 325 and 50. Draw afrequency polygon for these data. Show all work. What is the skew?

Some of these terms are in previous chapters.

1

74

x f

200-210 1

100-199 4

50-99 5

30-49 2

10-29 2

1-9 1

(6) Treat these data as sample data: X = 40, 42, 38, 33, 22, 21, 15, 10, 10, 6, 5, 3, 2. Find themean, variance, standard deviation, z score for 7, z score for 9 and the median. Be sure tolayout in matrix format and show all work.

(7) Create a solution matrix for each of the following and calculate the mean, mode, median,variance and standard deviation.

1(A) X = 25.60, 28.36, 42.21, 58.26, 62.21, 75.36, 85.31, 92.50, 107.68, 110.70

2(B) X = 5, 10, 10, 10, 10, 20, 20, 20, 20, 20, 30, 30, 40, 40, 40, 40, 50, 50, 50, 60, 60

1(C) Y = 90.10, 75.60, 60.30, 50.20, 45.60, 30.70, 20.30, 10.20, 6.40, 5.10

2(D) Y = 100, 90, 90, 80, 80, 80, 80, 70, 70, 70, 70, 70, 60, 50, 50, 40, 40, 30, 20, 10

3(E) X = 130, 125, 120, 115, 110, 100, 90, 65, 40, 30, 20, 10, 5, 4

4(F) X = 6-8, 6-8, 9-11, 12-14, 12-14, 15-17, 15-17, 15-17, 18-20, 18-20, 21-23, 21-23, 24-26,24-26, 27-29

3(G) Y = 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 11, 12

4(H) Y = 50-48, 50-48, 51-53, 47-45, 47-45, 47-45, 44-43, 44-43, 44-43, 44-43, 42-40, 42-40, 39-37, 39-37, 36-34

5(I) X = 11, 10, 10, 10, 9, 9, 9, 8, 8, 8, 8, 8, 7, 7, 7, 7, 7, 6, 6, 5, 5, 4, 4, 3

5(J) Y = 26, 30, 31, 42, 51, 51, 62, 62, 62, 62, 70, 70, 70, 70, 81, 81, 81, 92, 92, 102, 105, 107, 111, 113, 121

(8) Find the mean, mode, median, z score for 55 and 150, and percentile rank for 60.

CHAPTER 4 files/statbook/chapter4.pdf60 CHAPTER 4 VARIABILITY ANALYSES Chapter 3 introduced the...

Documents

Transcript of CHAPTER 4 files/statbook/chapter4.pdf60 CHAPTER 4 VARIABILITY ANALYSES Chapter 3 introduced the...