OCR Mathsaurus Data Presentation and measures of central ...€¦ · Data Presentation and Measures...

All rights reserved. © Kevin Olding 2013 1

Statistics1

DataPresentationandMeasuresofCentralTendencyandDispersion‐MEI

KevinOlding

Contents1. Types of data ............................................................................................................................................................. 2

2. Frequency tables for grouped and ungrouped data ................................................................................................. 2

3. Pie Charts, Bar Charts and Vertical Line Charts ........................................................................................................ 4

4. Histograms ................................................................................................................................................................ 4

5. Stem and leaf diagrams ............................................................................................................................................ 7

6. The Range and the Median ....................................................................................................................................... 8

7. Quartiles, and the Inter‐Quartile Range ................................................................................................................... 9

8. Box and whisker plots (Boxplots), Outliers (first definition) ................................................................................... 10

9. Cumulative frequency, cumulative frequency curves, percentiles ......................................................................... 12

10. Reading off the median, quartiles and percentiles from a cumulative frequency curve ................................... 14

11. Mean, mean square deviation, root mean square deviation ............................................................................. 15

12. Variance and standard deviation, outliers (second definition) .......................................................................... 16

13. Calculating the mean and variance from an ungrouped frequency table .......................................................... 18

14. Estimating the mean and median from a grouped frequency table .................................................................. 19

15. Linear Coding ...................................................................................................................................................... 20

16. Advantages and Disadvantages of the mean, median mode and midrange ...................................................... 21

17. Skewness ............................................................................................................................................................. 22


1. Types of data

There are three types of data:

1. Qualitative data, which consists of descriptions using names

Head or Tail

Black or White

Labour, Conservative or Liberal Democrat

2. Discrete data, which consists of numerical values in cases where we can make a list of the possible

values. Often this list can be very short, for example the outcomes of the roll of a die, {1,2,3,4,5,6}.

Sometimes it will be longer and can potentially be infinite, for example the number of Tails that appear

before the first Head when I repeatedly toss a coin has outcomes {0,1,2,3,...}.

3. Continuous data, which consists of numerical values in cases where it is not possible to make a list of all

possible outcomes. Eg measurements of physical quantities, such as weight, height and time.

Note 1: Continuous data can often appear discrete, for example because the limitations of measuring instruments

force us to round to eg the nearest 10 grams.

Note 2: Discrete data can sometimes appear continuous. For example, amounts of money feel like they could be

continuous but usually we only think in terms of multiples of 1p. Sometimes we might have fractions of 1p, for

example in a financial transaction but there is always a limit to the subdivisions allowed. In fact we often treat such

data as continuous simply because it is convenient to do so.

Note 3: Is could be argued that there really is no such thing as continuous data as there is always a theoretical limit

on the accuracy of a measurement, for example a measure of the mass of an object might ultimately be limited by

multiples of a part of the object on an atomic level. We leave it to the philosophers to debate such issues. For us,

when treating data as continuous is useful we will do so.

2. Frequency tables for grouped and ungrouped data

Ungrouped data:

The table below describes the A‐level results of a group of students.

Number of A grades Frequency

4 62

3 38

2 12

1 7

0 1

From this table we can see that the most common number of A grades was 4, with 62 students getting 4 As. We say

that the mode is 4 or that 4 is the modal value.


Grouped data:

Sometimes it can be useful to group data into classes, especially when there are many different values. The

information becomes more concise than the raw data, but the disadvantage is that the original data has been lost.

Example: FTSE 100 Share prices

Price of share Frequency

0‐49p 7

50p – 99p 9

£1‐£1.99 28

£2‐£3.99 42

£4+ 14

Class boundaries

Sometimes it is obvious which class a piece of data falls into. For example a share of price £1.45 fits into the third

class above. But what about a share that is trading at 49.5p? (Shares do trade at non‐integer values!)

There is no correct answer to this question. If we are designing a frequency table ourselves we should be careful to

write the boundaries without ambiguity, for example we could have written 0 50x and 50 100x pence for

the first two class boundaries. If we are presented with an ambiguous table someone else has constructed we just

have to make the best guess we can as to what their intentions were.

How to manipulate statistics 1

Statistics can be used to give clarity to and to describe and summarise data in a useful and informative way. It can

also be used to deceive, or be presented in a way which is misleading.

The following tables describe the wages paid to workers in two different factories which each employ 150 people.

Which would you rather work in?

The Fair Factory with Fun‐loving Foremen

£ per hour Frequency

0‐£7.50 13

£7.51‐£50 137

The Sad Store of Selfish Slave‐drivers


0‐£8.50 133

£8.51‐£50 17


In fact both factories are the same...here is the ungrouped data:


7 13

8 120

10 15

25 2

3. Pie Charts, Bar Charts and Vertical Line Charts

You are expected to be able to construct and interpret these basic diagrams for Statistics 1, but they are not

included in these notes. Please ask your teacher if there is anything you are unsure about.

4. Histograms

Grouped data can be displayed in a histogram. For example, consider the data below which shows the

marks gained by a group of students in an examination:

Mark ( x %) Frequency

0 30x 4

30 50x 12

50 70x 37

70 100x 14

The data has been grouped into intervals or classes and we can calculate the class widths by comparing the

endpoints. For example for 50 70x has width 70 50 20 . The frequency density is then calculated

as

FrequencyFrequency density =

Class width

Mark ( x %) Frequency Class width Frequency density (students/mark)

0 30x 4 30 430 0.13

30 50x 12 20 1220 0.6

50 70x 37 20 3720 1.85

70 100x 14 30 1430 0.47

Now we can draw the histogram, which will have the Mark on the horizontal axis and Frequency Density on

the vertical axis

Histograms can seem similar to bar charts, but there are some important differences.

In a histogram:

a. The vertical axis is always labelled "Frequency Density"

b. There are no gaps between the bars


c. The area of each bar is proportional to the frequency that it represents.

The highest bar in the histogram represents the interval 50 70x and this is called the modal class.

Note that this interval has the highest frequency density. In this example it also has the highest frequency,

but this need not be the case. It is the frequency density that matters.

Because the frequency density was calculated as Frequency

Frequency density = Class width

If we have a histogram and we want to retrieve the frequencies for each class we can rearrange this to give

Frequency = Frequency density Class width

For example, the frequency of the class 30 50x above is 0.6 20 12 as we can check from the table.

Note Frequency density has units which should be included on the histogram, here students per mark. If it

is more convenient you could instead calculate students per 10 marks and alter the units accordingly. Only

use an unusual unit if you are confident, it is usually best just to use the basic calculation.

Histogram to show the marks of a group of students

50 100

0.5

1

1.5

2

Mark

Frequency Density (students/mark)


Note 2 In this diagram, not all of the classes have the same width, but you could still construct a histogram in

the same way if all of the classes do happen to have the same width.

How to manipulate statistics 2

Here is a histogram for another class of students. Which class do you think has done better overall based on these

histograms?

At first glance it looks like the second class has done a lot better than the first, with much higher frequency density

for the higher marks than the lower marks. But in fact both histograms are based on exactly the same data. The

only differences are:

a. in the second histogram the third and fourth classes have been combined to produce a 50 100x class

with frequency 37+14 = 51 and frequency density 5150 1.02

b. The frequency density axis in the second histogram has been altered so that it goes up to 1.2 rather than 2 .

When compared side by side with the first histogram to give an impression of relatively higher frequency

densities.

Beware of cunning and devious (or simply lazy and ignorant) makers of graphs when reading newspapers as visual

representations of data can often be more misleading than they are helpful.

More examples of this sort of deception can be found in the excellent little book 'How to Lie With Statistics' by

Darrell Huff. You will be surprised how many examples of bad statistics you can easily find having read this.

Statistics and graphs are increasingly used to support arguments in professional contexts and honing your skills in

this way will give you a real advantage both in challenging incorrect examples and in not making the same mistakes

yourself.

50 100

0.2

0.4

0.6

0.8

1

1.2

Mark

Frequency Density


5. Stem and leaf diagrams

Here are the marks obtained by 20 students in a test out of 100:

68 45 22 14 92 55 58 53 78 71

16 39 80 42 72 72 88 31 12 89

It is sensible here to choose intervals 10‐19, 20‐29, 30‐39, ...., 90‐99 for this data and we can represent the entries by

a stem and leaf diagram. Grouping in tens is common in stem and leaf diagrams but you could choose other

groupings so long as your Key makes this clear. The first five entries above can be represented as:

Stem Leaf 1 4 2 2 3 4 5 5 6 8 7 8 9 2

And we can fill in the rest of the data to complete the diagram

1 4 6 2 2 2 3 9 1 4 5 2 5 5 8 3 6 8 7 8 1 2 2 8 0 8 9 9 2

The words stem and leaf at the top are optional but the Key is a vital part of the diagram. It is also important that

the numbers in the leaves are uniformly spaced out, as the diagram should give a visual representation of the data.

The diagrams above are unordered stem and leaf diagrams. Ordered stem and leaf diagrams are more useful and so

usually the data is rearranged so that it is in ascending numerical order as follows:

1 2 4 6 2 2 3 1 9 4 2 5 5 3 5 8 6 8 7 1 2 2 8 8 0 8 9

Key 2 2 = 22

Key 2 2 = 22

Key 2 2 = 22


9 2

Sometimes we want to compare two sets of data and in this case we can form a back to back stem and leaf diagram.

Suppose I want to compare the class above to another, whose test results were:

25 23 13 42 59 18 21 32 44 10

50 44 48 32 25 14 14 68 18 15

To make the diagram, we simply add these data to the left hand side of the stem as follows:

Class 2

Class 1

8 8 5 4 4 3 0 1 2 4 6 5 5 3 1 2 2 2 2 3 1 9 8 4 4 2 4 2 5 9 0 5 3 5 8 8 6 8 7 1 2 2 8 8 0 8 9 9 2

This diagram allows us to quickly compare the two sets of data ‐ we can see immediately that the second class has

not done as well.

6. The Range and the Median

Let us consider again the first class of students from the section above. The ordered stem and leaf diagram

we produced earlier will now be useful to us so here it is again:

1 2 4 6 2 2 3 1 9 4 2 5 5 3 5 8 6 8 7 1 2 2 8 8 0 8 9 9 2

The lowest score was 12 and the highest was 92. We could say that the data range from 12 to 92 and we

define the range of the data as 92 minus 12, or 80.

Key 2 2 = 22

Key 2 2 = 22


Range = Maximum value - Minimum value

It is useful to be able to talk about the middle data point too, and this is called the median. If there was an

extra piece of data there would be 21 pieces of data and then the 11th would have 10 pieces of data below it

and 10 pieces of data above it and so we could say that values was the median. Here we have an even

number of pieces of data, so there is no exact middle value. The 10th and 11th data points are 55 and 58

and so the best we can do is to say that the median is the average of 55 and 58, or 56.5.

Warning! We must be very careful using the word 'average' as I have done above as it could mean either

the mean, the median or even the mode. The usual meaning of 'average' is the mean, which we will come

to again later, but be very cautious when you hear someone talk about the 'average' as people often use

whichever of the mean or median is most convenient for the point they are trying to make, especially on

television or radio shows where there is little opportunity to view their detailed calculations!

7. Quartiles, and the Inter‐Quartile Range

The median splits the data into a top half and a bottom half, but there is another half that is very interesting,

and that is the middle half and we are often interested in how spread out the most central half of the data is.

This spread is called the Inter‐Quartile Range, or IQR.

Before we can work out the IQR we must first calculate the Lower Quartile ( LQ or 1Q ) and the Upper

Quartile (UQ or 3Q ). The Upper Quartile is the median of the top half of the data, and the Lower Quartile

is the median of the bottom half of the data.

So, returning to our example above, if we focus on the bottom half of the data we have 10 values and so the

Lower Quartile is the median of these.

1 2 4 6 2 2 3 1 9 4 2 5 5 3 5 8 6 8 7 1 2 2 8 8 0 8 9 9 2

As there are 10 values, we must average the 5th and 6th values to give 31 39

352

LQ

.

Summary: How to find the median

Odd number of pieces of data ‐ take the middle value

Even number of pieces of data ‐ take the average of the two middle values

Key 2 2 = 22


Similarly, the Upper Quartile is 72 78

752

UQ

.

We can then define the Inter‐Quartile Range as

Inter-Quartile Range = Upper Quartile - Lower Quartile

IQR = UQ - LQ

So in our example, the IQR is 75 ‐ 35 = 40.

Note: In case you are wondering about the notation 1Q and 3Q for the lower and upper quartiles, 2Q is

often used to denote the median.

Note: If we have an odd number of pieces of data, as in the data set below with 11 data points we must

make a decision. Here the median is 58 and when calculating the Lower Quartile (and the Upper Quartile)

we have to choose whether make the LQ the median of the first five pieces of data (including the median) or

the first four pieces of data (excluding the median)

13 32 47 55 58 59 69 71 74

There is no statistical convention on this, and it is rarely important in practice, but for the sake of consistency

we will choose to exclude the median for calculations. So, here the Lower Quartile would be based on 13,

32, 47, 55 and would be 32 47

34.52

LQ

.

8. Box and whisker plots (Boxplots), Outliers (first definition)

Now we know how to calculate the quartiles, we can create a five‐point summary of our data to create another

visual representation of the data. A five‐point summary is just a list of the following five values. The

numbers are from the same example as above. These values are useful since roughly one quarter of the data lie

between each of the consecutive pairs of values.

0Q Minimum 12

1Q Lower Quartile 35

2Q Median 56.5

3Q Upper Quartile 75

4Q Maximum 92


A box and whisker plot is constructed by first drawing a box from the Lower Quartile to The Upper Quartile with a

vertical line to represent the median.

We then add the whiskers which go from the Lower Quartile to the Minimum and from the Upper Quartile to the

Maximum.

Having a scale on the horizontal axis is very important, but there is no vertical scale so it doesn't matter how fat or

thin your box is or how long any of the vertical lines are so long as the finished picture looks sensible.

There is one last component to the box and whisker plot, which is how we represent outliers. An outlier is a piece

of data which is far enough away from the centre to make us question whether or not it really is a valid data point.

For example, suppose that there was another piece of data in our list of marks, 155. We know that the test was out

of 100 and so this cannot be a real piece of data and must have been put in the table as a result of a human error. It

is probably really 15 or 55 but someone has mis‐typed it. Outliers are not always as obvious as this and so we have

devise some methods for identifying potential outliers. There are two tests and we will come to the second later.

For now, we will define an outlier as follows:

So we work out the IQR, here 40 and multiply it by 1.5 to give 60. Then a piece of data is an outlier if it is more than

60 above the UQ or more than 60 below the LQ. That is, it is an outlier if on the boxplot it would be more than 1.5

IQRs from the box. Here the boundary for outliers would be

below 1.5 35 60 25LQ IQR

or above 1.5 75 60 135UQ IQR

Outlier ‐ First Definition (IQR Definition)

An outlier is a piece of data which is more than one and a half Inter‐Quartile Ranges

below the Lower Quartile or above the Upper Quartile.


and so we do not have any outliers in our dataset. If the value of 155 were included though, it would be above 135

and so would be an outlier. We would represent it on our boxplot with a cross, as follows:

Note: Of course, if 155 were included the five point summary would also change to incorporate it, this diagram is just

to illustrate how to draw the outlier if there is one.

9. Cumulative frequency, cumulative frequency curves, percentiles

A group of students were asked about their journey times into school in the morning and the results were as follows:

Time ( x minutes) Frequency

0 5x 24

5 10x 32

10 15x 48

15 20x 23

20 30x 18

30 45x 10

45 60x 5

We might be interested in knowing how many students take less than 15 minutes to travel to school, or how many

take less than 45 minutes and the answers to these questions are called cumulative frequencies. So for example for

the 15 20x category the Frequency is 23 and the Cumulative Frequency is the total of all the Frequencies in the

first for categories up to and including the 15 20x category. So here it is 24 + 32 + 48 + 23 =127. You can

calculate the cumulative frequencies quickly by starting with the first category and repeatedly adding on the

frequency for the next category.

Time ( x minutes) Frequency Cumulative Frequency

0 5x 24 24

5 10x 32 56

10 15x 48 104

15 20x 23 127

20 30x 18 145

30 45x 10 155

45 60x 5 160


Now we have calculated the cumulative frequencies for each category we can draw a cumulative frequency curve to

represent the data, which will show us the number of students who take less than or equal to a given time to travel

to school. We plot the cumulative frequencies against the values at the upper end of each category, so for example

the Cumulative Frequency of 127 is plotted at 20 because we know there are 127 students who take less than or

equal to 20 minutes to travel to school.

We plot all of the cumulative frequencies and

join with a smooth curve to form the

cumulative frequency curve.

Note, we also plot the point (0,0) here at

the bottom left as we know that no‐one takes

less than no minutes to travel to school!

Warning! Cumulative frequency curves do not

necessarily start at (0,0) . If the first class

had been 5 10x (i.e. we know that no

student takes less than 5 minutes to travel to

school), we would start the curve at (5,0)

instead.

Rather than joining the points with a smooth

curve, we could instead join them with a

straight line. In this case the diagram is called

a cumulative frequency polygon and nothing

else is different.

There is no particular rule as to which to use

and here you can see that it has made little

difference to the resulting graph. If in doubt,

draw a curve.

Cumulative frequency curve showing journey times to school

10 20 30 40 50 60 70

20

40

60

80

100

120

140

160

Travel time

Cumulative Frequency

Cumulative frequency polygon showing journey times to

school

10 20 30 40 50 60 70

20

40

60

80

100

120

140

160

Travel time

Cumulative Frequency


10. Reading off the median, quartiles and percentiles from a cumulative frequency curve

We can use a cumulative frequency curve to approximate the median and lower and upper quartiles of the data.

The median is the middle value. Here there are 160 values, so the median should have roughly 80 of the pieces of

below it and 80 above it. Hence the median has a cumulative frequency of 80 and if we read off the value with a

cumulative frequency of 80 we can approximate the median.

Note: The precise reader might suggest reading off 80.5 instead of 80, and this would in fact be slightly better, but

when we are dealing with cumulative frequency curves the number of pieces of data is usually large and since the

reading will only give us an approximation anyway it is common just to read off at 80 for convenience.

We can also read off the Lower and Upper quartiles in the same way, by looking for values with Cumulative

Frequencies one quarter and three quarters of the total number of pieces of data, so here 40 and 120 respectively.

We can also use a cumulative frequency diagrams to read off percentiles in a similar way. For example, the 95th

percentile is the value below which 95% of the pieces of data lie. The median then is the 50th percentile, the Lower

Quartile is the 25th percentile and the Upper Quartile is the 75% percentile.


11. Mean, mean square deviation, root mean square deviation

We have now met two measures of 'central tendency' or 'average':

a. the mode, the most common value; and

b. the median, the middle value

and you are already familiar with the third, the mean, which is calculated by adding up all the data and dividing by

the number of pieces of data you have. For example, the mean of 1, 4, 9, 16, 25 and 36 is

1 4 9 16 25 36 9115.17

6 6

The mean is often denoted by x (or y or z or similar ‐ it's the bar that denotes the mean, the letter is arbitrary).

And we can write x

xn

, where the (sigma) means 'sum', or 'add up all the...' and n is the number of

pieces of data. So to get the mean, we add up all the pieces of data, x , and divide by the total number of pieces of

data.

We have also met some measures of dispersion (or spread), namely the range and the inter‐quartile range. We

now consider some other such measures based on the mean rather than the median.

For illustration, let us consider the following set of 10 data points:

5 12 45 89 123 158 232 288 314 404

Adding up the data tells us that 1670x and so the mean, 1670

16710

x .

One way we might think about measuring spread, is to look at the average of the distances of the data points from

the mean. A quick consideration of this will make you realise that the positive and negative differences will cancel

out and this will not be useful. Some statisticians simply ignore the signs and consider the modulus (or absolute

value) of the differences from the mean, but we will adopt the more common route of looking at the average

squared differences from the mean.

For example, the data point 45 is 45 167 122 from the mean, so its squared difference from the mean is 2( 122) 14884 . For each piece of data x , the squared difference from the mean is 2( )x x and it is useful to

introduce notation for the sum of these values:

2( )xxS x x

Even with a calculator, this could be quite time consuming to calculate, but fortunately there is another way we can

calculate xxS , which is to simply add up all of the values squared, that is 2x and to subtract the n times the

mean squared 2x . The equivalence of this is not too difficult to justify but we will skip a proof here to avoid

distraction and accept that another way of writing xxS is:

2 2xxS x nx


Note: Be careful, this means 2 2( )xxS x nx , i.e. we add up all the 2x but just subtract 2nx once.

So in our example, we already know 10n and 167x , so 2 278890nx , and

2 2 2 2 2 2 2 2 2 2 25 12 45 89 123 158 232 288 314 404 448788x

and so

448788 278890 169898xxS

This is the total of the squared differences from the mean. We began by looking for the average of the squared

differences of the mean, and so we must divide by the number of pieces of data n . This gives us the mean square

deviation, here 169898

16989.810

. By square rooting this value, we can compensate for the fact that we have

squared all the data to give a value which is closer to the size of the differences. The square root of the mean

squared deviation is called the root mean square deviation, here 16989.8 130.3 .

12. Variance and standard deviation, outliers (second definition)

For technical reasons, the formulae for mean square deviation and root mean square deviation are often altered to

divide by ( 1)n instead of n . The resulting statistics are called the variance and standard deviation respectively.

Nothing in the calculations changes apart from this and so the box below looks very similar to the one above.

Note: The letter s is used to denote the standard deviation and 2s for the variance. In another part of the

Statistics 1 course we will meet standard deviation and variance in the context of random variables where the Greek

and 2 will be used. In the context of data, the Roman s should always be used.

Summary: mean square deviation (msd) and root mean square deviation (rmsd)

mean square deviation = xxS

n root mean square deviation = xxS

n

where 2 2 2( )xxS x x x nx

Variance and Standard Deviation

2variance = = 1

xxSs

n standard deviation =

1xxS

sn

where 2 2 2( )xxS x x x nx


In section 8, we considered a method of classifying pieces of data as outliers based on the quartiles and the inter‐

quartile range. There is also a method of identifying outliers based on the mean and the standard deviation. There

is no set rule as to which definition to use, we usually use whichever is most convenient given the statistics we are

able to calculate (or have already calculated) from the data.

When we look for outliers, we are trying to identify pieces of data which are not genuine. The fact that a piece of

data passes one or both of the tests for an outlier raises a suspicion, but we then must consider qualitative

information to decide whether or not to exclude the data from our considerations. For example, we previously

considered a mark of 155%. Our statistical procedure flagged the data as an outlier and then our knowledge that

you cannot score over 100% on the test led us to exclude the data point. If we were to consider lottery prizes and

had a dataset which included lots of 0s, a couple of 10s and one entry of 65487 then the 65487 would certainly be

flagged as a statistical outlier. We would not want to exclude this piece of data however, as it simply shows that the

person has won a large prize, information that is highly relevant to the data set.

Outlier ‐ Second Definition (Variance Definition)

An outlier is a piece of data which is more than two standard deviations above or below the mean.


13. Calculating the mean and variance from an ungrouped frequency table

A group of students were asked how many siblings they have and the results were as follows:

Number of siblings x

Frequency

f

0 17

1 23

2 18

3 9

4 3

If we want to calculate the mean and variance, one way to do this would be to write all the data out in a list

0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3

,3,3,3,3,3,,4,4,4. and to proceed as before. We would need to add up all the data for the mean, and add up all the

squares of the data points for the calculation of the variance. But this is tedious, and since so many of the data

points are the same we can speed up the process. For example there are 18 '2's and so the sum of the squares of

these data is 218 2 . Hence we fill in the table as follows:

Number of siblings x

Frequency

f

xf

2x

2x f

0 17 0 0 0

1 23 23 1 23

2 18 36 4 72

3 9 27 9 81

4 3 12 16 48

70n f 98xf 2 224x f

We can now quickly calculate the mean 98

1.470

xfx

n

and the variance 2 2 2

2 224 70 1.41.26

1 1 69xx

x f nxSs

n n

The values of xf , 2x f etc are called summary statistics and will sometimes be calculated for you in exam

questions so you don't have to carry out too many routine calculations under timed conditions, but you cannot rely

on this and so you should be able to do the calculations from scratch if necessary.

Modern calculators have statistical functions which can eliminate some of the work in carrying out these

calculations, but you need to practice in advance to know how to use these and you should only use the calculator if

you are confident you know what it is doing. Most calculators have many different settings and you should do at

least one calculation by hand and using the calculator to make sure you have chosen the correct ones!


14. Estimating the mean and median from a grouped frequency table

The amount of pocket money received by a group of children was recorded as follows:

Amount of pocket money x Frequency

0 5x 8

5 10x 32

10 20x 24

20 50x 6

We cannot calculate the mean of this data precisely since we do not have the raw data, we only know which

category each of the pieces of data falls into. The best we can do then is to make an estimate of the mean, based

on the assumption that each of the pieces of data falls exactly in the middle of the interval it falls in.

Amount of pocket money x Frequency, f Mid‐interval value m mf

0 5x 8 2.5 20

5 10x 32 7.5 240

10 20x 24 15 360

20 50x 6 35 210

We can then estimate the mean as 20 240 360 210 830

£11.868 32 24 6 70

mf

n

Similarly, we could estimate the variance (or standard deviation or msd or rmsd) in the same way, by assuming that

all of the pieces of data fall at the relevant mid‐interval values and completing the calculation as before.

We could also estimate the median from this data. There are 70 pieces of data, so we would like to average the

35th and the 36th pieces of data. We do not have the exact data, but we can tell from the information given that

both of these values lie in the 5 10x interval. Furthermore, we know that 32 pieces of data lie in this interval,

and that there are 8 in the interval below. Hence the 35th and 36th data points will be the 27th and 28th data

points in the 5 10x interval. Hence we can make an estimate of the median as:

27.55 (10 5) 9.30

32 to 2 decimal places.

If desired, we could also estimate the quartiles and other percentiles in the same way.


15. Linear Coding

Suppose you are a professional who charges a fixed call out rate and then an additional fee per hour you spend on a

job and you have gathered data on the amount of time you have spent on jobs in the last few months and calculated

the mean and standard deviation.

Suppose your call out rate is £50 and you then charge £80 an hour for your services. If x is the mean number of

hours you spend on a job and 2

xs the variance, what are the mean and variance of the amount of money you

receive, which we shall call y and 2ys ?

Mean: Each time you do a job you get £50 plus £80 times the number of hours the job takes. So if you take x

hours on a job you get paid 50 80y x pounds. If you do n jobs, the mean number of hours is x

n

and so the

mean pay you receive is:

50 80

50 80

50 80

y n xy

n n

x

nx

Variance: The variance of the number of hours is

2( )

1

x x

n

and so the variance of the pay you receive is:

2 22

2

22

2 2

( ) (50 80 (50 80 ))

1 1

(80 80 )

1

( ) 80

1

80

y

x

y y x xs

n n

x x

n

x x

n

s

Similarly, the standard deviation of the pay you receive is 2 80y y xs s s .

This method is called linear coding and allows us to calculate the values we are interested in without first going back

and calculating the amount of money we have received for each job.

In general, f we have a data set x and apply the linear coding y ax b , the following results are

true:

Mean: y ax b

Variance: 2 2 2y xs a s Standard deviation: y xs a s

The results for the variance and standard deviation also hold for the msd and rmsd respectively.


16. Advantages and Disadvantages of the mean, median mode and midrange

In common language the word 'average' usually means the mean. However the mean, median, mode and midrange

are all averages, and which is the most suitable and useful depends both on the particular data we are looking at and

the reason we want an average. This is why some of the advantages in the table below also appear as potential

disadvantages. Using an inappropriate average can give a misleading or unrepresentative figure as an average.

Listed below are some possible advantages and disadvantages of each. This list is by no means comprehensive.

Advantages Disadvantages

Mean All pieces of data are used in the calculation.

Useful when the total quantity is also of interest.

Will be skewed (ie changed significantly), by the presence of a few exceptionally

small or large pieces of data.

Median Will not be significantly skewed by outliers. This is why the median is often used for average salaries for

example, where there are often one or two people who earn a lot more than everybody else.

Does not take all pieces of data fully into account and ignores the presence of exceptionally high or low data points

which may be of significance.

Mode Can be useful where there are just a few possible values, for example the number of A grades students have

achieved at A‐Level.

Can be very misleading, especially when there are a large number of different values which all have low frequencies.

Midrange There are very few situations in which the midrange is the best average to use. One advantage is that it is easy

to calculate.

Only takes into account the lowest and highest pieces of data. As such it is

heavily affected by outliers.

Note: The midrange is defined as

Minimum Value + Maximum ValueMidrange

2

It is literally in the middle of the range of values in the data set, hence the name!


17. Skewness

Distributions can be symmetrical, or skewed either positively or negatively.

In a positively skewed distribution, the mean is pulled in the positive direction by the presence of some values which are significantly larger than the bulk of the distribution. The median is not as badly affected by a few large values and

stays nearer the bulk of the distribution.

In a negatively skewed distribution, the reverse is true. So there are some values significantly smaller than the bulk of

the distribution which pull, or skew, the mean in the negative direction from the median.

If a distribution has no skew then it is described as symmetrical.

The above diagrams are all frequency diagrams. Some students have found the following pictures below useful to

memorise positive and negative skew. How would you feel if you were either of the cyclists below?

Feeling positive!

Feeling negative!


We can also identify skewness from histograms, boxplots and cumulative frequency diagrams:

Positive skew Symmetrical Negative skew

Frequency diagram/histogram

Cumulative frequency diagram

Boxplot

OCR Mathsaurus Data Presentation and measures of central ...€¦ · Data Presentation and Measures...

Documents

Transcript of OCR Mathsaurus Data Presentation and measures of central ...€¦ · Data Presentation and Measures...