Data description Peter Shaw. One variable or many? In your research you will almost certainly end up...

30
Data description Peter Shaw

Transcript of Data description Peter Shaw. One variable or many? In your research you will almost certainly end up...

Page 1: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Data description

Peter Shaw

Page 2: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

One variable or many?In your research you will almost certainly end up measuring many different things:‘Survey the plants’ means collecting 10-50 columns of data‘Analyse the soil’ means 5-10 variables‘take body measurements’ means 5-30 variables.

This lecture is essentially about how to explore each of those variables, one by one, to tell a reader about the range or distribution of values it contains. This tells a reader about how important the variable is and what sort of tests may be run on it (P or N-P?).

But this does not treat your dataset as a unified object.There is a powerful branch of data description called Ordination, which is essentially asking for a description of ALL variables at the same time.

Page 3: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Things to do with data:

This is an infinite morass of statistical techniques, but one fundamental division is paramount and must be understood.

  DESCRIPTIVE <---------------------> INFERENTIAL   Descriptive statistics aim to condense out the useful/important essence

of a (usually large) body of data. Calculate an average, plot a graph showing the range of values etc.

  Inferential statistics requires that the user sets up a formal hypothesis,

then invokes a procedure which ends up with a probability value by which the hypothesis may be judged.

Page 4: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Why bother with data descriptions?

Standard format:AbstractIntroductionMethodsResults

DiscussionReferences

I have lost count of the number of times that students have got this far then dived straight into the fancier analyses – Correlations or Anovas usually, without bothering to tell the reader anything about the data they are analysing.

Page 5: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Standard format:AbstractIntroductionMethodsResults

DiscussionReferences

1: Describe your data: units, indications of typical values + variability.

2: Analyse relationships within your data

Page 6: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

General ground rules:

1: What do the data mean, What are the units? 2: Eyeball it! 3: Summon up the formal

procedures, by PC or calculator

Pb, ppm in white

Paint exposed

on a nursery door

16207

14833

29524

18436

26236

Page 7: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Graphing data This is a huge topic, entire

books have been written. One unifying point: A good graph is the best

way to present data.

I am going to show you several histograms today.

These show the distribution of values within a dataset.

Number of observations

Size of value

Page 8: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

What could you want to know about a dataset? In order of decreasing likelihood: What magnitude of numbers are

you dealing with? What sort of spread have you got? What is the nature of the

distribution of results? ?other? (your turn!)

Page 9: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Magnitude summaries

In plain English, what sort of values am I dealing with?

In statspeak, you require Measures of Central Tendency

There are 3 such measures you need to learn, of which 2 are actually useful! Mean Median Mode

Page 10: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Mode

This is simply the commonest occurrence in the data. Most real datasets don’t have a mode, as all values are different.

As such, the Mode is easily the least useful technique for data description, but is always mentioned in the books so you may as well learn it!

Pb, ppm in whitePaint exposedon a nursery door1620714833295241843626236

Page 11: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Median

This is the middle of the dataset, defined as the point below which half the data points lie, and above which half the data lie.

How to find it: Sort data into ascending order 1..N If N is odd, median is the (N+1)/2th value If N is even, median is half way between

(N/2)th and ((N/2)+1)th value

Pb, ppm in white

Paint exposed

on a nursery door

14833

16207

18436 median

26236

29524

Page 12: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Median, contd

The median is an under-rated tool, often preferable to the more widely used mean, because it gives a sensible answer whatever the shape of data distribution

It is a special case of a more general descriptive technique known as centiles.

The median is the 50th centile of a dataset, meaning that 50% of the data points lie below it.

Page 13: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

The Mean

The ‘Mean’ is the name given by statisticians to what everyone else calls the ‘average’! Often given symbol μ.

Easy to calculate: add up the numbers and divide by N

μ=Σx/N Your calculator should have this built in as a

stats function It is often NOT the middle of the data. This

happens when data are asymmetrically distributed

Page 14: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Number of observations

Size of valueMean and medianabout the same

A symmetrical distribution

Size of valueMedian

An asymmetrical distribution.Note that the mean ismisleading here

Mean

Page 15: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Number of observations

Size of value

Distribution A

Number of observations

Size of value

Distribution B

Two data sets. In which one are you more likely to guess the next value correctly?

Page 16: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

This leads onto..Measures of dispersion These are indicators of how tightly

clumped data are. There is a proliferation of such

indices, but they divide into 2 families: Non-parametric, based on centiles Parametric, based on variance and

giving rise to standard deviations etc.

Page 17: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Centiles 1: sort data into ascending

order (this is a PC job for big datasets)

2: To get the 25th centile, find the number below which 25% of the data lie

3: To get the 75th centile, find the number below which 75% of the data lie, etc

UNSORTEDMore paint PbData, ppm27343404500046411620714833151516672952418436262367255580010588946263685122658568464143

SORTEDPaint Pb Data, ppm1: 15152: 16673: 27344: 34045: 4143***25th centile here6: 46417: 50008: 51229: 580010: 6368*** 50th centile = median here11: 658512: 684613: 725514: 946215: 10588***75th centile here16: 1483317: 1620718: 1843619: 2623620: 29524

Page 18: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

The inter-quartile range

Is the difference between the 25% and 75% centiles of a distribution.

This means that is is the range covered by the middle half of the data.

Number of observations

Size of value

25th 50th 75th

Interquartile range

Page 19: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Boxplots

These are under-rated, but extremely helpful tools for examining the distribution of data.

0

50

100

median

25th centile

75th centile

Highest value

Lowest value

Page 20: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Standard deviations and all that.. The parametric family of measures of dispersion have

messy-looking formulae, but luckily are easily obtained from calculators or PCs

They are based on a measure misleadingly called the sum of squares of the data (SS).

The origin of SS is as follows: For each data point Xi calculate (Xi - mean)*(Xi - mean) [This square >=0 ]

Add up all these squares = SS

Formula: SS = Σi(xi - μ)2

Luckily there is a simpler(?) formula SS = Σi(xi

2) – (Σixi * Σixi) /N

Mean

X1

X2

X3

0

500

1000

Page 21: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Variance etc

Having got the Sum of Squares Variance is the mean value of SS

Variance = SS/N (an alternative formula also used:

Variance = SS / (N-1) This estimates the variance of the whole

population, while /N gives variance just for the sample taken.

Geographers tend to prefer Variance = SS/N

Biologists tend to prefer Variance = SS/(N-1)

Page 22: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Standard deviation

Is the square root of variance This has the useful property that sd has the same

units as the raw data and will be commensurate with the interquartile range. (Roughly, for typical data, the IQR= 2* sd)

Because there are 2 ways to calculate variance, there are 2 s.d.s

Sd = (SS/N)1/2. This is labelled σ on many calculators

or

Sd = (SS/(N-1))1/2. This is labelled s on many calculators

Page 23: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

How to use your calculator Why is he telling me this – I

already know?! OK then, what’s this hierarchy? +- < */ < Yx < () Use this to calculate

123*456+789*112 (109*256+103*876)/(22*44+89*78)

The solution to the ‘grains of rice on a chessboard’ problem is 264-1 (ie 2*2*2………..….*2 –1), which is?

Page 24: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Stats modeon yourcalculator

If you have buttons saying N, Σx, sd then your machine has stats functions

This means it has special registers called N, Σx and Σx2, which keep running totals as you enter data.

Put into stat mode Enter the number 7 by hitting the Σ or M+ or Xi button Optional, but for your education find the contents of the special

registers (Kout or recall)

N =1, Σx = 7, Σx2 = 49 Enter the number 2 Now you find that N =2, Σx = 9, Σx2 = 53 Now the Mean button will give you the mean 4.5, the sd buttons

the sds (2.5 sd/n, 3.53.. /n-1). Easy!

Page 25: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Your turn! Water content of heathland soils, %

8.53 17.53 39.14 32.00 20.53 21.07 26.20 23.80 12.53 20.80 31.33 28.87 14.00

For the numbers listed here

Find mean, median, mode,and interquartile range

Find both standard deviations, by your calculator’s inbuilt functions or by the formulae:

SS = Σi(xi2) – (Σixi * Σixi) /N

Then Sd = (SS/N)1/2. or

Sd = (SS/(N-1))1/2

Page 26: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’
Page 27: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

LOI

90.0

85.0

80.0

75.0

70.0

65.0

60.0

55.0

50.0

45.0

40.0

35.0

30.0

25.0

20.0

15.0

10.0

5.0

20

10

0

Std. Dev = 27.97

Mean = 29.3

N = 69.00

LOGLOI

1.88

1.75

1.63

1.50

1.38

1.25

1.13

1.00

.88

.75

.63

12

10

8

6

4

2

0

Std. Dev = .44

Mean = 1.26

N = 69.00

Often real data don’t follow the Normal curve but are skewed – here organic content in heath soils

Try log-transforming the data. Here the same data after calculating log of the numbers – not perfect, but clearly more symmetrical

Distribution shape

Page 28: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

How to decide about normality? Inspect histogram + fitted

normal curve. Inspect a cumulative “P-P

curve” with predicted normal distribution

Run the Kolgomorov-Smirnov test

Normal P-P Plot of LOI

Observed Cum Prob

1.00.75.50.250.00

Exp

ect

ed

Cu

m P

rob

1.00

.75

.50

.25

0.00

Normal P-P Plot of LOGLOI

Observed Cum Prob

1.00.75.50.250.00

Exp

ect

ed

Cu

m P

rob

1.00

.75

.50

.25

0.00

Page 29: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

One-Sample Kolmogorov-Smirnov Test

69 69

29.2806 1.2603

27.9695 .4409

.217 .086

.217 .080

-.183 -.086

1.804 .716

.003 .685

N

Mean

Std. Deviation

Normal Parameters a,b

Absolute

Positive

Negative

Most ExtremeDifferences

Kolmogorov-Smirnov Z

Asymp. Sig. (2-tailed)

LOI LOGLOI

Test distribution is Normal.a.

Calculated from data.b.

The Kolmogorov-Smirnov test examines whether data can be assumed to come from a chosen distribution – here the normal.

LOI is almost certainly NOT normally distributed

LogLOI may or may not be normal, but the test tells us that its deviations from normality would occur 7 times in 10 in randomly chosen normal data

Page 30: Data description Peter Shaw. One variable or many? In your research you will almost certainly end up measuring many different things: ‘Survey the plants’

Kolmogorov test in SPSS

Typical SPSS – does the same test in 2 ways in different bits of menu structure and uses different algorithms to assess significance. I use the basic version

Analyse – non parametric stats – 1 sample KS

But it also hides under

Analyses – descriptive statistics – explore – plots then click the box labelled “Normality plots with tests”. This well-hidden version uses a modified significance test (Lilliefor’s correction), which really threw me the first time I met it!