SCM300 Survey Design Lecture 3 Summarising Data For use in fall semester 2015 Lecture notes were...

67
SCM300 Survey Design Lecture 3 Summarising Data For use in fall semester 2015 Lecture notes were originally designed by Nigel Halpern. This lecture set may be modified during the semester. Last modified: 4-8-2015

Transcript of SCM300 Survey Design Lecture 3 Summarising Data For use in fall semester 2015 Lecture notes were...

SCM300 Survey Design

Lecture 3Summarising Data

For use in fall semester 2015Lecture notes were originally designed by Nigel Halpern. This lecture set may be modified during the semester.

Last modified: 4-8-2015

SCM300 Survey Design

Lecture Aim & Objectives

Aim• To investigate pictorial & statistical methods of

analysing quantitative data

Objectives• Pictorial representation of quantitative data• Statistical representation of quantitative data

SCM300 Survey Design

Pictorial Representation

• Levels of measurement• Tables & frequency distributions• Charts, plots, graphs & pie-charts

SCM300 Survey Design

3 (4) Levels of Measurement

• Nominal variables• Ordinal variables• Interval (& ratio) variables

SCM300 Survey Design

Nominal

• Categories – e.g. gender (m/f), responses (y/n), class of travel (b/l)

• Usually presented as frequencies & categories or %’s– e.g. 45% male, 55% female

• Measure the existence (or not) of a characteristic– But contain limited information

SCM300 Survey Design

Ordinal

• Ordered categories or preferences– e.g. ranked responses from a Likert scale– e.g. finishers in a race (1st, 2nd, 3rd, etc) – e.g. preferred aircraft

• Measure intensity, order or degree– But still limited as they don’t imply distances

• i.e. distance between 1st & 2nd

SCM300 Survey Design

Interval & Ratio

• Ordered & scaled (on equal intervals)– e.g. age in years, temperature

• Measures differences between values– Interval: arbitrary zero

• e.g. temperature (+/-)

– Ratio: absolute zero indicating absence of that variable• e.g. age, income

• High analytical capabilities– e.g. can compare means unlike for nominal or ordinal data

SCM300 Survey Design

Variable type Description Examples

Nominal Classification of responses into mutually exclusive categories

Male/Female

Yes/No

Ordinal Categories are rank ordered

1st/2nd/3rd

Likert

Interval/Ratio Distances between items on scale are equal

Temperature

Age

Levels of Measurement Summary

SCM300 Survey Design

Your turn…..

• What levels of measurement would be derived from each of the following questions1. Gender (male/female)2. Age in years and months (state years/months)3. Do you smoke (yes/no)4. How many cigarettes, on average, do you smoke a day (state no.)5. Number of full years you’ve been smoking (state no.)6. How many minutes exercise do you do, on average, each day (less

than 30mins / 30-59mins / 60+mins)7. To what extent do you think that smoking is bad for your health

(Strongly agree / tend to agree / neither / tend to disagree / strongly disagree)

8. Rank the cigarette brands in order of quality (B&H, Silk Cut, Marlborough)

Variable type Description

Nominal Classification of responses into mutually exclusive categories

Ordinal Categories are rank ordered

Interval/Ratio Distances between items on scale are equal

SCM300 Survey Design

Tables

• Most straight forward pictorial representation• Good method of storing information• Summarises &/or shows patterns in data• Easily made using word-processing or

spreadsheets

• Confusing if constructed poorly• Confusing if they try to show too much

SCM300 Survey Design

Table Considerations

• Should be clear & appropriate• Should be chosen with a purpose in mind

– Not just for the sake of it

• Must include a title & a source of data• Must be referenced & discussed in the text

– Don’t assume that everyone will understand them

SCM300 Survey Design

Table Clarity

• Use a common system of data presentation• Use percentages rather than raw scores for clarity &

comparative capabilities

The above points are particularly relevant if the table includes more than one variable calculated using different units of measurement (AKA ‘cross-tabulation’)

SCM300 Survey Design

Table 1. Passengers at LGW, LHR & MAN, 1999

Socio-economic status

Business passenger

Leisure passengers

Total

A/B 18,607 43,407 62,014

C1 14,345 52,400 66,745

C2 1,386 21,508 22,894

D/E 312 13,035 13,347

Total 34,650 130,350 165,000

Data from a survey of pax at LGW, LHR & MAN (CAA, 2000): - 34,650 Business Pax: A/B=18,607; C1=14,345; C2=1,386; D/E=312 - 130,350 Leisure Pax: A/B=43,407; C1=52,400; C2=21,508; D/E=13,035

Use percentages instead?

SCM300 Survey Design

Table 1. Passengers at LGW, LHR & MAN, 1999 (%)

Socio-economic status

Business passengers

Leisure passengers

Total

A/B 54 33 38

C1 41 40 40

C2 4 17 14

D/E 1 10 8

Total 21 79 100

Data from a survey of pax at LGW, LHR & MAN (CAA, 2000): - 34,650 Business Pax: A/B=18,607; C1=14,345; C2=1,386; D/E=312 - 130,350 Leisure Pax: A/B=43,407; C1=52,400; C2=21,508; D/E=13,035

Easier to interpret?

SCM300 Survey Design

Frequency Distributions

• Standard frequency distribution• Univariate frequency distribution• Grouped frequency distribution• Relative & cumulative frequency distribution

SCM300 Survey Design

• Standard frequency distribution– Presents data

• e.g. “How many return flights did you take last year?”

• Answers from 50 pax as a standard frequency distribution:

Number of return flights taken last year:7 3 10 3 2 4 3 3 6 3 5 2 3 4 2 5 4 3 6 8 4 12 1 3 4 15 5 1 3 1 4 2 3 5 2 3 8 3 4 4 6 3 5 2 4 2 3 2 5 1

Standard Frequency

SCM300 Survey Design

• Univariate frequency distribution– Lists data more clearly & with their frequency– Important for large sample sizes

Univariate Frequency

Flights Frequency Flights Frequency

1 4 7 1

2 8 8 2

3 14 10 1

4 9 12 1

5 6 15 1

6 3

SCM300 Survey Design

• Grouped frequency distribution– Groups all data according to

categories– Further improves clarity

Grouped Frequency

Flights Grouped frequency

1-3 26

4-6 18

7-9 3

10-12 2

13+ 1

Total 50

SCM300 Survey Design

• Relative & cumulative frequency distributions– Relative: each category as a % of the total– Cumulative: add each relative to proceeding

Relative & Cumulative Frequency

Flights Grouped Relative (%) Cumulative (%)

1-3 26 52 52

4-6 18 36 88

7-9 3 6 94

10-12 2 4 98

13+ 1 2 100

Total 50 100

SCM300 Survey Design

Too many numbers…?

SCM300 Survey Design

Charts, Plots, Graphs & Pie-charts

• Simple bar charts• Compound bar charts• Histograms• Scatter or dot plots• Line graphs• Pie-charts

SCM300 Survey Design

Charts, Plots, Graphs & Pie-charts:Pros & Cons

• Easily made using word-processing or spreadsheets

• Ease of creation can lead to over-elaborate charts at the expense of clarity

SCM300 Survey Design

Charts, Plots, Graphs & Pie-charts:Considerations

• Should be clear & appropriate• Should be chosen with a purpose in mind

– Not just for the sake of it

• Typically include– Title– Labelled axis– Key that explains the different segments– Source of data

• Must be referenced & discussed in the text– Do not assume that everyone will understand them

• Data type will restrict which method is chosen

SCM300 Survey Design

Simple Bar Charts

• Simple bar charts– Horizontal or vertical charts of separate bars that represent

size of data

Student results for SCM300 in 2007

0-39% 40-49% 50-59% 60-69% 70+%

5 9 15 7 3

SCM300 Survey Design

Simple Bar Charts

02468

10121416

Nu

mb

er o

f st

ud

ents

0-39 40-49 50-59 60-69 70+

Grade (%)

Figure 1. Student results for SCM300 in 2007

SCM300 Survey Design

Compound Bar Charts

• Compound bar charts– Show proportions/relative size of groups– Bars will always have same height when % are used but not

when figures are used– For 3+ components, pie-charts may be better

Student results for SCM300 in 2007

0-39% 40-49% 50-59% 60-69% 70%+

Male 4 7 7 2 0

Female 1 2 8 5 3

SCM300 Survey Design

Compound Bar Charts

0%

20%

40%

60%

80%

100%

0-39 40-49 50-59 60-69 70+

Grade (%)

Nu

mb

er o

f st

ud

ents

Female

Male

Figure 1. Student results for SCM300 in 2007

SCM300 Survey Design

Histograms

• Histograms– Similar to bar charts but a better indication of variation &

distribution– Bars are connected instead of separate

SCM300 Survey Design

Histograms

02468

10121416

Nu

mb

er o

f st

ud

ents

0-39 40-49 50-59 60-69 70+

Grade (%)

Figure 1. Student results for SCM300 in 2007

SCM300 Survey Design

This figure indicates repeat visits to Norway & tourists interest in returning but is it easy to understand…..?

SCM300 Survey Design

Scatter or Dot Plots

• Scatter or dot plots– Illustrate the exact distribution of data– Can be used to illustrate continuous data

• BUT a line graph may be better

– Effective for 2 related variables

SCM300 Survey Design

Scatter or Dot Plots

0100 000200 000300 000400 000500 000600 000

Aircraft movements

Pa

ss

en

ge

rs

Figure 1. Passengers & Aircraft Movements at HiMolde Airport

SCM300 Survey Design

Line Graphs

• Line graphs– Show trends over time

• e.g. patterns, peaks & troughs, rates of incline/decline

– Can show more than 1 variable at a time• This can indicate possible relationships

• e.g. see next slide

SCM300 Survey Design

SCM300 Survey Design

Pie Charts

• Pie-charts– Segments represent cases in each category– Best for 3-6 categories (no more, no less)– Labelling & shading sometimes difficult– Combining categories may improve clarity but loses detail

SCM300 Survey Design

Pie Charts

Other26%

Car park21%

Retail42%

Catering11%

SCM300 Survey Design

Pie ChartsToo many pies……..?

SCM300 Survey Design

Charts, Plots, Graphs & Pie-chartsSummary

Variable type Bar Pie Line

Nominal Yes Yes No

Ordinal Yes Yes No

Interval/ratio Yes (if grouped) Yes (if grouped) Yes

SCM300 Survey Design

Statistical Representation

• Measures of central tendency• Measures of dispersion• Normal distribution & skew

SCM300 Survey Design

Measures of Central Tendency

• Raw data can be confusing & meaningless• Measures of central tendency

– AKA measures of location or average– Present the data in 1 single number

• 3 different measures depend on intention or data– See next slide

SCM300 Survey Design

Measure Definition Data

Mode Most commonly occurring value in a data set Misleading if an extreme value & may be multiple modes (bimodal distribution)

Any

Median Central value representing central point of a data setWhen there is an even set of values you take the two middle values and find the mid-point between them. Extremes don’t distort it but data has to be in order from lowest to highest in order to calculate it.

Ordinal or interval/ratio

Mean Average value in a data setAdvantage is that it uses all values in a data set. Disadvantage is that it can only be used with interval/ratio data and when there are few values in the data set, it can be distorted by extremes.

Interval/ratio

Measures of Central Tendency

SCM300 Survey Design

ExampleAge of students

Mean 22

Median 20

Mode 19

19 20 36 19 19 24 37 20 21 20

19 19 19 19 20 25 20 26 20 19

19 19 19 19 20 19 24 25 20 20

26 25 19 20 19 18 19 28 22 19

SCM300 Survey Design

Measures of Dispersion

• Measures of central tendency don’t show:– How closely related values are (i.e. clustered)– How representative they are of the data set– The range of values– The degree of distortion by extreme values

Salaries of office staff at HiMolde Airways:·  £11k, £15k, £15k, £18k, £25K, £30k, £32k, £38kSalaries of office staff at HiMolde Airport:·   £20k, £21k, £22k, £23k, £23k, £24k, £25k, £26kMean salary at HiMolde Airways = £23k (£184k/8)Mean salary at HiMolde Airport = £23k (£184k/8)

SCM300 Survey Design

Measures of Dispersion

• Range• Inter-quartile range• Standard deviation

SCM300 Survey Design

Range

• Simplest & crudest measure of dispersion• Indicates spread of data

– Places values in ascending order– Then subtracts smallest from the largest value

• Extreme values affect (determine) the outcome• Range gives a greater insight into a data set

– But gives no indication of the clustering of individual values

SCM300 Survey Design

Range

Salaries of office staff at HiMolde Airways:·    - £11k, £15k, £15k, £18k, £25K, £30k, £32k, £38k Salaries of office staff at HiMolde Airport:·    - £20k, £21k, £22k, £23k, £23k, £24k, £25k, £26k Range of salaries at HiMolde Airways = £38k - £11k = £27k Range of salaries at HiMolde Airport = £26k - £20k = £6k

SCM300 Survey Design

Inter-Quartile Range

• Most appropriate when using ordinal data• Divides values into 4 equal parts (quartiles)

– Is an extension of the idea of the median

• Represents the middle 50% of the values that fall between the 1st & 3rd quartiles

• Not affected by extremes– BUT doesn’t utilise all values

• It discards 50% of the values & therefore provides a limited picture of the degree of clustering

SCM300 Survey Design

Min. value Q1 Q2 Q3 Max. value

Inter-Quartile Range

1st 25% cases

4th 25% cases

3rd 25% cases

2nd 25% cases

Median value

Inter-Quartile Range

SCM300 Survey Design

Salaries of office staff at HiMolde Airways:·    - £11k, £15k, £15k, £18k, £25K, £30k, £32k, £38k Salaries of office staff at HiMolde Airport:·    - £20k, £21k, £22k, £23k, £23k, £24k, £25k, £26k IQ Range of salaries at HiMolde Airways = £15-£31 IQ Range of salaries at HiMolde Airport = £22-£24

Inter-Quartile Range

SCM300 Survey Design

Standard Deviation

• Widely used in quantitative research• Most useful measure of dispersion• Utilises all data in the distribution• Compares each value in the distribution with the mean

– It examines the variance of the data around the mean– Therefore saying something about how representative the

mean is for the data set

SCM300 Survey Design

• Smaller SD = less variation– i.e. data is more concentrated around the mean– Greater SD = greater variation

• However– Size of SD is in part a reflection of the size of the mean

• So a large SD may simply be the product of a large mean

• Because of this, both figures should be quoted

• Extreme numbers can distort the outcome– BUT have less of an impact than when using the range

Standard Deviation

SCM300 Survey Design

Salaries of office staff at HiMolde Airways:·    - £11k, £15k, £15k, £18k, £25K, £30k, £32k, £38k Salaries of office staff at HiMolde Airport:·    - £20k, £21k, £22k, £23k, £23k, £24k, £25k, £26k Standard deviation of salaries at HiMolde Airways = 10 Standard deviation of salaries at HiMolde Airport = 2

Standard Deviation

SCM300 Survey Design

Central Tendency & Dispersion Summary

Nominal Ordinal Interval/Ratio

Example Male/Female 1st/2nd/3rd Temperature

Central tendency

Mode Median Mean

Dispersion N/a Inter-quartile range

Standard deviation

SCM300 Survey Design

Normal Distribution & Skew

• Normal distribution• Skew

SCM300 Survey Design

Normal Distribution

• Normal if– Mean, median & mode coincide– Distribution is the same either side of the central values

• e.g. see next slide

• Often referred to as a bell-shaped curve– 50% of the cases can be found either side of the central value– Values tend to be clustered around the mean

• i.e. very few extreme values

SCM300 Survey Design

Normal Distribution

MeanMedianMode

50% of cases

50% of cases

SCM300 Survey Design

Normal Distribution

• A normal distribution has certain properties– 68% of cases fall within 1 SD either side of the mean– 95% within 2 SDs– 99% within 3 SDs

• e.g. see next slide

• Other % values can be calculated using statistical tables– Found in some statistics books

• Normal distribution is important for sampling & hypothesis testing– Many statistical tests assume data will be normally distributed

SCM300 Survey Design

68.26%

95.44%

99.7%

-3sd -2sd -1sd Mean +1sd +2sd +3sd

• Normal distribution is an ‘ideal’ type of distribution• However, it is unlikely that data sets will be normal• When they are not normal, they are ‘skewed’

SCM300 Survey Design

Skew

• +ve skew– Data set has a few very large values

• i.e. most values cluster to the left

– The mean will be larger than the median

• -ve skew– Data set has a few very small values

• i.e. most values cluster to the right

– The mean will be smaller than the median

SCM300 Survey Design

Positive Skew

Median Mean

SCM300 Survey Design

Negative Skew

Mean Median

SCM300 Survey Design

Skew

• Skew is typically found where– Sample sizes are small– Bias has been introduced in the sampling process

• Skewed distributions can be determined– Visually using a histogram– Statistically by calculating a co-efficient of skewness (sk)

SCM300 Survey Design

Co-efficient of Skewness

• Indicates the direction of the skew (+ve or –ve)• Greater co-efficient = greater skew• Normal distribution will have a co-efficient of 0

3(Mean – Median)sk = ---------------------------- Standard Deviation

SCM300 Survey Design

Summary

• Pictorial representation of quantitative data– 3 (4) levels of measurement

• Nominal• Ordinal• Interval / ratio

– Range of pictorial representation available• Choice is determined by the level of measurement

SCM300 Survey Design

Summary

• Statistical representation of quantitative data– 3 measures of central tendency

• Mean, median, mode• Choice is determined by the level of measurement

– 3 measures of dispersion• Range, inter-quartile range, SD• Choice is determined by the level of measurement

– Normal distribution & skew represent the distribution of responses

SCM300 Survey Design

Recommended Reading

• Chapter 1-3 in Gaur, A.S. and Gaur, S.S. (2006). Statistical Methods for Practice and Research: A Guide to Data Analysis Using SPSS. New Delhi: Response Books.

SCM300 Survey Design

“Thank you for your attention”

Questions.…….