Statistic I: Data Collection & Handling

©[email protected] 2013

Collect, Explore & Summarise

Dr Azmi Mohd TamilDept of Community Health

Universiti Kebangsaan Malaysia

FK6163


Data Collection

Data collection begins after deciding on design of study and the sampling strategy


Data Collection

Sample subjects are identified and the required individual information is obtained in an item-wise and structured manner.


Data Collection

Information is collected on certain characteristics, attributes and the qualities of interest from the samples

These data may be quantitative or qualitative in nature.


Data Collection Techniques

Use available information Observation Interviews Questionnaires Focus group discussion


Using Available Information

Existing Records• Hospital records - case notes• National registry of births & deaths• Census data• Data from other surveys


Disadvantages of using existing records

Incomplete records Cause of death may not be verified by a

physician/MD Missing vital information Difficult to decipher May not be representative of the target

group - only severe cases go to hospital


Disadvantages of using existing records

Delayed publication - obsolete data Different method of data recording

between institutions, states, countries, making comparison & pooling of data incompatible

Comparisons across time difficult due to difference in classification, diagnostic tools etc


Advantages of using existing records

Cheap convenient in some situations, it is the only data

source i.e. accidents & suicides


Observation

Involves systematically selecting, watching & recording behaviour and characteristics of living beings, objects or phenomena

Done using defined scales Participant observation e.g. PEF and

asthma symptom diary Non-participant observation e.g.

cholesterol levels


Interviews

Oral questioning of respondents either individually or as a group.

Can be done loosely or highly structured using a questionnaire


Administering Written Questionnaires

Self-administered via mail by gathering them in one place and

getting them to fill it up hand-delivering and collecting them later Large non-response can distort results


Questionnaires

Influenced by education & attitude of respondent esp. for self-administered

Interviewers need to be trained open ended vs close ended the need for pre-testing or pilot study


Focus group discussion

Selecting relevant parties to the research questions at hand and discussing with them in focus groups

examples in your own field of interest?


Plan for data collection

Permission to proceed Logistics - who will collect what, when

and with what resources Quality control


Accuracy & Reliability

Accuracy - the degree which a measurement actually measures the measures the characteristic it is supposed to measure

Reliability is the consistency of replicate measures


Reliability & Accuracy


Accuracy & Reliability

Both are reduced by random error and systematic error from the same sources of variability;• the data collectors• the respondents• the instrument


Strategies to enhance precision & accuracy

Standardise procedures and measurement methods

training & certifying the data collectors Repetition Blinding


Introduction

Method of Exploring and Summarising Data differs

According to Types of Variables


Dependent/Independent

Frequency of Exercise

Obesity

Food Intake

Independent Variables

Dependent Variable


Explore

It is the first step in the analytic process to explore the characteristics of the data to screen for errors and correct them to look for distribution patterns - normal

distribution or not May require transformation before further

analysis using parametric methods Or may need analysis using non-parametric

techniques


Data Screening

By running frequencies, we may detect inappropriate responses

How many in the audience have 15 children and currently pregnant with the 16th?

PARITY

67 30.7

44 20.2

36 16.5

22 10.1

21 9.6

8 3.7

3 1.4

7 3.2

5 2.3

3 1.4

1 .5

1 .5

218 100.0

1

2

3

4

5

6

7

8

9

10

11

15

Total

ValidFrequency Percent


Data Screening

See whether the data make sense or not.

E.g. Parity 10 but age only 25.


Data Screening

By looking at measures of central tendency and range, we can also detect abnormal values for quantitative data

Descriptive Statistics

184 32 484 53.05 33.37

184

Pre-pregnancy weight

Valid N (listwise)

N Minimum Maximum MeanStd.

Deviation


Interpreting the Box Plot

Outlier

Outlier

Upper quartile

Smallest non-outlier

Median

Lower quartile

Largest non-outlier The whiskers extend to 1.5 times the box width from both ends of the box and ends at an observed value. Three times the box width marks the boundary between "mild" and "extreme" outliers.

"mild" = closed dots "extreme"= open dots


Data Screening

We can also make use of graphical tools such as the box plot to detect wrong data entry 184N =

Pre-pregnancy weight

600

500

400

300

200

100

0

141198211181

73


Data Cleaning

Identify the extreme/wrong values Check with original data source – i.e.

questionnaire If incorrect, do the necessary correction. Correction must be done before

transformation, recoding and analysis.


Parameters of Data Distribution

Mean – central value of data Standard deviation – measure of how

the data scatter around the mean Symmetry (skewness) – the degree of

the data pile up on one side of the mean Kurtosis – how far data scatter from the

mean


Normal distribution

The Normal distribution is represented by a family of curves defined uniquely by two parameters, which are the mean and the standard deviation of the population.

The curves are always symmetrically bell shaped, but the extent to which the bell is compressed or flattened out depends on the standard deviation of the population.

However, the mere fact that a curve is bell shaped does not mean that it represents a Normal distribution, because other distributions may have a similar sort of shape.


Normal distribution

If the observations follow a Normal distribution, a range covered by one standard deviation above the mean and one standard deviation below it includes about 68.3% of the observations;

a range of two standard deviations above and two below (+ 2sd) about 95.4% of the observations; and

of three standard deviations above and three below (+ 3sd) about 99.7% of the observations

68.3%

95.4%

99.7%


Normality

Why bother with normality?? Because it dictates the type of analysis

that you can run on the data


Normality-Why?Parametric


Normality-Why?Non-parametric


Normality-How?

Explored graphically• Histogram• Stem & Leaf• Box plot• Normal probability

plot• Detrended normal

plot

Explored statistically• Kolmogorov-Smirnov

statistic, with Lilliefors significance level and the Shapiro-Wilks statistic

• Skew ness (0)• Kurtosis (0)

– + leptokurtic– 0 mesokurtik– - platykurtic


Kolmogorov- Smirnov

In the 1930’s, Andrei Nikolaevich Kolmogorov (1903-1987) and N.V. Smirnov (his student) came out with the approach for comparison of distributions that did not make use of parameters.

This is known as the Kolmogorov-Smirnov test.


Skew ness

Skewed to the right indicates the presence of large extreme values

Skewed to the left indicates the presence of small extreme values


Kurtosis

For symmetrical distribution only.

Describes the shape of the curve

Mesokurtic - average shaped

Leptokurtic - narrow & slim

Platikurtic - flat & wide


Skew ness & Kurtosis

Skew ness ranges from -3 to 3. Acceptable range for normality is skew ness

lying between -1 to 1. Normality should not be based on skew ness

alone; the kurtosis measures the “peak ness” of the bell-curve (see Fig. 4).

Likewise, acceptable range for normality is kurtosis lying between -1 to 1.


Normality - ExamplesGraphically

Height

167.5

165.0

162.5

160.0

157.5

155.0

152.5

150.0

147.5

145.0

142.5

140.0

60

50

40

30

20

10

0

Std. Dev = 5.26

Mean = 151.6

N = 218.00


Q&Q Plot

This plot compares the quintiles of a data distribution with the quintiles of a standardised theoretical distribution from a specified family of distributions (in this case, the normal distribution).

If the distributional shapes differ, then the points will plot along a curve instead of a line.

Take note that the interest here is the central portion of the line, severe deviations means non-normality. Deviations at the “ends” of the curve signifies the existence of outliers.


Normality - ExamplesGraphically

Detrended Normal Q-Q Plot of Height

Observed Value

170160150140130

De

v fr

om

No

rma

l

.6

.5

.4

.3

.2

.1

0.0

-.1

-.2

Normal Q-Q Plot of Height

Observed Value

170160150140130

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2

-3


Normal distributionMean=median=mode


Tests of Normality

.060 218 .052HeightStatistic df Sig.

Kolmogorov-Smirnova

Lilliefors Significance Correctiona.

Descriptives

151.65 .356

150.94

152.35

151.59

151.50

27.649

5.258

139

168

29

8.00

.148 .165

.061 .328

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

HeightStatistic Std. Error

Normality - ExamplesStatistically

Shapiro-Wilks; only if sample size less than 100.

Normal distributionMean=median=mode

Skewness & kurtosis within +1

p > 0.05, so normal distribution


K-S Test


K-S Test

very sensitive to the sample sizes of the data.

For small samples (n<20, say), the likelihood of getting p<0.05 is low

for large samples (n>100), a slight deviation from normality will result in being reported as abnormal distribution


Guide to deciding on normality


Normality Transformation

Normal Q-Q Plot of LN_PARIT

Observed Value

3.02.52.01.51.0.50.0-.5

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2

Normal Q-Q Plot of LN_PARIT

Observed Value

3.02.52.01.51.0.50.0-.5

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2

Normal Q-Q Plot of PARITY

Observed Value

1614121086420

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2

Normal Q-Q Plot of PARITY

Observed Value

1614121086420

Exp

ect

ed

No

rma

l

3

2

1

0

-1

-2


Square root Logarithm Inverse

Reflect and square root

Reflect and logarithm

Reflect and inverse

TYPES OF TRANSFORMATIONS


Summarise

Summarise a large set of data by a few meaningful numbers.

Single variable analysis• For the purpose of describing the data• Example; in one year, what kind of cases are

treated by the Psychiatric Dept?• Tables & diagrams are usually used to describe

the data• For numerical data, measures of central tendency

& spread is usually used


Frequency Table

Race F %Malay 760 95.84%

Chinese 5 0.63%Indian 0 0.00%Others 28 3.53%TOTAL 793 100.00%

•Illustrates the frequency observed for each category


Frequency Distribution Table

Umur Bil %0-0.99 25 3.26%1-4.99 78 10.18%5-14.99 140 18.28%15-24.99 126 16.45%25-34.99 112 14.62%35-44.99 90 11.75%45-54.99 66 8.62%55-64.99 60 7.83%65-74.99 50 6.53%75-84.99 16 2.09%85+ 3 0.39%JUMLAH 766

• > 20 observations, best presented as a frequency distribution table.

•Columns divided into class & frequency.

•Mod class can be determined using such tables.


Measurement of Central Tendency & Spread


Measures of Central Tendency

MeanModeMedian


Measures of Variability

Standard deviation Inter-quartilesSkew ness & kurtosis


Mean

the average of the data collected To calculate the mean, add up the

observed values and divide by the number of them.

A major disadvantage of the mean is that it is sensitive to outlying points


Mean: Example

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Total of x = 648 n= 20 Mean = 648/20 = 32.4


Measures of variation - standard deviation

tells us how much all the scores in a dataset cluster around the mean. A large S.D. is indicative of a more varied data scores.

a summary measure of the differences of each observation from the mean.

If the differences themselves were added up, the positive would exactly balance the negative and so their sum would be zero.

Consequently the squares of the differences are added.


sd: Example

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Mean = 32.4; n = 20 Total of (x-mean)2

= 3050.8 Variance = 3050.8/19

= 160.5684 sd = 160.56840.5=12.67

x (x-mean)^2 x (x-mean)^2

12 416.16 32 0.16

13 376.36 35 6.76

17 237.16 37 21.16

21 129.96 38 31.36

24 70.56 41 73.96

24 70.56 43 112.36

26 40.96 44 134.56

27 29.16 46 184.96

27 29.16 53 424.36

30 5.76 58 655.36

TOTAL 1405.8 TOTAL 1645


Median

the ranked value that lies in the middle of the data

the point which has the property that half the data are greater than it, and half the data are less than it.

if n is even, average the n/2th largest and the n/2 + 1th largest observations

"robust" to outliers


Median:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

(20+1)/2 = 10th which is 30, 11th is 32 Therefore median is (30 + 32)/2 = 31


Measures of variation - quartiles

The range is very susceptible to what are known as outliers

A more robust approach is to divide the distribution of the data into four, and find the points below which are 25%, 50% and 75% of the distribution. These are known as quartiles, and the median is the second quartile.


Quartiles

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

25th percentile 24; (24+24)/2 50th percentile 31; (30+32)/2 ; = median 75th percentile 42.5; (41+43)/2


Mode

The most frequent occurring number. E.g. 3, 13, 13, 20, 22, 25: mode = 13.

It is usually more informative to quote the mode accompanied by the percentage of times it happened; e.g., the mode is 13 with 33% of the occurrences.


Mode: Example

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Modes are 24 (10%) & 27 (10%)


Mean or Median?

Which measure of central tendency should we use?

if the distribution is normal, the mean+sd will be the measure to be presented, otherwise the median+IQR should be more appropriate.


Normal distribution;Use Mean+SD

Not Normal distribution;Use Median & IQR


Presentation

Qualitative & Quantitative Data

Charts & Tables


Presentation

Qualitative Data


Graphing Categorical Data: Univariate Data

Categorical Data

Tabulating Data

The Summary Table

0 1 0 2 0 3 0 4 0 5 0

S to c k s

B o n d s

S a vin g s

C D

Graphing Data

Pie Charts

Pareto DiagramBar Charts

0

5

1 0

1 5

2 0

2 5

3 0

3 5

4 0

4 5

S to c k s B o n d s S a vin g s C D

0

2 0

4 0

6 0

8 0

1 0 0

1 2 0


Bar Chart

Type of work

Field w orkOffice w orkHousew ife

Pe

rce

nt

80

60

40

20

0

20

11

69


Pie Chart

Others

Chinese

Malay


Tabulating and Graphing Bivariate

Categorical Data Contingency tables:

Table 1: Contigency table of pregnancy induced hypertension andSGA

Count

103 94 197

5 16 21

108 110 218

No

Yes

Pregnancy inducedhypertension

Total

Normal SGA

SGA

Total


Tabulating and Graphing Bivariate

Categorical Data

Pregnancy induced hypertension

YesNo

Co

un

t

120

100

80

60

40

20

0

SGA

Normal

SGA

16

94

103

Side by side charts


Presentation

Quantitative Data


Ogive

0

20

40

60

80

100

120

10 20 30 40 50 60

Tabulating and Graphing Numerical

Data

0

1

2

3

4

5

6

7

10 20 30 40 50 60

Numerical Data

Ordered Array

Stem and LeafDisplay

Histograms Area

Tables

2 144677

3 028

4 1

41, 24, 32, 26, 27, 27, 30, 24, 38, 21

21, 24, 24, 26, 27, 27, 30, 32, 38, 41

Frequency DistributionsCumulative Distributions

Polygons


Tabulating Numerical Data: Frequency

Distributions Sort raw data in ascending order:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Find range: 58 - 12 = 46

Select number of classes: 5 (usually between 5 and 15)

Compute class interval (width): 10 (46/5 then round up)

Determine class boundaries (limits): 10, 20, 30, 40, 50, 60

Compute class midpoints: 14.95, 24.95, 34.95, 44.95, 54.95

Count observations & assign to classes


Frequency Distributions and Percentage Distributions

Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class Midpoint Freq %

10.0 - 19.9 14.95 3 15%

20.0 - 29.9 24.95 6 30%

30.0 - 39.9 34.95 5 25%

40.0 - 49.9 44.95 4 20%

50.0 - 59.9 54.95 2 10%

TOTAL 20 100%


3

6

5

4

2

0

1

2

3

4

5

6

7

14.95 24.95 34.95 44.95 54.95

Age

Fre

qu

en

cy

Graphing Numerical Data:

The HistogramData in ordered array:

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Class MidpointsClass Boundaries

No Gaps Between

Bars


Graphing Numerical Data:

The Frequency Polygon

Class Midpoints

Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

0

1

2

3

4

5

6

7

14.95 24.95 34.95 44.95 54.95


Linear Regression Line


Survival Function

DURATION

76543210

Cu

m S

urv

iva

l

1.2

1.0

.8

.6

.4

.2

0.0

Survival Function

Censored


Principles of Graphical Excellence

Presents data in a way that provides substance, statistics and design

Communicates complex ideas with clarity, precision and efficiency

Gives the largest number of ideas in the most efficient manner

Almost always involves several dimensions Tells the truth about the data

Statistic I: Data Collection & Handling

Presentations & Public Speaking

Transcript of Statistic I: Data Collection & Handling