Statistic I: Data Collection & Handling
-
Upload
sppippukm -
Category
Presentations & Public Speaking
-
view
319 -
download
1
description
Transcript of Statistic I: Data Collection & Handling
![Page 1: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/1.jpg)
©[email protected] 2013
Collect, Explore & Summarise
Dr Azmi Mohd TamilDept of Community Health
Universiti Kebangsaan Malaysia
FK6163
![Page 2: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/2.jpg)
©[email protected] 2013
Data Collection
Data collection begins after deciding on design of study and the sampling strategy
![Page 3: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/3.jpg)
©[email protected] 2013
Data Collection
Sample subjects are identified and the required individual information is obtained in an item-wise and structured manner.
![Page 4: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/4.jpg)
©[email protected] 2013
Data Collection
Information is collected on certain characteristics, attributes and the qualities of interest from the samples
These data may be quantitative or qualitative in nature.
![Page 5: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/5.jpg)
©[email protected] 2013
Data Collection Techniques
Use available information Observation Interviews Questionnaires Focus group discussion
![Page 6: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/6.jpg)
©[email protected] 2013
Using Available Information
Existing Records• Hospital records - case notes• National registry of births & deaths• Census data• Data from other surveys
![Page 7: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/7.jpg)
©[email protected] 2013
Disadvantages of using existing records
Incomplete records Cause of death may not be verified by a
physician/MD Missing vital information Difficult to decipher May not be representative of the target
group - only severe cases go to hospital
![Page 8: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/8.jpg)
©[email protected] 2013
![Page 9: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/9.jpg)
©[email protected] 2013
Disadvantages of using existing records
Delayed publication - obsolete data Different method of data recording
between institutions, states, countries, making comparison & pooling of data incompatible
Comparisons across time difficult due to difference in classification, diagnostic tools etc
![Page 10: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/10.jpg)
©[email protected] 2013
Advantages of using existing records
Cheap convenient in some situations, it is the only data
source i.e. accidents & suicides
![Page 11: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/11.jpg)
©[email protected] 2013
Observation
Involves systematically selecting, watching & recording behaviour and characteristics of living beings, objects or phenomena
Done using defined scales Participant observation e.g. PEF and
asthma symptom diary Non-participant observation e.g.
cholesterol levels
![Page 12: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/12.jpg)
©[email protected] 2013
Interviews
Oral questioning of respondents either individually or as a group.
Can be done loosely or highly structured using a questionnaire
![Page 13: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/13.jpg)
©[email protected] 2013
Administering Written Questionnaires
Self-administered via mail by gathering them in one place and
getting them to fill it up hand-delivering and collecting them later Large non-response can distort results
![Page 14: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/14.jpg)
©[email protected] 2013
Questionnaires
Influenced by education & attitude of respondent esp. for self-administered
Interviewers need to be trained open ended vs close ended the need for pre-testing or pilot study
![Page 15: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/15.jpg)
©[email protected] 2013
Focus group discussion
Selecting relevant parties to the research questions at hand and discussing with them in focus groups
examples in your own field of interest?
![Page 16: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/16.jpg)
©[email protected] 2013
Plan for data collection
Permission to proceed Logistics - who will collect what, when
and with what resources Quality control
![Page 17: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/17.jpg)
©[email protected] 2013
Accuracy & Reliability
Accuracy - the degree which a measurement actually measures the measures the characteristic it is supposed to measure
Reliability is the consistency of replicate measures
![Page 19: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/19.jpg)
©[email protected] 2013
Accuracy & Reliability
Both are reduced by random error and systematic error from the same sources of variability;• the data collectors• the respondents• the instrument
![Page 20: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/20.jpg)
©[email protected] 2013
Strategies to enhance precision & accuracy
Standardise procedures and measurement methods
training & certifying the data collectors Repetition Blinding
![Page 21: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/21.jpg)
©[email protected] 2013
Introduction
Method of Exploring and Summarising Data differs
According to Types of Variables
![Page 22: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/22.jpg)
©[email protected] 2013
Dependent/Independent
Frequency of Exercise
Obesity
Food Intake
Independent Variables
Dependent Variable
![Page 23: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/23.jpg)
©[email protected] 2013
![Page 24: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/24.jpg)
©[email protected] 2013
Explore
It is the first step in the analytic process to explore the characteristics of the data to screen for errors and correct them to look for distribution patterns - normal
distribution or not May require transformation before further
analysis using parametric methods Or may need analysis using non-parametric
techniques
![Page 25: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/25.jpg)
©[email protected] 2013
Data Screening
By running frequencies, we may detect inappropriate responses
How many in the audience have 15 children and currently pregnant with the 16th?
PARITY
67 30.7
44 20.2
36 16.5
22 10.1
21 9.6
8 3.7
3 1.4
7 3.2
5 2.3
3 1.4
1 .5
1 .5
218 100.0
1
2
3
4
5
6
7
8
9
10
11
15
Total
ValidFrequency Percent
![Page 26: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/26.jpg)
©[email protected] 2013
Data Screening
See whether the data make sense or not.
E.g. Parity 10 but age only 25.
![Page 27: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/27.jpg)
©[email protected] 2013
![Page 28: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/28.jpg)
©[email protected] 2013
![Page 29: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/29.jpg)
©[email protected] 2013
Data Screening
By looking at measures of central tendency and range, we can also detect abnormal values for quantitative data
Descriptive Statistics
184 32 484 53.05 33.37
184
Pre-pregnancy weight
Valid N (listwise)
N Minimum Maximum MeanStd.
Deviation
![Page 30: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/30.jpg)
©[email protected] 2013
Interpreting the Box Plot
Outlier
Outlier
Upper quartile
Smallest non-outlier
Median
Lower quartile
Largest non-outlier The whiskers extend to 1.5 times the box width from both ends of the box and ends at an observed value. Three times the box width marks the boundary between "mild" and "extreme" outliers.
"mild" = closed dots "extreme"= open dots
![Page 31: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/31.jpg)
©[email protected] 2013
Data Screening
We can also make use of graphical tools such as the box plot to detect wrong data entry 184N =
Pre-pregnancy weight
600
500
400
300
200
100
0
141198211181
73
![Page 32: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/32.jpg)
©[email protected] 2013
Data Cleaning
Identify the extreme/wrong values Check with original data source – i.e.
questionnaire If incorrect, do the necessary correction. Correction must be done before
transformation, recoding and analysis.
![Page 33: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/33.jpg)
©[email protected] 2013
Parameters of Data Distribution
Mean – central value of data Standard deviation – measure of how
the data scatter around the mean Symmetry (skewness) – the degree of
the data pile up on one side of the mean Kurtosis – how far data scatter from the
mean
![Page 34: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/34.jpg)
©[email protected] 2013
Normal distribution
The Normal distribution is represented by a family of curves defined uniquely by two parameters, which are the mean and the standard deviation of the population.
The curves are always symmetrically bell shaped, but the extent to which the bell is compressed or flattened out depends on the standard deviation of the population.
However, the mere fact that a curve is bell shaped does not mean that it represents a Normal distribution, because other distributions may have a similar sort of shape.
![Page 35: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/35.jpg)
©[email protected] 2013
Normal distribution
If the observations follow a Normal distribution, a range covered by one standard deviation above the mean and one standard deviation below it includes about 68.3% of the observations;
a range of two standard deviations above and two below (+ 2sd) about 95.4% of the observations; and
of three standard deviations above and three below (+ 3sd) about 99.7% of the observations
68.3%
95.4%
99.7%
![Page 36: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/36.jpg)
©[email protected] 2013
Normality
Why bother with normality?? Because it dictates the type of analysis
that you can run on the data
![Page 39: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/39.jpg)
©[email protected] 2013
Normality-How?
Explored graphically• Histogram• Stem & Leaf• Box plot• Normal probability
plot• Detrended normal
plot
Explored statistically• Kolmogorov-Smirnov
statistic, with Lilliefors significance level and the Shapiro-Wilks statistic
• Skew ness (0)• Kurtosis (0)
– + leptokurtic– 0 mesokurtik– - platykurtic
![Page 40: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/40.jpg)
©[email protected] 2013
Kolmogorov- Smirnov
In the 1930’s, Andrei Nikolaevich Kolmogorov (1903-1987) and N.V. Smirnov (his student) came out with the approach for comparison of distributions that did not make use of parameters.
This is known as the Kolmogorov-Smirnov test.
![Page 41: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/41.jpg)
©[email protected] 2013
Skew ness
Skewed to the right indicates the presence of large extreme values
Skewed to the left indicates the presence of small extreme values
![Page 42: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/42.jpg)
©[email protected] 2013
Kurtosis
For symmetrical distribution only.
Describes the shape of the curve
Mesokurtic - average shaped
Leptokurtic - narrow & slim
Platikurtic - flat & wide
![Page 43: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/43.jpg)
©[email protected] 2013
Skew ness & Kurtosis
Skew ness ranges from -3 to 3. Acceptable range for normality is skew ness
lying between -1 to 1. Normality should not be based on skew ness
alone; the kurtosis measures the “peak ness” of the bell-curve (see Fig. 4).
Likewise, acceptable range for normality is kurtosis lying between -1 to 1.
![Page 44: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/44.jpg)
©[email protected] 2013
![Page 45: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/45.jpg)
©[email protected] 2013
Normality - ExamplesGraphically
Height
167.5
165.0
162.5
160.0
157.5
155.0
152.5
150.0
147.5
145.0
142.5
140.0
60
50
40
30
20
10
0
Std. Dev = 5.26
Mean = 151.6
N = 218.00
![Page 46: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/46.jpg)
©[email protected] 2013
Q&Q Plot
This plot compares the quintiles of a data distribution with the quintiles of a standardised theoretical distribution from a specified family of distributions (in this case, the normal distribution).
If the distributional shapes differ, then the points will plot along a curve instead of a line.
Take note that the interest here is the central portion of the line, severe deviations means non-normality. Deviations at the “ends” of the curve signifies the existence of outliers.
![Page 47: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/47.jpg)
©[email protected] 2013
Normality - ExamplesGraphically
Detrended Normal Q-Q Plot of Height
Observed Value
170160150140130
De
v fr
om
No
rma
l
.6
.5
.4
.3
.2
.1
0.0
-.1
-.2
Normal Q-Q Plot of Height
Observed Value
170160150140130
Exp
ect
ed
No
rma
l
3
2
1
0
-1
-2
-3
![Page 49: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/49.jpg)
©[email protected] 2013
Tests of Normality
.060 218 .052HeightStatistic df Sig.
Kolmogorov-Smirnova
Lilliefors Significance Correctiona.
Descriptives
151.65 .356
150.94
152.35
151.59
151.50
27.649
5.258
139
168
29
8.00
.148 .165
.061 .328
Mean
Lower Bound
Upper Bound
95% ConfidenceInterval for Mean
5% Trimmed Mean
Median
Variance
Std. Deviation
Minimum
Maximum
Range
Interquartile Range
Skewness
Kurtosis
HeightStatistic Std. Error
Normality - ExamplesStatistically
Shapiro-Wilks; only if sample size less than 100.
Normal distributionMean=median=mode
Skewness & kurtosis within +1
p > 0.05, so normal distribution
![Page 51: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/51.jpg)
©[email protected] 2013
K-S Test
very sensitive to the sample sizes of the data.
For small samples (n<20, say), the likelihood of getting p<0.05 is low
for large samples (n>100), a slight deviation from normality will result in being reported as abnormal distribution
![Page 53: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/53.jpg)
©[email protected] 2013
Normality Transformation
Normal Q-Q Plot of LN_PARIT
Observed Value
3.02.52.01.51.0.50.0-.5
Exp
ect
ed
No
rma
l
3
2
1
0
-1
-2
Normal Q-Q Plot of LN_PARIT
Observed Value
3.02.52.01.51.0.50.0-.5
Exp
ect
ed
No
rma
l
3
2
1
0
-1
-2
Normal Q-Q Plot of PARITY
Observed Value
1614121086420
Exp
ect
ed
No
rma
l
3
2
1
0
-1
-2
Normal Q-Q Plot of PARITY
Observed Value
1614121086420
Exp
ect
ed
No
rma
l
3
2
1
0
-1
-2
![Page 54: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/54.jpg)
©[email protected] 2013
Square root Logarithm Inverse
Reflect and square root
Reflect and logarithm
Reflect and inverse
TYPES OF TRANSFORMATIONS
![Page 55: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/55.jpg)
©[email protected] 2013
Summarise
Summarise a large set of data by a few meaningful numbers.
Single variable analysis• For the purpose of describing the data• Example; in one year, what kind of cases are
treated by the Psychiatric Dept?• Tables & diagrams are usually used to describe
the data• For numerical data, measures of central tendency
& spread is usually used
![Page 56: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/56.jpg)
©[email protected] 2013
Frequency Table
Race F %Malay 760 95.84%
Chinese 5 0.63%Indian 0 0.00%Others 28 3.53%TOTAL 793 100.00%
•Illustrates the frequency observed for each category
![Page 57: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/57.jpg)
©[email protected] 2013
Frequency Distribution Table
Umur Bil %0-0.99 25 3.26%1-4.99 78 10.18%5-14.99 140 18.28%15-24.99 126 16.45%25-34.99 112 14.62%35-44.99 90 11.75%45-54.99 66 8.62%55-64.99 60 7.83%65-74.99 50 6.53%75-84.99 16 2.09%85+ 3 0.39%JUMLAH 766
• > 20 observations, best presented as a frequency distribution table.
•Columns divided into class & frequency.
•Mod class can be determined using such tables.
![Page 60: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/60.jpg)
©[email protected] 2013
Measures of Variability
Standard deviation Inter-quartilesSkew ness & kurtosis
![Page 61: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/61.jpg)
©[email protected] 2013
Mean
the average of the data collected To calculate the mean, add up the
observed values and divide by the number of them.
A major disadvantage of the mean is that it is sensitive to outlying points
![Page 62: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/62.jpg)
©[email protected] 2013
Mean: Example
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Total of x = 648 n= 20 Mean = 648/20 = 32.4
![Page 63: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/63.jpg)
©[email protected] 2013
Measures of variation - standard deviation
tells us how much all the scores in a dataset cluster around the mean. A large S.D. is indicative of a more varied data scores.
a summary measure of the differences of each observation from the mean.
If the differences themselves were added up, the positive would exactly balance the negative and so their sum would be zero.
Consequently the squares of the differences are added.
![Page 64: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/64.jpg)
©[email protected] 2013
![Page 65: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/65.jpg)
©[email protected] 2013
sd: Example
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Mean = 32.4; n = 20 Total of (x-mean)2
= 3050.8 Variance = 3050.8/19
= 160.5684 sd = 160.56840.5=12.67
x (x-mean)^2 x (x-mean)^2
12 416.16 32 0.16
13 376.36 35 6.76
17 237.16 37 21.16
21 129.96 38 31.36
24 70.56 41 73.96
24 70.56 43 112.36
26 40.96 44 134.56
27 29.16 46 184.96
27 29.16 53 424.36
30 5.76 58 655.36
TOTAL 1405.8 TOTAL 1645
![Page 66: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/66.jpg)
©[email protected] 2013
Median
the ranked value that lies in the middle of the data
the point which has the property that half the data are greater than it, and half the data are less than it.
if n is even, average the n/2th largest and the n/2 + 1th largest observations
"robust" to outliers
![Page 67: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/67.jpg)
©[email protected] 2013
Median:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
(20+1)/2 = 10th which is 30, 11th is 32 Therefore median is (30 + 32)/2 = 31
![Page 68: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/68.jpg)
©[email protected] 2013
Measures of variation - quartiles
The range is very susceptible to what are known as outliers
A more robust approach is to divide the distribution of the data into four, and find the points below which are 25%, 50% and 75% of the distribution. These are known as quartiles, and the median is the second quartile.
![Page 69: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/69.jpg)
©[email protected] 2013
Quartiles
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
25th percentile 24; (24+24)/2 50th percentile 31; (30+32)/2 ; = median 75th percentile 42.5; (41+43)/2
![Page 70: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/70.jpg)
©[email protected] 2013
Mode
The most frequent occurring number. E.g. 3, 13, 13, 20, 22, 25: mode = 13.
It is usually more informative to quote the mode accompanied by the percentage of times it happened; e.g., the mode is 13 with 33% of the occurrences.
![Page 71: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/71.jpg)
©[email protected] 2013
Mode: Example
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Modes are 24 (10%) & 27 (10%)
![Page 72: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/72.jpg)
©[email protected] 2013
Mean or Median?
Which measure of central tendency should we use?
if the distribution is normal, the mean+sd will be the measure to be presented, otherwise the median+IQR should be more appropriate.
![Page 76: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/76.jpg)
©[email protected] 2013
Graphing Categorical Data: Univariate Data
Categorical Data
Tabulating Data
The Summary Table
0 1 0 2 0 3 0 4 0 5 0
S to c k s
B o n d s
S a vin g s
C D
Graphing Data
Pie Charts
Pareto DiagramBar Charts
0
5
1 0
1 5
2 0
2 5
3 0
3 5
4 0
4 5
S to c k s B o n d s S a vin g s C D
0
2 0
4 0
6 0
8 0
1 0 0
1 2 0
![Page 77: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/77.jpg)
©[email protected] 2013
Bar Chart
Type of work
Field w orkOffice w orkHousew ife
Pe
rce
nt
80
60
40
20
0
20
11
69
![Page 79: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/79.jpg)
©[email protected] 2013
Tabulating and Graphing Bivariate
Categorical Data Contingency tables:
Table 1: Contigency table of pregnancy induced hypertension andSGA
Count
103 94 197
5 16 21
108 110 218
No
Yes
Pregnancy inducedhypertension
Total
Normal SGA
SGA
Total
![Page 80: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/80.jpg)
©[email protected] 2013
Tabulating and Graphing Bivariate
Categorical Data
Pregnancy induced hypertension
YesNo
Co
un
t
120
100
80
60
40
20
0
SGA
Normal
SGA
16
94
103
Side by side charts
![Page 82: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/82.jpg)
©[email protected] 2013
Ogive
0
20
40
60
80
100
120
10 20 30 40 50 60
Tabulating and Graphing Numerical
Data
0
1
2
3
4
5
6
7
10 20 30 40 50 60
Numerical Data
Ordered Array
Stem and LeafDisplay
Histograms Area
Tables
2 144677
3 028
4 1
41, 24, 32, 26, 27, 27, 30, 24, 38, 21
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Frequency DistributionsCumulative Distributions
Polygons
![Page 83: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/83.jpg)
©[email protected] 2013
Tabulating Numerical Data: Frequency
Distributions Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Find range: 58 - 12 = 46
Select number of classes: 5 (usually between 5 and 15)
Compute class interval (width): 10 (46/5 then round up)
Determine class boundaries (limits): 10, 20, 30, 40, 50, 60
Compute class midpoints: 14.95, 24.95, 34.95, 44.95, 54.95
Count observations & assign to classes
![Page 84: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/84.jpg)
©[email protected] 2013
Frequency Distributions and Percentage Distributions
Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class Midpoint Freq %
10.0 - 19.9 14.95 3 15%
20.0 - 29.9 24.95 6 30%
30.0 - 39.9 34.95 5 25%
40.0 - 49.9 44.95 4 20%
50.0 - 59.9 54.95 2 10%
TOTAL 20 100%
![Page 85: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/85.jpg)
©[email protected] 2013
3
6
5
4
2
0
1
2
3
4
5
6
7
14.95 24.95 34.95 44.95 54.95
Age
Fre
qu
en
cy
Graphing Numerical Data:
The HistogramData in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Class MidpointsClass Boundaries
No Gaps Between
Bars
![Page 86: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/86.jpg)
©[email protected] 2013
Graphing Numerical Data:
The Frequency Polygon
Class Midpoints
Data in ordered array:12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
0
1
2
3
4
5
6
7
14.95 24.95 34.95 44.95 54.95
![Page 88: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/88.jpg)
©[email protected] 2013
Survival Function
DURATION
76543210
Cu
m S
urv
iva
l
1.2
1.0
.8
.6
.4
.2
0.0
Survival Function
Censored
![Page 89: Statistic I: Data Collection & Handling](https://reader033.fdocuments.us/reader033/viewer/2022052905/55849863d8b42adf458b4c14/html5/thumbnails/89.jpg)
©[email protected] 2013
Principles of Graphical Excellence
Presents data in a way that provides substance, statistics and design
Communicates complex ideas with clarity, precision and efficiency
Gives the largest number of ideas in the most efficient manner
Almost always involves several dimensions Tells the truth about the data