Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics...
-
Upload
bryson-fulk -
Category
Documents
-
view
213 -
download
0
Transcript of Data Handling II: Describing and Depicting your Data Dr Yanzhong Wang Lecturer in Medical Statistics...
Data Handling II: Describing and Depicting your Data
Dr Yanzhong WangLecturer in Medical StatisticsDivision of Health and Social Care ResearchKing's College LondonEmail: [email protected]
Drug Development Statistics & Data Management
2
Types of data
• Quantitative data– continuous, discrete– distributions may symmetric or skewed
• Qualitative (categorical) data– binary– nominal, ordinal
3
Positively skewed data
Fre
qu
en
cy0
5
10
15
20
25
Negatively Skewed data
0
5
10
15
20
25
30
Fre
quen
cy
Long tail to left Long tail to right
Skewed Distributions
4
0 2 4 60
.1
.2
.3
.4
Symmetric Distribution
5
Summary statistics
• ‘Where the data are’ - location– mean, median, mode, geometric mean
• Used to describe baseline data and main outcomes
• ‘How variable the data are’ - spread– standard deviation, variance, range, interquartile
range, 95% range• Needed (primarily) to describe baseline data
in RCT and cohort study
6
Definition of the Mean
The mean of a sample of values is the arithmetic average and is determined by dividing the sum of the values by the number of the values.
7
Definition of the Median
The median is the middle value.
not affected by skewness and outliers, but less precise than mean theoretically.
Ordered Blood Glucose Values
2.2 2.9 3.3 3.3 3.3 3.4 3.4 3.4 3.6 3.6 3.6 3.6 3.7 3.7 3.8 3.8 3.8 3.9 4.0 4.0 4.0 4.1 4.1 4.1 4.2 4.3 4.4 4.4 4.4 4.5 4.6 4.7 4.7 4.7 4.8 4.9 4.9 5.0 5.1 6.0
8
Definition of the Mode
The mode is the most frequent value.
9
2.2 2.9 3.3 3.3 3.3 3.4 3.4 3.4 3.6 3.6 3.6 3.6 3.7 3.7 3.8 3.8 3.8 3.9 4.0 4.0 4.0 4.1 4.1 4.1 4.2 4.3 4.4 4.4 4.4 4.5 4.6 4.7 4.7 4.7 4.8 4.9 4.9 5.0 5.1 6.0
Ordered Blood Glucose Values
10
0
1
2
3
4
5
6
7
2 3 4 5 6
Blood glucose (mmol/litre)
Cou
nt
Arithmetic Mean - outlier prone
Mode - not necessarily central (categorical data)Median - only uses relative magnitudes
Location = Central Tendency
11
Relation of mean, median and mode
• If distribution is unimodal (has only one mode) then:
• Mean=median=mode for symmetric distribution.
• Mean>median>mode for positively skewed distribution.
• Mean<median<mode for negatively skewed distribution.
12
0
10
20
30
40
50
60
70
80
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7
Serum Triglyceride Levels
Cou
nt
Serum Triglyceride Levels from Cord Blood of 282 Babies
13
0
5
10
15
20
25
30
35
-1.9 -1.7 -1.5 -1.3 -1.1 -0.9 -0.7 -0.5 -0.3 -0.1 0.1 0.3 0.5
log(Serum Triglyceride) Levels
coun
t
Log(Serum Triglyceride Levels) from Cord Blood of 282 Babies
14
Definition of the Geometric Mean
The geometric mean of a sample of n values is determined by multiplying all the values together and taking the nth root (for only two values this is the more familiar square root).
15
Geometric Mean
• A common example of when the geometric mean is the correct choice average is when averaging growth rates.
• Another Method: Take log of each value, find arithmetic mean and anti-log the result.
Exp( (log(0.15) + … + log(1.66) )/40) = 0.467
0
10
20
30
40
50
60
70
80
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7
Serum Triglyceride Levels
Cou
nt
Mean=0.506
Median=0.460Geometric Mean=0.467
Serum Triglyceride Levels from Cord Blood of 282 Babies
17
Why measures of variability are important
Production of Aspirin • New production process of 100 mg tabs• Random sample from process
– 96 97 100 101 101 mgs - mean 99 mg• Random sample from old process
– 88 93 100 104 110 mgs - mean 99 mg• Same means but new is better because less variable
18
Definition of RangeThe range of a sample of values is the largest value minus the smallest value.
• New process the range is 101-96=5 • Old process the range is 110-88=22
• Range is simple ….. BUT– Only uses min and max– Gets larger as sample size increases
19
Definition of Inter-quartile Range
The inter-quartile range of a sample of values is the difference between the upper and lower quartiles. The lower quartile is the value which is greater than ¼ of the sample and less than ¾ of the sample. Conversely, the upper quartile is the value which is greater than ¾ of the sample and less than ¼ of the sample.
20
Ordered Blood Glucose Values
2.2 2.9 3.3 3.3 3.3 3.4 3.4 3.4 3.6 3.6 3.6 3.6 3.7 3.7 3.8 3.8 3.8 3.9 4.0 4.0 4.0 4.1 4.1 4.1 4.2 4.3 4.4 4.4 4.4 4.5 4.6 4.7 4.7 4.7 4.8 4.9 4.9 5.0 5.1 6.0
1/4 of 40 = 10 3/4 of 40 = 30
21
0
1
2
3
4
5
6
7
2 3 4 5 6
Blood glucose (mmol/litre)
Cou
nt
Inter-Quartile Range
Lower quartile Upper quartile
Inter-quartile range
22
Standard deviation
• Neither measure uses the numerical values - only relative magnitudes
• A measure accounting for the values is the standard deviation
• Consider the aspirin data from the new process 96 97 100 101 101 (mean 99 mg)
• Determine deviations from mean -3 -2 1 2 2
• Square , add, average and square-root098.24.4
5
44149
23
Measures of scatter/dispersion – ‘how variable the data are’
• Range – smallest to biggest value– increases with sample size
• Standard deviation – measure of variation around the mean– affected by skewness and outliers
• Variance = square of standard deviation• Interquartile range (IQR) – from 25th centile
to 75th centile
24
Plotting Data
• Histograms• Stem and Leaf Plots Box Plots
Stem Leaf 60 0 1 58 56 54 52 50 00 2 48 000 3 46 0000 4 44 0000 4 42 00 2 40 000000 6 38 0000 4 36 000000 6 34 000 3 32 000 3 30 28 0 1 26 24 22 0 1 ----+----+----+----+ Multiply Stem.Leaf by 10**-1
2
3
4
5
6
Blo
od g
luco
se (
mm
ol/li
tre)
25
Mean and standard deviation
• Best description if distribution reasonably symmetric (and single mode)
• Give full description if data have Normal distribution
26
x0 1 2 3 4 5 6 7 8 9 10
0
.1
.2
.3
.4 Mean 3, s.d. 1 Mean 5, s.d. 1
Mean 5, s.d. 2
27
Properties of Normal distribution
• Symmetric distribution – mean, median and mode equal
• Completely specified by mean and standard deviation
• 95% of distribution contained within mean 1.96 standard deviations
• 68% within mean 1 standard deviation
28
Continuous data, not Normally distributed
• If symmetric use mean and standard deviation• If skewed use median and IQR
Unless• Positively skewed, but log transformation
creates symmetric distribution – use geometric mean
29
Nominal categorical data
• Mode.• % in each category, especially when binary.
Wheeze in last 12 months
Frequency (n) %
No 1945 75.2Yes 642 24.8Total 2587 100.0
30
Ordinal categorical data
• Median and IQR if enough separate values.• Otherwise as for nominal.
31
Discrete quantitative data
• As for continuous data if many values, as for ordinal data if fewer.
33
Difference BetweenStandard Deviation & Standard Error
34
Measure of Variability of the Sample Mean
• Range, inter-quartile range and standard deviation relate to population (sample) not mean.
• To understand the difference carry out a sampling experiment using the Ritchie Index values
35
Values of the Ritchie Index (Measure of Joint Stiffness) in 50 Untreated Patients
14 9 8 9 1 20 3 3 2 4 2 3 6 1 2 11 16 24 16 21 19 22 33 12 12 12 19 10 33 2 19 40 1 20 1 2 4 7 9 4 9 6 14 8 27 10 27 7 24 21
Mean = (14+…+21)/50 = 12.18
36
0
2
4
6
8
10
12
14
16
0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40Values of the Ritchie Index
Arithmetic Mean - outlier prone
Median - only uses relative magnitudes
Mode - not necessarily central (categgorical data)
Location = Central Tendency
37
Sampling Experiment
• Take a random sample (10) from the 50 values
• Determine the mean of the 10 values• Repeat 50 times• These means show variation - HOW
LARGE IS IT ?
38
Variations in Samples
0
2
4
6
8
10
12
14
16
0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40Values of the Ritchie Index
0
2
4
6
8
10
12
14
16
0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40Values of the Ritchie Index
0
2
4
6
8
10
12
14
16
0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40Values of the Ritchie Index
0
2
4
6
8
10
12
14
16
0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40Values of the Ritchie Index
0
2
4
6
8
10
12
14
16
0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40Values of the Ritchie Index
Mean=12.18
Mean=10.00
Mean=12.60
Mean=13.40
Mean=11.50
39
Ritchie Values
Values of the Ritchie Index0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
30
25
20
15
10
5
0
Original values (mean - 12.18 ; sd - 9.69)
40
Ritchie ValuesSampling Experiment – Sample Means
Values of the Ritchie Index0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
30
25
20
15
10
5
0
Sample means(mean - 12.21 ; sd - 2.97)
Original values (mean - 12.18 ; sd - 9.69)
41
Definition of the Standard Error
The standard deviation of the sampling distribution of the mean is called the standard error of the mean.
42
Increasing Sample Size
• Increased precision (smaller standard error)• Less skewness
Values of the Ritchie Index0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
Sample means(mean - 12.21 ; sd - 2.97)
30
25
20
15
10
5
0
35
40
30
25
20
15
10
Values of the Ritchie Index0 - 5 6 - 10 11 - 15 16 - 20 21 - 25 26 - 30 31 - 35 36 - 40
Sample means(mean - 12.37 ; sd - 2.43)
5
0
35
40n=10 n=15
43
Standard error of the mean as a function of the sample size
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40
Sample Size
Sta
ndar
d E
rror
of t
he M
ean
nse
sd
/
44
Population of Gene Lengthsn=20,290
0 5000 10000 15000
Gene Length (# of nucleotides)
Fre
quen
cy
050
010
0015
0020
0025
0030
00
45
Samples of size : n=100
0 5000 10000 15000
050
100
150
200
250
300
Gene Length (# of nucleotides)
Fre
quen
cy
46
Practical Confusion
• A mean is often reported in medical papers as
12.18 1.37
what is 1.37 ?
sd or se ?
Thanks!
Tea break