Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of...
-
Upload
esther-miller -
Category
Documents
-
view
219 -
download
0
Transcript of Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of...
Biostatistics, statistical software I.
Basic statistical concepts
Krisztina Boda PhD
Department of Medical Informatics, University of Szeged
INTERREG 2Krisztina Boda
What is biostatisics?
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data.
Biostatistics or biometry is the application of statistics to a wide range of topics in biology. It has particular applications to medicine and to agriculture.
INTERREG 3Krisztina Boda
Application of biostatistics
Research Design and analysis
of clinical trials in medicine
Public health, including epidemiology,
INTERREG 4Krisztina Boda
INTERREG 5Krisztina Boda
Biostatistical methods
Descriptive statistics Hypothesis tests (statistical tests)
They depend on: the type of data the nature of the problem the statistical model
INTERREG 6Krisztina Boda
The data set
A data set contains information on a number of individuals.
Individuals are objects described by a set of data, they may be people, animals or things. For each individual, the data give values for one or more variables.
A variable describes some characteristic of an individual, such as person's age, height, gender or salary.
INTERREG 7Krisztina Boda
The data-table Data of one experimental unit
(“individual”) must be in one record (row) Data of the answers to the same question
(variables) must be in the same field of the record (column)Number SEX AGE ....
1 1 20 .... 2 2 17 ....
. . . ...
INTERREG 8Krisztina Boda
Variables
Categorical (discrete)A discrete random variable X has finite number of possible values Gender Blood group Number of children …
ContinuousA continuous random variable X has takes all values in an interval of numbers. Concentration Temperature …
INTERREG 9Krisztina Boda
Types of data from two aspects
Based on the number of values they can have discrete (categorical) Continuous
Based on the property they represent Qualitative data
nominal data (they can be distinguished by their names )
ordinal data (there are categories of classification it may be possible to order)
Quantitative (or numerical) data
Example
Sex, blood-group, number of children
Age, temperature, concentration
Example
Qualitative data Sex, blood-group very good-good –
acceptable- wrong - very wrong-very wrong,low - normal - high,
Quantitative (or numerical) data
Age, number of children
INTERREG 10Krisztina Boda
Distribution of variables
Discrete: the distribution of a categorical variable describes what values it takes and how often it takes these values.
Continuous: the distribution of a continuous variable describes what values it takes and how often these values fall into an interval.
SEX
SEX
femalemale
Fre
qu
en
cy
14
12
10
8
6
4
2
0
age in years
65.055.045.035.025.015.05.0
Histogram
Fre
qu
en
cy
10
8
6
4
2
0
INTERREG 11Krisztina Boda
The distribution of a continuous variable, exampleValues: Categories:
Frequencies
20.00 0-10417.00 11-20522.00 21-30728.00 31-401 9.00 41-501 5.00 51-60226.0060.0035.0051.0017.0050.00 9.0010.0019.0022.0025.0029.0027.0019.00
0
1
2
3
4
5
6
7
8
0-10 11-20 21-30 31-40 41-50 51-60
Age
Freq
uenc
y
INTERREG 12Krisztina Boda
The length of the intervals (or the number of intervals) affect a histogram
0
1
2
3
4
5
6
7
8
0-10 11-20 21-30 31-40 41-50 51-60
age
cou
nt
0
1
2
3
4
5
6
7
8
9
10
0-20 21-40 41-60
ageco
un
t
INTERREG 13Krisztina Boda
The overall pattern of a distribution
The center, spread and shape describe the overall pattern of a distribution.
Some distributions have simple shape, such as symmetric and skewed. Not all distributions have a simple overall shape, especially when there are few observations.
A distribution is skewed to the right if the right side of the histogram extends much farther out then the left side.
INTERREG 14Krisztina Boda
Histogram of body weights (kg)
Jelenlegi testsúlya /kg/
87.582.577.572.567.562.557.552.547.542.537.532.5
Hisztogram
Jelenlegi testsúlyok300
200
100
0
Std. Dev = 8.74 Mean = 57.0N = 1090.00
INTERREG 15Krisztina Boda
Outliers
Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them (real data, typing mistake or other).
Jelenlegi testsúlya
110.0
105.0
100.0
95.0
90.0
85.0
80.0
75.0
70.0
65.0
60.0
55.0
50.0
45.0
40.0
10
8
6
4
2
0
Std. Dev = 13.79
Mean = 62.1
N = 43.00
INTERREG 16Krisztina Boda
Describing distributions with numbers
Measures of central tendency: the mean, the mode and the median are three commonly used measures of the center.
Measures of variability : the range, the quartiles, the variance, the standard deviation are the most commonly used measures of variability .
Measures of an individual: rank, z score
INTERREG 17Krisztina Boda
Measures of the center
Mean:
Mode: is the most frequent number
Median: is the value that half the members of the sample fall below and half above. In other words, it is the middle number when the sample elements are written in numerical order
Example: 1,2,4,1 Mean Mode Median
xx x x
n
x
nn
ii
n
1 2 1...
INTERREG 18Krisztina Boda
Measures of the center
Mean:
Mode: is the most frequent number
Median: is the value that half the members of the sample fall below and half above. In other words, it is the middle number when the sample elements are written in numerical order
Example: 1,2,4,1 Mean=8/4=2 Mode=1 Median
First sort data1 1 2 4
Then find the element(s) in the middle
If the sample size is odd, the unique middle element is the median
If the sample size is even, the median is the average of the two central elements
1 1 2 4
Median=1.5
xx x x
n
x
nn
ii
n
1 2 1...
INTERREG 19Krisztina Boda
Example The grades of a test written by 11 students were
the following: 100 100 100 63 62 60 12 12 6 2 0. A student indicated that the class average was
47, which he felt was rather low. The professor stated that nevertheless there were more 100s than any other grade. The department head said that the middle grade was 60, which was not unusual.The mean is 517/11=47, the mode is 100, the median is 60.
INTERREG 20Krisztina Boda
Relationships among the mean(m), the median(M) and the mode(Mo)
A symmetric curve
A curve skewed to the right
A curve skewed to the left
m=M=Mo
Mo<M< m
M < M < Mo
INTERREG 21Krisztina Boda
Measures of variability (dispersion) The range is the difference between the largest number
(maximum) and the smallest number (minimum). Percentiles (5%-95%): 5% percentile is the value
below which 5% of the cases fall. Quartiles: 25%, 50%, 75% percentiles
The variance=
The standard deviation: 1
)(1
2
2
n
xxSD
n
ii
iancen
xxSD
n
ii
var1
)(1
2
INTERREG 22Krisztina Boda
Example Data: 1 2 4 1, in ascending order: 1 1 2 4 Range: max-min=4-1=3 Quartiles: Standard deviation:
Percentiles
1.0000 1.5000 3.5000
1.0000 1.5000 3.0000
Weighted Average(Definition 1)
Tukey's Hinges
25 50 75
Percentiles
414.1236
1
)(1
2
n
xxSD
n
ii1 1-2=-1 1
1 1-2=-1 1
2 2-2=0 0
4 4-2=2 4
Total 0 6
xxi 2)( xxi ix
INTERREG 23Krisztina Boda
The meaning of the standard deviation
A measure of dispersion around the mean. In a normal distribution, 68% of cases fall within one standard deviation of the mean and 95% of cases fall within two standard deviations.
For example, if the mean age is 45, with a standard deviation of 10, 95% of the cases would be between 25 and 65 in a normal distribution.
INTERREG 24Krisztina Boda
The use of sample characteristics in summary tables
Center Dispersion Publish
Mean Standard deviation,
Standard error
Mean (SD)
Mean SD
Mean SE
Mean SEM
Median Min, max
5%, 95%s percentile
25 % , 75% (quartiles)
Med (min, max)
Med(25%, 75%)
INTERREG 25Krisztina Boda
Displaying data Categorical data
bar chart pie chart
Continuous data histogram box-whisker plot mean-standard
deviation plot scatter plot
His togram (kerd97.STA 20v *43c )
SULY
No o
f obs
NEM: f iú
35 40 45 50 55 60 65 70 75 80 85 90 950
2
4
6
8
10
12
NEM: lány
35 40 45 50 55 60 65 70 75 80 85 90 95
Oszlopdiagram
Apja legmagasabb iskolai végzettsége
nincs válaszfelsőfokú végzettség
gimnáziumi érettségiszakközépiskolai ére
szakmunkásképző8 ált.
8 ált.-nal kevesebb
Per
cent
40
30
20
10
0
Kördiagram
Apja iskolai végzettsége
nincs válasz
felsőfokú végzettség
gimnáziumi érettségi
szakközépiskolai ére
szakmunkásképző
8 ált.
8 ált.-nal kevesebb
Box Plot (kerd97 20v *43c )
Median 25% -75% Min-Max Ex tremes
f iú lány
NEM
30
40
50
60
70
80
90
100
SU
LY
Mean Plot (kerd97 20v *43c )
Mean Mean±SD
f iú lány
NEM
45
50
55
60
65
70
75
80
85
SU
LY
Szóródási diagram
Kivánatosnak tartott testsúlya /kg/
100806040
Jele
nleg
i tes
tsúl
ya /
kg/
120
100
80
60
40
20
0
INTERREG 26Krisztina Boda
Distribution of body weightsThe distribution is skewed in case of girls
1. Leíró statisztika
boys girls
INTERREG 27Krisztina Boda
His togram (kerd97.STA 20v *43c )
SULY
No o
f obs
NEM: f iú
35 40 45 50 55 60 65 70 75 80 85 90 950
2
4
6
8
10
12
NEM: lány
35 40 45 50 55 60 65 70 75 80 85 90 95NEM = 2.00
40 60 80
Jelenlegi testsúlya
SULY
NEM = 1.00
65 70 75 80 85
Jelenlegi testsúlya
SULY
INTERREG 28Krisztina Boda
Mean-dispersion diagrams
Mean + SD Mean + SE Mean + 95% CI
Mean Plot (kerd97 20v *43c )
Mean Mean±SD
f iú lány
NEM
45
50
55
60
65
70
75
80
85
SU
LY
Mean Plot (kerd97 20v *43c )
Mean Mean±SE
f iú lány
NEM
45
50
55
60
65
70
75
80
85
SU
LYMean Plot (kerd97 20v *43c )
Mean Mean±0.95 Conf . Interv al
f iú lány
NEM
45
50
55
60
65
70
75
80
85
SU
LY
Mean SE
Mean SDMean 95% CI
INTERREG 29Krisztina Boda
Box diagramBox Plot (kerd97 20v *43c )
Median 25% -75% Non-Outlier Range Ex tremes
f iú lány
NEM
30
40
50
60
70
80
90
100
SU
LY
Box Plot (kerd97 20v *43c )
Median 25% -75% Min-Max Ex tremes
f iú lány
NEM
30
40
50
60
70
80
90
100
SU
LY
A box plot, sometimes called a box-and-whisker plot displays the median, quartiles, and minimum and maximum observations .
INTERREG 30Krisztina Boda
Transformations of data valuesAddition, subtraction
Adding (or subtracting) the same number to each data value in a variable shifts each measures of center by the amount added (subtracted).
Adding (or subtracting) the same number to each data value in a variable does not change measures of dispersion.
INTERREG 31Krisztina Boda
Transformations of data valuesMultiplication, division
Measures of center and spread change in predictable ways when we multiply or divide each data value by the same number.
Multiplying (or dividing) each data value by the same number multiplies (or divides) all measures of center or spread by that value.
INTERREG 32Krisztina Boda
Proof.The effect of linear transformations
Let the transformation be x ->ax+b Mean:
Standard deviation:
bxan
nbxxxan
baxbaxbaxn
baxnn
n
ii
)...(... 21211
SDan
xxa
n
xxa
n
xaax
n
bxabax
n
bxabax
n
ii
n
ii
n
ii
n
ii
n
ii
1
)(
1
)(
1
)(
1
))((
1
))()((
1
2
1
22
1
2
1
2
1
2
INTERREG 33Krisztina Boda
Example: the effect of transformations
Sample data (xi)
Addition
(xi +10)
Subtraction (xi -10)
Multiplication (xi *10)
Division
(xi /10)
1 11 -9 10 0.1
2 12 -8 20 0.2
4 14 -6 40 0.4
1 11 -9 10 0.1
Mean=2 12 -8 20 0.2
Median=1.5 11.5 -8.5 15 0.15
Range=3 3 3 30 0.3
St.dev.≈1.414 ≈1 .414 ≈ 1.414 ≈ 14.14 ≈ 0.1414
INTERREG 34Krisztina Boda
Special transformation: standardisation
The z score measures how many standard deviations a sample element is from the mean. A formula for finding the z score corresponding to a particular sample element xi is
, i=1,2,...,n. We standardize by subtracting the mean and
dividing by the standard deviation. The resulting variables (z-scores) will have
Zero mean Unit standard deviation No unit
zx x
sii
INTERREG 35Krisztina Boda
Example: standardisation
Sample data (xi) Standardised data (zi)
1 -1
2 0
4 2
1 1
Mean 2 0
St. deviation ≈1 .414 1
INTERREG 36Krisztina Boda
Review questions and exercises
Problems to be solved by hand-calculations ..\Handouts\Problems hand I.doc
Solutions ..\Handouts\Problems hand I solutions.doc
Problems to be solved using computer ..\Handouts\Problems comp I.doc
INTERREG 37Krisztina Boda
Useful WEB pages
http://www-stat.stanford.edu/~naras/jsm http://www.ruf.rice.edu/~lane/rvls.html http://my.execpc.com/~helberg/statistics.html http://www.math.csusb.edu/faculty/stanton/m26
2/index.html