Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of...

37
Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged

Transcript of Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of...

Page 1: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

Biostatistics, statistical software I.

Basic statistical concepts

Krisztina Boda PhD

Department of Medical Informatics, University of Szeged

Page 2: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 2Krisztina Boda

What is biostatisics?

Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data.

Biostatistics or biometry is the application of statistics to a wide range of topics in biology. It has particular applications to medicine and to agriculture.

Page 3: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 3Krisztina Boda

Application of biostatistics

Research Design and analysis

of clinical trials in medicine

Public health, including epidemiology,

Page 4: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 4Krisztina Boda

Page 5: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 5Krisztina Boda

Biostatistical methods

Descriptive statistics Hypothesis tests (statistical tests)

They depend on: the type of data the nature of the problem the statistical model

Page 6: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 6Krisztina Boda

The data set

A data set contains information on a number of individuals.

Individuals are objects described by a set of data, they may be people, animals or things. For each individual, the data give values for one or more variables.

A variable describes some characteristic of an individual, such as person's age, height, gender or salary.

Page 7: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 7Krisztina Boda

The data-table Data of one experimental unit

(“individual”) must be in one record (row) Data of the answers to the same question

(variables) must be in the same field of the record (column)Number SEX AGE ....

1 1 20 .... 2 2 17 ....

. . . ...

Page 8: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 8Krisztina Boda

Variables

Categorical (discrete)A discrete random variable X has finite number of possible values Gender Blood group Number of children …

ContinuousA continuous random variable X has takes all values in an interval of numbers. Concentration Temperature …

Page 9: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 9Krisztina Boda

Types of data from two aspects

Based on the number of values they can have discrete (categorical) Continuous

Based on the property they represent Qualitative data

nominal data (they can be distinguished by their names )

ordinal data (there are categories of classification it may be possible to order)

Quantitative (or numerical) data

Example

Sex, blood-group, number of children

Age, temperature, concentration

Example

Qualitative data Sex, blood-group very good-good –

acceptable- wrong - very wrong-very wrong,low - normal - high,

Quantitative (or numerical) data

Age, number of children

Page 10: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 10Krisztina Boda

Distribution of variables

Discrete: the distribution of a categorical variable describes what values it takes and how often it takes these values.

Continuous: the distribution of a continuous variable describes what values it takes and how often these values fall into an interval.

SEX

SEX

femalemale

Fre

qu

en

cy

14

12

10

8

6

4

2

0

age in years

65.055.045.035.025.015.05.0

Histogram

Fre

qu

en

cy

10

8

6

4

2

0

Page 11: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 11Krisztina Boda

The distribution of a continuous variable, exampleValues: Categories:

Frequencies

20.00 0-10417.00 11-20522.00 21-30728.00 31-401 9.00 41-501 5.00 51-60226.0060.0035.0051.0017.0050.00 9.0010.0019.0022.0025.0029.0027.0019.00

0

1

2

3

4

5

6

7

8

0-10 11-20 21-30 31-40 41-50 51-60

Age

Freq

uenc

y

Page 12: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 12Krisztina Boda

The length of the intervals (or the number of intervals) affect a histogram

0

1

2

3

4

5

6

7

8

0-10 11-20 21-30 31-40 41-50 51-60

age

cou

nt

0

1

2

3

4

5

6

7

8

9

10

0-20 21-40 41-60

ageco

un

t

Page 13: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 13Krisztina Boda

The overall pattern of a distribution

The center, spread and shape describe the overall pattern of a distribution.

Some distributions have simple shape, such as symmetric and skewed. Not all distributions have a simple overall shape, especially when there are few observations.

A distribution is skewed to the right if the right side of the histogram extends much farther out then the left side.

Page 14: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 14Krisztina Boda

Histogram of body weights (kg)

Jelenlegi testsúlya /kg/

87.582.577.572.567.562.557.552.547.542.537.532.5

Hisztogram

Jelenlegi testsúlyok300

200

100

0

Std. Dev = 8.74 Mean = 57.0N = 1090.00

Page 15: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 15Krisztina Boda

Outliers

Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them (real data, typing mistake or other).

Jelenlegi testsúlya

110.0

105.0

100.0

95.0

90.0

85.0

80.0

75.0

70.0

65.0

60.0

55.0

50.0

45.0

40.0

10

8

6

4

2

0

Std. Dev = 13.79

Mean = 62.1

N = 43.00

Page 16: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 16Krisztina Boda

Describing distributions with numbers

Measures of central tendency: the mean, the mode and the median are three commonly used measures of the center.

Measures of variability : the range, the quartiles, the variance, the standard deviation are the most commonly used measures of variability .

Measures of an individual: rank, z score

Page 17: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 17Krisztina Boda

Measures of the center

Mean:

Mode: is the most frequent number

Median: is the value that half the members of the sample fall below and half above. In other words, it is the middle number when the sample elements are written in numerical order

Example: 1,2,4,1 Mean Mode Median

xx x x

n

x

nn

ii

n

1 2 1...

Page 18: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 18Krisztina Boda

Measures of the center

Mean:

Mode: is the most frequent number

Median: is the value that half the members of the sample fall below and half above. In other words, it is the middle number when the sample elements are written in numerical order

Example: 1,2,4,1 Mean=8/4=2 Mode=1 Median

First sort data1 1 2 4

Then find the element(s) in the middle

If the sample size is odd, the unique middle element is the median

If the sample size is even, the median is the average of the two central elements

1 1 2 4

Median=1.5

xx x x

n

x

nn

ii

n

1 2 1...

Page 19: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 19Krisztina Boda

Example The grades of a test written by 11 students were

the following: 100 100 100 63 62 60 12 12 6 2 0. A student indicated that the class average was

47, which he felt was rather low. The professor stated that nevertheless there were more 100s than any other grade. The department head said that the middle grade was 60, which was not unusual.The mean is 517/11=47, the mode is 100, the median is 60.

Page 20: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 20Krisztina Boda

Relationships among the mean(m), the median(M) and the mode(Mo)

A symmetric curve

A curve skewed to the right

A curve skewed to the left

m=M=Mo

Mo<M< m

M < M < Mo

Page 21: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 21Krisztina Boda

Measures of variability (dispersion) The range is the difference between the largest number

(maximum) and the smallest number (minimum). Percentiles (5%-95%): 5% percentile is the value

below which 5% of the cases fall. Quartiles: 25%, 50%, 75% percentiles

The variance=

The standard deviation: 1

)(1

2

2

n

xxSD

n

ii

iancen

xxSD

n

ii

var1

)(1

2

Page 22: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 22Krisztina Boda

Example Data: 1 2 4 1, in ascending order: 1 1 2 4 Range: max-min=4-1=3 Quartiles: Standard deviation:

Percentiles

1.0000 1.5000 3.5000

1.0000 1.5000 3.0000

Weighted Average(Definition 1)

Tukey's Hinges

25 50 75

Percentiles

414.1236

1

)(1

2

n

xxSD

n

ii1 1-2=-1 1

1 1-2=-1 1

2 2-2=0 0

4 4-2=2 4

Total 0 6

xxi 2)( xxi ix

Page 23: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 23Krisztina Boda

The meaning of the standard deviation

A measure of dispersion around the mean. In a normal distribution, 68% of cases fall within one standard deviation of the mean and 95% of cases fall within two standard deviations.

For example, if the mean age is 45, with a standard deviation of 10, 95% of the cases would be between 25 and 65 in a normal distribution.

Page 24: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 24Krisztina Boda

The use of sample characteristics in summary tables

Center Dispersion Publish

Mean Standard deviation,

Standard error

Mean (SD)

Mean SD

Mean SE

Mean SEM

Median Min, max

5%, 95%s percentile

25 % , 75% (quartiles)

Med (min, max)

Med(25%, 75%)

Page 25: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 25Krisztina Boda

Displaying data Categorical data

bar chart pie chart

Continuous data histogram box-whisker plot mean-standard

deviation plot scatter plot

His togram (kerd97.STA 20v *43c )

SULY

No o

f obs

NEM: f iú

35 40 45 50 55 60 65 70 75 80 85 90 950

2

4

6

8

10

12

NEM: lány

35 40 45 50 55 60 65 70 75 80 85 90 95

Oszlopdiagram

Apja legmagasabb iskolai végzettsége

nincs válaszfelsőfokú végzettség

gimnáziumi érettségiszakközépiskolai ére

szakmunkásképző8 ált.

8 ált.-nal kevesebb

Per

cent

40

30

20

10

0

Kördiagram

Apja iskolai végzettsége

nincs válasz

felsőfokú végzettség

gimnáziumi érettségi

szakközépiskolai ére

szakmunkásképző

8 ált.

8 ált.-nal kevesebb

Box Plot (kerd97 20v *43c )

Median 25% -75% Min-Max Ex tremes

f iú lány

NEM

30

40

50

60

70

80

90

100

SU

LY

Mean Plot (kerd97 20v *43c )

Mean Mean±SD

f iú lány

NEM

45

50

55

60

65

70

75

80

85

SU

LY

Szóródási diagram

Kivánatosnak tartott testsúlya /kg/

100806040

Jele

nleg

i tes

tsúl

ya /

kg/

120

100

80

60

40

20

0

Page 26: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 26Krisztina Boda

Distribution of body weightsThe distribution is skewed in case of girls

1. Leíró statisztika

boys girls

Page 27: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 27Krisztina Boda

His togram (kerd97.STA 20v *43c )

SULY

No o

f obs

NEM: f iú

35 40 45 50 55 60 65 70 75 80 85 90 950

2

4

6

8

10

12

NEM: lány

35 40 45 50 55 60 65 70 75 80 85 90 95NEM = 2.00

40 60 80

Jelenlegi testsúlya

SULY

NEM = 1.00

65 70 75 80 85

Jelenlegi testsúlya

SULY

Page 28: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 28Krisztina Boda

Mean-dispersion diagrams

Mean + SD Mean + SE Mean + 95% CI

Mean Plot (kerd97 20v *43c )

Mean Mean±SD

f iú lány

NEM

45

50

55

60

65

70

75

80

85

SU

LY

Mean Plot (kerd97 20v *43c )

Mean Mean±SE

f iú lány

NEM

45

50

55

60

65

70

75

80

85

SU

LYMean Plot (kerd97 20v *43c )

Mean Mean±0.95 Conf . Interv al

f iú lány

NEM

45

50

55

60

65

70

75

80

85

SU

LY

Mean SE

Mean SDMean 95% CI

Page 29: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 29Krisztina Boda

Box diagramBox Plot (kerd97 20v *43c )

Median 25% -75% Non-Outlier Range Ex tremes

f iú lány

NEM

30

40

50

60

70

80

90

100

SU

LY

Box Plot (kerd97 20v *43c )

Median 25% -75% Min-Max Ex tremes

f iú lány

NEM

30

40

50

60

70

80

90

100

SU

LY

A box plot, sometimes called a box-and-whisker plot displays the median, quartiles, and minimum and maximum observations .

Page 30: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 30Krisztina Boda

Transformations of data valuesAddition, subtraction

Adding (or subtracting) the same number to each data value in a variable shifts each measures of center by the amount added (subtracted).

Adding (or subtracting) the same number to each data value in a variable does not change measures of dispersion.

Page 31: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 31Krisztina Boda

Transformations of data valuesMultiplication, division

Measures of center and spread change in predictable ways when we multiply or divide each data value by the same number.

Multiplying (or dividing) each data value by the same number multiplies (or divides) all measures of center or spread by that value.

Page 32: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 32Krisztina Boda

Proof.The effect of linear transformations

Let the transformation be x ->ax+b Mean:

Standard deviation:

bxan

nbxxxan

baxbaxbaxn

baxnn

n

ii

)...(... 21211

SDan

xxa

n

xxa

n

xaax

n

bxabax

n

bxabax

n

ii

n

ii

n

ii

n

ii

n

ii

1

)(

1

)(

1

)(

1

))((

1

))()((

1

2

1

22

1

2

1

2

1

2

Page 33: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 33Krisztina Boda

Example: the effect of transformations

Sample data (xi)

Addition

(xi +10)

Subtraction (xi -10)

Multiplication (xi *10)

Division

(xi /10)

1 11 -9 10 0.1

2 12 -8 20 0.2

4 14 -6 40 0.4

1 11 -9 10 0.1

Mean=2 12 -8 20 0.2

Median=1.5 11.5 -8.5 15 0.15

Range=3 3 3 30 0.3

St.dev.≈1.414 ≈1 .414 ≈ 1.414 ≈ 14.14 ≈ 0.1414

Page 34: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 34Krisztina Boda

Special transformation: standardisation

The z score measures how many standard deviations a sample element is from the mean. A formula for finding the z score corresponding to a particular sample element xi is

, i=1,2,...,n. We standardize by subtracting the mean and

dividing by the standard deviation. The resulting variables (z-scores) will have

Zero mean Unit standard deviation No unit

zx x

sii

Page 35: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 35Krisztina Boda

Example: standardisation

Sample data (xi) Standardised data (zi)

1 -1

2 0

4 2

1 1

Mean 2 0

St. deviation ≈1 .414 1

Page 36: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 36Krisztina Boda

Review questions and exercises

Problems to be solved by hand-calculations ..\Handouts\Problems hand I.doc

Solutions ..\Handouts\Problems hand I solutions.doc

Problems to be solved using computer ..\Handouts\Problems comp I.doc

Page 37: Biostatistics, statistical software I. Basic statistical concepts Krisztina Boda PhD Department of Medical Informatics, University of Szeged.

INTERREG 37Krisztina Boda

Useful WEB pages

http://www-stat.stanford.edu/~naras/jsm http://www.ruf.rice.edu/~lane/rvls.html http://my.execpc.com/~helberg/statistics.html http://www.math.csusb.edu/faculty/stanton/m26

2/index.html