Bio Statistics Basics

Biostatistics - I

Presented by :

Kush Pathak

Contents

• Introduction• History• Applications and uses of biostatistics in science• Common statistical terms• Common symbols used• Data -

(a) Collection and types

(b) Presentation

(c) Analysis

(d) Interpretation• Limitations• Conclusion • References

Introduction

• There are three kinds of lies: lies, damn lies, and statistics. (Benjamin Disraeli / Mark Twain).

• The word statistics conveys a variety of meaning to people. It is known for handling data in general and in field of research. The word “statistics” comes from Italian word ‘statista’ meaning ‘statesman’ or the German word “statistik”, each of which means political state.

• It comes from two main sources, that are (1) Government records (2) Mathematics

• John Graunt (1620 - 1674) was the father of health statistics.

Definitions

Statistics : Science of collecting, summarizing, presentation, analysis and interpretation of data is called statistics.

Biostatistics : Method of collecting, organizing, analyzing, tabulating and interpreting the data, related to living organisms and human

beings is called biostatistics.

[Soben Peter. Essentials of preventive and community

dentistry, 2nd edition. New Delhi : Arya; 2006. 824]

HISTORY

Father of Health Statistics

1620 - 1674

THE HISTORY OF STATISTICS HAS IT’S ROOTS IN BIOLOGY

Sir Francis Galton

Inventor of fingerprints,

Study of heredity of quantitative traits

Regression & correlation

Karl Pearson

Polymath

-Studied genetics

-Correlation coefficient

-c2 test

-Standard deviation

Sir Ronald Fisher

The Genetical Theory of

Natural Selection

Founder of population genetics.

Analysis of variance

Likelihood.

P-value

APPLICATIONS AND USES OF BIOSTATISTICS IN SCIENCE

• IN PHYSIOLOGY AND ANATOMY :

– To define the limits of normality for variables such as height, weight, Blood Pressure etc. in a population.

– Variation more than natural limits may be pathological i.e. abnormal due to play of certain external factors.

– To find correlation between two variables like height and weight.

• IN PHARMACOLOGY

– To find the action of drugs

– To compare the action of two drugs or two successive dosages of same drug

– To find the relative potency of a new drug with respect to a standard drug

• IN MEDICINE

– To compare the efficiency of a particular drug, operation or line of treatment

– To find association between two attributes such as cancer and smoking

– To identify signs and symptoms of disease

• IN COMMUNITY MEDICINE AND PUBLIC HEALTH

– To test usefulness of vaccine in the field

– In epidemiologic studies the role of causative factors is statistically tested

• FOR STUDENTS :

– By learning the methods in biostatistics a student learns to evaluate articles published in medical and dental journals or papers read in medical and dental conferences.

– He also understands the basic methods of observation in his clinical practice and research.

Common Statistical terms

• VARIABLES :

Characteristic that takes different values for different persons, place or things.

A quantity that varies between limits i.e. height, weight, blood pressure, age etc.

Denoted as X and for orderly series as X1, X2, X3…..Xn

Sigma stands for summation of results or observations.

• CONSTANT :

Quantities that do not vary such as π = 3.141, e = 2.718

These do not require statistical study.

e.g. in biostatistics, mean, standard deviation are considered constant for a population.

• OBSERVATION :

An event and it’s measurements, such as B.P and 120 mm of Hg

• OBSERVATIONAL UNIT :

Source that gives observations, such as object or person etc.

In medical stats, term individuals or subject, is used more often.

.

• DATA:

Set of values recorded on one or more observational units.

• POPULATION :

Population includes all persons, events and objects under study.

It may be finite or infinite.

• SAMPLE :

Defined as a part of a population generally selected so as to be representative of the population whose variables are under study.

• PARAMETER

It is a constant that describes a population e.g. in a college there are 40% girls. This describes the population, hence it is a parameter.

• STATISTIC

Statistic is a constant that describes the sample e.g. out of 200 students of the same college 45% girls. This 45% will be statistic as it describes the sample

• ATTRIBUTE

A characteristic based on which the population can be described into categories or classes e.g. gender, caste, religion.

Commonly used symbols

= Equal to

< Greater than

> Lesser than

Z No. of standard deviations

% Percentage

r Pearson’s correlation coefficient

ρ Spearman’s rank correlation coefficient

d.f. or f Degree of freedom

K Number of groups or classes

P Probability

O Observed number

E Expected number

DATA

Set of values recorded on one or more observational units is called data.

It is of two types :

QUALITATIVE (discrete) data

QUANTITIVE (continuous) data

Collection of health information

A. Census :

United nations define census as “ the total process of collecting, compiling and publishing demographic, economic and social data pertaining at the specified time or times, to all persons in a country or delimited territory”

It is an important source of health information.

First regular census in India was taken in 1881, and others took place at 10 year intervals.

Primary function of census is to provide demographic information such as total count of population and it’s breakdown into groups and sub groups such as age and sex distribution.

Population census provides basic data (by age and sex) needed to compute vital statistical rates, and other health, demographic and socio economic indicators.

B. Registration of vital events :

United nations define a vital event registration system as including “ legal registration, statistical recording and reporting of the occurrence of, and the collection, compilation, presentation, analysis and distribution of statistics pertaining to vital events i.e. live births, deaths, fetal deaths, marriages, divorces, adoption, legitimations, recognitions, annulments and legal separations.”

It keeps a continuous check on demographic changes.

In 1873, the Govt. of India had passed the Births, Deaths and Marriages Registration Act. But still the registration system in India tended to be very unreliable, the data being grossly deficient in regard to accuracy, timelines, completeness and coverage.

Due to this other actions were taken :

o The Central Births and Deaths Registration act, 1969 :

Central Births and Deaths Registration Act was promulgated in 1969, which came into force on 1st April 1970.

The time limiting of registering the events of births is 14 days and that of deaths is 7 days. In case of any default, a fine of Rs. 50 was imposed.

o Lay Reporting :

It is defined as, “Collection of information, it’s use and transmission to other levels of health system by non professional health workers.”

Some countries have attempted to employ first line health workers(e.g. village health guides) to record births and deaths in a community.

C. Sample Registration system (SRS) :

It’s a dual record system consisting of continuous enumeration of births and deaths by an enumerator and an independent survey every 6 months by an investigator- supervisor.

It was initiated in the mid 1960s to provide reliable estimates of birth and death rates at the national and state levels.

It is a major source of health information.

D. Notification of diseases :

It’s primary purpose is to effect prevention and/or control of the diseases.

Also a valuable source of morbidity data.

Diseases which are considered to be serious menaces to public health are included in the list of notifiable diseases.

Limitations : (a) covers only a small part of total sickness in the community (b) System suffers from a good deal of under reporting (c) Many cases specially, atypical and subclinical cases escape notification due to non – recognition.

E. Hospital records :

They constitute a basic and primary source of information about diseases prevalent in the community.

Drawbacks : (a) Provide info. On only those patients who seek medical care. (b) Admission policy may vary from hospital to hospital. (c) Population served by a hospital cannot be defined.

F. Disease Registers :

Provides a permanent record of diseases and morbidity caused due to them.

If reporting system is effective and the coverage is on a national basis, register can provide useful data on morbidity and disease specific mortality.

G. Record Linkage :

Used to describe the process of bringing together, records relating to one individual and the records originating in different times or places.

Medical record linkage implies the assembly and maintenance for each individual in a population, of a file of the more important records relating to his health.

Problem : Volume of data accumulated. Therefore, in practice, records linkage has been applied only on a limited scale. E.g. twin studies, measurement of morbidity, chronic diseases. Etc.

H. Environmental health data :

These statistics now provide data on various aspects of air, water and noise pollution; harmful food additives; industrial intoxicants etc.

I. Health manpower statistics :

Relates to physicians, dentists, pharmacists, veterinarians, nurses, technicians etc.

Their records are maintained by state medical/ dental/ nursing counsils and directorates of medial education.

J. Population surveys :

Carried out for epidemiological studies by trained teams to find incidence or prevalence of health or disease in a community.

Provide useful info on : • Changing trends in health status.• Timely warning of public health hazards.• Feedback expected to modify policy and system.

Health surveys can be classified as :

(a) Health interview (face to face) survey

(b) health examination survey (c) health records surveys (d) Mailed questionnaire survey

K. Non- quantifiable information :

Health planners also need non quantifiable info. E.g. health policies, health legislations, public attitudes, programme costs, procedures and technologies.

Types of Data

• Qualitative or discrete data :

When the data is collected on the basis of attributes or qualities like sex, malocclusion and cavities etc., it is called as qualitative data.

The number of person having the same attribute are variable and are measured.

for e.g. – Out of 100 people, 75 have diabetes, 15 have T.B and 10 have Anemia.

Then diabetes, T.B and Anemia are attributes which can not be measured in figures. Only number of people having it can be determined.

• Quantitative or continuous data :

When the data is collected through measurement using calipers, etc. it is called quantitative data.

In such classification there are two variables : - Characteristic – such as height

Frequency – i.e. number of persons with same characteristic and in same range

• e.g. Height of one person is 150 cm and other is 160 cm and both are of

same age and sex.

Persons with 150 cms or in range of 150 – 152 cm may be 10 and that of 160 cm or in range of 160 – 162 cm may be 20.

Thus we find out characteristic and frequency. Both vary from person to person as well as group to group.

Presentation

Tabulation

Drawings

Tabulation : • Is the most common method

• Data presentation is in the form of columns and rows

• It can be of the following types– Simple tables – Frequency distribution tables

• Simple tables :Month and Year Number of biopsies performed in Oral

Pathology department

January 2010 15

June 2010 21

December 2010 26

• Frequency Distribution tables :

In a frequency distribution table, the data is first split into convenient groups ( class interval ) and the number of items ( frequency ) which occurs in each group is shown in adjacent column

Year and month

No. of biopsies sent from different departments to Oral Pathology department.

Oral surgery

Oral Medicine

Cons and Endo

Pediatric Dept.

Perio. Private Clinics

January 2010

6 2 3 1 1 2

June 2010

11 NIL 2 2 2 4

Dec 2010 19 NIL 1 2 1 3

• Charts and Drawings :

Useful method of presenting statistical data

Powerful impact on imagination of the people

Presentation of quantitative data is done through graphs. They are : Histograms Frequency Polygons Frequency curve Line chart or graph Cumulative frequency diagram Scatter or dot diagram

• Presentation of qualitative data is done through diagrams. They are : Bar Pie or sector Pictogram or picture diagram Map diagram or spot map

Histograms

Pictorial presentation of frequency distribution.

Consists of series of rectangles.

Class interval given on vertical axis

Area of rectangle is proportional to the frequency

Jan/

10

Feb/1

0

Mar

/10

Apr/1

0

May

/10

Jun/

10Ju

l/10

Aug/1

0

Sep/1

0

Oct/10

Nov/1

0

Dec/1

00

2

4

6

8

10

12

14

16

18

20

O.SO.MConsPedoPerioPrivate

Frequency Polygon

Obtained by joining midpoints of histogram blocks at the height of frequency by straight lines usually forming a polygon.

Frequency curve :

When number of observations is very large and class interval is reduced the frequency polygon looses its angulations becoming a smooth curve known as frequency curve.

Line Chart

Line diagram are used to show the trends of events with the passage of time.

Jan/

10

Feb/1

0

Mar

/10

Apr/1

0

May

/10

Jun/

10Ju

l/10

Aug/1

0

Sep/1

0

Oct/10

Nov/1

0

Dec/1

002468

101214161820

O.SO.MCONSPEDOPERIOPRIVATE

Cumulative frequency diagram

Graphical representation of cumulative frequency .

It is obtained by adding the frequency of previous class .

Scatter or Dot diagram

Shows relationship between two variables.

If the dots are clustered showing a straight line, it shows a relationship of linear nature.

3.5 4 4.5 5 5.5 6 6.5 7012345678

Y-Values

Y-Values

Bar Chart

• Length of bars drawn vertical or horizontal is proportional to frequency of variable.

• Suitable scale is chosen.

• Bars are usually equally spaced.

• They are of three types :

-Simple bar chart

-Multiple bar chart

-Component bar chart

Simple bar chart

Jan/10Feb/10Mar/10Apr/10

May/10Jun/10Jul/10

Aug/10Sep/10Oct/10

Nov/10Dec/10

0 5 10 15 20 25 30

biopsies

biopsies

Multiple bar chart :

Two or more variables are grouped together

Jan/10Feb/10Mar/10Apr/10

May/10Jun/10Jul/10

Aug/10Sep/10Oct/10

Nov/10Dec/10

0 2 4 6 8 10 12 14 16 18 20

PRIVATEPERIOPEDOCONSO.MO.S

Component bar chart :

Bars are divided into two or more parts.

Each part representing certain item and proportional to magnitude of that item.

Jan/10

Feb/10

Mar/10

Apr/10

May/10

Jun/10

Jul/10

Aug/10

Sep/10

Oct/10

Nov/10

Dec/10

0 5 10 15 20 25 30

O.SO.MCONSPEDOPERIOPRIVATE

Pie chart

• In this frequencies of the group are shown as segment of circle.

• Degree of angle denotes the frequency.

• Angle is calculated by class frequency x 360

total observations

Biopsies

01/01/201006/01/2010

Pictogram

• Popular method of presenting data to the common man.

Spot map or Map diagram

• These maps are prepared to show geographic distribution of frequencies of characteristics.

Analysis

• Average value in a distribution is the one central value around which all the other observations are concentrated.

• Average value helps : To find most characteristic value of a set of measurements.

To find which group is better off by comparing the average of one group with that of another.

[K.park. Preventive and social medicine, 20th edition:

McGraw-Hill Medical; 2009. 749]

• Most commonly used averages are Mean Median Mode

Mean

• Refers to arithmetic mean.

• Individual observations are first added together, and then divided by the number of observations.

• Addition of the observations is called ‘summation’ and is denoted by ∑ or S.

• Individual observations are denoted by ƞ and the mean is denoted by xZ ( ‘X’ bar).

• xZ = x1 + X2 + X3 …. Xƞ / ƞ

• eg. The diastolic blood pressure of 10 individuals was 83, 75, 81, 79, 71, 95, 75, 77, 84, 90. The total was 810, which was then divided by 10, resulting into 81.0

• Advantages – It is easy to calculate.

• Disadvantages – Influenced by extreme values.

Median

• When all the observation are arranged either in ascending order or descending order, the middle observation is known as median.

• In case of even number the average of the two middle values is taken.

• Median is better indicator of central value as it is not affected by the extreme values.

Diastolic Blood Pressure (unarranged)

83

75

81

79

71

95

75

77

84

Diastolic Blood Pressure (arranged)

71

75

75

77

79 (median)

81

83

84

95

Diastolic Blood Pressure (unarranged)

83

75

81

79

71

95

75

77

84

90

Diastolic Blood Pressure (arranged)

71

75

75

77

7981

83

84

90

95

79 +81/2 =80

In case there are 10 values instead of 9

Mode

• Most frequently used observation or most ‘fashionable’ value in a series of observation, is called mode.

• E.g. diastolic blood pressure of 20 individuals is 85, 75, 81, 79, 71, 95, 75, 77, 75, 90, 71, 75, 79, 95, 75, 77, 84, 75, 81, 75.

• Here the most frequently occurring value is 75.

• Advantages :

It is easy to understand. Not affected by extreme items.

• Disadvantages : Exact location is often uncertain and not clearly defined.

[Therefore, mode is not often used in biological or medical statistics.]

Interpretation

• Test of Significance :

• Whatever be the sampling procedure or the care taken while selecting sample, the sample statistics will differ from the population parameters.

• Variations between 2 samples drawn from the same population may also occur.

• But differences in the results between two research workers for the same investigation may be observed.

• So, it becomes important to find out the significance of this observed variation• • i.e. whether it is due to

– chance or biological variation (statistically not significant) OR – due to influence of some external factors ( statistically significant)

• To test whether the variation observed is of significance, various tests of significance are done.

• Tests of significance can be broadly classified as

Parametric tests

Non parametric tests

Parametric Tests

• Parametric tests are those tests in which certain assumptions are made about the population :

Population from which sample is drawn has normal distribution.

The variances of sample do not differ significantly.

The observations found are truly numerical thus arithmetic procedure such as addition, division, and multiplication can be used.

• Since these test make assumptions about the population parameters, they are called parametric tests .

• These are usually used to test the difference.

• They are:– Student T test( paired or unpaired)– ANOVA

ANOVA

Analysis of variance • Investigations may not always be confined to comparison of 2 samples

only • e.g. we might like to compare the difference in vertical dimension

obtained using 2 or more methods like phonetics, swallowing.• In such cases where more than 2 samples are used ANOVA can be

used• Also when measurements are influenced by several factors playing

there role e.g. factors affecting retention of a denture, ANOVA can be used.

• ANOVA helps to decide which factors are more important

• Requirements

– Data for each group are assumed to be independent and normally distributed

– Sampling should be at random

• One way ANOVA :

– -Where only one factor will effect the result between 2 groups

• Two way ANOVA

– Where we have 2 factors that affect the result or outcome.

• Multi way ANOVA

-Three or more factors affect the result or outcomes between groups

-

Student t test

• It was given by WS Gossett whose pen name was student .

• There are two types of student t Test.

1. Unpaired t test

2. Paired t test

Unpaired t test

• Applied to unpaired data of observation made on individuals of 2 separate groups to find the significance of difference between 2 means.

• Sample size is less than 30.

• e.g. difference in accuracy in an impression using two different impression materials

• Steps in unpaired t Test are :

Calculate the mean of two samples.

Calculate combined standard deviation

Calculate the standard error of mean which is given by

SEM = SD √1/n1 + 1/n2.

Calculate observed difference between means X1 – X2

Calculate t value = observed difference / Standard error of mean

Determine the degree of freedom which is one less than no of observation in a sample (n -1)

Here combined degree of freedom will be = (n1 – 1) + (n2 – 1)

• Refer to table and find the probability of the t value corresponding to degree of freedom

• P< 0.05 states difference is significant

• P> 0.05 states difference is not significant

Paired t test

• It is applied to paired data of observation from one sample only.

• Used in sample less than 30

• The individual gives a pair of observation i.e. observation before and after taking a drug

• The steps involved are :

Calculate the difference in paired observation i.e. before and after = x1 – x2 = y

Calculate the mean of this difference = y

• Calculate SD

• Calculate SE = SD / √ n

• Determine t = y / SE

• Determine the degree of freedom.

• Since there is one sample df = n-1

• Refer to table and find the probability of the t value corresponding to degree of freedom

• P< 0.05 states difference is significant

• P> 0.05 states difference is not significant

Non Parametric tests

• In many biological investigation the research worker may not know the nature of distribution or other required values of the population.

• Also some biological measurements may not be true numerical values hence

arithmetic procedures are not possible in such cases.

• In such cases distribution free or non parametric tests are used in which no assumption are made about the population parameters e.g.– Mann Whitney test – Chi square test – Phi coefficient test – Fischer’s Exact test– Sign Test– Freidman's Test

Chi square test

• Chi square test unlike z and t test is a non parametric test.

• The test involves calculation of a quantity called chi square .

• Chi square is denoted by X2

• It was developed by Karl Pearson

• The most important application of chi square test in medical statistics are – Test of proportion – Test of association – Test of goodness of fit

• Test of proportion – Used as an alternate test to find the significance of difference in 2

or more than 2 proportions

• Test of association – To measure the probability of association between 2 discreet

attributes e.g smoking and cancer

• Test of goodness of fit – Tests whether the observed values of a character differ from the

expected value by chance or due to play of some external factor

Stages in performing Tests of Significance

• State the null hypothesis

• State the alternative hypothesis

• Accept or reject the null hypothesis

• Finally determine the p value

State the null hypothesis

• State the null hypothesis :

Null Hypothesis, is a hypothesis of no difference between statistics of a sample and parameter of the population or between statistics of two samples.

It nullifies the claim that the experimental result is different from or better than the one observed already

State the alternative hypothesis

• State the alternative hypothesis :

It states, that the sample result is different i.e. larger or smaller than the value of population or statistics of one sample is different from the other.

• Accept or reject the null hypothesis :

Null Hypothesis is accepted or rejected depending on whether the result falls in zone of acceptance or zone of rejection.

If the result of a sample falls in the area of mean ± 2SE the null hypothesis is accepted.

This area of normal curve is called zone of acceptance for null hypothesis.

If the result of sample falls beyond the area of mean ± 2 SE.

Null hypothesis of no difference is rejected and alternate hypothesis accepted.

This area of normal curve is called zone of rejection for null hypothesis

• Finally determining the P value :

P value is determined using any of the previously mentioned methods.

If p> 0.05, the difference is due to chance and is not statistically different but if p < 0.05 the difference is due to some external factor and statistically significant.

Probability or p value

Concept of probability is very important in statistics.

Probability is the chance of occurrence of any event or permutation combination.

It is denoted by p for sample and P for population.

In various tests of significance we are often interested to know whether the observed difference between 2 samples is by chance or due to sampling variation.

At this time, probability or p value is used to find out the difference.

P ranges from 0 to 1

0 = there is no chance that the observed difference could not be due to sampling variation

1 = it is absolutely certain that observed difference between 2 samples is due to sampling variation

However such extreme values are rare.

P = 0.4 i.e. chances that the difference is due to sampling variation is 4 in 10

Obviously the chances that it is not due to sampling variation will be 2 in 10.

The essence of any test of significance is to find out p value and draw inference.

If p value is 0.05 or more It is customary to accept that difference is due to chance (sampling

variation) . The observed difference is said to be statistically not significant.

If p value is less than 0.05

Observed difference is not due chance but due to role of some external factors.

The observed difference here is said to be statistically significant.

Sampling

• When a large proportion of individuals are to be studied, it is impossible to include each and every member, as it will be time consuming, costly, laborious. So, sampling is done.

• Sampling is a process by which some unit of a population are selected for the study and by subjecting it to statistical computation, conclusions are drawn about the population from which these units are drawn.

• The sample taken will be a representative of entire population.

• It is sufficiently large.

• It is unbiased.

• Such sample will have its statistics almost equal to parameters of entire population.

• Two main characteristics of a representative sample are :

Precision

Unbiased character

Precision

• Precision depends on a sample size.

• Ordinarily sample size should not be less than 30.

• Precision = √n/s

• n = sample size , s = standard deviation

• Precision is directly proportional to square root of sample size. Greater the sample size greater the precision.

• Thus, to obtain precision, sample size needs to be increased

Unbiased character

• The sample should be unbiased i.e. every individual should have an equal chance to be selected in the sample.

• Thus a standard random sampling method should be used.

• Non sampling errors can be taken care of by Using standardized instruments and criteria. By single, double, triple blind trials Use of a control group

Limitations

• Statistics has several limitations :

• It gives statistical and not substantive answers.

• The statistical conclusion refers to groups and not individuals.

• It only summarizes but does not interpret data.

• Statistics can be misused by selective presentation of desired results.

• Computation is not an end in itself. It is a tool that can be used well or can be misused.

• A human must have a clear idea of what is required of the computer and must instruct it accordingly.

• The human must also be able to intelligently interpret the output from the computer.

• All who tinker with computers must remember the adage ‘rubbish in/rubbish out’.

Conclusion

• Health information systems are the best means of getting reliable, relevant, up to date, adequate and reasonably complete information for health managers at all levels.

• Although, being a very helpful source for collection of data, it has been very difficult to get information where it matters most i.e. at community level.

• So, actions should be taken in this direction and this system should be used more frequently for better and clear results, mainly in cases of researches involving large masses.

References

• K.park. Preventive and social medicine, 20th edition : Mc Graw – Hill Medical ; 2009 .743 – 756

• Soben Peter. Essentials of preventive and community dentistry, 2nd edition. New Delhi : Arya; 2006. 21 – 50

• B.k.Mahajan. Methods in Biostatistics for medical students and research workers, 6th edition. New Delhi : Jaypee brothers ; 2006. 1- 39

Bio Statistics Basics

Documents

Transcript of Bio Statistics Basics