Probability&Statistics - based models August 1, 2007 MathFest 2007 San Jose, CA Raina Robeva –...

Post on 26-Dec-2015

216 views 0 download

Transcript of Probability&Statistics - based models August 1, 2007 MathFest 2007 San Jose, CA Raina Robeva –...

Probability&Statistics - based modelsProbability&Statistics - based models

August 1, 2007 MathFest 2007 San Jose, CA

Raina Robeva – Sweet Briar College

Probability&Statistics - based modelsProbability&Statistics - based models

Introduction Introduction

Quantitative Traits (Limit Quantitative Traits (Limit Theorems)Theorems)

Luria – Delbruck Experiments Luria – Delbruck Experiments

Evaluating risks from time Evaluating risks from time series data series data

Elementary ProbabilityElementary Probability

Random Variables

),(XX

),,( P - Probability Space

Histograms

Elementary ProbabilityElementary Probability

Set of all outcomes -

Examples:

1) Flipping a coin:

2) Rolling a die:

3) Rolling two dice:

TH ,

6,5,4,3,2,1

Elementary ProbabilityElementary Probability

Elementary Events – the elements of

Events – the subsets of : CBA ,,

Definition of Probability:

elementsofnumberA

AP

||,||

||)(

How do we find probabilities?

We Count!We Count!

Chromosomes and GenesChromosomes and Genes

Genes are found on chromosomes and code for a specific trait

The possible alternative forms of the genes are called alleles.

Chromosomes are large DNA molecules found in the cell’s nucleus

Each gene has a specified place on the chromosome called a locus.

The human Chromosome 11 contains 28 genes. The first 5 genes from the tip of the short arm form a cluster of genes that encode components of hemoglobin

ProblemProblem

- All possible sequences of length 2 comprised of a and A

2when,1

1when,2

0when,1

||

k

k

k

E

If E = “exactly k dominant alleles”, find P(E).

AAAaaAaa ,,,

||

||)(

EEP

One gene, two types of alleles: a (recessive) and A (dominant)

k = number of dominant alleles (0, 1, or 2)

Problem (cont.)Problem (cont.) AAAaaAaa ,,,

2when,4/1

1when,2/1

0when,4/1

||

||)(

k

k

kE

EP

Gregor Mendel – experiments with peas

Round - dominant Wrinkled - recessive

Parental Generation

First Filial Generation

P

Second Filial Generation

only round peas in F1

3:1 ratio of round vs. wrinkled in F2

1F

2F

x

x

Phenotypic Phenotypic Ratios Ratios

1:3 (1:2:1)1:3 (1:2:1)

%751)round(

%25)wrinkled(

41

41

P

P

Quantitative Traits (1909)Quantitative Traits (1909)

Herman Nilsson – Ehle

Phenotypic Ratios Phenotypic Ratios

1 : 4 : 6 : 4 : 1 1 : 4 : 6 : 4 : 1

1 : 6 : 15 : 20 : 15 : 6 : 1 : 6 : 15 : 20 : 15 : 6 : 11

……

Parental Generation

First Filial Generation

P

Second Filial Generation

1F

2F

x

x All of intermediate color

Tw

o n

ew s

had

es a

pp

ear

Quantitative Traits – ExamplesQuantitative Traits – Examples

n genes, two types of alleles: a and A

Polygenic HypothesisPolygenic Hypothesis

N = 2n – total positions

If E = “exactly k dominant alleles”, find P(E) = ?k = number of dominant alleles (0, 1, 2, …, N)

Polygenic Hypothesis – set of Polygenic Hypothesis – set of outcomesoutcomes

1

2

3

4

- All possible sequences of length 8 comprised of a and A82|| In general,

nN 222||

Polygenic HypothesisPolygenic Hypothesis

Alleles a and A are equally likely

N = 2n – total positions

If E = “exactly k dominant alleles”, find P(E).

k = number of dominant alleles (0, 1, 2, …, N)

)!(!

!||

kNk

N

k

NE

N2||

N

k

N

EEP

2||

||)(

Example: Nilsson-Ehle (1909)Example: Nilsson-Ehle (1909)

Nilsson – Ehle: Two genes (n = 2), N = 2n = 4 number of alleles

X – number of a alleles

in the N loci

16

4

2

kk

N

NP(X = k) =

Random VariablesRandom Variables

)(XX

Continuous – X can be any value from an interval

X is “known” when we know:

the distribution function F(x) = P(X< x);

the probability density function f(x) = d/dx [F(x)]

x

dttfxF )()(

Discrete – X takes integer values

X is “known” when we know P(X=k) for all possible k

Common Discrete Random Common Discrete Random VariablesVariables

Bernoulli X takes values k = 0, 1

P(X=1) = p; P(X=0) = 1-p

Binomial X takes values k = 0, 1, 2, …, N

kNk ppk

NkXP

)1()(

Poisson X takes values k = 0, 1, 2, 3, …

!)(

kekXP

k

N = 20, p = 0.5N= 20, p = 0.2N= 20, p = 0.7

Parameters

Bernoulli (p)

Bin(N, p)

Po( )

Common Continuous Random Common Continuous Random VariablesVariables

Exponential X takes values ),0( xxexF 1)(

xexf )(

Gaussian (Normal) X takes values ),( x

22

2)(

21)(

x

exf 2

2

21)(

x

exf

),( N )1,0(N

Bell - Shaped Distr. of Quantitative Bell - Shaped Distr. of Quantitative TraitsTraits

Traits are controlled not by one but by several different genes. The genes are independent and contribute cumulatively to the expression of the characteristic (Polygenic Hypothesis)

Distribution of the trait is Binomial (2n, p), where n –number of genes and p frequency of the non-contributing allele in the population.

Distribution is approximately Gaussian.

Further “smoothing” by environmental factors

N=8, p = 0.2

N = 20, p = 0.5

N=50, p = 0.7

When Np is large and N(1-p) is large, then

Binomial (N,p) ~ Normal (Np, ))1( pNp

Why the “bell-shaped” distribution of Why the “bell-shaped” distribution of quantitative traits? quantitative traits?

1667 - 1754MoivreMoivre

1749 - 1827LaplaceLaplace

Central Limit TheoremCentral Limit Theorem

Aggregate CharacteristicsAggregate Characteristics

Mean Value )()( kXkPXE

dxxxfXE )()(

Standard Deviation222 )]([)()]([)( XEXEXEXEXVar

Moments of order m )()( kXPkXE mm

dxxfxXE mm )()(

ExamplesExamples

Binomial (N, p) NpXE )(

NpqXVar )(

)(XE

)(XVar

)(XE2)( XVar

Gaussian ( ),

Poisson( )

Poission Distribution Arises Poission Distribution Arises When…When…

Events of low intensity occurring in time

X(t) – the number of events that have occurred in [0,t]

0 timet

X(t) has a Poisson distribution with parameter

!

)()(

k

etkXP

tk

t

Average number of events per unit time =

Events of low intensity occurring independently of one another

t

X– the number of events that have occurred in a unit surface/volume over time t

X has a Poisson distribution with parameter

!

)()(

k

etkXP

tk

Average number of events per unit surface/volume per unit time =

Poission Distribution Arises Poission Distribution Arises When…When…

The Law of Large Numbers (1713)

If X is a random variable with

,)( XE

then

,as,21

nn

XXX n

.as, nX

or, equivalently,

Example – Ordinary Coin Toss Game

1. Toss a coin

.as,5.021

nn

XXX n

5. Average payback to you

2. If Heads, win $1

3. If Tails, win nothing

50.0$1)2/1(0)2/1()( iXE

4. Let Xi be your win for game i

6. By the Law of Large Numbers

Simulation Example

Example – St. Petersburg Game1. Toss a coin

5. With probability 1/(2N) we win $2N

2. If Heads, win $2

3. If Tails, keep tossing until it falls Heads

4. If first Heads on N-th toss, win $2N

H $2TH $4TTH $8TTTH $16 etc.

111

2)2

1(2)

2

1(2)

2

1( 3

32

2

6. Average payback to you

St. Petersburg Game – a sample run

Random Processes (Temporal Stochastic Models)Random Processes (Temporal Stochastic Models)

Random Process: X(t) – Random variable that changes in time

When t = 0, 1, 2, … – Discrete Random Process

When t changes continuously – Continuous Random Process

In addition, since for any value of t, X(t) can be discrete or continuous random variable, there are four possibilities for the process {X(t), t}.

{X(t), t} is defined through its probability distribution. ))0(|)(()( iXxtXPtp i

x

For example, if X(t) can take values x = 0,1,2,…, then is the probability

distribution of X. ),...](),(),([)( 210 tptptptp iiii

Single Population Immigration-Death ProcessSingle Population Immigration-Death Process

Deterministic Model

X(t) = population size at time t

I = rate of immigration

a = per capita death rate

aXIdt

dX

Stochastic Model (Kolmogorov – Chapman DE) xttX )( can happen when:

X(t) = x and no change over . (Event A)

X(t) = x + 1 and one death over . (Event B)

X(t) = x -1 and one immigration over . (Event C)

Probability for more than unit change over . (D)

t

t

t

)( tot

Kolmogorov – Chapman EquationsKolmogorov – Chapman Equations

))(Pr()( ntXtpn

)()())(1()()()1()( 11 totpttoIantptItptnattp nnnn

P(B) P(C) P(A) P(D)

Subtract , divide by , and let )(tpn t 0t

0),()()()()1()( 11 ntpanItIptpnatpdt

dnnnn

0),()()( 100 ntaptIptpdt

d

Demo

How are the Stochastic and Deterministic Models Related?How are the Stochastic and Deterministic Models Related?

Define )(tnpEXX n

Multiply the K-C equation by n and sum over n

0),()()()()1()( 11 ntpanItIptpnatpdt

dnnnn

][])1([

)()()()()1(

11

11

nnnn

nnn

npnpIannpnanpXdt

d

tpanIntnIptpnnaXdt

d

Xatnpa n )( 1

The mean value of the stochastic process X

satisfies the deterministic equation

XaIX

dt

d

Luria-Delbruck Experiments

Darwinian Model - mutations are equally likely to occur at any moment in time.

Lamarckian Model - mutations evolve only in response to an environmental cue.

When do mutations occur?

Luria-Delbruck Experiments (1943)

Large number of bacterial cultures, starting each one from a small number of cells.

Plate the cultures on nutrient agar plates that on which a large amount of a virus has been plated first. Incubate.

Luria SE & Delbruck M. Mutations of Bacteria from Virus Sensitivity to Virus Resistance. Genetics 28:491(1943).

Control

Hypothesis 1 (Mutation): Mutations occur randomly, but the probability that a bacterium mutates from sensitive to resistant is small. This mutation is completely independent from the presence of the virus. When the bacteria are added to the plates, the mutants are already resistant to the virus. Only these mutants proliferate into colonies on the plate.

 

Hypothesis 1 (Acquired Immunity): A small number of bacteria mutated to acquire resistance only after they are exposed to the virus. Survival confers immunity not only to the individual but also to its offspring, and the colonies grow.

Hypotheses

Count the Number of Colonies

Hypothesis 1 (Acquired Immunity, Directed Mutation): A small number of bacteria mutated to acquire resistance only after they are exposed to the virus. Survival confers immunity not only to the individual but also to its offspring, and the colonies grow.

Two opposing hypotheses

killer virus

Two opposing hypotheses Hypothesis 2 (Mutation + Selection): Mutations occur randomly, but the probability that a bacterium mutates from sensitive to resistant is small. This mutation is completely independent from the presence of the virus. When the bacteria are added to the plates, the mutants are already resistant to the virus. Only these mutants proliferate into colonies on the plate.

killer virus

Poisson

)()( XVarXE 1)(/)( XVarXE

What is the Distribution of the Mutant Cells at the time of plating?

Under the Directed Mutation Hypothesis

killer virus

Under the Mutation + Selection Hypothesis

killer virus

Non-Poisson

)()( XVarXE largeveryis)(XVar

Luria-Delbruck Distribution

Large variation in the number of mutants

What is the average number of resistant cells under continuous mutation? Assume that mutation can only occur at the time of division

Assume that each cell can mutate with a constant probability p

Average number of mu-tant cells in generation i

Generation (i)

Expected number of mutants at the end from this generation

0

1

23

45

6

p

p2

p22p32

p42

p52

p62

Np2

NN pp 222 1

NN pp 222 22

NN pp 222 33

NN pp 222 44

NN pp 222 55

NN pp 222 66

NNN pppXE 222)( 1111)(XE

Mutation.xls

AcqIm.xls

Biological ESTEEM

Lea and Coulson (1949)

Lea, D.E. and Coulson, C.A. (1949) The distribution of the number of mutants in bacterial populations. J. Genetics 49, 264-285

xxmxmx /)1()1(),(

Theorem. Let Xt denote the number of mutant cells in the culture at time t. If p is the probability for a single cell to mutate and m = p2n, then the probability generating function of the distribution defined by

has the form

kt xkXPmx )(),(

More recent work on the Luria-Delbruck distribution

Evaluating risk from time series data Evaluating risk from time series data

Glucose Variability and Risk Assessment Glucose Variability and Risk Assessment in Diabetesin Diabetes

Hearth Rate Variability and the Risk for Hearth Rate Variability and the Risk for Neonatal Sepsis Neonatal Sepsis

In both human and economic terms, diabetes is one of the nations most costly diseases. Diabetes is the leading cause of kidney failure, blindness in adults, and amputations. It is a major risk factor for heart disease, stroke, and birth defects. Diabetes shortens average life expectancy by up to 15 years, and costs our nation in excess of $100 billion annually in health-relatedSixteen Million people Sixteen Million people

in the United States havein the United States haveDiabetes Mellitus.Diabetes Mellitus.

expenditures- more than any other single chronic disease. Diabetes spares no group, affecting young and old, all races and ethnic groups, the rich and the poor.

Blood Glucose Fluctuation Characteristics Blood Glucose Fluctuation Characteristics Quantified from Self-Monitoring DataQuantified from Self-Monitoring Data

• Type 1 Diabetes also referred to as Insulin Dependent Diabetes Mellitus (IDDM) is the type of diabetes in which the pancreas produces no insulin or extremely small amounts;

• Type 2 Diabetes is the type of diabetes in which the body doesn’t use its insulin effectively or doesn’t produce enough insulin

• Insulin a hormone secreted by the pancreas that regulates metabolism of glucose.

• Blood Glucose (BG) is the concentration of glucose in the bloodstream;

• The BG levels are measured in mg/dl (USA) and in mmol/L (most elsewhere);

• The two scales are directly related by: 18 mg/dl= 1mM;

DefinitionsDefinitionsDefinitionsDefinitions

Target BloodGlucose Range:

70-180 mg/dl(DCCT, 1993)

Hyperglycemia

Hypoglycemia

Food

Insulin

Insulin

Severe Hypoglycemia

Counter-regulation

Insulin

• Defined as a low BG resulting in stupor, seizure, or unconsciousness that precludes self-treatment (The Diabetes Control and Complications Trial Research Group, 1997). Four percent of the deaths among individuals with IDDM are attributed to SH (DCCT Study Group, 1991).

• Although most severe hypoglycemic episodes are not fatal, there remain numerous negative sequelae leading to compromised occupational and scholastic functioning, social embarrassment, poor judgment, serious accidents, and possible permanent cognitive dysfunction (Gold AE et al., 1993; Deary et al., 1993; Lincoln et al., 1996).

• Fear of severe hypoglycemia is identified as the major barrier to improved metabolic control (Cryer et al., 1994).

Severe HypoglycemiaSevere Hypoglycemia

BG Fluctuations: T1DMBG Fluctuations: T1DM

0.00

100.00

200.00

300.00

400.00

500.00

600.00

0.00 5.00 10.00 15.00 20.00 25.00 30.00

BG Fluctuations: T2DMBG Fluctuations: T2DM

0.00

100.00

200.00

300.00

400.00

500.00

600.00

0.00 5.00 10.00 15.00 20.00 25.00 30.00

Average Glycemia and Glucose Variability Person A: HbA1c=8.0%

Blo

od

Glu

cose

(m

g/d

l)

Person B: HbA1c=8.0%

Blo

od

Glu

cose

(m

g/d

l)

Time (days)

0

50

100

150

200

250

300

350

400

0

50

100

150

200

250

300

350

400

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Blood Glucose (BG) Monitoring SystemsBlood Glucose (BG) Monitoring Systems

Self-Monitoring BG Devices (typically 3-10 measurements/24 hours)

Continuous BG Monitoring Systems

(up to 288 measurements/24 hours)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

5

10

15

20

25

30

Fre

quen

cy

Hypo- Target Range Hyperglycemia

Data Range, if Symmetrization is used

BG (mM)Standard Data Range

ClinicalCenter

NumericalCenter

The Distribution of the BG LevelsThe Distribution of the BG Levels::(Mean=6.7, SD=3.6, Normality hypothesis is rejected, P<0.05)

Symmetrization of the BG Scale:Symmetrization of the BG Scale:

Assumptions:A1: The transformed whole BG range should be symmetric around 0. A2: The transformed target BG range should be symmetric around 0.

Transformation:f(BG,) = [(ln (BG )) ‑ ], > 0

That satisfies the conditions:A1: f (33.3,) = - f (1.1,) and A2: f(10,) = - f(3.9,).

Which leads to the equations:(ln (33.3)) ‑ = [(ln (1.1)) ‑ ]

(ln (10.0)) ‑ ln ‑. [(ln (33.3)) ‑ (ln (1.1) ‑ 10 (scaling)

When solved numerically:1.0331.871 and 1.774 (when BG is in mM)

1.0841 and 1.509 (when BG is in mg/dl)

1 4 7 10 13 16 19 22 25 28 31 34

BG (mM)

00.5

11.5

22.5

33.5

-0.5-1

-1.5-2

-2.5-3

-3.5

ClinicalCenter

NumericalCenter

f(BG) = 1.774 * (ln(BG)^1.033 - 1.871)

Symmetrization Function:Symmetrization Function:

Distribution of the Transformed BG Levels:Distribution of the Transformed BG Levels:

-2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.50

10

20

30

40

50

Fre

quen

cy

f(BG)

Hypoglycemia HyperglycemiaTarget Range

Symmetrized Data Range

Clinical and Numerical

Center

The BG risk function: r(BG)=10.f(BG)2

Let x1, x2, ... xn be a series of n BG readings,and let

rl(BG)=r(BG) if f(BG)<0 and 0 otherwise;rh(BG)=r(BG) if f(BG)>0 and 0 otherwise.

The Low Blood Glucose [Risk] Index (LBGI) and the High BG [Risk] Index (HBGI) are then defined as:

)xrl(n

1=LBGI i

n

1=i )xrh(

n1

=HBGI i

n

1=i

Defining the Low and High Defining the Low and High Blood Glucose Indices:Blood Glucose Indices:

Symmetrization Symmetrization of the BG Measurement Scaleof the BG Measurement Scale

0 0.5 1 1.5 2 2.5 3-0.5-1-1.5-2-2.5-30

20

40

60

80

100

Transformed BG Scale

r(B

G)

Target RangeHypoglycemia Hyperglycemia

y = 10 * x^2Low BG Risk High BG Risk

Clinical and Numerical

Center

• Evaluation of HbAEvaluation of HbA1c1c

• Assessment of Long-Term Risk Assessment of Long-Term Risk for [Severe] Hypoglycemiafor [Severe] Hypoglycemia

• Assessment of Short-Term Assessment of Short-Term Risk for [Severe] HypoglycemiaRisk for [Severe] Hypoglycemia

Risk Analysis of Blood Glucose Data: Theory and Algorithms

• Predicts 40% of SH episodes for the subsequent 6 months;• Predicts 50% of imminent SH episodes (24 hours);• The technology has been licensed by Lifescan Inc, Milpitas, CA;

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 340

20

40

60

80

100

BG Level (mM)

r(B

G)

Target Range

Low BG Risk High BG Risk

The Blood Glucose Risk Function:The Blood Glucose Risk Function:(As Defined on the Original Blood Glucose Scale)

• 4 million births 4 million births • 40,000 very low birth weight 40,000 very low birth weight

(<1500 grams) infants (<1500 grams) infants • 15,000 NICU beds15,000 NICU beds• 400,000 NICU admissions400,000 NICU admissions

Hearth Rate Variability and the Risk for Neonatal Hearth Rate Variability and the Risk for Neonatal SepsisSepsis

Neonatal Sepsis: A Major Public HealthNeonatal Sepsis: A Major Public Health ProblemProblem

• Risk of sepsis is high– 25 - 40% of VLBW infants develop sepsis while

in the neonatal intensive care unit

• Significant mortality and morbidity – In VLBW infants, sepsis doubles the risk of

dying – Length of stay is increased by 1 month– Health care costs are increased

Current Practice for Infants at Risk for Current Practice for Infants at Risk for SepsisSepsis

• Nurse relates that infant in NICU is “not acting right” or “looks a little off”

• Physicians must take the cautious approach, suspecting sepsis

• Assessment includes invasive tests:– CBC, blood culture, urine culture, lumbar

puncture

• Intervention: antibiotics

Baby

Problems with Problems with Current Medical PracticeCurrent Medical Practice

• Nurses and physicians’ subjective assessments are neither sensitive nor specific

• Diagnostic tests have important limitations:– invasive– not performed until infant has clinical signs– various CBC components range from 11% to 77%

Need for Better Risk Need for Better Risk Assessment for Neonatal SepsisAssessment for Neonatal Sepsis

• Tremendous need for continuous non-invasive monitoring for sepsis

• Any device that adds objective information about infant’s state of health from continuous risk assessment monitoring would be helpful

Time [RR interval number]

Mag

nit

ud

e o

f R

R in

terv

al [

400

500

600

300

A

400

500

600

300

B

400

500

600

300

C

0 512 1024 1536 2048 2560 3072 3584 4096

Time [RR interval number]

Mag

nit

ud

e o

f R

R in

terv

al [M

sec]

400

500

600

300

A

400

500

600

300

B

400

500

600

300

C

0 512 1024 1536 2048 2560 3072 3584 4096

Time [RR interval number]

Mag

nit

ud

e o

f R

R in

terv

al [

400

500

600

300

A

400

500

600

300

B

400

500

600

300

C

0 512 1024 1536 2048 2560 3072 3584 4096

Time [RR interval number]

Mag

nit

ud

e o

f R

R in

terv

al [M

sec]

400

500

600

300

A

400

500

600

300

B

400

500

600

300

C

0 512 1024 1536 2048 2560 3072 3584 4096

4000

8000

12,000

16,000

18,000

10

100

1,000

10,000

1 0

4000

8000

12,000

16,000

18,000

10

100

1,000

10,000

1 0

-20 0 20 40 60 80 100 120

Difference from median [msec]

medianmedian

Sample Asymmetry=2.97R1=27

R2=79.5

Sample Asymmetry=11.8R1=45.5

R2=538.5

B

C

medianmedian

4000

8000

12,000

16,000

18,000

10

100

1,000

10,000

1 0

Sample Asymmetry=1.37R1=42

R2=57.5

A