Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

24
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and Probabilistic modeling and molecular phylogeny molecular phylogeny Anders Gorm Pedersen Anders Gorm Pedersen Molecular Evolution Group Molecular Evolution Group Center for Biological Sequence Analysis Center for Biological Sequence Analysis Technical University of Denmark (DTU) Technical University of Denmark (DTU)

description

Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU). What use are models?. “But that’s just a model. That tells me nothing about reality.”. What is a model?. - PowerPoint PPT Presentation

Transcript of Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

Page 1: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS Probabilistic modeling and Probabilistic modeling and molecular phylogenymolecular phylogeny

Anders Gorm PedersenAnders Gorm Pedersen

Molecular Evolution GroupMolecular Evolution GroupCenter for Biological Sequence AnalysisCenter for Biological Sequence AnalysisTechnical University of Denmark (DTU)Technical University of Denmark (DTU)

Page 2: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

What use are models?

• ““But that’s just a model. That tells me But that’s just a model. That tells me nothingnothing about about reality.”reality.”

Page 3: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

What is a model?

Page 4: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

What is a model?

Page 5: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

What is a model?

• Model = hypothesis !!!Model = hypothesis !!!

• Hypothesis (as used in most biological research): Hypothesis (as used in most biological research): – Precisely stated, but qualitativePrecisely stated, but qualitative– Allows you to make qualitative predictionsAllows you to make qualitative predictions

• Arithmetic model: Arithmetic model: – Mathematically explicit (parameters)Mathematically explicit (parameters)– Allows you to make quantitative predictionsAllows you to make quantitative predictions

Page 6: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

The scientific method

ObservationObservationof dataof data

HypothesisHypothesis about about how system workshow system works

Prediction(s) Prediction(s) about system about system behaviorbehavior

Page 7: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

The scientific method

ObservationObservationof dataof data

ModelModel of how of how system workssystem works

Prediction(s) Prediction(s) about system about system behaviorbehavior

Page 8: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

A model (hypothesis) is a simplified picture of reality

””[…] We actually made a map of the country, on the […] We actually made a map of the country, on the scale of a mile to the mile!”scale of a mile to the mile!”

"Have you used it much?" I inquired."Have you used it much?" I inquired.

"It has never been spread out, yet," said Mein Herr: "It has never been spread out, yet," said Mein Herr: "the farmers objected: they said it would cover the "the farmers objected: they said it would cover the whole country, and shut out the sunlight! So we use whole country, and shut out the sunlight! So we use the country itself, as its own map, and I assure you the country itself, as its own map, and I assure you it does nearly as well.”it does nearly as well.”

Lewis Carrol, "Sylvie and Bruno Concluded", 1893Lewis Carrol, "Sylvie and Bruno Concluded", 1893

Page 9: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

The maximum likelihood approach I

• Starting point:Starting point:You have some observed data and a probabilistic model for You have some observed data and a probabilistic model for how the observed data was producedhow the observed data was produced

• Example: Example: – DataData: result of tossing coin 10 times - 7 heads, 3 tails: result of tossing coin 10 times - 7 heads, 3 tails– ModelModel: coin has probability p for heads, 1-p for tails. : coin has probability p for heads, 1-p for tails.

The probability of observing h heads among n tosses is:The probability of observing h heads among n tosses is:

• Goal:Goal:You want to find the best estimate of the (unknown) You want to find the best estimate of the (unknown) parameter value based on the observations. (here the only parameter value based on the observations. (here the only parameter is “p”) parameter is “p”)

P(h heads) = ph (1-p)n-hhn

Page 10: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

The maximum likelihood approach II

Likelihood = Probability (Data | Model)Likelihood = Probability (Data | Model)

Maximum likelihood: Maximum likelihood: Best estimate is the set of parameter values Best estimate is the set of parameter values which gives the highest possible likelihood.which gives the highest possible likelihood.

Page 11: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

Maximum likelihood: coin tossing example

DataData: result of : result of tossing coin 10 times tossing coin 10 times - 7 heads, 3 tails- 7 heads, 3 tails

ModelModel: coin has : coin has probability p for probability p for heads, 1-p for tails. heads, 1-p for tails.

P(h heads) =

ph (1-p)10-hh10

Page 12: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

Probabilistic modeling applied to phylogeny

• Observed data: multiple alignment of sequencesObserved data: multiple alignment of sequences

H.sapiens globinH.sapiens globin A G G G A T T C AA G G G A T T C A M.musculus globinM.musculus globin A C G G T T T - AA C G G T T T - A R.rattus globinR.rattus globin A C G G A T T - AA C G G A T T - A

• Probabilistic model parameters (simplest case):Probabilistic model parameters (simplest case):– Nucleotide frequencies: Nucleotide frequencies: AA, , CC, , GG, , TT– Tree topology and branch lengthsTree topology and branch lengths– Nucleotide-nucleotide substitution rates (or substitution Nucleotide-nucleotide substitution rates (or substitution

probabilities):probabilities):

Page 13: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

Other models of nucleotide substitution

Page 14: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

More elaborate models of evolution

• Codon-codon substitution rates Codon-codon substitution rates (64 x 64 matrix of codon substitution rates)(64 x 64 matrix of codon substitution rates)

• Different mutation rates at different sites in the geneDifferent mutation rates at different sites in the gene(the “gamma-distribution” of mutation rates)(the “gamma-distribution” of mutation rates)

• Molecular clocks Molecular clocks (same mutation rate on all branches of the tree). (same mutation rate on all branches of the tree).

• Different substitution matrices on different branches of the Different substitution matrices on different branches of the treetree(e.g., strong selection on branch leading to specific group of (e.g., strong selection on branch leading to specific group of animals)animals)

• Etc., etc.Etc., etc.

Page 15: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

Computing the probability of one column in an alignment given tree topology and other

parametersA A G G A T T C AA A G G A T T C AA A G G T T T - AA A G G T T T - AA C G G A T T - AA C G G A T T - AA GA G G G T T T - A G G T T T - A

CT

A

A

G

C

Columns in alignment contain homologous nucleotides

Assume tree topology, branch lengths, and other parameters are given. Assume ancestral states were T and C. Start computation at any internal or external node.

Pr = T PTA(t1) PTA(t2) PTC(t3) PCG(t4) PCC(t5)

t1

t2

t3

t4

t5

Page 16: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

Computing the probability of an entire alignment given tree topology and other

parameters• Probability must be summed over all possible combinations of ancestral nucleotides. (Here we have two internal nodes giving 16 possible combinations)

• Probability of individual columns are multiplied to give the overall probability of the alignment, i.e., the likelihood of the model.

• Often the log of the probability is used (log likelihood)

Page 17: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

ISNode 1 Node 2 Group no.

A A 01, 0201, 02A C 03, 0403, 04A G 05, 0605, 06A T 07, 0807, 08C A 09, 1009, 10C C 11, 1211, 12C G 13, 1413, 14C T 15, 1615, 16G A 17, 1817, 18G C 19, 2019, 20G G 21, 2221, 22G T 23, 2423, 24T A 25, 2625, 26T C 27, 2827, 28T G 29, 3029, 30T T teacherteacher

A

A

G

C

1 2

Page 18: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

Maximum likelihood phylogeny• Data: Data:

– sequence alignmentsequence alignment• Model parameters: Model parameters:

– nucleotide frequencies, nucleotide substitution rates, tree nucleotide frequencies, nucleotide substitution rates, tree topology, branch lengths.topology, branch lengths.

Choose random initial values for all Choose random initial values for all parameters, compute likelihoodparameters, compute likelihood

Change parameter values slightly in a Change parameter values slightly in a direction so likelihood improvesdirection so likelihood improves

Repeat until maximum foundRepeat until maximum found

Results:Results:(1) ML estimate of tree topology (1) ML estimate of tree topology (2) ML estimate of branch lengths(2) ML estimate of branch lengths(3) ML estimate of (3) ML estimate of other model other model parametersparameters(4) Measure of how well model fits (4) Measure of how well model fits data data (likelihood).(likelihood).

Page 19: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

Selecting the best model: the likelihood ratio test

• The fit of two alternative models can be compared using the ratio of The fit of two alternative models can be compared using the ratio of their likelihoods:their likelihoods:

LR =LR = P(Data | M1) = L,M1P(Data | M1) = L,M1P(Data | M2) L,M2P(Data | M2) L,M2

• Note that LR > 1 if model 1 has the highest likelihoodNote that LR > 1 if model 1 has the highest likelihood

• For For nested modelsnested models it can be shown that it can be shown that

= ln(LR= ln(LR22) = 2 x ln(LR) = 2 x (lnL,M1 - lnL,M2)) = 2 x ln(LR) = 2 x (lnL,M1 - lnL,M2)

follows a follows a 22 distribution with degrees of freedom equal to the number distribution with degrees of freedom equal to the number of extra parameters in the most complicated model.of extra parameters in the most complicated model.

This makes it possible to perform stringent statistical tests to This makes it possible to perform stringent statistical tests to determine which model (hypothesis) best describes the datadetermine which model (hypothesis) best describes the data

Page 20: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

Asking biological questions in a likelihood ratio testing framework

• Fit two alternative, nested models to the data.Fit two alternative, nested models to the data.

• Record optimized likelihood and number of free parameters for each Record optimized likelihood and number of free parameters for each fitted model.fitted model.

• Test if alternative (parameter-rich) model is Test if alternative (parameter-rich) model is significantlysignificantly better than better than null-model, given number of additional parameters (nnull-model, given number of additional parameters (nextraextra):):

1.1. Compute Compute = 2 x (lnL = 2 x (lnLAlternative Alternative - lnL- lnLNullNull) ) 2.2. Compare Compare to to 22 distribution with n distribution with nextraextra degrees of freedom degrees of freedom

• Depending on models compared, different biological questions can Depending on models compared, different biological questions can be addressed (presence of molecular clock, presence of positive be addressed (presence of molecular clock, presence of positive selection, difference in mutation rates among sites or branches, selection, difference in mutation rates among sites or branches, etc.)etc.)

Page 21: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

Model Selection Using the Akaike Information Criterion (AIC)

• Fit any number of alternative, nested or non-nested models to the data.Fit any number of alternative, nested or non-nested models to the data.

• Record log likelihood (lnL) and number of free parameters (K) for each fitted Record log likelihood (lnL) and number of free parameters (K) for each fitted model.model.

• For each model compute AIC according to this formula:For each model compute AIC according to this formula:

AIC = -2 x lnL + 2 x K AIC = -2 x lnL + 2 x K

• Models can now be ranked according to AIC: Lower AIC is better, Models can now be ranked according to AIC: Lower AIC is better,

• Intuitive interpretation: AIC favors models that show a reasonable compromise Intuitive interpretation: AIC favors models that show a reasonable compromise between model fit (high lnL) and model complexity (low K)between model fit (high lnL) and model complexity (low K)

• AIC is firmly based on information theory (briefly, it is an estimate of the relative AIC is firmly based on information theory (briefly, it is an estimate of the relative Kullback-Leibler distances between the true model and the fitted models, Shorter Kullback-Leibler distances between the true model and the fitted models, Shorter distances are better)distances are better)

• There is no concept of significance in the AIC framework. This is A Good Thing. There is no concept of significance in the AIC framework. This is A Good Thing. From AIC it is possible to compute so-called Akaike weights which can be From AIC it is possible to compute so-called Akaike weights which can be interpreted as conditional model probabilitiesinterpreted as conditional model probabilities

Page 22: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

Positive selection I: synonymous and non-synonymous mutations

• 20 amino acids, 61 codons 20 amino acids, 61 codons – Most amino acids encoded by more than one codonMost amino acids encoded by more than one codon– Not all mutations lead to a change of the encoded amino acidNot all mutations lead to a change of the encoded amino acid– ””Synonymous mutations” are rarely selected againstSynonymous mutations” are rarely selected against

CTA(Leu)

CTC(Leu)

CTG(Leu)CTT(Leu)

CAA(Gln)

CCA(Pro)

CGA(Arg)

ATA(Ile)GTA(Val)TTA(Leu)

1 non-synonymous nucleotide site

1 synonymous nucleotide site

1/3 synonymous2/3 nonsynymousnucleotide site

Page 23: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

Positive selection II: non-synonymous and synonymous mutation rates contain information about selective pressure

• dN: rate of non-synonymous mutations dN: rate of non-synonymous mutations per non-synonymous siteper non-synonymous site• dS: rate of synonymous mutations dS: rate of synonymous mutations per synonymous siteper synonymous site

• Recall: Evolution is a two-step process:Recall: Evolution is a two-step process:(1) Mutation (random)(1) Mutation (random)(2) Selection (non-random)(2) Selection (non-random)

• Randomly occurring mutations will lead to dN/dS=1.Randomly occurring mutations will lead to dN/dS=1.• Significant deviations from this most likely caused by subsequent Significant deviations from this most likely caused by subsequent

selection.selection.

• dN/dS < 1dN/dS < 1: Higher rate of synonymous mutations: : Higher rate of synonymous mutations: negative (purifying) negative (purifying) selectionselection

• dN/dS > 1dN/dS > 1: Higher rate of non-synonymous mutations: : Higher rate of non-synonymous mutations: positive selectionpositive selection

Page 24: Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group

CE

NTE

R F

OR

BIO

LOG

ICA

L S

EQ

UE

NC

E A

NA

LYS

IS

Today’s exercise: positive selection in HIV?

• Fit two alternative models to HIV data:Fit two alternative models to HIV data:– M0: one, common dN/dS ratio in entire sequenceM0: one, common dN/dS ratio in entire sequence

– M3: three distinct classes with different dN/dS ratiosM3: three distinct classes with different dN/dS ratios

• Use likelihood ratio test to examine if M3 is significantly better than M0,Use likelihood ratio test to examine if M3 is significantly better than M0,

• If that is the case: is there a class of codons with dN/dS>1 (positive selection)?If that is the case: is there a class of codons with dN/dS>1 (positive selection)?

• If M3 significantly better than M0 AND if some codons have dN/dS>1 then you If M3 significantly better than M0 AND if some codons have dN/dS>1 then you have statistical evidence for positive selection. have statistical evidence for positive selection.

• Most likely reason: immune escape (i.e., sites must be in epitopes)Most likely reason: immune escape (i.e., sites must be in epitopes)

: Codons showing dN/dS > 1: likely epitopes