Distance-Based Genome Rearrangement Phylogeny

40
Distance-Based Distance-Based Genome Rearrangement Phylogeny Genome Rearrangement Phylogeny Li-San Wang University of Texas at Austin and University of Pennsylvania

description

Distance-Based Genome Rearrangement Phylogeny. Li-San Wang University of Texas at Austin and University of Pennsylvania. Genomes As Signed Permutations. 1. 6. 1 –5 3 4 -2 -6 or 5 –1 6 2 -4 -3 etc. 2. 5. 4. 3. Genomes Evolve by Rearrangements. - PowerPoint PPT Presentation

Transcript of Distance-Based Genome Rearrangement Phylogeny

Page 1: Distance-Based  Genome Rearrangement Phylogeny

Distance-Based Distance-Based Genome Rearrangement PhylogenyGenome Rearrangement Phylogeny

Li-San Wang

University of Texas at Austinand

University of Pennsylvania

Page 2: Distance-Based  Genome Rearrangement Phylogeny

2

Genomes As Signed Permutations

1 –5 3 4 -2 -6 or5 –1 6 2 -4 -3 etc.

1

5

3

4

2

6

Page 3: Distance-Based  Genome Rearrangement Phylogeny

3

Genomes Evolve by Rearrangements

1 2 3 4 5 6 7 8 9 10

1 2 –6 –5 -4 -3 7 8 9 10

1 2 7 8 3 4 5 6 9 10

1 2 7 8 –6 -5 -4 -3 9 10

Inversion:

Transposition:

Inverted Transposition:

Page 4: Distance-Based  Genome Rearrangement Phylogeny

4

Phylogeny Reconstruction

FN: false negative (missing edge)FP: false positive (incorrect edge)

50% error rate

S1 S2 S3 S4 S5

FN

S1

S2

S3S4

S5FP

S1 S2 S3

S4 S5

Page 5: Distance-Based  Genome Rearrangement Phylogeny

5

Results

5% Error

Page 6: Distance-Based  Genome Rearrangement Phylogeny

6

Outline

• Genome Rearrangement Evolution• True evolutionary distance estimators• Distance-based phylogeny reconstruction• Simulation study: accuracy of tree reconstruction

Page 7: Distance-Based  Genome Rearrangement Phylogeny

7

Our Model: the Generalized Nadeau-Taylor Model [STOC’01]

• Three types of events: − Inversions (INV)− Transpositions (TRP)− Inverted Transpositions (ITP)

• Events of the same type are equiprobable• Probabilities of the three types have fixed ratio

• We focus on signed circular genomes in this talk.

Page 8: Distance-Based  Genome Rearrangement Phylogeny

8

Additive Distance Matrix and True Evolutionary Distance (T.E.D.)

S2 S3 S4 S5

S1 0 9 15 14 17S2 0 14 13 16S3 0 13 16S4 0 13 13

75

4

5

8

S1

S2

S3

S4

S5

S1

S5 0

Theorem [Waterman et al. 1977] Given an m×m additive distance matrix, we can reconstruct a tree realizing the distance in O(m2) time.

Page 9: Distance-Based  Genome Rearrangement Phylogeny

9

Error Tolerance of Neighbor Joining

Theorem [Atteson 1999]Let {Dij} be the true evolutionary distances, and {dij} be the estimated distances for T. Let be the length of the shortest edge in T. If for all taxa i,j, we have

then neighbor joining returns T.

2

1|| ijij dD

Page 10: Distance-Based  Genome Rearrangement Phylogeny

10

Edit Distances Between Genomes

• (INV) Inversion distance [Hannenhalli & Pevzner 1995]− Computable in linear time [Moret et al 2001]

• (BP) Breakpoint distance [Watterson et al. 1982]− Computable in linear time− NJ(BP): [Blanchette, Kunisawa, Sankoff, 1999]

1 2 3 4 5 6 7 8 9 10

1 2 3 -8 -7 -6 4 5 9 10

A =

B =

BP(A,B)=3

Page 11: Distance-Based  Genome Rearrangement Phylogeny

11

NJ(BP) and NJ(INV)

120 genes, 160 leavesUniformly Random Trees

Transpositions/inverted transpositions only

Inversion only

Page 12: Distance-Based  Genome Rearrangement Phylogeny

12

BP and INV

INV vs K(120 genes)

(K: Actual number of inversions) (Inversion-only evolution)

BP/2 vs K

Page 13: Distance-Based  Genome Rearrangement Phylogeny

13

Objectives

• Better techniques for estimating true evolutionary distances− Mathematical guarantees whenever possible− Good accuracy in simulation− Robust to model violations

• Main goal: improve topological accuracy of tree reconstructions using neighbor joining and weighbor

Page 14: Distance-Based  Genome Rearrangement Phylogeny

14

New Estimators

• Exact-IEBP• Approx-IEBP• EDE

IEBP: Inverting Expected BreakPoint distanceEDE: Empirically Derived Estimator (inverts the expected inversion distance)

Page 15: Distance-Based  Genome Rearrangement Phylogeny

15

Distance-Based Methods

WeighborINV

NJBP

EDE

IEBP(Exact-, Approx-)

Page 16: Distance-Based  Genome Rearrangement Phylogeny

16

BP/2

K

K

BP

/2

(1)

(2)

Estimate True Evolutionary DistancesUsing BP

BP/2 vs K (120 genes)

(K: Actual number of inversions) (Inversion-only evolution)

To use the scatter plot to estimate the actual number of events (K):

1. Compute BP/2

2. From the curve, look up the corresponding valueof K

Page 17: Distance-Based  Genome Rearrangement Phylogeny

17

Using Breakpoints to Estimate True Evolutionary Distances

• Compute fn(k)= E[BP(G0,Gk)]

(i.e. the expected number of breakpoints after k random events; n is the number of genes)

• Given two genomes G and G’:− Compute breakpoint distance d=BP(G,G’)

− Find k so that fn (k) is closest to d

• Challenge: finding fn (k)

Page 18: Distance-Based  Genome Rearrangement Phylogeny

18

True Evolutionary Distance (t.e.d.) Estimators for Gene Order Data

T.E.D. Estimator

Exact-IEBP [WABI’01]

Approx-IEBP [STOC’01]

EDE [ISMB’01]

Based on the Expectation of

Breakpoint distance (Exact)

Breakpoint distance (Approx.)

Inversion distance (Approx.)

Derivation Analytical Analytical Empirical

Model knowledge

Required Required Inversion-only

Running Time O(n3) O(log n) O(1)

IEBP: Inverting the Expected BreakPoint distanceEDE: Empirically Derived Estimator

Page 19: Distance-Based  Genome Rearrangement Phylogeny

19

Approx-IEBP [Wang & Warnow, STOC’01]

Page 20: Distance-Based  Genome Rearrangement Phylogeny

20

True Evolutionary Distance Estimators

IEBP vs K(120 genes)

(K: Actual number of inversions) (Inversion-only evolution)

BP vs K

Page 21: Distance-Based  Genome Rearrangement Phylogeny

21

True Evolutionary Distance Estimators

BP INV

IEBP

EDE

Page 22: Distance-Based  Genome Rearrangement Phylogeny

22

Regression Formula for E(INV)

• Let n be the number of genes, and k be the number of inversions

• We use nonlinear regression to obtain easily computable formulas for E(INV) and Var(INV):

-> b=0.5956, c=0.4577

20

2

[ ( , )]~ ( ) min{ , } ( )kE INV G G x bx k

f x x xn x cx b n

0)0( f1. 1)0(' f2.

0 ( )f x x 3.

4.1( )f y : 0 1y y exists for all

Page 23: Distance-Based  Genome Rearrangement Phylogeny

23

Absolute Difference Plot

• Motivation: Recall the criterion of Atteson’s Theorem. For all i,j:

2

1|| ijij dD

Page 24: Distance-Based  Genome Rearrangement Phylogeny

24

Error Tolerance of Neighbor Joining

Theorem [Atteson 1999]Let {Dij} be the true evolutionary distances, and {dij} be the estimated distances for T. Let be the length of the shortest edge in T. If for all taxa i,j, we have

then neighbor joining returns T.

2

1|| ijij dD

Page 25: Distance-Based  Genome Rearrangement Phylogeny

25

Absolute Difference Plot

• Motivation: Recall the criterion of Atteson’s Theorem. For all i,j:

• Absolute difference plot:

− Compute d=d(G,G’)− Plot |d-k| (y-axis) vs k (x-axis)

G G’

NT Model, k rearrangement events

2

1|| ijij dD

Page 26: Distance-Based  Genome Rearrangement Phylogeny

26

120 Genes, Inversion-only Model

Page 27: Distance-Based  Genome Rearrangement Phylogeny

27

120 Genes, Inv:Transp=1:1

Page 28: Distance-Based  Genome Rearrangement Phylogeny

28

120 Genes, Transp Only

Page 29: Distance-Based  Genome Rearrangement Phylogeny

29

Using True Evolutionary Distance Helps

120 genes160 taxaUniformly random treesTranspositions/invertedtranspositions only(180 runs per figure)

5%

Page 30: Distance-Based  Genome Rearrangement Phylogeny

30

Variance of True Evolutionary Distance Estimators

• There are new distance-based phylogeny reconstruction methods (though designed for DNA sequences) − Weighbor [Bruno et al. 2000]

uses the variance of good t.e.d.s, and yield more accurate trees than NJ.

• Variance estimates for the t.e.d.s [Wang WABI’02]− Weighbor(IEBP),

Weighbor(EDE) K vs Exact-IEBP (120 genes)

Page 31: Distance-Based  Genome Rearrangement Phylogeny

31

Using True Evolutionary Distance Helps

120 genes160 taxaUniformly random treesTranspositions/invertedtranspositions only(180 runs per figure)

5%

Page 32: Distance-Based  Genome Rearrangement Phylogeny

32

Robustness

• EDE− Assumes inversion-only− When used with NJ or Weighbor, gives very

accurate results even under transposition-only evolution

• IEBP?

Page 33: Distance-Based  Genome Rearrangement Phylogeny

33

IEBP is Robust to Model Violations

120 genes, 160 taxaUniformly Random Trees(alpha,beta)=(0,0) (inversion only)

NJ(Exact-IEBP)

Page 34: Distance-Based  Genome Rearrangement Phylogeny

34

Beta Splitting Model [Aldous 1995]

-2 -1.5 -1 0 20

Comb UniformlyRandom

Aldous’suggestion

Yule CompletelyBalanced

Page 35: Distance-Based  Genome Rearrangement Phylogeny

35

Effect of Beta on Accuracy(120 Genes, 160 Taxa)

(Beta= -1.5: Uniform, -1: Aldous, 0: Yule)

Page 36: Distance-Based  Genome Rearrangement Phylogeny

36

Number of Genes

Page 37: Distance-Based  Genome Rearrangement Phylogeny

37

Number of Taxa

Page 38: Distance-Based  Genome Rearrangement Phylogeny

38

Observations

• Number of taxa has little effect on the accuracy of trees (at least under our way of model tree generation)

• Datasets with more genes give more accurate trees

• Topology has effect on the accuracy of trees for NJ(BP) and NJ(INV); using corrected distances reduces this effect.

Page 39: Distance-Based  Genome Rearrangement Phylogeny

39

Summary

• True evolutionary distance estimators for gene order data

• When used with Neighbor Joining and Weighbor, produce highly accurate trees

• How model trees are generated has effect on the accuracy of tree reconstruction methods

Page 40: Distance-Based  Genome Rearrangement Phylogeny

40

Acknowledgements

• University of Texas Tandy Warnow Robert K. Jansen Randy Linder Stacia Wyman

• University of New Mexico Bernard M.E. Moret David Bader Jijun Tang Mi Yan

• Central Washington University Linda Raubeson

• University of Canterbury Mike Steel

• NSF• Packard Foundation