Combining with phylogeny

59
Combining with Combining with phylogeny phylogeny Wafa Jobran Wafa Jobran Seminar in Seminar in Bioinformatics Bioinformatics Technion spring 2005 Technion spring 2005

description

Combining with phylogeny. Wafa Jobran Seminar in Bioinformatics Technion spring 2005. Schedule. Genome representation. GNT model. Distance based methods. True evolutionary distance. BP and IEBP variance INV and EDE variance simulation. Representing a chromosome. - PowerPoint PPT Presentation

Transcript of Combining with phylogeny

Page 1: Combining with phylogeny

Combining with phylogenyCombining with phylogeny

Wafa JobranWafa Jobran

Seminar in BioinformaticsSeminar in Bioinformatics

Technion spring 2005Technion spring 2005

Page 2: Combining with phylogeny

ScheduleSchedule

• Genome representation.Genome representation.

• GNT model.GNT model.

• Distance based methods.Distance based methods.

• True evolutionary distance.True evolutionary distance.

• BP and IEBP variance BP and IEBP variance

• INV and EDE varianceINV and EDE variance

• simulation.simulation.

Page 3: Combining with phylogeny

Representing a chromosomeRepresenting a chromosome

• ChromosomeChromosome is represented byis represented by an an ordering (linear or circular) of signed ordering (linear or circular) of signed genes.genes.

• We assign a number to the same gene in We assign a number to the same gene in each genome.each genome.

• In the linear genome the sign indicates In the linear genome the sign indicates which strand the gene is located on.which strand the gene is located on.

•In the circular genome we break off the In the circular genome we break off the circle circle between two neighboring genes between two neighboring genes and choosing the clockwise or counter and choosing the clockwise or counter clockwise as the positive direction.clockwise as the positive direction.

Page 4: Combining with phylogeny

Representing a chromosome.Representing a chromosome.exampleexample::

2

3

1

• Some of the linear Some of the linear representations for representations for this genome :this genome :

(1,2,3) ,(1,2,3) ,

(2,3,1)(2,3,1)

oror

(-1,-3,-2)(-1,-3,-2)

Page 5: Combining with phylogeny

The generalized Nadeau-Taylor The generalized Nadeau-Taylor model:”GNTmodel:”GNT””

• We are particularly interested in the We are particularly interested in the following three types of rearrangements following three types of rearrangements along the edges:along the edges:

1.1.inversionsinversions..

Page 6: Combining with phylogeny

InversionsInversions::

starting with genome starting with genome

G=(gG=(g11,, gg22,……………………..,g,……………………..,gnn))

an inversion between indices a and b, an inversion between indices a and b, 11 ≤a≤a<b<b≤n+1,produces:≤n+1,produces:

(g(g11,, gg22,…,g,…,ga-1a-1,,-g-gbb,…,-g,…,-gaa,g,gb+1b+1,…,g,…,gnn))

Page 7: Combining with phylogeny

The generalized Nadeau-Taylor The generalized Nadeau-Taylor model:”GNTmodel:”GNT””

• We are particularly interested in the We are particularly interested in the following three types of rearrangements following three types of rearrangements along the edges:along the edges:

1.inversions.1.inversions.

2.2.transpositiontransposition..

Page 8: Combining with phylogeny

TranspositionsTranspositions::

starting with genome starting with genome

G=(g1, g2,……………………..,gn)G=(g1, g2,……………………..,gn)

a transposition on the three indices a transposition on the three indices a,b,c a,b,c with 1≤awith 1≤a<b≤<b≤n and n and 2≤c≤n+1,c≠a and 2≤c≤n+1,c≠a and c≠b. produces:c≠b. produces:

(g(g11,…,g,…,ga-1a-1,g,gb+1b+1,…,g,…,gcc,,ggaa,g,ga+1a+1,…,g,…,gbb,g,gc+1c+1,…,g,…,gnn).).

Page 9: Combining with phylogeny

The generalized Nadeau-Taylor The generalized Nadeau-Taylor model:”GNTmodel:”GNT””

• We are particularly interested in the We are particularly interested in the following three types of rearrangements following three types of rearrangements along the edges:along the edges:

1.inversions.1.inversions.

2.transposition.2.transposition.

3.inverted transpositions.3.inverted transpositions.

Page 10: Combining with phylogeny

inverted transpositioninverted transposition::

starting with genome starting with genome G=(gG=(g11, g, g22,……………………..,g,……………………..,gnn))

an inverted transposiotion on the an inverted transposiotion on the three three indices a,b,c with 1≤aindices a,b,c with 1≤a<b≤<b≤n and n and 2≤c≤n+1, 2≤c≤n+1, c≠a and c≠b. produces:c≠a and c≠b. produces:

(g(g11,…,g,…,ga-1a-1,g,gb+1b+1,…,g,…,gcc,,-g-gbb,-g,-gb-1b-1,…,-g,…,-gaa,g,gc+1c+1,…,g,…,gnn).).

Page 11: Combining with phylogeny

ExamplesExamples::

• G=( 1 2 3 4 5 6 7 8 9 10)G=( 1 2 3 4 5 6 7 8 9 10)

inversion a=4 b=6:inversion a=4 b=6:

G (1 2 3 -6 -5 -4 7 8 9 10)G (1 2 3 -6 -5 -4 7 8 9 10)

transposition a=4 b=6 c=8:transposition a=4 b=6 c=8:

G (1 2 3 7 8 4 5 6 9 10) G (1 2 3 7 8 4 5 6 9 10)

inverted transposition a=4 b=6 c=8:inverted transposition a=4 b=6 c=8:

G (1 2 3 7 8 -6 -5 -4 9 10)G (1 2 3 7 8 -6 -5 -4 9 10)

Page 12: Combining with phylogeny

The generalized Nadeau-Taylor The generalized Nadeau-Taylor model:”GNTmodel:”GNT””

• We are particularly interested in the We are particularly interested in the following three types of rearrangements following three types of rearrangements along the edges:along the edges:

1.inversions.1.inversions.

2.transposition.2.transposition.

3.inverted transpositions.3.inverted transpositions.

• Different inversions have equal probability Different inversions have equal probability and so do different transpositions and and so do different transpositions and inverted transpositions.inverted transpositions.

Page 13: Combining with phylogeny

Cont:Cont:The generalized Nadeau-Taylor The generalized Nadeau-Taylor model:”GNTmodel:”GNT””

• Each model tree has two parameters: Each model tree has two parameters:

is the probability a rearrangement is the probability a rearrangement event is a transposition.event is a transposition.

is the probability a rearrangement is the probability a rearrangement event is an inverted transposition.event is an inverted transposition.

is the probability a is the probability a rearrangement event is an inversion.rearrangement event is an inversion.

1

Page 14: Combining with phylogeny

Reconstructing the true tree TReconstructing the true tree T

• every edge e in T is associated with a number every edge e in T is associated with a number kke,e, the actual the actual number of rearrangements number of rearrangements along edge e.along edge e.

• The true evolutionary distance (t.e.d) between The true evolutionary distance (t.e.d) between two leaves Gtwo leaves Gii and G and Gjj in T is k in T is kijij = where P = where Pijij is the simple path on T between Gis the simple path on T between Gii and G and Gj j ..

• Using good estimates of true evolutionary Using good estimates of true evolutionary between genomes greatly improves the between genomes greatly improves the performance of distance based methods.performance of distance based methods.

ij

ee P

k

• A phylogenetic tree T on a set of taxa S is a A phylogenetic tree T on a set of taxa S is a tree representation of the evolutionary tree representation of the evolutionary history of S:T is a tree leaf-labeled by S such history of S:T is a tree leaf-labeled by S such that the internal nodes reflect past speciation that the internal nodes reflect past speciation events.events.

Page 15: Combining with phylogeny

Reconstructing the true tree T.Reconstructing the true tree T.--Distance based methods--Distance based methods----

• NJ “Neighbor joining “.NJ “Neighbor joining “.• BioNJBioNJ• Weighbor “weighted neighbor joining”.Weighbor “weighted neighbor joining”.

uses the variance of good T.E.Ds and yield more uses the variance of good T.E.Ds and yield more accurate trees than NJ.accurate trees than NJ.

consists of two main steps that are repeated until consists of two main steps that are repeated until the tree is completed.the tree is completed.

1.Choosing a pair of taxa to be joined and replaced 1.Choosing a pair of taxa to be joined and replaced by a single new node representing their by a single new node representing their

immediate common ancestor.immediate common ancestor.2.Distances from the new node to all other nodes 2.Distances from the new node to all other nodes

are inferred.are inferred.

Is widely used because of its elegancy and

speed and because when

given exact distance, it is guaranteed to reproduce the

correct tree.

As in neighbor joining ,but while choosing a pair of taxa to join takes into account that errors in distance

estimates are exponentially

larger for longer distances. and that is done by

using the variance.

Page 16: Combining with phylogeny

Estimating true evolutionary distance Estimating true evolutionary distance (t.e.d) using genome rearrangements(t.e.d) using genome rearrangements

The assumption is that the genomes have The assumption is that the genomes have evolved from a common ancestor under evolved from a common ancestor under the GNT model of evolution.the GNT model of evolution.

Page 17: Combining with phylogeny

Estimating true evolutionary distance Estimating true evolutionary distance (t.e.d) using genome rearrangements(t.e.d) using genome rearrangements

• The edit distance:The edit distance:

between two gene orders is the minimum between two gene orders is the minimum of all sequences of events from the given of all sequences of events from the given set that transform one gene order into the set that transform one gene order into the other.other.

For example the inversion distance is the edit distance

when only inversions are

permitted and all inversions have

weight 1.

Page 18: Combining with phylogeny

Estimating true evolutionary distance Estimating true evolutionary distance using genome rearrangementsusing genome rearrangements

• The edit distance.The edit distance.

• The breakpoint distance:The breakpoint distance:

the number of breakpoints in G relative to G’.the number of breakpoints in G relative to G’.

for example:for example:

G=(1,2,3,4,5)G=(1,2,3,4,5)

G’=(1,-4,-3,2,5)G’=(1,-4,-3,2,5)

There are three pairs of adjacent genes in G but There are three pairs of adjacent genes in G but not in G’: (1,2),(2,3)and (4,5) so the breakpoint not in G’: (1,2),(2,3)and (4,5) so the breakpoint distance=3.distance=3.

Given two Given two genomes G and G’ genomes G and G’ a breakpoint in G a breakpoint in G is an ordered pair is an ordered pair of genes (gof genes (gaa,g,gbb) ) such that gsuch that gaa and g and gbb appear appear consecutively inconsecutively in that order in G but that order in G but neither (gneither (gaa,g,gbb) nor ) nor (-g(-gbb,-g,-gaa)) appear appear consecutively in consecutively in that order in G’.that order in G’.

Page 19: Combining with phylogeny

Estimating true evolutionary Estimating true evolutionary distance using genome distance using genome rearrangementsrearrangements• The edit distance.The edit distance.

• The breakpoint distance.The breakpoint distance.

• Exact-IEBP (Inverting the breakpoint distance):Exact-IEBP (Inverting the breakpoint distance):

replaces the approximation in the IEBP method by replaces the approximation in the IEBP method by computing the expected breakpoint distance computing the expected breakpoint distance exactly.exactly.

To compute the Exact-IEBP estimator (G,G’) for the true evolutionary distance between two genomes G and G’:1.For all k=1,…,r (where r is some integer large enough to bring a genome to random) compute E[BP(G0,Gk)].(Gk is G0 after k events)2.To compute k’= (G,G’)(0≤k’≤r)a.Compute the BP distance b=BP(G,G’), then

b.Find the integer k’, 0≤k’≤r such that|E[BP(G0,Gk’)]-b| is

minimized.

^

k

^

k

Page 20: Combining with phylogeny

Estimating true evolutionary Estimating true evolutionary distance using genome distance using genome rearrangementsrearrangements• The edit distance.The edit distance.

• The breakpoint distance.The breakpoint distance.

• Exact-IEBP (Inverting the breakpoint distance).Exact-IEBP (Inverting the breakpoint distance).

• EDE (Empirically derived estimator):EDE (Empirically derived estimator):

We estimate true evolutionary distance byWe estimate true evolutionary distance by inverting the expectedinverting the expected inversion distance.inversion distance.

Given two genomes having the same set of

n genes and the inversion distance

between them is d,we define the EDE

distance as n (d/n), where n is the number

of genes and f is an approximation to the expected inversion distance normalized

by the number of genes.

1f

Page 21: Combining with phylogeny

Experiments:Accuracy of the Experiments:Accuracy of the estimators by absolute differenceestimators by absolute difference

• GNT model with 120 genes.GNT model with 120 genes.

• Starting with the unrearranged genome Starting with the unrearranged genome GG00,we apply k events to it to obtain the ,we apply k events to it to obtain the genome Ggenome Gkk where k=1,…,300. for each where k=1,…,300. for each value of k we simulate 500 runs then we value of k we simulate 500 runs then we compute the five distances.compute the five distances.

Page 22: Combining with phylogeny

Accuracy of the estimators by absolute Accuracy of the estimators by absolute differencedifference

• Both BP and INV Both BP and INV distances underestimate distances underestimate the actual number of the actual number of events.events.

• EDE slightly EDE slightly overestimates the overestimates the actual number of actual number of events.events.

• The IEBP and Exact-IEBP The IEBP and Exact-IEBP distances are both distances are both unbiased.unbiased.

Page 23: Combining with phylogeny

Accuracy of the estimators by absolute Accuracy of the estimators by absolute differencedifference

• Both BP and INV Both BP and INV distances underestimate distances underestimate the actual number of the actual number of events.events.

• EDE slightly EDE slightly overestimates the overestimates the actual number of actual number of events.events.

• The IEBP and Exact-IEBP The IEBP and Exact-IEBP distances are both distances are both unbiased.unbiased.

Page 24: Combining with phylogeny

• Now we will find the variance of the Now we will find the variance of the breakpoint distance in an approximating breakpoint distance in an approximating model .model .

• We will find the variance of the IEBP We will find the variance of the IEBP estimator.estimator.

• We will find the variance of the inversion We will find the variance of the inversion and EDE distances. and EDE distances.

• Based on these variance estimators we will Based on these variance estimators we will see four new methods : BioNJ-see four new methods : BioNJ-IEBP,Weighbor-IEBP,BioNJ-EDE and IEBP,Weighbor-IEBP,BioNJ-EDE and Weighbor-EDE.Weighbor-EDE.

Page 25: Combining with phylogeny

Variance of the breakpoint Variance of the breakpoint distancedistance

Page 26: Combining with phylogeny

Deriving variance (BP)Deriving variance (BP)

• Difficulties:Difficulties:

1.even the expected BP distance between G and 1.even the expected BP distance between G and G’ with n genes after k rearrangements in the GNT G’ with n genes after k rearrangements in the GNT model is still unsimplified sum.model is still unsimplified sum.

2.the break points are not independent (under any 2.the break points are not independent (under any evolution model).evolution model).

• Solution: approximating model.Solution: approximating model.

Page 27: Combining with phylogeny

The approximating modelThe approximating model

• We motivate the approximating model by We motivate the approximating model by the case of inversion-only evolution on the case of inversion-only evolution on signed circular genome.signed circular genome.

• Let n be the number of genes and b the Let n be the number of genes and b the number of breakpoints of the current number of breakpoints of the current genome G.genome G.

Page 28: Combining with phylogeny

The approximating modelThe approximating model

• When we apply a random inversion to G When we apply a random inversion to G we have the following cases according to we have the following cases according to the two end points of the inversionthe two end points of the inversion::

1.None of the two endpoints of the inversion 1.None of the two endpoints of the inversion is a break point is a break point

The number of breakpoints is The number of breakpoints is increased by increased by 2. 2.

there are such inversions.there are such inversions. 2n b

Page 29: Combining with phylogeny

The approximating modelThe approximating model

• When we apply a random inversion to G When we apply a random inversion to G we have the following cases according to we have the following cases according to the two end points of the inversionthe two end points of the inversion::

1.None of the two endpoints of the inversion 1.None of the two endpoints of the inversion is a break point is a break point example: G=(1,2,3,4,5,6,7,8,9,10)example: G=(1,2,3,4,5,6,7,8,9,10)

G’=(1,G’=(1,2,-52,-5,-4,,-4,-3,6-3,6,7,8,9,10),7,8,9,10) the endpoints: 8,9the endpoints: 8,9 G’’=(1,G’’=(1,2,-52,-5,-4,,-4,-3,6-3,6,,7,-97,-9,,-8,10-8,10))

Page 30: Combining with phylogeny

The approximating modelThe approximating model

• When we apply a random inversion to G we When we apply a random inversion to G we have the following cases according to the have the following cases according to the two end points of the inversiontwo end points of the inversion::

2.exactly one of the two endpoints of the 2.exactly one of the two endpoints of the inversion is a breakpoint.inversion is a breakpoint.

the number of breakpoints is the number of breakpoints is increased by 1.increased by 1.

there are b(n-b) such inversions.there are b(n-b) such inversions.

Page 31: Combining with phylogeny

The approximating modelThe approximating model

• When we apply a random inversion to G we When we apply a random inversion to G we have the following cases according to the have the following cases according to the two end points of the inversiontwo end points of the inversion::

2.exactly one of the two endpoints of the 2.exactly one of the two endpoints of the inversion is a breakpoint.inversion is a breakpoint.

example: G=(1,2,3,4,5,6,7,8,9,10)example: G=(1,2,3,4,5,6,7,8,9,10)

G’=(1,G’=(1,2,-52,-5,-4,,-4,-3,6-3,6,7,8,9,10),7,8,9,10)

the endpoints:6,8the endpoints:6,8

G’’=(1,G’’=(1,2,-52,-5,-4,,-4,-3,-8-3,-8,-7,,-7,-6,9-6,9,10),10)

Page 32: Combining with phylogeny

The approximating modelThe approximating model

• When we apply a random inversion to G When we apply a random inversion to G we have the following cases according to we have the following cases according to the two end points of the inversionthe two end points of the inversion::

3.the two endpoints of the inversion are 3.the two endpoints of the inversion are two breakpoints. two breakpoints.

there are there are such inversions.such inversions.

and 3 cases.and 3 cases.

2b

Page 33: Combining with phylogeny

The approximating modelThe approximating model

• Case 3: the two endpoints of the inversion Case 3: the two endpoints of the inversion are two breakpoints.are two breakpoints.--let glet gii and g and gi+1i+1 be the left and right genes be the left and right genes at the left breakpoint and let gat the left breakpoint and let gjj and g and gj+1j+1 be be the left and the right genes at the right the left and the right genes at the right breakpoint.there are three subcases:breakpoint.there are three subcases:

• (…, (…, ggii,, ggi+1i+1,…,,…,ggjj,g,gj+1j+1,…),…)

• (…, (…, ggii,, - -ggjj,…,,…,-g-gi+1i+1,g,gj+1j+1,…),…)

Page 34: Combining with phylogeny

The approximating modelThe approximating model

• Case 3: the two endpoints of the inversion Case 3: the two endpoints of the inversion are two breakpoints.are two breakpoints.--let glet gii and g and gi+1i+1 be the left and right genes be the left and right genes at the left breakpoint and let gat the left breakpoint and let gjj and g and gj+1j+1 be be the left and the right genes at the right the left and the right genes at the right breakpoint.there are three subcases:breakpoint.there are three subcases:

A.None of (gA.None of (gii,-g,-gjj) and (-g) and (-gi+1i+1,g,gj+1j+1) is an ) is an adjacency in Gadjacency in G00..the number of breakpoint is unchanged.the number of breakpoint is unchanged.

Page 35: Combining with phylogeny

The approximating modelThe approximating model

• Case 3: the two endpoints of the inversion Case 3: the two endpoints of the inversion are two breakpoints.are two breakpoints.--let glet gii and g and gi+1i+1 be the left and right genes be the left and right genes at the left breakpoint and let gat the left breakpoint and let gjj and g and gj+1j+1 be be the left and the right genes at the right the left and the right genes at the right breakpoint.there are three subcases:breakpoint.there are three subcases:

B.exactly one of (gB.exactly one of (gii,-g,-gjj) and (-g) and (-gi+1i+1,g,gj+1j+1)is )is an adjacency in Gan adjacency in G00.. the number of breakpoints is decreased the number of breakpoints is decreased by 1.by 1.

Page 36: Combining with phylogeny

The approximating modelThe approximating model

• Case 3: the two endpoints of the inversion Case 3: the two endpoints of the inversion are two breakpoints.are two breakpoints.

--let glet gii and g and gi+1i+1 be the left and right genes be the left and right genes at the left breakpoint and let gat the left breakpoint and let gjj and g and gj+1j+1 be be the left and the right genes at the right the left and the right genes at the right breakpoint.there are three subcases:breakpoint.there are three subcases:

C.C. (g(gii,,--ggjj)) andand ((--ggi+1i+1,g,gj+1j+1)) are adjacenciesare adjacencies in G in G00..

the number of breakpoints is decreased the number of breakpoints is decreased by 2.by 2.

Page 37: Combining with phylogeny

The approximating modelThe approximating model

• Case 3: the two endpoints of the inversion Case 3: the two endpoints of the inversion are two breakpoints.are two breakpoints.when b≥3,out of inversions from case when b≥3,out of inversions from case 3 case 3(B) and 3(C) count for at most b 3 case 3(B) and 3(C) count for at most b inversions.inversions.this means given that inversion belongs to this means given that inversion belongs to case 3 with probability at least 1-b/ =(b-case 3 with probability at least 1-b/ =(b-3)/(b-2) it does not change the breakpoint 3)/(b-2) it does not change the breakpoint distance. this distance. this probability is close to 1 when b is large.probability is close to 1 when b is large.

Because for every

breakpoint there is only one specific

inversion that can cancel it.

2b

2b

Page 38: Combining with phylogeny

The approximating modelThe approximating model

CaseCase BPBP##inversioninversionss

11++22

22++11b(n-b)b(n-b)

33..AA00

33..BB--11≥≥bb

33..CC--22

2n b

2b

•Therefore, Therefore, when n is when n is large ,we large ,we can drop can drop case 3(B) case 3(B) and 3(C) and 3(C) without without affecting the affecting the distribution distribution of of breakpoint breakpoint distance distance drastically.drastically.

Page 39: Combining with phylogeny

The approximating modelThe approximating model

• Approximating box model: boxes correspond Approximating box model: boxes correspond to breakpoints.to breakpoints.

• Let us be given n boxes initially empty.Let us be given n boxes initially empty.

• At each iteration two boxes will be chosen At each iteration two boxes will be chosen randomly.randomly.

• We place a ball into each of these twoWe place a ball into each of these two boxes boxes if it is not empty.if it is not empty.

• The number of nonempty boxes after k The number of nonempty boxes after k iterations ,biterations ,bkk,can be used to estimate the ,can be used to estimate the number of breakpoints after k number of breakpoints after k rearrangement events are applied to an rearrangement events are applied to an unrearranged genome.unrearranged genome.

Page 40: Combining with phylogeny

The approximating modelThe approximating model

• This model can also be extended to This model can also be extended to approximate the GNT model: at each approximate the GNT model: at each iterationiteration with probability with probability we choose 2 boxes ,and with we choose 2 boxes ,and with probability probability we choose 3 boxes.we choose 3 boxes.

1

Page 41: Combining with phylogeny

Derivation of the varianceDerivation of the variance

• letlet S =((x1x2+x1x3+…+xn-1xn)/ ))in the INV_only modelin the INV_only model

-each term corresponds to the number of -each term corresponds to the number of ways of choosing two boxes for k times, ways of choosing two boxes for k times, where the total number of times box i is where the total number of times box i is chosen is the power of xchosen is the power of xii and the coefficient and the coefficient of that term is the total probability of these of that term is the total probability of these ways.ways.-for example :the coefficient of is the -for example :the coefficient of is the probability of choosing box 1 three times probability of choosing box 1 three times box 2 once ,and box 3 twice.box 2 once ,and box 3 twice.

2n k

3 21 2 3x x x

Page 42: Combining with phylogeny

Derivation of the varianceDerivation of the variance

• If transpositions and inverted transpositions If transpositions and inverted transpositions present: S=present: S=

• Let uLet uii be the coefficient of the terms with i be the coefficient of the terms with i distinct symbols udistinct symbols uii is the probability i is the probability i boxes are nonempty after k iterations.boxes are nonempty after k iterations.

• To solve for uTo solve for uii exactly for all k is difficult exactly for all k is difficult and unnecessary. Instead we can find the and unnecessary. Instead we can find the expectation and variance of bexpectation and variance of bk k directly.directly.

1 12 3

1( )ki j i j ln n

i j n i j l n

x x x x x

ni

Page 43: Combining with phylogeny
Page 44: Combining with phylogeny
Page 45: Combining with phylogeny

expectation and variance of bexpectation and variance of bkk

• Let S(a1,a2,…an) be the value of S when xi=ai for all i.

• Let Sj=(1,1,…1,0,…0)

j 1’s

Results for the inversion only: Results for the inversion only:

1.Eb1.Ebkk=n(1-S=n(1-Sn-1n-1))

2.Var b2.Var bkk==2 2

1 1 2( 1)n n nnS n S n n S

Page 46: Combining with phylogeny

expectation and variance of bexpectation and variance of bkk

• Results for the GNTResults for the GNT model:model:

1.1.

2.2.

1 1 1

1 2( ln ) ln(1 )k n n n

dEb nS S nS

dk k n

2 21 1 2( 1)k n n nVarb nS n S n n S

Page 47: Combining with phylogeny

Estimating the true evolutionary Estimating the true evolutionary distancedistance

• To estimate the true evolutionary distance To estimate the true evolutionary distance we use Exact-IEBP.we use Exact-IEBP.

• The variance of can be approximated The variance of can be approximated using a common statistical technique using a common statistical technique called the delta method:called the delta method:

^

( )k b

21^

2 1

21

(1 ( 1)( ))

( ) ( )2

(ln(1 ))

nn

nk k

n

SnS n

SdVar k b Eb Varb

dk nSn

Page 48: Combining with phylogeny

Accuracy of the estimators for the Accuracy of the estimators for the variancevariance

Var(BPVar(BPkk)) Var(k(bVar(k(bkk))))

Each figure consists of two sets of curves, corresponding to the

values of simulation and theoretical estimation.

•The number of genes is 120 •The number of rearrangement events is k range from 1 to 220.

•The evolutionary model is inversion-only GNT.

•For each k 500 runs.

Page 49: Combining with phylogeny

Variance of the inversion and EDE Variance of the inversion and EDE distancesdistances

• The EDE distance:The EDE distance:

--Given two genomes having the same set of n genes and the inversion distance between them is d,we define the EDE distance as n (d/n), where n is the number of genes and f is an approximation to the expected inversion distance normalized by the number of genes.

1f

Page 50: Combining with phylogeny

Variance of the inversion and EDE Variance of the inversion and EDE distancesdistances

• Let x be the normalized number of Let x be the normalized number of inversions (k/n).inversions (k/n).

• We simulate the inversion-only GNT model We simulate the inversion-only GNT model to evaluate the relationship between the to evaluate the relationship between the inversion distance and the actual number inversion distance and the actual number of inversions applied .Regression on of inversions applied .Regression on simulation results suggests simulation results suggests a=1,b=0.5956,and c=0.4577.a=1,b=0.5956,and c=0.4577.

• Let y=d/nLet y=d/n

2

2( ) min{ , }

ax bxf x x

x cx b

21 ( ) ( ) 4( )( ) max{ , }

2( )

b cy b cy a y byf y y

a y

Page 51: Combining with phylogeny

Variance of the inversion and EDE Variance of the inversion and EDE distancesdistances

• Using the same technique .Using the same technique .

-Let be the regression -Let be the regression formula for the standard deviation of the inversion formula for the standard deviation of the inversion distance normalized by the number of genes after distance normalized by the number of genes after nx inversions are applied.nx inversions are applied.

-q=-0.6998,u=0.1684,v=0.1573,w=-1,3893 and -q=-0.6998,u=0.1684,v=0.1573,w=-1,3893 and t=0.8224.t=0.8224.

--Var(EDE) can be obtained using the delta method onVar(INV).

Page 52: Combining with phylogeny

Simulation studySimulation study

Page 53: Combining with phylogeny

The accuracyThe accuracy of the new methods of the new methods..

• We use the original weighbor and BioNJ We use the original weighbor and BioNJ implementation and make modification so implementation and make modification so they use the new variance formulas.they use the new variance formulas.

• The following four distance estimators are The following four distance estimators are used with neighbor joining:BP,INV,Exact-used with neighbor joining:BP,INV,Exact-IEBP and EDE.IEBP and EDE.

• According to past simulation studies According to past simulation studies NJ(EDE) has the best accuracy followed NJ(EDE) has the best accuracy followed closely by NJ(Exact-IEBP).closely by NJ(Exact-IEBP).

Page 54: Combining with phylogeny

Quantifying errorQuantifying error

• Given an inferred tree ,we compare its Given an inferred tree ,we compare its “topological accuracy“topological accuracy"" by computing “false by computing “false negatives” with respect to the “true tree”.negatives” with respect to the “true tree”.

• False negative edgeFalse negative edge::– Let T be the true tree and T’ the inferred tree. Let T be the true tree and T’ the inferred tree.

An edge e in T is “missing” in T’ if T’ doesn’t An edge e in T is “missing” in T’ if T’ doesn’t contain an edge defining the same bipartition contain an edge defining the same bipartition on the leaf set.on the leaf set.

– The external edges are trivial in the sense that The external edges are trivial in the sense that they are in every tree with the same set of they are in every tree with the same set of leaves.leaves.

• The false negative rate is the number of The false negative rate is the number of false negative edges in T’ with respect to false negative edges in T’ with respect to T divided by the number of internal edges T divided by the number of internal edges in T.in T.

Page 55: Combining with phylogeny

““false negativesfalse negatives””

• 120 genes120 genes 160 genomes160 genomes

Weighbor-EDE has the best accuracy over all methods!

Page 56: Combining with phylogeny

• When we compare methods between When we compare methods between based on breakpoint distance and based on breakpoint distance and methods based on inversion distance the methods based on inversion distance the inversion distance always better.inversion distance always better.

• This suggests INV is better statistic than This suggests INV is better statistic than BP for the true evolutionary distance BP for the true evolutionary distance under GNT model even when under GNT model even when transpositions and inverted transpositions transpositions and inverted transpositions are present.are present.

Page 57: Combining with phylogeny

Running timeRunning time

• NJ,BioNJ-IEBP and BioNJ-EDE all finish NJ,BioNJ-IEBP and BioNJ-EDE all finish within 1 second for all settings on the within 1 second for all settings on the pentium of the simulation pentium of the simulation workstation running linux.however workstation running linux.however Weighbor-IEBP and Weighbor-EDE Weighbor-IEBP and Weighbor-EDE take about 30 minutes to finish for take about 30 minutes to finish for 160 genomes. 160 genomes.

Page 58: Combining with phylogeny

ConclusionConclusion

• We studied the variance of the breakpoint We studied the variance of the breakpoint and inversion distances under the and inversion distances under the generalized Nadeau-Taylor model.generalized Nadeau-Taylor model.

• We used these results to obtain four new We used these results to obtain four new methods:methods:BioNJ-IEBP, Weighbor-IEBP,BioNJ-EDE, and Weighbor-EDE. Of these Weighbor-IEBP and Weighbor-EDE yield very accurate phylogenetic trees and are robust against errors in the model parameters.

Page 59: Combining with phylogeny

ReferencesReferences• [1] W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted

neighbor joining: A likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol.

• [2] O. Gascuel. BIONJ: an improved version of the nj algorithm based on a smple model of sequence data. Mol. Biol. Evol., 14:685-695, 1997.

• [3]N. Saitou and M. Nei. The neighbor-joining method: A new method for recon-structing phylogenetic trees. Mol. Biol. & Evol., 4:406-425, 1987.

• [4]L.S. Wang and T. Warnow. Estimating true evolutionary distances between genomes. In Proc. 33th Annual ACM Symp. on Theory of Comp. (STOC 2001),pages 637-646. ACM Press, 2001.

• [5]L.S. Wang .Exact-IEBP:A New Technique For Estimating Evolutionary Distances Between Whole Genomes.

• [6]L.S. Wang .Genome Rearrangement phylogeny using Weighbor.