1. 2 Rooting the tree and giving length to branches.
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of 1. 2 Rooting the tree and giving length to branches.
1
2
Rooting the tree and giving length to branches
3
Rooted vs. unrooted trees
1
2
3
3 1
2
4
The position of the root does not affect the MP score.
Rooted vs. Unrooted.
Exercise:
Draw all alternative rooting of the MP tree. Evaluate 1 of them, and show that the MP score does not change.
5
s1 s4 s3 s2 s5
Gene number 1, Option number 1.
1 1 1 0 0
1
0
1
More intuition why rooting does not change score.
The change will always be on the same branch, no matter where the root is positioned…
1
6
How can we root the tree – we want rooted trees!
7
8
9
Gorilla gorilla
(Gorilla)
Homo sapiens (human)
Pan troglodytes (Chimpanzee)
Gallus gallus (chicken)
10
Evaluate all 3 possible UNROOTED trees:
Human
Chimp
Chicken
Gorilla
Human
Gorilla
Chimp
Chicken
Human
Chicken
Chimp
Gorilla
MP tree
11
1212
HOW MANY TREES
13
How many rooted trees
a ba b c b a c c a b
N=3, TR(3) = 3
b c da c b da d b ca a c db c a db
TR = “TREE ROOTED”
N=2, TR(2) = 1
d a cb a b dc b a dc d a bc a b cd
b a cd c a bd b c da c b da d b ca
N=4, TR(4) = 15
14
How many rooted trees
a b
c a b
TR = “TREE ROOTED”
2 branches. 3 possible places to add “c”
b a cdd b ca
c c
c
4 branches. 5 possible places to add “d”
6 branches. 7 possible places to add “e”
The number of branches is increased by 2 each time. The number of branches is an arithmetic series.0,2,4,6,8,…. A(n) = A(1)+(n-1)d. A(1) = 0; d=2. => A(n) = (n-1)*2 = 2n-2
15
How many rooted treesTR = “TREE ROOTED”
The number of branches is increased by 2 each time. The number of branches is an arithmetic series.0,2,4,6,8,…. A(n) = A(1)+(n-1)d. A(1) = 0; d=2. => A(n) = (n-1)*2 = 2n-2
a b
2 branches. 3 possible places to add “c”c c
c
Each time we can add a new branch in Br(n)+1 places. [Br(n)=number of branches]
TR(n+1) = TR(n)*(BR(n)+1)=TR(n)*(2n-1)TR(5) = TR(4)*7=TR(3)*5*7=TR(2)*3*5*7=1*3*5*7…TR(n) = 1*3*5*7*…..*(2n-3)
[Tr(n)=number of trees with n sequences]
16
How many rooted treesTR = “TREE ROOTED”
n!=1*2*3*4*5*6…..*n = n factorial.
TR(n) = 1*3*5*7*…..*(2n-3) =
2*4*6*8*….*(2n-4) =
1*2*3*4*5*6*7*…*(2n-3)
(2*1)*(2*2)*(2*3)*(2*4)*….*(2*(n-2)) =
1*2*3*4*5*6*7*…*(2n-3)
(2(n-2))*(1*2*3*4*….(n-2)) =
(2n-3)!
(2(n-2))*(n-2)!
(2n-3)! =
17
How many rooted treesTR = “TREE ROOTED”
TR(n) = 1*3*5*7*…..*(2n-3) =
(2(N-2))*(n-2)!
(2n-3)! =
=(2n-3)!!
18
19
How many unrooted trees
Ex: show that the number of unrooted trees is given by1*3*5*…*(2n-5) where n is the number of sequences.
Open questionsA close formula does not exist, though the recursion formula exists (Felsenstein 1987, Schroder, 1870). There are other results about the asymptotic rate at which the numbers rise, and other results concerning number of tree shapes, etc…
20
2121
HEURISTIC SEARCH
22
There are many trees..,
We cannot go over all the trees. We will try to find a way to find the best tree.These are approximate solutions…
23
Finding the maximum is the same thing as finding the minimum
Say we have a computer procedure that given a function, it finds its minimum, andwe want to find the maximum of a function f(x). We can just find the minimum of -f(x) and this is minus the maximum of f(x).
Example.
f(0) = 3; f(1) = 7; f(2) = -5; f(3) = 0; max f(x) = 7. argmax f(x) = 1;-f(0)=-3; -f(1) = -7; -f(2) = 5; -f(3) =0; min(-f(x)) = -7. argmax –(f(x) = 1;
24
Score = 1700
25
Score = 1700
Score = 1825
Score = 1710
Score = 1410
Score = 1695
26
Score = 1825
Score = 1828
Score = 1910
Score = 1800
27
Max score = 2900
28
Score = 2100
Problem number 1: local maximum
Score = 3100
Score = 2900
Local max
Global max
29
This algorithm is “greedy” – it seizes the first improvement encountered.
One way to avoid local maxima is to start from many random starting points
30
Several options to define a neighbor.
Option 1Option 2
31
Nearest-neighbor interchange
A
BC
D
A
DC
B
D
BC
A
Each internal branchdefines two neighbors
32
How many neighbors do we check each time?
For unrooted trees of n taxa, we have 2n-3 branches. However, only internal branches are interesting, thus we have n-3. Each defines two neighbors, thus the total number of neighbors in each NNI cycle is 2n-6.
A
BC
D
E
Internal branches
External branches
NNI is possible only in internal branches
33
I am greedy
34
(1)Most greedy: Start searching your neighbors. If you find something better – move there, and start the search again.
(2)Just greedy: Check ALL your neighbors. Move to the one that is the highest.
(3)Smart greedy: Try all NNI of trees that are tied for the best score.
Greedy variants
There are many other variants of the greedy search
that would not be discussed in this course.
35
SPR = SUBTREE PRUNING AND REGRAFTING
A
C
D
E
B
D
EA
CB
1.Chose a branch and cut it in 2.2.Remove the sticky end from one subtree.3.Connect the remaining sticky end to one
branch in the other subtree.
D
E
A
CB
D
E
A
CB
36
A
C
D
E
B
A
CB
1.Chose a branch and cut it in 2.2.Remove the sticky end from both subtrees.3.Connect the remaining 2 subtrees
anywhere.
A
CB
F
E
A
CB
TBR = TREE BISECTION AND RECONNECTION
F
D
E
F
D
E
F
D
37
Sequential addition
A
C
B
D D
CA
E
BD
CA
1.Start with a 3-taxa tree.2.Estimate all possible addition of the next
taxa.
Red: best addition
BE
One can do rearrangements in each addition step to increase efficiency.
38
Star decomposition
A
C
B
D D
(C,B)A
EB
D
CA
1.Start with an n-taxa star-tree.2. In each step find the best pair of taxa to
separate from the star’s root.
E
One can do rearrangements in each addition step to increase efficiency.
E
Red: best pair to group together
39
Simulated Annealing
Another method to avoid local maxima.
The idea in the simulated annealing is to relax the greediness by allowing steps to go downhill. For example we pick up one NNI neighbor randomly. If it is uphill – we move there. If it is downhill, we move there with a certain probability p.
We can control the probability p. In the beginning of the search allow p to be high. As the search progresses, reduce p (i.e., make the search more greedy).
1 0( , ) E
T
if Ep E T
e else
40
41
Branch and Bound
42
There are many trees..,
We cannot go over all the trees. We will try to find a way to find the best tree.There are approximate solutions… But what if we want to make sure we find the global maximum.
There is a way more efficient than just to go over all possible trees. It is called BRANCH AND BOUND and is a general technique in computer science, that can be applied to phylogeny.
43
BRANCH AND BOUND
To exemplify the BRANCH AND BOUND (BNB) method, we will use an example not connected to evolution. Later, when the general BNB method is understood, we will see how to apply this method to finding the MP tree. We will present the shortest Hamiltonian path (SHP) problem.
44
THE SHP PROBLEM (adapted to Israel).
A guard has to visit n check-points on a map. The problem is to find the shortest route (including the starting point) that goes through all points.
Naïve approach: (say for 5 points). You have 5 starting points. For each such starting point you have 4 possible “next steps”. For each such combination of starting point and first step, you have 3 possible second steps, etc. All together we have 5*4*3*2*1 possible solutions = 5!.
45
THE SHP TREE
1 2 3 4 5
2 3 4 5 1 3 4 5 1 2 4 5 1 2 3 5 1 2 3 4
2 4 5 1 4 5 1 2 5 1 2 4
5 4 5 2 4 2
4 5 2 5 2 4
46
THE SHP NAÏVE APPROACH
Each solution can be represented as a permutation:
(1,2,3,4,5)(1,2,3,5,4)(1,2,4,3,5)(1,2,4,5,3)(1,2,5,3,4)…We can go over the list and find the one giving the highest score.
47
THE SHP NAÏVE APPROACH
However, for 15 points for example, there are 1,307,674,368,000 permutations.
The rate of increase of the number of solutions is too big (more than exponential).
48
THE SHP HEURISTIC APPROACH
Start from a random point. Go to the closest point.This approach doesn’t work so good…
49
Computation times
The question is the relationship between computation time and n.
In very good cases, the computation time scales linearly with n: the computation time is increased by a constant for each increase in n.
In polynomial time, the function relating the dependency between computation time and n is a polynomial. For example CT(n) = 7n2.
50
Computation times
No matter what polynomial function we have, exponential functions like 2n will overtake for large enough n. .
51
NP-complete
Computer science theory shows that there is a class of problems that appear not to have a polynomial time solutions. All these np-complete problems are equivalent, in the sense that if ever one finds a polynomial solution to one – he can solve all of them. Although it was never proven that there is no polynomial solution to these problems (biggest open question in computer science), most people believe this to be the case.
52
NP-hard
There is another class of problems: the np-hard. There is no polynomial solution and even if the np-complete problems could be solved in polynomial time – this would not help solving these np-hard problems in polynomial time.
The SHP is one such NP-hard problem!
53
G
Estimating the parsimony score of a tree is not NP-complete.
A
C
A
G
4n-2 possible reconstructions.n=number of sequencesn-2=number of internal nodes
One could go over all 4n-2 possible assignments of characters to internal nodes to find the MP score. However, we have previously shown that although the naïve solution if exponential, a linear time algorithm exists.
54
BNB SOLUTION TO SHP
1 2 3 4 5
2 3 4 5 1 3 4 5 1 2 4 5 1 2 3 5 1 2 3 4
2 4 5 1 4 5 1 2 5 1 2 4
5 4 5 2 4 2
4 5 2 5 2 4
Shortest path found so far = 15
Score here already 16: no point in checking the rest of the subtree
55
Back to finding the MP tree
Finding the MP tree is NP-Hard…
BNB helps, though it is still exponential…
56
The MP search tree1
2
34 is added to branch 1.
1
2
34
1
2
34
1
2
3
4
5 is added to branch 2.There are 5 branches
57
The MP search tree
4 is added to branch 1.
30
43 39
52 54 52 53 58 61 56 59 61 69 53 51 42 47 47
55
58
MP-BNB
4 is added to branch 1.
30
43 39
52 54 52 53 58 61 56 59 61 69 53 51 42 47 47
55
Best record = 52
59
MP-BNB
4 is added to branch 1.
30
43 39
52 54 52 53 58 61 56 59 61 69 53 51 42 47 47
55
Best record = 52
60
MP-BNB
4 is added to branch 1.
30
43 39
52 54 52 53 58 61 56 59 61 69 53 51 42 47 47
55
Best record = 52
61
MP-BNB
30
43 39
52 54 52 53 58 53 51 42 47 47
55
Best record = 52
62
MP-BNB
30
43 39
52 54 52 53 58 53 51 42 47 47
55
Best record = 52
63
MP-BNB
30
43 39
52 54 52 53 58 53 51 42 47 47
55
Best record = 52 51
53 58
64
MP-BNB
30
43 39
52 54 52 53 58 53 51 42 47 47
55
Best record = 52 51 42
65
MP-BNB
30
43 39
52 54 52 53 58 53 51 42 47 47
55
Best record = 52 51 42
66
MP-BNB
30
43 39
52 54 52 53 58 53 51 42 47 47
55
Best record = 52 51 42
67
MP-BNB
30
43 39
52 54 52 53 58 53 51 42 47 47
55
Best TREE.MP score = 42Total trees visited: 14
68
MP-BNB – an improvement
30
43 39
53 51 42 47 47
55
Evaluate all 3 first
Total trees visited: 9
The “bound” after searching this subtree will be 42.