Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

31
Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    1

Transcript of Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

Page 1: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

Introduction to Bioinformatics

Tutorial 4

Multiple Alignment

and

Phylogeny

Page 2: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

2

ClustalW InputFast

alignment?

Scoring matrix

Alignment format

Fast alignment

options

Gap scoring

Phylogenetic trees

Input sequences

Page 3: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

3

ClustalW Output (1)

Input sequences

Pairwise alignment scores

Building alignment

Final score

Page 4: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

4

ClustalW Output (2)

Sequence names Sequence positions

Match strength in decreasing order: * : .

Page 5: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

5

Phylogenetic Trees

• Represent closeness between many entities– In our case, genomic or protein sequences

human

chimpmonkey

Observed entity

Unobserved commonality

Distance representation

Page 6: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

6

Rooting Trees

• A tree can be hung from a root– Adds directional information– Requires addition of ‘outgroup’

human chimpmonkeypig

We know this is

furthest

So we hang the tree

from where it joins

Page 7: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

7

Phylogeny and Evolution

Evolutionary T

ime

Speciation

Number of mutations

Common Ancestor

Page 8: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

8

Tree Reconstruction

• Build tree based on organism sequences

• Distance-based methods– Use pairwise alignment scores to build tree– Ignores sequences after initial alignments

• Character-based methods– Learn a tree with intermediate sequences that

minimizes total number of mutations– Slower but generally better results

Page 9: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

9

Distance-based Example (1)

2 3 41 7 4 5

2 2 2

3 2

1 2

3 4

Page 10: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

10

Distance-based Example (2)

3 412 -3 -1

3 2

3 4

1 2

Page 11: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

11

Distance-based Example (3)

3 4

3412 -4

1 2

Page 12: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

12

Newick Tree Format

(CFTR_SHEEP:0.01457, (CFTR_HUMAN:0.16153, (CFTR_MOUSE:0.70599, (CFTR_RABIT:2.76042, (CFTR_SQUAC:1.27192, CFTR_XENLA:0.28818) :3.42183) :0.77076) :0.65873):0.73937,CFTR_BOVIN:0.00953);

Page 13: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

13

Phylodendron Input

Graphical style Newick tree

description

Tree sizeOrientation

Page 14: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

14

Calculation of HIV/SIV Neighbor-joining tree

Why phylogenetic analyses? Mutations accumulate in the genomes of pathogens, especially viruses, during a spread of an infection. This can be used to document the history of transmission events. Phylogenetic analysis of these mutations may not only be used to reconstruct the history of a pathogen's spread through host populations but can also be used to make predictions about it's future progress.

The unsolved HIV/SIV relationshipOne interesting case, where phylogenetic treebuilding is useful, is the unsolved HIV/SIV relationship: HIV-1, HIV-2 and SIV.AIDS (acquired immunodeficiency syndrome) is caused by two different human viruses:

HIV-1, group M and O HIV-2, subtypes A to E

There are many related viruses in a variety of non-human primates. These related viruses are called SIV (simian immunodeficiency viruses).

Page 15: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

15

Calculation of HIV/SIV Neighbor-joining tree

Phylogenetic studies have shown that primate lentiviruses are all in the same clade. Within this clade there are five major lineages (the subscripts denotes the host) :

HIV-1 and SIVCPZ (Chimpanzee) HIV-2, SIVSM (Sooty mangabey) and SIVMAC (Captive macaque) SIVAGM (African green monkey) SIVMND (Mandrill) SIVSYK (Sykes´ monkey)

The NJ tree in our example is based on the poly protein sequence from HIV-1, HIV-2 and SIV with HTLV-1 as an outgroup. HTLV-1 (human T-lymphotropic virus type 1) is another human retroviral pathogen that has originated from related simian viruses.

Page 16: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

16

Calculation of HIV/SIV Neighbor-joining tree

Step by step summary:

1. Define all taxa and calculate all pairwise distances. 2. Pick two nodes in the star (i and j) for which the distance is

minimal.3. Define a new node (x) and calculate ri and rj.4. Calculate dix and djx, thereby joining x to i and j respectively.5. Remove i and j from the star and insert x instead.6. Calculate dxm for all m in the star.

Continue until the star has been resolved and root the tree in a final step.

Page 17: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

17

Step1 minimum

Page 18: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

18

Step1 (cont.)The calculation starts with the star:

The branch lengths between node 5 and 10 and between 6 and 10 are calculated with these formulas:In this case L = 9 New node x = 10

ri=r5=Σd5k/(L-2) = 3.22406/(9-2) = 0.46058rj=r6=Σd6k/(L-2) = 3.22758/(9-2) = 0.461083dix=d5 10=(d5 6 + r5 - r6)/2 = (0.06088 + 0.46058 - 0.461083)/2 = 0.0301886djx=d6 10 = d5 6 - d5 10 = 0.06088 - 0.0301886 = 0.0306914

Page 19: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

19

Step1(cont.)

Page 20: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

20

Step2 minimum

Page 21: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

21

Step2 (cont.)

Calculation of the new branches: 

In this case L = 8 New node x = 11

ri=r3=Σd3k/(L-2) = 2.715455/(8-2) = 0.452576rj= r4=Σd4k/(L-2) = 2.50096/(8-2) = 0.416827dix=d3 11=(d3 4 + r3 - r4)/2 = (0.125 + 0.452576 - 0.416827)/2 = 0.080375djx=d4 11 = d3 4 - d3 11 = 0.125 - 0.080375 = 0.044625

Page 22: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

22

Step1(cont.)

Page 23: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

23

Step3 minimum

Page 24: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

24

Step3 (cont.)

Calculation of the new branches: 

In this case L = 7 New node x = 12

ri=r2=Σd2k/(L-2) = 2.252265/(7-2) = 0.450453rj=r11=Σd11k/(L-2) = 2.108208/(7-2) = 0.4216415dix=d2 12=(d2 11 + r2 - r11)/2 = (0.109705 + 0.450453 - 0.4216415)/2 = 0.069258djx=d11 12 = d2 11 - d2 12 = 0.109705 - 0.069258 = 0.040447

Page 25: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

25

Step1(cont.)

Page 26: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

26

Step 7

In this case L = 3 New node x = 16:

r13= 0.843684r15=0.728574d13 16 = 0.131758d15 16 = 0.016648

Page 27: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

27

Step 7 (cont.)

Because node 9 is the outgroup, the root will be placed between node 9 and the other nodes. The distance between node 9 and the first internal node is 0.563519.

Page 28: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

28

Conclusions

HIV-2 (H2) is more closely related to SIV (S) from sooty mangabey than to HIV-1 (H1).

HIV-1 seems to be more closely related to SIV from chimpanzee.

This means that HIV-1 and HIV-2 have

originated independently from two different SIV

strains.

There must have been a cross-species transmission from chimpanzee SIV to human HIV-1.

There also seems to have been a cross-species transmission from human to MAN/MAC.

Page 29: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

29

Conclusions

As one can see the branch between the H2-ROD A and the to SIV taxa has a low support. Only 56% of the trees have this topology. Therefore the transmission events from human to non-human primates are very uncertain.

Page 30: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

30

ExerciseIn this exercise you will perform a phylogenetic analysis of the human globin sequences. You will compare your results to current prevalent knowledge on the globin family, according to the following summary on the globin sequences:

Myoglobin and hemoglobins diverged from one another before the emergence of worms, about 800 million year ago. The hemoglobins diverged into two families (the α-family and β-family) following a gene duplication, about 450 million years ago, which is before the emergence of mammals. The α-family diverged into the zeta, teta and alpha genes, and the β-family diverged into the beta, gamma_G, gamma_A, delta and epsilon genes, all following a series of gene duplications. The most recent duplication was that gamma_G from gamma_A, which occurred around the separation of the simians (humans, chimp, gorilla, etc.) from the pro-siminas (such as lemurs and lorises), about 55 million years ago. (adapted from Graur and Li, 1999)

Page 31: Introduction to Bioinformatics Tutorial 4 Multiple Alignment and Phylogeny.

31

Exercise (cont.)1. Reconstruct the phylogenetic tree of the human globins using

Neighbor joining. Make sure tree is properly rooted (by defining an outgroup) according to the information in the above summary. Point out where the hemoglobins and myoglobin diverged, and where the α-family and β-family diverged.

2. Which of the following groups are monophyletic according to the tree you obtained: (i) alpha, beta, delta, (ii) alpha, teta, zeta, (iii) epsilon, beta, delta

3. Bootstrap the tree you built with 1000 bootstrap iterations. Display the tree with the bootstrap values displayed. On which branch was the lowest bootstrap value obtained? Explain what this means.