Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides...
-
date post
22-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides...
![Page 1: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/1.jpg)
http://creativecommons.org/licenses/by-sa/2.0/
![Page 2: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/2.jpg)
CIS786, Lecture 6
Usman Roshan
Some of the slides are based upon material by David Wishart of University of Alberta and Ron Shamirof Tel Aviv University
![Page 3: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/3.jpg)
Previously…
![Page 4: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/4.jpg)
Iterated local search: Recursive-Iterative-DCM3
Local optimum
Output of Recursive-DCM3
Recursive-DCM3
Local search
Local search
![Page 5: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/5.jpg)
13921 Proteobacteria rRNA
![Page 6: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/6.jpg)
How to run Rec-I-DCM3 then?
• Unanswered question: what about better TNT heuristics? Can Rec-I-DCM3 improve upon them?
• Rec-I-DCM3 improves upon default TNT but we don’t know what happens for better TNT heuristics.
• Therefore, for a large-scale analysis figure out best settings of the software (e.g. TNT or PAUP*) on the dataset and then use it in conjunction with Rec-I-DCM3 with various subset sizes
![Page 7: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/7.jpg)
Maximum likelihood
![Page 8: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/8.jpg)
Maximum likelihood
• Four problems– Given data, tree, edge lengths, and ancestral
states find likelihood of tree: polynomial time– Given data, tree and edge lengths find
likelihood of tree: polynomial time dynamic programming
– Given data and tree, find likelihood: unknown complexity
– Given data find tree with best likelihood: unknown complexity
![Page 9: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/9.jpg)
Sequential RAxML
Compute randomized parsimony starting treewith dnapars from PHYLIP
Apply exhaustive subtree rearrangements
Iterate while tree improves
![Page 10: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/10.jpg)
Subtree Rearrangements
ST5
ST2
ST6ST4
ST3
ST1
Need to optimize all branches ?
![Page 11: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/11.jpg)
Idea: Lazy Subtree Rearrangements
ST5
ST2
ST6
ST4
ST3
ST1
![Page 12: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/12.jpg)
Idea: Lazy Subtree Rearrangements
ST5
ST2
ST6
ST4
ST3
ST1
![Page 13: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/13.jpg)
Comparison across all datasets
Dataset size Improvement as %
Steps improvement
Max p Avg p
2025 (ARB) -0.002% -6 0.56 0.36
2415 Bininda-Emonds
0.004% 23 0.48 0.2
6673 (RG) 1.251% 6877 1 0.29
7769 (RG) 2.338% 13290 1 0.33
8780 (ARB) 0.03% 270 0.55 0.23
![Page 14: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/14.jpg)
Parallel Rec-I-DCM3
Local optimum
Output of DCM3
Recursive-DCM3Local
search
Local search
(1) Solve subproblems in parallel
(2) Merge subtrees in the proper subtree order
Use parallel RAxMLdeveloped by Duand Stamatakis
![Page 15: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/15.jpg)
P-Rec-I-DCM3 vs Rec-I-DCM3Dataset Parallel LH Sequential
LHImprovement in steps
Improvement (as a %)
500 rbcL (Zilla) -99945 -99967 22 0.022%
2560 rbcL (Kallersjo)
-354944 -355088 144 0.041%
4114 16s Actinobacteria (RDP)
-383108 -383524 416 0.11%
6281 ssu rRNA Eukaryotes (ERNA)
-1270379 -1270785 406 0.032%
6458 16s Firmicutes Bacteria (RDP)
-900875 -902077 1202 0.13%
7769 rRNA 3-dom+2org (Gutell)
-540334 -541019 685 0.13%
![Page 16: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/16.jpg)
Parallel performance limits
• Performance appears sub-optimal because of significant load imbalance caused by different subproblem sizes
• Optimal speedup=(total subproblem time)/(minimum time)
• Dataset 3– 19 subproblems of which 3 require
at least 5K seconds (max is 5569 seconds)
– Optimal speedup: 37353/5569=6.71
• Dataset 6– 43 subproblems of which longest
takes 12164 seconds– Optimal speedup:
63620/12164=5.23
Dataset 3
4
8
16
1.9
5.5
6.7
2.6
5
5.7
2.2
5.3
6.2
Dataset 6
4
8
16
3.2
4.8
5.4
1.95
2.5
2.8
2.2
3
3.3
Processors Base Global Overall
![Page 17: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/17.jpg)
Summary of last time
• Rec-I-DCM3 in detail
• Rec-I-DCM3(TNT)
• Maximum likelihood (ML) problem
• RAxML for solving ML
• Rec-I-DCM3(RAxML)
• Parallel Rec-I-DCM3(RAxML)
![Page 18: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/18.jpg)
Sequencing SuccessesT7 bacteriophagecompleted in 198339,937 bp, 59 coded proteins
Escherichia colicompleted in 19984,639,221 bp, 4293 ORFs
Sacchoromyces cerevisaecompleted in 199612,069,252 bp, 5800 genes
![Page 19: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/19.jpg)
Sequencing SuccessesCaenorhabditis eleganscompleted in 199895,078,296 bp, 19,099 genes
Drosophila melanogastercompleted in 2000116,117,226 bp, 13,601 genes
Homo sapienscompleted in 20033,201,762,515 bp, 31,780 genes
![Page 20: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/20.jpg)
Genomes to Date• 8 vertebrates (human, mouse, rat, fugu, zebrafish)• 3 plants (arabadopsis, rice, poplar)• 2 insects (fruit fly, mosquito)• 2 nematodes (C. elegans, C. briggsae)• 1 sea squirt• 4 parasites (plasmodium, guillardia)• 4 fungi (S. cerevisae, S. pombe)• 200+ bacteria and archebacteria• 2000+ viruses
![Page 21: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/21.jpg)
So what do we do with all this sequence data?
![Page 22: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/22.jpg)
So what do we do with all So what do we do with all this sequence data?this sequence data?
Comparative bioinformatics
![Page 23: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/23.jpg)
DNA Sequence Evolution
AAGACTT -3 mil yrs
-2 mil yrs
-1 mil yrs
today
AAGACTT
T_GACTTAAGGCTT
_GGGCTT TAGACCTT A_CACTT
ACCTT (Cat)
ACACTTC (Lion)
TAGCCCTTA (Monkey)
TAGGCCTT (Human)
GGCTT(Mouse)
T_GACTTAAGGCTT
AAGACTT
_GGGCTT TAGACCTT A_CACTT
AAGGCTT T_GACTT
AAGACTT
TAGGCCTT (Human)
TAGCCCTTA (Monkey)
A_C_CTT (Cat)
A_CACTTC (Lion)
_G_GCTT (Mouse)
_GGGCTT TAGACCTT A_CACTT
AAGGCTT T_GACTT
AAGACTT
![Page 24: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/24.jpg)
Sequence alignments
They tell us about
• Function or activity of a new gene/protein
• Structure or shape of a new protein
• Location or preferred location of a protein
• Stability of a gene or protein
• Origin of a gene or protein
• Origin or phylogeny of an organelle
• Origin or phylogeny of an organism
• And more…
![Page 25: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/25.jpg)
Pairwise alignment
• How to align two sequences?
![Page 26: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/26.jpg)
Pairwise alignment
• How to align two sequences?How to align two sequences?• We use dynamic programming• Treat DNA sequences as strings over the
alphabet {A, C, G, T}
![Page 27: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/27.jpg)
Pairwise alignment
![Page 28: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/28.jpg)
Dynamic programming
Define V(i,j) to be the optimal pairwise alignment score between S1..i and T1..j (|S|=m, |T|=n)
![Page 29: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/29.jpg)
Dynamic programming
Time and space complexity is O(mn)
Define V(i,j) to be the optimal pairwise alignment score between S1..i and T1..j (|S|=m, |T|=n)
![Page 30: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/30.jpg)
Tabular computation of scores
![Page 31: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/31.jpg)
Traceback to get alignment
![Page 32: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/32.jpg)
Local alignment
Finding optimally aligned local regions
![Page 33: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/33.jpg)
Local alignment
![Page 34: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/34.jpg)
Database searching
• Suppose we have a set of 1,000,000 sequences
• You have a query sequence q and want to find the m closest ones in the database---that means 1,000,000 pairwise alignments!
• How to speed up pairwise alignments?
![Page 35: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/35.jpg)
FASTA
• FASTA was the first software for quick searching of a database
• Introduced the idea of searching for k-mers
• Can be done quickly by preprocessing database
![Page 36: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/36.jpg)
FASTA: combine high scoring hits into diagonal runs
![Page 37: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/37.jpg)
BLAST
Key idea: search for k-mers (short matchig substrings) quickly by preprocessing the database.
![Page 38: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/38.jpg)
BLAST
This key idea can also be used for speeding up pairwise alignments when doing multiple sequence alignments
![Page 39: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/39.jpg)
Biologically realistic scoring matrices
• PAM and BLOSUM are most popular
• PAM was developed by Margaret Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins
• BLOSUM is more recent and computed from blocks of sequences with sufficient similarity
![Page 40: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/40.jpg)
PAM
• We need to compute the probability transition matrix M which defines the probability of amino acid i converting to j
• Examine a set of closely related sequences which are easy to align---for PAM 1572 mutations between 71 families
• Compute probabilities of change and background probabilities by simple counting
![Page 41: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/41.jpg)
PAM• In this model the unit of evolution is the amount
of evolution that will change 1 in 100 amino acids on the average
The scoring matrix Sab is the ratio of Mab to pb
![Page 42: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/42.jpg)
PAM Mij matrix (x10000)
![Page 43: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/43.jpg)
Multiple sequence alignment
• “Two sequences whisper, multiple sequences shout out loud”---Arthur Lesk
• Computationally very hard---NP-hard
![Page 44: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/44.jpg)
Formally…
![Page 45: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/45.jpg)
Multiple sequence alignment
Unaligned sequences
GGCTT
TAGGCCTT
TAGCCCTTA
ACACTTC
ACTT
Aligned sequences
_G_ _ GCTT_
TAGGCCTT_
TAGCCCTTA
A_ _CACTTC
A_ _C_ CTT_ Conserved regions help us to identify functionality
![Page 46: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/46.jpg)
Sum of pairs score
![Page 47: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/47.jpg)
Sum of pairs score
• What is the sum of pairs score of this alignment?
![Page 48: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/48.jpg)
Tree alignment score
![Page 49: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/49.jpg)
Tree alignment score
![Page 50: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/50.jpg)
Tree Alignment
TAGGCCTT (Human)
TAGCCCTTA (Monkey)
ACCTT (Cat)
ACACTTC (Lion)
GGCTT (Mouse)
![Page 51: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/51.jpg)
Tree Alignment
TAGGCCTT_ (Human)
TAGCCCTTA (Monkey)
A__C_CTT_ (Cat)
A__CACTTC (Lion)
_G__GCTT_ (Mouse)
TAGGCCTT_ A__CACTT_
TGGGGCTT_
AGGGACTT_
0 2
2
11
3
3
2
Tree alignment score = 14
![Page 52: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/52.jpg)
Tree Alignment---depends on tree
TAGGCCTT_ (Human)
TAGCCCTTA (Monkey)
A__C_CTT_ (Cat)
A__CACTTC (Lion)
_G__GCTT_ (Mouse)
TA_CCCTT_ TA_CCCTTA
TA_CCCTT_
TA_CCCTTA
2 3
1
41
0
4
0
Tree alignment score = 15 Switch monkey and cat
![Page 53: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/53.jpg)
Profiles
• Before we see how to construct multiple alignments, how do we align two alignments?
• Idea: summarize an alignment using its profile and align the two profiles
![Page 54: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/54.jpg)
Profile alignment
![Page 55: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/55.jpg)
Iterative alignment(heuristic for sum-of-pairs)
• Pick a random sequence from input set S• Do (n-1) pairwise alignments and align to
closest one t in S• Remove t from S and compute profile of
alignment• While sequences remaining in S
– Do |S| pairwise alignments and align to closest one t
– Remove t from S
![Page 56: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/56.jpg)
Iterative alignment
• Once alignment is computed randomly divide it into two parts
• Compute profile of each sub-alignment and realign the profiles
• If sum-of-pairs of the new alignment is better than the previous then keep, otherwise continue with a different division until specified iteration limit
![Page 57: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/57.jpg)
Progressive alignment
• Idea: perform profile alignments in the order dictated by a tree
• Given a guide-tree do a post-order search and align sequences in that order
• Widely used heuristic
• Can be used for solving tree alignment
![Page 58: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/58.jpg)
Simultaneous alignment and phylogeny reconstruction
• Given unaligned sequences produce both alignment and phylogeny
• Known as the generalized tree alignment problem---MAX-SNP hard
• Iterative improvement heuristic:– Take starting tree– Modify it using say NNI, SPR, or TBR– Compute tree alignment score– If better then select tree otherwise continue until
reached a local minimum
![Page 59: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/59.jpg)
Median alignment
• Idea: iterate over the phylogeny and align every triplet of sequences---takes o(m3) (in general for n sequences it takes O(2nmn) time
• Same profiles can be used as in progressive alignment
• Produces better tree alignment scores (as observed in experiments)
• Iteration continues for a specified limit
![Page 60: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/60.jpg)
Popular alignment programs
• ClustalW: most popular, progressive alignment• MUSCLE: fast and accurate, progressive and
iterative combination• T-COFFEE: slow but accurate, consistency
based alignment (align sequences in multiple alignment to be close to the optimal pairwise alignment)
• PROBCONS: slow but highly accurate, probabilistic consistency progressive based scheme
• DIALIGN: very good for local alignments
![Page 61: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/61.jpg)
MUSCLE
![Page 62: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/62.jpg)
MUSCLE
![Page 63: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/63.jpg)
MUSCLE
Profile sum-of-pairs score
Log expectation score used by MUSCLE
![Page 64: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/64.jpg)
Evaluation of multiple sequence alignments
• Compare to benchmark “true” alignments
• Use simulation
• Measure conservation of an alignment
• Measure accuracy of phylogenetic trees
• How well does it align motifs?
• More…
![Page 65: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/65.jpg)
BAliBASE
• Most popular benchmark of alignments
• Alignments are based upon structure
BAliBASE currently consists of 142 reference alignments, containing over 1000 sequences. Of the 200,000 residues in the database, 58% are defined within the core blocks. The remaining 42% are in ambiguous regions that cannot be reliably aligned. The alignments are divided into four hierarchical reference sets, reference 1 providing the basis for construction of the following sets. Each of the main sets may be further sub-divided into smaller groups, according to sequence length and percent similarity.
![Page 66: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/66.jpg)
BAliBASE
• The sequences included in the database are selected from alignments in either the FSSP or HOMSTRAD structural databases, or from manually constructed structural alignments taken from the literature. When sufficient structures are not available, additional sequences are included from the HSSP database (Schneider et al., 1997). The VAST Web server (Madej, 1995) is used to confirm that the sequences in each alignment are structural neighbours and can be structurally superimposed. Functional sites are identified using the PDBsum database (Laskowski et al., 1997) and the alignments are manually verified and adjusted, in order to ensure that conserved residues are aligned as well as the secondary structure elements.
![Page 67: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/67.jpg)
BAliBASE
• Reference 1 contains alignments of (less than 6) equi-distant sequences, ie. the percent identity between two sequences is within a specified range. All the sequences are of similar length, with no large insertions or extensions. Reference 2 aligns up to three "orphan" sequences (less than 25% identical) from reference 1 with a family of at least 15 closely related sequences. Reference 3 consists of up to 4 sub-groups, with less than 25% residue identity between sequences from different groups. The alignments are constructed by adding homologous family members to the more distantly related sequences in reference 1. Reference 4 is divided into two sub-categories containing alignments of up to 20 sequences including N/C-terminal extensions (up to 400 residues), and insertions (up to 100 residues).
![Page 68: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/68.jpg)
Comparison of alignments on BAliBASE
![Page 69: Http://creativecommons.org/licenses/by-sa/2.0/. CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.](https://reader030.fdocuments.us/reader030/viewer/2022032523/56649d765503460f94a5832c/html5/thumbnails/69.jpg)
Next time…
• Comparison of alignments under simulation
• Heuristics for simultaneous alignment and phylogeny reconstruction
• Comparison of alignments for motif detection---functional sites in proteins
• Performance of alignments for phylogeny reconstruction