Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof....
-
Upload
sophie-goodrick -
Category
Documents
-
view
220 -
download
0
Transcript of Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof....
![Page 1: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/1.jpg)
Whole genome alignments
Genome 559: Introduction to Statistical and Computational Genomics
Prof. James H. Thomas
![Page 2: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/2.jpg)
Review
• What a score matrix is and how to calculate and use one.
• Why an affine gap penalty is desirable.
• How to align sequences using dynamic programming.
• How to calculate and interpret p-values and E-values for pair alignments and database searches.
![Page 3: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/3.jpg)
Whole genome alignments
Why?
![Page 4: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/4.jpg)
known gap in
assembly
averaged conservation
for 17 genomes
individual genome
alignments, darker = higher
scoring
alignment discontinuity (e.g. translocation
break point)
questionable
alignment segment
sequence present but unalignable
UCSC Browser track
![Page 5: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/5.jpg)
GQSQVGQGPPCPHHRCTTCCPDGCHFEPQVCMCDWESCCEEGGQSEVRQGPQCPYHKCIKCQPDGCHYEPTVCICREKPCDEKG
![Page 6: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/6.jpg)
How are genome-wide alignments made?
• mouse and human genomes are each about 3x109 nucleotides.
• how many calculations would a dynamic programming alignment have to make?
• at a minimum - 3 integer additions and 3 inequality tests for each DP matrix position
(by the way, there are other problems too, including assuming colinearity)
![Page 7: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/7.jpg)
• Most common method is the BLAST search (Basic Local Alignment Search Tool). Only the initial step is substantially different from dynamic alignment.
• Search sequence is broken into small words (usually 3 residues long for proteins). 20 * 20 * 20 = 8,000 words. These act as seeds for searches.
• The target dataset is pre-indexed to indicate the positions in the database sequences that match each search word above some score threshold (using a global score matrix such as BLOSUM62).
Making large searches faster
![Page 8: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/8.jpg)
...VFEWVHLLP... WIY
• Target sequences around each indexed word hit are retrieved and the initial match is extended in both directions:
your sequencedatabase (many sites)
• For example, the search sequence word “WVH” might score above threshold with these indexed sequences:
Indexed word Score WVH 23 WIH 22 WVY 17 WIY 16
BLAST searches (cont.)
![Page 9: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/9.jpg)
Schematic of indexed matches
Result – instead of aligning these 3 amino acids to everything, they are aligned only with the tiny fraction of sequence regions that are good candidates for a valid alignment.
(note- blast actually looks for two such matches close to each other)
![Page 10: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/10.jpg)
Extension and scoring
...QSVFEWVHLLPGA... ..WIY..
...QSVFEWVHLLPGA... ..WIYQ..
...QSVFEWVHLLPGA... ..WIYQK..
...QSVFEWVHLLPGA... ..WIYQKA..
Total Score:
16
13
11
10
Match Score:
16
-3
-2
-1
[mention gap variant]
![Page 11: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/11.jpg)
Extension termination
• Extension is continued until the cumulative score drops below some threshold (usually 0).
• This permits the match to cross a region of marginal similarity or frank mismatching (e.g. a small intron in tblastn) if it flanks a region of high similarity.
• Extensions whose maximal cumulative score is above some threshold are kept for reporting to user.
• For web interfaces, various formatting, links, and overviews are added and reported according to user settings (it is also fairly easy to download and run your own blast).
![Page 12: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/12.jpg)
Key to speed: word matching and prior indexing
• Though gapped blast local alignment is slow (like dynamic programming), only a very small part of total search space is analyzed.
• Because the positions of all database word matches are indexed and stored prior to the blast search, the relevant parts of search space are reached quickly.
• Tradeoff is in accuracy and certainty – occasionally matches will be missed (when they are distant enough and dispersed enough that no local word pairs match well enough).
![Page 13: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/13.jpg)
genome A
genome BDP alignment region
M x N manageable
BLAST matches
Dynamic programming after BLAST matching
![Page 14: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/14.jpg)
![Page 15: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/15.jpg)
Defining what a “tree” means
rooted tree (all real trees are rooted):unrooted tree (used when the root isn’t known):
time
ancestral sequence
time vaguely radiates out from somewhere near the center
…divergence time is the sum of (horizontal) branch lengths
sequences(leaves or tips)branch
points
branches
root
![Page 16: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/16.jpg)
A tree has topology and distances
Are these different trees?
![Page 17: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/17.jpg)
The number of tree topologies grows extremely fast
3 leaves3 branches1 internal node1 topology(3 insertions)
4 leaves5 branches2 internal nodes3 topologies (x3)(5 insertions)
5 leaves7 branches3 internal nodes15 topologies (x5)(7 insertions)
In general, an unrooted tree with N leaves has:2N – 3 branchesN – 2 internal nodes~ O(N!) topologies 3 5 7 ... 2 5N
![Page 18: Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.](https://reader038.fdocuments.us/reader038/viewer/2022110319/56649c885503460f94940707/html5/thumbnails/18.jpg)
There are many rooted trees for each unrooted tree
For each unrooted tree, there are 2N - 3 times as many rooted trees, where N is the number of leaves (# internal branches = 2N – 3).
20 leaves - 564,480,989,588,730,591,336,960,000,000 topologies