Introduction Ellen Walker CPSC 201 Data Structures Hiram College.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
-
Upload
steven-bell -
Category
Documents
-
view
214 -
download
0
Transcript of Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Sequence Comparison Algorithms
Ellen Walker
Bioinformatics
Hiram College
The Problem
• We have two sequences that we want to compare, based on edit distance
• Edit distance = number of changes to get from one string to the other– Insertions– Deletions– Changes
Example
• LOVE => MONEY
• 1. Replace L by M
• 2. Replace V by N
• 3. Add Y at the end
L O V E –
M O N E Y
Brute Force Solution
• Try all possible alignments between the strings
• Looking at one string, – Every possible shift (space before or after)– Every possible gap (space within)– Gaps of various lengths, bounded by the
size of the longest string
How many possibilities are there?
• Consider only single insertions:• _ M _ O _ N _ E _ Y_
– There are N+1 places to insert, where N is the length of the string
• At each place you have 2 choices (insert or not)– Therefore, just this subset is already 2N
– So, brute force is exponential!
Dynamic Programming
• Score possibilities in an alignment matrix
• Value of any square in the matrix depends on:– Value above (if “vertical gap”)– Value beside (if “horizontal gap”)– Value diagonally above (if match or
mismatch)
Global Alignment MatrixM O N E Y
0 ––
-1
––
-2
––
-3
––
-4
––
-5
L | -1 \
-1
\
-2
––
-3
––
-4
––
-4
O | -2 \
-2
\
0
––
-1
––
-2
––
-3
V | -3 \
-3
| -1 \
-1
\
-2
––
-3
E | -4 \
-4
| -2 \
-2
\
0
––
-1
Local Alignment MatrixM O N E Y
0 0 0 0 0 0
L 0 0 0 0 0 0
O 0 0 \
1
0 0 0
V 0 0 0 0 0 0
E 0 0 0 0 \
1
0
Computing the Alignment Matrix
• For each square:– Take minimum of vertical gap, horizontal
gap, (mis)match score : O(1)
• There are N*M squares, where N and M are the lengths of the strings
• Therefore, time and space are both O(N*M) or (for short) O(N2)
But, what is N?
• If we’re matching genomes, N is huge!
• N2 is too much time and space!
• How can we save further?
Ordering the Computations
• Each cell can be computed when the ones above, diagonally above, and to the left are computed– Left-to-right, top to bottom (row major)– Top-to-bottom, left to right (column major)– Across a diagonal wavefront
Saving Space: Row Major
• A row major computation really only needs two rows (the one above, and the current row).
• After each computation, the current row becomes the row above
• Savings: space is O(N) instead of O(N2)• Cost: Insufficient information for traceback
– Do a new alignment, limited to a region around the result.
Saving Time: Wavefront
• Use a parallel processor (effectively N machines at a time)
• Each reverse diagonal is computed at once• Time is now O(N), but cost is N processors
instead of 1• Computer science theoretician would say “no
savings”, but if you’re the one waiting, you might disagree!
Saving More Time: Partial Search
• In local alignment, large areas have 0’s.
• Mismatches adjacent to 0’s are also 0’s.
• To get “reasonably large” values, you need longer sequences (BLAST “words”) in common
• So, only search near where there are common subsequences
Finding Common Subsequences
• Pick a sequence length.
• For each subsequence of that length, find all occurrences in each sequence
• If i is the index in one sequence and j is the index in the other sequence, then fill in the region of the alignment matrix near (i, j) (i,j) is called the seed
BLAST’s Generalization
• Consider a threshold T and a sequence S
• The neighborhood of the sequence S is all sequences that score at or better than T against S
• BLAST uses neighborhoods to set seeds (areas of the alignment matrix to explore)
Consequences of Choices
• Higher T’s are faster, but ignore more potential matches
• Longer sequences are less common– Smaller neighborhoods for a given T– Fewer areas to search– More likelihood of missing good alignments
T vs Sequence Size
• Longer sequences have higher maximum scores (unless normalized)
• But, longer sequences (tend to) have more likelihood of mismatches?
Too Many Seeds
• If we pick a sequence length and threshold that is sufficiently sensitive, we still might have too many seeds for reasonable alignment times.
• Two-seed solution:– Only consider areas of the table that
contain two seeds (diagonals) separated by a limited distance
Extending Alignments
• A seed region is a small alignment• We want to “grow” the alignments
(especially if we can connect to others(!))
• To grow an alignment, use Smith-Waterman to compute neighboring values
• Question: when to stop growing?
Score Changes During Growing
• As an alignment is extended, its score changes– Score increases when sub-matches
connect– Score decreases when extended into
unrelated area
• Often score must decrease before increasing!
When to Stop?
• Consider current score, compared to maximum score so far
• When the current score gets sufficiently small relative to the maximum, then stop
• This is another parameter with a tradeoff (stop too soon and get smaller results, stop too late and do useless work)
One more “trick”
• Suppose that there is a “standard” sequence that many people want to align against
• Run the seeding algorithm with different sequence lengths and thresholds and save the resulting seed locations
• When someone does a search, the seeding part has already been done
Offline vs. Online Algorithms
• Offline Algorithms– Execute “standardized” part of algorithm in
advance, and save result– This is like compilation of a program
• Online Algorithm– Use the tables or databases you built offline to
answer a specific query– This is like running a program– User sees only time taken by Online Algorithm
Common Offline/Online Applications
• Web searching– Offline: build indexes of sites vs. keywords– Online: retrieve sites from the index
• Neural networks– Offline: train the network on many
examples of the problem, set the weights– Online: run the network once (with fixed
weights) on the specific example
Summary
• Smith Waterman is exact, accurate, and time-consuming (even though it uses dynamic programming to get down to O(N2)
• BLAST speeds up the search process, but is no longer exact, so it can miss good alignments (even the best one!)
Using BLAST Well
• Importance of setting parameters– Sequence length– Score threshold– Distance (for two-hit method)– Stopping condition (for growing seeded
alignments)
Exercises
• Given the BLOSUM62 matrix at http://www.ncbi.nlm.nih.gov/Class/BLAST/BLOSUM62.txt– What is the neighborhood of HID with
threshold 5? 10? 15?
• Create two random sequences of 20 bases each (flip two coins for each base: HH=A, TT=T, HT=C, TH=G)