Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Sequence Comparison Algorithms

Ellen Walker

Bioinformatics

Hiram College

The Problem

• We have two sequences that we want to compare, based on edit distance

• Edit distance = number of changes to get from one string to the other– Insertions– Deletions– Changes

Example

• LOVE => MONEY

• 1. Replace L by M

• 2. Replace V by N

• 3. Add Y at the end

L O V E –

M O N E Y

Brute Force Solution

• Try all possible alignments between the strings

• Looking at one string, – Every possible shift (space before or after)– Every possible gap (space within)– Gaps of various lengths, bounded by the

size of the longest string

How many possibilities are there?

• Consider only single insertions:• _ M _ O _ N _ E _ Y_

– There are N+1 places to insert, where N is the length of the string

• At each place you have 2 choices (insert or not)– Therefore, just this subset is already 2N

– So, brute force is exponential!

Dynamic Programming

• Score possibilities in an alignment matrix

• Value of any square in the matrix depends on:– Value above (if “vertical gap”)– Value beside (if “horizontal gap”)– Value diagonally above (if match or

mismatch)

Global Alignment MatrixM O N E Y

0 ––

-1

––

-2

––

-3

––

-4

––

-5

L | -1 \

-1

\

-2

––

-3

––

-4

––

-4

O | -2 \

-2

\

0

––

-1

––

-2

––

-3

V | -3 \

-3

| -1 \

-1

\

-2

––

-3

E | -4 \

-4

| -2 \

-2

\

0

––

-1

Local Alignment MatrixM O N E Y

0 0 0 0 0 0

L 0 0 0 0 0 0

O 0 0 \

1

0 0 0

V 0 0 0 0 0 0

E 0 0 0 0 \

1

0

Computing the Alignment Matrix

• For each square:– Take minimum of vertical gap, horizontal

gap, (mis)match score : O(1)

• There are N*M squares, where N and M are the lengths of the strings

• Therefore, time and space are both O(N*M) or (for short) O(N2)

But, what is N?

• If we’re matching genomes, N is huge!

• N2 is too much time and space!

• How can we save further?

Ordering the Computations

• Each cell can be computed when the ones above, diagonally above, and to the left are computed– Left-to-right, top to bottom (row major)– Top-to-bottom, left to right (column major)– Across a diagonal wavefront

Saving Space: Row Major

• A row major computation really only needs two rows (the one above, and the current row).

• After each computation, the current row becomes the row above

• Savings: space is O(N) instead of O(N2)• Cost: Insufficient information for traceback

– Do a new alignment, limited to a region around the result.

Saving Time: Wavefront

• Use a parallel processor (effectively N machines at a time)

• Each reverse diagonal is computed at once• Time is now O(N), but cost is N processors

instead of 1• Computer science theoretician would say “no

savings”, but if you’re the one waiting, you might disagree!

Saving More Time: Partial Search

• In local alignment, large areas have 0’s.

• Mismatches adjacent to 0’s are also 0’s.

• To get “reasonably large” values, you need longer sequences (BLAST “words”) in common

• So, only search near where there are common subsequences

Finding Common Subsequences

• Pick a sequence length.

• For each subsequence of that length, find all occurrences in each sequence

• If i is the index in one sequence and j is the index in the other sequence, then fill in the region of the alignment matrix near (i, j) (i,j) is called the seed

BLAST’s Generalization

• Consider a threshold T and a sequence S

• The neighborhood of the sequence S is all sequences that score at or better than T against S

• BLAST uses neighborhoods to set seeds (areas of the alignment matrix to explore)

Consequences of Choices

• Higher T’s are faster, but ignore more potential matches

• Longer sequences are less common– Smaller neighborhoods for a given T– Fewer areas to search– More likelihood of missing good alignments

T vs Sequence Size

• Longer sequences have higher maximum scores (unless normalized)

• But, longer sequences (tend to) have more likelihood of mismatches?

Too Many Seeds

• If we pick a sequence length and threshold that is sufficiently sensitive, we still might have too many seeds for reasonable alignment times.

• Two-seed solution:– Only consider areas of the table that

contain two seeds (diagonals) separated by a limited distance

Extending Alignments

• A seed region is a small alignment• We want to “grow” the alignments

(especially if we can connect to others(!))

• To grow an alignment, use Smith-Waterman to compute neighboring values

• Question: when to stop growing?

Score Changes During Growing

• As an alignment is extended, its score changes– Score increases when sub-matches

connect– Score decreases when extended into

unrelated area

• Often score must decrease before increasing!

When to Stop?

• Consider current score, compared to maximum score so far

• When the current score gets sufficiently small relative to the maximum, then stop

• This is another parameter with a tradeoff (stop too soon and get smaller results, stop too late and do useless work)

One more “trick”

• Suppose that there is a “standard” sequence that many people want to align against

• Run the seeding algorithm with different sequence lengths and thresholds and save the resulting seed locations

• When someone does a search, the seeding part has already been done

Offline vs. Online Algorithms

• Offline Algorithms– Execute “standardized” part of algorithm in

advance, and save result– This is like compilation of a program

• Online Algorithm– Use the tables or databases you built offline to

answer a specific query– This is like running a program– User sees only time taken by Online Algorithm

Common Offline/Online Applications

• Web searching– Offline: build indexes of sites vs. keywords– Online: retrieve sites from the index

• Neural networks– Offline: train the network on many

examples of the problem, set the weights– Online: run the network once (with fixed

weights) on the specific example

Summary

• Smith Waterman is exact, accurate, and time-consuming (even though it uses dynamic programming to get down to O(N2)

• BLAST speeds up the search process, but is no longer exact, so it can miss good alignments (even the best one!)

Using BLAST Well

• Importance of setting parameters– Sequence length– Score threshold– Distance (for two-hit method)– Stopping condition (for growing seeded

alignments)

Exercises

• Given the BLOSUM62 matrix at http://www.ncbi.nlm.nih.gov/Class/BLAST/BLOSUM62.txt– What is the neighborhood of HID with

threshold 5? 10? 15?

• Create two random sequences of 20 bases each (flip two coins for each base: HH=A, TT=T, HT=C, TH=G)

http://www.ncbi.nlm.nih.gov/Class/BLAST/BLOSUM62.txt

http://www.ncbi.nlm.nih.gov/Class/BLAST/BLOSUM62.txt

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Documents

Transcript of Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.