Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University...

12
Space-Efficient Sequence Space-Efficient Sequence Alignment Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel Pevzner (prepared by Iman Famili)

Transcript of Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University...

Page 1: Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Space-Efficient Sequence Space-Efficient Sequence AlignmentAlignment

Bioinformatics 202University of California, San Diego

Lecture Notes No. 7Dr. Pavel Pevzner

(prepared by Iman Famili)

Page 2: Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

OutlineOutline

New computational ideas for sequence comparison:

• Divide-and-conquerDivide-and-conquer technique• Recursive programsRecursive programs• HashHash tables

Page 3: Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Edit GraphsEdit Graphs• Finds similarities between two sequences.• Every alignment in this method

corresponds to the longest path problem from a source to a sink.

• The alignment is done by constructing an “edit graph”.

• There are 3 types of edges in the edit graph horizontal (H), diagonal (D), and vertical (V) corresponding to insertion (I), match/mismatch (M), and deletion (D), respectively.

• Every edge of the edit graph (i.e. every movement) has a weight corresponding to the penalty or premium for that action.

• The best path is the path with the maximum length.

Edit Graph

T G C A T A

A

T

C

T

G

A

T

deletions:

mismatches:

insertions:

matches:

source

sink

Page 4: Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Computational Complexity of Computational Complexity of Dynamic ProgrammingDynamic Programming

Sequence alignment is limited by:

• Time:Time: – Four operations are needed at each vertex.

– The required time is proportional to the number of edges in the edit graph (i.e. O(nm), where n and m are sequence lengths).

• Space:Space: – The required memory is proportional to the number of

vertices in the edit graph, O(nm).

Page 5: Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Computational Complexity of Computational Complexity of Dynamic ProgrammingDynamic Programming

– To compute the score of alignment, we can reduce the calculations to 2 columns at every computing instance. This can be done since scoring for each box in dynamic programming (DP) matrix is done based only on the three previously calculated boxes. Therefore only a linear memory is required for construction of the DP matrix.

– To calculate the alignment (backtracking through the matrix), however, a quadratic memory is needed (n2) since all the scores are needed to find the best alignment.

only 2 columns are needed to determine the score of each box(forward calculation)

all columns are needed for calculating the best alignment (backtracking)

Page 6: Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Space-Efficient Sequence Space-Efficient Sequence AlignmentAlignment

To solve the space complexity of sequence alignment:

• Find the middle vertex between a source and a sink by computing the score of the path s*,m/2 from (0,0) to (i,m/2) and sreverse

*,m/2 from (i,m/2) to (n,m) (i.e. find the longest path between the source and the middle vertex and middle vertex and the sink).

• Repeat this process iteratively

middle

m/2 m(0,0)

(n,m)n

i

m/2 m(0,0)

(n,m)n

middle

m/2 m(0,0)

n

middle

middle

(n,m)

m(0,0)

(n,m)n

m(0,0)

n (n,m)

m(0,0)

n (n,m)

Source

Sink

Page 7: Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Space-Efficient Sequence Space-Efficient Sequence AlignmentAlignment

• The computing time is equal to the area of the rectangles. The total time to find the middle vertices is therefore:area+area/2+area/4+…2*area

• The space complexity is of order n, O(n).• Pseudocode for this algorithm is:

Path (source, sink)If source and sink are in consecutive columns

output the longest path from the source to the sinkElse

middle middle vertex between source and sinkPath (source, middle)Path (middle, sink)

Page 8: Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

String Matching: naïve approachString Matching: naïve approach

Let’s say we want to compare a sequence of length =10 against a database of length, for example, =109

and we want to find the exact sequence =10 in . We can:

1. Move along one base at a time and find similar sequences (this takes a long time):

=10

=109

So, essentially moving diagonally along the database alignments:

Page 9: Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Sting Matching: hashingSting Matching: hashing

2. Create a hash table of all possible combinations of -length strings that exist in

Hash Table

and search your -length string against the hash table.

Page 10: Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Approximate String MatchingApproximate String Matching• Now if instead of =10 we have =1000, we can apply the

same method by dividing into overlapping strings of 10 base-long and cross the resultant alignments, as shown below:

• String matching in this fashion may be done using filtration/verification algorithms that will be described next.

Page 11: Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Filtration/Verification MethodFiltration/Verification Method• Let’s say we want to find a string in a database with up to 2

mismatches, or in general, find a string 1…n (text) in a database 1…p (query) with up to mismatches.

• The query matching problem is to find all -substrings of the query and the text that match with at most mismatches. Filtration/verification algorithms are used to perform this task.

• Filtration/verification algorithms involve a two-stage process.

walk in both directions while mismatches are < k

First, a set of positions are reselected in the text that are potentially similar to the query. Second, each potential position is verified if mismatches are less than k and rejected if more than k mismatches are found.

Page 12: Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Filtration/Verification MethodFiltration/Verification Method• Filtration algorithm is done in 2-steps:

1. Potential match detection: Find all matches of -tuples in both

query and the text for =/+1 (it’s sparse alignment happens rarely)

2. Potential match verification:Verify each potential match by

extending it to the left and to the right until either (i) the first +1 mismathces are found or (ii) the beginning or end of the query or the text is found

• This is the idea behind BLAST and FASTA.