Pairwise Sequence Alignment (I)

31
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from http:// www.bioalgorithms.info/slides.htm

description

Pairwise Sequence Alignment (I). (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign. Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm. - PowerPoint PPT Presentation

Transcript of Pairwise Sequence Alignment (I)

Page 1: Pairwise Sequence Alignment (I)

Pairwise Sequence Alignment (I)

(Lecture for CS498-CXZ Algorithms in Bioinformatics)

Sept. 22, 2005

ChengXiang Zhai

Department of Computer Science

University of Illinois, Urbana-Champaign

Many slides are taken/adapted from http://www.bioalgorithms.info/slides.htm

Page 2: Pairwise Sequence Alignment (I)

Comparing Genes in Two Genomes

• Small islands of similarity corresponding to similarities between exons

• Such comparisons are quite common in biology research

Page 3: Pairwise Sequence Alignment (I)

Alignment of sequences is one of the most basic and most important problems in bioinformatics…

Page 4: Pairwise Sequence Alignment (I)

Outline

• Defining the problem of alignment

• The longest common subsequence problem

• Dynamic programming algorithms for alignment

Page 5: Pairwise Sequence Alignment (I)

Aligning Two Strings

Given the strings:

• v = ATGTTAT

• w = ATCGTAC

One possible alignment of the strings:

AT_GTTAT_

ATCGT_A_C

1st row – string v with with space symbols “-” inserted

2nd row – string w with with space symbols “-” inserted

Page 6: Pairwise Sequence Alignment (I)

Aligning Two Strings (cont’d)

Another way to represent each row shows the number of symbols of the sequence present up to a given position. For example the above sequences can be represented as:

0 1 2 2 3 4 5 6 7 7

0 1 2 3 4 5 5 6 6 7

AT_GTTAT_ ATCGT_A_C

Page 7: Pairwise Sequence Alignment (I)

Alignment Matrix

Both rows of the alignment can be represented in the resulting matrix:

0 1 2 2 3 4 5 6 7 7

0 1 2 3 4 5 5 6 6 7

AT_GTTAT_ ATCGT_A_C

0 1 2 2 3 4 5 6 7 7

0 1 2 3 4 5 5 6 6 7

Page 8: Pairwise Sequence Alignment (I)

Alignment as a Path in the Edit Graph

0 0 1 1 2 2 3 4 5 6 7 72 2 3 4 5 6 7 7 A A T _ G T T A T _T _ G T T A T _ A A T C G T _ A _ CT C G T _ A _ C0 0 1 1 2 3 4 5 5 6 6 7 2 3 4 5 5 6 6 7

(0,0) , (0,0) , (1,1)(1,1)

Page 9: Pairwise Sequence Alignment (I)

Alignment as a Path in the Edit Graph

0 1 0 1 2 2 2 3 4 5 6 7 72 3 4 5 6 7 7 A A T T _ G T T A T __ G T T A T _ A A T T C G T _ A _ CC G T _ A _ C0 1 0 1 2 2 3 4 5 5 6 6 7 3 4 5 5 6 6 7

(0,0) , (1,1) , (0,0) , (1,1) , (2,2)(2,2)

Page 10: Pairwise Sequence Alignment (I)

Alignment as a Path in the Edit Graph

0 1 2 2 0 1 2 2 33 4 5 6 7 7 4 5 6 7 7 A T _ A T _ G G T T A T _T T A T _ A T C A T C G G T _ A _ CT _ A _ C0 1 2 3 0 1 2 3 4 4 5 5 6 6 7 5 5 6 6 7

(0,0) , (1,1) , (2,2), (2,3), (0,0) , (1,1) , (2,2), (2,3), (3,4)(3,4)

Page 11: Pairwise Sequence Alignment (I)

Alignment as a Path in the Edit Graph

0 1 2 2 3 4 5 6 7 70 1 2 2 3 4 5 6 7 7 A T _ G T T A T _A T _ G T T A T _ A T C G T _ A _ CA T C G T _ A _ C0 1 2 3 4 5 5 6 6 7 0 1 2 3 4 5 5 6 6 7

(0,0) , (1,1) , (2,2), (2,3), (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7)(7,6), (7,7)

- End Result -

Page 12: Pairwise Sequence Alignment (I)

Alignment as a Path in the Edit Graph

Every path in the edit graph corresponds to an alignment:

Page 13: Pairwise Sequence Alignment (I)

How to Score an Alignment?

• Simplest

– Every match scores 1

– Every mismatch scores 0

– An alignment is scored based on the number of common symbols

– Lead to the longest common subsequence problem

• More sophisticated

– ?

– ?

– To be covered later

Page 14: Pairwise Sequence Alignment (I)

Alignments in Edit Graph (cont’d)

and represent indels in v and w• Score 0.

represent exact matches. • Score 1.

Page 15: Pairwise Sequence Alignment (I)

Alignments in Edit Graph (cont’d)

The score of the alignment path in the graph is 5.

Page 16: Pairwise Sequence Alignment (I)

The Longest Common Subsequence (LCS) Problem

• Find the longest subsequence common to two strings.

Input: Two strings, v and w.

Output: The longest common subsequence of v and w.

A subsequence is not necessarily consecutive

v = ATGTTAT w = ATCGTAC

v = AT GTTAT | | | | | “ATGTA”w = ATCGT AC

Longest common subsequence Best alignment

Page 17: Pairwise Sequence Alignment (I)

How to solve the LCS problem efficiently?

Page 18: Pairwise Sequence Alignment (I)

Brute Force Approach

• Enumerate all the sequences up to length min(|v|,|w|)

• For each one, check to see if it is a subsequence of v and w

• Very expensive…. (How many sequences do we have to enumerate? )

Page 19: Pairwise Sequence Alignment (I)

The Idea of Dynamic Programming

• Think of an alignment as a path in an edit graph

• We only need to keep track of the best alignment (i.e., the longest common subsequence)

• Score a longer alignment based on shorter alignments

Page 20: Pairwise Sequence Alignment (I)

Alignment as a Path in the Edit Graph

01201222345673456777v= ATv= AT__GTGTTTAATT__w= ATw= ATCCGTGT__AA__CC 01201233455664556677

(0,0) , (1,1) , (2,2), (0,0) , (1,1) , (2,2), (2,3),(2,3), (3,4), (4,5), (3,4), (4,5), (5,5),(5,5), (6,6), (6,6), (7,6),(7,6), (7,7)(7,7)

Use each cell to store the best alignment so far…

Page 21: Pairwise Sequence Alignment (I)

Alignment: Dynamic Programming

Use this scoring algorithm

si,j = si-1, j-1+1 if vi = wj

max si-1, j

si, j-1

Page 22: Pairwise Sequence Alignment (I)

Dynamic Programming Example

• There are no matches in the beginning of the sequence

• Label column i=1 to be all zero, and row j=1 to be all zero

Page 23: Pairwise Sequence Alignment (I)

Dynamic Programming Example

Si,j = Si-1, j-1

max Si-1, j

Si, j-1

value from NW +1, if vi = wj

value from North (top) value from West (left)

Keep track of the best alignment score and the path contributing to it

Page 24: Pairwise Sequence Alignment (I)

Alignment: Backtracking

Arrows show where the score originated from.

if from the top

if from the left

if vi = wj

Page 25: Pairwise Sequence Alignment (I)

Dynamic Programming Example

Continuing with the scoring algorithm gives this result.

Page 26: Pairwise Sequence Alignment (I)

LCS Algorithm1.LCS(v,w)2. for i 1 to n

3. Si,0 0

4. for j 1 to m

5. S0,j 0

6. for i 1 to n

7. for j 1 to m

8. si-1,j

9. si,j max si,j-1

10. si-1,j-1 + 1, if vi = wj

11. “ “ if si,j = si-1,j

• bi,j “ “ if si,j = si,j-1

• “ “ if si,j = si-1,j-1 + 1

• return (sn,m, b)

Page 27: Pairwise Sequence Alignment (I)

Now What?

• LCS(v,w) created the alignment grid

• Now we need a way to read the best alignment of v and w

• Follow the arrows backwards from the (|v|,|w|) cell

Page 28: Pairwise Sequence Alignment (I)

LCS Runtime

• To create the nxm matrix of best scores from vertex (0,0) to all other vertices, it takes O(nm) time.

• Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix.

Page 29: Pairwise Sequence Alignment (I)

How do we improve the scoring of alignments?

Can we still find an alignment efficiently?

We’ll talk about these later…

Page 30: Pairwise Sequence Alignment (I)

The LCS Recurrence Revisited

• The formula can be rewritten by adding zero to the edges that come from an indel, since the penalty of indels are 0:

si-1, j-1+1 if vi = wj

si,j = max si-1, j + 0

si, j-1 + 0 Insertion/deletion score

Matching score

Page 31: Pairwise Sequence Alignment (I)

What You Should Know

• How an alignment corresponds to a path in an edit graph

• How the LCS problem corresponds to alignment with a simple scoring method

• How the dynamic programming algorithm solves the LCS problem (= simple alignment)