Comp. Genomics

27
Comp. Genomics Recitation 1

description

Comp. Genomics. Recitation 1. Outline. Sequence alignment End-space free alignment Alignment with gaps. x i | G. y j | C. Alignment basic step. x i |G. y j |C. G. C. x i |G. y j |C. G. -. x i |G. -. y j |C. C. Global alignment. All of x has to be aligned with all of y - PowerPoint PPT Presentation

Transcript of Comp. Genomics

Page 1: Comp. Genomics

Comp. Genomics

Recitation 1

Page 2: Comp. Genomics

Outline

• Sequence alignment• End-space free alignment• Alignment with gaps

Page 3: Comp. Genomics

Alignment basic step

xi | G

yj |C

GC

G-

xi |G

yj |C

xi |G

yj |C

xi |G

yj |C

-C

Page 4: Comp. Genomics

Global alignment

• All of x has to be aligned with all of y

• Therefore, every gap is “paid for”• The solution score is found in one cell

Alignment score here

Traceback all the way

Page 5: Comp. Genomics

Global alignment

• Input: Sequences x,y• Output: Maximum score alignment• F(i,j) – score of aligning x[1..i] with y[1..j]• Base conditions:

• F(i,0) = k=1..i(xk,-)• F(0,j) = k=1..j(-,yk)

• Recurrence relation: F(i-1,j-1) + (xi,yj)

1in, 1jm: F(i,j) = max F(i-1,j) + (xi,-)F(i,j-1) + (-, yj)

Page 6: Comp. Genomics

Local alignment

• Local alignment• Subset of x aligned with a subset of y• Gaps outside subsets “costless”• Solution equals the maximum score cell in the

DP matrix• Base conditions:

• F(i,0) = 0• F(0,j) = 0

• Recurrence relation: F(i-1,j-1) + (xi,yj)

1in, 1jm: F(i,j) = max F(i-1,j) + (xi,-)F(i,j-1) + (-,yj)

0

Page 7: Comp. Genomics

Local alignment example

H E A G A W G H E E

0 0 0 0 0 0 0 0 0 0 0

P 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 5 0 5 0 0 0 0 0

W 0 0 0 0 2 0 20 12 4 0 0

H 0 10 0 0 0 0 12 18 22 14 6

E 0 2 16 8 0 0 4 10 18 28 20

A 0 0 8 21 13 5 0 4 10 20 27

E 0 0 6 13 18 12 4 0 4 16 26

Mismatch: BLOSUM50Match: BLOSUM50Gap: -8

AWGHEAW_ HE

Page 8: Comp. Genomics

Overlap matches (end space free alignment)

• Something between global and local• Consider aligning a gene x to a (bacterial) genome y• Gaps in the beginning and end of x and y are costless• But all of x should be aligned• Base conditions:

• F(i,0) = 0• F(0,j) = 0

• Recurrence relation: F(i-1,j-1) + (xi,yj)

1in, 1jm: F(i,j) = max F(i-1,j) + (xi,-)

F(i,j-1) + (-,yj)

• The optimal solution is found at the last row/column (not necessarily at bottom right corner)

Page 9: Comp. Genomics

Handling weird gaps

• Affine gap: different cost for a “new” and “old” gaps

Xi|G

yj|C

GC

G

-

Xi|G

Xi|G

Xi|G

y j |C

-C

Now we care if there were gaps here

Two new things to keep track Two additional matrices

y j |Cy j |C

Page 10: Comp. Genomics

Alignment with Affine Gap Penalty

Base Conditions:M(i, 0) = Ix(i, 0) = Wg + iWs

M(0, j) = Iy(0, j) = Wg + jWs

M(0, 0) = 0Recursive Computation:

x 1...........iy 1...........j

x 1......i----y 1...........j

x 1...........iy 1….j-----

M(i,j)

Iy(i,j)

Ix(i,j)

M(i,j) = max

M(i-1,j-1) + (xi,yj)

Ix(i-1,j-1) + (xi,yj)

Iy (i-1,j-1) +(xi,yj)

Ix(i,j) = max M(i-1,j) + Wg+ Ws Ix(i-1,j) + Ws

Wg ,Ws <0

The optimal solution is the maximum of the relevant cells in the three matrices

Page 11: Comp. Genomics

When do constant and affine gap costs differ?

• Consider: AGAGACTGACGCTTA

ATATTA

AGAGACTGACGCTTA

----A-T-A---TTA

Constant penalty: Mismatch: -5Gap: -1

AGAGACTGACGCTTA

ATA---------TTA

Affine penalty: Mismatch: -5Gap open: -3Gap extend: -0.5

-9 -14

-12-14.5

Page 12: Comp. Genomics

Question

• Given two sequences x and y, the fragmentation number of x,y is the minimal k such that x and y can be broken into substrings x1,x2,...,xk ; y1,y2,...,yk and every xi is a substring of the corresponding yi

• Suggest an algorithm for finding the fragmentation number of two sequences

Page 13: Comp. Genomics

Solution

•Global alignment with the following modifications:•No penalty for gaps at the ends of y•Gaps are only allowed in x (characters of x may not be skipped)•Mismatches are not allowed (score -∞)•Affine gaps score, with open cost 1 and extension cost 0

Page 14: Comp. Genomics

Question

• How do we align two sequences with a bound k on the maximal number of gaps?

• Analyze the complexity

Page 15: Comp. Genomics

Solution

We will divide every cell in the alignment matrix to 2k sub-cells. The meaning of a sub-cell is as follows:

k cells with superscript 1:

k cells with superscript 2:1( , ) Cost of optimal alignment with l gaps opened and l gaps closedlM i j

2 ( , ) Cost of optimal alignment with l gaps opened and l-1 gaps closedlM i j

Page 16: Comp. Genomics

Solution

• The update rule for sub-cells with superscript 1:

• The update rule for sub-cells with superscript 2:

11

21

( 1, 1) ( , )( , ) max

( 1, 1) ( , )

ll

l

M i j i jM i j

M i j i j

2

22

11

11

( , 1) ( , )

( 1, ) ( , )( , ) max

( 1, ) ( , )

( , 1) ( , )

l

ll

l

l

M i j j

M i j iM i j

M i j i

M i j j

Page 17: Comp. Genomics

What about arbitrary gap functions?

• If the gap cost is an arbitrary function of its length, γ(k)

• When computing Mij, we need to look at all possible gap lengths “back”:

Xi|G

Yj |C

Page 18: Comp. Genomics

Alignment with arbitrary gap functions

Recursive Computation:

F(i,j) = max

F(i-1,j-1) + (xi,yj)

F(k,j) + γ(i-k)

F(i,k) + γ(j-k)

k=0,…,i-1

k=0,…,j-1

Page 19: Comp. Genomics

Complexity

Suppose the two sequences are of length n.

2

0 0 0 0

2 3

( ) ( 1) ( 1) ( 1)

( 1)2( 1) ( )

2

n n n n

i j i j

T n i j n i n j n

n nn n O n

Page 20: Comp. Genomics

LCS

• Longest common non-contigous subsequence:• Use global alignment with similarity

scores• +1 for match• 0 for indel • -∞ for mismatches

Page 21: Comp. Genomics

Exercise: Shortest common supersequence

• A is called a non-contiguous supersequence of B if B is a non-contiguous subsequence of A.

• e.g., YABADABADU is a non-contigous supersequence of BABU (YABADABADU)

• Problem: Given A and B, find their shortest common supersequence

Page 22: Comp. Genomics

Solution

• For A=“PRIDE” B=“PARADE”: • Compute LCS using global align:

A=P-R-IDE

B=PARA-DE• PARAIDE – Shortest common

supersequence• Notice that PRDE is the longest

common subsequence of A and B.

Page 23: Comp. Genomics

Exercise: Finding repeats

• Basic objective: find a pair of subsequences within a string x with maximum similarity

• Simple (albeit wrong) idea: Find an optimal alignment of x with itself!

(Why is this wrong?)• But using local alignment is still a

good idea

Page 24: Comp. Genomics

Variant #1

• First variant: the two sequences may not overlap

• Solution: Absence of overlap means that there exists an index k such that one substring is in x[1..k] and another in x[k+1..n] • Check local alignments between x[1..k] and

x[k+1..n] for all 1<=k<n• Pick the highest-scoring alignment

• Complexity: O(n3) time and O(n) space

Page 25: Comp. Genomics

Variant #1, Pictorially

Page 26: Comp. Genomics

Variant #2

• Second variant: the two sequences must be consecutive (tandem repeat)

• Solution: Similar to variant #2, but somewhat “ends-free”: seek a global alignment between x[1..k] and x[k+1..n], • No penalties for gaps in the beginning of

x[1..k]• No penalties for gaps in the end of x[k+1..n]

• Complexity: O(n3) time and O(n) space

Page 27: Comp. Genomics

Variant #2, Pictorially