Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.

SequenceAlignment Algorithms

Morten NielsenDepartment of systems biology,

Why learn about alignment algorithms?• All publicly available alignment

programs do the same thing– Sequence alignment using amino acids

substitution matrices and affine gap penalties

• This is fast but not optimal– Protein alignment is done much more

accurate using sequence profiles and position specific gap penalties (price for gaps depends on the structure)

• Must implement your own alignment algorithm to do this

Outline

• What you have been told is not entirely true :-)– Alignment algorithms are more complex

• The true sequence alignment algorithm story– The slow algorithm (O3)– The fast algorithm (O2)

Sequence alignmentThe old Story

Pairwise alignment: the solution

”Dynamic programming” (the Needleman-Wunsch algorithm)

Match score = 1Mismatch score -1Gap penalty = -2

Alignment depicted as path in matrix

T C G C A

TCGCATC-CA

TCGCAT-CCA

Score=2

Score=0

Match score = 1Mismatch score = -1

Gap penalty = -2

Alignment depicted as path in matrix

T C G C A

Meaning of point in matrix: all residues up to this point have been aligned (but there are many different possible paths).

Position labeled “x”: TC aligned with TC

--TC -TC TCTC-- T-C TC

Dynamic programming: computation of scores

T C G C A

Any given point in matrix can only be reached from three possible positions (you cannot “align backwards”).

=> Best scoring alignment ending in any given point in the matrix can be found by choosing the highest scoring of the three possibilities.

T C G C A

score(x,y) = max

score(x,y-1) - gap-penalty

T C G C A

score(x,y) = max

score(x-1,y-1) + substitution-score(x,y)

T C G C A

score(x,y) = max

score(x-1,y) - gap-penalty

T C G C A

Each new score is found by choosing the maximum of three possibilities. For each square in matrix: keep track of where best score came from.

Fill in scores one row at a time, starting in upper left corner of matrix, ending in lower right corner.

score(x,y) = max

score(x-1,y) - gap-penalty

Dynamic programming: example

A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1

Gaps: -2

A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1

Gaps: -2

A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1

Gaps: -2

A C G TA 1 -1 -1 -1C -1 1 -1 -1G -1 -1 1 -1T -1 -1 -1 1

Gaps: -2

T C G C A: : : :T C - C A1+1-2+1+1 = 2

A now the truth

T C G C A

What about j-2 and a gap extension?

And now the true algorithm

V L I L P

Start from here

How does it work (score matrix d)?

V L I L P

1 4 1 -3

1 5 2 5 -4

-3 -4 -3 -4 10

Blosum 50 matrixd matrix

How does it work? (The slow way O3)

V L I L P

17 10 4

10 15 5

4 5 10

000000

Start from here!

Gap-open W1 = -5Gap-extension U = -1

d matrixD matrix

How does it work? (The slow way O3)

V L I L P

17 10 4

10 15 5

4 5 10

000000

• Check all positions in (green) row and column to check score for gap extension.

• CPU intensive (O3)

d matrixD matrix

And now you!

How does it work? Fill out the D matrix

V L I L P

11 17 10 4

9 10 15 5

2 4 5 10

000000

d matrixD matrix

How does it work (The slow way O3)

V L I L P

18 14 9 3

11 15 17 10 4

8 9 10 15 5

2 3 4 5 10

000000

d matrixD matrix

And now the fast algorithm (O2)

Database (m)

Open a gap Extending a gap

Affine gap penalties

And now the fast algorithm (O2)

Database (m)

Open a gap Extending a gap

And now the true algorithm (cont.)

Database (m)

How does it work (D,Q, and P-matrices)D

V L I L P

000000

V L I L P

2 3 4 5 10

000000

W1 = -5U = -1

How does it work (D,Q,P, and E-matrices)D

V L I L P

000000

V L I L P

2 3 4 5 10

000000

W1 = -5U = -1

How does it work (D,Q, and P-matrices)D Q

V L I L P

-1 -1 -1 -1 -1

000000

V L I L P

2 3 4 5 10

000000

W1 = -5U = -1

How does it work (D,Q,P, and E-matrices)D Q

V L I L P

-1 -1 -1 -1 -1

000000

V L I L P

2 3 4 5 10

000000

W1 = -5U = -1

How does it work (D, Q, and P-matrices)D Q

V L I L P

-1 -1 -1 -1

000000

V L I L P

2 3 4 5 10

000000

PV L I L P

000000

How does it work (D, Q, and P-matrices)D Q

V L I L P

-1 -1 -1 -1 -1

000000

V L I L P

2 3 4 5 10

000000

PV L I L P

000000

How does it work. Eij-matrix.(Keeping track on the path)

eij = 1 matcheij = 2 gap-opening database (move vertical) eij = 3 gap-extension database(move vertical) eij = 4 gap-opening query (move horizontal) eij = 5 gap-extension query(move horizontal)

V L I L P

000000

The fast algorithm (cont.)

Database (m)

eij = 1 matcheij = 2 gap-opening database eij = 3 gap-extension database eij = 4 gap-opening query eij = 5 gap-extension query

V L I L P

-1 -1 -1 -1 -1

000000

V L I L P

2 3 4 5 10

000000

PV L I L P

000000

V L I L P

000000

V L I L P

10 12 9 3

3 4 5 10 4

-2 -2 -1 0 5

-1 -1 -1 -1 -1

000000

V L I L P

1 1 3 3

5 1 1 1 3

5 1 4 1 2

5 5 5 4 1

000000

V L I L P

9 4 -2 -1

11 12 5 -1 -1

8 9 10 0 -1

2 3 4 5 -1

000000

V L I L P

18 14 9 3

11 15 17 10 4

8 9 10 15 5

2 3 4 5 10

000000

And the alignment

V L I L P

18 14 9 3

11 15 17 10 4

8 9 10 15 5

2 3 4 5 10

000000

V L I L P

1 1 3 3

5 1 1 1 3

5 1 4 1 2

5 5 5 4 1

000000

VLILPVL-LP

And the alignment. Gap extensions

D-matrix m

V L I PL

13 17 9 3

9 14 12 10 4

7 8 9 15 5

1 2 3 5 10

000000

V L I L P

1 1 3 3

1 1 1 1 3

5 1 5 1 2

5 5 5 4 1

000000

E-matrix

And the alignment

V L I PL

13 17 9 3

9 14 12 10 4

7 8 9 15 5

1 2 3 5 10

000000

V L I L P

1 1 3 3

1 1 1 1 3

5 1 5 1 2

5 5 5 4 1

000000

15 -5 -1 = 9?

And the alignment

V L I PL

13 17 9 3

9 14 12 10 4

7 8 9 15 5

1 2 3 5 10

000000

V L I L P

1 1 3 3

1 1 1 1 3

5 1 5 1 2

5 5 5 4 1

000000

5 -5 -1 = 9?

And the alignment

V L I PL

13 17 9 3

9 14 12 10 4

7 8 9 15 5

1 2 3 5 10

000000

V L I L P

1 1 3 3

1 1 1 1 3

5 1 5 1 2

5 5 5 4 1

000000

VLIALPVL--LP

15 -5 -1 = 9? ***

5 -5 -1 = 9?

And now you!

Summary

• Alignment is more complicated than what you have been told.

• Simple algorithmic tricks allow for alignment in O2 time

• More heuristics to improve speed– Limit gap length– Look for high scoring

regions

Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.

Documents

Transcript of Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.

Immune system overview in 10 minutes The non-immunologist guide to the immune system Morten Nielsen Department of Systems Biology DTU.

Artificiel Neural Networks 2 Morten Nielsen Department of Systems Biology , DTU

Blosum matrices What are they? Morten Nielsen BioSys, DTU

Protein Fold recognition Morten Nielsen, CBS, BioCentrum, DTU.

Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU

Protein structure and homology modeling Morten Nielsen, CBS, BioCentrum, DTU.

Morten Granzau Nielsen, Dansk Industri

Protein Fold recognition Morten Nielsen, CBS, Department of Systems Biology, DTU.

Hidden Markov Models, HMM’s Morten Nielsen, CBS, BioSys, DTU.

Dealing with Sequence redundancy Morten Nielsen BioSys, DTU

Optimization methods Morten Nielsen Department of Systems Biology, DTU.

Artificial Neural Networks 2 Morten Nielsen Depertment of Systems Biology, DTU.

Sequence motifs, information content, logos, and Weight matrices Morten Nielsen, CBS, BioCentrum, DTU.

CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,

Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU IIB-INTECH, UNSAM, Argentina.

Algorithms in Bioinformatics Morten Nielsen BioSys, DTU.

Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU

Prediction of T cell epitopes using artificial neural networks Morten Nielsen, CBS, BioCentrum, DTU.

Artificial Neural Networks 1 Morten Nielsen Department of Systems Biology, DTU.

Artificiel Neural Networks 2 Morten Nielsen Department of Systems Biology, DTU.