Sequence comparison and Phylogeny

148
Michael Schroeder BioTechnological Center TU Dresden Biotec Sequence comparison and Phylogeny based Chapter Lesk, Introduction to Bioinformati

description

Sequence comparison and Phylogeny. based on Chapter 4 Lesk, Introduction to Bioinformatics. Contents. Motivation Sequence comparison and alignments Dot plots Dynamic programming Substitution matrices Dynamic programming: Local and global alignments and gaps BLAST - PowerPoint PPT Presentation

Transcript of Sequence comparison and Phylogeny

Page 1: Sequence comparison and Phylogeny

Michael Schroeder BioTechnological CenterTU Dresden

Biotec

Sequence comparisonand Phylogeny

based onChapter 4

Lesk, Introduction to Bioinformatics

Page 2: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 2

Contents

Motivation Sequence comparison and alignments

Dot plots Dynamic programming Substitution matrices Dynamic programming: Local and global alignments

and gaps BLAST Significance of alignments

Multiple sequence alignments Phylogenetic trees

Page 3: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 3

Motivation

From where are we? Recent Africa vs. Multi-regional Hypothese

In 1999 Encephalitis caused by the West Nile Virus broke out in New York. How did the virus come to New York?

How did the nucleus get into the eucaryotic cells?

To answer such questions we will need sequence comparison and phylogenetic trees

Page 4: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 4

Sequence Similarity Searches

Sequence similarity can be clue to common evolutionary ancestor… E.g. globin genes in chimpanzees and humans

… or common function E.g. v-sys onco genes in simian sarcoma virus leading to cancer

in monkeys and the seemingly unrelated growth stimulating hormone PDGF, which stimulates cell growth (first success of similarity idea, 1983)

In general: If an unknown sequences is found, deduce its

function/structure indirectly by finding similar sequences, whose function/structure is known

Assumption: Evolution changes sequences “slowly” often maintaining main features of a sequence’s function/structure

Page 5: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 5

Sequence alignment

Substitutions, insertions and deletions can be interpreted in evolutionary terms

But: distinguish chance similarity and real biological relationship

CCGTAA

CCGTAT

TCGTAGTAGTAC

TCGTAC

TCGTAA

TTGTAA

Page 6: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 6

Evolution

Convergent evolution: same sequence evolved from different ancestors

Back evolution - mutate to a previous sequence

CCGTAA

CCGTAT

TCGTAGTAGTAC

TCGTAC

TCGTAA

TAGTAC CCGTAA

TAGTAA

Page 7: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 7

Similarity vs. Homology

Any sequence can be similar Sequences homologues if evolved from common

ancestor Homologous sequences:

Orthologs: similar biological function Paralogs: different biological function (after gene

duplication), e.g. lysozyme and α-lactalbumin, a mammalian regulatory protein

Assumption: Similarity indicator for homology Note, altered function of the expressed protein will

determine if the organism will survive to reproduce, and hence pass on the altered gene

Page 8: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 8

Sequence alignments

Given two or more sequences, we wish to

Measure their similarity Determine the residue-residue correspondences Observe patterns of conservation and variability Infer evolutionary relationships

Page 9: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 9

What is the best alignment?

Uninformative: -------gctgaacgctataatc-------

Without gaps: gctgaacgctataatc

With gaps: gctga-a--cg--ct-ataatc

Another one: gctg-aa-cg-ctataatc-

Formally: The best alignment have only a minimal number of mismatches (insertions, deletions, replace)

We need a method to systematically explore and to compute alignments

Page 10: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 10

Scores for an alignment Percentage of matches Score each match, mismatch, gap opening, gap extension Example

match +1 mismatch -1 Gap opening -3 Gap extension -1

Uninformative: 0%, score= -21 -------gctgaacgctataatc-------

Without gaps:25%,score= -4 gctgaacgctataatc

With gaps: 0%, score= -23 gctga-a--cg--ct-ataatc

Another one: 50%, score=-12 gctg-aa-cg-ctataatc-

Page 11: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 11

Scores for an alignment Percentage of matches Score each match, mismatch, gap opening, gap extension Example

match +2 mismatch -1 Gap opening -1 Gap extension -1

Uninformative: 0%, score= -17 -------gctgaacgctataatc-------

Without gaps:25%,score= -2 gctgaacgctataatc

With gaps: 0%, score= -15 gctga-a--cg--ct-ataatc

Another one: 50%, score=0 gctg-aa-cg-ctataatc-

Page 12: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 12

Dot plots

Page 13: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 13

Dot plots

A convenient way of comparing 2 sequences visually Use matrix, put 1 sequence on X-axis, 1 on Y-axis Cells with

identical characters filled with a ‘1’, non-identical with ‘0’ (simplest scheme - could have weights)

Page 14: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 14

Dot plots

N

I

K

G

D

O

H

Y

H

T

O

R

O

D

NIKGDOHTOOFWORCYHTOROD

Page 15: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 15

Dot plots

NN

II

KK

GG

DDD

OOOOOOO

HHH

YY

HHH

TTT

OOOOOOO

RRR

OOOOOOO

DDD

NIKGDOHTOOFWORCYHTOROD

Page 16: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 16

Interpreting dot plots

What do identical sequences look like? What do unrelated sequences look like? What do distantly related sequences look like?

What does reverse sequence look like? Relevant for detections of stems in RNA structure

What does a palindrome look like? Relevant for restriction enzymes

What do repeats look like? What does a protein with domains A and B and another

one with domains B and C look like?

Page 17: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 17

Dot plot for identical sequences

NN

II

KK

GG

DDD

OOOO

HHH

YY

HHH

TT

OOOO

RR

OOOO

DDD

NIKGDOHYHTOROD

Page 18: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 18

Dotplot for unrelated sequences

RR

E

TT

E

II

DDD

OOOO

TT

TT

OOOO

NIKGDOHYHTOROD

Page 19: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 19

Dotplot for distantly related sequences

NN

II

KK

NN

E

J

YY

HHH

TT

OOOO

M

II

TT

NIKGDOHYHTOROD

Page 20: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 20

Dotplot for reverse sequences

DD

OOOO

RR

OOOO

TT

HHH

YY

HHH

OOOO

DDD

GG

KK

II

NN

NIKGDOHYHTOROD

Page 21: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 21

Dotplot for reverse sequences

Relevant to identify stems in RNA structures Plot sequence against its reverse complement

Page 22: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 22

Palindromes and restriction enzymes

Madam, I'm Adam Able was I ere I saw Elba (supposedly said by Napoleon) Doc note I dissent, a fast never prevents a fatness, I diet on cod.

Because DNA is double stranded and the strands run antiparallel, palindromes are defined as any double stranded DNA in which reading 5’ to 3’ both are the same

The HindIII cutting site:

– 5'-AAGCTT-3'– 3'-TTCGAA-5'

The EcoRI cutting site:

– 5'-GAATTC-3'– 3'-CTTAAG-5'

Page 23: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 23

Dotplot of a Palindrome

MMM

AAA

DD

AAA

MMM

MADAM

Page 24: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 24

Dotplot of repeats

EEEE

NNNN

OOO

YYY

TTTTTT

NNNN

EEEE

WWWW

TTTTTT

OOO

WWWW

TTTTTT

YYY

TTTTTT

NNNN

EEEE

WWWW

TTTTTT

OWTYTNEWTENOYTNEWT

Page 25: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 25

Dotplot of Repeats/Palindrome

MMMMM

AAAAA

DDD

AAAAA

MMMMM

II

MMMMM

AAAAA

DDD

AAAAA

MMMMM

MADAMIMADAM

Page 26: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 26

Dotplot for shared domain

RR

E

L

L

II

M

YY

HHH

TT

OOOO

RR

OOOO

DD

NIKGDOHYHTOROD

Page 27: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 27

ResultDot plot

dorothycrowfoothodgkind* * o * * * ** * r * * o * * * ** * t * * h * * y * h * * o * * * ** * d* * g * k * i * n *

Page 28: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 28

ResultDot plot

dorothycrowfoothodgkind* * o * * * ** * r * * o * * * ** * t * * h * * y * h * * o * * * ** * d* * g * k * i * n *

Page 29: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 29

Dotplots

Window size 15 Dot if

6 matches in window

Page 30: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 30

Window size 15 Dot if

6 matches in window

Page 31: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 31

>gi|1942644|pdb|1MEG| Crystal Structure Of A Caricain D158e Mutant In Complex With E-64

Length = 216

Score = 271 bits (693), Expect = 1e-73 Identities = 142/216 (65%), Positives = 168/216 (77%), Gaps = 4/216 (1%)

Query: 1 IPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRS 60 +PE VDWR+KGAVTPV++QGSCGSCWAFSAV T+EGI KIRTG L + SEQEL+DC+RRSSbjct: 1 LPENVDWRKKGAVTPVRHQGSCGSCWAFSAVATVEGINKIRTGKLVELSEQELVDCERRS 60

Query: 61 YGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCRSREKGPYAAKTDGVRQVQPYNQGA 120 +GC GGYP AL+ VA+ GIH R+ YPY+ Q CR+++ G KT GV +VQP N+G Sbjct: 61 HGCKGGYPPYALEYVAKNGIHLRSKYPYKAKQGTCRAKQVGGPIVKTSGVGRVQPNNEGN 120

Query: 121 LLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAV----GYGPNYILIKNS 176 LL +IA QPVSVV+++ G+ FQLY+GGIF GPCG KV+HAV AV G YILIKNSSbjct: 121 LLNAIAKQPVSVVVESKGRPFQLYKGGIFEGPCGTKVEHAVTAVGYGKSGGKGYILIKNS 180

Query: 177 WGTGWGENGYIRIKRGTGNSYGVCGLYTSSFYPVKN 212 WGT WGE GYIRIKR GNS GVCGLY SS+YP KNSbjct: 181 WGTAWGEKGYIRIKRAPGNSPGVCGLYKSSYYPTKN 216

1 lpenvdwrkk gavtpvrhqg scgscwafsa vatveginki rtgklvelse qelvdcerrs 61 hgckggyppy aleyvakngi hlrskypyka kqgtcrakqv ggpivktsgv grvqpnnegn 121 llnaiakqpv svvveskgrp fqlykggife gpcgtkveha vtavgygksg gkgyilikns 181 wgtawgekgy irikrapgns pgvcglykss yyptkn

Page 32: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 32

Window size 15 Dot if

6 matches in window

Page 33: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 33

>gi|2624670|pdb|1AIM| Cruzain Inhibited By Benzoyl-Tyrosine-Alanine- Fluoromethylketone

Length = 215

Score = 121 bits (303), Expect = 3e-28 Identities = 78/202 (38%), Positives = 107/202 (52%), Gaps = 13/202 (6%)

Query: 2 PEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSEQELLDCDRRSY 61 P VDWR +GAVT VK+QG CGSCWAFSA+ +E + L SEQ L+ CD+ Sbjct: 2 PAAVDWRARGAVTAVKDQGQCGSCWAFSAIGNVECQWFLAGHPLTNLSEQMLVSCDKTDS 61

Query: 62 GCNGGYPWSALQLVAQY---GIHYRNTYPY---EGVQRYCRSREKGPYAAKTDGVRQVQP 115 GC+GG +A + + Q ++ ++YPY EG+ C + A T V Q Sbjct: 62 GCSGGLMNNAFEWIVQENNGAVYTEDSYPYASGEGISPPCTTSGHTVGATITGHVELPQD 121

Query: 116 YNQGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAVGYGPN----YI 171 Q A ++ N PV+V + A+ + Y GG+ +DH V VGY + Y Sbjct: 122 EAQIAAWLAV-NGPVAVAVDAS--SWMTYTGGVMTSCVSEALDHGVLLVGYNDSAAVPYW 178

Query: 172 LIKNSWGTGWGENGYIRIKRGT 193 +IKNSW T WGE GYIRI +G+Sbjct: 179 IIKNSWTTQWGEEGYIRIAKGS 200

Page 34: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 34

Window size 15 Dot if

6 matches in window

Page 35: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 35

gi|7546546|pdb|1EF7|B Chain B, Crystal Structure Of Human Cathepsin X Length = 242

Score = 52.0 bits (123), Expect = 2e-07 Identities = 60/231 (25%), Positives = 94/231 (40%), Gaps = 34/231 (14%)

Query: 1 IPEYVDWRQKGAV---TPVKNQ---GSCGSCWAFSAVVTIEGIIKIRTGNL---NQYSEQ 51 +P+ DWR V + +NQ CGSCWA ++ + I I+ S QSbjct: 1 LPKSWDWRNVDGVNYASITRNQHIPQYCGSCWAHASTSAMADRINIKRKGAWPSTLLSVQ 60

Query: 52 ELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEGVQRYCR--------SREKGPY 103 ++DC C GG S Q+GI Y+ + C + K +Sbjct: 61 NVIDCGNAG-SCEGGNDLSVWDYAHQHGIPDETCNNYQAKDQECDKFNQCGTCNEFKECH 119

Query: 104 AAKTDGVRQVQPYN-----QGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNK-V 157 A + + +V Y + + AN P+S + A + Y GGI+ +Sbjct: 120 AIRNYTLWRVGDYGSLSGREKMMAEIYANGPISCGIMATER-LANYTGGIYAEYQDTTYI 178

Query: 158 DHAVAAVGY----GPNYILIKNSWGTGWGENGYIRI-----KRGTGNSYGV 199 +H V+ G+ G Y +++NSWG WGE G++RI K G G Y +Sbjct: 179 NHVVSVAGWGISDGTEYWIVRNSWGEPWGERGWLRIVTSTYKDGKGARYNL 229

Page 36: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 36

Window size 5 Dot if

2 matches in window

Page 37: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 37

Window size 1 Dot if

1 match in window

Page 38: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 38

Dynamic programming

Page 39: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 39

From Dotplots to Alignments Obvious best alignment: DOROTHYCROWFOOTHODGKIN

DOROTHY--------HODGKIN

NN

II

KK

GG

DDD

OOOOOOO

HHH

YY

HHH

TTT

OOOOOOO

RRR

OOOOOOO

DDD

NIKGDOHTOOFWORCYHTOROD

Page 40: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 40

From Dotplots to Alignments

Find “best” path from top left corner to bottom right Moving “east” corresponds to “-” in the second

sequence Moving “south” corresponds to “-” in the first

sequence Moving “southeast” corresponds to a match if the

characters are the same or a mismatch otherwise

Can we automate this?

Page 41: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 41

From Dotplots to Alignments

Algorithm (Dynamic Programming): Insert a row 0 and column 0 initialised with 0 Starting from the top left, move down row by row from row 1 and

right column by column from column 1 visiting each cell Consider

The value of the cell north The value of the cell west The value of the cell northwest if the row/column character

mismatch 1 + the value of the cell northwest if the row/column

character match Put down the maximum of these values as the value for the

current cell Trace back the path with the highest values from the bottom right

to the top left and output the alignment

Page 42: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 42

From Dotplots to Alignments

0 1 2 3 4 5 6T G C A T A

0 1 A2 T3 C4 T5 G6 A7 T

Page 43: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 43

From Dotplots to Alignments

0 1 2 3 4 5 6T G C A T A

0 0 0 0 0 0 0 01 A 02 T 03 C 04 T 05 G 06 A 07 T 0

Insert a row 0 and column 0 initialised with 0

Page 44: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 44

From Dotplots to Alignments

0 1 2 3 4 5 6T G C A T A

0 0 0 0 0 0 0 01 A 00 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07 T 0

•Consider•Value north•Value west•Value northwest if the row/column character mismatch•1 + value northwest if the row/column character match

•Put down the maximum of these values for current celll

Page 45: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 45

From Dotplots to Alignments

0 1 2 3 4 5 6T G C A T A

0 0 0 0 0 0 0 01 A 00 0 0 1 1 12 T 01 1 1 1 2 23 C 01 1 2 2 2 24 T 01 1 2 2 3 35 G 01 2 2 2 3 36 A 01 2 2 3 3 47 T 01 2 2 3 4 4

Page 46: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 46

Reading the Alignment

0 1 2 3 4 5 6T G C A T A

0 0 0 0 0 0 0 01 A 00 0 0 1 1 12 T 01 1 1 1 2 23 C 01 1 2 2 2 24 T 01 1 2 2 3 35 G 01 2 2 2 3 36 A 01 2 2 3 3 47 T 01 2 2 3 4 4

-tgcat-a-at-c-tgat

Page 47: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 47

Reading the Alignment: there are more than one possibility

0 1 2 3 4 5 6T G C A T A

0 0 0 0 0 0 0 01 A 00 0 0 1 1 12 T 01 1 1 1 2 23 C 01 1 2 2 2 24 T 01 1 2 2 3 35 G 01 2 2 2 3 36 A 01 2 2 3 3 47 T 01 2 2 3 4 4

---tgcataatctg-at-

Page 48: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 48

Formally:Longest Common Subsequence LCS What is the length s(V,W) of the longest common

subsequence of two sequencesV=v1..vn and W=w1..wm ?

Find sequences of indices1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ msuch that vit

= wjt for 1 ≤ t ≤ k

How? Dynamic programming: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and si-1,j

si,j = max si,j-1

si-1,j-1 + 1, if vi = wj

Then s(V,W) = sn,m is the length of the LCS

{

Page 49: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 49

Example LCS

0 1 2 3 4 5 6T G C A T A

0 1 A2 T3 C4 T5 G6 A7 T

Page 50: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 50

Example LCS:

0 1 2 3 4 5 6T G C A T A

0 0 0 0 0 0 0 01 A 02 T 03 C 04 T 05 G 06 A 07 T 0

Initialisation: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and

Page 51: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 51

Example LCS:

0 1 2 3 4 5 6T G C A T A

0 0 0 0 0 0 0 01 A 00 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07 T 0

Computing each cell: si-1,j

si,j = max si,j-1

si-1,j-1 + 1, if vi = wj

{

Page 52: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 52

Example LCS:

0 1 2 3 4 5 6T G C A T A

0 0 0 0 0 0 0 01 A 00 0 0 1 1 12 T 01 1 1 1 2 23 C 01 1 2 2 2 24 T 01 1 2 2 3 35 G 01 2 2 2 3 36 A 01 2 2 3 3 47 T 01 2 2 3 4 4

Computing each cell: si-1,j

si,j = max si,j-1

si-1,j-1 + 1, if vi = wj

{

Page 53: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 53

LCS Algorithm

LCS(V,W) For i = 1 to n

si,0 = 0 For j = 1 to m

s0,j = 0 For i = 1 to n

For j = 1 to m If vi = wj and si-1,j-1 +1 ≥ si-1,j and si-1,j-1 +1 ≥ si,j-1 Then

si,j = si-1,j-1 +1 bi,j = North West

Else if si-1,j ≥ si,j-1 Then si,j = si-1,j

bi,j = North Else

si,j = si,j-1

bi,j = West Return s and b

Complexity: LCS has quadratic complexity:

O(n m)

Page 54: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 54

Printing the alignment of LCS

PRINT-LCS(b,V,i,j) If i=0 or j=0 Then Return If bi,j = North West Then

PRINT-LCS(V,b,i-1,j-1) Print vi

Else if bi,j = North Then PRINT-LCS(V,b,i-1,j)

Else PRINT-LCS(V,b,i,j-1)

Page 55: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 55

Rewards/Penalities

We can use different schemes: -1 for insert/delete/mismatch +1 for match

…Consider -1 + the value of the cell north -1 + the value of the cell west -1 + the value of the cell northwest if the row/column

character mismatch +1 + the value of the cell northwest if the row/column

character match Put down the maximum of these values as the value for

the current cell

Page 56: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 56

Reading the Alignment

0 1 2 3 4 5 6T G C A T A

0 0 0 0 0 0 0 01 A 0-1 -1 -1 1 0 12 T 01 0 -1 0 2 13 C 00 -1 1 0 1 14 T 01 0 0 0 1 05 G 00 2 1 0 0 06 A 0-1 1 1 2 1 17 T 01 0 0 1 3 2

---tgcataatctg-at-

Page 57: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 57

Rewards/Penalities Let’s refine the schemes:

Transition mutations are more common: purine<->purine, a<->g pyrimidine<->pyrimidine, t<->c

Transversions (purine<->pyrimidine) are less common

Use a subsitutation matrix to rate mismatches:

-2 for insert/delete Mismatch/match according to substitution matrix

…Consider -2 + the value of the cell north -2 + the value of the cell west Corresponding value of the substion matrix

+ the value of the cell northwest Put down the maximum of these values as the

value for the current cell

2-20-2C

-22-20G

0-22-2T

-20-22A

CGTA

Page 58: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 58

Reading the Alignment

0 1 2 3 4 5 6T G C A T A

0 0 0 0 0 0 0 01 A 0-2 0 -2 2 0 22 T 02 0 0 0 4 23 C 00 0 2 0 2 24 T 02 0 0 0 2 05 G 00 4 2 0 0 26 A 0-2 2 2 4 2 27 T 02 0 2 2 6 4

---tgcataatctg-at-

Page 59: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 59

Substitution matrixes

Page 60: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 60

How to derive a substitution matrix for amino acids?

Amino acids can be classified by physiochemical properties

HydrophobicA

GP

I L V

C W

M F

AcidicDE

PolarS T

N Q

Y

H

Aromatic

K

R Basic

Page 61: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 61

PAM 250 matrixCys 12Ser 0 2Thr -2 1 3Pro -3 1 0 6Ala -2 1 1 1 2Gly -3 1 0 -1 1 5Asn -4 1 0 -1 0 0 2Asp -5 0 0 -1 0 1 2 4Glu -5 0 0 -1 0 0 1 3 4Gln -5 -1 -1 0 0 -1 1 2 2 4His -3 -1 -1 0 -1 -2 2 1 1 3 6Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6Lys -5 0 0 -1 -1 -2 1 0 0 1 0 3 5Met -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6Ile -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5Leu -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6Val -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4Phe -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A G N D E Q H R K M I L V F Y W

>0, likely mutation0, random mutation<0, unlikely

Page 62: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 62

Cys 12 0 -2 -3 -2 -3 -4 -5 -5 -5 -3 -4 -5 -5 -2 -6 -2 -4 0 -8

Ser 0 2 1 1 1 1 1 0 0 -1 -1 0 0 -2 -1 -3 -1 -3 -3 -2

Thr -2 1 3 0 1 0 0 0 0 -1 -1 -1 0 -1 0 -2 0 -3 -3 -5

Pro -3 1 0 6 1 -1 -1 -1 -1 0 0 0 -1 -2 -2 -3 -1 -5 -5 -6

Ala -2 1 1 1 2 1 0 0 0 0 -1 -2 -1 -1 -1 -2 0 -4 -3 -6

Gly -3 1 0 -1 1 5 0 1 0 -1 -2 -3 -2 -3 -3 -4 -1 -5 -5 -7

Asn -4 1 0 -1 0 0 2 2 1 1 2 0 1 -2 -2 -3 -2 -4 -2 -4

Asp -5 0 0 -1 0 1 2 4 3 2 1 -1 0 -3 -2 -4 -2 -6 -4 -7

Glu -5 0 0 -1 0 0 1 3 4 2 1 -1 0 -2 -2 -3 -2 -5 -4 -7

Gln -5 -1 -1 0 0 -1 1 2 2 4 3 1 1 -1 -2 -2 -2 -5 -4 -5

His -3 -1 -1 0 -1 -2 2 1 1 3 6 2 0 -2 -2 -2 -2 -2 0 -3

Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 3 0 -2 -3 -2 -4 -4 2

Lys -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 0 -2 -3 -2 -5 -4 -3

Met -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 2 4 2 0 -2 -4

Ile -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 2 4 1 -1 -5

Leu -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 2 2 -1 -2

Val -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 -1 -2 -6

Phe -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 7 0

Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 0

Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17

  C S T P A G N D E Q H R K M I L V F Y W

 

Average -2.8 -0.5 -0.7 -1.2 -0.9 -1.6 -0.7 -1.1 -1.1 -0.8 -0.3 -0.7 -0.9 -0.8 -0.8 -1.4 -0.8 -1.9 -1.5 -3.1

StDev 4 1.5 1.7 2.6 1.9 2.7 2.1 3 2.8 2.6 2.3 2.6 2.5 2.6 2.4 3 2.3 4.1 3.8 5.4

Page 63: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 63

Cys 12 0 -2 -3 -2 -3 -4 -5 -5 -5 -3 -4 -5 -5 -2 -6 -2 -4 0 -8

Ser 0 2 1 1 1 1 1 0 0 -1 -1 0 0 -2 -1 -3 -1 -3 -3 -2

Thr -2 1 3 0 1 0 0 0 0 -1 -1 -1 0 -1 0 -2 0 -3 -3 -5

Pro -3 1 0 6 1 -1 -1 -1 -1 0 0 0 -1 -2 -2 -3 -1 -5 -5 -6

Ala -2 1 1 1 2 1 0 0 0 0 -1 -2 -1 -1 -1 -2 0 -4 -3 -6

Gly -3 1 0 -1 1 5 0 1 0 -1 -2 -3 -2 -3 -3 -4 -1 -5 -5 -7

Asn -4 1 0 -1 0 0 2 2 1 1 2 0 1 -2 -2 -3 -2 -4 -2 -4

Asp -5 0 0 -1 0 1 2 4 3 2 1 -1 0 -3 -2 -4 -2 -6 -4 -7

Glu -5 0 0 -1 0 0 1 3 4 2 1 -1 0 -2 -2 -3 -2 -5 -4 -7

Gln -5 -1 -1 0 0 -1 1 2 2 4 3 1 1 -1 -2 -2 -2 -5 -4 -5

His -3 -1 -1 0 -1 -2 2 1 1 3 6 2 0 -2 -2 -2 -2 -2 0 -3

Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 3 0 -2 -3 -2 -4 -4 2

Lys -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 0 -2 -3 -2 -5 -4 -3

Met -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 2 4 2 0 -2 -4

Ile -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 2 4 1 -1 -5

Leu -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 2 2 -1 -2

Val -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 -1 -2 -6

Phe -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 7 0

Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 0

Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17

  C S T P A G N D E Q H R K M I L V F Y W

 

Average -2.8 -0.5 -0.7 -1.2 -0.9 -1.6 -0.7 -1.1 -1.1 -0.8 -0.3 -0.7 -0.9 -0.8 -0.8 -1.4 -0.8 -1.9 -1.5 -3.1

StDev 4 1.5 1.7 2.6 1.9 2.7 2.1 3 2.8 2.6 2.3 2.6 2.5 2.6 2.4 3 2.3 4.1 3.8 5.4

Page 64: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 64

PAM 250: Interpretation

Immutable: Cysteine (Avg=-2.8): known to have several unique,

indispensable functions attachment site of heme group in cytochrome and of iron

sulphur FeS in ferredoxins Cross links in proteins such as chymotrypsin or ribonuclease Seldom without unique function

Glycine (Avg=-1.6): small size maybe advantageous Mutable:

Serine often functions in active site, but can be easily replaced Self-alignment:

Tryptophan with itself scores very high, as W occurs rarely

Page 65: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 65

Point Accepted Mutations PAM

Substitution matrix using explicit evolutionary model of how amino acids change over time

Use parsimony method to determine frequency of mutations Entry in PAM matrix: Likelihood ratio for residues a and b:

Probability a-b is a mutation / probability a-b is chance PAM x: Two sequences V, W have evolutionary distance of x PAM if

a series of accepted point mutations (and no insertions/deletions) converts V into W averaging to x point mutation per 100 residues

Mutations here = mutations in the DNA Because of silent mutations and back mutations n can be >100 PAM 250 most commonly used

Page 66: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 66

PAM and Sequence Similarity

PAM 0 30 80 110

200 250

% identiy

100 75 60 50 25 20

Page 67: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 67

PAM

Dayhoff, Eck, Park: A model of evolutionary change in proteins, 1978

Accepted point mutation = substitution of an amino acid accepted by natureal selection

Assumption: X replacing Y as likely as Y replacing X

Used cytochrome c, hemoglobin, myoglobin, virus coat proteins, chymotrypsinogen, glyceraldehyde 3-phosphate dehrydogenase, clupeine, insulin, ferredoxin

Sequences which are too distantly related have been omitted as they are more likely to contain multiple mutations per site

Page 68: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 68

PAM: Step 1

Step 1: Construct a multiple alignment

Example ACGCTAFKI GCGCTAFKI ACGCTAFKL GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL

Page 69: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 69

PAM: Step 2

Create a phylogenetic tree (parsimony method)

ACGCTAFKI

A->G I->L

GCGCTAFKI ACGCTAFKL

A->G A->L C->S G->A

GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL

Page 70: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 70

PAM: Step 3

Note, the following variables Residue frequency ri is the number of amino acid i

occurring in the sequences, e.g. rA = 10 and rG=10 Number of residues r is the number of overall amino

acids in all sequences, e.g. r=63 Substitutability si is the number of substitutions in the

tree involving amino acid i , e.g. sA=4 Substituion frequency si,j is the number of

substitutions involving amino acid i and j (i.e. the number of ij and ji ), e.g. sA,G = 3

Number of substitutions s is the number of overall substitutions, s=6

Page 71: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 71

PAM: Step 4

Relative mutability is the number of times the amino acid is substituted by any other amino acid in the tree divided by the total number of substitutions that could have affected the residue

Note, it is assumed that substitutions in both directions are equally likely

mi = 100 x ( si x ri ) / ( 2 s x r )

Example mA = 100 x ( 4 x 10 ) / ( 2 x 6 x 63 ) = 5.3

How many residues, how many substitutions overall?

How many i? How many substitutions with i?

Page 72: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 72

PAM: Step 5

Compute mutation probability Mi,j = mj x si,j / sj

Example: MG,A = 5.3 x 3 / 4 = 3.975

Page 73: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 73

PAM: Step 6

Finally the entry in the PAM Matrix:

Ri,j = log ( Mi,j / ( ri / r ) )

Example: RG,A = log ( 3.975 / (10/63) ) = 1.4

Page 74: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 74

PAM: Step 6

For the entries on the diagonal

mj =relative mutability 1-mj relative immutability

Rj,j = log ( (1-mj ) / ( rj / r ) )

Page 75: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 75

BLOSUM Different approach to PAM BLOcks SUbstitution Matrix (based on BLOCKS

database) Generation of BLOSUM x

Group highly similar sequences and replace them by a representative sequences.

Only consider sequences with no more than x % similarity Align sequences (no gaps) For any pair of amino acids a,b and for all columns c of the

alignment, let q(a,b) be the number of co-occurrences of a,b in all columns c.

Let p(a) be the overall probability of a occurring

BLOSUM entry for a,b is log2 ( q(a,b) / ( p(a)*p(b) ) )

BLOSUM 50 and BLOSUM 62 widely used

Page 76: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 76

LCS Algorithm (Longest Common Subsequence) Revisited

Algorithm (Dynamic Programming) with Substitution Matrix: Insert a row 0 and column 0 initialised with 0 Starting from the top left, move down row by row from row 1 and

right column by column from column 1 visiting each cell Consider

The value of the cell north The value of the cell west The value of the cell northwest if the row/column character

mismatch s + the value of the cell northwest, where s is the value

in the subsitution matrix for the residues in row/column Put down the minimum of these values as the value for the

current cell Trace back the path with the highest values from the bottom right

to the top left and output the alignment

Page 77: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 77

LCS Revisited: Formally

What is the length s(V,W) of the longest common subsequence of two sequencesV=v1..vn and W=w1..wm ?

Find sequences of indices1 ≤ i1 < … < ik ≤ n and 1 ≤ j1 < … < jk ≤ msuch that vit

= wjt for 1 ≤ t ≤ k

How? Dynamic programming: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and

si-1,j

si,j = max si,j-1

si-1,j-1 + t, where t is the value for vi and wj in the substitution

matrix Then s(V,W) = sn,m is the length of the LCS

{

Page 78: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 78

Dynamic programming revisited:local and global alignments and gap

Page 79: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 79

Evolution and Alignments

Alignments can be interpreted in evolutionary terms Identical letters are aligned.

Interpretation: part of the same ancestral sequence and not changed

Non-identical letters are aligned (substitution)Interpretation: Mutation

GapsInterpretation: Insertions and deletions (indels)

Page 80: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 80

Evolution and Alignments

Specific problems aligning DNA: “Frame shift”:

DNA triplets code amino acids Indel of one nucleotide shifts the whole sequence of

triplets Thus may have a global effect and change all coded

amino acids Silent mutation:

Substitution in DNA leaves transcribed amino acid unchanged

Non-sense mutation: Substitution to stop-codon

Page 81: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 81

Local and Global Alignments

Global alignment (Needleham-Wunsch) algorithm finds overall best alignment Example: members of a protein family, e.g. globins are very

conserved and have the same length in different organisms from fruit fly to humans

Local alignment (Smith-Waterman) algorithm finds locally best alignment most widely used, as

e.g. genes from different organisms retain similar exons, but may have different introns

e.g. homeobox gene, which regulates embryonic development occurs in many species, but very different apart from one region called homeodomain

e.g. proteins share some domains, but not all

Page 82: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 82

Local Alignment

LCS s(V,W) computes globally best alignment Often it is better to maximise locally, i.e. compute

maximal s(vi…vi’ , wj… wi’ ) for all substrings of V and W

Can we adapt algorithm? Global alignment = longest path in matrix s from (0,0)

to (n,m) Local alignment = longest path in matrix s from any

(i,j) to any (i’,j’) Modify definition of s adding vertex of weight 0 from

source to every other vertex, creating a free “jump” to any starting position (i,j)

Page 83: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 83

Local Alignment

Modify the definition of s as follows: si,0 = s0,j = 0 for all 1 ≤ i ≤ n and 1 ≤ j ≤ m and 0

si-1,j

si,j = max si,j-1

si-1,j-1 + t, where t is the value for vi wj

in the substitution matrix

Then s(V,W) = max { si,j } is the length of the local LCS

This computes longest path in edit graph Several local alignment may have biological

significance (consider e.g. two multi-domain proteins whose domains are re-ordered

{

Page 84: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 84

Aligning with Gap Penalties

Gap is sequence of spaces in alignment So far, we consider only insertion and deletion of single

nucleotides or amino acids creating alignments with many gaps So far, score of a gap of length l is l Because insertion/deletion of monomers is evolutionary slow

process, large numbers of gaps do not make sense Instead whole substrings will be deleted or inserted We can generalise score of a gap to a score function A + B l,

where A is the penalty to open the gap and B is the penalty to extend the gap

Page 85: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 85

Aligning with Gap Penalties

High gap penalties result in shorter, lower-scoring alignments with fewer gaps and

Lower gap penalties give higher-scoring, longer alignments with more gaps

Gap opening penalty A mainly influences number of gaps

Gap extension penalty B mainly influences length of gaps

E.g. if interested in close relationships, then choose A, B above default values, for distant relationships decrease default values

Page 86: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 86

Aligning with Gap Penalties

Adapt the definition of s as follows: s-deli,j = max s-deli-1,j - B

si-1,j – (A+B)

s-insi,j = max s-insi,j-1 - B

si,j-1 – (A+B)

0 s-deli,jsi,j = max s-insi,j

si-1,j-1 + t, where t is the value for vi, wj

in the substitution matrix Then s(V,W) = max { si,j } is the length of the local LCS with gap penalties A and B

{

{{

Page 87: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 87

FASTA and BLAST

Page 88: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 88

Motivation As in dotplots, the underlying data structure for dynamic

programming is a table Given two sequences of length n dynamic programming

takes time proportional to n2

Given a database with m sequences, comparing a query sequence to the whole database takes time proportional to m n2

What does this mean? Imagine you need to fill in the tables by hand and it takes 10

second to fill in one cell Assume there are 1.000.000 sequences each 100 amino acids

long How long does it take?

Page 89: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 89

1.000.000 x 100 x 100 x 10 sec = 1011 sec = 27.777.778h = 1157407days = 3170 years

Even if a computer does not take 10 sec, but just 0.1ms to fill in one cell, it would still be 12 days.

We cannot do something about the database size, but can we do something about the table size?

Page 90: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 90

An idea: Prune the search space

Page 91: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 91

Another idea Did we formulate the

problem correctly? Do we need the alignments

for all sequences in the database?

No, only for “reasonable” hits introduce a threshold

A “reasonable” alignment will contain short stretches of perfect matches

Find these first, then extend them to connect them as best possible

Page 92: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 92

FASTA and BLAST

FASTA and BLAST faster than dynamic programming (5 times and 50 times respectively)

Underlying idea for a heuristic: High-scoring alignments will contain short stretches

of identical letters, called words FASTA and BLAST first search for matches of words of

a given length and score threshold: BLAST for words of length 3 for proteins and 11 for

DNA FASTA for words of length 2 for proteins and 6 for

DNA Next, matches are extended to local (BLAST) and

global (FASTA) alignments

Page 93: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 93

FASTA and BLAST More formally:

If the strings V=v1..vm and W=w1..wm match with at most k mismatches, then they share an p-tuple for

p = m/(k+1), i.e. vi..vi+l-1 =wj..wj+l-1 for some 1 ≤ i,j ≤ m-p+1

FILTRATION ALGORITHM, which detects all matching words of length m with up to k mismatches Potential match detection: Find all matches of p-tuples

of V,W (can be done in linear time by inserting them into a hash table)

Potential match verification: Verify each potential match by extending it to the left and right until either the first k+1 mismatches are found or the beginning or end of the sequences are found

Page 94: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 94

Example for BLAST Search SWISSPROT for Immunoglobulin:

SWISS_PROT:C79A_HUMAN P11912

Page 95: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 95

Example for BLAST

Search BLAST (www.ncbi.nlm.nih.gov/BLAST/) for P11912

Database: All non-redundant SwissProt sequences

1,292,592 sequences; 412,925,052 total letters

Page 96: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 96

Example for BLAST Distribution of Hits:

Page 97: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 97

Example for BLAST: Top Hits Score E Sequences producing significant alignments: Score E-Value gi|

547896|sp|P11912|C79A_HUMAN B-cell antigen receptor comp... 473 e-133 gi|728993|sp|P40293|C79A_BOVIN B-cell antigen receptor comp... 312 3e-85 gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor comp... 278 5e-75 gi|728994|sp|P40259|C79B_HUMAN B-cell antigen receptor comp... 55 1e-07 gi|125781|sp|P01618|KV1_CANFA IG KAPPA CHAIN V REGION GOM 38 0.019 gi|125361|sp|P17948|VGR1_HUMAN Vascular endothelial growth ... 37 0.042 gi|549319|sp|P35969|VGR1_MOUSE Vascular endothelial growth ... 36 0.052 gi|114764|sp|P15530|C79B_MOUSE B-cell antigen receptor comp... 36 0.064 gi|1718161|sp|P53767|VGR1_RAT Vascular endothelial growth f... 35 0.080 gi|125735|sp|P01681|KV01_RAT Ig kappa chain V region S211 35 0.095 gi|1730075|sp|P01625|KV4A_HUMAN IG KAPPA CHAIN V-IV REGION LEN 34 0.26 gi|1718188|sp|P52583|VGR2_COTJA Vascular endothelial growth... 33 0.28 gi|125833|sp|P06313|KV4B_HUMAN IG KAPPA CHAIN V-IV REGION J... 33 0.30 gi|125806|sp|P01658|KV3F_MOUSE IG KAPPA CHAIN V-III REGION ... 33 0.30 gi|125808|sp|P01659|KV3G_MOUSE IG KAPPA CHAIN V-III REGION ... 33 0.30 gi|1172451|sp|Q05793|PGBM_MOUSE Basement membrane-specific ... 33 0.33 gi|125850|sp|P01648|KV5O_MOUSE Ig kappa chain V-V region HP... 33 0.36 gi|125830|sp|P06312|KV40_HUMAN Ig kappa chain V-IV region p... 33 0.38 gi|2501738|sp|Q06639|YD03_YEAST Putative 101.7 kDa transcri... 33 0.41

Page 98: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 98

Example for BLAST: Alignment>gi|126779|sp|P11911|C79A_MOUSE B-cell antigen receptor complex associated protein alpha-chainprecursor (IG-alpha) (MB-1 membrane glycoprotein)(Surface-IGM-associated protein) (Membrane-boundimmunoglobulin associated protein) (CD79A)Length = 220

Score = 278 bits (711), Expect = 5e-75Identities = 150/226 (66%), Positives = 165/226 (73%), Gaps = 6/226 (2%)

Query: 1 MPGGPGVLQALPATIFLLFLLSAVYLGPGCQALWMHKVPASLMVSLGEDAHFQCPHNSSN 60 MPGG + LL LS LGPGCQAL + P SL V+LGE+A C N+ Sbjct: 1 MPGG----LEALRALPLLLFLSYACLGPGCQALRVEGGPPSLTVNLGEEARLTC-ENNGR 55

Query: 61 NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQEGNESYQQSCG 120 N N+TWW L N TWPP LGPG+ G L VNK+ G C+V E N ++SCGSbjct: 56 NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTGCQVIE-NNILKRSCG 114

Query: 121 TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKLGLDAGD 180 TYLRVR P PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK G+D DSbjct: 115 TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEKFGVDMPD 174

Query: 181 EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP 226 +YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG+L+IGD QLEKPSbjct: 175 DYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGNLHIGDAQLEKP 220

Page 99: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 99

Example for BLAST Lineage Report

root . cellular organisms . . Eukaryota [eukaryotes] . . . Fungi/Metazoa group [eukaryotes] . . . . Bilateria [animals] . . . . . Coelomata [animals] . . . . . . Gnathostomata [vertebrates] . . . . . . . Tetrapoda [vertebrates] . . . . . . . . Amniota [vertebrates] . . . . . . . . . Eutheria [mammals] . . . . . . . . . . Homo sapiens (man) ---------------------- 473 33 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch . . . . . . . . . . Bos taurus (bovine) ..................... 312 2 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch . . . . . . . . . . Mus musculus (mouse) .................... 278 31 hits [mammals] B-cell antigen receptor complex associated protein alpha-ch . . . . . . . . . . Canis familiaris (dogs) ................. 37 1 hit [mammals] IG KAPPA CHAIN V REGION GOM . . . . . . . . . . Rattus norvegicus (brown rat) ........... 35 7 hits [mammals] Vascular endothelial growth factor receptor 1 precursor (VE . . . . . . . . . . Oryctolagus cuniculus (domestic rabbit) . 29 1 hit [mammals] IG KAPPA CHAIN V REGION K29-213 . . . . . . . . . Coturnix japonica ------------------------- 33 2 hits [birds] Vascular endothelial growth factor receptor 2 precursor (VE . . . . . . . . . Gallus gallus (chickens) .................. 31 4 hits [birds] CILIARY NEUROTROPHIC FACTOR RECEPTOR ALPHA PRECURSOR (CNTFR . . . . . . . . Xenopus laevis (clawed frog) ---------------- 30 2 hits [amphibians] Neural cell adhesion molecule 1, 180 kDa isoform precursor . . . . . . . Heterodontus francisci ------------------------ 28 1 hit [sharks and rays] Myelin P0 protein precursor (Myelin protein zero) (Myelin p . . . . . . Drosophila melanogaster ------------------------- 30 2 hits [flies] Neuroglian precursor . . . . . Caenorhabditis elegans ---------------------------- 29 1 hit [nematodes] Hypothetical protein F59B2.12 in chromosome III . . . . Saccharomyces cerevisiae (brewer's yeast) ----------- 33 1 hit [ascomycetes] Putative 101.7 kDa transcriptional regulatory protein in PR . . . Marchantia polymorpha --------------------------------- 29 1 hit [liverworts] Succinate dehydrogenase cytochrome b560 subunit (Succinate . . Agrobacterium tumefaciens str. C58 ---------------------- 28 1 hit [a-proteobacteria] Formamidopyrimidine-DNA glycosylase (Fapy-DNA glycosylase) . Human adenovirus type 3 ----------------------------------- 30 1 hit [viruses] EARLY E3 20.5 KD GLYCOPROTEIN . Human adenovirus type 7 ................................... 30 1 hit [viruses] EARLY E3 20.5 KD GLYCOPROTEIN

Page 100: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 100

How good is an alignment?

Be careful: Fitch/Smith found 17 alignments for alpha- and beta-chains in chicken haemoglobins

Only one is the correct one (according to the structure)

Given an alignment, how good is it : Percentage of matching residues, i.e. number of matches divided

by length of smallest sequence Advantage: independent of sequence length E.g. AT–C –TGAT 4/6 = 66.67%

–TGCAT –A–

More general: also consider gaps, extensions,…

Page 101: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 101

Blast Raw Score

R = a I + b X - c O - d G, where I is the number of identities in the alignment and a is

the reward for each identity X is the number of mismatches in the alignment and

b is the “reward” for each mismatch O is the number of gaps and c is the penalty for each

gap G is the number of “-” characters in the alignment

and d is the penalty for each

The values for a,b,c,d appear at the bottom of a Blast report. For BLASTn they are a=1, b=-3, c=5, d=2

Page 102: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 102

ExampleQuery: 1 atgctctggccacggcacttgcgga ||||||||||||||| |||| |||Sbjt:107 atgctctggccacggatcttgtgga

tcccagggtgatctgtgcacctgcgata 53 ||||| |||| ||||||||||||||| tccca---tgatatgtgcacctgcgata 156

R = 1 x 46 + -3 x 4 - 5 x 1 - 2 x 3 = 23

So, given the scores: how significant is the alignment?

Page 103: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 103

Significance of an alignment

Significance of an alignment needs to be defined with respect to a control population

Pairwise alignment: How can we get control population? Generate sequences randomly? Not a good model of real

sequences Chop up both sequences and randomly reassemble them

Database search: How can we get control population? Control = whole database

Align sequence to control population and see how good result is in comparison

This is captured by Z scores, P-values and E-values

Page 104: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 104

Z-score

Z-score normalises the score S: Let m be mean of population and std its standard

deviation, then Z-score = (S – m) / std Z-score of 0 no better than average, hence might

have occurred by chance The higher the Z-score the better

Page 105: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 105

P-value

P-value: probability of obtaining a score ≥ S Range: 0 ≤ P ≤ 1 Let m be the number of sequences in the control

population with score ≥ S Let p be the size of the control population Then P-value = m / p Rule of thumb:

P ≤ 10-100 exact match, 10-100 ≤ P ≤ 10-50 nearly identical (SNPs) 10-50 ≤ P ≤ 10-10 homology certain 10-5 ≤ P ≤ 10-1 usually distant relative P > 10-1 probably insignificant

Page 106: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 106

E-values

E-value takes also the database into account E-value = expected frequency of a score ≥ S

Range: 0 ≤ E ≤ m, where m is the size of the database Relationship to P: E = m P

E values are calculated from the bit score the length of the query the size of the database

Page 107: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 107

BLAST Bit score

The bit score normalizes the raw score S to make score under different settings comparable

The bit score is obtained from the raw score as follows S = ( lambda x R - ln(K) ) / ( ln(2),

where lambda = 1.37 and K=0.711

Example S = ( 1.37 x 23 - ln(0.711) ) / ln(2) = 46

Page 108: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 108

E-value

The E-value is then calculated as follows: E = m x n x 2 -S , where

m is the effective length of the query n is the effective length of the database S is the bit score (effective length takes into account that an alignment

cannot start at the end of a sequence)

Example: m=34 (19 nucleotides fewer than the 53 submitted) n=5,854,611,841 Result: E=0.003

Page 109: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 109

Precision and Recall How good are BLAST and FASTA?

True positives, tp = hits which are biologically meaningful False positives, fp = hits which are not biologically meaningful True negatives, tn = non-hits which are not biologically meaningful False negatives, fn = non-hits which are biologically meaningful

Minimise fp and fn Recall: tp/(tp+fn) (meaningful hits / all meaningful) Precision: tp/(tp+fp) (meaningful hits / all hits) But: since no objective data available difficult to judge BLAST

and FASTA’s sensitivity and specificity

Page 110: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 110

Multiple Sequence Alignments

Page 111: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 111

Multiple Sequence Alignment

Align more than two sequences Choice of sequences

If too closely related then large redundant If very distantly related then difficult to generate good alignment

Additionally use colour for residues with similar properties Yellow Small polar GLy,Ala,Ser,Thr Green Hydrophobic Cys,Val,Ile,Leu,

Pro,Phe,Tyr,Met,Trp Magenta Polar Asn,Gln,His Red Negatively charged Asp,Glu Blue Positively charged Lys, Arg

Page 112: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 112

Thioredoxins: WCGPC[K or R] motif

Page 113: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 113

Thioredoxins: Gly/Pro = turn

Page 114: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 114

Thioredoxins: every second hydrophobic = beta strand

Page 115: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 115

Thioredoxins: ca. every 4th hydrophobic = alpha helix

Page 116: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 116

Page 117: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 117

Profiles, PSI-Blast, HMM

Page 118: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 118

Profiles

Derive profile from multiple sequence alignment Useful to

Align distantly related sequences Conserved regions, which may indicate active site Classify subfamilies within homologues

How can profile be used to search Insist on profile (such as WGCPC)? Too strict Use frequence distribution of profile…

Page 119: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 119

Consider frequencies

Score for VDFSAS = 13+16+16+7+16+7 ADATAA = 1+16+0+1+16+0 Not good to pick up distant relationships Better: combine with substitution matrix Result: position specific substitution matrix

17124130

1629

351728

1627

1626

132125

YWVTSRQPNMLKIHGFEDCA

Page 120: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 120

PSI-Blast

Globin familiy (oxygen transport ) of proteins occurs in many species

Proteins have same function and structure and But there are pairs of members of the family sharing

less than 10% identical residues

A B C

PSI-BLAST idea: score via intermediaries may be better than score from direct comparison

50%

Only 10%

50%

Page 121: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 121

PSI-BLAST

PSI-BLAST 1. BLAST 2. Collect top hits 3. Build multiple sequence alignment from significant

local matches 4. Build profile 5. Re-probe database with profile 6. Go back to 2.

Page 122: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 122

PSI-BLAST

But beware of PSI-BLAST: False positives propagate and spread through

iterations If protein A consists of domains D and E, and protein B

of domains E and F and protein C of domain F, then PSI-BLAST will relate A and C although they do not share any domain

Page 123: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 123

Hidden Markov Model Procedure to generate sequences

State transition systems with three types of states Deletion Insertion Match, which emits residues

Follow probability distribution for successor state Train model on multiple sequence alignment

del del del

start end

match matchmatch

ins insins

Page 124: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 124

Summary

Evolutionary model: Indels and substitutions Homologues vs. similarity Dot plots

Easy visual exploration, but not scalable Dynamic programming

Local, global, gaps Substitution matrices (PAM, BLOSUM) BLAST and FASTA Scores and significance

Multiple Sequence Alignments Profiles, PSI-BLAST, HMM

Page 125: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 125

Phylogeny

Page 126: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 126

Motivation

How did the nucleus get into the eucaryotic cells?

From where are we? Recent Africa vs. Multi-regional Hypothese

In 1999 Encephalitis caused by the West Nile Virus broke out in New York. How did the virus come to New York?

Page 127: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 127

How did the nucleus get into the eucaryotic cells?

Simple experiment: Blast classes genes

with related functions in yeast (Eucaryote) against Bacteria

and against Archaea

And count number of significant hits

Page 128: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 128

How did the nucleus get into the eucaryotic cells?

Mitochondria und Energy metabolism: Significantly more hits

in bacteria

Cell organisation: Significantly more hits

in Archaea

Fundamental Result without any experiment!

Blue = BacteriaGrey = Archaea

Page 129: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 129

Phylogeny

Taxonomists aim to classify and group organisms

E.g. Aristoteles, De Partibus Animalium Ought we, for instance, to begin by discussing

each separate species – man, lion, ox, and the like – taking each kind in hand independently of the rest, or ought we rather to deal first with the attributes which they have in common in virtue of some common element of their nature, and proceed from this as a basis for consideration of them separately other

Page 130: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 130

Schools of Taxonomists Goal: create taxonomy

Approach: Phenotype Phylogeny

3 schools: Phenotype only Evolutionary

Taxonomists:Phenotype (+ Phylogeny)

Cladists: Phylogeny (+Phenotype)

Page 131: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 131

Practical Application: Westnile virus in NY

Westnile virus mainly in Africa

Transmitted by insects and birds

How did the virus get to NY in 1999

Hundreds of DNA samples taken All 99.8% identical

single entry to NY! Phylogenetic tree allows to

deduce origin

Page 132: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 132

Example: Westnil virus in NY How can the trees be

constructed?

Page 133: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 133

Distance-based Hierarchical clustering

Character-based Parsimony Maximum likelihood

Three Methods to Generate Phylogenetic Trees

Page 134: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 134

Distance-based Approach Single Alignment

Score: 46 matches, 3 mismatches, 1 gap, 3 gap extensions, z.B. Score = 46x1 - 3x1 - 1x2 - 3x1 = 38

Approach: Define distance between two sequences, e.g. percentage of

mismatches in their alignment Construct tree, which groups sequences with minimal

distances iteratively together

atgctctggccacggcacttgcggatcccagggtgatctgtgcacctgcgata||||||||||||||| |||| |||||||| |||| |||||||||||||||atgctctggccacggatcttgtggatccca---tgatatgtgcacctgcgata

Page 135: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 135

Hierarchical Clustering

0(4,5)

403

850(1,2)

(4,5)3(1,2)

05

304

5403

89502

9106201

54321

05

304

5403

8950(1,2)

543(1,2)

0(3,(4,5))

50(1,2)

(3,(4,5))(1,2)

1

0

2

3

4

5

1 2 3 4 5

Page 136: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 136

Hierarchical Clustering Given a distance matrix D=(dij) with 1≤ i,j ≤ n Result: A binary tree of clusters Init:

ToDo = {} For all i in { 1,…, n } do

Let ti be a tree without children, i.e. a leaf ToDo := ToDo { ti }

Main loop While |ToDo | > 1 do

Find i,j such that dij is minimal Add a new column and row labelled k := (i,j) to D For all indices h of D apart from k,i,j do

dh,k = dk,h := min { dh,i , dh,j } // min = single linkage Let tk be a new tree with children ti and tj

ToDo := ( ToDo { tk } ) - { ti ,tj } Remove columns and rows i,j from D

Complexity: O(n2)

Page 137: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 137

Hierarchical Clustering: How to define distance between clusters?

Single linkage: dh,k = dk,h := min { dh,i , dh,j }

Example: Distance (A,B) to C is 1

Complete linkage: dh,k = dk,h := max { dh,i , dh,j }

Example: Distance (A,B) is C is 2

Average linkage: dh,k = dk,h := 0.5 dh,i + 0.5 dh,j

Example: Distance (A,B) to C is 1.5

Are dendrograms always the same independent of the linkage method?

0C

10B

210A

CBA

A B CA B C

Page 138: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 138

Parsimony-method Approach: Generate “smallest” tree

containing all the sequences as leaves

Seq 1 2 3 4 5 6 a G G G G G G b G G G A G T c G G A T A G d G A T C A T

3 G->A 4 G->T 5 G->A 2 G->A 3 T->A 4 G->A 4 T->C 6 G->T 6 G->Ta GGGGGG b GGGAGT c GGATAG d GATCAT

Page 139: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 139

Parsimony

Generate smallest tree Informative vs. non-informative sites Build pairs with fewest possible substitutions Example:

3 possible trees: ((a,b),(c,d)) or ((a,c),(b,d)) or ((a,d),(b,c))

1,2,3,4 are not informative 5,6 are informative

5: ((a,b),(c,d)) 6: ((a,c),(b,d))

Seq 1 2 3 4 5 6 a G G G G G G b G G G A G T c G G A T A G d G A T C A T

Page 140: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 140

Maximum likelihood

Assigns quantitative probabilities to mutation events

Reconstructs ancestors for all nodes in the tree Assigns branch lengths based on probabilities of the

mutational events For each possible tree topology, the assumed

substitution rates are varied to find the parameters that give the highest likelihood of producing the observed data

Page 141: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 141

Problems

Character-based methods tend to be better (based on paleontological data)

All make assumptions: No back mutations Same evolutionary rate

Page 142: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 142

Assessing Quality: Bootstrapping

Given a tree obtained from one of the methods above Generate Multiple Alignment For a number of interations

Generate new sequences by selecting columns (possibly the same column more than once) form the multiple alignment

Generate tree for the new sequences Compare this new tree with the given tree For each cluster in the given tree, which also approach

in the new tree, the bootstrap value is increased Bootstrap-Value = Percentage of trees containing the

same cluster

Page 143: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 143

From where are we?

Recent-Africa Hypothesis Homo Sapiens came 100-200.000 years ago from

Africa Multi-regional Hypothesis

Ancestors of Homo Sapiens left Africa ca. 2.000.000 years ago

Which one’s right?

Page 144: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 144

Experiment

Mitochondrial DNA form 53 humans in different regions sequenced

Outgroup = Mitochondrial DNA of chimpanzee

Page 145: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 145

A nice phylogeny (Nature 2004)

Nature October 2004 Volume 431 No. 7012

Page 146: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 146

Why Mitochondria?

Simple genetic structure No repetitions No Pseudo genes No Introns

No recombination

Page 147: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 147

Molecular Clock

Based on genetic and paleontological the most recent common ancestor (mrca) of chimpanzee and homo sapiens dates back 5.000.000 years

Molecular clock: 1.7 x10-8 nucleotide changes per site and year Assumption: equal distribution, no silent mutations Diversity in Afrikca: 3.7 x10-3 nucleotide changes per site and year Diversity outside Africa: 1.7 x10-3 nucleotide changes per site and year Estimated expansion1925 generations ago = ca. 40.000 years Mrca of all humans: 171.500 +/- 50.000 years ago Mrca of African and non-African: 52,000 +/- 27.500 years ago

Experiment supports recent-Africa hypothesis

Page 148: Sequence comparison and Phylogeny

By Michael Schroeder, Biotec, 2004 148

Summary

Schools of taxonomists Assumptions made Methods

Distance-based Character-based