Homology and sequence alignment.

84
1 Homology and sequence alignment.

description

Homology and sequence alignment. Homology. Homology = Similarity between objects due to a common ancestry. Hund = Dog, Schwein = Pig. Sequence homology. Similarity between sequences as a result of common ancestry. VLS P AV K WAKV G A HA AGHG ||| || |||| | |||| VLS E AV L WAKV E A DV AGHG. - PowerPoint PPT Presentation

Transcript of Homology and sequence alignment.

Page 1: Homology and sequence alignment.

1

Homology and sequence alignment.

Page 2: Homology and sequence alignment.

HomologyHomology = Similarity between objects due to a common ancestry

Hund = Dog,Schwein = Pig

Page 3: Homology and sequence alignment.

3

Sequence homology

VLSPAVKWAKVGAHAAGHG||| || |||| | ||||VLSEAVLWAKVEADVAGHG

Similarity between sequences as a result of common ancestry.

Page 4: Homology and sequence alignment.

4

Sequence alignment

Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

Page 5: Homology and sequence alignment.

5

Why align?VLSPAVKWAKV||| || |||| VLSEAVLWAKV

1.To detect if two sequence are homologous. If so, homology may indicate similarity in function (and structure).

2.Required for evolutionary studies (e.g., tree reconstruction).

3.To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site).

Page 6: Homology and sequence alignment.

6

Sequence alignment

If two sequences share a common ancestor – for example human and dog hemoglobin, we can represent their evolutionary relationship using a tree

VLSPAV-WAKV||| || |||| VLSEAVLWAKV

VLSPAV-WAKV VLSEAVLWAKV

Page 7: Homology and sequence alignment.

7

Perfect match

VLSPAV-WAKV||| || |||| VLSEAVLWAKV

VLSPAV-WAKV VLSEAVLWAKV

A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case).

Page 8: Homology and sequence alignment.

8

A substitution

VLSPAV-WAKV||| || |||| VLSEAVLWAKV

VLSPAV-WAKV VLSEAVLWAKV

A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred).

Page 9: Homology and sequence alignment.

9

Indel

VLSPAV-WAKV||| || |||| VLSEAVLWAKV

VLSPAV-WAKV

VLSEAVLWAKV

Option 1: The ancestor had L and it was lost here. In such a case, the event was a deletion.

VLSEAVLWAKV

Page 10: Homology and sequence alignment.

10

Indel

VLSPAV-WAKV||| || |||| VLSEAVLWAKV

VLSPAV-WAKV

VLSEAVWAKV

Option 2: The ancestor was shorter and the L was inserted here. In such a case, the event was an insertion.

VLSEAVLWAKV

L

Page 11: Homology and sequence alignment.

11

Indel

VLSPAV-WAKV

Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel.

VLSEAVLWAKV

Deletion? Insertion?

Page 12: Homology and sequence alignment.

12

Global vs. Local

• Global alignment – finds the best alignment across the entire two sequences.

• Local alignment – finds regions of similarity in parts of the sequences.

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

ADLG CDRYFQ|||| |||| |ADLG CDRYYQ

Global alignment:

forces alignment in

regions which differ

Local alignment will

return only regions of

good alignment

Page 13: Homology and sequence alignment.

13

Global alignment

PTK2 protein tyrosine kinase 2 of human and rhesus monkey

Page 14: Homology and sequence alignment.

14

Proteins are comprised of domains

Domain B

Protein tyrosine kinase domain

Domain A

Human PTK2 :

Page 15: Homology and sequence alignment.

15

Protein tyrosine kinase domain

In leukocytes, a different gene for tyrosine kinase is expressed.

Domain X

Protein tyrosine kinase domain

Domain A

Page 16: Homology and sequence alignment.

16

Domain X

Protein tyrosine kinase domain

Domain BProtein tyrosine kinase domain

Domain A

Leukocyte TK

PTK2 The sequence similarity is restricted to a single domain

Page 17: Homology and sequence alignment.

17

Global alignment of PTK and LTK

Page 18: Homology and sequence alignment.

18

Local alignment of PTK and LTK

Page 19: Homology and sequence alignment.

19

Conclusions

Use global alignment when the two sequences share the same overall sequence arrangement.

Use local alignment to detect regions of similarity.

Page 20: Homology and sequence alignment.

20

How alignments are computed

Page 21: Homology and sequence alignment.

21

Pairwise alignment

AAGCTGAATTCGAAAGGCTCATTTCTGA

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

One possible alignment:

Page 22: Homology and sequence alignment.

22

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

This alignment includes:2 mismatches 4 indels (gap)

10 perfect matches

Page 23: Homology and sequence alignment.

23

Choosing an alignment for a pair of sequences

AAGCTGAATTCGAAAGGCTCATTTCTGA

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Which alignment is better?

Many different alignments are

possible for 2 sequences:

Page 24: Homology and sequence alignment.

24

Scoring system (naïve)

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Higher score Better alignment

Perfect match: +1

Mismatch: -2

Indel (gap): -1

Page 25: Homology and sequence alignment.

25

Alignment scoring - scoring of sequence similarity:

Assumes independence between positions:each position is considered separately

Scores each position:• Positive if identical (match)• Negative if different (mismatch or gap)

Total score = sum of position scoresCan be positive or negative

Page 26: Homology and sequence alignment.

26

Scoring system

•In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary

•Different scoring systems different alignments

•We want a good scoring system…

Page 27: Homology and sequence alignment.

27

DNA scoring matrices

Can take into account biological phenomena such as:

• Transition-transversion

Page 28: Homology and sequence alignment.

28

Amino-acid scoring matrices• Take into account physico-chemical properties

Page 29: Homology and sequence alignment.

29

Scoring gaps (I)

In advanced algorithms, two gaps of one amino-acid are given a different score than one gap of two amino acids. This is solved by giving a penalty to each gap that is opened.

Gap extension penalty < Gap opening penalty

Page 30: Homology and sequence alignment.

30

Homology versus chance similarity

How to check if the score is significant?

A. Take the two sequences Compute score.

B. Take one sequence randomly shuffle it -> find score with the second sequence. Repeat 100,000 times.

If the score in A is at the top 5% of the scores in B the similarity is significant.

Page 31: Homology and sequence alignment.

31

How close?

• Rule of thumb:

• Proteins are homologous if they are at least 25% identical (length >100)

• DNA sequences are homologous if they are at least 70% identical

Page 32: Homology and sequence alignment.

32

Twilight zone

• < 25% identity in proteins – may be homologous and may not be….

• (Note that 5% identity will be obtained completely by chance!)

Page 33: Homology and sequence alignment.

33

Searching a sequence database

Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs

The same idea in short: Use your sequence as a query to find homologous

sequences in a sequence database

Page 34: Homology and sequence alignment.

34

Some terminology

• Query sequence - the sequence with which we are searching

• Hit – a sequence found in the database, suspected as homologous

Page 35: Homology and sequence alignment.

35

Query sequence: DNA or protein?

• For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences.

• Which is preferable?

Page 36: Homology and sequence alignment.

36

Protein is better!

• Selection (and hence conservation) works (mostly) at the protein level:

CTTTCA = Leu-SerTTGAGT = Leu-Ser

Page 37: Homology and sequence alignment.

37

Query type

• Nucleotides: a four letter alphabet

• Amino acids: a twenty letter alphabet

• Two random DNA sequences will, on average, have 25% identity

• Two random protein sequences will, on average, have 5% identity

Page 38: Homology and sequence alignment.

38

Conclusion

The amino-acid sequence is often preferable for homology search

Page 39: Homology and sequence alignment.

39

How do we search a database?

• If each pairwise alignment takes 1/10 of a second, and if the database contains 107

sequences, it will take 106 seconds = 11.5 days to complete one search.

• 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.

Page 40: Homology and sequence alignment.

40

Conclusion

• Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow

Page 41: Homology and sequence alignment.

41

Heuristic

•Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution

Page 42: Homology and sequence alignment.

42

BLAST

Page 43: Homology and sequence alignment.

43

BLAST

• BLAST - Basic Local Alignment and Search Tool

• A heuristic for searching a database for similar sequences

Page 44: Homology and sequence alignment.

44

DNA or Protein• All types of searches are possible

Query: DNA Protein

Database: DNA Protein

blastn – nuc vs. nucblastp – prot vs. protblastx – translated query vs. protein databasetblastn – protein vs. translated nuc. DBtblastx – translated query vs. translated database

Translated databases:

trEMBLgenPept

Page 45: Homology and sequence alignment.

45

BLAST - underlying hypothesis

• The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them

• The heuristic:

1. Discard irrelevant sequences

2. Perform exact local alignment only with the remaining sequences

Page 46: Homology and sequence alignment.

46

How do we discard irrelevant sequences quickly?

• Divide the database into words of length w (default: w = 3 for protein and w = 7 for DNA)

• Save the words in a look-up table that can be searched quickly

WTDFGYPAILKGGTAC

WTDTDFDFGFGYGYP …

Page 47: Homology and sequence alignment.

47

BLAST: discarding sequences

• When the user enters a query sequence, it is also divided into words

• Search the database for consecutive neighboring words

Page 48: Homology and sequence alignment.

48

Neighbor words

• neighbor words are defined according to a scoring matrix (e.g., BLOSUM62 for proteins) with a certain cutoff level

GFB

GFC (20)

GPC (11)WAC (5)

Page 49: Homology and sequence alignment.

49

E-value• The number of times we will theoretically

find an alignment with a score ≥ Y of a random sequence vs. a random database

Theoretically, we could trust

any result with an

E-value ≤ 1

In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a

significant homology.E-values between 10-4 and 10-2 should be checked (similar domains, maybe

non-homologous).E-values between 10-2 and 1 do not

indicate a good homology

Page 50: Homology and sequence alignment.

Web servers for pairwise alignment

Page 51: Homology and sequence alignment.

BLAST 2 sequences (bl2Seq) at NCBI

Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment

• Does not use an exact algorithm but a heuristic

Page 52: Homology and sequence alignment.

Back to NCBIBack to NCBI

Page 53: Homology and sequence alignment.

BLAST – bl2seqBLAST – bl2seq

Page 54: Homology and sequence alignment.

Bl2Seq - queryBl2Seq - query

blastnblastn – – nucleotidenucleotide blastpblastp – – proteinprotein

Page 55: Homology and sequence alignment.

Bl2seq resultsBl2seq results

Page 56: Homology and sequence alignment.

Bl2seq results

Match Match Dissimilarity Dissimilarity Gaps Gaps Similarity Similarity Low Low

complexity complexity

Page 57: Homology and sequence alignment.

BLAST – programsBLAST – programs

Query: DNA Protein

Database: DNA Protein

Page 58: Homology and sequence alignment.

BLAST – BlastpBLAST – Blastp

Page 59: Homology and sequence alignment.

Blastp - resultsBlastp - results

Page 60: Homology and sequence alignment.

Blastp – results (cont’)Blastp – results (cont’)

Page 61: Homology and sequence alignment.

Blast scores:

• Bits score – A score for the alignment according to the number of similarities, identities, etc.

• Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a random database of a particular size. The closer the e-value is to zero, the greater the confidence that the hit is really a homolog

Page 62: Homology and sequence alignment.

Blastp – acquiring sequencesBlastp – acquiring sequences

Page 63: Homology and sequence alignment.

blastp – acquiring sequencesblastp – acquiring sequences

Page 64: Homology and sequence alignment.

64

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Similar to pairwise alignment BUT n sequences are aligned instead of just 2

Multiple sequence alignment

Page 65: Homology and sequence alignment.

65

MSA = Multiple Sequence AlignmentEach row represents an individual sequenceEach column represents the ‘same’ position

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Multiple sequence alignment

Page 66: Homology and sequence alignment.

66

Conserved positions

• Columns in which all the sequences contain the same amino acids or nucleotides

• Important for the function or structure

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGSSSNIGS--ITVNWYQQLPGLRLSCTGSGFIFSS--YAMYWYQQAPGLSLTCTGSGTSFDD-QYYSTWYQQPPG

Page 67: Homology and sequence alignment.

67

Consensus sequence

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

A consensus sequence holds the most frequent character of the alignment at each column

Page 68: Homology and sequence alignment.

68

Profile = PSSM = Position Specific Score Matrix

A T C T T G

A A C T T G

A A C T T C

1 2 3 4 5 6

A 1 .67 0 0 0 0

C 0 0 1 0 0 0.33

G 0 0 0 0 0 0.67

T 0 .33 0 1 1 0

Page 69: Homology and sequence alignment.

69

Alignment methods

There is no available optimal solution for MSA – all methods are heuristics:

• Progressive/hierarchical alignment (Clustal)

• Iterative alignment (mafft, muscle)

Page 70: Homology and sequence alignment.

70

ABCDE

Compute the pairwise Compute the pairwise alignments for all against alignments for all against

all (6 pairwise alignments).all (6 pairwise alignments).The similarities are The similarities are

converted to distances and converted to distances and stored in a tablestored in a table

First step:

Progressive alignment

A B C D E

A

B 8

C 15 17

D 16 14 10

E 32 31 31 32

Page 71: Homology and sequence alignment.

71

A

D

C

B

E

Cluster the sequences to create a Cluster the sequences to create a tree (tree (guide treeguide tree):):

•represents the order in which pairs of represents the order in which pairs of sequences are to be alignedsequences are to be aligned•similar sequences are neighbors in the similar sequences are neighbors in the tree tree •distant sequences are distant from distant sequences are distant from each other in the treeeach other in the tree

Second step: A B C D E

A

B 8

C 15 17

D 16 14 10

E 32 31 31 32The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!

Page 72: Homology and sequence alignment.

72

Third step:A

D

C

B

E

1. Align the most similar (neighboring) pairs

sequence

sequence

sequence

sequence

Page 73: Homology and sequence alignment.

73

Third step:A

D

C

B

E

2. Align pairs of pairs

sequence

profile

Page 74: Homology and sequence alignment.

74

Third step:A

D

C

B

E sequence

profile

Main disadvantages:

•Sub-optimal tree topology

•Misalignments resulting from globally aligning pairs of sequences.

Page 75: Homology and sequence alignment.

75

ABCDE

Iterative alignment

Guide tree

MSA

Pairwise distance table

A

DCB

Iterate until the MSA does not change (convergence)

E

Page 76: Homology and sequence alignment.

76

Case study: Using homology searching

• The human kinome

Page 77: Homology and sequence alignment.

77

Kinases and phosphatases

Page 78: Homology and sequence alignment.

78

Multi-tasking enzymes

• Signal transduction• Metabolism• Transcription• Cell-cycle• Differentiation• Function of nervous and

immune system• …• And more

Page 79: Homology and sequence alignment.

79

How many kinases in the human genome?

• 1950’s, discovery that reversible phosphorylation regulates the activity of glycogen phosphorylase

• 1970’s, advent of cloning and sequencing produced a speculation that the vertebrate genome encodes as many as 1,001 kinases

Page 80: Homology and sequence alignment.

80

• 2001 – human genome sequence …

• As well – databases of Genbank, Swissprot, and dbEST

• How can we find out how many kinases are out there?

How many kinases in the human genome?

Page 81: Homology and sequence alignment.

81

The human kinome

• In 2002, Manning, Whyte, Martinez, Hunter and Sudarsanam set out to:

1. Search and cross-reference all these databases for all kinases

2. Characterize all found kinases

Page 82: Homology and sequence alignment.

82

ePKs and aPKs

Eukaryotic protein kinases (majority) catalytic domain

Atypical protein kinases

Sequence homology of the catalytic domain; additional regulatory domains are non-homologous

No sequence homology to ePKs; some aPK subfamilies have structural similarity to ePKs

Page 83: Homology and sequence alignment.

83

The search

• Several profiles were built:based on the catalytic domain of:

(a) 70 known ePKs from yeast, worm, fly, and human with > 50% identity in the ePK domain

(b) each subfamily of known aPKs

• HMM-profile searches and PSI-BLAST searches were performed

Page 84: Homology and sequence alignment.

84

The results…

• 478 ePKs • 40 aPKs

• Total of 518 kinases

in the human genome

(half of the prediction

in the 1970’s)

[1.7% of human genes]