Lecture 4 Protein Sequence Alignment -...

40
Introduction to Bioinformatics for Medical Research Gideon Greenspan [email protected] Lecture 4 Protein Sequence Alignment

Transcript of Lecture 4 Protein Sequence Alignment -...

Page 1: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

Introduction to Bioinformaticsfor Medical Research

Gideon [email protected]

Lecture 4Protein Sequence Alignment

Page 2: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

2

Protein Sequence Alignment

• Alignment Recap• Needleman-Wunsch Algorithm• Genomes to Proteins• Scoring Matrices• PAM• BLOSUM• Genomic vs Protein

Page 3: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

3

Alignment Recap (1)

• Two input sequences– Add dashes to make same length

• Score for each position in alignment– Positive for match, negative for mismatch– Final score is sum of position scores

• Dashes placed so as to maximize score– Needleman-Wunsch algorithm

Page 4: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

4

Alignment Recap (2)

• Global vs Local Alignment– Smith-Waterman algorithm

• Gap scores– Affine gap model

• Complexity– Indexing to reduce alignments

• FASTA and BLAST algorithms

Page 5: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

5

Needleman-Wunsch Alignment

• Global alignment between sequences– Compare entire sequence against another

• Create scoring table– Sequence A across top, B down left

• Cell at row i and column j contains thescore of best alignment between the first ielements of A and the first j elements of B– Global alignment score is bottom right cell

Page 6: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

6

Needleman-Wunsch Example

5

4

32

-1

Score of bestalignment betweenAC and CATG

Sequences: A = ACGCTG, B = CATGT

-2…betweenAC andCATGT

2

…between ACGand CATG

Calculate scorebetween ACGand CATGT

?

Page 7: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

7Sequences: A = ACGCTG, B = CATGT

Needleman-Wunsch Example

5

4

32

-1-2

2

-1 from beforeplus -1 for

mismatch of Gagainst T fi -2

2 from beforeplus -1 for

mismatch of –against T fi 1

-2 from beforeplus -1 for

mismatch of Gagainst – fi -3

1 Cell getshighest score

of -2,1,-3 fi 1

Page 8: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

8T 5!G 4!T 3!A 2!C 1!

00!

G6

T5

C4

G3

C2

A10

Page 9: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

9T 5!G 4!T 3!A 2!C 1!

-100!

G6

T5

C4

G3

C2

A10

A-

Page 10: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

10T 5!G 4!T 3!A 2!C 1!

-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

ACGCTG------

Page 11: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

11-5T 5!-4G 4!-3T 3!-2A 2!-1C 1!

-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

-----CATGT

Page 12: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

12-5T 5!-4G 4!-3T 3!-2A 2!

-1-1C 1!-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

AC

Page 13: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

13-5T 5!-4G 4!-3T 3!-2A 2!

1-1-1C 1!-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

AC-C

Page 14: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

14-5T 5!-4G 4!-3T 3!-2A 2!

01-1-1C 1!-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

ACG-C-

Page 15: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

15-5T 5!-4G 4!-3T 3!-2A 2!

-101-1-1C 1!-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

ACGC-C--

ACGC---C

Page 16: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

16-5T 5!-4G 4!-3T 3!

001-2A 2!-3-2-101-1-1C 1!-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

ACG-CA

Page 17: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

172311-2-2-5T 5!3012-1-1-4G 4!01-1-100-3T 3!-3-2-1001-2A 2!-3-2-101-1-1C 1!-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

Page 18: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

182311-2-2-5T 5!3012-1-1-4G 4!01-1-100-3T 3!-3-2-1001-2A 2!-3-2-101-1-1C 1!-6-5-4-3-2-100!

G6

T5

C4

G3

C2

A10

Page 19: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

1923T 5!312G 4!

10T 3!-101A 2!

01-1C 1!-100!

G6

T5

C4

G3

C2

A10

Page 20: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

2023T 5!312G 4!

10T 3!-101A 2!

01-1C 1!-100!

G6

T5

C4

G3

C2

A10

ACGCTG--C-ATGT

Page 21: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

2123T 5!312G 4!

10T 3!-101A 2!

01-1C 1!-100!

G6

T5

C4

G3

C2

A10

ACGCTG--CA-TGT

Page 22: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

2223T 5!312G 4!

10T 3!-101A 2!

01-1C 1!-100!

G6

T5

C4

G3

C2

A10

-ACGCTGCATG-T-

Page 23: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

23

From Genomes to Proteins

• 3-nucleotide ‘codons’ for each amino acid– 43 = 64 possible codon values– Only 20 amino acids fi degeneracy

• Start and stop codons– Start codon determines reading frame

• Silent, missense and nonsense mutations• Different codes, e.g. for mitochondria

Page 24: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

24

The Standard Genetic Code

Page 25: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

25

Scoring Matrices (1)• The standard scoring scheme for aligning

nucleotides could be expressed as a matrix:

+2-1-1-1T-1+2-1-1G-1-1+2-1C-1-1-1+2ATGCA

Page 26: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

26

Scoring Matrices (2)• But we could take account of relative

likelihood of transitions and transversions:

+1-5-1-5T-5+1-5-1G-1-5+1-5C-5-1-5+1ATGCA

axis ofsymmetry

Page 27: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

27

Amino Acid Matrices

• For aligning amino acids, we need a scoringmatrix of 20 rows ¥ 20 columns

• Matrices represent biological processes– Mutation causes changes in sequence– Evolution tends to conserve protein function– Similar function requires similar amino acids

• Could base matrix on amino acid properties– In practice: based on empirical data

Page 28: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

28

Page 29: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

29

PAM

• Percent Accepted Mutations– Margaret Dayhoff (1978)

• Based on very similar protein sequences– 34 known protein superfamilies

• Phylogenetic trees from global alignment– Count number of observed changes– Changes are obviously ‘accepted’

Page 30: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

30

PAM Matrices

• PAM1 defines unit of evolutionary change– One percent accepted mutation, i.e. one amino

acid difference per 100 residues on average• Any PAMn calculated from PAM1

– PAMn = PAM1 multiplied by itself n times• PAMn is not n differences per 100 residues

– One amino acid can change several times– PAM250 is in common use

Page 31: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

31

Selecting a PAM Matrix

• For PAMn, higher n suitable for sequenceswhich are longer or less similar– PAM120 recommended for general use– PAM60 for close relations– PAM250 for distant relations

• If uncertain, try several different matrices– PAM40, PAM120, PAM250 recommended

Page 32: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

32

BLOSUM

• Blocks Substitution Matrix– Steven and Jorga G. Henikoff (1992)

• Based on BLOCKS database– Families of proteins with identical function– Highly conserved protein domains

• Ungapped local alignment to identify motifs– Counts amino acids observed in same column– Symmetrical model of substitution

Page 33: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

33

BLOSUM Matrices

• Different BLOSUMn matrices arecalculated independently from BLOCKS

• Value n defines how sequences are groupedwithin family before counting amino acids– For BLOSUMn, sequences which are more

than n percent identical are considered as one• Purpose: to prevent bias in favor of closely

related protein sequences

Page 34: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

34

Selecting a BLOSUM Matrix

• For BLOSUMn, higher n suitable forsequences which are more similar– BLOSUM62 recommended for general use– BLOSUM80 for close relations– BLOSUM45 for distant relations

• BLOSUM unsuitable for short sequences– Use PAM30 instead

Page 35: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

35

PAM vs BLOSUM

IndependentDependentCalculation

DomainsAncestryEffective forUnderlying model

Source sequencesSource alignmentHigher n fiMatrix

DomainsEvolution

SimilarVery similarLocalGlobalCloseDistant

BLOSUMPAM

Page 36: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

36

Provisional Guidelines

11, 1BLOSUM62> 85

10, 1BLOSUM8050 … 85

10, 1PAM7035 … 50

9, 1PAM30< 35

Gap CostsMatrixQuery Length

Page 37: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

37

Other Scoring Matrices

• Genetic code changes– Distance between amino acid codons

• Amino acid properties– Particular properties for application

• Matrices from specific families– Example: transmembrane proteins

• Multi-site substitution model

Page 38: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

38

Amino Acid Properties

Page 39: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

39

Genomic vs Protein Alignment

SelectionMutationMechanismFurtherCloserInterspecies range

Scoring matrixDifferent genesDifferent proteins

RelationshipSequence

complexsimpleexponential1

16

FunctionPhylogenyProteinGenomic

Page 40: Lecture 4 Protein Sequence Alignment - Technionbioinfo.cs.technion.ac.il/courses/biomed/lectures/... · –Needleman-Wunsch algorithm. 4 Alignment Recap (2) •Global vs Local Alignment

40

BLAST Variations

Translated genomicTranslated genomictblastx

Translated genomicProteintblastn

ProteinTranslated genomicblastx

ProteinProteinblastp

GenomicGenomicblastn

DatabaseQuery typeName