Dot Plots, Path Matrices, Score Matrices

35
V V diagonal lines give equivalent residues diagonal lines give equivalent residues I I L L S S T T R R I I V V H H V V N N S S I I L L P P S S T T N N V V I I L L S S T T R R I I V V I I L L P P E E F F S S T T Sequence A Sequence A Sequence B Sequence B Dot Plots, Path Dot Plots, Path Matrices, Score Matrices Matrices, Score Matrices

description

Dot Plots, Path Matrices, Score Matrices. Sequence A. V. T. R. I. V. H. V. N. S. I. L. P. S. T. N. I. L. S. V. I. L. S. T. R. I. Sequence B. V. I. L. P. E. F. S. T. diagonal lines give equivalent residues. Sequence A. V. T. R. I. V. H. - PowerPoint PPT Presentation

Transcript of Dot Plots, Path Matrices, Score Matrices

Page 1: Dot Plots, Path Matrices, Score Matrices

VV

diagonal lines give equivalent residuesdiagonal lines give equivalent residues

II LL SS TT RR II VV HH VV NN SS II LL PP SS TT NN

VVIILLSSTTRRIIVVIILLPPEEFFSSTT

Sequence ASequence AS

equ

enc

e B

Se

que

nce

B

Dot Plots, Path Matrices, Score Dot Plots, Path Matrices, Score MatricesMatrices

Page 2: Dot Plots, Path Matrices, Score Matrices

VV II LL SS TT RR II VV HHVVNNSS II LL PP SS TT NN

VVIILLSSTTRRIIVVIILLPPEEFFSSTT

Sequence ASequence A

Seq

uen

ce B

Seq

uen

ce B

identical residues score 1identical residues score 1highest scoring path across the matrix gives best alignmenthighest scoring path across the matrix gives best alignment

Page 3: Dot Plots, Path Matrices, Score Matrices

V I L S L V I L P Q R S L V V I L S L V I L A L T VV I L S L V I L P Q R S L V V I L S L V I L A L T V

SSTTVVIILLSSLLVVRRNNVVIILLPPQQRRIILLSSLLVVIISSLLAALL

Sequence ASequence A

Seq

uen

ce B

Seq

uen

ce B

runs runs (tuples) of (tuples) of

33residuesresidues

66

66

55

66

33

33

33

66

SCORE = SCORE = 20 - 9 = 20 - 9 =

1111

33

gap gap penaltypenalty

= 3= 3

Page 4: Dot Plots, Path Matrices, Score Matrices

Alignment from Dot PlotAlignment from Dot Plot

VILSLV ILPQRSLVVILSLVI LALTVVILSLV ILPQRSLVVILSLVI LALTV

STVILSLVNVILPQR ILSLVISLAL STVILSLVNVILPQR ILSLVISLAL

score = 20score = 20

sequence identity = 20/26 = 75%sequence identity = 20/26 = 75%

Page 5: Dot Plots, Path Matrices, Score Matrices

HH CC NN II RR QQ CC LL CC RR PP MMAAAAIICCIINNRRCCKKCCRRHHPP

110000000000000000000000

000000000000000000001100

000011000000110011000000

000000001100000000000000

001100110000000000000000

000000000011000000110000

000000000000000000000000

000011000000110011000000

000000000000000000000000

000011000000110011000000

000000000011000000110000

000000000000000000000011

000000000000000000000000

ALVKRH…ALVKRH…

……H

RK

VLA

HR

KV

LA 11

1111

11

00 00 00 0……0……

Path or Score MatrixPath or Score Matrix

Residue Residue substitution substitution

matrixmatrix

11

Page 6: Dot Plots, Path Matrices, Score Matrices

Needleman & WunschNeedleman & Wunsch

HH CC NN II RR QQ CC LL CC RR PP MMAA

AA

II

CC

II

NN

RR

CC

KK

CC

RR

HH

PP

11

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

11

00

00

00

11

00

11

00

00

00

00

00

00

00

11

00

00

00

00

00

00

00

00

11

00

11

00

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

11

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

11

00

11

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

11

00

11

00

00

00

00

00

00

00

00

11

00

00

00

11

00

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

00

00

00

00

00

00

00

00

00

Page 7: Dot Plots, Path Matrices, Score Matrices

Needleman & Wunsch AlgorithmNeedleman & Wunsch Algorithm

• Accumulate the matrix by adding to each cell the highest score in Accumulate the matrix by adding to each cell the highest score in the column or row to the right and below itthe column or row to the right and below it

• find the highest scoring path in the matrix by:find the highest scoring path in the matrix by:

• starting in the top left cornerstarting in the top left corner

• moving down across the matrix from cell to cell moving down across the matrix from cell to cell

• choosing the highest scoring cell at each movechoosing the highest scoring cell at each move

• the path can not go back on itself or cross the same row or column the path can not go back on itself or cross the same row or column twicetwice

Page 8: Dot Plots, Path Matrices, Score Matrices

• Add to the score in the cell the highest score from a cell in the row or Add to the score in the cell the highest score from a cell in the row or column to right and belowcolumn to right and below

Accumulating the MatrixAccumulating the Matrix

i,ji,j

i-1,j-1i-1,j-1

i-n,j-1i-n,j-1

i-1,j-mi-1,j-m

Page 9: Dot Plots, Path Matrices, Score Matrices

Sequence ASequence A

HH CC NN II RR QQ CC LL CC RR PP MMAA

AA

II

CC

II

NN

RR

CC

KK

CC

RR

HH

PP

88

77

66

66

55

44

33

33

22

22

11

00

77

77

66

66

55

44

33

33

22

11

22

00

66

66

77

66

55

44

44

33

33

11

11

00

66

66

66

55

66

44

33

33

22

11

11

00

55

66

55

66

55

44

33

33

22

11

11

00

44

44

44

44

55

55

33

33

22

22

11

00

44

44

44

44

44

44

33

33

22

11

11

00

33

33

44

33

33

33

44

33

33

11

11

00

33

33

33

33

33

33

33

33

22

11

11

00

22

22

33

22

33

22

33

22

33

11

11

00

11

11

11

11

11

22

11

11

11

22

11

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

00

00

00

00

00

00

00

00

00

Seq

uenc

e B

Seq

uenc

e B

Page 10: Dot Plots, Path Matrices, Score Matrices

• start in the leftmost or topmost rowstart in the leftmost or topmost row

• move to the highest scoring cell in row or column to right and belowmove to the highest scoring cell in row or column to right and below

Possible Moves in Finding a Path across the Possible Moves in Finding a Path across the MatrixMatrix

i,ji,j

i-1,j-1i-1,j-1

i-n,j-1i-n,j-1

i-1,j-mi-1,j-m

Page 11: Dot Plots, Path Matrices, Score Matrices

Sequence ASequence A

HH CC NN II RR QQ CC LL CC RR PP MMAA

AA

II

CC

II

NN

RR

CC

KK

CC

RR

HH

PP

88

77

66

66

55

44

33

33

22

22

11

00

77

77

66

66

55

44

33

33

22

11

22

00

66

66

77

66

55

44

44

33

33

11

11

00

66

66

66

55

66

44

33

33

22

11

11

00

55

66

55

66

55

44

33

33

22

11

11

00

44

44

44

44

55

55

33

33

22

22

11

00

44

44

44

44

44

44

33

33

22

11

11

00

33

33

44

33

33

33

44

33

33

11

11

00

33

33

33

33

33

33

33

33

22

11

11

00

22

22

33

22

33

22

33

22

33

11

11

00

11

11

11

11

11

22

11

11

11

22

11

00

00

00

00

00

00

00

00

00

00

00

00

11

00

00

00

00

00

00

00

00

00

00

00

00

Seq

uenc

e B

Seq

uenc

e B

Page 12: Dot Plots, Path Matrices, Score Matrices

Sequence ASequence AHH CC NN II RR QQ CC LL CC RR PP MMAA

AAIICCIINNRRCCKKCCRRHHPP

887766665544333322221100

777766665544333322112200

666677665544443333111100

666666556644333322111100

556655665544333322111100

444444445555333322221100

444444444444333322111100

333344333333443333111100

333333333333333322111100

222233223322332233111100

111111111122111111221100

000000000000000000000011

000000000000000000000000

Sequ

ence

BSequ

ence

B

A H C N I - R Q C L C R - P MA H C N I - R Q C L C R - P M

A I C - I N R - C K C R H P MA I C - I N R - C K C R H P M

Page 13: Dot Plots, Path Matrices, Score Matrices

Searching Sequence DatabasesSearching Sequence Databases

Can you inherit functional information?Can you inherit functional information?

Do fast scans using approximate Do fast scans using approximate methods e.g.methods e.g. BLAST or PSIBLASTBLAST or PSIBLAST

Align proteins carefully using a dynamic Align proteins carefully using a dynamic programming methodprogramming method Needleman & WunschNeedleman & WunschSmith & WatermanSmith & Waterman

Scan against sequence profiles (or Scan against sequence profiles (or HMMs) in secondary databases e.g.HMMs) in secondary databases e.g. Pfam, Gene3D, InterProPfam, Gene3D, InterPro

Align query sequence against family relatives Align query sequence against family relatives using:using: ClustalW, Jalview, MUSCLE, MAFFTClustalW, Jalview, MUSCLE, MAFFT

Page 14: Dot Plots, Path Matrices, Score Matrices

Profile Based Sequence Search MethodsProfile Based Sequence Search Methods

by comparing related sequences within a protein family can by comparing related sequences within a protein family can identify patterns of conserved residuesidentify patterns of conserved residues

even the most distant members of the family should have these even the most distant members of the family should have these patterns of conserved residuespatterns of conserved residues

can make acan make a profile profile which encapsulates these patterns and use it which encapsulates these patterns and use it to detect more distantly related sequencesto detect more distantly related sequences

highly conserved positions usually correspond to the buried core highly conserved positions usually correspond to the buried core or functional residues within the active siteor functional residues within the active site

Page 15: Dot Plots, Path Matrices, Score Matrices

• first constructs a multiple alignment of all the related sequences first constructs a multiple alignment of all the related sequences identified by BLASTidentified by BLAST

• then estimates the residue frequencies at each position to construct a then estimates the residue frequencies at each position to construct a score matrix score matrix Position Specific Score Matrices (PSSM)Position Specific Score Matrices (PSSM) also known as also known as weight matrices or profilesweight matrices or profiles

Iterated Application of BLASTIterated Application of BLAST

PSI-BLASTPSI-BLASTAltschul et al. (1997) Altschul et al. (1997)

Page 16: Dot Plots, Path Matrices, Score Matrices

PSI-BLASTPSI-BLAST

UniProt DatabaseUniProt Database

query query sequencesequence

further iterations pull out more distant sequence relativesfurther iterations pull out more distant sequence relatives

aligns matched aligns matched sequences and builds sequences and builds

profileprofile

Altschul et al. (1997) Altschul et al. (1997)

Page 17: Dot Plots, Path Matrices, Score Matrices

Use the Multiple Alignment to Calculate Residue FrequenciesUse the Multiple Alignment to Calculate Residue Frequencies

PSI-BLASTPSI-BLAST

the residue frequencies at each position are used to calculate the scores the residue frequencies at each position are used to calculate the scores for aligning a query sequence against the patternfor aligning a query sequence against the pattern

P1……... P5 P6…………... Pn…………...

queryquery

relativesrelatives

putativeputativerelativerelative

three times more powerful than BLAST!!three times more powerful than BLAST!!

Page 18: Dot Plots, Path Matrices, Score Matrices

AAIICCIINNRRCCKKCCRRHHPP

Position Position specific specific

substitution substitution matrixmatrix……

HR

VLA

HR

VLA 1010

1010202070709090..

1010

7070

7070

9090

Path matrix Path matrix or score or score matrixmatrix

Page 19: Dot Plots, Path Matrices, Score Matrices

Multiple AlignmentMultiple Alignment

• direct extensions of the standard DP approach for the alignment direct extensions of the standard DP approach for the alignment of 2 sequences are computationally impossible for more than 3 of 2 sequences are computationally impossible for more than 3 sequencessequences

• practical heuristic solutions are based on the idea that sequences practical heuristic solutions are based on the idea that sequences are evolutionary related and can be aligned using an underlying are evolutionary related and can be aligned using an underlying phylogenetic tree phylogenetic tree

this is known as progressive alignmentthis is known as progressive alignment

Page 20: Dot Plots, Path Matrices, Score Matrices

(1) Pairwise Alignment(1) Pairwise Alignment

(2) Multiple Alignment following the tree from 1(2) Multiple Alignment following the tree from 1

4 sequences A, B, C, D4 sequences A, B, C, D

AA

BBCC

DD

6 pairwise comparisons6 pairwise comparisonsthen cluster analysisthen cluster analysis

BB

DD

AA

CC

AACC

BBDD

AA

BB

DD

CC

Align most similar pairAlign most similar pair

Align next most similar pairAlign next most similar pair

Align alignments - preserve gapsAlign alignments - preserve gaps

gaps to optimise alignmentgaps to optimise alignment

new gap to optimise alignment of BD with ACnew gap to optimise alignment of BD with AC

Page 21: Dot Plots, Path Matrices, Score Matrices

Multiple AlignmentMultiple Alignment

• start by aligning the most closely related pairs using DP and start by aligning the most closely related pairs using DP and gradually align these groups together keeping the gaps that gradually align these groups together keeping the gaps that appear in earlier alignments fixed appear in earlier alignments fixed

• alternatively can add sequences one at a time to a growing alternatively can add sequences one at a time to a growing multiple alignmentmultiple alignment

the heuristic approach is not guaranteed to find the optimum the heuristic approach is not guaranteed to find the optimum alignment - but it is soundly based, biologicallyalignment - but it is soundly based, biologically

Page 22: Dot Plots, Path Matrices, Score Matrices

ClustalWClustalW

• since the choice of parameters used can have significant effect on the since the choice of parameters used can have significant effect on the alignment for very distant sequences, ClustalW addresses this problem alignment for very distant sequences, ClustalW addresses this problem by:by:

position specific gap opening and extension penaltiesposition specific gap opening and extension penalties

using different amino acid substitution matrices - one for close relatives, using different amino acid substitution matrices - one for close relatives, one for distantone for distant

Higgins, 1997Higgins, 1997

More recent resources:More recent resources:

MAFFTMAFFT

MUSCLEMUSCLE

JALVIEWJALVIEW

Page 23: Dot Plots, Path Matrices, Score Matrices

ClustalWClustalW

• where structure is known, one would want to increase the gap penalty where structure is known, one would want to increase the gap penalty within helices and strands and decrease it between them - forcing gaps within helices and strands and decrease it between them - forcing gaps to occur more frequently in loops to occur more frequently in loops

• if no structure known, can use simple rules which depends on the if no structure known, can use simple rules which depends on the residues occurring and the frequencies of gapsresidues occurring and the frequencies of gaps

e.g. use lower gap penalties where gaps already occure.g. use lower gap penalties where gaps already occur

Gap penaltiesGap penalties

Page 24: Dot Plots, Path Matrices, Score Matrices

Secondary databases (as opposed to primary sequence databases) group

proteins into related families

Families are usually represented by a sequence profile or sequence model

(Hidden Markov Model HMM) derived from a multiple sequence alignment of the

relatives

Searching Protein Family DatabasesSearching Protein Family Databases

Page 25: Dot Plots, Path Matrices, Score Matrices

Pfam, SUPERFAMILY, Gene3D : Hidden Markov Models (HMMs)

•sequence is aligned using a probabilistic model of interconnecting match, delete or insert states

•contains statistical information on observed and expected positional variation - “fingerprint of a protein family”

B EMi

Di

Ii

HMMs for Protein Domain Family RecognitionHMMs for Protein Domain Family Recognition

Page 26: Dot Plots, Path Matrices, Score Matrices
Page 27: Dot Plots, Path Matrices, Score Matrices

Pfam-A 10,340 curated families with annotation

Pfam-B 224,303 families derived from ADDA (50% clearly related to a Pfam-A)

UniProt coverage 74% of sequences 51% of residues

PDB coverage 94% of sequences 76% of residues

Pfam-APfam-BOther

Page 28: Dot Plots, Path Matrices, Score Matrices

Pfam :Pfam :

Profile-HMMHMMer-2.0

FULL alignment

Search UniProt

Manually curated Automatically made

SEED alignmentrepresentative members

Page 29: Dot Plots, Path Matrices, Score Matrices

Protein

Pfam classificationPfam classification

Protein fold, etc.

Page 30: Dot Plots, Path Matrices, Score Matrices

Protein

Family

Protein fold, etc.

Pfam classificationPfam classification

Page 31: Dot Plots, Path Matrices, Score Matrices

Protein

Clan

Family

Protein fold, etc.

Pfam classificationPfam classification

Page 32: Dot Plots, Path Matrices, Score Matrices
Page 33: Dot Plots, Path Matrices, Score Matrices
Page 34: Dot Plots, Path Matrices, Score Matrices
Page 35: Dot Plots, Path Matrices, Score Matrices