Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

43
Aidan Budd, EMBL Heidelb Pairwise Alignments and Sequence Similarity-Based Searching

Transcript of Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Page 1: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Pairwise Alignments and Sequence Similarity-Based

Searching

Page 2: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Residues:Monomers within a polymer (polypeptide or polynucleotide) chain

residues

residues

"Anatomy" of a Sequence Alignment

WKKLGSNVG

WGKVKNVDsequences

Sequences:List of residues in a polymer chain... ...listed in the same order they occur within the polymer

Page 3: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

1:1 residue correspondences/relationships

sequences

residues

residues

"Anatomy" of a Sequence Alignment

WKKLGSNVG

WGKVKNVD

1:1 residue correspondences/relationshipsCorrespondences between

•a single residue in one sequence and •a single residue in another sequence

Page 4: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Could perhaps say there is a "1:2" relationship between this residue and these residues

Residue has no equivalent in the top sequencei.e. no residue in the top sequence has a 1:1 relationship with this residue

1:1 residue correspondences/relationships

"Anatomy" of a Sequence Alignment

WKKLGSNVG

WGKVKNVD

However, alignments focus on 1:1 relationships

Page 5: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Often represented using a grid/matrix:

W K K L G S N V GW G K V K N V D

Sequence Alignment Within a Grid

WKKLGSNVG

WGKVKNVD

One sequence per row

--

Gap characters (usually "-") indicate that the sequence contains no residues 'equivalent' to other residues in that column

-

Residues in the same column are 'equivalent'

Page 6: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Multiple Sequence Alignment

W K K L G S N V GW G K V - - N V D- A K V - - - V D

Page 7: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Pairwise Sequence Alignment

Raise your hands:•Who has ever seen a "pairwise alignment" before?•Who has ever used/encountered one in their research?

Pairwise sequence alignments are a crucial component of many bioinformatic analyses and tools

In particular they lie at the heart of the tools most commonly used to predict function of protein/DNA/RNA molecules i.e. to generate hypotheses for the function of key biological entities that can be tested by wet-lab experiments

Page 8: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Building Sequence Alignments

Page 9: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

How They Used to Align Sequences...

Courtesy of Geoff Barton, Dundee

Page 10: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Pairwise Alignments

How we can build them today "by hand" using JalView

Why it's useful to look at this, despite the many automatic methods

Sometimes we're right and the automatic methods are wrong and we can spot this

Useful to get you thinking about how we choose a good from a bad alingment, and to identify sequences that are easier of more difficult to align (as these are the same kinds of sequences that automatic tools find easy/difficult to align, so we know better what to expect from such tools

Page 11: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Pairwise Alignments

How we can build them today "by hand" using JalViewhttp://tinyurl.com/protBioinf2011

During exercises, write down:

1.Features that describe a relatively "good" alignment, thinking in terms of•sizes of gaps•numbers of gaps•properties of residues in the same column (the same as each other? different?)1.Instructions on how to change a "bad" alignment into a better one2.Characteristics of sequences that are more difficult/take more time to align than othersDescribe what you've written to your neighbour - have you come up with the same answers?

Page 12: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Pairwise Alignments

1.Features that describe a relatively "good" alignment:• small, few gaps• residues in the same column either the same amino acid, or

ones with similar physical/chemical property1.We build a "good" alignment by inserting as few, as short as

possible gaps into the alignment, so that as many columns as possible contain the same/similar residues

2.Sequences that are more difficult/take more time to align than others when:

• They are more divergent (e.g. when the best alignment contains fewer "identical" residues)

• They are longer• They have very different lengths• They are DNA rather than the corresponding coding

sequences• They contain "repeated" regionsPart of the reason why these situations cause problems is that, rather than there clearly being one best alignment, we find several that are "equally" good (or bad)

Page 13: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

1:1 Relationships Between Residues in the Same Column

Thinking in terms of (i) protein structure and (ii) evolution, what defines the relationship that you expect residues in the same column of a good/correct alignment to share with each other?

Think about this on your own for a minute.Perhaps there's just a single word you might want to use to describe this sought-after relationship?

Try and explain your answers to your neighbours

Try and write down a definition of this relationship

Page 14: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Alternative Interpretations of MSAs (Evolutionary and

Structural)

Page 15: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

•Structurally equivalent/similar

•Evolutionary equivalent/related/homologous

Residues in the same column either:

Different applications assume different types of equivalence

Different types of similarity not necessarily equivalent

“Equivalence”/similarity of residues

Page 16: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Structural Similarity

Bacterial toxins 1ji6 and 1i5p

Unaligned Structurally Aligned

Page 17: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Structural Similarity

Chain1: 68 ELIGLQANIREFNQQVDNF 1111111111111111111Chain2: 70 ELQGLQNNFEDYVNALNSW

Residues with a similar structural context may lie almost on top of each other within a structural alignment. Clearly, the dark green and red side chains have more similar structural contexts than they do with the adjacent light-coloured side chains

Page 18: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Chain 1: 16 KVGSLIGKR---ILSELWGIIFPSGST 111111111 11111111111 111Chain 2: 16 VVGVPFAGALTSFYQSFLNTIWP-SDA

Structural Similarity

Page 19: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Structural equivalence

Some regions of the structures do not have structurally equivalent residues in the other structure

1i5p: DNFLNPTQN----PVPLSITSSVN 111111 111111111111ji6: NSWKKTPLSLRSKRSQDRIRELFS

Alignment gaps are a sure sign of such residues

Placing such residues in the same column as residues from other sequences is a misalignment - to be avoided!

Page 20: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

“Evolutionary Equivalence”

AGWYTI

AGWWTI

AGWYTI

AGWWTI

AAWYTI

AGWWTI

AAQQQWYTI

Mutation / Substituti

onY-W

Substitution G-A

QQQ Insertio

n

AGWYTI

AGWYTI

Two copies of

gene generate

d

AGWYTI AGWYTI

AGWYTI

AGWWTI

AGWYTI

AGWWTI

AAWYTI

AG---WWTI

AAQQQWYTI

Residues in the same alignment column should trace their history back to the same residue in the ancestral sequence with any changes due only to point substitutions

Page 21: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Quiz - Evolutionary Interpretation of Alignments

Which alignment of the final sequences (X, Y or Z) only places residues in the same column if they are related by substitution events?

KGE--------PGIGL------PGKGIPG-----------DPAFGDPGRGIPGEVLGAQ-----------PG

ZKGEPG------IGL------PGKGIPG---------DPAFGDPGRGIPGEVLGAQ---------PG

YKGEPG---IGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG

X

Page 22: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Quiz - Evolutionary Interpretation of Alignments

KGE--------PGIGL------PGKGIPG-----------DPAFGDPGRGIPGEVLGAQ-----------PG

"True" alignment given history described aboveRGIPGEVLGAQPGKGIPGDPAFGDPG---KGEPGIGLPG

PRANK

Page 23: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

RGIPGEVLGAQPGKGIPGDPAFGDPG---KGEPGIGLPG

PRANKKGEPG---IGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG

MAFFTK---GEPGIGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG

CLUSTALX

Different automatic MSA software gives different results

All are different from the "true" alignment (assuming the scenario of transformation on the previous slide is true)...

... because that scenario is very unlikely under the models of evolutionary transformation incorporated within these tools

KGE--------PGIGL------PGKGIPG-----------DPAFGDPGRGIPGEVLGAQ-----------PG

ZKGEPG------IGL------PGKGIPG---------DPAFGDPGRGIPGEVLGAQ---------PG

YKGEPG---IGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG

X

Quiz - Evolutionary Interpretation of Alignments

Page 24: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Interpreting Alignments

• Special 1:1 relationship between residues in the same column

• Structural: very similar structural context

• Evolutionary: any difference between residues in the same column due to point substitution (not to any other kind of mutation e.g. deletion followed by insertion)

• Structural and Evolutionary equivalence need not necessarily be the same

• Not all residues have 1:1 equivalents in other sequences

Page 25: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

nef1/fyn1PDB:1efn

Non-Equivalence of Evolutionary and Structural Alignments

Demonstration 1:Structural equivalence without evolutionary equivalence

Structural alignment of SH3-interaction motifs from nef and ncf1

ncf1/ncf4PDB:1w70

aligned ncf1/nef1 SH3 interaction motifs

Page 26: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Non-Equivalence of Evolutionary and Structural Alignments

Demonstration 2b:Sequences differ by ONE amino acid residue and have different folds

Proc Natl Acad Sci U S A. 2009 Dec 15;106(50):21149-54.A minimal sequence code for switching protein structure and function.Alexander PA, He Y, Chen Y, Orban J, Bryan PN.PMID: 19923431

GA95 GB95

Page 27: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Quiz - Numbers of Insertions

(a) 2 (b) 1 (c) 0 (d) 3

The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?

Page 28: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Quiz - Numbers of Insertions

If all sequences are the same length, we can explain their diversity without inferring ANY insertions or deletions

If and alignment contains sequences that are all either length x or y, then we can explain their diversity by inferring just one insertion or deletion

The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?

Page 29: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Quiz - Numbers of Insertions

The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?We can ALWAYS explain observed sequence length

diversity with:•0 insertions (all length variation due to deletion)•0 deletions (all length variation due to insertion)•a combination of insertions and deletions

Perhaps we should instead focus on inferring the most likely scenario?(Although if this is not particularly relevant for our analysis, perhaps we should focus instead on something completely different!)

Page 30: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Identifying Good Alignments

Page 31: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Distinguishing Better from Worse Alignments

An "objective" way of choosing which of two alignments (between the same pair of sequences) is "better" (more likely to be correct) would help us identify "good" alignments

Scoring schemes/rules have been developed for this purpose

These aim to assign higher scores to alignments that are more similar to true alignments

NOTE - we are optimising against some "average"

protein - thse may be different for different proteins, howver it's still demonstrably useful

higher score

lower score

Page 32: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Scoring Schemes for Aligned Residues

SeqA N A N

SeqB N - G

3.8 0.4

SeqA N A N

SeqB N G -

3.8 0.5 score = 3.8 + 0.5 = 4.3

score = 3.8 + 0.4 = 4.2

Commonly, scoring schemes (for columns containing no gaps) assign a score for each column...

...the score for the full alignment is then the sum of the individual column scores

Below we calculate, in this way, a score for two alignments of the same sequencesFor each column, the score assigned depends on which residues are found in the column

Aligned residue scores are taken from an amino acid substitution matrix - see next slide...

Page 33: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Amino Acid Substitution Matrices

Gonnet PAM250

SeqA N A N

SeqB N - G

3.8 0.4

SeqA N A N

SeqB N G -

3.8 0.5 score = 3.8 + 0.5 = 4.3

score = 3.8 + 0.4 = 4.2

Matrix provides a score for alignment of each different pair of amino acids

Page 34: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Amino Acid Substitution Matrices

Gonnet PAM250

Values in some cells in this matrix are positive

Although more cells contain negative values

Page 35: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Amino Acid Substitution Matrices

Gonnet PAM250

Values are obtained analysing alignments between sequences that we are very confident:•have similar 3D structures•evolutionarily related by processes of:

•point substitution•insertion•deletion

Page 36: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Amino Acid Substitution Matrices

One set of matrices (the BLOSUM series) are built by analysing many ungapped regions ("blocks") of alignments of fairly similar sequences

Block IPB000108A

ID P67PHOX; BLOCK

AC IPB000108A; distance from previous block=(7,386)

DE Neutrophil cytosol factor 2 signature

BL PR00499; width=18; seqs=27; 99.5%=1042; strength=1146

NCF2_BOVIN|O77775 ( 218) APLQPQAAEPPPRPKTPE 9

NCF2_HUMAN|P19878 ( 218) APLQPQAAEPPPRPKTPE 9

O70145|O70145_MOUSE ( 218) APLQPQSAEPPPRPKTPE 10

YKA7_CAEEL|P34258 ( 117) IPLKEAFTALPPRPAAPS 40

Q8NFC7 ( 218) APLQPQAAEPPPRPKTPE 9

Q9BV51 ( 218) APLQPQAAEPPPRPKTPE 9

Q95MN2|Q95MN2_RABIT ( 218) APLQPQAAEPPPRPKTPE 9

Q95L70|Q95L70_BISBI ( 218) APLQPQAAEPPPRPKTPE 9

Q9N0E9|Q9N0E9_TURTR ( 218) APLQPQAAEPPPRPKTPE 9

Q6GMC8|Q6GMC8_XENLA ( 219) APLQPQANNPPSRPKTPE 22

Q59F14|Q59F14_HUMAN ( 246) APLQPQVRQSDLLGAQAG 95

Q61QT5|Q61QT5_CAEBR ( 237) IPLKEAFSAPPPRPAAPS 37

Q5R5J0|Q5R5J0_PONPY ( 218) APLQPQAAEPPPRPKTPE 9

O08635|O08635_MOUSE ( 20) QIFKNQDPVLPPRPKPGH 20

Q3TC92|Q3TC92_MOUSE ( 232) APLQPQSAEPPPRPKTPE 10

Q3U5S4|Q3U5S4_MOUSE ( 218) APLQPQSAEPPPRPKTPE 10

Q6DFH8|Q6DFH8_XENLA ( 52) YVIKRQQPDLPPRPKPGH 43

Q499C5|Q499C5_XENTR ( 219) APLQPQASNPPPRPKTPE 13

Q60FB5|Q60FB5_ONCMY ( 218) APLQPQVEEVPTRPKVPE 25

Q32N10|Q32N10_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 16

Q5HYK7|Q5HYK7_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 16

Q5U3B8|Q5U3B8_HUMAN ( 20) QVFKNQDPVLPPRPKPGH 16

Q9UFC8|Q9UFC8_HUMAN ( 8) QVFKNQDPVLPPRPKPGH 16

Q4R8R2|Q4R8R2_MACFA ( 20) QVFKNQDPVLPPRPKPGH 16

O35146|O35146_MOUSE ( 10) KIVTPLDERPRGRPNDSG 100

Q1PCS1|Q1PCS1_MUSMM ( 218) APLQPQSAEPPPRPKTPE 10

Q1A3S2|Q1A3S2_9PERO ( 218) APLQPQVEDGPTRPKEPE 27

//

http://blocks.fhcrc.org/blocks-bin/getblock.pl#IPB000108A

Values are obtained analysing alignments between sequences that we are very confident:•have similar 3D structures•evolutionarily related by processes of:

•point substitution•insertion•deletion

Page 37: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Amino Acid Substitution Matrices

Block IPB000108A

ID P67PHOX; BLOCK

AC IPB000108A; distance from previous block=(7,386)

DE Neutrophil cytosol factor 2 signature

BL PR00499; width=18; seqs=27; 99.5%=1042; strength=1146

NCF2_BOVIN|O77775 ( 218) APLQPQAAEPPPRPKTPE 9

NCF2_HUMAN|P19878 ( 218) APLQPQAAEPPPRPKTPE 9

O70145|O70145_MOUSE ( 218) APLQPQSAEPPPRPKTPE 10

YKA7_CAEEL|P34258 ( 117) IPLKEAFTALPPRPAAPS 40

Q8NFC7 ( 218) APLQPQAAEPPPRPKTPE 9

Q9BV51 ( 218) APLQPQAAEPPPRPKTPE 9

Q95MN2|Q95MN2_RABIT ( 218) APLQPQAAEPPPRPKTPE 9

Q95L70|Q95L70_BISBI ( 218) APLQPQAAEPPPRPKTPE 9

Q9N0E9|Q9N0E9_TURTR ( 218) APLQPQAAEPPPRPKTPE 9

Q6GMC8|Q6GMC8_XENLA ( 219) APLQPQANNPPSRPKTPE 22

Q59F14|Q59F14_HUMAN ( 246) APLQPQVRQSDLLGAQAG 95

Q61QT5|Q61QT5_CAEBR ( 237) IPLKEAFSAPPPRPAAPS 37

Q5R5J0|Q5R5J0_PONPY ( 218) APLQPQAAEPPPRPKTPE 9

O08635|O08635_MOUSE ( 20) QIFKNQDPVLPPRPKPGH 20

Q3TC92|Q3TC92_MOUSE ( 232) APLQPQSAEPPPRPKTPE 10

Q3U5S4|Q3U5S4_MOUSE ( 218) APLQPQSAEPPPRPKTPE 10

Q6DFH8|Q6DFH8_XENLA ( 52) YVIKRQQPDLPPRPKPGH 43

Q499C5|Q499C5_XENTR ( 219) APLQPQASNPPPRPKTPE 13

Q60FB5|Q60FB5_ONCMY ( 218) APLQPQVEEVPTRPKVPE 25

Q32N10|Q32N10_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 16

Q5HYK7|Q5HYK7_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 16

Q5U3B8|Q5U3B8_HUMAN ( 20) QVFKNQDPVLPPRPKPGH 16

Q9UFC8|Q9UFC8_HUMAN ( 8) QVFKNQDPVLPPRPKPGH 16

Q4R8R2|Q4R8R2_MACFA ( 20) QVFKNQDPVLPPRPKPGH 16

O35146|O35146_MOUSE ( 10) KIVTPLDERPRGRPNDSG 100

Q1PCS1|Q1PCS1_MUSMM ( 218) APLQPQSAEPPPRPKTPE 10

Q1A3S2|Q1A3S2_9PERO ( 218) APLQPQVEDGPTRPKEPE 27

//

http://blocks.fhcrc.org/blocks-bin/getblock.pl#IPB000108A

Key parameters estimated in analysis:

•frequency (qij) with which residues i and j are found in the same column (averaged over all analysed blocks)

•frequency with which each pair of residues are found in same column if all sequences randomised

•where pi is the frequency with which residue i is present in the complete dataset, this is: pi * pj

If two residues i and j are more often found in the same column in blocks compared to randomised sequences thenqij / pi*pj > 1 and log(qij / pi*pj) > 0

Page 38: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Amino Acid Substitution Matrices

Block IPB000108A

ID P67PHOX; BLOCK

AC IPB000108A; distance from previous block=(7,386)

DE Neutrophil cytosol factor 2 signature

BL PR00499; width=18; seqs=27; 99.5%=1042; strength=1146

NCF2_BOVIN|O77775 ( 218) APLQPQAAEPPPRPKTPE 9

NCF2_HUMAN|P19878 ( 218) APLQPQAAEPPPRPKTPE 9

O70145|O70145_MOUSE ( 218) APLQPQSAEPPPRPKTPE 10

YKA7_CAEEL|P34258 ( 117) IPLKEAFTALPPRPAAPS 40

Q8NFC7 ( 218) APLQPQAAEPPPRPKTPE 9

Q9BV51 ( 218) APLQPQAAEPPPRPKTPE 9

Q95MN2|Q95MN2_RABIT ( 218) APLQPQAAEPPPRPKTPE 9

Q95L70|Q95L70_BISBI ( 218) APLQPQAAEPPPRPKTPE 9

Q9N0E9|Q9N0E9_TURTR ( 218) APLQPQAAEPPPRPKTPE 9

Q6GMC8|Q6GMC8_XENLA ( 219) APLQPQANNPPSRPKTPE 22

Q59F14|Q59F14_HUMAN ( 246) APLQPQVRQSDLLGAQAG 95

Q61QT5|Q61QT5_CAEBR ( 237) IPLKEAFSAPPPRPAAPS 37

Q5R5J0|Q5R5J0_PONPY ( 218) APLQPQAAEPPPRPKTPE 9

O08635|O08635_MOUSE ( 20) QIFKNQDPVLPPRPKPGH 20

Q3TC92|Q3TC92_MOUSE ( 232) APLQPQSAEPPPRPKTPE 10

Q3U5S4|Q3U5S4_MOUSE ( 218) APLQPQSAEPPPRPKTPE 10

Q6DFH8|Q6DFH8_XENLA ( 52) YVIKRQQPDLPPRPKPGH 43

Q499C5|Q499C5_XENTR ( 219) APLQPQASNPPPRPKTPE 13

Q60FB5|Q60FB5_ONCMY ( 218) APLQPQVEEVPTRPKVPE 25

Q32N10|Q32N10_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 16

Q5HYK7|Q5HYK7_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 16

Q5U3B8|Q5U3B8_HUMAN ( 20) QVFKNQDPVLPPRPKPGH 16

Q9UFC8|Q9UFC8_HUMAN ( 8) QVFKNQDPVLPPRPKPGH 16

Q4R8R2|Q4R8R2_MACFA ( 20) QVFKNQDPVLPPRPKPGH 16

O35146|O35146_MOUSE ( 10) KIVTPLDERPRGRPNDSG 100

Q1PCS1|Q1PCS1_MUSMM ( 218) APLQPQSAEPPPRPKTPE 10

Q1A3S2|Q1A3S2_9PERO ( 218) APLQPQVEDGPTRPKEPE 27

//

http://blocks.fhcrc.org/blocks-bin/getblock.pl#IPB000108A

Key parameters estimated in analysis:

•frequency (qij) with which residues i and j are found in the same column (averaged over all analysed blocks)

•frequency with which each pair of residues are found in same column if all sequences randomised

•where pi is the frequency with which residue i is present in the complete dataset, this is: pi * pj

qij / pi*pj < 1 and log(qij / pi*pj) < 0

If two residues i and j are less often found in the same column in blocks compared to randomised sequences then

Page 39: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

aa Substitution Matrix Values

What range of scores would you expect alignments that are more similar to alignments found in blocks, compared to alignments between random sequences, to have:

A. scores < 0

B. 0 < scores < 1

C. 0 > scores

Values in matrix cells are proportional to log(qij / pi*pj)

SeqA N A N

SeqB N G -

3.8 0.5score = 3.8 + 0.5 = 4.3

Page 40: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Sequence Similarity Searching

Query sequence

Database of

sequences

...

Build optimal alignment between query sequence and each database sequence

Calculate score for each optimal alignment

For each alignment, calculate how many alignments between randomised (structurally dissimilar/evolutionarily unrelated) sequences you would expect with a score the same or greater than the score for this alignment

This number is the E-value

Page 41: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Interpreting E-values

E-value of 0.001 means that you would expect this score to be found for an alignment between randomised sequences once in every 1000 similar searches

For each of values below, decide whether it it impossible that it be reported as the result of such a search:A.1B.0C.0.00001D.1000000000E.-1

Page 42: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Interpreting E-values

You have carried out a BLAST search with a query sequence seqA against a database dbB, such that there are no sequences in dbB that are "related" to seqA

What is the most likely E-value for the highest-scoring alignment?

Page 43: Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Aidan Budd, EMBL Heidelberg

Interpreting E-values

You have carried out a BLAST search with a query sequence seqA against a database dbB, such that there are no sequences in dbB that are "related" to seqA

You carry out several variants of this search, as described below. For each search, decide:•the most likely value of the E-value for the highest-scoring alignment•the most likely fraction of the score of highest-scoring alignment compared to that of the initial search (5 times larger? 5 times smaller etc.)

•delete the second half of seqA, search against dbB•query seqA against a database (dbC) which contains two copies of every sequence in dbB•query seqA against a database (dbD) from which every second sequence in dbB has been removed from