Sequence Alignment and Phylogeny

54
Sequence Alignment and Phylogeny Dr Peter Smooker, [email protected] B I O I N F O R M A T I C S | | | | | | | B I O L O G Y - M A T H - S

description

Sequence Alignment and Phylogeny. B I O I N F O R M A T I C S | | | | | | | B I O L O G Y - M A T H - S. Dr Peter Smooker, [email protected]. Uses of alignments. To determine the relationship (ie: distance) between two sequences (pair-wise alignment) - PowerPoint PPT Presentation

Transcript of Sequence Alignment and Phylogeny

Page 1: Sequence Alignment and Phylogeny

Sequence Alignment and Phylogeny

Dr Peter Smooker, [email protected]

B I O I N F O R M A T I C S| | | | | | |B I O L O G Y - M A T H - S

Page 2: Sequence Alignment and Phylogeny

Uses of alignments

1. To determine the relationship (ie: distance) between two sequences (pair-wise alignment)

2. To search databanks for the presence of homologues

3. To look for sequence conservation in families of proteins

4. To use molecular approaches to phylogeny

Page 3: Sequence Alignment and Phylogeny

Comments/Caveats

• When sequences are aligned, we assume they share a common ancestor

• Protein fold is more conserved than protein sequence

• DNA sequences are less informative than protein sequences

• Two sequences can always be aligned- we need to determine what is a meaningful result

Page 4: Sequence Alignment and Phylogeny

Homology

• Proteins or genes are defined as homologous if they can be said to have shared an ancestor

• Genes or proteins are either homologs or they are not- there is no such thing as percent homology. There is percent identity or similarity of the sequences

Page 5: Sequence Alignment and Phylogeny

“Ologies”

• Homology - descent from a common ancestor

• Orthology - descent from a speciation event

• Paralogy - descent from a duplication event

• Xenology - descent from a horizontal transfer event

Page 6: Sequence Alignment and Phylogeny

When Is Homology Real?

• As a general rule, in a pairwise alignment:>25% identical aa’s, proteins will have similar

folding pattern- most likely homologous

18-25% identical- twilight zone- tantalizing

<18% identical- cannot determine from alignment

Page 7: Sequence Alignment and Phylogeny

Measuring Sequence Similarity

• Two measures of the distance between two strings:

1. Hamming distance: strings equal length, number of positions with mismatches

2. Levenshtein distance: not equal length, number of edit operations to change one string to the other

Page 8: Sequence Alignment and Phylogeny

agtc Hamming distance = 2

cgta

ag-tcc Levenshtein distance = 3

cgctca

Page 9: Sequence Alignment and Phylogeny

Protein Alignments-Substitution Matrices

• When sequences diverge over time, they accumulate mutations- some are deleterious, some are neutral, some are advantageous

• Some changes are more likely than others

• This can be examined and the relative probability of a change occurring calculated

• Substitution matrices have been developed

Page 10: Sequence Alignment and Phylogeny

Matrices.

• PAM = Percent Accepted Mutation• Matrices are derived from families of

proteins with a set level of identity.• PAM matrices proposed by Margaret

Dayhoff. Based on sequences with > 85% identity. The PAM 1 matrix was computed. Extrapolated for larger evolutionary distances

Page 11: Sequence Alignment and Phylogeny

PAM Matrices

PAM 0 30 80 110 200 250% identity 100 75 50 60 25 20

• The PAM250 matrix is corresponds to proteins of average 20% identity (lowest we can reasonably be confident about). It was derived by the extrapolation of observed substitution frequencies. PAM250 refers to 250 substitutions per 100 amino acids.

Page 12: Sequence Alignment and Phylogeny

Definition of PAM from BLAST literature

• http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

• One "PAM" corresponds to an average change in 1% of all amino acid positions. After 100 PAMs of evolution, not every residue will have changed: some will have mutated several times, perhaps returning to their original state, and others not at all. Thus it is possible to recognize as homologous proteins separated by much more than 100 PAMs. Note that there is no general correspondence between PAM distance and evolutionary time, as different protein families evolve at different rates.

Page 13: Sequence Alignment and Phylogeny

BLOSUM Matrices

• Developed by S and JG Henikoff• Made use of a much larger amount of data• Based on the BLOCKS database of aligned

protein domains• http://www.blocks.fhcrc.org/• Used a weighted average of closely related

sequences with identities higher than a threshold. For example, the common BLOSUM62 matrix is based on proteins with greater than 62% identity

Page 14: Sequence Alignment and Phylogeny

BLOCKS

• The substitutions in each aligned column are identified and a score for each substitution calculated and inserted into the matrix.

Page 15: Sequence Alignment and Phylogeny

Which Matrix to use?

• In BLASTP, the following matrices are offered:• PAM 30• PAM 70• BLOSUM 80• BLOSUM 62 (default)• BLOSUM 42• In PAM, greater numbers = more evolutionary

distance. Reverse for BLOSUM

Page 16: Sequence Alignment and Phylogeny

Which Matrix to use?

• Generally, BLOSUM perform better than PAM for local alignment searches

• Use the matrix appropriate for the task- if you expect a close match, use a low PAM or high BLOSUM number

• Generally, if you use the default (generally BLOSUM 62) and find nothing, go to a matrix derived from a more evolutionarily distant dataset

Page 17: Sequence Alignment and Phylogeny

Scoring

Score of mutation i > j

log observed i >j

expected i > j

Expected i > j is simply calculated by the frequencies of the amino acids

Result is multiplied by 10. Scores are added.

Page 18: Sequence Alignment and Phylogeny

PAM250 A R N D C Q E G H I L K M F P S T W Y V

A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4 -5 4 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3 1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4

Page 19: Sequence Alignment and Phylogeny

• Scores below 0 indicate amino acids that are rarely substituted, and different aa’s that give a high +ve score are usually functionally equivalent

• Scores below 0 indicates that those substitutions are rarely observed

Page 20: Sequence Alignment and Phylogeny

• Hydrophilic

Page 21: Sequence Alignment and Phylogeny

• These aa’s are hydrophobic (except glycine, often put in a class by itself).

Page 22: Sequence Alignment and Phylogeny

Interpreting scores- BLAST output

Page 23: Sequence Alignment and Phylogeny
Page 24: Sequence Alignment and Phylogeny

Significance

• Two values are given- the Bit score and the E-value.

• The E-value is a statistical calculation of the probability that the match is real, ie: that in a query database of that size, the sequence would give that score by chance

• The bit score is related to both the raw score (calculated from the BLOSUM or PAM lookup matrix) but is normalised

Page 25: Sequence Alignment and Phylogeny

Bit Score

• Bit scores are normalised with respect to the scoring system. Hence they can be compared across different searches (using different matrices)

• In particular:

To convert a raw score S into a normalized score S' expressed in bits, one uses the formula S' = (lambda*S - ln K)/(ln 2), where lambda and K are parameters dependent upon the scoring system (substitution matrix and gap costs) employed

Page 26: Sequence Alignment and Phylogeny

Multiple Sequence Alignment

• To quote Lesk

• “One amino acid sequence plays coy; a pair of homologous sequences whisper; many aligned sequences shout out loud”

Page 27: Sequence Alignment and Phylogeny

Multiple Sequence Alignment

• Multiple sequence alignments can offer a considerable amount of information over a pairwise alignment.– Regions of similarity (especially distant

similarity) can be detected– Regions of functional significance can often be

detected– Evolutionary relationships can be examined,

and trees drawn.

Page 28: Sequence Alignment and Phylogeny

MSA’s are computationally expensive

• If we use dynamic programming, rather than a 2D array as for pairwise comparison, have an n-dimensional array. Computational time grows as Mn, where n is the number of sequences. Difficult for n=4, impossible for higher values.

• Use a heuristic approach. Most common is the CLUSTAL algorithm

Page 29: Sequence Alignment and Phylogeny

Progressive Alignment

• Iterative pairwise alignment

• Two most similar sequences aligned first, then next most similar to that pair, etc.

• A very popular progressive alignment algorithm is CLUSTAL W

Page 30: Sequence Alignment and Phylogeny

CLUSTAL W- Steps

• A matrix of pairwise distances between all sequences is constructed. This determines the similarity between all sequences to be aligned.

• A guide tree (dendogram), or inferred phylogeny, is built

• The alignment is constructed based on the guide tree.

• Generally results in a near-optimal alignment

Page 31: Sequence Alignment and Phylogeny

CLUSTAL W

• A major problem in MSA is the selection of an appropriate matrix for alignments consisting of divergent and closely related sequences

• CLUSTAL W (weighted) assigns weights to a sequence dependent on how divergent it is from the two most closely related sequences

• Adapts gap penalties and scoring matrix to suit

Page 32: Sequence Alignment and Phylogeny

An example (from our research)

• Some definitions:• Phylogeny: Evolutionary history (“tree of life”)• Molecular phylogeny: Determined using sequence data• Bootstrapping: A statistical process to evaluate

phylogenetic trees. The data is resampled 1000 times (generally) and the support for each branch determined

• Homology modelling. Predicting the structure of a protein based on the experimentally derived structure of a homologue

Page 33: Sequence Alignment and Phylogeny

Fasciola- Liver Fluke

NEJ Adult

Page 34: Sequence Alignment and Phylogeny

Liver fluke (Fasciola spp.)

• Trematode (flatworm) parasite

• Infects ruminants, humans

• Has a complex life-cycle

• Secretes proteins (excretory/secretory material)

• Major secreted protein is cathepsin L in adults

Page 35: Sequence Alignment and Phylogeny

Cysteine proteases

• Digest proteins: cleave between adjacent amino acids.

• Not random cleavage, different proteases show a preference for different targets.

Page 36: Sequence Alignment and Phylogeny

There are a number of Fasciola cathepsin L sequences known.

• At least 30 full sequences now known

• Only one contains an indel

• Protein sequences 46-99% identical

Page 37: Sequence Alignment and Phylogeny

What are the differences between the two classes of CatL that account for the substrate

specificity?

Presumed to be due to changes affecting the S2 subsite of the

enzyme.

Page 38: Sequence Alignment and Phylogeny

Homology Modelling

• FhCatL modelled on the known crystal structure of human CatL.

• Models of CatL2 and CatL5 (functional equivalent of CatL1) compared, especially around the S2 subsite of the enzyme.

Page 39: Sequence Alignment and Phylogeny

Homology Modelling

• Three substitutions is residues lining the S2

subsite were observed (L5-> L2)

• L69Y: Makes substantial contacts with the P2 Phe

• N161T: Side chain points away from pocket

• G163A: Bottom of pocket, no substantial contact with P2 Phe

Page 40: Sequence Alignment and Phylogeny

GRASP electrostatic surface potential

The architecture around the S2 pocket is substantially influenced by a Y or L at position 69.

Made mutant, expressed in yeast, performed kinetic analysis.

L2 L5

Page 41: Sequence Alignment and Phylogeny

Conclusions

• The L69Y change does affect the substrate specificity

• 69Y allows increased catalysis of substrates with a P2 proline

• There are other, more subtle changes between L5 and L2

Page 42: Sequence Alignment and Phylogeny

Acrobat Document

What about the other enzymes- CLUSTAL W

Page 43: Sequence Alignment and Phylogeny

What amino acid is at #69?

Page 44: Sequence Alignment and Phylogeny

FgCatL1-a 61 GNMGCSGGLMENAYEYLKQFGLETESSYPYTAVEGQCRYNRQLGVAKVTDYYTVHSGSEV 120FgCatL1-b GNYGCMGGLMENAYEYLKQFGLETESSYPYTAVEGQCRYNRQLGVAKVTDYYTVHSGSEVFgCatL1-c GNFGCNGGLMENACEYLKRFGLETESSYPYRAVEGPCRYNKQLGVAKVTGYYMVHSGDEVFgCatL1-d GNHGCGGGYMENAYEYLKHSGLETDSYYPYQAVEGPCQYDGRLAYAKVTDYYTVHSGDEVFgCatL1-e GNYGCMGGLMENAYEYLKQFGLETESSYPYTAVEDQCRYNRQLGVAKVTDYYTVHSGSEVFgCatL1-f GNNGCRGGLMEIAYEYLRRFGLEIESTYPYRAVEGPCRYDRRLGVAKVTGYYIVHSGDEVFgCatL2 GNMGCSGGLMENAYEYLKQFGLETESSYPYTAVEGQCRYNRQLGVAKVTDYYTVHSGSEVFgCatL3 GNINCMGGLMENAYEYLKQFGLETESSYPYTAVEGQCRYNRQLGVAKVTDYYTVHSGSEVFhCatL1 GNNGCGGGLMENAYQYLKQFGLETESSYPYTAVGGQCRYNKQLGVAKVTGYYTVQSGSEVFhCatL2 GNYGCGGGYMENAYEYLKHNGLETESYYPYQAVEGPCQYDGRLAYAKVTGYYTVHSGDEIFhCatL3 GNNGCSGGLMENAYQYLKQFGLETESSYPYTAVEGQCRYNKQLGVAKVTGYYTVHSGSEVFhCatL4 GNYGCNGGLMENAYEYLKRFGLETESSYPYRAVEGQCRYNEQLGVAKVTGYYTVHSGDEVFhCatL5 GNYGCNGGLMENAYEYLKRFGLETESSYPYRAVEGQCRYNEQLGVAKVTGYYTVHSGDEVFhCatL6 GNYGCMGGLMENAYEYLKQFGLETESSYPYTAVEGQCRYNRQLGVAKVTDYYTVHSGSEVFhCatL7 GNYGCGGGYMENAYEYLKHNGLETESYYPYQAVEGPCQYDGRLAYAKVTGYYTVHSGDEIFhCatL8 GNHGCGGGWMENAYKYLKNSGLETASYYPYQAVEYQCQYRKELGVAKVTGAYTVHSGDEMFhCatL9 GNNGCSGGLMENAYEYLKRFGLETESSYPYRAVEGQCRYNEQLGVAKVTGYYTVHSGSEVFhCatL10 GNHGCGGGWMENAYKYLKNSGLETASDYPYQGWEYQCQYRKELGVAKVTGAYTVHSGDEM ** .* ** ** * :**:. *** * *** . *:* .*. ****. * *:**.*: 

Page 45: Sequence Alignment and Phylogeny

Fasciola CatL’s form a monophyletic clade

• Fasciola sequences aligned to the family of papain-like cysteine proteases

• 100% bootstrap support for clade

• All Fasciola sequences arose after divergence from Schistosoma

• Probably all parasitic catLs have diverged after speciation (Sajid and McKerrow)

Page 46: Sequence Alignment and Phylogeny
Page 47: Sequence Alignment and Phylogeny

Relationship of Fasciola enzymes

• Tree constructed using 18 full-length sequences

• Resolved into 4 distinct clades

Page 48: Sequence Alignment and Phylogeny

L69 Phe-Arg

Y69 Pro-Arg

L69 Phe-Arg

AA69 Predicted Substrate

W69 ??-Arg

Page 49: Sequence Alignment and Phylogeny
Page 50: Sequence Alignment and Phylogeny

Evolutionary Timeframe

• First observed divergence (clade A) 135 MYA

• F. hepatica and F. gigantica predicted to diverge approx. 19 –25 MYA

• Confirmed by constructing a neighbour-joining tree using Glutathione-S transferase sequences: 19 +/- 5.2 MYA

Page 51: Sequence Alignment and Phylogeny

Practice runs- 1. Blast

• Go to the BLAST server at NCBI

• http://www.ncbi.nlm.nih.gov/BLAST/

• Note the different “flavours” of BLAST that can be performed.

• Go to Protein-Protein BLAST. Look at the format and the searching parameters.

• Paste in sequence 1 and run the BLAST

Page 52: Sequence Alignment and Phylogeny

Sequence 1

• What is it? (note that a conserved domain is detected)

• From what organism (should be 100% match)?

• What is the organism that has the closest relative?

• What is meant by “positives”?

Page 53: Sequence Alignment and Phylogeny

• For interest, use sequence 2 to run a BLAST. This is the mRNA sequence from which the protein sequence is translated. (note- choose your BLAST flavour carefully!)

• Is the same result obtained?

Page 54: Sequence Alignment and Phylogeny

Practice runs- 2. CLUSTAL W

• Go to http://www.ebi.ac.uk/clustalw/

• Upload (or paste) Seq3.txt, run the tool

• Does the dendogram resemble that previously demonstrated?