Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments Web servers...

45
Multiple Sequence Alignments

Transcript of Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments Web servers...

Page 1: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Sequence Alignments

Page 2: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Alignments

Generating multiple alignments Web servers

Analyzing a multiple alignment what makes a ‘good’ multiple alignment? what can it tell us, why is it useful?

Adjusting a multiple alignment Alignment editors and HowTo Demonstration and practice

Page 3: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

What is a Multiple Alignment?

A comparison of sequences “multiple sequence alignment”

A comparison of equivalents: Structurally equivalent positions Functionally equivalent residues Secondary structure elements Hydrophobic regions, polar residues

Page 4: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Generating multiple alignments

Pairwise sequence alignment is easy with sufficiently closely related sequences.

Below a certain level of identity sequence alignment may become uncertain : twilight zone for aa sequences ~ 30%.

In or below the twilight zone it is good to make use of additional information, eg, from evolution.

A multiple alignment of diverse sequences is more informative than a pairwise alignment: residues conserved over longer period of time are under

stronger evolutionary constraints.

Page 5: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Sequence Alignments Algorithms

Multiple sequence alignment uses heuristic methods only: With dynamic programming, computational

time quickly explodes as the number of sequences increases.

Different methods/algorithms: Segment-based (DiAlign, …). Iterative (HMMs, DiAlign, PRRP, …). Progressive (Clustalw, T-Coffee, MUSCLE,

…).

Page 6: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Progressive Alignment

Step1: Calculate all pairwise alignments and calculate distances for all pairs of sequences.

Step 2: Construct guide tree joining the most similar sequences using Neighbour Joining.

A B C D EB 2C 4 4D 6 6 6E 6 6 6 4F 8 8 8 8 8

Step 1 Step 2

Page 7: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Progressive Alignment

Step 3: From the tree assign weights for each sequence: We want to down-weight nearly identical sequences

and up-weight the most divergent ones. Step 4: Align sequences, starting at the leaves

of the guide tree: Pairwise comparisons as well as comparison of single

sequence with a group of sequences (Profile)

Caveat: errors introduced early cannot be corrected by subsequent information

Page 8: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Web servers

ClustalW: http://www.ebi.ac.uk/Tools/clustalw2/

T-Coffee: http://www.ebi.ac.uk/Tools/t-coffee/ MUSCLE: http://www.ebi.ac.uk/Tools/muscle/

DiAlign: http://dialign.gobics.de/

... and more at http://helix.nih.gov/apps/bioinfo/msa.html.

Page 9: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Clustalw features Amino acid substitution matrices are varied at different

alignment stages according to the divergence of the sequences to be aligned.

Reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure.

Insertions and deletions are more common in loop regions than in the core of the protein!

Page 10: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

T-Coffee featuresMore accurate than ClustalWInstead of amino acid substitution matrices,

uses consistency in a library of pairwise alignments

i

j

Vertices represent positions in protein

sequence. Edges represent pairwise

alignments between protein sequences.

If residues I and J have many common

neighbours, their consistency is high.

Page 11: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

MUSCLE

Fast implementationSometimes more accurate than ClustalW

or T-Coffee

Page 12: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Example Let’s build a multiple alignment for the following

sequences :>query

MKNTLLKLGVCVSLLGITPFVSTISSVQAERTVEHKVIKNETGTISISQLNKNVWVHTELGYFSGEAVPSNGLVLNTSKGLVLVDSSWDDKLTKELIEMVEKKFKKRVTDVIITHAHADRIGGMKTLKERGIKAHSTALTAELAKKNGYEEPLGDLQSVTNLKFGNMKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSASSKDLGNVADAYVNEWSTSIENVLKRYGNINLVVPGHGEVGDRGLLLHTLDLLK>gi|2984094 MGGFLFFFLLVLFSFSSEYPKHVKETLRKITDRIYGVFGVYEQVSYENRGFISNAYFYVADDGVLVVDALSTYKLGKELIESIRSVTNKPIRFLVVTHYHTDHFYGAKAFREVGAEVIAHEWAFDYISQPSSYNFFLARKKILKEHLEGTELTPPTITLTKNLNVYLQVGKEYKRFEVLHLCRAHTNGDIVVWIPDEKVLFSGDIVFDGRLPFLGSGNSRTWLVCLDEILKMKPRILLPGHGEALIGEKKIKEAVSWTRKYIKDLRETIRKLYEEGCDVECVRERINEELIKIDPSYAQVPVFFNVNPVNAYYVYFEIENEILMGE>gi|115023|sp|P10425|MKKNTLLKVGLCVSLLGTTQFVSTISSVQASQKVEQIVIKNETGTISISQLNKNVWVHTELGYFNGEAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTDVIITHAHADRIGGITALKERGIKAHSTALTAELAKKSGYEEPLGDLQTVTNLKFGNTKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSAEAKNLGNVADAYVNEWSTSIENMLKRYRNINLVVPGHGKVGDKGLLLHTLDLLK>gi|115030|sp|P25910|MKTVFILISMLFPVAVMAQKSVKISDDISITQLSDKVYTYVSLAEIEGWGMVPSNGMIVINNHQAALLDTPINDAQTEMLVNWVTDSLHAKVTTFIPNHWHGDCIGGLGYLQRKGVQSYANQMTIDLAKEKGLPVPEHGFTDSLTVSLDGMPLQCYYLGGGHATDNIVVWLPTENILFGGCMLKDNQATSIGNISDADVTAWPKTLDKVKAKFPSARYVVPGHGDYGGTELIEHTKQIVNQYIESTSKP>gi|282554|pir||S25844 MTVEVREVAEGVYAYEQAPGGWCVSNAGIVVGGDGALVVDTLSTIPRARRLAEWVDKLAAGPGRTVVNTHFHGDHAFGNQVFAPGTRIIAHEDMRSAMVTTGLALTGLWPRVDWGEIELRPPNVTFRDRLTLHVGERQVELICVGPAHTDHDVVVWLPEERVLFAGDVVMSGVTPFALFGSVAGTLAALDRLAELEPEVVVGGHGPVAGP EVIDANRDYLRWVQRLAADAVDRRLTPLQAARRADLGAFAGLLDAERLVANLHRAHEELLGGHVRDAMEIFAELVAYNGGQLPTCLA

Page 13: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

ClustalW at EBI

Many options: CPU mode, full/fast alignment, window length in

fast mode, … gap penalties.

Page 14: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

ClustalW at EBI Automatic display of:

Score table Alignment (optional

colouring) Tree guide

Link to Jalview alignment editor!

Page 15: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

A note on the example

It is atypical: It uses only three sequences. One should use more in order to extract reliable

informations.

It illustrates a common mistake: It uses too closely related sequences. One should use as divergent and diverse sequences

as possible in order to extract relevant informations.

Page 16: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

A Good Multiple Alignment?

Difficult to define…

Good ones look pretty! Aligned secondary structures Strongly conserved residues / regions Comparison with known structure helps

Bad ones look chaotic and random.

Page 17: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

A Good Multiple Alignment?

☻ ?

conservation

quality

consensus

Page 18: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Alignment Features

Barton (1993) “The position of insertions and deletions suggests

regions where surface loops exist…

Page 19: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Alignment Features

Page 20: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Alignment Features

Barton (1993) “The position of insertions and deletions suggests

regions where surface loops exist…

Conserved glycine or proline suggests a β-turn...

Page 21: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Alignment Features

Page 22: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Alignment Features

Barton (1993) “The position of insertions and deletions suggests

regions where surface loops exist…

Conserved glycine or proline suggests a β-turn…

Residues with hydrophobic properties conserved at i, i+2, i+4 (etc) separated by unconserved or hydrophilic residues suggests a surface β-strand…

Page 23: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Alignment Features

Page 24: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Alignment Features

Barton (1993) “The position of insertions and deletions suggests

regions where surface loops exist…

Conserved glycine or proline suggests a β-turn…

Residues with hydrophobic properties conserved at i, i+2, i+4 (etc) separated by unconserved or hydrophilic residues suggests a surface β-strand…

A short run of hydrophobic amino acids (4 or 5 residues) suggests a buried β-strand…

Page 25: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Alignment Features

Page 26: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Alignment Features

Barton (1993) Pairs of conserved hydrophobic amino acids separated by

pairs of unconserved or hydrophilic residues suggests an α-helix with one face packed in the protein core. Similarly, an i, i+3, i+4, i+7 pattern of conserved residues.”

Page 27: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Alignment Features

Page 28: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Alignment Features

Cysteine is a rare amino acid, and is often used in disulphide bonds ( pairs of conserved cysteines )

Charged residues ( histidine, aspartate, glutamate, lysine, arginine ) and other polar residues embedded in a conserved region indicate functional importance

Page 29: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Multiple Alignment Features

Page 30: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Quality Assessment

Bad residues Large distance from column consensus

Bad columns Average distance from consensus is high

– “entropy”

Bad regions Profile scores

Bad quality doesn’t always mean badly aligned!

RINAIEVMAKLIQ

LI

MIILVEIVLAM

PERMKIDQGQNMW

DLVTWDYAASLDF

DNPGGACRTTLID

Page 31: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Quality Assessment

Profiles A profile holds scores for each residue type (plus gaps)

over every column of a multiple alignment Concepts:

• Consensus sequence• Amino acid similarity

Some multiple alignment programs use profiles to build or add to an alignment

Any alignment, or even one sequence, can be a profile (one sequence isn’t a very good one…)

Page 32: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

What can we do with a multiple alignment?

Identify subgroups (phylogeny) Intra-group sequence conservation Evolutionary relatedness (view tree)

Identify motifs (functionality) Evolutionary signals Highly conserved residues indicate

functional or structural significance!

Widen search for related proteins MA better than single sequence Consensus sequence / profile useful

RPDDWHLHLRGGIDTHVHFIGFTLTHEHICPFVEPHIHLDPKVELHVHLD

Page 33: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

What do we want to do?

Build a homology model? Accuracy

Perform phylogenetic analysis? Completeness

Functional analysis of a protein family? Diversity

Page 34: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Building the initial alignment

Fetch related sequences and run alignment Clustal, Dialign, TCoffee, Muscle …

Fetch a multiple alignment from a database and add sequences of interest Pfam, ProDom, ADDA …

Start from a motif-finding procedure MEME, Pratt, Gibbs Sampler …

Page 35: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Adjusting the alignment

1. Filter alignment: Remove any redundancy Remove unrelated sequences Remove unwanted domains Recalculate alignment if necessary

2. Look for conserved motifs, adjust any misalignments. Try different colour schemes and thresholds.

3. One step at a time…

Page 36: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Jalview Alignment Editor

Clamp, M., Cuff, J., Searle, S. M. and Barton, G. J. (2004), "The Jalview Java Alignment Editor", Bioinformatics, 20, 426-7.

Page 37: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Colouring your alignment

HYDROPHOBIC/ POLAR hydrophobic polar

BURIED INDEX buried surface

β-STRAND LIKELIHOOD probable unlikely

HELIX LIKELIHOOD probable unlikely

Page 38: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Colouring your alignment

By conservation thresholds:

Page 39: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Colouring your alignment

Conservation index

Amino Acid Property Classification Schema, eg: Livingstone & Barton 1993

Page 40: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Sequence Features

Page 41: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Check PDB Structures

Load MA with sequence(s) for known PDB structure View >> Feature Settings >> Fetch DAS Features (wait...)

OR Right-click >> Associate Structure with Sequence >> Discover

PDB ids (quicker)

Right-click sequence name >> View PDB Entry

Structure opens in new window – residues acquire MA colours

Highlight residues by hovering mouse over alignment or structure

Label residues by clicking on structure

Page 42: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Compare Alignment to Structure

Page 43: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Compare Alignment to Structure

Crucial way of checking alignment!

Where are gaps / insertions /deletions ? In secondary structures: bad In surface loops: okay

Where are our key / functional residues? Are they in probable active site? Check they are clustered Check they are accessible, not buried

Page 44: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Demonstration and Practice

1. Start Jalview (click here)2. Tools >> Preferences >>

Visualselect Maximise Window, unselect Quality, set Font Size to 8 or 9, Colour >> Clustal, uncheck Open File

Editingcheck Pad Gaps When Editing

3. File >> Input Alignment >> from URL (use this one)4. Get used to the controls – selecting and deselecting

sequences/groups (drag mouse), dragging sequences/groups (use shift/ctrl), selecting sequence regions, hiding sequences/groups, removing columns and regions… Then explore menus and tools.

5. Now load this alignment – I’ve messed up a good alignment, and now I’d like you to correct it! There are two groups of sequences and one single sequence to adjust.

Page 45: Multiple Sequence Alignments. Multiple Alignments Generating multiple alignments  Web servers Analyzing a multiple alignment  what makes a ‘good’ multiple.

Demonstration and Practice

6. View >> Feature Settings >> DAS Settings select Uniprot, dssp, cath, Pfam, PDBsum_ligands, PDBsum_DNAbinding,

then click ‘Save as default’ click Fetch DAS Features (then click yes at prompt) ... Move mouse over alignment and read information about features Move mouse over sequence names to check for PDB ids

7. Open a PDB structure (choose any)

8. View >> uncheck Show All Chains, then use up-arrow key to increase structure size.

9. Hover mouse over structure (see how residues are highlighted in the sequence), then do same for sequence. Select residues in the structure by clicking them – a label will appear. Click again to remove label.

10. Check position of insertions & deletions using this method.