Previous Lecture Hypothesis Tesing 2: Comparing Samples.

Previous LectureHypothesis Tesing 2: Comparing Samples

Multiple Alignment

Stuart M. Brown

NYU School of Medicine

Learning Objectives

Understand the need for multiple alignment methods in biology

Optimal methods (dynamic programming) are not practical to align many sequences

Progressive pairwise approach Profile alignments Editing alignments Sequence Logos

Reasons for aligning sets of sequences

Organize data to reflect sequence homology Estimate evolutionary distance Infer phylogenetic trees from homologous sites Highlight conserved sites/regions (motifs) Highlight variable sites/regions Uncover changes in gene structure Look for evidence of selection Summarize information

Pairwise Alignment

The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem.

The best solution seems to be an approach called Dynamic Programming.

Dynamic Programming Dynamic Programming is a general

programming technique. It is applicable when a large search space

can be structured into a succession of stages, such that: the initial stage contains trivial solutions to

sub-problems each partial solution in a later stage can

be calculated by recurring a fixed number of partial solutions in an earlier stage

the final stage contains the overall solution

Multiple Alignments

Making an optimal alignment between two sequences is computationally straightforward, but aligning a large number of sequences using the same method is almost impossible.

The problem increases exponentially with the number of sequences involved, so it becomes computationally expensive (and inefficient) for large numbers of sequences.

Longer SequencesA G T

G -1 -1 -2

T -2 -2 -1

A -2 -3 -3

What happens to the number of cells in the matrix when we add another base to one sequence?

How about to both? # cells = L1 x L2 or L2 if we use 2 sequences of the same length. So the amount of computing grows with the square of seq. length – bad but

not terrible, because the compute time for each cell remains constant

A G T A

G -1 -1 -2 ?

T -2 -2 -1 ?

A -2 -3 -3 ?

C ? ? ? ?

Align Three Sequences by Dynamic programming

So how many cells (that contain values that must be computed) do we add for each additional sequence – it’s a power function! For N sequences of length L: # of cells = 2n x Ln

This is very bad for computing alignments of a lot of sequences!

Georg Fullen, VSNS Biocomputing, Univ. Munster

If the calculation takes 1 nanosecond per cell, then for 6 sequences of length 100, we'll have a running time of is 26 x 1006 x 10-9 seconds (64000 seconds). Just add 2 more sequences, and the running time is 28 x 1008 x 10-9 = 2.6 x 109 seconds (~28 days)

Global vs. Local Multiple Alignments

Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached.

Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there. Creates inconsistent gap regions between aligned blocks

Optimal Alignment

For a given group of sequences, there is no single "correct" alignment, only an alignment that is "optimal" according to some set of calculations.

Determining what alignment is best for a given set of sequences is really up to the judgment of the investigator.

Progressive Pairwise Methods Most of the available multiple alignment programs use

some sort of incremental or progressive method that makes pairwise (global) alignments, averages them into a consensus (actually a profile), then adds new sequences one at a time to the aligned set.

This is an approximate method! Heuristic Perform quick pairwise alignments, score all similarities, build

a distance tree Align first pair of sequences (most similar pair) Build a profile of aligned sequences Align each new sequence to profile, rebuild profile Do the progressive alignments in a sensible order

Profile Alignment Can represent two (or more) aligned sequences

as the frequency of each letter at each position. Can slide a new sequence along this profile and

calculate a similarity score at each position using a score function that gives value for a match equal to the weighted frequency of that letter in the profile.

Very similar to using a lookup table (PAM or

BLOSSUM) for amino acid similarities Can use the same method to align two profiles

with each other

CLUSTAL CLUSTAL is the most popular multiple alignment program

Gap penalties can be adjusted based on specific amino acid residues, regions of hydrophobicity, proximity to other gaps, or secondary structure.

it can re-align just selected sequences or selected regions in an existing alignment

It can compute phylogenetic trees from a set of aligned sequences.

Unix command line program Website: http://www.ebi.ac.uk/Tools/clustalw2/index.html

There are also Mac and PC versions with a nice graphical interface (CLUSTALX).

Clustal Algorithm

Perform pairwise alignments and calculate distances for all pairs of sequences

Construct guide tree (dendrogram) joining the most similar sequences using Neighbour Joining

Align sequences, starting at the leaves of the guide tree. This involves the pair-wise comparisons as well as comparison of single sequence with a group of seqs (profile)

http://www.ebi.ac.uk/Tools/clustalw2/index.html

(now replaced by Clustal Omega)

CLUSTALW2 at the EBI website

http://www.ebi.ac.uk/Tools/clustalw2/index.html

Clustal Parameters

Scoring Matrix Gap opening penalty Gap extension penalty Protein gap parameters Additional algorithm parameters Secondary structure penalties

Score Matrices

Pairwise matrices and multiple alignment matrix series

For Proteins: PAM (Dayhoff), BLOSUM (Hennikof), GONNET (default), user defined

Transition (A<->G) weight (zero in clustal means transitions scored as mismatch – one means transition scored as match) – should be low for distantly related sequences

Gap Penalties

Linear gap penalties – Affine gap penaltiesp = (o + l.e)

Gap opening /Gap extension Penalized multiple nearby gaps Protein specific penalties (on by default)

Increase the probability of gaps associated with certain residues

Increase the chances of gaps in loop regions (> 5 hydrophilic residues)

Algorithm parameters

Slow-accurate pair-wise alignment Do alignment from guide tree Reset gaps before aligning

(iteration) Delay divergent sequences (%)

Additional displays

Column Scores Low quality regions Exceptional residues

ClustalX is not optimal There are known areas in which ClustalX

performs badly e.g. errors introduced early cannot be corrected by

subsequent information alignments of sequences of differing lengths

cause strange guide trees and unpredictable effects

edges: ClustalX does not penalise gaps at edges

There are alternatives to ClustalX available

Other Multiple Alignment Tools

MUSCLEhttp://www.ebi.ac.uk/Tools/muscle/index.html(builds progressive alignment, then improves by additional re-alignment of problem pairs)

TCOFFEhttp://www.ebi.ac.uk/Tools/t-coffee/

(Uses both local and global pairwise alignments – SLOWER!)

MSA

http://www.ebi.ac.uk/Tools/t-coffee/

Multiple Alignment Tips

Align pairs of sequences using an optimal method Progressive alignment programs such as ClustalX

for multiple alignment Choose representative sequences to align

carefully Choose sequences of comparable lengths Progressive alignment programs may be

combined Review alignment by eye and edit If you have a choice align amino acid sequences

rather than nucleotides

Alignment of coding regions

Nucleotide sequences are much harder to align accurately than proteins

Protein coding sequences can be aligned using the protein sequences e.g. BioEdit: toggle translation to amino acid, call clustalw

to align, edit alignment by hand, toggle back to nucleotide In-frame nucleotide alignments can be used, e.g. to

determine non-synonymous and synonymous distances separately

Editing Multiple Alignments

There are a variety of tools that can be used to modify and display a multiple alignment.

These programs can be very useful in formatting and annotating an alignment for publication.

An editor can also be used to make modifications by hand to improve biologically significant regions in a multiple alignment created by an alignment program.

Many different file formats exist for alignments: Clustal, Phylip, MSF, MEGA

Consensus Sequences

Simplest Form:A single sequence which represents the most common amino acid/base in that position

Y D D G A V - E A LY D G G - - - E A LF E G G I L V E A LF D - G I L V Q A VY E G G A V V Q A LY D G G A/I V/L V E A L

Clustal Format

CLUSTAL X (1.81) multiple sequence alignment

CAS1_BOVIN MKLLILTCLVAVALARPKHPIKHQGLPQ--------EVLNEN-CAS1_SHEEP MKLLILTCLVAVALARPKHPIKHQGLSP--------EVLNEN-CAS1_PIG MKLLIFICLAAVALARPKPPLRHQEHLQNEPDSRE--------CAS1_HUMAN MRLLILTCLVAVALARPKLPLRYPERLQNPSESSE--------CAS1_RABBIT MKLLILTCLVATALARHKFHLGHLKLTQEQPESSEQEILKERKCAS1_MOUSE MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ------QQHSSSECAS1_RAT MKLLILTCLVAAALALPRAHRRNAVSSQTQ------------- *:***: **.*.*:* : . :

Phylip Format (Interleaved)

7 100SOMA_BOVIN MMAAGPRTSL LLAFALLCLP WTQVVGAFPA MSLSGLFANA VLRAQHLHQL SOMA_SHEEP MMAAGPRTSL LLAFTLLCLP WTQVVGAFPA MSLSGLFANA VLRAQHLHQL SOMA_RAT_P -MAADSQTPW LLTFSLLCLL WPQEAGAFPA MPLSSLFANA VLRAQHLHQL SOMA_MOUSE -MATDSRTSW LLTVSLLCLL WPQEASAFPA MPLSSLFSNA VLRAQHLHQL SOMA_RABIT -MAAGSWTAG LLAFALLCLP WPQEASAFPA MPLSSLFANA VLRAQHLHQL SOMA_PIG_P -MAAGPRTSA LLAFALLCLP WTREVGAFPA MPLSSLFANA VLRAQHLHQL SOMA_HUMAN -MATGSRTSL LLAFGLLCLP WLQEGSAFPT IPLSRLFDNA MLRAHRLHQL

AADTFKEFER TYIPEGQRYS -IQNTQVAFC FSETIPAPTG KNEAQQKSDL AADTFKEFER TYIPEGQRYS -IQNTQVAFC FSETIPAPTG KNEAQQKSDL AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KEEAQQRTDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KEEAQQRTDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KDEAQQRSDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KDEAQQRSDV AFDTYQEFEE AYIPKEQKYS FLQNPQTSLC FSESIPTPSN REETQQKSNL

Phylip Format (Sequential)

3 100Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTTAATGGCCGTGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAAMouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCTTGGGGAAAGGTGAACTCCGATGAAGTTGGTGGTGAGGCCCTGGGRabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGCGGTCACTGCTGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGG

Mega Format#megaTITLE: No title

#Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT#Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT#Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC#Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC#Oppossum ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG#Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT#Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT

1501 1550 Hsirf2 SERPSKKGKK PKTEKEDKVK HIKQEPVESS LGLSNGVSDL SPEYAVLTST Muirf2 SERPSKKGKK PKTEKEERVK HIKQEPVESS LGLSNGVSGF SPEYAVLTSA Chirf2 SERPSKKGKK TKSEKDDKFK QIKQEPVESS FGI.NGLNDV TSDY.FLSSS Muirf1 LTRNQRKERK SKSSRDTKSK TKRKLCGDVS PDTFS..DGL SSSTLPDDHS Ratirf1 LTKNQRKERK SKSSRDTKSK TKRKLCGDSS PDTLS..DGL SSSTLPDDHS Hsirf1 LTKNQRKERK SKSSRDAKSK AKRKSCGDSS PDTFS..DGL SSSTLPDDHS Chkirf1a LTKDQKKERK SKSSREARNK SKRKLYEDMR MEESA..ERL TSTPLPDDHS Hsirf3a ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Mmuirf3 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Hsirf5 GPAPTDSQPP EDYSFGAGEE EEEEEELQRM LPSLSLTDAV QSGPHMTPYS Mmuirf6 IPQPQGS.VI NPGSTGSAPW DEKDNDVDED EEEDELEQSQ HHVPIQDTFP Hump48 ...PPGIVSG QPGTQKVPSK RQHSSVSSER KEEEDAMQNC TLSPSVLQDS Mup48 ...PAGTLPN QPRNQKSPCK RSISCVSPER EEN...MENG RTNGVVNHSD Hsirf4 ...PEGAKKG AKQLTLEDPQ MSMSHPYTMT TPYPSLPA.Q VHNYMMPPLD Mupip ...PEGAKKG AKQLTLDDTQ MAMGHPYPMT APYGSLPAQQ VHNYMMPPHD Huicsbp ...PEEDQK. .......... .......... CKLGVATAGC VNEVTEMECG Muicsbp ...PEEEQK. .......... .......... CKLGVAPAGC MSEVPEMECG Chkicsbp ...PEEEQK. .......... .......... CKIGVGNGSS LTDVGDMDCS

1551 1600 Hsirf2 IKNEVDSTVN IIVVGQSHLD SNIENQEIVT NPPDICQVVE VTTESDEQPV Muirf2 IKNEVDSTVN IIVVGQSHLD SNIEDQEIVT NPPDICQVVE VTTESDDQPV Chirf2 IKNEVDSTVN IVVVGQPHLD GSSEEQVIVA NPPDVCQVVE VTTESDEQPL Muirf1 SYTTQGYLGQ DLDMER.DIT PALSPCVVSS SLSEWHMQMD I.IPDSTTDL Ratirf1 SYTAQGYLGQ DLDMDR.DIT PALSPCVVSS SLSEWHMQMD I.MPDSTTDL Hsirf1 SYTVPGYM.Q DLEVEQ.ALT PALSPCAVSS TLPDWHIPVE V.VPDSTSDL Chkirf1a SYTAHDYTGQ EVEVENTSIT LDLSSCEVSG SLTDWRMPME IAMADSTNDI Hsirf3a ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Mmuirf3 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Hsirf5 LLKEDVKWPP TLQPPTLQPP VVLGPPAPDP SPLAPPPGNP AGFRELLSEV Mmuirf6 FL........ NINGSPMAPA SVGNCSVGNC SPESVWP... ......KTEP Hump48 LNNEEEGASG GAVHSDIGSS SSSSSPEPQE VTDTTEAPFQ ........GD Mup48 SGSNIGGGGN GSNRSD...S NSNCNSELEE GAGTTEATIR ........ED Hsirf4 RSWRDYVPDQ PHPEIPYQCP MTFGPRGHHW QGPACENGCQ VTGTFYACAP Mupip RSWRDYAPDQ SHPEIPYQCP VTFGPRGHHW QGPSCENGCQ VTGTFYACAP Huicsbp RSEIDELIKE .PSVDDYMGM IKRSPSP... P.DACRS..Q LLPDWWAHEP Muicsbp RSEIEELIKE .PSVDEYMGM TKRSPSP... P.EACRS..Q ILPDWWVQQP Chkicsbp PSAIDDLMKE PPCVDEYLGI IKRSPSP... PQETCRN..P PIPDWWMQQP

PILEUP Output> more myseqs.msf

Editing a multiple sequence alignment

It is NOT “cheating” to edit a multiple sequence alignment heuristic alignment is approximate

Incorporate additional knowledge if possible Alignment editors help to keep the data

organized and help to prevent unwanted mistakes

Alignment editors

The MACAW and SeqVu program for Macintosh; GeneDoc and DCSE for PCs are free and provide excellent editor functionality.

BioEdit Seaview, Jalview (web based) Many “comprehensive” molecular biology

programs include multiple alignment functions: Sequencher, MacVector, DS Gene, Vector NTI, all

include a built-in version of CLUSTAL

EMBOSS tools

emma = clustal plotcon = PLOTSIMILARITY showalign = PRETTY Prettyplot ≈ PRETTYBOX

JalView

Install on your machine

or run as a Java WebStart application

Check out CINEMA (Colour INteractive Editor for Multiple Alignments) It is an editor created completely

in JAVA (old browsers beware) It includes a fully functional

version of CLUSTAL, BLAST, and a DotPlot module

http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/

Analysis of Alignments Once you have a multiple alignment,

what can you do with it?

1) Identify regions of similarity and difference- conserved regions may be functionally important,

and/or sites for inclusive (cross species) primer design

- Variable regions may be functionally important, and/or sites for gene/allele-specific primer design

- 2) Create a sequence logo 3) Build a Phylogenetic Tree (next week)

Format a Multiple Alignment

1) PLOTSIMILARITY (a graph of overall similarity

across the alignment) EMBOSS = plotcon

2) Show match to consensus = showalign

3) Shade by similarity = prettyplot/Boxshade

• The concept of a consensus sequence is implied by any multiple alignment. There can be various rules for building the consensus: simple majority rules, plurality by a specific %, etc.

• The alignment may look nicer by showing how each letter matches the consensus – highlight the differences.

Plurality: 2.00 Threshold: 4AveWeight 0.55 AveMatch 2.91 AvMisMatch -2.00

PRETTY of: @pretty.list October 7, 1998 10:35 ..

1 50fa10.ugly .......... .......... .......... ..TTttGESA D.PvtTtVE.fa12.ugly .......... .......... .......... ..TTatGESA D.PvtTtVE.fo1k.ugly .......... .......... .......... ..TTsaGESA D.PvtTtVE. e.ugly Gvenae.kgv tEnTna.Tad fvaqpvyLPe .nqT...... kv.Affynrs p1m.ugly GlgqmlEsmI .dnTvreTvg AatsrdaLPn teasGPthSk eiPALTAVET p1s.ugly GlgqmlEsmI .dnTvreTvg AatsrdaLPn teasGPahSk eiPALTAVET p2s.ugly GigdmiEgav .Egitknalv pptstnsLPg hkpsGPahSk eiPALTAVET p3s.ugly Giedliseva .qgal..Tls lpkqqdsLPd tkasGPahSk evPALTAVET cb3.ugly ...gpvEdaI .......T.. Aaigr..vad tvgTGPtnSe aiPALTAaET r14.ugly GlgdelEevI vEkT.kqTv. Asi....... ..ssGPkhtq kvPiLTAnET r2.ugly ...npvEnyI dEvlnevlv. .......vPn inssnPttSn saPALdAaETConsensus G-----E--I -E-T---T-- A------LP- --TTGPGESA D-PALTAVET

/////////////////////////////////////////////////////////////////

301 349fa10.ugly aElyCPRPll AIkvtsqdRy KqKI.iAPa. ..KQll.... .........fa12.ugly aElyCPRPll AIevssqdRh KqKI.iAPg. ..KQll.... .........fo1k.ugly aEtyCPRPll AIhpt.eaRh KqKI.vAPv. ..KQTl.... ......... e.ugly krvfCPRPtv ffPwpTsG.D Kidmtpragv lmlespnald isrty.... p1m.ugly irvWCPRPPR AlaYygpGvD ykdgtltPls tkdlTTy... ......... p1s.ugly irvWCPRPPR AvaYygpGvD ykdgtltPls tkdlTTy... ......... p2s.ugly VrvWCPRPPR AvPYfgpGvD ykdg.ltPlp ekglTTy... ......... p3s.ugly VrvWCPRPPR AvPYygpGvD yrn.nldPls ekglTTy... ......... cb3.ugly VkaWiPRPPR lcqYekakn. vnfrssgvtt trqsiTtmtn tgaiwtti. r14.ugly VEaWiPRaPR AlPY.Tsigr tny..pknte pvikkrk.gd i.ksy.... r2.ugly VkaWCPRPPR AleY.Trahr tnfkiedrsi qtaivTrpii ttagpsdmyConsensus VE-WCPRPPR AIPY-T-GRD K-KI--AP-- --KQTT---- ---------

Boxshade

http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=boxshade

http://www.ch.embnet.org/software/BOX_form.html

Shade each letter of the alignment based on its match to the consensus

– highlights conserved regions– much more informative for protein alignments (shades

of grey for similar amino acids)

Sequence Logos

http://weblogo.threeplusone.com/create.cgi

http://weblogo.berkeley.edu/logo.cgi

T. D. Schneider and R. M. Stephens. Sequence logos: a new way to display

consensus sequences. Nucleic Acids Research, Vol. 18, No 20, p. 6097-6100.

http://genome.tugraz.at/Logo/

Seq Logos are based on Information Theory

Height of the letter corresponds to the amount of information present at that position in an aligned region (motif) DNA has a max of 2 bits (binary of 4), protein has

>4 bits If many bases/amino acids are present at an

alignment position, there is very little information

We will explore using motifs next week.

Summary

Understand the need for multiple alignment methods in biology

Optimal methods (dynamic programming) are not practical to align many sequences

Progressive pairwise approach Profile alignments Editing alignments Sequence Logos

Next Lecture: Sequence Motifs

Previous Lecture Hypothesis Tesing 2: Comparing Samples.

Documents

Transcript of Previous Lecture Hypothesis Tesing 2: Comparing Samples.