Climate chamber tesing of wind turbine components, why? how? and Where?
Previous Lecture Hypothesis Tesing 2: Comparing Samples.
-
Upload
christina-bond -
Category
Documents
-
view
224 -
download
0
Transcript of Previous Lecture Hypothesis Tesing 2: Comparing Samples.
Previous LectureHypothesis Tesing 2: Comparing Samples
Multiple Alignment
Stuart M. Brown
NYU School of Medicine
Learning Objectives
Understand the need for multiple alignment methods in biology
Optimal methods (dynamic programming) are not practical to align many sequences
Progressive pairwise approach Profile alignments Editing alignments Sequence Logos
Reasons for aligning sets of sequences
Organize data to reflect sequence homology Estimate evolutionary distance Infer phylogenetic trees from homologous sites Highlight conserved sites/regions (motifs) Highlight variable sites/regions Uncover changes in gene structure Look for evidence of selection Summarize information
Pairwise Alignment
The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem.
The best solution seems to be an approach called Dynamic Programming.
Dynamic Programming Dynamic Programming is a general
programming technique. It is applicable when a large search space
can be structured into a succession of stages, such that: the initial stage contains trivial solutions to
sub-problems each partial solution in a later stage can
be calculated by recurring a fixed number of partial solutions in an earlier stage
the final stage contains the overall solution
Multiple Alignments
Making an optimal alignment between two sequences is computationally straightforward, but aligning a large number of sequences using the same method is almost impossible.
The problem increases exponentially with the number of sequences involved, so it becomes computationally expensive (and inefficient) for large numbers of sequences.
Longer SequencesA G T
G -1 -1 -2
T -2 -2 -1
A -2 -3 -3
What happens to the number of cells in the matrix when we add another base to one sequence?
How about to both? # cells = L1 x L2 or L2 if we use 2 sequences of the same length. So the amount of computing grows with the square of seq. length – bad but
not terrible, because the compute time for each cell remains constant
A G T A
G -1 -1 -2 ?
T -2 -2 -1 ?
A -2 -3 -3 ?
C ? ? ? ?
Align Three Sequences by Dynamic programming
So how many cells (that contain values that must be computed) do we add for each additional sequence – it’s a power function! For N sequences of length L: # of cells = 2n x Ln
This is very bad for computing alignments of a lot of sequences!
Georg Fullen, VSNS Biocomputing, Univ. Munster
If the calculation takes 1 nanosecond per cell, then for 6 sequences of length 100, we'll have a running time of is 26 x 1006 x 10-9 seconds (64000 seconds). Just add 2 more sequences, and the running time is 28 x 1008 x 10-9 = 2.6 x 109 seconds (~28 days)
Global vs. Local Multiple Alignments
Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached.
Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there. Creates inconsistent gap regions between aligned blocks
Optimal Alignment
For a given group of sequences, there is no single "correct" alignment, only an alignment that is "optimal" according to some set of calculations.
Determining what alignment is best for a given set of sequences is really up to the judgment of the investigator.
Progressive Pairwise Methods Most of the available multiple alignment programs use
some sort of incremental or progressive method that makes pairwise (global) alignments, averages them into a consensus (actually a profile), then adds new sequences one at a time to the aligned set.
This is an approximate method! Heuristic Perform quick pairwise alignments, score all similarities, build
a distance tree Align first pair of sequences (most similar pair) Build a profile of aligned sequences Align each new sequence to profile, rebuild profile Do the progressive alignments in a sensible order
Profile Alignment Can represent two (or more) aligned sequences
as the frequency of each letter at each position. Can slide a new sequence along this profile and
calculate a similarity score at each position using a score function that gives value for a match equal to the weighted frequency of that letter in the profile.
Very similar to using a lookup table (PAM or
BLOSSUM) for amino acid similarities Can use the same method to align two profiles
with each other
CLUSTAL CLUSTAL is the most popular multiple alignment program
Gap penalties can be adjusted based on specific amino acid residues, regions of hydrophobicity, proximity to other gaps, or secondary structure.
it can re-align just selected sequences or selected regions in an existing alignment
It can compute phylogenetic trees from a set of aligned sequences.
Unix command line program Website: http://www.ebi.ac.uk/Tools/clustalw2/index.html
There are also Mac and PC versions with a nice graphical interface (CLUSTALX).
Clustal Algorithm
Perform pairwise alignments and calculate distances for all pairs of sequences
Construct guide tree (dendrogram) joining the most similar sequences using Neighbour Joining
Align sequences, starting at the leaves of the guide tree. This involves the pair-wise comparisons as well as comparison of single sequence with a group of seqs (profile)
http://www.ebi.ac.uk/Tools/clustalw2/index.html
(now replaced by Clustal Omega)
CLUSTALW2 at the EBI website
Clustal Parameters
Scoring Matrix Gap opening penalty Gap extension penalty Protein gap parameters Additional algorithm parameters Secondary structure penalties
Score Matrices
Pairwise matrices and multiple alignment matrix series
For Proteins: PAM (Dayhoff), BLOSUM (Hennikof), GONNET (default), user defined
Transition (A<->G) weight (zero in clustal means transitions scored as mismatch – one means transition scored as match) – should be low for distantly related sequences
Gap Penalties
Linear gap penalties – Affine gap penaltiesp = (o + l.e)
Gap opening /Gap extension Penalized multiple nearby gaps Protein specific penalties (on by default)
Increase the probability of gaps associated with certain residues
Increase the chances of gaps in loop regions (> 5 hydrophilic residues)
Algorithm parameters
Slow-accurate pair-wise alignment Do alignment from guide tree Reset gaps before aligning
(iteration) Delay divergent sequences (%)
Additional displays
Column Scores Low quality regions Exceptional residues
ClustalX is not optimal There are known areas in which ClustalX
performs badly e.g. errors introduced early cannot be corrected by
subsequent information alignments of sequences of differing lengths
cause strange guide trees and unpredictable effects
edges: ClustalX does not penalise gaps at edges
There are alternatives to ClustalX available
Other Multiple Alignment Tools
MUSCLEhttp://www.ebi.ac.uk/Tools/muscle/index.html(builds progressive alignment, then improves by additional re-alignment of problem pairs)
TCOFFEhttp://www.ebi.ac.uk/Tools/t-coffee/
(Uses both local and global pairwise alignments – SLOWER!)
MSA
Multiple Alignment Tips
Align pairs of sequences using an optimal method Progressive alignment programs such as ClustalX
for multiple alignment Choose representative sequences to align
carefully Choose sequences of comparable lengths Progressive alignment programs may be
combined Review alignment by eye and edit If you have a choice align amino acid sequences
rather than nucleotides
Alignment of coding regions
Nucleotide sequences are much harder to align accurately than proteins
Protein coding sequences can be aligned using the protein sequences e.g. BioEdit: toggle translation to amino acid, call clustalw
to align, edit alignment by hand, toggle back to nucleotide In-frame nucleotide alignments can be used, e.g. to
determine non-synonymous and synonymous distances separately
Editing Multiple Alignments
There are a variety of tools that can be used to modify and display a multiple alignment.
These programs can be very useful in formatting and annotating an alignment for publication.
An editor can also be used to make modifications by hand to improve biologically significant regions in a multiple alignment created by an alignment program.
Many different file formats exist for alignments: Clustal, Phylip, MSF, MEGA
Consensus Sequences
Simplest Form:A single sequence which represents the most common amino acid/base in that position
Y D D G A V - E A LY D G G - - - E A LF E G G I L V E A LF D - G I L V Q A VY E G G A V V Q A LY D G G A/I V/L V E A L
Clustal Format
CLUSTAL X (1.81) multiple sequence alignment
CAS1_BOVIN MKLLILTCLVAVALARPKHPIKHQGLPQ--------EVLNEN-CAS1_SHEEP MKLLILTCLVAVALARPKHPIKHQGLSP--------EVLNEN-CAS1_PIG MKLLIFICLAAVALARPKPPLRHQEHLQNEPDSRE--------CAS1_HUMAN MRLLILTCLVAVALARPKLPLRYPERLQNPSESSE--------CAS1_RABBIT MKLLILTCLVATALARHKFHLGHLKLTQEQPESSEQEILKERKCAS1_MOUSE MKLLILTCLVAAAFAMPRLHSRNAVSSQTQ------QQHSSSECAS1_RAT MKLLILTCLVAAALALPRAHRRNAVSSQTQ------------- *:***: **.*.*:* : . :
Phylip Format (Interleaved)
7 100SOMA_BOVIN MMAAGPRTSL LLAFALLCLP WTQVVGAFPA MSLSGLFANA VLRAQHLHQL SOMA_SHEEP MMAAGPRTSL LLAFTLLCLP WTQVVGAFPA MSLSGLFANA VLRAQHLHQL SOMA_RAT_P -MAADSQTPW LLTFSLLCLL WPQEAGAFPA MPLSSLFANA VLRAQHLHQL SOMA_MOUSE -MATDSRTSW LLTVSLLCLL WPQEASAFPA MPLSSLFSNA VLRAQHLHQL SOMA_RABIT -MAAGSWTAG LLAFALLCLP WPQEASAFPA MPLSSLFANA VLRAQHLHQL SOMA_PIG_P -MAAGPRTSA LLAFALLCLP WTREVGAFPA MPLSSLFANA VLRAQHLHQL SOMA_HUMAN -MATGSRTSL LLAFGLLCLP WLQEGSAFPT IPLSRLFDNA MLRAHRLHQL
AADTFKEFER TYIPEGQRYS -IQNTQVAFC FSETIPAPTG KNEAQQKSDL AADTFKEFER TYIPEGQRYS -IQNTQVAFC FSETIPAPTG KNEAQQKSDL AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KEEAQQRTDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KEEAQQRTDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KDEAQQRSDM AADTYKEFER AYIPEGQRYS -IQNAQAAFC FSETIPAPTG KDEAQQRSDV AFDTYQEFEE AYIPKEQKYS FLQNPQTSLC FSESIPTPSN REETQQKSNL
Phylip Format (Sequential)
3 100Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTTAATGGCCGTGGTGGCTGGAGTGGCCAGTGCCCTGGCTCACAAGTACCACTAAMouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGTCTCTTGCCTTGGGGAAAGGTGAACTCCGATGAAGTTGGTGGTGAGGCCCTGGGRabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGCGGTCACTGCTGGGGCAAGGTGAATGTGGAAGAAGTTGGTGGTGAGGCCCTGGG
Mega Format#megaTITLE: No title
#Rat ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT#Mouse ATGGTGCACCTGACTGATGCTGAGAAGGCTGCTGT#Rabbit ATGGTGCATCTGTCCAGT---GAGGAGAAGTCTGC#Human ATGGTGCACCTGACTCCT---GAGGAGAAGTCTGC#Oppossum ATGGTGCACTTGACTTTT---GAGGAGAAGAACTG#Chicken ATGGTGCACTGGACTGCT---GAGGAGAAGCAGCT#Frog ---ATGGGTTTGACAGCACATGATCGT---CAGCT
1501 1550 Hsirf2 SERPSKKGKK PKTEKEDKVK HIKQEPVESS LGLSNGVSDL SPEYAVLTST Muirf2 SERPSKKGKK PKTEKEERVK HIKQEPVESS LGLSNGVSGF SPEYAVLTSA Chirf2 SERPSKKGKK TKSEKDDKFK QIKQEPVESS FGI.NGLNDV TSDY.FLSSS Muirf1 LTRNQRKERK SKSSRDTKSK TKRKLCGDVS PDTFS..DGL SSSTLPDDHS Ratirf1 LTKNQRKERK SKSSRDTKSK TKRKLCGDSS PDTLS..DGL SSSTLPDDHS Hsirf1 LTKNQRKERK SKSSRDAKSK AKRKSCGDSS PDTFS..DGL SSSTLPDDHS Chkirf1a LTKDQKKERK SKSSREARNK SKRKLYEDMR MEESA..ERL TSTPLPDDHS Hsirf3a ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Mmuirf3 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Hsirf5 GPAPTDSQPP EDYSFGAGEE EEEEEELQRM LPSLSLTDAV QSGPHMTPYS Mmuirf6 IPQPQGS.VI NPGSTGSAPW DEKDNDVDED EEEDELEQSQ HHVPIQDTFP Hump48 ...PPGIVSG QPGTQKVPSK RQHSSVSSER KEEEDAMQNC TLSPSVLQDS Mup48 ...PAGTLPN QPRNQKSPCK RSISCVSPER EEN...MENG RTNGVVNHSD Hsirf4 ...PEGAKKG AKQLTLEDPQ MSMSHPYTMT TPYPSLPA.Q VHNYMMPPLD Mupip ...PEGAKKG AKQLTLDDTQ MAMGHPYPMT APYGSLPAQQ VHNYMMPPHD Huicsbp ...PEEDQK. .......... .......... CKLGVATAGC VNEVTEMECG Muicsbp ...PEEEQK. .......... .......... CKLGVAPAGC MSEVPEMECG Chkicsbp ...PEEEQK. .......... .......... CKIGVGNGSS LTDVGDMDCS
1551 1600 Hsirf2 IKNEVDSTVN IIVVGQSHLD SNIENQEIVT NPPDICQVVE VTTESDEQPV Muirf2 IKNEVDSTVN IIVVGQSHLD SNIEDQEIVT NPPDICQVVE VTTESDDQPV Chirf2 IKNEVDSTVN IVVVGQPHLD GSSEEQVIVA NPPDVCQVVE VTTESDEQPL Muirf1 SYTTQGYLGQ DLDMER.DIT PALSPCVVSS SLSEWHMQMD I.IPDSTTDL Ratirf1 SYTAQGYLGQ DLDMDR.DIT PALSPCVVSS SLSEWHMQMD I.MPDSTTDL Hsirf1 SYTVPGYM.Q DLEVEQ.ALT PALSPCAVSS TLPDWHIPVE V.VPDSTSDL Chkirf1a SYTAHDYTGQ EVEVENTSIT LDLSSCEVSG SLTDWRMPME IAMADSTNDI Hsirf3a ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Mmuirf3 ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ ~~~~~~~~~~ Hsirf5 LLKEDVKWPP TLQPPTLQPP VVLGPPAPDP SPLAPPPGNP AGFRELLSEV Mmuirf6 FL........ NINGSPMAPA SVGNCSVGNC SPESVWP... ......KTEP Hump48 LNNEEEGASG GAVHSDIGSS SSSSSPEPQE VTDTTEAPFQ ........GD Mup48 SGSNIGGGGN GSNRSD...S NSNCNSELEE GAGTTEATIR ........ED Hsirf4 RSWRDYVPDQ PHPEIPYQCP MTFGPRGHHW QGPACENGCQ VTGTFYACAP Mupip RSWRDYAPDQ SHPEIPYQCP VTFGPRGHHW QGPSCENGCQ VTGTFYACAP Huicsbp RSEIDELIKE .PSVDDYMGM IKRSPSP... P.DACRS..Q LLPDWWAHEP Muicsbp RSEIEELIKE .PSVDEYMGM TKRSPSP... P.EACRS..Q ILPDWWVQQP Chkicsbp PSAIDDLMKE PPCVDEYLGI IKRSPSP... PQETCRN..P PIPDWWMQQP
PILEUP Output> more myseqs.msf
Editing a multiple sequence alignment
It is NOT “cheating” to edit a multiple sequence alignment heuristic alignment is approximate
Incorporate additional knowledge if possible Alignment editors help to keep the data
organized and help to prevent unwanted mistakes
Alignment editors
The MACAW and SeqVu program for Macintosh; GeneDoc and DCSE for PCs are free and provide excellent editor functionality.
BioEdit Seaview, Jalview (web based) Many “comprehensive” molecular biology
programs include multiple alignment functions: Sequencher, MacVector, DS Gene, Vector NTI, all
include a built-in version of CLUSTAL
EMBOSS tools
emma = clustal plotcon = PLOTSIMILARITY showalign = PRETTY Prettyplot ≈ PRETTYBOX
SeqVu
JalView
Install on your machine
or run as a Java WebStart application
Check out CINEMA (Colour INteractive Editor for Multiple Alignments) It is an editor created completely
in JAVA (old browsers beware) It includes a fully functional
version of CLUSTAL, BLAST, and a DotPlot module
http://www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/
Analysis of Alignments Once you have a multiple alignment,
what can you do with it?
1) Identify regions of similarity and difference- conserved regions may be functionally important,
and/or sites for inclusive (cross species) primer design
- Variable regions may be functionally important, and/or sites for gene/allele-specific primer design
- 2) Create a sequence logo 3) Build a Phylogenetic Tree (next week)
Format a Multiple Alignment
1) PLOTSIMILARITY (a graph of overall similarity
across the alignment) EMBOSS = plotcon
2) Show match to consensus = showalign
3) Shade by similarity = prettyplot/Boxshade
• The concept of a consensus sequence is implied by any multiple alignment. There can be various rules for building the consensus: simple majority rules, plurality by a specific %, etc.
• The alignment may look nicer by showing how each letter matches the consensus – highlight the differences.
Plurality: 2.00 Threshold: 4AveWeight 0.55 AveMatch 2.91 AvMisMatch -2.00
PRETTY of: @pretty.list October 7, 1998 10:35 ..
1 50fa10.ugly .......... .......... .......... ..TTttGESA D.PvtTtVE.fa12.ugly .......... .......... .......... ..TTatGESA D.PvtTtVE.fo1k.ugly .......... .......... .......... ..TTsaGESA D.PvtTtVE. e.ugly Gvenae.kgv tEnTna.Tad fvaqpvyLPe .nqT...... kv.Affynrs p1m.ugly GlgqmlEsmI .dnTvreTvg AatsrdaLPn teasGPthSk eiPALTAVET p1s.ugly GlgqmlEsmI .dnTvreTvg AatsrdaLPn teasGPahSk eiPALTAVET p2s.ugly GigdmiEgav .Egitknalv pptstnsLPg hkpsGPahSk eiPALTAVET p3s.ugly Giedliseva .qgal..Tls lpkqqdsLPd tkasGPahSk evPALTAVET cb3.ugly ...gpvEdaI .......T.. Aaigr..vad tvgTGPtnSe aiPALTAaET r14.ugly GlgdelEevI vEkT.kqTv. Asi....... ..ssGPkhtq kvPiLTAnET r2.ugly ...npvEnyI dEvlnevlv. .......vPn inssnPttSn saPALdAaETConsensus G-----E--I -E-T---T-- A------LP- --TTGPGESA D-PALTAVET
/////////////////////////////////////////////////////////////////
301 349fa10.ugly aElyCPRPll AIkvtsqdRy KqKI.iAPa. ..KQll.... .........fa12.ugly aElyCPRPll AIevssqdRh KqKI.iAPg. ..KQll.... .........fo1k.ugly aEtyCPRPll AIhpt.eaRh KqKI.vAPv. ..KQTl.... ......... e.ugly krvfCPRPtv ffPwpTsG.D Kidmtpragv lmlespnald isrty.... p1m.ugly irvWCPRPPR AlaYygpGvD ykdgtltPls tkdlTTy... ......... p1s.ugly irvWCPRPPR AvaYygpGvD ykdgtltPls tkdlTTy... ......... p2s.ugly VrvWCPRPPR AvPYfgpGvD ykdg.ltPlp ekglTTy... ......... p3s.ugly VrvWCPRPPR AvPYygpGvD yrn.nldPls ekglTTy... ......... cb3.ugly VkaWiPRPPR lcqYekakn. vnfrssgvtt trqsiTtmtn tgaiwtti. r14.ugly VEaWiPRaPR AlPY.Tsigr tny..pknte pvikkrk.gd i.ksy.... r2.ugly VkaWCPRPPR AleY.Trahr tnfkiedrsi qtaivTrpii ttagpsdmyConsensus VE-WCPRPPR AIPY-T-GRD K-KI--AP-- --KQTT---- ---------
Boxshade
http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=boxshade
http://www.ch.embnet.org/software/BOX_form.html
Shade each letter of the alignment based on its match to the consensus
– highlights conserved regions– much more informative for protein alignments (shades
of grey for similar amino acids)
Sequence Logos
http://weblogo.threeplusone.com/create.cgi
http://weblogo.berkeley.edu/logo.cgi
T. D. Schneider and R. M. Stephens. Sequence logos: a new way to display
consensus sequences. Nucleic Acids Research, Vol. 18, No 20, p. 6097-6100.
http://genome.tugraz.at/Logo/
Seq Logos are based on Information Theory
Height of the letter corresponds to the amount of information present at that position in an aligned region (motif) DNA has a max of 2 bits (binary of 4), protein has
>4 bits If many bases/amino acids are present at an
alignment position, there is very little information
We will explore using motifs next week.
Summary
Understand the need for multiple alignment methods in biology
Optimal methods (dynamic programming) are not practical to align many sequences
Progressive pairwise approach Profile alignments Editing alignments Sequence Logos
Next Lecture: Sequence Motifs