Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit...
-
Upload
carmella-wood -
Category
Documents
-
view
216 -
download
4
Transcript of Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit...
Sequence Alignments
Wednesday 6th October 2010
Aidan BuddStructural and Computational Biology
UnitEMBL Heidelberg, Germany
Why focus on sequence alignment?
•Required for the development of almost all bioinformatics tools
•Enables sequence similarity to be measured in a way that provides confident prediction of structural-similarity/evolutionary-relatedness between sequences
•Similarity to sequences with known function provides KEY source of novel hypotheses for function of novel sequences
•Usually the first analysis of any 'novel' sequences involves aligning it to many sequences
Have any of you ever used a tool to align sequences?
Sequence Alignment
Sequence Alignment:
Arrangement of two or more sequences in matrix/grid
Biological Sequence Alignment - Rows
• Residues in the same row are from the same biological macromolecule (protein or nucleic acid)
• Residues are arranged in the order they occur in the macromolecule
• N to C terminal in proteins
• 5' to 3' in nucleotides
Biological Sequence Alignment - Columns
• Elements/cells in the matrix only ever contain a single character (either a residue or a "blank"/"gap" character)
• Residues in the same column share a special 1:1 relationship that they don't share with residues in any other column
Biological Sequence Alignment - Gaps
• Gaps specify pairs of residues in the same sequence
• i.e. the residues flanking the gap within the same sequence
• Between the flanking residues, other sequences in the alignment have residues for which there is no 1:1 relationship in the gapped sequence
Biological Sequence Alignment - Gaps
Gap-only column provide no information about relationships between residues in the alignment
Biological Sequence Alignment - Gaps
Gap-only column provide no information about relationships between residues in the alignment
• Removing sequences sometimes leaves all-gap columns
• We usually remove these "Empty" columns does not effect
• (as this doesn't change the 1:1 relationships described by the rest of the alignment)
Pairwise Sequence Alignments
>EMBLCDS:BAB82422 BAB82422.1 Ephydatia fluviatilis protein tyrosine kinase Length = 1488
Score = 89.7 bits (98), Expect = 3e-15 Identities = 104/138 (75%), Gaps = 2/138 (1%) Strand = Plus / Plus
Query: 85 aagct-gggccagggctgctttggcgaggtgtggatggggacctggaacggtaccaccag 143 ||||| ||| | |||| | ||||| || || ||| ||| ||||| || ||||||||Sbjct: 703 aagcttggggcggggcag-tttggtgaagtttgggagggtgtgtggaatgggaccaccag 761
Query: 144 ggtggccatcaaaaccctgaagcctggcacgatgtctccagaggccttcctgcaggaggc 203 |||||| | || ||||| || || ||||| |||||| |||| ||||||||||||||Sbjct: 762 tgtggccgttaagaccctcaaaccaggcacaatgtctgtcgaggagttcctgcaggaggc 821
Query: 204 ccaggtcatgaagaagct 221 |||||||||||||Sbjct: 822 aagcatcatgaagaagct 839
Pairwise Alignment: Alignment of two sequences
Build your own sequence alignment
•Sequences are usually aligned automatically
•Pairwise: BLAST, FASTA, Smith-Waterman etc.
•Multiple: MUSCLE, PRANK, CLUSTAL etc.
•Also possible 'manually' using tools such as JalView
•Trying some manual alignment can help understand how/why automatic methods are used
JalView Demo and Exercises
•Loading sequences
•Changing the way the sequences are displayed
•Manual editing of alignments
•Adding/removing sequences to an alignment
•Exporting sequences/alignments from JalView for use in another application
Multiple Sequence Alignment
Multiple Sequence Alignment (MSA): Alignment of 3+ sequences
Multiple Sequence Alignment
MSAs describe a set of pairwise alignments
An MSA of n sequences
n(n+1)
2pairwise alignmentsFor example:
implies
•Structurally equivalent/similar
•Evolutionary equivalent/related/homologous
Residues in the same column either:
Different applications assume different types of equivalence
Different types of similarity not necessarily equivalent
“Equivalence”/similarity of residues
Structural equivalence
Demonstration:
http://www.embl-heidelberg.de/~seqanal/courses/commonCourseContent/commonMsaExercises.html#DemonstratingStructuralEquivalence
Bacterial toxins 1ji6 and 1i5p
1i5p: 1 YVAPVVGTVSSFLLKKVGSLIGKR 1111111111111111111111111ji6: 1 DAVGTGISVVGQILGVVGVPFAGA
Structural equivalence
Residues in the same alignment column are "structurally equivalent"
i.e. they should be the residues in the two structures whose location with respect to the rest of the structure are most similar in the two structureSuch residues will have similar structural/functional "roles" in the two proteins e.g. form similar side-chain interactions
Structural equivalence
Some regions of the structures do not have structurally equivalent residues in the other structure
1i5p: DNFLNPTQN----PVPLSITSSVN 111111 111111111111ji6: NSWKKTPLSLRSKRSQDRIRELFS
Alignment gaps are a sure sign of such residues
Placing such residues in the same column as residues from other sequences is a misalignment - to be avoided!
“Evolutionary Equivalence”
AGWYTI
AGWWTI
AGWYTI
AGWWTI
AAWYTI
AGWWTI
AAQQQWYTI
Mutation / Substituti
onY-W
Substitution G-A
QQQ Insertio
n
AGWYTI
AGWYTI
Two copies of
gene generate
d
AGWYTI AGWYTI
AGWYTI
AGWWTI
AGWYTI
AGWWTI
AAWYTI
AG---WWTI
AAQQQWYTI
Residues in the same alignment column should trace their history back to the same residue in the ancestral sequence with any changes due only to point substitutions
Quiz - Evolutionary Interpretation of Alignments
Which alignment of the final sequences (X, Y or Z) only places residues in the same column if they are related by substitution events?
KGE--------PGIGL------PGKGIPG-----------DPAFGDPGRGIPGEVLGAQ-----------PG
ZKGEPG------IGL------PGKGIPG---------DPAFGDPGRGIPGEVLGAQ---------PG
YKGEPG---IGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG
X
Quiz - Evolutionary Interpretation of Alignments
RGIPGEVLGAQPGKGIPGDPAFGDPG---KGEPGIGLPG
PRANKKGEPG---IGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG
MAFFTK---GEPGIGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG
CLUSTALX
Different automatic MSA software gives different results
They're all "wrong"....
Because their model of evolutionary process is very divergent from the (very strange...) one under which I told you they evolved
KGE--------PGIGL------PGKGIPG-----------DPAFGDPGRGIPGEVLGAQ-----------PG
ZKGEPG------IGL------PGKGIPG---------DPAFGDPGRGIPGEVLGAQ---------PG
YKGEPG---IGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG
X
Interpreting Alignments
• Special 1:1 relationship between residues in the same column
• Structural: Very similar structural environment
• Evolutionary: Related by point mutations/no mutations from the same residue in the ancestral sequence
• Structural and Evolutionary equivalence need not necessarily be the same
• Not all residues have 1:1 equivalents in other structures
Non-Equivalence of Evolutionary and Structural Alignments
Demonstration 1:Structural equivalence without evolutionary equivalence
Structural alignment of SH3-interaction motifs from nef and ncf1
nef/fyn1 PDB:1efn
ncf1PDB:1w70
aligned ncf1/nef1 SH3 interaction motifs
Non-Equivalence of Evolutionary and Structural Alignments
Demonstration 2:Evolutionary equivalence without structural equivalence
Human Lymphotactin adopts different folds depending on the conditions
PDB:2jp1
PDB:1j8i
Mis-alignment
“Gold-standard” structural alignment
“Gold-standard” structural alignment (with CLUSTALX default colouring)
Mis-alignment of same region - residues that are known to be functionally equivalent are NOT in the same coloumns
Indicating Difficult-To-Align/Non-Equivalent
Alignment Regions
1st3
1sbc
1sbt
2prk
Differently code regions which e.g. with upper-case characters
Create alignment to minimise number of gaps in alignment (typical solution - particularly for MSA software)
Only include in same column characters known/believed structurally equivalent
Quiz - Numbers of Insertions
(a) 2 (b) 1 (c) 0 (d) 3
The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?
Quiz - Numbers of Insertions
If all sequences are the same length, we can explain their diversity without inferring ANY insertions or deletions
If and alignment contains sequences that are all either length x or y, then we can explain their diversity by inferring just one insertion or deletion
The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?
Quiz - Numbers of Insertions
The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?We can ALWAYS explain observed sequence length
diversity with:•0 insertions (all length variation due to deletion)•0 deletions (all length variation due to insertion)•a combination of insertions and deletions
Perhaps we should instead focus on inferring the most likely scenario?(Although if this is not particularly relevant for our analysis, perhaps we should focus instead on something completely different!)
Building an Alignment
1. Formulate 1. Formulate the questionthe question
2. Collect 2. Collect initial initial sequence setsequence set
3. 3. Automatically Automatically align align sequencessequences
4. 4. Examine Examine alignmealignment nt 5. 5.
Discard Discard SequencSequenceses
6. Include 6. Include additional additional sequencessequences
7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment
Building an Alignment
1. Formulate 1. Formulate the questionthe question
2. Collect 2. Collect initial initial sequence setsequence set
3. 3. Automatically Automatically align align sequencessequences
4. 4. Examine Examine alignmealignment nt 5. 5.
Discard Discard SequencSequenceses
6. Include 6. Include additional additional sequencessequences
7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment
Formulate the Question
1. Formulate 1. Formulate the questionthe question
2. Collect 2. Collect initial initial sequence setsequence set
3. 3. Automatically Automatically align align sequencessequences
4. 4. Examine Examine alignmealignment nt 5. 5.
Discard Discard SequencSequenceses
6. Include 6. Include additional additional sequencessequences
7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment
1.Search databases for appropriate sequences
•Keyword-based
•Sequence-similarity-based
2.Get pre-aligned set of sequences
•PFAM pfam.sanger.ac.uk
•ENSEMBL www.ensembl.org
•HOMSTRAD http://tardis.nibio.go.jp/homstrad/
•TREEFAM http://www.treefam.org/
Collect Initial Sequences
Pre-Calculated Alignments
Demonstration and Exercise
Alignments associated with SRC_HUMAN
•Ensembl
•TreeFam
•PFAM
•HOMSRAD
1. Formulate 1. Formulate the questionthe question
2. Collect 2. Collect initial initial sequence setsequence set
3. 3. Automatically Automatically align align sequencessequences
4. 4. Examine Examine alignmealignment nt 5. 5.
Discard Discard SequencSequenceses
6. Include 6. Include additional additional sequencessequences
7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment
Many different tools available
Choice depends on
•availability
•size of dataset
•specific application
Automatic Alignment
•Only one of many possible colouring schemes
•Highlights well differing patterns of conservation in different columns
•Designed for red/green colour-blindness
Examine Alignments: CLUSTALX Colouring Scheme
•Same colour for amino acids with similar properties
•Residues only coloured if a given % of residues in column share the propertyEXCEPTION - P and G are always coloured
Examine Alignment: CLUSTALX Colouring Scheme
Hydrophobic: L V I M F W A CPolar: N T S Q
Acidic: D E
Basic: K R Secondary-structure breaking: G P
Large Aromatic Polar: H Y
Examine Alignment: CLUSTALX Colouring Scheme
Demo and Exercise: Comparing Automatic Alignment
Results
•Downloading reference alignments from BaliBase
•Align them using several different tools
•Compare automatic alignments with reference alignment
• Different tools give different results• Tools often make mistakes• Experience using several different tools• Practise looking at/examining alignments
Many (50+) Protein Sequences
MUSCLE (Fast and fairly accurate)
Few (<50) Protein Sequences
PROBCONS (Most accurate available)
Disordered Protein Regions
MAFFT
Few DNA sequences
PRANK (Very accurate)
Unhappy with initial alignment
Change parameters
Use different (more accurate) tool
Alignment Tool Selection
Default, first choice tool
Many (50+) Protein Sequences
MUSCLE (Fast and fairly accurate)
Few (<50) Protein Sequences
PROBCONS (Most accurate available)
Disordered Protein Regions
MAFFT
Few DNA sequences
PRANK (Very accurate)
Unhappy with initial alignment
Change parameters
Use different (more accurate) tool
Alignment Tool Selection
1. Formulate 1. Formulate the questionthe question
2. Collect 2. Collect initial initial sequence setsequence set
3. 3. Automatically Automatically align align sequencessequences
4. 4. Examine Examine alignmealignment nt 5. 5.
Discard Discard SequencSequenceses
6. Include 6. Include additional additional sequencessequences
7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment
•One of the most important stages in the analysis
•Involves deciding whether or not MSA ready for use in down-stream analyses
Examine Alignment
1. Formulate 1. Formulate the questionthe question
2. Collect 2. Collect initial initial sequence setsequence set
3. 3. Automatically Automatically align align sequencessequences
4. 4. Examine Examine alignmealignment nt 5. 5.
Discard Discard SequencSequenceses
6. Include 6. Include additional additional sequencessequences
7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment
Adjusting the Initial Alignment
Exercise/Demo
•MSA analysis from start to finish
3. 3. Automatically Automatically align align sequencessequences
4. 4. Examine Examine alignmenalignment t
5. 5. Discard Discard SequencSequenceses
6. Include 6. Include additional additional sequencessequences
7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment
•Sequences/Regions important for the analysis?
•YES - Edit alignment manually to fix mis-alignment
•NO - Discard sequences/Ignore region
Mis-Aligned Regions
KK
Examine Alignment:What to look out for
Common Patterns - buried beta-strand
http://tardis.nibio.go.jp/cgi-bin/homstrad/showpage.cgi?family=response_reg&disp=str
Response regulator receiver domain
Common Patterns - amphiphilic partially-buried alpha-helices
http://tardis.nibio.go.jp/cgi-bin/homstrad/showpage.cgi?family=response_reg&disp=str
Response regulator receiver domain
Common Patterns - amphiphilic partially-buried alpha-helices
http://tardis.nibio.go.jp/cgi-bin/homstrad/showpage.cgi?family=subt&disp=str
subtilases
Common Patterns - amphiphilic beta strands
ubiquitin conjugating enzyme
Common Patterns - non-globular sequence
Different, more strongly biasesd (from equal representation of each of the 20 amino acids), sequence composition
Low sequence complexity THE HAPPY PROTEIN
TTT TTTTT TTTTTTTTHE THEHT THETHET
IncreasingComplexity
More variable sequence (more substitutions, more gaps than globular/structured regions)
Common Patterns - short linear motifs
[RK].L.{0,1}[FYLIVMP]LIG_CYCLIN_1Mostly occur in disordered protein regionsOften show greater conservation than neighbouring sequence
CautionAbsence of patterns that characterise different structrures (e.g. helices, motifs, etc.) doesn’t mean the structures are not present!Only means that either the sturcture is showing different evolutionary dynamics from those shown here, or insufficient/too-much diversity is present in alignment, obscuring tell-tale patterns of conservation
Clustalx helps spotting misaligned regions
BaliBase kinase2_ref5 as misaligned by clustalx
Quality->“Show Low-Scoring Segments”
NB other software is available to help in this way
Demo - edit/correct misalignments
•View alignment with ClustalX
•Quality -> Show low-scoring segments
•Edit using JalView
•http://://www.jalview.org/examples/editing.html
An aside - different tools give different
alignments
clustalx
probcons
An aside - different tools give different
alignments
BaliBase kinase2_ref5
mafft
prank
3. 3. Automatically Automatically align align sequencessequences
4. 4. Examine Examine alignmenalignment t
5. 5. Discard Discard SequencSequenceses
6. Include 6. Include additional additional sequencessequences
7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment
Missing regions•Sequences important/vital for the
analysis?
•YES
•Try and determine why sequence is missing
•If due to error, attempt to correct it
•NO
•Discard sequences
Examine Alignment:What to look out for
3. 3. Automatically Automatically align align sequencessequences
4. 4. Examine Examine alignmenalignment t
5. 5. Discard Discard SequencSequenceses
6. Include 6. Include additional additional sequencessequences
7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment
Apparently nonsense/unrelated regions•Sequences important/vital for the
analysis?
•YES
•Try and determine why sequence is different
•If due to error, attempt to correct it
•NO
•Discard sequences
With CLUSTALX “”Quality”->”Show Low-Scorring Segments” switched on
Examine Alignment:What to look out for
3. 3. Automatically Automatically align align sequencessequences
4. 4. Examine Examine alignmenalignment t
5. 5. Discard Discard SequencSequenceses
6. Include 6. Include additional additional sequencessequences
7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment
Lots of very similar sequences•Slow down calculations
•If there are no more divergent sequences in the alignment, there may not be enough information for successful analysis
•Remove most of them, leaving most “interesting” sequences e.g. keep human, remove chimp, guinea-pig
• Find alternative data-sources to supplement initial sequence set e.g. ENSEMBL for access to predicted peptides not yet relasesed elsewhere
Examine Alignment:What to look out for
Demo: Comparing Automatic Alignment Tools
•1aboA_ref1 from BaliBase
•http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE2/align_index.html
•http://www.ebi.ac.uk/mafft/
•http://www.ebi.ac.uk/Tools/muscle/
•http://www.ebi.ac.uk/t-coffee/
Difficult Alignments - low similarity between sequences
Possible Strategies: •Collect more sequences to reduce lowest pairwise similarity•Use structural information if possible
Typical Problems•Difficult to align throughout the sequence - more problems in local regions of lower conservation
BaliBase: 1aboA_ref1
Difficult Alignments - “orphan” sequences
Possible Strategies: •Collect more sequences to reduce lowest pairwise similarity•Use structural information if possible
Typical Problems•Orphan difficult to align correctly - typically it is forced to have similar conserved block structure as other sequences
Reference alignment
ClustalX alignment
BaliBase: 1aboA_ref2
Difficult Alignments - Several Subfamilies
Possible Strategies: •Align each subfamiliy on its own initially•Then align the subfamiliy alignments directly to each other•Add additional “bridging” sequences if you can find them
Difficult Alignments - Insertions/Extensions
clustalx alignment
BaliBase Reference alignment
Possible Strategies: •Initially align sequences without insertion•Then align (perhaps manually) sequence(s) with the insertion
Difficult Alignments - Repeats
BaliBase Reference alignment
Possible Strategies: •Determine repeat structure of your sequences•Decide which order
Typical Problems•Repeats are aligned “out of order”
Difficult Alignments - Rearrangement/Shuffling
BaliBase: cellulase_1_ref8
Possible Strategies: •Determine repeat/domain structure of your sequences before aligning them•Extract sequences of individual domains, and only include sequences of the same domain in the same alignment
Typical Problems•Most software prefers to minimise gaps - so completely misalign domains as seen left
Difficult Alignments - Short Linear Motifs
[DE][DES][DEGAS]F[SGAD][DEAP][LVIMFD]LIG_AP_GAE_1
[RK].L.{0,1}[FYLIVMP]LIG_CYCLIN_1
Possible Strategies: •Align regions suspected of containing motifs separately from the rest of the proteins•Use MAFFT (works best for regions of this kind)
Typical Problems•Most software finds it difficult to identify short regions of conservation in a region of low conservation
Summary
•It’s not “Is my alignment correct”
•Rather “Where is my alignment wrong”!
•The most important steps in building an alignment are:
• Specifying an appropriate question
•Critically examining the alignments you build
•Quality of the alignment influences quality of down-stream analyses
•Mis-alignment introduces systematic errors in downstream analyses
•There is no “best” alignment tool - which one you use depends on your question
•It can cost lots of time to make a good alignment!!