Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit...

66
Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany

Transcript of Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit...

Page 1: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Sequence Alignments

Wednesday 6th October 2010

Aidan BuddStructural and Computational Biology

UnitEMBL Heidelberg, Germany

Page 2: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Why focus on sequence alignment?

•Required for the development of almost all bioinformatics tools

•Enables sequence similarity to be measured in a way that provides confident prediction of structural-similarity/evolutionary-relatedness between sequences

•Similarity to sequences with known function provides KEY source of novel hypotheses for function of novel sequences

•Usually the first analysis of any 'novel' sequences involves aligning it to many sequences

Have any of you ever used a tool to align sequences?

Page 3: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Sequence Alignment

Sequence Alignment:

Arrangement of two or more sequences in matrix/grid

Page 4: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Biological Sequence Alignment - Rows

• Residues in the same row are from the same biological macromolecule (protein or nucleic acid)

• Residues are arranged in the order they occur in the macromolecule

• N to C terminal in proteins

• 5' to 3' in nucleotides

Page 5: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Biological Sequence Alignment - Columns

• Elements/cells in the matrix only ever contain a single character (either a residue or a "blank"/"gap" character)

• Residues in the same column share a special 1:1 relationship that they don't share with residues in any other column

Page 6: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Biological Sequence Alignment - Gaps

• Gaps specify pairs of residues in the same sequence

• i.e. the residues flanking the gap within the same sequence

• Between the flanking residues, other sequences in the alignment have residues for which there is no 1:1 relationship in the gapped sequence

Page 7: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Biological Sequence Alignment - Gaps

Gap-only column provide no information about relationships between residues in the alignment

Page 8: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Biological Sequence Alignment - Gaps

Gap-only column provide no information about relationships between residues in the alignment

• Removing sequences sometimes leaves all-gap columns

• We usually remove these "Empty" columns does not effect

• (as this doesn't change the 1:1 relationships described by the rest of the alignment)

Page 9: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Pairwise Sequence Alignments

>EMBLCDS:BAB82422 BAB82422.1 Ephydatia fluviatilis protein tyrosine kinase Length = 1488

Score = 89.7 bits (98), Expect = 3e-15 Identities = 104/138 (75%), Gaps = 2/138 (1%) Strand = Plus / Plus

Query: 85 aagct-gggccagggctgctttggcgaggtgtggatggggacctggaacggtaccaccag 143 ||||| ||| | |||| | ||||| || || ||| ||| ||||| || ||||||||Sbjct: 703 aagcttggggcggggcag-tttggtgaagtttgggagggtgtgtggaatgggaccaccag 761

Query: 144 ggtggccatcaaaaccctgaagcctggcacgatgtctccagaggccttcctgcaggaggc 203 |||||| | || ||||| || || ||||| |||||| |||| ||||||||||||||Sbjct: 762 tgtggccgttaagaccctcaaaccaggcacaatgtctgtcgaggagttcctgcaggaggc 821

Query: 204 ccaggtcatgaagaagct 221 |||||||||||||Sbjct: 822 aagcatcatgaagaagct 839

Pairwise Alignment: Alignment of two sequences

Page 10: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Build your own sequence alignment

•Sequences are usually aligned automatically

•Pairwise: BLAST, FASTA, Smith-Waterman etc.

•Multiple: MUSCLE, PRANK, CLUSTAL etc.

•Also possible 'manually' using tools such as JalView

•Trying some manual alignment can help understand how/why automatic methods are used

Page 11: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

JalView Demo and Exercises

•Loading sequences

•Changing the way the sequences are displayed

•Manual editing of alignments

•Adding/removing sequences to an alignment

•Exporting sequences/alignments from JalView for use in another application

Page 12: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Multiple Sequence Alignment

Multiple Sequence Alignment (MSA): Alignment of 3+ sequences

Page 13: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Multiple Sequence Alignment

MSAs describe a set of pairwise alignments

An MSA of n sequences

n(n+1)

2pairwise alignmentsFor example:

implies

Page 14: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

•Structurally equivalent/similar

•Evolutionary equivalent/related/homologous

Residues in the same column either:

Different applications assume different types of equivalence

Different types of similarity not necessarily equivalent

“Equivalence”/similarity of residues

Page 15: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Structural equivalence

Demonstration:

http://www.embl-heidelberg.de/~seqanal/courses/commonCourseContent/commonMsaExercises.html#DemonstratingStructuralEquivalence

Bacterial toxins 1ji6 and 1i5p

Page 16: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

1i5p: 1 YVAPVVGTVSSFLLKKVGSLIGKR 1111111111111111111111111ji6: 1 DAVGTGISVVGQILGVVGVPFAGA

Structural equivalence

Residues in the same alignment column are "structurally equivalent"

i.e. they should be the residues in the two structures whose location with respect to the rest of the structure are most similar in the two structureSuch residues will have similar structural/functional "roles" in the two proteins e.g. form similar side-chain interactions

Page 17: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Structural equivalence

Some regions of the structures do not have structurally equivalent residues in the other structure

1i5p: DNFLNPTQN----PVPLSITSSVN 111111 111111111111ji6: NSWKKTPLSLRSKRSQDRIRELFS

Alignment gaps are a sure sign of such residues

Placing such residues in the same column as residues from other sequences is a misalignment - to be avoided!

Page 18: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

“Evolutionary Equivalence”

AGWYTI

AGWWTI

AGWYTI

AGWWTI

AAWYTI

AGWWTI

AAQQQWYTI

Mutation / Substituti

onY-W

Substitution G-A

QQQ Insertio

n

AGWYTI

AGWYTI

Two copies of

gene generate

d

AGWYTI AGWYTI

AGWYTI

AGWWTI

AGWYTI

AGWWTI

AAWYTI

AG---WWTI

AAQQQWYTI

Residues in the same alignment column should trace their history back to the same residue in the ancestral sequence with any changes due only to point substitutions

Page 19: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Quiz - Evolutionary Interpretation of Alignments

Which alignment of the final sequences (X, Y or Z) only places residues in the same column if they are related by substitution events?

KGE--------PGIGL------PGKGIPG-----------DPAFGDPGRGIPGEVLGAQ-----------PG

ZKGEPG------IGL------PGKGIPG---------DPAFGDPGRGIPGEVLGAQ---------PG

YKGEPG---IGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG

X

Page 20: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Quiz - Evolutionary Interpretation of Alignments

RGIPGEVLGAQPGKGIPGDPAFGDPG---KGEPGIGLPG

PRANKKGEPG---IGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG

MAFFTK---GEPGIGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG

CLUSTALX

Different automatic MSA software gives different results

They're all "wrong"....

Because their model of evolutionary process is very divergent from the (very strange...) one under which I told you they evolved

KGE--------PGIGL------PGKGIPG-----------DPAFGDPGRGIPGEVLGAQ-----------PG

ZKGEPG------IGL------PGKGIPG---------DPAFGDPGRGIPGEVLGAQ---------PG

YKGEPG---IGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG

X

Page 21: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Interpreting Alignments

• Special 1:1 relationship between residues in the same column

• Structural: Very similar structural environment

• Evolutionary: Related by point mutations/no mutations from the same residue in the ancestral sequence

• Structural and Evolutionary equivalence need not necessarily be the same

• Not all residues have 1:1 equivalents in other structures

Page 22: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Non-Equivalence of Evolutionary and Structural Alignments

Demonstration 1:Structural equivalence without evolutionary equivalence

Structural alignment of SH3-interaction motifs from nef and ncf1

nef/fyn1 PDB:1efn

ncf1PDB:1w70

aligned ncf1/nef1 SH3 interaction motifs

Page 23: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Non-Equivalence of Evolutionary and Structural Alignments

Demonstration 2:Evolutionary equivalence without structural equivalence

Human Lymphotactin adopts different folds depending on the conditions

PDB:2jp1

PDB:1j8i

Page 24: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Mis-alignment

“Gold-standard” structural alignment

“Gold-standard” structural alignment (with CLUSTALX default colouring)

Mis-alignment of same region - residues that are known to be functionally equivalent are NOT in the same coloumns

Page 25: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Indicating Difficult-To-Align/Non-Equivalent

Alignment Regions

1st3

1sbc

1sbt

2prk

Differently code regions which e.g. with upper-case characters

Create alignment to minimise number of gaps in alignment (typical solution - particularly for MSA software)

Only include in same column characters known/believed structurally equivalent

Page 26: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Quiz - Numbers of Insertions

(a) 2 (b) 1 (c) 0 (d) 3

The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?

Page 27: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Quiz - Numbers of Insertions

If all sequences are the same length, we can explain their diversity without inferring ANY insertions or deletions

If and alignment contains sequences that are all either length x or y, then we can explain their diversity by inferring just one insertion or deletion

The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?

Page 28: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Quiz - Numbers of Insertions

The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?We can ALWAYS explain observed sequence length

diversity with:•0 insertions (all length variation due to deletion)•0 deletions (all length variation due to insertion)•a combination of insertions and deletions

Perhaps we should instead focus on inferring the most likely scenario?(Although if this is not particularly relevant for our analysis, perhaps we should focus instead on something completely different!)

Page 29: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Building an Alignment

Page 30: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

1. Formulate 1. Formulate the questionthe question

2. Collect 2. Collect initial initial sequence setsequence set

3. 3. Automatically Automatically align align sequencessequences

4. 4. Examine Examine alignmealignment nt 5. 5.

Discard Discard SequencSequenceses

6. Include 6. Include additional additional sequencessequences

7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment

Building an Alignment

Page 31: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

1. Formulate 1. Formulate the questionthe question

2. Collect 2. Collect initial initial sequence setsequence set

3. 3. Automatically Automatically align align sequencessequences

4. 4. Examine Examine alignmealignment nt 5. 5.

Discard Discard SequencSequenceses

6. Include 6. Include additional additional sequencessequences

7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment

Formulate the Question

Page 32: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

1. Formulate 1. Formulate the questionthe question

2. Collect 2. Collect initial initial sequence setsequence set

3. 3. Automatically Automatically align align sequencessequences

4. 4. Examine Examine alignmealignment nt 5. 5.

Discard Discard SequencSequenceses

6. Include 6. Include additional additional sequencessequences

7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment

1.Search databases for appropriate sequences

•Keyword-based

•Sequence-similarity-based

2.Get pre-aligned set of sequences

•PFAM pfam.sanger.ac.uk

•ENSEMBL www.ensembl.org

•HOMSTRAD http://tardis.nibio.go.jp/homstrad/

•TREEFAM http://www.treefam.org/

Collect Initial Sequences

Page 33: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Pre-Calculated Alignments

Demonstration and Exercise

Alignments associated with SRC_HUMAN

•Ensembl

•TreeFam

•PFAM

•HOMSRAD

Page 34: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

1. Formulate 1. Formulate the questionthe question

2. Collect 2. Collect initial initial sequence setsequence set

3. 3. Automatically Automatically align align sequencessequences

4. 4. Examine Examine alignmealignment nt 5. 5.

Discard Discard SequencSequenceses

6. Include 6. Include additional additional sequencessequences

7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment

Many different tools available

Choice depends on

•availability

•size of dataset

•specific application

Automatic Alignment

Page 35: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

•Only one of many possible colouring schemes

•Highlights well differing patterns of conservation in different columns

•Designed for red/green colour-blindness

Examine Alignments: CLUSTALX Colouring Scheme

Page 36: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

•Same colour for amino acids with similar properties

•Residues only coloured if a given % of residues in column share the propertyEXCEPTION - P and G are always coloured

Examine Alignment: CLUSTALX Colouring Scheme

Page 37: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Hydrophobic: L V I M F W A CPolar: N T S Q

Acidic: D E

Basic: K R Secondary-structure breaking: G P

Large Aromatic Polar: H Y

Examine Alignment: CLUSTALX Colouring Scheme

Page 38: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Demo and Exercise: Comparing Automatic Alignment

Results

•Downloading reference alignments from BaliBase

•Align them using several different tools

•Compare automatic alignments with reference alignment

• Different tools give different results• Tools often make mistakes• Experience using several different tools• Practise looking at/examining alignments

Page 39: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Many (50+) Protein Sequences

MUSCLE (Fast and fairly accurate)

Few (<50) Protein Sequences

PROBCONS (Most accurate available)

Disordered Protein Regions

MAFFT

Few DNA sequences

PRANK (Very accurate)

Unhappy with initial alignment

Change parameters

Use different (more accurate) tool

Alignment Tool Selection

Default, first choice tool

Page 40: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Many (50+) Protein Sequences

MUSCLE (Fast and fairly accurate)

Few (<50) Protein Sequences

PROBCONS (Most accurate available)

Disordered Protein Regions

MAFFT

Few DNA sequences

PRANK (Very accurate)

Unhappy with initial alignment

Change parameters

Use different (more accurate) tool

Alignment Tool Selection

Page 41: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

1. Formulate 1. Formulate the questionthe question

2. Collect 2. Collect initial initial sequence setsequence set

3. 3. Automatically Automatically align align sequencessequences

4. 4. Examine Examine alignmealignment nt 5. 5.

Discard Discard SequencSequenceses

6. Include 6. Include additional additional sequencessequences

7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment

•One of the most important stages in the analysis

•Involves deciding whether or not MSA ready for use in down-stream analyses

Examine Alignment

Page 42: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

1. Formulate 1. Formulate the questionthe question

2. Collect 2. Collect initial initial sequence setsequence set

3. 3. Automatically Automatically align align sequencessequences

4. 4. Examine Examine alignmealignment nt 5. 5.

Discard Discard SequencSequenceses

6. Include 6. Include additional additional sequencessequences

7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment

Adjusting the Initial Alignment

Page 43: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Exercise/Demo

•MSA analysis from start to finish

Page 44: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

3. 3. Automatically Automatically align align sequencessequences

4. 4. Examine Examine alignmenalignment t

5. 5. Discard Discard SequencSequenceses

6. Include 6. Include additional additional sequencessequences

7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment

•Sequences/Regions important for the analysis?

•YES - Edit alignment manually to fix mis-alignment

•NO - Discard sequences/Ignore region

Mis-Aligned Regions

KK

Examine Alignment:What to look out for

Page 45: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Common Patterns - buried beta-strand

http://tardis.nibio.go.jp/cgi-bin/homstrad/showpage.cgi?family=response_reg&disp=str

Response regulator receiver domain

Page 46: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Common Patterns - amphiphilic partially-buried alpha-helices

http://tardis.nibio.go.jp/cgi-bin/homstrad/showpage.cgi?family=response_reg&disp=str

Response regulator receiver domain

Page 47: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Common Patterns - amphiphilic partially-buried alpha-helices

http://tardis.nibio.go.jp/cgi-bin/homstrad/showpage.cgi?family=subt&disp=str

subtilases

Page 48: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Common Patterns - amphiphilic beta strands

ubiquitin conjugating enzyme

Page 49: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Common Patterns - non-globular sequence

Different, more strongly biasesd (from equal representation of each of the 20 amino acids), sequence composition

Low sequence complexity THE HAPPY PROTEIN

TTT TTTTT TTTTTTTTHE THEHT THETHET

IncreasingComplexity

More variable sequence (more substitutions, more gaps than globular/structured regions)

Page 50: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Common Patterns - short linear motifs

[RK].L.{0,1}[FYLIVMP]LIG_CYCLIN_1Mostly occur in disordered protein regionsOften show greater conservation than neighbouring sequence

CautionAbsence of patterns that characterise different structrures (e.g. helices, motifs, etc.) doesn’t mean the structures are not present!Only means that either the sturcture is showing different evolutionary dynamics from those shown here, or insufficient/too-much diversity is present in alignment, obscuring tell-tale patterns of conservation

Page 51: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Clustalx helps spotting misaligned regions

BaliBase kinase2_ref5 as misaligned by clustalx

Quality->“Show Low-Scoring Segments”

NB other software is available to help in this way

Page 52: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Demo - edit/correct misalignments

•View alignment with ClustalX

•Quality -> Show low-scoring segments

•Edit using JalView

•http://://www.jalview.org/examples/editing.html

Page 53: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

An aside - different tools give different

alignments

clustalx

probcons

Page 54: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

An aside - different tools give different

alignments

BaliBase kinase2_ref5

mafft

prank

Page 55: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

3. 3. Automatically Automatically align align sequencessequences

4. 4. Examine Examine alignmenalignment t

5. 5. Discard Discard SequencSequenceses

6. Include 6. Include additional additional sequencessequences

7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment

Missing regions•Sequences important/vital for the

analysis?

•YES

•Try and determine why sequence is missing

•If due to error, attempt to correct it

•NO

•Discard sequences

Examine Alignment:What to look out for

Page 56: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

3. 3. Automatically Automatically align align sequencessequences

4. 4. Examine Examine alignmenalignment t

5. 5. Discard Discard SequencSequenceses

6. Include 6. Include additional additional sequencessequences

7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment

Apparently nonsense/unrelated regions•Sequences important/vital for the

analysis?

•YES

•Try and determine why sequence is different

•If due to error, attempt to correct it

•NO

•Discard sequences

With CLUSTALX “”Quality”->”Show Low-Scorring Segments” switched on

Examine Alignment:What to look out for

Page 57: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

3. 3. Automatically Automatically align align sequencessequences

4. 4. Examine Examine alignmenalignment t

5. 5. Discard Discard SequencSequenceses

6. Include 6. Include additional additional sequencessequences

7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment

Lots of very similar sequences•Slow down calculations

•If there are no more divergent sequences in the alignment, there may not be enough information for successful analysis

•Remove most of them, leaving most “interesting” sequences e.g. keep human, remove chimp, guinea-pig

• Find alternative data-sources to supplement initial sequence set e.g. ENSEMBL for access to predicted peptides not yet relasesed elsewhere

Examine Alignment:What to look out for

Page 58: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Demo: Comparing Automatic Alignment Tools

•1aboA_ref1 from BaliBase

•http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE2/align_index.html

•http://www.ebi.ac.uk/mafft/

•http://www.ebi.ac.uk/Tools/muscle/

•http://www.ebi.ac.uk/t-coffee/

Page 59: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Difficult Alignments - low similarity between sequences

Possible Strategies: •Collect more sequences to reduce lowest pairwise similarity•Use structural information if possible

Typical Problems•Difficult to align throughout the sequence - more problems in local regions of lower conservation

BaliBase: 1aboA_ref1

Page 60: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Difficult Alignments - “orphan” sequences

Possible Strategies: •Collect more sequences to reduce lowest pairwise similarity•Use structural information if possible

Typical Problems•Orphan difficult to align correctly - typically it is forced to have similar conserved block structure as other sequences

Reference alignment

ClustalX alignment

BaliBase: 1aboA_ref2

Page 61: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Difficult Alignments - Several Subfamilies

Possible Strategies: •Align each subfamiliy on its own initially•Then align the subfamiliy alignments directly to each other•Add additional “bridging” sequences if you can find them

Page 62: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Difficult Alignments - Insertions/Extensions

clustalx alignment

BaliBase Reference alignment

Possible Strategies: •Initially align sequences without insertion•Then align (perhaps manually) sequence(s) with the insertion

Page 63: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Difficult Alignments - Repeats

BaliBase Reference alignment

Possible Strategies: •Determine repeat structure of your sequences•Decide which order

Typical Problems•Repeats are aligned “out of order”

Page 64: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Difficult Alignments - Rearrangement/Shuffling

BaliBase: cellulase_1_ref8

Possible Strategies: •Determine repeat/domain structure of your sequences before aligning them•Extract sequences of individual domains, and only include sequences of the same domain in the same alignment

Typical Problems•Most software prefers to minimise gaps - so completely misalign domains as seen left

Page 65: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Difficult Alignments - Short Linear Motifs

[DE][DES][DEGAS]F[SGAD][DEAP][LVIMFD]LIG_AP_GAE_1

[RK].L.{0,1}[FYLIVMP]LIG_CYCLIN_1

Possible Strategies: •Align regions suspected of containing motifs separately from the rest of the proteins•Use MAFFT (works best for regions of this kind)

Typical Problems•Most software finds it difficult to identify short regions of conservation in a region of low conservation

Page 66: Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit EMBL Heidelberg, Germany.

Summary

•It’s not “Is my alignment correct”

•Rather “Where is my alignment wrong”!

•The most important steps in building an alignment are:

• Specifying an appropriate question

•Critically examining the alignments you build

•Quality of the alignment influences quality of down-stream analyses

•Mis-alignment introduces systematic errors in downstream analyses

•There is no “best” alignment tool - which one you use depends on your question

•It can cost lots of time to make a good alignment!!