Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit...

Sequence Alignments

Wednesday 6th October 2010

Aidan BuddStructural and Computational Biology

UnitEMBL Heidelberg, Germany

Why focus on sequence alignment?

•Required for the development of almost all bioinformatics tools

•Enables sequence similarity to be measured in a way that provides confident prediction of structural-similarity/evolutionary-relatedness between sequences

•Similarity to sequences with known function provides KEY source of novel hypotheses for function of novel sequences

•Usually the first analysis of any 'novel' sequences involves aligning it to many sequences

Have any of you ever used a tool to align sequences?

Sequence Alignment

Sequence Alignment:

Arrangement of two or more sequences in matrix/grid

Biological Sequence Alignment - Rows

• Residues in the same row are from the same biological macromolecule (protein or nucleic acid)

• Residues are arranged in the order they occur in the macromolecule

• N to C terminal in proteins

• 5' to 3' in nucleotides

Biological Sequence Alignment - Columns

• Elements/cells in the matrix only ever contain a single character (either a residue or a "blank"/"gap" character)

• Residues in the same column share a special 1:1 relationship that they don't share with residues in any other column

Biological Sequence Alignment - Gaps

• Gaps specify pairs of residues in the same sequence

• i.e. the residues flanking the gap within the same sequence

• Between the flanking residues, other sequences in the alignment have residues for which there is no 1:1 relationship in the gapped sequence


Gap-only column provide no information about relationships between residues in the alignment


Gap-only column provide no information about relationships between residues in the alignment

• Removing sequences sometimes leaves all-gap columns

• We usually remove these "Empty" columns does not effect

• (as this doesn't change the 1:1 relationships described by the rest of the alignment)

Pairwise Sequence Alignments

>EMBLCDS:BAB82422 BAB82422.1 Ephydatia fluviatilis protein tyrosine kinase Length = 1488

Score = 89.7 bits (98), Expect = 3e-15 Identities = 104/138 (75%), Gaps = 2/138 (1%) Strand = Plus / Plus

Query: 85 aagct-gggccagggctgctttggcgaggtgtggatggggacctggaacggtaccaccag 143 ||||| ||| | |||| | ||||| || || ||| ||| ||||| || ||||||||Sbjct: 703 aagcttggggcggggcag-tttggtgaagtttgggagggtgtgtggaatgggaccaccag 761

Query: 144 ggtggccatcaaaaccctgaagcctggcacgatgtctccagaggccttcctgcaggaggc 203 |||||| | || ||||| || || ||||| |||||| |||| ||||||||||||||Sbjct: 762 tgtggccgttaagaccctcaaaccaggcacaatgtctgtcgaggagttcctgcaggaggc 821

Query: 204 ccaggtcatgaagaagct 221 |||||||||||||Sbjct: 822 aagcatcatgaagaagct 839

Pairwise Alignment: Alignment of two sequences

Build your own sequence alignment

•Sequences are usually aligned automatically

•Pairwise: BLAST, FASTA, Smith-Waterman etc.

•Multiple: MUSCLE, PRANK, CLUSTAL etc.

•Also possible 'manually' using tools such as JalView

•Trying some manual alignment can help understand how/why automatic methods are used

JalView Demo and Exercises

•Loading sequences

•Changing the way the sequences are displayed

•Manual editing of alignments

•Adding/removing sequences to an alignment

•Exporting sequences/alignments from JalView for use in another application

Multiple Sequence Alignment

Multiple Sequence Alignment (MSA): Alignment of 3+ sequences

Multiple Sequence Alignment

MSAs describe a set of pairwise alignments

An MSA of n sequences

n(n+1)

2pairwise alignmentsFor example:

implies

•Structurally equivalent/similar

•Evolutionary equivalent/related/homologous

Residues in the same column either:

Different applications assume different types of equivalence

Different types of similarity not necessarily equivalent

“Equivalence”/similarity of residues

Structural equivalence

Demonstration:

http://www.embl-heidelberg.de/~seqanal/courses/commonCourseContent/commonMsaExercises.html#DemonstratingStructuralEquivalence

Bacterial toxins 1ji6 and 1i5p

1i5p: 1 YVAPVVGTVSSFLLKKVGSLIGKR 1111111111111111111111111ji6: 1 DAVGTGISVVGQILGVVGVPFAGA


Residues in the same alignment column are "structurally equivalent"

i.e. they should be the residues in the two structures whose location with respect to the rest of the structure are most similar in the two structureSuch residues will have similar structural/functional "roles" in the two proteins e.g. form similar side-chain interactions


Some regions of the structures do not have structurally equivalent residues in the other structure

1i5p: DNFLNPTQN----PVPLSITSSVN 111111 111111111111ji6: NSWKKTPLSLRSKRSQDRIRELFS

Alignment gaps are a sure sign of such residues

Placing such residues in the same column as residues from other sequences is a misalignment - to be avoided!

“Evolutionary Equivalence”

AGWYTI

AGWWTI

AGWYTI

AGWWTI

AAWYTI

AGWWTI

AAQQQWYTI

Mutation / Substituti

onY-W

Substitution G-A

QQQ Insertio

n

AGWYTI

AGWYTI

Two copies of

gene generate

d

AGWYTI AGWYTI

AGWYTI

AGWWTI

AGWYTI

AGWWTI

AAWYTI

AG---WWTI

AAQQQWYTI

Residues in the same alignment column should trace their history back to the same residue in the ancestral sequence with any changes due only to point substitutions

Quiz - Evolutionary Interpretation of Alignments

Which alignment of the final sequences (X, Y or Z) only places residues in the same column if they are related by substitution events?

KGE--------PGIGL------PGKGIPG-----------DPAFGDPGRGIPGEVLGAQ-----------PG

ZKGEPG------IGL------PGKGIPG---------DPAFGDPGRGIPGEVLGAQ---------PG

YKGEPG---IGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG

X

Quiz - Evolutionary Interpretation of Alignments

RGIPGEVLGAQPGKGIPGDPAFGDPG---KGEPGIGLPG

PRANKKGEPG---IGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG

MAFFTK---GEPGIGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG

CLUSTALX

Different automatic MSA software gives different results

They're all "wrong"....

Because their model of evolutionary process is very divergent from the (very strange...) one under which I told you they evolved

KGE--------PGIGL------PGKGIPG-----------DPAFGDPGRGIPGEVLGAQ-----------PG

ZKGEPG------IGL------PGKGIPG---------DPAFGDPGRGIPGEVLGAQ---------PG

YKGEPG---IGLPGKGIPGDPAFGDPGRGIPGEVLGAQPG

X

Interpreting Alignments

• Special 1:1 relationship between residues in the same column

• Structural: Very similar structural environment

• Evolutionary: Related by point mutations/no mutations from the same residue in the ancestral sequence

• Structural and Evolutionary equivalence need not necessarily be the same

• Not all residues have 1:1 equivalents in other structures

Non-Equivalence of Evolutionary and Structural Alignments

Demonstration 1:Structural equivalence without evolutionary equivalence

Structural alignment of SH3-interaction motifs from nef and ncf1

nef/fyn1 PDB:1efn

ncf1PDB:1w70

aligned ncf1/nef1 SH3 interaction motifs

Non-Equivalence of Evolutionary and Structural Alignments

Demonstration 2:Evolutionary equivalence without structural equivalence

Human Lymphotactin adopts different folds depending on the conditions

PDB:2jp1

PDB:1j8i

Mis-alignment

“Gold-standard” structural alignment

“Gold-standard” structural alignment (with CLUSTALX default colouring)

Mis-alignment of same region - residues that are known to be functionally equivalent are NOT in the same coloumns

Indicating Difficult-To-Align/Non-Equivalent

Alignment Regions

1st3

1sbc

1sbt

2prk

Differently code regions which e.g. with upper-case characters

Create alignment to minimise number of gaps in alignment (typical solution - particularly for MSA software)

Only include in same column characters known/believed structurally equivalent

Quiz - Numbers of Insertions

(a) 2 (b) 1 (c) 0 (d) 3

The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?


If all sequences are the same length, we can explain their diversity without inferring ANY insertions or deletions

If and alignment contains sequences that are all either length x or y, then we can explain their diversity by inferring just one insertion or deletion

The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?


The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?We can ALWAYS explain observed sequence length

diversity with:•0 insertions (all length variation due to deletion)•0 deletions (all length variation due to insertion)•a combination of insertions and deletions

Perhaps we should instead focus on inferring the most likely scenario?(Although if this is not particularly relevant for our analysis, perhaps we should focus instead on something completely different!)

Building an Alignment

1. Formulate 1. Formulate the questionthe question

2. Collect 2. Collect initial initial sequence setsequence set

3. 3. Automatically Automatically align align sequencessequences

4. 4. Examine Examine alignmealignment nt 5. 5.

Discard Discard SequencSequenceses

6. Include 6. Include additional additional sequencessequences

7. Manually 7. Manually adjust/edit adjust/edit alignmentalignment

Building an Alignment








Formulate the Question








1.Search databases for appropriate sequences

•Keyword-based

•Sequence-similarity-based

2.Get pre-aligned set of sequences

•PFAM pfam.sanger.ac.uk

•ENSEMBL www.ensembl.org

•HOMSTRAD http://tardis.nibio.go.jp/homstrad/

•TREEFAM http://www.treefam.org/

Collect Initial Sequences

Pre-Calculated Alignments

Demonstration and Exercise

Alignments associated with SRC_HUMAN

•Ensembl

•TreeFam

•PFAM

•HOMSRAD








Many different tools available

Choice depends on

•availability

•size of dataset

•specific application

Automatic Alignment

•Only one of many possible colouring schemes

•Highlights well differing patterns of conservation in different columns

•Designed for red/green colour-blindness

Examine Alignments: CLUSTALX Colouring Scheme

•Same colour for amino acids with similar properties

•Residues only coloured if a given % of residues in column share the propertyEXCEPTION - P and G are always coloured

Examine Alignment: CLUSTALX Colouring Scheme

Hydrophobic: L V I M F W A CPolar: N T S Q

Acidic: D E

Basic: K R Secondary-structure breaking: G P

Large Aromatic Polar: H Y

Examine Alignment: CLUSTALX Colouring Scheme

Demo and Exercise: Comparing Automatic Alignment

Results

•Downloading reference alignments from BaliBase

•Align them using several different tools

•Compare automatic alignments with reference alignment

• Different tools give different results• Tools often make mistakes• Experience using several different tools• Practise looking at/examining alignments

Many (50+) Protein Sequences

MUSCLE (Fast and fairly accurate)

Few (<50) Protein Sequences

PROBCONS (Most accurate available)

Disordered Protein Regions

MAFFT

Few DNA sequences

PRANK (Very accurate)

Unhappy with initial alignment

Change parameters

Use different (more accurate) tool

Alignment Tool Selection

Default, first choice tool

Many (50+) Protein Sequences

MUSCLE (Fast and fairly accurate)

Few (<50) Protein Sequences

PROBCONS (Most accurate available)

Disordered Protein Regions

MAFFT

Few DNA sequences

PRANK (Very accurate)

Unhappy with initial alignment

Change parameters

Use different (more accurate) tool

Alignment Tool Selection








•One of the most important stages in the analysis

•Involves deciding whether or not MSA ready for use in down-stream analyses

Examine Alignment








Adjusting the Initial Alignment

Exercise/Demo

•MSA analysis from start to finish


4. 4. Examine Examine alignmenalignment t

5. 5. Discard Discard SequencSequenceses



•Sequences/Regions important for the analysis?

•YES - Edit alignment manually to fix mis-alignment

•NO - Discard sequences/Ignore region

Mis-Aligned Regions

KK

Examine Alignment:What to look out for

Common Patterns - buried beta-strand

http://tardis.nibio.go.jp/cgi-bin/homstrad/showpage.cgi?family=response_reg&disp=str

Response regulator receiver domain

Common Patterns - amphiphilic partially-buried alpha-helices

http://tardis.nibio.go.jp/cgi-bin/homstrad/showpage.cgi?family=response_reg&disp=str

Response regulator receiver domain

Common Patterns - amphiphilic partially-buried alpha-helices

http://tardis.nibio.go.jp/cgi-bin/homstrad/showpage.cgi?family=subt&disp=str

subtilases

Common Patterns - amphiphilic beta strands

ubiquitin conjugating enzyme

Common Patterns - non-globular sequence

Different, more strongly biasesd (from equal representation of each of the 20 amino acids), sequence composition

Low sequence complexity THE HAPPY PROTEIN

TTT TTTTT TTTTTTTTHE THEHT THETHET

IncreasingComplexity

More variable sequence (more substitutions, more gaps than globular/structured regions)

Common Patterns - short linear motifs

[RK].L.{0,1}[FYLIVMP]LIG_CYCLIN_1Mostly occur in disordered protein regionsOften show greater conservation than neighbouring sequence

CautionAbsence of patterns that characterise different structrures (e.g. helices, motifs, etc.) doesn’t mean the structures are not present!Only means that either the sturcture is showing different evolutionary dynamics from those shown here, or insufficient/too-much diversity is present in alignment, obscuring tell-tale patterns of conservation

Clustalx helps spotting misaligned regions

BaliBase kinase2_ref5 as misaligned by clustalx

Quality->“Show Low-Scoring Segments”

NB other software is available to help in this way

Demo - edit/correct misalignments

•View alignment with ClustalX

•Quality -> Show low-scoring segments

•Edit using JalView

•http://://www.jalview.org/examples/editing.html

An aside - different tools give different

alignments

clustalx

probcons

An aside - different tools give different

alignments

BaliBase kinase2_ref5

mafft

prank






Missing regions•Sequences important/vital for the

analysis?

•YES

•Try and determine why sequence is missing

•If due to error, attempt to correct it

•NO

•Discard sequences







Apparently nonsense/unrelated regions•Sequences important/vital for the

analysis?

•YES

•Try and determine why sequence is different

•If due to error, attempt to correct it

•NO

•Discard sequences

With CLUSTALX “”Quality”->”Show Low-Scorring Segments” switched on







Lots of very similar sequences•Slow down calculations

•If there are no more divergent sequences in the alignment, there may not be enough information for successful analysis

•Remove most of them, leaving most “interesting” sequences e.g. keep human, remove chimp, guinea-pig

• Find alternative data-sources to supplement initial sequence set e.g. ENSEMBL for access to predicted peptides not yet relasesed elsewhere


Demo: Comparing Automatic Alignment Tools

•1aboA_ref1 from BaliBase

•http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE2/align_index.html

•http://www.ebi.ac.uk/mafft/

•http://www.ebi.ac.uk/Tools/muscle/

•http://www.ebi.ac.uk/t-coffee/

Difficult Alignments - low similarity between sequences

Possible Strategies: •Collect more sequences to reduce lowest pairwise similarity•Use structural information if possible

Typical Problems•Difficult to align throughout the sequence - more problems in local regions of lower conservation

BaliBase: 1aboA_ref1

Difficult Alignments - “orphan” sequences

Possible Strategies: •Collect more sequences to reduce lowest pairwise similarity•Use structural information if possible

Typical Problems•Orphan difficult to align correctly - typically it is forced to have similar conserved block structure as other sequences

Reference alignment

ClustalX alignment

BaliBase: 1aboA_ref2

Difficult Alignments - Several Subfamilies

Possible Strategies: •Align each subfamiliy on its own initially•Then align the subfamiliy alignments directly to each other•Add additional “bridging” sequences if you can find them

Difficult Alignments - Insertions/Extensions

clustalx alignment

BaliBase Reference alignment

Possible Strategies: •Initially align sequences without insertion•Then align (perhaps manually) sequence(s) with the insertion

Difficult Alignments - Repeats

BaliBase Reference alignment

Possible Strategies: •Determine repeat structure of your sequences•Decide which order

Typical Problems•Repeats are aligned “out of order”

Difficult Alignments - Rearrangement/Shuffling

BaliBase: cellulase_1_ref8

Possible Strategies: •Determine repeat/domain structure of your sequences before aligning them•Extract sequences of individual domains, and only include sequences of the same domain in the same alignment

Typical Problems•Most software prefers to minimise gaps - so completely misalign domains as seen left

Difficult Alignments - Short Linear Motifs

[DE][DES][DEGAS]F[SGAD][DEAP][LVIMFD]LIG_AP_GAE_1

[RK].L.{0,1}[FYLIVMP]LIG_CYCLIN_1

Possible Strategies: •Align regions suspected of containing motifs separately from the rest of the proteins•Use MAFFT (works best for regions of this kind)

Typical Problems•Most software finds it difficult to identify short regions of conservation in a region of low conservation

Summary

•It’s not “Is my alignment correct”

•Rather “Where is my alignment wrong”!

•The most important steps in building an alignment are:

• Specifying an appropriate question

•Critically examining the alignments you build

•Quality of the alignment influences quality of down-stream analyses

•Mis-alignment introduces systematic errors in downstream analyses

•There is no “best” alignment tool - which one you use depends on your question

•It can cost lots of time to make a good alignment!!

Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit...

Documents

Transcript of Sequence Alignments Wednesday 6th October 2010 Aidan Budd Structural and Computational Biology Unit...