Pairwise and multiple sequence alignments Alain Schenkel Tuomas Hätinen Bioinformatics group...

Pairwise and multiple sequence alignments

Alain Schenkel

Tuomas Hätinen

Bioinformatics groupInstitute of BiotechnologyUniversity of Helsinki

Protein Analysis Workshop 2006

Overview

Motivation – Why alignments? Sequence comparison

Dotplot

The alignment problem

Pairwise alignment algorithms Exact algorithms

Heuristic algorithms

Database searches

Multiple sequence alignments Web tools:

Build alignments using SRS or EBI server,

Blast at NCBI, EBI,

PairsDB, …

Motivation

Proteins perform most of the functions required in biological systems: Signaling (kinases, ...)

Enzymes (proteases, …)

Structural (collagen, elastin, …)

Immune system (antibodies, ...)

Storage and transport (hemoglobin, …)

…

Large amount of information available in current databanks.

Goal: Want to extrapolate information about the function of a newly discovered sequence by comparing it to annotated sequences.

Does it make sense?

All functional information is ultimately contained within the sequence.

Proteins are evolutionary related:

Selective pressure is on function, and thus on residues with functional role

(eg: active site or structural key residues are conserved).

Modular nature of proteins.

Two sequences have the same structure if corresponding residues are

similar enough on physico-chemical level.

Application of sequence alignments

Determining function of newly discovered genetic or protein

sequences. Identification of functional patterns/domains. Predicting structure of proteins. Determining evolutionary relationships among genes, proteins,

and entire species.

Aligning and comparing sequences, and searching databases for similar sequences – a cornerstone of bioinformatics!!

Sequence Comparison

• Alignment

• Dotplots

• The pairwise alignment problem

Pairwise alignment

Pairwise alignment = identification of residue-residue correspondence.

????? 101 AGVIGTILLISYGIRRLIKKSPSDVKP 115 ||:||.|||::|..|||.|:.|:||.| GLP_HORSE 60 AGIIGIILLLAYVSRRLRKRPPADVPP 86

What criteria should we use to obtain biologically meaningful alignments?

For the alignment to be meaningful, the correspondence should reflect the functional, or evolutionary, …, relationship (if any).

Some terminology

Identity:

percentage of pairs of identical residues between two aligned sequences.

Similarity:

percentage of pairs of similar residues between two aligned sequences.

one must define what similar means. Eg:

- as observed in well studied evolutionary

related protein families,

- physico-chemical amino acid

properties: hydropathy, size, …

Homology:

two sequences are homologous if and only if they have a common ancestor.

it´s either yes or no.

not to be confused with similarity!

Dotplots

The simplest way of comparing

two sequences: A dot is placed where both

sequence elements are identical.

Gives an overview of all possible

alignments. Each diagonal indicates a

possible (ungapped) alignment.

A T C T T C G A T

T ● ● ● ●

A ● ●

C ● ●

G ●

A ● ●

T ● ● ● ●

Sequence 1

Sequence 2

ATCTTCGAT | ||||---TACGAT

One possible alignment:

Dots may be scored according to a sliding window and a similarity

cutoff to reduce noise:

The smaller the window, the more noise. With large windows, the sensitivity for short sequences is reduced.

Filtering Out the Noise in Dotplots

LETVHKKLYAGQYQNAGQFCDDIWLMLDNA

| | || |||| | || ||| |

LSTIKRKLDTGQYQEPWQYVDDVWLMFNN


| | || |||| | || ||| |


Window size = 5, Similarity cutoff = 3

LETVHKKLYAGQYQNAGQFCDDIWLMLDNALSTIKRKLD *TGQ *YQEPWQ…


| | || |||| | || ||| |


Using Dotmatcher from SRS

SRS at EBI: http://srs.ebi.ac.uk/

SRS at EMBnet Austria: http://emb2.bcc.univie.ac.at:8080/srs/

... or any servers listed at http://downloads.lionbio.co.uk/publicsrs.html

Check out the SRS version (bottom of page): different versions index different databases, so the search results might be different depending on

the version.

http://srs.ebi.ac.uk/



DotmatcherP (for proteins)

Enter sequences in FASTA format!

Advanced options: Change default window size, threshold score and scoring matrix

DotmatcherP

Comparing a protein with itself.

repeated protein domains

Eg: Drosophila Melanogaster SLIT

Identification of

Identification of conserved protein domains. Using the default parameters window size = 10 and

threshold = 23:

DotmatcherP

Comparing two different sequences:

DotmatcherP

If we lower the window size and the threshold, we

observe lots of noise. Eg, with window size = 5, threshold = 10:

Another Dotplot server: Dotlet

Has more options and provides more flexibility than Dotmatcher. Some very useful features:

If only one sequence is entered, dotlet automatically compares it

against itself (finding repeats, low complexity regions, etc.).

Same application for both nucleic acid and protein sequences.

When comparing nucleic acid to nucleic acid, dotlet will reverse

complement one of the sequences and perform a second

comparison. Enables, eg, to see structures like stem-loops.

Possible to compare a protein to a nucleic acid sequence. The

nucleic acid sequence is translated in the three forward frames and

pixels are set to the highest of the scores. Enables, eg, to detect

introns/exons, frameshift, etc.

Dotlet

At http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html

Let´s find repeated domains

in the following sequence :> SLIT_DROME (P24014):MAAPSRTTLMPPPFRLQLRLLILPILLLLRHDAVHAEPYSGGFGSSAVSSGGLGSVGIHIPGGGVGVITEARCPRVCSCTGLNVDCSHRGLTSVPRKISADVERLELQGNNLTVIYETDFQRLTKLRMLQLTDNQIHTIERNSFQDLVSLERLDISNNVITTVGRRVFKGAQSLRSLQLDNNQITCLDEHAFKGLVELEILTLNNNNLTSLPHNIFGGLGRLRALRLSDNPFACDCHLSWLSRFLRSATRLAPYTRCQSPSQLKGQNVADLHDQEFKCSGLTEHAPMECGAENSCPHPCRCADGIVDCREKSLTSVPVTLPDDTTDVRLEQNFITELPPKSFSSFRRLRRIDLSNNNISRIAHDALSGLKQLTTLVLYGNKIKDLPSGVFKGLGSLRLLLLNANEISCIRKDAFRDLHSLSLLSLYDNNIQSLANGTFDAMKSMKTVHLAKNPFICDCNLRWLADYLHKNPIETSGARCESPKRMHRRRIESLREEKFKCSWGELRMKLSGECRMDSDCPAMCHCEGTTVDCTGRRLKEIPRDIPLHTTELLLNDNELGRISSDGLFGRLPHLVKLELKRNQLTGIEPNAFEGASHIQELQLGENKIKEISNKMFLGLHQLKTLNLYDNQISCVMPGSFEHLNSLTSLNLASNPFNCNCHLAWFAECVRKKSLNGGAARCGAPSKVRDVQIKDLPHSEFKCSSENSEGCLGDGYCPPSCTCTGTVVACSRNQLKEIPRGIPAETSELYLESNEIEQIHYERIRHLRSLTRLDLSNNQITILSNYTFANLTKLSTLIISYNKLQCLQRHALSGLNNLRVVSLHGNRISMLPEGSFEDLKSLTHIALGSNPLYCDCGLKWFSDWIKLDYVEPGIARCAEPEQMKDKLILSTPSSSFVCRGRVRNDILAKCNACFEQPCQNQAQCVALPQREYQCLCQPGYHGKHCEFMIDACYGNPCRNNATCTVLEEGRFSCQCAPGYTGARCETNIDDCLGEIKCQNNATCIDGVESYKCECQPGFSGEFCDTKIQFCSPEFNPCANGAKCMDHFTHYSCDCQAGFHGTNCTDNIDDCQNHMCQNGGTCVDGINDYQCRCPDDYTGKYCEGHNMISMMYPQTSPCQNHECKHGVCFQPNAQGSDYLCRCHPGYTGKWCEYLTSISFVHNNSFVELEPLRTRPEANVTIVFSSAEQNGILMYDGQDAHLAVELFNGRIRVSYDVGNHPVSTMYSFEMVADGKYHAVELLAIKKNFTLRVDRGLARSIINEGSNDYLKLTTPMFLGGLPVDPAQQAYKNWQIRNLTSFKGCMKEVWINHKLVDFGNAQRQQKITPGCALLEGEQQEEEDDEQDFMDETPHIKEEPVDPCLENKCRRGSRCVPNSNARDGYQCKCKHGQRGRYCDQGEGSTEPPTVTAASTCRKEQVREYYTENDCRSRQPLKYAKCVGGCGNQCCAAKIVRRRKVRMVCSNNRKYIKNLDIVRKCGCTKKCY

http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html








http://ekhidna.biocenter.helsinki.fi/how/week1/day2/data/dotseq.html

1. Enter sequence

3. Repeat for second sequence (optional)

2. Enter name for sequence (optional)

4. Select scoring matrix, window size and zoom

4. Click ”compute”!

Each pixel corresponds to a residue in the horisontal sequence and to a residue in the vertical sequence

The pixels color depends on how similar the two sequences are around these two positions

Possible to scroll the dotplot here

Possible to scroll the alignment here

Residues that match well in the alignment are coloured blue

Tuning of grayscale in order to make background noise disappear

Dotlet reverse complements one of the sequences stem-loops can be detected

Dotplot - Summary

Comparing a sequence with itself, can be used to

identify:

Repeated domains,

Regions of low complexity (eg, …GYCAAAAAAAAALK…).

Comparing two protein sequences, can be used to

identify: Local regions of similarity,

Conserved protein domains.

Dotplot - Summary

Good: visual detection of feature/similarity,

exploring the sequence organisation.

Bad: resolving regions of low similarity,

does not provide an alignment (no insertions/deletions).

To obtain an alignment, we need a method for lining up the diagonals in a dotplot.

G A T C T A

G 1

A 2

T 3

C 4

A 5

GATCTA

GATC_A

The Pairwise Alignment Problem

Lign up diagonal by edit operations: substitution (mutation)

gap or indel (insertion/deletion)

seq1 IGTILLISYGIRRLIKKSPSDVKP----LPSPDTDVP || ||| | ||| | | || | || | |seq2 IGIILLLAYVSRRLRKRPPADVPPPASTVPSADAPPP

substitution deletion

insertion

sequence 1s

eq

ue

nc

e

2

gap

But there are many ways to align 2 sequences we need to score alignments to decide which is the best.

Scoring the Edit Operations

For example: identical: +10 (it´s good)

substitution: +2 for S-A, -1 for K-P, …

gap: -3

PSDVKP--P | || | | PADVPPPAP

Score: +50+2-1+2*(-3) = 45

Choosing an appropriate scoring scheme: where biological information is introduced (eg, reward the evolutionary most likely alignment).

Standard notation: | for identical : for very similar (eg, size and hydropathy) . for somewhat similar (eg, size or hydropathy)

Gap penalty

Few long gaps

is better than

many small gaps

Different scores for gap opening, eg: -5

gap extension, eg: L(-1) with

L=length of extension

gap opening > gap extension

TIL--------LISYGIRRLIK

TILKKSPSDVKLISYGIRRLIK

TIL--------LISYGIRRLIK

TILKKSPSDVKLISYGIRRLIK

gap openinggap extension

IG-TI--LYDL-SYYAG---IR

IGKIIPRL--LVAY--VLIGSR

gap score= -5 -6

Gap penalty

Can also consider special penalty for gaps at end/beginning of

alignment (eg, zero penalty).

Need to be careful in adjusting the gap score to the substitution

score: too strong penalty no gaps,

too weak penalty too many gaps.

Insertions and deletions have been found to occur in nature at

significantly lower frequency than mutations.

Residue Substitution

A substitution score for each aa pair

a substitution matrix.

Most used: based on evolutionary relationship.

Two types: PAM series,

BLOSUM series.

PAM (Percent Accepted Mutation)

PAM1: observed mutations in

carefully selected sets of closely

related proteins (1572 sequences

from 71 families). (1978)

Idea: observed substitutions are the

result of 1 mutation (not many).

PAMn: iterate PAM1 n times to

obtain substitution rate between

more divergent sequences.

PAM: 0 30 80 110 200 250%identity: 100 75 60 50 25 20

PAM250

Usewhen

BLOSUM (BLOck Substitution Matrix)

Based on a larger set than PAM is.

More recent than PAM. (1992)

Different approach than PAM:

not based on an explicit evolutionary

model,

observed aa substitutions in a set of

conserved aa patterns called blocks.

BLOSUMn: from blocks which are n%

identical.

BLOSUM62: empirically shown to be among

the best at detecting weak similarity.

BLOSUM62

Tips for using substitution matrices

Generally, BLOSUM matrices perform better than PAM for local

similarity searches. For database searches, the most commonly used matrix is

BLOSUM62. When comparing closely related proteins, one should use lower

PAM or higher BLOSUM, for distantly related proteins higher PAM

or lower BLOSUM matrices

Caution: substitution matrices are statistical in nature. In a given

alignment, a substitution may or may not correspond to an actual

mutation.

BLOSUM 8 BLOSUM 62 BLOSUM 45

PAM 1 PAM 120 PAM 250

Less divergent More divergent

Pairwise alignment algorithms

• Exact algorithms

• Heuristic algorithms

• Database scanning

Pairwise Alignment Algorithms

Given a scoring scheme, an alignment algorithm tries to find the best

alignment between 2 sequences according to that scheme.

Exact algorithms: guaranteed to return an alignment with the best possible score.

Heuristic alignments: not guaranteed to return best alignments.

but they are quicker (and hopefully still return good alignments).

Two types of alignment: Global: forced over the entire length of 2 sequences.

Local: between substrings of 2 sequences..

Global vs Local Alignment

Global alignments: are sensitive to gap penalties,

do not take into account the modular nature of

proteins,

can be used to compare 2 proteins with same

function (in, eg, human/mouse).

Local alignments: are sensitive to modular nature

of proteins. They can be used to: look for conserved domains or motifs in 2 proteins,

search for local similarities in large sequences,

database searches,

scanning an entire genome with a short sequence.

Exact Algorithms: Dynamic Programming

Exhaustive search among all possible

alignments is not possible (eg, for 2 sequences of

100 and 95 residues: 55 millions alignments with 5

gaps).

Problem solved by dynamic programming:

1. initialize top row and left column,

2. compute best local scores iteratively,

3. keep track of where best local score comes from,

4. traceback to obtain the best alignments.

May exist several best solutions: an alignment

reported to you may be one among a number of

possibilities.

How can we find the best alignment between 2 sequences?

best global score

ATTCTCTGA-TAC--TGA

ATTCTCTGA-TA--CTGA

The example is from www.pasteur.fr

Example of 2 best solutions:

Global Alignment Servers (Exact Algorithm)

Server at SRS: NeedleP. (http://srs.ebi.ac.uk/ Tools) Server at EBI: EMBOSS-Align

Let´s submit to http://www.ebi.ac.uk/emboss/align/index.html the sequences :

Use the Needleman-Wunsch algorithm (1970).

>uniprot|P35858|ALS_HUMAN Insulin-like growth factor-binding protein complexMALRKGGLALALLLLSWVALGPRSLEGADPGTPGEAEGPACPAACVCSYDDDADELSVFCSSRNLTRLPDGVPGGTQALWLDGNNLSSVPPAAFQNLSSLGFLNLQGGQLGSLEPQALLGLENLCHLHLERNQLRSLALGTFAHTPALASLGLSNNRLSRLEDGLFEGLGSLWDLNLGWNSLAVLPDAAFRGLGSLRELVLAGNRLAYLQPALFSGLAELRELDLSRNALRAIKANVFVQLPRLQKLYLDRNLIAAVAPGAFLGLKALRWLDLSHNRVAGLLEDTFPGLLGLRVLRLSHNAIASLRPRTFKDLHFLEELQLGHNRIRQLAERSFEGLGQLEVLTLDHNQLQEVKAGAFLGLTNVAVMNLSGNCLRNLPEQVFRGLGKLHSLHLEGSCLGRIRPHTFTGLSGLRRLFLKDNGLVGIEEQSLWGLAELLELDLTSNQLTHLPHRLFQGLGKLEYLLLSRNRLAELPADALGPLQRAFWLDVSHNRLEALPNSLLAPLGRLRYLSLRNNSLRTFTPQPPGLERLWLEGNPWDCGCPLKALRDFALQNPSAVPRFVQAICEGDDCQPPAYTYNNITCASPPEVVGLDLRDLSEAHFAPC

>uniprot|O08770|GPV_RAT Platelet glycoprotein V precursor (GPV) (CD42D).MLRSVLLSAVLSLVGAQPFPCPKTCKCVVRDAVQCSGGSVAHIAELGLPTNLTHILLFRMDRGVLQSHSFSGMTVLQRLMLSDSHISAIDPGTFNDLVKLKTLRLTRNKISHLPRAILDKMVLLEQLFLDHNALRDLDQNLFQKLLNLRDLCLNQNQLSFLPANLFSSLGKLKVLDLSRNNLTHLPQGLLGAQIKLEKLLLYSNRLMSLDSGLLANLGALTELRLERNHLRSIAPGAFDSLGNLSTLTLSGNLLESLPPALFLHVSWLTRLTLFENPLEELPEVLFGEMAGLRELWLNGTHLRTLPAAAFRNLSGLQTLGLTRNPLLSALPPGMFHGLTELRVLAVHTNALEELPEDALRGLGRLRQVSLRHNRLRALPRTLFRNLSSLVTVQLEHNQLKTLPGDVFAALPQLTRVLLGHNPWLCDCGLWPFLQWLRHHLELLGRDEPPQCNGPESRASLTFWELLQGDQWCPSSRGLPPDPPTENALKAPDPTQRPNSSQSWAWVQLVARGESPDNRFYWNLYILLLIAQATIAGFIVFAMIKIGQLFRTLIREELLFEAMGKSSN


http://www.ebi.ac.uk/emboss/align/index.html

choose scoring matrix gap penalties

gap penalties

NeedleP at SRS

options for gap penalties

choose scoring matrix (optional)

Local Alignment Servers (Exact Algorithm)

Server at EMBnet: LALIGN, uses SIM algorithm (1991) http://www.ch.embnet.org/software/LALIGN_form.html

Server at SRS: http://srs.ebi.ac.uk/ Tools.

WaterP. Uses the Smith-Waterman algorithm (1981)

MatcherP. Can be used to find various local alignments

between 2 sequences. Slower than WaterP.

Server at EBI (Smith-Waterman algorithm). http://www.ebi.ac.uk/emboss/align/index.html












Heuristic Algorithms

Motivations:Exact algorithms are exhaustive but computationally

expensive.Exact algorithms are impractical for comparing a query

sequence to millions of other sequences in a database

(database scanning),and so, database scanning requires faster alignment

algorithm (at the cost of optimality).

Heuristic Algorithms

Probing a database with a query is similar to aligning a query with

a very long sequence.

Main idea: Use dynamic programming, but limited to (sub-)sequences which are

likely to produce interesting alignments with the query.

Heuristic part of the algorithm: eliminate from search uninteresting

sequences (need to make a guess).

Algorithms: FASTA : Lipman-Pearson (1985).

BLAST (Basic Local Alignment Search Tool) : Altshul et al. (1990).

need fast local alignment methods.

BLAST Overview

Many versions for different query-database cases: blastp: protein - protein

blastn: nucleotide - nucleotide

blastx: nucleotide protein - protein

tblastn: protein - protein nucleotide

tblastx: nucleotide protein - protein nucleotide

Comes in many flavours. Fast and reliable. Easy to use.

BLAST Overview

BLAST computes “an alignment”, not necessarily the exact optimal

alignment. Given the query and the database (long sequence):

Find all words of length k (typical: k=4) that match the query with a

score high enough.

Look for subsequences in the database that contain these words.

Extend subsequences to see if match score can be increased.

Compute total score when no more extensions are possible.

Rank the alignments.

How should the different matched (sub-)sequences be ranked?

Significance of Alignments

Scores cannot be used to rank alignments: a bad but long alignment may have a higher score than a good but short

alignment.

We need a normalized scoring scheme that would allow to

compare alignments, and evaluate their biological significance. Idea:

Probe the database with random sequences.

This gives a distribution of scores (it follows the extreme-value distribution).

Establish a threshold for significance.

Extreme-Value Distribution

score

Score distribution for random sequences

score of our query

probability that the score of our query is no better than random: P-value

Difficulty: finding a significance threshold.

Quantifying the Significance of Alignments

P-value: The probability of an alignment occurring with score S or better if

the aligned-against sequence is random. The lower the P-value, the more significant the alignment.

E-value: Expected number of alignments with scores equivalent to or better

than S to occur by chance only. The lower the E-value, the more significant the alignment. E-value = P-value * size of database.

For an alignment with raw score S:

Rough Guide for P-values and E-values

P-Value (reported by many programs): 0≤ P-val ≤ 1

E-value (reported by some programs, eg PSI-Blast): 0 ≤ E-val ≤ size of database

P<= 10-100 Exact match

10-100 < P < 10-50 Sequences very nearly identical, e.g.: alleles or SNPs

10-50 < P < 10-10 Closely related sequences, homology certain

10-5 < P < 10-1 Usually distant relatives

P>10-1 Match probably insignificant

E<=0.02 Sequences probably homologous

0.02 <=E <=1 Homology can’t be ruled out

E>1 This match would be obtained by chance

Heuristic Algorithms Servers

Pairwise alignment:BLAST:

http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi Database screening:

FASTA: http://www.ebi.ac.uk/fasta33/ , SRS, …BLAST:

- SRS (at EBI or ...)- http://www.ncbi.nlm.nih.gov/BLAST/ - http://www.ebi.ac.uk/blast/index.html - http://www.ch.embnet.org/software/bBLAST.html - http://www.ch.embnet.org/software/aBLAST.html

Evaluating the significance of an alignment:PRSS:

http://www.ch.embnet.org/software/PRSS_form.html

http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi

http://www.ebi.ac.uk/fasta33/

http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+packageInfo+-id+6KHEG1RNU4o+-package+BLAST

http://www.ncbi.nlm.nih.gov/BLAST/

http://www.ebi.ac.uk/blast/index.html

http://www.ch.embnet.org/software/bBLAST.html

http://www.ch.embnet.org/software/aBLAST.html

http://www.ch.embnet.org/software/PRSS_form.html

BLAST Servers

Blast has many options : choice of database, substitution matrix, …

basic or advanced section.

BLAST interfaces are different: NCBI: excellent help pages and tutorial

SRS: easy multiple alignment access

EMBnet: simple text + graphical output.

Remark: there is a server with a powerful implementation of Smith-Waterman for database screening: http://www.ebi.ac.uk/MPsrch/. Runs about 50 times slower, but is more sensitive and returns less false positives than Blast.

http://www.ebi.ac.uk/MPsrch/

BLAST at NCBI

>1IGR:A INSULIN-LIKE GROWTH FACTOR RECEPTOR EICGPGIDIRNDYQQLKRLENCTVIEGYLHILLISKAEDYRSYRFPKLTVITEYSLGDLFPNLTVIRGWKLFYNYALVIFEMTNLKDIGLYNLRNITRGAIRIEKNADLCYLSTVDWSLILDAVSNNYIVGNKPPKECGDLCPGTMEEKPMCEKTTINNEYNYRCWTTNRCQKMCPSTCGKRACTENNECCHPECLGSCSAPDNDTACVACRHYYYAGVCVPACPPNTYRFEGWRCVDRDFCANILSAESSDSEGFVIHDGECMQECPSGFIRNGSQSMYCIPCEGPCPKVCEEEKKTKTIDSVTSAQMLQGCTIFKGNLLINIRRGNNIASELENFMGLIEVVTGYVKIRHSHALVSLSFLKNLRLILGEEQLEGNYSFYVLDNQNLQQLWDWDHRNLTIKAGKMYFAFNPKLCVSEIYRMEEVTGTKGRQSKGDINTRNNGERASCESDVDDDDKEQKLISEEDLN

Let´s submit the query sequence

at http://www.ncbi.nlm.nih.gov/BLAST/



We paste our sequence here and launch the search

substitution matrix

Conserved domains

Graphical overview of hits – couloured according to similarity

Hits

Alignment for each of the hits

E value: Expectation value.

Expected # of alignments with scores equivalent to or better than S to occur by chance. The lower the E value, the more significant the score.

Bit score: S’

The value S’ is derived from the raw alignment score S, but statistical properties of the scoring system have been taken into account. Because bit scores are normalised w.r.t. scoring system, they can be used to compare alignment scores from different searches.

NCBI Blast output help: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Blast_output.html

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Blast_output.html

http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Blast_output.html

BLAST at SRS EBI

SRS EBI: View results using BlastAlignment

Alignments are dispalyed

BLAST at EMBnet

Graphic output on/off

BLAST Variants

PHI-Blast: Pattern-Hit Initiated Blast: Searches proteins containing a specified pattern AND are similar

to the query sequence in the neighborhood. Patterns must follow the syntax of PROSITE.

PSI-Blast: Position-Specific Iterated Blast: More sensitive, ie better at detecting distant relationships, than

BLAST. Computes position-specific substitution matrices (PSSMs) to score

matches between query and database sequences .(Blast uses

precomputed substitution matrices, eg BLOSUM62.)

PSI-BLAST

Repeatedly searches the target databases.

At each round: compute a multiple alignment of high scoring sequences to

generate a new PSSM for next round of searching.

Iterates until no new sequences found (or until a maximal

number of iteration is reached).

Rules of thumb for pairwise alignment

Use server defaults in the absence of any other information. Adjust the substitution matrix to the expected divergence of

the 2 sequences. Use BLOSUM62 if no a priori information. For distantly related sequences, use PSI-Blast rather than

BLAST. Many ways of aligning 2 sequences.

A returned alignment is not the absolute truth.

Inspect the alignment from the biologist´s perspective.

PairsDB

A database of pre-computed Blast and Psi-Blast alignments.

Continually updated.

Source databases: Uniprot, PDB, EMBL, Worm database, ENSEMBL, NCBI genomes, RefSeq.

PairsDB thus provides a quick and easy way to explore protein sequences and their relationships.

PairsDB

NRDB90: non-redundant database at 90%, etc.

Seq databases: - Uniprot - PDB - ...

remove redundancy at 90%

NRDB90

NRDB80

NRDB70

NRDB40

NRDB30

...

BlastP all-on-all

Psi-Blast all-on-all

A set of alignments

A set of alignments

PairsDB: http://www.csc.fi/cgi-bin/pairsdb/pairsdb.cgi

PairsDB

Multiple sequence alignment

• Motivation

• Algorithms overview

• Clustalw

• Clustal-X

Multiple Sequence Alignment

Given a set of N ≥ 3 sequences, we want to find the best

way of aligning these sequences simultaneously. A multiple alignment does not reflect the level of pairwise

similarity between pairs of sequences.

-----------------NC------------------------------- 142-----------------ACF------------------------------ 141---------------IRGCRL----------------------------- 147---------------MAECWSHGSNSVFPF-------------------- 158VTPSVKPSHASQEVKLHDSTSYAQNPFLSLLGKPIVPAQAPIKPQSKPPS 792------------------CEAQ---------------------------- 142----------------VACNLRSLSPVRSPRGFLTG-------------- 179

Motivations

Pairwise sequence alignment is easy with sufficiently

closely related sequences.

Below a certain level of identity sequence alignment may

become uncertain : twilight zone for aa sequences ~ 30%.

In or below the twilight zone it is good to make use of

additional information, eg, from evolution.

Motivations

A multiple alignment of diverse sequences is more

informative than a pairwise alignment: residues conserved over longer period of time are under

stronger evolutionary constraints.

Reasons for aligning sets of sequences: organize data to reflect sequence homology,

estimate evolutionary distance,

infer phylogenetic trees from homologous sites,

highlight variable and conserved sites/regions,

determine substitution frequencies,

pattern/domains identification,

helpful for protein structure prediction.

An alignment of 8 fragments of immunoglobulin:

Alignment highlights: Conserved residues: One of the cysteines forming the

disulphide bridges, and the tryptophan.

Conserved regions (e.g. Q.PG).

Patterns (e.g.: dominance of hydrophobic residues at

positions 1 and 3). The alternating hydrophobicity pattern

is typical for surface beta-strand at the beginning of each

fragment.

VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG--

Consensus Sequence

Simplest Form:

A single sequence which represents the most common amino

acid/base in that position

Y D D G A V - E A L

Y D G G - - - E A L

F E G G I L V E A L

F D - G I L V Q A V

Y E G G A V V Q A L

-------------------------------------------------------Y D G G A/I V/L V E A L

Multiple Sequence Alignments Algorithms

Multiple sequence alignment uses heuristic methods only: With dynamic programming, computational time quickly

explodes as the number of sequences increases.

Different methods/algorithms: Segment-based (DiAlign, T-Coffee…).

Iterative (HMMs, SAGA, DiAlign, PRRP, …).

Progressive (Clustalw, T-Coffee, PileUp, …).

ClustalW: First described by D.G. Higgins and P.M.Sharp (1988).

Can be used for nucleotide or amino acid sequences.

Clustalw Algorithm

Step1: Calculate all pairwise alignments and calculate distances for all pairs of sequences.

Step 2: Construct guide tree joining the most similar sequences using Neighbour Joining.

A B C D E

B 2

C 4 4

D 6 6 6

E 6 6 6 4

F 8 8 8 8 8

Step 1 Step 2

Clustalw Algorithm

Step 3: From the tree assign weights for each sequence: We want to down-weight nearly identical sequences and up-

weight the most divergent ones.

Step 4: Align sequences, starting at the leaves of the

guide tree: Pairwise comparisons as well as comparison of single

sequence with a group of sequences (Profile)

Clustalw Algorithm

Some features: Amino acid substitution matrices are varied at different alignment

stages according to the divergence of the sequences to be aligned. Reduced gap penalties in hydrophilic regions encourage new gaps in

potential loop regions rather than regular secondary structure.

Insertions and deletions are more common in loop regions than in the core of the protein!

Clustalw

Clustalw is not optimal. There are known areas in which Clustalw performs badly, for

example: errors introduced early cannot be corrected by subsequent

information, alignments of sequences of differing lengths cause strange guide

trees and unpredictable effects.

Use also others, slower but better depending on the situation: T-Coffee: http://www.ch.embnet.org/software/TCoffee.html

DiAlign: http://dialign.gobics.de/

POA: http://www.bioinformatics.ucla.edu/poa/

SAGA

... and more at http://helix.nih.gov/apps/bioinfo/msa.html.

ClustalW Servers

Servers: EBI: http://www.ebi.ac.uk/clustalw/

SRS: eg, http://srs.ebi.ac.uk/ tools multiple alignments

EMBnet: http://www.ch.embnet.org/software/ClustalW.html

Let’s build a multiple alignment for the following sequences :

>query

MKNTLLKLGVCVSLLGITPFVSTISSVQAERTVEHKVIKNETGTISISQLNKNVWVHTELGYFSGEAVPSNGLVLNTSKGLVLVDSSWDDKLTKELIEMVEKKFKKRVTDVIITHAHADRIGGMKTLKERGIKAHSTALTAELAKKNGYEEPLGDLQSVTNLKFGNMKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSASSKDLGNVADAYVNEWSTSIENVLKRYGNINLVVPGHGEVGDRGLLLHTLDLLK>gi|2984094 MGGFLFFFLLVLFSFSSEYPKHVKETLRKITDRIYGVFGVYEQVSYENRGFISNAYFYVADDGVLVVDALSTYKLGKELIESIRSVTNKPIRFLVVTHYHTDHFYGAKAFREVGAEVIAHEWAFDYISQPSSYNFFLARKKILKEHLEGTELTPPTITLTKNLNVYLQVGKEYKRFEVLHLCRAHTNGDIVVWIPDEKVLFSGDIVFDGRLPFLGSGNSRTWLVCLDEILKMKPRILLPGHGEALIGEKKIKEAVSWTRKYIKDLRETIRKLYEEGCDVECVRERINEELIKIDPSYAQVPVFFNVNPVNAYYVYFEIENEILMGE>gi|115023|sp|P10425|MKKNTLLKVGLCVSLLGTTQFVSTISSVQASQKVEQIVIKNETGTISISQLNKNVWVHTELGYFNGEAVPSNGLVLNTSKGLVLVDSSWDNKLTKELIEMVEKKFQKRVTDVIITHAHADRIGGITALKERGIKAHSTALTAELAKKSGYEEPLGDLQTVTNLKFGNTKVETFYPGKGHTEDNIVVWLPQYQILAGGCLVKSAEAKNLGNVADAYVNEWSTSIENMLKRYRNINLVVPGHGKVGDKGLLLHTLDLLK>gi|115030|sp|P25910|MKTVFILISMLFPVAVMAQKSVKISDDISITQLSDKVYTYVSLAEIEGWGMVPSNGMIVINNHQAALLDTPINDAQTEMLVNWVTDSLHAKVTTFIPNHWHGDCIGGLGYLQRKGVQSYANQMTIDLAKEKGLPVPEHGFTDSLTVSLDGMPLQCYYLGGGHATDNIVVWLPTENILFGGCMLKDNQATSIGNISDADVTAWPKTLDKVKAKFPSARYVVPGHGDYGGTELIEHTKQIVNQYIESTSKP>gi|282554|pir||S25844 MTVEVREVAEGVYAYEQAPGGWCVSNAGIVVGGDGALVVDTLSTIPRARRLAEWVDKLAAGPGRTVVNTHFHGDHAFGNQVFAPGTRIIAHEDMRSAMVTTGLALTGLWPRVDWGEIELRPPNVTFRDRLTLHVGERQVELICVGPAHTDHDVVVWLPEERVLFAGDVVMSGVTPFALFGSVAGTLAALDRLAELEPEVVVGGHGPVAGP EVIDANRDYLRWVQRLAADAVDRRLTPLQAARRADLGAFAGLLDAERLVANLHRAHEELLGGHVRDAMEIFAELVAYNGGQLPTCLA

http://www.ebi.ac.uk/clustalw/

ClustalW at EBI

Many options: CPU mode,

full/fast alignment,

window length in fast mode,

…

gap penalties.

ClustalW at EBI

Automatic display of:

Score table

Alignment (optional colouring)

Tree guide

Link to Jalview alignment editor!(More on Jalview at end of week.)

Running Clustalw from SRS (Columbia University)

Running Clustalw from SRS

View results using: *complete entries*

View results using: ClustalwAli

Clustal-X

Windows or Linux interface for the ClustalW multiple sequence

alignment program. Integrated environment for performing multiple sequence and

profile alignments and analyzing the results. A versatile coloring scheme:

allows to highlight conserved features in the alignment,

fully customizable.

Does not have as versatile gap penalties options as servers. Start with sequences in FASTA format (or an existing alignment

in Clustal format). [Do Alignment] on the alignment menu.

Clustal-X

Using Clustal-X

Clustal X input: can read FASTA format (and 6 others)

Output: alignment (coloured) and consensus sequence: * indicates single, fully conserved residue : indicates that one of the following ‘strong’ groups is fully conserved:

STA, NEQK, NHQK, NDEQ, QHRK, MILV, MILF, HY, FYW

. Indicates that one of the following ‘weaker’ groups is conserved:

CSA, ATV, SAG, STNK, STPA, SGND, SNDEQK, NDEQHK, NEQHRK, FVLIM, HFY

Residues are coloured by type by default, but colouring scheme is customizable.

Source: ClustalX help search on google: => http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html

http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html



Using Clustal-X with JalView

Proteins: 1MBD (myoglobin), 4HHB-B (hemoglobin), 1ECD (hemoglobin)

• Feed sequences to Clustal-X compute alignments, trees, ...• Feed an alignment to JalView edit the alignment.

The most hydrophobic residues according to this table are coloured red and the most hydrophilic ones are coloured blue. The colours of the in between residues are varying shades of purple according to whereabouts they are on the scale.

A note on the example

It is atypical: It uses only three sequences. One should use more in order to extract reliable informations.

It illustrates a common mistake: It uses too closely related sequences. One should use as divergent and diverse sequences as

possible in order to extract relevant informations.

References

Tutorials:Blast: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html Clustal-X: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html

Sequence analysis:D.W. Mount: Bioinformatics, Sequence Analysis and Genome

Analysis. Cold Spring Harbor Laboratory Press, 2004 (2nd

edition)…

Pairwise and multiple sequence alignments Alain Schenkel Tuomas Hätinen Bioinformatics group...

Documents

Transcript of Pairwise and multiple sequence alignments Alain Schenkel Tuomas Hätinen Bioinformatics group...