Alignments and alignment reliability The first critical step in sequence analysis – the know how...

Alignments and alignment reliability

The first critical step in sequence analysis – the know how

Eyal Privman and Osnat Penn

Tel Aviv University

COST Training School

Rehovot, 2010

What are alignments good for? To compare sequences

Find homology Similar sequence similar function

To learn about sequence evolution Mismatch = point mutation Gap = indel (insertion or deletion) Reconstruct phylogenetic tree Infer selection forces, e.g., detecting positive

selection

Sequences evolution

ATGAAATAA

ATGTTTTAA ATGCCCAAATAA

ATGTTTTAA ATGTTT ATGCCCAAATAA

ATG---TTTTAA

ATG---TTT---

ATGCCCAAATAA

30 MYA

5 MYA

Today

Human

Chimp

Mouse

Alignment and phylogeny are mutually dependant

Inaccurate tree building

MSA

Sequence alignment

0.4

Phylogeny reconstruction

Unaligned sequences

Alignment and phylogeny are both challenging

25% of residues are aligned wrong

Based on BAliBASE: a large representative set of proteins

Alignment and phylogeny are both challenging

5% of tree branches are wrong

Based on simulations of 100 protein sequences

Making an alignment

For 2 sequences : use exact methods.

For more sequences: Exact methods are not feasible (too slow) We use heuristic methods

ABCDE

Compute the pairwise Compute the pairwise alignments for all against alignments for all against

all (10 pairwise alignments).all (10 pairwise alignments).The similarities are The similarities are

converted to distances and converted to distances and stored in a tablestored in a table

First step :compute pairwise distances

Progressive alignment

ABCDE

A

B8

C1517

D161410

E32313132

A

D

C

B

E

Cluster the sequences to create a Cluster the sequences to create a tree (tree (guide treeguide tree):):

• represents the order in which pairs of represents the order in which pairs of sequences are to be alignedsequences are to be aligned• similar sequences are neighbors in the similar sequences are neighbors in the tree tree • distant sequences are distant from distant sequences are distant from each other in the treeeach other in the tree

Second step:build a guide tree

ABCDE

A

B8

C1517

D161410

E32313132The guide tree is imprecise The guide tree is imprecise and is NOT the tree which and is NOT the tree which truly describes the truly describes the evolutionary relationship evolutionary relationship between the sequences!between the sequences!

Third step: align sequences in a bottom up order

A

D

C

B

E

1. Align the most similar (neighboring) pairs

2. Align pairs of pairs

3. Align sequences clustered to pairs of pairs

deeper in the tree

Sequence A

Sequence B

Sequence C

Sequence D

Sequence E

Multiple sequence alignment (MSA)

progressive alignment

ABCDE

Guide tree

A

DCB

E

MSA

Pairwise distance table Iterative

Multiple sequence alignment (MSA)Several advanced MSA programs are available.

Today we will use two: MAFFT – fastest and one of the most accurate PRANK – distinct from all other MSA programs because of its

correct treatment of insertions/deletions

MAFFT Web server & download:

http://align.bmr.kyushu-u.ac.jp/mafft/online/server/ Efficiency-tuned variants

quick & dirty or slow but accurate

Nucleic Acids Research, 2002, Vol. 30, No. 14 3059-3066© 2002 Oxford University Press

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

Kazutaka Katoh, Kazuharu Misawa1, Kei-ichi Kuma and Takashi Miyata*

Choosing a MAFFT strategy

quick & dirty slow

but accurate

Choosing a MAFFT strategy

L-INS-i

ooooooooooooooooooooooooooooooooXXXXXXXXXXX-XXXXXXXXXXXXXXX------------------

--------------------------------XX-XXXXXXXXXXXXXXX-XXXXXXXXooooooooooo-------

------------------ooooooooooooooXXXXX----XXXXXXXX---XXXXXXXooooooooooo-------

--------ooooooooooooooooooooooooXXXXX-XXXXXXXXXX----XXXXXXXoooooooooooooooooo

--------------------------------XXXXXXXXXXXXXXXX----XXXXXXX------------------

G-INS-i

XXXXXXXXXXX-XXXXXXXXXXXXXXX

XX-XXXXXXXXXXXXXXX-XXXXXXXX

XXXXX----XXXXXXXX---XXXXXXX

XXXXX-XXXXXXXXXX----XXXXXXX

XXXXXXXXXXXXXXXX----XXXXXXX

E-INS-i

oooooooooXXX------XXXX---------------------------------XXXXXXXXXXX-XXXXXXXXXXXXXXXooooooooooooo

---------XXXXXXXXXXXXXooo------------------------------XXXXXXXXXXXXXXXXXX-XXXXXXXX-------------

-----ooooXXXXXX---XXXXooooooooooo----------------------XXXXX----XXXXXXXXXXXXXXXXXXooooooooooooo

---------XXXXX----XXXXoooooooooooooooooooooooooooooooooXXXXX-XXXXXXXXXXXX--XXXXXXX-------------

---------XXXXX----XXXX---------------------------------XXXXX---XXXXXXXXXX--XXXXXXXooooo--------

quick & dirty slow

but accurate

MAFFT outputSaving the output Choose a format: Clustal, Fasta,

or click "Reformat" to convert to a selection of other formats

Save page as a text filee.g. save as "phylip" file and uploadto PhyML for reconstructing the tree

A colored view of the alignment

PhyML: tree reconstructionThe most widely used maximum likelihood (ML) program Web server & download: http://www.atgc-montpellier.fr/phyml/

Classical alignment errors for HIV env

PRANK Web server: http://www.ebi.ac.uk/goldman-srv/webPRANK/

PRANK output

If you need a different format – copy the results to the READSEQ sequence converter: http://www-bimas.cit.nih.gov/molbio/readseq/

1. Download and save the sequences file from Osnat's homepage

(you can google “Osnat Penn" and look for the workshop

materials under "Teaching"). Save the file as "trim5a.AA.fas"

(File “Save page as”). This file contains 20 protein sequences

in FASTA format.

2. Run PRANK web-server to create a protein alignment:

a. In the “Default alignment” section browse for

“trim5a.AA.fas”.

b. Run (press the “Start alignment“ button) .

3. While you wait: copy the sequences into the MAFFT web server

and run the "automatic" "moderately accurate" strategy – which

strategy did MAFFT choose for you? Click on the "Fasta

format“ link, and save as “trim5a.AA.mafft.aln“ (File “Save

page as”) and try the "Jalview" button.

4. When PRANK finishes click on the “Show Fasta file” button,

and save the MSA by the name “trim5a.AA.prank.aln“.

Sources of alignment errors

Progressive alignment algorithms are greedy heuristics Co-optimal solutions

Heads-or-Tails (HoT) scores (Landan & Graur 2007)

Guide-tree errors GUIDANCE scores (Penn, Privman et al. MBE 2010)

GUIDANCE: Guide-tree based alignment confidence scores

…MSA 1 MSA 2 MSA 99 MSA 100

Progressive alignment

…Tree 1 Tree 2 Tree 99 Tree 100

Bootstrap sampling of NJ trees

Base MSA

GUIDANCE Scores

0

1Confident Uncertain

Penn, Privman et al. MBE. 2010

http://guidance.tau.ac.il

HIV1 group M

SIV chimp

HIV1 group O

HIV1 group N

SIV cerco

SIV gorilla

Transmembrane domain

Extracellular domain

Cytoplasmic domain(a)

GU

IDA

NC

E s

core

Column

GUIDANCE Scores

Confident Uncertain

HIV1 group M

SIV chimp

HIV1 group O

Transmembrane domain

Extracellular domain

Cytoplasmic domain(b)

GU

IDA

NC

E s

core

Column

1. Run GUIDANCE web-server to calculate confidence scores for

the MAFFT alignment:

a. In the “Upload your sequence file” window browse for

“trim5a.AA.fas”.

b. Choose “Amino Acids” in the “Sequences Type” option.

c. In order to speed the run, change the “Number of bootstrap

repeats” in the “Advanced options” section to 30. Note that

this is not recommended for real life.

d. Run (press the “Submit“ button) .

Detecting Detecting selection forces selection forces Positive selection Positive selection

Empirical findingsvariation among genes:

““ImportantImportant”” proteins evolveproteins evolve

slowerslowerthan “unimportantunimportant” onesones

Histone 3 protein

Empirical findingsvariation among sites:

Functional Functional sitessites evolveevolve

slowerslowerthanthan nonfunctional nonfunctional sitessites

Silent and non-silent mutations

Silent:UUU -> UUC (both encode phenylalanine)

Non-silent:UUU -> CUU (phenylalanine to leucine)

For most proteins, the rate of For most proteins, the rate of silentsilent substitutions is much highersubstitutions is much higher

than the than the non-silentnon-silent rate rate

This is called purifying selection purifying selection

= conservation= conservation

There are rarerare cases where the non-silentnon-silent rate is much higher than the silentsilent rate

This is called positive selection positive selection

Positive Selection

Examples: Pathogen proteins evading the host immune

system Proteins of the immune system detecting

pathogen proteins Pathogen proteins that are drug targets Proteins that are products of gene duplication Proteins involved in the reproductive system

http://selecton.tau.ac.il

Selecton results

False positive predictions

Selecton uses an MSA as input The MSA may contain unreliable regions

Errors in Selecton computations

Errors in the positive selection inference

1. Go to the GUIDANCE results of the last exercise.

2. Which columns are not well aligned? Are these sites

also predicted to evolve under positive selection?

See Selecton results in:

http://selecton.tau.ac.il/results/1268662868/colors.html

Summary

Different alignment programs may result different MSAs.

Alignment uncertainty may cause errors in downstream analyses such as positive selection analysis.

GUIDANCE can detect alignment errors.

Thanks for your attention!

Alignments and alignment reliability The first critical step in sequence analysis – the know how...

Documents

Transcript of Alignments and alignment reliability The first critical step in sequence analysis – the know how...