Phylogenetic analysis for molecular sequence data · Phylogeny reconstruction methods • Distance...

Phylogenetic analysis for molecular sequence data

João C. Setubal University of São Paulo

Agosto 2012

1 8/23/2012 J. C. Setubal

Outline

1. What is the biological question? 2. What input sequences should be used? 3. Analysis pipeline: steps and components 4. Output visualization 5. Output interpretation

2 8/23/2012 J. C. Setubal

Biological questions

• How do oomycete species relate to one another and to other species?

• What is the history of a particular gene? – Gene trees vs. species trees – Lateral Gene Transfer

• Other questions

3 8/23/2012 J. C. Setubal

4 Credit: www.apsnet.org 8/23/2012 J. C. Setubal

Taxonomy is not phylogeny class Oomycota

• Kingdom: Chromalveolata • Phylum: Heterokontophyta • Class: Oomycota • Orders (& families) • Lagenidiales

– Lagenidiaceae – Olpidiosidaceae – Sirolpidiaceae

• Leptomitales – Leptomitaceae

• Peronosporales – Albuginaceae – Peronosporaceae – Pythiaceae

• Rhipidiales – Rhipidaceae

• Saprolegniales – Ectrogellaceae – Haliphthoraceae – Leptolegniellaceae – Saprolegniaceae

• Thraustochytriales

Phytophthora

http://en.wikipedia.org/wiki/Chromalveolate�

http://en.wikipedia.org/wiki/Heterokont�

http://en.wikipedia.org/wiki/Lagenidiales�

http://en.wikipedia.org/w/index.php?title=Lagenidiaceae&action=edit&redlink=1�

http://en.wikipedia.org/w/index.php?title=Olpidiosidaceae&action=edit&redlink=1�

http://en.wikipedia.org/w/index.php?title=Sirolpidiaceae&action=edit&redlink=1�

http://en.wikipedia.org/wiki/Leptomitales�

http://en.wikipedia.org/wiki/Leptomitales�

http://en.wikipedia.org/wiki/Peronosporales�

http://en.wikipedia.org/wiki/Albuginaceae�

http://en.wikipedia.org/wiki/Peronosporaceae�

http://en.wikipedia.org/wiki/Pythiaceae�

http://en.wikipedia.org/w/index.php?title=Rhipidiales&action=edit&redlink=1�

http://en.wikipedia.org/w/index.php?title=Rhipidiales&action=edit&redlink=1�

http://en.wikipedia.org/wiki/Saprolegniales�

http://en.wikipedia.org/w/index.php?title=Ectrogellaceae&action=edit&redlink=1�

http://en.wikipedia.org/w/index.php?title=Haliphthoraceae&action=edit&redlink=1�

http://en.wikipedia.org/w/index.php?title=Leptolegniellaceae&action=edit&redlink=1�

http://en.wikipedia.org/wiki/Saprolegniaceae�

http://en.wikipedia.org/w/index.php?title=Thraustochytriales&action=edit&redlink=1�

Input sequences

• They should belong to the same homologous family (Cf. Friday lecture)

6 8/23/2012 J. C. Setubal

Pipeline

1. Multiple sequence alignment (MSA) 2. Alignment editing 3. Phylogeny reconstruction 4. Visualization

7 8/23/2012 J. C. Setubal

Multiple Sequence Alignment

8 8/23/2012 J. C. Setubal

Multiple Sequence Alignment

• Generalization of pairwise alignment – Optimum vs. approximation – All practical programs for MSA produce approximations

• DNA or amino acids – DNA is more sensitive; but 3rd codon position is less

informative – Amino acids allow more distant proteins to be included

• Scoring matrices: BLOSUM, PAM

• Aligned sites (a column) should be homologous • Output formats: clustal, FASTA, MSF, NEXUS, PHYLIP

– http://molecularevolution.org/resources/fileformats/converting 9 8/23/2012 J. C. Setubal

http://molecularevolution.org/resources/fileformats/converting�

Programs for MSA

• Muscle – Edgar, R.C. (2004) Nucleic Acids Res. 32(5):1792-1797

– www.drive5.com/muscle

• MAFFT – Katoh, Misawa, Kuma, Miyata 2002 (Nucleic Acids Res. 30:3059-3066)

– mafft.cbrc.jp/alignment/software/

• ClustalW/X • Cobalt (NCBI) • T-coffee

8/23/2012 J. C. Setubal 10

Input sequences

• Should be related to each other • Cannot be too long (less than ~10kb) • Not too many (less than ~100) • (numbers vary depending on program and on

computer) • FASTA format is best

11 8/23/2012 J. C. Setubal

Alignment editing

12 8/23/2012 J. C. Setubal

Credit: R. Dixon

Alignment editing

• Certain columns may be uninformative • Sometimes humans can see better alignments • Manual editing

– Jalview: www.jalview.org – Waterhouse et al. Bioinformatics 2009 25 (9) 1189-1191

– Seaview: http://pbil.univ-lyon1.fr/software/seaview.html • Gouy M., Guindon S. & Gascuel O. (2010) Molecular Biology and Evolution

27(2):221-224

• Automatic editing: Gblocks – http://molevol.cmima.csic.es/castresana/Gblocks_server.html – Castresana, J. (2000) Molecular Biology and Evolution 17, 540-552

13 8/23/2012 J. C. Setubal

http://www.jalview.org/�

http://pbil.univ-lyon1.fr/software/seaview.html�

JALVIEW http://www.jalview.org/

14 8/23/2012 J. C. Setubal

Phylogeny reconstruction

15 8/23/2012 J. C. Setubal

Credit: R. Dixon

A

B

Cladogram version

Topology and branch lengths: A tree and a cladogram

8/23/2012 J. C. Setubal 16

Credit: Wattam et al. 2011

Branch lengths: # of substitutions per site

Unrooted tree (no outgroup)

17 8/23/2012 J. C. Setubal

http://itol.embl.de

http://itol.embl.de/�

Rooted tree: needs outgroup

8/23/2012 J. C. Setubal 18

Phylogeny reconstruction methods

• Distance – Distance matrix

• Parsimony – Minimize mutations along branches

• Maximum likelihood – Searches for the most likely tree under a

probabilistic model

• Bayesian inference – Also probabilistic, but using bayesian approach

19 8/23/2012 J. C. Setubal

Running time considerations

• In the last century, distance and parsimony methods were dominant – the others were too slow

• Now Maximum Likelihood has become a “standard”

8/23/2012 J. C. Setubal 20

Models of evolution

• Except for distance methods, all other methods must rely on models for the evolution of sequences

8/23/2012 J. C. Setubal 21

Evolution of models for DNA evolution

8/23/2012 J. C. Setubal 22

http://authors.library.caltech.edu/5456/1/hrst.mit.edu/hrs/evolution/public/models/sequence.html

Protein evolution

• Amino acid substitution matrices – PAM – BLOSUM – WAG

• Whelan and Goldman (2001) Mol. Biol. Evol. 18, 691-699

8/23/2012 J. C. Setubal 23

Models in PhyML

• DNA – JC69, K80, F81, F84, HKY85, TN93, GTR, custom

• Amino acids – LG, WAG, Dayhoff, JTT, Blosum62, mtREV, rtREV,

cpREV,DCMut, VT, mtMAM, custom

8/23/2012 J. C. Setubal 24

Phylogeny reconstruction programs

• PHYLIP – Joe Felsenstein – http://evolution.genetics.washington.edu/phylip.html

• PAUP – David Swofford – http://paup.csit.fsu.edu/

• Distance – Neighbor-joining, UPGMA

• Parsimony 25 8/23/2012 J. C. Setubal

Maximum likelihood

• RaXML – A. Stamatakis – http://www.exelixis-lab.org/

• phyML – O. Gascuel et al. Systematic Biology, 59(3):307-21, 2010

– http://www.atgc-montpellier.fr/phyml/

• fastTree – Morgan N. Price in Adam Arkin’s group – http://www.microbesonline.org/fasttree/ – “FastTree can handle alignments with up to a million of

sequences in a reasonable amount of time and memory”

8/23/2012 J. C. Setubal 26

A performance data point

• An ML tree for about 500 protein sequences about 300 aa in length each

• RAxML or PHYml took about 10 hours • Fasttree took less than 1 hour

8/23/2012 J. C. Setubal 27

Bayesian inference

• MrBayes • Ronquist and Huelsenbeck. Bioinformatics.

2003 19(12):1572-4. • http://mrbayes.sourceforge.net/ • Slower compared to RAxML and phyML

8/23/2012 J. C. Setubal 28

http://mrbayes.sourceforge.net/�

Tree visualization: formats

• Newick, NEXUS • (((erHomoC:0.28006,erCaelC:0.22089):0.40998,(erH

omoA:0.32304, (erpCaelC:0.58815,((erHomoB:0.5807,erCaelB:0.23569):0.03586, erCaelA:0.38272):0.06516):0.03492):0.14265):0.63594,(TRXHomo:0.65866, TRXSacch:0.38791):0.32147,TRXEcoli:0.57336);

• http://molecularevolution.org/resources/treeformats

29 8/23/2012 J. C. Setubal

Tree visualization

• Interactive Tree of Life http://itol.embl.de • http://en.wikipedia.org/wiki/List_of_phylogenetic_tree_visualization_software

8/23/2012 J. C. Setubal 30

http://itol.embl.de/�

8/23/2012 J. C. Setubal 31

All-in-one: phylogeny.fr

32 8/23/2012 J. C. Setubal

Phylogeny.fr (2)

33 8/23/2012 J. C. Setubal

Building your tree locally: SeaView

8/23/2012 J. C. Setubal 34

Interpretation

• Trees are just hypotheses • They can suffer from GIGO • Most likely tree may not be the true tree • Confidence in the topology

– Bootstrap values • Should be above 0.7 (70%)

– Costly to compute – PhyML provides approximate bootstrap values that are

much faster to compute • It’s always a good idea to try more than one reconstruction

method

35 8/23/2012 J. C. Setubal

Supermatrix approach

• Good for obtaining robust species tree when complete or nearly complete genomes are available (phylogenomics)

• Find all families that have exactly one representative from each genome

• MSA for each family • Concatenate all MSAs • Build tree based on concatenated alignment

8/23/2012 J. C. Setubal 36

Ciccarelli et al, Science, 2006

Eisen & Wu, Genome Biology, 2008

The bane of species trees: Horizonta Gene Transfer

• Likely when gene tree differs from species tree • Can be detected by other methods

– Sequence composition deviation – Genomic islands

8/23/2012 J. C. Setubal 39

Network models for gene sharing

• Current research topic

8/23/2012 J. C. Setubal 40

Kloesges et al, Molecular Biology and Evolution, 2011

Review: Tal Dagan. Phylogenomic networks. Trends in Microbiology, 19(10), 483-491, 2011

Additional Resource

• http://www.megasoftware.net/

41 8/23/2012 J. C. Setubal

http://www.megasoftware.net/�

Books

• Bioinformatics. Baxevanis and Ouellette (Eds.) Wiley-Interscience, 2005 (3rd edition), ch. 14

• D. Mount. Bioinformatics. CSHL Press, 2004 (2nd edition), ch. 7

• The phylogenetic handbook. Lemey, Salemi and Vandamme (Eds.) Cambridge University Press, 2009 (2nd edition)

8/23/2012 J. C. Setubal 42

Phylogenetic analysis for molecular sequence data · Phylogeny reconstruction methods • Distance...

Documents

Transcript of Phylogenetic analysis for molecular sequence data · Phylogeny reconstruction methods • Distance...