Phylogenetic analysis for molecular sequence data · Phylogeny reconstruction methods • Distance...
Transcript of Phylogenetic analysis for molecular sequence data · Phylogeny reconstruction methods • Distance...
Phylogenetic analysis for molecular sequence data
João C. Setubal University of São Paulo
Agosto 2012
1 8/23/2012 J. C. Setubal
Outline
1. What is the biological question? 2. What input sequences should be used? 3. Analysis pipeline: steps and components 4. Output visualization 5. Output interpretation
2 8/23/2012 J. C. Setubal
Biological questions
• How do oomycete species relate to one another and to other species?
• What is the history of a particular gene? – Gene trees vs. species trees – Lateral Gene Transfer
• Other questions
3 8/23/2012 J. C. Setubal
4 Credit: www.apsnet.org 8/23/2012 J. C. Setubal
Taxonomy is not phylogeny class Oomycota
• Kingdom: Chromalveolata • Phylum: Heterokontophyta • Class: Oomycota • Orders (& families) • Lagenidiales
– Lagenidiaceae – Olpidiosidaceae – Sirolpidiaceae
• Leptomitales – Leptomitaceae
• Peronosporales – Albuginaceae – Peronosporaceae – Pythiaceae
• Rhipidiales – Rhipidaceae
• Saprolegniales – Ectrogellaceae – Haliphthoraceae – Leptolegniellaceae – Saprolegniaceae
• Thraustochytriales
Phytophthora
Input sequences
• They should belong to the same homologous family (Cf. Friday lecture)
6 8/23/2012 J. C. Setubal
Pipeline
1. Multiple sequence alignment (MSA) 2. Alignment editing 3. Phylogeny reconstruction 4. Visualization
7 8/23/2012 J. C. Setubal
Multiple Sequence Alignment
8 8/23/2012 J. C. Setubal
Multiple Sequence Alignment
• Generalization of pairwise alignment – Optimum vs. approximation – All practical programs for MSA produce approximations
• DNA or amino acids – DNA is more sensitive; but 3rd codon position is less
informative – Amino acids allow more distant proteins to be included
• Scoring matrices: BLOSUM, PAM
• Aligned sites (a column) should be homologous • Output formats: clustal, FASTA, MSF, NEXUS, PHYLIP
– http://molecularevolution.org/resources/fileformats/converting 9 8/23/2012 J. C. Setubal
Programs for MSA
• Muscle – Edgar, R.C. (2004) Nucleic Acids Res. 32(5):1792-1797
– www.drive5.com/muscle
• MAFFT – Katoh, Misawa, Kuma, Miyata 2002 (Nucleic Acids Res. 30:3059-3066)
– mafft.cbrc.jp/alignment/software/
• ClustalW/X • Cobalt (NCBI) • T-coffee
8/23/2012 J. C. Setubal 10
Input sequences
• Should be related to each other • Cannot be too long (less than ~10kb) • Not too many (less than ~100) • (numbers vary depending on program and on
computer) • FASTA format is best
11 8/23/2012 J. C. Setubal
Alignment editing
12 8/23/2012 J. C. Setubal
Credit: R. Dixon
Alignment editing
• Certain columns may be uninformative • Sometimes humans can see better alignments • Manual editing
– Jalview: www.jalview.org – Waterhouse et al. Bioinformatics 2009 25 (9) 1189-1191
– Seaview: http://pbil.univ-lyon1.fr/software/seaview.html • Gouy M., Guindon S. & Gascuel O. (2010) Molecular Biology and Evolution
27(2):221-224
• Automatic editing: Gblocks – http://molevol.cmima.csic.es/castresana/Gblocks_server.html – Castresana, J. (2000) Molecular Biology and Evolution 17, 540-552
13 8/23/2012 J. C. Setubal
JALVIEW http://www.jalview.org/
14 8/23/2012 J. C. Setubal
Phylogeny reconstruction
15 8/23/2012 J. C. Setubal
Credit: R. Dixon
A
B
Cladogram version
Topology and branch lengths: A tree and a cladogram
8/23/2012 J. C. Setubal 16
Credit: Wattam et al. 2011
Branch lengths: # of substitutions per site
Rooted tree: needs outgroup
8/23/2012 J. C. Setubal 18
Phylogeny reconstruction methods
• Distance – Distance matrix
• Parsimony – Minimize mutations along branches
• Maximum likelihood – Searches for the most likely tree under a
probabilistic model
• Bayesian inference – Also probabilistic, but using bayesian approach
19 8/23/2012 J. C. Setubal
Running time considerations
• In the last century, distance and parsimony methods were dominant – the others were too slow
• Now Maximum Likelihood has become a “standard”
8/23/2012 J. C. Setubal 20
Models of evolution
• Except for distance methods, all other methods must rely on models for the evolution of sequences
8/23/2012 J. C. Setubal 21
Evolution of models for DNA evolution
8/23/2012 J. C. Setubal 22
http://authors.library.caltech.edu/5456/1/hrst.mit.edu/hrs/evolution/public/models/sequence.html
Protein evolution
• Amino acid substitution matrices – PAM – BLOSUM – WAG
• Whelan and Goldman (2001) Mol. Biol. Evol. 18, 691-699
8/23/2012 J. C. Setubal 23
Models in PhyML
• DNA – JC69, K80, F81, F84, HKY85, TN93, GTR, custom
• Amino acids – LG, WAG, Dayhoff, JTT, Blosum62, mtREV, rtREV,
cpREV,DCMut, VT, mtMAM, custom
8/23/2012 J. C. Setubal 24
Phylogeny reconstruction programs
• PHYLIP – Joe Felsenstein – http://evolution.genetics.washington.edu/phylip.html
• PAUP – David Swofford – http://paup.csit.fsu.edu/
• Distance – Neighbor-joining, UPGMA
• Parsimony 25 8/23/2012 J. C. Setubal
Maximum likelihood
• RaXML – A. Stamatakis – http://www.exelixis-lab.org/
• phyML – O. Gascuel et al. Systematic Biology, 59(3):307-21, 2010
– http://www.atgc-montpellier.fr/phyml/
• fastTree – Morgan N. Price in Adam Arkin’s group – http://www.microbesonline.org/fasttree/ – “FastTree can handle alignments with up to a million of
sequences in a reasonable amount of time and memory”
8/23/2012 J. C. Setubal 26
A performance data point
• An ML tree for about 500 protein sequences about 300 aa in length each
• RAxML or PHYml took about 10 hours • Fasttree took less than 1 hour
8/23/2012 J. C. Setubal 27
Bayesian inference
• MrBayes • Ronquist and Huelsenbeck. Bioinformatics.
2003 19(12):1572-4. • http://mrbayes.sourceforge.net/ • Slower compared to RAxML and phyML
8/23/2012 J. C. Setubal 28
Tree visualization: formats
• Newick, NEXUS • (((erHomoC:0.28006,erCaelC:0.22089):0.40998,(erH
omoA:0.32304, (erpCaelC:0.58815,((erHomoB:0.5807,erCaelB:0.23569):0.03586, erCaelA:0.38272):0.06516):0.03492):0.14265):0.63594,(TRXHomo:0.65866, TRXSacch:0.38791):0.32147,TRXEcoli:0.57336);
• http://molecularevolution.org/resources/treeformats
29 8/23/2012 J. C. Setubal
Tree visualization
• Interactive Tree of Life http://itol.embl.de • http://en.wikipedia.org/wiki/List_of_phylogenetic_tree_visualization_software
8/23/2012 J. C. Setubal 30
8/23/2012 J. C. Setubal 31
All-in-one: phylogeny.fr
32 8/23/2012 J. C. Setubal
Phylogeny.fr (2)
33 8/23/2012 J. C. Setubal
Building your tree locally: SeaView
8/23/2012 J. C. Setubal 34
Interpretation
• Trees are just hypotheses • They can suffer from GIGO • Most likely tree may not be the true tree • Confidence in the topology
– Bootstrap values • Should be above 0.7 (70%)
– Costly to compute – PhyML provides approximate bootstrap values that are
much faster to compute • It’s always a good idea to try more than one reconstruction
method
35 8/23/2012 J. C. Setubal
Supermatrix approach
• Good for obtaining robust species tree when complete or nearly complete genomes are available (phylogenomics)
• Find all families that have exactly one representative from each genome
• MSA for each family • Concatenate all MSAs • Build tree based on concatenated alignment
8/23/2012 J. C. Setubal 36
Ciccarelli et al, Science, 2006
Eisen & Wu, Genome Biology, 2008
The bane of species trees: Horizonta Gene Transfer
• Likely when gene tree differs from species tree • Can be detected by other methods
– Sequence composition deviation – Genomic islands
8/23/2012 J. C. Setubal 39
Network models for gene sharing
• Current research topic
8/23/2012 J. C. Setubal 40
Kloesges et al, Molecular Biology and Evolution, 2011
Review: Tal Dagan. Phylogenomic networks. Trends in Microbiology, 19(10), 483-491, 2011
Additional Resource
• http://www.megasoftware.net/
41 8/23/2012 J. C. Setubal
Books
• Bioinformatics. Baxevanis and Ouellette (Eds.) Wiley-Interscience, 2005 (3rd edition), ch. 14
• D. Mount. Bioinformatics. CSHL Press, 2004 (2nd edition), ch. 7
• The phylogenetic handbook. Lemey, Salemi and Vandamme (Eds.) Cambridge University Press, 2009 (2nd edition)
8/23/2012 J. C. Setubal 42