Phylogenetic analysis for molecular sequence Phylogeny reconstruction methods...

Click here to load reader

download Phylogenetic analysis for molecular sequence Phylogeny reconstruction methods ¢â‚¬¢ Distance ¢â‚¬â€œ Distance

of 42

  • date post

    03-Aug-2020
  • Category

    Documents

  • view

    4
  • download

    0

Embed Size (px)

Transcript of Phylogenetic analysis for molecular sequence Phylogeny reconstruction methods...

  • Phylogenetic analysis for molecular sequence data

    João C. Setubal University of São Paulo

    Agosto 2012

    1 8/23/2012 J. C. Setubal

  • Outline

    1. What is the biological question? 2. What input sequences should be used? 3. Analysis pipeline: steps and components 4. Output visualization 5. Output interpretation

    2 8/23/2012 J. C. Setubal

  • Biological questions

    • How do oomycete species relate to one another and to other species?

    • What is the history of a particular gene? – Gene trees vs. species trees – Lateral Gene Transfer

    • Other questions

    3 8/23/2012 J. C. Setubal

  • 4 Credit: www.apsnet.org 8/23/2012 J. C. Setubal

  • Taxonomy is not phylogeny class Oomycota

    • Kingdom: Chromalveolata • Phylum: Heterokontophyta • Class: Oomycota • Orders (& families) • Lagenidiales

    – Lagenidiaceae – Olpidiosidaceae – Sirolpidiaceae

    • Leptomitales – Leptomitaceae

    • Peronosporales – Albuginaceae – Peronosporaceae – Pythiaceae

    • Rhipidiales – Rhipidaceae

    • Saprolegniales – Ectrogellaceae – Haliphthoraceae – Leptolegniellaceae – Saprolegniaceae

    • Thraustochytriales

    Phytophthora

    http://en.wikipedia.org/wiki/Chromalveolate� http://en.wikipedia.org/wiki/Heterokont� http://en.wikipedia.org/wiki/Lagenidiales� http://en.wikipedia.org/w/index.php?title=Lagenidiaceae&action=edit&redlink=1� http://en.wikipedia.org/w/index.php?title=Olpidiosidaceae&action=edit&redlink=1� http://en.wikipedia.org/w/index.php?title=Sirolpidiaceae&action=edit&redlink=1� http://en.wikipedia.org/wiki/Leptomitales� http://en.wikipedia.org/wiki/Leptomitales� http://en.wikipedia.org/wiki/Peronosporales� http://en.wikipedia.org/wiki/Albuginaceae� http://en.wikipedia.org/wiki/Peronosporaceae� http://en.wikipedia.org/wiki/Pythiaceae� http://en.wikipedia.org/w/index.php?title=Rhipidiales&action=edit&redlink=1� http://en.wikipedia.org/w/index.php?title=Rhipidiales&action=edit&redlink=1� http://en.wikipedia.org/wiki/Saprolegniales� http://en.wikipedia.org/w/index.php?title=Ectrogellaceae&action=edit&redlink=1� http://en.wikipedia.org/w/index.php?title=Haliphthoraceae&action=edit&redlink=1� http://en.wikipedia.org/w/index.php?title=Leptolegniellaceae&action=edit&redlink=1� http://en.wikipedia.org/wiki/Saprolegniaceae� http://en.wikipedia.org/w/index.php?title=Thraustochytriales&action=edit&redlink=1�

  • Input sequences

    • They should belong to the same homologous family (Cf. Friday lecture)

    6 8/23/2012 J. C. Setubal

  • Pipeline

    1. Multiple sequence alignment (MSA) 2. Alignment editing 3. Phylogeny reconstruction 4. Visualization

    7 8/23/2012 J. C. Setubal

  • Multiple Sequence Alignment

    8 8/23/2012 J. C. Setubal

  • Multiple Sequence Alignment

    • Generalization of pairwise alignment – Optimum vs. approximation – All practical programs for MSA produce approximations

    • DNA or amino acids – DNA is more sensitive; but 3rd codon position is less

    informative – Amino acids allow more distant proteins to be included

    • Scoring matrices: BLOSUM, PAM

    • Aligned sites (a column) should be homologous • Output formats: clustal, FASTA, MSF, NEXUS, PHYLIP

    – http://molecularevolution.org/resources/fileformats/converting 9 8/23/2012 J. C. Setubal

    http://molecularevolution.org/resources/fileformats/converting�

  • Programs for MSA

    • Muscle – Edgar, R.C. (2004) Nucleic Acids Res. 32(5):1792-1797

    – www.drive5.com/muscle • MAFFT

    – Katoh, Misawa, Kuma, Miyata 2002 (Nucleic Acids Res. 30:3059-3066)

    – mafft.cbrc.jp/alignment/software/ • ClustalW/X • Cobalt (NCBI) • T-coffee

    8/23/2012 J. C. Setubal 10

  • Input sequences

    • Should be related to each other • Cannot be too long (less than ~10kb) • Not too many (less than ~100) • (numbers vary depending on program and on

    computer) • FASTA format is best

    11 8/23/2012 J. C. Setubal

  • Alignment editing

    12 8/23/2012 J. C. Setubal

    Credit: R. Dixon

  • Alignment editing

    • Certain columns may be uninformative • Sometimes humans can see better alignments • Manual editing

    – Jalview: www.jalview.org – Waterhouse et al. Bioinformatics 2009 25 (9) 1189-1191

    – Seaview: http://pbil.univ-lyon1.fr/software/seaview.html • Gouy M., Guindon S. & Gascuel O. (2010) Molecular Biology and Evolution

    27(2):221-224

    • Automatic editing: Gblocks – http://molevol.cmima.csic.es/castresana/Gblocks_server.html – Castresana, J. (2000) Molecular Biology and Evolution 17, 540-552

    13 8/23/2012 J. C. Setubal

    http://www.jalview.org/� http://pbil.univ-lyon1.fr/software/seaview.html�

  • JALVIEW http://www.jalview.org/

    14 8/23/2012 J. C. Setubal

  • Phylogeny reconstruction

    15 8/23/2012 J. C. Setubal

    Credit: R. Dixon

  • A

    B

    Cladogram version

    Topology and branch lengths: A tree and a cladogram

    8/23/2012 J. C. Setubal 16

    Credit: Wattam et al. 2011

    Branch lengths: # of substitutions per site

  • Unrooted tree (no outgroup)

    17 8/23/2012 J. C. Setubal

    http://itol.embl.de

    http://itol.embl.de/�

  • Rooted tree: needs outgroup

    8/23/2012 J. C. Setubal 18

  • Phylogeny reconstruction methods

    • Distance – Distance matrix

    • Parsimony – Minimize mutations along branches

    • Maximum likelihood – Searches for the most likely tree under a

    probabilistic model

    • Bayesian inference – Also probabilistic, but using bayesian approach

    19 8/23/2012 J. C. Setubal

  • Running time considerations

    • In the last century, distance and parsimony methods were dominant – the others were too slow

    • Now Maximum Likelihood has become a “standard”

    8/23/2012 J. C. Setubal 20

  • Models of evolution

    • Except for distance methods, all other methods must rely on models for the evolution of sequences

    8/23/2012 J. C. Setubal 21

  • Evolution of models for DNA evolution

    8/23/2012 J. C. Setubal 22

    http://authors.library.caltech.edu/5456/1/hrst.mit.edu/hrs/evolution/public/models/sequence.html

  • Protein evolution

    • Amino acid substitution matrices – PAM – BLOSUM – WAG

    • Whelan and Goldman (2001) Mol. Biol. Evol. 18, 691-699

    8/23/2012 J. C. Setubal 23

  • Models in PhyML

    • DNA – JC69, K80, F81, F84, HKY85, TN93, GTR, custom

    • Amino acids – LG, WAG, Dayhoff, JTT, Blosum62, mtREV, rtREV,

    cpREV,DCMut, VT, mtMAM, custom

    8/23/2012 J. C. Setubal 24

  • Phylogeny reconstruction programs

    • PHYLIP – Joe Felsenstein – http://evolution.genetics.washington.edu/phylip.html

    • PAUP – David Swofford – http://paup.csit.fsu.edu/

    • Distance – Neighbor-joining, UPGMA

    • Parsimony 25 8/23/2012 J. C. Setubal

  • Maximum likelihood

    • RaXML – A. Stamatakis – http://www.exelixis-lab.org/

    • phyML – O. Gascuel et al. Systematic Biology, 59(3):307-21, 2010

    – http://www.atgc-montpellier.fr/phyml/ • fastTree

    – Morgan N. Price in Adam Arkin’s group – http://www.microbesonline.org/fasttree/ – “FastTree can handle alignments with up to a million of

    sequences in a reasonable amount of time and memory”

    8/23/2012 J. C. Setubal 26

  • A performance data point

    • An ML tree for about 500 protein sequences about 300 aa in length each

    • RAxML or PHYml took about 10 hours • Fasttree took less than 1 hour

    8/23/2012 J. C. Setubal 27

  • Bayesian inference

    • MrBayes • Ronquist and Huelsenbeck. Bioinformatics.

    2003 19(12):1572-4. • http://mrbayes.sourceforge.net/ • Slower compared to RAxML and phyML

    8/23/2012 J. C. Setubal 28

    http://mrbayes.sourceforge.net/�

  • Tree visualization: formats

    • Newick, NEXUS • (((erHomoC:0.28006,erCaelC:0.22089):0.40998,(erH

    omoA:0.32304, (erpCaelC:0.58815,((erHomoB:0.5807,erCaelB:0.235 69):0.03586, erCaelA:0.38272):0.06516):0.03492):0.14265):0.635 94,(TRXHomo:0.65866, TRXSacch:0.38791):0.32147,TRXEcoli:0.57336);

    • http://molecularevolution.org/resources/treeformats

    29 8/23/2012 J. C. Setubal

  • Tree visualization

    • Interactive Tree of Life http://itol.embl.de • http://en.wikipedia.org/wiki/List_of_phylogenetic_tree_visualization_software

    8/23/2012 J. C. Setubal 30

    http://itol.embl.de/�

  • 8/23/2012 J. C. Setubal 31

  • All-in-one: phylogeny.fr

    32 8/23/2012 J. C. Setubal

  • Phylogeny.fr (2)

    33 8/23/2012 J. C. Setubal