PHASE: a Software Package for PHylogenetics And Sequence ...gowrishv/phase-1.1-manual.pdf · The...

PHASE : a Software Package for Phylogenetics And Sequence

Evolution

Version 1.1, April 24, 2003

Copyright 2002, 2003 by the University of Manchester.

PHASE is distributed under the terms of the GNU General Public License as published by the Free Software

Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed

in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Howsun Jow and Vivek Gowri-Shankar

∗bug report: [email protected]

Why is PHASE different from other phylogenetic programs?

This package is designed specifically for use with RNA sequences that have a conserved secondarystructure, e.g., rRNA and tRNA. It is well known that compensatory substitutions occur in the pairedregions of RNA secondary structures; this means that substitutions occurring on one side of a pair arecorrelated with substitutions on the other side. Most phylogenetic programs assume that each site in amolecule evolves independently of the others but this assumption is not valid for RNA genes.

Substitution models of sequence evolution that consider pairs of sites rather than single sites areimplemented in this package along with standard nucleotides substitution models used nowadays. Whena RNA molecule with a secondary structure is used in conjunction with a RNA substitution model,PHASE requires a structure-based alignment of the sequences with the consensus secondary structureindicated in bracket and dot notation at the top of the alignment. We assume that you can provide thisstructure.

It is now commonplace to perform combined analyses of heterogeneous sequence data when nu-cleotides with diffent patterns of evolution are sequenced for a set of studied species. It is possible touse several substitution models simultaneously with PHASE (for paired and/or unpaired sites) whenanalysing protein coding genes or when stems and loops of RNA genes are used.

PHASE provides a Markov Chain Monte Carlo sampler to generate large numbers of possible phylo-genetic trees with probability proportional to their likelihood. This is a Bayesian statistical method thatallows posterior probabilities to be generated for alternative trees and alternative clades. These poste-rior probabilities provide a sound statistical measure of support of alternative phylogenetic hypotheses,and they remove the need for bootstrapping. Where many alternative arrangements of a given set ofspecies exist, it is possible to calculate posterior probabilities for all the alternative arrangements ofthese species in a convenient way.

Standard Maximum Likelihood techniques for inferring the optimal tree with any of the DNA orRNA evolution models are also implemented.

The program’s features include:

• Bayesian estimation of phylogenies and substitution model parameters

• standard ML search algorithms for inferring the optimal tree with optional topology constraints

• 6, 7 and 16 state RNA models

• standard 4 state DNA models

• invariant and discrete gamma model for substitution rate heterogeneity between sites

• mixing of molecular data types in a single analysis

Journal publications :

• C. Hudelot, V. Gowri-Shankar, H. Jow, M. Rattray and P. Higgs. “RNA-based Phylogenetic Meth-ods: Application to Mammalian Mitochondrial RNA Sequences”. Molecular Phylogenetics and Evo-lution (in press, 2003).

• H. Jow, C. Hudelot, M. Rattray and P. Higgs. “Bayesian phylogenetics using an RNA substitutionmodel applied to early mammalian evolution”. Molecular Biology and Evolution, 19(9):1591-1601(2002).

Acknowledgements

Howsun Jow and Vivek Gowri-Shankar carried out this work as PhD students at Manchester Universityunder the supervision of Magnus Rattray. We gratefully acknowledge contributions to the design, doc-umentation and testing from Paul Higgs and Cendrine Hudelot. The PHASE software was developedas part of a BBSRC funded research project into RNA-based phylogenetic methods (investigators: PaulHiggs and Magnus Rattray).

1

Contents

Why is PHASE different from other phylogenetic programs? . . . . . . . . . . . . . . . . . . 1

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Introduction 4

How to read this manual ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Aquiring and installing the software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

MS-Windows installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Unix-like system installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Description of programs in the PHASE package . . . . . . . . . . . . . . . . . . . . . . . . . . 5

optimise and mlphase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

mcmcphase and consensus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

simulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Running the programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1 Using programs in the PHASE package 7

1.1 Inputs/outputs in PHASE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.1 Data file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.1.2 Control file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.1.3 Tree file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

1.1.4 Substitution model parameters file format . . . . . . . . . . . . . . . . . . . . . . 9

1.1.5 Parameters displayed on the screen and output of each program . . . . . . . . . 9

1.1.6 Clade file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2 Control files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.1 Structure of the control files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.2 Datafile block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.2.3 Model block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.3 Using the programs in the PHASE package . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.1 likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.3.2 optimise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3.3 simulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.3.4 mlphase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.3.5 mcmcphase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.3.6 analyser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2

2 Elements of phylogenetic theory 23

2.1 Phylogenetic trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1.1 Unrooted phylogenies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1.2 String representation of a tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.1.3 Branch lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2 Nucleotide substitution models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.1 A Markov model of substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.2 Transition matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.3 Nucleotide substitution models implemented in PHASE . . . . . . . . . . . . . . 26

2.3 Paired-site substitution models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3.1 RNA secondary structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3.2 Theory of compensatory substitutions . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.3 Base-paired substitution models implemented in PHASE . . . . . . . . . . . . . 29

2.4 Refinements to substitution models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4.1 Invariant and discrete gamma models . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4.2 The MIXED model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.5 Bayesian phylogenetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.5.1 Bayes’ theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.5.2 Markov chain Monte-Carlo (MCMC) . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5.3 Priors and proposals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5.4 Pitfalls of Markov chain Monte-Carlo techniques . . . . . . . . . . . . . . . . . . 34

A Some examples of control files 37

A.1 Control file for likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

A.2 Control file for optimise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

A.3 Control file for simulate (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

A.4 Control file for simulate (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

A.5 Control file for mlphase (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

A.6 Control file for mlphase (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.7 Control file for mcmcphase (1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

A.8 Control file for mcmcphase (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography 46

3

Introduction

How to read this manual ?

People with a good background in phylogenetic inference might be interested only in the first chapterwhich explains how to use PHASE . The second chapter contains a few elements of the theory ofphylogenetic inference with some valuable information about PHASE that can make technical details inthe first chapter clearer. Experienced phylogeneticists might find it useful to read the RNA substitutionmodels section (2.3) to learn about RNA substitution models and the Bayesian phylogenetics section (2.5)if they are not familiar with Markov Chain Monte-Carlo (MCMC) techniques.

Once you have read the short description of the programs in this introduction, you can try themstraightaway with the examples provided. However, be warned that inferences using the mammalsdataset of 69 species and the maximum likelihood inference with the primates (primates-rna-ml.control)require at least one day. You should use other control files instead.

The first chapter of this manual should be used as a reference only and to clarify obscure pointsabout PHASE programs. The HTML version of these pages is probably more appropriate to find usefulinformation.

Aquiring and installing the software

PHASE can be downloaded from http://www.bioinf.man.ac.uk/resources; it is currently available forWindows and Unix/Linux platforms.

MS-Windows installation

Download the archive phase-1.1-MSWin-exec.zip and decompress it into the directory of your choice, forinstance c:\Phase\. PHASE does not require any other installation procedure and you can thereforetest the software straightaway with the provided example files.

Unix-like system installation

For Unix and Linux systems you are recommended to compile the program yourself. However, if theprocess fails and if you cannot produce a proper executable, then you can try the precompiled linuxversion in the archive phase-1.1-linux-i586-exec.tgz.

To compile the program yourself:

• decompress and extract the archive into the directory of your choicetar -xvvzf phase-1.1.tgz

• enter the newly created phase-1.1 directorycd phase-1.1

• compile with the provided Makefilemake

4

We assume here that you have the default recent C++ compiler g++ on your platform. You cannotcompile PHASE with gcc v2.96 and older. You can check the gcc version installed on your system bytyping “g++ -v”. You might want to (or might have to) edit and modify the makefiles in order toadapt them to your specific system configuration. In that case please have a look at the readme file first.

PHASE uses the BLAS and LAPACK library routines. Unless your system is equipped with op-timised versions of these mathematical libraries, in which case you are strongly advised to modify themakefile, generic versions will be built during the compilation process. The g77 compiler and the libg2clibrary are required but they should already be present on your system.

Description of programs in the PHASE package

The PHASE (PHylogenetics and Sequence Evolution) package consists of two main programs, mlphaseand mcmcphase.

• mlphase performs maximum-likelihood inference

• mcmcphase is a Bayesian phylogenetic inference program

There are five other smaller programs in the package:

• analyser checks the content of the molecular sequences

• likelihood computes the likelihood given a specified evolution model

• simulate generates sequences according to a specified evolution model

• optimise is a smaller version of mlphase without tree search capabilities

• consensus is used with mcmcphase to summarize the results of a MCMC run.

Below we summarize the behaviour of these programs. Please refer to the first chapter in order tolearn how to use them.

optimise and mlphase

The mlphase program is a maximum-likelihood phylogenetic inference program similar to dnaml inPHYLIP1 and baseml in PAML2. The mlphase program has a broad range of functionalities and can beused with a large number of evolutionary substitution models including those which take into accountthe RNA secondary structure in the evolution of RNA sequences (see sections 2.2 and 2.3).

The mlphase program has two main modes of operation:

1. Optimisation of user-defined trees: estimation of maximum likelihood (ML) branch lengthsand, optionally, evolutionary model parameters, given a set of labelled molecular sequences, for auser-defined set of phylogenetic tree topologies.

2. Maximum-likelihood tree search: the program aims at finding the model (tree topology, asso-ciated branch length and, optionally, sequence evolution model parameters) that yields the highestlikelihood. The user can choose the topology search algorithm to be used among the three available:

• Simple exhaustive search: all the possible phylogenies are considered.

• Branch and bound search: non-optimal phylogenies are rejected before evaluation.

• Heuristic search via stepwise addition: greedy search for the best topology.

Constraints can be placed on the phylogenetic tree topologies that are considered during ML inferencein order to reduce the search space and the computation time.

1http://evolution.genetics.washington.edu/phylip.html2http://abacus.gene.ucl.ac.uk/software/paml.html

5

The optimise program is a simpler version of mlphase and is provided for convenience. This programreturns the ML branch lengths and ML evolutionary parameters of a fixed user-defined tree topology, forinstance a consensus tree found with a MCMC run. This is equivalent to the first mode of mlphase withonly one tree. The optimise program requires less parameters than mlphase; it is simpler to use andallows quick experimentations with different initial parameters when an entrapment in a local maximumof the likelihood is suspected.

mcmcphase and consensus

The mcmcphase program performs Bayesian phylogenetic inference. It uses a Markov Chain MonteCarlo algorithm to sample from the posterior probability distribution of phylogenetic tree topology,branch lengths and sequence evolution model parameters. For an explanation of Bayesian phylogeneticsand a description of the MCMC sampling algorithms used in mcmcphase, please consult the BayesianPhylogenetics section (section 2.5) or Jow et al. (2002).

The consensus program is used to exploit the results of a MCMC run. This program produces twoconsensus models (using mean and median of the parameters in the sample) and can return consensusbranch lengths for any supplied topology, e.g., a PHYLIP-style consensus tree, if similar topologies weresampled during the run.

likelihood

The likelihood program computes the likelihood of a phylogeny with respect to any implemented substi-tution models.

simulate

The program simulate generates molecular sequences according to an user-specified tree (topology &branch lengths) and substitution model (type & parameters).

analyser

At the moment analyser outputs basic statistics about a sequence data file. It can also be used to locatein sequences the sites with too many gaps in case you decide to remove them. The analyser programcan be quite useful to validate your secondary structure alignment and to set a maximum limit for themismatch frequency at each site (see section 2.3.1).

Running the programs

Programs in the PHASE package are run through the command line under both Unix-like systems andMS-windows systems. For Windows operating systems, you have to open a MS-DOS command windowto use them. Click on “Run...” in the “Start...” menu and type cmd in the newly opened dialog box.You might have to type command instead of cmd depending on your MS-Windows version. Oncethe command window is opened, you have to move to the directory where you extracted the software.At the shell prompt, you can type, for example, cd c:\Phase\. You can then run any program of thePHASE package.

Run the programs by typing their name followed by the arguments they require. In most cases,PHASE ’s programs take one argument which is the name of a control file (see section 1.2). You cantype for instance:

mcmcphase control\hiv-dna-mcmc-1.control 3 or,optimise control/primates-rna-optimise-7a.control

After installation, if the examples are all present, these commands should work.

3Please note that the use of the ‘\’ or ‘/’ characters is dependant on your operating system. On Unix systems youmight have to type “./” before the program name.

6

1 - Using programs in the PHASE package

1.1 Inputs/outputs in PHASE

1.1.1 Data file format

All molecular sequence data used by the PHASE programs are stored in a common format. The datafile format is similar to the PHYLIP data file format but has a few minor modifications.

A data file is divided into four sections but two of them are not compulsory. Comments can beincluded by preceding the commented lines with a hash (#) symbol. The entire commented line isignored by the program. Taking a look at the example data files in the package (*.dna, *.rna, and *.mixin the data directory) will make the following explanations easier to understand.

File content

The first non-comment section of the data file is a single line containing

1. the number of species

2. the length of the molecular sequences

3. a code which can be either DNA for usual unpaired molecular sequences or RNA for base-pairedmolecular sequences. In fact the purpose of this code is to indicate whether a pairing mask (see below)is present in the data file and you can use the code RNA even if some nucleotides are unpaired.

For example the line,5 100 DNA

at the beginning of a data file indicates that there are five non-base-paired sequences of length 100 inthe file.

For convenience a third code, MIXED, can be used instead of RNA when the user is using aconcatenation of RNA loops and stems (see section 2.3.1) but should be avoided in other cases. Moredetails on the specific meaning of the code MIXED are given in the class section below. The lines,

10 300 RNAand,

10 300 MIXEDboth indicate that there are ten sequences of length 300 in the file and that a pairing mask is associatedwith them.

Pairing mask

The second section of the data file is the pairing mask. This mask is only required when sequencescontain some base-paired nucleotides (in that case the code should be RNA or MIXED). In the caseof fully unpaired sequences – i.e., when the DNA code is used – the pairing mask must not beprovided. The pairing mask is in the form of a mathematical expression consisting of round brackets.Corresponding brackets indicate that the bases at those positions in the sequence form a base-pair inthe RNA secondary structure. Unpaired sites can be indicated with a dot “.” or a hyphen “-”. Forexample a sequence ACCAGAUGGU with a pairing mask (((.(.)))) indicates that the sequence ismade of the base-pairs AU-CG-CG-GU and unpaired sites A-A.

7

Molecular sequences

The third section of the data file contains the molecular sequences. Indels (-) and ambiguities (purine(R), pyrimidine (Y), unknown(N or ?) ) are allowed. Sequences can be written in one of two formats.The first is the non-interleaved format. This consists of an identifying label for each sequence followedby the whole sequence. An example is:

2 8 DNAMouse ACCGUGGU

UCCAUAAARat ACUGUGGC

UCGAUAUA

There can be no spaces in the label though the sequence itself can be formatted into blocks usingmultiple lines and spaces. An alternate way of specifying the sequences is using the interleaved format.This enables the sequences to be split into homologous blocks. The non-interleaved example given abovecould equivalently be written:

2 8 DNAMouse ACCGRat ACUGUGGUUCCAUAAAUGGCUCGAUAUA

Notice that only the first interleaved block should contain labels. Subsequent interleaved blocks areassumed to have the same labels and to be in the same order.

Class section

The fourth section is not compulsory and is used when performing a combined analysis of heterogeneousdata sets (e.g., loops and stems of a RNA molecule, protein coding genes with three codon positionsor concatenated data of different genes with different evolutionary patterns). You can safely skip thissection if you plan to study DNA sequences or RNA helices only (i.e., no “.” in the pairing mask) withonly one appropriate nucleotide/base-pair substitution model.

The aim of this section is to assign each nucleotide/pair to a class. Each class is expected to havea different pattern of evolution. This section consists of a sequence of integers which correspond to theclass of each nucleotide. For instance, the class section of a protein coding gene may look like:

. . . 2 3 1 2 3 1 2 3 1 2 3 1 2 . . .When the data file contains a class section, programs in the PHASE package expect it to comply to thefollowing set of rules:

• class labels are separated by a space

• classes are labelled from 1 to K, where K is the number of distinct classes

• the number of labels equals the length of the sequences

• when used in conjunction with a base-paired structure, the two components of a paired site are inthe same class.

Since PHASE is specifically designed for the analysis of RNA sequences with secondary structure,the most common use of the class section should be the obvious separation of unpaired and base-pairedsites into two distinct classes. The code MIXED can replace the code RNA to avoid a tiresome taskand let PHASE know that he can simply use the provided pairing mask to build the class section (e.g.,(((.())))..) implies 2 2 2 1 2 2 2 2 2 1 1 2). When the code MIXED is used the class section isnot compulsory and the unpaired and paired sites will respectively be attributed to the classes 1 and 2automatically1.

Usually classes are used to determine the model of sequence evolution PHASE is using with eachnucleotide. Each class in the data file is treated by its own model of nucleotide substitution during the

1When the class section is present the code MIXED is equivalent to RNA: the user assignment prevails the automaticone

8

phylogenetic inference. The models are defined later in the model section of the control file(see 1.2.3).Let us just point out here that if you use the MIXED type for your data with the automatic assignment,i.e., without the class section, you have to make sure your first and second model are respectively anucleotide substitution model and a base-pair substitution model when you declare your models ofevolution. We will return to this point later on.

1.1.2 Control file format

Most programs in the package use a control file. The purpose of this file is to assign a specific task to theprogram, i.e., analysed sequences, assumed substitution model, and others specific parameters. Controlfiles are the key to using the software and two sections are devoted to them. Section 1.2 describes thestructure of this file and describes common features for many programs in the package. Section 1.3presents the specific parameters for each program.

1.1.3 Tree file format

PHASE can output trees into a file and sometimes the user has to provide a file which contains one ormore trees. A tree file is simply a file with one ore more phylogenies written in the computer readableformat described in the tree representation section (2.1.2).

1.1.4 Substitution model parameters file format

With a model parameters file, one can provide initial values for the parameters of the substitution modelsused. PHASE can also create a model parameters file to store the results concerning a substitutionmodel after a run (these could be Maximum Likelihood Estimate (MLE) parameters or Mean PosteriorEstimate (MPE) parameters).

Model parameters file content

The content of this file is highly dependant on the substitution model used and we cannot describe it ingeneral terms. The fields used to assign a value to each parameter are hopefully quite self-explanatoryas long as you know the underlying substitution model. You might need to have a look at the transitionmatrices section (2.2.2) to understand the PHASE concept of “rate ratios” in substitution models.Each “Rate ratio i” parameter in this file stands for the parameter αi in the transition matrix ofthe corresponding model. Transition matrices for all implemented substitution models are given insection 2.2.3 for DNA models and in section 2.3.3 for RNA models.

Producing a model parameters file

Model parameters files and control files share the same structural elements. Some examples can befound in the data directory (*.model). Although it is quite easy to understand the content of a modelparameters file without explanations when reading it, you might find it harder to produce your own filefrom scratch and without guidance if you want to initialise a substitution model with specific values. Itis possible to use the simulate program to generate a stub of this file for each model implemented in thePHASE package. This skeleton can be modified easily to suit your needs. See section 1.3.3 for details.

1.1.5 Parameters displayed on the screen and output of each program

Each program in the package will output information on the screen, and one or more files to store theresults permanently. The outputs will be reviewed individually for each program in section 1.3.

The content displayed on the screen is usually quite easy to understand, but you might be a bitconfused by the parameters of substitution models. PHASE outputs on the screen two kind of matrices:

• one “rate ratios” matrix R

9

• one transition matrix Q

These matrices are described in the transition matrices section (2.2.2). Other parameters have a straight-forward meaning.

1.1.6 Clade file format

The user is allowed to specify some invariant clades to reduce the number of possible topologies whenusing mlphase. A clade file contains a list of monophyletic clades in newick format (see section 2.1.2).All studied species must appear once (and only once) in the file, either alone or in a clade. Here is asimple clade file example for 6 species:

(Specie5,Specie6);(Specie4,(Specie3,Specie2));Specie1;

1.2 Control files

Most programs in the PHASE package have their options set using a simple text file. We call this filethe control file. Although the content of this file may differ for each program in the package, its structureremains the same. Some control files are provided as example with the package (*.control in the controldirectory). The easiest and safest way to use PHASE is to copy one of these examples and to adapt itto your need.

1.2.1 Structure of the control files

A control file contains logical blocks (e.g., DATAFILE block, MODEL block, . . . ) and control lines.Lines preceeded by a hash (#) symbol are considered comments and ignored. Comments can be placedanywhere.

A control line is used to define a parameter and gives it a value. It has the format:label = value

The order in which control lines are provided in the control file is not important but they must appear inthe right block. Note that PHASE is case sensitive, “Tree file” and “Tree File” are two different labels.At the moment no warning is issued if the user mistypes an optional parameter. Please check yourcontrol files against the provided examples, otherwise PHASE might miss some important parameterswithout you noticing it.

A block is a container. It contains control lines but can also contain other blocks. The blockBLOCKNAME begins with the tag:

BLOCKNAMEand ends with the tag:

\BLOCKNAMETags must be put alone in their line. By convention the name of blocks are all uppercase.

In the remainder of this document, parameters of the control files are colored depending on theirstatus. Compulsory parameters are in red and you must provide a value for them. Optionnal parametersare in green and they do not need to appear in the control file. Often, a default value will be assumedfor optional parameters. Some fields are dependent on the presence and/or values of other parametersand their presence (or absence) is compulsory under certain conditions. These conditional parametersare in orange.

1.2.2 Datafile block

Almost all programs in the PHASE package require a DATAFILE block to parse analysed sequences.As stated previously, the DATAFILE block begins with the tag DATAFILE alone on a line andends with the tag \DATAFILE alone on a line. The DATAFILE block contains some necessary

10

information which is not included in the data file itself (see section 1.1.1 for the format of this file); itcontains the following control lines:

• Data file: the location of the molecular sequences file to be used.Data file = data/sequences.dna

• Interleaved data file: a yes/no option that specifies whether the molecular data is interleaved.Interleaved data file = yes

• Outgroup: the label of the outgroup sequence (see section 2.1.1). The inference techniques used inPHASE produce unrooted phylogenies and using an outgroup in your study is not required. HoweverPHASE requires this parameter to produce a unique newick representation (2.1.2) for unrooted trees.

Outgroup = Mole

• Heterogeneous data models: is a yes/no parameter which specifies whether the data file contains aclass section. The default value is no and the class section of your data file will be ignored if youforget this field.

Heterogeneous data models = yes

1.2.3 Model block

Most programs in the PHASE package require the specification of a substitution model for sequenceevolution. This is the purpose of the MODEL block. The MODEL block is delimited by the MODELand \MODEL tags. It contains the name of the substitution model followed by parameters (andsometimes blocks) specific to the model (see section 2.2 for background information on substitutionmodels of nucleotide evolution).

Simple substitution model

Depending on the data to be analysed, the PHASE package can be used with a wide variety ofDNA substitution models or RNA-specific base-paired models (see sections 2.2.3 and 2.3.3 for a re-view of these models). The content of the MODEL block is the same for all these models and theparameters are:

• Model: the model’s name, by convention it should be all upper case.Model = REV

Nucleotide substitution models implemented include JC69, K80, HKY85, TN93 and REV.Base-paired substitution models implemented include RNA6A, RNA6B, RNA7A, RNA7D,RNA16A.

• Discrete gamma distribution of rates: the discrete gamma model (see section 2.4.1) can be used toaccount for among site rate variation. Use yes/no values to turn this option on/off. When a discretegamma model is used, PHASE expects the number of gamma categories to be specified. By defaultthe discrete gamma model is not used.

Discrete gamma distribution of rates = yes

• Number of gamma categories: when the discrete gamma model is used, you have to provide an integerto specify the desired number of discrete gamma categories.

Number of gamma categories = 5

• Invariant sites: alternatively, or in conjunction with the discrete gamma model, the user can allowa proportion of sites to be invariant, i.e., with zero rate of evolution. The default value is no.

Invariant sites = yes

Mixed model for combined analyses of heterogeneous data

To study heterogeneous sequences several models are required. The mixed model (see section 2.4.2)allows these models to work concurrently.

11

• Model: this field contains the name of the model which is MIXED.Model = MIXED

• Number of models: the number of models used concurrently. If a class section was provided with thedata file then the number of models should be the same as the number of classes. If you used theflag MIXED in your data file and did not provide a class section then this parameter has to be setto 2 and the two models must be a DNA substitution model and a base-paired substitution modelrespectively.

Number of models = 3

• MODELi block: each model used in the mixed model must be defined in its own block. If thenumber of models is n then the MODEL block must contains n blocks whose name are MODEL1,MODEL2, . . . , MODELn. The content of these blocks is the same as for a simple substitution modelblock.

MODELModel = MIXEDNumber of models = 2MODEL1

Model = REVInvariant sites = yes

\MODEL1MODEL2

Model = RNA7ADiscrete gamma distribution of rates = yesNumber of gamma categories = 5

\MODEL2MODEL3

Model = RNA7DInvariant sites = noDiscrete gamma distribution of rates = no

\MODEL3\MODEL

1.3 Using the programs in the PHASE package

Each program in the PHASE package requires a specific control-file, the content of which is describedhere. As in the previous section, compulsory parameters appears in red, optional parameters in greenand conditional parameters dependant on the others are in orange.

1.3.1 likelihood

Using likelihood

The likelihood program is used to compute the likelihood of a model of evolution (i.e., tree + param-eterised substitution model) given a set of studied sequences. To use likelihood, one has to provide aphylogeny for the taxa under investigation (i.e., topology and branch lengths) and a substitution modelfor nucleotide evolution with user-defined parameters. To use likelihood, type at the command-line:

likelihood likelihood-control-filewhere likelihood-control-file is a valid control file for the likelihood program. For verification purposeslikelihood outputs the phylogenetic tree used on the screen before the likelihood value. Unlike mostother PHASE programs, likelihood does not send any results to a file.

Control file for likelihood

An example of a valid control file for likelihood can be found in appendix A.1. In its control file, thelikelihood program requires the specification of:

• a DATAFILE block: see the data file block section (1.2.2).

12

• a MODEL block: see the model block section (1.2.3).

• Tree file: the name of the file containing the phylogeny, i.e., a tree in the Newick format (section 2.1.2),with branch lengths values.

Tree file = data/mammals-consensus.tree

• Model parameters file: the name of the file containing parameter values for the model defined in theMODEL block above. Simulate can help you to produce this file.

Model parameters file = data/mammals-consensus.model

1.3.2 optimise

Using optimise

The program optimise is used to compute maximum-likelihood estimates (MLE) for the branch lengthsand substitution model parameters of a given model of evolution (i.e., a fixed tree topology and a specifiedsubstitution model with free parameters). One can specify some initial values for branch lengths andsubstitution model parameters to speed-up the convergence or to detect trapping in local maxima of thelikelihood function. To use optimise, type at the command-line:

optimise optimise-control-filewhere optimise-control-file is a valid control file for the optimise program.

When launched, optimise displays the initial tree and the initial likelihood on the screen and beginsthe optimisation. Once it is finished, the ML substitution model parameters are printed on the screenand saved in the “.output” file with the ML tree and the value of the maximum likelihood. The ML treeis also saved in the “.tree” file and a “.model” file (see input section 1.1.4) is created to store the MLEfor the substitution model parameters.

Control file for optimise

An example of a valid control file for optimise can be found in appendix A.2. The control file of theoptimise program must/may provide:



• Tree file: the name of the file containing the phylogeny, i.e., a tree in the Newick format (see section 2.1.2)with optional initial branch lengths values.

Tree file = mammals69-mix-consensus.tree

• Random seed: the integer value provided with this field is used to initialise the random numbergenerator (used to draw random initial branch lengths if they are not provided).

Random seed = 1

• Starting model parameters file: the name of a file containing initial values for the parameters of thesubstitution model used. If this field is not provided, the analysed sequences are used to initialisethe model.

Starting model parameters file = data/hiv.model

• Output file: the basename for the three files basename.tree, basename.model and basename.output.They contain the results generated by optimise.

Output file = mammals69-mix-optimise

1.3.3 simulate

Using simulate

Simulate is used:

13

1. to generate examples of “.model” files for all the substitution models implemented in PHASE . A“.model” file (see section 1.1.4) is used to provide initial or fixed values for the model parameters tosome programs in the package.

2. to generate molecular sequences which evolved from a random initial one according to a specifiedmodel of evolution, i.e., phylogeny and substitution model.

To use simulate, type at the command-line:simulate simulate-control-file

where simulate-control-file is a valid control file for the simulate program. In its first mode of operationsimulate create a single “.model” file and you can modify this file with your own initial values. In itssecond mode of operation, simulate displays on screen the tree used to generate the actual sequences.This tree was either provided by the user or randomly created by the program. In the second casethe tree is saved in a file specified by the user. Eventually, the likelihood of the generated molecularsequences given the model is printed on the screen and simulate saves the sequences in a file specifiedby the user. The format of this file is described in the data file format section (1.1.1). If the MIXEDmodel described in section 2.4.2 is used, heterogeneous sequences are generated in sequential order.

Control file for simulate

In appendix A.3 and A.4, example control files are provided for the first and the second mode of operationrespectively.

The control file of the simulate program must provide


• Retrieve the name of the model’s parameters: a boolean field to specify the user’s aim. Use yes forthe first mode of usage mentionned above and no for the second mode.

Retrieve the name of the model’s parameters = no

• Model parameters file: if simulate is used to generate an example of a substitution model parametersfile, the parameters are saved in a file having the name provided. When simulate is used to generatesequences, the user must provide parameters for the substitution model and they are read from thegiven file.

Model parameters file = simulate.model

The following fields may be required when simulate is used to generate sequences.

• Random seed: the integer value provided with this field is used to initialise the random numbergenerator.

Random seed = 1

• Random tree and Tree file: simulate can either generate a random tree or use a supplied phylogeny.If Random tree is equal to yes then simulate generates a random tree and saves it in the specifiedfile. If Random tree is equal to no then simulate parses the user tree from the specified file.

Random tree = noTree file = 8-species.tree

• Number of species and Maximum branch length: when the Random tree field is set to yes, the usermust provide the number of species and the maximum value for branch lengths in the generated thetree.

Number of species = 10Maximum branch length = .4

• Number of symbols from class i: you have to specify the number of symbols (e.g., number of nu-cleotides or number of paired sites) you want to generate for each class in your final sequence.

Number of symbols from class 1 = 100Number of symbols from class 2 = 100Number of symbols from class 3 = 100Number of symbols from class 4 = 500Number of symbols from class 5 = 300

14

• Structure for the elements of class i: simulate can add a stucture in the generated data file in whichcase you have to specify the appropriate structure for the elements of each class.

Structure for the elements of class 1 = .Structure for the elements of class 2 = .Structure for the elements of class 3 = .Structure for the elements of class 4 = .Structure for the elements of class 5 = ()

• Data file type and Total length of the raw sequences: simulate produces an input file following theformat defined in the data file format section (1.1.1). To produce this file, you have to specify yourselfthe type and the length written in the first line (see section 1.1.1). With the 5 classes described above:

Data file type = RNATotal length of the raw sequences = 1400 #(100+100+100+500+300*2)

• Output file: the name of the file where generated sequences are saved.Output file = simulated-data/codons and rna.sequences

1.3.4 mlphase

Using mlphase

The mlphase program can be used:

1. to find the Maximum Likelihood Estimates for branch lengths and, optionally, evolutionary modelparameters for a user-defined set of topologies.

2. to find the phylogeny and, optionally, evolutionary model parameters that yield the maximum like-lihood. Three algorithms are provided for topology search:

• Simple exhaustive search

• Branch-and-bound exhaustive search

• Heuristic stepwise addition

In the first mode of operation, mlphase operates like optimise but several trees can be considered at once.In the second mode of operation, when mlphase performs a branch and bound search or an exhaustivesearch, the ten phylogenies (and associated substitution model parameters) with the highest likelihoodare returned. These two search algorithms return the best tree unless they become trapped in localminima during the optimisation process. The heuristic stepwise addition returns only one tree. It isless likely to find the optimal tree but it is computationally feasible with a larger number of taxa. Bewarned that the optimiser might crash unexpectedly sometimes and you can change the initial values toovercome that (hopefully rare) problem.

To reduce the search space and the computation time, constraints can be placed on the phylogenetictree topologies considered during ML inference. With a clade file (see section 1.1.6) one can specifyinvariant monophyletic clade topologies which should be preserved during phylogenetic inference. Theprogram will look for an optimal topology consistent with these clade arrangements. To use mlphase,type at the command-line:

mlphase mlphase-control-filewhere mlphase-control-file is a valid control file for mlphase. The mlphase program saves the results ofan inference in a single file. Results are also displayed on screen during the run.

Control file for mlphase

Please see the examples in appendix A.5 and A.6. These control files show the two main modes ofoperation. The control file of the mlphase program contains:



15

• a FUNCTION block dependant on the operating mode of mlphase (see below)

• Random seed: the seed for the random number generator.Random seed = 13

• Output file: the name of the file where the results are sent.Output file = results/hiv-mlphase.output

The FUNCTION block contains specific parameters according to the mode of operation. At themoment, mlphase can “Optimise user-defined phylogenetic trees” or “Search for ML topology”.When the user wants to optimise a set of defined trees the FUNCTION block contains the following fields:

• Function: the parameter to specify the mode of operation.Function = Optimise user-defined phylogenetic trees

• Trees file: the name of the file containing the phylogenies, i.e., a set of trees in the Newick format(section 2.1.2) with optional initial branch lengths values.

Trees file = primates.phylogenies

• Number of trees: the user has to specify the number of trees in the previous file.Number of trees = 4

• Optimise model parameters: set this field to no if the model parameters are to be considered fixed,set it to yes if you want to optimise them.

Optimise model parameters = no

• User’s model parameters file: if the parameters are constant one must provide values for them. Thisfield is for the name of the file containing the parameters for the model defined in the MODEL block.If provided when not required, the content of this file is used to initialise the parameters of the modelbefore optimisation.

User’s model parameters file = data/hiv-REV.model

When looking for the ML tree, the FUNCTION block contains:

• Function: the parameter to specify the mode of operation.Function = Search for ML topology

• Topology search: this field specifies the search algorithm used to determine the phylogenies withthe highest likelihood. At the moment the search algorithms implemented are Simple exhaustivesearch, Branch-and-bound exhaustive search and Heuristic stepwise addition.

Topology search = Heuristic stepwise addition

• User defined monophyletic clades and Clade file: set the first field to yes if you want to constrain thesearch in the topology space. The second field is the name of your clade file (see section 1.1.6).

User defined monophyletic clades = yesClade file = primates.clades

• Optimise model parameters: set this field to no if the model parameters are to be considered fixed,set it to yes if you want to optimise them.

Optimise model parameters = yes

• User’s model parameters file: if the parameters are constant one must provide values for them, thisfield is for the name of the file containing the parameters for the model defined in the MODEL block.

User’s model parameters file = data/primates-RNA7A.model

16

1.3.5 mcmcphase

Using mcmcphase

The mcmcphase program perfoms Bayesian estimation of phylogenies (see section 2.5) and uses Markovchain Monte Carlo to produce large samples from the posterior probability density. To use mcmcphase,simply type at the command-line:

mcmcphase mcmcphase-control-filewhere mcmcphase-control-file is a valid control file for the mcmcphase program.

The mcmcphase program saves the results of an inference in many files. Be warned that it mightrequire a large amount of disk space for large studies (around 90 Mb for 70 species and 50000 samples).

• .besttree and .bestmodel files: the phylogeny and the parameters of the substitution model whenthe best state (i.e., the state with the highest likelihood) was visited, it is not necessary one of thesampled configurations and this state might have been visited during the burnin period. The bestconfiguration is not very important in a MCMC analysis but a “strange” best state indicates quicklythat something went wrong. The tree and the model can also be used as starting points in maximumlikelihood inference.

• .mp file: the file with the sampled parameters of the substitution model(s). Each sample occupiesone line. The parameters are, in order,

– the proportion of invariant sites if an invariant category is used (+I models)

– the gamma shape parameter (α) if the discrete gamma model is used (+dGX models)

– the frequencies of the states as they appear in the substitution matrix

– the rate ratios

When a MIXED model is used, substitution model parameters are printed sequentially. Except forthe first model, each set of parameters is preceded by the average substitution rate of the model. Theaverage substitution rate for the first model is always 1.0 and therefore this value is not reported.

• .samples file: the sampled topologies, this file can be used with another phylogenetic package toproduce a consensus tree. To avoid wasting disk space, mcmcphase will output the sampled topologiesusing an index for each species according to their appearance order in the datafile.

• .bl file: the branch lengths for the previous topologies (for use with other PHASE programs).

• .output file: a file with similar content to the screen output.

• .plot file: the evolution of the likelihood during the run. Sampling of these values starts at thebeginning of the run, i.e., likelihood values are stored durning the burnin too.

Using consensus

The consensus program is used to exploit the large sample of states produced by mcmcphase. Theprogram still lacks the ability to produce a consensus tree by itself and requires that tree from the user.Many phylogenetic programs can build a consensus tree from the sample of topologies produced bymcmcphase in the “.samples” file. You can use the consense program of PHYLIP2 for instance. To useconsensus, simply type at the command-line:

consensus mcmcphase-control-file consensus-topology-filewhere mcmcphase-control-file is the control file that was used by the mcmcphase program to producethe results and consensus-topology-file is the file which contains the consensus topology. Since mcm-cphase outputs the topologies using numbers instead of the names of the species, consensus expects theconsensus topology to be given with numbers too.

The consensus program retrieves the model used and the location of the sample files from the control-file. Two consensus substitution models are produced using respectively the mean and median valuesof the sample. The consensus topology is used to produce a consensus tree with branch lengths. The

2http://evolution.genetics.washington.edu/phylip.html

17

branch lengths of the states whose topology is identical to the consensus topology are used. For eachbranch, the consensus length is simply the mean value of all the lengths. The consensus program cannotreturn a consensus tree if the consensus topology has never been visited. In such a case, we suggest youuse optimise to produce ML branch lengths.

Control file for mcmcphase

Please see the examples provided in appendix A.7 and A.8. In the control file of mcmcphase onecan/must have:



• a PERTURBATION block: control block for the mixing properties of mcmcphase (see below).

• Random seed: the seed for the random number generator.Random seed = 1

• Burnin iterations: the number of “burnin” cycles (i.e., cycles before the beginning of the sampling).During the “burnin”, only likelihood values are stored.

Burnin iterations = 150000

• Sampling iterations: the number of cycles for sampling.Sampling iterations = 600000

• Sampling period: the number of cycles between extraction of two consecutive samples.Sampling period = 20

• Random start model parameters and User’s starting model parameters file: to reduce the necessary“burnin” time, the chain can be initialised with some user-specified model parameters. Otherwisethe sequences are used to initialise the substitution model.

Random start model parameters = noUser’s starting model parameters file = data/primates-RNA7A.model

• Random start tree and User’s starting tree file: similarly, one can choose to initialise the chainrandomly or with a user-defined topology. We do not encourage the use of an initial user-definedtopology but this option can be useful to quickly gain an idea of what results can be expected.

Random start tree = yesUser’s starting tree file = this field is ignored in this case

• Output file: the basename for all the output files (basename.besttree, basename.bestmodel, base-name.mp, basename.samples, basename.bl, basename.output and basename.plot).

Output file = results/hiv-dna

• Output format: the format used for the topologies in the .samples file, it can be phylip (with asemi-colon at the end) or bambe (without semi-colon).

Output format = phylip

PERTURBATION block

The PERTURBATION block contains the mixing parameters used for the proposals. The followingmixing parameters are relative to the branches:

• Initial branch step proposal parameter: the initial standard deviation of the normal distribution usedto modify the branch lengths. This proposal parameter is modified during the “burnin”.

Initial branch step proposal parameter = 0.1

• Branch length upper bound: the upper bound used for the uniform prior distribution of branchlengths.

Branch length upper bound = 3.5

18

The PERTURBATION block also contains mixing parameters for the proposals of substitution modelparameters. These parameters are dependant on the substitution model used. With simple substitutionmodels, i.e., any nucleotide or base-paired model used alone, the following parameters are added in thePERTURBATION block to perturb the frequencies:

• Frequencies, proposal priority: an integer value to specify how often we try to perturb the frequencieswith respect to other parameters. This parameter is usually compulsory except for models with fixedfrequencies (i.e., JC69 and K80). Use the value 0 to prevent the perturbation (i.e., if you wantthe frequencies to remain equal to the empirical frequencies or to the values provided in an initialsubstitution model).

Frequencies, proposal priority = 1

• Frequencies, proposal minimum acceptance rate and Frequencies, proposal maximum acceptance rate:during the “burnin”, mcmcphase will try to adapt the proposal step so as to reach an acceptance ratewithin the specified range. By default this range is [0.21, 0.25]. If you want to change the defaultvalues, provide the two parameters. Use the range [0.0, 1.0] to turn off the dynamic adaptation ofthe proposal step.

Frequencies, proposal minimum acceptance rate = 0.0Frequencies, proposal maximum acceptance rate = 1.0

• Frequencies, initial Dirichlet tuning parameter: the initial proposal parameter (the higher the Dirich-let parameter, the lower the step). This parameter is compulsory if you turned off the dynamicadaptation of the step. Allowed values are in the range [100.0, 100000.0] and the default value is1000.0.

Frequencies, initial Dirichlet tuning parameter = 1500.0

Similar parameters are used for the rate ratios:

• Rate ratios, proposal priority: an integer value to specify how often we try to perturb each rateratio with respect to other parameters. Rate ratios are treated individually but they share the samepriority. This parameter is usually compulsory except for models with fixed rates (i.e., JC69). Youcan use the value 0 to have constant rate ratios but you should not do that unless you provide initialparameters with a “.model” file.

Rate ratios, proposal priority = 2

• Rate ratios, proposal minimum acceptance rate and Rate ratios, proposal maximum acceptance rate:rate ratios are treated individually but they share the same range for the acceptance rate. Thedefault range is [0.21, 0.25] and you can turn off the dynamic modification of the step with the range[0.0, 1.0].

Rate ratios, proposal minimum acceptance rate = 0.3Rate ratios, proposal maximum acceptance rate = 0.6

• Rate Ratios, initial step: allowed initial values are in the range ]0, 5.0]. The default value (when thedynamic step option is on) is 0.2. You cannot provide a different initial step for each rate ratio atthe moment.

Rate Ratios, initial step = 0.25

Similarly, you can/must use the following parameters when using a +I (invariant category) or/anda +dG (discrete gamma categories) model:

• Gamma parameter, proposal priority: compulsory if a gamma model is used.

• Gamma parameter, proposal minimum acceptance rate: default value = 0.21

• Gamma parameter, proposal maximum acceptance rate: default value = 0.25

• Gamma parameter, initial step: default value = 0.2

and,

• Invariant parameter, proposal priority: compulsory if an invariant model is used.

19

• Invariant parameter, proposal minimum acceptance rate: default value = 0.21

• Invariant parameter, proposal maximum acceptance rate: default value = 0.25

• Invariant parameter, initial step: default value = 0.05

The upper-bound for the uniform prior distribution of each rate ratio is 1000.0. The upper-boundfor the uniform prior distribution of the gamma parameter is 200000.0. Be aware that there might besomething wrong if you reach that upper-bound value during a MCMC inference, i.e., there may beinsufficient data to properly estimate the parameter.

PERTURBATION Block for MIXED models

When a mixed model is used, the previous proposal parameters are required for each model and mustbe enclosed in separate blocks. You must complete a PERTURBATION i block for each model in thePERTURBATION block with the previously described parameters (see appendix A.8). Parameterswhich are specific to the MIXED model appear in the PERTURBATION block.

• Model i priority: specify a priority for each model with respect to the other models and the priorityused for the average rates.

Model 1 priority = 8Model 2 priority = 24

• Average rates, proposal priority: specify a priority for the perturbation of the model average substi-tution rate.

Average rates, proposal priority = 1

• Average rates, proposal minimum acceptance rate and Average rates, proposal maximum acceptancerate: the acceptance range mcmcphase aims for, [0.21, 0.25] by default.

Average rates, proposal minimum acceptance rate = 0.15Average rates, proposal maximum acceptance rate = 0.25

• Average rates, initial step: allowed initial values are in the range ]0, 1.0]. The default value (whenthe dynamic step option is on) is 0.2. You cannot provide a specific initial step for each averagesubstitution rate at the moment.

Average rates, initial step = 0.14

The upper-bound for the uniform prior distribution of the average substitution rate ratios is 100.0.

Proposals priority

Each cycle mcmcphase perturbs one branch length and a topology change is tried every ten cycles (seethe Bayesian phylogenetics algorithms in section 2.5). Each cycle PHASE will also try to modify thesubstitution model. Let us consider the following example with the MIXED model and three substitu-tion models,

PERTURBATION#PERTURBATION OF THE TREE :

...#PERTURBATION OF THE MODEL :

Model 1 priority = 8Model 2 priority = 24Model 3 priority = 7Average rates, proposal priority = 1Average rates, initial step = .3Average rates, proposal minimum acceptance rate = .15Average rates, proposal maximum acceptance rate = .20PERTURBATION1

Frequencies, proposal priority = 1Rate ratios, proposal priority = 2

20

Gamma parameter, proposal priority = 1\PERTURBATION1PERTURBATION2

Frequencies, proposal priority = 3Rate ratios, proposal priority = 4Gamma parameter, proposal priority = 1Invariant parameter, proposal priority = 1

\PERTURBATION2PERTURBATION3

Frequencies, proposal priority = 4Rate ratios, proposal priority = 2Gamma parameter, proposal priority = 1

\PERTURBATION3\PERTURBATION

The mcmcphase program will modify the average substitution rate of the second model with proba-bility given in 1.1a; the probability of modifying the average substitution rate of the third model is thesame. The average substitution rate of the first model is the reference, it is never modified and remainsequal to 1.0. With probability given in 1.1b, mcmcphase will modify the parameters of the substitutionmodel i (i.e., any parameters but the average substitution rate). The priorities inside the correspondingPERTURBATION i block are used, the figures inside each PERTURBATION i block do not have anyeffect outside the block.

P (average rate) =Average rates, proposal priority

total priority(1.1a)

P (model i) =Model i priority

total priority(1.1b)

total priority = (N − 1) ∗Average rates, proposal priority +N∑

i=1

Model i priority (1.1c)

In our example, if the second model is selected for the next modification, we define prioritytotal:

prioritytotal = priorityfrequencies+prioritygamma+priorityinvariant+numberrates ratios∗priorityrate ratios

and we modify either all the frequencies (probability 1.2a), or one of the rate ratios (probability 1.2beach), or the invariant parameter (probability 1.2c), or the gamma shape parameter (probability 1.2d).

P =priorityfrequencies

prioritytotal(1.2a)

P =priorityrate ratios

prioritytotal(1.2b)

P =priorityinvariant

prioritytotal(1.2c)

P =prioritygamma

prioritytotal(1.2d)

1.3.6 analyser

The analyser program does not require any control file. To use analyser, type at the command-line:analyser oranalyser data-file

where data-file is a file following the data file format described in section 1.1.1. The analyser programneeds you to provide the fields usually used in the DATAFILE block (see section 1.2.2) in order to parseyour sequences. Once it is done you will be prompted for the class to check if the data file containsheterogeneous sites and analyser will require a “.lump” file.

The “.lump” file is used:

21

1. to match sites to a given state, e.g., indels(-) and purine(R) can be lumped to an ambiguity state(X).

2. to choose the state(s) used for the cut-off (see below).

Two ‘‘.lump” files are provided, data/dna.lump and data/rna.lump. The first one is used with singlenucleotide, ambiguity — i.e., -, R, Y and X — are lumped in a state X used for the cut-off. Thesecond one is used with paired sites. mismatches — e.g., AC, UU, . . . — are lumped into the singlestate MM and ambiguity — e.g., C-, UR, . . . — are lumped into the state XX. Both states are usedfor the cut-off.

Once the “.lump” file is provided, analyser outputs statistics for each state and requires a value forthe cut-off (between 0 and 1.0). For each site the frequencies of “cut-off states” is computed and thesites above the cut-off threshold provided are displayed on the screen.

22

2 - Elements of phylogenetic theory

2.1 Phylogenetic trees

Usually a sketch of a tree-like structure is used to describe evolution; the evolutionary tree representsthe hierarchical relationships among species arising through evolution. Ancestors’ species are locatednear the root of the tree and contemporary species are the leaves. Almost all methods accept theappropriateness of a tree-like model to describe the evolution of species but one must keep in mind thatit is a strong assumption in itself.

2.1.1 Unrooted phylogenies

Since the data for the ancestors are usually missing, the phylogenetic trees produced by PHASE areonly schematic trees comprising a set of nodes linked together by branches. Terminal nodes, usuallycalled tips or leaves, are known sequences of existing organisms or contemporary taxa. Internal nodesare bifurcation points between genetically isolated groups.

The analytical techniques used in PHASE result in the inference of an unrooted, strictly bifurcatingtree. The location of the common ancestor of all the species under study — i.e., the earliest point intime — cannot be identified by our inference method. An unrooted, strictly bifurcating tree can beseen as a kind of network where all the internal nodes are linked to exactly three others nodes, eitherinternal nodes or leaves. To produce a neat tree-like structure, one or more outgroup species, known tobe genetically isolated from all the others, should be used to root the tree (see figure 2.1).

Figure 2.1: Two equivalent representations of the same unrooted tree

2.1.2 String representation of a tree

All trees used by the programs follow the newick1 standard though the grammar in PHASE is morelimited. The newick format uses the recursive definition of a tree to represent phylogenies in a computerreadable form with nested parentheses. The tree in figure 2.1 can be written:

(Outgroup, gorilla, (human, chimpanzee));

1http://evolution.genetics.washington.edu/phylip/newicktree.html

23

However one must be aware that this representation is not unique, the following one works as well:

(human, (Outgroup, gorilla), chimpanzee));

Sometimes, when an outgroup was provided, the rooted representation is used in PHASE :

(Outgroup, (gorilla, (human, chimpanzee)))

2.1.3 Branch lengths

The branch lengths usually represent the evolutionary distances between two consecutive nodes. Wetend to split the phylogenetic tree into two parts: its topology (i.e., the pattern of branching) and itsassociated edge lengths. The expected rate of evolutionary change is assumed constant across all lineagesin a phylogeny and the length of a branch is scaled to the expected number of substitutions per sitealong that branch. These lengths can be integrated in the string representation seen in section 2.1.2; forinstance we can write:

(Outgroup : 0.35, gorilla : 0.25, (human : 0.3, chimpanzee : 0.2));

2.2 Nucleotide substitution models

Substitution models are a description of the way sequences evolve in time by nucleotide replacements.The PHASE package provides a wide range of substitution models. These consist of standard nucleotidesubstitution models as well as specific base-pair substitution models.

2.2.1 A Markov model of substitution

Replacements within DNA sequences can be described and modelled by a Markov process with fourstates. Each state represents one base — Adenine, Cytosine, Guanine or Thymine (see figure 2.2). Lots

Figure 2.2: Markov model for nucleotide evolution in DNA sequences

of assumptions are made in order to make phylogenetic reconstructions more computationally feasible.First, each nucleotide is supposed to evolve independently of other sites evolution and of its past history.We suppose there is no interaction between sites and we treat them independently. Further, the Markovprocess of substitution is assumed to be the same across all sites (spatial homogeneity). Finally, theprocess is assumed to remain constant over time (stationary) and time homogeneous, i.e., nucleotidefrequencies and substitution rates can be assumed constant through time and across all sites in analignment.

24

One might concede that assumptions made for the nucleotide evolutionary process are not strictlyvalid. Actual data shows some discrepancies, e.g., heterogeneous selection pressure, unequal base fre-quencies among species, . . . . We can relax these assumptions and allow for substitution rate variationacross sites with the gamma model of Yang (1994), described in section 2.4.1. It is also possible to usemultiple substitution processes simultaneously when heterogeneous data are analysed (see section 2.4.2).

In spite of their name, DNA models can naturally be used for the treatment of the loops within RNAsequences (see figure 2.3). In RNA loops, nucleotides are not subject to any structural constraints andthey are assumed to evolve independently from other sites. Therefore, the use of similar Markov modelsfor nucleotide evolution in RNA loops is appropriate.

2.2.2 Transition matrices

The mathematical expression of a DNA Markov model uses a matrix Q of substitution rates in whicheach element rij represents the rate of substitution from nucleotide i to nucleotide j. The diagonalelements of the instantaneous rate matrix must satisfy the equation

rii = −∑j 6=i

rij (2.1)

so that each row of Q sums to zero. The process must be homogeneous and stationary ; if πA, πC , πG

and πT are the four equilibrium bases frequencies then the rates must obey the following constraint:

πirij = πjrji ∀i, j (2.2)

also known as the time-reversibility constraint. To enforce this constraint we define αij so that

Qij = rij = mr × πjαij ∀i, ∀j 6= i (2.3)

where mr is a constant factor described later. The time-reversibility condition is satisfied with a sym-metric choice of αij . In practice, PHASE uses one of these αij parameters as a reference and sets itsvalue to 1.0. Depending on the model, other parameters (we call them rate ratios) are fixed or inferredduring an analysis.

With Q we can compute the transition probability matrix over time2 t.

dP (t)dt

= P (t)×Q

P (t) = exp(Qt)= exp(πjαij ×mr × t)

The transition probability matrix P (t) = pij(t) is used to compute the probability that nucleotide iwill be nucleotide j after time t (i can be equal to j). The “rate ratios” matrix in PHASE refers to thematrix αij and the “transition rates” matrix refers to Q.

Inference methods used do not permit the separation of mr, a factor proportional to the averagesubstitution rate of the model, and t, branch lengths of the evolutionary tree (see section 2.1) whichreflect an amount of change. The longer the branch, the bigger the evolutionary distance between itstwo incident nodes. We have to impose a scaling on the branch length. In practice, we fix the averagerate of substitutions of our model to be one per “unit of time”. This is done by adding a constraint forthe factor mr.

mr ×nbstates∑

i=1

∑j 6=i

πirij = 1.0 (2.4)

This last constraint does not hold when multiple substitution models are used simultaneously in theMIXED model. The average substitution rate of the first model is still fixed equal to 1.0 but theaverage substitution rate of other models is now a free parameter.

2the term branch length would be more correct

25

2.2.3 Nucleotide substitution models implemented in PHASE

One can refer to Whelan et al. (2001) for a comprehensive review of the following substitution modelsand their hierarchical relationships. The transition rate matrices of these models can highlight theirdifferences. They are presented by increasing complexity, i.e., ordered according to their number offree parameters. (equilibrium frequencies and/or rates). In nucleotide substitution models, the A↔Gtransitition is used as a reference by PHASE , αAG = αGA = 1.

JC69 model (Jukes-Cantor, 69)

The Jukes-Cantor model assumes equal base frequencies and equal mutation rates, therefore it does nothave any free parameter. πi = 1

4 quad∀i, αij = 1.0 ∀i, ∀j 6= i

Q = mr ×

A C G T

A ∗ 0.25 0.25 0.25C 0.25 ∗ 0.25 0.25G 0.25 0.25 ∗ 0.25T 0.25 0.25 0.25 ∗

Table 2.1: JC69 transition matrix

K80 model (Kimura, 80)

The Kimura model assumes equal base frequencies and accounts for the difference between transitionsand transversions with one parameter. πi = 1

4 ∀i, αtransition = 1.0, αtransversion = α1

Q = mr ×

A C G T

A ∗ 0.25α1 0.25 0.25α1

C 0.25α1 ∗ 0.25α1 0.25G 0.25 0.25α1 ∗ 0.25α1

T 0.25α1 0.25 0.25α1 ∗

Table 2.2: K80 transition matrix

HKY85 model (Hasegawa-Kishino-Yano, 85)

The HKY85 model does not assume equal base frequencies and accounts for the difference betweentransitions and transversions with one parameter. αtransition = 1.0, αtransversion = α1

Q = mr ×

A C G T

A ∗ πCα1 πG πT α1

C πAα1 ∗ πGα1 πT

G πA πCα1 ∗ πT α1

T πAα1 πC πGα1 ∗

Table 2.3: HKY85 transition tatrix

26

TN93 model (Tamura-Nei, 93)

The TN93 model has four frequencies parameters. It accounts for the difference between transitions andtransversions and differentiates the two kinds of transitions (purine↔purine & pyrimidine↔pyrimidine).αAG = αGA = 1.0, αtransversion = α1, αCT = αTC = α2

Q = mr ×

A C G T


C πAα1 ∗ πGα1 πT α2


T πAα1 πCα2 πGα1 ∗

Table 2.4: TN93 transition matrix

REV model (Yang, 94)

The REV model is the most general model for nucleotide substitution subject to the time-reversibilityconstraint. It has four frequencies and five rate parameters.

Q = mr ×

A C G T


C πAα1 ∗ πGα3 πT α4


T πAα2 πCα4 πGα5 ∗

Table 2.5: REV transition matrix

2.3 Paired-site substitution models

RNA substitution models are an attempt to add biological realism in the evolution model. The assump-tion that each nucleotide site evolves independantly must be modified for RNA molecules. Paired-sitesubstitution models can account for the secondary structure of these molecules.

2.3.1 RNA secondary structure

In the double helical structure of the DNA molecule, two complementary nucleotide strands are heldtogether with hydrogen bonds between the Waston-Crick pairs A-T and C-G. RNA molecules usuallycome as single strands but left in their environment they fold themselves in their tertiary structurebecause of the same hydrogen bonding mechanism. Helices, also known as stems, are formed intra-molecularly .

There are 16 possible base-pairings, however of these, only six (AU, GU, GC, UA, UG, CG)are stable enough to form actual base-pairs. The rest are called mismatches and occur at very lowfrequencies in helices. RNA molecules, such as ribosomal RNAs and transfer RNAs, have an importantrole. Their structure cannot easily be disrupted without impact on their function and lethal consequencesand selection is acting to maintain the secondary structure. Yet, the primary structure of the stems(i.e., their nucleotide sequence) can still vary and in fact we observe that RNA helical regions are quitevariable in sequence. The nature of the bases is not important and substitutions are possible as long asthey preserve the secondary structure. One could model the evolution of stems using the DNA modelsdescribed above but there may be a substantial bias in results because paired substitutions would seem

27

Figure 2.3: A RNA molecule secondary structure

far less probable than they are in reality (see Jow et al., 2002). Statistics become invalid and it can havean effect on inferred phylogenies.

The secondary structure is left unchanged when complementary substitutions occur in the DNA genecoding for the RNA molecule. The process can be a single step process (double substitution) or a twostep process (two single substitutions). These two processes are descibed in the theory of compensatorysubstitutions section below.

2.3.2 Theory of compensatory substitutions

From the individual sequence viewpoint complementary mutations are a two-step process typically in-volving a U-G or a G-U pair as a transition state. These pairs are thermodynamically less stable thanWaston-Crick pairs but they are still more likely to arise than any other mismatches. Nonetheless, inphylogenetic studies we are not considering individual copies of a gene but we are rather modelling con-sensus sequences for a large number of individuals. From the population genetics viewpoint, evolution instems can either occur by two single substitutions or by simultaneous compensatory substitutions, (see,e.g., Higgs, 1998; Savill et al., 2001). The first mechanism is by fixation of the slightly deleterious UGor GU pair in the population before the second mutation occurs. The second mechanism happens whennatural selection against intermediate mutants is too strong. In such a case, deleterious pairs are keptlow in frequency until a second mutation takes place in one of the sporadic mutant sequences by chance.Afterwards, the new neutral variant may replace the original one due to drift in gene frequencies (seefigure 2.4).

Figure 2.4: Substitution mechanisms for paired-sites

28

Therefore, even if simultaneous mutations are very unlikely to occur in a single organism, it isreasonable, although not compulsory, to allow double substitutions in models from the population pointof view. The experimental results you can have with PHASE confirm that. Since natural selectionagainst intermediate mutants with any other mismatch pairs than U-G or G-U is usually much stronger,one can notice two groups of states in which rapid interchange occurs, while interchange between thetwo groups, although possible, is really slow (see figure 2.5).

Figure 2.5: Mutation rate between paired-sites

2.3.3 Base-paired substitution models implemented in PHASE

Like DNA models, RNA substitution models are Markov models but they consider pairs of nucleotidesas their elementary states rather than single sites. The PHASE software contains 16-state models toaccount for the 16 possible pairs that can be formed with 4 bases. These models have a lot of param-eters and you might prefer them the 6-state and 7-state models where mismatch pairs are respectivelydiscarded or lumped into a single state MM. The time-reversibility contraint and the average mutationrate are set as they were for DNA models. One can refer to Savill et al. (2001) for a better review of thefollowing substitution models and their hierarchical relationships. With base-paired models, PHASEuses the rate of the double transition AU↔GC as a reference for the rate ratios.

RNA6A model

Six state models completely ignore mismatches and consider substitutions between the six stable base-pairs only. Mismatch pairs are assigned to one of the 6 states in some deterministic fashion (the treatmentof a mismatch is quite similar to the treatment of a gap with the DNA models). The RNA6A modelis the most general six state model with 15 rate parameters and 6 frequencies (and 2 constraints) asshown in table 2.6.

Q = mr ×

AU GU GC UA UG CGAU ∗ πGUα1 πGC πUAα2 πUGα3 πCGα4

GU πAUα1 ∗ πGCα5 πUAα6 πUGα7 πCGα8

GC πAU πGUα5 ∗ πUAα9 πUGα10 πCGα11

UA πAUα2 πGUα6 πGCα9 ∗ πUGα12 πCGα13

UG πAUα3 πGUα7 πGCα10 πUAα12 ∗ πCGα14

CG πAUα4 πGUα8 πGCα11 πUAα13 πUGα14 ∗

Table 2.6: RNA6A transition matrix

RNA6B model (Tillier, 94)

The RNA6B model (Tillier, 1994) is formed by restriction of the RNA6A model. The RNA6B modelhas only 3 rate parameters and 6 frequencies, it uses a rate of single transitions α1 and a rate of doubletransversions α2. The reference for the rate ratios are the rates of double transition. The transition ratematrix is given in table 2.7.

29

Q = mr ∗

AU GU GC UA UG CGAU ∗ πGUα1 πGC πUAα2 πUGα2 πCGα2

GU πAUα1 ∗ πGCα1 πUAα2 πUGα2 πCGα2

GC πAU πGUα1 ∗ πUAα2 πUGα2 πCGα2

UA πAUα2 πGUα2 πGCα2 ∗ πUGα1 πCG

UG πAUα2 πGUα2 πGCα2 πUAα1 ∗ πCGα1

CG πAUα2 πGUα2 πGCα2 πUA πUGα1 ∗

Table 2.7: RNA6B transition matrix

RNA7A model

The RNA7A model is the most general of the seven state models. It has 21 rate parameters (includingthe reference rate AU↔GC) and 7 frequencies. All mismatches are treated in a single state MM. TheRNA7A model is described by the following rate matrix (table 2.8).

Q = mr ×

AU GU GC UA UG CG MMAU ∗ πGUα1 πGC πUAα2 πUGα3 πCGα4 πMMα5

GU πAUα1 ∗ πGCα6 πUAα7 πUGα8 πCGα9 πMMα10

GC πAU πGUα6 ∗ πUAα11 πUGα12 πCGα13 πMMα14

UA πAUα2 πGUα7 πGCα11 ∗ πUGα15 πCGα16 πMMα17

UG πAUα3 πGUα8 πGCα12 πUAα15 ∗ πCGα18 πMMα19

CG πAUα4 πGUα9 πGCα13 πUAα16 πUGα18 ∗ πMMα20

MM πAUα5 πGUα10 πGCα14 πUAα17 πUGα19 πCGα20 ∗

Table 2.8: RNA7A transition matrix

RNA7D model (Tillier, 98)

The RNA7D model (Tillier and Collins, 1998) is a biologically plausible restriction of the RNA7Amodel. The restictions in the 7D model are analogous to the restrictions made in the 6B. There is onemore frequency parameter for the mismatch state and one more rate ratio parameter for the substitutionrates involving this state. The reference for the rate ratios are the rate of double transitions. This modelis described by the following rate matrix (table 2.9).

Q = mr ×

AU GU GC UA UG CG MMAU ∗ πGUα1 πGC πUAα2 πUGα2 πCGα2 πMMα3

GU πAUα1 ∗ πGCα1 πUAα2 πUGα2 πCGα2 πMMα3

GC πAU πGUα1 ∗ πUAα2 πUGα2 πCGα2 πMMα3

UA πAUα2 πGUα2 πGCα2 ∗ πUGα1 πCG πMMα3

UG πAUα2 πGUα2 πGCα2 πUAα1 ∗ πCGα1 πMMα3

CG πAUα2 πGUα2 πGCα2 πUA πUGα1 ∗ πMMα3

MM πAUα3 πGUα3 πGCα3 πUAα3 πUGα3 πCGα3 ∗

Table 2.9: RNA7D transition matrix

RNA16A model

PHASE contains a general 16-state model (RNA16), however this model has 119 + 15 free parametersand is not well suited for phylogenetic inference, especially maximum-likelihood inference. RNA16A isa simplified 16-state model, it reduces some of the complexity of the RNA16 model by cutting downon the number of rate parameters from 120 to 5. It uses a rate of single transitions α1, a rate of

30

double transversions α2, a mismatch↔non-mismatch transition rate α3 for transitions requiring onlyone substitution and a mismatch↔mismatch transition rate α4 for transitions requiring one substitutiontoo. The reference rate is the rate of double transitions. Some base-pair substitutions are not allowed(null substitution rate). The transition matrix for the RNA16A model is given in table 2.10.

2.4 Refinements to substitution models

In this section we introduce some refinements made to the substitution models described above.

2.4.1 Invariant and discrete gamma models

Substitution rates are definitely variable over sites of a sequence for many real dataset if not all. Includingthe heterogeneity of rates in substitution models is widely recognized as an important factor in thefitting to data. One attempt to take this acknowledged biological fact into account is to suppose thata proportion of sites are invariant while others evolve at the same single rate. PHASE provides thisinvariant model. One extra parameter in the model governs the proportion of sites with zero rate ofevolution.

Models that allows continuous variability of mutation rates over sites are more realistic and thegamma model of Yang (1994) outperforms the invariant model. The discrete gamma model is imple-mented in PHASE . The continuous rate distribution is approximated with a discrete distribution whichis computationaly tractable and sites are divided into k equally probable rate categories. A single pa-rameter α governs the shape of this distribution and the substitution rates for all categories. The meanE(r) of the gamma distribution is the average mutation rate of our substitution model as stated earlierand its variance is V (r) = E(r)2/α. A small alpha suggests that rates differ significantly between siteswith few sites having high rates and others being practically invariant; on the contrary, large α modelsweak rate heterogeneity (see figure 2.6). When α → +∞, the gamma model reduces to the single ratemodel. Computational requirement of the discrete gamma model is roughly linear, i.e., the applicationof a discrete gamma model with k categories is about k times slower than the use of a model where rateheterogeneity is not considered.

2.4.2 The MIXED model

Since the current trend in phylogenetic analysis is to use several genes and/or several sorts of sequencesat once, models were designed for combined analyses of heterogeneous sequence data from the same setof species (Yang, 1996). PHASE allows to use multiple substitution models simultaneously to treat thiskind of sequences, each model having its own independant set of parameters. The average mutation rateof the first model is still set to 1.0 but the average mutation rate of the others are now free parameters ofthe model. The MIXED model for combined analysis of heterogeneous data is equivalent to the modelwith proportional branch lengths described in Yang (1996).

2.5 Bayesian phylogenetics

2.5.1 Bayes’ theorem

A Bayesian approach to phylogeny reconstruction requires the definition of a parameter space Ω whichcontains the sets of all possible combined states φ = τi, νi, θ where the symbol τi labels the ith

possible tree topology, νi are the branch lengths associated with this topology and θ is a set of allowedparameters for our evolutionary model (e.g., rate ratios αij , nucleotide or base-pair frequencies πi,gamma distribution parameter α, . . . ). According to Bayes’ theorem, we can calculate the posteriorprobability of the combined state φ given sequence data X,

p(φ|X) =P (X|φ)p(φ)∑Ns

i=1

∫dνi

∫dθP (X|φ)p(φ)

(2.5)

where Ns is the number of possible tree topologies for a data set containing s species, P (X|φ) is thelikelihood of the data and p(φ) is the prior probability density associated with state θ.

31

Figure 2.6: Probability density function of several gamma distributions of rate heterogeneity with meanE(r) = 1

2.5.2 Markov chain Monte-Carlo (MCMC)

Computing the denominator of equation 2.5 is infeasible for realistic sized problems. A Markov ChainMonte-Carlo method is therefore used. The standard Metropolis-Hastings MCMC algorithm can con-struct a Markov chain in our state space Ω by iterating a two step process (Metropolis et al., 1953;Hastings, 1970). Firstly, a new state φ′ is drawn from the actual state φn according to some proposalmechanism. The proposed state is then accepted or rejected with some probability which depends onthe ratio of the posterior probabilities of the two states φ′ and φn and of the proposal.

After a “burnin” period, the chain converges to an equilibrium, under quite weak conditions. Afterdiscarding an initial portion of the chain, states are distributed according to the posterior probabilitydensity p(φ|X). PHASE can produce a large sample from the posterior probability density and withthis sample one can compute the posterior probability of any identifiable phylogenetic feature of interest.For instance the posterior probability of a specific topology is simply given by the fraction of times thistopology appears in our MCMC sample. Similarly we can fit a posterior probability density curve tothe gamma distribution parameter.

2.5.3 Priors and proposals

Uniform priors ?

We have no strong evidence for any particular prior and we therefore choose a simple factorized priorp(φ) = p(θ)p(νi)P (τi). We assume a uniform prior on all trees P (τi) = 1/Ns, we use a flat Dirichletdistribution prior for frequency parameters (i.e., all sets of frequencies are equally likely as long asthey sum to one) and we choose a uniform positive prior for substitution rate parameters, gammadistribution parameter and branch lengths. Consequently, for all pairs of possible states (φ, φ′), thepriors are equal. One should set upper limits in the case of uniform priors but since these parametersusually remain between reasonable limits during simulations these boundaries should not have any effecton experimental results unless unreasonable values are chosen. It is good practice to check whether theseupper boudaries are reached while monitoring the parameters convergence.

32

Proposals for the parameters

For the proposal step, we have to balance the desire to move globally through the parameter space Ωwith the need to make computationally feasible moves in areas of high probability. Therefore we splitup the process and we apply a suitable proposition to the variables at each iteration. For the frequencyparameters we adopt a Dirichlet proposal distribution centred at the current frequency vector used byLarget and Simon (1999). For the gamma distribution parameter and the substitution rate ratios, anormal proposal distribution centred at the current value is used, with reflecting boundary at zero andat the upper-limit defined above. Distant moves in Ω might result in a low acceptance rate whereassmall modifications will prevent a full inspection of highly probable areas. The parameters used to moveinto the state space must therefore be carefully chosen for proper mixing (quick convergence and goodsampling from the posterior probability density). With mcmcphase, these parameters can be adjustedduring the burnin period.

Proposals for the tree

The tree topology is perturbed every ten cycles with either the nearest neighbor interchange (NNI)proposal shown in figure 2.7 or the subtree pruning and re-grafting (SPR) proposal (Swofford et al.,1996) shown in figure 2.8.

Figure 2.7: The nearest neighbor interchange algorithm (Jow et al., 2002)

Figure 2.8: The subtree pruning and regrafting algorithm (Jow et al., 2002)

33

Each cycle a randomly chosen branch length is modified with a figure δ drawn from a normal dis-tribution centred at zero. When the branch length becomes negative special rules which can lead to atopology change are applied (Jow et al., 2002). If the branch is an internal branch then one of the twonearest neighbor topologies is proposed with each having equal probability; this is the Nearest NeighbourInterchange described above. The new internal branch length is set to y = |x + δ| (see figure 2.9). Ifthe branch is a terminal branch, we cannot apply the NNI algorithm and we simply use a reflectingboundary. The new proposed length is y = |x + δ|.

Figure 2.9: The continuous change algorithm when x + δ < 0 (Jow et al., 2002)

The acceptance rate for the SPR and the NNI proposals are usually quite low. The “local” NNIproposal, induced by a branch length modification, has a better acceptance rate.

2.5.4 Pitfalls of Markov chain Monte-Carlo techniques

One can doubt that maximum-likelihood algorithms always find the true global maximum of the like-lihood function. Similarly, with MCMC techniques, the Markov chain can fail to converge to the sta-tionary distribution of the posterior probabilities. A possible reason for this is the failure to visit allhighly probable regions of the parameter space because of local maxima in the likelihood curve. Howeverpoor proposal mechanisms and/or failure to run the chain long enough are usually the main cause ofsample defect (see Huelsenbeck et al., 2002). Unfortunately it is not always easy to identify these traps.We can only recommend to do long runs, monitor the convergence of several model parameters sincemonitoring the likelihood only is not enough, and repeat the experiment using different random startingtrees to check that all the chains give similar results (i.e., substitution model parameters, consensustree, likelihood, . . . ).

34

Q=

mr×

AU

GU

GC

UA

UG

CG

AA

AG

AC

GA

GG

CA

CC

CU

UC

UU

AU

∗π

GU

α1

πG

Cπ

UA

α2

πU

Gα

2π

CG

α2

πA

Aα

3π

AG

α3

πA

Cα

30

00

0π

CU

α3

0π

UU

α3

GU

πA

Uα

1∗

πG

Cα

1π

UA

α2

πU

Gα

2π

CG

α2

00

0π

GA

α3

πG

Gα

30

0π

CU

α3

0π

UU

α3

GC

πA

Uπ

GU

α1

∗π

UA

α2

πU

Gα

2π

CG

α2

00

πA

Cα

3π

GA

α3

πG

Gα

30

πC

Cα

30

πU

Cα

30

UA

πA

Uα

2π

GU

α2

πG

Cα

2∗

πU

Gα

1π

CG

πA

Aα

30

0π

GA

α3

0π

CA

α3

00

πU

Cα

3π

UU

α3

UG

πA

Uα

2π

GU

α2

πG

Cα

2π

UA

α1

∗π

CG

α1

0π

AG

α3

00

πG

Gα

30

00

πU

Cα

3π

UU

α3

CG

πA

Uα

2π

GU

α2

πG

Cα

2π

UA

πU

Gα

1∗

0π

AG

α3

00

πG

Gα

3π

CA

α3

πC

Cα

3π

CU

α3

00

AA

πA

Uα

30

0π

UA

α3

00

∗π

AG

α4

πA

Cα

4π

GA

α4

0π

CA

α4

00

00

AG

πA

Uα

30

00

πU

Gα

3π

CG

α3

πA

Aα

4∗

πA

Cα

40

πG

Gα

40

00

00

AC

πA

Uα

30

πG

Cα

30

00

πA

Aα

4π

AG

α4

∗0

00

πC

Cα

40

πU

Cα

40

GA

0π

GU

α3

πG

Cα

3π

UA

α3

00

πA

Aα

40

0∗

πG

Gα

4π

CA

α4

00

00

GG

0π

GU

α3

πG

Cα

30

πU

Gα

3π

CG

α3

0π

AG

α4

0π

GA

α4

∗0

00

00

CA

00

0π

UA

α3

0π

CG

α3

πA

Aα

40

0π

GA

α4

0∗

πC

Cα

4π

CU

α4

00

CC

00

πG

Cα

30

0π

CG

α3

00

πA

Cα

40

0π

CA

α4

∗π

CU

α4

πU

Cα

40

CU

πA

Uα

3π

GU

α3

00

0π

CG

α3

00

00

0π

CA

α4

πC

Cα

4∗

0π

UU

α4

UC

00

πG

Cα

3π

UA

α3

πU

Gα

30

00

πA

Cα

40

00

πC

Cα

40

∗π

UU

α4

UU

πA

Uα

3π

GU

α3

0π

UA

α3

πU

Gα

30

00

00

00

0π

CU

α4

πU

Cα

4∗

Tab

le2.

10:R

NA

16A

tran

siti

onm

atri

x

35

Appendices

36

AppendixA - Some examples of control files

A.1 Control file for likelihood

######################## The Sequence Alignment Section ############DATAFILE#The name of your data fileData file = data/mammals69.mix

#The format of your data file (interleaved or not)Interleaved data file = no

#The species used to root the treeOutgroup = 26\DATAFILE

######################## The Evolutionary Model Section ############MODEL#the name of your modelModel = MIXED

#since we are using the mixed model we provide the number of modelsNumber of models = 2

#and we define each substitution model inside its own block.#the file "data/mammals69.mix" is a RNA sequence with loops and stems#we did not specify a class section but the code used was MIXED#therefore the first model must be the DNA model for the loop#and the second model must be the RNA model for the helices.MODEL1

#DNA model : REV + dg3Model = REVDiscrete gamma distribution of rates = yesNumber of gamma categories = 3

\MODEL1MODEL2

#RNA model : RNA7A + dg4 + IModel = RNA7ADiscrete gamma distribution of rates = yesNumber of gamma categories = 4Invariant sites = yes

\MODEL2\MODEL

####################### The tree & model Section ####################

37

#To evaluate the likelihood of a phylogeny you must provide#1.a phylogeny file (tree with branch lengths)Tree file = data/mammals69-mix-consensus.tree

#2.the parameters for the model you defined aboveModel parameters file = data/mammals69-mix-consensus.model

A.2 Control file for optimise

######################## The Sequence Alignment Section ############DATAFILE#The name of your data fileData file = data/primates.rna

#the format of your data file (interleaved or not)Interleaved data file = no

#the species used to root the treeOutgroup = 14\DATAFILE

######################## The Evolutionary Model Section ############MODEL#model : RNA16A + dG4Model = RNA16ADiscrete gamma distribution of rates = yesNumber of gamma categories = 4\MODEL

####################### The tree & model Section #####################the phylogeny to optimiseTree file = data/primates.tree

#an optional field to choose the initial model parameters (and check#whether the method always converge to the same tree)#Starting model parameters file =

#a random seed to initialise the branch lengths randomly in case the phylogeny#provided does not hold this informationRandom seed = 1

#the base name of the three output files (base.output, base.model, base.tree)Output file = results/primates-rna-optimise/primates-rna-optimise-RNA16A

A.3 Control file for simulate (1)



38

MODEL1#DNA model : REV + dg3Model = REVDiscrete gamma distribution of rates = yesNumber of gamma categories = 3

\MODEL1MODEL2


\MODEL2\MODEL

######################## The Simulate Section ############

#to produce an example of ’.model’ file for the specified model set this field#to ’yes’Retrieve the name of the model’s parameters = yes

#the following field is the name of the ’.model’ file to createModel parameters file = data/simulate.model

A.4 Control file for simulate (2)



MODEL1#DNA model : REV + dg3Model = REVDiscrete gamma distribution of rates = yesNumber of gamma categories = 3

\MODEL1MODEL2


\MODEL2\MODEL

######################## The Simulate Section ############

#to simulate some sequences set this field to ’no’Retrieve the name of the model’s parameters = no

39

#the file with the user-specified parameters of the substitution modelModel parameters file = data/simulate.model

#Initialise the random number generator with a seedRandom seed = 1

#Random tree or user-specified tree ?Random tree = yes

#parameters used if a random tree is generatedNumber of species = 8Maximum branch length = .5

#if Random tree == yes the tree will be saved with that file name#if Random tree == no the tree is read from that fileTree file = simulated-data/random-8species.tree

#generate sequences:#for each model you have to specify the desired number of symbols#if you are not using a MIXED model fill this field for the class 1 onlyNumber of symbols from class 1 = 600Number of symbols from class 2 = 1200

#if you need a secondary structure fill the following fieldsStructure for the elements of class 1 = .Structure for the elements of class 2 = ()

#to produce a complete PHASE input file, you have to specify the type and the#final length yourselfData file type = RNATotal length of the raw sequences = 3000

#the name of the file where your sequences are saved, please check this file#before useOutput file = simulated-data/simulated-8species.mix

A.5 Control file for mlphase (1)

####################### The Data Section ###########################DATAFILE#The name of your data fileData file = data/hiv.dna



40

######################## The Evolutionary Model Section ############MODEL#model : REV + dG4Model = REVDiscrete gamma distribution of rates = yesNumber of gamma categories = 4\MODEL

####################### The Function Section ###########################FUNCTIONFunction = Optimise user-defined phylogenetic trees

#The file with the treesTrees file = data/hiv-dna.trees#The number of trees in this fileNumber of trees = 2

#Optimise the substitution model parameters simultaneously ?Optimise model parameters = yes

# The name of the file containing initial substitution model parameters,# if the previous field is set to no (ie, fixed parameters for the model) this# field is compulsory#### User’s model parameters file =

\FUNCTION

# Random seed for the random number generatorRandom seed = 2

# The next control line sets the output fileOutput file = results/hiv-dna-ml/hiv-dna-ml.output

A.6 Control file for mlphase (2)

####################### The Data Section ###########################DATAFILE#The name of your data fileData file = data/primates.rna



######################## The Evolutionary Model Section ############MODEL#model : RNA7A + dG3Model = RNA7ADiscrete gamma distribution of rates = yesNumber of gamma categories = 3\MODEL

41

####################### The Function Section ###########################FUNCTIONFunction = Search for ML topology

#Monophyletic clades ?User defined monophyletic clades = yesClade file = data/primates.clades

#The search method for tree topology :# ’Simple exhaustive search’, ’Branch-and-bound exhaustive search’ or# ’Heuristic stepwise addition’Topology search = Branch-and-bound exhaustive search

#optimise the substitution model parameters simultaneously ?Optimise model parameters = yes

#a field to choose initial model parameters, this field is compulsory if#the previous field is set to no#### User’s model parameters file =

\FUNCTION

# Random seed for the random number generatorRandom seed = 2

# The next control line sets the output fileOutput file = results/primates-rna-ml/primates-rna-ml.output

A.7 Control file for mcmcphase (1)

######################## The Sequence Alignment Section ############DATAFILE#The name of your data fileData file = simulated-data/suzuki-arranged.dna


#The species used to root the treeOutgroup = 1

#Is there a class section in your data file ?Heterogeneous data models = yes\DATAFILE

######################## The Evolutionary Model Section ############MODEL#model : K80 + dG3Model = K80Discrete gamma distribution of rates = yesNumber of gamma categories = 3Invariant sites = no\MODEL

42

######################## The Perturbation Section ############PERTURBATION#Initial branch step proposal parameterInitial branch step proposal parameter = 0.1

#Upper bound for the branch length uniform distributionBranch length upper bound = 1.5

#priority for the frequencies perturbationFrequencies, proposal priority = 1#optional initial parameter for the perturbationFrequencies, initial Dirichlet tuning parameter = 500.0

#priority for the rate ratios perturbationRate ratios, proposal priority = 1#optional, initial rate ratio step proposal parameterRate ratios, initial step = 0.3#optional, set the lower bound for the acceptance rateRate ratios, proposal minimum acceptance rate = 0.2#optional, set the upper bound for the acceptance rateRate ratios, proposal maximum acceptance rate = 0.6

#priority for the gamma shape parameter perturbationGamma parameter, proposal priority = 1#do not adapt the proposal parameter during the burnin periodGamma parameter, proposal minimum acceptance rate = .0Gamma parameter, proposal maximum acceptance rate = 1.0# the initial proposal step is required because it is fixedGamma parameter, initial step = .1

#priority for the invariant parameter (% of invariant sites) perturbationInvariant parameter, proposal priority = 1

\PERTURBATION

######################## The program Section #############initialise the random number generatorRandom seed = 1

#number of burnin iterationsBurnin iterations = 150000

#number of sampling iterationsSampling iterations = 300000

#sample every 20 cyclesSampling period = 20

#initialise the chain with user-defined substitution parameters ?Random start model parameters = yes#### User’s starting model parameters file =

#initialise the chain with a given tree ?Random start tree = yes#### User’s starting tree file =

#The base name for the output files (base.output, base.bestmp, base.besttree,

43

# base.samples, base.mp, base.bl, base.plot)Output file = results/simulation-mix-mcmc/simulation-suzuki-mcmc

#the format for the ’base.samples’ file (phylip or bambe)Output format = phylip

A.8 Control file for mcmcphase (2)

######################## The Sequence Alignment Section ############DATAFILEData file = data/mammals69.mixInterleaved data file = noOutgroup = 26\DATAFILE

######################## The Evolutionary Model Section ############MODELModel = MIXEDNumber of models = 2MODEL1

Model = REVDiscrete gamma distribution of rates = yesNumber of gamma categories = 4Invariant sites = no

\MODEL1MODEL2

Model = RNA7ADiscrete gamma distribution of rates = yesNumber of gamma categories = 4Invariant sites = no

\MODEL2\MODEL

######################## The MCMC PERTURBATION Section ############PERTURBATION#PERTURBATION OF THE TREE :Initial branch step proposal parameter = 0.03Branch length upper bound = 1.7

#PERTURBATION OF THE MODEL :Model 1 priority = 8Model 2 priority = 24

Average rates, proposal priority = 1Average rates, initial step = .3Average rates, proposal minimum acceptance rate = .15Average rates, proposal maximum acceptance rate = .20

PERTURBATION1Frequencies, proposal priority = 1Rate ratios, proposal priority = 1Gamma parameter, proposal priority = 1

\PERTURBATION1PERTURBATION2

Frequencies, proposal priority = 1Rate ratios, proposal priority = 1

44

Gamma parameter, proposal priority = 1\PERTURBATION2

\PERTURBATION

Random seed = 1

Burnin iterations = 40000

Sampling iterations = 100000

Sampling period = 10

Random start model parameters = noUser’s starting model parameters file = data/mammals69-mix-consensus.model

Random start tree = noUser’s starting tree file = data/mammals69-mix-consensus.tree

Output file = results/mammals69-mix-mcmc/mammals69-mix-mcmc-preinitOutput format = phylip

45

Bibliography

Hasegawa, M. et al.1985. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol.,42:160–174.

Hastings, W.1970. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57:97–109.

Higgs, P.1998. Compensatory neutral mutation and the evolution of RNA. Genetica, 102:91–101.

Hudelot, C., V. Gowri-Shankar, H. Jow, M. Rattray, and P. Higgs2003. RNA-based phylogenetics methods: Application to mammalian mitochondrial RNA sequences.Mol. Phyl. Evol.

Huelsenbeck, J., B. Larget, R. Miller, and F. Ronquist2002. Potential applications and pitfalls of bayesian inference of phylogeny. Syst. Biol., 51(5):673–688.

Jow, H., C. Hudelot, M. Rattray, and P. Higgs2002. Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution.Mol. Biol. Evol., 19(9):1591–1601.

Jukes, T. and C. Cantor1969. Evolution of protein molecules. In Mammalian Protein Metabolism, volume 3, Pp. 21–132.Munro, H.H., ed.

Kimura, M.1980. A simple method for estimating evolutionary rate of base substitutions through comparativestudies of nucleotide sequences. J. Mol. Evol., 16:111–120.

Larget, B. and D. Simon1999. Markov chain monte carlo algorithms for the bayesian analysis of phylogenetic trees. MolecularBiology and Evolution, 16(6):750–759.

Metropolis, N., A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller1953. Equations of states calculations for fast computing machines. Journal of Chemical Physics,21:1087–1091.

Savill, N., D. Hoyle, and P. Higgs2001. Rna sequence evolution with secondary structure constraints: Comparison of substitution ratemodels using maximum likelyhood methods. Genetics, 157:399–411.

Swofford, D. L., G. Olsen, P. Waddell, and D. Hillis1996. Phylogenetic inference. In Molecular Systematics (2nd edition), Pp. 407–515. Hillis, D.M.

Tamura, K. and M. Nei1993. Estimation of the number of nucleotide substitutions in the control region of mitochondrial dnain humans and chimpanzees. Mol. Biol. Evol., 10(3):512–526.

Tillier, E. and R. Collins1998. High apparent rate of simultaneous compensatory basepair substitutions in ribosomal RNA.Genetics, 148:1993–2002.

46

Tillier, E. R. M.1994. Maximum likelihood with multiparameter models of substitution. Journal of Molecular Evolu-tion, 39:409–417.

Whelan, S., P. Lio, and N. Goldman2001. Molecular phylogenetics: state-of-the art methods for looking into the past. TRENDS inGenetics, 17(5):262–272.

Yang, Z.1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites:Approximate methods. J. Mol. Evol., 39:306–314.

Yang, Z.1996. Maximum likelihood models for combined analyses of multiple sequence data. J. Mol. Evol.,42:587–596.

47

PHASE: a Software Package for PHylogenetics And Sequence ...gowrishv/phase-1.1-manual.pdf · The...

Documents

Transcript of PHASE: a Software Package for PHylogenetics And Sequence ...gowrishv/phase-1.1-manual.pdf · The...