Phylogeny Notes PDF

1

Chapter Y Bioinformatics of Phylogeny The focus of this chapter is to address the bioinformatics of phylogeny. Relying on the classical and modern views on evolution pertinent to origin and diversity of species, details on phylogenetic concepts and phylogenetic analysis are presented. Relevant software classifications and available phylogenetic programs are discussed. y.1 EVOLUTIONARY THEORY The historical visit in 1837 to Galapagos Islands by Charles Darwin, led him to observe the unique features of those islands in each being different from other in their natural fauna and all of the islands had distinct populations of finches largely adapted to the natural ambient of each individual island. Further, a set of birds with certain types of beaks flourished on particular islands implying that a selection process had been acting on the populations. Darwin presented his observations in his celebrated book, The Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life [y.1] with a bold proposition that “all the organic beings which have ever lived on this earth have descended from some one primordial form”. He supported his hypothesis with different species of finch observed by him on the Gallapagos islands and indicated an extensive examples from nature such as the vestigial wings on flightless beetles and argued that these could only be remnants of some common ancestor.

Thus emerged, the evolutionary history of all organisms based on Darwin’s general framework on how to infer relationships among species using morphological, paleontological, and bio-geographical details. Ever since then systematically, grew the art of understanding the relationships among species with a scale of diversity of life seen on the earth.

Evolution is also concurrently known as biological, genetic or organic evolution. It refers to the changes and modifications observed “in the inherited traits of a population of organisms through

2

successive generations”. Here the “trait” a specific attribute of anatomical, biochemical or behavioral characteristics stemming from gene-environment interaction.

The observed changes are due to two settings of a complex system: (i) Interactions between a set of processes, which introduce variations into a population; (ii) and, another set of processes eliminating the variations. In this context of such polemically differing interacting processes, the variants having specified traits “become more, or less, common”.

In the biology evolution, the variation indicated above is caused by mutation, which introduces genetic modifications. Further, the changes are heritable conveyed through generations via reproduction giving rise “to alternative traits in organisms”. Yet another source of variation is possible -due to genetic recombination, wherein the genes could get shuffled into new combinations leading to organisms exhibiting different traits. It has been also observed that variation can also be enhanced under certain circumstances. This happens as a result of transfer of genes between species.

When an organism takes up permanent residence within another with the two becoming a single functional organism, it is known as endosymbiosis (Mitochondria and plastids are believed to have resulted from endosymbiosis). Rarely (but significantly), the variations indicated may occur due to “wholesale incorporation of genomes through endosymbiosis”.

Largely, the variants may become more common or rarer in a population due to natural selection. The reason is that the natural selection process allows two streams of traits: Traits that aid survival and reproduction to become more common and those that hinder survival and reproduction to become rarer.

In Nature, the available resources are limited and organisms may produce excessive offspring, more than their environment can support. As such, natural selection comes into play inasmuch as only a small proportion of individuals in each generation will survive and reproduce. Over the generations, natural selection process (selectively) filters the heritable variation in traits retaining successively the beneficial changes through differential survival and reproduction. The underlying iterative considerations adjust the traits making them better suited to the ambient of the organism. Such

3

adjustments termed as adaptations. (However, it is cautioned that not all change could be adaptive).

The so-called genetic drift is also a causative mechanism for evolution leading to random changes seen in the prevalence of common traits in a population. This genetic drift is very concerning whenever traits do not strongly influence survival particularly across a small populations (where chance or probabilistic norms would play a disproportionate role in the frequency of traits passed on to the next generation.

The concept of genetic drift is popular in the applications of the so-called neutral theory of molecular evolution, and it plays a role in the molecular clocks of phylogenetic studies.

Speciation is a key component in evolution. It refers to the context in which a single ancestral species splits and diversifies into multiple new species via several modes of occurrence. A long stretch of speciation events is responsible for the descend of all living (and extinct) species from a common ancestor. These events are marks set on a diverse "tree of life" (considered to have grown over the 3.5 billion years supporting the life on Earth). Visibly, speciation is seen in anatomical, genetic and other similarities between groups of organisms, geographical proliferation of related species, the fossil record etc. y.1.1 Natural Selection The underlying mechanism through which evolution takes place is called natural selection. It refers to a process that acts on populations of individual species in order to produce the diversity of life as seen now. Natural selection hypothesizes that, in Nature any population produces a gamut of offspring more than the environment can support. As such, there is always an inevitable struggle for survival within the population; and, those individuals that are “best adapted to the environment” bear a better chance of survival and to produce offspring. A continued natural selection process acted over a large numbers of generations, leads to a population, which is highly adapted to its particular environmental conditions. In certain situations and ambient, a population may become reproductively isolated producing a new species.

4

In the biological ecosystem, there are factors that are biotic (species) meaning related to life (such as plants, animals, fungi, bacteria etc.). To start with, the locale of such living factors is barren and unoccupied. Subsequently, new organisms colonize the environment. However, their successful proliferation (survival and reproduction) relies on favorable ambient conditions in the area. Such environmental conditions are dictated by abiotic factors such as habitat (pond, lake, ocean, desert, mountain etc.) and weather features like temperature, rain, snow etc. When a variety of species are present in such an ecosystem, the consequent actions of these species can affect/influence the lives of fellow species in the area; and, these influencing factors are deemed biotic factors. Thus, biotic and abiotic factors combine to create a system or more precisely, a complex biological ecosystem that represents a community of living and nonliving things considered as a unit.

Natural selection is not braced over the population across the time-scale. Most often, its template for fortuitous individuals gets constantly changed as dictated by how external (exogenous) factors or abiotic influences (such as climatic conditions, emergence of new predators, accessibility to food etc.). Further, no matter to what extent a population has adapted to its environment, it does not, however, provides any guarantees on the future survival of the species. An example of this phenomenology refers to the Irish Elk, which is known for its formidable antlers whose existence are suspected to be have been due to constant sexual selection among fighting males attempting to gain access to females. However, as the climate changed quickly at the end of the last ice-age, the combination of a general reduction in food supply and increased hunting by humans led the species to suffer from a condition similar to osteoporosis, which is thought to have resulted eventually to its extinction [Stuart et al., 2004].

Said correctly, “nothing in biology makes sense except in the light of evolution” [Dobzhansky, T. 1973. Nothing in Biology Makes Sense Except in the Light of Evolution. The American Biology Teacher, 35:125-129].Thus, the theory of evolution points out the plausibility of a unique, unifying and cohesive force that explains the origin and existence of all forms of life. As stated in [FSU], “it is to

5

the life sciences what the long sought holy grail of the unified field theory is to astrophysics”.

Any form of life is a descendent from a common ancestral origin. Evolution of species is not a static process but it dynamically changes over time. Darwin’s description of this process conforms to being a variation sorted out through drift and selection of lineages that diverge. In short, evolution implies a “descent with modification.” making the organisms to bear a history; and, the modifications observed are stochastical epochs (of that history) stemming from the statistical nature of mutational changes causing damage or otherwise.

y.1.2 Modern views on Evolution: Synthetic Theory With his enunciated theory of evolution, Darwin could not, however, explain the underlying process and operational aspects of natural selection. This was mainly due to the absence of genetic considerations vis-à-vis inheritance that was later proposed by Mendel (in 1866) revealing the secrets of heredity and variability that prevail in populations; and, only in 20th century, neo-Darwinism came into being with the synthetic theory of evolution that blends the fields of genetics and evolution [Huxley, 1974)]. This neoteric concept formalizes genetic mutation as the essential ingredient, in formulating the observed variations solicited by the occurrence of natural selection. Further, with the advent of protein sequence data (in the early 1960’s), it became evident that the level of mutation among populations could be much higher than that expected by advantageous mutations [Barrell and Sanger, 1967]. Subsequently, a number of theories were ushered in to explain the aforesaid observation. For example, the so-called Kimura’s neutral theory of evolution specifies that most mutations at the genetic level are neutral meaning that they are neither advantageous nor disadvantageous toward natural selection) and determined largely by the mutation rate and effective population size [Kimura, 1968]. This concept concurs with the theory of natural selection wherein it is proposed that only a minute fraction of mutations are adaptive and eventually affect the fitness of an individual. Considering a fairly large platform of population, substitutions (namely, change of one amino acid or nucleotide base to another) occur as a very gradual

6

(temporally long) process and support many different mutations but only a very small proportion actually gets sustained in the overall population. In all, in addition to indels and substitutions, rearrangements of genetic materials in the species along the evolutionary time-scale are also possible.

Implicit knowledge of genetic science has been in vogue even in the prehistoric times through selective breeding and domestication of animals; and, the early 20th century saw a graceful emergence of the subject genetics (with “geno” meaning “to give birth” in Greek)) with the proposal by William Bateson to describe the study of inheritance, variation, and heredity. As described in earlier chapters, subsequent contribution by Watson and Crick (in 1953 relevant to the physical and chemical structure of DNA that resides in each and every cell of the living system) provided details on the exact format of instructions pertinent to the duplication of the organism in terms of the genetic information associated with the chain of simple molecules of the DNA.

In simple terms, as described in earlier chapters, the prevalence of genetic information and any possibility of its corruption can be summarized as follows: With the two complimentary sugar-phosphate strands bound around each other helically (with each strand consisting of a chain of nucleotides, namely, Adenine (A), Guanine (G), Cytosine (C) and Thymine (T), a set of genetic information pertinent to an organism prevails in the DNA (and RNA). It is referred to as the genome. The DNA with its genomic profile is copied and passed from one generation to the next. But, the replication may not always be perfect with a number of errors depicting unexpected insertions and deletions (shortly known as indels) popping up with a particular nucleotide is missing from its locale (in the DNA chain) or a nucleotide is inserted in the replicated stretch of DNA. Further, a mutation occurs whenever one nucleotide is substituted with another nucleotide at the same position.

Referring to the post genomic era, rapid sequencing methods [Sanger and Coulson, 1975], invention of the polymerase chain reaction, PCR [Saiki et al., 1985] and automation of DNA sequencing [Hunkapiller et al., 1991] came into realities and scientists today can sequence whole prokaryote genomes in few of hours [Margulies et al., 2005] enabling the whole genomes being

7

viewed while performing the so-called phylogenetic analysis described in the following sections. In summary, modern view on evolution synthesizes Mendelism and Darwinism; (starting in the 1930’s,); however, deemphasis on phylogenetics. Enter Zimmerman (1920’s...) and Hennig (1950’s...) Phylogenetic Systematics — the Cladists. Enter the molecule: Perhaps the first reference to molecules as a means for deciphering phylogeny is y.2 PHYLOGENY Considering the diversity of life with the plethora of estimated 5 to 100 million species of organisms living on Earth, an evidential implication of details gathered (from morphological, biochemical, and gene sequence data) “suggests that all organisms on Earth are genetically related, and the genealogical relationships of living things can be represented by a vast evolutionary tree”…; and, this tree of life depicts the phylogeny of organisms. (The term “phylon’ in Greek means a combining form of a race or a tribe; and as such, phylogeny or phylogenesis implies a race history of an animal or a plant type). In essence, phylogeny is a collection of information about the origin and diversity of species. That is, considering the history of lineages associated with organisms dynamically changing through time, it implies that “different species arise from previous forms via descent, and that all organisms, from the smallest microbe to the largest plants and vertebrates, are connected by the passage of genes along the branches of the phylogenetic tree that links all of life”. Thus the evolutionary history depicts the history of development of biological organisms, functions, molecules etc. through random mutations under selective (invariably nonrandom) pressure. As such, the evolutionary history presumes that an existing organism has descended from ancestral organisms; and, inevitable mutations take place at nucleotide level leading to the observed biological diversity. [Hennig, W. 1965. Phylogenetic Systematics. Ann. Rev. Entomol. 10,97-116, Phylogenetic Systematics. (tr. D. Davis and R. Zangerl), Univ. of Illinois Press, Urbana 1966, reprinted 1979, Zuckerkandl and Pauling(1965) “Molecules as Documents of Evolutionary History.” Journal of Theoretical Biology. 8:357-366].

8

In short, living systems bear a history. If so, can it be uncovered and probed into? The answer is borne in the art of phylogeny. In short it explains that, given a parental strain of a species, how (perhaps why), this strain possibly diverges or splits into two other strain over a period of time. In a broader sense, considering a family of (presumably related) sequences, a phylogenetic analysis determines how this family might have derived and.emerged across the phases of evolution. By placing the sequences as out(most) branches on a tree, it depicts the evolutionary relationships among the sequences and the associated branching relationships (in the inner part of the tree) reflect the extent to which different sequences are related.

Darwin recognized that all life that has ever existed is related through the process of natural selection and by a “great Tree of Life” [Darwin, 1859] and his recognition today reverberates in reconstructing the tree of life so as to discover the last universal common ancestor of all life [Woese et al., 1990]. What is the shape of the tree of life? This query and inquisitiveness has led to many versions and revisions on the proposed aspects of the tree of life.

Constructing just the tree of life alone is not however, the end or limit of probing the evolutionary processes of life. There are a number of avenues of fruitful deductions of phylogenetic analyses through evolutionary considerations. Useful and need-based techniques have been developed in the recent times to build phylogenetic trees across the cross-section of biology and evolutionary considerations for example, in estimating the timing of the ancestor of the HIV virus [Korber et al., 2000], investigating the origins of deadly flu strains [Worobey et al., 2002] and in investigating the genetic mechanisms of malaria [Kedzierski et al., 2002].

y.2.1 Convergent and Divergent Evolutions As seen above, in evolutionary process, unpredictably species can adapt a particular phenotype (meaning any observable feature of an organism, which has stemmed from one ore more genes); and, such physical features assume the suit with a purpose regardless of genetic variations or changes accommodating the evolutionary pressure felt for such changes.

9

Acquisition of the same biological trait in unrelated lineages defines the convergent evolution process. As a classical example, the can be indicated as a result of convergent evolution in action. In spite of the fact that the last common ancestor of birds and bats did not have wings, presently as seen these species are capable of flying. The mechanism of flight places constraints on the wing shape making the wings as of almost similar shapes across all bird species. This similarity can also be due to shared ancestry, inasmuch as evolution works only with a given entity of preexistence. Hence, wings depict the morphology seen with limbs modified (as evidenced by their bone structure). Convergent evolution lead to specific traits termed analogous structures. (These are in contrast to homologous structures, which have a common origin). The wings of bat and pterodactyl (a kind of “winged lizard”) depict analogous structures. But the bat wing is homologous to human as well as other mammal forearms in the sense that an ancestral state is shared despite serving different functions. Homoplasy denotes the similarity in species of different ancestry which is the result of convergent evolution.

Further, convergent evolution is similar to, but distinct from what are known as the phenomena of evolutionary relay and parallel evolution. Evolutionary relay portrays how independent species acquire similar characteristics through their evolution in similar ecosystems, at different times seen for example as the dorsal fins of extinct ichthyosaurs and sharks; and, parallel evolution takes place whenever two independent species evolve simultaneously in the habitat of same ecology. Also they acquire similar characteristics, (as for instance seen in extinct browsing-horses and paleotheres).

The opposite of convergent evolution is divergent evolution. Here, related species evolve different traits. This can happen at molecular level as a result of random mutation unrelated to adaptive changes. Similar to convergence in evolution, a rationale for divergent evolution can be seen vis-à-vis the accumulation of differences between groups leading to the formation of new species. That is, a “divergent evolution results from diffusion of the same species adapting to different environments, leading to natural selection defining the success of specific mutations”. For example, the vertebrate limb is an example of divergent evolution. Though, the limb in several species has a common origin, it has diverged

10

somewhat in overall structure and function. That is, divergent evolution is somewhat readily observable in organisms in certain higher-level characters of structure and function. The divergent evolution can be applied to molecular biology characteristics as well. That is, it can be seen with respect to a pathway in two or more organisms or cell types applied to genes and proteins, (such as nucleotide sequences or protein sequences that derive from two or more homologous genes). Further, divergent evolution prevails both in orthologous and paralogous genes. (The orthologous genes result from a speciation event and paralogous genes result from gene duplication within a population). The existence of possible divergent evolution in paralogous genes indicates occurrence feasibility between two genes within a species.

In summary, similarity seen in the case of divergent evolution is due to the common origin, (such as divergence from a common ancestral structure or function that has not yet completely obscured the underlying similarity). In contrast, a convergent evolution arises whenever some sort of ecological or physical drivers are pushing toward a similar solution, (even though the structure or function has arisen independently), such as different characters converging on a common, similar solution from different points of origin. This includes analogous structures as well. y.2.2 Phylogenetic Trees Studies on phenotype precede sequencing efforts exercised on DNA and/or proteins. In 1960-70’s, the evolutionary positioning of species were largely done on the basis of anatomical features and by the art of taxonomy (depicting the science of plant and animal classification).

Subsequently, phylogeny is indicated towards deciphering the history of descent of a group of organisms. It is described usually by branching tree-like diagram (known as phylogenetic tree). This phylogenetic approach offers rigorous mathematical background and computational methods can be used to infer the trees or better known as dentograms. The underlying inference is based on proteins and DNA that are homologous. Relevant measure is therefore unaffected by environmental effects on phenotypical changes. In short,

11

phylogenetic tree enables understanding the evolutionary changes at the molecular level.

In constructing the tree of life each branch is called a clade and living organisms are placed as leaves at the tips of the branches; and, their evolutionary history can be represented by a series of ancestors shared hierarchically via different subsets of organisms (that are seen surviving today). As mentioned before, these organisms seen alive are depicted as the leaves on the dentograms, leading to down-tracing their history back to the branches encountering their various ancestors. On time-scale these ancestral existence, denote thousands or millions or even possibly hundreds of millions of years ago. Thus, in short a notion exists that all of life is genetically connected through a mammoth phylogenetic tree.

Does it mean that there could be a common ancestor for humans and beetles? Plausibly, yes! Relevant common ancestral organism could have been some sort of a worm; and, somewhere along the evolutionary time-frame, this species of ancestral worm divided into two separate (worm) species, which thereafter divided again (and again), each division (or speciation) resulting in new, independently evolving lineages along the branches of the tree with the end result being two possible leaves – human and beetles! (Vow!!)

Across various ancestral forms derived, the new lineages retain mostly their ancestral features. However, these would gradually get modified and supplemented with necessary traits that make them congenially adjust to (and survive in) the environment of their habitation. The pedagogy of phylogeny of organisms explains relevant similarities and differences among plants, animals, and microorganisms that have evolved through the formalism of the tree of life. Thus, the phylogenetic tree (Figure y.1) offers a systematic framework for various sub-disciplines of biology in understanding the organizational aspects of biological knowledge.

12

Fig. y.1 An evolutionary tree-of-life: Concept of phylogeny implies that existing species (depicted as end leaves) can be linked by an underlying structure in a tree-like manner As is evident from its name, a phylogenetic tree (Figure y.1) is a conceptual representation of the tree-of-life by a tree structure with species located at the leaf-nodes of the tree. In essence, it is a diagrammatic representation of the evolutionary relationships among a set of species. Phylogenetic trees can be deduced from molecular sequences (like DNA or amino acid sequences) by comparing similarities between such sequences. Branches in the tree depict that a particular species has evolved from another through internal nodes, (which denote the epochs of speciation). Further, the events of speciation are supposedly taking place in a dichotomous (binary) fashion. It implies that any speciation event results in two new species.

Quantitatively, the length of a branch can be regarded as some measure of evolutionary distance between two nodes.

13

Implicitly, it indicates the degree of dissimilarity of two (DNA) sequences in the branched parts of the tree at speciation node. Constructing a phylogenetic tree is based on the following heuristics:

� Species exhibit variations or diversity � However, a set of similar species can be identified and

grouped � Such grouped species can be connected (or linked) to a

common ancestor.

First as defined earlier, dentogram is a tree representation where all the leaves are of equidistant from the root, implicitly representing the passage of time of sequences up to the leaf-tips from the root. Descriptively, the components of a phylogenetic tree and the set of terminology associated in constructing such structures is as follows:

� Tree: This is a line-diagram that provides a visual means of

representation for a group of sequences or species and indicates their time-series of origin. A tree consists of nodes, branches and leaves. It is a mathematical structure that models the actual evolutionary history of a group of sequences or organisms.

14

Figure y.2 Components of a phylogenetic tree. (OTU: Operational taxonomic units – leaves; and, root is the common ancestor of all OTUs

� Nodes: A tree consists of nodes connected by branches. Nodes depict ends of the tree and are represented by a small circle. A node can be internal or at an end (in which case, it is called a “leaf” that is, the leaf is the loose-end (terminal) node of a branch in the tree. Internal nodes denote hypothetical nodes and an unique (internal) node can be identified as the root of the tree depicting the ancestor of all the sequences. The terminal nodes represent sequences or organisms (for which the phylogenetic data are compiled and known). Each

Ancestral root

Branch length

Branch

OTU

OTU

OTU

Node

Outer group

15

terminal node (leaf) is designated as an operational taxonomical unit (OTU). That is, an OTU depicts a terminal node in phylogenetic analysis and represents an organism and a group of such OTUs constitute a clade representing a set of several sequences and their common ancestral nodes. A bifurcating node explicitly carries two distinct lineages arising from it

Figure y.a Types of nodes in a phylogenetic tree

� Types of trees: There are two versions of trees, namely rooted

and unrooted trees as illustrated in Figure y.y

Figure y.x Phylogenetic tree configurations: (a) Rooted tree and (b) unrooted tree

Root node

Internal node

Terminal nodes

(a) (b)

Bifurcating node

16

Rooted tree implies a structure in which the direction of evolution is specified with respect to a “root” or branch-off site of single node. That is, the tree structure is shown with a designated “root” depicting an ancestral origin); and, via adequate divergence across the phases of evolution, multitudes species (denoted as different leaves (OTUs) of the tree) have evolved. An unrooted tree simply displays the underlying connections or links between the species. That is, no specified root node of ancestral implication is shown on this structure. The nodes shown simply denote the mutual relativeness (such as branch lengths between them. This branch length measures the extent of divergence between the nodes). Scaling the tree (or scaled trees) implies elucidating the differences between adjoining nodes. It is done by determining the length of branches involved. Gene tree: This denotes a tree structure that results from analyzing homologous genes As regard to the two versions of tree as above, the number of branches and nodes can be obtained in terms of the number of OTUs (M). Relevant details are shown in Table y.y.

Number of branches

Nodes

Interior M - 2 M - 1 Rooted Total 2M - 2 2M - 1

Interior M - 3 M - 2 Unrooted Total 2M - 3 2M - 2

� Tree network: While trees signify only one path between any

pair of nodes, a tree network has more than one path between any pair of nodes as shown in Figure y.r

17

Figure y.r (a) A simple tree and (b) a tree network mesh

� Newick formatted phylogenetic tree: Newick tree format (or alternatively known as Newick notation or New Hampshire tree format) is a mathematical way to represent graph-theoretical trees with edge-lengths using parentheses and commas. When an unrooted tree is represented in Newick format, an arbitrary node is chosen as its root. Whether rooted or unrooted, typically the representation of the tree is rooted on an internal node and rarely (but permissibly) rooting a tree on a leaf node is done. Newick format is a shorthand notation and is illustrated in Figure y.x

Figure y.t Newick tree format � A rooted binary tree that is rooted on an internal node has

exactly two immediate descendant nodes for each internal node. An unrooted binary tree that is rooted on an arbitrary

(a) (b)

{{A, B} {C, D}} {{A, B}{C, D} {E, F}}

18

internal node has exactly three immediate descendant nodes for the root node, and each other internal node has exactly two immediate descendant nodes. A binary tree rooted from a leaf has at most one immediate descendant node for the root node, and each internal node has exactly two immediate descendant nodes

� Cladistics: It literally means "branch" and forms biological systematics to classify species of organisms into hierarchical monophyletic groups. More specifically, it is defined as the study of the pathways of evolution. How many branches there are among a group of organisms, which branch connects to which other branch and what is the branching sequence are queries of interest to cladists. Typically, cladistics strives to identify monophyletic clades- a group that represents a species and all its descendants. Closely related clades are called sister groups; and other groupings are known as paraphyletic depicting common ancestor and some of its descendants and polyphyletic, which signifies sister groups but not the common ancestor). Further, in cladistic sense, a monophyletic group is group of organisms (taxon) constituting a clade consisting of an ancestor and all its descendants

� Taxon (and taxa): A taxon (with taxa being its plural) as mentioned above is a group of (one or more) organisms, (which a taxonomist adjudges to be a unit). However, in phylogenetic nomenclature of cladistic approach, do require taxa need not be monophyletic (consisting of all descendants of some ancestor). Here taxa is not the basic unit and "clades" is used instead. That is, a clade is a special form of taxon in phylogenetic sense.

� Cladogram: A tree-like network that expresses ancestor-descendant relationships is called a cladogram. Thus, it describes the topology of a rooted phylogenetic tree via relative ancestral origins of sequences, but without any branch length considerations. In essence, cladistic classifications of trees illustrate cladograms (Figure y.n) with

19

the intention to reflect the relative recency of common ancestry or the sharing of homologous features

Figure y.n A cladogram showing the relative order of common ancestry

� Additive tree: It is a cladogram specified with branch lengths. It is also known as phylograms or metric trees. An example of additive tree is shown in Figure y.y

Figure y.y Example of an additive tree. The numbers shown are some hypothetical branch lengths

Ultrametric tree: It is a dendogram denoting a specific type of an additive tree in which the tips of the tree are all equidistant from the root as illustrated in Figure y.r where the relative order of common ancestry can be seen

1 2

3 4

5

6

7

20

�

Figure y.n An example of ultrametric tree. (All the tips of the tree are of equidistance from the root. In this case, the

equidistance is equal to four)

� Cladistic terminology: It includes (i) apomorphy or derived character shared by a species and its descendants, but not in the ancestral species; (ii) synapomorphy, which is used to define an ancestor and its descendants and (iii) pleisomorphy defining ancestral characteristics

� Phenetics is the study of relationships among a group of organisms on the basis of the degree of similarity between them, be that similarity molecular, phenotypic, or anatomical

� Phenogram: It is a tree-like network expressing phenetic relationships

� Pleisomorphy: This defines some characteristics pertinent to the ancestor, which are sequel in all further sequences. And it is a character-state present in both outgroups as well as in the ancestors

� Homoplasy: This refers to similarity that has evolved independently without being indicative of common phylogenetic origin. Similarity seen in species of different ancestry is the result of convergent evolution and it denotes the homoplasy

� Convergent evolution: It describes the acquisition of the same biological trait in unrelated lineages

� Polytomies: Soft polytomy implies a lack of information about the order of divergence; and, hard poytomy hypotheses that multiple divergenses occurred simultaneously

1

2

1

4

1 2

1

21

� Autapomorphy is a derived trait seen uniquely in a particular taxon

� Synapomorphy: In contrast with autapomorphy, synapomorphy denotes the characteristics shared with the ancestor in a specific phylogeny and then derived from the ancestor. In cladistics defined earlier, a synapomorphy or synapomorphic character denotes a trait, which is shared ("symmorphy") by two or more taxa and their last common ancestor, whose ancestor in turn does not possess the trait. A synapomorphy is thus an apomorphy (or a derived characteristic of a clade)

Phylogenetic trees based on morphological features Phylogenetic trees based on numerical taxonomy do not indicate the buried subjectivity of evolution. For example, in a particular phylogenetic analysis, the numerical approach does not say the relative importance of, say for example, skin color and tail-length. So, a phylogenetic tree that displays morphological aspects across the diverging species of evolution can be indicated. An example of such morphological features is that a chimp has furs and a bird does not. A morphological tree can be drawn as shown in Figure y.3.

22

Figure y.3: A morphological tree of phylogeny y.2.2 Development of a Phylogenetic Tree Consistent with the phylogenetic terminology, and in conceiving a phylogenetic tree with the components illustrated in Figure y.2, the underlying assumptions are as follows: (i) Each species in the tree (or the taxon) bears a relation to a common ancestor; (ii) the phylogenetic tree in essence is a branching of structure, that is a bifurcating tree and (iii) along the time of evolution that frames the phylogenetic tree mutations have taken place randomly. These assumptions lead to eventual phylogenetic inferences of scientific interest. Making such phylogenetic inferences therefore, relies on models that specify the process of evolution and the development of the associated tree. To elucidate these models, the set of a priori details required are based on the following queries relevant to the sequences adopted in the modeling pursuits:

Jaws

Lungs

Claws, Nails

Feathers

Furs, Mammary

glands

?

23

� Are the sequences in hand are correct? � Are they (true) representatives of the evolution process

involved? � Are they homologous? � Is the multiple alignment of the sequences question correct?

Homologs or homologous structures are defined as those that are derived from a common ancestral structure in two related species. In contrast, the other possibilities are: Orthologs: These also bear a common ancestor, carry similar functional and structural attributes but they are separated by speciation, which depicts the phenomenon wherein a common ancestor gives birth to two subgroups that slowly drift away to become distinct species. Paralogs: These are homologs separated by a duplication event, meaning that within a genome, a gene had been duplicated; and, one of the duplicates retained the original function and the other duplicate could have assumed a new (or related) function. Xenalogs: These result from a lateral transfer between two organisms, where a lateral transfer is a direct DNA transfer between two species. Hence, one of the genes contains a gene that does not have the same history as the genome in which it is inserted. In this horizontal gene transfer, the result is hard-to-tell similar functions being observed.

Formation of orthologs and paralogs via duplication and speciation are indicated by an illustration in Figure y.4.

αααα

αααα

αααα1111

ββββ

ββββ1111 αααα2222 ββββ2222

Duplication

Speciation

24

Figure y.4 Consistent with the occurrence of duplication and speciation as illustrated, the set {αααα1111, α, α, α, α2222, , , , ββββ1111, β, β, β, β2222} depict orthologs and the set {αααα1111, β, β, β, β2222} denote paralogs. y.2 Methods of Phylogenetic Tree Construction The method of constructing a phylogenetic tree involves a set of procedures, which can be illustrated in a pseudocode format given in Table y.1 Table y.1: Pseudocode formatted description of constructing a phylogenetic tree: Phylogenetic analysis ______________________________________________ CONSTRUCTING A PHYLOGENETIC TREE: // Initial: Choose the test sequences For choosing the sequence, CALL SUBROUTINE I write SeQ:

← Database search and list the test sequences

perform MSeQA:

← Perform multiple sequence alignment go to: SUBROUTINE II

← Multiple alignment preparation next // Choose the model of evolution by: check Similarity SIM: if SIM defines “strong similarity”,

then Goto PM

← PM corresponds to parsimony methods Perform SUBROUTINE III

or else, check SIM:

if SIM defines moderate similarity, then go to DM

25

← DM corresponds to distance methods or else, check SIM:

if SIM defines low or no similarity, then go to ML

← ML corresponds to maximum-likelihood method

ENDIF next PERFORM TREE-BUILDING/RECONSTRUCTION

← This refers to making of the required phylogenetic tree. This is done with the set of multiple sequences aligned and prepared

go to: SUBROUTINE III next EVALUATE THE QUALITY OF THE END-RESULT

← This is done by applying consensus methods to the set of trees constructed. Consensus method verifies the reliable tree that truly depicts the evolution history of the sequences addressed.

go to: SUBROUTINE IV END: ; ______________________________________________

SUBROUTINE I: Choosing the query sequence

← This refers to choosing homologous test sequences

← Relevant choice is based on: Check: The selection is not a

sequence fragment. Incomplete sequences are not friendly toward

26

multiple sequence alignment nor tree reconstruction. At least same fragment is used for all multiple sequences

Check: The sequences chosen are not xenologs. Unless the purpose is to study xenologs, avoid including genes that result from lateral transfer.

Check: The selection is not a recombinant sequence. (Some proteins result from a combination of multiple proteins (as is common in viruses. Such proteins carry two ancestors (instead of one)not being compatible for regular tree reconstruction.

Check: Whether the sequences are pertinent to large and complex families containing various domains and repeats.(Working on smaller and more uniform subsets is preferable)

Check: Whether the sequences are pertinent to nucleotides or proteins

Check: Are they from closely related species (in conformance to an extent of being at least 70% identical with or without the underlying mutations being high)

if… the sequences do not

conform to an extent of being at least 70% identical as above (meaning more divergent)

then,Goto to: Use protein sequence or conserved

27

nucleotides(such as ribosomal RNA)

or else… Use the chosen nucleotide sequence

return Selected sequence next Goto to: Perform MSeQA: Subroutine II ______________________________________________ Subroutine II: MSeQA: Multiple sequence alignment and preparation for tree construction ← Alignments are an essential pre-requisite to many further analyses of protein families such a homology modeling, phylogenetic reconstruction or simply to illustrate conserved and variable sites within a family.

Sequence alignment is an important and useful procedure in bioinformatics. A multiple sequence alignment is a scheme of writing once sequence on top of each other where the parallel residues in any one position are deemed to have a common evolutionary origin. Sequences are placed in rows on top of each other and aligned so that homologous residues are placed in the same column. Multiple sequence alignment is computationally intensive. As shown before, more than four sequences would render the alignment almost incomputable.

Typical multiple alignment programs are: Clustal, T-Coffee, MAFFT, Muscle..

28

← Subroutine II involves two steps (i) Retrieving homologs (ii) Arranging multiple sequences

(alignment procedure) and preparing the multiple sequences for tree construction , which refers to “cleaning up” the chosen multiple sequences for alignment by a set of procedures

Step (i)- Retrieve- Given a sequence, its homolog can be retrieved as follows: 1. Access NCBI BLAST by typing the URL

http://www.ncbi.nlm.nih.gov/blast into the address line of the web browser

2. There are two options: By selecting protein Blast option, paste in the query protein sequence. (That is, Search protein database using a protein query - Algorithms: blastp, psi-blast, phi-blast). Otherwise, by selecting nucleotide Blast option, paste in the query nucleotide sequence.(That is, Search a nucleotide database using a nucleotide query- Algorithms: blastn, megablast, discontiguous megablast

3. Exercise search limitations on the BLAST run. (In practice, a selection of sequences from the same species as the query sequence can be searched along with a selection of sequences from different species. This can be done by using two separate BLAST searches (using different search limitations).

29

4. Next, the query (protein or nucleotide) sequence is pasted in the box below ‘Enter Query Sequence’ and the name of the species that query sequence comes from is typed in the Organism box. This will return hits from species of interest only. (To obtain results from other species run BLAST without any limitations.

5. With the same search limitations selected before, continue with the BLAST search by clicking the BLAST button. After a while (depending on the search time), the FASTA format sequence of six ORF can be retrieved from the BLAST output.

Step (ii)- Arrange/Prepare: This refers to placing the chosen multiple sequences one below the other forming a column of sequences as an alignment procedure and preparing the multiple sequences for tree construction . The preparation involves “cleaning up” the chosen multiple sequences for alignment by a set of procedures Example1: Multiple DNA sequences of

hypothetical homologous species

Example 2: Multiple protein sequences of

AAGCA-AGGTAAATGCATGCATGGA- -AGTCCTGGAATGGTA

AGAT- - AGGTAAATGCAGCTAGCAT-AAGTCCTGGACCGGAT

GCAATTAGGTAAAACCAAGGTACCT- -AGTCCTGGAGAGATA

GTGATTAGGTAAAACCAACGCAACGCAGTCCTGGACGTAGG

30

hypothetical homologous species

Prepare: This refers to removing certain sets of columns in the arranged multiple sequences shown in the above examples as shaded sections. ← Criteria for such removals?

o Sections of gap-free columns are mostly retained. Gaps invariably cause phylogenetic tree-forming.(In tree-construction, Programs like ClustalW follows complete-deletion policy ignoring every column that contains the gap)

o Extremities of the multiple sequences are removed inasmuch as N-terminus and C-terminus tend to be poorly conserved and as such, they do not well aligned

o Gap-rich columns can be removed. They often spond to loops. As such, even when a program returns an alignment with gap-rich columns, it may not be a meaningful

o Most informative blocks should be retained. Ideally, in building a tree, high alignment of sequences possessing low level of identity is preferable

ASLIFR- SDAYS SDAV- -

KNRTVIPVWNEGF

CSAVFASDAYS

CVVVF- SDAY

KKRTVIKNSVNPVWNE

KKRTVVIKNSVNNVWN D

S KKRTVIIKNSVNPVWNE

Q- - DQSSL

DQGSLHVVVKKENGS

HVVVKMQEYERN

QSSLHVV ELL- - - - -

DQSSLHVVVKTKEESE

AGVA- - SDAYS

31

since it would contain a trace of the family history

o “Good blocks” (typically with 20 – 30 amino acids long with a few conserved positions are useful in realizing a correct tree

← Programs for column removal

- Removal of columns that are unlikely to be correctly aligned can be done via T-Coffee server. T-Coffee is a progressive method for sequence alignment [C. Notredame, D. Higgins and J. Heringa, T-Coffee: “A novel method for multiple sequence alignments", Journal of Molecular Biology, 302, 205-217, (2000]

- Editing multiple alignments can be done with Jalview. It is a multiple alignment editor written in Java. It is used widely in a variety of web pages (for example, the EBI Clustalw server and the Pfam protein domain database and is available as a general purpose alignment editor.

← Programs for multiple alignment Clustal, T-Coffee, MAFFT, Muscle, …

return Aligned multiple sequence End _______________________________________________________ Building the phylogenetic tree Phylogenetic tree construction relies on two considerations: (i) An approach based on similarities and (ii) another approach based on dissimilarities of the observed data. The similarity aspect is concerned with the relatedness in the compared entities. It is divided into two types: (a) phenetic version and (b) cladistic version. The phonetic version is character-based phenogram, for example,

32

comparing a set of plants of related characteristics in terms of their associated characters (such as, petals, sepals, anthers, ovary, size or style etc.). Depending on the extent of comparable characters between them, they are declared as being related closely or not. Cladistics is a method of hypothesizing relationships among organisms in reconstructing evolutionary trees. The basis of a cladistic analysis is data on the characters, or traits, of the organisms and these characters could be anatomical and physiological characteristics, behaviors, or genetic sequences. The result of a cladistic analysis is a tree, which represents a supported hypothesis about the relationships among the organisms.

Summarizing the contrasts and comparisons between phenetics versus cladistics, the former offers relationships among a group of organisms on the basis of the degree of similarity between them. (This similarity may refer to molecular, phenotypic, or anatomical features); hence, a tree-like network (called a phenogram.) is evolved so as to express the underlying phenetic relationships.

Cladistics refers to the pathways of evolution. In this context, number of branches that prevail among a group of organisms, specifying which branch connects to which other branch and knowing the associated branching sequence are queries of interest. Relevant tree-like network that expresses such ancestor-descendant relationships is a cladogram or the topology of a rooted phylogenetic tree.

A phenogram is an indicator of cladistic relationships. But, it is not per se identical to a cladogram. (Only when exists a linear relationship between the time of divergence and the degree of genetic (or morphological) divergence, the two types of trees may become identical to each other.

A class of phylogeny is molecular phylogenetics, (also known as molecular systematics), uses the structure of molecules in order to gain information on the evolutionary relationships for an organism. The result of such molecular phylogenetic analysis can be expressed in a phylogenetic tree. In pursuing molecular phylogeny, the associated classification of methods involves distance and character-state approaches.

33

Methods of distance-based approach are based on certain distance measures, such as the number of nucleotide or amino-acid substitutions. Examples of this distance-based method are: The unweighted pair group (UPGMA) method, the transformed-distance method, and the neighbors-relation method.

The heuristic logic behind the distance method is that evolutionary distance is a tree metric and hence defines a tree. Typically, evolutionary distances are computed for all pairs of taxa; and the tree is constructed by considering the relationships among these distance data (fitting a tree to the matrix).

The methods of character-state approach rely on the state of the character, namely (i) the nucleotide or amino acid at a particular site; and, (ii) the presence or absence of an indel at a certain DNA location. An example of character-state methods is the maximum parsimony method is a character-state method.

Yet another statistical strategy of phylogenetic tree reconstruction using molecular data refers to maximum likelihood method, which uses all the information available in the sequence. Before getting into the details of aforesaid molecular phylogeny methods, some basic aspects of tree construction are outlined below. The process of building a tree, in essence, refers to making of or reconstructing the required phylogenetic tree structure using the multiple sequences aligned and prepared as indicated before. With any given set of (N) multiple sequences, corresponding number of possible trees (M) is such that, for high values of N will give extremely large values of M. (For example, M = 1 for N = 3; M = 3 for N = 4; …; M = 1,027,025 for N = 10; …, 2.8 × 1074 and so on as will be explained later). Therefore, unless only a small number of sequences is considered, the total number of feasible trees would increase to an enormous extent. As such, with the sequence data gathered, the multiple alignment is done only on a limited number of sequences (leading to a suboptimal number of trees being reconstructed). As indicated above, building the tree implies assessing the underlying phylogeny. This assessment can be done by two approaches, namely, (i) distance-based approach and (ii) clustering-based approach.

34

The distance-based approach refers to introducing a “weight” concept to the basic tree structures indicated earlier. The underlying concept of distance-based pursuit is as follows: Considering a rooted tree, the root is the most recent ancestor in the tree and the path from the root to a leaf signifies the evolutionary path. Such rooted trees are often represented with a root vertex as shown in Figure y.x (a) emphasizing that the root corresponds to the ancestral species. In contrast, the unrooted tree (Figure y.x(b)), bears no assumption as regard to the position of an evolutionary ancestor (root) in the tree. That is, no assumption about the origin of species prevails in unrooted trees.

The concept of weight in distance-based methods can be illustrated as in Figure y.x (c) where, for example, there are six leaf-nodes (vertices), I, II, III, IV, V and VI; and, a positive weight (or length) is assigned between any two consecutive nodes. This length, for example, may depict the number of mutations on the evolutionary path.

Quantitatively, the length (d) of the path between any two vertices can be specifies as the sum of the weights in the path between them. In Figure y.x(c), for example, d between nodes I and V is given by: d = 13 + 13 + 14 + 18 + 12 = 70. In general, given a weighted tree (T) with n leaves (end-nodes), computation of the path di, j (T) between any two leaves (i, j) can be done as indicated by the above example.

35

Figure y.cc Weighted unrooted (star topology) tree Now, considering an inverse problem, suppose a (N × N) distance matrix ∆i, j = [di, j (T)] (for every two leaves (i, j) is available (mostly via biological experiments). A method is then required to search for a tree T that has n leaves and consistent with the data in hand. Whenever the matrix size is small (say 3 × 3) and it is symmetric and non-negative, construction of the tree could be trivial. But for larger matrix sizes, the number of trees to be constructed becomes unwieldy. This could be seen from the following algorithmic relations: Given ν nodes, the number of rooted trees that can be designed is εR = (2ν – 3)!! = (2ν – 3)(2ν – 5)(2ν – 7) … ∞; and, number of unrooted trees is, εUR = (2ν – 5)!! = (2ν – 5)(2ν – 7)(2ν – 9) … ∞. Illustrated in Figure y.y is a couple of simple examples.

I VI

II III

IV

V

13 13

13

14

15

18

12 10

36

Figure y.y: Given the number of nodes (ν), realization of εR and εUR: (a) With ν = 2 in a rooted tree and (b) with ν = 3 in an unrooted (star topology) tree. In the distance-based approach, [N. Saitou and M. Nei, The Neighbor-Joining Method: A New Method for Reconstruction of Phylogenetic Trees, Molecular Biology and Evolution, 4, 1987, 406-425], evolutionary distances are calculated for all pairs of taxa and phylogenetic tree is constructed by means of an algorithm, which establishes some functional relationships among distance values. Hence a distance matrix is deduced which is a table that contains the “distances” (or counts on the number of evolutionary events) that separate each pair of sequences in the data set of aligned multiple sequences. Popularly, the distance matrix is conceived via UPGMA, which is simple toward tree reconstruction. It essentially assumes that the rate of evolution is nearly constant among different evolutionary lineages. That is, evolutionary distance is proportional to the divergence period time. Essentially, distance matrices of phylogeny are non-parametric schemes originally adopted for phenetic data using a matrix of pairwise distances. The distances obtained thereof are then used to make a tree, (that is a phylogram, depicting the informative branch lengths that carry the information on the underlying evolutionary process).

Normally as said before, the distance matrix is a result of biological experiments (such as immunological studies) complied as measured values.It can also be elucidated from:

� Morphometric analysis

(a) (b)

Root

Leaf

37

� Pairwise distance formulations (like Euclidean distance between discrete morphological characters)

� Genetic distance calculations from sequence, restriction fragment

� Allozyme data.

Raw distance values vis-à-vis phylogenetic character data, can be decided by simple counts on the number of pairwise differences in character states. (Specifically described as Manhattan distance, the raw distance values in question conform to what is known as taxicab geometry, proposed by Hermann Minkowski in the 19th century. It refers to a form of geometry wherein the conventional Euclidean geometry metric is supplanted by a new metric in which the distance between two points is the sum of the (absolute) differences of their coordinates. The taxicab metric is also known as rectilinear distance). Manhattan distance or Manhattan length, or its variations depict the geometry of grid layout of most streets on the island of Manhattan. Relevant length of the shortest path that a taxicab could take between two points in the city is equal to the distance between the points in taxicab geometry).

Inasmuch as the distance-matrix approach requires "genetic distance" evaluation between the sequences being classified, they need multiple sequence alignment (MSA) as an input. This genetic distance is often defined as the fraction of mismatches at aligned positions, with gaps either ignored or counted as mismatches [D. M. Mount, Bioinformatics: Sequence and Genome Analysis, Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY:2004].

Further, distance methods imply constructing an all-to-all matrix from the sequence query set. This describes the distance between each sequence pair. The constructed phylogenetic tree via distance matrix renders closely-related sequences under the same interior node and the branch lengths reproduce to a close extent the observed distances between sequences. Further, the type of tree reconstructed can be either rooted or unrooted version depending on the type of algorithm adopted. Distance methods lay foundation for progressive and iterative types of MSA. But, such methods do not use efficiently the information about any local high-variation regions

38

that may appear across multiple sub-trees [J. Felsenstein J., Inferring Phylogenies, Sinauer Associates, Sunderland, MA: 2004].

The genetic distance concept is adopted as a data clustering strategy in a method known as the neighbor-joining approach, which enables reconstruction of unrooted trees. In its approach, neighbor-joining does not assume a constant rate of evolution (that is., a molecular clock) across lineages. In other words evolutionary divergence time cannot be found from mutations and as said before, mutation rates are not constant. (In contrast, in the UPGMA to be described later rooted trees are reconstructed using a constant-rate assumption of an ultrametric tree in which, as said earlier, the distances from the root to every branch tip are equal).

Neighbor-joining is based on the minimum-evolution criterion for phylogenetic trees, i.e. the topology that gives the least total branch length is preferred at each step of the algorithm. However, neighbor-joining may not find the true tree topology with least total branch length because it is a greedy algorithm that constructs the tree in a step-wise fashion. Even though it is sub-optimal in this sense, it has been extensively tested and usually finds a tree that is quite close to the optimal tree. Nevertheless, it has been largely superseded in phylogenetics by methods that do not rely on distance measures and offer superior accuracy under most conditions.

The main virtue of neighbor-joining relative to these other methods is its computational efficiency. That is, neighbor-joining is a polynomial-time algorithm. It can be used on very large data sets for which other means of phylogenetic analysis (e.g. minimum evolution, maximum parsimony and maximum likelihood) are computationally prohibitive. Unlike the UPGMA algorithm for phylogenetic tree reconstruction, neighbor-joining does not assume that all lineages evolve at the same rate (molecular clock hypothesis) and produces an unrooted tree. Rooted trees can be created by using the outgroup and the root can then effectively be placed on the point in the tree where the edge from the outgroup connects.

Furthermore, neighbor-joining is statistically consistent under many models of evolution. Hence, given data of sufficient length, neighbor-joining will reconstruct the true tree with high probability. Atteson proved that if each entry in the distance matrix differs from

39

the true distance by less than half of the shortest branch length in the tree, then neighbor joining will construct the correct tree. UPGMA method: An algorithmic procedure to construct a phylogenetic tree The unweighted pair-group method with arithmetic mean (UPGMA) is a method of tree construction developed originally in the context of constructing taxonomic phenograms of depicting trees that reflect the phenotypic similarities between OTUs. The underlying consideration refers to the relationships between organisms viewed in terms of similarity seen between them (instead of probing their genealogy). That is similar organisms can be grouped (clustered) together (per the old adage of birds of the same feather flock together!) A method of such grouping effort is exercised by the UPGMA. In this phenetic approach the groups are identified in making clusters of varying degrees of similarity (that is, on the basis of least distant to most distant relationships). It makes the tree clock-like (ultrametric). Computationally, tree construction by UPGMA is fast. Hence, UPGMA can be adopted arbitrarily with large data set. But a single tree that is obtained in a broad sense of similarity and does not look for whether the relationships considered are historically significant. UPGMA can also be used to construct phylogenetic trees if the rates of evolution are approximately constant among the different lineages. For this purpose either the number of observed nucleotide or amino-acid substitutions can be used.

In summary, UPGMA employs a sequential clustering algorithm, in which local topological relationships are identified in order of similarity and the phylogenetic tree is built in a ladder format. First, identified from among all the OTUs are those two OTUs that are most similar to each other and then these two are combined as a new single OTU. This combined OTU is called a composite OTU. Subsequently, a distance matrix is constructed with the composite OTU plus the rest of the OTUs. Again, the pair with the highest similarity is identified to make the new composite and so

40

on, until only two OTUs are left out. Relevant exercise is illustrated in the following example.

Example x.1 Consider a set of six OTUs {α, β, χ, δ, ε, φ} whose sequences are listed below:

α G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C T G G

β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T

χ G A A C G C T G C G T C G T G T T G T C G T C T G T G A G A T A T G G C T C G

δ G A A G C C T G C G T G T G G T T G T C G T C T G C G A G A T A T G G C T C G

ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G

φ : T C A G G C C G C G T G G T G T TG T C G TC T G C G A G A A T T G G C T C G

Assuming that the above set of OTUs had the following common ancestral root (R), corresponding evolutionary distance (ED) matrix can be constructed as illustrated as illustrated in Figure x.

R T A A G G C T G C G T G G T G T T G T C G T C T G C G A G A T A T G G C T C G

Root αααα

ββββ

1 1 1 1

χχχχ

δδδδ

εεεε

φφφφ

Taxa/OTUs

α β χ δ ε φα β χ δ ε φα β χ δ ε φα β χ δ ε φ

0

2 0

4 4 0

6 6 6 0

6 6 6 4 0

8 8 8 8 8 0

αααα

ββββ

χχχχ

δδδδ

εεεε

φφφφ

Evolutionary distance (ED) ED matrix (EDM)

41

Figure x Construction of the evolutionary distance matrix for the OTU-set of Example 1. The ED values indicated in the matrix of Figure 1 correspond to the extent (number) of changes in the nucleotide residue set of R in going to each taxonal leaf (OTU). That is, ED corresponds to number of dissimilarities observed between the sequences being compared. This is illustrated in the following set of tables (Tables 1 - x):

Check I: Distance of four (4) of each OTU from the root R. (Dissimilar residues are shown bold)

R T A A G G C T G C G T G G T G T T G T C G T C T G C G A G A T A T G G C T C G α G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C T G G

R T A A G G C T G C G T G G T G T T G T C G T C T G C G A G A T A T G G C T C G β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T

R T A A G G C T G C G T G G T G T T G T C G T C T G C G A G A T A T G G C T C G

χ G A A C G C T G C G T C G T G T T G T C G T C T G T G A G A T A T G G C T C G

R T A A G G C T G C G T G G T G T T G T C G T C T G C G A G A T A T G G C T C G δ G A A G C C T G C G T G T G G T T G T C G T C T G C G A G A T A T G G C T C G

R T A A G G C T G C G T G G T G T T G T C G T C T G C G A G A T A T G G C T C G ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G

R T A A G G C T G C G T G G T G T T G T C G T C T G C G A G A T A T G G C T C G φ T C A G G C C G C G T G G T G T T G T C G T C T G C G A G A A T T G G C T C G

Check II: Distance of subsequent OTUs from αααα (Dissimilar residues are shown bold)

α G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C TGG β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T CT α G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C TGG χ G A A C G C T G C G T C G T G T T G T C G T C T G T G A G A T A T G G C T CG α G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C TGG δ G A A G C C T G C G T G T G G T T G T C G T C T G C G A G A T A T G G C TCG α G A A C G C T G C G T G G T G T A G T C GT C T G C G A G A T A T G G C TGG

42

ε G A A G G T T G C G T G G T GT T G T C TG C T G C G A G A T A T G G C T C G α G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C TGG Φ T C A G G C C G C G T G G T G T T G T C G TC T G C G A G A A T T G G C TCG

Check III: Distance of subsequent OTUs from ββββ (Dissimilar residues are shown bold)

β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T χ G A A C G C T G C G T C G T G T T G T C G T C T G T G A G A T A T G G C T C G

β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T δ G A A G C C T G C G T G T G G T T G T C G T C T G C G A G A T A T G G C T C G

β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G

β G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C T C T φ T C A G G C C G C G T G G T G T T G T C G TC T G C G A G A A T T G G C T C G

Check IV: Distance of subsequent OTUs from χχχχ (Dissimilar residues are shown bold)

χ G A A C G C T G C G T C G T G T T G T C G T C T G T G A G A T A T G G C T C G δ C A A T G C T C C G T G G T G T T G T C G T C T G C G A T A T A T G G C T C G

χ G A A C G C T G C G T C G T G T T G T C G T C T G T G A G A T A T G G C T C G ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G

χ G A A C G C T G C G T C G T G T T G T C G T C T G T G A G A T A T G G C T C G φ T C A G G C C G C G T G G T G T TG T C G TC T G C G A G A A T T G G C T C G

Check V: Distance of subsequent OTUs from δδδδ (Dissimilar residues are shown bold)

δ G A A G C C T G C G T G T G G T T G T C G T C T G C G A G A T A T G G C T C G ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G

43

δ G A A G C C T G C G T G T G G T T G T C G T C T G C G A G A T A T G G C T C G φ T C A G G C C G C G T G G T G T T G T C G TC T G C G A G A A T T G G C T C G

Check VI: Distance of subsequent OUT from εεεε

(Dissimilar residues are shown bold)

ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G φ T C A G G C C G C G T G G T G T TG T C G TC T G C G A G A A T T G G C T C G

With reference to the evolutionary distances (EDs) indicated in the ED matrix (EDM) of Figure 1, the problem in hand is to construct an additive tree using clustering algorithm of UPGMA. For this purpose, as mentioned earlier the number of observed nucleotide or amino-acid substitutions can be used in the UPGMA sequential clustering algorithm where local topological relationships are identified in the order of similarity. Hence, building of the phylogenetic tree is done via following steps: Step I: Given an EDM, the two OTUs that are most similar (bearing closest ED) are identified from among all the OTUs and these two OTUs are treated as a new single composite OTU. As shown in Figure y.1, the OTU-pair α, and β is the chosen pair by virtue of their smallest ED of 2 in the EDM; and, the composite OTU is {α, β}.

Fig. y.1: Selection of two OTUs α and β that are most similar in the

given EDM (bearing the smallest ED equal to 2) Step II: With the composite OTU {α, β} introduced in the EDM, the new matrix is constructed by choosing again the most similar pair and clustering them together as illustrated in Figure y.2. Relevant set of EDs is elucidated as follows:

Taxa/OTUs

Root αααα

ββββ

1 1 1 1

αααα

2 ββββ

ED EDM

44

ED between {α, β}and χ = ½ × [ED(α and χ) + ED (β and χ)] ED between {α, β}and δ = ½ × [ED(α and δ) + ED (β and δ)] ED between {α, β}and ε = ½ × [ED(α and ε) + ED (β and ε)] ED between {α, β}and φ = ½ × [ED(α and φ) + ED (β and φ)]

Fig. y.2: Construction of the new EDM with the composite OTU, {α

β} Step III: As in the previous steps, among the new group of OTUs, the pair with the highest similarity is identified and the corresponding composite OTU is specified. This procedure is repeated until we only two OTUs are left out. Corresponding EDM constructions are shown in Figures y.3-y.x.

Root αααα

ββββ

1 1 1 1

δδδδ

εεεε

Taxa/OTUs


0

4 0

6 6 0

α βα βα βα β

χχχχ

δδδδ

εεεε

ED EDM

χχχχ

6 6 4 0

φφφφ 8 8 8 8 0 φφφφ

45

Fig. y.3: Construction of the new EDM with the composite OTUs,

{ δε} and {δε}


{ αβχ} and {δε}

Root αααα

ββββ

1 1 1 1

δδδδ

εεεε

Taxa/OTUs


0

6 0

α β χα β χα β χα β χ

δ εδ εδ εδ ε

ED EDM

χχχχ

φφφφ 8 8 0 φφφφ

Root αααα

ββββ

1 1 1 1

δδδδ

εεεε

Taxa/OTUs


0

4 0

6 6 0

α βα βα βα β

χχχχ

δ εδ εδ εδ ε

ED EDM

χχχχ

φφφφ 8 8 8 0 φφφφ

46


{ αβχ δε} The resulting ultrametric tree for the test EDM is illustrated in Figure y.5.

Fig. y.4: Final UPGMA-based ultrametric phylogenetic tree for the

test EDM

Root αααα

ββββ

1 1 1 1

χχχχ

δδδδ

εεεε

φφφφ

Taxa/OTUs


0

8 0

α β χ δ εα β χ δ εα β χ δ εα β χ δ ε

φφφφ


Root

αααα

ββββ

1 1 1 1

χχχχ

δδδδ

εεεε

φφφφ

Taxa/OTUs

Evolutionary distance (ED)

47

Problem y.x The EDM with metrics of distances depicting a hypothetical ultrametric phylogenetic tree is shown below. Suppose the following sequence depicts the common ancestral root (R) corresponding to the given EDM, determine the sequences that can be specified for the OTU-set {a, b, c, d, e, f, g} of the tree. R : G A A T G TT G C G T G G T G T T G T G G T C T G C G A G A T A T A A C

T C G AATGCCT

Figure x A hypothetical EDM supplied for Problem y.x 1. The shortcomings of the UPGMA clustering method are: (i) It is sensitive to unequal evolutionary rates implying that whenever one of the OTUs has experienced more number of mutations over time relative to others, the resulting topology of the tree will be erroneous. (Why?); (ii) clustering is feasible only when the data is ultrametric and (iii) ultrametric distances are constrained by the so-called “three-point conditions”. (the three-point condition stipulates that for any given set of three taxa {A, B, C}, the two largest distances are equal;

Root a b

1 1 1 1

c d e f

Taxa/OTUs

a b c d e f g 0 2 0 4 4 0 6 6 6 0

6 6 6 4 2 0 8 8 8 8 8 8 0

a b c d e f


g g

6 6 6 4 0

48

that is, ED(A-C) ≤ maximum of [ED(A-B), ED(B-C)]. The two largest distances being equal signifies that the evolutionary rate is the same for all branches. Should this condition of rate constancy fails among lineages, an erroneous topology would result in. Problem y.x Consider an EDM with metrics of distances as shown below. For the ED values of the matrix indicated, relevant topology is as shown.

Figure y.a EDM of an phylogenetic tree having no evolution rate constancy Determine the three-point condition and show that the tree reconstructed by considering the evolutionary history via UPGMA leads to a wrong topology illustrated below implying that UPGMA on unequal rates of mutation will show a completely different topology from the original EDM-specific tree.

αααα

ββββ

1 1 1 1 1 1 1

χχχχ

δδδδ

εεεε

φφφφ

Taxa/OTUs


0

5 0

4 7 0

7 10 7 0

6 9 6 5 0

8 11 8 9 8 0

αααα

ββββ

χχχχ

δδδδ

εεεε

φφφφ


49

Figure y.b UPGMA-based topology for the EDM of Figure y.a (Solution hint: Considering the divergence phases of α and β, the taxon β has faced mutations at a much higher rate than the taxon α. Check therefore, the three-point criterion is violated and the possible UPGMA-based tree is erroneous. In such cases the neighborhood-joining procedure (described below) will yield the correct topology).

Example y.y An EDM indicated below corresponds to real data matrix for five ribosomal RNA sequences. Each value denotes the estimated number of nucleotide residue substitutions per position separating the corresponding pair of the presently existing sequences.

1 2 3 4 5 Node

Taxa BSu Bst Lvi Amo Mlu 1 Bsu 0 0.172 0.215 0.309 0.233 2 Bst 0 0.299 0.340 0.206 3 Lvi 0 0.280 0.394 4 Amo 0 0.429 5 Mlu 0

Construct an ultrametric tree.

Root

αααα

χχχχ

ββββ

δδδδ

εεεε

φφφφ

Taxa/OTUs

0.5

0.5

1.0

2.0

2.0

3.0

2.5

2.5

4.5

1.5

50

Solution:

Neighbor-joining (NJ) method: An algorithmic procedure to find the shortest tree The NJ method is a method for reconstructing phylogenetic trees, and computing the lengths of the branches of this tree. In each stage, the two nearest nodes of the tree are chosen and defined as neighbors in the tree. This is recursively followed until all of the nodes are paired together. The algorithm was originally developed by Saitou and Nei [ ] with subsequent corrections on the proof of the algorithm, plus some minor changes (to the algorithm) due to Studier and Kepler [ ].

Bsu

Bst

Mlu

Lvi

Amu

51

Figure y.x Illustration of neighbors in the phylogenetic trees Phylogenetic neighbors are defined as a pair of OTUs connected at a single node connecting them. For instance, consider a tree illustrated in Figure y.x with the (non-internal) node set {α, β, χ, δ} and {X, Y} depicts an internal node pair. The set of branch-lengths is {la, lb, lc, ld} as illustrated. Further, α and β are neighbors since they are connected through a common internal node. Likewise χ and δ are also neighbors. (However, as evidently {α, δ} and {β, χ} are not neighbors).

The underlying algorithm in constructing, assumes that the trees are additive trees. An additive tree means a tree where, for example, in Figure y.x, the distance between nodes α and β is equal to the distance between nodes α and X plus the distance between nodes β and X. Stipulation of 4-point condition A four-point condition can be stipulated to specify the general observation that neighbors are closer than non-neighbors. Referring to Figure y.y, dαχ + dβδ = dαδ + dβχ = la + lb+ lc+ ld + 2x = dαβ + dχδ + 2x and the 4-point conditions thereof are:

la

x

αααα ββββ

δδδδ χχχχ

lb

ld lc

X

Y

52

[Sum of all distances between neighboring nodes: for example, (dαβ + dχδ) in the sets {α,β} and {χ, δ}] < [Sum of all distances between non-neighbor nodes: for example, (dαχ + dβδ) in the sets {α, χ} and {β, δ}] [Sum of all distances between neighboring nodes: for example, (dαβ + dχδ) in the sets {α,β} and {χ, δ}] < [Sum of all distances between non-neighbor nodes: for example, (dαχ + dβδ) in the sets {α, δ} and {β, χ}] Suppose, there are N OTUs with a matrix in which evolutionary distances between every pair of OTUare known. Then the process of building the tree starts with no initial clustering. That is, every OTU is considered as being in an equal relationship with every other OTU. This leads to initiating a non-hierarchical (star-like) structure shown in Figure y.t(a).

Figure y.t (a) Star-like, non-hierarchical structure initialized; and (b) the tree realized with the given set of OTUs Let the distance between OTU-pair i and j be dij and lab be the distance between nodes a and b. The sum of the branch lengths for the nonhierarchical star-like tree centered at X in Figure y.t(a) is then given by:

N

o iX ij(i < j)i = 1

1S = = d

N 1

− ∑ ∑l (y.c)

1 7 8

2

3 5 4

6

1 7

8

2

3

5

4

6

Xx

Y

(a) (b)

Xx

53

The above relation is due to each branch being counted (N – 1) times when all distances are added. Next step of tree construction involves identifying the first pair of OTUs to be clustered implying that they diverge from the same node in the tree. Considering an example (with eight OTUs as above) forming the star as shown in Figure y.t(a), suppose 1 and 2 constitute the first pair being clustered at X; and, rest of the OTUs are clustered at another node Y as illustrated in Figure y.t(b). Now, for this tree that has {1, 2} OTU pair diverging from a single node (X) and other, {3, 4, 5, 6, 7, 8} OTU set from another node (Y), the “length” (S12) of the tree can be determined. Relevant algorithm is as follows: The length of the tree (Sij) for a node pair {i, j} can be deduced in terms of number of OTUs (N) and pairwise distances di j for every two leaves (taxa/OTUs) (i, j). Hence, with respect to Figure y.t(b), the branch length between X and Y (lXY) is given by:

[ ]XY 1k 2k 1X 2X(k 3, N)

N

iYi=3

1 = (d d ) (N 2)(

{2(N 2)}

2

=

+ − − + −

−

∑

∑

l l l

l

(y.z)

where 1k 2k(k 3, N)

1(d d )

{2(N 2)} =

+ − ∑ term specifies the sum of

all distances that include lXY.

And, [ ]N

1X 2X iYi=3

(N 2)( and 2

− + ∑l l l are terms that

depict irrelevant entities being excluded in computing lXY. Further, if the interior branch (X to Y) in Figure y.t (b), is removed, there will be two independent star-like trees, one for the OUT-pair {1 and 2} and the other for the remaining set of (N – 2) OTUs. Corresponding

branch lengths, (l1X + l2X) and iYi = 3

N ∑ l can be deduced by

applying equation (y.c). That is,

54

l1X + l2X = d12 (y.a1) and,

N

iY ij(3 i < j)i = 3

1 = d

N 3 ≤

− ∑ ∑l (y.a2)

Adding these branch lengths, the sum of all branch lengths of the tree configuration of Figure y.t(b) namely S12 is obtained as follows:

[ ]N

12 XY 1X 2X iYi=3

1k 2k(k 3, N)

1k 2k 12(k 3, N)

ij(3 i < j)

S = (

1(d d )

{2(N 2)}

1 1(d d ) + d

{2(N 2)} 2

1+ d

(N 2)

=

=

≤

+ + +

+ + −

= + −

−

∑

∑

∑

∑

l l l l

(y.zz) where the three terms on the right-hand side denote explicitly, the following:

1k 2k(k 3, N)

1 (d d )

{2(N 2)} =

+ = − ∑ Mean distance from OTU1 and

OTU2 to the rest of the OTUs 3, 4,…, N.

121

d2 =

Semi-pairwise distance from

OTU1 to OTU2

ij(3 i < j)

1 d

N 2 ≤ = −

∑ Average of all pairwise distances

between the OTUs 3 to N.

55

(Equation (y.zz) refers to the sum of least-squares estimate of branch-lengths as proved in [Soito]).

Thus S12 is the total tree-length when OTU1 and OTU2 considered to form the first cluster. Likewise tree-lengths S1j → S13, S14, …, S1N can be calculated and the lowest of these would indicate the true pair of OTUs to be first clustered (implying the first neighbors to be joined). These first two OTU neighbors are then merged to depict a single (composite) OTU.

For example, suppose S12 is found to be smallest among all values of the set {Sij}. Then OTUs 1 and 2 are designated as a pair of (nearest) neighbors, and these are joined to form a combined OTU (1-2). And, the distance between this combined OTU with respect to any other OTU j is given by: (1 2)j 1j 2jD = (D + D )/2− for (3 ≤ j ≤ N).

Now, the number of OTUs involved is reduced by one; and for the resulting new distance matrix, the previous procedure is applied so as to find the next pair of (nearest) neighbors. This cycle of operation iterated until the number of OTUs becomes three upon a single unrooted tree.

In the tree construction, another computation needed refers to determining the branch lengths. This can be done via so-called Fitch-Margoliash algorithm [W. M. Fitch and E. Margoliash, Construction of phylogenetic trees, vol. 55, Science, 1967, 279-284]. Relevant considerations are as follows: Suppose, for example, the OUT pair {1, 2} is the first pair being joined in the tree of Figure y.t, corresponding l1X and l2X are estimated by: l1X = (d12 + d1Z − d2Z)/2 (y.ua)

l2X = (d12 + d2Z − d1Z)/2 (y.ub) where Z denotes a group of all OTUs (leaves) excluding the OTU pair {1, 2}. It represents intuitively the whole cluster of all included OTUs as shown in Figure y.b.

56

Figure y.b Cluster (Z) of all the leaves excluding {1 and } Further, d1Z and d2Z are distances between (1 and Z) and (2 and Z) respectively. Explicitly, they are determined by [Nei 1987]:

N

1Z 1ii = 3

1 d = d

N 2

−

∑ (y.uc)

N

2Z 2ii = 3

1 d = d

N 2

−

∑ (y.ud)

And, l1X and l2X denote least-squares estimates for the tree of the illustration in Figure y.t(b).

The neighbor-joining method in essence, is an iterative algorithm, which assumes an additive tree or summing procedure. Its each iteration is indicated in the following pseudocode: __________________________________________________ Phylogenetic tree construction: Pseudocode on NJ method Input distance matrix

← Given a set of multiple sequences data ← Call subroutine for sequence alignment:

Global multiple alignment and selecting reliable data of aligned sequences devoid of any gaps; that is, any gap

1 7 8

2

3

5

4

6

x y

Z

57

present is skipped or ignored – Working alignment

Build distance matrix between sequences Define a distance parameter (d i, j ) between

the nodes 1 through N (N = 8 in the example of Figure y.t(a))

← Distance parameter can be defined by different notions, such as: Jukes-Cantor method, Kimura method, p-distance method etc.

← As an example, the p-distance method defines the distance between two sequences as equal to the ratio of count of different positions (of a nucleotide, for example) to total number of positions

← The distance matrix is symmetric and so, only the values above/below the diagonal need to be computed; that is, the matrix is symmetric regardless of the diagonal and it is suffice to show normally only the top half (or lower half) filled.

← enter the distance parameter as elements in a matrix.

← Resulting distance matrix(DM) is used to identify the first branch of the tree as consisting of the two OTUs that have the shortest distance between them

Example of an(N × N)distance matrix

i

d11 d 12 . . . d1N d21 d2N . . . d ij . . .

j

dN1 dN1 . . . dNN

58

← The columns and rows of this matrix

denote nodes; and, the values (d i,j) of the matrix elements represent the distance between nodes i and j.

← An example of a (5 × 5) distance matrix with numerical values. (This corresponds to real data matrix for five ribosomal RNA sequences. Each value denotes the estimated number of nucleotide residue substitutions per position separating the corresponding pair of the presently existing sequences).

1 2 3 4 5 Node

Taxa BSu Bst Lvi Amo Mlu 1 Bsu 0 0.172 0.215 0.309 0.233 2 Bst 0 0.299 0.340 0.206 3 Lvi 0 0.280 0.394 4 Amo 0 0.429 5 Mlu 0

Initialize Step 1 Start off with a star tree (Figure y.y1)

Figure y.y1: Star-like tree initialized Step 2 Define/identify the neighbors

← two nodes with the lowest value in the matrix are chosen and they are defined

1 5

2 3

4

59

as neighbors. In the example shown above nodes 1 (Bsu) and 2 (Bst) show lowest value (0.172) in the distance matrix. So, they are the “nearest” and identified as “neighbors”.

Merge these neighbors to form a single composite (new) OTU and label it as (1-2) Step 3 Compute the branch lengths for between this composite node (1-2) and rest of the other nodes

← This computation involve staking the unweighted arithmetic mean of all the pairwise distances of the new OTU(1-2)with respect to all other OTUs (namely, 3, 4, 5. Hence,

→ the distance between (1-2) and 3 is: d (1-2)3 = (d 13 + d23)/2



Step 4 Construct the resulting table as shown below:

1 2 3 4 5 Node Taxa Bsu Bst Lvi Amo Mlu

1 Bsu 0 0.172 0.215 0.309 0.233 2 Bst 0 0.299 0.340 0.206 3 Lvi 0 0.280 0.394 4 Amo 0 0.429 5 Mlu 0

60

1-2 3 4 5 Node

Taxa Bsu-BSt

Lvi Amo Mlu

1-2 Bsu-Bst 0 0.257 0.325 0.220 3 Lvi 0 0.280 0.394 4 Amo 0 0.429 5 Mlu 0

Step 5 Repeat the process from Step 2 again looking for the 2 nearest nodes, and constructing the resultant matrix

← Now OTU 5 (Mlu) and the composite OTU (1-2( Bsu/Bst )constitute the nearest neighbors

← Construct the new matrix with: d (1-2)5 = (d 15 + d25)/2 = (d 1x

+ d2x) d [(1-2)5]3 = (d 1x + d13)/2 + (d 2x + d23)/2

d[(1-2)5]4 = (d 1x + d14)/2 + (d 2x + d24)/2

(1-2)5 3 4 Node Taxa (Bsu/Bst)Mlu Lvi Amo

(1-2)5 (Bsu/Bst)Mlu 0 0.282 0.435 3 Lvi 0 0.280 4 Amo 0

Step 6 Repeat the process from Step 2 again looking for the 2 nearest nodes, and constructing the resultant matrix

← Now composite OTU(1-2)5(Bsu-Bst)Mlu and the composite OTU(3-4)(Lvi-Amo) form nearest neighbors

← Construct the new matrix with: d(1-2)5 =(d 1x

+ d2x)

61

← d[(1-2)5](3-4) = (d 1x + d1x-3 +d 2x + d2x-3 )/4

(1-2)5 (3-4) Node Taxa (BSu-BSt)Mlu (Lvi-Amo)

(1-2)5 (Bsu-Bst)Mlu 0 0.289 (3-4) (Lvi-Amo) 0

______________________________________ Inasmuch no prior information is available on _______________________________________________________

Reconstruction of a phylogenetic tree: Iterative algorithm of neighbor-joining (NJ) method Input: a (n × n) distance matrix ∆i, j = [d i,j (T)] for every two leaves (taxa) (i,j) Do: Iteration (I) over k = 1, 2, ..., r over r taxa ← Each I involves… Calculate: Q-matrix based on the current distance matrix.

← For a given distance d(i, j) between taxa i and j and using the corresponding distance matrix relating r taxa, Q-matrix is specified as follows:

← r r

i, jk = 1 k = 1

Q = (r 2)d(i, j) d(i, k) d(j, k)− − −∑ ∑ (y.1)

Determine: The pair of taxa with the lowest value in Q Create: A node on the tree that joins this pair of taxa. ← the closest neighbors are joined

62

Calculate: The distance of each of the taxa in the pair to this new node

Determine: The distance of all taxa outside of this pair to the new node

Go To: I (k = 1, 2, …, r), considering the

pair of joined neighbors as a single taxon and using the distances calculated in the previous step.

Stop: Termination of r th iteration End _______________________________________________________

Example y.1 (of NJ method) Suppose there are three sequences (α, β, γ) with distances specified by:

β γ α 25 30 β 35

The problem is to construct an unrooted tree shown below with the branch lengths a, b and c indicated.

Solution: From the distance matrix given:

α to β: a + b = 25 α to γ: a + c = 30

a

c

b

αααα ββββ γγγγ

63

β to γ: b + c = 35 The three linear, independent equations lead to, a = 10, b = 15 and c = 20. Example y.z A hypothetical set of multiple sequences is given below: αααα CAG CGT TGG GCG ATG GCA ACC… ββββ CAG CGT TGG GCG ACG GTA ATC… δδδδ CAG CAT TGA ATG ATG ATA ATC… εεεε CAG CAT TGA GTG ATA ATA ATC… Construct the distance matrix between the sequences and hence construct a unrooted tree using the distance matrix derived. Solution The distances depict number of changes seen in the nucleotides (or aminoacids) between any two sequences under comparison. Hence, the required matrix is:

αααα ββββ δδδδ εεεε αααα 3 7 8 ββββ 6 7 δδδδ 3 εεεε

Unrooted tree evolved from the given sequences

64

The distance scoring used can be empirical. However, only a single tree is realized. In building this single tree, the raw pairwise distances may not always be perfectly additive. That is, with reference to the above example, the sum of the three simultaneous equations (= sum of the branch lengths) can be seen exactly equal to the sum of pairwise distances. Such exactness may not prevail in raw data. Such situation may arise when there is undetected homoplasy. (Homoplasy is a character shared by a set of species but not present in their common ancestor). It may miss homoplasies across long distances, meaning long evolutionary distances may be underestimated.

Neighbor-joining can be regarded as a star decomposition technique of heuristic nature. It is computationally not very intensive and may produce reasonable trees. But its tree search does not pose ant optimality criterion, and as such it, no guarantee is given to say that the “recovered tree is the one that best fits the data”. More should attempt to produce a starting tree, and then perform a tree search exercising an optimality criterion so as to ensure that the best tree is obtained. Example y.2 (Deducing a Q-matrix) Given a set of four taxa {α, β, α, β, α, β, α, β, δ, εδ, εδ, εδ, ε} as shown, deduce the corresponding Q-matrix.

αααα ββββ δδδδ εεεε αααα

1 2

1 2

4

αααα ββββ

δδδδ εεεε

65

ββββ 7 δδδδ 11 6 εεεε 14 9 7

Solution:

Q- matrix:

αααα ββββ δδδδ εεεε αααα ββββ − 40 δδδδ − 34 − 34 εεεε − 34 − 34 − 40

In the example above, two pairs of taxa have the lowest value, namely − 40. We can select either of them for the second step of the algorithm. We follow the example assuming that we joined taxa α and β together. Adding more branches is iterative. If more sequences are included, corresponding branches will be introduced in the tree. This method scales linearly with the number of sequences. This involves adding a new node to the tree. A. Distance of the pair members to the new node Suppose for each neighbor in the pair just joined, calculation of the distance to the new node with reference to the taxa set, {θ and φ: the paired taxa; ω: the newly generated node) can be done by using the following algorithm:

k = r k = r

k = 1 k = 1

d(θ, ) 1d(θ, ω) = + d(θ, k) d( , k)

2 2(r 1)

φ − φ − ∑ ∑ (y.2)

B. Distance of the other taxa to the new node

For each taxon not considered above, the distance to the new node is obtained from:

66

[ ] [ ]1 1d( , k) = d(θ, k) d(θ, u) d( , k) d( , u)

2 2ω − + φ − φ (y.3)

where, as said before ω is the new node, and θ and φ are the members of the pair just joined; and, k is the node for which it is required to calculate the distance. and f and g are the members of the pair just joined

Problem y.1: (a) In the Example y.2 determine: (i) the distance between α and

the new node; and (ii) the distance between β and the new node. Use equation (y.2). (Answers: (i) 6; (ii) 1)

(b) In the Example y.1 determine: (i) the distance between δ and

the new node; and (ii) the distance between ε and the new node. Use equation (y.3). (Answers: (i) 5; (ii) 8)

Next step on summing procedure Using the considerations as above, the following matrix can be evolved iteratively for Example y.3 (with αβ acting as a new taxon) with the result of Problem y.1(b):

ααααββββ δδδδ εεεε ααααββββ δδδδ 5 εεεε 8 7

In the iteration, this matrix as above can be considered as the original distance matrix. (With reference to Example y.2, one more step of the recursion will enable the complete tree. Try!) Neighbor-joining (NJ) topology gives the least total branch length is preferred at each step of the algorithm. Hence it specifies minimum-evolution criterion for phylogenetic trees, But the NJ solution may not lead to the true tree topology with this least total branch length specification inasmuch as it is a greedy algorithm. That is, it develops the tree in a step-wise fashion. It is suboptimal in this sense,

67

but often it finds a tree close to optimal topology. Yet there are other methods that do not rely on distance measures and give results with better accuracy.

NJ method is computationally efficient and conforms to polynomial-time algorithm. Very large data sets can be handled by NJ approach. (Other phylogenetic analyses like minimum evolution, maximum parsimony and maximum likelihood are computationally intensive. Though NJ computation leads to an unrooted topology, rooted trees can be obtained by using an outgroup; and, the root can be appropriately inserted at a point in the tree where the edge from the outgroup connects. That is, inasmuch as the phylogenetic inference via NJ method yields unrooted tree, from this result itself it is not possible to tell which of the OTUs had branched off before all the others. So, in order to root a tree one should add an “outgroup” to the data set. An outgroup implies an OTU for which external information (for example, paleontological information) is available that indicates that the outgroup branched off before all other taxa. (See Figure y.1). In practice, distance-matrix methods suggest the inclusion of at least one outgroup sequence, (which is known to be only distantly related to the sequences of interest in the query set). This procedure conforms to an experimental control (contrasting an experimental group versus a control group). When the chosen outgroup is appropriate, it will show much greater genetic distance (and hence will possess longer branch) than any other sequence in the set; as such, it will appear near the root of a rooted tree or indicates the nearest node as the possible root-node in a unrooted topology (Figure y.1).

Further, in choosing an appropriate outgroup requires the selection of a sequence that is moderately related to the sequences of interest. Suppose an outgroup implicates too close a relationship. Then, this outgroup will lose its significance. Likewise, when an outgroup depicts too far a relationship, it simply adds noise to the analysis. In prescribing an outgroup, care must be exercises so as to omit cases where the species from which the sequences were taken are distantly related; but, the gene encoded by the sequences is rather highly conserved across lineages. Another confounding aspect of outgroup selection and usage is horizontal gene transfer (between, otherwise divergent bacteria).

68

NJ strategy is statistically consistent usually under many models of evolution. Hence, given data of sufficient length, neighbor-joining will reconstruct the true tree with high probability. It is shown in [K. Atteson, The performance of neighbor-joining methods of phylogenetic reconstruction, Algorithmica, 25, 1999, 251-278] that, if each entry in the distance matrix differs from the true distance by less than half of the shortest branch length in the tree, then NJ will lead to constructing the correct tree.In order to correct for the enhanced inaccuracy in measuring distances between distantly-related sequences, an alternative procedure indicated for tree reconstruction using distance matrix refers to what is known as the Fitch-Margoliash (FM) method [W. M. Fitch and E. Margoliash, Construction of phylogenetic trees. Science 155, 1967, 279–284]. It is described below. Fitch-Margoliash method This method uses the genetic distance and follows a weighted least squares method for clustering. That is, more weight in the tree construction process is given to closely-related sequences. This approach would reduce the inaccuracy associated in the distance measurement on distantly-related sequences. In the relevant computation, the distances used as input to the algorithm are first normalized so as to prevent large deviational artifacts in (computing) relationships between closely-related versus distantly-related groups. Further, the distances adopted for this method must be linear. This linearity criterion for distances requires that the expected values of the branch lengths for two individual branches must equal the expected value of the sum of the two branch distances. This property applies to biological sequences only when they have been corrected for the possibility of back mutations at individual sites. (This correction can be done via substitution matrix such as that derived from the Jukes-Cantor model of DNA evolution. The distance correction is, however, necessary in practice only when the evolution rates differ among branches).

Though the least-squares criterion advocated enables the distances in FM approach to be more accurate, it however, burdens the computational efforts more heavily than the NJ methods. Further, in addition to least-squares criterion, another improvement that

69

corrects for correlations between distances (stemming from many closely-related sequences in the data set) can be envisaged in FM technique with increased computational cost. Finding the optimal least-squares tree with any correction factor is NP-complete [W. H. E.Day, Computational complexity of inferring phylogenies from dissimilarity matrices. Bulletin of Mathematical Biology 49, 1986, 461-467].so heuristic search methods like those used in maximum-parsimony analysis are applied to the search through tree space. (Suppose given a solution to a problem and it is verifiable quickly (in polynomial time). Then, such a set of problems is known as NP (nondeterministic polynomial time) in the theory of computational complexity. Relevant class of complexity is NP-complete (abbreviated NP-C or NPC). Further, the class of problems are such that, if the problem can be solved quickly (in polynomial time), then so can every problem in NP. In searching for a quick problem-verification, however, there is no known efficient way to locate a solution a priori. In NP-complete problems the fast solution to them is rather unknown. That is, the time required to solve the problem using any currently known algorithm would increase considerably as the size of the problem grows. As such, mostly NP-complete problems are addressed by using approximation algorithms).

Notwithstanding the details perceived from a group of query sequences, suppose some independent information as regard to the relationship between such sequences or groups is available. It can be used profitably in curtailing the (rooted and/or unrooted) tree search-space.

In general, pairwise distance data are an underestimate of the path-distance between taxa on a phylogram. Pairwise distances effectively "cut corners" in a manner analogous to geographic distance: the distance between two cities may be 100 miles "as the crow flies," but a traveler may actually be obligated to travel 120 miles because of the layout of roads, the terrain, stops along the way, etc. Between pairs of taxa, some character changes that took place in ancestral lineages will be undetectable, because later changes have erased the evidence (often called multiple hits and back mutations in sequence data). This problem is common to all phylogenetic estimation, but it is particularly acute for distance methods, because only two samples are used for each distance calculation; other

70

methods benefit from evidence of these hidden changes found in other taxa not considered in pairwise comparisons. For nucleotide and amino acid sequence data, the same stochastic models of nucleotide change used in maximum likelihood analysis can be employed to "correct" distances, rendering the analysis "semi-parametric."

Several simple algorithms exist to construct a tree directly from pairwise distances, including UPGMA and neighbor joining (NJ), but these will not necessarily produce the best tree for the data. To counter potential complications noted above, and to find the best tree for the data, distance analysis can also incorporate a tree-search protocol that seeks to satisfy an explicit optimality criterion. Two optimality criteria are commonly applied to distance data, minimum evolution (ME) and least squares inference. Least squares is part of a broader class of regression-based methods lumped together here for simplicity. These regression formulae minimize the residual differences between path-distances along the tree and pairwise distances in the data matrix, effectively "fitting" the tree to the empirical distances. In contrast, ME accepts the tree with the shortest sum of branch lengths, and thus minimizes the total amount of evolution assumed. ME is closely akin to parsimony, and under certain conditions, ME analysis of distances based on a discrete character dataset will favor the same tree as conventional parsimony analysis of the same data. Phylogeny estimation using distance methods has produced a number of controversies. UPGMA assumes an ultrametric tree (a tree where all the path-lengths from the root to the tips are equal). If the rate of evolution were equal in all sampled lineages (a molecular clock), and if the tree were completely balanced (equal numbers of taxa on both sides of any split, to counter the node density effect), UPGMA should not produce a biased result. These expectations are not met by most datasets, and although UPGMA is somewhat robust to their violation, it is not commonly used for phylogeny estimation.

Neighbor-joining is a form of star decomposition and, as a heuristic method, is generally the least computationally intensive of these methods. It is very often used on its own, and in fact quite frequently produces reasonable trees. However, it lacks any sort of tree search and optimality criterion, and so there is no guarantee that the

71

recovered tree is the one that best fits the data. A more appropriate analytical procedure would be to use NJ to produce a starting tree, then employ a tree search using an optimality criterion, to ensure that the best tree is recovered.

Many scientists eschew distance methods. In some cases, this is for esoteric philosophical reasons. A commonly-cited reason is that distances are inherently phenetic rather than phylogenetic, in that they do not distinguish between ancestral similarity (symplesiomorphy) and derived similarity (synapomorphy). This criticism is not entirely fair: most currently implementations of parsimony, likelihood, and Bayesian phylogenetic inference use time-reversible character models, and thus accord no special status to derived or ancestral character states. Under these models, the tree is estimated unrooted; rooting, and consequently determination of polarity, is performed after the analysis. The primary difference between these methods and distances is that parsimony, likelihood, and Bayesian methods fit individual characters to the tree, whereas distance methods fit all the characters at once. There is nothing inherently less phylogenetic about this approach.

More practically, distance methods are avoided because the relationship between individual characters and the tree is lost in the process of reducing characters to distances. Since these methods do not use character data directly, and information locked in the distribution of character states can be lost in the pairwise comparisons. Also, some complex phylogenetic relationships may produce biased distances. On any phylogram, branch lengths will be underestimated because some changes cannot be discovered at all due to failure to sample some species due to either experimental design or extinction (a phenomenon called the node density effect). However, even if pairwise distances from genetic data are "corrected" using stochastic models of evolution as mentioned above, they may more easily sum to a different tree than one produced from analysis of the same data and model using maximum likelihood. This is because pairwise distances are not independent; each branch on a tree is represented in the distance measurements of all taxa it separates. Error resulting from any characteristic of that branch that might confound phylogeny (stochastic variability, change in

72

evolutionary parameters, an abnormally long or short branch length) will be propagated through all of the relevant distance measurements. The resulting distance matrix may then better fit an alternate (presumably less optimal) tree.

Despite these potential problems, distance methods are extremely fast, and they often produce a reasonable estimate of phylogeny. They also have certain benefits over the methods that use characters directly. Notably, distance methods allow use of data that may not be easily converted to character data, such as DNA-DNA hybridization assays. They also permit analyses that account for the possibility that the rate at which particular nucleotides are incorporated into sequences may vary over the tree, using LogDet distances. For some network-estimation methods (notably NeighborNet), the abstraction of information about individual characters in distance data are an advantage. When considered character-by character, conflict between character and a tree due to reticulation cannot be told from conflict due either to homoplasy or error. However, pronounced conflict in distance data, which represents an amalgamation of many characters, is less likely due to error or homoplasy unless the data are strongly biased, and is thus more likely to be a result of reticulation.

Distance methods are extremely popular among some molecular systematists, a substantial number of whom use NJ without an optimization stage almost exclusively. With the increasing speed of character-based analyses, some of the advantages of distance methods will probably wane. However, the nearly-instantaneous NJ implementations, the ability to incorporate an evolutionary model in a speedy analysis, LogDet distances, network estimation methods, and the occasional need to summarize relationships in with a single number all mean that distance methods will probably stay in the mainstream for a long time to come.

The clustering-based approach Since the raw data is a string of character states (for example, the nucleotide sequence), the character-state methods are indicated as powerful than distance methods; and, in transforming such character-state data into distance-matrices some information is lost.

73

However, while the maximum parsimony method indeed uses the raw data, it usually uses only a small fraction of the available data. For instance, in the example that we have used only three sites are informative and were used while six sites are excluded from the analysis. For this reason, this method is often less efficient than some distance-matrix methods (e.g., see Saitou and Nei 1986). Of course, if the number of informative sites is large, the maximum parsimony method is generally very effective. Distance methods only give one tree, while parsimony analyses many trees and may suggest multiple, equally likely trees, none of which is necessarily the right one. The maximum likelihood method does not suffer from these limitations. It uses the entire sequence information, it analyses many trees and proposes the tree with the highest likelihood. Moreover, the method seems to be robust and relatively insensitive to violations of the evolutionary model used or to unequal rates of evolution or nucleotide bias. The method, however, is not suitable for large datasets due to its CPU-intensive nature. Subroutine III

← Making of the required phylogenetic tree: Mainly, phylogenetic tree-reconstruction is done by two computer-based methods: (i) Use of ClustalW, or (ii) use of Phylip. (There are also other programs: PAUP*, PhyloWin, CLC, BioEdit, MacClade etc.) The constructed tree can be visualized by: TreeView, NJplot, CLC…

Program

Simple

or sophisticated?

Free

or commercial?

Remarks

ClustalW

Easy, fast and

Good tool with different analysis methods.

74

hustle-free

Essentially, it is a multiple sequence alignment (MSA) tool enabling assembling of a tree; but, it is not a phylogenetic tree. It is a guide tree that ClustalW uses to m ake MSA. Only with EBI ClustalW a phylogenetic tree is enables.

Phylip

Sophisticated and involves running several programs Free

Contains many programs. Both distance- and character- based analyses are feasibleBoot- strapping is possible. No graphical interface.

PAUP*

Apparently good Commercial

Good tool with different analyses methods.

MacClade

Commercial

Good tool with different analysis methods. Works on Macintosh only.

BioEdit

Simple: by selecting the sequences in the alignment and cho osing the wanted phylogeny, the result is given. No need to learn the command line Free package

Phylogeny methods are built-in. Phylip routines can be called. Boot-strapping is not facilitated

75

PhyloWin

Fairly simple Free

It enables maximum-likel ihood on nucleotide sequences. Allows maximum parsimony computation.

CLC Work-bench

Commercial

Has routine to make distance-based trees It enable neighbor- joining and UPGMA

TreeView

Does visualization of trees. Its link: Wiki

NJplot Does visua lization of trees. Its link: wiki

______________________________________________ This involves statistical methods in reconstructing the tree from molecular data. The tree-reconstruction process conforms to optimum phylogenetic estimation. As seen before that depending on the number of sequences taken, there are a number of feasible tree configurations. Therefore, methods of favoring one phylogenetic tree over another have been developed [Felsenstein, 2004]. Each of these existing methods can be classed into one of three main categories: parsimony method, distance-based or distance-matrix method and statistical methods (specifically, the maximum likelihood method). Thus, in the case of inferring phylogenetic trees from paralogous sequences, internal nodes in the tree represent duplication events. Relevantly, the phylogenetic trees are viewed as unrooted trees inasmuch as the choice of which two nodes between which to place the root is not obvious. There are two techniques that are usually

76

used to infer where the root of the tree is located. One approach is outgroup rooting (Felsenstein, 2004). Whenever three monophyletic groups of organisms are compared, and two of them are more closely related to each other than either is to the third, the third group is known as the outgroup. Generally this technique is used to determine the locations of the root of the tree as the outgroup is assumed a priori to have diverged earlier from the common ancestor than the other groups in the tree. Another method is to assume a molecular clock when constructing the tree. By assuming a molecular clock, we are saying that evolution on all lineages of the tree happened at the same rate (homogeneous rate of evolution). This makes the distance from root to tip equal for all paths emanating from the root. Therefore, we can find the root of the tree by locating the place in the tree where the distance along all lineages to all tips is approximately equal (Felsenstein, 2004). Figure 1.3: A rooted binary tree of animals (a) and its unrooted equivalent (b).

Phylogeny Notes PDF

Documents

Transcript of Phylogeny Notes PDF