Stuart M. Brown

111
1 Stuart M. Brown New York University School of Medicine With adaptations by H. Geller (GMU) presented by Molecular Phylogenetics Computing Evolution and Artificial Life

Transcript of Stuart M. Brown

Page 1: Stuart M. Brown

1

Stuart M. BrownNew York University School of MedicineWith adaptations by H. Geller (GMU)

presented by

Molecular Phylogenetics Computing Evolution and Artificial Life

Page 2: Stuart M. Brown

2

Topics•

Life’s Levels of Organization

Emergent Properties•

Molecular Evolution

Calculating Distances •

Clustering Algorithms

Cladistic

Methods •

Computer Software

Page 3: Stuart M. Brown

3

Recall Properties of LifeLiving organisms:– are composed of cells– are complex and ordered– respond to their environment– can grow and reproduce– obtain and use energy– maintain internal balance– allow for evolutionary adaptation

Page 4: Stuart M. Brown

4

Levels of OrganizationCellular Organization

cellsorganelles

moleculesatoms

The cell

is the basic unit of life.

Page 5: Stuart M. Brown

5

Levels of OrganizationOrganismal Level

organismorgan systems

organstissues

Page 6: Stuart M. Brown

6

Levels of OrganizationPopulation Level

ecosystemcommunity

speciespopulation

Page 7: Stuart M. Brown

7

Levels of OrganizationEach level of organization builds on the

level below it but often demonstrates new features.

Emergent properties: new properties present at one level that are not seen in the previous level

Page 8: Stuart M. Brown

8

The theory of evolution is the foundation upon which all of modern biology is built.

Evolution

From anatomy to behavior to genomics, the scientific method requires an appreciation of changes in organisms over time.

It is impossible to evaluate relationships among gene sequences without taking into consideration the way these sequences have been modified over time

Page 9: Stuart M. Brown

9

Nothing in biology makes sense except in the light of evolution.–

Theodosius Dobzhansky, 1973

Page 10: Stuart M. Brown

10

Similarity searches and multiple alignments of sequences naturally lead to the question:

“How are these sequences related?”

and more generally:

“How are the organisms from which these sequences come related?”

Relationships

Page 11: Stuart M. Brown

11

The purpose of a phylogenetic

tree is to illustrate how a group of objects (usually genes or organisms) are related to one another

Page 12: Stuart M. Brown

12

Taxonomy•

The study of the relationships between groups of organisms is called taxonomy, an ancient and venerable branch of classical biology.

Taxonomy is the art of classifying things into groups —

a quintessential human

behavior —

established as a mainstream scientific field by Carolus

Linnaeus (1707-1778).

Page 13: Stuart M. Brown

13

Page 14: Stuart M. Brown

14

iClicker

Question•

When a plant or animal dies, the remains are usually lost.

– A True

– B False

Page 15: Stuart M. Brown

15

iClicker

Question•

It is not possible to document the transition from one species to another with the fossil record.

– A True

– B False

Page 16: Stuart M. Brown

16

iClicker

Question•

The fossil record is very complete.

– A True

– B False

Page 17: Stuart M. Brown

17

iClicker

Question•

How many species of early life forms are estimated to be in the fossil record?

– A

1 out of every 10

– B

1 out of every 1000

– C

1 out of every 10,000

– D

One out of every 100,000

Page 18: Stuart M. Brown

18

iClicker

Question•

Most species that have lived on Earth have died out and are now extinct.

– A True

– B False

Page 19: Stuart M. Brown

19

iClicker

Question•

Vestigial organs are:

– A

Internal features that serve no useful function

– B

Organs attached to the vestigial bone

– C

Internal organs with an evolutionary link to the gills of fish

– D

A musical instrument produced in Vestig, Italy

Page 20: Stuart M. Brown

20

Charles Darwin

Served as naturalist on mapping expedition around coastal South America.

Used many observations to develop his ideas

Proposed that evolution occurs by natural selection

Page 21: Stuart M. Brown

21

Voyage of the Beagle

Page 22: Stuart M. Brown

22

Charles DarwinEvolution:

modification of a species

over generations-“descent with modification”

Natural Selection: individuals with superior physical or behavioral characteristics are more likely to survive and reproduce than those without such characteristics

Page 23: Stuart M. Brown

23

Darwin’s EvidenceSimilarity of related species

-

Darwin noticed variations in related species living in different locations

Page 24: Stuart M. Brown

24

Darwin’s EvidencePopulation growth vs. availability of

resources

-population growthis geometric

-increase in foodsupply is arithmetic

Page 25: Stuart M. Brown

25

Darwin’s EvidencePopulation growth vs. availability of

resources

-

Darwin realized that not all members of a population survive and reproduce.

-Darwin based these ideas on the writings of Thomas Malthus.

Page 26: Stuart M. Brown

26

Post-Darwin Evolution EvidenceFossil record-

New fossils are found all the time

-

Earth is older than previously believed

Mechanisms of heredity-

Early criticism of Darwin’s ideas were resolved by Mendel’s theories for genetic inheritance.

Page 27: Stuart M. Brown

27

Post-Darwin Evolution EvidenceComparative anatomy-

Homologous structures

have same

evolutionary origin, but different structure and function.

-

Analogous structures

have similar structure and function, but different evolutionary origin.

Page 28: Stuart M. Brown

28

Homologous Structures

Page 29: Stuart M. Brown

29

Post-Darwin Evolution EvidenceMolecular Evidence

- Our increased understanding of DNA and protein structures has led to the development of more accurate phylogenetic trees.

Page 30: Stuart M. Brown

30

Time’s Story of Life•

First cell–

Natural selection

mutations•

Mutations–

Most not beneficial

Environment–

Impacts evolution

Eukaryotes•

Colonies

Hard Shell–

Cambrian explosion

Page 31: Stuart M. Brown

31

Geological Time

Page 32: Stuart M. Brown

32

Mass Extinctions and the Rate of Evolution

Rate of extinction–

10%-20% extinct in 5-6 million years

Mass extinctions–

30%-90% extinct

Mechanisms–

asteroid

Evolution –

Gradualism

Punctuated equilibrium

Page 33: Stuart M. Brown

33

The Evolution of Human Beings

Page 34: Stuart M. Brown

34

iClicker

Question•

Approximately how many “major”

mass extinctions do biogeologists recognize since Cambrian era?

– 5

– 50

– 5000

Page 35: Stuart M. Brown

35

iClicker

Question•

A structure, process, or behavior that helps an organism survive and pass on its genes is called

– A

an adaptation

– B evolution

– C

survival of the fittest

Page 36: Stuart M. Brown

36

iClicker

Question•

The concept of natural selection depends on which fact(s)?

– A

Life evolved from simple cells and the biggest ones were most likely to survive.

– B

Better camouflaged animals are less likely to be eaten and they are more likely to produce offspring.

– C

Every population contains some genetic diversity and many more individuals are born than can possibly survive.

– D A and B

– E B and C

Page 37: Stuart M. Brown

37

iClicker

Question•

Human beings and the great apes had a common ancestor about:

– A

7 to 8 thousand years ago

– B

1 to 2 million years ago

– C

7 to 8 million years ago

– D

1 to 2 billion years ago

Page 38: Stuart M. Brown

38

Phylogenetics•

Evolutionary theory states that groups of similar organisms are descended from a common ancestor.

Phylogenetic

systematics

(cladistics) is a method of taxonomic classification based on their evolutionary history.

It was developed by Willi Hennig, a German

entomologist, in 1950.

Page 39: Stuart M. Brown

39

Cladistics

and Phenetics•

Cladistic

approach: Trees are drawn

based on the conserved characters•

Phenetic

approach: Trees are based

on some measure of distance between the leaves

Molecular phylogenies are inferred from molecular (usually sequence) data–

either cladistic

(e.g. gene order) or

phenetic

Page 40: Stuart M. Brown

40

Cladistic

Methods•

Evolutionary relationships are documented by creating a branching structure, termed a phylogeny or tree, that illustrates the relationships between the sequences.

Cladistic

methods construct a tree (cladogram) by considering the various possible pathways of evolution and choose from among these the best possible tree.

A phylogram

is a tree with branches that are proportional to evolutionary distances.

Page 41: Stuart M. Brown

41

Page 42: Stuart M. Brown

42

Algorithm classes used to infer phylogeny from sequence

Distance methods•

Parsimony

Likelihood•

Probabilistic methods

Page 43: Stuart M. Brown

43

Molecular Evolution•

Phylogenetics

often makes use of numerical data,

(numerical taxonomy) which can be scores for various “character states”

such as the size of a

visible structure or it can be DNA sequences.•

Similarities and differences between organisms can be coded as a set of characters, each with two or more alternative character states.

In an alignment of DNA sequences, each position is a separate character, with four possible character states, the four nucleotides.

Page 44: Stuart M. Brown

44

DNA is a good tool for taxonomy

DNA sequences have many advantages over classical types of taxonomic characters: –

Character states can be scored unambiguously

Large numbers of characters can be scored for each individual

Information on both the extent and the nature of divergence between sequences is available (nucleotide substitutions, insertion/deletions, or genome rearrangements)

Page 45: Stuart M. Brown

45

A aat tcg ctt cta gga atc tgc cta atc ctgB ... ..a ..g ..a .t. ... ... t.. ... ..aC ... ..a ..c ..c ... ..t ... ... ... t.aD ... ..a ..a ..g ..g ..t ... t.t ..t t..

Each nucleotide difference is a character

Page 46: Stuart M. Brown

46

After working with sequences for a while, one develops an intuitive understanding that for a given gene, closely related organisms have similar sequences and more distantly related organisms have more dissimilar sequences. These differences can be quantified.

Given a set of gene sequences, it should be possible to reconstruct the evolutionary relationships among genes and among organisms.

Sequences Reflect Relationships

Page 47: Stuart M. Brown

47

Page 48: Stuart M. Brown

48

What Sequences to Study?•

Different sequences accumulate changes at different rates -

chose level of variation that

is appropriate to the group of organisms being studied.–

Proteins (or protein coding DNAs) are constrained by natural selection -

better for very distant

relationships–

Some sequences are highly variable (rRNA

spacer

regions, immunoglobulin genes), while others are highly conserved (actin, rRNA

coding regions)

Different regions within a single gene can evolve at different rates (conserved vs. variable domains)

Page 49: Stuart M. Brown

49

Orthologs

vs. Paralogs•

When comparing gene sequences, it is important to distinguish between identical vs. merely similar genes in different organisms.

Orthologs

are homologous genes in different species with analogous functions.

Paralogs

are similar genes that are the result of a gene duplication.–

A phylogeny that includes both orthologs

and paralogs

is likely to be incorrect.

Sometimes phylogenetic

analysis is the best way to determine if a new gene is an ortholog

or paralog

to other known genes.

Page 50: Stuart M. Brown

50

A

A B

A2 B2A1 B1

Duplication

Speciation

(globin)

(hemoglobin) (myoglobin)

(mouse) (human)

Ancestral gene

Page 51: Stuart M. Brown

51

Disclaimers

Before describing any theoretical or practical aspects of phylogenetics, it is necessary to give some disclaimers. This area of computational biology is an intellectual minefield!

Neither the theory nor the practical applications of any algorithms are universally accepted throughout the scientific community.

The application of different software packages to a data set is very likely to give different answers; minor changes to a data set are also likely to profoundly change the result.

Page 52: Stuart M. Brown

52

Page 53: Stuart M. Brown

53

A modern revision

of the seals and sea lions

Page 54: Stuart M. Brown

54

Genes vs. Species•

Relationships calculated from sequence data represent the relationships between genes, this is not necessarily the same as relationships between species.

Your sequence data may not have the same phylogenetic

history as the species from which

they were isolated

Different genes evolve at different speeds, and there is always the possibility of horizontal gene transfer (hybridization, vector mediated DNA movement, or direct uptake of DNA).

Page 55: Stuart M. Brown

55

Cladistic

vs. PheneticWithin the field of taxonomy there are two different methods and philosophies of building phylogenetic

trees: cladistic

and phenetic

Phenetic

methods construct trees (phenograms) by considering the current states of characters without regard to the evolutionary history that brought the species to their current phenotypes.

Remember that phenotype is outward, physical manifestation of the organism, and genotype is the internally coded inheritable information.

Cladistic

methods rely on assumptions about ancestral relationships as well as on current data.

Clad or clade

is a branch of a phylogenetic

tree.

Page 56: Stuart M. Brown

56

Darwin was a Cladist“The natural system based on descent

with modification …

the characters that naturalists consider as showing true affinity are those which have been inherited from a common parent, and in so far as all true classification is genealogical; that community of descent is the common bond that naturalists have been seeking.”

-

Charles Darwin, Origin of Species, 1859

Page 57: Stuart M. Brown

57

Phenetic

Methods•

Computer algorithms based on the phenetic

model rely on

Distance Methods

to build of trees from sequence data.•

Phenetic

methods count each base of sequence

difference equally, so a single event that creates a large change in sequence (insertion/deletion or recombination) will move two sequences far apart on the final tree.

Phenetic

approaches generally lead to faster algorithms and they often have nicer statistical properties for molecular data.

The phenetic

approach is popular with molecular evolutionists because it relies heavily on objective character data (such as sequences) and it requires relatively few assumptions.

Page 58: Stuart M. Brown

58

Distances Measurements•

It is often useful to measure the genetic distance between two species, between two populations, or even between two individuals.

The entire concept of numerical taxonomy is based on computing phylogenies

from a table of distances.

In the case of sequence data, pairwise

distances must be calculated between all sequences that will be used to build the tree -

thus creating a distance matrix.

Distance methods give a single measurement of the amount of evolutionary change between two sequences since divergence from a common ancestor.

Page 59: Stuart M. Brown

59

Distance methodsCalculate the distance CORRECTING FOR MULTIPLE HITS

The Distance Matrix7 Rat Mouse Rabbit Human Opossum Chicken Frog

Rat 0.0000 0.0646 0.1434 0.1456 0.3213 0.3213 0.7018Mouse 0.0646 0.0000 0.1716 0.1743 0.3253 0.3743 0.7673Rabbit 0.1434 0.1716 0.0000 0.0649 0.3582 0.3385 0.7522Human 0.1456 0.1743 0.0649 0.0000 0.3299 0.2915 0.7116Oppossum 0.3213 0.3253 0.3582 0.3299 0.0000 0.3279 0.6653Chicken 0.3213 0.3743 0.3385 0.2915 0.3279 0.0000 0.5721Frog 0.7018 0.7673 0.7522 0.7116 0.6653 0.5721 0.0000

Page 60: Stuart M. Brown

60

Computing a Distance MatrixReading sequences...

gtr1_human: 548 total, 548 readgtr2_human: 548 total, 548 readgtr3_human: 548 total, 548 readgtr4_human: 548 total, 548 readgtr5_human: 548 total, 548 read

Computing distances using Kimura method...1 x 2: 48.61 1 x 3: 45.501 x 4: 65.74 1 x 5: 107.702 x 3: 61.53 2 x 4: 74.572 x 5: 113.82 3 x 4: 68.933 x 5: 104.43 4 x 5: 110.86

Matrix 11 2 3 4 5

____________________________________________________________

..| 1 | 0.00 48.61 45.50 65.74 107.70| 2 | 0.00 61.53 74.57 113.82| 3 | 0.00 68.93 104.43| 4 | 0.00 110.86| 5 | 0.00

Page 61: Stuart M. Brown

61

DNA Distances

Distances between pairs of DNA sequences are relatively simple to compute as the sum of all base pair differences between the two sequences. –

this type of algorithm can only work for pairs of sequences that are similar enough to be aligned

Generally all base changes are considered equal•

Insertion/deletions are generally given a larger weight than replacements (gap penalties).

It is also possible to correct for multiple substitutions at a single site, which is common in distant relationships and for rapidly evolving sites.

Page 62: Stuart M. Brown

62

Page 63: Stuart M. Brown

63

Correction for multiple hits•

Only differences can be observed directly –

not

distances•

All distance methods rely (crucially) on this

A great many models used for nucleotide sequences (e.g. JC, K2P, HKY, Rev, Maximum Likelihood)

aa sequences are infinitely more complicated!•

Can take account of different rates of evolution at sites (e.g. gamma distribution)

Accuracy falls off drastically for highly divergent sequences

Page 64: Stuart M. Brown

64

Amino Acid Distances•

Distances between amino acid

sequences are a bit

more complicated to calculate. •

Some amino acids can replace one another with relatively little effect on the structure and function of the final protein while other replacements can be functionally devastating.

From the standpoint of the genetic code, some amino acid changes can be made by a single DNA mutation while others require two or even three changes in the DNA sequence.

In practice, what has been done is to calculate tables of frequencies of all amino acid replacements within families of related protein sequences in the databanks: i.e. PAM

and BLOSSUM

Page 65: Stuart M. Brown

65

The PAM 250

scoring matrixA R N D C Q E G H I L K M F P S T W Y V

A 2R -2 6N 0 0 2 D 0 -1 2 4 C -2 -4 4 -5 4Q 0 1 1 2 -5 4E 0 -1 1 3 -5 2 4G 1 -3 0 1 -3 -1 0 5H -1 2 2 1 -3 3 1 -2 6I -1 -2 -2 -2 -2 -2 -2 -3 -2 5

L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6K -1 3 1 0 -5 1 0 -2 0 -2 -3 5M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 3T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -2 0 1 3W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4

Dayhoff, M, Schwartz, RM, Orcutt, BC (1978) A model of evolutionary change in proteins. in Atlas of Protein Sequence and Structure, vol 5, sup. 3, pp 345-352. M. Dayhoff ed., National Biomedical Research Foundation, Silver Spring, MD.

Page 66: Stuart M. Brown

66

Clustering AlgorithmsClustering algorithms use distances to calculate phylogenetic

trees. These trees

are based solely on the relative numbers of similarities and differences between a set of sequences.

Start with a matrix of pairwise

distances

Cluster methods construct a tree by linking the least distant pairs of taxa, followed by successively more distant taxa.

Page 67: Stuart M. Brown

67

Minimum Evolution•

The total length of all branches in the tree should be a minimum

It has been shown that the minimum evolution tree is expected to be the true tree provided branch lengths corrected for multiple hits

Page 68: Stuart M. Brown

68

UPGMA•

The simplest of the distance methods is the UPGMA

(Unweighted

Pair Group Method using Arithmetic averages)

The PHYLIP

programs DNADIST

and PROTDIST calculate absolute pairwise

distances between a

group of sequences. Then the GCG

program GROWTREE

uses UPGMA

to build a tree.

Many multiple alignment programs such as PILEUP use a variant of UPGMA

to create a dendrogram

of

DNA sequences which is then used to guide the multiple alignment algorithm.

Page 69: Stuart M. Brown

69

Neighbor Joining

The Neighbor Joining

method is the most popular way to build trees from distance measurements

(Saitou and Nei 1987, Mol. Biol. Evol. 4:406)

Neighbor Joining

corrects the UPGMA method for its (frequently invalid) assumption that the same rate of evolution applies to each branch of a tree.

The distance matrix is adjusted for differences in the rate of evolution of each taxon

(branch).

Neighbor Joining

has given the best results in simulation studies and it is the most computationally efficient of the distance algorithms (N. Saitou and T. Imanishi, Mol. Biol. Evol. 6:514 (1989)

Page 70: Stuart M. Brown

70

Neighbour

Joining

87

6

54

1

2

3

8

7

6

5

23

4

1

Page 71: Stuart M. Brown

71

Cladistic

Methods

For character data about the physical traits of organisms (such as morphology of organs etc.) and for deeper levels of taxonomy, the cladistic

approach is

almost certainly superior.

Cladistic

methods are often difficult to implement with molecular data because all of the assumptions are generally not satisfied.

Page 72: Stuart M. Brown

72

Cladistic

methods

Cladistic

methods are based on the assumption that a set of sequences evolved from a common ancestor by a process of mutation and selection without mixing (hybridization or other horizontal gene transfers).

These methods work best if a specific tree, or at least an ancestral sequence, is already known so that comparisons can be made between a finite number of alternate trees rather than calculating all possible trees for a given set of sequences.

Page 73: Stuart M. Brown

73

Parsimony•

Parsimony

is the most popular method for

reconstructing ancestral relationships.–

Derived from parsimonious used to mean least number (stingiest)

Parsimony

allows the use of all known evolutionary information in building a tree–

In contrast, distance methods compress all of the differences between pairs of sequences into a single number

Page 74: Stuart M. Brown

74

Building Trees with Parsimony•

Parsimony

involves evaluating all possible

trees and giving each a score based on the number of evolutionary changes that are needed to explain the observed data.

The best tree is the one that requires the fewest base changes for all sequences to derive from a common ancestor.

Page 75: Stuart M. Brown

75

Check each topology•

Count the minimum number of changes required to explain the data

Choose the tree with the smallest number of changes

Usually performs well with closely related sequences –

but often performs badly with

very distantly related sequences•

With distantly related sequences homoplasy

(similarity due to convergent

evolution, but independent origins)

becomes a major problem

Building Trees with Parsimony

Page 76: Stuart M. Brown

76

Parsimony Example•

Consider four sequences: ATCG, TTCG, ATCC, and TCCG

Imagine a tree that branches at the first position, grouping ATCG and ATCC on one branch, TTCG and TCCG on the other branch.

Then each branch splits, for a total of 3 nodes

on the tree (Tree #1)

Page 77: Stuart M. Brown

77Tree #1

Tree #2

Compare Tree #1 with one that first divides ATCC on its own branch, then splits off ATCG, and finally

divides TTCG from TCCG (Tree #2).

Trees #1 and #2 both have three nodes, but when all of the distances back to the root (# of nodes crossed) are summed, the total is equal to 8

for Tree

#1 and 9

for Tree #2.

Page 78: Stuart M. Brown

78

Maximum Likelihood•

Require a model of evolution

Each substitution has an associated likelihood given a branch of a certain length

A function is derived to represent the likelihood of the data given the tree, branch-lengths and additional parameters

Function is minimized

Page 79: Stuart M. Brown

79

Maximum Likelihood•

The method of Maximum Likelihood

attempts to reconstruct a phylogeny

using an explicit model of evolution.

This method works best when it is used to test (or improve) an existing tree.

Even with simple models of evolutionary change, the computational task is enormous, making this the slowest of all phylogenetic

methods.

Page 80: Stuart M. Brown

80

Models can be made more parameter rich to increase their realism

The most common additional parameters are:–

A correction to allow different substitution rates for each type of nucleotide change

A correction for the proportion of sites which are unable to change

A correction for variable site rates at those sites which can change

The values of the additional parameters will be estimated in the process

Page 81: Stuart M. Brown

81

Ancestral Sequences•

Maximum likelihood predicts ancestral sequences–

at branch points in the tree (nodes)

can provide information about the timing of the acquiring of a novel trait or mutation

PAML (Phylogenetic

Analysis using Maximum Likelihood)–

Confidence intervals provided

Selection can be inferred

Page 82: Stuart M. Brown

82

Assumptions for Maximum Likelihood

The frequencies of DNA transitions (C<->T,A<->G) and transversions

(C or T<->A or G).

The assumptions for protein sequence changes are taken from the PAM matrix -

and are quite likely to

be violated in “real”

data.

Since each nucleotide site evolves independently, the tree is calculated separately for each site. The product of the likelihood's for each site provides the overall likelihood of the observed data.

Page 83: Stuart M. Brown

83

The Molecular ClockFor a given protein the rate of sequence

evolution is approximately constant across lineages

Zuckerkandl and Pauling (1965)

This would allow speciation and duplication events to be dated accurately based on molecular data

Local and approximate molecular clocks more reasonable

Page 84: Stuart M. Brown

84

Rooting the Tree•

In an unrooted

tree the direction of

evolution is unknown•

The root is the hypothesized ancestor of the sequences in the tree

The root can either be placed on a branch or at a node

You should start by viewing an unrooted

tree

Page 85: Stuart M. Brown

85

Page 86: Stuart M. Brown

86

Page 87: Stuart M. Brown

87

Rooting Using an Outgroup•

The outgroup

should be a sequence (or set

of sequences) known to be less closely related to the rest of the sequences than they are to each other

It should ideally be as closely related as possible to the rest of the sequences while still satisfying condition 1

The root must be somewhere between the outgroup

and the rest (either on the node

or in a branch)

Page 88: Stuart M. Brown

88

Are there Correct

trees??•

Despite all of these caveats, it is actually quite simple to use computer programs calculate phylogenetic

trees for data sets.

Provided the data are clean, outgroups

are correctly specified, appropriate algorithms are chosen, no assumptions are violated, etc., can the true, correct tree be found

and proven to be

scientifically valid?

Unfortunately, it is impossible to ever conclusively state what is the "true" tree for a group of sequences (or a group of organisms); taxonomy is constantly under revision as new data is gathered.

Page 89: Stuart M. Brown

89

Is my tree correct?

Bootstrap valuesBootstrapping is a statistical technique that can use

random re-sampling of data to determine

sampling error for tree topologies•

Leave-one-out methods–

(leave out a row, not a species)

Agreement among the resulting trees is summarized with a majority-rule consensus tree

Each branch of the tree is labelled with the % of bootstrap trees where it occurred.

80% is good, less than 50% is bad

Page 90: Stuart M. Brown

90

Non-Synonymous Substitutions•

There is MORE

information hidden in

alignments•

For each DNA substitution, we can observe if it changes the corresponding amino acid

due to the redundancy of the genetic code, a SYNONYMOUS (Ks)

substitution does not

change the AA•

a NON-SYNONYMOUS (Ka)

substitution

changes the AA at that codon•

[Need to correct the # of observed Ka and Ks for the possible number of each kind of changes that could occur in each codon]

Page 91: Stuart M. Brown

91

Ka/Ks•

Neutral mutations will changes all bases at an equal rate, so Ka/Ks = 1

Conserved sequences will have Ka/Ks <1 [this is true for the vast majority of protien

coding seqences]

Ka/Ks >1 is a signature for selection (AA changes occur at a faster rate than expected by chance)–

discovery of a gene under positive selection by Ka/Ks>1 is a very big deal

[The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study.Nekrutenko

A, Makova

KD, Li WH. Genome Res. 2002 Jan;12(1):198-202.]

Page 92: Stuart M. Brown

92

Ka/Ks varies within a gene

Page 93: Stuart M. Brown

93

Computer Software for PhylogeneticsDue to the lack of consensus among evolutionary biologists about basic principles for phylogenetic

analysis, it is not surprising

that there is a wide array of computer software available for this purpose.–

PHYLIP

is a free package that includes 30

programs that compute various phylogenetic algorithms on different kinds of data. Command

line only -

hard to use.(Several free web servers provide a fuctional

user interface)–

CLUSTALX

is a multiple alignment program that

includes the ability to create tress based on Neighbor Joining.

Very easy to use, but NJ may not

always be the best method to handle your data.

Page 94: Stuart M. Brown

94

Other useful software•

Mega

-

(free, Windows only) alignment, build trees,

estimate rates of evolution, •

Mesquite

-

(free Mac & Win)

advanced analysis of trees created by other programs

Phylowin

-

(free Mac & Win)

builds trees from a distance matrix (NJ, parsimony, max likelihood)

PAUP

-

(Commercial, Mac & Win)–

sophisticated, but fairly easy to use

Includes NJ, Parsimony, and Max. Likelihood–

Also does bootstrapping

Phylodendron

- (web) redraw trees

Page 95: Stuart M. Brown

95

Other Web Resources•

Joseph Felsenstein

(author of PHYLIP) maintains a

comprehensive list of Phylogeny programs

at:http://evolution.genetics.washington.edu/phylip/software.html

Introduction to Phylogenetic

Systematics,Peter H. Weston & Michael D. Crisp, Society of Australian Systematic Biologists

http://www.science.uts.edu.au/sasb/WestonCrisp.html

University of California, Berkeley Museum of Paleontology (UCMP)http://www.ucmp.berkeley.edu/clad/clad4.html

Page 96: Stuart M. Brown

96

Software Hazards•

There are a variety of programs for Macs and PCs, but you can easily tie up your machine for many hours with even moderately sized data sets (i.e. fifty 300 bp sequences)

Moving sequences into different programs can be a major hassle due to incompatible file formats.

Just because a program can perform a given computation on a set of data does not mean that that is the appropriate algorithm for that type of data.

Page 97: Stuart M. Brown

97

Molecular Phylogeny ConclusionsGiven the huge variety of methods for computing phylogenies, how can the biologist determine what is the best method for analyzing a given data set?–

Published papers that address phylogenetic

issues

generally make use of several different algorithms and data sets in order to support their conclusions.

In some cases different methods of analysis can work synergistically

Neighbor Joining

methods generally produce just one tree, which can help to validate a tree built with the parsimony

or maximum likelihood

method–

Using several alternate

methods can give an indication of the robustness of a given conclusion.

Page 98: Stuart M. Brown

98

Recall What is Life?•

State of a functional activity and continual change, before death (defined complimentarily as end-of-life).

Characterized by the capability to:•

Reproduce itself,

• Adapt to an environment in a quest for

survival, and•

Take Actions independent of exterior agents.

Page 99: Stuart M. Brown

99

Nature as a special case of Life•

The Biology of Nature so far been the scientific study of life on Earth based on Carbon-chain chemistry.

However, nothing restricts the study of properties of life to carbon-chain chemistry; it is merely the only form of life so far available for study.

Further motivation to study life as a generic concept comes from the hypothesis that we are perhaps just one possible atom combination that makes this life

possible. We haven’t met other

examples (Aliens).

Page 100: Stuart M. Brown

100

…which brings us to Artificial-Life•

Lack of any available non-carbon based life-

forms motivates us to create an artificial environment

and a set of rules

for life to

evolve.

Artificial Life, or ALife or

AL is the study of non-organic organisms, beyond the creations of nature, that possess the essential properties of life as we understand it, and whose environment is artificially created in an alternative media, which very often is a logical device like the computer.

Page 101: Stuart M. Brown

101

ALife as a Synthesis approach•

Rather than being an analytical study of “natural”

life, A-Life is a Synthesis

approach to studying any form of Life.

We have :–

an artificially-created environment (usually) within computers,

A fairly universal set of rules and properties of life, derived from the one example we have of life -

Natural life.

Page 102: Stuart M. Brown

102

So what is the motivation?•

A-Life could have been dubbed as yet-another-

approach to studying intelligent life, had it not been for the Emergent properties in life that motivates scientists to explore the possibility of artificially creating life and expecting the unexpected.

Recall that an emergent property is created when something becomes more than sum of its parts. For example, half a human is not capable of working without the other half, but together, capable of very complex behavior (not a representative example).

Page 103: Stuart M. Brown

103

So where does A-Life fit in?•

The A-Life concept helps to:

Study existing natural life forms by trying to simulate the generic rules they follow, the environmental parameters like entropy/chaos , and the seed, i.e. the initial set of elements on which the rules of life apply under the given environmental condition, in order to understand evolution in nature.

Create new life within the digital world by creating new set of external parameters, seeds, and rules of evolution, and let life find a way.

Page 104: Stuart M. Brown

104

So is A-Life = AI ??

Artificial Life Artificial Intelligence

Concept : Late 1980s Concept : 1960s

Grounded in Biology, Physics, Chemistry, Mathematics.

Pursued primarily in Comp. Sci, Engineering & Psychology.

Studies Intelligence as part of Life itself

Studies Intelligent behavior in isolation

Bottom-Up approach -

study synthesis

Top-Down approach -

focus is on results

Views life-as-it-could-be Views life-as-it-is

Both seem to approach similar problems, but…

Page 105: Stuart M. Brown

105

A-Life : Emergence•

What you get when something is more than the sum of its parts.

Human thoughts rely on nearly all cells that make up the brain -

single cells are incapable of thought

-

thought is the emergence property of these cells coming together and interacting to give complex results -

motivation behind CA, NN.

Extreme example: Earth as a one living thing, consisting of whole of nature being in dynamic equilibrium, each part having baring on the other.

Page 106: Stuart M. Brown

106

A-Life : Entropy•

Second Law of Thermodynamics : When two systems are joined together, the entropy (or chaos) in the combined system is greater than the sum of the individual systems.

This roughly applies to all systems, including those that exchange information.

Life is all about fighting against entropy : as other systems lose information to surroundings, life not only keeps hold of its information, but also increases its amount of information.

Page 107: Stuart M. Brown

107

A-Life : Complexity•

Life is a complex system : It is a dynamic system that can keep on changing and evolving over a great period of time without dying.

If the amount of information exchange in a system is varied from low to high, it gives Fixed, Periodic, and Chaotic systems in that order. Somewhere in between, a system exhibits complex behavior.

Accordingly, each unit in a system either dies, freezes, pulsates, or behaves in a complex manner.

Fixed No Change, No Death

Periodic Change, No Evolution, No Death

Chaotic Change, Evolution,Death

Complex Change, Evolution, No Death

Page 108: Stuart M. Brown

108

A-Life : Chaos Theory•

Chaos Theory

explains apparent randomness

-

many apparently random events are not truly random -

they are just iteration of simple rules on existing states (and possibly previous states) generating complex behavior -

they live on the edge of total chaos.

Most natural processes are chaotic -

sea, wind.•

Some man-made processes are chaotic -

Financial market.

Lack of knowledge of all rules,inputs and seed prevents us from determining the exact state of such a system at a point, but knowledge of some of those dominant rules/inputs lead to possible prediction of general behavior of the system.

This lack of knowledge of all parameters leads us to conclude it to be random behavior of the system.

Page 109: Stuart M. Brown

109

A-Life : Current research areas•

Mathematical, Philosophical, Biological foundations, Social and Ethical implications of A-Life.

Cellular Automata•

Neural Networks

Genetic Algorithms•

Origin, Self-organization, Repair and Replication

Evolutionary / Adaptive Dynamics•

Autonomous,Adaptive and Evolving Robots

Software Agents (good/evil)•

Emergent Collective Behaviors, Swarms.

Synthetic/Artificial Chemistry/Biology/Materials•

Applications: Finance, Economics, Gaming, MEMS etc

Page 110: Stuart M. Brown

110

ALife:Foundation/Implications•

Research on Foundation tries to answer questions about the motivation behind such a ground-breaking concept, using our existing knowledge base in Math, Chemistry, Biology, Philosophy of life etc. The Question is “How, why and where can the ALife approach succeed (or fail)?”

Research on Implications tries to understand and explain how the extension of life as a generic concept impacts our understanding of the very basics of natural life, shattering (or possibly not affecting) many-a-belief about God, creation and destruction. The Question here is “How does ALife fit in (if at all) to the present-day social setup of morals and ethics, often laid out by the various religious texts ?”

Page 111: Stuart M. Brown

111

Alife : Cellular Automata•

Inspired by the way Natural biological cells behave and interact with their neighboring cells by following rules set out by the DNA code in them.

Cellular Automata (CA) is an array of N-

dimensional ‘cells’

that interact with their neighboring cells according to a pre-

determined set of rules, to generate actions, which in turn may trigger a new series of reactions on itself or its neighbors.

The best known example is Conway’s Life, which is a 2-state 2-D CA with simple rules (see on right) applied to all cells simultaneously to create generations of cells from an initial pattern.

Different initial patterns generate different behavorial

patterns, some die away

(unstable), some

blink (periodic), and the rest show complex behavior by continuing to live and evolve.

Conway’s Life: Rules

A living cell with 0-1neighbors dies of isolation

A living cell with 4+ neighbors dies from overcrowding

All other cells are unaffected