0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao...

Fast and Accurate Reconstructionof Evolutionary Trees: a Model-based Study

Ming-Yang Kao

Department of Computer ScienceNorthwestern University

Evanston, Illinois

U. S. A.

Perspectives

Use biology ideas to solve computer science problems

Use computer science tools to solve biology problems

biologycomputerscience

this talk

Use Biology to Solve CS Problems

• DNA Computing

• DNA Self-Assembly

• Genetic Algorithms

• Neural Network

• Others

Use CS to Solve Biology Problems

• Bioinformatics or Computational Biology

data mining

(this talk)

• Related fields computational neuroscience computational ecology medical informatics … many more ...

Example Research Areas of Bioinformatics

• DNA sequencing • DNA microarray analysis• DNA self-assembly for nano-structures• DNA word design

• RNA secondary structure prediction

• Protein sequencing (my talk #4)• Proteomics• Protein database search • Protein sequence design (my talk #3)• Protein landscape analysis

• Phylogeny reconstruction (this talk)• Phylogeny comparison (my talk #1)

Evolutionary Trees

definition: a tree with distinct labels at leaves

leaf labels: species, organisms, DNAs, RNAs, proteins, features, etc.

ancestralspecies

bird plumpeach

rice wheat

present-day species(Just a joke!)

Evolutionary Trees

leaf labels: DNA sequences

bird plum peach

rice wheat

AAGT CCAG CCAT

CGGG CGGC

(Just a joke!)

Problem Formulation

bird plumpeach

rice wheat

AAGT CCAGCCAT

CGGG CGGC

Input: DNA sequences of present-day species

Output: the true evolutionary tree

Question: What is “true”? Need a model!

(Just a joke!)

A Fundamental Problem of Biology

Since the time of Charles Darwin,

Problem: reconstruct the evolutionary history of all known species.

Importance:

• intellectually fascinating

• practical benefits – medicine, food …

• Charles Robert Darwin --- 1809-1882• Origin of Species --- 1859

Main Difficulties

• Availability of data

Hundreds of millions of species --- unlikely to be all available any time soon or ever.

But DNA sequences of more and more species are becoming available.

• Extracting information from data

focus of this talk

Today’s Technical Focus

bird plumpeach

rice wheat

AAGT CCAGCCAT

CGGG CGGC

Input: DNA sequences of present-day species

Output: the true evolutionary tree

Question: What is “true”? Need a model!

Collaborators:Csuros & Kim

Main Result

An algorithm that constructs an evolutionary tree from biomolecular sequences

• Provable high accuracy

• Short sequence length

• Optimal running time

• Optimal memory space

Outline of Technical Discussion

1. Define the model of evolution.2. Formulate the computational

problem.

3. Discuss the theoretical performance of our algorithm.

4. Discuss the empirical performance.

5. Describe and analyze the algorithm.

6. Further research.

Outline of Technical Discussion (1)

problem.

Model of Evolution

Intuitions

ACGTACT

AGGAGAA

CAGGAGTTTTAA

Mutation occurs probabilistically.

1. edge length ~ time 2. edge length ~ mutation probability3. edge length ~ dissimilarity (or distance)

AGTTCCT

Jukes-Cantor Model of Evolution (1)

Edge Mutation Probability

430 gf p

• No insertion or deletion.

• X = A with probability 1 - 0.6 = 0.4

• X = C, G, or T with probability 0.6/3 = 0.2

Independent Mutations along All Edges

0.70.65

i.i.d. mutations at every character

AGTTCAGG

0.70.65

problem.

Problem Formulation

CAGGT GTACT

CGTGT ATCGT

0.60.7

0.20.5

0.70.1

True Tree(not known to algorithm)

Input: SSS 521,...,,

Output:

unrooted

• Pick any sequence for the root (also unknown to algorithm).• Generate the other sequences.

but not the other sequences,nor the tree.

Computational Objectives

Input: DNA sequences SSS 521,...,,

Output:

Minimize:

• running time

• memory space

• probability of incorrect output

• sample size, i.e., length of the input sequences

problem.

Triplets• A triplet is one formed by three leaves.

• P is the center of XYZ.

G-depth of Triplet

# of edges between X and Y

},,max{ dddd ZXYZXYXYZ

5, 8, 7

G-depth of a Tree

the smallest d such that the triplets of g-depth at most d covers the entire tree

g-depth = 4

the best case

G-depth of a Tree

the smallest d such that the triplets of g-depth at most d covers the entire tree

g-depth = 2 log n

the worst case

G-depth of a Treethe smallest d such that the triplets of g-depth at most d covers the entire tree

• at most 2 log n

• can be O(1)

Our New Result (1)

Our New Result (2)

polynomial sample size

Our New Result (3)

provable high accuracy

Our New Result (4)

optimal time & space

Comparison with Previous Results

this talk

problem.

Experimental Study Design

• Step 1 -- Pick a model tree T.

• Step 2 -- Use T to generate sequences.

• Step 3 -- Use an algorithm to reconstruct a tree T’ from the sequences (without knowing T).

• Step 4 -- Compare T’ and T.

Wrong and Right Edges

true tree

reconstructed tree

Experiment #1

• the 135-taxon African-Eve tree (courtesy of Huson and Maddison)

• algorithms compared: HGT and bioNJ (Olivier Gascuel)

• parameters: sequence length and percentage of wrong edges

• edge mutation probabilities: between 0.47 and 0.088

• # of simulations = 20 per sequence length

• more experiments in progress

135-taxon African Eve Tree

Results of Experiment #1

Experiment #2

• a 1892-taxon tree of eukaryotes

• algorithms compared: HGT and bioNJ

• parameters: sequence length and percentage of wrong edges

• edge mutation probabilities: between 0.47 and 0.088

• # of simulations = 20 per sequence length

• more experiments in progress

• several variants of the basic HGT

Results of Experiment #2

problem.

Our New Result (4)

1. Describe the HGT algorithm.

2. Prove the sample size bound (and high probability for accuracy).

3. Prove the optimal time & space.

Outline of Technical Discussion (5/1)

Closeness and Distance of Two Leaves

AGTTX CAGG

0.70.65

3lnln3

XYYXYX

The larger the closeness,the more accurately we can estimate the distance.

Closeness is multiplicative.Distance is additive!!!

Closeness = Cubic Root of Determinant

A C G T

Closeness of Triplet

AGTTX CAGG

GGTGY GTTG

0.70.65

ZXYZXY

XYZ 1111

The larger the closeness, the more accurately we can estimate the three pairwise distances.

Assemble Triplets Into Treevia Distance Additivity (I)

Assemble Triplets Into Treevia Distance Additivity (II)

15 216

How to Choose Triplets to Minimize Errors?

ZXYZXY

XYZ 1111

The larger the closeness, the more accurately we can estimate the three pairwise distances.

Greedy Strategy!

Harmonic Greedy Triplet (HGT)

Over-Simplified Outline of HGT

• Stage 1: T’ ABC with the largest

closeness.

• Stage 2: Repeat the following steps until

T’ contains all the leaves.

Step 2(1): Pick a triplet XYZ with the largest closeness where X, Y are in T’ but Z is not.

Step 2(2): Incorporate XYZ into T’ to add Z into T’.

Our New Result (4/1)

closeness.

Polynomial Sequence Length (1)

)1( 34 g d XYZ

larger

smaller

Lemma 1: The g-depth of the last triplet used in HGT is the g-depth of the true tree T.

Proof:• The largest closeness such that the triplets with same or larger closeness cover the true tree T.

• The smallest g-depth such that the triplets with same or smaller g-depths cover the true tree T.

Polynomial Sequence Length (2)

)1( 34 g d XYZ

XYZ g-depth of tree

Lemma 1: The g-depth of the last triplet used in HGT is the g-depth of the true tree T.

Lemma 2:

XYZsequence length needed

where XYZ is the last triplet used.

Our New Result (4/2)

closeness.

Optimal Time/Space for the First Triplet

• Stage 1:

Fix an arbitrary leaf A.

T’ ABC with the largest closeness.

• Stage 2:

Repeat the following steps until T’ contains all the leaves.

Step 2(2): Incorporate XYZ into T’.

Optimal Time/Space for the Other Leaves

partially reconstructed tree

not yet recovered

only need to consider thetriplets formed byone of X, Y, one of B, C,and one of

problem.

Further Research

• more general models of evolution

• practical implementations

Main Difficulties

• Availability of data

Hundreds of millions of species --- unlikely to be all available any time soon or ever.

But DNA sequences of more and more species are becoming available.

• Extracting information from data

focus of this talk

Do the genomes of all green plants contain enough information for the reconstructionof their evolutionary tree?

• genome size of eukaryotes: base pairs

• # of green plant species: several

If so, does this impose any necessary structure on the information or the tree? If so, how do we determine and use that structure?

Beyond All Computational Considerations

1010116

What do you think?

The End.

Thank You!

Data Mining Flowchart

true tree(unknown)

collect & processindividual sequences

compare & alignmultiple sequences

tree reconstructionalgorithms

tree verification(compare & refine)

evolution models

generatesequences further

process

parameters

distance or characters

treesinformation

refine

today’s focus

parameters

0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao...

Documents

Transcript of 0 Fast and Accurate Reconstruction of Evolutionary Trees: a Model-based Study Ming-Yang Kao...

Evanston Insurance Company

1 A Combinatorial Toolbox for Protein Sequence Design and Landscape Analysis in the Grand Canonical Model Ming-Yang Kao Department of Computer Science.

Reconstructing Circular Order from Inaccurate Adjacency Information Applications in NMR Data Interpretation Ming-Yang Kao.

WIT · 2015-07-09 · Fabrication, welding and plating Assemblies ... Universal 4 Axis Machine Centre ... Automated Precision Lead Screw Tapping Machine Kao Ming Radial Drill

Igraj Kao Muskarac, Pobedjuj Kao Zena

Yujiang Chen (Walter) Zeyu Chi Yin-Jen Kao Ming Li Yuqing Zhang Presented 11-10-2015.

EVANSTON SYMPHONY ORCHESTRA

Design Evanston

Ming-Chih Kao, PhD University of Michigan Medical School mckao@med.umich

Hearing Loss Evanston IL

In the Matter of Evanston Northwestern Healthcare Corporation · In the Matter of Evanston Northwestern Healthcare Corporation ... Evanston Northwestern Healthcare Corporation ...

Albion at Evanston

Evanston police department 3.29.13

EVANSTON - Cloudinary

4/4/20131 EECS 395/495 Algorithmic DNA Self-Assembly General Introduction Thursday, 4/4/2013 Ming-Yang Kao General Introduction.

METHODISTS AT EVANSTON Ralph

Hamsa: Fast Signature Generation for Zero-day Polymorphic Worms with Provable Attack Resilience Zhichun Li, Manan Sanghi, Yan Chen, Ming-Yang Kao and Brian.

Redlining Evanston

NORTHWESTERN UNIVERSITY CAMPUS - Downtown Evanstondowntownevanston.org/sites/downtown-evanston/downtown-evanston-map.pdfnorthwestern university campus on a albion evanston evanston

1 Towards Anomaly/Intrusion Detection and Mitigation on High-Speed Networks Yan Gao, Zhichun Li, Manan Sanghi, Yan Chen, Ming- Yang Kao Northwestern Lab.