Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller...

52
Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University

Transcript of Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller...

Page 1: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Information Theoretic Approach to Whole

Genome Phylogenies

David Burstein Igor Ulitsky Tamir Tuller Benny Chor

School Of Computer ScienceTel Aviv University

School Of Computer ScienceTel Aviv University

Page 2: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Tree of Life“I believe it has been with the tree of life, which fills with its dead and broken branches the crust of the earth, and covers the surface with its ever branching and beautiful ramifications"... Charles Darwin, 1859

Page 3: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Accepted Evolutionary Model: Trees Initial period: Primordial soup, where “you

are what you eat”. Recombination events. Horizontal transfers.

Formation of distinct taxa. Speciation events induce a tree-like evolution.

Page 4: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Phylogenetic Trees Based on What?1. Morphology2. Single genes3. Whole genomes

Page 5: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Whole Genome Phylogenies: Motivation Cons for single genes trees

Require preprocessing Gene duplications Often too sensitive

Pros for whole genomes trees Fully automatic More information Seems essential in viruses

What about proteomes trees? Less “noise”, but do require preprocessing

Page 6: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Whole Genome Phylogenies: Challenges Very large inputs: Up to 5G bp long Extreme length variability (5G to 1M bp) No meaningful alignment Different segments experienced different

evolutionary processes

Page 7: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Previous Approaches Genome rearrangements (Hannanelly & Pevzner 1995,…) Gene/domain contents (Snel et al. 1999,…)

Li et al (2001) – “Kolmogorov complexity” Otu et al (2003) – “Lempel Ziv compression” Qi et al (2004) – Composition vectors

Common approach (ours too): Compute pairwise distances Build a tree from distance matrix (e.g. using

Neighbor Joining, Saitou and Nei 1987)

Page 8: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Genome Rearrangements Emphasis on finding best sequence of rearrangements Drawbacks

Requires manual definition of blocks Disregards changes within the block

Page 9: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Gene/Domain Content Genome equi length Boolean vector Various tree construction methods The drawback

Requires gene/domain definition/knowledge Disregards most of the genetic information

Page 10: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Ming Li et al.- “Kolomogorov Complexity” Kolmogorov Complexity is a wonderful

measure But … it is not computable “Approximate” KC by compression Drawbacks

Justification of the “approximation” Compression of one human chromosome reportedly took 24 hours (sloooow).

Page 11: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Otu et al.: “Lempel-Ziv Distance”

Run LZ compression on genome A. Use Genome A dictionary to compress Genome B. Log compression ratio (B given A vs. B given B)

≈ distance (B, A) Easy to implement Linear running time Drawback:

Dictionary size effects

Page 12: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Calculate distributions of the K-tuples. For K=1 – nucleotide/amino acid frequencies. For K=5 – 45 (205) possible 5-tuples Various methods for scoring distances Report K=5 as seemingly optimal

ACCGT

GGTAC

ATTGC

AACGG

GCTAT

ATGCG

GTTGC

Genome AGenome AACCGT

GGTAC

ATTGC

AACGG

GCTAT

ATGCG

GTTGC

Genome BGenome B

Qi et al.: Composition Vector

Page 13: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

For every position in Genome A, find the

longest common substring in Genome B.

AGGCTTAGATCGAGGCTAGGATCCCCTTAGCGAGGCTTAGATCGAGGCTAGGATCCCCTTAGCGAGGCTTAGATCGAGGCTAGGATCCCCTTAGCGAGGCTTAGATCGAGGCTAGGATCCCCTTAGCG

AAAGCTACCTGGATGAAGGTAGGCTACGCCCTTTAAAGCTACCTGGATGAAGGTAGGCTACGCCCTTTAAAGCTACCTGGATGAAGGTAGGCTACGCCCTTTAAAGCTACCTGGATGAAGGTAGGCTACGCCCTTT

Genome A

Genome B

Our Approach: Average Common Substring (ACS)

Page 14: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

For every position in Genome A, find the

longest common substring in Genome B.

AGGCTTAGATCGAGGCTTAGATCGAAGGCTAGGATCCCCTTAGCGGGCTAGGATCCCCTTAGCG

AAAAAAGCTGCTAACCTGGCCTGGAATGTGAAAAGGTGGTAAGGCTGGCTAACGCCCTTTCGCCCTTT

Genome A

Genome B

Our Approach: ACS (cont.)

Page 15: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

For every position in Genome A, find the

longest common substring in Genome B.

AGGCTTAGATCGAGGCTTAGATCGAGAGGCTAGGATCCCCTTAGCGGCTAGGATCCCCTTAGCG

AAAAAGAGCTACCTGGATGACTACCTGGATGAAGAGGTGTAGAGGCTACGCCCTTTGCTACGCCCTTT

Genome A

Genome B

Our Approach: ACS (cont.)

Page 16: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

For every position in Genome A, find the

longest common substring in Genome B.

AGGCTTAGATCGAGGCTTAGATCGAGGAGGCTAGGATCCCCTTAGCGCTAGGATCCCCTTAGCG

AAAGCTACCTGGATGAAAAGCTACCTGGATGAAGGAGGTTAGGAGGCTACGCCCTTTCTACGCCCTTT

Genome A

Genome B

Our Approach: ACS (cont.)

Page 17: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

For every position in Genome A, find the

longest common substring in Genome B.

AGGCTTAGATCGAGGCTTAGATCGAGGCAGGCTAGGATCCCCTTAGCGTAGGATCCCCTTAGCG

AAAGCTACCTGGATGAAGGTAAAGCTACCTGGATGAAGGTAGGCAGGCTACGCCCTTTTACGCCCTTT

Genome A

Genome B

Our Approach: ACS (cont.)

Page 18: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

For every position in Genome A, find the length

of longest common substring in Genome B. In this case, ( )=5.

AGGCTTAGATCGAGGCTTAGATCGAGGCTAGGCTAGGATCCCCTTAGCGAGGATCCCCTTAGCG

AAAGCTACCTGGATGAAGGTAAAGCTACCTGGATGAAGGTAGGCTAGGCTGCGCCCTTTGCGCCCTTT

Genome A

Genome B

Our Approach: ACS (cont.)

Page 19: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

For every position in Genome A, find the length

of longest common substring in Genome B. In this case, ( )=5. ACS= average ( ) = L(Genome A, Genome B)

AGGCTTAGATCGAGGCTTAGATCGAGGCTAGGCTAGGATCCCCTTAGCGAGGATCCCCTTAGCG

AAAGCTACCTGGATGAAGGTAAAGCTACCTGGATGAAGGTAGGCTAGGCTGCGCCCTTTGCGCCCTTT

Genome A

Genome B

Our Approach: ACS (cont.)

Page 20: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

),( BAL

From ACS to Our Distance: Intuition High L( A , B ) indicates higher similarity.

Should normalize to account for length of B.

Page 21: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

)log(

),(

B

BAL

From ACS to Our Distance: Intuition High L( A , B ) indicates higher similarity.

Should normalize to account for length of B. Still, we want distance rather than similarity.

Page 22: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

From ACS to Our Distance: Intuition High L( A , B ) indicates higher similarity.

Should normalize to account for length of B. Still, we want distance rather than similarity.

)||(~

)||(~

),(

),(

)log(

),(

)log()||(

~

ABDBADBAD

AAL

A

BAL

BBAD

s

Page 23: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

)||(~

)||(~

),(

),(

)log(

),(

)log()||(

~

ABDBADBAD

AAL

A

BAL

BBAD

s

High L( A , B ) indicates higher similarity.

Should normalize to account for length of B. Still, we want distance rather than similarity. And want to have D( A , A ) = 0 .

From ACS to Our Distance: Intuition

Page 24: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

)||(~

)||(~

),(

),(

)log(

),(

)log()||(

~

ABDBADBAD

AAL

A

BAL

BBAD

s

High L( A , B ) indicates higher similarity.

Should normalize to account for length of B. Still, we want distance rather than similarity. And want to have D( A , A ) = 0 .

Finally, we want to ensure symmetry.

From ACS to Our Distance: Intuition

Page 25: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Comparison to Human (H)

9.134.570.9x106E. coli

8.974.822x106S. Cerevisiae (yeast)

5.565.2911x106Arabidopsis Thaliana

2.1122.9712x106Mus Musculus (mouse)

Ds(H,*)L(H,*)Proteome

sizeSpecies

Page 26: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

What Good is this Weird Measure?

1) Our “ACS distance” is related to an information theoretic measure thatis close to Kullback Leibler relative entropy between two distributions.

2) The proof of the pudding is in the eating: Will show this “weird measure” is empirically good.

Page 27: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Define = number of bits required to describe distribution p, given q.

is closely related to Kullback Leibler

relative entropy

An Info Theoretic Measure( || )D p q

1 1lim ( ) log

( )

1log log

(

( )

|| )

( || )( )

l l

ll l

x X

P p

p xl q x

E E q x

p

p qq

q

X

D

D

( )log| )

( )( | P

pp q

XE

q XD

( || )D p q

Page 28: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Both and are common “distance measures” between two probability distributions p and q.

Both “distances” are neither symmetric, nor satisfy triangle inequality.

An Info Theoretic Measure( || )D p q ( || )p qD

Page 29: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Suppose p and q are Markovian probabilitydistributions on strings, and A, B are generated by them. Abraham Wyner (1993) showed that w.h.p

Relations Between ACS and

,

,

log( )( || )

( , )

( , ) ( ||

( || )

(

li

|| ) ( |

) ( ||

m

)

| )

A B

A B

s

BD A B

L A B

D A B

D p q

D p

D A B D

q

B A

D q p

( || )D p q

Page 30: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Computation distance of two k long genomes: Naïve implementation requires O(k2)

(disaster on billion letters long genomes) With suffix trees/arrays: Total time for

computing is O(k) (much nicer).

Implementation and Complexity

1 2( , )sD g g

Page 31: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Results and Comparisons Many genomes and proteomes Small ribosomal subunit ML tree Compare to other whole-genome methods Quantitative and qualitative evaluation

Page 32: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Benchmark dataset – 75 species 191 species (all non-viral proteomes in NCBI) 1,865 viral genomes 34 mitochondrial DNA of mammals (same as Li et al.)

Four Datasets Used

Page 33: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Benchmark Dataset – 75 Species Genomes and proteomes of archaea,

bacteria and eukarya Tree topologies reconstructed from

distance matrix using Neighbor Joining (Saitou and Nei 1987)

Reference tree and distance matrix obtained from the RDP (ribosomal database)

Page 34: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Benchmark dataset Genomes/Proteomes of 75 species from archaea, bacteria and

eukarya.

Methods tested : ACS (Ours) “Lempel Ziv complexity” (Otu and Sayhood) K-mers composition vectors (Qi et al.).

Results: Quantitative Evaluations

Tree Evaluation

04.05.35.33.5E

4.0

5.3

5.3

3.5

E

03.42.44.6D

3.403.42.3C

2.43.401.2B

4.62.31.20A

DCBA

A

B

E

DC

Tested Methods

Page 35: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Tree evaluation Reference tree: “Accepted” tree obtained

from ribosomal database project (Cole et al. 2003)

Tree Distance: Robinson-Foulds (1981)

Results: Quantitative Evaluations

Tree Evaluation

04.05.35.33.5E

4.0

5.3

5.3

3.5

E

03.42.44.6D

3.403.42.3C

2.43.401.2B

4.62.31.20A

DCBA

A

B

E

DC

Tested Methods

Page 36: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Robinson-Foulds Distance Each tree edge partitions species into 2

sets. Search which partitions exist only in one of

the trees.

AA

BB

CC

DD EE

AA

BB

EE

DD CC

Tree ATree A Tree BTree B

A,B C,D,E A,B C,D,ECommon Common PartitionPartition

xx

yy

Page 37: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

AA

BB

CC

DD EE

AA

BB

EE

DD CC

Tree ATree A Tree BTree B

D,E

A,B,C

Robinson-Foulds Distance

xx

yyPartitionPartition

Not in BNot in B

Each tree edge partitions species into 2 sets.

Search which partitions exist only in one of the trees.

Page 38: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Distance = number of edges inducing partitions existing only in one of the trees.

For n leaves, distance ranges from 0 through 2n-6.

Robinson-Foulds Distance

AA

BB

CC

DD EE

AA

BB

EE

DD CC

Tree ATree A Tree BTree B

D,E

A,B,Cxx

yyPartitionPartition

Not in BNot in B

Page 39: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Robinson-Foulds Distance - Results

Benchmark set has n=75 species, so max distance is 144.

76108ACS

(Our method)

92110Composition

vector

126118LZ

complexity

ProteomesGenomesMethod

Page 40: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

All Proteomes Dataset 191 proteomes from NCBI Genome 11 Eukarya, 19 Archaea, 161 Bacteria Compared to NCBI Taxonomy

Page 41: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

191 proteomes from NCBI Genome 11 Eukarya, 19 Archaea, 161 Bacteria Compared to NCBI Taxonomy

All Proteomes Dataset

Page 42: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

191 proteomes from NCBI Genome 11 Eukarya, 19 Archaea, 161 Bacteria Compared to NCBI Taxonomy

All Proteomes Dataset

Page 43: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Viral Forest 1865 viral genomes from EBI Split into super-families:

dsDNA ssDNA dsRNA ssRNA positive ssRNA negative Retroids Satellite nucleic acid

Page 44: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

83 Reverse-transcriptases: Hepatitis B viruses Circular dsDNA ssRNA

Retroid TreeAvian

Mammalian

Page 45: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Each segment treated separately 174 segments of 74 viruses.

ssRNA Negative Tree

Page 46: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Mammalian mtDNA Tree

Avian

Mammalian

Page 47: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Intelligent Design ?

Throwing Branch Lengths In

Page 48: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

General Insights Proteomes vs. Genomes Overlapping vs. Non-overlapping Triangle inequality held in all cases

Page 49: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Additional Directions attempted

Naïve introduction of mismatches Division into segments Weighted combinations of genome and

proteome data Bottom line (subject to change):

Simple is beautiful.

Page 50: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Summary Whole genome phylogeny based on ACS

method Effective algorithm Information theoretic justification Successful reconstruction of known

phylogenies.

Page 51: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Future work Additional datasets Statistical significance Improved branch lengths estimation Better time and space complexities

Page 52: Information Theoretic Approach to Whole Genome Phylogenies David Burstein Igor Ulitsky Tamir Tuller Benny Chor School Of Computer Science Tel Aviv University.

Questions ?