Beiko cms final

Post on 26-Jul-2015

174 views 0 download

Tags:

Transcript of Beiko cms final

Robert Beiko

When trees can’t agree

2

- The human microbiome -an ecosystem unlike any other

Human gut microbiome: 2-3 million genes

Typically > 160 “species” at any given time

Human: ~25,000 genes

Qin et al., Nature (2010)

4

Microbial communities

http://upload.wikimedia.org/wikipedia/commons/2/2d/Bacteria_%28251_31%29_Airborne_microbes.jpg

5

Photo courtesy of Emma Allen-Vercoe, University of Guelph

Lachnospiraceae bacterium 3-1-57 CT1“Lachnozilla”

6Meehan and Beiko (2014) Genome Biol Evol

Lachno

Lachnospiraceae – commonly thought of as “Good bacteria”

7

0 1000 2000 3000 4000 5000 6000 7000 8000

Sizes of Assembly and Draft Genomes of Class Clostridia

Number of Protein-Coding Genes

Zilla

?

9

50

33

4

?

10W. Ford Doolittle, Sci Am (1999)

11

PNAS, 2012

“…pathogen-driven inflammatory responses in the gut can generate transient enterobacterial blooms in which conjugative transfer occurs at unprecedented rates.”

PLoS Biol, 2007

“…lateral gene transfer, mobile elements, and gene amplification have played important roles in affecting the ability of gut-dwelling Bacteroidetes to vary their cell surface, sense their environment, and harvest nutrient resources present in the distal intestine.”

Gene transfer matters

12

The genomics toolkitGene profiles

Gene 1 Gene 2 Gene 3 Gene 4 Gene 5

13

The genomics toolkit“Species” trees

14

The genomics toolkitGene trees

Do this forALL genes

15

Representing and understandingmicrobial relationships

1. Matrix-based approaches

2. Phylogenetic reconciliation

3. Gene distributions and “microbial identity”

1The tyrannyof distance

17

From profile to distance matrix

Gene 1 Gene 2 Gene 3 Gene 4 Gene n

A

B

C

D

E

F

S1 = 0.91 0.82 0.72 0.89

𝑑𝐴 ,𝐵=1.0−1𝑛∑

𝑔=1

𝑛

𝑆𝑔

A B C

A 0 0.165 0.252

B 0.165 0 0.297

C 0.252 0.297 0

18

Neighbor-joining

Start with a ‘star’ tree

At each iteration, split off the pair of taxa that minimizes the total sum of branch lengths in the tree

Choose groups x and y to minimize the Q-criterion:

Distance matrix entry for (x,y)

x

y

Weighted distance to all leaves

19

Continue until binary tree is obtained

Saitou and Nei (1987)

20

Neighbor-net: Building a splits graph

Bryant and Moulton, Mol Biol Evol (2003)

21

Neighbor-net is guaranteed to produce a circular set of splits

This will produce a planar graph

22

Neighbor-net of 298 microbial genomes

Beiko, Biol Direct (2011)

23

Limitations of neighbor-net

• Neighbor-net still imposes a constraint on the relationships among genomes: “long-distance” connections cannot be shown

?

24

Explicit connections between genomes• Make each genome a vertex in a graph G

V = {A,B,C,D,E,F,…}E = {{A,B},…}

For some threshold t:{A,B} ϵ G iff dA,B ≤ tor if some other condition is satisfied

A BwA,B

25

Linear programming

• Weighting networks based on straight genome-genome similarity highlights close relatives, redundancy

• LP introduces weighting scheme that constrains connections and promotes distinct relationships

26

P. aeruginosaP. fluorescensP. lePewtidaP. syringaeP. entomophilaP. stutzeriP. mendocina

Holloway and Beiko, BMC Evol Biol (2010)

“Plume”

27

Some like it hotPyrococcus furiosusoptimal growth temperature:

100°C

28Kunin et al. (2005) Genome Res

Networks

29

Networks!!!!

Dagan et al. (2008) PNAS

2Inferring andcomparing trees

31

Phylogenetic tree reconciliation

Species tree S Gene tree GLateral gene transfer

Subtree prune and regraftWhidden et al., Syst Biol (2014)

32

For two rooted trees, dSPR is equal to thenumber of components in a MAF, minus 1

So building a MAF is equivalent to inferring the minimumnumber of SPR events needed to reconcile a species treewith a gene tree

Problem is NP-hard

dSPR = 1

MAF components = 2

Bordewich and Semple, Ann Combinatorics (2005)

33

T1 T2

Case 1(separate components)

Case 3(several pendant nodes)

Case 2(one pendant node)

Chris’s algorithm

34

Fixed-parameter tractability

• Problem is dominated by Case 3 (3 alternatives)

• Cut all candidate edges at each step = linear 3-approximation

• Decision problem: to decide if SPR distance ≤ k

• Problem is exponential in SPR distance, NOT number of leaves

therefore FPT

Chris Whidden + Norbert Zeh

35

In practice

36

SPR Supertrees

Supertree: a tree that satisfies some optimality criterion with respect to a set of input trees

SPR supertree: given a set of gene trees, find a tree that minimizes the total number of SPR operations vs. all gene trees

Building an SPR supertree: assemble an initial tree, then propose SPR operations and evaluate its total SPR distance from input trees

Whidden et al., 2014

37

Why SPR supertrees?

1. Explicit representation of LGT events

2. Branches broken in MAF → implied LGT events. Can build graph of connections

244 bacterial genomes40,631 gene trees= Bacterial SPR supertree

LGT patterns for Clostridium

Whidden et al., 2014

(taming in progress) http://en.wikipedia.org/wiki/File:Godzilla_%2754_design.jpg

3Taming Lachnozilla

What makes LachnoZilla

LachnoZilla ?

41

C. difficile….

“Virulence-associated protein”Mobile DNA

Phylogenetic profile basedon extremely good matches toother genomes (> 95% ID, > 95% coverage)

= “recent” LGT events

42

LZ & friends

279 genomesConserved marker-gene tree

Ben Wright

43

LachnoZilla (and friends)genome graph

!

44

Close relative(expected)

45

Distant relative(not so expected)(big genome though!)

46

Selective sharing

Gene-centric graphsLZ Genom

e 1Genom

e 2Genom

e 3Genom

e 4Genom

e 5Genom

e 6

Gene 1

× ×

Gene 2

×

Gene 3

× ×

Gene 4

× × ×

Edge weights are proportional to similarity of distributionUse graph clustering to divide up completely connected, weighted graph

Gene 2

Gene 3

Gene 1

Gene 4

Legionaminic acidAcetylneuraminic acid

(pathogen associated)

Bacteroides pectinophilusButyrivibrio proteoclasticusEubacterium plexicaudatumRoseburiaNeighborsWeirdly named isolates

Lachnozilla in graph form(it all makes sense now)

Mystery isolate #1(made-up example)

Mystery isolate #2(made-up example)

Questions

Representations

Clear inference

From pattern to understanding

52

FIN