ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R....

28
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget- Groba and Charles H. Cannon

description

Alignment-free methods for building phylogeny Typically from assembled genomes De novo assembly with short reads? Mainly on closely related prokaryotic genomes No confidence assessment (e.g. bootstrapping)

Transcript of ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R....

Page 1: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA

Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon

Page 2: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Traditional methods for building phylogeny

Requirements:• High coverage

• Assembly• Detection of putative orthologous genes

• Alignment

• Phylogeny from tiny portion of the whole genome

• Genome scale multi-sequence alignment is difficult

Page 3: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Alignment-free methods for building phylogeny

• Typically from assembled genomes• De novo assembly with short reads?

• Mainly on closely related prokaryotic genomes

• No confidence assessment (e.g. bootstrapping)

Page 4: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Overview• Assembly and Alignment-Free method (AAF)

• Calculate phylogenetic distances using whole genome short read sequencing data

• Method validation• Genome complexity• Different genome sizes• Sequencing errors• Range of sequencing coverage

• 12 mammal species• 21 tropical tree species

• Comparision with andi

Page 5: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

AAF method• Calculate pairwise genetic distances between each

sample using the number of evolutionary changes between their genomes, which are represented by the number of k-mers that differ between genomes.

• Phylogenetic relationships among the genomes are then reconstructed from the pairwise distance matrix

Page 6: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

AAF method - Evolutionary model • The probability that no mutation will occur within a given k-

mer between species A and B is exp(−kd).• If only substitutions occurred, all k-mers are unique, then all

the species will have the same total number of k-mers, nt, and the maximum likelihood estimate of exp(−kd) is ns/nt.

• Mutations will decrease the number of shared k-mers, ns, between species relative to the total number of k-mers, nt

• Insertion: loss of (k – 1) or gain of (l + k – 1) k-mers• Deletion: loss of (l+k – 1) or gain of (k – 1) k-mers

• Greater effect

Page 7: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

K-mer sensitivity and homoplasy • No assembly -> not all indels identified

• If k-mer covers multiple substitutions

• Shorter k-mers -> better sensitivity

• Shorter k-mers -> same k-mers from evolutionary different regions• Homoplasy

Page 8: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

K-mer homoplasy• k=15• Genome size > 5x108 => same k-mers randomly in other

species

• May incorrectly inflate the proportion of shared k-mers

• The optimal k for phylogenetic reconstruction is the k which is just large enough to greatly reduce k-mer homoplasy for a given genome size

Page 9: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

ph

• Prediction of the ratio ns/nt

• Large genomes and small k ph = 1 • all possible k-mers occur in both species. This problem is exac-

erbated if GC content is biased, which will inflate the average similarity in genomic k-mer composition.

• GC content

• Sufficiently large k will overcome homoplasy, regardless of the evolutionary distance between species.

Page 10: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Mathematical prediction

Page 11: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Random ancestral sequence

Page 12: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Real (non-random) sequence

Page 13: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Assembly-free• Sampling error caused by low genome coverage

• The actual number of k-mers will be under-represented given low sequencing coverage

• Sequencing errors • Loss of true k-mers and the gain of false k-mers• Filtering = remove singletons

Page 14: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Seq errors p=observed/true

Coverage 5-8 sufficient to observe all true k-mers when filtering

=> Tip corrections

Page 15: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Filter only singletons?

Page 16: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Filter only singletons?

Page 17: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

BootstrappingNonparametric bootstrap 1) Resample original reads with replacement 2) “Block bootstrap” – take rows with probabilty 1/k

OR

Two-stage parametric bootstrap• Estimate the variances in distances between species

caused by sampling and evolutionary variation• Independent of genome size

Page 18: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Bushbaby (galago)

Page 19: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Tarsier

Page 20: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Recently published phylogeny of primates

Page 21: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Assembled genomes, k=19

Page 22: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Assembled genomes, k=21

Page 23: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Simulated reads

Page 24: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Simulated reads

Page 25: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Real data – tropical trees

Intsia palembanica

Page 26: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Advantages• Low coverage

requirements

• Low computational demands • 12 primates 25GB RAM, 12

threads

Limitations• Loss of k-mer sensitivity

• Deep nodes

• Location of mutations

Page 27: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

Distance computing for 73 Escherichia strains

• AAF• 32+76 = 1h 48min

• andi• 21 min

Page 28: ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.

AAF andi