ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R....
-
Upload
vivian-bradford -
Category
Documents
-
view
217 -
download
0
description
Transcript of ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R....
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA
Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon
Traditional methods for building phylogeny
Requirements:• High coverage
• Assembly• Detection of putative orthologous genes
• Alignment
• Phylogeny from tiny portion of the whole genome
• Genome scale multi-sequence alignment is difficult
Alignment-free methods for building phylogeny
• Typically from assembled genomes• De novo assembly with short reads?
• Mainly on closely related prokaryotic genomes
• No confidence assessment (e.g. bootstrapping)
Overview• Assembly and Alignment-Free method (AAF)
• Calculate phylogenetic distances using whole genome short read sequencing data
• Method validation• Genome complexity• Different genome sizes• Sequencing errors• Range of sequencing coverage
• 12 mammal species• 21 tropical tree species
• Comparision with andi
AAF method• Calculate pairwise genetic distances between each
sample using the number of evolutionary changes between their genomes, which are represented by the number of k-mers that differ between genomes.
• Phylogenetic relationships among the genomes are then reconstructed from the pairwise distance matrix
AAF method - Evolutionary model • The probability that no mutation will occur within a given k-
mer between species A and B is exp(−kd).• If only substitutions occurred, all k-mers are unique, then all
the species will have the same total number of k-mers, nt, and the maximum likelihood estimate of exp(−kd) is ns/nt.
• Mutations will decrease the number of shared k-mers, ns, between species relative to the total number of k-mers, nt
• Insertion: loss of (k – 1) or gain of (l + k – 1) k-mers• Deletion: loss of (l+k – 1) or gain of (k – 1) k-mers
• Greater effect
K-mer sensitivity and homoplasy • No assembly -> not all indels identified
• If k-mer covers multiple substitutions
• Shorter k-mers -> better sensitivity
• Shorter k-mers -> same k-mers from evolutionary different regions• Homoplasy
K-mer homoplasy• k=15• Genome size > 5x108 => same k-mers randomly in other
species
• May incorrectly inflate the proportion of shared k-mers
• The optimal k for phylogenetic reconstruction is the k which is just large enough to greatly reduce k-mer homoplasy for a given genome size
ph
• Prediction of the ratio ns/nt
• Large genomes and small k ph = 1 • all possible k-mers occur in both species. This problem is exac-
erbated if GC content is biased, which will inflate the average similarity in genomic k-mer composition.
• GC content
• Sufficiently large k will overcome homoplasy, regardless of the evolutionary distance between species.
Mathematical prediction
Random ancestral sequence
Real (non-random) sequence
Assembly-free• Sampling error caused by low genome coverage
• The actual number of k-mers will be under-represented given low sequencing coverage
• Sequencing errors • Loss of true k-mers and the gain of false k-mers• Filtering = remove singletons
Seq errors p=observed/true
Coverage 5-8 sufficient to observe all true k-mers when filtering
=> Tip corrections
Filter only singletons?
Filter only singletons?
BootstrappingNonparametric bootstrap 1) Resample original reads with replacement 2) “Block bootstrap” – take rows with probabilty 1/k
OR
Two-stage parametric bootstrap• Estimate the variances in distances between species
caused by sampling and evolutionary variation• Independent of genome size
Bushbaby (galago)
Tarsier
Recently published phylogeny of primates
Assembled genomes, k=19
Assembled genomes, k=21
Simulated reads
Simulated reads
Real data – tropical trees
Intsia palembanica
Advantages• Low coverage
requirements
• Low computational demands • 12 primates 25GB RAM, 12
threads
Limitations• Loss of k-mer sensitivity
• Deep nodes
• Location of mutations
Distance computing for 73 Escherichia strains
• AAF• 32+76 = 1h 48min
• andi• 21 min
AAF andi