The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples
Systematic biology presentationYuantong Ding Dec. 6
Outline • Background
• Workflow
• Sequence comparison
• Tree comparison
• Summary & future work
Can short-reads successfully recover phylogeny?
• Next generation sequencing (NGS)• Low-cost• High-throughput • Short-read
Multi individual sampleShort-reads Reconstructed sequence phylogeny
?
Background Workflow Sequence comparison Tree comparisonSummary
Simulation process Original genealogy Original haplotypes NJ treeSimulated by
SerialSimCoal with coalescent model
Consensus sequence Short-readsSimulated by MetaSim with 454 error model
Mapping Alignment built by SHRiMP and SSAHA
Reconstructed haplotypes Haplotypes reconstructed by ShoRAH
NJ tree built by PAUP* Compare tree topology
Compare number and similarity ofhaplotypes
Background Workflow Sequence comparison Tree comparisonSummary
6 parameters used• Effective population size N• Sample size n• Mutation rate μ• Sequence length l
N n μ l Sr_N Sr_l
3000 10 5.00E-05 1200 5000 200
5000 20 1.00E-05 2000 10000 400
10000 40 5.00E-06 5000 30000 —
• Number of short-reads Sr_N• Length of short-reads Sr_l
Background Workflow Sequence comparison Tree comparisonSummary
All 486 combination of these parameters were simulated
Different numbers of haplotypes
Background Workflow Sequence comparison Tree comparisonSummary
Similar sequences
Background Workflow Sequence comparison Tree comparisonSummary
Can reconstructed haplotypes still capture some phylogenetic information?
• Different haplotypes number impossible to recover the true phylogenetic trees
Assuming true haplotypes number of the sample is known
Select the most similar reconstructed sequences to build phylogeny tree
Calculate symmetric difference
Background Workflow Sequence comparison Tree comparisonSummary
Cluster (k-mean) reconstructed haplotypes to n groups
Build tree with consensus sequence of each group
Calculate tree balance statistics
Method for tree comparison
A B C B A C(BC)(ABC)
(AC)(ABC) symmetric difference = 2
Symmetric difference for rooted and labeled trees
Tree balance statistics for rooted and unlabeled trees
ANi is the internal nodes number between tip i and root
e.g. i=A, NA = 2, Ñ = (2+2+2+3+3)/5=2.4
Different topology of most similar sequence tree
Background Workflow Sequence comparison Tree comparisonSummary
Different balance statistics of k-mean cluster tree
Background Workflow Sequence comparison Tree comparisonSummary
n N_bar I_c
org rec P org rec P
10 4.8 4.7 0.002 0.74 0.67 0.0004
20 7.5 6.9 9.2e-09 0.57 0.47 1.52e-10
40 10.6 9.6 1.2e-08 0.40 0.33 1.94e-09
Summary & future work
• Reconstructed haplotypes typically failed to estimate the correct number of haplotypes
• Consequently, it was not possible to recover the true phylogenetic trees.
• Even assuming we know the true haplotype number, the chance to recover the true tree topology is still small.
• Other reconstruction method, use multiple reference sequence when mapping…
Reference • Anderson, C.N.K., Ramakrishnan, U. et al.2005. Serial SimCoal: A population
genetic model for data from multiple populations and points in time. . Bioinformatics 21, 1733-1734.
• Johnson, P.L., Slatkin, M., 2006. Inference of population genetic parameters in metagenomics: a clean look at messy data. Genome Res 16, 1320-1327.
• Richter, D.C., Ott, F. et al. 2008. MetaSim—A Sequencing Simulator for Genomics and Metagenomics. PLoS ONE 3, 3373.
• Suzuki, S., Ono, N., Furusawa, C., Ying, B.-W., Yomo, T., 2011. Comparison of Sequence Reads Obtained from Three Next-Generation Sequencing Platforms. PLoS ONE 6, e19534.
• Zagordi, O., Bhattacharya, A. et al. 2011. ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC Bioinformatics 12, 119
• Metei D., Misko D,. et al. 2011 SHRiMP2: Sensitive yet Practical Short Read Mapping. Bioinformatics 27, 7
• Ning Z, Cox AJ and Mullikin JC. 2001. SSAHA: a fast search method for large DNA databases. Genome research, 1725-9
Top Related