2013 caltech-edrn-talk

1.C. Titus BrownAssistant ProfessorCSE, MMG, BEACONMichigan State UniversityMay 1, [email protected] approaches to reference-free variantcalling

2. Open, online scienceMuch of the software and approaches Im talkingabout today are available:khmer software:github.com/ged-lab/khmer/Blog: http://ivory.idyll.org/blog/Twitter: @ctitusbrown 3. Outline & Overview Motivation: lots of data; analyzed with offlineapproaches. Reference-based vs reference-free approaches. Single-pass algorithms for lossy compression;application to resequencing data. 4. Shotgun sequencingIt was the best of times, it was the wor, it was the worst of times, it was theisdom, it was the age of foolishnessmes, it was the age of wisdom, it was thIt was the best of times, it was the worst of times, it wasthe age of wisdom, it was the age of foolishnessbut for lots and lots of fragments! 5. Sequencers produce errorsIt was the Gest of times, it was the wor, it was the worst of timZs, it was theisdom, it was the age of foolisXness, it was the worVt of times, it was themes, it was Ahe age of wisdom, it was thIt was the best of times, it Gas the wormes, it was the age of witdom, it was thisdom, it was tIe age of foolishnessIt was the best of times, it was the worst of times, it was theage of wisdom, it was the age of foolishness 6. Three basic problemsResequencing, counting, and assembly. 7. Three basic problemsResequencing & counting, and assembly. 8. Resequencing analysisWe know a reference genome, and want to findvariants (blue) in a background of errors (red) 9. CountingWe have a reference genome (or gene set) andwant to know how much we have. Think geneexpression/microarrays, copy number variation.. 10. Noisy observations informationIt was the Gest of times, it was the wor, it was the worst of timZs, it was theisdom, it was the age of foolisXness, it was the worVt of times, it was themes, it was Ahe age of wisdom, it was thIt was the best of times, it Gas the wormes, it was the age of witdom, it was thisdom, it was tIe age of foolishnessIt was the best of times, it was the worst of times, it was theage of wisdom, it was the age of foolishness 11. Three types of data scientists.(Bob Grossman, U. Chicago, at XLDB 2012)1. Your data gathering rate is slower than MooresLaw.2. Your data gathering rate matches Moores Law.3. Your data gathering rate exceeds Moores Law. 12. http://www.genome.gov/sequencingcosts/ 13. Three types of data scientists.1. Your data gathering rate is slower than MooresLaw.=> Be lazy, all will work out.2. Your data gathering rate matches Moores Law.=> You need to write good software, but all willwork out.3. Your data gathering rate exceeds Moores Law.=> You need serious help. 14. Random sampling => deep samplingneededTypically 10-100x needed for robust recovery (300 Gbp for human) 15. Applications in cancer genomics Single-cell cancer genomics will advance: e.g. ~60-300 Gbp data for each of ~1000 tumorcells. Infer phylogeny of tumor => mechanistic insight. Current approaches are computationally intensiveand data-heavy. 16. Current variant calling approach.Map reads toreference"Pileup" and do variantcallingDownstreamdiagnostics 17. Drawbacks of reference-basedapproaches Fairly narrowly defined heuristics. Allelic mapping bias: mapping biased towardsreference allele. Ignorant of unexpected novelty Indels, especially large indels, are often ignored. Structural variation is not easily retained orrecovered. True novelty discarded. Most implementations are multipass on big data. 18. Challenges Considerable amounts of noise in data (0.1-1%error) Reference-based approaches have severaldrawbacks. Dependent on quality/applicability of reference. Detection of true novelty (SNP vs indels; SVs)problematic. => The first major data reduction step (variantcalling) is extremely lossy in terms of potentialinformation. 19. Raw data(~10-100 GB) Analysis"Information"~1 GB"Information""Information""Information""Information"Database &integrationCompression(~2 GB)A software & algorithms approach: can we developlossy compression approaches that1. Reduce data size & remove errors => efficientprocessing?2. Retain all information? (think JPEG)If so, then we can store only the compressed data forlater reanalysis.Short answer is: yes, we can. 20. Raw data(~10-100 GB) Analysis"Information"~1 GB"Information""Information""Information""Information"Database &integrationCompression(~2 GB)Save in cold storageSave for reanalysis,investigation. 21. My lab at MSU:Theoretical => applied solutions.Theoretical advancesin data structures andalgorithmsPractically useful & usableimplementations, at scale.Demonstratedeffectiveness on real data. 22. 1. Time- and space-efficient k-mercountingTo add element: increment associated counter at all hash localesTo get count: retrieve minimum counter across all hash localeshttp://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/ 23. 1%5%15%10%Pell et al., PNAS, 20122. Compressible assembly graphs(NOVEL) 24. Transcriptomes, microbial genomes incl MDA,and most metagenomes can be assembled inunder 50 GB of RAM, with identical or improvedresults. Core algorithm is single pass, low memory.3. Online, streaming, lossycompression.(NOVEL)Brown et al., arXiv, 2012 25. Digital normalization 26. Digital normalization 27. Digital normalization 28. Digital normalization 29. Digital normalization 30. Digital normalization 31. Digital normalization approachA digital analog to cDNA library normalization, diginorm: Reference free. Is single pass: looks at each read only once; Does not collect the majority of errors; Keeps all low-coverage reads & retains allinformation. Smooths out coverage of regions. 32. Can we apply this algorithmically efficienttechnique to variants? Yes.Single pass, reference free, tunable, streaming online varian 33. Align reads to assembly graphDr. Jason Pell 34. Reference-free variant calling.Align read to graphNovelty? Retain.DownstreamdiagnosticsSaturated? Count &discard.Output variant atsaturation (online). 35. Coverage is adjusted to retain signal 36. Reference-free variant calling Streaming & online algorithm; single pass. For real-time diagnostics, can be applied as bases areemitted from sequencer. Reference free: independent of reference bias. Coverage of variants is adaptively adjusted to retainall signal. Parameters are easily tuned, although theory needsto be developed. High sensitivity (e.g. C=50 in 100x coverage) => poorcompression Low sensitivity (C=20) => good compression. Can subtract reference => novel structural variants. (See: Cortex, Zam Iqbal.) 37. Concluding thoughts This approach could provide significant andsubstantial practical and theoretical leverage tochallenging problem. They provide a path to the future: Many-core implementation; distributable? Decreased memory footprint => cloud/rental computingcan be used for many analyses. Still early days, but funded Our other techniques are in use, ~dozens of labsusing digital normalization. 38. References & reading list Iqbal et al., De novo assembly and genotyping ofvariants using colored de Bruijn graphs. Nat. Gen2012.(PubMed 22231483) Nordstrom et al., Mutation identification by directcomparison of whole-genome sequencing datafrom mutant and wild-type individuals using k-mers. Nat. Biotech 2013.(PubMed 23475072) Brown et al., Reference-Free Algorithm forComputational Normalization of ShotgunSequencing Data. arXiv 1203.4802Note: this talk is online at slideshare.net, c.titus.brown. 39. AcknowledgementsLab members involved Collaborators Adina Howe (w/Tiedje) Jason Pell Arend Hintze Rosangela Canino-Koning Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald Chris Welcher Jim Tiedje, MSU Billie Swalla, UW Janet Jansson, LBNL Susannah Tringe, JGIFundingUSDA NIFA; NSF IOS;BEACON.Thank you for the invitation!

2013 caltech-edrn-talk

Documents

Transcript of 2013 caltech-edrn-talk