Emerging challenges in data-intensive genomics
-
Upload
mikaelhuss -
Category
Data & Analytics
-
view
4.773 -
download
2
description
Transcript of Emerging challenges in data-intensive genomics
Emerging challenges in data-intensive genomics
BioFrontiers SymposiumMay 28, 2014
Mikael Huss, SciLifeLab / Stockholm University, Sweden
Where I work
INTEGRATIVE AND TECHNOLOGY DRIVEN RESEARCH IN HIGH-THROUGHPUT BIOLOGY
SciLifeLab – an infrastructure for massive biology
Science 328,805 (14 May 2010)
Inaugurated mid-2010
Hosted by three universities in Stockholm: Karolinska Institutet (medical faculty), Royal Institute of Technology (technical) and Stockholm University (natural science). SciLifeLab node in Uppsala.
Approximately 700 researchers
More than 100 researchers in bioinformatics and systems biology
Co-directors Prof Mathias Uhlén (KTH), Jan Andersson (KI), Gunnar von Heijne (SU)
Big is relative
… but some people are willing to go out on a limb
“Where is the cut-off? The line in the sand is 5TB of unstructured data or 7.5-10TB of structured data, which cannot be reduced any further”(OLRAC SPS)
http://www.itweb.co.za/index.php?option=com_content&view=article&id=111815
”There is no such thing as biomedical big data”(Will Bush, Vanderbilt University Center for Human Genetic Research)
http://gettinggeneticsdone.blogspot.se/2014/02/no-such-thing-biomedical-bigdata.html
Genomics big data in context: Throughput
Genomics big data in context: Storage
Genomics big data in context: Heterogeneity
“The size of the data is not the whole story.
If the data are uniform, they can almost always be compressed and filtered with traditional methods.
You do not get a ‘big data’ processing challenge until other factors, such as variety, non-uniformity and continuous growth, are added to a large data set.”
(adapted from Aleksi Kallio)
DNA sequencing with and without a reference
Reference-based
Analogous to matching words and sentence fragments to a book. Called ”alignment” or ”mapping.”
Algorithmically: Matching strings to an index.
Reference-free
Analogous to reconstructing a book from scratch based on only the words and sentence fragments. Called (de novo) ”assembly”.
Algorithmically: finding the best path through a very complicated graph.
Reference-based example
http://www.slideshare.net/gcoates/next-generation-genomcs-petascale-data-in-the-life-sciences
Find genetic variants relative to human reference genome
community genomicsSequencing environmental samples: ocean, soil, etc.
Metagenomics
Continuously monitoring enviromental DNA
Discovering new bacterial strains, viruses, antibiotic resistance genes
Metagenomics
Human microbiome
Estimated that there are 3-10 times as many bacterial cells as human cells in the body
Also, viruses and bacteriophages
Diagnostics
“NGS saves a young life”, http://omicsomics.blogspot.se/2014/02/ngs-saves-young-life.htmlStorified tweets about this story: http://nextgenseek.com/2014/02/ngs-in-critical-care-a-feel-good-story/
Joe DeRisi (UCSF)
14-year-old boy came in with various symptoms for which the underlying problem was hard to diagnose
In the end, took a 1 cubic centimeter brain biopsy and sequenced on a MiSeq instrument, which identified a pathogen (leptospira)
Sequencing took ~1 day including lab work, analysis took 1.5 hImage from Charles Chiu, UCSF
Real-time/streaming bioinformatics needed!
The unknown
http://www.ted.com/talks/nathan_wolfe_what_s_left_to_explore.html
“Biological dark matter”
“The unknown continent”
According to one estimate, less than 1% of the viral diversity has been explored!
The unknown
In a recent paper on soil metagenomics, Titus Brown and colleagues report that:
80% of the 398 billion sequences could not be assembled into putative genes
Of the cases where sequences could be assembled into putative genes which would create putative proteins, 60% of these proteins could not be matched to anything in the databases!
Hunting viral pathogens
Many academic groups and companies are trying to identify viruses that might be involved in a variety of diseases in humans and animals.
“Needle in a haystack” problem. Real-life example:
- Sequence human or animal tissue samples. (~30-40 million sequences).- Filter out host DNA in the computer.- Try to match rest of sequences to databases of known viruses.- For whatever is left, assemble sequences de novo and match the assembled
“genes” to “everything” out there (=NCBI’s NT and other databases).- End up with ~20.000 putative genes that don’t resemble anything in the
databases.
Public data
We realized there is a lot of data online, although scattered around.
Can use the raw or assembled sequences from these studies as part of our own studies.
Also by combining different data sets and their metadata, we may get clues about what the unknown things are.
Problems:
1) Sequence comparisons take a long time – need more efficient algorithms.2) Publicly available data is scattered and disorganized, and much that could be public isn’t.
Wishlist
- Everybody who is doing metagenomics is finding a lot of unknown stuff!- Make as much sequence data as possible available- Build to make all the sequences findable and queryable so that we can
identify commonalities between data sets
- String matching algorithms better adapted for “big-data” use cases in genomics:
- Real-time (streaming) matching, for diagnostics and environmental monitoring
- More efficient matching of sequences to huge reference indexes (“every known sequence”)
- Develop more reference-free methods for discovering new organisms and genes
Efforts towards these goals
“we want to support automated data exploration in ways that are simply not possible today” C Titus Brown (http://ivory.idyll.org/blog/2014-moore-ddd-round2-final.html)
Jeff Jonas:
“Data finds data”“The data is the query”
Using the dataset itself, or a statistical description of it, as a query
Efforts towards these goals
Competition hosted by Innocentive on behalf of the US Defense Threat Reduction Agency
Helix.io
Genetic classificationstartup
Competitions as a way to drive innovation
Competitions as a way to drive innovation
Competitions as a way to drive innovation
Sage Bionetworks’ competition platformCan build directly on each other’s code!SAGE/DREAM breast cancer challenge
Winner of the Innocentive challengehttp://www.newton.ac.uk/programmes/MTG/seminars/2014032415301.html
CLARITY challengeIdentifying possible disease causal genetic variants in three children
Summary
DNA sequencing has great potential for improved diagnostics and pathogen discovery
We need more efforts in real-time sequence analysis for diagnostics and monitoring
We need better ways to publish and connect data sets online to enable more efficient and unbiased discovery
Online collaboration can help both through open data and online competitions
@mikaelhusshttp://followthedata.wordpress.com
Acknowledgements
Research environmentThomas Svensson + the rest of the WABI groupJoakim Lundeberg + his group members
Helpful commentsPetter HolmeStefania GiacomelloMattias Andersson
Metagenomics discussionsAnders Andersson + groupHilja StridJoakim LarssonJohan Bengtsson-Palme
+ the readers of my blog and all the data enthusiasts in Stockholm and elsewhere!@mikaelhusshttp://followthedata.wordpress.com
Extra slides
Why hasn’t Hadoop caught on in genomics?
Hadoop is almost synonymous with big data in the corporate world
Ideas:– Existing computing infrastructure is sufficient– Or, focused on supercomputing solutions rather than commodity
servers– The programming skills and training are not there– Many problems not parallelisable– Not enough flexibility for exploratory analysis