Emerging challenges in data-intensive genomics

Emerging challenges in data-intensive genomics

BioFrontiers SymposiumMay 28, 2014

Mikael Huss, SciLifeLab / Stockholm University, Sweden

Where I work

INTEGRATIVE AND TECHNOLOGY DRIVEN RESEARCH IN HIGH-THROUGHPUT BIOLOGY

SciLifeLab – an infrastructure for massive biology

Science 328,805 (14 May 2010)

Inaugurated mid-2010

Hosted by three universities in Stockholm: Karolinska Institutet (medical faculty), Royal Institute of Technology (technical) and Stockholm University (natural science). SciLifeLab node in Uppsala.

Approximately 700 researchers

More than 100 researchers in bioinformatics and systems biology

Co-directors Prof Mathias Uhlén (KTH), Jan Andersson (KI), Gunnar von Heijne (SU)

Big is relative

… but some people are willing to go out on a limb

“Where is the cut-off? The line in the sand is 5TB of unstructured data or 7.5-10TB of structured data, which cannot be reduced any further”(OLRAC SPS)

http://www.itweb.co.za/index.php?option=com_content&view=article&id=111815

”There is no such thing as biomedical big data”(Will Bush, Vanderbilt University Center for Human Genetic Research)

http://gettinggeneticsdone.blogspot.se/2014/02/no-such-thing-biomedical-bigdata.html

Genomics big data in context: Throughput

Genomics big data in context: Storage

Genomics big data in context: Heterogeneity

“The size of the data is not the whole story.

If the data are uniform, they can almost always be compressed and filtered with traditional methods.

You do not get a ‘big data’ processing challenge until other factors, such as variety, non-uniformity and continuous growth, are added to a large data set.”

(adapted from Aleksi Kallio)

DNA sequencing with and without a reference

Reference-based

Analogous to matching words and sentence fragments to a book. Called ”alignment” or ”mapping.”

Algorithmically: Matching strings to an index.

Reference-free

Analogous to reconstructing a book from scratch based on only the words and sentence fragments. Called (de novo) ”assembly”.

Algorithmically: finding the best path through a very complicated graph.

Reference-based example

http://www.slideshare.net/gcoates/next-generation-genomcs-petascale-data-in-the-life-sciences

Find genetic variants relative to human reference genome

community genomicsSequencing environmental samples: ocean, soil, etc.

Metagenomics

Continuously monitoring enviromental DNA

Discovering new bacterial strains, viruses, antibiotic resistance genes

Metagenomics

Human microbiome

Estimated that there are 3-10 times as many bacterial cells as human cells in the body

Also, viruses and bacteriophages

Diagnostics

“NGS saves a young life”, http://omicsomics.blogspot.se/2014/02/ngs-saves-young-life.htmlStorified tweets about this story: http://nextgenseek.com/2014/02/ngs-in-critical-care-a-feel-good-story/

Joe DeRisi (UCSF)

14-year-old boy came in with various symptoms for which the underlying problem was hard to diagnose

In the end, took a 1 cubic centimeter brain biopsy and sequenced on a MiSeq instrument, which identified a pathogen (leptospira)

Sequencing took ~1 day including lab work, analysis took 1.5 hImage from Charles Chiu, UCSF

Real-time/streaming bioinformatics needed!

The unknown

http://www.ted.com/talks/nathan_wolfe_what_s_left_to_explore.html

“Biological dark matter”

“The unknown continent”

According to one estimate, less than 1% of the viral diversity has been explored!

The unknown

In a recent paper on soil metagenomics, Titus Brown and colleagues report that:

80% of the 398 billion sequences could not be assembled into putative genes

Of the cases where sequences could be assembled into putative genes which would create putative proteins, 60% of these proteins could not be matched to anything in the databases!

Hunting viral pathogens

Many academic groups and companies are trying to identify viruses that might be involved in a variety of diseases in humans and animals.

“Needle in a haystack” problem. Real-life example:

- Sequence human or animal tissue samples. (~30-40 million sequences).- Filter out host DNA in the computer.- Try to match rest of sequences to databases of known viruses.- For whatever is left, assemble sequences de novo and match the assembled

“genes” to “everything” out there (=NCBI’s NT and other databases).- End up with ~20.000 putative genes that don’t resemble anything in the

databases.

Public data

We realized there is a lot of data online, although scattered around.

Can use the raw or assembled sequences from these studies as part of our own studies.

Also by combining different data sets and their metadata, we may get clues about what the unknown things are.

Problems:

1) Sequence comparisons take a long time – need more efficient algorithms.2) Publicly available data is scattered and disorganized, and much that could be public isn’t.

Wishlist

- Everybody who is doing metagenomics is finding a lot of unknown stuff!- Make as much sequence data as possible available- Build to make all the sequences findable and queryable so that we can

identify commonalities between data sets

- String matching algorithms better adapted for “big-data” use cases in genomics:

- Real-time (streaming) matching, for diagnostics and environmental monitoring

- More efficient matching of sequences to huge reference indexes (“every known sequence”)

- Develop more reference-free methods for discovering new organisms and genes

Efforts towards these goals

“we want to support automated data exploration in ways that are simply not possible today” C Titus Brown (http://ivory.idyll.org/blog/2014-moore-ddd-round2-final.html)

Jeff Jonas:

“Data finds data”“The data is the query”

Using the dataset itself, or a statistical description of it, as a query

Efforts towards these goals

Competition hosted by Innocentive on behalf of the US Defense Threat Reduction Agency

Helix.io

Genetic classificationstartup

Competitions as a way to drive innovation

Competitions as a way to drive innovation

Sage Bionetworks’ competition platformCan build directly on each other’s code!SAGE/DREAM breast cancer challenge

Winner of the Innocentive challengehttp://www.newton.ac.uk/programmes/MTG/seminars/2014032415301.html

CLARITY challengeIdentifying possible disease causal genetic variants in three children

Summary

DNA sequencing has great potential for improved diagnostics and pathogen discovery

We need more efforts in real-time sequence analysis for diagnostics and monitoring

We need better ways to publish and connect data sets online to enable more efficient and unbiased discovery

Online collaboration can help both through open data and online competitions

@mikaelhusshttp://followthedata.wordpress.com

Acknowledgements

Research environmentThomas Svensson + the rest of the WABI groupJoakim Lundeberg + his group members

Helpful commentsPetter HolmeStefania GiacomelloMattias Andersson

Metagenomics discussionsAnders Andersson + groupHilja StridJoakim LarssonJohan Bengtsson-Palme

+ the readers of my blog and all the data enthusiasts in Stockholm and elsewhere!@mikaelhusshttp://followthedata.wordpress.com

Extra slides

Why hasn’t Hadoop caught on in genomics?

Hadoop is almost synonymous with big data in the corporate world

Ideas:– Existing computing infrastructure is sufficient– Or, focused on supercomputing solutions rather than commodity

servers– The programming skills and training are not there– Many problems not parallelisable– Not enough flexibility for exploratory analysis

Emerging challenges in data-intensive genomics

Data & Analytics

Transcript of Emerging challenges in data-intensive genomics