2014 aus-agta

WHAT’S AHEAD FOR BIOLOGY?THE DATA INTENSIVE FUTURE

C. Titus Brown

[email protected]

Assistant Professor, Michigan State University

(In January, moving to UC Davis / VetMed.)

Talk slides on slideshare.net/c.titus.brown

mailto:[email protected]

The Data Deluge(a traditional requirement for these talks)

The short version• Data gathering & storage is growing, leaps & bounds!

• Biology is completely unprepared for this at every level:• Technical and infrastructure• Cultural• Training

• Our funding/incentivization/prioritization structures are also largely unprepared.

• This is a huge missed opportunity!!

(What does Titus think we should be doing?)

Challenges:

1. Dealing with Big Data (my current research)

2. Interpreting the unknowns (future research)

3. Accelerating research with better data/methods/results sharing.

4. Expanding the role of exploratory data analysis in biology. (career windmill)

1. Dealing with Big Data

A. Lossy compression

B. Streaming algorithms

Looking forward 5 years…

Navin et al., 2011

Some basic math:• 1000 single cells from a tumor…• …sequenced to 40x haploid coverage with Illumina…• …yields 120 Gbp each cell…• …or 120 Tbp of data.

• HiSeq X10 can do the sequencing in ~3 weeks.

• The variant calling will require 2,000 CPU weeks…

• …so, given ~2,000 computers, can do this all in one month.

Similar math applies:• Pathogen detection in blood;• Environmental sequencing;• Sequencing rare DNA from circulating blood.

• Two issues:

• Volume of data & compute infrastructure;

•Latency for clinical applications.

Approach A: Lossy compression

Lossy compression can substantially reduce data size while retaining

information needed for later (re)analysis.

(Reduce volume of data & compute infrastructure requirements)

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Digital normalization

e.g. de novo assembly now scales with richness, not diversity.

• 10-100 fold decrease in memory requirements• 10-100 fold speed up in analysis

Brown et al., arXiv, 2012

Hey, cool, our approach and software is used by Illumina for long-read sequencing!

Our general strategy: compressive prefilters

Approach B: streaming data analysis

See also eXpress, Roberts et al., 2013.

(Reduce latency for clinical applications)

Current variant calling approaches are multipass

Streaming graph-based approaches can detect information saturation

Approach supports compute-intensive interludes – remapping, etc.

Rimmer et al., 2014

Streaming with bases

Integrate sequencing and analysisDecrease latency!

So, how do we deal with Big Data issues?

• Fairly record cost of data analysis (running software & cost of computational infrastructure)

• This incentivizes development of better approaches!

• Lossy compression, streaming, …??

• Think 5 years ahead, rather than 2 years behind!

• Pay attention to workflows, software lifecycle, etc. etc.

(See ABiC 2014 talk :)

2. Dealing with the unknowns

“What is the function of ….?”

We can observe almost everything at a DNA/RNA level!

But,• Experimentally based functional annotations are sparse;• Most genes play multiple roles and are generally

annotated for only one;• Model organisms are phylogenetically quite limited and

biased;• …there is little or no $$$ or reputation gain for

characterizing novel genes (and nor is it straightforward or easy to do so!)

"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery

being associated with greater research momentum—a genomic bandwagon effect."

Ref.: Pandey et al. (2014), PLoS One 11, e88889. Slide courtesy Erich Schwarz

The problem of lopsided gene characterization:e.g., the brain "ignorome"

How do we systematically broaden our functional understanding of genes?

1. More experimental work!• Population studies, perturbation studies, good ol’ fashioned

molecular biology, etc.

2. Integrate modeling, to see where we have (or lack) sufficiency of knowledge for a particular phenotype.

3. Sequence it all and let the bioinformaticians sort it out!

What I think will work best: a tight integration between all three approaches (c.f. physics) – hypothesis-driven

investigation, modeling, and exploratory data science.

See also: ivory.idyll.org/blog/2014-function-of-unknown-genes.html

3. Accelerating research with better sharing of results, data, methods.

Our current journal system is a 20th century solution to a 17th century problem.

- Paraphrased from Cameron Neylon

(Note: 20th century was LAST century)

3. Accelerating research with better sharing of results, data, methods.

We could accelerate research with better sharing.

Recent example re rare diseases:

http://www.newyorker.com/magazine/2014/07/21/one-of-a-kind-

2

“The current academic publication system does patients an enormous disservice.” – Daniel MacArthur

There are many barriers to better communication of results, data, and methods, but most of them are cultural, not

technical. (Much harder!)

http://www.newyorker.com/magazine/2014/07/21/one-of-a-kind-2



Preprints• Many fields (including bioinformatics and increasingly

genomics) routinely share papers prior to publication. This facilitates reproduction, dissemination, and ultimately progress.

• Biology is behind the times!

See:

1. Haldane’s Sieve (blog discussion of preprints)

2. Evidence that preprints confer massive citation advantage in physics (http://arxiv.org/abs/0906.5418)

http://arxiv.org/abs/0906.5418

http://arxiv.org/abs/0906.5418

Current model for data sharing

In a data limited world,this kind of made sense.

Current model for data sharingThis model ignores the fact that data often has multiple (unrealized or serendipitous) uses.

(Among many other problems ;)

The train wreck ahead

When data is cheap, andinterpretation is expensive,

most data doesn’t get published,and therefore is lost.

(Program managers are not a fan of this)

Data sharing challenges -• Little immediate or career rewards for sharing data;

incentives are almost entirely punitive (if you DON’T…)

• Sharing data in a usable form is still rather difficult.

• Submitting data to archival services is, in many cases, surprisingly difficult.

• Few methods for gaining recognition for data sharing prior to publication of conclusions.

The Ocean Cruise Model

DeepDOM – photo courtesy E. Kujawinski, WHOI

One really expensive cruise, many data collectors, shared data.

Sage Bionetworks / “walled garden”

Collaborative data sharing policy with restricted access to outsiders;

Central platform with analysis provenance tracking;

A model for the future of biomedical research?

See, e.g., Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas. Omberg et al, 2014.

Distributed cyberinfrastructure to encourage sharing?

ivory.idyll.org/blog/2014-moore-ddd-talk.html

Better metadata collection is needed!Suppose the NSA could EITHER track

who was calling whom,OR what they were saying – which would

be more valuable?

Who? What? Who?

Better metadata collection is needed!We need to track sample origin,

phenotype/environmental conditions, etc.

Sample information The –omic data Phenotype

This will facilitate discovery, serendipity, re-analysis, and cross-validation.

Data and software citation

Now methods for:• assigning DOIs to data (which makes it citable) – figshare,

dryad.

• Data publications – gigascience, SIGS, Scientific Data.

• Software citations – Zenodo, MozSciLab/GitHub

• Software publications – F1000 Research

Will this address the need to incentivize data sharing and methods? Probably not but it’s a good start ;)

4. Exploratory data analysis

Old model:

New modelYour data is most useful when combined with everyone else’s.

Given enough publicly accessible data…

But: we face lack of training.

The lack of training in data science is the biggest challenge facing biology.

Students! There’s a great future in data analysis!

Also see:

Data integration?

Once you have all the data, what do you do?

"Business as usual simply cannot work."

Looking at millions to billions of genomes.

(David Haussler, 2014)

Illumina estimate: 228,000 human genomes will be sequenced in 2014, mostly by researchers.

http://www.technologyreview.com/news/531091/emtech-illumina-says-228000-human-genomes-will-be-sequenced-this-year/

Looking to the future

For the senior scientists and funders amongst us,

• How do we incentivize data sharing, and training?

• How do we fund the meso- and micro-scale cyberinfrastructure development that will accelerate bio?

See: ivory.idyll.org/blog/2014-nih-adds-up-meeting.html

The NIH and NSF are exploring this; the Moore and Sloan

foundations are simply doing it(but 1% the size).

Thanks for listening!

combine.org.au

Annual Student SymposiumFriday 28th November 2014

Parkville, Victoria

Now accepting abstracts for talks and postersTalk abstracts close 31st October

For Australian students and early career researchers in bioinformatics and computational biology

2014 aus-agta

Science

Transcript of 2014 aus-agta