2014 aus-agta

56
WHAT’S AHEAD FOR BIOLOGY? THE DATA INTENSIVE FUTURE C. Titus Brown [email protected] Assistant Professor, Michigan State University (In January, moving to UC Davis / VetMed.) Talk slides on slideshare.net/c.titus.brown

description

Australasian Genomics Technology Association 2014 talk

Transcript of 2014 aus-agta

Page 1: 2014 aus-agta

WHAT’S AHEAD FOR BIOLOGY?THE DATA INTENSIVE FUTURE

C. Titus Brown

[email protected]

Assistant Professor, Michigan State University

(In January, moving to UC Davis / VetMed.)

Talk slides on slideshare.net/c.titus.brown

Page 2: 2014 aus-agta

The Data Deluge(a traditional requirement for these talks)

Page 3: 2014 aus-agta

The short version• Data gathering & storage is growing, leaps & bounds!

• Biology is completely unprepared for this at every level:• Technical and infrastructure• Cultural• Training

• Our funding/incentivization/prioritization structures are also largely unprepared.

• This is a huge missed opportunity!!

(What does Titus think we should be doing?)

Page 4: 2014 aus-agta

Challenges:

1. Dealing with Big Data (my current research)

2. Interpreting the unknowns (future research)

3. Accelerating research with better data/methods/results sharing.

4. Expanding the role of exploratory data analysis in biology. (career windmill)

Page 5: 2014 aus-agta

1. Dealing with Big Data

A. Lossy compression

B. Streaming algorithms

Page 6: 2014 aus-agta

Looking forward 5 years…

Navin et al., 2011

Page 7: 2014 aus-agta

Some basic math:• 1000 single cells from a tumor…• …sequenced to 40x haploid coverage with Illumina…• …yields 120 Gbp each cell…• …or 120 Tbp of data.

• HiSeq X10 can do the sequencing in ~3 weeks.

• The variant calling will require 2,000 CPU weeks…

• …so, given ~2,000 computers, can do this all in one month.

Page 8: 2014 aus-agta

Similar math applies:• Pathogen detection in blood;• Environmental sequencing;• Sequencing rare DNA from circulating blood.

• Two issues:

• Volume of data & compute infrastructure;

•Latency for clinical applications.

Page 9: 2014 aus-agta

Approach A: Lossy compression

Lossy compression can substantially reduce data size while retaining

information needed for later (re)analysis.

(Reduce volume of data & compute infrastructure requirements)

Page 10: 2014 aus-agta

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 11: 2014 aus-agta

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 12: 2014 aus-agta

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 13: 2014 aus-agta

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 14: 2014 aus-agta

http://en.wikipedia.org/wiki/JPEG

Lossy compression

Page 15: 2014 aus-agta

Digital normalization

Page 16: 2014 aus-agta

Digital normalization

Page 17: 2014 aus-agta

Digital normalization

Page 18: 2014 aus-agta

Digital normalization

Page 19: 2014 aus-agta

Digital normalization

Page 20: 2014 aus-agta

Digital normalization

Page 21: 2014 aus-agta

e.g. de novo assembly now scales with richness, not diversity.

• 10-100 fold decrease in memory requirements• 10-100 fold speed up in analysis

Brown et al., arXiv, 2012

Page 22: 2014 aus-agta

Hey, cool, our approach and software is used by Illumina for long-read sequencing!

Page 23: 2014 aus-agta

Our general strategy: compressive prefilters

Page 24: 2014 aus-agta

Approach B: streaming data analysis

See also eXpress, Roberts et al., 2013.

(Reduce latency for clinical applications)

Page 25: 2014 aus-agta

Current variant calling approaches are multipass

Page 26: 2014 aus-agta

Streaming graph-based approaches can detect information saturation

Page 27: 2014 aus-agta

Approach supports compute-intensive interludes – remapping, etc.

Rimmer et al., 2014

Page 28: 2014 aus-agta

Streaming with bases

Page 29: 2014 aus-agta

Integrate sequencing and analysisDecrease latency!

Page 30: 2014 aus-agta

So, how do we deal with Big Data issues?

• Fairly record cost of data analysis (running software & cost of computational infrastructure)

• This incentivizes development of better approaches!

• Lossy compression, streaming, …??

• Think 5 years ahead, rather than 2 years behind!

• Pay attention to workflows, software lifecycle, etc. etc.

(See ABiC 2014 talk :)

Page 31: 2014 aus-agta

2. Dealing with the unknowns

Page 32: 2014 aus-agta

“What is the function of ….?”

We can observe almost everything at a DNA/RNA level!

But,• Experimentally based functional annotations are sparse;• Most genes play multiple roles and are generally

annotated for only one;• Model organisms are phylogenetically quite limited and

biased;• …there is little or no $$$ or reputation gain for

characterizing novel genes (and nor is it straightforward or easy to do so!)

Page 33: 2014 aus-agta

"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery

being associated with greater research momentum—a genomic bandwagon effect."

Ref.: Pandey et al. (2014), PLoS One 11, e88889. Slide courtesy Erich Schwarz

The problem of lopsided gene characterization:e.g., the brain "ignorome"

Page 34: 2014 aus-agta

How do we systematically broaden our functional understanding of genes?

1. More experimental work!• Population studies, perturbation studies, good ol’ fashioned

molecular biology, etc.

2. Integrate modeling, to see where we have (or lack) sufficiency of knowledge for a particular phenotype.

3. Sequence it all and let the bioinformaticians sort it out!

What I think will work best: a tight integration between all three approaches (c.f. physics) – hypothesis-driven

investigation, modeling, and exploratory data science.

See also: ivory.idyll.org/blog/2014-function-of-unknown-genes.html

Page 35: 2014 aus-agta

3. Accelerating research with better sharing of results, data, methods.

Our current journal system is a 20th century solution to a 17th century problem.

- Paraphrased from Cameron Neylon

(Note: 20th century was LAST century)

Page 36: 2014 aus-agta

3. Accelerating research with better sharing of results, data, methods.

We could accelerate research with better sharing.

Recent example re rare diseases:

http://www.newyorker.com/magazine/2014/07/21/one-of-a-kind-

2

“The current academic publication system does patients an enormous disservice.” – Daniel MacArthur

There are many barriers to better communication of results, data, and methods, but most of them are cultural, not

technical. (Much harder!)

Page 37: 2014 aus-agta

Preprints• Many fields (including bioinformatics and increasingly

genomics) routinely share papers prior to publication. This facilitates reproduction, dissemination, and ultimately progress.

• Biology is behind the times!

See:

1. Haldane’s Sieve (blog discussion of preprints)

2. Evidence that preprints confer massive citation advantage in physics (http://arxiv.org/abs/0906.5418)

Page 38: 2014 aus-agta

Current model for data sharing

In a data limited world,this kind of made sense.

Page 39: 2014 aus-agta

Current model for data sharingThis model ignores the fact that data often has multiple (unrealized or serendipitous) uses.

(Among many other problems ;)

Page 40: 2014 aus-agta

The train wreck ahead

When data is cheap, andinterpretation is expensive,

most data doesn’t get published,and therefore is lost.

(Program managers are not a fan of this)

Page 41: 2014 aus-agta

Data sharing challenges -• Little immediate or career rewards for sharing data;

incentives are almost entirely punitive (if you DON’T…)

• Sharing data in a usable form is still rather difficult.

• Submitting data to archival services is, in many cases, surprisingly difficult.

• Few methods for gaining recognition for data sharing prior to publication of conclusions.

Page 42: 2014 aus-agta

The Ocean Cruise Model

DeepDOM – photo courtesy E. Kujawinski, WHOI

One really expensive cruise, many data collectors, shared data.

Page 43: 2014 aus-agta

Sage Bionetworks / “walled garden”

Collaborative data sharing policy with restricted access to outsiders;

Central platform with analysis provenance tracking;

A model for the future of biomedical research?

See, e.g., Enabling transparent and collaborative computational analysis of 12 tumor types within The Cancer Genome Atlas. Omberg et al, 2014.

Page 44: 2014 aus-agta

Distributed cyberinfrastructure to encourage sharing?

ivory.idyll.org/blog/2014-moore-ddd-talk.html

Page 45: 2014 aus-agta

Better metadata collection is needed!Suppose the NSA could EITHER track

who was calling whom,OR what they were saying – which would

be more valuable?

Who? What? Who?

Page 46: 2014 aus-agta

Better metadata collection is needed!Suppose the NSA could EITHER track

who was calling whom,OR what they were saying – which would

be more valuable?

Who? What? Who?

Page 47: 2014 aus-agta

Better metadata collection is needed!We need to track sample origin,

phenotype/environmental conditions, etc.

Sample information The –omic data Phenotype

This will facilitate discovery, serendipity, re-analysis, and cross-validation.

Page 48: 2014 aus-agta

Data and software citation

Now methods for:• assigning DOIs to data (which makes it citable) – figshare,

dryad.

• Data publications – gigascience, SIGS, Scientific Data.

• Software citations – Zenodo, MozSciLab/GitHub

• Software publications – F1000 Research

Will this address the need to incentivize data sharing and methods? Probably not but it’s a good start ;)

Page 49: 2014 aus-agta

4. Exploratory data analysis

Old model:

Page 50: 2014 aus-agta

New modelYour data is most useful when combined with everyone else’s.

Page 51: 2014 aus-agta

Given enough publicly accessible data…

Page 52: 2014 aus-agta

But: we face lack of training.

The lack of training in data science is the biggest challenge facing biology.

Students! There’s a great future in data analysis!

Also see:

Page 53: 2014 aus-agta

Data integration?

Once you have all the data, what do you do?

"Business as usual simply cannot work."

Looking at millions to billions of genomes.

(David Haussler, 2014)

Illumina estimate: 228,000 human genomes will be sequenced in 2014, mostly by researchers.

http://www.technologyreview.com/news/531091/emtech-illumina-says-228000-human-genomes-will-be-sequenced-this-year/

Page 54: 2014 aus-agta

Looking to the future

For the senior scientists and funders amongst us,

• How do we incentivize data sharing, and training?

• How do we fund the meso- and micro-scale cyberinfrastructure development that will accelerate bio?

See: ivory.idyll.org/blog/2014-nih-adds-up-meeting.html

The NIH and NSF are exploring this; the Moore and Sloan

foundations are simply doing it(but 1% the size).

Page 55: 2014 aus-agta

Thanks for listening!

Page 56: 2014 aus-agta

combine.org.au

Annual Student SymposiumFriday 28th November 2014

Parkville, Victoria

Now accepting abstracts for talks and postersTalk abstracts close 31st October

For Australian students and early career researchers in bioinformatics and computational biology