EPA 2013 Air Sensors Meeting Big Data Talk

Post on 16-Jun-2015

2.281 views 1 download

Tags:

description

https://sites.google.com/site/airsensors2013/final-materials

Transcript of EPA 2013 Air Sensors Meeting Big Data Talk

BIG DATA (IN BIOLOGY): INTEGRATING LARGE, FAST MOVING,

HETEROGENEOUS DATASETS

Adina Howe

Argonne National Laboratory

Michigan State University

EPA Air Sensors 2013: Data Quality and Applications

March 19, 2013

Introduction – My perspective

Experiment

Design

Data Generation

Workflow / Tools

Data analysis

Applied Solutions Engineering

Microbial EcologyBioinformatics

THE DATA DELUGEAn exponential landscape

Next-generation sequencing growth outpacing computational resources

Stein, Genome Biology, 2010

Log

Sca

le!

Next-generation sequencing growth outpacing computational resources

Stein, Genome Biology, 2010

Effects of low cost sequencing…1995 First free-living bacterium sequenced

for billions of dollars and years of analysis

Personal genome can be mapped in a few days and hundreds to few thousand dollars

Effects of low cost sequencing on research

Sboner et al., Genome Biology, 2011

Effects of low cost sequencing on research

Sboner et al., Genome Biology, 2011

Effects of low cost sequencing on research

Sboner et al., Genome Biology, 2011

Technology

Core

competencyValue added

RETHINKING

What it takes to deliver

Technical obstacles in the big data deluge

• Access to the data and its value • Access to the resources

Democratization of both data and resource access

“80% of awards and 50% of $$ are for grants < $350,000”

Root causes:• Data volume and velocity “clog”• Data is very heterogeneous• Previous efforts are difficult to integrate• Innovation is necessary but hard

Experiment

Design

Data Generation

Workflow / ToolsData analysis

Applied Solutions

Social obstacles are the most difficult.• Shift of costs do not mean a shift of expectations

• “Give me the answer so I can get back to work.”

• A culture of sharing (data, time, and tools)

• Evolution of necessary training• Creating teams that can communicate across domains

• Incentives are not strong enough• Patterns for success (useful data sharing and

collaboration) are not apparent or well understood.

POSSIBLE SOLUTIONS

Common solutions: been there, done that

http://xkcd.com/927/

What would an ideal solution look like?

• Flexible access to data, tools, and resources

• Cost effective, consistent, reusable (scalable)

• Rapid exploration• Incentives to participate,

share, communicate• Community sandbox (vs

lab-specific)• Painless

Platform which supports an “ecology” of databases, interfaces, and analysis software.

The success of organization: Amazon• > 50 million users, > 1 million product partners, billions of

reviews, dozens of compute services.• Continually changing/updating data sets.• Explicitly adopted a service-oriented architecture that

enables both internal and external use of this data.• For example, the Amazon.com website is itself built from

over 150 independent services…• Amazon routinely deploys new services and functionality.

http://highscalability.com/amazon-architecture

https://plus.google.com/112678702228711889851/posts/eVeouesvaVX

Amazon development guideline:Colloquially said, “You should eat your own dogfood.”

Design and implement the database and database functionality to meet your own needs; only use the functionality you’ve explicitly made available to

everyone.

To adapt to research: database functionality should be designed in tight integration with researchers who are

using it, both at a user interface level and programmatically.

If the “customers” aren’t integrated into the development loop:

http://blog.thingsdesigner.com/uploads/id/tree_swing_development_requirements.jpg

DOE Knowledgebase (KBase)• Emerging software and data environment to enable

researchers• Service oriented architecture where biological data

integrated into single data model with Kbase services loosely coupled to achieve various functions

• Open development environments for community contribution (public data, services, software)

• Provides robust and scalable infrastructure (with some level of support)

https://kbase.us

Kbase uses service oriented architecture

http://kbase.us/files/6913/4990/5274/Infrastructure.pptx.pdf

Hig

her

leve

l fun

ctio

ns

DOE KBase Investment

“…may also apply for additional supplemental funding of up to $300,000 per year for development of systems biology and –omics data driven applications in collaboration with the DOE Systems Biology Knowledgbase.”

Free tutorials / workshops for the community provided.

Advice for the next round…

Data generator:• Managing expectations and value

Developer:• “Eat your own dogfood”

Data analyzer:• Analyze with reproducibility in mind

} Access

Training

Communication

Platform / Teams

Big data is a community

problem and solution

Resources• Amazon interviews

http://highscalability.com/amazon-architecture

• Titus Brown’s blog post on heterogeneous data integration

http://ivory.idyll.org/blog/software-architecture-for-heterogeneous-data-integration.html

• Kbase website

http://www.kbase.us

• Software carpentry – “helping scientists build better software”

http://software-carpentry.org

Thanks!

Please feel free to contact me:

http://adina.github.com

adina@anl.gov

http://cheezburger.com/6983817216