The Discovery Cloud: Accelerating Science via Outsourcing and Automation

Post on 05-Dec-2014

330 views 0 download

description

Director's Colloquium at Los Alamos National Laboratory, September 18, 2014. We have made much progress over the past decade toward harnessing the collective power of IT resources distributed across the globe. In high-energy physics, astronomy, and climate, thousands work daily within virtual computing systems with global scope. But we now face a far greater challenge: Exploding data volumes and powerful simulation tools mean that many more--ultimately most?--researchers will soon require capabilities not so different from those used by such big-science teams. How are we to meet these needs? Must every lab be filled with computers and every researcher become an IT specialist? Perhaps the solution is rather to move research IT out of the lab entirely: to leverage the “cloud” (whether private or public) to achieve economies of scale and reduce cognitive load. In this talk, I explore the past, current, and potential future of large-scale outsourcing and automation for science.

Transcript of The Discovery Cloud: Accelerating Science via Outsourcing and Automation

accelerating science via outsourcing and automation

Ian Foster Argonne National Laboratory and University of Chicago

foster@anl.gov

ianfoster.org

The Discovery Cloud!

Publish

results

Collectdata

Design experimen

t

Test hypothesis

Hypothesize

explanation

Identify patterns

Analyzedata

The discovery process:Iterative and time-consuming

Pose questio

n

We've got no money, so we've got to think

Ernest Rutherford

Civilization advancesby extending the number of important operations which we can perform without thinking about them

Alfred North Whitehead (1911)

About 85% of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know

J.C.R Licklider, 1960

Automation is required to apply more sophisticated methods at larger scales

Outsourcing is needed to achieve economies of scale in the use of automated methods

Automation is required to apply more sophisticated methods at larger scales

Outsourcing and automation:(1) The Grid

A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to computational capabilities

Foster and Kesselman, 1998

Higgs discovery “only possible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG

10s of PB, 100s of institutions,1000s of scientists, 100Ks of CPUs, Bs of tasks

Outsourcing and automation:(2) The Cloud

Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction

NIST, 2011

11

Tripit exemplifies process automation

MeBook flights

Book hotel

Record flights

Suggest hotel

Record hotel

Get weather

Prepare maps

Share info

Monitor prices

Monitor flight

Other servicesTime

How the “business cloud” works

Platformservices

Database, analytics, application, deployment, workflow, queuing Auto-scaling, Domain Name Service, content distributionElastic MapReduce, streaming data analyticsEmail, messaging, transcoding. Many more.

Infrastructureservices

Computing, storage, networkingElastic capacityMultiple availability zones

The Intelligence Cloud

Process automation for science

Run experimentCollect dataMove dataCheck data

Annotate dataShare data

Find similar dataLink to literature

Analyze dataPublish data

Time

Automate and

outsource:

theDiscovery cloud

Analysis

Staging Ingest

Community Repository

Archive Mirror

Registry

Next-gen genomesequencer

Telescope

In millions of labs worldwide, researchers struggle with massive data, advanced software, complex protocols, burdensome reporting

Globus research data management services

www.globus.org

Simulation

“I need to easily, quickly, and reliably mirror [portions of] my data to other

places.”

Research Computing HPC Cluster

Lab Server

Campus Home Filesystem

Desktop Workstation

Personal Laptop

XSEDE Resource

Public Cloud

“I need to easily and securely share my data with colleagues.”

“I need to get data from a scientific instrument to my analysis server.”

Next GenSequencer

Light Sheet Microscope

MRIAdvanced Light Source

Globus transfer & sharing; identity & group management, data discovery &

publication

25,000 users, 60 PB and 3B files transferred, 8,000 endpoints

The Globus Galaxies platform:Science as a service

Globus Galaxies platform

Tool and workflow execution, publication, discovery, sharing;identity management; data management; task scheduling

Infra-structureservices

EC2, EBS, S3, SNS, Spot, Route 53, Cloud Formation

Ematter materials scienceFACE-IT

PDACS

22

Flexible, scalable, affordable

genomics analysis for all biologists

Ravi Madduri, Paul Davé , Dina Sulakhe, Alex Rodriguez

Globus Genomics

Sequencing Centers

Sequencing Centers

PublicData

Storage

Local Cluster/CloudSeq

Center

Research Lab

Globus Provides a• High-performance • Fault-tolerant• Secure

file transfer Service between all data-endpoints

Data Management Data Analysis

Picard

GATK

Fastq Ref Genome

Alignment

Variant Calling

Galaxy Data Libraries

Globus Genomics on Amazon EC2

• Analytical tools are automatically run on the scalable compute resources when possible

• Globus Integrated within Galaxy

• Web-based UI• Drag-Drop workflow

creations• Easily modify Workflows

with new tools

Galaxy-based workflow management

FTP, SCP, others

FTP, SCP

SCP

Globus Genomics

FTP,

SCP,

HTTP

It’s proving popular

DobynsLab

Cox LabVolchenboum LabOlopade Lab

Nagarajan Lab

25

2.5 million core hours used in first six months of 2014

• Pricing includes• Estimated compute• Storage (one month)• Globus Genomics platform usage• Support

Costs are remarkably low

metagenomics.anl.gov

Data service as community resource

kbase.us

Linking simulation and experiment to study disordered structures

Diffuse scattering images from Ray Osborn et al., Argonne

SampleExperimentalscattering

Material composition

Simulated structure

Simulatedscattering

La 60%Sr 40%

Detect errors (secs—mins)

Knowledge basePast experiments;

simulations; literature; expert knowledge

Select experiments (mins—hours)

Contribute to knowledge base

Simulations driven by experiments (mins—days)

Knowledge-drivendecision making

Evolutionary optimization

Integrate data movement, management, workflow, and computation to accelerate data-driven applications

New data, computational capabilities, and methods create opportunities and

challengesIntegrate statistics/machine learning to assess many models and calibrate them against `all' relevant data

New computer facilities enable on-demand computing and high-speed analysis of large quantities of data

A lab-wide data architecture and facility

32

Immediate assessment of alignment quality in near-field high-energy

diffraction microscopy

33

Before

After

Hemant SharmaJustin WozniakMike WildeJon Almer

34

One APS data node: 125 destinations

Same node(1 Gbps link)

Accelerate discovery via automation and outsourcing

And at the same time:– Enhance reproducibility– Encourage entrepreneurial science– Democratize access and contributions– Enhance collaboration

The discovery Cloud!

My work is supported by:

U.S . DEPARTMENT OF

ENERGY

37

Questions?

foster@anl.gov