Open Analytics Environment

Post on 10-May-2015

1.363 views 2 download

Tags:

description

I summarize requirements for an "Open Analytics Environment" (aka "the Cauldron"), and some work being performed at the University of Chicago and Argonne National Laboratory towards its realization.

Transcript of Open Analytics Environment

Ian Foster

Computation Institute

Argonne National Lab & University of Chicago

Towards anOpen Analytics Environment

2

The Computation Institute

A joint institute of Argonne and the University of Chicago, focused on furthering system-level science via the development and use of advanced computational methods.

Solutions to many grand challenges facing science and society today require the analysis and understanding of entire systems, not just individual components. They require not reductionist approaches but the synthesis of knowledge from multiple levels of a system, whether biological, physical, or social (or all three).

www.ci.uchicago.edu

Faculty, fellows, staff, students, computers, projects.

3

The Good Old Days: Astronomy ~1600

30 years? years

10 years6 years2 years

4

Automation10

-1 108 Hz

data capture

Community10

0 104

astronomers(106 amateur)

ComputationData10

6 1015

Baggregate 10

-1 1015

Hzpeak

Literature10

1 105

pages/year

Astronomy,from 1600 to 2000

5

Biomedical Research ~1600

6

Biomedical Research ~2000

...atcgaattccaggcgtcacattctcaattcca...

DNA sequencesalignments

MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT...

Proteins sequence

2º structure 3º structure

Protein-ProteinInteractions

metabolism pathways

receptor-ligand 4º structure

Polymorphism and Variants

genetic variants individual patients

epidemiology

Physiology Cellular biology

Biochemistry Neurobiology

Endocrinology etc.>10

6

ESTs Expression patternsLarge-scale screensGenetics and Maps

Linkage Cytogenetic Clone-based

From John Wooley>10

6

>109

>106

>105

>109

7

Growth of Sequences and Annotations since 1982

Folker Meyer, Genome Sequencing vs. Moore’s Law: Cyber Challenges for the Next Decade, CTWatch, August 2006.

8

The Analyst in Denial

“I just need a bigger disk (and workstation)”

9

An Open Analytics Environment

Resultsout

Datain

Programs& rules in

“No limits” Storage Computing Format Program

Allowing for Versioning Provenance Collaboration Annotation

10

o·pen [oh-puhn] adjective

having the interior immediately accessible

relatively free of obstructions to sight, movement, or internal arrangement

generous, liberal, or bounteous

in operation; live

readily admitting new members

not constipated

11

What Goes In (1)

12

What Goes In (2)

RulesRules

WorkflowsWorkflows

DryadDryad

MapReduceMapReduce

Parallel programsParallel programs

SQLSQL

BPELBPEL

SwiftSwift

SCFLSCFL

RR

MatLabMatLab

OctaveOctave

13

How it Cooks

Virtualization Run any program, store

any data Indexing

Automated maintenance Provisioning

Policy-driven allocation of resources to competing demands

14

What Comes Out

DataData

15

Analysis as (Collaborative) ProcessTransformAnnotate SearchAdd toTag

VisualizeDiscover

ExtendGroupShare

16

Centralizedor

Distributed?

Both

17

Towards an Open Analysis Environment:(1) Applications

Astrophysics Cognitive science East Asian studies Economics Environmental science Epidemiology Genomic medicine Neuroscience Political science Sociology Solid state physics

18

Towards an Open Analysis Environment:(2) Hardware

SiCortex6K cores, 6 Top/s

IBM BG/P160K cores, 500 Top/s

PADS

PADS

10-40 Gbit/s

19

PADS: Petascale Active Data Store

500 TB reliable storage (data &

metadata)

180 TB, 180 GB/s 17 Top/s

analysisData

ingest

Dynamic provisioning

Parallel analysis

Remote access

Offload to remote data centers

P A D S

Diverseusers

Diversedata

sources

1000 TBtape backup

20

Towards an Open Analysis Environment:(3) Methods

HPC systems software (MPICH, PVFS, etc.) Collaborative data tagging (GLOSS) Data integration (XDTM) HPC data analytics and visualization Loosely coupled parallelism (Swift, Hadoop) Dynamic provisioning (Falkon) Service authoring (Introduce, caGrid, gRAVI) Provenance recording and query (Swift) Service composition and workflow (Taverna) Virtualization management Distributed data management (GridFTP, etc.)

21

Tagging & Social Networking

GLOSS: Generalized

Labels Over Scientific data Sources

22

./group23

drwxr-xr-x 4 yongzh users 2048 Nov 12 14:15 AA

drwxr-xr-x 4 yongzh users 2048 Nov 11 21:13 CH

drwxr-xr-x 4 yongzh users 2048 Nov 11 16:32 EC

./group23/AA:

drwxr-xr-x 5 yongzh users 2048 Nov 5 12:41 04nov06aa

drwxr-xr-x 4 yongzh users 2048 Dec 6 12:24 11nov06aa

. /group23/AA/04nov06aa:

drwxr-xr-x 2 yongzh users 2048 Nov 5 12:52 ANATOMY

drwxr-xr-x 2 yongzh users 49152 Dec 5 11:40 FUNCTIONAL

. /group23/AA/04nov06aa/ANATOMY:

-rw-r--r-- 1 yongzh users 348 Nov 5 12:29 coplanar.hdr

-rw-r--r-- 1 yongzh users 16777216 Nov 5 12:29 coplanar.img

. /group23/AA/04nov06aa/FUNCTIONAL:

-rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0001.hdr

-rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0001.img

-rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0002.hdr

-rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0002.img

-rw-r--r-- 1 yongzh users 496 Nov 15 20:44 bold1_0002.mat

-rw-r--r-- 1 yongzh users 348 Nov 5 12:32 bold1_0003.hdr

-rw-r--r-- 1 yongzh users 409600 Nov 5 12:32 bold1_0003.img

XDTM: XML Data Typing & Mapping

LogicalPhysical

23

fMRI Type Definitions

type Study { Group g[ ];

}

type Group { Subject s[ ];

}

type Subject { Volume anat; Run run[ ];

}

type Run { Volume v[ ];

}

type Volume { Image img; Header hdr;

}

type Image {};

type Header {};

type Warp {};

type Air {};

type AirVec { Air a[ ];

}

type NormAnat {Volume anat; Warp aWarp; Volume

nHires;}

24

High-PerformanceData Analytics

FunctionalMRI

Ben Clifford, Mihael Hatigan, Mike Wilde,Yong Zhao

25

SwiftScript for fMRI Data Analysis

(Run snr) functional ( Run r, NormAnat a, Air shrink ) {

Run yroRun = reorientRun( r , "y" );Run roRun = reorientRun( yroRun , "x" );Volume std = roRun[0];Run rndr = random_select( roRun, 0.1 );AirVector rndAirVec = align_linearRun( rndr, std, 12, 1000, 1000, "81 3 3" );Run reslicedRndr = resliceRun( rndr, rndAirVec, "o", "k" );Volume meanRand = softmean( reslicedRndr, "y", "null" );Air mnQAAir = alignlinear( a.nHires, meanRand, 6, 1000, 4, "81 3 3" );Warp boldNormWarp = combinewarp( shrink, a.aWarp, mnQAAir );…

}

(Run or) reorientRun (Run ir, string direction) { foreach Volume iv, i in ir.v { or.v[i] = reorient(iv, direction); } }

26

Provenance Data Model

dvIDhoststart

durationexitcode

stats

Invocation

nmspacename

version

Call

passes passes

executescalls

binds references

describesuses

includes

nmspacename

version

Procedure

argnametype

direction

FormalArg

argnamevalue

ActualArg

wfidfromDV

toDV

Workflow

nmspacename

Dataset

objectpred

type/valuserdate

Annotation

1

1

1

1

1

1

*

*

*

*

*

1

11

1

1

1

1 describes

27

Virtual Node(s)

SwiftScript

Abstractcomputation

Virtual DataCatalog

SwiftScriptCompiler

Specification Execution

Worker Nodes

Provenancedata

ProvenancedataProvenance

collector

launcher

launcher

file1

file2

file3

AppF1

AppF2

Scheduling

Execution Engine(Karajan w/

Swift Runtime)

Swift runtimecallouts

C

C CC

Status reporting

Multi-level Scheduling

Provisioning

FalkonResource

Provisioner

AmazonEC2

28

DOCK on SiCortex CPU cores: 5760 Power: 15,000 W Tasks: 92160 Elapsed time: 12821 sec Compute time: 1.94 CPU years

(does not include ~800 sec to stage input data)

Ioan Raicu,Zhao

Zhang

29

Birmingham•

LIGO Gravitational WaveObservatory

>1 Terabyte/day to 8 sites770 TB replicated to date: >120 million replicasMTBF = 1 month

Cardiff

AEI/Golm

Ann Chervenak et al., ISI; Scott Koranda et al, LIGO

30

Lag Plot for Data Transfers to Caltech

Credit: Kevin Flasch, LIGO

31

SIDGrid: B. Bertenthal et al., U.Chicago, IU, UIC

32

Social Informatics Data Grid (SIDgrid)

TeraGrid PADS …

SIDgrid

Collaborative, multi-modal analysis of cognitive science data

Diverseexperimenta

ldata &

metadata Browse dataSearchContent previewTranscodeDownloadAnalyze

33

ELAN

SIDGrid Portal

34

35

A Community Integrated Model for Economic and Resource Trajectories for

Humankind (CIM-EARTH)

Dynamics,foresight,

uncertainty,resolution, …

Agriculture,transport,

taxation, …

Data (global,local, …)

(Super)computers

CIM-EARTHFramework

Communityprocess

Opencode, data

36

Alleviating Poverty

in Thailand:Modeling

Entrepreneurship

Consider only wealth,

access to capital

Consider alsodistance to

6 major cities

Rob Townsend, Victor Zhorin, et al.

Match

High

Low

37

Text Mining

38

GeneWays

Online Journals

Pathways

GeneWays

Andrey Rzhetsky et al.

Screening 250,000 journal articles

2.5M reasoning chains

4M statements

39

Identify Genes

Phenotype 1 Phenotype 2 Phenotype 3 Phenotype 4

Predictive Disease Susceptibility

Physiology

Metabolism Endocrine

Proteome

Immune Transcriptome

BiomarkerSignatures

Morphometrics

Pharmacokinetics

EthnicityEnvironment

AgeGender

Evidence Integration:Genetics & Disease Susceptibility

Source: Terry Magnuson

40James Evans, U.Chicago

Arabidopsis articles

41

An Open Analytics Environment

Resultsout

Datain

Programs& rules in

“No limits” Storage Computing Format Program

Allowing for Versioning Provenance Collaboration Annotation

42