Active Data PDSW'13

30
Introduction Active Data Discussion Conclusion Active Data A Data-Centric Approach to Data Life-Cycle Management Anthony Simonet 1 Gilles Fedak 1 Matei Ripeanu 2 Samer Al-Kiswany 2 1 Inria, ENS Lyon, University of Lyon 2 University of British Columbia November 18th, 2013 A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 1/20

description

 

Transcript of Active Data PDSW'13

Page 1: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Active DataA Data-Centric Approach to Data Life-Cycle Management

Anthony Simonet1 Gilles Fedak1

Matei Ripeanu2 Samer Al-Kiswany2

1Inria, ENS Lyon, University of Lyon 2University of British Columbia

November 18th, 2013

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 1/20

Page 2: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Outline

Introduction

Data Life Cycle Management

Use-case

Requirements

Active Data

Active Data: principles & features

Exemple: Globus Online and iRODS

Discussion

Advantages

Limitations

Conclusion

Related works

Conclusion

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 2/20

Page 3: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Big Data

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 3/20

I Science and Industry have become data-intensiveI Volume of data produced by science and industry grows exponentiallyI How to store this deluge of data?I How to extract knowledge and sense?I How to make data valuable?

I Some examplesI CERN’s Large Hadron Collider: 1.5PB/weekI Large Synoptic Survey Telescope, Chile: 30 TB/nightI Billion edge social network graphsI Searching and mining the Web

Page 4: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Data Life Cycle

Data Life Cycle

I Creation/Acquisition

I Transfer

I Replication

I Disposal/Archiving

Definition

The life cycle is the course of operational stages through whichdata pass from the time when they enter a system to the timewhen they leave it.

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 4/20

Page 5: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Data Life Cycle Management

Complicated scenarios

I Execution of workflow

I Complex interactions between software

I Need to quickly react to operational events

Ad-hoc task-centric approaches

I Hard to program, maintain and debug

I No formal specification

I Complicates interactions between systems

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 5/20

Page 6: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Data Life Cycle Use-case

Example: the Advanced Photon Source at Argonne National LabI 100TB of raw data per dayI Raw data are preprocessed and registered in a Globus dataset

catalogI Data are analyzed by various applicationsI Results are stored in the dataset catalog and shared

Instrument(Beamline)

LocalStorage

Transfer

MetadataCatalog

Extract &Register Metadata

RemoteData Center

Transfer

AcademicCluster

Analysis

More analysis

Upload result

Register result metadata

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 6/20

Page 7: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Use-case

Task Centric

Vs

Data CentricI Independent scripts I Express data-dependancies

I Hard to program, maintain, verify I Cross data-center coordination

I Coarse granularity I User-level fault-tolerance

I Incremental processing

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 7/20

Page 8: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Requirements

Challenges: a perfect system should. . .

I Simply represent the life cycle of data distributed acrossdifferent data centers and systems

I Simplify DLM modeling and reasoning

I Hide the complexity resulting from using differentinfrastructures and systems

I Be easy to integrate with existing systems

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 8/20

Page 9: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Active Data principles

System programmers expose their system’s internal data life cyclewith a model based on Petri Nets.A life cycle model is made of Places and Transitions

•Created

t1

Written

t2

Read

t3

t4

Terminated

public void handler () {

computeMD5 ();

}

Each token has a unique identifier, corresponding to the actualdata item’s.

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20

Page 10: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Active Data principles

System programmers expose their system’s internal data life cyclewith a model based on Petri Nets.A life cycle model is made of Places and Transitions

Created

t1

•Written

t2

Read

t3

t4

Terminated

public void handler () {

computeMD5 ();

}

A transition is fired whenever a data state changes.

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20

Page 11: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Active Data principles

System programmers expose their system’s internal data life cyclewith a model based on Petri Nets.A life cycle model is made of Places and Transitions

Created

t1

•Written

t2

Read

t3

t4

Terminated

public void handler () {

computeMD5 ();

}

Code may be plugged by clients to transitions.It is executed whenever the transition is fired.

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 9/20

Page 12: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Active Data features

The Active Data programming model and runtime environment:

I Allows to react to life cycle progression

I Exposes transparently distributed data sets

I Can be integrated with existing systems

I Has scalable performance and minimum overhead overexisting systems

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 10/20

Page 13: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Implementation

I Prototype implemented in Java (' 2,800 LOC)

I Client/Service communication is Publish/SubscribeI 2 types of subscription:

I Every transitions for a given data itemI Every data items for a given transition

Active DataService

Client

Client

subscribeClient subscribe

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20

Page 14: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Implementation

I Several ways to publish transitionsI Instrument the codeI Read the logsI Rely on an existing notification system

I The service orders transitions by time of arrival

Active DataService

Client

publish transition

Client

subscribeClient subscribe

publish transi

tion

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20

Page 15: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Implementation

I Clients run transition handler code locallyI Transition handlers are executed

I SeriallyI In a blocking wayI In the order transitions were published

Active DataService

Client

publish transition

Client

subscribenotify

Client subscribe

notify

publish transi

tion

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 11/20

Page 16: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Performance evaluation: Throughput

10 50 100 200 300 400 450 500 550# clients

5,000

10,000

15,000

20,000

25,000

30,000

35,000

Tra

nsi

tions

per

seco

nd

Figure: Average number of transitions per second handled by the ActiveData Service

Clients publish 10,000 transitions in a row without pausing.

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20

Page 17: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Performance evaluation: Throughput

10 50 100 200 300 400 450 500 550# clients

5,000

10,000

15,000

20,000

25,000

30,000

35,000

Tra

nsi

tions

per

seco

nd

Figure: Average number of transitions per second handled by the ActiveData Service

The prototype scales up to 30,000 transitions per seconds.

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 12/20

Page 18: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Exemple: Data Provenance

Definition

The complete history of data life cycle derivations and operations.

I Assess the quality of data

I Keep track of the origin of data over time

I Specialized Provenance Aware Storage Systems

−→ What about heterogeneous systems?

Example with Globus Online and iRODS

File transfer service Data store and metadata catalog

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20

Page 19: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Exemple: Data Provenance

Definition

The complete history of data life cycle derivations and operations.

I Assess the quality of data

I Keep track of the origin of data over time

I Specialized Provenance Aware Storage Systems−→ What about heterogeneous systems?

Example with Globus Online and iRODS

File transfer service Data store and metadata catalog

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20

Page 20: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Exemple: Data Provenance

Definition

The complete history of data life cycle derivations and operations.

I Assess the quality of data

I Keep track of the origin of data over time

I Specialized Provenance Aware Storage Systems−→ What about heterogeneous systems?

Example with Globus Online and iRODS

File transfer service Data store and metadata catalog

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 13/20

Page 21: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Exemple: Globus Online and iRODS

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20

Data events coming from Globus Online and iRODS

Created

GetPut

Terminated

t5

t9

t6

t7

t8 t10

iRODS

Id: {GO: 7b9e02c4-925d-11e2,iRODS: 10032}

public void handler () {

annotate ();

}

Created

t1 t2

Succeeded Failed

t3 t4

Terminated

Globus Online

Id: {GO: 7b9e02c4-925d-11e2}

public void handler () {

iput (...);

}

Page 22: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Exemple: Globus Online and iRODS

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20

Data events coming from Globus Online and iRODS

Created

GetPut

Terminated

t5

t9

t6

t7

t8 t10

iRODS

Id: {GO: 7b9e02c4-925d-11e2,iRODS: 10032}

public void handler () {

annotate ();

}

Created

t1 t2

Succeeded Failed

t3 t4

Terminated

Globus Online

Id: {GO: 7b9e02c4-925d-11e2}

public void handler () {

iput (...);

}

Page 23: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Exemple: Globus Online and iRODS

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20

Data events coming from Globus Online and iRODS

Created

GetPut

Terminated

t5

t9

t6

t7

t8 t10

iRODS

Id: {GO: 7b9e02c4-925d-11e2,iRODS: 10032}

public void handler () {

annotate ();

}

Created

t1 t2

Succeeded Failed

t3 t4

Terminated

Globus Online

Id: {GO: 7b9e02c4-925d-11e2}public void handler () {

iput (...);

}

Page 24: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Exemple: Globus Online and iRODS

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 14/20

Data events coming from Globus Online and iRODS

Created

Get

• Put

Terminated

t5

t9

t6

t7

t8 t10

iRODS

Id: {GO: 7b9e02c4-925d-11e2,iRODS: 10032}

public void handler () {

annotate ();

}

Created

t1 t2

Succeeded Failed

t3 t4

Terminated

Globus Online

Id: {GO: 7b9e02c4-925d-11e2}public void handler () {

iput (...);

}

Page 25: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Exemple: Globus Online and iRODS

$ imeta ls -d test/out_test_4628

AVUs defined for dataObj test/out_test_4628:

attribute: GO_FAULTS

value: 0

----

attribute: GO_COMPLETION_TIME

value: 2013 -03 -21 19:28:41Z

----

attribute: GO_REQUEST_TIME

value: 2013 -03 -21 19:28:17Z

----

attribute: GO_TASK_ID

value: 7b9e02c4 -925d-11e2 -97ce -123139404 f2e

----

attribute: GO_SOURCE

value: go#ep1/~/ test

----

attribute: GO_DESTINATION

value: asimonet#fraise /~/ out_test_4628

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 15/20

Page 26: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Advantages

I Simple and graphical way to program DLM operations

I Allows to formally verify some properties of data life cycles

I Easy coordination between systems

I Easy to scale

I Easy to debug

I Easy fault tolerance

I Fine-grain interaction with data life cycle

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 16/20

Page 27: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Limitations

I Complexity to reason in terms of life cycle events

I Lack of standard

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 17/20

Page 28: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Related works

Data-centric parallel processingI Programing models:

I MapReduce and higher level abstractions: PigLatin, TwisterI Incremental systems: MapReduce-Online, Percolator, Chimera,

NepheleI Other models with implicit parallelism: Swift, Dryad, Allpairs

I Storage systemsI BitDewI MosaStoreI Provenance Aware Storage SystemsI Active Storage

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 18/20

Page 29: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Conclusion

Active Data is. . .

I Data-centric & Event-driven

I System-level data integration

What’s next?

I Advanced representation of operations that consume andproduce data: represent data derivation

I Data collection abilities

I Distributed implementation of the Publish/Subscribe layer

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 19/20

Page 30: Active Data PDSW'13

Introduction Active Data Discussion Conclusion

Thank you!

Questions?

Inria booth #2116

A. Simonet(Inria) Active Data (PDSW’13) November 18th, 2013 20/20