Introduction to the Kepler Workflow System · •Goals • Produce an open-source scientific...

Post on 08-Jun-2020

9 views 0 download

Transcript of Introduction to the Kepler Workflow System · •Goals • Produce an open-source scientific...

Introduction to the Kepler Workflow System

Matthew B. JonesNational Center for Ecological Analysis and Synthesis (NCEAS)

University of California, Santa Barbara

Software Tools for Sensor NetworksA Workshop sponsored by NCEAS, LTER, and DataONE

May 1-5, 2012

Abstract

• Scientific workflows capture the transformation of data that are produced and consumed by disparate analysis and modeling software systems. Kepler is an open-source system for authoring and executing workflows, providing access to data and services from a variety of networks and systems. By versioning data, workflows, and executions, Kepler allows full reconstruction of the analyses used in scientific papers, even if those analyses are conducted using a variety of commercial and custom software. Kepler promotes reproducible science by allowing users to publish these workflows, data products, and execution traces to remote repositories to be shared with other users.

Diverse Analysis and Modeling

• Wide variety of analyses used in ecology and environmental sciences– Statistical analyses and trends– Rule-based models– Dynamic models (e.g., continuous time)– Individual-based models (agent-based)– many others

• Implemented in many frameworks– implementations are black-boxes– learning curves can be steep– difficult to couple models

Analysis/Modeling Challenges

• Manual process to work with multiple analytical systems

• Data are discovered outside of tools and imported manually

• Difficult to understand models at a glance

• Difficult to revise analyses except in scripted systems

• No accepted way to publish models to share with colleagues

• Little re-use of components – many re-inventions

• Difficult to use multiple computers for one analysis/model– Only a few experts use grid computing

Reproducible Science

• Analytical transparency– open systems– works across analysis packages– documents algorithms completely

• Automated analysis for repeatability– must be scriptable– must be able to handle data dynamically

• Archived and shared analysis and model runs

• Current analytical practices are difficult to manage

• Model the steps used by researchers during analysis– Graphical model of flow of data among processing steps

• Each step often occurs in different software– Matlab, R, SAS, C/C++, Fortran, Swarm, ...– Each component can ‘wrap’ external systems, presenting

a unified view

• Refer to these graphs as ‘Scientific Workflows’

Models as ‘scientific workflows’

Data GraphClean Analyze/Model

A

Source(e.g., data)

C

Sink(e.g., display)

B

Scientific workflows• What are scientific workflows?

– Graphical model of data flow among processing steps

– Inputs and Outputs of components are precisely defined– Components are modular and reusable– Flow of data controlled by a separate execution model– Support for hierarchical models

Processor(e.g., regression)

A

Source(e.g., data)

C

Sink(e.g., display)

B

Scientific workflows• What are scientific workflows?

– Graphical model of data flow among processing steps

– Inputs and Outputs of components are precisely defined– Components are modular and reusable– Flow of data controlled by a separate execution model– Support for hierarchical models

A’

Processor(e.g., regression)

A

Source(e.g., data)

C

Sink(e.g., display)

B

Scientific workflows• What are scientific workflows?

– Graphical model of data flow among processing steps

– Inputs and Outputs of components are precisely defined– Components are modular and reusable– Flow of data controlled by a separate execution model– Support for hierarchical models

A’

Processor(e.g., regression)

B

ED F

• Overview of Kepler• Features

– Data Access– Workflow archiving and sharing– Grid Computing support

• Open source community

Outline

Overview of Kepler

• Goals

• Produce an open-source scientific workflow system• enable scientists to design, share, and execute

scientific workflows

• Support scientists in a variety of disciplines• e.g., biology, ecology, oceanography, astronomy

• Important features• access to scientific data• flexible framework that works across analytical packages• simplify distributed computing using computing grids• clear documentation of analysis and models• effective user interface for workflow design• provenance tracking for results• model archiving and sharing

Kepler use cases represent many science domains

• Ecology– SEEK: Ecological Niche Modeling

– COMET: environmental science – REAP: Parasite invasions using sensor networks

• Geosciences– GEON: LiDAR data processing

– GEON: Geological data integration

• Molecular biology– SDM: Gene promoter identification

– ChIP-chip: genome-scale research

– CAMERA: metagenomics

• Oceanography– REAP: SST data processing– LOOKING: ocean observing CI

– NORIA: ocean observing CI

– ROADNet: real-time data modeling

– Ocean Life project

• Physics– CPES: Plasma fusion simulation

– FermiLab: particle physics

• Phylogenetics• ATOL: Processing Phylodata• CiPRES: phylogentic tools

• Chemistry• Resurgence: Computational

chemistry• DART (X-Ray crystallography)

• Library Science• DIGARCH: Digital preservation• Cheshire digital library: archival

• Conservation Biology• SanParks: Thresholds of Potential

Concerns

Anatomy of a Kepler Workflow

Actors

Channels Ports

Tokens int, string, record{..}, array[..], ..

Kepler scientific workflow system

Kepler scientific workflow system

Data source from repository

Kepler scientific workflow system

Data source from repository

res <- lm(BARO ~ T_AIR)resplot(T_AIR, BARO)abline(res)

R processing script

Kepler scientific workflow system

Data source from repository

res <- lm(BARO ~ T_AIR)resplot(T_AIR, BARO)abline(res)

R processing script

Kepler scientific workflow system

Run ManagementEach execution recordedProvenance of derived data recordedCan archive runs and derived data

A Simple Kepler Workflow

Component Tab

Workflow Run Manager

Searchable Component

List

Component Documentation

Data preparation

Data preparation

FORTRAN code

Data preparation

FORTRAN code

MATLAB code

Data Access

Accessing Data in Kepler

• File system (e.g., CSV files)• Catalog searches (e.g., KNB)• Remote databases (e.g., PostgresQL)• Web services• Data access protocols (e.g., OPeNDAP)• Streaming data (e.g., DataTurbine)• Specialized repositories (e.g., SRB)

• etc., and extensible

Direct Data Access to Data RepositoriesSearch

for metadata term (“ADCP”)

Drag to workflow area to create datasource

398 hits for ‘ADCP’ located in search

OPeNDAP

• Directly access OPeNDAP servers• Apply OPeNDAP constraints for

remote data subsetting

• Current work: searchable catalogs across OPeNDAP servers

Gene sequences via web services

Gene sequences via web services

Web service executes remotely (e.g., in Japan)

Gene sequences via web services

Gene sequence returnedin XML format

Web service executes remotely (e.g., in Japan)

Gene sequences via web services

Web service executes remotely (e.g., in Japan)

Extracted sequencecan be returned forfurther processing

Gene sequences via web services

Web service executes remotely (e.g., in Japan)

This entire workflow can be wrapped as a re-usable componentso that the details of extracting sequence data are hidden unless needed.

Extracted sequencecan be returned forfurther processing

Benthic Boundary Layer Project: Kilo Nalu, Hawaii

Benthic Boundary Layer Geochemistry and Physics at the Kilo Nalu ObservatoryG. Pawlak, M. McManus, F. Sansone, E. De Carlo, A. Hebert and T. Stanton

NSF Award #OCE-0536607-000

• Research instruments are part of cabled-array at the Kilo Nalu Observatory• Deployed off of Point Panic, Honolulu Harbor, Hawai’i• Goal: Measure the interactions between physical oceanographic forcing, sediment alteration, and

modification of sediment-seawater fluxes

Accessing sensor streams at Kilo Nalu

!

!

!

!!

!

! !

!

! !!

!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

!! !

!

!

!

!

!

!

!!

!

!

!

!

!

! ! ! !

!

!!

!!

!! ! !

24.2

024.3

024.4

024.5

0

water temperature

(bottom, 10m ADCP)

Time

Tem

pera

ture

degre

es C

01:00 05:00 09:00 13:00 17:00

Accessing sensor streams at Kilo Nalu

!

!

!

!!

!

! !

!

! !!

!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

!! !

!

!

!

!

!

!

!!

!

!

!

!

!

! ! ! !

!

!!

!!

!! ! !

24.2

024.3

024.4

024.5

0

water temperature

(bottom, 10m ADCP)

Time

Tem

pera

ture

degre

es C

01:00 05:00 09:00 13:00 17:00

Streaming Datafrom observatoryDataTurbine Server

Accessing sensor streams at Kilo Nalu

!

!

!

!!

!

! !

!

! !!

!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

!! !

!

!

!

!

!

!

!!

!

!

!

!

!

! ! ! !

!

!!

!!

!! ! !

24.2

024.3

024.4

024.5

0

water temperature

(bottom, 10m ADCP)

Time

Tem

pera

ture

degre

es C

01:00 05:00 09:00 13:00 17:00

Streaming Datafrom observatoryDataTurbine Server

now <- Sys.time()Epoch <- now - as.numeric(now)timeval <-Epoch + timestampsposixtmedian = median(timeval)mediantime = as.numeric(posixtmedian)meantemp = mean(data)

Support application scriptsin R, Matlab, etc.

Accessing sensor streams at Kilo Nalu

!

!

!

!!

!

! !

!

! !!

!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

!! !

!

!

!

!

!

!

!!

!

!

!

!

!

! ! ! !

!

!!

!!

!! ! !

24.2

024.3

024.4

024.5

0

water temperature

(bottom, 10m ADCP)

Time

Tem

pera

ture

degre

es C

01:00 05:00 09:00 13:00 17:00

Streaming Datafrom observatoryDataTurbine Server

Modular components,easily saved and shared

Accessing sensor streams at Kilo Nalu

!

!

!

!!

!

! !

!

! !!

!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

!! !

!

!

!

!

!

!

!!

!

!

!

!

!

! ! ! !

!

!!

!!

!! ! !

24.2

024.3

024.4

024.5

0

water temperature

(bottom, 10m ADCP)

Time

Tem

pera

ture

degre

es C

01:00 05:00 09:00 13:00 17:00

Streaming Datafrom observatoryDataTurbine Server

Graphs and derived data can bearchived and displayed

Modular components,easily saved and shared

Composite actors aid comprehension

Composite actors aid comprehension

Composite actors aid comprehension

Composite actors aid comprehension

Composite actors aid comprehension

Composite actors aid comprehension

• Save components • for later re-use

• Share components • via external repositories

Workflow archiving and sharing

Archiving isn’t just for data...

• Kepler can archive and version:

–Analysis code and workflows

–Results and derived data• e.g., data tables, graphs, maps

–Derived data lineage• What data were used as inputs• What processes were used to generate the

derived products

Run Management & Sharing• Provenance subsystem

monitors data tokens

Run Management & Sharing• Provenance subsystem

monitors data tokens

Run Management & Sharing• Provenance subsystem

monitors data tokens

Scheduling remote execution

Viewing remote runs

Grid Computing

• Support for several grid technologies– Ad-hoc Kepler networks (Master-Slave)– Globus grid jobs– Hadoop Map-Reduce– SSH plumbed-HPC

Grid computing

Open Source Community

Open Kepler Collaboration

• http://kepler-project.org

• Open-source– BSD License

• Collaborators– UCSB, UCD,

UCSD, UCB, Gonzaga, many others

Ptolemy II

Community Contribution: Kepler/WEKA

from Peter Reutemann

Community Contribution:Science Pipes

from Paul Allen, Cornell Lab of Ornithology

In summary…

• Typical analytical models are complex and difficult to comprehend and maintain

• Scientific workflows provide– An intuitive visual model– Structure and efficiency in modeling and analysis– Abstractions to help deal with complexity– Direct access to data– Means to publish and share models

• Kepler is an evolving but effective tool for scientists– Kepler/CORE award funds transition from research prototype

to production software tool

• Mix analytical systems– Matlab, R, C code, FORTRAN, other executables, ...

• Understand models– visually depict how the analysis works

• Directly access data• Utilize Grid and Cloud computing• Share and version models

– allow sharing of analytical procedures– document precise versions of data and models used

• Provide provenance information– provenance is critical to science– workflows are metadata about scientific process

Advantages of Scientific Workflows

Workflows promote reproducible science

• Scientific Workflows are metadata about process

• Document data analysis and models– provide provenance for data derivation– allows sharing of analytical details

• Publishing and citing workflows supports reproducibility of scientific results

NCEAS’ model for Open Science

From Reichman, Jones, and Schildhauer; doi:10.1126/science.1197962

NCEAS’ model for Open Science

From Reichman, Jones, and Schildhauer; doi:10.1126/science.1197962

NCEAS’ model for Open Science

From Reichman, Jones, and Schildhauer; doi:10.1126/science.1197962

NCEAS’ model for Open Science

From Reichman, Jones, and Schildhauer; doi:10.1126/science.1197962

NCEAS’ model for Open Science

From Reichman, Jones, and Schildhauer; doi:10.1126/science.1197962

Questions?

• http://www.nceas.ucsb.edu/ecoinformatics/

• http://kepler-project.org

Acknowledgments

• This material is based upon work supported by:• The National Science Foundation (9980154, 9904777,

0131178, 9905838, 0129792, and 0225676)• The National Center for Ecological Analysis and Synthesis• The Andrew W. Mellon Foundation.• Kepler contributors: SEEK, REAP, Kepler/CORE, Ptolemy

II, SDM/SciDAC projects• For many shared conversations and a shared vision for

Kepler:– Betram Ludaescher and Tim McPhillips, UC Davis– Ilkay Altintas, UC San Diego– Mark Schildhauer, UC Santa Barbara– Shawn Bowers, Gonzaga University– Christopher Brooks, UC Berkeley

Extra slides

Sensor Network Management

Real-time Environment for Analytical Processing

• Management and Analysis of Environmental Observatory Data using the Kepler Scientific Workflow System

http://reap.ecoinformatics.org/

REAP goals

• For scientists– capabilities for designing and executing complex analytical

models over near real-time and archived data sources

• For data-grid engineers• monitoring and

management capabilities of underlying sensor networks

• For outside users• access to

observatory data and results of models, approachable to non-scientists.

Sensor sites: topology and monitoring

Sensor sites: topology and monitoring