The Kepler Project Overview, Status, and Future Directions Matthew B. Jones on behalf of the Kepler...

18
The Kepler Project Overview, Status, and Future Directions Matthew B. Jones on behalf of the Kepler Project team National Center for Ecological Analysis and Synthesis University of California, Santa Barbara
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    3

Transcript of The Kepler Project Overview, Status, and Future Directions Matthew B. Jones on behalf of the Kepler...

The Kepler ProjectOverview, Status, and Future Directions

Matthew B. Joneson behalf of the Kepler Project team

National Center for Ecological Analysis and SynthesisUniversity of California, Santa Barbara

SWDB Aug 29, 2004

The Kepler Project

• Goals

• Produce an open-source scientific workflow system• enable scientists to design scientific workflows and execute

them

• Support scientists in a variety of disciplines• e.g., biology, ecology, astronomy

• Important features• access to scientific data• flexible means for executing complex analyses• enable use of Grid-based approaches to distributed

computation• semantic models of scientific tasks• effective UI for workflow design

SWDB Aug 29, 2004

Kepler Collaboration

• Open-source • Builds on Ptolemy II

• Collaborators• SEEK Project• SciDAC SDM Center• Ptolemy Project• GEON Project• ROADNet Project• Resurgence Project

• Goals• Create powerful

analytical tools that are useful across disciplines

• Ecology, Biology, Engineering, Geology, Physics, Chemistry, Astronomy, …

Ptolemy II

SWDB Aug 29, 2004

Usage statistics

• Source code access• 154 people accessed source code• 30 members have write permission

–Projects using Kepler:•SEEK (ecology)

•SciDAC (molecular bio, ...)

•CPES (plasma simulation)

•GEON (geosciences)

•CiPRes (phylogenetics)

•CalIT2

•ROADnet (real-time data)

•LOOKING (oceanography)

•CAMERA (metagenomics)

•Resurgence (Computational chemistry)

•NORIA (ocean observing CI)

•NEON (ecology observing CI)

•ChIP-chip (genomics)

•COMET (environmental science)

•Cheshire Digital Library (archival)

•Digital preservation (DIGARCH)

•Cell Biology (Scripps)

•DART (X-Ray crystallography)

•Ocean Life

•Assembling theTree of Life project

•Processing Phylodata (pPOD)

•FermiLab (particle physics)

Kepler downloadsTotal = 9204Beta = 6675

red=Windows

blue=Macintosh

SWDB Aug 29, 2004

Kepler advances

• Data and Actor search• EarthGrid data access system• Kepler Component Library

• Kepler Archive (KAR) format• Integrated support for LSID identifiers for all objects

• Object Manager and cache• Web service execution• RExpression & MatlabExpression actors• Redesigned user interface• Authentication subsystem• Null-value handling

SWDB Aug 29, 2004

More advances

• Documentation• Collection-oriented workflows (COMAD)• Domain-specific actors for case studies

• e.g., GARP, phylogenetics actors• Provenance system• Grid computing support

• NIMROD, Globus, ssh, ...• Semantics support

• annotation, search, workflow validation, integration

SWDB Aug 29, 2004

Distributed execution

• Opportunities for parallel execution• Fine-grained parallelism• Coarse-grained parallelism

• Few or no cycles• Limited dependencies among components• ‘Trivially parallel’• Many science problems fit this mold

• parameter sweep, iteration of stochastic models

• Current ‘plumbing’ approaches to distributed execution• workflow acts as a controller

• stages data resources• writes job description files• controls execution of jobs on nodes

• requires expert understanding of the Grid system

• Scientists need to focus on just the computations• try to avoid plumbing as much as possible

SWDB Aug 29, 2004

• Higher-order component for executing a model on one or more remote nodes

• Master and slave controllers handle setup and communication among nodes, and establish data channels

• Extremely easy for scientist to utilize• requires no knowledge of grid computing systems

Distributed Kepler

OUT

IN

Master Slave

Controller Controller

SWDB Aug 29, 2004

Token

{1,5,2}

• Need for integrated management of external data• EarthGrid access is partial, need refactoring• Include other data sources, such as JDBC, OpeNDAP, etc.• Data needs to be a first class object in Kepler, not just

represented as an actor• Need support for data versioning to support provenance

• e.g., Need to pass data by reference• workflows contain large data tokens (100’s of megabytes)• intelligent handling of unique identifiers (e.g., LSID)

Token

ref-276

{1,5,2}

Data Management

A B

SWDB Aug 29, 2004

New projects: REAP

• Management and Analysis of Environmental Observatory Data using the Kepler Scientific Workflow System

• Extend Kepler to:• Manage and monitor sensor networks• Consume data from sensors• Integrate sensor data handling with data archive handling

• Terrestrial ecology and oceanography use cases

PIsJonesAltintasEstrinSeabloomGallagherCornillonHosseini

InstitutionsUCSBUCDUCSDUCLAOSUOpeNDAP

Ludäscher SchildhauerReichmanBaruPotterBorer

SWDB Aug 29, 2004

REAP breakdown

SWDB Aug 29, 2004

New projects: ChIP-chip

• A Collaborative Scientific Workflow Environment for Accelerating Genome-Scale Biological Research• CS/IT: Ludaescher, Bowers, McPhillips• Bio: Peggy Farnham, Mark Bieda

• Integrate a web-based "experiment workspace" environment with a flexible scientific workflow system

• Support rapid prototyping and easy addition of new "methods"• templates

• details of how key steps are left out until runtime, then late-binding of one or more specific algorithms or data

• which is the best motif-finding algorithm?• parts of workflow have similar set of steps• need to compare results from parallel analyses

• Support client-server (i.e., enterprise) deployments for group/lab-wide collaboration

• different people have different roles [Software dev (Tim), Bioinformatics specialist (Mark), Biologist (Peggy)]

SWDB Aug 29, 2004

external components and services

ChIP-chip Data Analysis (ChIPOTle, HMM, …)

Motif Finding Algorithms(MEME, MDscan, …)

Visualization Packages and Statistics Tools

Public Databases & Services(GenBank, David,TransFac, …)

Component Specification(wrapping, integration, and creation of components)

Workflow Specification(workflow design and template creation)

T1

F2

F1 T2

F2

F3

Workflow Automation(configuration and execution support)

ConfigurationManagement

ExecutionManagement

MonitoringSupport

(1) select design template and configure

(2) generate optimized executable workflow

ProvenanceTracking

componentrepository

designrepository

provenancerepository

Experiment Workspace (setup, run, and manage)

experimentrepository

Setup “protocol”

Import/Export Data

Data Display,Visualization

RunExperiment

Peggy

“biologist”

Mark

“bioinformaticsspecialist”

Tim

“software developer”

KeplerWorkflowEngine

Figure from Bowers and McPhillips

SWDB Aug 29, 2004

Kepler C.O.R.E. proposal

• Development of Kepler CORE -- A Comprehensive, Open, Robust, and Extensible Scientific Workflow Infrastructure• Ludäscher, Altintas, Bowers, Jones, McPhillips

• Goals• Reliable

• refactored build• more modular design• improved engineering practices

• Independently extensible• Open architecture, open project

• improved governance

SWDB Aug 29, 2004

Kepler C.O.R.E. -- Extensibility

SWDB Aug 29, 2004

Kepler C.O.R.E. -- Governance

SWDB Aug 29, 2004

Kepler C.O.R.E. -- Sustainability

• How does Kepler persist?

• Now, via research grants• unsustainable for production purposes

• Future• new models for financial support

• support contracts?• extension contracts?• new science domains?• continued research dollars?• foundations?

• exploring 501.3c organization that can sustain Kepler and similar open-source initiatives

SWDB Aug 29, 2004

• Funding• The National Science Foundation under Grant Numbers

9980154, 9904777, 0131178, 9905838, 0129792, 0225676, and 0619060.

• The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus.

• The Andrew W. Mellon Foundation• The Department of Energy

• Collaborators• NCEAS (UC Santa Barbara), University of New Mexico (Long

Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas, University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis

• Kepler contributors• SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL,

Resurgence

Acknowledgements