Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common...

Post on 09-Jul-2020

1 views 0 download

Transcript of Tigres: Template Interfaces for Agile Parallel Data ......Tigres: Design templates for common...

Tigres: Template Interfaces for Agile Parallel Data-Intensive Science

Lavanya Ramakrishnan

LRamakrishnan@lbl.gov

http://tigres.lbl.gov

1

Tree Files (ps and .pdf files)

blast blast

clustalw clustalw

dnapars protpars

drawgram

drawgram

(CS Biased) View of Workflow Challenges: Gene2Life Molecular Biology Analysis

•  Mostly simple sequential workflow

•  Repetitive •  Tracking

–  Provenance, metadata, etc

•  “Iterative” –  Swap programs and

data sets •  Desktop to HPC/

Cloud

Nucleotide or amino acid.

search

alignment

analysis

visualization

Pre Interproscan

Interproscan Interproscan

Post Interproscan

Motif

N=135

256 processors

(CS Biased) View of Workflow Challenges: MotifNetwork

•  Mid-sized “compute-intensive” workflow

•  Mix of single processor and multiprocessor tasks

•  Intermediate data formatting/logic

•  Move •  Share

data preparation

analysis

aggregation

processing

People use ad-hoc scripts, keep notes in text files and encode metadata in file names

Source: Jeff Tilson

Big Data is here … Larger volumes of data More dynamic content Significant variety in data types Large amounts of unstructured data Increased rate of data arrival Need for faster data processing rates …..

… It is not getting easier

MapReduce and Hadoop Ecosystem

Map

Reduce

Computation performed on large volumes of data in parallel Provides scaling, data locality, fault tolerance Higher-level tools have evolved for specific data analysis There are challenges in using MapReduce/Hadoop

for scientific workflows

Tigres: Design templates for common scientific workflow patterns

"LightSrc" Domain templates

Base Tigres templates

Scale up

Application "LightSrc-1"

Application "LightSrc-2"

Create andDebug

Share

Create andDebug

Implement templates as a library in an existing language

Tigres Templates

TaskN

Task1 Sequence

Taskn Task1 ... ...

Split

Parallel

TaskN Task1

Task

Merge

Tasko

Taskn Task1

Key Aspects of Tigres

•  Targeted for large-scale data-intensive workflows –  Motivated by “MapReduce” model –  No centralized managed model

•  Library model embedded in existing languages such as Python and C –  “Extend current scripting/programming tools” –  API-based, embedded in code

•  Light-weight execution framework –  “As easy to run as an MPI program on an HPC resource” –  No persistent services

•  Scientist-Centered Design Process –  Get feedback from user before writing all the code

Design

Execution Environment

API Implementation Optimizations Scientist-Centered

Design Process

Tigres Design Process

Create a workflow 1.  Define input types 2.  Define task 3.  Assign input values

4.  Repeat 1-3 above for other tasks

5.  Create appropriate input

arrays 6.  Create appropriate tasks

arrays 7.  Create (and run) the

template

Task2

Task1

Task45 ...

Task3

Task40

Task55

Task50 ...

Task6

input1_task1    {Type:  Object_a}   input2_task1    

{Type:  int}    

Templates

•  Sequence ( name, task_array, input_array ) –  e.g., output [ ] = Sequence (“my seq”, task_array_12,

input_array_12) •  Parallel ( name, task_array, input_array )

–  e.g., output[ ] = Parallel(“abc”, task_array_12, input_array_12)

•  Split ( name, split_task, split_input_values, task_array, task_array_in ) –  e.g., Split( “split”, task_x1, input_value_1, spl_t_arr,

spl_i_arr) •  Merge ( name, task_array, input_array, merge_task,

merge_input_values) –  e.g., Merge( “merge”, syn_t_arr, syn_i_arr, task_x1,

input_value_1)

Scientist-Centered Design Process

•  Use Google docs for an interactive step-by-step exercises with “facilitator” and “human compiler” –  white/black board didn’t work

•  Preparation –  terminology, basic template, example, exercise –  15 minutes preparation time

•  Testers ( ~6) –  Developers, web design/UI staff, application scientists

Concept understanding by user Changes to Nomenclature Support in C also important

Priorities for first prototype: Desktop to NERSC Monitoring Intermediate state management

Impact of Scientist-Centered Design

Design

Execution Environment

API Implementation Optimizations Scientist-Centered

Design Process

What did we learn?

•  Documentation clarity was key –  Majority of our participants “coded-by example”

•  Nomenclature was important –  Confusion with initial terminology

•  Keep API simple –  Dependencies/output – two different styles within and

outside a template can be confusing •  Support extended API

–  Optional parameters, different programming styles •  Execution Semantics were important

–  Monitoring, logging It took days for first stub implementation rather

than weeks (or months)!

Summary

•  “Scientist-friendly” programming API to manage workflows

•  Plan to test API with different user groups

•  Core team –  Deb Agarwal (PI), Lavanya Ramakrishnan, Daniel Gunter –  Gilberto Pastorello, Valerie Hendrix, Ryan Rodriguez

•  CS Research groups –  Shane Canon –  John Shalf

•  Science research groups –  Cosmology - Alex Kim, Rollin Thomas, Stephen Bailey –  Gamma Ray - Dan Chivers –  Advanced Light Source - Dula Parkinson –  HEP - Paolo Calafiura –  Materials – Kristin Persson

Tigres Team

Website: http://tigres.lbl.gov Contact: LRamakrishnan@lbl.gov

Monitoring

•  Initialize –  init (tigres-destination, user-destination)

•  User Logging –  setLevel(level) enumeration FATAL upto TRACE –  write(level, name, key-value pairs)

•  Query –  getStatus(type, names) –  getInfo(name, key-value-pairs)

Input and Task

•  InputTypes ( name, types[ ] ) –  e.g., input_type1 = InputTypes(“Types1”, {“int”, “string”})

•  InputValues ( name, values[ ] ) –  e.g., input_value1 = InputValues(“Values1”, {1, “hello”})

•  InputArray ( name, input_values[ ]) –  e.g., input_array_12 = InputArray(“Array12”, {input_value_1, input_value_2})

•  Task ( name, type, impl_name, input_types, env) –  e.g., task_f1= Task(“A”, FUNCTION, “myfunc”, input_type1))

•  TaskArray ( name, task[ ] ) –  e.g., task_array_xy = TaskArray(“xy”, {task_f1, task_x1})