Wildfire Distributed, Grid-Enabled Workflow Construction and Execution
description
Transcript of Wildfire Distributed, Grid-Enabled Workflow Construction and Execution
6th June, 2007 WWWFG 2007
WildfireDistributed, Grid-Enabled Workflow
Construction and Execution
Arun Krishnan, PhDAssistant Professor,
Institute for Advanced Biosciences,Keio University, Tsuruoka, Japan
6th June, 2007 WWWFG 2007
Two Trends
â—Ź Bioinformatics analysis
â—Ź Increasingly complex analyses
â—Ź Several bioinformatics applications assembled into workflows
â—Ź Affordable HPCâ—Ź Commodity hardware
assembled into Beowulf clusters
â—Ź Pooled hardware in Grids
â—Ź Parallel by designTraditional solution: implement workflows as
perl scripts.
Difficult to program.
Difficult to maintain.
Difficult to port.
6th June, 2007 WWWFG 2007
The Problem
â—Ź We needâ—Ź Tool for construction and execution of
workflows on supercomputersâ—Ź User-interface must be intuitive for non-HPC-
specialistsâ—Ź Execution must support different
supercomputing platforms
6th June, 2007 WWWFG 2007
Solution
â—Ź Objectives:â—Ź Coarse-grained parallel programming for Gridâ—Ź Exploit heterogeneity of Grid (s/w licences, data, h/w)
â—Ź Approach:â—Ź An expressive workflow description language, GEL
â—Ź Sequential and parallel compositionâ—Ź Conditional execution (if-then-else)â—Ź Sequential iteration (while loop)â—Ź Parameterised parallel composition (parameter sweeps)â—Ź Parameterised sequential composition
Basic Idea: Can we do on the grid, what we do using shell scripts on a cluster??
for i in `ls .`; do blastp -d yeast -I $i; done
6th June, 2007 WWWFG 2007
GEL: An OverviewSemantics
â—Ź A workflow hasâ—Ź One input directoryâ—Ź One or more output directories
â—Ź Workflow cannot modify its input directory
6th June, 2007 WWWFG 2007
â—Ź Job (atomic workflow subunit)â—Ź Characteristics
â—Ź Executable nameâ—Ź Resource/system/software/data requirementsâ—Ź One input directoryâ—Ź One output directory
â—Ź Semanticsâ—Ź Stage files into input directoryâ—Ź Run executableâ—Ź Present output directory as result
GEL: An OverviewSemantics (Job)
6th June, 2007 WWWFG 2007
â—Ź Conditional (if E then A else B)â—Ź E is a job for which we ignore the output filesâ—Ź A and B are workflowsâ—Ź Executing (if E then A else B) entails
â—Ź Execute E and observe stdoutâ—Ź If stdout is non-empty then execute Aâ—Ź If stdout is empty then execute B
GEL: An OverviewSemantics (Conditional)
6th June, 2007 WWWFG 2007
â—Ź Sequential composition (A;B)â—Ź Execute Aâ—Ź Copy files from all output directories of A into input
directory of Bâ—Ź Execute Bâ—Ź Note: implicit merge of output directories of A
â—Ź Parallel composition (A||B)â—Ź Execute A and B from input directories populated from the
same filesâ—Ź Output directories are those of A and B
GEL: An OverviewSemantics (seq, par
compn)
6th June, 2007 WWWFG 2007
â—Ź Sequential iteration (while E do A)â—Ź E is a job for which we ignore the output filesâ—Ź (Standard while defn) Executing
while E do A
is semantically equivalent to executing
if E then (A; while E do A)
GEL: An OverviewSemantics (Sequential
Iteration)
6th June, 2007 WWWFG 2007
â—Ź Parameterised parallel composition (pfor x in xs do A(x))
â—Ź xs is a list expressionâ—Ź E.g. 0:50:10 = [0, 10, 20, 30, 40, 50]
â—Ź Variable x is a bound variableâ—Ź Executing
pfor x in (a0,xs) do A(x)is semantically equivalent to executing
A(a0) || (pfor x in xs do A(x))
GEL: An OverviewSemantics (Parametrized par
compn)
6th June, 2007 WWWFG 2007
â—Ź Parameterised sequential composition (for x in xs do A)
â—Ź xs is a list expression, x is a bound variableâ—Ź Executing
for x in (a0,xs) do A(x)is semantically equivalent to executing
A(a0); (pfor x in xs do A(x))â—Ź Know number of iterations before executing loop
(cf. while loop)
GEL: An OverviewSemantics (Parametrized seq
compn)
6th June, 2007 WWWFG 2007
So what did we do…?
â—Ź Grammar definedâ—Ź Sequential and parallel compositionâ—Ź Sequential iterationâ—Ź Intrinsic jobs (e.g. file projection)
â—Ź Interpretors implementedâ—Ź Local machine: spawn jobs locallyâ—Ź Clusters: spawn jobs using SGE,PBS and LSFâ—Ź Statically-scheduled Grid interpretor: GridFTP
staging, GramJob spawnâ—Ź Required
â—Ź A GUI frontend
6th June, 2007 WWWFG 2007
Wildfire…
Wildfire and GEL brings supercomputing power to the bioinformatician
6th June, 2007 WWWFG 2007
Features
â—Ź Integrated environment
â—Ź Construct and execute workflows from the same interface
â—Ź User-friendlyâ—Ź Drawing-analogy
workflow constructionâ—Ź Program options
presented using Jemboss-style drop-down lists, buttons, textboxes, etc.
â—Ź Supercomputing support
â—Ź Shared memory multiprocessors
â—Ź Cluster schedulersâ—Ź PBSâ—Ź SGEâ—Ź LSF
â—Ź Gridsâ—Ź Globus
6th June, 2007 WWWFG 2007
W/F Construction: Drawing
â—Ź Double click on components to change options
â—Ź Draw arrows between componentsâ—Ź Drag components into containers
Yellow boxes are atomic components
Parallel bars denote parallel container
Parallel “foreach” repeats contents for each file matching pattern
An arrow denotes sequential dependence
6th June, 2007 WWWFG 2007
W/F Construction: Components
â—Ź Wildfire has been pre-configured with EMBOSS applications
â—Ź Custom/new components can be added
6th June, 2007 WWWFG 2007
W/F Execution
Globus
GRID
GEL datadatadatadata
User uses Wildfire to create workflow as GEL script
Execution on(1) Grid,(2) Cluster, or(3) local
LSF
Cluster
fork
Laptop
6th June, 2007 WWWFG 2007
W/F Execution: GEL
• “Local”/SMP– Run programs
directly– Use multiple
processors if available
• Grid– Stage files using
GridFTP– Execute programs
using GRAM
• Beowulf Cluster– Submit job
requests through queue manager
– Use processors on compute nodes
– Use job dependencies
• PBS/Torque• Sun GridEngine
(SGE)• Platform LSF
6th June, 2007 WWWFG 2007
More workflow parallel features
Parallel containerâ—ŹDenotes independent componentsâ—ŹWhole container is considered a component
Parallel for loopâ—ŹLoop variable $i iterates over values 0 to 3â—ŹFor each value of $i, an instance of its contents executes in parallel
6th June, 2007 WWWFG 2007
Workflow: while loops
Component inside round disc is the loop guard
If loop guard evaluates to false, then the break branch is taken
If loop guard is true, then the true branch is taken, after which the loop guard is evaluated again
While loop allows for iterative workflows
6th June, 2007 WWWFG 2007
Ex: Transcript Analysis
â—Ź Transcripts database from Mammalian Gene Collection
â—Ź Exons from chromosomes from NCBI Genbank
â—Ź Blast each exon against transcript database to investigate splicing of transcripts
Chromosome
Exons Transcripts
BLAST
Extract
Alignments
6th June, 2007 WWWFG 2007
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
Transcript Analysis:24 Chromosomes
â—Ź Human genome has 24 chromosomes (1-22,X,Y)
â—Ź How do we leverage parallel computing?
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
C h r o m o s o m e
E x o n s T r a n s c r i p t s
B L A S T
E x t r a c t
A l i g n m e n t s
6th June, 2007 WWWFG 2007
Transcript Analysis:Parallelism
Dice splits big exons file into several smaller filesC/some
Exons
Extract
Dice
Exons
BlastBlastBlast
Results
Transcripts
Separate BLAST instances align the smaller files against transcript database
Alignments are stored in many files
One copy per chromosome
C/some
Exons
Extract
Dice
Exons
BlastBlastBlast
Results
Transcripts
C/some
Exons
Extract
Dice
Exons
BlastBlastBlast
Results
Transcripts
C/some
Exons
Extract
Dice
Exons
BlastBlastBlast
Results
Transcripts
C/some
Exons
Extract
Dice
Exons
BlastBlastBlast
Results
Transcripts
C/some
Exons
Extract
Dice
Exons
BlastBlastBlast
Results
Transcripts
6th June, 2007 WWWFG 2007
Transcript Analysis:Workflow Parallel “foreach”
container executes inner pipeline once for each file matching *.gbk.gzDecompress
chromosome data
Decompress transcripts file
Extract exons
Format database for BLAST query
Break up exons file into smaller files
Parallel “foreach” container executes blastall component for all files matching *_dice*.fna
6th June, 2007 WWWFG 2007
Transcript Analysis:Execution Profile
â—Ź The execution profile shows when programs start and stop
● Note: “makespan” can be improved by balancing the duration of blast jobs (modify dice)
6th June, 2007 WWWFG 2007
Summary
• End-User Requirements– Ease of construction– Ease of implementation– Ease of recovery
• Grid Scripting the way to go?
• Interfaces to grid-scripting?
6th June, 2007 WWWFG 2007
Acknowledgments
• Bioinformatics Institute, Singapore
• Dr. Francis Tang
• Chua Ching Lian
• Liang-Yoong Ho