Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

38
Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis Thesis Defense: Ashish Nagavaram Graduate student Computer Science and Engineering Advisor: Dr. Gagan Agrawal Committee: Dr. Rajiv Ramnath Dr. Michael Freitas

description

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis. Thesis Defense: Ashish Nagavaram Graduate student Computer Science and Engineering Advisor: Dr. Gagan Agrawal Committee: Dr. Rajiv Ramnath Dr. Michael Freitas. Introduction. Cloud computing Resources on demand - PowerPoint PPT Presentation

Transcript of Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Page 1: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data

AnalysisThesis Defense:

Ashish NagavaramGraduate student

Computer Science and Engineering

Advisor: Dr. Gagan AgrawalCommittee: Dr. Rajiv Ramnath

Dr. Michael Freitas

Page 2: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

2

Introduction

Cloud computing• Resources on demand• pay-as-you-go• Elasticity

Resource Allocation on the cloud• Dynamic resource allocation

Page 3: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

3

Motivation

Use elasticity of cloud for executing scientific applications• Over provisioning and Under provisioning• Avoid wastage of resources

No Generalized scientific workflow to execute application in dynamic fashion

Allocate resources during the executionMeet time constraints by using more resources

Page 4: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

4

Background-MassMatrix

Developed by Dr. Hua Xu and Dr. Michael Freitas at Ohio State University

A database search program with rapid characterization of proteins and peptides• Supports multiple data formats like .mgf, .mzXML and

raw data• The input database are of the formats .fasta or .BAS

Page 5: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

5

MassMatrix Application FlowTheoretical Protein database

Digest the sequence

Has the sequence been

searched before?

Do not add it to the final result

Full scan search for finding matching peptides

Clear insignificant peptides

Statistical analysis to generate results

results

MS/MS data input file

yes

no

Page 6: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

6

Contributions (1/2)

Providing a framework for parallelization of the MassMatrix application

Creating a dynamic workflow• Resources are allocated adaptively• QOS is achieved by parameter prediction • Gives user control by using benefit function

Page 7: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

7

Contributions (2/2)

Allows to specify the time constraint in which the application should be completed

“A cloud-based Dynamic Workflow for Mass spectrometry Data Analysis” - Ashish Nagavaram, Gagan Agrawal, Michael Freitas, Gaurang Mehta 7th IEEE Conference on E-Science, Dec 2011

Page 8: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

8

Outline

IntroductionMotivationBackgroundParallelization of MassMatrixAdaptive Resource allocationExperimental ResultsParameter PredictionConclusion

Page 9: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

9

Parallel MassMatrix

Parallelize the full-scan search phase• Takes the longest time to execute• The rest of the phases are sequential

A split-merge approach is followed• The user can specify the number of splits• Splits are made based on specific tags• Index embedded in the file-split name• Other options also considered

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Page 10: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

10

Parallel MassMatrix (contd.)

Only input file split• When we split database also leads to redundant results• When split both input and database we have the same

problemThe intermediate files are written to disk• Pointers serialized• Written as comma separated values

A python script keeps polling the job queue to check if the parallel phase has been completed• Suspends the sequential phase until then

Page 11: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

11

Parallel MassMatrix (Contd.)

The intermediate files are read back in and re-indexed while merging

The merging process is complicated• Complex data structures (matrix of matrices)• Have to get inside each data-structure to maximize them• Intermediate files are indexed among each other• While re-indexing maintain both local and global index• The data structures are also re-numbered while merging

Page 12: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

12

Parallel MassMatrix (contd.)

Intermediate files are merged in order of the split they process

Unnecessary intermediate files are not loaded back• Saves memory• Helps in case of large data files

Page 13: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

13

MassMatrix Flow (Parallel)

13

Configuration File

Input File

Input Database

Python Script

splitN

split2

split1

Sequential phase

Merge

massmatrix

massmatrix

massmatrix

Page 14: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

14

Experimental results (Parallelization)

Experimental setup:8 core Intel Xeon node with 6GB of DDR400 RAMThe theoretical database used was of 20 MB• .fasta format database is used

The code was run for 6 different datasets • Each had 50,000 records on average• Is of .mgf format

Experiments are run for 1, 2, 4 and 8 splits• Run on a single node with 8 cores

Page 15: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

15

Experimental results (Parallelization)

Execution times when datasets are run for 1, 2, 4 and 8 splits

Page 16: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

16

Experimental results (Parallelization)

Execution times for datasets when run on 1, 2, 4 and 8 cores

Page 17: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

17

Background (Pegasus)

Used to help creating adaptive version of MassMatrix• Is a software system to manage workflows• Manages resources on local, grid and cloud• Provides API’s to create workflows

Creates a DAG to represent dependencies• DAG has a connection between nodes if there is

dependencyCreates a plan for the execution of the application• Executes application according to this plan.

Page 18: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

18

Background (Condor)

Uses wrangler to start nodes in the cloud• New nodes added to cluster automatically• Uses Amazon private and public keys to identify user• Configuration specified in xml file

Condor is the job scheduler used• Developed at University of Wisconsin• Jobs are stored in a queue• Jobs submitted from queue to the cluster in FIFO• Provides fault tolerance through check pointing

Page 19: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

19

The Pegasus workflow

Pegasus workflow showing the workflow of MassMatrix Application

Page 20: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

20

Parallel Pegasus workflow

Pegasus workflow for parallel version of MassMatrix application

Page 21: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

21

Adaptive Resource Allocation

An approach for dynamic resource allocation• Decision based on rate of execution • Calculates number of additional resources to meet time

constraint

Initial assumption that input is divided into equal splits

Decision made on the basis of execution time of initial N splits

Page 22: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

22

Adaptive Resource Allocation (Contd.)

The code initially is run with N resourcesFor our case we used N=4Let Tper_split be the execution time of a single split Tconstraint be the user specified time constraint

Then we can say that

Ttime_constraint = Tconstraint – ( 2 × Tper_split ) (1)

Page 23: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

23

Adaptive Resource Allocation (Contd.)

Another N splits must have already started execution • Hence we do not consider them in calculation

Hence if we use N resources the predicted execution time is

Texecution_pred = Tper_split × ( {split_count} - 2 × N ) (2)

Page 24: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

24

Adaptive Resource Allocation (Contd.)

Based on equations (1) and (2) we can calculate the number of needed as

Nodesrequired is the number of additional nodes that need to be spawned

N 1TTNodes

rainttime_const

predictedexecution_required

Page 25: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

25

Adaptive Algorithm

Algorithm showing the steps involved in calculating the additional resources needed to meet the time constraint

Page 26: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

26

Experimental Goals

To evaluate efficiency of our system with different datasets

The framework is effective • calculates the additional nodes required• Meets the time constraints• Tested for different time constraints

Page 27: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

27

Experimental results (Adaptive)

Experimental setup:Cloud infrastructure: Amazon EC2submit host to submit jobs to the cloudPegasus version 3.0.2Condor job scheduler version 7.5.6Results for 2 datasets and different time

constraints

Page 28: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

28

Experimental Results (contd.)

Results obtained when algorithm is ran for different time constraints on the dataset1

Page 29: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

29

Experimental Results (contd.)

Results Obtained for dataset2 when run with same time constraints

Page 30: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

30

Benefit function and Parameter prediction (QOS)

Motivation: Provide Quality of service

• Tradeoff between execution time vs. quality of results• Quality depends on the parameter values• Provide a way for the user to control the quality of results• Quality defined as equation in terms of parameters

User has flexibility to decide which parameter has more importance

Makes prediction such that execution time is as close as possible to time constraint

Page 31: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

31

Benefit function and Parameter prediction (QOS)

Benefit function - is an equation made of some or all parameters of the application• We use this equation to set the parameter importance• This is the minimal set of equations needed to obtain the required quality

The goal is to maximize this benefit function within the user specified time constraint• Calculated for different parameter combinations

Decision made using tables constructed from data of previous executions• Hash tables are used

Page 32: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

32

Benefit function and Parameter prediction (QOS)

Tables contain parameter combination to execution time mappings and vice versa

Multiple datasets can be used for prediction • Parameters are mapped to average execution time• Reduces error percentage

Page 33: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

33

Parameter prediction process

Page 34: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

34

Experimental Results

Experiments conducted on a Linux desktop machine with 2 cores and 1 GB of memory

The tables are populated using two datasets data1.mgf and data2.mgf

The parameter combinations are predicted for two other datasets data3.mgf and data4.mgf

Page 35: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

35

Experimental ResultsParameter Prediction results when run for different Benefit function and constraints

Page 36: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

36

Experimental Results

Parameter Prediction results for a different Benefit Function

Page 37: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

37

Conclusion

Displayed a framework for dynamic execution of scientific workflows

User specified time constraint can be used to drive the allocation of resources

Effective dynamic allocationMaximizing Benefit function • Parameter prediction within this value• Provide quality results based on user requirements

Page 38: Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

Cloud based Dynamic workflow with QOS for Mass Spectrometry Data Analysis

38

Thank you