28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun...

19
28 April, 2005 ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing Centre, Ta iwan 2 CERN IT-GD-ED, Switzerland

Transcript of 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun...

Page 1: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

The Efficient Handling of BLAST Applications on the GRID

Hurng-Chun Lee1 and Jakub Moscicki2

1 Academia Sinica Computing Centre, Taiwan2 CERN IT-GD-ED, Switzerland

Page 2: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

Outline

• The consideration of distributing BLAST jobs• The master-worker computing model of BLAST

– mpiBLAST

• The Gridified BLAST– mpiBLAST-g2 vs. DIANE-BLAST

• Summary

Page 3: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

The considerations of distributing BLAST jobs

• BLAST has been widely and routinely used for sequence analysis

• The essential component in most of bioinformatics and life science applications

• Problem Complexity ~ O(SqxSd)– Sq : The query size– Sd : The database size

• In most cases, Sd >> Sq

– e.g. Sq ~ O(MB), Sd ~ O(GB)– The cost of moving query is lower

• Database management, storage and sharing issues– Replication, Archive– Privacy, Security

• Other perspective for service providing– scalability, robustness

Page 4: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

The master-worker model of BLAST

• Database splitting is the easiest way to distribute BLAST jobs

• Fragmented databases for avoiding the memory swapping

• Each sub task can be 100% independent

• Each worker requests the tasks from master (pull model) and runs the normal BLAST search

• The individual result can be easily merged by master process

• Report generation (BioSeq fetching)

• Multi-query blast search can be easily split to multiple independent single-query blast search by a trivial script

– Master-worker model can also be applied in each single-query search

Database

Master

workers

DB Fragments

Task list

Job requesting

Result merging

formatdb

blast search

BioSeq fetching

Page 5: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

mpiBLASTLANL, US http://mpiblast.lanl.gov

• The MPI implementation of BLAST master-worker model

• Advantages– High throughput– Load Balancing

• Running in local cluster– Performance and Problem

size still be limited by local computing power

– Simultaneous I/O to centralized database causes the performance bottleneck

– Database sharing is still difficult

Page 6: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

mpiBLAST-g2 ASCC, Taiwan and PRAGMA http://bits.sinica.edu.tw/mpiBlast/index_en.php

• A GT2-enabled parallel BLAST runs on Grid– GT2 GASSCOPY API– MPICH-g2

• The enhancement from mpiBLAST by ASCC

• Performing cross cluster scheme of job execution

• Performing remote database sharing

• Help Tools for– database replication– automatic resource specification and job submi

ssion (with static resource table)– multi-query job splitting and result merging

• Close link with mpiBLAST development team– The new patches of mpiBLAST can be quickly

applied in mpiBLAST-g2

Page 7: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

SC2004 mpiBLAST-g2 demonstration

KISTI

Page 8: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

mpiBLAST-g2 current deployment

-- From PRAGMA GOC http://pragma-goc.rocksclusters.org

Page 9: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

mpiBLAST-g2Performance Evaluation (perfect case)

Elapsed time Speedup

Database: est_human ~ 3.5 GBytesQueries: 441 test sequences ~ 300 KBytes • Overall speedup is approximately linear

— Searching + Merging

— BioSeq fetching

— Overall

Page 10: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

mpiBLAST-g2Performance Evaluation (worse case)

Elapsed time Speedup

Database: drosophila NT ~ 122 MBytesQueries: 441 test sequences ~ 300 KBytes

• The overall speedup is limited by the unscalable BioSeq fetching

— Searching + Merging

— BioSeq fetching

— Overall

Page 11: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

Issues of mpiBLAST-g2

• Single error will crash the whole job– The MPICH nature – Error might be due to the transient problem on the loosely coupled Grid

environment

• MPI Job will be started only when all resources are available– Different level of resource availability

Error recovery is required for– providing a robust application service on the Grid– efficiently using the Grid resources

Asynchronous task dispatching/pulling to use the available resources immediately

Page 12: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

The DIANEhttp://cern.ch/diane

• DIstributed ANalysis Environment

• Lightweight distributed framework for parallel scientific applications in master-worker model– A perfect match of the mpiBLAST computing model

• Current applications– BLAST for Genomic Sequence Analysis (DIANE-BLAST)– Geant4 Simulation for Radiotherapy and Astrophysics – Image Rendering – Data Analysis for High Energy Physics

Page 13: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

DIANE Features

• Rapid prototyping– Python and CORBA

• Error recovery– Heartbeat worker health check– Resubmission of failed tasks– User defined error recovery method

• No need of outbound connectivity– Proxy of workers with only private IP

• Job submitters for– Simple fork– Condor, LSF, SGE, PBS– GT2, LCG, gLite

Pull Model

Batch and Interactive

Distributed workers

• planner• integrator

Page 14: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

DIANE-BLAST implementation

• Splitting mpiBLAST-g2 to DIANE components– Master (Planner and Integrator), Worker

• Wrapping each component with Python– Hooking core BLAST C libraries with python swig

• Implementing the DIANE GT2 job submitter– For running workers on the GT2-enabled clusters

• Reusing the deployed databases for mpiBLAST-g2

Page 15: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

mpiBLAST-g2 vs. DIANE-BLASTThe Speedup

• Query– Drosophila chromosome 4– size: 1.2 Mbps

• DB– Drosophila nucleotide sequence

database– size: 1170 seq. 122 Mbps– no. fragments: 32

• Computing Resource– Available # of CPU: 12– PIII 1.4GHz– 1GByte Memory

Speedup of mpiBLAST-g2

Speedup of DIANE-BLAST

Page 16: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

mpiBLAST-g2 vs. DIANE-BLAST The Worker Lifeline

DIANE-BLAST task dispatching

• Handled by DIANE’s task thread

• Due to the bugs in the current DIANE release

DIANE-BLAST task dispatching

• Handled by DIANE’s task thread

• Due to the bugs in the current DIANE release

mpiBLAST-g2 task dispatching

• mpiBLAST-g2 task handling logic

mpiBLAST-g2 task dispatching

• mpiBLAST-g2 task handling logic

Page 17: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

mpiBLAST-g2 vs. DIANE-BLASTOverall Comparisons

• mpiBLAST-g2– Master-Worker model implemented by

using MPICH-g2 libraries

– Gridification efforts• Implementing database sharing with GA

SSCOPY API• Recompilation with MPICH-g2 and GT2

libraries

– Error recovery• Need the fault-tolerance MPI

– Cross cluster computation• Requiring outbound connectivity on eac

h worker

– Performance/Throughput• In cluster performance is as well as the

original mpiBLAST

• DIANE-BLAST– Pluggable application for DIANE Maste

r-Worker framework

– Gridification efforts• Through the gridified DIANE framework

– Error recovery• Task resubmission• Tracking the health of each worker

– Cross cluster computation• Using proxy for workers with private IPs

– Performance/Throughput• Performance can be tuned by controllin

g the job thread

Page 18: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

Summary

• Two grid-enabled BLAST implementations (mpiBLAST-g2 and DIANE-BLAST) were introduced for efficient handling the BLAST jobs on the Grid

• Both implementations are based on the Master-Worker model for distributing BLAST jobs on the Grid

• The mpiBLAST-g2 has good scalability and speedup in some cases– Require the fault-tolerance MPI implementation for error recovery – In the unscalable cases, BioSeq fetching is the bottleneck

• DIANE-BLAST provides flexible mechanism for error recovery– Any master-worker workflow can be easily plugged into this framework– The job thread control should be improved to achieving the good perfor

mance and scalability

Page 19: 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun Lee 1 and Jakub Moscicki 2 1 Academia Sinica Computing.

28 April, 2005 ISGC 2005, Taiwan

Thanks for your attention!!