28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun...

Post on 14-Dec-2015

215 views 0 download

Tags:

Transcript of 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun...

28 April, 2005 ISGC 2005, Taiwan

The Efficient Handling of BLAST Applications on the GRID

Hurng-Chun Lee1 and Jakub Moscicki2

1 Academia Sinica Computing Centre, Taiwan2 CERN IT-GD-ED, Switzerland

28 April, 2005 ISGC 2005, Taiwan

Outline

• The consideration of distributing BLAST jobs• The master-worker computing model of BLAST

– mpiBLAST

• The Gridified BLAST– mpiBLAST-g2 vs. DIANE-BLAST

• Summary

28 April, 2005 ISGC 2005, Taiwan

The considerations of distributing BLAST jobs

• BLAST has been widely and routinely used for sequence analysis

• The essential component in most of bioinformatics and life science applications

• Problem Complexity ~ O(SqxSd)– Sq : The query size– Sd : The database size

• In most cases, Sd >> Sq

– e.g. Sq ~ O(MB), Sd ~ O(GB)– The cost of moving query is lower

• Database management, storage and sharing issues– Replication, Archive– Privacy, Security

• Other perspective for service providing– scalability, robustness

28 April, 2005 ISGC 2005, Taiwan

The master-worker model of BLAST

• Database splitting is the easiest way to distribute BLAST jobs

• Fragmented databases for avoiding the memory swapping

• Each sub task can be 100% independent

• Each worker requests the tasks from master (pull model) and runs the normal BLAST search

• The individual result can be easily merged by master process

• Report generation (BioSeq fetching)

• Multi-query blast search can be easily split to multiple independent single-query blast search by a trivial script

– Master-worker model can also be applied in each single-query search

Database

Master

workers

DB Fragments

Task list

Job requesting

Result merging

formatdb

blast search

BioSeq fetching

28 April, 2005 ISGC 2005, Taiwan

mpiBLASTLANL, US http://mpiblast.lanl.gov

• The MPI implementation of BLAST master-worker model

• Advantages– High throughput– Load Balancing

• Running in local cluster– Performance and Problem

size still be limited by local computing power

– Simultaneous I/O to centralized database causes the performance bottleneck

– Database sharing is still difficult

28 April, 2005 ISGC 2005, Taiwan

mpiBLAST-g2 ASCC, Taiwan and PRAGMA http://bits.sinica.edu.tw/mpiBlast/index_en.php

• A GT2-enabled parallel BLAST runs on Grid– GT2 GASSCOPY API– MPICH-g2

• The enhancement from mpiBLAST by ASCC

• Performing cross cluster scheme of job execution

• Performing remote database sharing

• Help Tools for– database replication– automatic resource specification and job submi

ssion (with static resource table)– multi-query job splitting and result merging

• Close link with mpiBLAST development team– The new patches of mpiBLAST can be quickly

applied in mpiBLAST-g2

28 April, 2005 ISGC 2005, Taiwan

SC2004 mpiBLAST-g2 demonstration

KISTI

28 April, 2005 ISGC 2005, Taiwan

mpiBLAST-g2 current deployment

-- From PRAGMA GOC http://pragma-goc.rocksclusters.org

28 April, 2005 ISGC 2005, Taiwan

mpiBLAST-g2Performance Evaluation (perfect case)

Elapsed time Speedup

Database: est_human ~ 3.5 GBytesQueries: 441 test sequences ~ 300 KBytes • Overall speedup is approximately linear

— Searching + Merging

— BioSeq fetching

— Overall

28 April, 2005 ISGC 2005, Taiwan

mpiBLAST-g2Performance Evaluation (worse case)

Elapsed time Speedup

Database: drosophila NT ~ 122 MBytesQueries: 441 test sequences ~ 300 KBytes

• The overall speedup is limited by the unscalable BioSeq fetching

— Searching + Merging

— BioSeq fetching

— Overall

28 April, 2005 ISGC 2005, Taiwan

Issues of mpiBLAST-g2

• Single error will crash the whole job– The MPICH nature – Error might be due to the transient problem on the loosely coupled Grid

environment

• MPI Job will be started only when all resources are available– Different level of resource availability

Error recovery is required for– providing a robust application service on the Grid– efficiently using the Grid resources

Asynchronous task dispatching/pulling to use the available resources immediately

28 April, 2005 ISGC 2005, Taiwan

The DIANEhttp://cern.ch/diane

• DIstributed ANalysis Environment

• Lightweight distributed framework for parallel scientific applications in master-worker model– A perfect match of the mpiBLAST computing model

• Current applications– BLAST for Genomic Sequence Analysis (DIANE-BLAST)– Geant4 Simulation for Radiotherapy and Astrophysics – Image Rendering – Data Analysis for High Energy Physics

28 April, 2005 ISGC 2005, Taiwan

DIANE Features

• Rapid prototyping– Python and CORBA

• Error recovery– Heartbeat worker health check– Resubmission of failed tasks– User defined error recovery method

• No need of outbound connectivity– Proxy of workers with only private IP

• Job submitters for– Simple fork– Condor, LSF, SGE, PBS– GT2, LCG, gLite

Pull Model

Batch and Interactive

Distributed workers

• planner• integrator

28 April, 2005 ISGC 2005, Taiwan

DIANE-BLAST implementation

• Splitting mpiBLAST-g2 to DIANE components– Master (Planner and Integrator), Worker

• Wrapping each component with Python– Hooking core BLAST C libraries with python swig

• Implementing the DIANE GT2 job submitter– For running workers on the GT2-enabled clusters

• Reusing the deployed databases for mpiBLAST-g2

28 April, 2005 ISGC 2005, Taiwan

mpiBLAST-g2 vs. DIANE-BLASTThe Speedup

• Query– Drosophila chromosome 4– size: 1.2 Mbps

• DB– Drosophila nucleotide sequence

database– size: 1170 seq. 122 Mbps– no. fragments: 32

• Computing Resource– Available # of CPU: 12– PIII 1.4GHz– 1GByte Memory

Speedup of mpiBLAST-g2

Speedup of DIANE-BLAST

28 April, 2005 ISGC 2005, Taiwan

mpiBLAST-g2 vs. DIANE-BLAST The Worker Lifeline

DIANE-BLAST task dispatching

• Handled by DIANE’s task thread

• Due to the bugs in the current DIANE release

DIANE-BLAST task dispatching

• Handled by DIANE’s task thread

• Due to the bugs in the current DIANE release

mpiBLAST-g2 task dispatching

• mpiBLAST-g2 task handling logic

mpiBLAST-g2 task dispatching

• mpiBLAST-g2 task handling logic

28 April, 2005 ISGC 2005, Taiwan

mpiBLAST-g2 vs. DIANE-BLASTOverall Comparisons

• mpiBLAST-g2– Master-Worker model implemented by

using MPICH-g2 libraries

– Gridification efforts• Implementing database sharing with GA

SSCOPY API• Recompilation with MPICH-g2 and GT2

libraries

– Error recovery• Need the fault-tolerance MPI

– Cross cluster computation• Requiring outbound connectivity on eac

h worker

– Performance/Throughput• In cluster performance is as well as the

original mpiBLAST

• DIANE-BLAST– Pluggable application for DIANE Maste

r-Worker framework

– Gridification efforts• Through the gridified DIANE framework

– Error recovery• Task resubmission• Tracking the health of each worker

– Cross cluster computation• Using proxy for workers with private IPs

– Performance/Throughput• Performance can be tuned by controllin

g the job thread

28 April, 2005 ISGC 2005, Taiwan

Summary

• Two grid-enabled BLAST implementations (mpiBLAST-g2 and DIANE-BLAST) were introduced for efficient handling the BLAST jobs on the Grid

• Both implementations are based on the Master-Worker model for distributing BLAST jobs on the Grid

• The mpiBLAST-g2 has good scalability and speedup in some cases– Require the fault-tolerance MPI implementation for error recovery – In the unscalable cases, BioSeq fetching is the bottleneck

• DIANE-BLAST provides flexible mechanism for error recovery– Any master-worker workflow can be easily plugged into this framework– The job thread control should be improved to achieving the good perfor

mance and scalability

28 April, 2005 ISGC 2005, Taiwan

Thanks for your attention!!