28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun...

28 April, 2005 ISGC 2005, Taiwan

The Efficient Handling of BLAST Applications on the GRID

Hurng-Chun Lee1 and Jakub Moscicki2

1 Academia Sinica Computing Centre, Taiwan2 CERN IT-GD-ED, Switzerland

Outline

• The consideration of distributing BLAST jobs• The master-worker computing model of BLAST

– mpiBLAST

• The Gridified BLAST– mpiBLAST-g2 vs. DIANE-BLAST

• Summary

The considerations of distributing BLAST jobs

• BLAST has been widely and routinely used for sequence analysis

• The essential component in most of bioinformatics and life science applications

• Problem Complexity ~ O(SqxSd)– Sq : The query size– Sd : The database size

• In most cases, Sd >> Sq

– e.g. Sq ~ O(MB), Sd ~ O(GB)– The cost of moving query is lower

• Database management, storage and sharing issues– Replication, Archive– Privacy, Security

• Other perspective for service providing– scalability, robustness

The master-worker model of BLAST

• Database splitting is the easiest way to distribute BLAST jobs

• Fragmented databases for avoiding the memory swapping

• Each sub task can be 100% independent

• Each worker requests the tasks from master (pull model) and runs the normal BLAST search

• The individual result can be easily merged by master process

• Report generation (BioSeq fetching)

• Multi-query blast search can be easily split to multiple independent single-query blast search by a trivial script

– Master-worker model can also be applied in each single-query search

Database

Master

workers

DB Fragments

Task list

Job requesting

Result merging

formatdb

blast search

BioSeq fetching

mpiBLASTLANL, US http://mpiblast.lanl.gov

• The MPI implementation of BLAST master-worker model

• Advantages– High throughput– Load Balancing

• Running in local cluster– Performance and Problem

size still be limited by local computing power

– Simultaneous I/O to centralized database causes the performance bottleneck

– Database sharing is still difficult

mpiBLAST-g2 ASCC, Taiwan and PRAGMA http://bits.sinica.edu.tw/mpiBlast/index_en.php

• A GT2-enabled parallel BLAST runs on Grid– GT2 GASSCOPY API– MPICH-g2

• The enhancement from mpiBLAST by ASCC

• Performing cross cluster scheme of job execution

• Performing remote database sharing

• Help Tools for– database replication– automatic resource specification and job submi

ssion (with static resource table)– multi-query job splitting and result merging

• Close link with mpiBLAST development team– The new patches of mpiBLAST can be quickly

applied in mpiBLAST-g2

SC2004 mpiBLAST-g2 demonstration

mpiBLAST-g2 current deployment

-- From PRAGMA GOC http://pragma-goc.rocksclusters.org

mpiBLAST-g2Performance Evaluation (perfect case)

Elapsed time Speedup

Database: est_human ~ 3.5 GBytesQueries: 441 test sequences ~ 300 KBytes • Overall speedup is approximately linear

— Searching + Merging

— BioSeq fetching

— Overall

mpiBLAST-g2Performance Evaluation (worse case)

Elapsed time Speedup

Database: drosophila NT ~ 122 MBytesQueries: 441 test sequences ~ 300 KBytes

• The overall speedup is limited by the unscalable BioSeq fetching

— Searching + Merging

— BioSeq fetching

— Overall

Issues of mpiBLAST-g2

• Single error will crash the whole job– The MPICH nature – Error might be due to the transient problem on the loosely coupled Grid

environment

• MPI Job will be started only when all resources are available– Different level of resource availability

Error recovery is required for– providing a robust application service on the Grid– efficiently using the Grid resources

Asynchronous task dispatching/pulling to use the available resources immediately

The DIANEhttp://cern.ch/diane

• DIstributed ANalysis Environment

• Lightweight distributed framework for parallel scientific applications in master-worker model– A perfect match of the mpiBLAST computing model

• Current applications– BLAST for Genomic Sequence Analysis (DIANE-BLAST)– Geant4 Simulation for Radiotherapy and Astrophysics – Image Rendering – Data Analysis for High Energy Physics

DIANE Features

• Rapid prototyping– Python and CORBA

• Error recovery– Heartbeat worker health check– Resubmission of failed tasks– User defined error recovery method

• No need of outbound connectivity– Proxy of workers with only private IP

• Job submitters for– Simple fork– Condor, LSF, SGE, PBS– GT2, LCG, gLite

Pull Model

Batch and Interactive

Distributed workers

• planner• integrator

DIANE-BLAST implementation

• Splitting mpiBLAST-g2 to DIANE components– Master (Planner and Integrator), Worker

• Wrapping each component with Python– Hooking core BLAST C libraries with python swig

• Implementing the DIANE GT2 job submitter– For running workers on the GT2-enabled clusters

• Reusing the deployed databases for mpiBLAST-g2

mpiBLAST-g2 vs. DIANE-BLASTThe Speedup

• Query– Drosophila chromosome 4– size: 1.2 Mbps

• DB– Drosophila nucleotide sequence

database– size: 1170 seq. 122 Mbps– no. fragments: 32

• Computing Resource– Available # of CPU: 12– PIII 1.4GHz– 1GByte Memory

Speedup of mpiBLAST-g2

Speedup of DIANE-BLAST

mpiBLAST-g2 vs. DIANE-BLAST The Worker Lifeline

DIANE-BLAST task dispatching

• Handled by DIANE’s task thread

• Due to the bugs in the current DIANE release

DIANE-BLAST task dispatching

• Handled by DIANE’s task thread

• Due to the bugs in the current DIANE release

mpiBLAST-g2 task dispatching

• mpiBLAST-g2 task handling logic

mpiBLAST-g2 task dispatching

• mpiBLAST-g2 task handling logic

mpiBLAST-g2 vs. DIANE-BLASTOverall Comparisons

• mpiBLAST-g2– Master-Worker model implemented by

using MPICH-g2 libraries

– Gridification efforts• Implementing database sharing with GA

SSCOPY API• Recompilation with MPICH-g2 and GT2

libraries

– Error recovery• Need the fault-tolerance MPI

– Cross cluster computation• Requiring outbound connectivity on eac

h worker

– Performance/Throughput• In cluster performance is as well as the

original mpiBLAST

• DIANE-BLAST– Pluggable application for DIANE Maste

r-Worker framework

– Gridification efforts• Through the gridified DIANE framework

– Error recovery• Task resubmission• Tracking the health of each worker

– Cross cluster computation• Using proxy for workers with private IPs

– Performance/Throughput• Performance can be tuned by controllin

g the job thread

Summary

• Two grid-enabled BLAST implementations (mpiBLAST-g2 and DIANE-BLAST) were introduced for efficient handling the BLAST jobs on the Grid

• Both implementations are based on the Master-Worker model for distributing BLAST jobs on the Grid

• The mpiBLAST-g2 has good scalability and speedup in some cases– Require the fault-tolerance MPI implementation for error recovery – In the unscalable cases, BioSeq fetching is the bottleneck

• DIANE-BLAST provides flexible mechanism for error recovery– Any master-worker workflow can be easily plugged into this framework– The job thread control should be improved to achieving the good perfor

mance and scalability

Thanks for your attention!!

28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun...

Documents

Transcript of 28 April, 2005ISGC 2005, Taiwan The Efficient Handling of BLAST Applications on the GRID Hurng-Chun...

working with electronics 20101007 - Sinica

Advanced Analytical Chemistry - Sinica

Dermatologica Sinica

Extrusion-Cooking Techniques Edited by Leszek Moscicki

TOXOPLASMOSIS IN DOMESTIC ANIMALS - Sinica

1 DIANE – Distributed Analysis Environment Jakub T. Moscicki CERN IT/API

Taiwan Studies in Europe - Sinica

Biomaterials Science - Sinica

Announcements - Sinica

10 September 20031 POOL v1 QA Massimo Lamanna, Jakub Moscicki CERN LCG/SPI.

Replication DNA - Sinica

Michael Shiyung Liu Academia Sinica

PDF - Academia Sinica

Ganga 4 Basics - Tutorial Jakub T. Moscicki ARDA/LHCb Ganga Tutorial, September 2006.

Zoo}({J)gicm - Sinica

Symbol Tables - Sinica

Services, Semantics, and Cloud - Sinica

Ganga Core: Status Jakub T. Moscicki ARDA/LHCb LHCb Software Week, September, 2005.

Chapter 2 Assemblers - Sinica

DNA Sequence - Sinica