Exploiting performance characterization of BLAST in the grid

Cluster Comput (2010) 13: 385–395DOI 10.1007/s10586-010-0121-z

Exploiting performance characterization of BLAST in the grid

Enis Afgan · Purushotham Bangalore

Received: 9 May 2008 / Accepted: 3 February 2010 / Published online: 20 February 2010© Springer Science+Business Media, LLC 2010

Abstract Sequence analysis has become essential to thestudy of genomes and biological research in general. Ba-sic Local Alignment Search Tool (BLAST) leads the way asthe most accepted method for performing necessary querysearches and analysis of discovered genes. Combating grow-ing data sizes, with the goal of speeding up job runtimes,scientist are resorting to grid computing technologies. How-ever, grid environments are characterized by dynamic, het-erogeneous, and transient state of available resources caus-ing major hindrance to users when trying to realize user-desired levels of service. This paper analyzes performancecharacteristics of NCBI BLAST on several resources andcaptures influence of resource characteristics and job para-meters on BLAST job runtime across those resources. Ob-tained results are summarized as a set of principles char-acterizing performance of NCBI BLAST across homoge-neous and heterogeneous environments. These principles arethen applied and verified through creation of a grid-enabledBLAST wrapper application called Dynamic BLAST. Re-sults show runtime savings up to 50% and resource utiliza-tion improvement of approximately 40%.

Keywords Grid computing · BLAST · Dynamic BLAST ·Job parameterization · Performance benchmarking ·Resource selection

E. Afgan · P. Bangalore (�)Department of Computer and Information Sciences,University of Alabama at Birmingham, 1300 University Blvd,CH #131, Birmingham, AL 35294-1170, USAe-mail: [email protected]

E. Afgane-mail: [email protected]

1 Introduction

The sequence alignment process enables identification ofregions of similarity among biomolecular sequences (i.e.,DNA, RNA, or amino acid sequences), where high se-quence similarity often implies significant functional, struc-tural, and evolutionary information between genes. Findingsuch similarities enables derivation of inferences about genefunctionality and ancestry. As a result, sequence alignmenthas grown to be an important aspects of today’s biologi-cal research [1]. The alignment process consists of com-paring multiple sequence queries by searching for series ofmatching individual characters or character patterns acrosssequences. One of the most widely spread algorithms forsequence alignment is the Basic Local Alignment SearchTool (BLAST) [2, 3]. Although there exist many other se-quence alignment algorithms (e.g., FASTA [4], SSEARCH[5], HMMer [6]), BLAST has gained popularity due to itsemphasis on execution speed. At the same time, the adventof high throughput sequencing technologies and large scaleprojects, such as the Human Genome Project [7], have ledto exponential growth of search target databases and thusexponential search times [1].

In order to cope with the prolonged search times, parallelcomputing techniques have been used to help BLAST jobsgain speedup on searches by distributing search jobs overa cluster of computers. There are two main methodologiesfor parallelizing BLAST searches, namely database seg-mentation and query segmentation [8]. Database segmenta-tion methodology (employed by mpiBLAST [9] and Tur-boBLAST [10]) distributes a portion of the sequence data-base to each cluster node. Thus, each cluster node only needsto search a given query set against its portion of the sequencedatabase. Alternatively, query segmentation (employed by[11, 12]) distributes portions of the user submitted queries

mailto:[email protected]

mailto:[email protected]

386 Cluster Comput (2010) 13: 385–395

Fig. 1 Two models forparallelizing BLAST: (a) querysplitting and (b) databasesplitting

to available cluster nodes. Each compute node has access tothe whole database and performs the search only on its por-tion of the query set, thus speeding up the overall search dueto the developed parallelism. Figure 1 graphically depictsthe two BLAST parallelization methodologies.

Nevertheless, executing alignment searches presents achallenge in terms of required resources because neither in-dividual workstations nor individual clusters are capable ofrealizing needed performance. For example, a small-scalesearch with approximately 1,000 search queries against apopular 1.6 GB nr database on a decent workstation (dualCPU, dual core Intel Xeon 3 GHz, 4 GB RAM) takes ap-proximately 8 hours to complete using one thread per CPU.Because of the speed at which sequencing machines per-form analysis, it is not uncommon to generate on the order of250,000 queries in a single experiment. Performing a searchwith that amount of data presents itself with a challeng-ingly high computation cost. Resorting to publicly avail-able resources such as NCBI [13] is still not enough becausethey impose a one cumulative CPU-hour limitation per userjob [14].

In order to realize the benefits of increased executionspeed offered by the parallel approaches and gain access toneeded resources, scientists are aiming at using grid com-puting technologies (e.g., [15–17]). Grid computing [18] isa technology enabling virtualization and aggregation of dis-tributed compute resources into a seamless resource pool.Advances in grid computing have brought unsurpassed com-putational capabilities to user’s fingertips and offer an ex-cellent foundation for realizing the reduction in executiontime for BLAST jobs. Regrettably, heterogeneity (i.e., phys-ical and logical differences that exist between individual re-sources) of available resources introduces additional com-plexities, forcing scientists to deal with these complexitiesdirectly.

Heterogeneous grid environments are complicated withdependencies between the application, platform used for jobexecution, input data properties, and user preferences [19].Because individual applications impose different require-ments on underlying resources, such dependencies areapplication- and resource-specific. A result of these depen-dencies is that often undesired effects, such as excessiveand non-uniform job execution times, are observed. Further-

more, because of the lack of understanding of application-resource relationship, resources are being underutilized.In order to understand and leverage existing dependenciesthere is a need to understand this relationship.

This paper presents a performance characterization ofruntime characteristics of NCBI BLAST application acrossgrid resources followed by an experimental evaluation of thedevised characterization. To the best of our knowledge, thispaper represents the first instance of such work for NCBIBLAST application. With availability of such a character-ization, a user may understand and control submission oftheir jobs across grid resources through user-guided job pa-rameterization. We define job parameterization as an un-derstanding and selection of job parameters that can be fedto a job submission engine as requirements that are algo-rithm, input data, and resource dependent. Alternatively, anapplication-specific metascheduler (i.e., grid resource bro-ker) may be developed that can automate the job parame-terization process and realize goals desired by a user (e.g.,minimize job runtime, maximize result accuracy).

There are three concrete contributions presented in thispaper. They are as follows:

• Performance characterization of NCBI BLAST applica-tion for homogeneous compute resources (i.e., computeclusters)

• Performance characterization of NCBI BLAST applica-tion for heterogeneous compute resources

• Evaluation of the generated characterizations through im-plementation of a BLAST-specific grid metaschedulercalled Dynamic BLAST.

The results of this work enable targeted parameterizationof a given BLAST job for any given grid resource availabil-ity. Presented results can be used directly by a user or tofacilitate development of an automated BLAST-specific jobsubmission tool. Lastly, because BLAST application is rep-resentative of other scientific applications (e.g., no inter-taskcommunication, largely compute intensive, data easily dis-tributed), the results described could be adapted by other ap-plications as they are ported and executed in heterogeneousenvironments.

The rest of this paper is organizes as follows: Sect. 2presents related and previous work. Section 3 builds on the

Cluster Comput (2010) 13: 385–395 387

conclusions of related works and derives BLAST-specificperformance characterizations for homogeneous as well asheterogeneous resources. Section 4 presents implementationof the proposed characterization and discusses obtained per-formance results. Section 5 provides the summary and con-clusion.

2 Related work

Even though various implementations of sequence align-ment exist (e.g., FASTA [4], SSEARCH [5] and SCANPS[20]), this work focuses on NCBI BLAST [13] because itrepresents the most widely used implementation; therefore,obtained results provide benefit to the widest audience. Pre-sented performance characterization is devised based on re-sults of analyses of BLAST application: [8, 21–23].

In earlier studies, Afgan and Bangalore [21] and Wu [8]present empirical analyses of BLAST application across aset of heterogeneous resources. Study presented by Afganand Bangalore [21] focuses on influence of resource het-erogeneity on job runtime while Wu’s contribution [8] fo-cuses on influence of high-level job invocation parameterson BLAST job runtime (e.g., size of search database, inputquery length). The conclusions of these analyses point at im-portance and effects that execution resources and job inputparameters have on job runtime. Results of these analysesare incorporated into the performance characterization pre-sented in this paper.

Tan et al. [22] and Sanchez et al. [23] present low-levelanalyses of a set of bioinformatics applications, includingBLAST. In [22] Tan et al. present performance analysis thatfocuses on architectural components of executing systems.Results indicate that BLAST (more specifically, blastallprogram) is a CPU-bound application with good memorymanagement. This implies that selecting a resource withhigher CPU clock speed for executing BLAST jobs shouldyield better performance. Similarly, Sanchez et al. in [23]present a very detailed analysis of influence that architec-tural components of a CPU have on runtime of BLAST.Conclusions imply that because the application is CPU-bound, relative performance of a set of resources in termsrelevant to BLAST application can be estimated quite ef-fectively (within 10%) by using a standardized CPU bench-mark, such as SPEC2000 [24].

In order to provide a more targeted (i.e., application-specific) benchmark of resources, and thus capture perfor-mance of a resource more closely, BioPerf [25] project wasintroduced. BioPerf represents an easy-to-use suite of bioin-formatics applications preconfigured with small input datasets. It is utilized to evaluate high-performance computersystems. Results of BioPerf are used to empirically ob-tain application-specific performance rates of a range of re-

sources, which a job manager later uses to distribute work-load in a fashion that best meets resource capabilities.

Lastly, GridBLAST [11] and GNARE [16] represent im-plementations of BLAST application for the grid. Througha set of scripts, execution of BLAST across Globus-based[26] grid resources is enabled. These projects focuses on en-abling BLAST to execute across a set of resources but theydo not focus on leveraging resource capabilities to maximizejob performance (i.e., minimize job turnaround time througheffective task-to-resource mappings).

Overall, work presented in this paper generalizes worksdiscussed above and delivers a performance characteriza-tion for NCBI BLAST application that can be used duringany job parameterization. Such approach represents a high-level solution that allows a user (or a job submission tool)to directly utilize available conclusions during future jobsubmissions. This is in contrast to focusing on understand-ing low-level application execution characteristics that aimat improving algorithmic properties of an application. In-stead, presented solution builds on the outcomes of the low-level application optimizations and can be used to imple-ment a range of job submission tools or grid job metasched-ulers, thus providing a general purpose solution for execu-tion of BLAST jobs across heterogeneous and distributedresources.

3 Characterizing BLAST performance

Query segmentation parallelization methodology for BLAST(described in the introduction) represents a suitable paral-lelization method for grid environments because individ-ual jobs may be divided, configured, and submitted to inde-pendent sites throughout a Virtual Organization (VO). Fur-thermore, there is no inter-process communication betweeninstantiated tasks and the model offers high tolerance forpartial task failure. As a result, this is the parallelizationmethodology adopted for the developed performance char-acterization.

While we recognize database segmentation as an impor-tant approach to parallelizing BLAST searches, due to in-creased difficulty of adopting such tightly coupled paral-lel algorithms to execute across grid resources, performancecomparison and analysis of these type of parallelization wasnot considered. Because execution of this type of applica-tions is inherently limited to single resources, within gridor not, individual resource performance can be transitionedto grid resources by considering performance of dedicatedsystems and incorporating middleware overhead. For moredetails on implementation and performance analysis of thistype of parallelism, we refer interested reader to [9, 27, 28].

388 Cluster Comput (2010) 13: 385–395

3.1 Characterizing BLAST performance in homogeneousenvironments

Overall, parameterization of a job in a heterogeneous envi-ronment can be partitioned into two components, (1) para-meterization of the overall job in terms of resource selectionand data distribution across those resources, and (2) parame-terization of an individual resource upon it being selected.This section focuses on the second step and presents a char-acterization that can be instantiated when parameterizing ajob on any single resource.

Based on earlier analyses (i.e., [8, 21]) and in the con-text of a homogeneous resource, three decisions need to bemade:

• How many tasks to create?• How to distribute data among created tasks?• How to parameterize each task?

The number of tasks that can be created on a given re-source is constrained by the number of Processing Slots(PS), or nodes, available on the system. We refer to this num-ber as n. Although this is a soft constraint, because the focusof proposed characterization is maximizing job performanceby avoiding resource overloading, the number of tasks cre-ated should not exceed the number of PSs available on givenresource. The number of created tasks can, however, be lessthan the total number of PSs. As shown by Afgan and Ban-galore [21], depending on the architecture of a resource, cer-tain job configurations result in improvements in resourceutilization. Depending on the objective of a metascheduler,utilizing such information may be of interest when schedul-ing and parameterizing a job. Because such information isapplication-specific, it should be implemented at the levelof the application-specific wrapper implementing presentedcharacterization. The general purpose characterization pre-sented here needs to be able to accommodate such a request.Therefore, the number of tasks to be created by a metasched-uler is provided as a variable at the time of job instantiationand it is denoted by T (thus supporting variable resourceavailability).

Once the number of tasks has been determined, there isa need to decide how to distribute the total amount of in-put data D between T tasks. Because a resource is com-posed of homogeneous nodes, it is adequate to divide theinput across the nodes (and thus the tasks) in a proportionalmanner. Equation (1) captures this action where d indicatesthe size of the input that should be assigned to each individ-ual task (size is BLAST-specific value and can refer to thenumber of input queries or physical size of the overall inputfile):

d = D

T(1)

Variable d indicates a general size of input that should beassigned to individual task ti . However, assigning the ac-tual d amount of data to ti should be implemented by aBLAST-specific wrapper. More specifically, based on pre-vious BLAST analysis (i.e., [21]), the amount of data as-signed to each task ti should be guided by the number ofqueries and the length of individual queries. It is thereforenecessary to create a script (or a module) that can distrib-ute the provided input data in the appropriate fashion. Toaccomplish this, the module is initially instructed as to howmany data chunks to create and how much data to assignto each task; it then splits the job input data based on twometrics: the number of search queries and physical file size.The outcome of splitting the input file in such fashion is awell-balanced workload distribution.

Lastly, each task should be parameterized to help realizethe desired scheduling objective (e.g., minimize job runtime,minimize job cost). With the focus on minimizing job run-time, as shown in earlier experiments (i.e., [21]), BLASTrealizes maximum performance when the number of threadsinstantiated (−a option) matches the number of Process-ing Elements (PEs) available on a PS. Depending on thescheduling policy of a given resource, this number may cor-respond to the number of PSs but it also may differ (e.g., asingle PS is a dual quad-core CPU and each user request fora single PS assigns a user job to one PS that actually contains8 PEs). Such information is utilized when the characteriza-tion is instantiated.

In conclusion, the presented characterization provides astandardized interface for a BLAST-specific wrapper to im-plement. The characterization supports notions of variableresource assignment, data distribution, and task parameteri-zation. Within an implementation of the characterization, ageneral-purpose solution to above actions and subsequentduties (e.g., task submission, data staging) can be imple-mented; however, it is the BLAST-specific wrapper that pro-vides necessary information (e.g., number of tasks to create)and performs BLAST-specific actions (e.g., data distribu-tion). As input, the characterization accepts a BLAST queryinput file and information about available resource. The out-put is a number of task assignments across given resource sothat the input data is distributed in a fashion that maximizesresource utilization.

3.2 Characterizing BLAST performance in heterogeneousenvironments

At the grid job submission level, a characterization is neededthat encompasses global job parameterization in terms of re-source selection and data distribution. This section presentssuch a characterization. Presented characterization repre-sents a general purpose solution that can be instantiated andit assumes existence of an application-specific plug-in thatprovides and implements many of the necessary details.

Cluster Comput (2010) 13: 385–395 389

The following notation is used throughout this section:

R Set of normalized performance weights of resourcesavailable for executing job J

ni Number of Processing Slots (PSs) or nodes on re-source i

Pi Performance rate of individual PS on resource i

Wi Performance weight of resource i

E Set of normalized performance weights of resources se-lected for executing job J

Ti Number of tasks assigned to resource i

di The size of data chunk assigned to resource i

D The total size of input for job J (as number of inputqueries or physical file size)

The following is the list of actions supported by the de-rived characterization:

• Understand heterogeneity of available resources• Support notion of resource selection• Understand the possibility for variable data assignment to

selected resources

Capabilities of a resource can be measured by its size, andsize can be quantified through the number of PSs available.Because of the heterogeneity of grid resources, size shouldnot be the only measurement. Therefore, understanding re-source heterogeneity involves accounting for the size of theresource and resource’s performance from the perspectiveof BLAST application. Equation (2) captures the process ofresource weight calculation.

Wi = ni ∗ Pi (2)

Note that (2) exclusively considers all of the availablePSs on given resource. However, this does not need to bethe case and an instance of the presented characterizationcan be invoked with a desired number of PSs. This can becaptured in above formulation by replacing the number ofPSs on a resource ni with the number of tasks Ti that mustto be created on a given resource.

The performance weight Pi of a resource can be obtainedthrough one of several methods, namely:

• Calculate theoretical peak performance of a resourcebased on the clock speed of a CPU

• Use a generic benchmark tool, such as the SPEC bench-mark discussed in Sect. 2

• Use an application-specific benchmark (this may includeanalysis of historical runtime characteristics of a givenapplication, executing selected application with a smallerinput data set prior to the job submission [8], use of appli-cation skeletons to estimate application performance [29],or use of application-specific benchmark tool (e.g., [30])).BioPerf project discussed earlier is an example of such atool.

Because of the application-resource dependency ex-plained in the introduction, also shown in previous research[31], the application-specific benchmarks offer the most ac-curate performance of a resource as it pertains to the givenapplication. It is thus desirable to analyze behavior of run-time characteristics of BLAST directly as it moves acrossresources available in the grid. Nevertheless, when such per-formance data is not available, general purpose approachescan be used (with possibly reduced level of accuracy).

Once resource weights are obtained, they are normalizedand used to represent the set of available resources whoseindividual performances can be directly compared (see (3)).

R = {W1,W2, . . . ,Wm} (3)

Resources represented by set R can now be effectively com-pared and the act of resource selection can be performed.Presented characterization realizes the notion of resource se-lection based on a certain threshold. The value of the thresh-old may be interpreted differently for different objectives be-ing implemented by a tool instantiating this characterization(e.g., minimize runtime, minimize load imbalance acrosstasks). Some examples of the threshold are: resource per-formance is less than some constant, resource performanceis less than 50% of the fastest resource, cost of a resource ifgreater than some constant. Equation (4) captures such func-tionality:

E ⊆ R and

{Ri ∈ Ei, Ri ≥ threshold

Ri /∈ Ei, Ri < threshold(4)

Lastly, the metascheduling characterization needs to un-derstand the notion of variable data distribution across se-lected resources based on the relative resource performance.Equation (5) provides a general formulation for the resourceperformance relative data distribution. Computed value di

refers to the amount of the data that should be assigned toresource i. For BLAST application, this implies that inputdata partitioning should be performed at the granularity ofan individual query. Furthermore, the partitioning moduleshould account for number of queries as well as length ofqueries when performing the data distribution. As with thecase of (2), the total number of PSs ni in (5) can be replacedwith the desired number of tasks to be created on resource i.

di = ni∑j<|E|j=0 nj

∗ D ∗ Wi (5)

Once the decisions regarding above discussion are madeat the level of a job, the performance characterization pre-sented in Sect. 3.1 is instantiated to parameterize each indi-vidual task on each individual resource. Overall, presentedcharacterization captures the requirements of understandingand coping with heterogeneity of resources in grid environ-ments. Implementations and instantiations of the presented

390 Cluster Comput (2010) 13: 385–395

characterization can thus rely on the definition of tasks andthe steps required to submit a BLAST job in a controlledfashion. Specifically, the characterization accepts a BLASTquery file and list of available resources as input and pro-duces a workload assignment across available resources asoutput. Produced workload assignment is based on perfor-mance of BLAST application across given resources so thatthe resource utilization is maximized and job runtime mini-mized. Next section provides an example of an implementa-tion of the presented characterization.

4 Experiments and results

This section presents runtime performance results whenabove derived BLAST performance characterization is im-plemented and instantiated. Job runtime characteristics fora BLAST wrapper application implementing above charac-terization are compared to job runtime characteristics whenusing GridWay metascheduler, current state-of-the-art gridmetascheduler. First, brief overview of Dynamic BLAST,the BLAST wrapper application, is provided (for additionaltechnical details on Dynamic BLAST see [15, 32]) followedby description of the experimental setup. Performance re-sults are presented last.

4.1 Overview of dynamic BLAST

With the general focus of presented BLAST performancecharacterization toward heterogeneous environments, vali-dation of the characterization was performed across a poolof grid resources (described in Sect. 4.2). From the topmostlevel, grid resources offer a heterogeneous compute envi-ronment where the ability of the characterization to captureresource characteristics is tested. Furthermore, because eachof the selected resources was a compute cluster composed ofmultiple homogeneous nodes, the homogeneous portion ofthe presented characterization is tested within each resource(i.e., through a group of tasks assigned to a resource).

The performance characterization presented in Sect. 3was instantiated in form of a BLAST-specific wrapper ap-plication called Dynamic BLAST [15]. The high-level ar-chitecture of Dynamic BLAST and interacting componentsis shown in Fig. 2. As shown in the figure, a user sub-mitting BLAST jobs to grid resources is completely ab-stracted from the low-level grid resource and infrastructuredetails. Authentication and authorization are moved outsidethe Dynamic BLAST where the user is required to havevalid X.509 GSI proxy credentials [33] before job invoca-tion. Resources available for job submission are discoveredautomatically through grid Monitoring and Discovery Ser-vices (MDS) [12]. GridAtlas [34] is a locally developed util-ity that monitors application related (in this case, BLAST

Fig. 2 High level diagram of interactions between grid componentsand Dynamic BLAST

related) parameters across various resources. Monitored in-formation includes application installation and input data lo-cations on various resources. Interaction with GridWay isperformed through Distributed Resource Management Ap-plication API (DRMAA) [35] standard, which is used for alljob submission and monitoring activities. Dynamic BLASThandles resource selection and data distribution but resourceallocation, data staging, and job monitoring are all delegatedto GridWay through the DRMAA API. GridWay provides astreamlined platform for high-level grid application devel-opment within Globus Toolkit [36] grids and was thus cho-sen as the application development platform.

4.1.1 Instantiating BLAST performance characterization

The previous section provides a brief and high-level over-view of Dynamic BLAST. This section describes how theproposed BLAST performance characterization from Sect. 3is implemented within Dynamic BLAST.

At the level of BLAST job scheduling across grid re-sources, the following three considerations were captured bythe characterization:

1. Understand heterogeneity of available resources2. Support notion of resource selection3. Understand the possibility for variable data assignment

to selected resources

Within Dynamic BLAST, understanding heterogeneity ofavailable resources (i.e., (2)) is implemented either througha user-provided theoretical peak performance for a resourceor it is obtained from historical BLAST application perfor-mance data for a given resource. Application PerformanceDatabase (AppDB) [37] is the tool used for obtaining re-quired resource performance data. Dynamic BLAST fo-cuses on minimizing job runtime and thus resource selection

Cluster Comput (2010) 13: 385–395 391

Fig. 3 Visualization of BLAST job input workload in terms of surfacearea. Both inputs represent the same workload

(see (4)) is not implemented. Instead, Dynamic BLAST uti-lizes all of the available resources and performs appropriatedata distribution (as described in the following paragraph)to minimize load imbalance across resources while maxi-mizing resource utilization.

As described in Sect. 3, the runtime of BLAST jobs is de-pendent on job input in terms of the number of input queriesand input query length. As a result, the size of input providedto a job (or a task) as workload can be captured and visual-ized in terms of surface area. For example, a small numberof long input queries can represent the same workload asa large number of small queries. Figure 3 visually capturesthis property.

Within Dynamic BLAST, above realization is imple-mented as a bin-packing problem [38] through the vari-able data distribution component of the presented perfor-mance characterization. Implemented algorithm distributesuser-provided input queries across available resources sothat a proportional number of short, long, and mediumlength queries are assigned to individual resource. A sim-ple yet effective and efficient heuristic implementation forthis problem is the first fit decreasing algorithm (complexityis �(n logn) where n is the number of queries) [39], whichassigns data elements across individual bins in a decreasingorder for as long as there is input. The amount of data as-signed to individual resources (i.e., bins) is guided by the re-source performance weight described in previous paragraph.

At the level of BLAST job scheduling across homoge-neous resources, the following three questions were cap-tured by the characterization:

1. How many tasks to create?2. How to distribute data among created tasks?3. How to parameterize each task?

As elaborated on in Sect. 3, as well as previous research[8, 21], the number of tasks to be created on a given resourceshould match the number of processing nodes available onthe resource. Dynamic BLAST obtains needed data auto-matically from MDS and instantiates a number of processes(i.e., tasks) that matches the number of available nodes.Based on (1), the input data within a resource is evenly di-vided among available processing nodes. However, due tothe dependency of BLAST runtime on the structure of inputfiles (i.e., query number and length), the bin-packing algo-rithm described earlier is also applied at the resource level.

Table 1 Architectural details of resources used during performed ex-periments

Machine Cheaha Ferrum Olympus

Processor Intel Xeon Intel Xeon Intel Xeon

E5450 E5345

Clock frequency (GHz) 3 2.33 3.2

Instructions/cycle 4 4 2

No. of cores/node 8 8 2

Memory per node (GB) 16 12 4

No. of nodes available 12 24 64

Total no. of cores available 96 192 128

As a result, the input file chunk assigned to a given resourceis reorganized in a fashion that assigns proportional num-ber of short, long, and medium length queries to individualresource nodes. Lastly, each task is parameterized so thatthe number of threads instantiated matches the number ofprocessing cores available on the resource. The number ofavailable processing cores per compute node is obtained atruntime and automatically from GridAtlas.

4.2 Experimental setup

The experiments testing the proposed BLAST performancecharacterization were performed across three heterogeneousresources available on UABgrid (technical resource detailsare provided in Table 1). Performance of the characteriza-tion implemented within Dynamic BLAST was comparedto GridWay metascheduler. GridWay metascheduler repre-sents current state-of-the-art metascheduler; it is distributedwith the Globus Toolkit and supports notion of submit-and-forget methodology [40]. Beyond validating the approachcaptured by the presented characterization, it is the intent ofpresented performance results to indicate the potential ben-efits of exploiting application-specific characteristics duringjob metascheduling.

Resources selected for executing BLAST jobs are locatedacross three independent departments, each locally adminis-tered with applicable policies and procedures in place. Re-sources selected are significantly heterogeneous in terms ofthe number of nodes/cores available as well as node/coreperformance. Resource availability was not varied duringthe experiments to enable clear focus on performance gainedby exploiting application-resource relationship, which is thefocus of the proposed characterization. Nevertheless, pre-sented performance characterization of BLAST and, in turn,Dynamic BLAST are capable of handling and appropriatelyadjusting for varying resources and input data size. Specif-ically, these variations would be realized through instantia-tions of (4) where resource selection is performed and (5)where data distribution is performed. Dynamic BLAST is

392 Cluster Comput (2010) 13: 385–395

currently not capable of adjusting execution parameters of ajob if resource availability changes during job execution.

All of the utilized resources had a version of BLASTapplication installed and required search database availablefor use. Technical resource details are provided in Table 1.Popular 1.6 GB nr database was used to perform BLASTsearches. The nr database is a non-redundant protein data-base with entries from GenPept, Swissprot, PIR, PDF, PDB,and RefSeq. The version used was 1.6 GB in size and avail-able from the National Center for Biotechnology Informa-tion (NCBI).1

The query input file used consisted of 4,096 searchqueries randomly selected from the Viral Bioinformatics Re-source Center (VBRC)2 database. The VBRC database con-tains the complete genomic sequences for all viral pathogensand related strains that are available for about half a dozen ofvirus families. The total number of sequences in the databaseis around 40,000. The number of queries used to evaluate theperformance of Dynamic BLAST is thus a small fraction ofthe total number of queries available in the VBRC database.Lastly, shown performance values represent average valueof at least three runs using given job parameters.

4.3 Performance results

The major difficulty in achieving optimal job performancefor the data distribution across heterogeneous grid re-sources adopted by the performance characterization de-scribed stems from the need to minimize load imbalanceacross instantiated tasks while realizing desired resource uti-lization [19]. In order to achieve this, resource capabilitiesin terms of the given application need to be understood [19].The performance characterization presented in Sect. 3 copeswith this load-balancing problem in two steps: (1) by ac-knowledging and leveraging heterogeneous resource perfor-mance into the data distribution approach, and (2) by dis-tributing job input data in a BLAST-specific fashion (i.e.,consider number and length of input queries, described inSect. 4.1.1).

Performance of Dynamic BLAST, an implementationof the proposed characterization, was compared to perfor-mance of comparable static data distribution solution builtand supported by GridWay. For the GridWay job submis-sion, job input queries were distributed among differentresources based on resources’ sizes (i.e., total number ofprocessing cores available within each resource). This ap-proach embraces simplicity of use of grid resources advo-cated by GridWay, minimizes requirements imposed on auser wishing to utilize grid resources for respective job exe-cution, and presents a solution typically adopted by domain

1http://www.ncbi.nlm.nih.gov/.2http://www.biovirus.org/.

Fig. 4 Input data distribution in terms of the number of input queriesexecuted through GridWay

scientist not interested in fiddling with low-level infrastruc-ture details. Equation (6) captures the adopted input data di-vision while using notation from Sect. 3.2:

di = ni∑j<|E|j=0 nj

∗ D (6)

For the performed job runs, size of the job input file wasmeasured in terms of the number of input queries. After ap-plying (6) for resource availability outlined in Table 1, inputdata distribution shown in Fig. 4 is obtained.

The job input data was divided between individual re-sources using the UNIX split utility based on the valuesshown in Fig. 4. Within each resource, input data was thendivided evenly between available compute nodes (i.e., us-ing (1)). Figure 5 presents a visualization of the querylengths of the job input file along with the associated di-vision across resources and nodes within a resource. Notethat only a portion of the entire job input file is shown tomake the figure more readable. As is evident from the figure,lengths of individual queries vary greatly and are unevenlyspread across the provided input file. As a result of apply-ing straightforward split utility, a disproportionate type ofqueries (in terms of query length) is assigned to individualnodes.

For the case of Dynamic BLAST and the presentedBLAST performance characterization, the job input data di-vision for individual resources is based on resource perfor-mance as well as resource size (i.e., (5)). For the performedexperiments, the performance weight Pi from (2) utilizedin (5) was obtained from benchmark data. For the bench-mark, we used the runtime of entire 4,096 query input file.Obtained runtime results, calculated normalized resourceperformance (i.e., (2)), and data distribution based on (5)are provided in Table 2.

In addition to the resource-performance based input datadistribution, presented characterization indicated that the in-put data distribution should incorporate number of inputqueries as well as query length into the workload calcula-tion. As described in Sect. 4.1.1, Dynamic BLAST imple-ments this component of the characterization in form of thefirst-fit decreasing algorithm. By applying this algorithm to

http://www.ncbi.nlm.nih.gov/

http://www.biovirus.org/

Cluster Comput (2010) 13: 385–395 393

Fig. 5 Visualization of query lengths across job input file. This figure also shows GridWay-based distribution of the job input file across resources.As can be seen, input queries are distributed across resources without regard for query length

Fig. 6 Visualization of query lengths for the job input file as distrib-uted by Dynamic BLAST. Each tailing spike visible in the figure cor-responds to data assignment to individual compute node on given re-source. By performing such input data reorganization to achieve con-

sistent distribution of queries across resources as well as computenodes, consistent node and resource performance is exhibited that re-duces overall job load imbalance

Fig. 7 Runtime characteristicsof GridWay-based and DynamicBLAST job implementingBLAST performancecharacterization. Maximumrefers to the maximum runtimeof any one resource within a joband thus the overall runtime of ajob

Table 2 Resource benchmark values, calculated resource weights andnewly derived job data distribution

Resource Cheaha Ferrum Olympus

Benchmark Runtime Data (sec) 1595 1025 2234

Normalized Resource Performance 0.643 1.00 0.458

Query Distribution 1255 1949 892

the job input data, a more even input data distribution is ob-tained (see Fig. 6). As presented in Fig. 6, individual nodes

within a resource are assigned a more proportional amountof input data, which results in more even task and job run-time (see Fig. 7).

Aim of presented job model is minimization of load im-balance (as described in Sect. 4.1.1) because overall run-time of a job is determined by the longest running task con-tained within that job. Figure 7 presents runtime results ofthe GridWay-based and the Dynamic BLAST-based jobs. Ascan be seen from the figure, runtime of the Dynamic BLASTjob, based on the presented BLAST performance charac-terization, is approximately half of that of GridWay-based

394 Cluster Comput (2010) 13: 385–395

job submission (determined by the longest running task—maximum). Furthermore, load imbalance across resourcesis considerably smaller, which results from more adequateworkload assignment across individual tasks. Obtained per-formance results indicate at the benefits on job performanceand resource utilization realized when utilizing presentedBLAST performance characterization.

5 Summary and future work

BLAST application is one of the most widely used bioin-formatics applications of today; scientists from various dis-ciplines employ it in their daily routines. With the ever-growing size of search databases and thus longer searchtimes, scientists are exploring new technologies, such asgrid computing, to accommodate increased requirements inresource capabilities. Resources available in such environ-ment, as well as individual resources directly available tousers, are subject to performance variation due to job pa-rameter selection. Such variations are further exemplifiedin heterogeneous environments. Selection of inappropriatejob parameters and parameter values results in resourcesbeing underutilized and users overpaying for received ser-vice [19]. With likely increase in use of grid-like resources(e.g., through cloud computing paradigm [41]), importanceof understanding relationships between an application, un-derlying resources, and job parameters will only increase.

In order to allow users to have an understanding of per-formance dependencies between BLAST application andunderlying hardware, in this paper, performance character-izations for BLAST were presented. Characterizations forboth homogeneous and heterogeneous environments weredeveloped, enabling users to use presented information andmaximize utilization of resources while minimizing theirBLAST jobs search times. Derived characterizations are val-idated and tested through their implementation in a grid-enabled BLAST application. Results show significant ben-efit in overall job runtime as well as resource utilization.

In conclusion, as understanding of shown parameter op-timizations become more prominent and recognized, toolswill emerge that will fully automate shown process and in-corporate all the components (i.e., resource selection, appli-cation dependencies for given resource, and user utility) intoa single job submission interface. This can lead to automaticenablement and presentation of cost vs. time tradeoffs to theusers for their jobs. Although this is a long-term future goal,in near future authors plan on expanding functionalities ofDynamic BLAST to fully automate above described processby continuing to develop and imbed optimization methodsand characterizations directly into the application.

Acknowledgements We thank our colleagues, namely PavithranSathyanarayana, on development of the Perl script for distributingBLAST input queries and Shankar Changayil for providing us with thesearch sequences and appropriate explanations and descriptions. Wewould also like to thank Dr. Elliot Lefkowitz for this valuable input dur-ing this work. This work was made possible in part by a grant of highperformance computing resources from the Department of Computerand Information Sciences at the University of Alabama at Birming-ham, the School of Natural Sciences and Mathematics at the Universityof Alabama at Birmingham, the National Science Foundation AwardCNS-0420614, and NIH/NIAID Contract No. HHSN266200400036Cfor the Viral Bioinformatics Resource Center (VBRC).

References

1. Bergeron, B.: Bioinformatics Computing, 1st edn. Prentice Hall,Upper Saddle River (2002)

2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.:Basic local alignment search tool. Mol. Biol. 215(3), 403–410(1990)

3. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang,Z., Miller, W., Lipman, D.J.: Gapped BLAST and PSI-BLAST:a new generation of protein database search programs. NucleicAcids Res. 25(17), 3389–3402 (1997)

4. Pearson, W.R., Lipman, D.J.: Improved tools for biological se-quence comparison. Proc. Nat. Acad. Sci. USA 85(16), 2444–2448 (1988)

5. Smith, T., Waterman, M.: Identification of common molecularsubsequences. J. Mol. Biol. 147, 195–197 (1981)

6. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Se-quence Analysis: Probabilistic Models of Proteins and NucleicAcids. Cambridge University Press, Cambridge (1998)

7. Program, H.G.: What is the human genome project? Decem-ber 07 (2005). Available at http://www.ornl.gov/sci/techresources/Human_Genome/project/about.shtml. Retrieved: April 28, 2008

8. Xue Wu, C.-W.T.: Searching sequence databases using high-performance BLASTs. In: Albert, Y.Z. (ed.) Parallel Computingfor Bioinformatics and Computational Biology, pp. 211–232. Wi-ley, New York (2006)

9. Darling, A.E., Carey, L., Feng, W.-C.: The design, implementa-tion, and evaluation of mpiBLAST. In: ClusterWorld Conference& Expo in conjunction with the 4th International Conference onLinux Clusters: The HPC Revolution 2003, San Jose, CA (2003)

10. Bjomson, R.D., Sherman, A.H., Weston, S.B., Willard, N., Wing,J.: TurboBLAST: a parallel implementation of BLAST built onthe TurboHub. In: International Parallel and Distributed Process-ing Symposium: IPDPS 2002, Ft. Lauderdale, FL (2002)

11. Krishnan, A.: GridBLAST: a Globus-based high-throughput im-plementation of BLAST in a Grid computing framework. Concurr.Comput., Pract. Experience 17(13), 1607–1623 (2005)

12. Czajkowski, K., Fitzgerald, S., Foster, I., Kesselman, C.: Grid in-formation services for distributed resource sharing. In: 10th IEEESymp. on High Performance Distributed Computing (HPDC), LosAlamitos, CA, pp. 181–195 (2001)

13. NCBI: BLAST basic local alignment search tool, April 25,2008. Available at http://blast.ncbi.nlm.nih.gov/Blast.cgi. Re-trieved: April 28, 2008

14. NCBI: BLAST frequently asked questions (2008). Available athttp://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastFAQs#sigxcpu. Retrieved: April 28, 2008

15. Afgan, E., Bangalore, P.: Dynamic BLAST—a grid enabledBLAST. Int. J. Comput. Sci. Netw. Secur. 9(4), 149–157 (2009)

16. Sulakhe, D., Rodriguez, A., D’Souza, M., Wilde, M., Nefedova,V., Foster, I., Maltsev, N.: GNARE: an environment for grid-based

http://www.ornl.gov/sci/techresources/Human_Genome/project/about.shtml

http://www.ornl.gov/sci/techresources/Human_Genome/project/about.shtml

http://blast.ncbi.nlm.nih.gov/Blast.cgi

http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastFAQs#sigxcpu

http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastFAQs#sigxcpu

Cluster Comput (2010) 13: 385–395 395

high-throughput genome analysis. In: Fifth IEEE InternationalSymposium on Cluster Computing and the Grid (CCGrid’05),pp. 455–462. Cardiff, UK (2005)

17. Gardner, M.K., Feng, W.-C., Archuleta, J., Lin, H., Ma, X.: Par-allel genomic sequence-searching on an ad-hoc grid: experiences,lessons learned, and implications. In: Supercomputing, 2006 (SC’06), pp. 22–36. Tampa, FL (2006)

18. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Com-puting Infrastructure, 1st edn. Morgan Kaufmann, San Francisco(1998)

19. Afgan, E., Purushotham, B.: Embarrassingly parallel jobs are notembarrassingly easy to schedule on the grid. In: International Con-ference for High Performance, Networking, Storage and Analysis(SC08)—Workshop on Many-Task Computing on Grids and Su-percomputers, p. 10. Austin, TX (2008)

20. Barton, G.J.: SCANPS version 2.3.9 User Guide. University ofDundee, Scotland (2002)

21. Afgan, E., Bangalore, P.: Performance characterization of BLASTfor the grid. In: 7th IEEE International Conference on Bioinfor-matics and Bioengineering (BIBE 2007), pp. 1394–1398. Boston,MA (2007)

22. Tan, G., Xu, L., Dai, Z., Feng, S., Sun, N.: A study of architecturaloptimization methods in bioinformatics applications. Int. J. HighPerform. Comput. Appl. 21(3), 371–384 (2007)

23. Sanchez, F., Salami, E., Ramirez, A., Valero, M.: Performanceanalysis of sequence alignment applications. In: 2006 IEEE In-ternational Symposium on Workload Characterization, pp. 51–60.San Jose, CA (2006)

24. Standard Performance Evaluation Corporation, March 10, 2009.Available at http://www.spec.org/. Retrieved: March 19, 2009

25. Bader, D.A., Li, Y., Li, T., Sachdeva, V.: BioPerf: A bench-mark suite to evaluate high-performance computer architecture onbioinformatics applications. In: The IEEE International Sympo-sium on Workload Characterization (IISWC 2005), pp. 163–173.Austin, TX (2005)

26. Globus: The Globus Resource Specification Language RSL v1.0(2009). Available at http://www-unix.globus.org/api/c-globus-2.4/globus_gram_documentation/html/. Retrieved: April 2, 2009

27. Wang, C., Lefkowitz, E.J.: SS-Wrapper: a package of wrapper ap-plications for similarity searches on Linux clusters. BMC Bioin-formatics 5(171) (2004)

28. Dwan, C.: Bioinformatics Benchmarks on the Dual Core IntelXeon Processor. The BioTeam, Inc., Cambridge (2006)

29. Sodhi, S., Subhlok, J.: Automatic construction and evaluation ofperformance skeletons. In: 19th International Parallel and Dis-tributed Processing Symposium (IPDPS ’05), p. 10. Denver, CO(2005)

30. Nadeem, F., Prodan, R., Fahringer, T., Iosup, A.: Benchmarkinggrid applications for performance and scalability predictions. In:CoreGRID 2007 Workshop on Middleware, p. 14. Dresden, Ger-many (2007)

31. Tirado-Ramos, A., Tsouloupas, G., Dikaiakos, M., Sloot, P.: Gridresource selection by application benchmarking for computa-tional haemodynamics applications. In: International Conferenceon Computational Science (ICCS) 2005, pp. 534–543. Kassel,Germany (2005)

32. Afgan, E., Bangalore, P.: Experiences with developing and de-ploying dynamic BLAST. In: 15th ACM Mardi Gras Confer-ence, Workshop on Grid-Enabling Applications, pp. 38–48. BatonRouge, LA (2008)

33. Foster, I., Kesselman, C., Tsudik, G., Tuecke, S.: A security ar-chitecture for computational grids. In: 5th ACM Conference onComputer and Communication Security Conference, pp. 83–92.San Francisco, CA (1998)

34. Afgan, E., Bangalore, P., Duncan, D.: GridAtlas – a grid applica-tion and resource configuration repository and discovery service.

In: International Conference on Cluster Computing, New Orleans,LA, pp. 1–10, Aug 31–Sep 4, 2009

35. Rajic, H., Brobst, R., Chan, W., Ferstl, F., Gardiner, J., Haas, A.,Nitzberg, B., Tollefsrud, J.: Distributed resource management ap-plication API (DRMAA) specification 1.0 GFD-R-P.022. GlobalGrid Forum (GGF) (2004)

36. Foster, I., Kesselman, C.: The Globus toolkit. In: Foster, I.,Kesselman, C. (eds.) The Grid: Blueprint for a New ComputingInfrastructure, pp. 259–278. Morgan Kaufmann, San Francisco(1999)

37. Afgan, E., Bangalore, P., Mukkai, S., Yammanuru, S.: Design andimplementation of a readily available historical application per-formance database (AppDB) for Grid. University of Alabama atBirmingham (UAB), Birmingham, AL UABCIS-TR-2008-0506-1, 6 May 2008

38. Dale, N., Teague, D.: C++ Plus Data Structures, 2nd edn. Jones& Bartlett, Boston (2001)

39. Leung, J.Y.-T. (ed.): Handbook of Scheduling: Algorithms, Mod-els, and Performance Analysis, 1st edn., vol. 1. CRC, Boca Raton(2004)

40. GridWay: Job Template options. Feb 16, 2009. Available athttp://www.gridway.org/documentation/stable5.4/user/gridway-user-functionality.html#id2578278. Retrieved: April 2, 2009

41. Buyya, R., Yeo, C.S., Venugopal, S., Broberg, J., Brandic, I.:Cloud computing and emerging IT platforms: Vision, hype, andreality for delivering computing as the 5th utility. Future Gener.Comput. Syst. 25(6), 599–616 (2009)

Enis Afgan is currently a postdoc-toral research fellow at Emory Uni-versity in the Department of Bi-ology. His current focus is on en-abling transparent and efficient ex-ecution of biological tools acrossdistributed environments. In sum-mer of 2009 he received a Ph.D.degree in grid computing from theUniversity of Alabama at Birm-ingham. His research interests fo-cus around distributed computing(grid, cloud) with the emphasis onuser-level scheduling, optimizationmethods, and performance model-ing.

Purushotham Bangalore is an As-sociate Professor and Director ofthe Collaborative Computing Labin the Department of Computer andInformation Sciences at the Uni-versity of Alabama at Birmingham.He received B.E. degree in Com-puter Science and Engineering (Oc-tober 1991) from Bangalore Uni-versity, India and M.S. degree inComputer Science (May 1995) andPh.D. in Computational Engineer-ing (May 2003) from MississippiState University. His research in-cludes metascheduling techniques

for Grid Computing, gridenablement of multidisciplinary applications,and techniques to synthesize parallel programs for heterogeneous ar-chitectures.

http://www.spec.org/

http://www-unix.globus.org/api/c-globus-2.4/globus_gram_documentation/html/

http://www-unix.globus.org/api/c-globus-2.4/globus_gram_documentation/html/

http://www.gridway.org/documentation/stable5.4/user/gridway-user-functionality.html#id2578278

http://www.gridway.org/documentation/stable5.4/user/gridway-user-functionality.html#id2578278

Exploiting performance characterization of BLAST in the grid

Documents

Transcript of Exploiting performance characterization of BLAST in the grid