1 Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and...

22
1 Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications Mark K. Gardner (Virginia Tech) Wu-chun Feng (Virginia Tech) Jeremy Archuleta (U. Utah) Heshan Lin (NCSU) Xiaosong Ma (NCSU & ORNL) Nominated for Best Paper Award, SC 2006, Tampa, FL

Transcript of 1 Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and...

1

Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications

Mark K. Gardner (Virginia Tech)Wu-chun Feng (Virginia Tech)Jeremy Archuleta (U. Utah)

Heshan Lin (NCSU)Xiaosong Ma (NCSU & ORNL)

Nominated for Best Paper Award, SC 2006, Tampa, FL

2

Overview StorCloud Demo of SC|05

I/O throughput competition of real world scientific applications When: Sun., Nov. 13 to Thu., Nov. 17, 2005 Part of slides modified from StorCloud presentation “mpiBLAS

T on the GreenGene Distributed Supercomputer” (Wu Feng et. al.) Story

Built an ad-hoc grid (GreenGene) with 3048 Processor for intensive genomic sequence search (search NT against NT with mpiBLAST)

Team Institutions

LANL, NCSU, U. Utah, and Virginia Tech Vendors

Intel, Panta Systems, and Foundry Networks

3

GreenGene Grid

How?

Intel(Dupont)

SC2005Showroom

Floor

U.Utah

Va Tech

4

Outline

About BLAST and mpiBLAST Motivation Planning

Estimate resource requirements What kind of grid do we need

System design Hardware architecture Software architecture

Results Conclusion

5

What is BLAST?

Basic Local Alignment Sequence Tool Ubiquitous sequence database search tool used

in molecular biology Given a query DNA or amino-acid (AA) sequence,

BLAST Finds similar sequences in database Reports statistical significance of similarities between

query and database Newly sequenced genomes are typically

BLAST-searched against database of known genes Similar sequences may have similar functions in

a new organism

6

BLAST at the Core of Sequence DB Search

Widely used: Approximately 75%-90% of all compute cycles in life sciences are devoted

to BLAST searches But, it is:

Computationally demanding, O(n2) (variant of string matching algorithm) Requires seq database to be stored in memory to perform efficiently

Challenge: sequence databases growing exponentially

7

mpiBLAST Algorithm: Querying the Database

Open source BLAST parallelization (developed at LANL) Parallel approach: segment and distribute database across cluster Advantage: deliver super-linear speedup by avoiding repeated I/O Limitation: poor performance in handle search with large output

volume because of results merging bottleneck

8

mpiBLAST-PIO: Enhancing Efficiency

Optimizations transferred from pioBLAST Research prototype developed at NCSU and

ORNL [Lin et. al. IPDPS05] Dramatically improves search throughput

and scalability Using parallel I/O techniques to remove result

merging bottleneck Results buffered and outputted concurrently by workers

Enhancing output processing to reduce communication volume

Largely used in SC StorCloud demo

9

Why Sequence-Search the NT Database Against Itself?

From a Biological Perspective Aids in understanding of which genetic codes are unique and

which are redundant Enables a number of useful studies from organism

“barcoding” to gene function and evolution From a Computer Science Perspective

Provides pertinent demonstration of mpiBLAST/pio’s scalability to larger problems (NT is one of the largest seq databases)

Can potentially generate huge output data Enables realization of advanced indexing structure that

tracks relationships among sequences in the database Such indexing structures can provide

Up to 100x speedup in search times with little loss of sensitivity. Up to 20x compression of the database using phylogenetic

methods.

10

Resource Estimation

Why do we care? To evaluate the feasibility of the project To make better scheduling decision

What’s the complexity of the problem? Intuitively: estimation by seq length

NT composition

11

Sequence Length Based Estimation Simple linear extrapolation appears “mission impossible”

Because of “hard queries” intensive computation, large quantities of intermediate results

Fortunately, Weak correlation between sequence length and resource

requirements because of BLAST employs heuristics G1 sequences well behaved, large portion of sequences belong to G1 Search of hard queries can be speeded up with more memory

Sampling NT sequences search

12

Better Predictor?

Hit-based rather than length-based? Two phase BLAST search

First phase: find hits in word level Second phase: extend matched words in both

direction to find maximal segment pair (longest local matching substring)

Computation of first phase much less expensive then that of second phase

Modified BLAST algorithm to collect number of hits in the first phase

Attractive: utilizing internal knowledge of BLAST algorithm

13

Number of Hits Not a Better Predictor

Linear regression on data collected from 500 seqs Y: output size, execution time; X: length, # hits

Number of hits not necessary better Difference of mean square errors < 5% High correlation (0.9942) between number of hits and

sequence length Sequence length is much easier to collect

14

What Kind of Grid Do We Need?

Existing grid frameworks (such as Globus) not what we want Not available or well tested on Mac OS X and 64-

bit Linux OS mpiBLAST-PIO not ported to Globus High learning curve for installation and

configuration Home made grid software wrote from

scratch Just fit our needs Easy to deploy, allow full control

15

Hardware Architecture

Heterogeneous environment Interoperability is big concern

Cluster Organization

Architecture Memory

#Procs

File System

System X Virginia Tech Dual 2.3GHz PowerPC 970FX

4GB 2200 NFS

TunnelArch

Univ. of Utah Dual AMD Opteron 240 CPU

4GB 126 PVFS

TunnelArch

Univ. of Utah Dual AMD Opteron 244 CPU

2GB 128 PVFS

Dupon Intel Quad core N/A 512/256

NFS

Jarrel Intel Dual 3.4GHz Intel P4 2GB 20 NFS

Blade Center

Intel Dual 2.66GHz Intel Xeon 2GB 28 NFS

Panta Panta Systems

Four AMD Opteron 246HE 2GB 32 NFS

16

17

Software Architecture Hierarchical design

SuperMaster: assign queries, fetch results, load balancing GroupMaster: fetch queries, perform search How to choose group size?

Challenges: heterogeneity, scalability, fault tolerance

NT Replica NT Replica

GroupMaster GroupMaster GroupMaster

SuperMaster

NT Replica

18

Heterogeneity And Accessibility

Only use four existing, cross-platform tools Perl, ssh, rsync, bash 5 scripts, totaled only 458 lines Fast deployment in Unix like systems

Customize mpiBLAST-PIO System X need special care

Porting issues because of Mac OS and Power PC

Implement pseudo-parallel-write to improve output performance on NFS

19

Design for Scalability Managing thousands of procs efficiently with

loosely coupled, hierarchical design Reduce loads on SuperMaster Passive SuperMaster: easy to add group masters,

regroup processors, and avoid security hole Allow incremental system start

Hiding WAN latency by queuing queries in local Prevent “bubbles in the pipeline”

Ensuring data integrity with MD5 checksum A silent error every 500GB [Paxson 1999]

Alleviating network bandwidth constraint with compression (compression ration 1:5 ~ 1:7)

20

Fault Tolerance Serious: mean time failure < 10 hours in machines

with thousands of processors [Reed 2004] Re-execution rather than checkpoint-restart

Primary issue: query states management Maintain all query states in file system

21

Results Finished 1/7 NT in one day

Coalesced sequences into batches targeting 30 minutes search time

Execution statistics Output size: 600K ~ 7GB per batch, 284.2KB per seq Execution time: 6 secs ~ 1.6 hours, average 9 mins per

batch

22

Conclusion

Not be able to take advantage of existing grid software

Home made grid software did work Enables rapid development and deployment Portable to Unix like platforms

Identify hard queries for bio research Future work

Extend framework to support more general applications

Better resource estimation