1 Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and...
-
Upload
alannah-parsons -
Category
Documents
-
view
214 -
download
1
Transcript of 1 Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and...
1
Parallel Genomic Sequence-Searching on an Ad-Hoc Grid: Experiences, Lessons Learned, and Implications
Mark K. Gardner (Virginia Tech)Wu-chun Feng (Virginia Tech)Jeremy Archuleta (U. Utah)
Heshan Lin (NCSU)Xiaosong Ma (NCSU & ORNL)
Nominated for Best Paper Award, SC 2006, Tampa, FL
2
Overview StorCloud Demo of SC|05
I/O throughput competition of real world scientific applications When: Sun., Nov. 13 to Thu., Nov. 17, 2005 Part of slides modified from StorCloud presentation “mpiBLAS
T on the GreenGene Distributed Supercomputer” (Wu Feng et. al.) Story
Built an ad-hoc grid (GreenGene) with 3048 Processor for intensive genomic sequence search (search NT against NT with mpiBLAST)
Team Institutions
LANL, NCSU, U. Utah, and Virginia Tech Vendors
Intel, Panta Systems, and Foundry Networks
4
Outline
About BLAST and mpiBLAST Motivation Planning
Estimate resource requirements What kind of grid do we need
System design Hardware architecture Software architecture
Results Conclusion
5
What is BLAST?
Basic Local Alignment Sequence Tool Ubiquitous sequence database search tool used
in molecular biology Given a query DNA or amino-acid (AA) sequence,
BLAST Finds similar sequences in database Reports statistical significance of similarities between
query and database Newly sequenced genomes are typically
BLAST-searched against database of known genes Similar sequences may have similar functions in
a new organism
6
BLAST at the Core of Sequence DB Search
Widely used: Approximately 75%-90% of all compute cycles in life sciences are devoted
to BLAST searches But, it is:
Computationally demanding, O(n2) (variant of string matching algorithm) Requires seq database to be stored in memory to perform efficiently
Challenge: sequence databases growing exponentially
7
mpiBLAST Algorithm: Querying the Database
Open source BLAST parallelization (developed at LANL) Parallel approach: segment and distribute database across cluster Advantage: deliver super-linear speedup by avoiding repeated I/O Limitation: poor performance in handle search with large output
volume because of results merging bottleneck
8
mpiBLAST-PIO: Enhancing Efficiency
Optimizations transferred from pioBLAST Research prototype developed at NCSU and
ORNL [Lin et. al. IPDPS05] Dramatically improves search throughput
and scalability Using parallel I/O techniques to remove result
merging bottleneck Results buffered and outputted concurrently by workers
Enhancing output processing to reduce communication volume
Largely used in SC StorCloud demo
9
Why Sequence-Search the NT Database Against Itself?
From a Biological Perspective Aids in understanding of which genetic codes are unique and
which are redundant Enables a number of useful studies from organism
“barcoding” to gene function and evolution From a Computer Science Perspective
Provides pertinent demonstration of mpiBLAST/pio’s scalability to larger problems (NT is one of the largest seq databases)
Can potentially generate huge output data Enables realization of advanced indexing structure that
tracks relationships among sequences in the database Such indexing structures can provide
Up to 100x speedup in search times with little loss of sensitivity. Up to 20x compression of the database using phylogenetic
methods.
10
Resource Estimation
Why do we care? To evaluate the feasibility of the project To make better scheduling decision
What’s the complexity of the problem? Intuitively: estimation by seq length
NT composition
11
Sequence Length Based Estimation Simple linear extrapolation appears “mission impossible”
Because of “hard queries” intensive computation, large quantities of intermediate results
Fortunately, Weak correlation between sequence length and resource
requirements because of BLAST employs heuristics G1 sequences well behaved, large portion of sequences belong to G1 Search of hard queries can be speeded up with more memory
Sampling NT sequences search
12
Better Predictor?
Hit-based rather than length-based? Two phase BLAST search
First phase: find hits in word level Second phase: extend matched words in both
direction to find maximal segment pair (longest local matching substring)
Computation of first phase much less expensive then that of second phase
Modified BLAST algorithm to collect number of hits in the first phase
Attractive: utilizing internal knowledge of BLAST algorithm
13
Number of Hits Not a Better Predictor
Linear regression on data collected from 500 seqs Y: output size, execution time; X: length, # hits
Number of hits not necessary better Difference of mean square errors < 5% High correlation (0.9942) between number of hits and
sequence length Sequence length is much easier to collect
14
What Kind of Grid Do We Need?
Existing grid frameworks (such as Globus) not what we want Not available or well tested on Mac OS X and 64-
bit Linux OS mpiBLAST-PIO not ported to Globus High learning curve for installation and
configuration Home made grid software wrote from
scratch Just fit our needs Easy to deploy, allow full control
15
Hardware Architecture
Heterogeneous environment Interoperability is big concern
Cluster Organization
Architecture Memory
#Procs
File System
System X Virginia Tech Dual 2.3GHz PowerPC 970FX
4GB 2200 NFS
TunnelArch
Univ. of Utah Dual AMD Opteron 240 CPU
4GB 126 PVFS
TunnelArch
Univ. of Utah Dual AMD Opteron 244 CPU
2GB 128 PVFS
Dupon Intel Quad core N/A 512/256
NFS
Jarrel Intel Dual 3.4GHz Intel P4 2GB 20 NFS
Blade Center
Intel Dual 2.66GHz Intel Xeon 2GB 28 NFS
Panta Panta Systems
Four AMD Opteron 246HE 2GB 32 NFS
17
Software Architecture Hierarchical design
SuperMaster: assign queries, fetch results, load balancing GroupMaster: fetch queries, perform search How to choose group size?
Challenges: heterogeneity, scalability, fault tolerance
NT Replica NT Replica
GroupMaster GroupMaster GroupMaster
SuperMaster
NT Replica
18
Heterogeneity And Accessibility
Only use four existing, cross-platform tools Perl, ssh, rsync, bash 5 scripts, totaled only 458 lines Fast deployment in Unix like systems
Customize mpiBLAST-PIO System X need special care
Porting issues because of Mac OS and Power PC
Implement pseudo-parallel-write to improve output performance on NFS
19
Design for Scalability Managing thousands of procs efficiently with
loosely coupled, hierarchical design Reduce loads on SuperMaster Passive SuperMaster: easy to add group masters,
regroup processors, and avoid security hole Allow incremental system start
Hiding WAN latency by queuing queries in local Prevent “bubbles in the pipeline”
Ensuring data integrity with MD5 checksum A silent error every 500GB [Paxson 1999]
Alleviating network bandwidth constraint with compression (compression ration 1:5 ~ 1:7)
20
Fault Tolerance Serious: mean time failure < 10 hours in machines
with thousands of processors [Reed 2004] Re-execution rather than checkpoint-restart
Primary issue: query states management Maintain all query states in file system
21
Results Finished 1/7 NT in one day
Coalesced sequences into batches targeting 30 minutes search time
Execution statistics Output size: 600K ~ 7GB per batch, 284.2KB per seq Execution time: 6 secs ~ 1.6 hours, average 9 mins per
batch
22
Conclusion
Not be able to take advantage of existing grid software
Home made grid software did work Enables rapid development and deployment Portable to Unix like platforms
Identify hard queries for bio research Future work
Extend framework to support more general applications
Better resource estimation