Lichtenberg bosc2010 wordseeker
-
Upload
bosc-2010 -
Category
Technology
-
view
491 -
download
0
Transcript of Lichtenberg bosc2010 wordseeker
CONCURRENT BIOINFORMATICS SOFTWARE FOR
DISCOVERING GENOME-WIDE PATTERNSAND WORD-BASED GENOMIC SIGNATURES
Jens Lichtenberg, Kyle Kurz, Xiaoyu Liang, Rami Al-Ouran, Lev Neiman, Lee Nau, Joshua Welch, Edwin Jacox, Thomas Bitterman, Klaus Ecker, Laura Elnitski, Frank Drews, Stephen Lee, Lonnie Welch
The WordSeeker Tool
Enumeration Suffix Tree and Suffix Array Radix Tree
Scoring Clustering
Sequence Clustering Word Clustering
Conservation Analysis Phast Cons Score Extraction
Location Distributions Sequence Coverage
Min set of words necessary tocover all sequences
Module Discovery Enumerative
Ranger Markup Basic Functional Elements
Software Properties
Google code repository: http://code.google.com/p/word-seeker/ GNU General Public License v3 Doxygen code generator (Internal Documentation). Svn for command line access: http://word-seeker.googlecode.com/svn/trunk
Requirements G++ compiler version 4.1* or higher OpenMP headers MPI environment (distributed version) For visualizations and other post-processing steps
Perl 5.8.8, TFBS (http://tfbs.genereg.net/) SET::Scalar LWP::Simple Parallel::Forkmanager GD::Graphs::bars, Algorithm::Cluster Bio::SeqIO (all available through CPAN) Gnuplot version 4.2 or higher
Need for a Scalable Approach
Word Enumeration Module
Represents a set of biological input sequences based on some data structure
Keeps track of words, word counts, sequence counts, and word locations
Need to keep the data persistent in memory
Word Scoring Module Determines statistical
scores for each word Frequent lookups for
words and substrings of words Example: Markov order m
model requires lookups for all substrings of up to length m for all words
Keep space complexity low Keep time complexity for
lookups low
Enumeration Approaches
Total number of nucleotides in the input sequences: n
Word length: m
Radix Tree
Time/Space Complexity: • dependent on m
Time Complexity for lookups: • fast lookup
Suffix Tree
Time/Space Complexity: • independent of m• significant constant factor for space complexity
Time Complexity for lookups: • fast lookup
Suffix Array
Time/Space Complexity: • much lower constant factor for space complexity
Time Complexity for lookups: • more expensive for lookups
)( mnO
)(mO
)(nO
)(mO
)(nO
)(lognO
Distributed Solution Tasks executed on
different nodes Distributed Memory
Multi-core Solution Tasks executed on
different cores Shared Memory
Solution
Parallelization
Parallel Software Properties
Shared Memory Open MP parallelization
Simple, portable, directives that compile even on non supported architectures
Simple loops are run in parallel on multiple processors
Distributed Memory MPI parallelization
Hardware optimizations and support for Fortran, C/C++, Perl
Each node is provided a subset of the data to process “Smart” division of tasks is key
Results
Analyzed the Arabidopsis thaliana genome All segments and the full genome Multiple word lengths (1-20) Searched top words against AGRIS
(repository of known elements in A. thaliana)
Characterized the Framework Speedup and runtime analysis Radix Trie and Suffix Tree
Memory Requirements for Arabidopsis thaliana
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
5
10
15
20
25
Size Requirements of Arabidopsis thaliana Segments
Intronic Regions5' UTR3' UTRCoding SequencesCore PromotersProximal PromotersDistal PromotersFull Genome
Word length
Siz
e (
GB
)
Conducted at the Ohio Supercomputer Center
Execution Times for Arabidopsis thaliana
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200
5000
10000
15000
20000
25000
30000
Time Requirements of Arabidopsis thaliana Segments
Intronic Regions5' UTR3' UTRCoding SequencesCore PromotersProximal PromotersDistal PromotersFull Genome
Word length
Tim
e (
s)
Speedup, efficiency and timing using A. thaliana core promoter sequences.
Analyzing the Parallel System
Shared and Distributed Memory Speedup
Radix Trie Suffix Tree
Shared and Distributed Memory Efficiency
Radix Trie Suffix Tree
Shared and Distributed Memory Performance
Radix Trie Suffix Tree
Scoring Speedup Contribution
Runtime Scoring
1->2 1->4 1->80246
Radix Tree Scoring Speedups
5nt 10nt 20nt50nt 100nt
Cores
Speedup
1->2 1->4 1->80246
Suffix Tree Scoring Speedups
5nt 10nt 20nt50nt 100nt
Cores
Speedup
1->2 1->4 1->80246
Radix Tree Runtime Speedups
5nt 10nt 20nt50nt 100nt
Cores
Speedup
1->2 1->4 1->80
2
4
Suffix Tree Runtime Speedups
5nt 10nt 20nt50nt 100nt
Cores
Speedup
Results: Pushing the limits
Summary
Parallel Shared memory on single nodes Distributed memory on 5 nodes
High-throughput Full genomes analyzed in under 5 hours Long word lengths
Genomes approaching 20 Smaller files often 100 or greater
Powerful analysis Detailed statistics Degeneracy via clustering Additional post-processing (scatter plots, logos, etc.)
Future Work
Post-processing Word distributions Sequence clustering Gbrowse visualization
Further parallelization Within a node Greater distributed abstraction (more
prefixes)
QUESTIONS?