Lichtenberg bosc2010 wordseeker

CONCURRENT BIOINFORMATICS SOFTWARE FOR

DISCOVERING GENOME-WIDE PATTERNSAND WORD-BASED GENOMIC SIGNATURES

Jens Lichtenberg, Kyle Kurz, Xiaoyu Liang, Rami Al-Ouran, Lev Neiman, Lee Nau, Joshua Welch, Edwin Jacox, Thomas Bitterman, Klaus Ecker, Laura Elnitski, Frank Drews, Stephen Lee, Lonnie Welch

The WordSeeker Tool

Enumeration Suffix Tree and Suffix Array Radix Tree

Scoring Clustering

Sequence Clustering Word Clustering

Conservation Analysis Phast Cons Score Extraction

Location Distributions Sequence Coverage

Min set of words necessary tocover all sequences

Module Discovery Enumerative

Ranger Markup Basic Functional Elements

Software Properties

Google code repository: http://code.google.com/p/word-seeker/ GNU General Public License v3 Doxygen code generator (Internal Documentation). Svn for command line access: http://word-seeker.googlecode.com/svn/trunk

Requirements G++ compiler version 4.1* or higher OpenMP headers MPI environment (distributed version) For visualizations and other post-processing steps

Perl 5.8.8, TFBS (http://tfbs.genereg.net/) SET::Scalar LWP::Simple Parallel::Forkmanager GD::Graphs::bars, Algorithm::Cluster Bio::SeqIO (all available through CPAN) Gnuplot version 4.2 or higher

http://code.google.com/p/word-seeker/

http://tfbs.genereg.net/

Need for a Scalable Approach

Word Enumeration Module

Represents a set of biological input sequences based on some data structure

Keeps track of words, word counts, sequence counts, and word locations

Need to keep the data persistent in memory

Word Scoring Module Determines statistical

scores for each word Frequent lookups for

words and substrings of words Example: Markov order m

model requires lookups for all substrings of up to length m for all words

Keep space complexity low Keep time complexity for

lookups low

Enumeration Approaches

Total number of nucleotides in the input sequences: n

Word length: m

Radix Tree

Time/Space Complexity: • dependent on m

Time Complexity for lookups: • fast lookup

Suffix Tree

Time/Space Complexity: • independent of m• significant constant factor for space complexity

Time Complexity for lookups: • fast lookup

Suffix Array

Time/Space Complexity: • much lower constant factor for space complexity

Time Complexity for lookups: • more expensive for lookups

)( mnO

)(mO

)(nO

)(mO

)(nO

)(lognO

Distributed Solution Tasks executed on

different nodes Distributed Memory

Multi-core Solution Tasks executed on

different cores Shared Memory

Solution

Parallelization

Parallel Software Properties

Shared Memory Open MP parallelization

Simple, portable, directives that compile even on non supported architectures

Simple loops are run in parallel on multiple processors

Distributed Memory MPI parallelization

Hardware optimizations and support for Fortran, C/C++, Perl

Each node is provided a subset of the data to process “Smart” division of tasks is key

Results

Analyzed the Arabidopsis thaliana genome All segments and the full genome Multiple word lengths (1-20) Searched top words against AGRIS

(repository of known elements in A. thaliana)

Characterized the Framework Speedup and runtime analysis Radix Trie and Suffix Tree

Memory Requirements for Arabidopsis thaliana

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

5

10

15

20

25

Size Requirements of Arabidopsis thaliana Segments

Intronic Regions5' UTR3' UTRCoding SequencesCore PromotersProximal PromotersDistal PromotersFull Genome

Word length

Siz

e (

GB

)

Conducted at the Ohio Supercomputer Center

Execution Times for Arabidopsis thaliana

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

5000

10000

15000

20000

25000

30000

Time Requirements of Arabidopsis thaliana Segments

Intronic Regions5' UTR3' UTRCoding SequencesCore PromotersProximal PromotersDistal PromotersFull Genome

Word length

Tim

e (

s)

Speedup, efficiency and timing using A. thaliana core promoter sequences.

Analyzing the Parallel System

Shared and Distributed Memory Speedup

Radix Trie Suffix Tree

Shared and Distributed Memory Efficiency


Shared and Distributed Memory Performance


Scoring Speedup Contribution

Runtime Scoring

1->2 1->4 1->80246

Radix Tree Scoring Speedups

5nt 10nt 20nt50nt 100nt

Cores

Speedup

1->2 1->4 1->80246

Suffix Tree Scoring Speedups


Cores

Speedup

1->2 1->4 1->80246

Radix Tree Runtime Speedups


Cores

Speedup

1->2 1->4 1->80

2

4

Suffix Tree Runtime Speedups


Cores

Speedup

Results: Pushing the limits

Summary

Parallel Shared memory on single nodes Distributed memory on 5 nodes

High-throughput Full genomes analyzed in under 5 hours Long word lengths

Genomes approaching 20 Smaller files often 100 or greater

Powerful analysis Detailed statistics Degeneracy via clustering Additional post-processing (scatter plots, logos, etc.)

Future Work

Post-processing Word distributions Sequence clustering Gbrowse visualization

Further parallelization Within a node Greater distributed abstraction (more

prefixes)

QUESTIONS?

Lichtenberg bosc2010 wordseeker

Technology

Transcript of Lichtenberg bosc2010 wordseeker