ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.
-
Upload
ashlyn-lynch -
Category
Documents
-
view
216 -
download
0
Transcript of ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.
ClawHMMerStreaming protein search Daniel HornMike HoustonPat Hanrahan
Stanford University
Data proliferation in biology
Protein databases SWISS-PROT (200,000 annotated proteins) NCBI Nonredundant DB (2.5 million proteins) UniprotKB/TrEMBL (2.3 million proteins)
DNA sequences DDBJ Japan (42 million genes) NCBI GenBank (10 million sequence records)
Protein matching
Problem Different proteins, same function
common amino acid patterns Solution
Fuzzy string matching
FRNT
P
F
FRNTAP
FRNTAP
FRNFTP
Similar toAs can be seenFrom side-by-sideplacement
A A
Current methods
BLAST [Altschul ‘90]- Ad hoc gap penalties
Matches depend on penalty
+ Fast HMMer [Krogh ‘93]
+ Robust Probabilistic model
- Slow
New architectures
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
IBM Cell Graphics Hardware
New architectures
Traditional CPUs P4 3.0 GHz = 12 GFlops G5 2.5 GHz = 10 GFlops
New architectures offer more compute power ATI X1800XT = 120 GFlops NVIDIA G70 = 176 GFlops Cell = 250 GFlops
However, new architectures are parallel
Fragment Processor
FragmentProcessor
Texture Memory(256-512MB)
128+ Constant Values 8 Interpolated Values
16 floats written sequentially to texture memory eventually
32Temporary
R/WRegisters
Fragment Processors
Have 16-24 Fragment Processors Need 10,000 threads executing in parallel
Hides memory latency Shared Program Counter (PC)
Newer hardware has more (1 per 4 processors)
Texture Memory(256-512MB)128+ Constant Values Interpolants
Data written sequentially to texture memory
Frag.Proc.
R/WReg
Frag.Proc.
R/WReg
Frag.Proc.
R/WReg
Frag.Proc.
R/WReg
…
PC
Brook For GPU’s
Abstraction for streaming architectures [Buck ‘04] Encourages data-parallel applications
Streams Arrays of data
Kernel Small function that can run on a fragment processor
No access to global variables Fixed number of stream inputs, outputs, lookup tables
Mapping operation Runs kernel on each element of input data stream Writes to output stream
Challenges
Current search algorithm doesn’t map well Require lots of fine-grained parallelism
10,000 kernel invocations in parallel High Arithmetic intensity
8:1 compute:fetch ratio Fragment processors have limited resources
Kernels need to be small
Hidden Markov Model
State machine represents unobservable world state
Rainy Sunny
Transition probabilities
Rainy Sunny
.3
.4
.7 .6
Observations
P(coat|Rainy)=.4
P(nothing|Rainy)=.1
P(umbrella|Rainy)=.5
P(coat|Sunny)=.3
P(nothing|Sunny)=.6
P(umbrella|Sunny)=.1
Rainy Sunny
.3
.4
.7 .6
Viterbi algorithm
Given HMM and observation sequence, Viterbi finds Most likely path through world state Probability of this path
Dynamic programming Fills probability table
Per observation Per state Max over incoming transitions
Probability of state machine taking the most likely path to this current state while emitting given observation sequence
Coat NothingUmbrella
.4
.7
=.2
.4 .5
Viterbi Example
Observations:
Sunny
0.0
1.0
Rainy Rainy
Sunny
.4
.6
.3
.7
.06
.20
max(0.7,1 .4)
max(0 .3,1 .6) p(Umbrella | Sunny)
p(Umbrella|Rainy)
=.06
.6 .1
Sunday Monday Tuesday Wednesday
.6
.3
.4
.7
.4
.7
Viterbi Example
Observations:
Sunny
0.0
1.0
Rainy Rainy
Sunny .06
.20
Coat NothingUmbrella
.056
.018 .01026
.00399
.01026>.00399.018·.6 < .056·.3.06*.4 < .2*.7
Rainy
Sunny
Rainy
Sunny
Sunday Monday Tuesday Wednesday
Viterbi Traceback Example
1.0*.4 > 0.0*.7
Sunny
RainyRainy
Sunny
Viterbi path: “Sunny, Rainy, Rainy, Sunny”
HMM Training
FRNT
P
FRNTTP
FRNTTP
F
GF
FRNTTP
G
F
FRNT
P
F
T
+
FR
FT
PT
N
FR
N
T
TP
G
G
F
FRNF
PT
FR
NT
TP
F
G
FRNTTP
Probabilistic model
Proposedalignment
Proteins in family
Align
Delete
Insertion
G
Emit “F”Emit “G”
Emit “P”
Emit “T”Emit “R”
Emit “N”
Probabilistic model: HMM
Junk
Junk Junk Junk Junk Junk
Del Del Del Del
Start EndF R N T PT
F R N T P
FG
Emit “F”
No Emission
HMMer Searches a Database
If Probability score high:Perform traceback for alignment
FR
NT
TP
F
G
FR
N
T
TP
F
G
F
FRNT
TP
F
GR
TT
FP
F
G
FR
TT
NP
F
GFNNN
NP
P
RR
F
T
TP
F
T
N
FNNN
NP
P
PR
TN
TR
F
G
Probabilistic model
Database
Query
Junk
Junk Junk Junk Junk Junk
Del Del Del Del
Start EndF R N T PT
Viterbi Algorithm Code
for obs in sequence //observation loop
for s in states: //next state loop
for predecessor in states: //transition loop
if max likelihood path to s came from predecessor
fill table[obs][s]
Parallelizing Viterbi Over a Database
parallel_for observ_seq in database: //DB loop
for obs in observ_seq: //observation loop
for s in states: //state loop
for predecessor in states: //transition loop
if max likelihood path to s came from predecessor
fill table [observ_seq][obs][s]
Flexible parallelism
Pull database loop inside Reduces size/state of inner loop code
More flexibility in choosing Coarse grained parallelism
Data Parallel Fine Grained Parallelism
SIMD Instructions
Loop Reordering
for pred in states: //transition loop
if max likelihood path to s came from pred
fill table[observ_seq] [i ] [s]
parallel_for observ_seq in database: //DB loop
for s in states: //next state loop
for i =1 to length(longest observ_seq): //observation loop
Kernel
Loop Reordering
for pred in states: //transition loop
if max likelihood path to s came from pred
fill table[observ_seq] [i ] [s]
parallel_for observ_seq in database://DB loop
for s in states: //next state loop
for i =1 to length(longest observ_seq): //observation loop
Kernel
Further Optimizations HMM
Unroll Transition Loop Completely Unroll State Loop by 3
Gain fine grained data level parallelism by processing 3 related states (triplet) at once
Start End
Junk
Insert Insert Insert Insert Insert
F R N T P
Del Del Del Del
Further State Loop Unrolling Increases Arithmetic Intensity Neighboring probability computations (multi-triplet)
Read Similar Inputs Use each others’ intermediate results Fewer Fetches + Same Math
= Higher Arithmetic Intensity
Start End
Junk
Insert Insert Insert Insert Insert
F R N T P
Del Del Del Del
Implementation on Graphics Hardware (GPU)
Wrote 1 triplet and 4 triplet state processing kernels in Brook for GPU’s [Buck ‘04] Downloaded as shader program
Amino acid sequences read-only textures
State probabilities read/write textures
Transition probabilities constant registers
Emission probabilities small lookup table
If necessary, traceback performed on CPU
Experimental Methodology
Tested Systems Sean Eddy’s HMMer 2.3.2 [Eddy ‘03]
Pentium IV 3.0GHz Erik Lindahl’s AltiVec v2 implementation [Lindahl ‘05]
2.4 GHz G5 Power PC Our Brook Implementation
ATI X1800 XT NVIDIA 7800GT
Data set NCBI Nonredundant database with 2.5 million proteins Representative HMM’s from Pfam database
Adenovirus Performance
ATI Hardware outperforms PowerPC by factor of 2+ Graphics Hardware performs between 10-25 times
better than HMMer on x86 Careful optimization may improve x86 in the future
0
2
4
6
8
10
12
14
16
18
20
22
24
26
Adenovirus
Relative Performance
P4
G5
X1800XT
7800 GTX
Performance
ATI Hardware outperforms PowerPC by factor of 2+ Graphics Hardware performs between 10-25 times
better than HMMer on x86 Careful optimization may improve x86 in the future
0
10
20
30
40
Colipase Connexin50
Adenovirus Arfaptin PGI DUF#499
Relative Performance
P4
G5
X1800XT
7800 GTX
Performance analysis
Comparison [ATI x800; 17.5GB/s; 8.3Gops] 1-Triplet: Bandwidth limited 4-Triplet: Instruction Issue limited
Conclusions 4-triplet kernel achieves 90% of peak performance Unrolling kernel important
Scales Linearly
Results on 16-node cluster Gig-E interconnect ATI Radeon 9800 Dual CPU
Split 2.5 million protein database across nodes Linear scalability
Need 10,000 proteins per cluster to keep GPU’s busy
0
2
4
6
8
10
12
14
16
0 4 8 12 16
Nodes
Relative Performance
HMMer-Pfam
Solves inverse problem of HMMer-search Given a protein of interest search a database of HMMs
Problem for HMMer-pfam Requires traceback for every HMM Not enough memory on GPU
GPU requires 10,000 elements in parallel• Requires storage for 10,000 num_states sequence_length
floating point probabilities
Need architecture with finer parallelism
Summary
Streaming version of HMMer Implemented on GPUs Performs well
2-3x a PowerPC G5 20-30x a Pentium 4
Well suited for other parallel architectures Cell
Acknowledgements
Funding Sheila Vaidya & John Johnstone @ LLNL
People Erik Lindahl ATI folks NVIDIA folks Sean Eddy Ian & Brook Team
Questions?
danielrh @ graphics.stanford.edu mhouston @ graphics.stanford.edu
Importance of Streaming HMMer Algorithm
Scales to many types of parallel architectures By adjusting the width of the parallel_for loops
Cell, Multicore, Clusters … Cell
Addressible Read/Write Memory Fine-grained parallelism
• Pfam possibility Tens, not thousands of parallel DB queries
• On Cell, 16-64 database entries could be processed Each kernel could processes all states
• Only returns a probability
General HMM Performance
Arbitrary transition matrix Exactly like repeated matrix-vector multiply Each matrix is transition matrix
Scaled by emission probability Max() supplants add() operation in matrix-vector analogy
Brook for GPU paper shows Matrix-Vector Multiply
Good for streaming architectures Each element is touched once
Streaming over large quantities of memory GPU memory controlled designed for this mode
Filling Probability Table
Sequence A
Sequence B
Sequence C
Sequence D
F R N F … .8 .2 .3 .1F R N T …
N T N F …
T G P F …
Sequence Database
States: A B C D A B C D A B C D …
Competing with CPU Cache
Probability at each state requires max() over incoming state probabilities in memory on Graphics
Hardware Incoming state probabilities in L1 cache on CPU
Only final probability must be written to memory Only tiny database entries must be read
Luckily 12-way version instruction, not bandwidth limited on GPU CPU also instruction-limited
Viterbi Example (yes, will fix arrows)
Observations: Umbrella Coat Nothing
Rainy
Sunny
0
1.0
Rainy
Sunny
pemit(Rainy,Umbrella) *max(0*.7,1*.4)
pemit(Sunny,Umbrella) *max(0*.3,1*.6)
.3
.6
.4
.7
=.5*max(0,.4)
=.2
=.1*max(0,.6)=.06
Viterbi Example
Observations: Umbrella Coat Nothing
Rainy
Sunny
0
1.0
Rainy
Sunny
pemit(Rainy,Coat) *max(.2*.7,.06*.4)
.2
.06 pemit(Sunny,Coat) *max(.2*.3,.06*.6)
=.3*max(.06,.036)=
Rainy
Sunny.3
.6
.4
.7
=.4*max(.14,.024)=.056
.018
Viterbi Example
Observations: Umbrella Coat Nothing
Rainy
Sunny
0
1.0
Rainy
Sunny
pemit(Rainy,Nothing) *max(.057*.7,.018*.4)
.2
.06 pemit(Sunny,Nothing) *max(.057*.3,.018*.6) =.6*max(.0171,.0108)
=.01026
Rainy
Sunny
.056
.018
Rainy
Sunny.3
.6
.7
=.1*max(.0399,.0072)
=.00399.4
Performance
ATI Hardware outperforms PowerPC by factor of 2+ Graphics Hardware performs between 10-25 times
better than HMMer on x86 Careful optimization may improve x86 in the future
0
10
20
30
40
Colipase
Connexin 50AdenovirusArfaptin
PGI
DUF#499
Relative Performance
P4G5X1800XTX800XTPE9800XT7800 GTX6800 Ultra
Transition Probabilities Represent chance of the world changing state
Trained from example data
FRNT
P
FRNTTP
F
GF
FRNTTP
G
GF
FRNT
P
F
T
Junk
Junk Junk Junk Junk Junk
Del Del Del Del
Start EndF R N T PT
25% chance of correct match
.25
50% chance of insertion
.5
25% chance of deletion
.25
Proteins in Family
HMMer Scores a Protein
Returns probable alignment of protein to model
HMMer+
DatabaseProtein
ProbabilityScore
FRNTTP
F
G+
Most Likely Alignment to Model
FR
NT
TP
F
G
Probabilistic model
Junk
Junk Junk Junk Junk Junk
Del Del Del Del
Start EndF R N T PT
Probabilisitic Model: HMM
Specialized layout of states Trained to randomly generate amino-acid chains
Chains likely to belong in a desired family States with a circle emit no observation
Start End
Junk
Insert Insert Insert Insert Insert
F R N T P
Del Del Del Del
GPU Limitations
High granularity parallelism of 10,000 elements No room to store entire state table
Download of database Readback of results Few registers
Only mechanism of fast read/write storage Limits how many state triplets a kernel may process
Number of kernel outputs
.4
.7
Viterbi path: “Sunny, Rainy, Rainy, Sunny”
=.2
.5 .4
Viterbi Example
Sunny
0
1.0
Rainy Rainy
Sunny
.4
.6
.3
.7Rainy
Sunny.06
.2
Coat NothingUmbrella p(Umbrella | Rainy)
p(Umbrella | Sunny) max(0.3,1 .6)
max(0.7,1 .4)
Rainy
Sunny
.4
.6
.3
.7
.4
.6
.3
.7
.056
.018 .01026
.00399
.01026>.00399.06*.4 < .2*.71.0*.4 > 0.0*.7
Viterbi Traceback Example
.018·.6 < .056·.3
.6
.3
.4
.7
=.06
.1 .6
Coat NothingUmbrella Sunday Monday Tuesday Wednesday
Rainy