ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Post on 21-Jan-2016

216 views 0 download

Transcript of ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

ClawHMMerStreaming protein search Daniel HornMike HoustonPat Hanrahan

Stanford University

Data proliferation in biology

Protein databases SWISS-PROT (200,000 annotated proteins) NCBI Nonredundant DB (2.5 million proteins) UniprotKB/TrEMBL (2.3 million proteins)

DNA sequences DDBJ Japan (42 million genes) NCBI GenBank (10 million sequence records)

Protein matching

Problem Different proteins, same function

common amino acid patterns Solution

Fuzzy string matching

FRNT

P

F

FRNTAP

FRNTAP

FRNFTP

Similar toAs can be seenFrom side-by-sideplacement

A A

Current methods

BLAST [Altschul ‘90]- Ad hoc gap penalties

Matches depend on penalty

+ Fast HMMer [Krogh ‘93]

+ Robust Probabilistic model

- Slow

New architectures

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

IBM Cell Graphics Hardware

New architectures

Traditional CPUs P4 3.0 GHz = 12 GFlops G5 2.5 GHz = 10 GFlops

New architectures offer more compute power ATI X1800XT = 120 GFlops NVIDIA G70 = 176 GFlops Cell = 250 GFlops

However, new architectures are parallel

Fragment Processor

FragmentProcessor

Texture Memory(256-512MB)

128+ Constant Values 8 Interpolated Values

16 floats written sequentially to texture memory eventually

32Temporary

R/WRegisters

Fragment Processors

Have 16-24 Fragment Processors Need 10,000 threads executing in parallel

Hides memory latency Shared Program Counter (PC)

Newer hardware has more (1 per 4 processors)

Texture Memory(256-512MB)128+ Constant Values Interpolants

Data written sequentially to texture memory

Frag.Proc.

R/WReg

Frag.Proc.

R/WReg

Frag.Proc.

R/WReg

Frag.Proc.

R/WReg

PC

Brook For GPU’s

Abstraction for streaming architectures [Buck ‘04] Encourages data-parallel applications

Streams Arrays of data

Kernel Small function that can run on a fragment processor

No access to global variables Fixed number of stream inputs, outputs, lookup tables

Mapping operation Runs kernel on each element of input data stream Writes to output stream

Challenges

Current search algorithm doesn’t map well Require lots of fine-grained parallelism

10,000 kernel invocations in parallel High Arithmetic intensity

8:1 compute:fetch ratio Fragment processors have limited resources

Kernels need to be small

Hidden Markov Model

State machine represents unobservable world state

Rainy Sunny

Transition probabilities

Rainy Sunny

.3

.4

.7 .6

Observations

P(coat|Rainy)=.4

P(nothing|Rainy)=.1

P(umbrella|Rainy)=.5

P(coat|Sunny)=.3

P(nothing|Sunny)=.6

P(umbrella|Sunny)=.1

Rainy Sunny

.3

.4

.7 .6

Viterbi algorithm

Given HMM and observation sequence, Viterbi finds Most likely path through world state Probability of this path

Dynamic programming Fills probability table

Per observation Per state Max over incoming transitions

Probability of state machine taking the most likely path to this current state while emitting given observation sequence

Coat NothingUmbrella

.4

.7

=.2

.4 .5

Viterbi Example

Observations:

Sunny

0.0

1.0

Rainy Rainy

Sunny

.4

.6

.3

.7

.06

.20

max(0.7,1 .4)

max(0 .3,1 .6) p(Umbrella | Sunny)

p(Umbrella|Rainy)

=.06

.6 .1

Sunday Monday Tuesday Wednesday

.6

.3

.4

.7

.4

.7

Viterbi Example

Observations:

Sunny

0.0

1.0

Rainy Rainy

Sunny .06

.20

Coat NothingUmbrella

.056

.018 .01026

.00399

.01026>.00399.018·.6 < .056·.3.06*.4 < .2*.7

Rainy

Sunny

Rainy

Sunny

Sunday Monday Tuesday Wednesday

Viterbi Traceback Example

1.0*.4 > 0.0*.7

Sunny

RainyRainy

Sunny

Viterbi path: “Sunny, Rainy, Rainy, Sunny”

HMM Training

FRNT

P

FRNTTP

FRNTTP

F

GF

FRNTTP

G

F

FRNT

P

F

T

+

FR

FT

PT

N

FR

N

T

TP

G

G

F

FRNF

PT

FR

NT

TP

F

G

FRNTTP

Probabilistic model

Proposedalignment

Proteins in family

Align

Delete

Insertion

G

Emit “F”Emit “G”

Emit “P”

Emit “T”Emit “R”

Emit “N”

Probabilistic model: HMM

Junk

Junk Junk Junk Junk Junk

Del Del Del Del

Start EndF R N T PT

F R N T P

FG

Emit “F”

No Emission

HMMer Searches a Database

If Probability score high:Perform traceback for alignment

FR

NT

TP

F

G

FR

N

T

TP

F

G

F

FRNT

TP

F

GR

TT

FP

F

G

FR

TT

NP

F

GFNNN

NP

P

RR

F

T

TP

F

T

N

FNNN

NP

P

PR

TN

TR

F

G

Probabilistic model

Database

Query

Junk

Junk Junk Junk Junk Junk

Del Del Del Del

Start EndF R N T PT

Viterbi Algorithm Code

for obs in sequence //observation loop

for s in states: //next state loop

for predecessor in states: //transition loop

if max likelihood path to s came from predecessor

fill table[obs][s]

Parallelizing Viterbi Over a Database

parallel_for observ_seq in database: //DB loop

for obs in observ_seq: //observation loop

for s in states: //state loop

for predecessor in states: //transition loop

if max likelihood path to s came from predecessor

fill table [observ_seq][obs][s]

Flexible parallelism

Pull database loop inside Reduces size/state of inner loop code

More flexibility in choosing Coarse grained parallelism

Data Parallel Fine Grained Parallelism

SIMD Instructions

Loop Reordering

for pred in states: //transition loop

if max likelihood path to s came from pred

fill table[observ_seq] [i ] [s]

parallel_for observ_seq in database: //DB loop

for s in states: //next state loop

for i =1 to length(longest observ_seq): //observation loop

Kernel

Loop Reordering

for pred in states: //transition loop

if max likelihood path to s came from pred

fill table[observ_seq] [i ] [s]

parallel_for observ_seq in database://DB loop

for s in states: //next state loop

for i =1 to length(longest observ_seq): //observation loop

Kernel

Further Optimizations HMM

Unroll Transition Loop Completely Unroll State Loop by 3

Gain fine grained data level parallelism by processing 3 related states (triplet) at once

Start End

Junk

Insert Insert Insert Insert Insert

F R N T P

Del Del Del Del

Further State Loop Unrolling Increases Arithmetic Intensity Neighboring probability computations (multi-triplet)

Read Similar Inputs Use each others’ intermediate results Fewer Fetches + Same Math

= Higher Arithmetic Intensity

Start End

Junk

Insert Insert Insert Insert Insert

F R N T P

Del Del Del Del

Implementation on Graphics Hardware (GPU)

Wrote 1 triplet and 4 triplet state processing kernels in Brook for GPU’s [Buck ‘04] Downloaded as shader program

Amino acid sequences read-only textures

State probabilities read/write textures

Transition probabilities constant registers

Emission probabilities small lookup table

If necessary, traceback performed on CPU

Experimental Methodology

Tested Systems Sean Eddy’s HMMer 2.3.2 [Eddy ‘03]

Pentium IV 3.0GHz Erik Lindahl’s AltiVec v2 implementation [Lindahl ‘05]

2.4 GHz G5 Power PC Our Brook Implementation

ATI X1800 XT NVIDIA 7800GT

Data set NCBI Nonredundant database with 2.5 million proteins Representative HMM’s from Pfam database

Adenovirus Performance

ATI Hardware outperforms PowerPC by factor of 2+ Graphics Hardware performs between 10-25 times

better than HMMer on x86 Careful optimization may improve x86 in the future

0

2

4

6

8

10

12

14

16

18

20

22

24

26

Adenovirus

Relative Performance

P4

G5

X1800XT

7800 GTX

Performance

ATI Hardware outperforms PowerPC by factor of 2+ Graphics Hardware performs between 10-25 times

better than HMMer on x86 Careful optimization may improve x86 in the future

0

10

20

30

40

Colipase Connexin50

Adenovirus Arfaptin PGI DUF#499

Relative Performance

P4

G5

X1800XT

7800 GTX

Performance analysis

Comparison [ATI x800; 17.5GB/s; 8.3Gops] 1-Triplet: Bandwidth limited 4-Triplet: Instruction Issue limited

Conclusions 4-triplet kernel achieves 90% of peak performance Unrolling kernel important

Scales Linearly

Results on 16-node cluster Gig-E interconnect ATI Radeon 9800 Dual CPU

Split 2.5 million protein database across nodes Linear scalability

Need 10,000 proteins per cluster to keep GPU’s busy

0

2

4

6

8

10

12

14

16

0 4 8 12 16

Nodes

Relative Performance

HMMer-Pfam

Solves inverse problem of HMMer-search Given a protein of interest search a database of HMMs

Problem for HMMer-pfam Requires traceback for every HMM Not enough memory on GPU

GPU requires 10,000 elements in parallel• Requires storage for 10,000 num_states sequence_length

floating point probabilities

Need architecture with finer parallelism

Summary

Streaming version of HMMer Implemented on GPUs Performs well

2-3x a PowerPC G5 20-30x a Pentium 4

Well suited for other parallel architectures Cell

Acknowledgements

Funding Sheila Vaidya & John Johnstone @ LLNL

People Erik Lindahl ATI folks NVIDIA folks Sean Eddy Ian & Brook Team

Questions?

danielrh @ graphics.stanford.edu mhouston @ graphics.stanford.edu

Importance of Streaming HMMer Algorithm

Scales to many types of parallel architectures By adjusting the width of the parallel_for loops

Cell, Multicore, Clusters … Cell

Addressible Read/Write Memory Fine-grained parallelism

• Pfam possibility Tens, not thousands of parallel DB queries

• On Cell, 16-64 database entries could be processed Each kernel could processes all states

• Only returns a probability

General HMM Performance

Arbitrary transition matrix Exactly like repeated matrix-vector multiply Each matrix is transition matrix

Scaled by emission probability Max() supplants add() operation in matrix-vector analogy

Brook for GPU paper shows Matrix-Vector Multiply

Good for streaming architectures Each element is touched once

Streaming over large quantities of memory GPU memory controlled designed for this mode

Filling Probability Table

Sequence A

Sequence B

Sequence C

Sequence D

F R N F … .8 .2 .3 .1F R N T …

N T N F …

T G P F …

Sequence Database

States: A B C D A B C D A B C D …

Competing with CPU Cache

Probability at each state requires max() over incoming state probabilities in memory on Graphics

Hardware Incoming state probabilities in L1 cache on CPU

Only final probability must be written to memory Only tiny database entries must be read

Luckily 12-way version instruction, not bandwidth limited on GPU CPU also instruction-limited

Viterbi Example (yes, will fix arrows)

Observations: Umbrella Coat Nothing

Rainy

Sunny

0

1.0

Rainy

Sunny

pemit(Rainy,Umbrella) *max(0*.7,1*.4)

pemit(Sunny,Umbrella) *max(0*.3,1*.6)

.3

.6

.4

.7

=.5*max(0,.4)

=.2

=.1*max(0,.6)=.06

Viterbi Example

Observations: Umbrella Coat Nothing

Rainy

Sunny

0

1.0

Rainy

Sunny

pemit(Rainy,Coat) *max(.2*.7,.06*.4)

.2

.06 pemit(Sunny,Coat) *max(.2*.3,.06*.6)

=.3*max(.06,.036)=

Rainy

Sunny.3

.6

.4

.7

=.4*max(.14,.024)=.056

.018

Viterbi Example

Observations: Umbrella Coat Nothing

Rainy

Sunny

0

1.0

Rainy

Sunny

pemit(Rainy,Nothing) *max(.057*.7,.018*.4)

.2

.06 pemit(Sunny,Nothing) *max(.057*.3,.018*.6) =.6*max(.0171,.0108)

=.01026

Rainy

Sunny

.056

.018

Rainy

Sunny.3

.6

.7

=.1*max(.0399,.0072)

=.00399.4

Performance

ATI Hardware outperforms PowerPC by factor of 2+ Graphics Hardware performs between 10-25 times

better than HMMer on x86 Careful optimization may improve x86 in the future

0

10

20

30

40

Colipase

Connexin 50AdenovirusArfaptin

PGI

DUF#499

Relative Performance

P4G5X1800XTX800XTPE9800XT7800 GTX6800 Ultra

Transition Probabilities Represent chance of the world changing state

Trained from example data

FRNT

P

FRNTTP

F

GF

FRNTTP

G

GF

FRNT

P

F

T

Junk

Junk Junk Junk Junk Junk

Del Del Del Del

Start EndF R N T PT

25% chance of correct match

.25

50% chance of insertion

.5

25% chance of deletion

.25

Proteins in Family

HMMer Scores a Protein

Returns probable alignment of protein to model

HMMer+

DatabaseProtein

ProbabilityScore

FRNTTP

F

G+

Most Likely Alignment to Model

FR

NT

TP

F

G

Probabilistic model

Junk

Junk Junk Junk Junk Junk

Del Del Del Del

Start EndF R N T PT

Probabilisitic Model: HMM

Specialized layout of states Trained to randomly generate amino-acid chains

Chains likely to belong in a desired family States with a circle emit no observation

Start End

Junk

Insert Insert Insert Insert Insert

F R N T P

Del Del Del Del

GPU Limitations

High granularity parallelism of 10,000 elements No room to store entire state table

Download of database Readback of results Few registers

Only mechanism of fast read/write storage Limits how many state triplets a kernel may process

Number of kernel outputs

.4

.7

Viterbi path: “Sunny, Rainy, Rainy, Sunny”

=.2

.5 .4

Viterbi Example

Sunny

0

1.0

Rainy Rainy

Sunny

.4

.6

.3

.7Rainy

Sunny.06

.2

Coat NothingUmbrella p(Umbrella | Rainy)

p(Umbrella | Sunny) max(0.3,1 .6)

max(0.7,1 .4)

Rainy

Sunny

.4

.6

.3

.7

.4

.6

.3

.7

.056

.018 .01026

.00399

.01026>.00399.06*.4 < .2*.71.0*.4 > 0.0*.7

Viterbi Traceback Example

.018·.6 < .056·.3

.6

.3

.4

.7

=.06

.1 .6

Coat NothingUmbrella Sunday Monday Tuesday Wednesday

Rainy