ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

49
ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University

Transcript of ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Page 1: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

ClawHMMerStreaming protein search Daniel HornMike HoustonPat Hanrahan

Stanford University

Page 2: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Data proliferation in biology

Protein databases SWISS-PROT (200,000 annotated proteins) NCBI Nonredundant DB (2.5 million proteins) UniprotKB/TrEMBL (2.3 million proteins)

DNA sequences DDBJ Japan (42 million genes) NCBI GenBank (10 million sequence records)

Page 3: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Protein matching

Problem Different proteins, same function

common amino acid patterns Solution

Fuzzy string matching

FRNT

P

F

FRNTAP

FRNTAP

FRNFTP

Similar toAs can be seenFrom side-by-sideplacement

A A

Page 4: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Current methods

BLAST [Altschul ‘90]- Ad hoc gap penalties

Matches depend on penalty

+ Fast HMMer [Krogh ‘93]

+ Robust Probabilistic model

- Slow

Page 5: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

New architectures

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

IBM Cell Graphics Hardware

Page 6: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

New architectures

Traditional CPUs P4 3.0 GHz = 12 GFlops G5 2.5 GHz = 10 GFlops

New architectures offer more compute power ATI X1800XT = 120 GFlops NVIDIA G70 = 176 GFlops Cell = 250 GFlops

However, new architectures are parallel

Page 7: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Fragment Processor

FragmentProcessor

Texture Memory(256-512MB)

128+ Constant Values 8 Interpolated Values

16 floats written sequentially to texture memory eventually

32Temporary

R/WRegisters

Page 8: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Fragment Processors

Have 16-24 Fragment Processors Need 10,000 threads executing in parallel

Hides memory latency Shared Program Counter (PC)

Newer hardware has more (1 per 4 processors)

Texture Memory(256-512MB)128+ Constant Values Interpolants

Data written sequentially to texture memory

Frag.Proc.

R/WReg

Frag.Proc.

R/WReg

Frag.Proc.

R/WReg

Frag.Proc.

R/WReg

PC

Page 9: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Brook For GPU’s

Abstraction for streaming architectures [Buck ‘04] Encourages data-parallel applications

Streams Arrays of data

Kernel Small function that can run on a fragment processor

No access to global variables Fixed number of stream inputs, outputs, lookup tables

Mapping operation Runs kernel on each element of input data stream Writes to output stream

Page 10: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Challenges

Current search algorithm doesn’t map well Require lots of fine-grained parallelism

10,000 kernel invocations in parallel High Arithmetic intensity

8:1 compute:fetch ratio Fragment processors have limited resources

Kernels need to be small

Page 11: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Hidden Markov Model

State machine represents unobservable world state

Rainy Sunny

Page 12: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Transition probabilities

Rainy Sunny

.3

.4

.7 .6

Page 13: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Observations

P(coat|Rainy)=.4

P(nothing|Rainy)=.1

P(umbrella|Rainy)=.5

P(coat|Sunny)=.3

P(nothing|Sunny)=.6

P(umbrella|Sunny)=.1

Rainy Sunny

.3

.4

.7 .6

Page 14: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Viterbi algorithm

Given HMM and observation sequence, Viterbi finds Most likely path through world state Probability of this path

Dynamic programming Fills probability table

Per observation Per state Max over incoming transitions

Probability of state machine taking the most likely path to this current state while emitting given observation sequence

Page 15: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Coat NothingUmbrella

.4

.7

=.2

.4 .5

Viterbi Example

Observations:

Sunny

0.0

1.0

Rainy Rainy

Sunny

.4

.6

.3

.7

.06

.20

max(0.7,1 .4)

max(0 .3,1 .6) p(Umbrella | Sunny)

p(Umbrella|Rainy)

=.06

.6 .1

Sunday Monday Tuesday Wednesday

Page 16: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

.6

.3

.4

.7

.4

.7

Viterbi Example

Observations:

Sunny

0.0

1.0

Rainy Rainy

Sunny .06

.20

Coat NothingUmbrella

.056

.018 .01026

.00399

.01026>.00399.018·.6 < .056·.3.06*.4 < .2*.7

Rainy

Sunny

Rainy

Sunny

Sunday Monday Tuesday Wednesday

Viterbi Traceback Example

1.0*.4 > 0.0*.7

Sunny

RainyRainy

Sunny

Viterbi path: “Sunny, Rainy, Rainy, Sunny”

Page 17: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

HMM Training

FRNT

P

FRNTTP

FRNTTP

F

GF

FRNTTP

G

F

FRNT

P

F

T

+

FR

FT

PT

N

FR

N

T

TP

G

G

F

FRNF

PT

FR

NT

TP

F

G

FRNTTP

Probabilistic model

Proposedalignment

Proteins in family

Align

Delete

Insertion

G

Page 18: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Emit “F”Emit “G”

Emit “P”

Emit “T”Emit “R”

Emit “N”

Probabilistic model: HMM

Junk

Junk Junk Junk Junk Junk

Del Del Del Del

Start EndF R N T PT

F R N T P

FG

Emit “F”

No Emission

Page 19: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

HMMer Searches a Database

If Probability score high:Perform traceback for alignment

FR

NT

TP

F

G

FR

N

T

TP

F

G

F

FRNT

TP

F

GR

TT

FP

F

G

FR

TT

NP

F

GFNNN

NP

P

RR

F

T

TP

F

T

N

FNNN

NP

P

PR

TN

TR

F

G

Probabilistic model

Database

Query

Junk

Junk Junk Junk Junk Junk

Del Del Del Del

Start EndF R N T PT

Page 20: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Viterbi Algorithm Code

for obs in sequence //observation loop

for s in states: //next state loop

for predecessor in states: //transition loop

if max likelihood path to s came from predecessor

fill table[obs][s]

Page 21: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Parallelizing Viterbi Over a Database

parallel_for observ_seq in database: //DB loop

for obs in observ_seq: //observation loop

for s in states: //state loop

for predecessor in states: //transition loop

if max likelihood path to s came from predecessor

fill table [observ_seq][obs][s]

Page 22: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Flexible parallelism

Pull database loop inside Reduces size/state of inner loop code

More flexibility in choosing Coarse grained parallelism

Data Parallel Fine Grained Parallelism

SIMD Instructions

Page 23: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Loop Reordering

for pred in states: //transition loop

if max likelihood path to s came from pred

fill table[observ_seq] [i ] [s]

parallel_for observ_seq in database: //DB loop

for s in states: //next state loop

for i =1 to length(longest observ_seq): //observation loop

Kernel

Page 24: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Loop Reordering

for pred in states: //transition loop

if max likelihood path to s came from pred

fill table[observ_seq] [i ] [s]

parallel_for observ_seq in database://DB loop

for s in states: //next state loop

for i =1 to length(longest observ_seq): //observation loop

Kernel

Page 25: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Further Optimizations HMM

Unroll Transition Loop Completely Unroll State Loop by 3

Gain fine grained data level parallelism by processing 3 related states (triplet) at once

Start End

Junk

Insert Insert Insert Insert Insert

F R N T P

Del Del Del Del

Page 26: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Further State Loop Unrolling Increases Arithmetic Intensity Neighboring probability computations (multi-triplet)

Read Similar Inputs Use each others’ intermediate results Fewer Fetches + Same Math

= Higher Arithmetic Intensity

Start End

Junk

Insert Insert Insert Insert Insert

F R N T P

Del Del Del Del

Page 27: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Implementation on Graphics Hardware (GPU)

Wrote 1 triplet and 4 triplet state processing kernels in Brook for GPU’s [Buck ‘04] Downloaded as shader program

Amino acid sequences read-only textures

State probabilities read/write textures

Transition probabilities constant registers

Emission probabilities small lookup table

If necessary, traceback performed on CPU

Page 28: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Experimental Methodology

Tested Systems Sean Eddy’s HMMer 2.3.2 [Eddy ‘03]

Pentium IV 3.0GHz Erik Lindahl’s AltiVec v2 implementation [Lindahl ‘05]

2.4 GHz G5 Power PC Our Brook Implementation

ATI X1800 XT NVIDIA 7800GT

Data set NCBI Nonredundant database with 2.5 million proteins Representative HMM’s from Pfam database

Page 29: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Adenovirus Performance

ATI Hardware outperforms PowerPC by factor of 2+ Graphics Hardware performs between 10-25 times

better than HMMer on x86 Careful optimization may improve x86 in the future

0

2

4

6

8

10

12

14

16

18

20

22

24

26

Adenovirus

Relative Performance

P4

G5

X1800XT

7800 GTX

Page 30: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Performance

ATI Hardware outperforms PowerPC by factor of 2+ Graphics Hardware performs between 10-25 times

better than HMMer on x86 Careful optimization may improve x86 in the future

0

10

20

30

40

Colipase Connexin50

Adenovirus Arfaptin PGI DUF#499

Relative Performance

P4

G5

X1800XT

7800 GTX

Page 31: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Performance analysis

Comparison [ATI x800; 17.5GB/s; 8.3Gops] 1-Triplet: Bandwidth limited 4-Triplet: Instruction Issue limited

Conclusions 4-triplet kernel achieves 90% of peak performance Unrolling kernel important

Page 32: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Scales Linearly

Results on 16-node cluster Gig-E interconnect ATI Radeon 9800 Dual CPU

Split 2.5 million protein database across nodes Linear scalability

Need 10,000 proteins per cluster to keep GPU’s busy

0

2

4

6

8

10

12

14

16

0 4 8 12 16

Nodes

Relative Performance

Page 33: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

HMMer-Pfam

Solves inverse problem of HMMer-search Given a protein of interest search a database of HMMs

Problem for HMMer-pfam Requires traceback for every HMM Not enough memory on GPU

GPU requires 10,000 elements in parallel• Requires storage for 10,000 num_states sequence_length

floating point probabilities

Need architecture with finer parallelism

Page 34: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Summary

Streaming version of HMMer Implemented on GPUs Performs well

2-3x a PowerPC G5 20-30x a Pentium 4

Well suited for other parallel architectures Cell

Page 35: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Acknowledgements

Funding Sheila Vaidya & John Johnstone @ LLNL

People Erik Lindahl ATI folks NVIDIA folks Sean Eddy Ian & Brook Team

Page 36: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Questions?

danielrh @ graphics.stanford.edu mhouston @ graphics.stanford.edu

Page 37: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Importance of Streaming HMMer Algorithm

Scales to many types of parallel architectures By adjusting the width of the parallel_for loops

Cell, Multicore, Clusters … Cell

Addressible Read/Write Memory Fine-grained parallelism

• Pfam possibility Tens, not thousands of parallel DB queries

• On Cell, 16-64 database entries could be processed Each kernel could processes all states

• Only returns a probability

Page 38: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

General HMM Performance

Arbitrary transition matrix Exactly like repeated matrix-vector multiply Each matrix is transition matrix

Scaled by emission probability Max() supplants add() operation in matrix-vector analogy

Brook for GPU paper shows Matrix-Vector Multiply

Good for streaming architectures Each element is touched once

Streaming over large quantities of memory GPU memory controlled designed for this mode

Page 39: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Filling Probability Table

Sequence A

Sequence B

Sequence C

Sequence D

F R N F … .8 .2 .3 .1F R N T …

N T N F …

T G P F …

Sequence Database

States: A B C D A B C D A B C D …

Page 40: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Competing with CPU Cache

Probability at each state requires max() over incoming state probabilities in memory on Graphics

Hardware Incoming state probabilities in L1 cache on CPU

Only final probability must be written to memory Only tiny database entries must be read

Luckily 12-way version instruction, not bandwidth limited on GPU CPU also instruction-limited

Page 41: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Viterbi Example (yes, will fix arrows)

Observations: Umbrella Coat Nothing

Rainy

Sunny

0

1.0

Rainy

Sunny

pemit(Rainy,Umbrella) *max(0*.7,1*.4)

pemit(Sunny,Umbrella) *max(0*.3,1*.6)

.3

.6

.4

.7

=.5*max(0,.4)

=.2

=.1*max(0,.6)=.06

Page 42: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Viterbi Example

Observations: Umbrella Coat Nothing

Rainy

Sunny

0

1.0

Rainy

Sunny

pemit(Rainy,Coat) *max(.2*.7,.06*.4)

.2

.06 pemit(Sunny,Coat) *max(.2*.3,.06*.6)

=.3*max(.06,.036)=

Rainy

Sunny.3

.6

.4

.7

=.4*max(.14,.024)=.056

.018

Page 43: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Viterbi Example

Observations: Umbrella Coat Nothing

Rainy

Sunny

0

1.0

Rainy

Sunny

pemit(Rainy,Nothing) *max(.057*.7,.018*.4)

.2

.06 pemit(Sunny,Nothing) *max(.057*.3,.018*.6) =.6*max(.0171,.0108)

=.01026

Rainy

Sunny

.056

.018

Rainy

Sunny.3

.6

.7

=.1*max(.0399,.0072)

=.00399.4

Page 44: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Performance

ATI Hardware outperforms PowerPC by factor of 2+ Graphics Hardware performs between 10-25 times

better than HMMer on x86 Careful optimization may improve x86 in the future

0

10

20

30

40

Colipase

Connexin 50AdenovirusArfaptin

PGI

DUF#499

Relative Performance

P4G5X1800XTX800XTPE9800XT7800 GTX6800 Ultra

Page 45: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Transition Probabilities Represent chance of the world changing state

Trained from example data

FRNT

P

FRNTTP

F

GF

FRNTTP

G

GF

FRNT

P

F

T

Junk

Junk Junk Junk Junk Junk

Del Del Del Del

Start EndF R N T PT

25% chance of correct match

.25

50% chance of insertion

.5

25% chance of deletion

.25

Proteins in Family

Page 46: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

HMMer Scores a Protein

Returns probable alignment of protein to model

HMMer+

DatabaseProtein

ProbabilityScore

FRNTTP

F

G+

Most Likely Alignment to Model

FR

NT

TP

F

G

Probabilistic model

Junk

Junk Junk Junk Junk Junk

Del Del Del Del

Start EndF R N T PT

Page 47: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Probabilisitic Model: HMM

Specialized layout of states Trained to randomly generate amino-acid chains

Chains likely to belong in a desired family States with a circle emit no observation

Start End

Junk

Insert Insert Insert Insert Insert

F R N T P

Del Del Del Del

Page 48: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

GPU Limitations

High granularity parallelism of 10,000 elements No room to store entire state table

Download of database Readback of results Few registers

Only mechanism of fast read/write storage Limits how many state triplets a kernel may process

Number of kernel outputs

Page 49: ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

.4

.7

Viterbi path: “Sunny, Rainy, Rainy, Sunny”

=.2

.5 .4

Viterbi Example

Sunny

0

1.0

Rainy Rainy

Sunny

.4

.6

.3

.7Rainy

Sunny.06

.2

Coat NothingUmbrella p(Umbrella | Rainy)

p(Umbrella | Sunny) max(0.3,1 .6)

max(0.7,1 .4)

Rainy

Sunny

.4

.6

.3

.7

.4

.6

.3

.7

.056

.018 .01026

.00399

.01026>.00399.06*.4 < .2*.71.0*.4 > 0.0*.7

Viterbi Traceback Example

.018·.6 < .056·.3

.6

.3

.4

.7

=.06

.1 .6

Coat NothingUmbrella Sunday Monday Tuesday Wednesday

Rainy