ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

ClawHMMerStreaming protein search Daniel HornMike HoustonPat Hanrahan

Stanford University

Data proliferation in biology

Protein databases SWISS-PROT (200,000 annotated proteins) NCBI Nonredundant DB (2.5 million proteins) UniprotKB/TrEMBL (2.3 million proteins)

DNA sequences DDBJ Japan (42 million genes) NCBI GenBank (10 million sequence records)

Protein matching

Problem Different proteins, same function

common amino acid patterns Solution

Fuzzy string matching

FRNTAP

FRNFTP

Similar toAs can be seenFrom side-by-sideplacement

Current methods

BLAST [Altschul ‘90]- Ad hoc gap penalties

Matches depend on penalty

+ Fast HMMer [Krogh ‘93]

+ Robust Probabilistic model

- Slow

New architectures

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

IBM Cell Graphics Hardware

New architectures

Traditional CPUs P4 3.0 GHz = 12 GFlops G5 2.5 GHz = 10 GFlops

New architectures offer more compute power ATI X1800XT = 120 GFlops NVIDIA G70 = 176 GFlops Cell = 250 GFlops

However, new architectures are parallel

Fragment Processor

FragmentProcessor

Texture Memory(256-512MB)

128+ Constant Values 8 Interpolated Values

16 floats written sequentially to texture memory eventually

32Temporary

R/WRegisters

Fragment Processors

Have 16-24 Fragment Processors Need 10,000 threads executing in parallel

Hides memory latency Shared Program Counter (PC)

Newer hardware has more (1 per 4 processors)

Texture Memory(256-512MB)128+ Constant Values Interpolants

Data written sequentially to texture memory

Frag.Proc.

R/WReg

Frag.Proc.

R/WReg

Frag.Proc.

R/WReg

Frag.Proc.

R/WReg

Brook For GPU’s

Abstraction for streaming architectures [Buck ‘04] Encourages data-parallel applications

Streams Arrays of data

Kernel Small function that can run on a fragment processor

No access to global variables Fixed number of stream inputs, outputs, lookup tables

Mapping operation Runs kernel on each element of input data stream Writes to output stream

Challenges

Current search algorithm doesn’t map well Require lots of fine-grained parallelism

10,000 kernel invocations in parallel High Arithmetic intensity

8:1 compute:fetch ratio Fragment processors have limited resources

Kernels need to be small

Hidden Markov Model

State machine represents unobservable world state

Rainy Sunny

Transition probabilities

Rainy Sunny

Observations

P(coat|Rainy)=.4

P(nothing|Rainy)=.1

P(umbrella|Rainy)=.5

P(coat|Sunny)=.3

P(nothing|Sunny)=.6

P(umbrella|Sunny)=.1

Rainy Sunny

Viterbi algorithm

Given HMM and observation sequence, Viterbi finds Most likely path through world state Probability of this path

Dynamic programming Fills probability table

Per observation Per state Max over incoming transitions

Probability of state machine taking the most likely path to this current state while emitting given observation sequence

Coat NothingUmbrella

Viterbi Example

Observations:

Rainy Rainy

max(0.7,1 .4)

max(0 .3,1 .6) p(Umbrella | Sunny)

p(Umbrella|Rainy)

Sunday Monday Tuesday Wednesday

Viterbi Example

Observations:

Rainy Rainy

Sunny .06

Coat NothingUmbrella

.018 .01026

.00399

.01026>.00399.018·.6 < .056·.3.06*.4 < .2*.7

Sunday Monday Tuesday Wednesday

Viterbi Traceback Example

1.0*.4 > 0.0*.7

RainyRainy

Viterbi path: “Sunny, Rainy, Rainy, Sunny”

HMM Training

FRNTTP

Probabilistic model

Proposedalignment

Proteins in family

Delete

Insertion

Emit “F”Emit “G”

Emit “P”

Emit “T”Emit “R”

Emit “N”

Probabilistic model: HMM

Junk Junk Junk Junk Junk

Del Del Del Del

Start EndF R N T PT

F R N T P

Emit “F”

No Emission

HMMer Searches a Database

If Probability score high:Perform traceback for alignment

Probabilistic model

Database

Del Del Del Del

Start EndF R N T PT

Viterbi Algorithm Code

for obs in sequence //observation loop

for s in states: //next state loop

for predecessor in states: //transition loop

if max likelihood path to s came from predecessor

fill table[obs][s]

Parallelizing Viterbi Over a Database

parallel_for observ_seq in database: //DB loop

for obs in observ_seq: //observation loop

for s in states: //state loop

for predecessor in states: //transition loop

if max likelihood path to s came from predecessor

fill table [observ_seq][obs][s]

Flexible parallelism

Pull database loop inside Reduces size/state of inner loop code

More flexibility in choosing Coarse grained parallelism

Data Parallel Fine Grained Parallelism

SIMD Instructions

Loop Reordering

for pred in states: //transition loop

if max likelihood path to s came from pred

fill table[observ_seq] [i ] [s]

parallel_for observ_seq in database: //DB loop

for i =1 to length(longest observ_seq): //observation loop

Kernel

Loop Reordering

for pred in states: //transition loop

if max likelihood path to s came from pred

fill table[observ_seq] [i ] [s]

parallel_for observ_seq in database://DB loop

for i =1 to length(longest observ_seq): //observation loop

Kernel

Further Optimizations HMM

Unroll Transition Loop Completely Unroll State Loop by 3

Gain fine grained data level parallelism by processing 3 related states (triplet) at once

Start End

Insert Insert Insert Insert Insert

F R N T P

Del Del Del Del

Further State Loop Unrolling Increases Arithmetic Intensity Neighboring probability computations (multi-triplet)

Read Similar Inputs Use each others’ intermediate results Fewer Fetches + Same Math

= Higher Arithmetic Intensity

Start End

F R N T P

Del Del Del Del

Implementation on Graphics Hardware (GPU)

Wrote 1 triplet and 4 triplet state processing kernels in Brook for GPU’s [Buck ‘04] Downloaded as shader program

Amino acid sequences read-only textures

State probabilities read/write textures

Transition probabilities constant registers

Emission probabilities small lookup table

If necessary, traceback performed on CPU

Experimental Methodology

Tested Systems Sean Eddy’s HMMer 2.3.2 [Eddy ‘03]

Pentium IV 3.0GHz Erik Lindahl’s AltiVec v2 implementation [Lindahl ‘05]

2.4 GHz G5 Power PC Our Brook Implementation

ATI X1800 XT NVIDIA 7800GT

Data set NCBI Nonredundant database with 2.5 million proteins Representative HMM’s from Pfam database

Adenovirus Performance

ATI Hardware outperforms PowerPC by factor of 2+ Graphics Hardware performs between 10-25 times

better than HMMer on x86 Careful optimization may improve x86 in the future

Adenovirus

Relative Performance

X1800XT

7800 GTX

Performance

Colipase Connexin50

Adenovirus Arfaptin PGI DUF#499

X1800XT

7800 GTX

Performance analysis

Comparison [ATI x800; 17.5GB/s; 8.3Gops] 1-Triplet: Bandwidth limited 4-Triplet: Instruction Issue limited

Conclusions 4-triplet kernel achieves 90% of peak performance Unrolling kernel important

Scales Linearly

Results on 16-node cluster Gig-E interconnect ATI Radeon 9800 Dual CPU

Split 2.5 million protein database across nodes Linear scalability

Need 10,000 proteins per cluster to keep GPU’s busy

0 4 8 12 16

HMMer-Pfam

Solves inverse problem of HMMer-search Given a protein of interest search a database of HMMs

Problem for HMMer-pfam Requires traceback for every HMM Not enough memory on GPU

GPU requires 10,000 elements in parallel• Requires storage for 10,000 num_states sequence_length

floating point probabilities

Need architecture with finer parallelism

Summary

Streaming version of HMMer Implemented on GPUs Performs well

2-3x a PowerPC G5 20-30x a Pentium 4

Well suited for other parallel architectures Cell

Acknowledgements

Funding Sheila Vaidya & John Johnstone @ LLNL

People Erik Lindahl ATI folks NVIDIA folks Sean Eddy Ian & Brook Team

Questions?

danielrh @ graphics.stanford.edu mhouston @ graphics.stanford.edu

Importance of Streaming HMMer Algorithm

Scales to many types of parallel architectures By adjusting the width of the parallel_for loops

Cell, Multicore, Clusters … Cell

Addressible Read/Write Memory Fine-grained parallelism

• Pfam possibility Tens, not thousands of parallel DB queries

• On Cell, 16-64 database entries could be processed Each kernel could processes all states

• Only returns a probability

General HMM Performance

Arbitrary transition matrix Exactly like repeated matrix-vector multiply Each matrix is transition matrix

Scaled by emission probability Max() supplants add() operation in matrix-vector analogy

Brook for GPU paper shows Matrix-Vector Multiply

Good for streaming architectures Each element is touched once

Streaming over large quantities of memory GPU memory controlled designed for this mode

Filling Probability Table

Sequence A

Sequence B

Sequence C

Sequence D

F R N F … .8 .2 .3 .1F R N T …

N T N F …

T G P F …

Sequence Database

States: A B C D A B C D A B C D …

Competing with CPU Cache

Probability at each state requires max() over incoming state probabilities in memory on Graphics

Hardware Incoming state probabilities in L1 cache on CPU

Only final probability must be written to memory Only tiny database entries must be read

Luckily 12-way version instruction, not bandwidth limited on GPU CPU also instruction-limited

Viterbi Example (yes, will fix arrows)

Observations: Umbrella Coat Nothing

pemit(Rainy,Umbrella) *max(0*.7,1*.4)

pemit(Sunny,Umbrella) *max(0*.3,1*.6)

=.5*max(0,.4)

=.1*max(0,.6)=.06

Viterbi Example

pemit(Rainy,Coat) *max(.2*.7,.06*.4)

.06 pemit(Sunny,Coat) *max(.2*.3,.06*.6)

=.3*max(.06,.036)=

Sunny.3

=.4*max(.14,.024)=.056

Viterbi Example

pemit(Rainy,Nothing) *max(.057*.7,.018*.4)

.06 pemit(Sunny,Nothing) *max(.057*.3,.018*.6) =.6*max(.0171,.0108)

=.01026

Sunny.3

=.1*max(.0399,.0072)

=.00399.4

Performance

Colipase

Connexin 50AdenovirusArfaptin

DUF#499

P4G5X1800XTX800XTPE9800XT7800 GTX6800 Ultra

Transition Probabilities Represent chance of the world changing state

Trained from example data

FRNTTP

Del Del Del Del

Start EndF R N T PT

25% chance of correct match

50% chance of insertion

25% chance of deletion

Proteins in Family

HMMer Scores a Protein

Returns probable alignment of protein to model

HMMer+

DatabaseProtein

ProbabilityScore

FRNTTP

Most Likely Alignment to Model

Probabilistic model

Del Del Del Del

Start EndF R N T PT

Probabilisitic Model: HMM

Specialized layout of states Trained to randomly generate amino-acid chains

Chains likely to belong in a desired family States with a circle emit no observation

Start End

F R N T P

Del Del Del Del

GPU Limitations

High granularity parallelism of 10,000 elements No room to store entire state table

Download of database Readback of results Few registers

Only mechanism of fast read/write storage Limits how many state triplets a kernel may process

Number of kernel outputs

Viterbi path: “Sunny, Rainy, Rainy, Sunny”

Viterbi Example

Rainy Rainy

.7Rainy

Sunny.06

Coat NothingUmbrella p(Umbrella | Rainy)

p(Umbrella | Sunny) max(0.3,1 .6)

max(0.7,1 .4)

.018 .01026

.00399

.01026>.00399.06*.4 < .2*.71.0*.4 > 0.0*.7

Viterbi Traceback Example

.018·.6 < .056·.3

Coat NothingUmbrella Sunday Monday Tuesday Wednesday

ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Documents

Transcript of ClawHMMer Streaming protein search Daniel Horn Mike Houston Pat Hanrahan Stanford University.

Adam norton newell Harry Christopher Hanrahan Daniel Mudie ...

Cliona O Hanrahan Visual Resume v0.3

Drip Irrigation Systems By: Philip Hanrahan and Matt Garrity.

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan February 10th, 2003.

Status – Week 283 Victor Moya. 3D Graphics Pipeline Akeley & Hanrahan course. Akeley & Hanrahan course. Fixed vs Programmable. Fixed vs Programmable.

Adam norton newell Harry Christopher Hanrahan Daniel … · Adam norton newell Harry emma White Rob McHaffie Gemma Smith Mitch Cairns Christopher Hanrahan Daniel Mudie Cunningham

Systems of Thought Pat Hanrahan Systems of Thought - Computer

Simplified Astrophotography By Pat Hanrahan

Conveying Shape Pat Hanrahan

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.

Roni Horn AKA Roni Horn

Final Report Template - Meat & Livestock Australia · final report p Project code: B.RMT.0012 Prepared by: Peter Hanrahan Peter Hanrahan Consulting Pty Ltd Date published: November

Tim Hanrahan, McKnight Foundation AIA Minnesota/McKnight ...Tim Hanrahan, McKnight Foundation CONTACT Mary Larkin, AIA Minnesota, (612) 338-6763 x 218 Communications Director, larkin@aia-mn.org

THE BRITANNIA · 2018-04-04 · Managing Consultant Alex Hanrahan shares his guide to collecting Britannia coins Alex Hanrahan Managing Consultant Britannia – the female personification

JP Hanrahan & B Good; Dec 2008 J.P. Hanrahan & Barbara Good Teagasc, Animal Production Research Centre, Athenry Organic sheep system Grazing management,

Always Stand Tall Written by: Theresa Hanrahan Illustrated by: Allison Gomez, Elaine Janozo and Lara Meneses Copyright© 2013 Theresa Hanrahan All rights.

Introduction to Media Theory Spring 2013 B Hanrahan Syllabus

LECTURE 18: Horn Antennas Rectangular horn antennas ... · PDF fileNikolova 2016 1 LECTURE 18: Horn Antennas (Rectangular horn antennas. Circular apertures.) 1 Rectangular Horn Antennas

Mozart horn quintet for horn in E (solo horn)

Hindemeth Horn Sonata Horn Part