Co-processors for speeding up drug design algorithms Advait Jain Priyanka Jindal Pulkit Gambhir...

Co-processors for speeding up drug design algorithms

Advait JainPriyanka JindalPulkit Gambhir

Under the guidance of:Prof. M BalakrishnanProf. Kolin Paul

Objective To design FPGA based hardware

accelerators for speeding up the energy minimization process.

Approach to the problem Familiarization with the code Software profiling

Identifying bottleneck procedures/loops Compiler level optimizations

H/w - S/w partitioning Where to partition API’s to export

Hardware Design Performance Analysis

Overall Control Flow

Bottleneck Functions

Split Up codeEval_Energy_for _step(%)

Diff_Energy(%)

Non-bonded pairs

68.61 29.10

Dihedrals 00.54 00.56

Angles 00.17 00.12

Bonded 00.00 00.00

Bottleneck Functions

Iterate over list of bonds {O(N) elements}

Iterate over list of angles {O(N) elements}

Iterate over list of dihedrals {O(N) elements}

Iterate over list of non-bonded pairs {O(N2) elements}

Eval energy Eval Energy for stepDiff energy

Molecule Size v/s Time (log plot)

Average Slope = 2.03

Energy v/s CG Steps

We are here

Non-bonded List

Node structure

Float A, B, C (4*3 bytes)Int a1, a2

C is a function of charge q1 and q2 of atoms.

471,282 distinct Cs(3 bytes)

A, BAre a function of radius and epsilon of atoms.

192 distinct pairs of A,B(1 byte)

New Data Structure

Vector of

Distinct Cs

Vector of Distinct

(A,B) pairs

New Node structure

3d coordinates of atoms

Int a1, a2

Unsigned common_index

31

Result of new data structureMolecule Size: 2008

VanderList: 2,008,417 AB_Vander list: 136 C_Vanderlist: 21,651

Old Data Structure

New Data Structure

Projected Data Structure

2,008,417 * 20

~ 40 MB

2,008,417 * 12 + 136 * 8 + 21,651 * 4

~ 24 MB

2,008,417 * 8 + 136 * 8 + 21,651 * 4

~ 16 MB

Improvement in cache performance

Sorting to improve performance Consecutive nodes of van-der list can

point randomly anywhere in the C and (A,B) vectors

Scope for further improving Cache performance

Radix sort on the van-der list First bucket sort on the C-index Second stable bucket sort on the (A,B)-index

Sequential access of (A,B) vector

Cache Profiling (unsorted vs sorted)

L1D refs L1D misses L2 refs1,773,145,080 Rd: 1,451,802,230Wr:321,342,785

44,016,787 Rd: (3%)43,429,781 Wr: (.1826 %)587,006

44,754,341 Rd:44,167,335 Wr:587,006

1,842,686,500 Rd:1,495,124,238 Wr:347,562,262

29,287,877 Rd: (1.9%)28,470,590 Wr:(.235%)817,287

30,152,893 Rd:29,335,606 Wr:817,287

Test Case : Molecule of size 413 atoms with 25 SD and 100 CG steps

Converting to floating point All the code written with a double point

precision Double point difficult to replicate in

hardware Need to test feasibility of conversion to

single precision

Single Point PrecisionminEnergyCG()

diffEnergy() evalEnergy_for_step()

moveStep()

Precision lost here

Instability introduced hereResulting in NaN

Single Point Precision Removed the instability

Parabolic interpolation replaced by lnsearch() whenever points are colinear.

Time taken to evaluate the energy increased.

Increase in the number of calls to evalEnergy_for_step().

Slow Float Vs Double: Time Plot

Control Flow

Single Point Precision (Molecule Size: 2008 SD:100 CG: 150)# of Calls to: EvalEnergyforStep()

Double642

Slow Float893

From: minEnergyCG() 450 450From: lnSearch() 192 443

Double Slow Float

# of Calls to:lnSearch()

100 177

evalEnergyForStep() per lnSearch()

1.92 2.5

Reducing the number of Calls minEnergyCG:

Parabolic interpolation – which 3pts to choose. Lnsearch :

Iteratively calculates the step size. When to stop the iteration determined by 2

tolerances. What we did:

Pts for parabolic interpolation are further apart Increased the tolerances till the time to

minimize the energy was same as double. Then profiled to check the actual energy.

Fast Float Vs Double: Time Plot

Fast Float Vs Double: Energy Plot

Our conclusions from this exercise Located the source of instability. However converting to float increased the

time required for the code to run. Increasing tolerances again made the code

fast. The energy in case of float did not agree

well with double computation.

Feedback from SCF-Bio team They are interested primarily in “relaxing”

the molecule. Actual energy is not of any consequence. To check float-code, metric should be error

between the molecular structures (float vs double).

Start Structure

Double Relaxed Structure

Float Relaxed Structure

RMS Distance

New Checking Methodology

Acceptance: < 0.5

RMS Distance vs CG Steps

We are here

Comparison with new metric

Tasks completed this semester Software Profiling

No. of calls Cache misses Effect of parameters

Control Flow Analysis Flow Diagram Data parallelism

Floating point precision requirement Exploring H/W Options

Platform Selection S/W H/W Partitioning

Ongoing work + next semester Setting up building blocks

ZBT RAM access PCI Interface Floating Point Unit

Combining blocks for a simple implementation

Refining the implementation Multiple compute engines Multiple PCI cards

Co-processors for speeding up drug design algorithms Advait Jain Priyanka Jindal Pulkit Gambhir...

Documents

Transcript of Co-processors for speeding up drug design algorithms Advait Jain Priyanka Jindal Pulkit Gambhir...