GPU Acceleration

GPU acceleration for the pricing of the eMS spread option

Qasim Nasar-Ullah University College London

Gower Street London, United Kingdom

[email protected]

ABSTRACT

This paper presents a study on the pricing of a financial derivative using parallel algorithms which are optimised to run on a GPU. Our chosen financial derivative, the constant maturity swap (CMS) spread option, has an associated pricing model which incorporates several algorithmic steps, including: evaluation of probability distributions, implied volatility root-finding, integration and copula simulation. The novel aspects of the analysis are: (1) a fast new accurate double precision normal distribution approximation for the GPU (based on the work of Ooura) , (2) a parallel grid search algorithm for calculating implied volatility and (3) an optimised data and instruction workflow for the pricing of the CMS spread option. The study is focused on 91.5% of the runtime of a benchmark (CPU based) model and results in a speed-up factor of 10.3 when compared to our singlethreaded benchmark model. Our work is implemented in double precision using the NVIDIA GFI00 architecture.

C ategories and Subject D escriptors

D.1.3 [Concurrent Programming] : Parallel Programming; G.1.2 [Approximation] : Special function approximations; G.3 [Probability and Statistics] : Probabilistic algorithms (including Monte Carlo)

General Terms

Algorithms, Performance

Keywords

GPU, Derivative pricing, CMS spread option, Normal distribution, Parallel grid search

1. INTRODUCTION Modern graphics processing units (GPUs) are high through

put devices with hundreds of processor cores. GPUs are able to launch thousands of threads in parallel and can be configured to minimise the effect of memory and instruction latency by an optimal saturation of the memory bus and arithmetic pipelines. Certain algorithms configured to the GPU are thought to offer speed performance improvements over existing architectures. In this study we examine the application of GPUs for pricing a constant maturity swap (CMS) spread option.

The CMS spread option, a commonly traded fixed income derivative, makes payments to the option holder based on the difference (spread) between two CMS rates C1, C2 (e.g.

the ten and two year CMS rates). Given a strike value K, a CMS spread option payoff can be given as [C1 - C2 - K]+, where [.]+ = Max[·, 0]. The product makes payments, based on the payoff equation, to the holder at regular intervals (e.g. three months) over the duration of the contract (e.g. 20 years). The CMS rates C1, C2 are recalculated at the start of each interval and the payoff is made at the end of each interval.

Prior to discussing our GPU based CMS spread option model in Sections 4 and 5, we use Sections 2 and 3 to present two algorithms that are used within our model. Section 2 presents an implementation of the standard normal cumulative distribution function based on the work of Ooura [15]. The evaluation of this function is central to numerous problems within computational finance and dominates the calculation time of the seminal Black Scholes formula [4]. We compare our algorithm to other implementations and discuss sources of performance gain, we also comment on the accuracy of our algorithm. In Section 3 we present a GPU algorithm that evaluates implied volatility through a parallel grid search. The calculation of implied volatility is regarded as one of the most common tasks within computational finance [12]. Our method is shown to be robust and is suited to the GPU when the number of implied volatility evaluations is of order 100 or less. In Section 4 we present a short mathematical model for the pricing of CMS spread options. In Section 5 we present our GPU based implementation, providing various example optimisations alongside a set of performance results.

2. NORMAL DISTRIBUTION FUNCTION

ON THEGPU The calculation of the standard normal cumulative distri

bution function, or normal CDF, occurs widely in computational finance. The normal CDF, (x) , can be expressed as:

 (x) = _1_ jX e�t2/2dt, yI2:;r �OO

(1)

and is typically calculated from a numerical approximation of the error function erf (x) and the complementary error function erfc(x), which are shown in Figure 1. Approximations for erf (x) and erfc(x) are often restricted to positive values of x which are related to (x) by:

978-1-4673-2633-9/12/$31.00 ©2012 IEEE

1

o

-e- Erf(x)

__ Erfc(x)

-- <T>(x)

- 1 ��--�--�--� -6 -4 -2 0 2 4 6

x

Figure 1: Overlaying the functions erf (x) , erfc(x) alongside the normal CDF <'I> (x) . Due to the symmetry of these functions, algorithms typically restrict actual evaluations to positive values of x.

<'1>( +x) = � [1 + erf ( :;)] � [2 - erfc ( :;) ], (2)

<'1>( -x) = � [1 - erf ( :;) ] = � erfc (:;) . (3)

The majority of normal CDF algorithms we surveyed approximate the inner region of x (close to x = 0) using erf (x) and approximate the outer region of x using erfc(x) , this minimises cancellation error. The optimum branch point separating the inner and outer regions and minimising cancellation error is x � 0.47 [5] , this point is the intersection of erf (x) and erfc(x) shown in Figure 1. The algorithms implemented within our study are listed in Table 1 and have absolute accuracy < 10-15.

An algorithmic analysis of various common approximations [1, 5, 10, 11] highlights areas of expected performance loss when implemented within CPUs. Firstly, the approximations are rational and so utilise at least one instance of the division operator (which has low throughput on the current generation of CPUs) . The common presence of rational approximations, for example Pade approximants, stems from their superior numerical efficiency on traditional architectures [7] . Secondly, due to separate approximations over the range of x, the CPU may sequentially evaluate each approximation (known as branching) if the thread execution vector (known as a 'warp' of 32 threads in current NVIDIA architectures) contains values of x in different approximation regions. An algorithm falling into this class is the Cody algorithm [5] , which is also considered a standard implementation within financial institutions and is used to benchmark our results.

Within our survey we identify the Ooura error function derf

and complimentary error function derfe [15] as being particularly suited to the CPU.

The Ooura error function derf is based on polynomial (as opposed to rational) approximations where each approximation utilises high throughput multiplication and addition arithmetic only. The algorithm uses two explicit if

branches, each having access to five sets of coefficients. As a result the algorithm consists of ten separate polynomial ap-

10,--,---,--,--,---,--,---,

8 - +

._ - __ I

6 -- +

+

- - - Outer approximation

-- Inner approximation

1 2 3 4 5 6 x

Figure 2: The different polynomial approximations within the Ooura error function derf over x. We observe two explicit branches (each having five subbranches) . The domain of x will be expanded by a factor of v'2 � 1.4 when evaluating the normal CDF due to the transformation in (2) and (3) .

proximations operating on ten distinct regions of x, the ten regions can be seen in Figure 2. Having hard-coded polynomial coefficients, as opposed to storage in another memory type, offered the best performance. It is worthwhile to note that the addition of a single exponential function exp or single division operation increased the execution time of a single derf approximation by around 50% and 25% respectively.

In contrast, the Ooura complimentary error function derfe

uses a single polynomial approximation across the entire range of x, whilst utilising two instances of low throughput operations (namely exp and a division) . We were unable to find a more parsimonious representation (in terms of low throughput operations) of the complimentary error function within our survey.

We formulate a hybrid algorithm called ONORM to calculate the normal CDF (listed in Appendix A). The algorithm uses the innermost approximation of derf to cover the inner region [±v'2] � [±1.4] (where [±d] = -d < x < d) and derfe for the remaining outer region. The resulting branch point of x = 1.4 is greater than the optimum branch point of x = 0.47 and was chosen to maximise the interval evaluated by the higher performance derf approximation.

Our results are shown in Table 1, within which we compare the effects of uniform inputs of x, which increment gradually to minimise potential branching, against random inputs of x, which are randomised for increased branching. Our results show that ONORM offers its peak performance when x is in the inner range of [±1.4] . In this range it slightly outperforms derf (upon which ONORM is based) due to fewer control flow operations. Within our test samples we observe ONORM outperforming the Cody algorithm by factors ranging from 1.09 to 3.65.

Focusing on the random access results we see that when x is in [±10] ONORM performs slower than derfe due to each 'warp' executing both the inner derf and outer derfe

approximations with probability � 0 .99. The performance difference is however limited since the inner derf approximation has around 45% of the cost of the derfe approximation. The use of random (against uniform) inputs significantly re-

Range of x [±0.2] [±1.4] [±1O] Access Random Uniform Random Uniform Random Uniform derf 4.72 4.75 4.81 4.82 1 .06 4.56 derfc 2.22 2.24 2 .23 2 .24 2 .23 2 .24 Phi 2.40 2.75 1.34 1.84 1.08 1.21 NV 5.33 5.34 5.35 5.37 1.64 2.16 Cody 4.36 4.36 1.34 2.39 0.96 1.63 ONORM 4.77 4.81 4.88 4.83 1.71 2 .28 ONORM speed-up vs. Cody ( x ) 1.09 1.10 3.65 2.02 1.78 1.40

Table 1: Calculations per second (x109) for the Ooura derf, Ooura derfc, Marsaglia Phi, NV, Cody and our ONORM algorithm. Active threads per possible active threads or 'occupancy' fixed at 0.5. Uniform access attempts to minimise possible branching, Random access attempts to maximise possible branching. GPU used: NVIDIA M2070.

1.5 X 10-16 I. X 10-16 5. x 10-17

w 0 -5. x 10-17 -I. x 10-16

-1.5 x 10-16 -2 -I 0 2

x

Figure 3: Absolute error AE of the derf algorithm for the range [±2] .

duces the performance of Cody in the ranges [±1.4] and [±1O] and derf in the range [±1O]' highlighting the performance implications of branching.

We also comment on the Marsaglia Phi algorithm which is based on an error function approximation [13] . It is a branchless algorithm which involves the evaluation of a Taylor series about the origin. A conditional while loop adds additional polynomial terms as x moves away from the origin. Within our GPU implementation we add a precalculated array for Taylor coefficients to eliminate all division operations. The algorithm performs single digit iterations close to the origin, and grows exponentially as x moves towards the tails. We found that despite having extremely few iterations close to the origin, performance is limited by the presence of a single exp function. Our results also indicate that the Marsaglia algorithm is always dominated by the derf function (unless extensive branching occurs) and can perform at least as fast as the derfc function when x is within [±0.2] .

A comparison is also made against the latest NVIDIA CUDA 4.1 implementation [14] of the error and complimentary error functions: NV-Erf and NV-Erfc. As per the ONORM algorithm we can craft a hybrid algorithm NV which uses the innermost NV-Erf approximation for the inner region [±1.4] (consisting of multiplication and addition arithmetic only) and the branchless NV-Erfc approximation (consisting of three low throughput functions: an exp and two divisions) in the outer region. The inner approximation is more efficient than ONORM due to a smaller polynomial,

2. X 10-16

I. X 10-16

w

-I. X 10-16

-2. X 10-16 -2 -I 0 2

x

Figure 4: Absolute error AE of the derfc algorithm for the range [±2] .

whereas the outer approximation uses an additional division operation yielding a loss in performance.

The accuracy of our GPU approximations can be measured against an arbitrary precision normal CDF function implemented within Mathematica CDF [NormalDistribution],

referred to as <l>actual(X). We measure absolute accuracy as:

AE = (x) - <l>actual(x), (4)

and relative accuracy (which is amplified for increasingly negative values of x) as:

RE = (x) _ 1. <l>actual (x)

(5)

The ONORM branches combine to reduce cancellation error by using the inner region of derf and the outer region of derfc, this can be observed in Figures 3 and 4. However, having chosen our branch point as x = 1.4 rather than x = 0.47, we observe in the range -1.4 < x < -0.47 a small increase in the ONORM maximum relative error of 3.23 x 10-16.

We also compare the accuracy of ONORM against the NV

and Cody algorithms. Comparative relative error plots are shown in Figures 5 and 6 . Over the range -22 < x < 9 the NV, Cody and ONORM algorithms exhibit comparable relative error.

As seen in Figure 6, within the inner region [±1.4] ' ONORM was inferior to both NV and Cody. It is apparent therefore

18

I 16 - t '"

� I bO � 14 -!

12�--�--�--�--�==�==� -20 -15 -10 -5 0 5

x

Figure 5: Maximum bound of relative error RE for the NV, Cody and ONORM algorithms for the range -22 < x < 9. Lower values depict higher relative error.

16.5 r--------r----.---------,------,

15.5

15 -- Cody

- - - ONORM

14.5_L2---

-LI---

0L�==

I===�

2 x

Figure 6: Maximum bound of relative error RE for the NV, Cody and ONORM algorithms for the range [±2J. Lower values depict higher relative error.

that the inner region of ONORM can be improved in terms of speed and accuracy by utilising the error function approximation within NV-erf. The maximum absolute errors of NV,

Cody and ONORM were all less than 1.56 x 10-16.

3. IMPLIED VOLATILITY ON THE GPU The evaluation of Black [3J style implied volatility occurs

in many branches of computational finance. The Black formula (which is closely related to the seminal contribution in derivative pricing, the Black Scholes formula [4] ) calculates an option price Vas a function V(S, K, u, '1'), where S is the underlying asset value, K is the strike value, U is the underlying asset volatility and T is the time to maturity. The implied volatility calculation is based on a simple formula inversion where the implied volatility Ui is now a function O"i (Vm, S, K, '1'), where Vm is an observed option price. Due to the absence of analytic methods to calculate implied volatility, an iterative root-finding method is typically employed to find Ui such that:

V(S, K, Ui, T) - Vm = O. (6)

The function V (.) appears well suited for efficient root finding based on Newton's method, for example, the function

Thread block size 9 [) BS 16 32 64 128 512 10-6 27 7 6 5 4 3 10-8 34 9 7 6 5 4 10-10 40 10 8 7 6 5 10-12 47 12 10 8 7 6 10-14 54 14 11 9 8 6

Table 2: Iterations needed, Sp, to calculate implied volatility based on a parallel grid search. Domain size d = 100. 0 represents search accuracy. BS represents binary search.

is monotonically increasing for u, the function has a single analytic inflexion point with respect to U and the function has analytic expressions for its derivative with respect to U (8V/8u). However, within the context of implied volatility 8V /8u can tend to 0, resulting in non-convergence [12J. Predominant methods to evaluate implied volatility are therefore typically based on Newton with bisection or BrentDekker algorithms [16] , the latter being preferred. GPU based evaluation of these functions (particularly Brent-Dekker) can result in a loss of performance. Firstly, the high register usage of these functions is generally suboptimal on the light-weight thread architecture of GPUs. Secondly, the algorithms may execute in a substantially increased runtime due to conditional branch points coupled with unknown iterations to convergence. Finally, numerous contexts in computational finance (including the CMS spread option model) are concerned with obtaining the implied volatility of small groups of options, hence single-thread algorithms such as Newton and Brent-Dekker can result in severe GPU underutilisation (assuming sequential kernel launching).

We therefore develop a parallel grid search algorithm that has the following properties: it uses a minimal amount of low throughput functions, it is branchless and executes in a fixed time frame and can be used to target an optimum amount of processor utilisation when the number of implied volatility evaluations is low.

The parallel grid search algorithm operates on the following principles: we assume the domain of Ui is of size d and the required accuracy of Ui is o. The required accuracy can be guaranteed by searching over u units, where

(7)

Using a binary search method (which halves the search interval by a factor of two with each iteration) the number of required iterations Sb is given by:

(8)

where "1 is the ceiling function and i = ,xl is the smallest integer i such that i :0:: x.

Alternately a parallel grid search can be employed using a GPU 'thread block' with 9 threads and 9 search areas (a thread block permits groups of threads to communicate via a 'shared memory' space). Using a parallel grid search, the number of required iterations sp is given by:

(9)

� 2 <0 0 ...... X

"0 1.5 .: 0 u Cll

if1 .... 1 Cll 0-m .: 0

'';:; 0.5 ro "3 � 0

� ,f , ,,'-, -,' . .,,1' 1 .. , I ,' ,., , � .' ,� . , � , oO i ,. .0° I' ... . . . . " , 0° • ', " 0° .. .. .0 :. 0°

!7..ii ...... ' ..... iA"_ ...... "1 ...... 16 threads

� ..... M ...... ��--t - - - 32 threads

-- 64 threads

-- 96 threads

-- 128 threads

100 200 300 400 500

No. of implied volatility calculations

Figure 7: Calculations per second of a parallel grid search implied volatility for various thread block sizes. The number of implied volatility calculations is equal to the number of thread blocks.

400

200

...... 16 threads

- - - 32 threads

-- 64 threads

-- 96 threads

-- 128 threads

50 100 150 200 250

No. of implied volatility calculations

300

Figure 8: Time taken for the evaluation of a parallel grid search implied volatility for various thread block sizes. The number of implied volatility calculations is equal to the number of thread blocks.

The number of iterations needed against various thread block sizes g to guarantee a given accuracy [) over a given domain size d can be estimated a priori, of which an example is shown in Table 2.

The parallel grid search algorithm will thus calculate the implied volatility of b options by launching b thread blocks, with each thread block having g threads. It is noted that while numerous CPU algorithms seek to maximise thread execution efficiency (since the number of threads are fixed), we are primarily concerned with maximising thread block execution efficiency (since the number of b thread blocks or b options are fixed). We offer a brief outline of the key steps within a CUDA implementation:

1. Within our listing we use the following example parameters which can be passed directly from the CPU, where base is a preprocessed variable to enable the direct use of high throughput bitshifting on the CPU and where the thread block size blocksize (g) is assumed to be a power of two:

left = 1.0e-9; //minimum volatility

delta = 1.0e-12; //desired accuracy base = log2f(blocksize);

2. We begin a for loop over the total iterations iter as calculated in (9). The integer mult is used to collapse the grid size with each iteration, its successive values are calculated as giter-l, giter-2, ... ,gl, g

O. Sub-

sequently we calculate our volatility guess vol (ai ) and the error err as given by the left hand side of (6) by:

for (int i = 0; i<iter; i++)

{ mult = l«(base * (iter - 1 - i»;

//vol guess over the entire interval vol = left + delta*mult*threadldx.x;

//calculation of error: V(vol, ... ) - V m

err = price(vol, ... ) - price_m;

The volatility guess now covers the entire search interval. If in (9) u is chosen not be a power of g, our first iteration overstates our initial search size d (which extends out to the right of the left interval left) . This does not affect the algorithm's accuracy and minimally affects performance due to a partially redundant additional iteration. Special care must be taken to ensure mult does not overflow due to excessive bitshifts, in our final implementation we avoided this problem by using additional integer multiples (e.g. mult2, mult3).

3. We found it optimal to declare separate shared memory arrays (prefixed sh_) to store the absolute error and the sign of the error. This prevents excessive usage of the absolute function fabs within the reduction. A static index is also populated to provide the left interval location for the next iteration:

//absolute error

sh_err[threadldx.x] = fabs(err);

//sign of error

sh_sign[threadldx.x] = signbit(err);

//static index

sh_index [threadldx.x] = threadldx.x;

4. After a parallel reduction to compute the index of the minimum absolute error (stored in sh_index [0] ) ,

the left bracket is computed by checking the sign of the minimum error location using properties of the signbit function:

//V(vol, ... ) - V_m < 0

if (!sh_sign[sh_index[O]])

left = left + (sh_index [O] - l)*delta*mult;

//V(vol, ... ) - V_m > 0 else left = left + sh_index [0] *delta*mult;

}

Our results showing the calculations per second of various thread block sizes are shown in Figure 7, where a number of effects are visible. Firstly, consider thread block sizes of 64, 96 and 128 threads. As we increase the number of options for which we compute implied volatility, the calculations per second become a strong linear function of the number of

thread blocks launched, exhibited in the plateauing effect to the right of Figure 7. Secondly, consider thread block sizes of 16 and 32 threads. It is observed that these kernels maintain load imbalances where additional calculations can be undertaken at no additional cost. In our example peak performance was achieved by thread blocks of size 32. The lower peaks associated with 16 threads (as opposed to 32 threads) is a consequence of 16 threads utilising an additional two iterations as given by (9) .

By studying Figure 8 we observe how load imbalances are linked to the 'number of passes' taken through the GPU. We introduce this phenomenon as follows: The GPU used in this study consisted of 14 multiprocessors each accommodating upto 8 thread blocks, thus a maximum of 14 x 8 = 112 thread blocks are active on the GPU at any given instance. In our implementation this was achieved by having 16, 32 and 64 threads per block, where we observe an approximate doubling of execution time as we vary from 112 to 113 total thread blocks. This is due to the algorithm scheduling an additional 'pass' through the GPU. Focusing on larger thread block sizes we see that a similar 'pass' effect is observed limited to the left hand side of Figure 8. Due to the large differences in time, assessing the number of passes should be carefully studied before implementing this algorithm. The number of passes can be estimated as follows:

Number of passes = r Tp : NM 1 ' ( 10)

where b is the total number of thread blocks requiring evaluation, 'i? is the total number of thread blocks scheduled on each multiprocessor and N M is the total number of multiprocessors on the GPU. Tp should be obtained by hardware profiling tools as the hardware may schedule different numbers of thread blocks per multiprocessor than would be expected by a static analysis of processor resources.

In order to provide a comparison against a single-thread algorithm on the GPU (that is the number of implied volatility calculations is equal to the number of GPU threads launched) we implement a parsimonious representation of Newton's method (we preprocess input data to ensure convergence) . Although the number of Newton iterations may vary substantially, by comparing Figures 8 and 9 we conclude that the parallel grid search algorithm is likely to offer a comparable runtime when the number of implied volatility evaluations is of order 100.

Within our parallel grid search algorithm the size of thread blocks is equal to the size of the parallel grid g. An alternate algorithm would instead accommodate multiple parallel subgrids within a single thread block (with each sub-grid evaluating a single implied volatility) . An increase in the number of sub-grids per thread block would result in a decrease in the number of blocks needed for evaluation whilst increasing the number of iterations and control flow needed. Such an approach is advantageous when dealing with large numbers of implied volatility evaluations. For instance, using Figure 8 when the number of evaluations is between 113 and 224, a thread block of 64 threads makes two passes, however if the thread block was split into two parallel sub-grids (of 32 threads) only one pass would be needed. Thus, in this instance, the total execution would approximately half. The use of a single-thread algorithm is ultimately an idealised version of this effect where each sub-grid is effectively re-

300 I I I I I

i = Newton iterations r---= '""""' '" 2; 200 I---

" i = 25 --7 <ll � �

i=20--7 �

<ll i = 15 --7 S 100 I---� i = 10 --7

I i I I I I 50 100 150 200 250 300

Size of GPU thread block

Figure 9: Time taken for the evaluation of ideal Newton gradient descent implied volatility, the lowest line represents five iterations, lines increment by one iteration, highest line represents 28 iterations. Results are obtained from the execution of a single thread block.

duced to a single thread, drastically reducing the number of passes offset against increased time/complexity whilst evaluating on a single thread.

4. MATHEMATICAL MODEL FOR CMS

SPREAD OPTION PRICING The CMS spread option price is calculated by firstly esti

mating stochastic processes to describe the two underlying CMS rates C1, C2 and secondly, using such stochastic processes to obtain (via a copula simulation) the expected final payoff [C1 - C2 - KJ+. These two steps are repeated for each interval start date and the option price is subsequently evaluated by summing the discounted expected payoffs relating to each interval payment date. We describe the model evaluation in more detail using the following steps:

1. As stated, we first require a stochastic process to describe the underlying CMS rate. The first step to obtaining this process is to calculate the CMS rate C itself using the put-call parity rule. This calculates the CMS rate C as a function of the price of a CMS call option (CallK) , CMS put option (PUtK) and a strike value K. We set K = an observable forward swap rate. Put-call parity results in the value of C as:

C = CallK - PutK +K. ( 11)

In order to calculate the price of the CMS options (CallK, PutK) we follow a replication argument [8J whereby we decompose the CMS option price into a portfolio of swaptions R(k) which are evaluated at different strike values k (swaptions are options on a swap rate for which we have direct analytic expressions) . The portfolio of swaptions can be approximated by an integral [8J of which the main terms are:

CallK � L= R(k)dk, ( 12)

(13)

2. With the CMS rate C captured we also require information regarding the volatility smile effect [9]. The volatility smile describes changing volatilities as a result of changes in an option's strike value K. We therefore evaluate CMS call options (CallK) at various strikes surrounding C and calculate the corresponding option implied volatilities.

3. We calibrate a stochastic process incorporating the above strikes, prices and implied volatilities. We thus obtain unique stochastic processes, expressing the volatility smile effect, for each of the underlying rates C1, C2. The stochastic process is typically based on a SABR class model [9].

4. The price of a spread option contract based on two underlyings C1, C2 with the payoff [C1 - C2 - K]+ is:

(14)

where f( C1, C2) is a bivariate density function of both underlyings and A is the range of the given density function. Obtaining the bivariate density function is non-trivial and a standard approach is to instead calculate (14) using copula methods [6]. The copula method allows us to estimate a bivariate density function f( C1, C2) through a copula C. The copula is a function of the component univariate marginal distributions F1, F2 (which can be directly ascertained from our stochastic processes for C1, C2) and a dependency structure p (for example a historical correlation between C1, C2). The price of a spread option can thus be given as:

1111 [F1-1(U1) - F2-1(U2) - K]+

C(U1, U2)du1du2, (15)

where (U1, U2) are uniformly distributed random numbers on the unit square.

The integral is subsequently approximated by a copula simulation. This involves firstly obtaining a set of N two-dimensional uniformly distributed random numbers (U1, U2), secondly, incorporating a dependency structure between (U1, U2) and finally obtaining a payoff by using the inverse marginal distributions F1-1, F; 1

on (U1, U2)' The final result will be the average of the N simulated payoffs.

5. GPU MODEL IMPLEMENTATION The mathematical steps of the previous section consist of

four dominant computational tasks (in terms of the time taken for computation). Based on these tasks it is instructive to relabel our model into the following stages:

Integration An integration as shown in (12) and (13), relating to steps 1 and 2 in Section 4.

,.---------Load Data I I - ---r--- -

I Calibration , .... -----'

-----C---...

CPU

GPU

Figure 10: Model flowchart for the CMS spread option.

Calibration A calibration to obtain CMS rate processes, relating to step 3 in Section 4. This is not implemented within our CPU model.

Marginals The creation of lookup tables that represent discretised univariate marginal distributions F1, F2 for an input range C\, C2. This allows us to evaluate the inverse marginal distributions F1-1 , F2-1 on (U1, U2) through an interpolation method, relating to step 4 in Section 4.

Copula Simulation of the copula based on (15), relating to step 4 in Section 4.

When presenting the timing results of each of the above stages we include the cost of CPU-CPU memory transfers. A flowchart describing the evaluation of the CMS spread option model is shown in Figure 10, within which we have an additional CPU operation (Load Data) which loads a set of market data for the calibration stage. The function is overlapped with part 2 of the integration stage, hence there is no time penalty associated with this operation. As a result of our CPU implementation we obtain performance results as shown in Table 3. Within our results we set the number of intervals or start dates t = 96; more generally t is in the range [40, 200]. The speed-up impact upon changing t is minimal since our benchmark model is a strong linear function of t as are the calibration and copula stages which account for 96.9% of the execution time within our final CPU implementation. Results are based on an M2070 CPU and an Intel Xeon L5640 CPU with clock speed 2.26 CHz running on a single core. Compilation is under Microsoft Visual Studio with compiler flags for debugging turned on for both the CPU and CPU implementations. Preliminary further work suggests that the use of non-debug compiled versions results in a significantly larger proportional time reduction for the CPU implementation. This indicates that the stated final speed-up results are an underestimate. Within

Time Speed-up Main Kernel Stats (%) (ms) ( x ) Time Replays Ll Hit

VI 15.41 22.87 38 .01 5.47 54.12 V2 9.50 37.10 41.78 5.78 53.18 V3 7.25 48.59 21.43 0.13 92.80

Table 4: Integration Results: VI = Separate underlying evaluation, V2 = Combined underlyings, V3 = Preprocessing stage. 'Replays' and 'Ll hit' represents local memory.

Time Speed-up Main Kernel Stats (%) (ms) ( x ) Time Replays Ll Hit

VI 5.82 48.38 61.58 10.33 58.73 V2 4.76 59.17 75 .76 10.33 58.73 V3 1.43 197.30 43.35 0.00 99.97 V4 1.35 207.84 40.33 0.04 96.77

Table 5: Marginal Results: VI = Separate underlying evaluation, V2 = Combined underlyings, V3 = Preprocessing stage, V 4 = Optimum thread block size. 'Replays' and 'Ll hit' represents local memory.

the context of our eMS model, the underestimate is considered negligible since the final speed-up is sufficiently close to Amdahl's [2J theoretical maximum. Our CPU model targets the integration, marginals and copula stages accounting for 91.5% of the benchmark model runtime.

5.1 O ptimisation examples and results Our analysis focuses primarily on the integration and marginals stage as the copula stage offered fewer avenues of optimisation due to its relative simplicity. The 'main kernel' or the CPU function that dominates runtime is similar for both the integration and marginals stage. The main kernel is respectively used for the pricing of swaptions and the calculation of marginal distributions, both of which require the evaluation of SABR [9J type formulae. The performance bottleneck within our main kernel was the high number of arithmetic instructions, this was determined through a profiling analysis and timing based on code removal.

Within the integration and marginals stage we identify a grid of size t x 2 x n, where t is the number of start dates, 2 is the number of underlyings and n is the size of integration grid or the number of points within the discretised marginal distribution F. Within the grid of t x 2 x n we observe significant data parallelism and target this grid as a base for our CPU computation. For the integration stage we set n = 82; more generally n is in the range [50, 100J and for the marginals stage we set n = 512; more generally n is in the range [250, 1000J. Our choice of n is based on error considerations outside the scope of this paper. We briefly describe a set of optimisation steps that were relevant in the context of our model, corresponding results are shown in Tables 4 and 5:

1. In our first implementation (VI) we parallelise calculations on a grid of size t x n and sequentially evaluate each underlying. We use kernels of two sizes: Type I kernels are of size t (which launch on a single thread

60 -------'" 2,

.: j! �

40 <J.) � S �

- - - Occupancy 0.33

-- Occupancy 0.5

200

I I 5 10 15 20

o. of CPU Threads ( x 103)

Figure 11: An illustration of occupancy affecting kernel execution time, when the number of passes given by (10) is very low « 3).

block with t threads) and Type II kernels are of size t x n (which launch on t thread blocks each with n threads) . Within the CF100 architecture we are limited to 1024 threads per block thus we must ensure that n and tare::; 1024.

2. In our second implementation (V2) we combine both underlyings such that Type I kernels are now of size t x 2 (which launch on a single thread block with t x 2 threads) and Type II kernels are of size t x 2 x n (which launch on t thread blocks each with n x 2 threads) . For Type I kernels we found the additional underlying evaluation incurred no additional cost due to underutilisation of the processor, thus reducing Type I kernel execution times by approximately half (assuming sequential kernel launching) . For the main kernel in the integration stage (a Type II kernel) we found that a similar doubling of thread block sizing (from 82 to 164 threads) led to changed multiprocessor occupancy. As shown in Figure 11, small changes in processor occupancy can amplify performance differences when the number of passes given by (10) is small (this effect was also described in Section 3) . As a consequence, we see in Table 6 a significant reduction in the execution time of our main kernel. Type II kernels in the marginals stage were not combined as this led to unfeasibly large thread block sizes and loss of performance.

3. In our third implementation (V3) we undertook a preprocessing step to simplify SABR type evaluations used by the main kernels. Although formulae will vary based on the particular SABR class model being employed, the inner-most code will typically involve several interdependent low throughput transformations of an underlying asset value or rate S, strike K, volatility (J' and time to maturity T, in order to calculate a set of expressions of the form (as found in the Black [3J formula) :

cp ( In(S/K) + «(J'2/2)T) .

(J'VT (16)

Stage Benchmark Time Final Time Speed-up Kernels

Time (ms) (%) Time (ms) (%) (x) Launched

Integration 352.45 12.30 7 .25 2.60 48.59 45 Calibration 243.74 8.51 243.74 87.34 1.00 N/A

Marginals 281.49 9.82 1.35 0.49 207.84 6 Copula 1,987.40 69.37 26.73 9 .58 74.34 1 Overall 2,865.08 100 279 .08 100 10.27 52

Table 3: Benchmark and final performance results of the CMS spread option pricing model, where the number of intervals or start dates t = 96.

VI V2 Total evaluations, t x 2 x n 15744 15744 Thread block size 82 164 Total blocks, b 192 96 Occupancy 0.313 0 .375 Possible active blocks, 1? x N M 82 49 Passes needed by (10) 3 2 Measured time reduction N/A -32.2%

Table 6: Time reduction of the integration stage 'main kernel' through changes in the number of passes.

Within each grid of size n, parameter changes are strictly driven by changes in a single variable K. Within our optimisation efforts we therefore minimise our interdependent transformations in the grid size t x 2 x n to a smaller preprocessing kernel with grid size t x 2. Hence, assuming we were to evaluate (16), our preprocessing kernel would calculate the terms A = In (S) and

B = uVT. This would result in the idealised computational implementation of (16) as:

normalcdf((A - log(K))/B + O.5*B). (17)

As such the inner-most code conducts significantly fewer low throughput operations resulting in a performance gain. Such a preprocessing step can also be used to improve CPU performance. Within the CPU implementation, the preprocessing approach increased the amount of data needed by kernels from high latency global memory. However since our main kernels have arithmetic bottlenecks, the additional memory transactions had little effect on computation time. This is in contrast to numerous CPU algorithms which have memory bottlenecks and thus additional memory transactions are likely to affect computation time. As a result of our preprocessing we observed a large reduction in the local memory replays and an increase in the L1 local hit ratio which we define below:

High levels of complex computation can often result in a single CPU thread running out of allocated registers used to store intermediate values. An overflow of registers is assigned to the local memory space which has a small L1 cache, misses to this cache results in access to higher latency memory. Within our results we therefore wish to maximise the L1 local hit ratio

- that is, the proportion of loads and stores to local

Time (ms) Speed-up (x) Ver 1 4.56 436.27 Ver 2 18.51 107.39 Ver 3 12.03 165.22 Ver 4 26.73 74.34

Table 7: Copula Results: Ver 1 = Inverse normal CDF only, Ver 2 = Ver 1 + normal CDF, Ver 3 Ver 1 + interpolation, Ver 4 = All components.

memory that reside within the L1 cache. However, if the number of instructions associated to local memory cache misses is insignificant in comparison to the total number of instructions issued, we can somewhat ignore the effect of a low L1 local hit ratio. Therefore we also measure the local memory replays - that is, the number of local memory instructions that were caused by misses to the L1 cache as a percentage of the total instructions issued.

4. In our fourth implementation (V4) of the marginals stage we experimented with different sizes of thread blocks, as a result we obtained a small reduction in main kernel execution times (to 88% of V3) . Integration stage kernels did not benefit from further optimisation.

As a result of our integration and marginals stage optimisation we observed speed-ups of 48.59x and 207.84x respectively against our benchmark implementation.

Within the copula stage we targeted a parallel grid consisting of the number of simulations N, which we launched sequentially for each start date t. The sequential evaluation was justified as N is a very large multiple of the total number of possible parallel threads per CPU. Within this stage we were limited by two numerical tasks: (1) the evaluation of the normal CDF and (2) a linear interpolation. The extent of these limitations is shown in Table 7 which presents timing results for different versions which implement only part of the algorithm. The evaluation of the normal CDF dominated and was conducted by the ONORM algorithm presented in Section 2. In regards to the interpolation, we were unable to develop algorithms that suitably improved performance. In particular we were unable to benefit from hardware texture interpolation which is optimised for single precision contexts. Our final copula implementation resulted in a speed-up of 74.3x. The final CPU based model obtained an overall 1O.27x speed-up which is 87.3% of Amdahl's [2] maximum theoretical speed-up of 11.76x.

6. CONCLUSIONS Calculation of the normal CDF through our proposed

ONORM algorithm is well suited to the GPU architecture. ONORM exhibits comparable accuracy against the widely adopted Cody algorithm whilst being faster, thus it is likely to be the algorithm of choice for double precision GPU based evaluation of the normal CDF. The algorithm can be further improved by using the NV-erf algorithm in the inner range.

Our parallel grid search implied volatility algorithm is applicable to GPUs when dealing with small numbers of implied volatility evaluations. The algorithm is robust, guarantees a specific accuracy and executes in a fixed time frame. For larger groups of options, the algorithm is unsuitable as computation time will grow linearly at a much faster rate than GPU alternatives which use a single thread per implied volatility calculation.

Within our GPU based CMS spread option model we highlighted the importance of managing occupancy for kernels with low pass ratios whilst also obtaining a particular performance improvement through the use of preprocessing kernels. In our experience industrial models do not preprocess functions due to issues such as enabling maintenance and reducing obfuscation, an idea which needs to be challenged for GPU performance.

Further work will consider calibration strategies, traditionally problematic on GPUs due to the sequential nature of calibration algorithms (consisting of multi-dimensional optimisation) . Further work will also consider the wider performance implications of GPU algorithms within large pricing infrastructures found within the financial industry.

7. ACKNOWLEDGEMENTS I acknowledge the assistance of Ian Eames, Graham Bar

rett and anonymous reviewers. The study was supported by the UK PhD Centre in Financial Computing, as part of the Research Councils UK Digital Economy Programme, and BNP Paribas.

8. REFERENCES [lJ A. Adams. Algorithm 39. Areas under the normal

curve. Computer Journal, 12(2) : 197-198, 1969.

[2J G. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings, pages 483-485, 1967.

[3J F. Black. The pricing of commodity contracts. Journal

of Financial Economics, 3 ( 1-2) : 167-179, 1976.

[4J F. Black and M. Scholes. The pricing of options and corporate liabilities. Journal of Political Economy,

81 (3) :637-654, 1973.

[5J W. Cody. Rational Chebyshev approximations for the error function. Mathematics of Computation,

23 ( 107) :631-637,1969.

[6J S. Galluccio and O. Scaillet. CMS spread products. In R. Cont, editor, Encyclopedia of Quantitative Finance,

volume 1, pages 269-273. John Wiley & Sons, Chichester, UK, 2010.

[7J C. Gerald and P. Wheatley. Applied Numerical

Methods. Addison-Wesley, Reading, MA, 2004.

[8J P. Hagan. Convexity conundrums: pricing CMS swaps, caps and floors. Wilmott Magazine, 4:38-44, 2003.

[9J P. Hagan, D. Kumar, and A. Lisniewski. Managing smile risk. Wilmott Magazine, 1:84-108, 2002.

[IOJ J. Hart, E. Cheney, C. Lawson, H. Maehly, C. Mesztenyi, J. Rice, H. Thacher Jr, and C. Witzgall. Computer Approximations. John Wiley & Sons, New York, NY, 1968.

[11 J 1. Hill. Algorithm AS 66: The normal integral. Applied

Statistics, 22(3) :424-427, 1973.

[12J P. Jackel. By implication. Wilmott Magazine,

26:60-66, 2006.

[13J G. Marsaglia. Evaluating the normal distribution. Journal of Statistical Software, 11(5) : 1-11, 2004.

[14J NVIDIA Corp. CUDA Toolkit 4. 1. [OnlineJ Available: http://developer.nvidia.com/cuda-toolkit-41, 2012.

[15J T. Ooura. Gamma / error functions. [OnlineJ A vailable: http: //www.kurims.kyoto-u.ac.jp/-ooura/gamerf.html. 1996.

[16J W. Press, B. Flannery, S. Teukolsky, and W. Vetterling. Numerical Recipes. Cambridge University Press, New York, NY, 2007.

APPENDIX

A. LISTING OF THE ONORM ALGORITHM

/0 Based on the derf and derfc function of Takuya OOURA (email: ooura�mmm.t.u-tokyo.ac.jp) http://www.kurims.kyoto-u.ac.jp/-ooura/gamerf.html 0/

{ dev�ce inline double onorm(double x)

double t, y, u, w;

x o� 0.7071067811865475244008; w = x < O? -x : Xj

if (w < 1)

{

} else

{

}

t = w * w; y � ««««««5.958930743e-11 0 t + -1. 13739022964e-9) o t + 1.46600519983ge-8) 0 t + -1.635035446196e-7) 0 t + 1.6461004480962e-6) 0 t + -1.492559551950604e-5) 0 t + 1.2055331122299265e-4) 0 t + -8.548326981129666e-4) 0 t + 0.00522397762482322257) 0 t + -0.0268661706450773342) o t + 0.11283791670954881569) 0 t + -0.37612638903183748117) o t + 1.12837916709551257377) 0 w; y � 0.5 + 0.50y;

x = -Xj t � 3.97886080735226 / (w + 3.97886080735226); u � t - 0.5; Y � «««««««««««0.00127109764952614092 0 u + 1.19314022838340944e-4) 0 u - 0.003963850973605135) 0 u - 8.70779635317295828e-4) 0 u + 0.00773672528313526668) o u + 0.00383335126264887303) 0 u - 0.0127223813782122755) o u - 0.0133823644533460069) 0 u + 0.0161315329733252248) o u + 0.0390976845588484035) 0 u + 0.00249367200053503304) o u - 0.0838864557023001992) 0 u - 0.119463959964325415) o u + 0.0166207924969367356) 0 u + 0.357524274449531043) o u + 0.805276408752910567) 0 u + 1.18902982909273333) o u + 1.37040217682338167) 0 u + 1.31314653831023098) 0 u + 1.07925515155856677) 0 u + 0.774368199119538609) 0 u + 0.490165080585318424) 0 u + 0.275374741597376782) 0 t * 0.5;

Y � Y 0 exp(-x 0 x);

x < O? 1-y y;

}

return

GPU Acceleration

Documents

Transcript of GPU Acceleration