Monte Carlo Simulation Monte Carlo Simulation - University of Florida
Monte Carlo Simulation and its Efficient Implementation
description
Transcript of Monte Carlo Simulation and its Efficient Implementation
Experts in numerical algorithms and HPC services
Monte Carlo Simulation and its Efficient Implementation
Robert Tong
28 January 2010
2
Introduction to NAG
Founded in 1970 as a co-operative project in UK Operates as a commercial, not-for-profit
organization Worldwide operations
Oxford & Manchester, UK Chicago, US Tokyo, Japan Taipei, Taiwan
Over 3,000 customer sites worldwide Staff of ~100, over 50% technical, over 25 PhDs £7m+ financial turnover
3
PortfolioNumerical Libraries
Highly flexible for use in many computing languages, programming environments, hardware platforms and for high performance computing methods
Connector Products for Excel, MATLAB and Maple and Giving users of the Excel and mathematical software packages MATLAB and
Maple access to NAG’s library of highly optimized and often superior numerical routines and allowing easy integration
NAG Fortran Compiler and GUI based Windows Compiler: Fortran Builder
Visualization and graphics software Build data visualization applications with NAG’s IRIS Explorer
Consultancy services
4
Outline
Why use Monte Carlo simulation? Higher order methods and convergence GPU acceleration The need for numerical libraries
5
Why use Monte Carlo methods?
Essential for high dimensional problems – many degrees of freedom
For applications with uncertainty in inputs In finance
Important in risk modelling Pricing/hedging derivatives
6
The elements of Monte Carlo simulation
Derivative pricing Simulate a path of asset values Compute payoff from path Compute option value
Numerical components Pseudo-random number generator Discretization scheme
7
In the past Faster solution has been provided by increasing processor
speeds Want a quicker solution? Buy a new processor
Present Multi-core/Many-core architectures, without increased
processor clock speeds A major challenge for existing numerical algorithms
The escalator has stopped... or gone into reverse! Existing codes may well run slower on multi-core
The demand for ever increasing performance
8
Ways to improve performance in Monte Carlo simulation
1. Use higher order discretization 2. Keep low order (Euler) discretization –
make use of multi-core potentiale.g. GPU (Graphics Processing Unit)
3. Use high order discretization on GPU4. Use quasi-random sequence (Sobol’, …) and
Brownian Bridge5. Implement Sobol’ sequence and Brownian Bridge
on GPU
9
Higher order methods – 1(work by Kai Zhang, University of Warwick, UK)
10
Higher order methods – 2
11
Numerical example – 1
12
Numerical example – 1a
13
Numerical example – 1b
14
Numerical example – 2a
15
Numerical example – 2b
16
GPU acceleration
Retain low order Euler discretization Use multi-core GPU architecture to achieve speed-up
17
The Emergence of GPGPU Computing
Initially – computation carried out by CPU (scalar, serial execution)
CPU evolves to add cache, SSE instructions, ...
GPU added to speed graphics display – driven by gaming needs multi-core, SIMT, limited flexibility
CPU and GPU move closer CPU becomes multi-core GPU becomes General Purpose (GPGPU) – fully
programmable
18
Current GPU architecture – e.g. NVIDIA Tesla
19
Tesla – processing power
SM – Streaming Multiprocessor 8 X SP - Streaming Processor core 2 X Special Function Unit MT – Multithreaded instruction fetch and issue unit Instruction cache Constant cache (read only) Shared memory (16 Kb, read/write)
C1060 – adds double precision 30 double precision cores 240 single precision cores
20
Tesla C1060 memory (from: M A Martinez-del-Amor et al. (2008) based on E Lindholm et al. (2008))
21
Programming GPUs – CUDA and OpenCL
CUDA (Compute Unified Device Architecture, developed by NVIDIA) Extension of C to enable programming of GPU devices Allows easy management of parallel threads executing on
GPU Handles communication with ‘host’ CPU
OpenCL Standard language for multi-device programming Not tied to a particular company Will open up GPU computing Incorporates elements of CUDA
22
First step – obtaining and installing CUDA
FREE download from http://www.nvidia.com/object/cuda_learn.html
See: Quickstart Guide Require:
CUDA capable GPU – GeForce 8, 9, 200, Tesla, many Quadro Recent version of NVIDIA driver CUDA Toolkit – essential components to compile and build applications CUDA SDK – example projects
Update environment variables (Linux default shown) PATH /usr/local/cuda/bin LD_LIBRARY_PATH /usr/local/cuda/lib
CUDA compiler nvcc works with gcc (Linux) MS VC++ (Windows)
23
Host (CPU) – Device (GPU) Relationship
Application program initiated on Host (CPU) Device ‘kernels’ execute on GPU in SIMT (Single
Instruction Multiple Thread) manner Host program
Transfers data from Host memory to Device (GPU) memory Specifies number and organisation of threads on Device Calls Device ‘kernel’ as a C function, passing parameters Copies output from Device back to Host memory
24
Organisation of threads on GPU
SM (Streaming Multiprocessor) manages up to 1024 threads
Each thread is identified by an index Threads execute as Warps of 32 threads Threads are grouped in blocks (user specifies number
of threads per block) Blocks make up a grid
25
Memory hierarchy
• On device can• Read/write per-thread
• Registers• Local memory
• Read/write per-block shared memory• Read/write per-grid global memory• Read only per-grid constant memory
• On host (CPU) can• Read/write per-grid
• Global memory• Constant memory
26
CUDA terminology
‘kernel’ – C function executing on the GPU __global__ declares function as a kernel
Executed on the Device Callable only from the Host void return type
__device__ declares function that is Executed on the Device Callable only from the Device
27
Application to Monte Carlo simulation
Monte Carlo paths lead to highly parallel algorithms
• Applications in finance e.g. simulation based on SDE (return on asset)
drift + Brownian motion
• Requires fast pseudorandom or Quasi-random number generator
• Additional techniques improve efficiency: Brownian Bridge, stratified sampling, …
tt
t dWdtS
dS
28
Random Number Generators:choice of algorithm
Must be highly parallel Implementation must satisfy statistical tests of randomness
Some common generators do not guarantee randomness properties when split into parallel streams
A suitable choice: MRG32k3a (L’Ecuyer)
29
MRG32k3a: skip ahead
Generator combines 2 recurrences:
Each recurrence of form (M Giles, note on implementation)
Precompute in operations on CPU,
11,311,211, modmxbxax nnn
22,322,122, modmxbxax nnn
1 nn Ayy
2
1
n
n
n
n
x
x
x
y
pA )(log pO
myAy np
pn mod
30
MRG32k3a: modulus
Combined and individual recurrences
Can compute using double precision divide – slow
Use 64 bit integers (supported on GPU) – avoid divide
Bit shift – faster (used in CPU implementations)
Note: speed of different possibilities subject to change as NVIDIA updates floating point capability
12,1, modmxxz nnn
31
MRG32k3a: memory coalescence
GPU performance limited by memory accessRequire memory coalescence for fast transfer of
dataOrder RNG output to retain consecutive memory
accesses
is stored at
sequential ordering
(Implementation by M Giles)
btnx thbtn block in by thread generatedelement the,,
bNNnNt ptt
bNNtNn ptp
per thread points num threads,num pt NN
32
MRG32k3a: single – double precision
L’Ecuyer’s example implementation in double precision floating point
Double precision on high end GPUs – but arithmetic operations much slower in execution than single precision
GPU implementation in integers – final output cast to double
Note: output to single precision gives sequence that does not pass randomness tests
33
MRG32k3a: GPU benchmarking – double precision
GPU – NVIDIA Tesla C1060CPU – serial version of integer implementation
running on single core of quad-core XeonVSL – Intel Library MRG32k3aICC – Intel C/C++ compiler
VC++ – Microsoft Visual C++ GPU CPU-ICC CPU-VC+
+ VSL-ICC VSL-VC+
+
Samples/ sec
3.00E+09
3.46E+07
4.77E+07
9.35E+07
9.32E+07
34
MRG32k3a: GPU benchmarking – single precision
Note: for double precision all sequences were identical
For single precision GPU and CPU identicalGPU and VSL differ max abs err 5.96E-008
Which output preferred? use statistical tests of randomness GPU CPU-ICC CPU-VC+
+ VSL-ICC VSL-VC+
+
Samples/ sec
3.49E+09
3.58E+07
5.24E+07
1.02E+08
9.75E+07
35
LIBOR Market Model on GPU
Equally weighted portfolio of 15 swaptions each with same maturity, but different lengths and different strikes
36
Numerical Libraries for GPUs
The problem The time-consuming work of writing basic numerical
components should not be repeated The general user should not need to spend many days
writing each application The solution
Standard numerical components should be available as libraries for GPUs
37
NAG Routines for GPUs
38
nag_gpu_mrg32k3a_uniform
39
nag_gpu_mrg32k3a_uniform
40
nag_gpu_mrg32k3a_uniform
41
Example program: generate random numbers on GPU
... // Allocate memory on Host host_array = (double *)calloc(N,sizeof(double)); // Allocate memory on GPU cudaMalloc((void **)&device_array, sizeof(double)*N); // Call GPU functions // Initialise random number generator nag_gpu_mrg32k3a_init(V1, V2, offset);
// Generate random numbers nag_gpu_mrg32k3a_uniform(nb, nt, np, device_array); // Read back GPU results to hostcudaMemcpy(host_array,gpu_array,sizeof(double)*N,cudaMemcpyDeviceToHost);...
42
NAG Routines for GPUs
43
nag_gpu_mrg32k3a_next_uniform
44
nag_gpu_mrg32k3a_next_uniform
45
Example program – kernel function
__global__ void mrg32k3a_kernel(int np, FP *d_P){ unsigned int v1[3], v2[3]; int n, i0; FP x, x2 = nanf(""); // initialisation for first point nag_gpu_mrg32k3a_stream_init(v1, v2, np); // now do points i0 = threadIdx.x + np*blockDim.x*blockIdx.x; for (n=0; n<np; n++) {nag_gpu_mrg32k3a_next_uniform(v1, v2, x);} d_P[i0] = x; i0 += blockDim.x;}
46
Library issues: Auto-tuning
Performance affected by mapping of algorithm to GPU via threads, blocks and warps
Implement a code generator to produce variants using the relevant parameters
Determine optimal performance
Li, Dongarra & Tomov (2009)
47
Working with Fixed Income Research & Strategies Team (FIRST) NAG mrg32k3a works well in BNP Paribas CUDA “Local Vol
Monte-Carlo” Passes rigorous statistical tests for randomness properties
(Diehard, Dieharder,TestU01) Performance good Being able to match the GPU random numbers with the
CPU version of mrg32k3a has been very valuable for establishing validity of output
Early Success with BNP Paribas
48
BNP Paribas Results – local vol example
49
“The NAG GPU libraries are helping us enormously by providing us with fast, good quality algorithms. This has let us concentrate on our models and deliver GPGPU based pricing much more quickly.”
And with Bank of America Merrill Lynch
50
“Thank you for the GPU code, we have achieved speed ups of x120”
In a simple uncorrelated loss simulation: Number of simulations 50,000 Time taken in seconds 2.373606 Simulations per second 21065 Simulated default rate 311.8472 Theoretical default rate 311.9125
24 trillion numbers in 6 hours
“A N Other Tier 1” Risk Group
51
NAG routines for GPUs – 1
Currently available Random Number Generator (L’Ecuyer mrg32k3a)
Uniform distribution Normal distribution Exponential distribution Gamma distribution
Sobol sequence for Quasi-Monte Carlo (to 19,000 dimensions)
Brownian Bridge
52
NAG routines for GPUs – 2
Future plans Random Number Generator – Mersenne Twister Linear algebra components for PDE option pricing methods Time series analysis – wavelets ...
53
Summary
GPUs offer high performance computing for specific massively parallel algorithms such as Monte Carlo simulations
GPUs are lower cost and require less power than corresponding CPU configurations
Numerical libraries for GPUs will make these an important computing resource
Higher order methods for GPUs being considered
54
AcknowledgmentsMike Giles (Mathematical Institute, University of Oxford) – algorithmic input
Technology Strategy Board through Knowledge Transfer Partnership with Smith Institute
NVIDIA for technical support and supply of Tesla C1060 and Quadro FX 5800
See www.nag.co.uk/numeric/GPUs/
Contact: [email protected]