Massively Parallel Random Number Generators · Page 4 Massively Parallel Random Number Generators...
Transcript of Massively Parallel Random Number Generators · Page 4 Massively Parallel Random Number Generators...
Massively Parallel Random Number Generators
Holger Dammertz | September 15, 2010
Page 2 Massively Parallel Random Number Generators | Overview | September 15, 2010
Overview
Applications for Pseudo-Random NumbersI Monte Carlo Simulation, Integration
I Test and Content Generation
I These applications are often easily parallelizableI MRIP paradigm: multiple replications in parallel
Generating Pseudo-Random NumbersI Generate i.i.d. uniform random numbers (for example 32bit)
I Transform into (0,1)
I Additional transformation to target distribution (for example, normaldistributed)
Page 3 Massively Parallel Random Number Generators | Overview | September 15, 2010
Structure of a RNG
Formal Definition [L’E06]
(S ,µ, f ,U,g)
I S state space
I µ prob. distr. on S to select initial state (seed) s0 ∈ S
I f : S → S transition function
I g : S → U output function
I U = (0,1) output space
I si = f (si−1), i ≥ 1 and ui = g(si )
Page 4 Massively Parallel Random Number Generators | Overview | September 15, 2010
Structure of a RNG
In a parallel Implementation
(S ,µ, f ,U,g)
I S needed per generating stream (usually per thread)
I S if possible hold in fast memory
I S store in global memory after finishing for multiple calls
I f and g are device functions (usually in a single function)
I output space often unsigned int → transform needed
Example: Linear Congruence Generator LCG [Knu81]
I si ∈ S is an integer (for example 32bit), U = S ,g = id
I f : si = (asi−1 + c) mod m
I Needs well chosen a, c and m
Page 5 Massively Parallel Random Number Generators | Overview | September 15, 2010
Structure of a RNG
Required properties of a RNGI Speed
I Repeatability
I Minimal statistical bias
Additional propertiesI Random access on uiI Independent number streams
I Long period
Page 6 Massively Parallel Random Number Generators | Overview | September 15, 2010
Structure of a RNG
Why is a long period important?
From [SPM05]: for a cycle length of n a single simulation should use at most
16 3√n
random numbers (to trust the results of statistical simulation).Assume period of 248 and simple parallel cuda app with 4096 threads:
163√
248
4096≈ 256
random numbers per thread.
Page 7 Massively Parallel Random Number Generators | Overview | September 15, 2010
Choice of RNG parameters are important
Simple LCG, 210−3 points
Page 8 Massively Parallel Random Number Generators | Parallelization Techniques | September 15, 2010
Two different views on parallel random numbers
1) Single Stream for all ThreadsI Each thread computes parts of one problem
I Result should not depend on number of threads and should be repeatable
I Ideally RNG update is also parallel (→ speed)
Page 9 Massively Parallel Random Number Generators | Parallelization Techniques | September 15, 2010
Two different views on parallel random numbers
2) Each Thread uses it’s own RNGI Each thread computes individual solutions
I Need to guarantee independence of streams
I If you could have a true RNG both methods would behave exactly the same(but repeatability would be lost).
I Using Pseudo RNGs this needs to be explicitly designed in the program
Page 10 Massively Parallel Random Number Generators | Parallelization Techniques | September 15, 2010
Parallelizing RNGs
Random SeedingI Easy to implement
I Generally very bad parallelization methodI Need to seed valid statesI No guarantee of independence
I Can work for generators with a long period
I Better alternative: well chosen seeds (per thread)I For example: Mersenne Twister dcmt libraryI New GPU Mersenne Twister (MTGP) provides seed tables
ParametrizationI n different RNG configurations for n threads
I Needs to be especially developed and tested
I Often restricted to specific n
Page 11 Massively Parallel Random Number Generators | Parallelization Techniques | September 15, 2010
Parallelizing RNGs
Block SplittingI Can guarantee non overlapping sequences of length m
I m needs to be known in advance
I Starting states need to be known or computed
Page 12 Massively Parallel Random Number Generators | Parallelization Techniques | September 15, 2010
Parallelizing RNGs
Leap-FroggingI Needs RNG to be able to skip n numbers (or else quite inefficient)
Page 13 Massively Parallel Random Number Generators | Application Scenarios | September 15, 2010
Application Design Considerations
On the fly Computation
Compute RN when needed in kernel
I State needs to be stored per stream
I RNG uses additional resources (registers and memory)
I Needs properly parallelized implementation
Pre-computation
Store RN in main memory
I Only memory access per RN
I Easily parallelizable (same access as leap frog)
I Memory requirements can be huge
I Bandwidth need to be considered
Page 14 Massively Parallel Random Number Generators | Application Scenarios | September 15, 2010
Upload vs. Computation Example
Intel(R) Xeon(R) E5420 @ 2.50GHz, GTX 480, 256MB random numbers,Mersenne Twister: MTGP/SFMT, 214 threads, blocksize 256
PrecomputationI Upload CPU-computed RN
I Precompute on CPU (SFMT): 180msI Upload: 44ms
I Precompute on GPUI Precompute on GPU (MTGP): 9.18ms
I Consume (Asian Options): 783.5ms
Produce and consume (MTGP)I Single Kernel: 1058.1ms
Page 15 Massively Parallel Random Number Generators | Overview of some RNGs | September 15, 2010
Overview of some RNGs
LCGs
si = (asi−1 + c) mod m
I Combining multiple LCGs can give longer period
I Independent streams: Wichmann-Hill (273 threads)
Multiple Recursive Generator
si =
(k
∑ξ=1
aξ si−ξ
)mod m
I Larger period (for k=1 equal to LCG)
I Blocking: MRG32k3a [LSCK02]
Page 16 Massively Parallel Random Number Generators | Overview of some RNGs | September 15, 2010
Overview of some RNGs
RNGs based on Cryptographic functionsI Creates white noise from input
I MD5 [TW08]: hash function
I Tiny Encryption Algorithm (TEA) [ZOC10]
I Different configuration per thread (parametrization)I Transform counter and thread id
I Random Access in the sequence
Mersenne TwisterI New GPU version [Sai10]
I Long period: 32bit version provides 211213−1,223209−1,244497−1
I Good seeding strategies (see MTGPDC)
Page 17 Massively Parallel Random Number Generators | Testing of RNGs | September 15, 2010
Testing RNGs
Why use testsI Implementation of RNGs is very sensitive
I Before using any RNG implementation it should be tested
I Failed tests: very likely bad sequence
I Passed tests: guarantees nothing
Test SuitsI Use statistical tests to find flaws
I DIEHARD + NIST (both integrated in DIEHARDER [Bro09])
I TestU01 [LS07]
Page 18 Massively Parallel Random Number Generators | Our Collection | September 15, 2010
Collection of Parallel Random Number Generators
Selecting suitable RNGsI Domain specific problem
I Depends on current compiler and hardware
Our CollectionI CUDA implementation of several different RNGs
I Easy to use
I Currently tested on Linux
I Most of the code: MIT License
I Copy-paste ready code
Available athttp://mprng.sf.net
Page 19 Massively Parallel Random Number Generators | Our Collection | September 15, 2010
Collection of Parallel Random Number Generators
Integrated BenchmarkI Benchmarks compiles and runs on your machine
I Configure threads and blocks according to your target application
Integrated Test SuitsI DIEHARDER and TestU01
I Automatic report generation
Easy to extendI Integrate your own RNG
I Integrate your own tests
Page 20 Massively Parallel Random Number Generators | Some Results | September 15, 2010
Comparisons: TestU01 SmallCrush Battery
RNG Name Bir
thd
ayS
pac
ings
Col
lisio
n
Gap
Sim
pP
oker
Cou
pon
Col
lect
or
Max
Oft
Max
Oft
AD
Wei
ghtD
istr
ib
Mat
rixR
ank
Ham
min
gIn
dep
Ran
dom
Wal
k1H
Ran
dom
Wal
k1M
Ran
dom
Wal
k1J
Ran
dom
Wal
k1R
Ran
dom
Wal
k1C
combinedLCGTaus P P P P P P P P P P P P P P P
drand48gpu P P P P P F F P P P F F F F P
kiss07 P P P P P P P P P P P P P P P
lfsr113 P P P P P P P P P P P P P P P
md5 P P P P P P P P P P P P P P P
mtgp P P P P P P P P P P P P P P P
park miller F F P P P F F P P P F F F F F
ranecu P F P P P F F P P P F F F F F
tea P P P P P P P P P P P P P P P
tt800 P P P P P P P P P P P P P P P
Page 21 Massively Parallel Random Number Generators | Some Results | September 15, 2010
Raw Performance Test
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
combinedLCGTaus
drand48gpu
kiss07
lfsr113
mtgp
park_miller
ranecu
tt800
seco
nds
Raw Performance, GeForce GTX 480, 335544320 Random Numbers
Page 21 Massively Parallel Random Number Generators | Some Results | September 15, 2010
Raw Performance Test
0
0.05
0.1
0.15
0.2
0.25
0.3
combinedLCGTaus
drand48gpu
kiss07
lfsr113
md5
mtgp
park_miller
ranecu
teatt800
seco
nds
Raw Performance, GeForce GTX 480, 335544320 Random Numbers
Page 22 Massively Parallel Random Number Generators | Some Results | September 15, 2010
Performance Test: Asian Option Example from [HT07]
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
combinedLCGTaus
drand48gpu
kiss07
lfsr113
mtgp
park_miller
ranecu
tt800
seco
nds
Asian Options, GeForce GTX 480, 96 runs, 163840 simulations
Page 22 Massively Parallel Random Number Generators | Some Results | September 15, 2010
Performance Test: Asian Option Example from [HT07]
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
combinedLCGTaus
drand48gpu
kiss07
lfsr113
md5
mtgp
park_miller
ranecu
teatt800
seco
nds
Asian Options, GeForce GTX 480, 96 runs, 163840 simulations
Page 23 Massively Parallel Random Number Generators | Some Results | September 15, 2010
Quality vs. Speed
Reduced rounds of MD5/TEA RNGI For some applications speed more important than quality
I MD5 and TEA allow to reduce number of iterationsI Quality of random numbers may degradeI Avalanche effect
I MD5: passes most of the DIEHARDER tests after 16 rounds
Page 24 Massively Parallel Random Number Generators | Some Results | September 15, 2010
Quality vs. Speed, TEA, DIEHARDER tests
0 5
10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95
100 105 110 115 120
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Die
hard
er te
sts
Number of rounds
Tiny Encryption Algorithm
weakpassed
Page 25 Massively Parallel Random Number Generators | Some Results | September 15, 2010
Quality vs. Speed, TEA performance
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Run
time
in m
s
Number of rounds
Raw Performance, 335544320 Random Numbers
GTX480, Tiny Encryption Algorithm
Page 26 Massively Parallel Random Number Generators | Some Results | September 15, 2010
Quality vs. Speed, MD5 performance
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
Run
time
in m
s
Number of rounds
Raw Performance, 335544320 Random Numbers
GTX480, MD5
Page 27 Massively Parallel Random Number Generators | Conclusion | September 15, 2010
Conclusion
I Many good (parallel) RNGs exist
I Several different properties
I Choice of fitting RNG application dependent
Some PicksI KISS
+ Simple code and state management− Random seeding: may be ok for non-critical applications
I MTGPU
+ Very good quality and sophisticated seeding+ Long period− Relatively complex code− Fixed block/thread layout
I MD5, TEA
+ Random access+ No Seeding− Slow
Page 28 Massively Parallel Random Number Generators | Conclusion | September 15, 2010
Conclusion
Precomputing on GPUI May be an alternative to in kernel computation
RNG CollectionI Always evaluate your RNG choice and implementation
I Our framework provides an easy platform for testinghttp://mprng.sf.net
Questions?
AcknowledgementsI Framework: Christoph Schied
I This work was supported by a NVIDIA Professor Partnership Award with Hendrik P.A.Lensch
Page 28 Massively Parallel Random Number Generators | Conclusion | September 15, 2010
Conclusion
Precomputing on GPUI May be an alternative to in kernel computation
RNG CollectionI Always evaluate your RNG choice and implementation
I Our framework provides an easy platform for testinghttp://mprng.sf.net
Questions?
AcknowledgementsI Framework: Christoph Schied
I This work was supported by a NVIDIA Professor Partnership Award with Hendrik P.A.Lensch
Page 28 Massively Parallel Random Number Generators | References | September 15, 2010
Robert G. Brown.Dieharder: A random number test suite.http://www.phy.duke.edu/~rgb/General/dieharder.php, 2009.
Lee Howes and David Thomas.Efficient Random Number Generation and Application Using CUDA.In GPU Gems 3, pages 805–830, 2007.
D.E. Knuth.The art of computer programming: Seminumerical algorithms, volume 2,1981.
P. L’Ecuyer.Uniform random number generation.Handbooks in Operations Research and Management Science, 13:55–81,2006.
P. L’Ecuyer and R. Simard.TestU01: AC library for empirical testing of random number generators.ACM Transactions on Mathematical Software (TOMS), 33(4):22, 2007.
Pierre L’Ecuyer, Richard Simard, E. Jack Chen, and W. David Kelton.An object-oriented random-number package with many long streams andsubstreams.
Page 28 Massively Parallel Random Number Generators | References | September 15, 2010
Oper. Res., 50(6):1073–1075, 2002.
Mutsuo Saito.A variant of mersenne twister suitable for graphic processors.CoRR, abs/1005.4973, 2010.
Marcus Schoo, Krzysztof Pawlikowski, and Donald C. Mcnickle.A survey and empirical comparison of modern pseudo-random numbergenerators for distributed stochastic simulations, 2005.
S. Tzeng and L.Y. Wei.Parallel white noise generation on a GPU via cryptographic hash.In Proceedings of the 2008 symposium on Interactive 3D graphics and games,pages 79–87. ACM, 2008.
F. Zafar, M. Olano, and A. Curtis.GPU Random Numbers via the Tiny Encryption Algorithm.2010.