Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee ([email protected]) Parallel...
Transcript of Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee ([email protected]) Parallel...
Parallelizing the Hamiltonian Computation in DQMCSimulations: Checkerboard Method for Sparse Matrix
Exponentials on Multicore and GPU
Che-Rung Lee
National Tsing Hua University
joint work with Zhi-Hung Chen and Quey-Liang Kao
Second International Workshop onAccelerators and Hybrid Exascale Systems (AsHES)
May 25th, 2012
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 1 / 31
Outline
1 Determinant quantum Monte Carlo simulations
2 Matrix multiplication of sparse matrix exponentials
3 Parallel block checkerboard methods on multicore and GPU
4 Experiments and results
5 Concluding remarks
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 2 / 31
Computational Material Science
To study the properties of solid-state materials:magnetism, metal-insulator transition, hightemperature superconductivity, ...
The Hubbard model:
Energy operator H is associated with alattice of particles.
Boltzmann weight is expressed as a pathintegral
e−βH ≈ e−τH(h1)e−τH(h2) · · · e−τH(hL).
β = 1/T is the “imaginary time”.τ = β/L is the discretized time step.{hi} is the “Hubbard-Stratonovich field”.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 3 / 31
Computational Material Science
To study the properties of solid-state materials:magnetism, metal-insulator transition, hightemperature superconductivity, ...
The Hubbard model:
Energy operator H is associated with alattice of particles.
Boltzmann weight is expressed as a pathintegral
e−βH ≈ e−τH(h1)e−τH(h2) · · · e−τH(hL).
β = 1/T is the “imaginary time”.τ = β/L is the discretized time step.{hi} is the “Hubbard-Stratonovich field”.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 3 / 31
Determinant Quantum Monte Carlo (DQMC) Simulations
DQMC algorithm
1 Given a random h = (h`,i) = (±1).2 Until there are enough measurements
For ` = 1, . . . , L and i = 1, . . . , N1 Propose a new HS config h′.2 Compute the ratio γ of the
determinants of new/old configs.3 Generate a random number ρ ∈ [0, 1].4 If γ > ρ, accept h = h′.5 If the system is thermalized, sample
the interested physical measurements.
3 Aggregate the sampled measurements.
DQMC step
Random HS field
thermalized
DQMC step
Measurements
yes
no
enoughsamples
no
Aggregation
yes
warm
upsam
pling
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 4 / 31
DQMC Parallelization
Parallel Monte Carlo method canspeedup DQMC simulations byparallelizing the sampling stage.
Coarse-grained parallelization.(Communication only happens beforesampling and in aggregation.)
Strong scalability if the number ofdesired samplings is much larger thanthe number of processors.
DQMC step
Random HS field
thermalizedD
QM
C step
Measurem
ents
yes
no
Aggregation
warm
upsam
pling
DQ
MC
stepM
easurements
DQ
MC
stepM
easurements
...
...
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 5 / 31
DQMC Parallelization
Parallel Monte Carlo method canspeedup DQMC simulations byparallelizing the sampling stage.
Coarse-grained parallelization.(Communication only happens beforesampling and in aggregation.)
Strong scalability if the number ofdesired samplings is much larger thanthe number of processors.
DQMC step
Random HS field
thermalizedD
QM
C step
Measurem
ents
yes
no
Aggregation
warm
upsam
pling
DQ
MC
stepM
easurements
DQ
MC
stepM
easurements
...
...
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 5 / 31
DQMC Parallelization
Parallel Monte Carlo method canspeedup DQMC simulations byparallelizing the sampling stage.
Coarse-grained parallelization.(Communication only happens beforesampling and in aggregation.)
Strong scalability if the number ofdesired samplings is much larger thanthe number of processors.
DQMC step
Random HS field
thermalizedD
QM
C step
Measurem
ents
yes
no
Aggregation
warm
upsam
pling
DQ
MC
stepM
easurements
DQ
MC
stepM
easurements
...
...
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 5 / 31
Computational Challenges
By Amdahl’s law, the speedup of parallel Monte Carlo method islimited by the warmup stage (non-parallelizable).
Speedup =Twarmup + TsamplingTwarmup + Tsampling/p
→Twarmup + Tsampling
Twarmup
Parallel Monte Carlo method does not scale with problem size, i.e.number of particles and discretized time length.
Coarse grained parallelization does not fit well on multicore and GPU.
The computation of each DQMC step is complicated.Slower execution because of resource contention.Memory per core is reduced with the number of cores.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 6 / 31
Computational Challenges
By Amdahl’s law, the speedup of parallel Monte Carlo method islimited by the warmup stage (non-parallelizable).
Speedup =Twarmup + TsamplingTwarmup + Tsampling/p
→Twarmup + Tsampling
Twarmup
Parallel Monte Carlo method does not scale with problem size, i.e.number of particles and discretized time length.
Coarse grained parallelization does not fit well on multicore and GPU.
The computation of each DQMC step is complicated.Slower execution because of resource contention.Memory per core is reduced with the number of cores.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 6 / 31
Computational Challenges
By Amdahl’s law, the speedup of parallel Monte Carlo method islimited by the warmup stage (non-parallelizable).
Speedup =Twarmup + TsamplingTwarmup + Tsampling/p
→Twarmup + Tsampling
Twarmup
Parallel Monte Carlo method does not scale with problem size, i.e.number of particles and discretized time length.
Coarse grained parallelization does not fit well on multicore and GPU.
The computation of each DQMC step is complicated.Slower execution because of resource contention.Memory per core is reduced with the number of cores.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 6 / 31
Inside Each DQMC Step
DQMC step
Random HS field
thermalized
DQMC step
Measurements
yes
no
enoughsamples
no
Aggregation
yes
warm
upsam
pling
A DQMC step
1 Propose a local change: h→ h′.
2 Throw a random number 0 < r < 1.
3 Accept the change if r < det(e−βH(h′))det(e−βH(h))
.
Computational Kernel: Green’s function cal-culation
G = (I +BL · · ·B2B1)−1.
for computation of det(e−βH(h′)) and phys-ical measurements.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 7 / 31
Inside Each DQMC Step
DQMC step
Random HS field
thermalized
DQMC step
Measurements
yes
no
enoughsamples
no
Aggregation
yes
warm
upsam
pling
A DQMC step
1 Propose a local change: h→ h′.
2 Throw a random number 0 < r < 1.
3 Accept the change if r < det(e−βH(h′))det(e−βH(h))
.
Computational Kernel: Green’s function cal-culation
G = (I +BL · · ·B2B1)−1.
for computation of det(e−βH(h′)) and phys-ical measurements.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 7 / 31
Green’s Function Calculation
G = (I +BL · · ·B2B1)−1.
N : the number of particles; L: the number of time slices.
Time complexity of computing G is O(N3L).
For 103 warmup steps and 104 sampling steps, it takes 15 hours.
For large simulations, N = O(104), L = O(102), the projectedexecution time could take several days to months.
Profile of a DQMC simulation (N = 256, L = 96)
Matrix kernel Execution time
Matrix-matrix multiplication 72.39%Pivoted QR decomposition 17.83%Matrix inversion 3.02%Others 6.76%
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 8 / 31
Green’s Function Calculation
G = (I +BL · · ·B2B1)−1.
N : the number of particles; L: the number of time slices.
Time complexity of computing G is O(N3L).
For 103 warmup steps and 104 sampling steps, it takes 15 hours.
For large simulations, N = O(104), L = O(102), the projectedexecution time could take several days to months.
Profile of a DQMC simulation (N = 256, L = 96)
Matrix kernel Execution time
Matrix-matrix multiplication 72.39%Pivoted QR decomposition 17.83%Matrix inversion 3.02%Others 6.76%
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 8 / 31
Green’s Function Calculation
G = (I +BL · · ·B2B1)−1.
N : the number of particles; L: the number of time slices.
Time complexity of computing G is O(N3L).
For 103 warmup steps and 104 sampling steps, it takes 15 hours.
For large simulations, N = O(104), L = O(102), the projectedexecution time could take several days to months.
Profile of a DQMC simulation (N = 256, L = 96)
Matrix kernel Execution time
Matrix-matrix multiplication 72.39%Pivoted QR decomposition 17.83%Matrix inversion 3.02%Others 6.76%
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 8 / 31
Green’s Function Calculation
G = (I +BL · · ·B2B1)−1.
N : the number of particles; L: the number of time slices.
Time complexity of computing G is O(N3L).
For 103 warmup steps and 104 sampling steps, it takes 15 hours.
For large simulations, N = O(104), L = O(102), the projectedexecution time could take several days to months.
Profile of a DQMC simulation (N = 256, L = 96)
Matrix kernel Execution time
Matrix-matrix multiplication 72.39%Pivoted QR decomposition 17.83%Matrix inversion 3.02%Others 6.76%
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 8 / 31
Green’s Function Calculation
G = (I +BL · · ·B2B1)−1.
N : the number of particles; L: the number of time slices.
Time complexity of computing G is O(N3L).
For 103 warmup steps and 104 sampling steps, it takes 15 hours.
For large simulations, N = O(104), L = O(102), the projectedexecution time could take several days to months.
Profile of a DQMC simulation (N = 256, L = 96)
Matrix kernel Execution time
Matrix-matrix multiplication 72.39%Pivoted QR decomposition 17.83%Matrix inversion 3.02%Others 6.76%
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 8 / 31
Outline
1 Determinant quantum Monte Carlo simulations
2 Matrix multiplication of sparse matrix exponentials
3 Parallel block checkerboard methods on multicore and GPU
4 Experiments and results
5 Concluding remarks
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 9 / 31
Matrix-matrix Multiplication
Some tuned result on multicore and on GPU (Fermi)
DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s.(my laptop.)SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245,2010]DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]
It is great, but the running time grows cubically with problem size N .
Sparse-dense matrix multiplication takes only O(N2) time.
In the Green’s function calculation, G = (I +BLBL−1 · · ·B2B1)−1,
each Bi = eA is a matrix exponential.
eA = I +A+A2
2!+A3
3!+ · · · =
∞∑k=0
Ak
k!.
Matrix A is highly sparse, but eA is dense.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 10 / 31
Matrix-matrix Multiplication
Some tuned result on multicore and on GPU (Fermi)
DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s.(my laptop.)SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245,2010]DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]
It is great, but the running time grows cubically with problem size N .
Sparse-dense matrix multiplication takes only O(N2) time.
In the Green’s function calculation, G = (I +BLBL−1 · · ·B2B1)−1,
each Bi = eA is a matrix exponential.
eA = I +A+A2
2!+A3
3!+ · · · =
∞∑k=0
Ak
k!.
Matrix A is highly sparse, but eA is dense.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 10 / 31
Matrix-matrix Multiplication
Some tuned result on multicore and on GPU (Fermi)
DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s.(my laptop.)SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245,2010]DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]
It is great, but the running time grows cubically with problem size N .
Sparse-dense matrix multiplication takes only O(N2) time.
In the Green’s function calculation, G = (I +BLBL−1 · · ·B2B1)−1,
each Bi = eA is a matrix exponential.
eA = I +A+A2
2!+A3
3!+ · · · =
∞∑k=0
Ak
k!.
Matrix A is highly sparse, but eA is dense.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 10 / 31
Matrix-matrix Multiplication
Some tuned result on multicore and on GPU (Fermi)
DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s.(my laptop.)SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245,2010]DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]
It is great, but the running time grows cubically with problem size N .
Sparse-dense matrix multiplication takes only O(N2) time.
In the Green’s function calculation, G = (I +BLBL−1 · · ·B2B1)−1,
each Bi = eA is a matrix exponential.
eA = I +A+A2
2!+A3
3!+ · · · =
∞∑k=0
Ak
k!.
Matrix A is highly sparse, but eA is dense.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 10 / 31
Matrix-matrix Multiplication
Some tuned result on multicore and on GPU (Fermi)
DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s.(my laptop.)SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245,2010]DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]
It is great, but the running time grows cubically with problem size N .
Sparse-dense matrix multiplication takes only O(N2) time.
In the Green’s function calculation, G = (I +BLBL−1 · · ·B2B1)−1,
each Bi = eA is a matrix exponential.
eA = I +A+A2
2!+A3
3!+ · · · =
∞∑k=0
Ak
k!.
Matrix A is highly sparse, but eA is dense.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 10 / 31
Checkerboard Method
Checkerboard method can approximate eA by eA ≈ eA1eA2 · · · eAk , inwhich each eAj is sparse.
Theorem
If Ai is strictly sparse with zero diagonal, eAi has the same sparse patternas Ai with diagonal fill in.
Definition (strictly sparse)
A matrix is strictly sparse if it contains at most one nonzero per row andper column.
Checkerboard method for computing eA
1 Split A = A1 +A2 + · · ·+Ak such that each Ai is strictly sparse.
2 Exponentiate each Ai.
3 Return eA1eA2 · · · eAk as an approximation to eA.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 11 / 31
Checkerboard Method
Checkerboard method can approximate eA by eA ≈ eA1eA2 · · · eAk , inwhich each eAj is sparse.
Theorem
If Ai is strictly sparse with zero diagonal, eAi has the same sparse patternas Ai with diagonal fill in.
Definition (strictly sparse)
A matrix is strictly sparse if it contains at most one nonzero per row andper column.
Checkerboard method for computing eA
1 Split A = A1 +A2 + · · ·+Ak such that each Ai is strictly sparse.
2 Exponentiate each Ai.
3 Return eA1eA2 · · · eAk as an approximation to eA.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 11 / 31
Checkerboard Method
Checkerboard method can approximate eA by eA ≈ eA1eA2 · · · eAk , inwhich each eAj is sparse.
Theorem
If Ai is strictly sparse with zero diagonal, eAi has the same sparse patternas Ai with diagonal fill in.
Definition (strictly sparse)
A matrix is strictly sparse if it contains at most one nonzero per row andper column.
Checkerboard method for computing eA
1 Split A = A1 +A2 + · · ·+Ak such that each Ai is strictly sparse.
2 Exponentiate each Ai.
3 Return eA1eA2 · · · eAk as an approximation to eA.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 11 / 31
Checkerboard Method
Checkerboard method can approximate eA by eA ≈ eA1eA2 · · · eAk , inwhich each eAj is sparse.
Theorem
If Ai is strictly sparse with zero diagonal, eAi has the same sparse patternas Ai with diagonal fill in.
Definition (strictly sparse)
A matrix is strictly sparse if it contains at most one nonzero per row andper column.
Checkerboard method for computing eA
1 Split A = A1 +A2 + · · ·+Ak such that each Ai is strictly sparse.
2 Exponentiate each Ai.
3 Return eA1eA2 · · · eAk as an approximation to eA.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 11 / 31
Example: One Dimensional Ring
A =
0 h hh 0
. . .. . .
0 hh h 0
x 2 3
5
1
6 7 8
4
Odd link Even link
eA ≈
D. . .
D
cosh(h) sinh(h)D
. . .
sinh(h) cosh(h)
where D =
[cosh(h) sinh(h)sinh(h) cosh(h)
].
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 12 / 31
Example: One Dimensional Ring
A =
0 h hh 0
. . .. . .
0 hh h 0
x 2 3
5
1
6 7 8
4
Odd link Even link
eA ≈
D. . .
D
cosh(h) sinh(h)D
. . .
sinh(h) cosh(h)
where D =
[cosh(h) sinh(h)sinh(h) cosh(h)
].
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 12 / 31
Block Checkerboard Method
Exploring block structure of sparse matrices can obtain betterperformance and accuracy.
12x12x12 16x16x16 18x18x18 20x20x200
20
40
60
80
100
120
Matrix size
Ru
nn
ing
tim
e (
se
co
nd
)
ExactCheckerboardBlock Checkerboard
10−4
10−3
10−2
10−1
10−8
10−6
10−4
10−2
100
Discretize parameter ∆ τR
ela
tive
err
or
CheckerboardBlock Checkerboard
The algorithm is similar, but the basic element is a block submatrix.
We will focus on the parallelization of block checkerboard method.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 13 / 31
Block Checkerboard Method
Exploring block structure of sparse matrices can obtain betterperformance and accuracy.
12x12x12 16x16x16 18x18x18 20x20x200
20
40
60
80
100
120
Matrix size
Ru
nn
ing
tim
e (
se
co
nd
)
ExactCheckerboardBlock Checkerboard
10−4
10−3
10−2
10−1
10−8
10−6
10−4
10−2
100
Discretize parameter ∆ τR
ela
tive
err
or
CheckerboardBlock Checkerboard
The algorithm is similar, but the basic element is a block submatrix.
We will focus on the parallelization of block checkerboard method.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 13 / 31
Block Checkerboard Method
Exploring block structure of sparse matrices can obtain betterperformance and accuracy.
12x12x12 16x16x16 18x18x18 20x20x200
20
40
60
80
100
120
Matrix size
Ru
nn
ing
tim
e (
se
co
nd
)
ExactCheckerboardBlock Checkerboard
10−4
10−3
10−2
10−1
10−8
10−6
10−4
10−2
100
Discretize parameter ∆ τR
ela
tive
err
or
CheckerboardBlock Checkerboard
The algorithm is similar, but the basic element is a block submatrix.
We will focus on the parallelization of block checkerboard method.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 13 / 31
Outline
1 Determinant quantum Monte Carlo simulations
2 Matrix multiplication of sparse matrix exponentials
3 Parallel block checkerboard methods on multicore and GPU
4 Experiments and results
5 Concluding remarks
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 14 / 31
Computational Difficulties
Combination of sparse matrix and dense matrix computation.(generalized SPMV)
Dense matrix-matrix multiplication
Computational bound
Sparse matrix-vector multiplication
Memory bound
Sparse matrix-matrix multiplication
?
Multiplication of a sequence of sparse matrices of differentcharacteristics.
Matrix size is not large enough to reach good performance.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 15 / 31
Computational Difficulties
Combination of sparse matrix and dense matrix computation.(generalized SPMV)
Dense matrix-matrix multiplication
Computational bound
Sparse matrix-vector multiplication
Memory bound
Sparse matrix-matrix multiplication
?
Multiplication of a sequence of sparse matrices of differentcharacteristics.
Matrix size is not large enough to reach good performance.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 15 / 31
Computational Difficulties
Combination of sparse matrix and dense matrix computation.(generalized SPMV)
Dense matrix-matrix multiplication
Computational bound
Sparse matrix-vector multiplication
Memory bound
Sparse matrix-matrix multiplication
?
Multiplication of a sequence of sparse matrices of differentcharacteristics.
Matrix size is not large enough to reach good performance.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 15 / 31
Computational Difficulties
Combination of sparse matrix and dense matrix computation.(generalized SPMV)
Dense matrix-matrix multiplication
Computational bound
Sparse matrix-vector multiplication
Memory bound
Sparse matrix-matrix multiplication
?
Multiplication of a sequence of sparse matrices of differentcharacteristics.
Matrix size is not large enough to reach good performance.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 15 / 31
Computational Difficulties
Combination of sparse matrix and dense matrix computation.(generalized SPMV)
Dense matrix-matrix multiplication
Computational bound
Sparse matrix-vector multiplication
Memory bound
Sparse matrix-matrix multiplication
?
Multiplication of a sequence of sparse matrices of differentcharacteristics.
Matrix size is not large enough to reach good performance.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 15 / 31
Computational Difficulties
Combination of sparse matrix and dense matrix computation.(generalized SPMV)
Dense matrix-matrix multiplication
Computational bound
Sparse matrix-vector multiplication
Memory bound
Sparse matrix-matrix multiplication
?
Multiplication of a sequence of sparse matrices of differentcharacteristics.
Matrix size is not large enough to reach good performance.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 15 / 31
2D and 3D Torus Lattices
The kinetic matrices in our study are 2D and 3D torus lattices.
Images are from http://www.trampelwurm.ch/schmidt/wilhelmtux/swissremix/html/Linuxfibel/netstruct.htm and
http://www.fujitsu.com/global/news/pr/archives/month/2009/20090717-01.html
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 16 / 31
2D Lattice and Kinetic Matrix
The kinetic matrix of 2D latticescan be split into 3 block strictlysparse matrices, K1,K2, and K3.
Checkerboard methodapproximates the matrixexponential by
eK3/2eK2/2eK1eK2/2eK3/2.
The figure shows the sparsepattern of the matrix of a 4× 4 2Dlattice is shown, and the matrixexponentials, eK1 , eK2 , and eK3 .
0 10
0
5
10
15
(a) kinetic matrix K0 10
0
5
10
15
(b) expm(K1)
0 10
0
5
10
15
(c) expm(K2)
0 10
0
5
10
15
(d) expm(K3)
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 17 / 31
3D Lattice and Kinetic Matrix
The kinetic matrix of 3D latticescan be split into 5 block strictlysparse matrices, K1,K2,K3,K4,and K5.
Checkerboard methodapproximates the matrixexponential by
eK52 e
K42 e
K32 e
K22 eK1e
K52 e
K32 e
K42 e
K52 .
The figure shows the sparsepattern of the matrix of a 4× 42D lattice is shown, and thematrix exponentials,eK1 , eK2 , eK3 , eK4 , and eK5 .
0 50
0
20
40
60
(a) kinetic matrix K0 50
0
20
40
60
(b) expm(K1)
0 50
0
20
40
60
(c) expm(K2)
0 50
0
20
40
60
(d) expm(K3)
0 50
0
20
40
60
(e) expm(K4)
0 50
0
20
40
60
(f) expm(K5)
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 18 / 31
eK1: Block Diagonal Matrix
In 2D and 3D problems, eK1 =
A1
A2
. . .
Ak
, block diagonal.
For B =
B11 B12 . . . B1k
B21 B22 . . . B2k...
.... . .
...Bk1 Bk2 . . . Bkk
, the product of eK1B is
eK1B =
A1B11 A1B12 . . . A1B1k
A2B21 A2B22 . . . A2B2k...
.... . .
...AkBk1 AkBk2 . . . AkBkk
.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 19 / 31
eK1: Block Diagonal Matrix
In 2D and 3D problems, eK1 =
A1
A2
. . .
Ak
, block diagonal.
For B =
B11 B12 . . . B1k
B21 B22 . . . B2k...
.... . .
...Bk1 Bk2 . . . Bkk
, the product of eK1B is
eK1B =
A1B11 A1B12 . . . A1B1k
A2B21 A2B22 . . . A2B2k...
.... . .
...AkBk1 AkBk2 . . . AkBkk
.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 19 / 31
eKi, i 6= 1: Bidiagonal Matrices
eKi =
. . .
Di Cij. . .
Cji Dj
. . .
.
where Ci,j and Dj are diagonalfor eKi , i 6= 1.
There are only two non-zeros per row/column.
Let B =
B11 B12 . . . B1k
B21 B22 . . . B2k...
.... . .
...Bk1 Bk2 . . . Bkk
.
The (i, j) block and the (j, i)block of F = eKiB are
Fij = DiBii + CijBijFji = DjBjj + CjiBji
.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 20 / 31
eKi, i 6= 1: Bidiagonal Matrices
eKi =
. . .
Di Cij. . .
Cji Dj
. . .
.
where Ci,j and Dj are diagonalfor eKi , i 6= 1.
There are only two non-zeros per row/column.
Let B =
B11 B12 . . . B1k
B21 B22 . . . B2k...
.... . .
...Bk1 Bk2 . . . Bkk
.
The (i, j) block and the (j, i)block of F = eKiB are
Fij = DiBii + CijBijFji = DjBjj + CjiBji
.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 20 / 31
eKi, i 6= 1: Bidiagonal Matrices
eKi =
. . .
Di Cij. . .
Cji Dj
. . .
.
where Ci,j and Dj are diagonalfor eKi , i 6= 1.
There are only two non-zeros per row/column.
Let B =
B11 B12 . . . B1k
B21 B22 . . . B2k...
.... . .
...Bk1 Bk2 . . . Bkk
.
The (i, j) block and the (j, i)block of F = eKiB are
Fij = DiBii + CijBijFji = DjBjj + CjiBji
.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 20 / 31
Parallelize Algorithms for eK1B
Different algorithms need be designed from different types of matrices.
Block based algorithm for eK1B.
Parallelization of eK1B
For each Bij do
Parallel compute Ci,j = AiBij .
End for each
For cache effect, finer grained parallelization is needed.
Finer-grained parallelization of eK1B
For each Bij doFor each sub-block of Cij : Cij(k, `) do
Parallel compute Ci,j(k, `).
End for each
End for each
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 21 / 31
Parallelize Algorithms for eK1B
Different algorithms need be designed from different types of matrices.Block based algorithm for eK1B.
Parallelization of eK1B
For each Bij do
Parallel compute Ci,j = AiBij .
End for each
For cache effect, finer grained parallelization is needed.
Finer-grained parallelization of eK1B
For each Bij doFor each sub-block of Cij : Cij(k, `) do
Parallel compute Ci,j(k, `).
End for each
End for each
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 21 / 31
Parallelize Algorithms for eK1B
Different algorithms need be designed from different types of matrices.Block based algorithm for eK1B.
Parallelization of eK1B
For each Bij do
Parallel compute Ci,j = AiBij .
End for each
For cache effect, finer grained parallelization is needed.
Finer-grained parallelization of eK1B
For each Bij doFor each sub-block of Cij : Cij(k, `) do
Parallel compute Ci,j(k, `).
End for each
End for each
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 21 / 31
Parallelize Algorithms for eKiB, i 6= 1
Column based parallelizationFor better performance, aggregate the computation of eKiB.
Parallelization of eKiB, i 6= 1
For each column bi in B do
Parallel compute eK2/2eK3/2bi or eK2/2 · · · eK5/2bi.
End for each
Divide bi to fit cache
Parallelization of eKiB, i 6= 1
For each column bi in B doFor each segment of bi(:) do
Parallel compute eK2/2eK3/2bi(:) or eK2/2 · · · eK5/2bi(:).
End for each
End for each
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 22 / 31
Parallelize Algorithms for eKiB, i 6= 1
Column based parallelizationFor better performance, aggregate the computation of eKiB.
Parallelization of eKiB, i 6= 1
For each column bi in B do
Parallel compute eK2/2eK3/2bi or eK2/2 · · · eK5/2bi.
End for each
Divide bi to fit cache
Parallelization of eKiB, i 6= 1
For each column bi in B doFor each segment of bi(:) do
Parallel compute eK2/2eK3/2bi(:) or eK2/2 · · · eK5/2bi(:).
End for each
End for each
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 22 / 31
Outline
1 Determinant quantum Monte Carlo simulations
2 Matrix multiplication of sparse matrix exponentials
3 Parallel block checkerboard methods on multicore and GPU
4 Experiments and results
5 Concluding remarks
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 23 / 31
Experimental Setting
Experiment setting
We experimented the parallel algorithms on multicore and GPU.
Matrices are from 2D and 3D toruses.
Multicore CPU
Intel Core i7 950, 8G RAM, whose peak performance is 48.48Gflop/s.
Using DGEMM in MKL 11 for block matrix multiplication and forcomparison.
GPU
GeForce GTX 480, whose peak performance is 1344GFlop/s.
Using CUDA 4 and DGEMM and SGEMM CUDA SDK forcomparison.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 24 / 31
Results of 2D Problems on Multicore
0.000
5.000
10.000
15.000
20.000
25.000
1 2 3 4 5 6 7 8
Gflo
p/s
Number of threads
1024/32
4096/64
16384/128
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8
Performan
ce (G
flop/s)
Number of threads
1024/32
4096/64
16384/128
Lattice size: 32× 32,64× 64, and 128× 128.
Larger problem hasmore stableperformance result.
The results of 32× 32are less stable thanthose of other cases.
Hyperthreading isalmost no effect.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 25 / 31
Results of 3D Problems on Multicore
0
2
4
6
8
10
12
14
64/4 512/8 1728/12 4096/16 8000/20 13824/24
Gflo
p/s
Problem size/block size
1
2
3
4
0
5
10
15
20
25
64/4 512/8 1728/12 4096/16 8000/20 13824/24
Gflo
p/s
Problem size/block size
1
2
3
4
6
8
Lattice size: 4× 4× 4,8× 8× 8, 12× 12× 12,16× 16× 16,20× 20× 20, and24× 24× 24.
The speedup is about 3using 4 cores.
The Gflop/s of eK1 ismuch better than thatof the overallperformance.
Hyperthreading isworking, but its effectis decreasing.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 26 / 31
Results of 2D Problems on GPU
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
64 256 1024 4096
Execu&
on &me (1E-‐6 s)
Problem size
SDK float
SDK double
Blkckb float
Blkckb double
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
64 256 1024 4096
Speedu
p
Problem size
single
double
Lattice size: 8× 8,16× 16, 32× 32, and64× 64.
The speedup is up to67 for single precision.
DGEMM of CUDASDK cannot finish the64× 64 case.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 27 / 31
Results of 3D Problems on GPU
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
64 512 4096
Execu&
on &me (1E-‐6 s)
Problem size
SDK float
SDK double
Blkckb float
Blkckb double
0
20
40
60
80
100
120
140
160
64 512 4096
Speedu
p
Problem size
single
double
Lattice size: 4× 4× 4,8× 8× 8, 12× 12× 12,and 16× 16× 16.
The speedup is up to147 for single precision.
DGEMM of CUDASDK cannot finish the16× 16× 16 case.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 28 / 31
Outline
1 Determinant quantum Monte Carlo simulations
2 Matrix multiplication of sparse matrix exponentials
3 Parallel block checkerboard methods on multicore and GPU
4 Experiments and results
5 Concluding remarks
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 29 / 31
Concluding Remarks
Scientific applications request more and more computational power toenable larger and larger simulations. But to extract performance frommodern HPC machines, which hybrid coarse-grained and fine-grainedparallel architectures, application developers need to redesign theprogram.
For DQMC simulations, checkerboard method can make thealgorithm more scalable with problem size. This paper presents thepreliminary study of the parallelization of checkerboard method onmulticore and GPU. Further performance tunings are required.
Parallelization of general lattice geometry of checkerboard methodneed be studied. The integration of the parallel checkerboard methodto package needs be done.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 30 / 31
Concluding Remarks
Scientific applications request more and more computational power toenable larger and larger simulations. But to extract performance frommodern HPC machines, which hybrid coarse-grained and fine-grainedparallel architectures, application developers need to redesign theprogram.
For DQMC simulations, checkerboard method can make thealgorithm more scalable with problem size. This paper presents thepreliminary study of the parallelization of checkerboard method onmulticore and GPU. Further performance tunings are required.
Parallelization of general lattice geometry of checkerboard methodneed be studied. The integration of the parallel checkerboard methodto package needs be done.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 30 / 31
Concluding Remarks
Scientific applications request more and more computational power toenable larger and larger simulations. But to extract performance frommodern HPC machines, which hybrid coarse-grained and fine-grainedparallel architectures, application developers need to redesign theprogram.
For DQMC simulations, checkerboard method can make thealgorithm more scalable with problem size. This paper presents thepreliminary study of the parallelization of checkerboard method onmulticore and GPU. Further performance tunings are required.
Parallelization of general lattice geometry of checkerboard methodneed be studied. The integration of the parallel checkerboard methodto package needs be done.
Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 30 / 31