Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee ([email protected]) Parallel...

63
Parallelizing the Hamiltonian Computation in DQMC Simulations: Checkerboard Method for Sparse Matrix Exponentials on Multicore and GPU Che-Rung Lee National Tsing Hua University joint work with Zhi-Hung Chen and Quey-Liang Kao Second International Workshop on Accelerators and Hybrid Exascale Systems (AsHES) May 25th, 2012 Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 1 / 31

Transcript of Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee ([email protected]) Parallel...

Page 1: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Parallelizing the Hamiltonian Computation in DQMCSimulations: Checkerboard Method for Sparse Matrix

Exponentials on Multicore and GPU

Che-Rung Lee

National Tsing Hua University

joint work with Zhi-Hung Chen and Quey-Liang Kao

Second International Workshop onAccelerators and Hybrid Exascale Systems (AsHES)

May 25th, 2012

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 1 / 31

Page 2: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Outline

1 Determinant quantum Monte Carlo simulations

2 Matrix multiplication of sparse matrix exponentials

3 Parallel block checkerboard methods on multicore and GPU

4 Experiments and results

5 Concluding remarks

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 2 / 31

Page 3: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Computational Material Science

To study the properties of solid-state materials:magnetism, metal-insulator transition, hightemperature superconductivity, ...

The Hubbard model:

Energy operator H is associated with alattice of particles.

Boltzmann weight is expressed as a pathintegral

e−βH ≈ e−τH(h1)e−τH(h2) · · · e−τH(hL).

β = 1/T is the “imaginary time”.τ = β/L is the discretized time step.{hi} is the “Hubbard-Stratonovich field”.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 3 / 31

Page 4: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Computational Material Science

To study the properties of solid-state materials:magnetism, metal-insulator transition, hightemperature superconductivity, ...

The Hubbard model:

Energy operator H is associated with alattice of particles.

Boltzmann weight is expressed as a pathintegral

e−βH ≈ e−τH(h1)e−τH(h2) · · · e−τH(hL).

β = 1/T is the “imaginary time”.τ = β/L is the discretized time step.{hi} is the “Hubbard-Stratonovich field”.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 3 / 31

Page 5: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Determinant Quantum Monte Carlo (DQMC) Simulations

DQMC algorithm

1 Given a random h = (h`,i) = (±1).2 Until there are enough measurements

For ` = 1, . . . , L and i = 1, . . . , N1 Propose a new HS config h′.2 Compute the ratio γ of the

determinants of new/old configs.3 Generate a random number ρ ∈ [0, 1].4 If γ > ρ, accept h = h′.5 If the system is thermalized, sample

the interested physical measurements.

3 Aggregate the sampled measurements.

DQMC step

Random HS field

thermalized

DQMC step

Measurements

yes

no

enoughsamples

no

Aggregation

yes

warm

upsam

pling

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 4 / 31

Page 6: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

DQMC Parallelization

Parallel Monte Carlo method canspeedup DQMC simulations byparallelizing the sampling stage.

Coarse-grained parallelization.(Communication only happens beforesampling and in aggregation.)

Strong scalability if the number ofdesired samplings is much larger thanthe number of processors.

DQMC step

Random HS field

thermalizedD

QM

C step

Measurem

ents

yes

no

Aggregation

warm

upsam

pling

DQ

MC

stepM

easurements

DQ

MC

stepM

easurements

...

...

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 5 / 31

Page 7: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

DQMC Parallelization

Parallel Monte Carlo method canspeedup DQMC simulations byparallelizing the sampling stage.

Coarse-grained parallelization.(Communication only happens beforesampling and in aggregation.)

Strong scalability if the number ofdesired samplings is much larger thanthe number of processors.

DQMC step

Random HS field

thermalizedD

QM

C step

Measurem

ents

yes

no

Aggregation

warm

upsam

pling

DQ

MC

stepM

easurements

DQ

MC

stepM

easurements

...

...

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 5 / 31

Page 8: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

DQMC Parallelization

Parallel Monte Carlo method canspeedup DQMC simulations byparallelizing the sampling stage.

Coarse-grained parallelization.(Communication only happens beforesampling and in aggregation.)

Strong scalability if the number ofdesired samplings is much larger thanthe number of processors.

DQMC step

Random HS field

thermalizedD

QM

C step

Measurem

ents

yes

no

Aggregation

warm

upsam

pling

DQ

MC

stepM

easurements

DQ

MC

stepM

easurements

...

...

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 5 / 31

Page 9: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Computational Challenges

By Amdahl’s law, the speedup of parallel Monte Carlo method islimited by the warmup stage (non-parallelizable).

Speedup =Twarmup + TsamplingTwarmup + Tsampling/p

→Twarmup + Tsampling

Twarmup

Parallel Monte Carlo method does not scale with problem size, i.e.number of particles and discretized time length.

Coarse grained parallelization does not fit well on multicore and GPU.

The computation of each DQMC step is complicated.Slower execution because of resource contention.Memory per core is reduced with the number of cores.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 6 / 31

Page 10: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Computational Challenges

By Amdahl’s law, the speedup of parallel Monte Carlo method islimited by the warmup stage (non-parallelizable).

Speedup =Twarmup + TsamplingTwarmup + Tsampling/p

→Twarmup + Tsampling

Twarmup

Parallel Monte Carlo method does not scale with problem size, i.e.number of particles and discretized time length.

Coarse grained parallelization does not fit well on multicore and GPU.

The computation of each DQMC step is complicated.Slower execution because of resource contention.Memory per core is reduced with the number of cores.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 6 / 31

Page 11: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Computational Challenges

By Amdahl’s law, the speedup of parallel Monte Carlo method islimited by the warmup stage (non-parallelizable).

Speedup =Twarmup + TsamplingTwarmup + Tsampling/p

→Twarmup + Tsampling

Twarmup

Parallel Monte Carlo method does not scale with problem size, i.e.number of particles and discretized time length.

Coarse grained parallelization does not fit well on multicore and GPU.

The computation of each DQMC step is complicated.Slower execution because of resource contention.Memory per core is reduced with the number of cores.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 6 / 31

Page 12: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Inside Each DQMC Step

DQMC step

Random HS field

thermalized

DQMC step

Measurements

yes

no

enoughsamples

no

Aggregation

yes

warm

upsam

pling

A DQMC step

1 Propose a local change: h→ h′.

2 Throw a random number 0 < r < 1.

3 Accept the change if r < det(e−βH(h′))det(e−βH(h))

.

Computational Kernel: Green’s function cal-culation

G = (I +BL · · ·B2B1)−1.

for computation of det(e−βH(h′)) and phys-ical measurements.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 7 / 31

Page 13: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Inside Each DQMC Step

DQMC step

Random HS field

thermalized

DQMC step

Measurements

yes

no

enoughsamples

no

Aggregation

yes

warm

upsam

pling

A DQMC step

1 Propose a local change: h→ h′.

2 Throw a random number 0 < r < 1.

3 Accept the change if r < det(e−βH(h′))det(e−βH(h))

.

Computational Kernel: Green’s function cal-culation

G = (I +BL · · ·B2B1)−1.

for computation of det(e−βH(h′)) and phys-ical measurements.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 7 / 31

Page 14: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Green’s Function Calculation

G = (I +BL · · ·B2B1)−1.

N : the number of particles; L: the number of time slices.

Time complexity of computing G is O(N3L).

For 103 warmup steps and 104 sampling steps, it takes 15 hours.

For large simulations, N = O(104), L = O(102), the projectedexecution time could take several days to months.

Profile of a DQMC simulation (N = 256, L = 96)

Matrix kernel Execution time

Matrix-matrix multiplication 72.39%Pivoted QR decomposition 17.83%Matrix inversion 3.02%Others 6.76%

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 8 / 31

Page 15: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Green’s Function Calculation

G = (I +BL · · ·B2B1)−1.

N : the number of particles; L: the number of time slices.

Time complexity of computing G is O(N3L).

For 103 warmup steps and 104 sampling steps, it takes 15 hours.

For large simulations, N = O(104), L = O(102), the projectedexecution time could take several days to months.

Profile of a DQMC simulation (N = 256, L = 96)

Matrix kernel Execution time

Matrix-matrix multiplication 72.39%Pivoted QR decomposition 17.83%Matrix inversion 3.02%Others 6.76%

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 8 / 31

Page 16: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Green’s Function Calculation

G = (I +BL · · ·B2B1)−1.

N : the number of particles; L: the number of time slices.

Time complexity of computing G is O(N3L).

For 103 warmup steps and 104 sampling steps, it takes 15 hours.

For large simulations, N = O(104), L = O(102), the projectedexecution time could take several days to months.

Profile of a DQMC simulation (N = 256, L = 96)

Matrix kernel Execution time

Matrix-matrix multiplication 72.39%Pivoted QR decomposition 17.83%Matrix inversion 3.02%Others 6.76%

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 8 / 31

Page 17: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Green’s Function Calculation

G = (I +BL · · ·B2B1)−1.

N : the number of particles; L: the number of time slices.

Time complexity of computing G is O(N3L).

For 103 warmup steps and 104 sampling steps, it takes 15 hours.

For large simulations, N = O(104), L = O(102), the projectedexecution time could take several days to months.

Profile of a DQMC simulation (N = 256, L = 96)

Matrix kernel Execution time

Matrix-matrix multiplication 72.39%Pivoted QR decomposition 17.83%Matrix inversion 3.02%Others 6.76%

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 8 / 31

Page 18: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Green’s Function Calculation

G = (I +BL · · ·B2B1)−1.

N : the number of particles; L: the number of time slices.

Time complexity of computing G is O(N3L).

For 103 warmup steps and 104 sampling steps, it takes 15 hours.

For large simulations, N = O(104), L = O(102), the projectedexecution time could take several days to months.

Profile of a DQMC simulation (N = 256, L = 96)

Matrix kernel Execution time

Matrix-matrix multiplication 72.39%Pivoted QR decomposition 17.83%Matrix inversion 3.02%Others 6.76%

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 8 / 31

Page 19: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Outline

1 Determinant quantum Monte Carlo simulations

2 Matrix multiplication of sparse matrix exponentials

3 Parallel block checkerboard methods on multicore and GPU

4 Experiments and results

5 Concluding remarks

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 9 / 31

Page 20: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Matrix-matrix Multiplication

Some tuned result on multicore and on GPU (Fermi)

DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s.(my laptop.)SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245,2010]DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]

It is great, but the running time grows cubically with problem size N .

Sparse-dense matrix multiplication takes only O(N2) time.

In the Green’s function calculation, G = (I +BLBL−1 · · ·B2B1)−1,

each Bi = eA is a matrix exponential.

eA = I +A+A2

2!+A3

3!+ · · · =

∞∑k=0

Ak

k!.

Matrix A is highly sparse, but eA is dense.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 10 / 31

Page 21: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Matrix-matrix Multiplication

Some tuned result on multicore and on GPU (Fermi)

DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s.(my laptop.)SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245,2010]DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]

It is great, but the running time grows cubically with problem size N .

Sparse-dense matrix multiplication takes only O(N2) time.

In the Green’s function calculation, G = (I +BLBL−1 · · ·B2B1)−1,

each Bi = eA is a matrix exponential.

eA = I +A+A2

2!+A3

3!+ · · · =

∞∑k=0

Ak

k!.

Matrix A is highly sparse, but eA is dense.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 10 / 31

Page 22: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Matrix-matrix Multiplication

Some tuned result on multicore and on GPU (Fermi)

DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s.(my laptop.)SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245,2010]DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]

It is great, but the running time grows cubically with problem size N .

Sparse-dense matrix multiplication takes only O(N2) time.

In the Green’s function calculation, G = (I +BLBL−1 · · ·B2B1)−1,

each Bi = eA is a matrix exponential.

eA = I +A+A2

2!+A3

3!+ · · · =

∞∑k=0

Ak

k!.

Matrix A is highly sparse, but eA is dense.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 10 / 31

Page 23: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Matrix-matrix Multiplication

Some tuned result on multicore and on GPU (Fermi)

DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s.(my laptop.)SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245,2010]DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]

It is great, but the running time grows cubically with problem size N .

Sparse-dense matrix multiplication takes only O(N2) time.

In the Green’s function calculation, G = (I +BLBL−1 · · ·B2B1)−1,

each Bi = eA is a matrix exponential.

eA = I +A+A2

2!+A3

3!+ · · · =

∞∑k=0

Ak

k!.

Matrix A is highly sparse, but eA is dense.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 10 / 31

Page 24: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Matrix-matrix Multiplication

Some tuned result on multicore and on GPU (Fermi)

DGEMM on Intel Core i7-920 4 core with MKL is about 40 Gflop/s.(my laptop.)SGEMM can reach 662 Gflop/s on Fermi. [Jakub Kurzak LAWN 245,2010]DGEMM (362Gflop/s on Fermi) [Guangming Tan et. al. SC11]

It is great, but the running time grows cubically with problem size N .

Sparse-dense matrix multiplication takes only O(N2) time.

In the Green’s function calculation, G = (I +BLBL−1 · · ·B2B1)−1,

each Bi = eA is a matrix exponential.

eA = I +A+A2

2!+A3

3!+ · · · =

∞∑k=0

Ak

k!.

Matrix A is highly sparse, but eA is dense.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 10 / 31

Page 25: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Checkerboard Method

Checkerboard method can approximate eA by eA ≈ eA1eA2 · · · eAk , inwhich each eAj is sparse.

Theorem

If Ai is strictly sparse with zero diagonal, eAi has the same sparse patternas Ai with diagonal fill in.

Definition (strictly sparse)

A matrix is strictly sparse if it contains at most one nonzero per row andper column.

Checkerboard method for computing eA

1 Split A = A1 +A2 + · · ·+Ak such that each Ai is strictly sparse.

2 Exponentiate each Ai.

3 Return eA1eA2 · · · eAk as an approximation to eA.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 11 / 31

Page 26: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Checkerboard Method

Checkerboard method can approximate eA by eA ≈ eA1eA2 · · · eAk , inwhich each eAj is sparse.

Theorem

If Ai is strictly sparse with zero diagonal, eAi has the same sparse patternas Ai with diagonal fill in.

Definition (strictly sparse)

A matrix is strictly sparse if it contains at most one nonzero per row andper column.

Checkerboard method for computing eA

1 Split A = A1 +A2 + · · ·+Ak such that each Ai is strictly sparse.

2 Exponentiate each Ai.

3 Return eA1eA2 · · · eAk as an approximation to eA.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 11 / 31

Page 27: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Checkerboard Method

Checkerboard method can approximate eA by eA ≈ eA1eA2 · · · eAk , inwhich each eAj is sparse.

Theorem

If Ai is strictly sparse with zero diagonal, eAi has the same sparse patternas Ai with diagonal fill in.

Definition (strictly sparse)

A matrix is strictly sparse if it contains at most one nonzero per row andper column.

Checkerboard method for computing eA

1 Split A = A1 +A2 + · · ·+Ak such that each Ai is strictly sparse.

2 Exponentiate each Ai.

3 Return eA1eA2 · · · eAk as an approximation to eA.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 11 / 31

Page 28: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Checkerboard Method

Checkerboard method can approximate eA by eA ≈ eA1eA2 · · · eAk , inwhich each eAj is sparse.

Theorem

If Ai is strictly sparse with zero diagonal, eAi has the same sparse patternas Ai with diagonal fill in.

Definition (strictly sparse)

A matrix is strictly sparse if it contains at most one nonzero per row andper column.

Checkerboard method for computing eA

1 Split A = A1 +A2 + · · ·+Ak such that each Ai is strictly sparse.

2 Exponentiate each Ai.

3 Return eA1eA2 · · · eAk as an approximation to eA.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 11 / 31

Page 29: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Example: One Dimensional Ring

A =

0 h hh 0

. . .. . .

0 hh h 0

 

x  2   3  

5  

1  

6  7  8  

4  

Odd  link  Even  link  

eA ≈

D. . .

D

cosh(h) sinh(h)D

. . .

sinh(h) cosh(h)

where D =

[cosh(h) sinh(h)sinh(h) cosh(h)

].

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 12 / 31

Page 30: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Example: One Dimensional Ring

A =

0 h hh 0

. . .. . .

0 hh h 0

 

x  2   3  

5  

1  

6  7  8  

4  

Odd  link  Even  link  

eA ≈

D. . .

D

cosh(h) sinh(h)D

. . .

sinh(h) cosh(h)

where D =

[cosh(h) sinh(h)sinh(h) cosh(h)

].

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 12 / 31

Page 31: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Block Checkerboard Method

Exploring block structure of sparse matrices can obtain betterperformance and accuracy.

12x12x12 16x16x16 18x18x18 20x20x200

20

40

60

80

100

120

Matrix size

Ru

nn

ing

tim

e (

se

co

nd

)

ExactCheckerboardBlock Checkerboard

10−4

10−3

10−2

10−1

10−8

10−6

10−4

10−2

100

Discretize parameter ∆ τR

ela

tive

err

or

CheckerboardBlock Checkerboard

The algorithm is similar, but the basic element is a block submatrix.

We will focus on the parallelization of block checkerboard method.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 13 / 31

Page 32: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Block Checkerboard Method

Exploring block structure of sparse matrices can obtain betterperformance and accuracy.

12x12x12 16x16x16 18x18x18 20x20x200

20

40

60

80

100

120

Matrix size

Ru

nn

ing

tim

e (

se

co

nd

)

ExactCheckerboardBlock Checkerboard

10−4

10−3

10−2

10−1

10−8

10−6

10−4

10−2

100

Discretize parameter ∆ τR

ela

tive

err

or

CheckerboardBlock Checkerboard

The algorithm is similar, but the basic element is a block submatrix.

We will focus on the parallelization of block checkerboard method.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 13 / 31

Page 33: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Block Checkerboard Method

Exploring block structure of sparse matrices can obtain betterperformance and accuracy.

12x12x12 16x16x16 18x18x18 20x20x200

20

40

60

80

100

120

Matrix size

Ru

nn

ing

tim

e (

se

co

nd

)

ExactCheckerboardBlock Checkerboard

10−4

10−3

10−2

10−1

10−8

10−6

10−4

10−2

100

Discretize parameter ∆ τR

ela

tive

err

or

CheckerboardBlock Checkerboard

The algorithm is similar, but the basic element is a block submatrix.

We will focus on the parallelization of block checkerboard method.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 13 / 31

Page 34: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Outline

1 Determinant quantum Monte Carlo simulations

2 Matrix multiplication of sparse matrix exponentials

3 Parallel block checkerboard methods on multicore and GPU

4 Experiments and results

5 Concluding remarks

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 14 / 31

Page 35: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Computational Difficulties

Combination of sparse matrix and dense matrix computation.(generalized SPMV)

Dense matrix-matrix multiplication

Computational bound

Sparse matrix-vector multiplication

Memory bound

Sparse matrix-matrix multiplication

?

Multiplication of a sequence of sparse matrices of differentcharacteristics.

Matrix size is not large enough to reach good performance.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 15 / 31

Page 36: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Computational Difficulties

Combination of sparse matrix and dense matrix computation.(generalized SPMV)

Dense matrix-matrix multiplication

Computational bound

Sparse matrix-vector multiplication

Memory bound

Sparse matrix-matrix multiplication

?

Multiplication of a sequence of sparse matrices of differentcharacteristics.

Matrix size is not large enough to reach good performance.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 15 / 31

Page 37: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Computational Difficulties

Combination of sparse matrix and dense matrix computation.(generalized SPMV)

Dense matrix-matrix multiplication

Computational bound

Sparse matrix-vector multiplication

Memory bound

Sparse matrix-matrix multiplication

?

Multiplication of a sequence of sparse matrices of differentcharacteristics.

Matrix size is not large enough to reach good performance.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 15 / 31

Page 38: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Computational Difficulties

Combination of sparse matrix and dense matrix computation.(generalized SPMV)

Dense matrix-matrix multiplication

Computational bound

Sparse matrix-vector multiplication

Memory bound

Sparse matrix-matrix multiplication

?

Multiplication of a sequence of sparse matrices of differentcharacteristics.

Matrix size is not large enough to reach good performance.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 15 / 31

Page 39: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Computational Difficulties

Combination of sparse matrix and dense matrix computation.(generalized SPMV)

Dense matrix-matrix multiplication

Computational bound

Sparse matrix-vector multiplication

Memory bound

Sparse matrix-matrix multiplication

?

Multiplication of a sequence of sparse matrices of differentcharacteristics.

Matrix size is not large enough to reach good performance.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 15 / 31

Page 40: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Computational Difficulties

Combination of sparse matrix and dense matrix computation.(generalized SPMV)

Dense matrix-matrix multiplication

Computational bound

Sparse matrix-vector multiplication

Memory bound

Sparse matrix-matrix multiplication

?

Multiplication of a sequence of sparse matrices of differentcharacteristics.

Matrix size is not large enough to reach good performance.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 15 / 31

Page 41: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

2D and 3D Torus Lattices

The kinetic matrices in our study are 2D and 3D torus lattices.

Images are from http://www.trampelwurm.ch/schmidt/wilhelmtux/swissremix/html/Linuxfibel/netstruct.htm and

http://www.fujitsu.com/global/news/pr/archives/month/2009/20090717-01.html

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 16 / 31

Page 42: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

2D Lattice and Kinetic Matrix

The kinetic matrix of 2D latticescan be split into 3 block strictlysparse matrices, K1,K2, and K3.

Checkerboard methodapproximates the matrixexponential by

eK3/2eK2/2eK1eK2/2eK3/2.

The figure shows the sparsepattern of the matrix of a 4× 4 2Dlattice is shown, and the matrixexponentials, eK1 , eK2 , and eK3 .

0 10

0

5

10

15

(a) kinetic matrix K0 10

0

5

10

15

(b) expm(K1)

0 10

0

5

10

15

(c) expm(K2)

0 10

0

5

10

15

(d) expm(K3)

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 17 / 31

Page 43: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

3D Lattice and Kinetic Matrix

The kinetic matrix of 3D latticescan be split into 5 block strictlysparse matrices, K1,K2,K3,K4,and K5.

Checkerboard methodapproximates the matrixexponential by

eK52 e

K42 e

K32 e

K22 eK1e

K52 e

K32 e

K42 e

K52 .

The figure shows the sparsepattern of the matrix of a 4× 42D lattice is shown, and thematrix exponentials,eK1 , eK2 , eK3 , eK4 , and eK5 .

0 50

0

20

40

60

(a) kinetic matrix K0 50

0

20

40

60

(b) expm(K1)

0 50

0

20

40

60

(c) expm(K2)

0 50

0

20

40

60

(d) expm(K3)

0 50

0

20

40

60

(e) expm(K4)

0 50

0

20

40

60

(f) expm(K5)

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 18 / 31

Page 44: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

eK1: Block Diagonal Matrix

In 2D and 3D problems, eK1 =

A1

A2

. . .

Ak

, block diagonal.

For B =

B11 B12 . . . B1k

B21 B22 . . . B2k...

.... . .

...Bk1 Bk2 . . . Bkk

, the product of eK1B is

eK1B =

A1B11 A1B12 . . . A1B1k

A2B21 A2B22 . . . A2B2k...

.... . .

...AkBk1 AkBk2 . . . AkBkk

.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 19 / 31

Page 45: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

eK1: Block Diagonal Matrix

In 2D and 3D problems, eK1 =

A1

A2

. . .

Ak

, block diagonal.

For B =

B11 B12 . . . B1k

B21 B22 . . . B2k...

.... . .

...Bk1 Bk2 . . . Bkk

, the product of eK1B is

eK1B =

A1B11 A1B12 . . . A1B1k

A2B21 A2B22 . . . A2B2k...

.... . .

...AkBk1 AkBk2 . . . AkBkk

.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 19 / 31

Page 46: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

eKi, i 6= 1: Bidiagonal Matrices

eKi =

. . .

Di Cij. . .

Cji Dj

. . .

.

where Ci,j and Dj are diagonalfor eKi , i 6= 1.

There are only two non-zeros per row/column.

Let B =

B11 B12 . . . B1k

B21 B22 . . . B2k...

.... . .

...Bk1 Bk2 . . . Bkk

.

The (i, j) block and the (j, i)block of F = eKiB are

Fij = DiBii + CijBijFji = DjBjj + CjiBji

.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 20 / 31

Page 47: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

eKi, i 6= 1: Bidiagonal Matrices

eKi =

. . .

Di Cij. . .

Cji Dj

. . .

.

where Ci,j and Dj are diagonalfor eKi , i 6= 1.

There are only two non-zeros per row/column.

Let B =

B11 B12 . . . B1k

B21 B22 . . . B2k...

.... . .

...Bk1 Bk2 . . . Bkk

.

The (i, j) block and the (j, i)block of F = eKiB are

Fij = DiBii + CijBijFji = DjBjj + CjiBji

.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 20 / 31

Page 48: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

eKi, i 6= 1: Bidiagonal Matrices

eKi =

. . .

Di Cij. . .

Cji Dj

. . .

.

where Ci,j and Dj are diagonalfor eKi , i 6= 1.

There are only two non-zeros per row/column.

Let B =

B11 B12 . . . B1k

B21 B22 . . . B2k...

.... . .

...Bk1 Bk2 . . . Bkk

.

The (i, j) block and the (j, i)block of F = eKiB are

Fij = DiBii + CijBijFji = DjBjj + CjiBji

.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 20 / 31

Page 49: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Parallelize Algorithms for eK1B

Different algorithms need be designed from different types of matrices.

Block based algorithm for eK1B.

Parallelization of eK1B

For each Bij do

Parallel compute Ci,j = AiBij .

End for each

For cache effect, finer grained parallelization is needed.

Finer-grained parallelization of eK1B

For each Bij doFor each sub-block of Cij : Cij(k, `) do

Parallel compute Ci,j(k, `).

End for each

End for each

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 21 / 31

Page 50: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Parallelize Algorithms for eK1B

Different algorithms need be designed from different types of matrices.Block based algorithm for eK1B.

Parallelization of eK1B

For each Bij do

Parallel compute Ci,j = AiBij .

End for each

For cache effect, finer grained parallelization is needed.

Finer-grained parallelization of eK1B

For each Bij doFor each sub-block of Cij : Cij(k, `) do

Parallel compute Ci,j(k, `).

End for each

End for each

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 21 / 31

Page 51: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Parallelize Algorithms for eK1B

Different algorithms need be designed from different types of matrices.Block based algorithm for eK1B.

Parallelization of eK1B

For each Bij do

Parallel compute Ci,j = AiBij .

End for each

For cache effect, finer grained parallelization is needed.

Finer-grained parallelization of eK1B

For each Bij doFor each sub-block of Cij : Cij(k, `) do

Parallel compute Ci,j(k, `).

End for each

End for each

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 21 / 31

Page 52: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Parallelize Algorithms for eKiB, i 6= 1

Column based parallelizationFor better performance, aggregate the computation of eKiB.

Parallelization of eKiB, i 6= 1

For each column bi in B do

Parallel compute eK2/2eK3/2bi or eK2/2 · · · eK5/2bi.

End for each

Divide bi to fit cache

Parallelization of eKiB, i 6= 1

For each column bi in B doFor each segment of bi(:) do

Parallel compute eK2/2eK3/2bi(:) or eK2/2 · · · eK5/2bi(:).

End for each

End for each

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 22 / 31

Page 53: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Parallelize Algorithms for eKiB, i 6= 1

Column based parallelizationFor better performance, aggregate the computation of eKiB.

Parallelization of eKiB, i 6= 1

For each column bi in B do

Parallel compute eK2/2eK3/2bi or eK2/2 · · · eK5/2bi.

End for each

Divide bi to fit cache

Parallelization of eKiB, i 6= 1

For each column bi in B doFor each segment of bi(:) do

Parallel compute eK2/2eK3/2bi(:) or eK2/2 · · · eK5/2bi(:).

End for each

End for each

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 22 / 31

Page 54: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Outline

1 Determinant quantum Monte Carlo simulations

2 Matrix multiplication of sparse matrix exponentials

3 Parallel block checkerboard methods on multicore and GPU

4 Experiments and results

5 Concluding remarks

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 23 / 31

Page 55: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Experimental Setting

Experiment setting

We experimented the parallel algorithms on multicore and GPU.

Matrices are from 2D and 3D toruses.

Multicore CPU

Intel Core i7 950, 8G RAM, whose peak performance is 48.48Gflop/s.

Using DGEMM in MKL 11 for block matrix multiplication and forcomparison.

GPU

GeForce GTX 480, whose peak performance is 1344GFlop/s.

Using CUDA 4 and DGEMM and SGEMM CUDA SDK forcomparison.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 24 / 31

Page 56: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Results of 2D Problems on Multicore

0.000  

5.000  

10.000  

15.000  

20.000  

25.000  

1   2   3   4   5   6   7   8  

Gflo

p/s  

Number  of  threads  

1024/32  

4096/64  

16384/128  

0  

5  

10  

15  

20  

25  

30  

35  

1   2   3   4   5   6   7   8  

Performan

ce  (G

flop/s)  

Number  of  threads  

1024/32  

4096/64  

16384/128  

Lattice size: 32× 32,64× 64, and 128× 128.

Larger problem hasmore stableperformance result.

The results of 32× 32are less stable thanthose of other cases.

Hyperthreading isalmost no effect.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 25 / 31

Page 57: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Results of 3D Problems on Multicore

0  

2  

4  

6  

8  

10  

12  

14  

64/4   512/8   1728/12   4096/16   8000/20   13824/24  

Gflo

p/s  

Problem  size/block  size  

1  

2  

3  

4  

0  

5  

10  

15  

20  

25  

64/4   512/8   1728/12   4096/16   8000/20   13824/24  

Gflo

p/s  

Problem  size/block  size  

1  

2  

3  

4  

6  

8  

Lattice size: 4× 4× 4,8× 8× 8, 12× 12× 12,16× 16× 16,20× 20× 20, and24× 24× 24.

The speedup is about 3using 4 cores.

The Gflop/s of eK1 ismuch better than thatof the overallperformance.

Hyperthreading isworking, but its effectis decreasing.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 26 / 31

Page 58: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Results of 2D Problems on GPU

1.00E+00  

1.00E+01  

1.00E+02  

1.00E+03  

1.00E+04  

1.00E+05  

1.00E+06  

64   256   1024   4096  

Execu&

on  &me  (1E-­‐6  s)  

Problem  size  

SDK  float  

SDK  double  

Blkckb  float  

Blkckb  double  

0.00  

10.00  

20.00  

30.00  

40.00  

50.00  

60.00  

70.00  

64   256   1024   4096  

Speedu

p  

Problem  size  

single  

double  

Lattice size: 8× 8,16× 16, 32× 32, and64× 64.

The speedup is up to67 for single precision.

DGEMM of CUDASDK cannot finish the64× 64 case.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 27 / 31

Page 59: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Results of 3D Problems on GPU

1.00E+00  

1.00E+01  

1.00E+02  

1.00E+03  

1.00E+04  

1.00E+05  

1.00E+06  

64   512   4096  

Execu&

on  &me  (1E-­‐6  s)  

Problem  size  

SDK  float  

SDK  double  

Blkckb  float  

Blkckb  double  

0  

20  

40  

60  

80  

100  

120  

140  

160  

64   512   4096  

Speedu

p  

Problem  size  

single  

double  

Lattice size: 4× 4× 4,8× 8× 8, 12× 12× 12,and 16× 16× 16.

The speedup is up to147 for single precision.

DGEMM of CUDASDK cannot finish the16× 16× 16 case.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 28 / 31

Page 60: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Outline

1 Determinant quantum Monte Carlo simulations

2 Matrix multiplication of sparse matrix exponentials

3 Parallel block checkerboard methods on multicore and GPU

4 Experiments and results

5 Concluding remarks

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 29 / 31

Page 61: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Concluding Remarks

Scientific applications request more and more computational power toenable larger and larger simulations. But to extract performance frommodern HPC machines, which hybrid coarse-grained and fine-grainedparallel architectures, application developers need to redesign theprogram.

For DQMC simulations, checkerboard method can make thealgorithm more scalable with problem size. This paper presents thepreliminary study of the parallelization of checkerboard method onmulticore and GPU. Further performance tunings are required.

Parallelization of general lattice geometry of checkerboard methodneed be studied. The integration of the parallel checkerboard methodto package needs be done.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 30 / 31

Page 62: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Concluding Remarks

Scientific applications request more and more computational power toenable larger and larger simulations. But to extract performance frommodern HPC machines, which hybrid coarse-grained and fine-grainedparallel architectures, application developers need to redesign theprogram.

For DQMC simulations, checkerboard method can make thealgorithm more scalable with problem size. This paper presents thepreliminary study of the parallelization of checkerboard method onmulticore and GPU. Further performance tunings are required.

Parallelization of general lattice geometry of checkerboard methodneed be studied. The integration of the parallel checkerboard methodto package needs be done.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 30 / 31

Page 63: Parallelizing the Hamiltonian Computation in DQMC ...Che-Rung Lee (cherung@cs.nthu.edu.tw) Parallel Checkerboard Method AsHES 2012 1 / 31. Outline 1 Determinant quantum Monte Carlo

Concluding Remarks

Scientific applications request more and more computational power toenable larger and larger simulations. But to extract performance frommodern HPC machines, which hybrid coarse-grained and fine-grainedparallel architectures, application developers need to redesign theprogram.

For DQMC simulations, checkerboard method can make thealgorithm more scalable with problem size. This paper presents thepreliminary study of the parallelization of checkerboard method onmulticore and GPU. Further performance tunings are required.

Parallelization of general lattice geometry of checkerboard methodneed be studied. The integration of the parallel checkerboard methodto package needs be done.

Che-Rung Lee ([email protected]) Parallel Checkerboard Method AsHES 2012 30 / 31