Performance Evaluation of GPU Accelerated Linear Solvers with … · 2018. 3. 30. · Performance...

Performance Evaluation of GPU Accelerated Linear Solvers

with TCAD Examples

Ana IontchevaSenior Development Engineer - NumericsSilvaco Inc, March 26, 2018

● Introduction - Who is Silvaco? What is TCAD?● Linear Solvers for TCAD Simulations● Performance Evaluation Results● Conclusion

Outline

● Silvaco - Silicon Valley Company● 34 year old, headquartered in Santa Clara, CA● 16 offices worldwide, 7 development centers - US, Europe and Asia● Largest privately held Electronic Design Automation company● Develops advanced software tools for design and verification of semiconductor chips

Silvaco

EDA Design Flow

Layout

Spice

Parasitic extraction

Spice modeling

Process

Device

Measured Data

Schematic

Spice

LVS

DRC

TCAD

Parasitic reduction

Modeling Design & Verification

Reliability Analysis

Variation Analysis

Simulation

Technology Computer-Aided Design

TCAD - category of software tools for designing semiconductor devices

Applications:

- Memory - STTRAM, 3D NAND Flash- Display - OLED, QLED, Flexible LCD- Optical - Laser, Solar Cell, Photodiodes, TFT- Advanced Process Development - FinFET- Fin

Field Effect Transistor - 3D device used in modern processors, Quantum models

- Radiation/ Reliability - space exploration- Power - Electric Cars, Consumer Appliances,

Electric Railways, Medical Equipment, High-Power Equipmentin Power Plants

TCAD Design Flow

Structure Builder, Process Simulation

Victory Process

Device Simulation

Victory Device

3D RC Extraction

Clever

Layo

ut

Automation and Optimization

2D/3D Visualization

Technology Computer-Aided Design

● Process Simulation - models the fabrication steps of semiconductor devices Series of stacked very thin layers, the layers are usually made from different materials.

- Etching and deposition.- Oxidation - process which converts silicon on the wafer to silicon oxide. Silicon dioxide layers are used as

insulators or masks for ion implantation.- Ion implantation - introduce dopant impurities into crystalline silicon - Diffusion - the movement of impurity atoms in a semiconductor material at high temperatures. Diffusion is used

to form the source, drain and chanel region in a transistor. The output from the Process Simulation is a 2D or 3D structure which can be used in the Device Simulation.

● Device Simulation - models the electrical, thermal and optical behavior of semiconductor devices

Linear Solvers for TCAD Simulations

• Semiconductor process and device simulations require the solution of a system of PDEs discretized on a mesh.

• The nonlinear problem is solved with a nonlinear solver and at each iteration of the nonlinear solver a sparse linear system needs to be solved.

• A significant part of the overall computation time is spent on solving the linear systems so the performance of the linear solvers is essential.

• Accuracy is a very important requirement for the linear solvers.

Evaluation Setup

• Our first step towards incorporating accelerators into our linear solvers was to evaluate third party linear solver libraries accelerated with GPUs and compare the results with our parallel linear solvers running on CPUs only. Our purpose was to identify the types of tools for which GPU acceleration is possible and get an approximate idea of the speedup that we would get.

• We tested the GPU accelerated linear solvers for sparse linear systems in Paralution [1] and Magma [2] on linear systems with matrices extracted from different modules of our Process Simulator Victory Process.

• SolverLib is a library of linear solvers developed in Silvaco. We tried various combinations of solvers, preconditioners, triangular solvers and parameters and these result show the best performing combination that we could come up for a specific problem.

• For our testing we used an NVIDIA Tesla P100 16GB accelerator. SolverLib was run on an Intel Xeon CPU E5-2699 CPU.

The numerical simulations needed for this work were performed on Microway's Tesla GPU accelerated compute cluster. The author is grateful for the HPC time provided by Microway, Inc.

1. PARALUTION Labs. Paralution v1.1 2016. http://www.paralution.com2. Magma 2.2.0 http://icl.cs.utk.edu/magma/software/

http://www.paralution.com

http://icl.cs.utk.edu/magma/software/

Victory Process - 3D stationary diffusion of oxygen in silicon oxide

laplace.mtx - non-symmetricSource - Victory Process

n=1 496 362nnz=11 640 245

#iterations time residual

ParalutionSolver BICGSTABPreconditioner Gauss-Seidel

135 0.59 s 5.86e-10

Magma Solver BICGSTAB Preconditioner ILU(0)Triangular Solver Jacobi

128 0.72 7.75e-10

SolverLibSolver BICGSTABPreconditioner ILU(0)

57 6.87 s 6.92e-10

The laplace matrix comes from the finite difference approximation of a stationary diffusion equation (Laplace equation). The equation models 3D stationary diffusion of oxygen in silicon oxide during the thermal oxidation of silicon. In this example Paralution shows the best performance - probably due to the Gauss-Seidel preconditioner. The speedup of Paralution compared to SolverLib is 11.6 times. The SolverLib BICGSTAB solver is running on 1 core in this example. Paralution and Magma results are close 1.2 speedup with Paralution. Magma did not have Gauss Seidel preconditioner as an option so that the comparison can be calibrated further.

Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon

The next three matrices come from the finite difference approximation of a Stokes equation

(Navier-Stokes model of a liquid with high viscosity and hence with very low Reynolds number).

The equation models a 3D viscous flow of silicon oxide and silicon nitride during the

thermal oxidation of silicon.

This is the most time consuming part of the simulation so the speedup here is of particular importance.

We have 3 different size matrices - 1 million, 2 million and 4 million.

We wanted to see whether we would get consistent results.


vas_stokes_1M.mtx -non-symmetricn=1 090 664

nnz=32 767 207#iterations time residual

ParalutionSolver IDR(8)Preconditioner ILU(0,1)

1759 16.93 s 7.8e-10

Magma Solver IDR(8)Preconditioner ILU(0)Triangular Solver Jacobi

286 13.55 9.58e-10

SolverLibSolver PAM BICGSTAB- 16 CPUPreconditioner ILU(1)

434 37.5 s 6.2e-10


vas_stokes_2M.mtx non-symmetricn=2 146 677

nnz=65 129 037#iterations time residual

ParalutionSolver IDR(8)Preconditioner ILU(0,1)

1745 26.97 s 9.08e-10

Magma Solver IDR(8)Preconditioner ILU(0)Triangular Solver Jacobi

271 23.5 1.45e-09

SolverLibSolver PAM BICGSTAB- 16 CPUPreconditioner ILU(1)

385 62.8 s 8.45e-10


vas_stokes_4M.mtx non-symmetric

n=4 382 246nnz=131 577 61

#iterations time residual

ParalutionSolver IDR(8)

Preconditioner ILU(0,1)216 61.76 s 1.46e-09

Magma Solver IDR(8)

Preconditioner ILU(0)Triangular Solver Jacobi

33 57.9 s 2.04e-09

SolverLibSolver PAM BICGSTAB-

16 CPUPreconditioner ILU(1)

47 174.1 s 3.83e-10


We can see that the results are consistent. On all three examples Magma is outperforming the other

solvers. It should be noted here that the matrices are ill-conditioned and Induced Dimension Reduction (IDR) with dimension 8 was used in Paralution and Magma.

To compare we used Silvaco’s parallel domain decomposition based solver PAM on 16 cores.

Magma’s speedup compared to SolverLib on 16 cores is 2.77 on the 1 million sized problems, 2.67 on the 2 million sized problem and 3 on the 4 million sized problem.

So overall close to 3 times. Paralution showed results that were close to Magma.

Victory Device - example nv1

source: Victory Device size: 75 468 nnz: 2 449 194Very ill-conditioned matrix - huge condition number PAS running on 36 threads.

Sparse QR quad-double - NVIDIA

Solvers total time residual error

PAS 0.938 s 1.9e-15

Sparse QR 722 s 5.13e-26

CLEVER - example nv7

source: Clever size: 1 802 979 nnz: 25 180 017

PAM running on 16 MPI processes

PCG with Jacobi preconditioner - on CPU and GPU.

Iterative solvers # iterations total time residual error

PAM 36 5.6 s 1.20e-09

PCG+Jacobi 493 1.9 s 1.27e-09

Comment: PCG is about 2.9 times faster.

Conclusion

● GPU accelerated linear solvers are a promising tool for accelerating Process Simulation and RC extraction tools and we have added such solvers into our library of linear solvers.

● We are continuing to work on developing a good GPU accelerated linear solver for our Device Simulation tools.

Performance Evaluation of GPU Accelerated Linear Solvers with … · 2018. 3. 30. · Performance...

Documents

Transcript of Performance Evaluation of GPU Accelerated Linear Solvers with … · 2018. 3. 30. · Performance...