Performance Evaluation of GPU Accelerated Linear Solvers with … · 2018. 3. 30. · Performance...
Transcript of Performance Evaluation of GPU Accelerated Linear Solvers with … · 2018. 3. 30. · Performance...
Performance Evaluation of GPU Accelerated Linear Solvers
with TCAD Examples
Ana IontchevaSenior Development Engineer - NumericsSilvaco Inc, March 26, 2018
● Introduction - Who is Silvaco? What is TCAD?● Linear Solvers for TCAD Simulations● Performance Evaluation Results● Conclusion
Outline
● Silvaco - Silicon Valley Company● 34 year old, headquartered in Santa Clara, CA● 16 offices worldwide, 7 development centers - US, Europe and Asia● Largest privately held Electronic Design Automation company● Develops advanced software tools for design and verification of semiconductor chips
Silvaco
EDA Design Flow
Layout
Spice
Parasitic extraction
Spice modeling
Process
Device
Measured Data
Schematic
Spice
LVS
DRC
TCAD
Parasitic reduction
Modeling Design & Verification
Reliability Analysis
Variation Analysis
Simulation
Technology Computer-Aided Design
TCAD - category of software tools for designing semiconductor devices
Applications:
- Memory - STTRAM, 3D NAND Flash- Display - OLED, QLED, Flexible LCD- Optical - Laser, Solar Cell, Photodiodes, TFT- Advanced Process Development - FinFET- Fin
Field Effect Transistor - 3D device used in modern processors, Quantum models
- Radiation/ Reliability - space exploration- Power - Electric Cars, Consumer Appliances,
Electric Railways, Medical Equipment, High-Power Equipmentin Power Plants
TCAD Design Flow
Structure Builder, Process Simulation
Victory Process
Device Simulation
Victory Device
3D RC Extraction
Clever
Layo
ut
Automation and Optimization
2D/3D Visualization
Technology Computer-Aided Design
● Process Simulation - models the fabrication steps of semiconductor devices Series of stacked very thin layers, the layers are usually made from different materials.
- Etching and deposition.- Oxidation - process which converts silicon on the wafer to silicon oxide. Silicon dioxide layers are used as
insulators or masks for ion implantation.- Ion implantation - introduce dopant impurities into crystalline silicon - Diffusion - the movement of impurity atoms in a semiconductor material at high temperatures. Diffusion is used
to form the source, drain and chanel region in a transistor. The output from the Process Simulation is a 2D or 3D structure which can be used in the Device Simulation.
● Device Simulation - models the electrical, thermal and optical behavior of semiconductor devices
Linear Solvers for TCAD Simulations
• Semiconductor process and device simulations require the solution of a system of PDEs discretized on a mesh.
• The nonlinear problem is solved with a nonlinear solver and at each iteration of the nonlinear solver a sparse linear system needs to be solved.
• A significant part of the overall computation time is spent on solving the linear systems so the performance of the linear solvers is essential.
• Accuracy is a very important requirement for the linear solvers.
Evaluation Setup
• Our first step towards incorporating accelerators into our linear solvers was to evaluate third party linear solver libraries accelerated with GPUs and compare the results with our parallel linear solvers running on CPUs only. Our purpose was to identify the types of tools for which GPU acceleration is possible and get an approximate idea of the speedup that we would get.
• We tested the GPU accelerated linear solvers for sparse linear systems in Paralution [1] and Magma [2] on linear systems with matrices extracted from different modules of our Process Simulator Victory Process.
• SolverLib is a library of linear solvers developed in Silvaco. We tried various combinations of solvers, preconditioners, triangular solvers and parameters and these result show the best performing combination that we could come up for a specific problem.
• For our testing we used an NVIDIA Tesla P100 16GB accelerator. SolverLib was run on an Intel Xeon CPU E5-2699 CPU.
The numerical simulations needed for this work were performed on Microway's Tesla GPU accelerated compute cluster. The author is grateful for the HPC time provided by Microway, Inc.
1. PARALUTION Labs. Paralution v1.1 2016. http://www.paralution.com2. Magma 2.2.0 http://icl.cs.utk.edu/magma/software/
Victory Process - 3D stationary diffusion of oxygen in silicon oxide
laplace.mtx - non-symmetricSource - Victory Process
n=1 496 362nnz=11 640 245
#iterations time residual
ParalutionSolver BICGSTABPreconditioner Gauss-Seidel
135 0.59 s 5.86e-10
Magma Solver BICGSTAB Preconditioner ILU(0)Triangular Solver Jacobi
128 0.72 7.75e-10
SolverLibSolver BICGSTABPreconditioner ILU(0)
57 6.87 s 6.92e-10
The laplace matrix comes from the finite difference approximation of a stationary diffusion equation (Laplace equation). The equation models 3D stationary diffusion of oxygen in silicon oxide during the thermal oxidation of silicon. In this example Paralution shows the best performance - probably due to the Gauss-Seidel preconditioner. The speedup of Paralution compared to SolverLib is 11.6 times. The SolverLib BICGSTAB solver is running on 1 core in this example. Paralution and Magma results are close 1.2 speedup with Paralution. Magma did not have Gauss Seidel preconditioner as an option so that the comparison can be calibrated further.
Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon
The next three matrices come from the finite difference approximation of a Stokes equation
(Navier-Stokes model of a liquid with high viscosity and hence with very low Reynolds number).
The equation models a 3D viscous flow of silicon oxide and silicon nitride during the
thermal oxidation of silicon.
This is the most time consuming part of the simulation so the speedup here is of particular importance.
We have 3 different size matrices - 1 million, 2 million and 4 million.
We wanted to see whether we would get consistent results.
Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon
vas_stokes_1M.mtx -non-symmetricn=1 090 664
nnz=32 767 207#iterations time residual
ParalutionSolver IDR(8)Preconditioner ILU(0,1)
1759 16.93 s 7.8e-10
Magma Solver IDR(8)Preconditioner ILU(0)Triangular Solver Jacobi
286 13.55 9.58e-10
SolverLibSolver PAM BICGSTAB- 16 CPUPreconditioner ILU(1)
434 37.5 s 6.2e-10
Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon
vas_stokes_2M.mtx non-symmetricn=2 146 677
nnz=65 129 037#iterations time residual
ParalutionSolver IDR(8)Preconditioner ILU(0,1)
1745 26.97 s 9.08e-10
Magma Solver IDR(8)Preconditioner ILU(0)Triangular Solver Jacobi
271 23.5 1.45e-09
SolverLibSolver PAM BICGSTAB- 16 CPUPreconditioner ILU(1)
385 62.8 s 8.45e-10
Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon
vas_stokes_4M.mtx non-symmetric
n=4 382 246nnz=131 577 61
#iterations time residual
ParalutionSolver IDR(8)
Preconditioner ILU(0,1)216 61.76 s 1.46e-09
Magma Solver IDR(8)
Preconditioner ILU(0)Triangular Solver Jacobi
33 57.9 s 2.04e-09
SolverLibSolver PAM BICGSTAB-
16 CPUPreconditioner ILU(1)
47 174.1 s 3.83e-10
Victory Process - 3D viscous flow of silicon oxide and silicon nitride during the thermal oxidation of silicon
We can see that the results are consistent. On all three examples Magma is outperforming the other
solvers. It should be noted here that the matrices are ill-conditioned and Induced Dimension Reduction (IDR) with dimension 8 was used in Paralution and Magma.
To compare we used Silvaco’s parallel domain decomposition based solver PAM on 16 cores.
Magma’s speedup compared to SolverLib on 16 cores is 2.77 on the 1 million sized problems, 2.67 on the 2 million sized problem and 3 on the 4 million sized problem.
So overall close to 3 times. Paralution showed results that were close to Magma.
Victory Device - example nv1
source: Victory Device size: 75 468 nnz: 2 449 194Very ill-conditioned matrix - huge condition number PAS running on 36 threads.
Sparse QR quad-double - NVIDIA
Solvers total time residual error
PAS 0.938 s 1.9e-15
Sparse QR 722 s 5.13e-26
CLEVER - example nv7
source: Clever size: 1 802 979 nnz: 25 180 017
PAM running on 16 MPI processes
PCG with Jacobi preconditioner - on CPU and GPU.
Iterative solvers # iterations total time residual error
PAM 36 5.6 s 1.20e-09
PCG+Jacobi 493 1.9 s 1.27e-09
Comment: PCG is about 2.9 times faster.
Conclusion
● GPU accelerated linear solvers are a promising tool for accelerating Process Simulation and RC extraction tools and we have added such solvers into our library of linear solvers.
● We are continuing to work on developing a good GPU accelerated linear solver for our Device Simulation tools.