“i7 Networks” Manjunath M Gowda CEO & co-founder, i7 Networks.
Heinreichsberger* · ~'I7"JJHH~ll"'J.VllC:U MOS Device Simulation on a Heinreichsberger* Abstract....
Transcript of Heinreichsberger* · ~'I7"JJHH~ll"'J.VllC:U MOS Device Simulation on a Heinreichsberger* Abstract....
55
~I7JJHH~llJVllCU MOS Device Simulation on a
Heinreichsberger
Abstract In this paper we present oW experience with the implementation of the three-dimensional semiconmiddot ductor derice simulator MINIMOS on a massively architecture the Connection Machine CM2 The eD1lpllaSlS of this WOllI is placed on herative methods for solving the very luge sparse lineu yltems of equations that
at each step of the nonlmeu solution procedure Both symmetric and nonsymmetric linear systems ue solved br conJugate gradient type iterative methods The implementation or the puallel preconditioner is the most crucial step Multicolor incomplete LV ractoriation preconditioners are compued with polynomial preconditioners Several __erical examples from the nonsymmetric lineu systems in MlNIMOS ue given
Comparisons with vector supercomputers are made
1 Overview
linear ~vctpTfI
also (4) discrete continuity equations have to cope
(electrons and by a damperl
In
the enormous numerical range of the I tnn
discrete continuity equations have to the nonlinear iteration resulting
Institute for Microelectronics Technical University Vienna Gusshausstrausse 27-29 A-I040
388
APPLICATIONS MODELING AND SIMULATION 389
TABLE 1
BiCGSTAB Algorithm
Choose 20 (eg Zo = 0) 70 = (b - Azo) Choose Yo such that (Yo1o) - 0 (eg Yo = 10 ) Po = Vo = 0 P- l = 1 Wo = 1 a =1 FOR n = 0 STEP 1 UNTIL convergence DO
Pn = (Yo rn) f3 = ~
Pft-l Wft
P+1 = 1n + f3 (p + wnv) V+1 = A pn+l a = (Yo Vn+ l ) bull = 1 - aVn +l t =A
~ +1 = M 7n +l = bull - Wn +lt 2+1 = Z + ap+1 + W+lS
END FOR
2 T he Linear Solvers The parallel solvers are of (bi-)conjugate gradient type A relative error of 10-3 for the symmetric and 10-8 for the nonsymmetric systems (motivated by numerical experiments only) has been found to be sufficient For the Poisson equation the MIC-CG method[9] with a modification factor of Q = 095 is the optimal choice to the best to our knowledge On the CM2 we use the CG in the version of Concus Golub and OLeary As preconditioner the reduced system (RS) main diagonal is used We shall denote this method by RS-CG In the case of locally constant carrier temperatures (see [6]) the coefficient matrices of the discrete continuity equations are diagonally similar to symmetric positive definite matrices and thus have a positive real spectrum In these linearized discrete systems the matrix coefficients vary rapidly resulting in a high condition number thus yielding a large iteration counts of the linear i terative solver Among the number of iterative methods investigated [6J variants of the bi-conjugate gradient method such as the biconjugate gradient squared (BiCGS) method and the the BiCGSTAB method [14] (see Table 1) seem to be the optimal choice In particular the BiCGSTAB procedure improves one of the main problems of BiCGS namely its erratic convergence behaviour which often yields significantly more iterations for convergence than necessary At the same time the rapid convergence of BiCGS is maintained BiCGSTAB needs two dot products more and one vector update less
3 Preconditioning Methods Efficient and robust preconditioning is the most critical issue for the iterative solution of the discrete semiconductor equations Split incomplete L U preconditioning enables an efficient implementation of the incomplete Choleski and the incomplete LU preconditioner a technique originally proposed in [3] However the ILU preconditioner ofthe nat urally ordered unknowns is rather sequential in nature and thus unattractive on a SIMD architecture Alternatives are polynomial preconditioners and ILU with multicolor orderings For multicolor ILU preconditioning the coefficient matrix is permuted according to some regular replication (coloring) pattern of t he unknowns [11] A partitioning of the unknowns into sets of different colors resulting in a block structure of the coefficient matrix uncouples the unknowns
InCUCCS R UltYi
n1 preconditioned quantities l11l1lrllIV
pr~ecc~nCllltlol)n4ers were originally proposed
with respect to some
J)reseIlt convergence uvuuu~u iteration of device 1
wa~Konm scaling the Reduced System (RS) Least Squares poJynl)rnllaJs
ngt ltlrC1DEnLj~ SELBERHERR AND STIFTINGER
The vector length decreases as the colors increases At the same - usually - We have made using simulator
The results indicate that more than two colors are not favourable We therefore ImiPlemenlE~a a 2-0010r (red-blad) ILU preconditioner Partitioning the A
the same color convergence
DR HRBA= BBR DB
points) the reduced gtv~tfgtlrn is
prl~c(lnCllltl~gtm~r is then the main diagonal of system
=
resulting in a
scaling the unknowns by square root of lvu DR we may in the form
(DB - HBRHRB) ZB =bB
Approximately half virtual prC~e1S01~S In
during each stage
function is feasible due COE~mc~elll nunnCC5 ofthe uncoupled discrete
and is a fast operation
BiCGSTAB iterative procedure with various
with zero fill-in
order to lIlAtI1JIo
construction of least squares polynomials positive
equations The evaluation on an SIMD arCJlUect
obtained from the majority COIltiIlIUUy bull_______ section 5) The 2-norm ofthe relative eUlcJl(leson 111
FORTRAN A twoshy
APPLICATIONS MODELING AND SIMULATION 391
dearly be The linear solvers RS-CG and the RS-BiCGSTAB methods have been benchmarked Some results are shown Table 2 Three tensor of 163 323 and
is solved on Sk 16k and 32k processors USing double pnClSlOn 32k processors we a megaflop rate 20f 120 for for the N 643 grid Note that only halI VJDI processors are active due to the red-black Obviously is determined
the 7 point which executes with roughly 200 JJlltOIltULUI (JQrlrespOl1diA1J OOiampUl aplPr(lmna1~el) 50 megaflops on 16k processors for the same ltplrorll COJnput3Ltlon is the reason why polynomial preconditioners are not
dotproduct achieves a rate of 300 megano]gts pr4~con4Itl(gtners compare unfavorable We use the RS preconditioner for our
20 30 40 60 70 90 n
TABLB 2
RS-CG 8Ai RS-BiCGSTAB BeAclmalu (mul pel ifusfiDII)
ILU-BiCGS cnnllgtPfl
of the ILU(O) preconditioner the uU~lLcA
a BiCGSTAB solvers is expected
rU and ILU-BiCGSTAB In a recent work [13] high pelformance
392 HEINREICHSBERGER SELBERHERR AND
TABLB 3
Ptfform4f1ctt of MINIMOS 011 16k PfOCttIlllOf4
5
Device 1
performance measurements 16k processors
is used for the Poisson for both
3 performance statistics of the three simulations are listed The rows following meaning Row 1 the grid dimensions row 2 the number iterations Row 3 the and counts cnl for the continuity (CC) equation
and transfer times to and n1IIunc section the implementation using
optimality The linear solvers execute the case of a mean convergence uclilOnlu~
cOlrrespc)nlt1s to a 15 megaflop scalar nHl~ltlue of numerical results see [7J
6 Some Remarks been out in a relatively short time by porting onto the Connection Machine We have learned that the Connection Machine easy to use and to program although reo-programming of selected parts of the software is required
Conclusions The implementation on the LgtOIUl10n UltlILUe
computationally most eXipeIlSle
olrlpJlex ua v say tasks 100000 unknowns execute well on memory nlll1pv1 the matrix assembly on the end computer becomes slow
rnerlDllOlie the transfer time oUhe 7-band matrix the right side to and the SOIUllon
UC~nl-eJlla computer with the
vector from the Connection processors is in most cases slower than the solution IIVtt~lrn itself For a production code this is not acceptable Therefore an interaction of
Machines processors which involves computation and u of large sets data must be avoided lWLULLLl5 Machines is its FORTRAN compiler
algebra subroutines (such as multi wire nearest neighbour commushyCOIIDIIlUJDlCatlOn and computation) is under way A
between -4 to 8 seems A one gigaflop a mean convergence
deterioration factor of -4 corresponding to the IL U (0) preconditioned solvers to 250
were presented It has been shown the MIC-CG
expect that a fully optimized
APPLICATIONS MODELING AND SIMULATION 393
least twice as The convergence loss factors iterative solvers in presented simulations are than three in some cases larger than 10 A convergence degradation of more than a factor ofl0 makes the quesshytion of a potential of the Connection Machine or similar architectures over doubtful Part of this convergence dlllemma
an insufficient 3D mesh preconditioner can much more CUJllClnljI
An improved grid generation with _n~t
problem and is part of current UHcsvlg~JUUilli semiconductor equations is its
mrestlatlloD may not the ultimate answer ~UiI
are likely to on is OEUVtuc
statements are meant to summarize what we believe a engineer at a semiconductor manufacturing site
At first double hardware is an is Dot only true for MINIMOS but
are candidates for Connection Machine ImplE~ml~ntatllons Machine system with a number of processors is QClIlaUlC
nr()V1ltles a powerful very large three-dimensional of processors may be subdivided thus enabling more
~bullbull t concurrently on the Connection Machine Concurrency of UlltSIVLlgt such as MINIMOS CAD
Uil~Dliiel of binary data to and from the Machines processors is a (JClnlllectlOn aA in a serveNlient manner The transfer
lonnllctlon Machine and the computer modern deviceprocess CAD environments are on a computer
performance workstations as tools and supercomputshysolvers The effectiveness of such a configuration quite
interconnection between and the supercomputer hardware
Rose Fichtner W Numerical Methods for Semiconductor Device Simulation IEEE 1031~lOn 1983
Gkenbaum A Rodrigue GH Approximating the Inverse of a Matm for Use in herative AII~ornbms on Vedors Vector Processors Comvtin VoL 22 pp 257middot2681979
Edcmstat Efficient Implementation of a C1us of Preconditioned CoDiugate Gradient SIAM JSciSatComvt Vol 2 No I Mar 1981 pp 1middot4
al Massiydy Parallel Algorithms for Three-Dimensional Device Simulation Proceedinp July 1990 pp 35-36
SelCconsisten~ Iterative Scheme for One-dimensional Steady State Transistor CaleulatiOlls pp 455-465 1964
Heii1lr1tidlSbmiddoteqer 0 S Stiftinger M Traar KP Fast IteratiTe Solution of Curiel Continuity EquaC~OJlSfor 3D Deriee Simnlation to appear in SIAM JSeiSfat Comutbullbull
Heinzeichsberger 0 MJNlMOS on the Connection Machine Technical Repon Febrnampry 1991 Institute for Microelectronics Technical UniTenity Vienna AUSTRIA
Jolwon OG Micchdli A Paul G Polynomial Pkeonditioners for CoDiugate Gradient Calculations JNumeAnal Vol No2 Apr 1983 pp 362-316
MeijeriDk H Vorst H An Iterative Solution Method for Linear Systems of which the Codlicient Maim is a Symmetric M-Matm Math Comp 31 (1917) pp 148-162
Oppe Jouberi WD and Kincaid D~ NSPCG Users Guide Center of Numerical Analysis Uni-Temt of Taas at Austin 1984
EL Ortega JM Multico1or ICCG Methods for Vector Computers SIAM JNme Anel VoL 24 6 Dec 1987 1394-1US
Sonncyeid CGS A LanclIos-T1Pc Sober for Systems SIAM Vol No1 Jan 1989 36-52
W High Performance Preconditioning on Supercomputers for the 3D Derice MINIM() Proceedinp Supercomputing 90 pp 224-231
Vorst HA A Fast and Smoothly Conftlging Variant of BiCG for the Solution of NODSymmetric Linear Systems to appear in SIAM JStiStct Compv~bull
APPLICATIONS MODELING AND SIMULATION 389
TABLE 1
BiCGSTAB Algorithm
Choose 20 (eg Zo = 0) 70 = (b - Azo) Choose Yo such that (Yo1o) - 0 (eg Yo = 10 ) Po = Vo = 0 P- l = 1 Wo = 1 a =1 FOR n = 0 STEP 1 UNTIL convergence DO
Pn = (Yo rn) f3 = ~
Pft-l Wft
P+1 = 1n + f3 (p + wnv) V+1 = A pn+l a = (Yo Vn+ l ) bull = 1 - aVn +l t =A
~ +1 = M 7n +l = bull - Wn +lt 2+1 = Z + ap+1 + W+lS
END FOR
2 T he Linear Solvers The parallel solvers are of (bi-)conjugate gradient type A relative error of 10-3 for the symmetric and 10-8 for the nonsymmetric systems (motivated by numerical experiments only) has been found to be sufficient For the Poisson equation the MIC-CG method[9] with a modification factor of Q = 095 is the optimal choice to the best to our knowledge On the CM2 we use the CG in the version of Concus Golub and OLeary As preconditioner the reduced system (RS) main diagonal is used We shall denote this method by RS-CG In the case of locally constant carrier temperatures (see [6]) the coefficient matrices of the discrete continuity equations are diagonally similar to symmetric positive definite matrices and thus have a positive real spectrum In these linearized discrete systems the matrix coefficients vary rapidly resulting in a high condition number thus yielding a large iteration counts of the linear i terative solver Among the number of iterative methods investigated [6J variants of the bi-conjugate gradient method such as the biconjugate gradient squared (BiCGS) method and the the BiCGSTAB method [14] (see Table 1) seem to be the optimal choice In particular the BiCGSTAB procedure improves one of the main problems of BiCGS namely its erratic convergence behaviour which often yields significantly more iterations for convergence than necessary At the same time the rapid convergence of BiCGS is maintained BiCGSTAB needs two dot products more and one vector update less
3 Preconditioning Methods Efficient and robust preconditioning is the most critical issue for the iterative solution of the discrete semiconductor equations Split incomplete L U preconditioning enables an efficient implementation of the incomplete Choleski and the incomplete LU preconditioner a technique originally proposed in [3] However the ILU preconditioner ofthe nat urally ordered unknowns is rather sequential in nature and thus unattractive on a SIMD architecture Alternatives are polynomial preconditioners and ILU with multicolor orderings For multicolor ILU preconditioning the coefficient matrix is permuted according to some regular replication (coloring) pattern of t he unknowns [11] A partitioning of the unknowns into sets of different colors resulting in a block structure of the coefficient matrix uncouples the unknowns
InCUCCS R UltYi
n1 preconditioned quantities l11l1lrllIV
pr~ecc~nCllltlol)n4ers were originally proposed
with respect to some
J)reseIlt convergence uvuuu~u iteration of device 1
wa~Konm scaling the Reduced System (RS) Least Squares poJynl)rnllaJs
ngt ltlrC1DEnLj~ SELBERHERR AND STIFTINGER
The vector length decreases as the colors increases At the same - usually - We have made using simulator
The results indicate that more than two colors are not favourable We therefore ImiPlemenlE~a a 2-0010r (red-blad) ILU preconditioner Partitioning the A
the same color convergence
DR HRBA= BBR DB
points) the reduced gtv~tfgtlrn is
prl~c(lnCllltl~gtm~r is then the main diagonal of system
=
resulting in a
scaling the unknowns by square root of lvu DR we may in the form
(DB - HBRHRB) ZB =bB
Approximately half virtual prC~e1S01~S In
during each stage
function is feasible due COE~mc~elll nunnCC5 ofthe uncoupled discrete
and is a fast operation
BiCGSTAB iterative procedure with various
with zero fill-in
order to lIlAtI1JIo
construction of least squares polynomials positive
equations The evaluation on an SIMD arCJlUect
obtained from the majority COIltiIlIUUy bull_______ section 5) The 2-norm ofthe relative eUlcJl(leson 111
FORTRAN A twoshy
APPLICATIONS MODELING AND SIMULATION 391
dearly be The linear solvers RS-CG and the RS-BiCGSTAB methods have been benchmarked Some results are shown Table 2 Three tensor of 163 323 and
is solved on Sk 16k and 32k processors USing double pnClSlOn 32k processors we a megaflop rate 20f 120 for for the N 643 grid Note that only halI VJDI processors are active due to the red-black Obviously is determined
the 7 point which executes with roughly 200 JJlltOIltULUI (JQrlrespOl1diA1J OOiampUl aplPr(lmna1~el) 50 megaflops on 16k processors for the same ltplrorll COJnput3Ltlon is the reason why polynomial preconditioners are not
dotproduct achieves a rate of 300 megano]gts pr4~con4Itl(gtners compare unfavorable We use the RS preconditioner for our
20 30 40 60 70 90 n
TABLB 2
RS-CG 8Ai RS-BiCGSTAB BeAclmalu (mul pel ifusfiDII)
ILU-BiCGS cnnllgtPfl
of the ILU(O) preconditioner the uU~lLcA
a BiCGSTAB solvers is expected
rU and ILU-BiCGSTAB In a recent work [13] high pelformance
392 HEINREICHSBERGER SELBERHERR AND
TABLB 3
Ptfform4f1ctt of MINIMOS 011 16k PfOCttIlllOf4
5
Device 1
performance measurements 16k processors
is used for the Poisson for both
3 performance statistics of the three simulations are listed The rows following meaning Row 1 the grid dimensions row 2 the number iterations Row 3 the and counts cnl for the continuity (CC) equation
and transfer times to and n1IIunc section the implementation using
optimality The linear solvers execute the case of a mean convergence uclilOnlu~
cOlrrespc)nlt1s to a 15 megaflop scalar nHl~ltlue of numerical results see [7J
6 Some Remarks been out in a relatively short time by porting onto the Connection Machine We have learned that the Connection Machine easy to use and to program although reo-programming of selected parts of the software is required
Conclusions The implementation on the LgtOIUl10n UltlILUe
computationally most eXipeIlSle
olrlpJlex ua v say tasks 100000 unknowns execute well on memory nlll1pv1 the matrix assembly on the end computer becomes slow
rnerlDllOlie the transfer time oUhe 7-band matrix the right side to and the SOIUllon
UC~nl-eJlla computer with the
vector from the Connection processors is in most cases slower than the solution IIVtt~lrn itself For a production code this is not acceptable Therefore an interaction of
Machines processors which involves computation and u of large sets data must be avoided lWLULLLl5 Machines is its FORTRAN compiler
algebra subroutines (such as multi wire nearest neighbour commushyCOIIDIIlUJDlCatlOn and computation) is under way A
between -4 to 8 seems A one gigaflop a mean convergence
deterioration factor of -4 corresponding to the IL U (0) preconditioned solvers to 250
were presented It has been shown the MIC-CG
expect that a fully optimized
APPLICATIONS MODELING AND SIMULATION 393
least twice as The convergence loss factors iterative solvers in presented simulations are than three in some cases larger than 10 A convergence degradation of more than a factor ofl0 makes the quesshytion of a potential of the Connection Machine or similar architectures over doubtful Part of this convergence dlllemma
an insufficient 3D mesh preconditioner can much more CUJllClnljI
An improved grid generation with _n~t
problem and is part of current UHcsvlg~JUUilli semiconductor equations is its
mrestlatlloD may not the ultimate answer ~UiI
are likely to on is OEUVtuc
statements are meant to summarize what we believe a engineer at a semiconductor manufacturing site
At first double hardware is an is Dot only true for MINIMOS but
are candidates for Connection Machine ImplE~ml~ntatllons Machine system with a number of processors is QClIlaUlC
nr()V1ltles a powerful very large three-dimensional of processors may be subdivided thus enabling more
~bullbull t concurrently on the Connection Machine Concurrency of UlltSIVLlgt such as MINIMOS CAD
Uil~Dliiel of binary data to and from the Machines processors is a (JClnlllectlOn aA in a serveNlient manner The transfer
lonnllctlon Machine and the computer modern deviceprocess CAD environments are on a computer
performance workstations as tools and supercomputshysolvers The effectiveness of such a configuration quite
interconnection between and the supercomputer hardware
Rose Fichtner W Numerical Methods for Semiconductor Device Simulation IEEE 1031~lOn 1983
Gkenbaum A Rodrigue GH Approximating the Inverse of a Matm for Use in herative AII~ornbms on Vedors Vector Processors Comvtin VoL 22 pp 257middot2681979
Edcmstat Efficient Implementation of a C1us of Preconditioned CoDiugate Gradient SIAM JSciSatComvt Vol 2 No I Mar 1981 pp 1middot4
al Massiydy Parallel Algorithms for Three-Dimensional Device Simulation Proceedinp July 1990 pp 35-36
SelCconsisten~ Iterative Scheme for One-dimensional Steady State Transistor CaleulatiOlls pp 455-465 1964
Heii1lr1tidlSbmiddoteqer 0 S Stiftinger M Traar KP Fast IteratiTe Solution of Curiel Continuity EquaC~OJlSfor 3D Deriee Simnlation to appear in SIAM JSeiSfat Comutbullbull
Heinzeichsberger 0 MJNlMOS on the Connection Machine Technical Repon Febrnampry 1991 Institute for Microelectronics Technical UniTenity Vienna AUSTRIA
Jolwon OG Micchdli A Paul G Polynomial Pkeonditioners for CoDiugate Gradient Calculations JNumeAnal Vol No2 Apr 1983 pp 362-316
MeijeriDk H Vorst H An Iterative Solution Method for Linear Systems of which the Codlicient Maim is a Symmetric M-Matm Math Comp 31 (1917) pp 148-162
Oppe Jouberi WD and Kincaid D~ NSPCG Users Guide Center of Numerical Analysis Uni-Temt of Taas at Austin 1984
EL Ortega JM Multico1or ICCG Methods for Vector Computers SIAM JNme Anel VoL 24 6 Dec 1987 1394-1US
Sonncyeid CGS A LanclIos-T1Pc Sober for Systems SIAM Vol No1 Jan 1989 36-52
W High Performance Preconditioning on Supercomputers for the 3D Derice MINIM() Proceedinp Supercomputing 90 pp 224-231
Vorst HA A Fast and Smoothly Conftlging Variant of BiCG for the Solution of NODSymmetric Linear Systems to appear in SIAM JStiStct Compv~bull
InCUCCS R UltYi
n1 preconditioned quantities l11l1lrllIV
pr~ecc~nCllltlol)n4ers were originally proposed
with respect to some
J)reseIlt convergence uvuuu~u iteration of device 1
wa~Konm scaling the Reduced System (RS) Least Squares poJynl)rnllaJs
ngt ltlrC1DEnLj~ SELBERHERR AND STIFTINGER
The vector length decreases as the colors increases At the same - usually - We have made using simulator
The results indicate that more than two colors are not favourable We therefore ImiPlemenlE~a a 2-0010r (red-blad) ILU preconditioner Partitioning the A
the same color convergence
DR HRBA= BBR DB
points) the reduced gtv~tfgtlrn is
prl~c(lnCllltl~gtm~r is then the main diagonal of system
=
resulting in a
scaling the unknowns by square root of lvu DR we may in the form
(DB - HBRHRB) ZB =bB
Approximately half virtual prC~e1S01~S In
during each stage
function is feasible due COE~mc~elll nunnCC5 ofthe uncoupled discrete
and is a fast operation
BiCGSTAB iterative procedure with various
with zero fill-in
order to lIlAtI1JIo
construction of least squares polynomials positive
equations The evaluation on an SIMD arCJlUect
obtained from the majority COIltiIlIUUy bull_______ section 5) The 2-norm ofthe relative eUlcJl(leson 111
FORTRAN A twoshy
APPLICATIONS MODELING AND SIMULATION 391
dearly be The linear solvers RS-CG and the RS-BiCGSTAB methods have been benchmarked Some results are shown Table 2 Three tensor of 163 323 and
is solved on Sk 16k and 32k processors USing double pnClSlOn 32k processors we a megaflop rate 20f 120 for for the N 643 grid Note that only halI VJDI processors are active due to the red-black Obviously is determined
the 7 point which executes with roughly 200 JJlltOIltULUI (JQrlrespOl1diA1J OOiampUl aplPr(lmna1~el) 50 megaflops on 16k processors for the same ltplrorll COJnput3Ltlon is the reason why polynomial preconditioners are not
dotproduct achieves a rate of 300 megano]gts pr4~con4Itl(gtners compare unfavorable We use the RS preconditioner for our
20 30 40 60 70 90 n
TABLB 2
RS-CG 8Ai RS-BiCGSTAB BeAclmalu (mul pel ifusfiDII)
ILU-BiCGS cnnllgtPfl
of the ILU(O) preconditioner the uU~lLcA
a BiCGSTAB solvers is expected
rU and ILU-BiCGSTAB In a recent work [13] high pelformance
392 HEINREICHSBERGER SELBERHERR AND
TABLB 3
Ptfform4f1ctt of MINIMOS 011 16k PfOCttIlllOf4
5
Device 1
performance measurements 16k processors
is used for the Poisson for both
3 performance statistics of the three simulations are listed The rows following meaning Row 1 the grid dimensions row 2 the number iterations Row 3 the and counts cnl for the continuity (CC) equation
and transfer times to and n1IIunc section the implementation using
optimality The linear solvers execute the case of a mean convergence uclilOnlu~
cOlrrespc)nlt1s to a 15 megaflop scalar nHl~ltlue of numerical results see [7J
6 Some Remarks been out in a relatively short time by porting onto the Connection Machine We have learned that the Connection Machine easy to use and to program although reo-programming of selected parts of the software is required
Conclusions The implementation on the LgtOIUl10n UltlILUe
computationally most eXipeIlSle
olrlpJlex ua v say tasks 100000 unknowns execute well on memory nlll1pv1 the matrix assembly on the end computer becomes slow
rnerlDllOlie the transfer time oUhe 7-band matrix the right side to and the SOIUllon
UC~nl-eJlla computer with the
vector from the Connection processors is in most cases slower than the solution IIVtt~lrn itself For a production code this is not acceptable Therefore an interaction of
Machines processors which involves computation and u of large sets data must be avoided lWLULLLl5 Machines is its FORTRAN compiler
algebra subroutines (such as multi wire nearest neighbour commushyCOIIDIIlUJDlCatlOn and computation) is under way A
between -4 to 8 seems A one gigaflop a mean convergence
deterioration factor of -4 corresponding to the IL U (0) preconditioned solvers to 250
were presented It has been shown the MIC-CG
expect that a fully optimized
APPLICATIONS MODELING AND SIMULATION 393
least twice as The convergence loss factors iterative solvers in presented simulations are than three in some cases larger than 10 A convergence degradation of more than a factor ofl0 makes the quesshytion of a potential of the Connection Machine or similar architectures over doubtful Part of this convergence dlllemma
an insufficient 3D mesh preconditioner can much more CUJllClnljI
An improved grid generation with _n~t
problem and is part of current UHcsvlg~JUUilli semiconductor equations is its
mrestlatlloD may not the ultimate answer ~UiI
are likely to on is OEUVtuc
statements are meant to summarize what we believe a engineer at a semiconductor manufacturing site
At first double hardware is an is Dot only true for MINIMOS but
are candidates for Connection Machine ImplE~ml~ntatllons Machine system with a number of processors is QClIlaUlC
nr()V1ltles a powerful very large three-dimensional of processors may be subdivided thus enabling more
~bullbull t concurrently on the Connection Machine Concurrency of UlltSIVLlgt such as MINIMOS CAD
Uil~Dliiel of binary data to and from the Machines processors is a (JClnlllectlOn aA in a serveNlient manner The transfer
lonnllctlon Machine and the computer modern deviceprocess CAD environments are on a computer
performance workstations as tools and supercomputshysolvers The effectiveness of such a configuration quite
interconnection between and the supercomputer hardware
Rose Fichtner W Numerical Methods for Semiconductor Device Simulation IEEE 1031~lOn 1983
Gkenbaum A Rodrigue GH Approximating the Inverse of a Matm for Use in herative AII~ornbms on Vedors Vector Processors Comvtin VoL 22 pp 257middot2681979
Edcmstat Efficient Implementation of a C1us of Preconditioned CoDiugate Gradient SIAM JSciSatComvt Vol 2 No I Mar 1981 pp 1middot4
al Massiydy Parallel Algorithms for Three-Dimensional Device Simulation Proceedinp July 1990 pp 35-36
SelCconsisten~ Iterative Scheme for One-dimensional Steady State Transistor CaleulatiOlls pp 455-465 1964
Heii1lr1tidlSbmiddoteqer 0 S Stiftinger M Traar KP Fast IteratiTe Solution of Curiel Continuity EquaC~OJlSfor 3D Deriee Simnlation to appear in SIAM JSeiSfat Comutbullbull
Heinzeichsberger 0 MJNlMOS on the Connection Machine Technical Repon Febrnampry 1991 Institute for Microelectronics Technical UniTenity Vienna AUSTRIA
Jolwon OG Micchdli A Paul G Polynomial Pkeonditioners for CoDiugate Gradient Calculations JNumeAnal Vol No2 Apr 1983 pp 362-316
MeijeriDk H Vorst H An Iterative Solution Method for Linear Systems of which the Codlicient Maim is a Symmetric M-Matm Math Comp 31 (1917) pp 148-162
Oppe Jouberi WD and Kincaid D~ NSPCG Users Guide Center of Numerical Analysis Uni-Temt of Taas at Austin 1984
EL Ortega JM Multico1or ICCG Methods for Vector Computers SIAM JNme Anel VoL 24 6 Dec 1987 1394-1US
Sonncyeid CGS A LanclIos-T1Pc Sober for Systems SIAM Vol No1 Jan 1989 36-52
W High Performance Preconditioning on Supercomputers for the 3D Derice MINIM() Proceedinp Supercomputing 90 pp 224-231
Vorst HA A Fast and Smoothly Conftlging Variant of BiCG for the Solution of NODSymmetric Linear Systems to appear in SIAM JStiStct Compv~bull
APPLICATIONS MODELING AND SIMULATION 391
dearly be The linear solvers RS-CG and the RS-BiCGSTAB methods have been benchmarked Some results are shown Table 2 Three tensor of 163 323 and
is solved on Sk 16k and 32k processors USing double pnClSlOn 32k processors we a megaflop rate 20f 120 for for the N 643 grid Note that only halI VJDI processors are active due to the red-black Obviously is determined
the 7 point which executes with roughly 200 JJlltOIltULUI (JQrlrespOl1diA1J OOiampUl aplPr(lmna1~el) 50 megaflops on 16k processors for the same ltplrorll COJnput3Ltlon is the reason why polynomial preconditioners are not
dotproduct achieves a rate of 300 megano]gts pr4~con4Itl(gtners compare unfavorable We use the RS preconditioner for our
20 30 40 60 70 90 n
TABLB 2
RS-CG 8Ai RS-BiCGSTAB BeAclmalu (mul pel ifusfiDII)
ILU-BiCGS cnnllgtPfl
of the ILU(O) preconditioner the uU~lLcA
a BiCGSTAB solvers is expected
rU and ILU-BiCGSTAB In a recent work [13] high pelformance
392 HEINREICHSBERGER SELBERHERR AND
TABLB 3
Ptfform4f1ctt of MINIMOS 011 16k PfOCttIlllOf4
5
Device 1
performance measurements 16k processors
is used for the Poisson for both
3 performance statistics of the three simulations are listed The rows following meaning Row 1 the grid dimensions row 2 the number iterations Row 3 the and counts cnl for the continuity (CC) equation
and transfer times to and n1IIunc section the implementation using
optimality The linear solvers execute the case of a mean convergence uclilOnlu~
cOlrrespc)nlt1s to a 15 megaflop scalar nHl~ltlue of numerical results see [7J
6 Some Remarks been out in a relatively short time by porting onto the Connection Machine We have learned that the Connection Machine easy to use and to program although reo-programming of selected parts of the software is required
Conclusions The implementation on the LgtOIUl10n UltlILUe
computationally most eXipeIlSle
olrlpJlex ua v say tasks 100000 unknowns execute well on memory nlll1pv1 the matrix assembly on the end computer becomes slow
rnerlDllOlie the transfer time oUhe 7-band matrix the right side to and the SOIUllon
UC~nl-eJlla computer with the
vector from the Connection processors is in most cases slower than the solution IIVtt~lrn itself For a production code this is not acceptable Therefore an interaction of
Machines processors which involves computation and u of large sets data must be avoided lWLULLLl5 Machines is its FORTRAN compiler
algebra subroutines (such as multi wire nearest neighbour commushyCOIIDIIlUJDlCatlOn and computation) is under way A
between -4 to 8 seems A one gigaflop a mean convergence
deterioration factor of -4 corresponding to the IL U (0) preconditioned solvers to 250
were presented It has been shown the MIC-CG
expect that a fully optimized
APPLICATIONS MODELING AND SIMULATION 393
least twice as The convergence loss factors iterative solvers in presented simulations are than three in some cases larger than 10 A convergence degradation of more than a factor ofl0 makes the quesshytion of a potential of the Connection Machine or similar architectures over doubtful Part of this convergence dlllemma
an insufficient 3D mesh preconditioner can much more CUJllClnljI
An improved grid generation with _n~t
problem and is part of current UHcsvlg~JUUilli semiconductor equations is its
mrestlatlloD may not the ultimate answer ~UiI
are likely to on is OEUVtuc
statements are meant to summarize what we believe a engineer at a semiconductor manufacturing site
At first double hardware is an is Dot only true for MINIMOS but
are candidates for Connection Machine ImplE~ml~ntatllons Machine system with a number of processors is QClIlaUlC
nr()V1ltles a powerful very large three-dimensional of processors may be subdivided thus enabling more
~bullbull t concurrently on the Connection Machine Concurrency of UlltSIVLlgt such as MINIMOS CAD
Uil~Dliiel of binary data to and from the Machines processors is a (JClnlllectlOn aA in a serveNlient manner The transfer
lonnllctlon Machine and the computer modern deviceprocess CAD environments are on a computer
performance workstations as tools and supercomputshysolvers The effectiveness of such a configuration quite
interconnection between and the supercomputer hardware
Rose Fichtner W Numerical Methods for Semiconductor Device Simulation IEEE 1031~lOn 1983
Gkenbaum A Rodrigue GH Approximating the Inverse of a Matm for Use in herative AII~ornbms on Vedors Vector Processors Comvtin VoL 22 pp 257middot2681979
Edcmstat Efficient Implementation of a C1us of Preconditioned CoDiugate Gradient SIAM JSciSatComvt Vol 2 No I Mar 1981 pp 1middot4
al Massiydy Parallel Algorithms for Three-Dimensional Device Simulation Proceedinp July 1990 pp 35-36
SelCconsisten~ Iterative Scheme for One-dimensional Steady State Transistor CaleulatiOlls pp 455-465 1964
Heii1lr1tidlSbmiddoteqer 0 S Stiftinger M Traar KP Fast IteratiTe Solution of Curiel Continuity EquaC~OJlSfor 3D Deriee Simnlation to appear in SIAM JSeiSfat Comutbullbull
Heinzeichsberger 0 MJNlMOS on the Connection Machine Technical Repon Febrnampry 1991 Institute for Microelectronics Technical UniTenity Vienna AUSTRIA
Jolwon OG Micchdli A Paul G Polynomial Pkeonditioners for CoDiugate Gradient Calculations JNumeAnal Vol No2 Apr 1983 pp 362-316
MeijeriDk H Vorst H An Iterative Solution Method for Linear Systems of which the Codlicient Maim is a Symmetric M-Matm Math Comp 31 (1917) pp 148-162
Oppe Jouberi WD and Kincaid D~ NSPCG Users Guide Center of Numerical Analysis Uni-Temt of Taas at Austin 1984
EL Ortega JM Multico1or ICCG Methods for Vector Computers SIAM JNme Anel VoL 24 6 Dec 1987 1394-1US
Sonncyeid CGS A LanclIos-T1Pc Sober for Systems SIAM Vol No1 Jan 1989 36-52
W High Performance Preconditioning on Supercomputers for the 3D Derice MINIM() Proceedinp Supercomputing 90 pp 224-231
Vorst HA A Fast and Smoothly Conftlging Variant of BiCG for the Solution of NODSymmetric Linear Systems to appear in SIAM JStiStct Compv~bull
ILU-BiCGS cnnllgtPfl
of the ILU(O) preconditioner the uU~lLcA
a BiCGSTAB solvers is expected
rU and ILU-BiCGSTAB In a recent work [13] high pelformance
392 HEINREICHSBERGER SELBERHERR AND
TABLB 3
Ptfform4f1ctt of MINIMOS 011 16k PfOCttIlllOf4
5
Device 1
performance measurements 16k processors
is used for the Poisson for both
3 performance statistics of the three simulations are listed The rows following meaning Row 1 the grid dimensions row 2 the number iterations Row 3 the and counts cnl for the continuity (CC) equation
and transfer times to and n1IIunc section the implementation using
optimality The linear solvers execute the case of a mean convergence uclilOnlu~
cOlrrespc)nlt1s to a 15 megaflop scalar nHl~ltlue of numerical results see [7J
6 Some Remarks been out in a relatively short time by porting onto the Connection Machine We have learned that the Connection Machine easy to use and to program although reo-programming of selected parts of the software is required
Conclusions The implementation on the LgtOIUl10n UltlILUe
computationally most eXipeIlSle
olrlpJlex ua v say tasks 100000 unknowns execute well on memory nlll1pv1 the matrix assembly on the end computer becomes slow
rnerlDllOlie the transfer time oUhe 7-band matrix the right side to and the SOIUllon
UC~nl-eJlla computer with the
vector from the Connection processors is in most cases slower than the solution IIVtt~lrn itself For a production code this is not acceptable Therefore an interaction of
Machines processors which involves computation and u of large sets data must be avoided lWLULLLl5 Machines is its FORTRAN compiler
algebra subroutines (such as multi wire nearest neighbour commushyCOIIDIIlUJDlCatlOn and computation) is under way A
between -4 to 8 seems A one gigaflop a mean convergence
deterioration factor of -4 corresponding to the IL U (0) preconditioned solvers to 250
were presented It has been shown the MIC-CG
expect that a fully optimized
APPLICATIONS MODELING AND SIMULATION 393
least twice as The convergence loss factors iterative solvers in presented simulations are than three in some cases larger than 10 A convergence degradation of more than a factor ofl0 makes the quesshytion of a potential of the Connection Machine or similar architectures over doubtful Part of this convergence dlllemma
an insufficient 3D mesh preconditioner can much more CUJllClnljI
An improved grid generation with _n~t
problem and is part of current UHcsvlg~JUUilli semiconductor equations is its
mrestlatlloD may not the ultimate answer ~UiI
are likely to on is OEUVtuc
statements are meant to summarize what we believe a engineer at a semiconductor manufacturing site
At first double hardware is an is Dot only true for MINIMOS but
are candidates for Connection Machine ImplE~ml~ntatllons Machine system with a number of processors is QClIlaUlC
nr()V1ltles a powerful very large three-dimensional of processors may be subdivided thus enabling more
~bullbull t concurrently on the Connection Machine Concurrency of UlltSIVLlgt such as MINIMOS CAD
Uil~Dliiel of binary data to and from the Machines processors is a (JClnlllectlOn aA in a serveNlient manner The transfer
lonnllctlon Machine and the computer modern deviceprocess CAD environments are on a computer
performance workstations as tools and supercomputshysolvers The effectiveness of such a configuration quite
interconnection between and the supercomputer hardware
Rose Fichtner W Numerical Methods for Semiconductor Device Simulation IEEE 1031~lOn 1983
Gkenbaum A Rodrigue GH Approximating the Inverse of a Matm for Use in herative AII~ornbms on Vedors Vector Processors Comvtin VoL 22 pp 257middot2681979
Edcmstat Efficient Implementation of a C1us of Preconditioned CoDiugate Gradient SIAM JSciSatComvt Vol 2 No I Mar 1981 pp 1middot4
al Massiydy Parallel Algorithms for Three-Dimensional Device Simulation Proceedinp July 1990 pp 35-36
SelCconsisten~ Iterative Scheme for One-dimensional Steady State Transistor CaleulatiOlls pp 455-465 1964
Heii1lr1tidlSbmiddoteqer 0 S Stiftinger M Traar KP Fast IteratiTe Solution of Curiel Continuity EquaC~OJlSfor 3D Deriee Simnlation to appear in SIAM JSeiSfat Comutbullbull
Heinzeichsberger 0 MJNlMOS on the Connection Machine Technical Repon Febrnampry 1991 Institute for Microelectronics Technical UniTenity Vienna AUSTRIA
Jolwon OG Micchdli A Paul G Polynomial Pkeonditioners for CoDiugate Gradient Calculations JNumeAnal Vol No2 Apr 1983 pp 362-316
MeijeriDk H Vorst H An Iterative Solution Method for Linear Systems of which the Codlicient Maim is a Symmetric M-Matm Math Comp 31 (1917) pp 148-162
Oppe Jouberi WD and Kincaid D~ NSPCG Users Guide Center of Numerical Analysis Uni-Temt of Taas at Austin 1984
EL Ortega JM Multico1or ICCG Methods for Vector Computers SIAM JNme Anel VoL 24 6 Dec 1987 1394-1US
Sonncyeid CGS A LanclIos-T1Pc Sober for Systems SIAM Vol No1 Jan 1989 36-52
W High Performance Preconditioning on Supercomputers for the 3D Derice MINIM() Proceedinp Supercomputing 90 pp 224-231
Vorst HA A Fast and Smoothly Conftlging Variant of BiCG for the Solution of NODSymmetric Linear Systems to appear in SIAM JStiStct Compv~bull
APPLICATIONS MODELING AND SIMULATION 393
least twice as The convergence loss factors iterative solvers in presented simulations are than three in some cases larger than 10 A convergence degradation of more than a factor ofl0 makes the quesshytion of a potential of the Connection Machine or similar architectures over doubtful Part of this convergence dlllemma
an insufficient 3D mesh preconditioner can much more CUJllClnljI
An improved grid generation with _n~t
problem and is part of current UHcsvlg~JUUilli semiconductor equations is its
mrestlatlloD may not the ultimate answer ~UiI
are likely to on is OEUVtuc
statements are meant to summarize what we believe a engineer at a semiconductor manufacturing site
At first double hardware is an is Dot only true for MINIMOS but
are candidates for Connection Machine ImplE~ml~ntatllons Machine system with a number of processors is QClIlaUlC
nr()V1ltles a powerful very large three-dimensional of processors may be subdivided thus enabling more
~bullbull t concurrently on the Connection Machine Concurrency of UlltSIVLlgt such as MINIMOS CAD
Uil~Dliiel of binary data to and from the Machines processors is a (JClnlllectlOn aA in a serveNlient manner The transfer
lonnllctlon Machine and the computer modern deviceprocess CAD environments are on a computer
performance workstations as tools and supercomputshysolvers The effectiveness of such a configuration quite
interconnection between and the supercomputer hardware
Rose Fichtner W Numerical Methods for Semiconductor Device Simulation IEEE 1031~lOn 1983
Gkenbaum A Rodrigue GH Approximating the Inverse of a Matm for Use in herative AII~ornbms on Vedors Vector Processors Comvtin VoL 22 pp 257middot2681979
Edcmstat Efficient Implementation of a C1us of Preconditioned CoDiugate Gradient SIAM JSciSatComvt Vol 2 No I Mar 1981 pp 1middot4
al Massiydy Parallel Algorithms for Three-Dimensional Device Simulation Proceedinp July 1990 pp 35-36
SelCconsisten~ Iterative Scheme for One-dimensional Steady State Transistor CaleulatiOlls pp 455-465 1964
Heii1lr1tidlSbmiddoteqer 0 S Stiftinger M Traar KP Fast IteratiTe Solution of Curiel Continuity EquaC~OJlSfor 3D Deriee Simnlation to appear in SIAM JSeiSfat Comutbullbull
Heinzeichsberger 0 MJNlMOS on the Connection Machine Technical Repon Febrnampry 1991 Institute for Microelectronics Technical UniTenity Vienna AUSTRIA
Jolwon OG Micchdli A Paul G Polynomial Pkeonditioners for CoDiugate Gradient Calculations JNumeAnal Vol No2 Apr 1983 pp 362-316
MeijeriDk H Vorst H An Iterative Solution Method for Linear Systems of which the Codlicient Maim is a Symmetric M-Matm Math Comp 31 (1917) pp 148-162
Oppe Jouberi WD and Kincaid D~ NSPCG Users Guide Center of Numerical Analysis Uni-Temt of Taas at Austin 1984
EL Ortega JM Multico1or ICCG Methods for Vector Computers SIAM JNme Anel VoL 24 6 Dec 1987 1394-1US
Sonncyeid CGS A LanclIos-T1Pc Sober for Systems SIAM Vol No1 Jan 1989 36-52
W High Performance Preconditioning on Supercomputers for the 3D Derice MINIM() Proceedinp Supercomputing 90 pp 224-231
Vorst HA A Fast and Smoothly Conftlging Variant of BiCG for the Solution of NODSymmetric Linear Systems to appear in SIAM JStiStct Compv~bull