Heinreichsberger* · ~'I7"JJHH~ll"'J.VllC:U MOS Device Simulation on a Heinreichsberger* Abstract....

6
55 MOS Device Simulation on a Heinreichsberger* Abstract. In this paper we present oW' experience with the implementation of the three-dimensional semicon· ductor derice simula.tor MINIMOS on a. massively architecture, the Connection Machine CM2. The eD1lpllaSlS of this WOl'lI: is placed on herative methods for solving the very luge sparse lineu .yltems of equations that at each step of the nonlmeu solution procedure. Both symmetric and nonsymmetric linear systems ue solved br co:nJugate gradient type iterative methods. The implementation or the puallel preconditioner is the most crucial step. Multicolor incomplete LV ractori.ation preconditioners are compued with polynomial preconditioners. Several __erical examples from the nonsymmetric lineu systems in MlNIMOS ue given Comparisons with vector supercomputers are made. 1. Overview. linear also (4). discrete continuity equations have to cope (electrons and by a damperl In the enormous numerical range of the " ....I .. t.nn discrete continuity equations have to the nonlinear iteration resulting *Institute for Microelectronics, Technical University Vienna, Gusshausstra:usse 27-29, A-I040 388

Transcript of Heinreichsberger* · ~'I7"JJHH~ll"'J.VllC:U MOS Device Simulation on a Heinreichsberger* Abstract....

Page 1: Heinreichsberger* · ~'I7"JJHH~ll"'J.VllC:U MOS Device Simulation on a Heinreichsberger* Abstract. In . this paper we present oW' experience with the implementation of the three-dimensional

55

~I7JJHH~llJVllCU MOS Device Simulation on a

Heinreichsberger

Abstract In this paper we present oW experience with the implementation of the three-dimensional semiconmiddot ductor derice simulator MINIMOS on a massively architecture the Connection Machine CM2 The eD1lpllaSlS of this WOllI is placed on herative methods for solving the very luge sparse lineu yltems of equations that

at each step of the nonlmeu solution procedure Both symmetric and nonsymmetric linear systems ue solved br conJugate gradient type iterative methods The implementation or the puallel preconditioner is the most crucial step Multicolor incomplete LV ractoriation preconditioners are compued with polynomial preconditioners Several __erical examples from the nonsymmetric lineu systems in MlNIMOS ue given

Comparisons with vector supercomputers are made

1 Overview

linear ~vctpTfI

also (4) discrete continuity equations have to cope

(electrons and by a damperl

In

the enormous numerical range of the I tnn

discrete continuity equations have to the nonlinear iteration resulting

Institute for Microelectronics Technical University Vienna Gusshausstrausse 27-29 A-I040

388

APPLICATIONS MODELING AND SIMULATION 389

TABLE 1

BiCGSTAB Algorithm

Choose 20 (eg Zo = 0) 70 = (b - Azo) Choose Yo such that (Yo1o) - 0 (eg Yo = 10 ) Po = Vo = 0 P- l = 1 Wo = 1 a =1 FOR n = 0 STEP 1 UNTIL convergence DO

Pn = (Yo rn) f3 = ~

Pft-l Wft

P+1 = 1n + f3 (p + wnv) V+1 = A pn+l a = (Yo Vn+ l ) bull = 1 - aVn +l t =A

~ +1 = M 7n +l = bull - Wn +lt 2+1 = Z + ap+1 + W+lS

END FOR

2 T he Linear Solvers The parallel solvers are of (bi-)conjugate gradient type A relative error of 10-3 for the symmetric and 10-8 for the nonsymmetric systems (motivated by numerical experiments only) has been found to be sufficient For the Poisson equation the MIC-CG method[9] with a modification factor of Q = 095 is the optimal choice to the best to our knowledge On the CM2 we use the CG in the version of Concus Golub and OLeary As preconditioner the reduced system (RS) main diagonal is used We shall denote this method by RS-CG In the case of locally constant carrier temperatures (see [6]) the coefficient matrices of the discrete continuity equations are diagonally similar to symmetric positive definite matrices and thus have a positive real spectrum In these linearized discrete systems the matrix coefficients vary rapidly resulting in a high condition number thus yielding a large iteration counts of the linear i terative solver Among the number of iterative methods investigated [6J variants of the bi-conjugate gradient method such as the biconjugate gradient squared (BiCGS) method and the the BiCGSTAB method [14] (see Table 1) seem to be the optimal choice In particular the BiCGSTAB procedure improves one of the main problems of BiCGS namely its erratic convergence behaviour which often yields significantly more iterations for convergence than necessary At the same time the rapid convergence of BiCGS is maintained BiCGSTAB needs two dot products more and one vector update less

3 Preconditioning Methods Efficient and robust preconditioning is the most critical issue for the iterative solution of the discrete semiconductor equations Split incomplete L U preconditioning enables an efficient implementation of the incomplete Choleski and the incomplete LU preconditioner a technique originally proposed in [3] However the ILU preconditioner ofthe nat urally ordered unknowns is rather sequential in nature and thus unattractive on a SIMD architecture Alternatives are polynomial preconditioners and ILU with multicolor orderings For multicolor ILU preconditioning the coefficient matrix is permuted according to some regular replication (coloring) pattern of t he unknowns [11] A partitioning of the unknowns into sets of different colors resulting in a block structure of the coefficient matrix uncouples the unknowns

InCUCCS R UltYi

n1 preconditioned quantities l11l1lrllIV

pr~ecc~nCllltlol)n4ers were originally proposed

with respect to some

J)reseIlt convergence uvuuu~u iteration of device 1

wa~Konm scaling the Reduced System (RS) Least Squares poJynl)rnllaJs

ngt ltlrC1DEnLj~ SELBERHERR AND STIFTINGER

The vector length decreases as the colors increases At the same - usually - We have made using simulator

The results indicate that more than two colors are not favourable We therefore ImiPlemenlE~a a 2-0010r (red-blad) ILU preconditioner Partitioning the A

the same color convergence

DR HRBA= BBR DB

points) the reduced gtv~tfgtlrn is

prl~c(lnCllltl~gtm~r is then the main diagonal of system

=

resulting in a

scaling the unknowns by square root of lvu DR we may in the form

(DB - HBRHRB) ZB =bB

Approximately half virtual prC~e1S01~S In

during each stage

function is feasible due COE~mc~elll nunnCC5 ofthe uncoupled discrete

and is a fast operation

BiCGSTAB iterative procedure with various

with zero fill-in

order to lIlAtI1JIo

construction of least squares polynomials positive

equations The evaluation on an SIMD arCJlUect

obtained from the majority COIltiIlIUUy bull_______ section 5) The 2-norm ofthe relative eUlcJl(leson 111

FORTRAN A twoshy

APPLICATIONS MODELING AND SIMULATION 391

dearly be The linear solvers RS-CG and the RS-BiCGSTAB methods have been benchmarked Some results are shown Table 2 Three tensor of 163 323 and

is solved on Sk 16k and 32k processors USing double pnClSlOn 32k processors we a megaflop rate 20f 120 for for the N 643 grid Note that only halI VJDI processors are active due to the red-black Obviously is determined

the 7 point which executes with roughly 200 JJlltOIltULUI (JQrlrespOl1diA1J OOiampUl aplPr(lmna1~el) 50 megaflops on 16k processors for the same ltplrorll COJnput3Ltlon is the reason why polynomial preconditioners are not

dotproduct achieves a rate of 300 megano]gts pr4~con4Itl(gtners compare unfavorable We use the RS preconditioner for our

20 30 40 60 70 90 n

TABLB 2

RS-CG 8Ai RS-BiCGSTAB BeAclmalu (mul pel ifusfiDII)

ILU-BiCGS cnnllgtPfl

of the ILU(O) preconditioner the uU~lLcA

a BiCGSTAB solvers is expected

rU and ILU-BiCGSTAB In a recent work [13] high pelformance

392 HEINREICHSBERGER SELBERHERR AND

TABLB 3

Ptfform4f1ctt of MINIMOS 011 16k PfOCttIlllOf4

5

Device 1

performance measurements 16k processors

is used for the Poisson for both

3 performance statistics of the three simulations are listed The rows following meaning Row 1 the grid dimensions row 2 the number iterations Row 3 the and counts cnl for the continuity (CC) equation

and transfer times to and n1IIunc section the implementation using

optimality The linear solvers execute the case of a mean convergence uclilOnlu~

cOlrrespc)nlt1s to a 15 megaflop scalar nHl~ltlue of numerical results see [7J

6 Some Remarks been out in a relatively short time by porting onto the Connection Machine We have learned that the Connection Machine easy to use and to program although reo-programming of selected parts of the software is required

Conclusions The implementation on the LgtOIUl10n UltlILUe

computationally most eXipeIlSle

olrlpJlex ua v say tasks 100000 unknowns execute well on memory nlll1pv1 the matrix assembly on the end computer becomes slow

rnerlDllOlie the transfer time oUhe 7-band matrix the right side to and the SOIUllon

UC~nl-eJlla computer with the

vector from the Connection processors is in most cases slower than the solution IIVtt~lrn itself For a production code this is not acceptable Therefore an interaction of

Machines processors which involves computation and u of large sets data must be avoided lWLULLLl5 Machines is its FORTRAN compiler

algebra subroutines (such as multi wire nearest neighbour commushyCOIIDIIlUJDlCatlOn and computation) is under way A

between -4 to 8 seems A one gigaflop a mean convergence

deterioration factor of -4 corresponding to the IL U (0) preconditioned solvers to 250

were presented It has been shown the MIC-CG

expect that a fully optimized

APPLICATIONS MODELING AND SIMULATION 393

least twice as The convergence loss factors iterative solvers in presented simulations are than three in some cases larger than 10 A convergence degradation of more than a factor ofl0 makes the quesshytion of a potential of the Connection Machine or similar architectures over doubtful Part of this convergence dlllemma

an insufficient 3D mesh preconditioner can much more CUJllClnljI

An improved grid generation with _n~t

problem and is part of current UHcsvlg~JUUilli semiconductor equations is its

mrestlatlloD may not the ultimate answer ~UiI

are likely to on is OEUVtuc

statements are meant to summarize what we believe a engineer at a semiconductor manufacturing site

At first double hardware is an is Dot only true for MINIMOS but

are candidates for Connection Machine ImplE~ml~ntatllons Machine system with a number of processors is QClIlaUlC

nr()V1ltles a powerful very large three-dimensional of processors may be subdivided thus enabling more

~bullbull t concurrently on the Connection Machine Concurrency of UlltSIVLlgt such as MINIMOS CAD

Uil~Dliiel of binary data to and from the Machines processors is a (JClnlllectlOn aA in a serveNlient manner The transfer

lonnllctlon Machine and the computer modern deviceprocess CAD environments are on a computer

performance workstations as tools and supercomputshysolvers The effectiveness of such a configuration quite

interconnection between and the supercomputer hardware

Rose Fichtner W Numerical Methods for Semiconductor Device Simulation IEEE 1031~lOn 1983

Gkenbaum A Rodrigue GH Approximating the Inverse of a Matm for Use in herative AII~ornbms on Vedors Vector Processors Comvtin VoL 22 pp 257middot2681979

Edcmstat Efficient Implementation of a C1us of Preconditioned CoDiugate Gradient SIAM JSciSatComvt Vol 2 No I Mar 1981 pp 1middot4

al Massiydy Parallel Algorithms for Three-Dimensional Device Simulation Proceedinp July 1990 pp 35-36

SelCconsisten~ Iterative Scheme for One-dimensional Steady State Transistor CaleulatiOlls pp 455-465 1964

Heii1lr1tidlSbmiddoteqer 0 S Stiftinger M Traar KP Fast IteratiTe Solution of Curiel Continuity EquaC~OJlSfor 3D Deriee Simnlation to appear in SIAM JSeiSfat Comutbullbull

Heinzeichsberger 0 MJNlMOS on the Connection Machine Technical Repon Febrnampry 1991 Institute for Microelectronics Technical UniTenity Vienna AUSTRIA

Jolwon OG Micchdli A Paul G Polynomial Pkeonditioners for CoDiugate Gradient Calculations JNumeAnal Vol No2 Apr 1983 pp 362-316

MeijeriDk H Vorst H An Iterative Solution Method for Linear Systems of which the Codlicient Maim is a Symmetric M-Matm Math Comp 31 (1917) pp 148-162

Oppe Jouberi WD and Kincaid D~ NSPCG Users Guide Center of Numerical Analysis Uni-Temt of Taas at Austin 1984

EL Ortega JM Multico1or ICCG Methods for Vector Computers SIAM JNme Anel VoL 24 6 Dec 1987 1394-1US

Sonncyeid CGS A LanclIos-T1Pc Sober for Systems SIAM Vol No1 Jan 1989 36-52

W High Performance Preconditioning on Supercomputers for the 3D Derice MINIM() Proceedinp Supercomputing 90 pp 224-231

Vorst HA A Fast and Smoothly Conftlging Variant of BiCG for the Solution of NODSymmetric Linear Systems to appear in SIAM JStiStct Compv~bull

Page 2: Heinreichsberger* · ~'I7"JJHH~ll"'J.VllC:U MOS Device Simulation on a Heinreichsberger* Abstract. In . this paper we present oW' experience with the implementation of the three-dimensional

APPLICATIONS MODELING AND SIMULATION 389

TABLE 1

BiCGSTAB Algorithm

Choose 20 (eg Zo = 0) 70 = (b - Azo) Choose Yo such that (Yo1o) - 0 (eg Yo = 10 ) Po = Vo = 0 P- l = 1 Wo = 1 a =1 FOR n = 0 STEP 1 UNTIL convergence DO

Pn = (Yo rn) f3 = ~

Pft-l Wft

P+1 = 1n + f3 (p + wnv) V+1 = A pn+l a = (Yo Vn+ l ) bull = 1 - aVn +l t =A

~ +1 = M 7n +l = bull - Wn +lt 2+1 = Z + ap+1 + W+lS

END FOR

2 T he Linear Solvers The parallel solvers are of (bi-)conjugate gradient type A relative error of 10-3 for the symmetric and 10-8 for the nonsymmetric systems (motivated by numerical experiments only) has been found to be sufficient For the Poisson equation the MIC-CG method[9] with a modification factor of Q = 095 is the optimal choice to the best to our knowledge On the CM2 we use the CG in the version of Concus Golub and OLeary As preconditioner the reduced system (RS) main diagonal is used We shall denote this method by RS-CG In the case of locally constant carrier temperatures (see [6]) the coefficient matrices of the discrete continuity equations are diagonally similar to symmetric positive definite matrices and thus have a positive real spectrum In these linearized discrete systems the matrix coefficients vary rapidly resulting in a high condition number thus yielding a large iteration counts of the linear i terative solver Among the number of iterative methods investigated [6J variants of the bi-conjugate gradient method such as the biconjugate gradient squared (BiCGS) method and the the BiCGSTAB method [14] (see Table 1) seem to be the optimal choice In particular the BiCGSTAB procedure improves one of the main problems of BiCGS namely its erratic convergence behaviour which often yields significantly more iterations for convergence than necessary At the same time the rapid convergence of BiCGS is maintained BiCGSTAB needs two dot products more and one vector update less

3 Preconditioning Methods Efficient and robust preconditioning is the most critical issue for the iterative solution of the discrete semiconductor equations Split incomplete L U preconditioning enables an efficient implementation of the incomplete Choleski and the incomplete LU preconditioner a technique originally proposed in [3] However the ILU preconditioner ofthe nat urally ordered unknowns is rather sequential in nature and thus unattractive on a SIMD architecture Alternatives are polynomial preconditioners and ILU with multicolor orderings For multicolor ILU preconditioning the coefficient matrix is permuted according to some regular replication (coloring) pattern of t he unknowns [11] A partitioning of the unknowns into sets of different colors resulting in a block structure of the coefficient matrix uncouples the unknowns

InCUCCS R UltYi

n1 preconditioned quantities l11l1lrllIV

pr~ecc~nCllltlol)n4ers were originally proposed

with respect to some

J)reseIlt convergence uvuuu~u iteration of device 1

wa~Konm scaling the Reduced System (RS) Least Squares poJynl)rnllaJs

ngt ltlrC1DEnLj~ SELBERHERR AND STIFTINGER

The vector length decreases as the colors increases At the same - usually - We have made using simulator

The results indicate that more than two colors are not favourable We therefore ImiPlemenlE~a a 2-0010r (red-blad) ILU preconditioner Partitioning the A

the same color convergence

DR HRBA= BBR DB

points) the reduced gtv~tfgtlrn is

prl~c(lnCllltl~gtm~r is then the main diagonal of system

=

resulting in a

scaling the unknowns by square root of lvu DR we may in the form

(DB - HBRHRB) ZB =bB

Approximately half virtual prC~e1S01~S In

during each stage

function is feasible due COE~mc~elll nunnCC5 ofthe uncoupled discrete

and is a fast operation

BiCGSTAB iterative procedure with various

with zero fill-in

order to lIlAtI1JIo

construction of least squares polynomials positive

equations The evaluation on an SIMD arCJlUect

obtained from the majority COIltiIlIUUy bull_______ section 5) The 2-norm ofthe relative eUlcJl(leson 111

FORTRAN A twoshy

APPLICATIONS MODELING AND SIMULATION 391

dearly be The linear solvers RS-CG and the RS-BiCGSTAB methods have been benchmarked Some results are shown Table 2 Three tensor of 163 323 and

is solved on Sk 16k and 32k processors USing double pnClSlOn 32k processors we a megaflop rate 20f 120 for for the N 643 grid Note that only halI VJDI processors are active due to the red-black Obviously is determined

the 7 point which executes with roughly 200 JJlltOIltULUI (JQrlrespOl1diA1J OOiampUl aplPr(lmna1~el) 50 megaflops on 16k processors for the same ltplrorll COJnput3Ltlon is the reason why polynomial preconditioners are not

dotproduct achieves a rate of 300 megano]gts pr4~con4Itl(gtners compare unfavorable We use the RS preconditioner for our

20 30 40 60 70 90 n

TABLB 2

RS-CG 8Ai RS-BiCGSTAB BeAclmalu (mul pel ifusfiDII)

ILU-BiCGS cnnllgtPfl

of the ILU(O) preconditioner the uU~lLcA

a BiCGSTAB solvers is expected

rU and ILU-BiCGSTAB In a recent work [13] high pelformance

392 HEINREICHSBERGER SELBERHERR AND

TABLB 3

Ptfform4f1ctt of MINIMOS 011 16k PfOCttIlllOf4

5

Device 1

performance measurements 16k processors

is used for the Poisson for both

3 performance statistics of the three simulations are listed The rows following meaning Row 1 the grid dimensions row 2 the number iterations Row 3 the and counts cnl for the continuity (CC) equation

and transfer times to and n1IIunc section the implementation using

optimality The linear solvers execute the case of a mean convergence uclilOnlu~

cOlrrespc)nlt1s to a 15 megaflop scalar nHl~ltlue of numerical results see [7J

6 Some Remarks been out in a relatively short time by porting onto the Connection Machine We have learned that the Connection Machine easy to use and to program although reo-programming of selected parts of the software is required

Conclusions The implementation on the LgtOIUl10n UltlILUe

computationally most eXipeIlSle

olrlpJlex ua v say tasks 100000 unknowns execute well on memory nlll1pv1 the matrix assembly on the end computer becomes slow

rnerlDllOlie the transfer time oUhe 7-band matrix the right side to and the SOIUllon

UC~nl-eJlla computer with the

vector from the Connection processors is in most cases slower than the solution IIVtt~lrn itself For a production code this is not acceptable Therefore an interaction of

Machines processors which involves computation and u of large sets data must be avoided lWLULLLl5 Machines is its FORTRAN compiler

algebra subroutines (such as multi wire nearest neighbour commushyCOIIDIIlUJDlCatlOn and computation) is under way A

between -4 to 8 seems A one gigaflop a mean convergence

deterioration factor of -4 corresponding to the IL U (0) preconditioned solvers to 250

were presented It has been shown the MIC-CG

expect that a fully optimized

APPLICATIONS MODELING AND SIMULATION 393

least twice as The convergence loss factors iterative solvers in presented simulations are than three in some cases larger than 10 A convergence degradation of more than a factor ofl0 makes the quesshytion of a potential of the Connection Machine or similar architectures over doubtful Part of this convergence dlllemma

an insufficient 3D mesh preconditioner can much more CUJllClnljI

An improved grid generation with _n~t

problem and is part of current UHcsvlg~JUUilli semiconductor equations is its

mrestlatlloD may not the ultimate answer ~UiI

are likely to on is OEUVtuc

statements are meant to summarize what we believe a engineer at a semiconductor manufacturing site

At first double hardware is an is Dot only true for MINIMOS but

are candidates for Connection Machine ImplE~ml~ntatllons Machine system with a number of processors is QClIlaUlC

nr()V1ltles a powerful very large three-dimensional of processors may be subdivided thus enabling more

~bullbull t concurrently on the Connection Machine Concurrency of UlltSIVLlgt such as MINIMOS CAD

Uil~Dliiel of binary data to and from the Machines processors is a (JClnlllectlOn aA in a serveNlient manner The transfer

lonnllctlon Machine and the computer modern deviceprocess CAD environments are on a computer

performance workstations as tools and supercomputshysolvers The effectiveness of such a configuration quite

interconnection between and the supercomputer hardware

Rose Fichtner W Numerical Methods for Semiconductor Device Simulation IEEE 1031~lOn 1983

Gkenbaum A Rodrigue GH Approximating the Inverse of a Matm for Use in herative AII~ornbms on Vedors Vector Processors Comvtin VoL 22 pp 257middot2681979

Edcmstat Efficient Implementation of a C1us of Preconditioned CoDiugate Gradient SIAM JSciSatComvt Vol 2 No I Mar 1981 pp 1middot4

al Massiydy Parallel Algorithms for Three-Dimensional Device Simulation Proceedinp July 1990 pp 35-36

SelCconsisten~ Iterative Scheme for One-dimensional Steady State Transistor CaleulatiOlls pp 455-465 1964

Heii1lr1tidlSbmiddoteqer 0 S Stiftinger M Traar KP Fast IteratiTe Solution of Curiel Continuity EquaC~OJlSfor 3D Deriee Simnlation to appear in SIAM JSeiSfat Comutbullbull

Heinzeichsberger 0 MJNlMOS on the Connection Machine Technical Repon Febrnampry 1991 Institute for Microelectronics Technical UniTenity Vienna AUSTRIA

Jolwon OG Micchdli A Paul G Polynomial Pkeonditioners for CoDiugate Gradient Calculations JNumeAnal Vol No2 Apr 1983 pp 362-316

MeijeriDk H Vorst H An Iterative Solution Method for Linear Systems of which the Codlicient Maim is a Symmetric M-Matm Math Comp 31 (1917) pp 148-162

Oppe Jouberi WD and Kincaid D~ NSPCG Users Guide Center of Numerical Analysis Uni-Temt of Taas at Austin 1984

EL Ortega JM Multico1or ICCG Methods for Vector Computers SIAM JNme Anel VoL 24 6 Dec 1987 1394-1US

Sonncyeid CGS A LanclIos-T1Pc Sober for Systems SIAM Vol No1 Jan 1989 36-52

W High Performance Preconditioning on Supercomputers for the 3D Derice MINIM() Proceedinp Supercomputing 90 pp 224-231

Vorst HA A Fast and Smoothly Conftlging Variant of BiCG for the Solution of NODSymmetric Linear Systems to appear in SIAM JStiStct Compv~bull

Page 3: Heinreichsberger* · ~'I7"JJHH~ll"'J.VllC:U MOS Device Simulation on a Heinreichsberger* Abstract. In . this paper we present oW' experience with the implementation of the three-dimensional

InCUCCS R UltYi

n1 preconditioned quantities l11l1lrllIV

pr~ecc~nCllltlol)n4ers were originally proposed

with respect to some

J)reseIlt convergence uvuuu~u iteration of device 1

wa~Konm scaling the Reduced System (RS) Least Squares poJynl)rnllaJs

ngt ltlrC1DEnLj~ SELBERHERR AND STIFTINGER

The vector length decreases as the colors increases At the same - usually - We have made using simulator

The results indicate that more than two colors are not favourable We therefore ImiPlemenlE~a a 2-0010r (red-blad) ILU preconditioner Partitioning the A

the same color convergence

DR HRBA= BBR DB

points) the reduced gtv~tfgtlrn is

prl~c(lnCllltl~gtm~r is then the main diagonal of system

=

resulting in a

scaling the unknowns by square root of lvu DR we may in the form

(DB - HBRHRB) ZB =bB

Approximately half virtual prC~e1S01~S In

during each stage

function is feasible due COE~mc~elll nunnCC5 ofthe uncoupled discrete

and is a fast operation

BiCGSTAB iterative procedure with various

with zero fill-in

order to lIlAtI1JIo

construction of least squares polynomials positive

equations The evaluation on an SIMD arCJlUect

obtained from the majority COIltiIlIUUy bull_______ section 5) The 2-norm ofthe relative eUlcJl(leson 111

FORTRAN A twoshy

APPLICATIONS MODELING AND SIMULATION 391

dearly be The linear solvers RS-CG and the RS-BiCGSTAB methods have been benchmarked Some results are shown Table 2 Three tensor of 163 323 and

is solved on Sk 16k and 32k processors USing double pnClSlOn 32k processors we a megaflop rate 20f 120 for for the N 643 grid Note that only halI VJDI processors are active due to the red-black Obviously is determined

the 7 point which executes with roughly 200 JJlltOIltULUI (JQrlrespOl1diA1J OOiampUl aplPr(lmna1~el) 50 megaflops on 16k processors for the same ltplrorll COJnput3Ltlon is the reason why polynomial preconditioners are not

dotproduct achieves a rate of 300 megano]gts pr4~con4Itl(gtners compare unfavorable We use the RS preconditioner for our

20 30 40 60 70 90 n

TABLB 2

RS-CG 8Ai RS-BiCGSTAB BeAclmalu (mul pel ifusfiDII)

ILU-BiCGS cnnllgtPfl

of the ILU(O) preconditioner the uU~lLcA

a BiCGSTAB solvers is expected

rU and ILU-BiCGSTAB In a recent work [13] high pelformance

392 HEINREICHSBERGER SELBERHERR AND

TABLB 3

Ptfform4f1ctt of MINIMOS 011 16k PfOCttIlllOf4

5

Device 1

performance measurements 16k processors

is used for the Poisson for both

3 performance statistics of the three simulations are listed The rows following meaning Row 1 the grid dimensions row 2 the number iterations Row 3 the and counts cnl for the continuity (CC) equation

and transfer times to and n1IIunc section the implementation using

optimality The linear solvers execute the case of a mean convergence uclilOnlu~

cOlrrespc)nlt1s to a 15 megaflop scalar nHl~ltlue of numerical results see [7J

6 Some Remarks been out in a relatively short time by porting onto the Connection Machine We have learned that the Connection Machine easy to use and to program although reo-programming of selected parts of the software is required

Conclusions The implementation on the LgtOIUl10n UltlILUe

computationally most eXipeIlSle

olrlpJlex ua v say tasks 100000 unknowns execute well on memory nlll1pv1 the matrix assembly on the end computer becomes slow

rnerlDllOlie the transfer time oUhe 7-band matrix the right side to and the SOIUllon

UC~nl-eJlla computer with the

vector from the Connection processors is in most cases slower than the solution IIVtt~lrn itself For a production code this is not acceptable Therefore an interaction of

Machines processors which involves computation and u of large sets data must be avoided lWLULLLl5 Machines is its FORTRAN compiler

algebra subroutines (such as multi wire nearest neighbour commushyCOIIDIIlUJDlCatlOn and computation) is under way A

between -4 to 8 seems A one gigaflop a mean convergence

deterioration factor of -4 corresponding to the IL U (0) preconditioned solvers to 250

were presented It has been shown the MIC-CG

expect that a fully optimized

APPLICATIONS MODELING AND SIMULATION 393

least twice as The convergence loss factors iterative solvers in presented simulations are than three in some cases larger than 10 A convergence degradation of more than a factor ofl0 makes the quesshytion of a potential of the Connection Machine or similar architectures over doubtful Part of this convergence dlllemma

an insufficient 3D mesh preconditioner can much more CUJllClnljI

An improved grid generation with _n~t

problem and is part of current UHcsvlg~JUUilli semiconductor equations is its

mrestlatlloD may not the ultimate answer ~UiI

are likely to on is OEUVtuc

statements are meant to summarize what we believe a engineer at a semiconductor manufacturing site

At first double hardware is an is Dot only true for MINIMOS but

are candidates for Connection Machine ImplE~ml~ntatllons Machine system with a number of processors is QClIlaUlC

nr()V1ltles a powerful very large three-dimensional of processors may be subdivided thus enabling more

~bullbull t concurrently on the Connection Machine Concurrency of UlltSIVLlgt such as MINIMOS CAD

Uil~Dliiel of binary data to and from the Machines processors is a (JClnlllectlOn aA in a serveNlient manner The transfer

lonnllctlon Machine and the computer modern deviceprocess CAD environments are on a computer

performance workstations as tools and supercomputshysolvers The effectiveness of such a configuration quite

interconnection between and the supercomputer hardware

Rose Fichtner W Numerical Methods for Semiconductor Device Simulation IEEE 1031~lOn 1983

Gkenbaum A Rodrigue GH Approximating the Inverse of a Matm for Use in herative AII~ornbms on Vedors Vector Processors Comvtin VoL 22 pp 257middot2681979

Edcmstat Efficient Implementation of a C1us of Preconditioned CoDiugate Gradient SIAM JSciSatComvt Vol 2 No I Mar 1981 pp 1middot4

al Massiydy Parallel Algorithms for Three-Dimensional Device Simulation Proceedinp July 1990 pp 35-36

SelCconsisten~ Iterative Scheme for One-dimensional Steady State Transistor CaleulatiOlls pp 455-465 1964

Heii1lr1tidlSbmiddoteqer 0 S Stiftinger M Traar KP Fast IteratiTe Solution of Curiel Continuity EquaC~OJlSfor 3D Deriee Simnlation to appear in SIAM JSeiSfat Comutbullbull

Heinzeichsberger 0 MJNlMOS on the Connection Machine Technical Repon Febrnampry 1991 Institute for Microelectronics Technical UniTenity Vienna AUSTRIA

Jolwon OG Micchdli A Paul G Polynomial Pkeonditioners for CoDiugate Gradient Calculations JNumeAnal Vol No2 Apr 1983 pp 362-316

MeijeriDk H Vorst H An Iterative Solution Method for Linear Systems of which the Codlicient Maim is a Symmetric M-Matm Math Comp 31 (1917) pp 148-162

Oppe Jouberi WD and Kincaid D~ NSPCG Users Guide Center of Numerical Analysis Uni-Temt of Taas at Austin 1984

EL Ortega JM Multico1or ICCG Methods for Vector Computers SIAM JNme Anel VoL 24 6 Dec 1987 1394-1US

Sonncyeid CGS A LanclIos-T1Pc Sober for Systems SIAM Vol No1 Jan 1989 36-52

W High Performance Preconditioning on Supercomputers for the 3D Derice MINIM() Proceedinp Supercomputing 90 pp 224-231

Vorst HA A Fast and Smoothly Conftlging Variant of BiCG for the Solution of NODSymmetric Linear Systems to appear in SIAM JStiStct Compv~bull

Page 4: Heinreichsberger* · ~'I7"JJHH~ll"'J.VllC:U MOS Device Simulation on a Heinreichsberger* Abstract. In . this paper we present oW' experience with the implementation of the three-dimensional

APPLICATIONS MODELING AND SIMULATION 391

dearly be The linear solvers RS-CG and the RS-BiCGSTAB methods have been benchmarked Some results are shown Table 2 Three tensor of 163 323 and

is solved on Sk 16k and 32k processors USing double pnClSlOn 32k processors we a megaflop rate 20f 120 for for the N 643 grid Note that only halI VJDI processors are active due to the red-black Obviously is determined

the 7 point which executes with roughly 200 JJlltOIltULUI (JQrlrespOl1diA1J OOiampUl aplPr(lmna1~el) 50 megaflops on 16k processors for the same ltplrorll COJnput3Ltlon is the reason why polynomial preconditioners are not

dotproduct achieves a rate of 300 megano]gts pr4~con4Itl(gtners compare unfavorable We use the RS preconditioner for our

20 30 40 60 70 90 n

TABLB 2

RS-CG 8Ai RS-BiCGSTAB BeAclmalu (mul pel ifusfiDII)

ILU-BiCGS cnnllgtPfl

of the ILU(O) preconditioner the uU~lLcA

a BiCGSTAB solvers is expected

rU and ILU-BiCGSTAB In a recent work [13] high pelformance

392 HEINREICHSBERGER SELBERHERR AND

TABLB 3

Ptfform4f1ctt of MINIMOS 011 16k PfOCttIlllOf4

5

Device 1

performance measurements 16k processors

is used for the Poisson for both

3 performance statistics of the three simulations are listed The rows following meaning Row 1 the grid dimensions row 2 the number iterations Row 3 the and counts cnl for the continuity (CC) equation

and transfer times to and n1IIunc section the implementation using

optimality The linear solvers execute the case of a mean convergence uclilOnlu~

cOlrrespc)nlt1s to a 15 megaflop scalar nHl~ltlue of numerical results see [7J

6 Some Remarks been out in a relatively short time by porting onto the Connection Machine We have learned that the Connection Machine easy to use and to program although reo-programming of selected parts of the software is required

Conclusions The implementation on the LgtOIUl10n UltlILUe

computationally most eXipeIlSle

olrlpJlex ua v say tasks 100000 unknowns execute well on memory nlll1pv1 the matrix assembly on the end computer becomes slow

rnerlDllOlie the transfer time oUhe 7-band matrix the right side to and the SOIUllon

UC~nl-eJlla computer with the

vector from the Connection processors is in most cases slower than the solution IIVtt~lrn itself For a production code this is not acceptable Therefore an interaction of

Machines processors which involves computation and u of large sets data must be avoided lWLULLLl5 Machines is its FORTRAN compiler

algebra subroutines (such as multi wire nearest neighbour commushyCOIIDIIlUJDlCatlOn and computation) is under way A

between -4 to 8 seems A one gigaflop a mean convergence

deterioration factor of -4 corresponding to the IL U (0) preconditioned solvers to 250

were presented It has been shown the MIC-CG

expect that a fully optimized

APPLICATIONS MODELING AND SIMULATION 393

least twice as The convergence loss factors iterative solvers in presented simulations are than three in some cases larger than 10 A convergence degradation of more than a factor ofl0 makes the quesshytion of a potential of the Connection Machine or similar architectures over doubtful Part of this convergence dlllemma

an insufficient 3D mesh preconditioner can much more CUJllClnljI

An improved grid generation with _n~t

problem and is part of current UHcsvlg~JUUilli semiconductor equations is its

mrestlatlloD may not the ultimate answer ~UiI

are likely to on is OEUVtuc

statements are meant to summarize what we believe a engineer at a semiconductor manufacturing site

At first double hardware is an is Dot only true for MINIMOS but

are candidates for Connection Machine ImplE~ml~ntatllons Machine system with a number of processors is QClIlaUlC

nr()V1ltles a powerful very large three-dimensional of processors may be subdivided thus enabling more

~bullbull t concurrently on the Connection Machine Concurrency of UlltSIVLlgt such as MINIMOS CAD

Uil~Dliiel of binary data to and from the Machines processors is a (JClnlllectlOn aA in a serveNlient manner The transfer

lonnllctlon Machine and the computer modern deviceprocess CAD environments are on a computer

performance workstations as tools and supercomputshysolvers The effectiveness of such a configuration quite

interconnection between and the supercomputer hardware

Rose Fichtner W Numerical Methods for Semiconductor Device Simulation IEEE 1031~lOn 1983

Gkenbaum A Rodrigue GH Approximating the Inverse of a Matm for Use in herative AII~ornbms on Vedors Vector Processors Comvtin VoL 22 pp 257middot2681979

Edcmstat Efficient Implementation of a C1us of Preconditioned CoDiugate Gradient SIAM JSciSatComvt Vol 2 No I Mar 1981 pp 1middot4

al Massiydy Parallel Algorithms for Three-Dimensional Device Simulation Proceedinp July 1990 pp 35-36

SelCconsisten~ Iterative Scheme for One-dimensional Steady State Transistor CaleulatiOlls pp 455-465 1964

Heii1lr1tidlSbmiddoteqer 0 S Stiftinger M Traar KP Fast IteratiTe Solution of Curiel Continuity EquaC~OJlSfor 3D Deriee Simnlation to appear in SIAM JSeiSfat Comutbullbull

Heinzeichsberger 0 MJNlMOS on the Connection Machine Technical Repon Febrnampry 1991 Institute for Microelectronics Technical UniTenity Vienna AUSTRIA

Jolwon OG Micchdli A Paul G Polynomial Pkeonditioners for CoDiugate Gradient Calculations JNumeAnal Vol No2 Apr 1983 pp 362-316

MeijeriDk H Vorst H An Iterative Solution Method for Linear Systems of which the Codlicient Maim is a Symmetric M-Matm Math Comp 31 (1917) pp 148-162

Oppe Jouberi WD and Kincaid D~ NSPCG Users Guide Center of Numerical Analysis Uni-Temt of Taas at Austin 1984

EL Ortega JM Multico1or ICCG Methods for Vector Computers SIAM JNme Anel VoL 24 6 Dec 1987 1394-1US

Sonncyeid CGS A LanclIos-T1Pc Sober for Systems SIAM Vol No1 Jan 1989 36-52

W High Performance Preconditioning on Supercomputers for the 3D Derice MINIM() Proceedinp Supercomputing 90 pp 224-231

Vorst HA A Fast and Smoothly Conftlging Variant of BiCG for the Solution of NODSymmetric Linear Systems to appear in SIAM JStiStct Compv~bull

Page 5: Heinreichsberger* · ~'I7"JJHH~ll"'J.VllC:U MOS Device Simulation on a Heinreichsberger* Abstract. In . this paper we present oW' experience with the implementation of the three-dimensional

ILU-BiCGS cnnllgtPfl

of the ILU(O) preconditioner the uU~lLcA

a BiCGSTAB solvers is expected

rU and ILU-BiCGSTAB In a recent work [13] high pelformance

392 HEINREICHSBERGER SELBERHERR AND

TABLB 3

Ptfform4f1ctt of MINIMOS 011 16k PfOCttIlllOf4

5

Device 1

performance measurements 16k processors

is used for the Poisson for both

3 performance statistics of the three simulations are listed The rows following meaning Row 1 the grid dimensions row 2 the number iterations Row 3 the and counts cnl for the continuity (CC) equation

and transfer times to and n1IIunc section the implementation using

optimality The linear solvers execute the case of a mean convergence uclilOnlu~

cOlrrespc)nlt1s to a 15 megaflop scalar nHl~ltlue of numerical results see [7J

6 Some Remarks been out in a relatively short time by porting onto the Connection Machine We have learned that the Connection Machine easy to use and to program although reo-programming of selected parts of the software is required

Conclusions The implementation on the LgtOIUl10n UltlILUe

computationally most eXipeIlSle

olrlpJlex ua v say tasks 100000 unknowns execute well on memory nlll1pv1 the matrix assembly on the end computer becomes slow

rnerlDllOlie the transfer time oUhe 7-band matrix the right side to and the SOIUllon

UC~nl-eJlla computer with the

vector from the Connection processors is in most cases slower than the solution IIVtt~lrn itself For a production code this is not acceptable Therefore an interaction of

Machines processors which involves computation and u of large sets data must be avoided lWLULLLl5 Machines is its FORTRAN compiler

algebra subroutines (such as multi wire nearest neighbour commushyCOIIDIIlUJDlCatlOn and computation) is under way A

between -4 to 8 seems A one gigaflop a mean convergence

deterioration factor of -4 corresponding to the IL U (0) preconditioned solvers to 250

were presented It has been shown the MIC-CG

expect that a fully optimized

APPLICATIONS MODELING AND SIMULATION 393

least twice as The convergence loss factors iterative solvers in presented simulations are than three in some cases larger than 10 A convergence degradation of more than a factor ofl0 makes the quesshytion of a potential of the Connection Machine or similar architectures over doubtful Part of this convergence dlllemma

an insufficient 3D mesh preconditioner can much more CUJllClnljI

An improved grid generation with _n~t

problem and is part of current UHcsvlg~JUUilli semiconductor equations is its

mrestlatlloD may not the ultimate answer ~UiI

are likely to on is OEUVtuc

statements are meant to summarize what we believe a engineer at a semiconductor manufacturing site

At first double hardware is an is Dot only true for MINIMOS but

are candidates for Connection Machine ImplE~ml~ntatllons Machine system with a number of processors is QClIlaUlC

nr()V1ltles a powerful very large three-dimensional of processors may be subdivided thus enabling more

~bullbull t concurrently on the Connection Machine Concurrency of UlltSIVLlgt such as MINIMOS CAD

Uil~Dliiel of binary data to and from the Machines processors is a (JClnlllectlOn aA in a serveNlient manner The transfer

lonnllctlon Machine and the computer modern deviceprocess CAD environments are on a computer

performance workstations as tools and supercomputshysolvers The effectiveness of such a configuration quite

interconnection between and the supercomputer hardware

Rose Fichtner W Numerical Methods for Semiconductor Device Simulation IEEE 1031~lOn 1983

Gkenbaum A Rodrigue GH Approximating the Inverse of a Matm for Use in herative AII~ornbms on Vedors Vector Processors Comvtin VoL 22 pp 257middot2681979

Edcmstat Efficient Implementation of a C1us of Preconditioned CoDiugate Gradient SIAM JSciSatComvt Vol 2 No I Mar 1981 pp 1middot4

al Massiydy Parallel Algorithms for Three-Dimensional Device Simulation Proceedinp July 1990 pp 35-36

SelCconsisten~ Iterative Scheme for One-dimensional Steady State Transistor CaleulatiOlls pp 455-465 1964

Heii1lr1tidlSbmiddoteqer 0 S Stiftinger M Traar KP Fast IteratiTe Solution of Curiel Continuity EquaC~OJlSfor 3D Deriee Simnlation to appear in SIAM JSeiSfat Comutbullbull

Heinzeichsberger 0 MJNlMOS on the Connection Machine Technical Repon Febrnampry 1991 Institute for Microelectronics Technical UniTenity Vienna AUSTRIA

Jolwon OG Micchdli A Paul G Polynomial Pkeonditioners for CoDiugate Gradient Calculations JNumeAnal Vol No2 Apr 1983 pp 362-316

MeijeriDk H Vorst H An Iterative Solution Method for Linear Systems of which the Codlicient Maim is a Symmetric M-Matm Math Comp 31 (1917) pp 148-162

Oppe Jouberi WD and Kincaid D~ NSPCG Users Guide Center of Numerical Analysis Uni-Temt of Taas at Austin 1984

EL Ortega JM Multico1or ICCG Methods for Vector Computers SIAM JNme Anel VoL 24 6 Dec 1987 1394-1US

Sonncyeid CGS A LanclIos-T1Pc Sober for Systems SIAM Vol No1 Jan 1989 36-52

W High Performance Preconditioning on Supercomputers for the 3D Derice MINIM() Proceedinp Supercomputing 90 pp 224-231

Vorst HA A Fast and Smoothly Conftlging Variant of BiCG for the Solution of NODSymmetric Linear Systems to appear in SIAM JStiStct Compv~bull

Page 6: Heinreichsberger* · ~'I7"JJHH~ll"'J.VllC:U MOS Device Simulation on a Heinreichsberger* Abstract. In . this paper we present oW' experience with the implementation of the three-dimensional

APPLICATIONS MODELING AND SIMULATION 393

least twice as The convergence loss factors iterative solvers in presented simulations are than three in some cases larger than 10 A convergence degradation of more than a factor ofl0 makes the quesshytion of a potential of the Connection Machine or similar architectures over doubtful Part of this convergence dlllemma

an insufficient 3D mesh preconditioner can much more CUJllClnljI

An improved grid generation with _n~t

problem and is part of current UHcsvlg~JUUilli semiconductor equations is its

mrestlatlloD may not the ultimate answer ~UiI

are likely to on is OEUVtuc

statements are meant to summarize what we believe a engineer at a semiconductor manufacturing site

At first double hardware is an is Dot only true for MINIMOS but

are candidates for Connection Machine ImplE~ml~ntatllons Machine system with a number of processors is QClIlaUlC

nr()V1ltles a powerful very large three-dimensional of processors may be subdivided thus enabling more

~bullbull t concurrently on the Connection Machine Concurrency of UlltSIVLlgt such as MINIMOS CAD

Uil~Dliiel of binary data to and from the Machines processors is a (JClnlllectlOn aA in a serveNlient manner The transfer

lonnllctlon Machine and the computer modern deviceprocess CAD environments are on a computer

performance workstations as tools and supercomputshysolvers The effectiveness of such a configuration quite

interconnection between and the supercomputer hardware

Rose Fichtner W Numerical Methods for Semiconductor Device Simulation IEEE 1031~lOn 1983

Gkenbaum A Rodrigue GH Approximating the Inverse of a Matm for Use in herative AII~ornbms on Vedors Vector Processors Comvtin VoL 22 pp 257middot2681979

Edcmstat Efficient Implementation of a C1us of Preconditioned CoDiugate Gradient SIAM JSciSatComvt Vol 2 No I Mar 1981 pp 1middot4

al Massiydy Parallel Algorithms for Three-Dimensional Device Simulation Proceedinp July 1990 pp 35-36

SelCconsisten~ Iterative Scheme for One-dimensional Steady State Transistor CaleulatiOlls pp 455-465 1964

Heii1lr1tidlSbmiddoteqer 0 S Stiftinger M Traar KP Fast IteratiTe Solution of Curiel Continuity EquaC~OJlSfor 3D Deriee Simnlation to appear in SIAM JSeiSfat Comutbullbull

Heinzeichsberger 0 MJNlMOS on the Connection Machine Technical Repon Febrnampry 1991 Institute for Microelectronics Technical UniTenity Vienna AUSTRIA

Jolwon OG Micchdli A Paul G Polynomial Pkeonditioners for CoDiugate Gradient Calculations JNumeAnal Vol No2 Apr 1983 pp 362-316

MeijeriDk H Vorst H An Iterative Solution Method for Linear Systems of which the Codlicient Maim is a Symmetric M-Matm Math Comp 31 (1917) pp 148-162

Oppe Jouberi WD and Kincaid D~ NSPCG Users Guide Center of Numerical Analysis Uni-Temt of Taas at Austin 1984

EL Ortega JM Multico1or ICCG Methods for Vector Computers SIAM JNme Anel VoL 24 6 Dec 1987 1394-1US

Sonncyeid CGS A LanclIos-T1Pc Sober for Systems SIAM Vol No1 Jan 1989 36-52

W High Performance Preconditioning on Supercomputers for the 3D Derice MINIM() Proceedinp Supercomputing 90 pp 224-231

Vorst HA A Fast and Smoothly Conftlging Variant of BiCG for the Solution of NODSymmetric Linear Systems to appear in SIAM JStiStct Compv~bull