A 2589 Line Topology Optimization Code Written for the Graphics … · 2018. 10. 30. · Outline 1...

A 2589 Line Topology Optimization Code Written forthe Graphics Card

Stephan Schmidt, Volker Schulz

July 23rd, 2009

S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 1 / 23

Outline

1 The Graphics CardProcessing UnitMemory ManagementExample Applications

2 Linear Elasticity and Topology OptimizationDisplacements and ComplianceTopology Optimization ProblemSIMP Method

3 GPU Implementation

4 Results


Literature

S. Ananiev.On equivalence between optimality criteria and projected gradientmethods with application to topology optimization problem.Multibody System Dynamics, 13(1):25–38, 2003.

M. P. Bendsøe and O. Sigmund.Topology Optimization – Theory, Methods and Applications.Springer, Berlin, Heidelberg, New York, 2nd edition, 2004.

M. Giles.Using NVIDIA GPUs for computational finance.http://people.maths.ox.ac.uk/~gilesm/hpc/.

O. Sigmund.A 99 line topology optimization code written in matlab.Structural and Multidisciplinary Optimization, 21(2):120–127, 2001.


http://people.maths.ox.ac.uk/~gilesm/hpc/

The 3D Accelerator Graphics Card

Traditionally:Highly specialized processor (triangular data types, sub floatprecision, no integers)

Today:Unified shader: Autonomous compute deviceSIMD / Stream architectureHighly parallel, up to 512 threads, in order executionIdeal for vector processingSpecial programming extensions (CUDA, OpenCL)Peta-Flop supercomputers have GPU-like architecture


CPU vs GPU

CPU:Cores independentMemory accesses hidden bycaches automatically10.6 GB/s to RAM (PC2-5300DDR2), optimized for latency

GPU:Cores execute sameinstructions on differentmemory address (warp = 32)Memory access hidden bycoalescence and parallelismby programmer141.7 GB/s to RAM, optimizedfor bandwidth


GPU Memory Hierarchy

Device Memory (RAM)Access time: Up to 600 clock cycles (= 150 float add/mult)Remedy: Coalescence: Channel load/store instructions (zeropadding, pitch)!!Unknowns per thread should be multiple of 4 or 2 but not 3

Shared MemoryDelivers 32 Bit per clock cycle16 KB in 16 Banks: Bank conflict when too much data needed atthe same time or unstructured access (stride)


GPU Memory Hierarchy

Constant MemoryRead onlyCached with broadcast if all threads access same address

Registers8192 per Block, access 0 cycleShared with Shared Memory, potential bank conflicts


Programming Paradigm

Grid Block

Block

Block

Thread

Thread

Thread

Thread

Thread

Thread

Thread


Summary GPU Computing

Good Problems:Dense matvec, Sparse banded matvecFractalsFFTCompute intensive PDE Solvers (High order FVM, SpectralElements, Lattice Boltzmann)Structured meshes (cartesian)

Bad Problems:Unstructured sparse matvecUnstructured mixed element PDE schemesData intensive tasks

⇒ Numerical schemes should be designed with computer architecturein mind.


GPU Computing in Trier (with M. Siebenborn)

Finite Volume solver for shallow water and Euler equations, JSTScheme, scalar dissipation, structured bodyfitted mesh


Linear Elasticity

Deformation of a solid body under forces: Displacement vectoru ∈ R3.Linear strain tensor

�ij :=12

(∂ui∂xj

+∂uj∂xi

)Voight notation for symmetric strain tensor

�̃ := (�11, �22, �33, �12, �13, �23)T =: Bu

Cauchy stress tensor: Young’s modulus E , Poisson’s ratio ν

σ =E

(1 + ν)(1− 2ν)

26666641− ν ν νν 1− ν νν ν 1− ν

1− 2ν1− 2ν

1− 2ν

3777775 �̃=: CBu


Linear Elasticity and Finite Elements

Weak formulation

a(u, v) =∫Ω

(Bv)T CBu dS = L(u) ∀v ∈ V .

Matrix notation

K (Ω)u = f

Forces, loads, supports: fCompliance

c(u) = uT f = uT Ku,


Topology Optimization Problem

Mathematical Problem

min(u,Ω)

J(u,Ω) : = uT K (Ω)u

subject toK (Ω)u = fVol(Ω) = V0

How to deal with the unknown Ω?Level-Set method: Ω is zero level of function θ

Extract zero-level curve of θ ⇒ unstructured curve⇒ unstructureddiscretization of Ω

X-FEMSpecial treatment of bisected elements

Solid Isotropic Material with Penalization (SIMP), akahomogenization approach


Solid Isotropic Material with Penalization (SIMP)

Overlay Ω with cartesian gridPseudo-Density in each Finite Element ρ = (ρ1, · · · , ρN)T :

min(u,ρ)

J(u, ρ) : = uT K (ρ)u

subject toK (ρ)u = f

N∑e=1

ρe = V0

ρe ∈ {0,1}Replace ρe ∈ {0,1} by ρe ∈ [0,1]

a(u, v) =N∑

e=1

∫Ωe

(Bv)TρpeCBu dS = L(u) ∀v ∈ V

Penalty parameter pS. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 14 / 23

Optimality Criteria

Lagrangian:

L =uT K (ρ)u + λ

(N∑

e=1

ρe − V0

)+ µT (K (ρ)u − f )

+N∑

e=1

αe(−ρe) +N∑

e=1

βe(ρe − 1)

Optimality condition (self-adjoint in µ):

−uTe∂Ke∂ρe

ue + λ− αe + βe = 0

N∑e=1

ρe − V0 = 0

−ρe = 0 or ρe − 1 = 0


Optimality Criteria Method

Gradient of Lagrangian without constraint ρe ∈ {0,1}:

Be :=1λ

uTe∂Ke∂ρe

ue = 1

Update for ρe:

ρe ←

8 0, damping η = 0.5Bisection for λOC-Update can be interpreted as special projected gradientmethod for ρe ∈ [0,1] constraintImplemented in One-Shot, i.e. inexact gradient


Finite Elements: CPU Code

1: Compute RefStiff[i][j] ∈ R24×242: uk+1 = 03: for all Finite Elements T do4: for all vertices i of T do5: t = 06: for all vertices j of T do7: ig = Global-Index i8: jg = Global-Index j9: t = t + ρT RefStiff[i][j]uk [jg]

10: end for11: uk+1[ig] = uk+1[ig] + t12: end for13: end for

Cons:Requires 32 global loadoperations per elementRequires 8 global storeoperations per elementFinal store must be atomic!Prohibitive for GPU!


Memory Coalescence

Memory access is expensive!Strategy: Matrix-free FEM withCGCartesian Mesh: nx × ny × nztensor meshParallelism: Process matvecper 2D slice and stream ink -planePartition 2D slice inwarpsize × n blocks, where nis determined from avaliableshared memory


Finite Elements: GPU Code

1: Compute RefStiff[i][j] ∈ R24×24 and copy to constant memory2: Partition x-y -plane in warpsize × n patches, launch GPU blocks3: Init shared memory, synchronize threads4: for all k -planes do5: (i , j) = Thread-ID, Res = 0 in thread register6: Discard slice, load new one, synchronize7: for all Elements T that have (i , j , k) as a vertex do8: (i2, j2, k2) = local index (i , j , k) has in T9: uthread = u(i , j , k) from shared memory

10: for all (i1, j1, k1) vertex of T do11: Res = Res + ρT RefStiff[(i1, j1, k1)][(i2, j2, k2)]uthread12: end for13: end for14: Synch threads15: Upload Res from shared to global memory16: end for


3D Cantilever

180× 180× 360 mesh46.5 · 106 unknowns


Lavf52.31.0

cant_opt.mp4Media File (video/mp4)

3D Cantilever

(cantsmooth.u3d)

80× 80× 160 meshFull load in k -direction


Speed-Up

Time for 1000 CG iterations on 180× 180× 360 mesh

00:05:27,60 1Core2Duo E6600 1 Core 05:18:27,38 58,33Core2Duo E6600 2 Core 02:51:28,55 31,41Core2Duo T9600 1 Core 04:37:29,92 50,82Core2Duo T9600 2 Core 01:58:50,87 21,77

1000 CG Iterations 180x180x360 MeshGeForce GTX280

00:00:00,00

01:12:00,00

02:24:00,00

03:36:00,00

04:48:00,00

06:00:00,00GeForce GTX280Core2Duo E6600 1 CoreCore2Duo E6600 2 CoreCore2Duo T9600 1 CoreCore2Duo T9600 2 Core


Conclusions and Future Work

ConclusionsGPU very fast for problems with specific structureProgramming: Easy to pick up, hard to master

Future WorkMultigridFluid / structure interactionMulti-GPUHeterogenous CPU / GPU parallelismAdaptive load balancing

Code avaliablehttp://www.mathematik.uni-trier.de/~schmidt/gputop


http://www.mathematik.uni-trier.de/~schmidt/gputop

The Graphics CardProcessing UnitMemory ManagementExample Applications

Linear Elasticity and Topology OptimizationDisplacements and ComplianceTopology Optimization ProblemSIMP Method

GPU ImplementationResults

A 2589 Line Topology Optimization Code Written for the Graphics … · 2018. 10. 30. · Outline 1...

Documents

Transcript of A 2589 Line Topology Optimization Code Written for the Graphics … · 2018. 10. 30. · Outline 1...