A 2589 Line Topology Optimization Code Written for the Graphics … · 2018. 10. 30. · Outline 1...
Transcript of A 2589 Line Topology Optimization Code Written for the Graphics … · 2018. 10. 30. · Outline 1...
-
A 2589 Line Topology Optimization Code Written forthe Graphics Card
Stephan Schmidt, Volker Schulz
July 23rd, 2009
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 1 / 23
-
Outline
1 The Graphics CardProcessing UnitMemory ManagementExample Applications
2 Linear Elasticity and Topology OptimizationDisplacements and ComplianceTopology Optimization ProblemSIMP Method
3 GPU Implementation
4 Results
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 2 / 23
-
Literature
S. Ananiev.On equivalence between optimality criteria and projected gradientmethods with application to topology optimization problem.Multibody System Dynamics, 13(1):25–38, 2003.
M. P. Bendsøe and O. Sigmund.Topology Optimization – Theory, Methods and Applications.Springer, Berlin, Heidelberg, New York, 2nd edition, 2004.
M. Giles.Using NVIDIA GPUs for computational finance.http://people.maths.ox.ac.uk/~gilesm/hpc/.
O. Sigmund.A 99 line topology optimization code written in matlab.Structural and Multidisciplinary Optimization, 21(2):120–127, 2001.
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 3 / 23
http://people.maths.ox.ac.uk/~gilesm/hpc/
-
The 3D Accelerator Graphics Card
Traditionally:Highly specialized processor (triangular data types, sub floatprecision, no integers)
Today:Unified shader: Autonomous compute deviceSIMD / Stream architectureHighly parallel, up to 512 threads, in order executionIdeal for vector processingSpecial programming extensions (CUDA, OpenCL)Peta-Flop supercomputers have GPU-like architecture
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 4 / 23
-
CPU vs GPU
CPU:Cores independentMemory accesses hidden bycaches automatically10.6 GB/s to RAM (PC2-5300DDR2), optimized for latency
GPU:Cores execute sameinstructions on differentmemory address (warp = 32)Memory access hidden bycoalescence and parallelismby programmer141.7 GB/s to RAM, optimizedfor bandwidth
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 5 / 23
-
GPU Memory Hierarchy
Device Memory (RAM)Access time: Up to 600 clock cycles (= 150 float add/mult)Remedy: Coalescence: Channel load/store instructions (zeropadding, pitch)!!Unknowns per thread should be multiple of 4 or 2 but not 3
Shared MemoryDelivers 32 Bit per clock cycle16 KB in 16 Banks: Bank conflict when too much data needed atthe same time or unstructured access (stride)
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 6 / 23
-
GPU Memory Hierarchy
Constant MemoryRead onlyCached with broadcast if all threads access same address
Registers8192 per Block, access 0 cycleShared with Shared Memory, potential bank conflicts
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 7 / 23
-
Programming Paradigm
Grid Block
Block
Block
Thread
Thread
Thread
Thread
Thread
Thread
Thread
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 8 / 23
-
Summary GPU Computing
Good Problems:Dense matvec, Sparse banded matvecFractalsFFTCompute intensive PDE Solvers (High order FVM, SpectralElements, Lattice Boltzmann)Structured meshes (cartesian)
Bad Problems:Unstructured sparse matvecUnstructured mixed element PDE schemesData intensive tasks
⇒ Numerical schemes should be designed with computer architecturein mind.
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 9 / 23
-
GPU Computing in Trier (with M. Siebenborn)
Finite Volume solver for shallow water and Euler equations, JSTScheme, scalar dissipation, structured bodyfitted mesh
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 10 / 23
-
Linear Elasticity
Deformation of a solid body under forces: Displacement vectoru ∈ R3.Linear strain tensor
�ij :=12
(∂ui∂xj
+∂uj∂xi
)Voight notation for symmetric strain tensor
�̃ := (�11, �22, �33, �12, �13, �23)T =: Bu
Cauchy stress tensor: Young’s modulus E , Poisson’s ratio ν
σ =E
(1 + ν)(1− 2ν)
26666641− ν ν νν 1− ν νν ν 1− ν
1− 2ν1− 2ν
1− 2ν
3777775 �̃=: CBu
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 11 / 23
-
Linear Elasticity and Finite Elements
Weak formulation
a(u, v) =∫Ω
(Bv)T CBu dS = L(u) ∀v ∈ V .
Matrix notation
K (Ω)u = f
Forces, loads, supports: fCompliance
c(u) = uT f = uT Ku,
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 12 / 23
-
Topology Optimization Problem
Mathematical Problem
min(u,Ω)
J(u,Ω) : = uT K (Ω)u
subject toK (Ω)u = fVol(Ω) = V0
How to deal with the unknown Ω?Level-Set method: Ω is zero level of function θ
Extract zero-level curve of θ ⇒ unstructured curve⇒ unstructureddiscretization of Ω
X-FEMSpecial treatment of bisected elements
Solid Isotropic Material with Penalization (SIMP), akahomogenization approach
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 13 / 23
-
Solid Isotropic Material with Penalization (SIMP)
Overlay Ω with cartesian gridPseudo-Density in each Finite Element ρ = (ρ1, · · · , ρN)T :
min(u,ρ)
J(u, ρ) : = uT K (ρ)u
subject toK (ρ)u = f
N∑e=1
ρe = V0
ρe ∈ {0,1}Replace ρe ∈ {0,1} by ρe ∈ [0,1]
a(u, v) =N∑
e=1
∫Ωe
(Bv)TρpeCBu dS = L(u) ∀v ∈ V
Penalty parameter pS. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 14 / 23
-
Optimality Criteria
Lagrangian:
L =uT K (ρ)u + λ
(N∑
e=1
ρe − V0
)+ µT (K (ρ)u − f )
+N∑
e=1
αe(−ρe) +N∑
e=1
βe(ρe − 1)
Optimality condition (self-adjoint in µ):
−uTe∂Ke∂ρe
ue + λ− αe + βe = 0
N∑e=1
ρe − V0 = 0
−ρe = 0 or ρe − 1 = 0
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 15 / 23
-
Optimality Criteria Method
Gradient of Lagrangian without constraint ρe ∈ {0,1}:
Be :=1λ
uTe∂Ke∂ρe
ue = 1
Update for ρe:
ρe ←
8 0, damping η = 0.5Bisection for λOC-Update can be interpreted as special projected gradientmethod for ρe ∈ [0,1] constraintImplemented in One-Shot, i.e. inexact gradient
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 16 / 23
-
Finite Elements: CPU Code
1: Compute RefStiff[i][j] ∈ R24×242: uk+1 = 03: for all Finite Elements T do4: for all vertices i of T do5: t = 06: for all vertices j of T do7: ig = Global-Index i8: jg = Global-Index j9: t = t + ρT RefStiff[i][j]uk [jg]
10: end for11: uk+1[ig] = uk+1[ig] + t12: end for13: end for
Cons:Requires 32 global loadoperations per elementRequires 8 global storeoperations per elementFinal store must be atomic!Prohibitive for GPU!
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 17 / 23
-
Memory Coalescence
Memory access is expensive!Strategy: Matrix-free FEM withCGCartesian Mesh: nx × ny × nztensor meshParallelism: Process matvecper 2D slice and stream ink -planePartition 2D slice inwarpsize × n blocks, where nis determined from avaliableshared memory
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 18 / 23
-
Finite Elements: GPU Code
1: Compute RefStiff[i][j] ∈ R24×24 and copy to constant memory2: Partition x-y -plane in warpsize × n patches, launch GPU blocks3: Init shared memory, synchronize threads4: for all k -planes do5: (i , j) = Thread-ID, Res = 0 in thread register6: Discard slice, load new one, synchronize7: for all Elements T that have (i , j , k) as a vertex do8: (i2, j2, k2) = local index (i , j , k) has in T9: uthread = u(i , j , k) from shared memory
10: for all (i1, j1, k1) vertex of T do11: Res = Res + ρT RefStiff[(i1, j1, k1)][(i2, j2, k2)]uthread12: end for13: end for14: Synch threads15: Upload Res from shared to global memory16: end for
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 19 / 23
-
3D Cantilever
180× 180× 360 mesh46.5 · 106 unknowns
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 20 / 23
Lavf52.31.0
cant_opt.mp4Media File (video/mp4)
-
3D Cantilever
(cantsmooth.u3d)
80× 80× 160 meshFull load in k -direction
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 21 / 23
-
Speed-Up
Time for 1000 CG iterations on 180× 180× 360 mesh
00:05:27,60 1Core2Duo E6600 1 Core 05:18:27,38 58,33Core2Duo E6600 2 Core 02:51:28,55 31,41Core2Duo T9600 1 Core 04:37:29,92 50,82Core2Duo T9600 2 Core 01:58:50,87 21,77
1000 CG Iterations 180x180x360 MeshGeForce GTX280
00:00:00,00
01:12:00,00
02:24:00,00
03:36:00,00
04:48:00,00
06:00:00,00GeForce GTX280Core2Duo E6600 1 CoreCore2Duo E6600 2 CoreCore2Duo T9600 1 CoreCore2Duo T9600 2 Core
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 22 / 23
-
Conclusions and Future Work
ConclusionsGPU very fast for problems with specific structureProgramming: Easy to pick up, hard to master
Future WorkMultigridFluid / structure interactionMulti-GPUHeterogenous CPU / GPU parallelismAdaptive load balancing
Code avaliablehttp://www.mathematik.uni-trier.de/~schmidt/gputop
S. Schmidt, V. Schulz (University of Trier) GPU Topology Optimization July 23rd, 2009 23 / 23
http://www.mathematik.uni-trier.de/~schmidt/gputop
The Graphics CardProcessing UnitMemory ManagementExample Applications
Linear Elasticity and Topology OptimizationDisplacements and ComplianceTopology Optimization ProblemSIMP Method
GPU ImplementationResults