GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors:...
-
Upload
oswin-rice -
Category
Documents
-
view
212 -
download
0
Transcript of GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors:...
![Page 1: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/1.jpg)
GPU Fluid SimulationGPU Fluid Simulation
Neil OsborneNeil Osborne
School of Computer and Information Science, ECUSchool of Computer and Information Science, ECU
Supervisors: Supervisors:
Adrian BoeingAdrian Boeing
Philip HingstonPhilip Hingston
![Page 2: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/2.jpg)
IntroductionIntroduction
Project AimsProject Aims Why GPU (Graphics Processing Unit)?Why GPU (Graphics Processing Unit)? Why SPH (Smoothed Particle Why SPH (Smoothed Particle
Hydrodynamics)?Hydrodynamics)? Smoothed Particle Hydrodynamics Smoothed Particle Hydrodynamics GPU ArchitectureGPU Architecture ImplementationImplementation Results & ConclusionsResults & Conclusions
![Page 3: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/3.jpg)
Project AimsProject Aims
Implement SPH fluid simulation on GPUImplement SPH fluid simulation on GPU Identify GPU optimisationsIdentify GPU optimisations Compare CPU vs. GPU performanceCompare CPU vs. GPU performance
![Page 4: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/4.jpg)
Why GPUWhy GPU (Graphics Processing Unit)? (Graphics Processing Unit)?
Affordable and availableAffordable and available Enable interactivityEnable interactivity Parallel data processing on GPUParallel data processing on GPU
Jan Jun Apr Jun Mar Nov May Jun 2003 2004 2005 2006 2007 2008© NVIDIA Corporation 2008
NV30NV35 NV40
G70G71
G80
G80Ultra
G92
GT200
3.0 GHz Core2 Duo
3.2 GHz Harpertown
![Page 5: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/5.jpg)
Why SPH (Smoothed Particle Hydrodynamics)? SPH can be applied to many applications
concerned with fluid phenomena–– aerodynamics– weather– beach erosion
– astronomy Compute intensive Same operations required for multiple particles Maps well to GPU implementation
![Page 6: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/6.jpg)
Smoothed Particle Smoothed Particle Hydrodynamics (SPH)Hydrodynamics (SPH)
SPH is an interpolation method for particle SPH is an interpolation method for particle systemssystems
Distributes quantities in a local Distributes quantities in a local neighbourhood of each particle, using radial neighbourhood of each particle, using radial symmetrical smoothing kernelssymmetrical smoothing kernels
Density
Pressure
Viscosity
Acceleration (x, y, z)
Velocity (x, y, z)
Position (x, y, z)
Mass
hr
rj(1)
rj(3)
rj(2)
rj(4)
(r-rj(4))
![Page 7: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/7.jpg)
Smoothed Particle Smoothed Particle Hydrodynamics (SPH)Hydrodynamics (SPH)
Our SPH equations are derived from Navier - Stokes Our SPH equations are derived from Navier - Stokes equations which describe the dynamics of fluidsequations which describe the dynamics of fluids
As(r) is interpolated by a weighted sum of contributions from all neighbour particles
h)rW(rmAs(r) j,A
j
jj
j
Scalar quantity at location r Field quantity at location j
Mass of particle j
Density at location j
Smoothing kernel with core radius of h
![Page 8: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/8.jpg)
VIDEO: SPH implementationVIDEO: SPH implementation
![Page 9: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/9.jpg)
GPU: GPU: ArchitectureArchitecture
Control
Cache
DRAM
ALU ALU
ALU ALU
CPUDRAM
GPU
More transistors are devoted to data processing rather than data cachingand flow control
Each Multiprocessor contains a number of processors
© NVIDIA Corporation 2008
![Page 10: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/10.jpg)
Host Device
Grid 1x
yBlock(0,0)
Block(1,0)
Block(2,0)
Block(3,0)
Block(0,1)
Block(1,1)
Block(2,1)
Block(3,1)
Grid 2x
yBlock (1,1)
Thread(1,0)
Thread(0,0)
Thread(2,0)
Thread(3,0)
Thread(4,0)
Thread(1,1)
Thread(0,1)
Thread(2,1)
Thread(3,1)
Thread(4,1)
Kernel 1
Kernel 2
Host (PC)Host (PC)– Runs application codeRuns application code– Calls Device kernel Calls Device kernel
functions seriallyfunctions serially Device (GPU)Device (GPU)
– Executes kernel Executes kernel functions functions
GridGrid– Can have 1D or 2D Can have 1D or 2D
arrangement of Blocksarrangement of Blocks BlockBlock
– Can have 1D, 2D, or 3D Can have 1D, 2D, or 3D arrangement of arrangement of ThreadsThreads
ThreadThread
– Executes its portion Executes its portion of the codeof the code
GPU: GPU: Grid structureGrid structure
© NVIDIA Corporation 2008
![Page 11: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/11.jpg)
Grid
Global Memory
Constant Memory
Texture Memory
Block (0,0)
Shared Memory
Registers
Thread (0,0) Thread (1,0)
Local Memory
LocalMemory
Registers
Block (1,0)
Shared Memory
Registers
Thread (0,0) Thread (1,0)
Local Memory
LocalMemory
Registers
Shared Shared – Low latencyLow latency– (RW) access by all (RW) access by all
threads in blockthreads in block Local Local
– Unqualified variablesUnqualified variables– (RW) access by a (RW) access by a
threadthread GlobalGlobal
– High latency – not High latency – not cachedcached
– (RW) access by all (RW) access by all threadsthreads
ConstantConstant
– Cached in GlobalCached in Global– (RO) (RO) access by all access by all
threadsthreads
GPU: GPU: MemoryMemory
© NVIDIA Corporation 2008
![Page 12: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/12.jpg)
Implementation:Implementation: Main OperationsMain Operations
Create data structures on Host to hold data values
Allocate Device memory to store our data
Copy data from Host to Device memory
Free allocated Device memory
Copy data from Device memory to Host
Render particles using graphics engine
Loop until user aborts
clear_step()
update_density()
sum_density()
update_force()
particle_integrate()
collision_detection()
Reset densities and accelerations
Calculate densities & pressure}Calculate viscosities & accelerations
Detect potential collisions
Calculate velocities and positions
CPU & GPU
GPU only
![Page 13: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/13.jpg)
Implementation:Versions
4 software implementations– CPU– GPU V1 – 2D Grid, Global memory access– GPU V2 – 1D Grid, Global memory access– GPU V3 – 1D Grid, Shared memory access
![Page 14: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/14.jpg)
Implementation: Implementation: CPU - Nested LoopCPU - Nested Loop
C Functionvoid compare_particles(int n){ int i,j; for (i = 0; i < n; i++){ for (j = 0; j < n; j++){ if (i == j) continue; statements; } }}}
void main(){void main(){ int int nparticlesnparticles = 2048; = 2048; compare_particles(compare_particles(nparticlesnparticles););}}
![Page 15: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/15.jpg)
Implementation: Implementation: GPU V1- 2D Grid, Global Memory GPU V1- 2D Grid, Global Memory AccessAccessCUDA kernel__global__ void compare_particles(float *pos){
int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i != j){ statements; }}
void main(){ int nparticles = 2048; int blocksize = 32;
int dimBlock(blocksize);dim3 Grid2D(nparticles/blocksize, nparticles);compare_particles<<<Grid2D, dimBlock>>>(idataPos);
}
![Page 16: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/16.jpg)
Implementation: Implementation: GPU V1- 2D Grid, Global Memory GPU V1- 2D Grid, Global Memory AccessAccess
Grid2Dx y
32 32 32 32
Global Memory
32 32 32 320
1
n-1
idataPos
0 1 n-1
2048 / 32 =64 blocks
Each thread compares its own particledata in Global memory…
All threads in all rowscompare their own particledata in Global memory…
![Page 17: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/17.jpg)
Implementation: Implementation: GPU V1- 2D Grid, Global Memory GPU V1- 2D Grid, Global Memory AccessAccess
Grid2Dx y
32 32 32 32
Global Memory
32 32 32 320
1
n-1
idataPos
0 1 n-1
2048 / 32 =64 blocks
…with the particle data (associated with the block row)in global memory.
![Page 18: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/18.jpg)
Implementation: Implementation: GPU V2- 1D Grid, Global Memory GPU V2- 1D Grid, Global Memory AccessAccessCUDA kernel__global__ void compare_particles(float *pos, int n){ int i = blockIdx.x * blockDim.x + threadIdx.x; int j; for (j = 0; j < n; j++){ if (i != j){ statements; } }}
void main(){ int nparticles = 2048; int blocksize = 32;
int dimBlock(blocksize);dim3 Grid1D(nparticles/blocksize);compare_particles<<<Grid1D, dimBlock>>>(idataPos,N);
}
![Page 19: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/19.jpg)
Implementation: Implementation: GPU V2- 1D Grid, Global Memory GPU V2- 1D Grid, Global Memory AccessAccess
Grid1Dx (i)
32 32 32 32
Global Memory
32 32 32 32
idataPos
0 1 n-1
2048 / 32 =64 blocks
Each thread compares its own particledata in Global memory…
![Page 20: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/20.jpg)
Implementation: Implementation: GPU V2- 1D Grid, Global Memory GPU V2- 1D Grid, Global Memory AccessAccess
Grid1Dx (i)
32 32 32 32
Global Memory
32 32 32 32
idataPos
0 1 n-1
2048 / 32 =64 blocks
…with the first particle data in global memory
![Page 21: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/21.jpg)
Implementation: Implementation: GPU V2- 1D Grid, Global Memory GPU V2- 1D Grid, Global Memory AccessAccess
Grid1Dx (i)
32 32 32 32
Global Memory
32 32 32 32
idataPos
0 1 n-1
2048 / 32 =64 blocks
Each thread compares its own particledata in Global memory
![Page 22: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/22.jpg)
Implementation: Implementation: GPU V2- 1D Grid, Global Memory GPU V2- 1D Grid, Global Memory AccessAccess
Grid1Dx (i)
32 32 32 32
Global Memory
32 32 32 32
idataPos
0 1 n-1
2048 / 32 =64 blocks
…with the second particle data in global memory. etc…
![Page 23: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/23.jpg)
Implementation: Implementation: GPU V3- 1D Grid, Shared Memory GPU V3- 1D Grid, Shared Memory AccessAccessCUDA kernel__global__ void compare_particles(float *pos, int n){ int i = blockIdx.x * blockDim.x + threadIdx.x; __shared__ float posblock[32*3]; __shared__ float accelblock[32*3]; __shared__ float velblock[32*3]; __shared__ float densblock[32]; __shared__ float pressblock[32]; __shared__ float massblock[32]; //Copy global to shared statements here
int j; for (j = 0; j < n; j++){ if (i != j){ statements; } }}
![Page 24: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/24.jpg)
Implementation: Implementation: GPU V3- 1D Grid, Shared GPU V3- 1D Grid, Shared Memory AccessAccessvoid main(){ int nparticles = 2048; int blocksize = 32; int dimBlock(blocksize); dim3 Grid1D(nparticles/blocksize); compare_particles<<<Grid1D, dimBlock>>>(idataPos,N);}
![Page 25: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/25.jpg)
Implementation: Implementation: GPU V3- 1D Grid, Shared Memory GPU V3- 1D Grid, Shared Memory AccessAccess
Grid1Dx (i)
Global Memory
idataPos
0 1 n-1
2048 / 32 =64 blocks32 32 32 32 32 32 32 32
Shared memory
Each Block copies associated particle
data for its 32 threads into Shared memory
![Page 26: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/26.jpg)
Implementation: Implementation: GPU V3- 1D Grid, Shared Memory GPU V3- 1D Grid, Shared Memory AccessAccess
Grid1Dx (i)
Global Memory
idataPos
0 1 n-1
2048 / 32 =64 blocks32 32 32 32 32 32 32 32
Shared memory
Data in shared memory is compared to the first particle data in global memory.Calculations involving particles are quicker
![Page 27: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/27.jpg)
Implementation: Implementation: GPU V3- 1D Grid, Shared Memory GPU V3- 1D Grid, Shared Memory AccessAccess
Grid1Dx (i)
Global Memory
idataPos
0 1 n-1
2048 / 32 =64 blocks32 32 32 32 32 32 32 32
Shared memory
Data in shared memory is compared to the second particle data in global memory.Global memory accesses reduced.
![Page 28: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/28.jpg)
Results:Results:Kernel Timings (2048 particles)Kernel Timings (2048 particles)
sum_density
20.894
2.938
3.053
2.947
0 5 10 15 20 25
CPU
[GPU] V1
[GPU] V2
[GPU] V3
microseconds
![Page 29: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/29.jpg)
Results:Results:Kernel Timings (2048 particles)Kernel Timings (2048 particles)
update_density
33.989
30.424
15.676
8.921
0 5 10 15 20 25 30 35 40
CPU
[GPU] V1
[GPU] V2
[GPU] V3
milliseconds
![Page 30: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/30.jpg)
Results:Results:Kernel Timings (2048 particles)Kernel Timings (2048 particles)
update_force
307.743
33.611
16.579
9.366
0 50 100 150 200 250 300 350
CPU
[GPU] V1
[GPU] V2
[GPU] V3
milliseconds
![Page 31: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/31.jpg)
Results: Results: Kernel Timings (2048 particles)Kernel Timings (2048 particles)
cudaMemcpy
17.595
17.677
17.587
0 2 4 6 8 10 12 14 16 18 20
CPU
[GPU] V1
[GPU] V2
[GPU] V3
microseconds
![Page 32: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/32.jpg)
Results: Results: Kernel Timings (2048 particles)Kernel Timings (2048 particles)
Total
64.123
32.342
18.369
342.538
0 100 200 300
CPU
[GPU] V1
[GPU] V2
[GPU] V3
milliseconds
![Page 33: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/33.jpg)
Results:Results: Performance comparison
Function/Kernel CPU time GPU time GPU speedup
clear_step 49.751 microseconds 6.79 microseconds 7.3 faster
update_density 33.989 milliseconds 8.921 milliseconds 3.8 faster
sum_density 20.894 microseconds 2.947 microseconds 7.1 faster
update_force 307.743 milliseconds 9.366 milliseconds 32.8 faster
collision_detection 501.478 microseconds 19.952 microseconds 25.1 faster
particle_integrate 234.191 microseconds 34.454 microseconds 6.8 faster
Total 342.538 milliseconds 18.369 milliseconds 18.6 faster
![Page 34: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/34.jpg)
Results: Results: Frames Per SecondFrames Per Second
CPU vs. GPU Frames Per Second
53 2 1
65
36
22
14
75
72
43
28
19
1410
7
48
33
23
17
1298
15
32
10
78
0
10
20
30
40
50
60
70
80
90
512 800 1152 1568 2048 2592 3200
Particles
FP
S
CPU
[GPU] V1
[GPU] V2
[GPU] V3
![Page 35: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/35.jpg)
VIDEO of final GPU prog.VIDEO of final GPU prog.
![Page 36: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/36.jpg)
Results:Results:Summary Summary
CPU – CPU –
– SlowestSlowest– Low FLOPsLow FLOPs– No parallel data processingNo parallel data processing
GPU V1GPU V1
– SlowSlow– Too many threads Too many threads – Memory access issuesMemory access issues
![Page 37: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/37.jpg)
Results:Results:Summary Summary
GPU V2– GPU V2–
– FasterFaster– Better balance of threadsBetter balance of threads– Global memory slows resultsGlobal memory slows results
GPU V3-GPU V3-
– FastestFastest– Same thread balance Same thread balance – Shared memory improves resultsShared memory improves results
![Page 38: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/38.jpg)
ConclusionsConclusions
For parallel data, compute intense applications, For parallel data, compute intense applications, GPU out-performs CPUGPU out-performs CPU
The highly parallel nature of SPH fluid simulation The highly parallel nature of SPH fluid simulation is a good fit for GPUis a good fit for GPU
The optimal code for this simulation – 1D grid The optimal code for this simulation – 1D grid using shared memory using shared memory
The benefits of shared memory must be The benefits of shared memory must be balanced against internal mem-copy overheads.balanced against internal mem-copy overheads.
Optimized code is complex and can introduce Optimized code is complex and can introduce errors – original code may become errors – original code may become unrecognisableunrecognisable..
![Page 39: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/39.jpg)
Future WorkFuture Work
Direct Rendering from GPU Direct Rendering from GPU – OpenGL interfacesOpenGL interfaces– Direct3D interfacesDirect3D interfaces
Spatial SubdivisionSpatial Subdivision– Uniform Grid (finite)Uniform Grid (finite)– Hashed Grid (infinite)Hashed Grid (infinite)
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0
21
4
3
5
![Page 40: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/40.jpg)
QuestionsQuestions
??
![Page 41: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/41.jpg)
AcknowledgementsAcknowledgements
Muller M., Charypar D., Gross M., (2003), Particle-Based Fluid Muller M., Charypar D., Gross M., (2003), Particle-Based Fluid SimulationSimulation for Interactive Applications. for Interactive Applications. Eurographics Symposium on Eurographics Symposium on ComputerComputer Animation 2003. Animation 2003.
SPH Survival Kit. [n.d.]SPH Survival Kit. [n.d.] Retrieved December, 2008, from http://www.cs.umu.se/kurser/TDBD24/VT06/lectures/http://www.cs.umu.se/kurser/TDBD24/VT06/lectures/
Optimized Spatial Hashing for Collision Detection of Deformable Objects. Teschner M., Heidelberger B., Muller M., Pomeranets D., Gross M. Retrieved February, 2009, from http://www.beosil.com/download/CollisionDetectionHashing_ VMV03.pdf
NVIDIA_CUDA_Programming_Guide_2.1.pdf. NVIDIA Retrieved February, 2009, from http://sites.google.com/site/cudaiap2009/materials1/extras/ online-resources
![Page 42: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/42.jpg)
AppendixAppendix
![Page 43: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/43.jpg)
SPH EquationsSPH Equations
h)rW(rmρs(r) j,
j
j DensityDensity
mj = mass of particle jr - rj = distance between particlesh = smoothing length
3229 )(W
64315h)poly6(r, rhh
Smoothing kernel
![Page 44: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/44.jpg)
SPH EquationsSPH Equations
)( ,2
hrrWmf ji
j
jpressurei j
ji
mj = mass of particle jpj = density of particle jpi = density of particle iri - rj = distance between particlesh = smoothing length
PressurePressure
26 )(W 45h)spiky(r, rhh
Smoothing kernel
![Page 45: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/45.jpg)
SPH EquationsSPH Equations
)( ,2cos hrrWmuf ji
vv
j
jj
ijityvisi
ViscosityViscosity– Particle Particle ii checks neighbours in terms of its own checks neighbours in terms of its own
moving frame of referencemoving frame of reference– ii is accelerated in the direction of the relative is accelerated in the direction of the relative
speed of the environmentspeed of the environment
mj = mass of particle jvj = velocity of particle jvi = velocity of particle ipj = density of particle jri - rj = distance between particlesh = smoothing length
)(45h)(r,2
6W rhh
Smoothing kernel
![Page 46: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/46.jpg)
Implementation:Implementation:Development EnvironmentDevelopment Environment
SoftwareSoftware– MS Windows XP (SP3)MS Windows XP (SP3)– MS Visual Studio 2005 Express (SP1)MS Visual Studio 2005 Express (SP1)– Irrlicht 1.4.2 (Graphics Engine)Irrlicht 1.4.2 (Graphics Engine)– Nvidia CUDA 2.0 Nvidia CUDA 2.0
CUDA (CUDA (Compute Unified Device Architecture) A scalable parallel programming model and
software environment for parallel computing Minimal extensions to familiar C/C++ environment
– Nvidia CUDA Visual Profiler 1.1.6Nvidia CUDA Visual Profiler 1.1.6
![Page 47: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/47.jpg)
Implementation:Implementation:Development EnvironmentDevelopment Environment
HardwareHardware– CPU: Intel Core 2 Duo E8500 (3.16Ghz)CPU: Intel Core 2 Duo E8500 (3.16Ghz)– Mainboard: Intel DP35DP (P35 chipset)Mainboard: Intel DP35DP (P35 chipset)– Memory: 3GB DDR2 800MHzMemory: 3GB DDR2 800MHz– Graphics Card: Nvidia GTX9800Graphics Card: Nvidia GTX9800
GPU frequencyGPU frequency 675 MHz 675 MHz Shader clock frequency Shader clock frequency 1688 MHz 1688 MHz Memory clock frequency Memory clock frequency 1100 MHz 1100 MHz Memory bus width Memory bus width 256 bits 256 bits Memory type Memory type GDDR3 GDDR3 Memory quantity Memory quantity 512 MB512 MB
![Page 48: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/48.jpg)
Implementation: Implementation: Host Operations - codeHost Operations - code
// create data structure on host// create data structure on hostfloat *posData;posData = new float[NPARTICLES*3];
// allocate device memory (particle positions)// allocate device memory (particle positions)float* float* idataPosidataPos;;cudaMalloc( (void**) &cudaMalloc( (void**) &idataPosidataPos, sizeof(float)*NPARTICLES*3);, sizeof(float)*NPARTICLES*3);
// copy data from host to device// copy data from host to devicecudaMemcpy(cudaMemcpy(idataPosidataPos, , posData, sizeof(float)*NPARTICLES*3, , sizeof(float)*NPARTICLES*3,
cudaMemcpyHostToDevice);cudaMemcpyHostToDevice);
// execute the kernel// execute the kernelincrement_pos<<< increment_pos<<< dimGrid, dimBlock >>>( >>>(idataPosidataPos););
// copy data from device back to host// copy data from device back to hostcudaMemcpy(cudaMemcpy(posData, , idataPosidataPos, sizeof(float)*NPARTICLES*3, , sizeof(float)*NPARTICLES*3,
cudaMemcpyDeviceToHost)cudaMemcpyDeviceToHost);;
// free device memory// free device memorycudaFree(cudaFree(idataPosidataPos));;
![Page 49: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/49.jpg)
Implementation: Implementation: CPU - Nested LoopCPU - Nested Loop
C Functionvoid compare_particles(int n){ int i,j; for (i = 0; i < n; i++){ for (j = 0; j < n; j++){ if (i == j) continue; statements; } }}}
void main(){void main(){ int int nparticlesnparticles = 2048; = 2048; compare_particles(compare_particles(nparticlesnparticles););}}
![Page 50: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/50.jpg)
Implementation: Implementation: GPU V1- 2D Grid, Global Memory GPU V1- 2D Grid, Global Memory AccessAccessCUDA kernel__global__ void compare_particles(float *pos){
int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i != j){ statements; }}
void main(){ int nparticles = 2048; int blocksize = 32;
int dimBlock(blocksize);dim3 Grid2D(nparticles/blocksize, nparticles);increment_gpu<<<Grid2D, dimBlock>>>(idataPos);
}
![Page 51: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/51.jpg)
Implementation: Implementation: GPU V2- 1D Grid, Global Memory GPU V2- 1D Grid, Global Memory AccessAccessCUDA kernel__global__ void compare_particles(float *pos, int n){ int i = blockIdx.x * blockDim.x + threadIdx.x; int j; for (j = 0; j < n; j++){ if (i != j){ statements; } }}
void main(){ int nparticles = 2048; int blocksize = 32; int dimBlock(blocksize); dim3 Grid1D(nparticles/blocksize); compare_particles<<<Grid1D, dimBlock>>>(idataPos,N);}
![Page 52: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/52.jpg)
Implementation: Implementation: GPU V3- 1D Grid, Shared Memory GPU V3- 1D Grid, Shared Memory AccessAccessCUDA kernel__global__ void compare_particles(float *pos, int n){ int i = blockIdx.x * blockDim.x + threadIdx.x; __shared__ float posblock[32*3]; __shared__ float accelblock[32*3]; __shared__ float velblock[32*3]; __shared__ float densblock[32]; __shared__ float pressblock[32]; __shared__ float massblock[32]; //Copy global to shared statements here
int j; for (j = 0; j < n; j++){ if (i != j){ statements; } }}
![Page 53: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/53.jpg)
Implementation: Implementation: GPU V3- 1D Grid, Shared GPU V3- 1D Grid, Shared Memory AccessAccessvoid main(){ int nparticles = 2048; int blocksize = 32; int dimBlock(blocksize); dim3 Grid1D(nparticles/blocksize); compare_particles<<<Grid1D, dimBlock>>>(idataPos,N);}
![Page 54: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/54.jpg)
Results:Results:Kernel Timings (2048 particles)Kernel Timings (2048 particles)
particle_integrate
234.191
39.549
39.165
34.454
0 50 100 150 200
CPU
[GPU] V1
[GPU] V2
[GPU] V3
microseconds
![Page 55: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/55.jpg)
Results:Results:Kernel Timings (2048 particles)Kernel Timings (2048 particles)
clear_step
6.806
6.765
6.790
49.751
0 10 20 30 40 50
CPU
[GPU] V1
[GPU] V2
[GPU] V3
microseconds
![Page 56: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/56.jpg)
Results:Results:Kernel Timings (2048 particles)Kernel Timings (2048 particles)
collision_detection
21.302
19.882
19.952
501.478
0 100 200 300 400 500
CPU
[GPU] V1
[GPU] V2
[GPU] V3
microseconds
![Page 57: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/57.jpg)
Further Work: Further Work: Uniform GridUniform Grid
Particle interaction requires finding neighbouring particles – O(n2) comparisons
Solution: use spatial subdivision structure Uniform grid is simplest possible subdivision Divide world into cubical grid
(cell size = particle size) Put particles in cells Only have to compare each particle with the
particles in the same cell and in neighbouring cells
![Page 58: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/58.jpg)
Further Work :Further Work :Grid using sortingGrid using sorting
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0
21
4
3
5
Unsorted list(Cell id, Particle id)
0: (4,3)1: (6,2)2: (9,0)3: (4,5)4: (6,4)5: (6,1)
Sorted byCell id
0: (4,3)1: (4,5)2: (6,1)3: (6,2)4: (6,4)5: (9,0)
array(cell, index)
(0,-)(1,-)(2,-)(3,-)(4,0)(5,-)(6,2)(7,-)(8,-)(9,5)(10,-)
…(15,-)0 1 2 3 4 5 . . . . . n-1
3 5 1 2 4 0
Density Array
index
values for particle..
![Page 59: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/59.jpg)
Further Work:Further Work:Spatial Hashing (Infinite Grid)Spatial Hashing (Infinite Grid)
We may not want particles to be constrained to a finite grid
Solution: use a fixed number of grid buckets, and store particles in buckets based on hash function of grid position
Pro: Allows grid to be effectively infinite Con: Hash collisions (multiple positions
hashing to same bucket) causes inefficiency Choice of hash function can have big impact
![Page 60: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/60.jpg)
Further Work:Further Work: Hash FunctionHash Function
__device__ uint calcGridHash(float3 *Pos){const uint p1 = 73856093; // some large primesconst uint p2 = 19349663;const uint p3 = 83492791;int n = p1*Pos.x ^ p2*Pos.y ^ p3*Pos.z;n %= numBuckets;return n;}
![Page 61: GPU Fluid Simulation Neil Osborne School of Computer and Information Science, ECU Supervisors: Adrian Boeing Philip Hingston.](https://reader030.fdocuments.us/reader030/viewer/2022032604/56649e665503460f94b61cfb/html5/thumbnails/61.jpg)
Further Work:Further Work:Direct RenderingDirect Rendering
Sending data back to the host for rendering Sending data back to the host for rendering by the Irrlicht graphics engine is costly in by the Irrlicht graphics engine is costly in time.time.
Solution: make further use of GPU rendering Solution: make further use of GPU rendering capabilities –capabilities –– OpenGL interoperabilityOpenGL interoperability– Direct3D interoperabilityDirect3D interoperability– Texture memoryTexture memory