N-Body Simulations
description
Transcript of N-Body Simulations
![Page 1: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/1.jpg)
N-Body SimulationsKenneth Owens
![Page 2: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/2.jpg)
We wish to compute the interaction between particles (bodies) given their mass and positions
Simulation is performed in time steps◦ Forces between all bodies is computed O(n2)◦ Positions for all bodies are updated based on their
current kinematics and the interaction with other bodies O(n)
◦ Time moves forward by one step
The N-Body Problem
![Page 3: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/3.jpg)
The force between a body i and N other bodies is approximated as above by computing the interaction given their mass (m), the distance vector between them (r_ij), and a softening factor (ε).
This is computed for all bodies with all other bodies
Force Computation
![Page 4: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/4.jpg)
Euler Method: For each particle, a discrete timestep (dt) is used to approximate the continuous kinematic equation and update the position and velocity of each particle
Position Updatesdtdva dtadv dvvv ii 1 dtavv iii 1
dtdxv dtvdx dxxx ii 1
dtvxx iii 1
![Page 5: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/5.jpg)
Execute an n-body simulation on a distributed memory architecture with multiple GPUs on each node
Project Objective
![Page 6: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/6.jpg)
Sequential implementation of n-body simulation code◦ Written in C◦ Compiled using gcc-4.4 with –O3
MPI implementation◦ Written in C◦ Compiled using mpicc.mpich with gcc-4.4 using 0-3◦ Executed using mpirun.mpich on 2,5, an 10 nodes
GPU implementation◦ Written in C with CUDA extensions◦ Compiled using nvcc with gcc-4.4 using –O3◦ Executed on Nvidia 580s
MPI-GPU implementation◦ The MPI driver above was combined with the GPU kernel
implementation◦ Compiled but not tested for correctness
Project Accomplishments
![Page 7: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/7.jpg)
The main method of the driver calls nbody nbody calls two externally linked function
◦ compute_forces computes the interactions ◦ update_positions updates the kinematics
Driver Source Codevoid nbody(vector4d_t* positions, vector4d_t* velocities, vector4d_t* current_positions, vector4d_t* current_velocities, vector3d_t* accel, size_t size, value_t dt, value_t damping, value_t softening_squared){ compute_forces(positions,accel, size, positions, size, softening_squared); update_positions(positions, velocities, current_positions, current_velocities, accel, size, dt, damping); }
![Page 8: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/8.jpg)
Computes the pair-wise interaction◦ Hidden second loop in acceleration function
Sequential compute_forces
void compute_forces(vector4d_t* positions, vector3d_t* forces, size_t positions_size, vector4d_t* sources, size_t sources_size, value_t softening_squared){ for (size_t i = 0; i < positions_size; i++) { forces[i] = acceleration(positions[i],sources,sources_size, forces[i], softening_squared); } }
![Page 9: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/9.jpg)
Computation for individual interaction written using c
Sequential Interaction Computation
vector3d_t interaction(vector3d_t acceleration, vector4d_t body1, vector4d_t body2, value_t softening_squared) { vector3d_t force; force.x = body1.x - body2.x; force.y = body1.y - body2.y; force.z = body1.z - body2.z; float distSqr = force.x * force.x + force.y * force.y + force.z * force.z; distSqr += softening_squared; float invDist = 1.0f / sqrt(distSqr); float invDistCube = invDist * invDist * invDist; float s = body2.w * invDistCube; acceleration.x += force.x * s; acceleration.y += force.y * s; acceleration.z += force.z * s; return acceleration;}
![Page 10: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/10.jpg)
Updates each position based on the computed forces
Sequential Kinematics Update
void update_positions(vector4d_t* positions, vector4d_t* velocities, vector4d_t* current_positions, vector4d_t* current_velocities, vector3d_t* acceleration, size_t size, value_t dt, value_t damping){ for(size_t i = 0; i < size; i++) { vector4d_t current_position = current_positions[i]; vector3d_t accel = acceleration[i]; vector4d_t current_velocity = current_velocities[i]; update_position(&positions[i], &velocities[i], current_position, current_velocity, accel, dt, damping); }}
![Page 11: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/11.jpg)
Implements the previously shown equations
Sequential Update Function
void update_position(vector4d_t* position, vector4d_t* velocity, vector4d_t current_position, vector4d_t current_velocity, vector3d_t acceleration,value_t dt, value_t damping){ current_velocity.x += acceleration.x * dt; current_velocity.y += acceleration.y * dt; current_velocity.z += acceleration.z * dt; current_velocity.x *= damping; current_velocity.y *= damping; current_velocity.z *= damping; current_position.x += current_velocity.x * dt; current_position.y += current_velocity.y * dt; current_position.z += current_velocity.z * dt; *position = current_position; *velocity = current_velocity;}
![Page 12: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/12.jpg)
Started with the implementation from GPU Gems http://http.developer.nvidia.com/GPUGems3/gpugems3_ch31.html
Modified the code to work with data sizes that are larger than 256 but that are not evenly divisible by 256
Added kinematics update Code no longer works for sizes less than 256
◦ Needed command line specification to control grid and block size anyway
CUDA Implementation
![Page 13: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/13.jpg)
Copies to device memory and execute the compute_force_gpu kernel◦ Note - cudaMemAlloc truncated to fit code
CUDA compute_forces
void compute_forces(vector4d_t* positions, vector3d_t* forces, size_t positions_size, vector4d_t* sources, size_t sources_size, value_t softening_squared){ ….. compute_forces_gpu<<< grid, threads, sharedMemSize >>>(device_positions, device_forces, positions_size, device_sources, sources_size, softening_squared ); cudaThreadSynchronize(); cudaMemcpy(forces, device_forces, positions_size * sizeof(float3), cudaMemcpyDeviceToHost); cudaFree((void**)device_positions); cudaFree((void**)device_sources); cudaFree((void**)device_forces); err = cudaGetLastError(); if( cudaSuccess != err) { fprintf(stderr, "Cuda error: %s: \n", cudaGetErrorString( err) );
![Page 14: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/14.jpg)
Every thread computes the acceleration for its position and moves to the next block◦ For our test sizes this only implemented cleanup
for strides not divisible by 256
CUDA compute_forces_gpu Kernel
__global__ voidcompute_forces_gpu(float4* positions, float3* forces,int size, float4* sources, int sources_size, float softening_squared ){ for(int index = __mul24(blockIdx.x,blockDim.x) + threadIdx.x; index < size; index += blockDim.x * gridDim.x) { float4 pos = positions[index]; forces[index] = acceleration(pos, sources, sources_size, forces[index], softening_squared);
![Page 15: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/15.jpg)
Uses float3 and float4 instead of home brewed vector types Shared memory is used 256 positions per block Each thread strides across the grid to update a single particle
CUDA force computation
__device__ float3acceleration(float4 position, float4* positions, int size, float3 acc, float softening_squared){ extern __shared__ float4 sharedPos[]; int p = blockDim.x; int q = blockDim.y; int n = size; int numTiles = n / (p * q); for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++) { sharedPos[threadIdx.x+blockDim.x*threadIdx.y] = positions[WRAP(blockIdx.x + tile,gridDim.x) * p + threadIdx.x]; __syncthreads(); // This is the "tile_calculation" function from the GPUG3 article. acc = gravitation(position, acc,softening_squared); __syncthreads(); } return acc;}
![Page 16: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/16.jpg)
Kernel strides in the same way as the force computation All threads update a single position simulaneously
CUDA update_positions
__global__ voidupdate_positions_gpu(float4* positions, float4* velocities, float4* current_positions, float4* current_velocities, float3* forces, int size, float dt, float damping){ for(int index = __mul24(blockIdx.x,blockDim.x) + threadIdx.x; index < size; index += blockDim.x * gridDim.x) { float4 pos = current_positions[index]; float3 accel = forces[index]; float4 vel = current_velocities[index]; vel.x += accel.x * dt; vel.y += accel.y * dt; vel.z += accel.z * dt; vel.x *= damping; vel.y *= damping; vel.z *= damping; // new position = old position + velocity * deltaTime pos.x += vel.x * dt; pos.y += vel.y * dt; pos.z += vel.z * dt; // store new position and velocity positions[index] = pos; velocities[index] = vel; }}
![Page 17: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/17.jpg)
O(n2)/p pipeline implementation◦ Particles are divided among processes◦ Particle positions are shared in a ring
communication topology◦ Force computation occurs for all particles by
sending the data around the ring◦ After all forces are computed each process
updates the kinematics of its own particles
MPI implementation
![Page 18: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/18.jpg)
Compiles with CPU and GPU implementations
Timings have only been collected for CPU
MPI Driver
for(size_t i = 0; i < time_steps; i++) { memcpy( sendbuf, current_positions, num_particles * sizeof(vector4d_t) ); for (pipe=0; pipe<size; pipe++) { if (pipe != size-1) { MPI_Isend( sendbuf, num_particles, mpi_vector4d_t, right, pipe, commring, &request[0] ); MPI_Irecv( recvbuf, num_particles, mpi_vector4d_t, left, pipe, commring, &request[1] ); } compute_forces(positions,accel, num_particles, positions, num_particles, softening_squared); if (pipe != size-1) MPI_Waitall( 2, request, statuses ); memcpy( sendbuf, recvbuf, num_particles * sizeof(vector4d_t) ); } update_positions(positions, velocities, current_positions, current_velocities, accel, num_particles, dt, damping); }
![Page 19: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/19.jpg)
Taken of float for sequential and gpu Taken on tux for mpi All used 10 iterations for time steps Wallclock time was collected for comparison Memory allocation time was omitted
◦ Except for device memory allocation and device data transfer
Timings where not collected for the code using MPI to distribute data over multiple nodes with multiple GPUs
Timings
![Page 20: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/20.jpg)
100 200 300 400 500 600 700 800 900 1000 1100 12000
50
100
150
200
250
300
350
sequential
sequential
Sequential Timings
![Page 21: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/21.jpg)
Sequential GFlops
1 2 3 4 5 6 7 8 9 10 11 120
0.002
0.004
0.006
0.008
0.01
0.012
Sequential
gflops
![Page 22: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/22.jpg)
GPU Timings
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
0
5
10
15
20
25
30
35
40
GPU
time
![Page 23: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/23.jpg)
GPU GFlops
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
0
50
100
150
200
250
GPU
gflops
![Page 24: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/24.jpg)
MPI Timings
100 200 300 400 500 600 700 800 900 1000 1100 12000
20
40
60
80
100
120
140
160
180
MPI
n=2n=5n=10
![Page 25: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/25.jpg)
MPI GFlops
100 200 300 400 500 600 700 800 900 1000 1100 12000
0.02
0.04
0.06
0.08
0.1
0.12
MPI
n=2n=5n=10
![Page 26: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/26.jpg)
We achieved several orders of magnitude speed-up going to a GPU
We achieved similar results to what was obtained in GPU gems
The sequential implementation was not optimal as it did not use SSE or multiple cores – much lower than the theoretical possible FLOPs for the Xeon CPU
The MPI driver showed that task level parallelism can be exploited using distributed memory computing
Conclusions
![Page 27: N-Body Simulations](https://reader036.fdocuments.us/reader036/viewer/2022062521/56816763550346895ddc3ff4/html5/thumbnails/27.jpg)
Run the MPI GPU version on Draco FMM (Fast Multiple Method) MPI
implementation Multi-device GPU implementation
To Do List