High Performance Computingcomputationalgeomechanics.gtep.civ.puc-rio.br/documents/High... · High...

1

Hig

h P

erfo

rman

ce C

ompu

ting

High Performance Computing

2

Hig

h P

erfo

rman

ce C

ompu

ting

Introduction

Nowadays, petroleum industry has sought to apply geomechanical analysis in some problems, such as wellbore stability and reservoir simulation. The analytical solutions apply only to cases that have simple geometry, homogeneous material, simple loading and boundary conditions. Most of real engineering problems can only be solved through numerical analysis, and in stress analysis, the Finite Ele-ment Method (FEM) is largely employed between different numerical methods, such as finite difference method, boundary element method and discrete element method. The stress analysis programs based on FEM, by its nature, spend much time in petroleum problems, which have huge dimension (petroleum reservoir) and non-linear material properties (plasticity in rock). Well stability in drilling of rock salt needs creep law to evaluate the stress variation along the time, therefore FEM programs which employee an explicit time integration in a 3D model need much time steps and a large stiffness matrix, becoming the simulation time too long. The same way, in petroleum reservoir simulation coupled with a geomecha-nical analysis, the model (reservoir, overburden side burden and underburden) must be discretized by large number of elements (1.000.000 elements) and the simulation is realized along the time (transient analysis), therefore this simulation can take weeks.

3

Hig

h P

erfo

rman

ce C

ompu

ting

An alternative to reduce simulation time is the High Performance Computing (HPC) technology. The HPC refers to the use of highly optimized algorithms and high performance hardware in tasks which large amount data need to be pro-cessed in few times. The HPC technologies frequently employed are: computer clusters, multicore central processing unit (CPU), cloud computing, staggered pin grid array (SPGA) and graphics processing unit (GPU) card.

During the last decade, the manufacturers of GPUs developed and produced in-creasingly powerful devices, mainly due to the need to process complex games and visual simulations that are more and more detailed and realistic. This new gene-ration of GPUs now has processing power far superior to those of CPUs. Figure 1 shows the comparison of performance, number of f loating point operations per unit time, between NVIDIA GPUs and Intel CPUs. Another important measure of performance of a hardware power is the bandwidth of memory (rate at which data can be read or stored in a memory by a processor). This measure was also used to compare the Intel CPUs and GPUs, as can be seen in Figure 2.

Figure 1: Number of floating point operations per unit time: GPU (NVIDIA) x CPU (Intel)(CUDA C Programming Guide, 2011).

Figure 2: Bandwidth Memory: GPU (Nvi-dia) x CPU (Intel)

(CUDA C Programming Guide, 2011).

4

Hig

h P

erfo

rman

ce C

ompu

ting

The fundamental difference between CPUs and GPUs is the number of cores (part of the processor that reads, writes and runs the instructions), while the CPUs usu-ally have 2, 4 and 6 cores (Figure 3.a), for example Intel Core i3 - i7, and many modern GPUs come to 2688 cores (Figure 3.b), for example GPU Nvidia GeForce GTX Titan.

With all this processing power of modern GPUs, NVIDIA and AMD manufac-turers started to invest in the development of developer tools that can release the full potential of parallelization of these devices. NVIDIA released in November 2006 a parallel architecture for general purpose called CUDA (Compute Unified Device Architecture) and AMD released the Brook+ in October 2004, a compiler developed by Stanford University. The two architectures are based on SIMD (sin-gle instruction, multiple data) and use programming language similar to C. Both CUDA and Brook + offer great f lexibility to generate very optimized applications, but the big problem is that each of the architectures are compatible only with their devices. An alternative to CUDA and Brook+ is OpenCL (Open Computing Lan-guage), a parallel computing architecture for heterogeneous devices developed by the Khronos Group with the participation of several companies and institutions, such as AMD, NVIDIA, Los Alamos National Laboratory, Motorola, Movidius and others, which began in June 2008. Thus, OpenCL developed codes could be used in any GPU, instead of developing a specific program on GPUs from NVI-DIA or AMD. However, in practice the use of OpenCL is a little more complica-ted, since some functions and extensions are specific to each family of GPU.

Figure 3: (a) A CPU with 4 cores and (b) A GPU with hundreds of cores (CUDA C Programming Guide, 2011).

(a) (b)

5

Hig

h P

erfo

rman

ce C

ompu

ting

Architecture of a GPU

The structure of NVIDIA GPU architecture is composed of a number of Stre-aming Multiprocessors (SM), each containing a number of Streaming Processor (SP). Figure 4 shows the architecture of a GPU NVIDIA GTX 460M used to evaluate the implementations made in this paper. This GPU has four SMs, each containing 48 SP, resulting in a total of 192 CUDA cores.

Figure 4: Architecture of an NVIDIA GPU GTX 460M.

6

Hig

h P

erfo

rman

ce C

ompu

ting

Results

To evaluate the algorithm of assembly of the stiffness matrix and solution of linear equation system on the GPU, it was compared the results from our Finite Element Code, CHRONOS which was implemented for four GPUs, to a well-known fi-nite element software used by industry. The figure 1 shows the time processing from Chronos and well-known software for different number of elements. Table 1 shows the comparison between Chronos and well-known software considering processing time and performance and Table 2, the hardware configuration used in the work.

Figure 5: Time processing Chronos GPUs versus well-known softwareCPU.

7

Hig

h P

erfo

rman

ce C

ompu

ting

Table 2: Hardware configuration.

REFERENCES

NVIDIA CUDA C Programming Guide Version 4.0, 2011.

CPUIntel i7 4770 (8 threads)

3.4 GHz32Gb (memory)

GPU’s (4 Devices)

GeForce GTX Titan2688 cores0.84 GHz

6GB (memory)

Number of Elements

Number of Nodes

Number of DOF

Simulation Time well-knownsoftware (s)

ProcessingTime

CHRONOS (s)Performance

108315 115920 347760 51 3 16

364635 381248 1143744 453 11 40

534105 556416 1669248 856 18 49

672549 696192 2088576 2201 22 100

941724 972672 2918016 3901 36 110

1158729 1193472 3580416 9957* 52 192

2131089 2188032 6564096 - 139 -

Table 1: Time processing and performance GPUs versus CPU.

High Performance Computingcomputationalgeomechanics.gtep.civ.puc-rio.br/documents/High... · High...

Documents

Transcript of High Performance Computingcomputationalgeomechanics.gtep.civ.puc-rio.br/documents/High... · High...