EPCC News 69: GPUs now with extra va-va-voom

Issue 69, sprIng 2011

2 epCC at IsC’11

3 The rise of the gpu

5 new HeCTor gpu service

6 researching physics using many gpus in parallel

7 porting a particle transport code to gpgpu

8 Acoustic modelling using gpus

9 Digs: Distributed grid storage

10 Msc in High performance Computing

12 Visit: Centre of High performance Computing, Cape Town.

In this issue...

The newsletter of EPCC, the supercomputing centre at the University of Edinburgh

GPUs: now with extra va-va-voom

Image courtesy of nVIDIA.

The Graphics Processing Unit, or GPU, has been an integral part of most home computer systems and games consoles for several years. The thirst for ever more realistic games has driven its development from a simple 2D accelerator for graphics-based applications to an extremely powerful unit aimed at 3D games.

The raw computational power of the modern GPU has, in recent years, led to an explosion of interest in its use for numerically intensive computing beyond the graphics domain. This interest is demonstrated by the release of dedicated General Purpose GPUs, or GPGPUs, by manufacturers such as NVIDIA and AMD. The adoption of GPGPU computing by the HPC community is clearly shown by the fact that three of the top four machines in the latest Top500 list employ GPGPUs.

As always, EPCC are at the forefront of the take-up of new technology, and this issue of EPCC News highlights some of our

activities in the GPGPU area. I give a brief introduction to the GPU and its evolution into the GPGPU on the page opposite. We then feature some EPCC projects utilising GPGPUs, in particular the Ludwig fluid dynamics package, on page 6 and their application to a particle transport code on page 7. More unusually, we also include an article on the use of GPGPUs in audio processing.

In an effort to make GPGPU technology available to more users we have an exciting announcement on page 5 about the HECToR service, hosted by EPCC.

Of course, EPCC’s involvement in its wide range of other activities continues apace. We have a review of a distributed grid storage system, DiGS, on page 9, and news from our MSc on page 10. Last, but by no means least, Gavin Pringle reports on his trip to the Centre of High Performance Computing in Cape Town, South Africa, on page 12.

2

editorial Jeremy Nowell

IsC 2011:epCC and planetHpC hope to see you there19-23 June, Hamburg, Germany

The International Supercomputing Conference (ISC) is a key global conference and exhibition for high performance computing, networking and storage.

Now in its 26th year, ISC’11 will once again reunite over 2,000 like-minded HPC researchers, technology leaders, scientists and IT-decision makers. A world-class exhibition on supercomputing, storage and networking vendors also awaits our visitors.

EPCC and PlanetHPC will be exhibiting in booth 152. We look forward to meeting you there!

www.isc11.org

3

The rise of the gpgpu: from pixels to petaflopsJeremy Nowell

A brief history of the GPU

A Graphics Processing Unit, or GPU, is simply an accelerator, sometimes called a co-processor, designed to carry out specific graphics tasks faster than the main CPU in the system. It contains one or more microchips designed with a limited number of algorithms in mind. Graphics operations may, crudely, be split into two types – either vector-based or raster operations. Vector-based operations are the manipulation of the so-called graphics primitives – that is objects such as lines, circles and arcs. A raster, or bitmap, is a structure representing individual image pixels such as those displayed on screen. Raster operations are the manipulation of such bitmaps in various ways such as scrolling a background image between display frames or overlaying a moving sprite over a background.

GPUs were developed during the 1970s and 80s, however it was the Commodore Amiga in the mid 1980s which became the first mass-market personal computer to include a dedicated chipset capable of taking care of all the graphics functions. The chipset included the famous blitter chip, named after the acronym for Block Image Transfer. The blitter was responsible for manipulating large amounts of data corresponding to bitmap images. As well as being a popular machine for playing games the advanced graphics capabilities of the Amiga led to its use in video processing, production and scene rendering. However, in many ways the Amiga was several years ahead of its time and it wasn't until the 1990s that the development of GPUs began in earnest. It was in this period that more advanced GPUs for IBM-compatible PCs started to be developed. The first of these were simple 2D accelerators, aimed at speeding up the performance of the user interface of the Windows operating system. At about the same time 3D computer games were starting to become popular, leading to the development of GPUs specifically aimed 3D graphics processing. These accelerators became available in games consoles such as the Sony PlayStation and Nintendo 64, and on the PC thanks to graphics cards such as the Voodoo 3dfx.

The race was on. Fuelled by the thirst for ever more realistic computer games, the development of 3D GPUs has continued apace ever since. Today the market is largely dominated by two companies, AMD [1] and NVIDIA [2]. Modern graphics cards are responsible for the many different operations involved in producing a graphics scene, which, when taken together are commonly referred to as the rendering pipeline. The input to the pipeline is in the form of information about primitives, typically triangles. The rendering process transforms and shades the primitives and maps them onto the screen for display.

The typical pipeline steps – shown overleaf – are:

• Vertex generation. The pipeline is initiated with a list of vertex descriptors, containing the scene position of the vertex of each primitive, the colour of the surface and the orientation of the vector normal to the surface.

• Vertex processing. Each vertex is transformed into the screen-space and shaded, taking into account the lighting of the scene. Nowadays this stage is generally application programmable to allow more control of the final output by the developer.

• Primitive generation. The vertices are assembled into triangles.

• Primitive processing. Again application programmable. Each input primitive is processed independently and may produce zero or more output primitives.

• Fragment generation or rasterization. Determination of which screen-space pixels are covered by each triangle. A fragment is generated for every pixel covered by every triangle. Thus where triangles overlaps there will be several fragments per pixel.

• Fragment processing. The interaction of light with the fragment surfaces is simulated to determine the surface colour and opacity. In practice each fragment is shaded using either colour information from the vertices or more usually by applying textures. This stage is application programmable to allow a wide variation in textures.

• Pixel operations or composition. Fragments are assembled into a final image, taking into account the position of each fragment relative to the viewpoint and its opacity.

The rendering pipeline naturally lends itself to a form of processing called stream processing. A stream of data is passed through a series of computational kernels. The operations within each kernel are performed locally and independently on each element within the data stream. GPUs naturally developed in such a way as to exploit the inherent task-parallelism of streaming – different processor resources being devoted to different stages of the pipeline. Since each stage of the pipeline typically performs a fixed function then processors could be developed to be as efficient as possible.

The NVIDIA GF100 architecture.

Vertex generation

Vertex processing

Primitive generation

Primitive processing

Fragment generation

Fragment processing

Pixel operations

v0

v1 v2

v3 v4

v5

p0

p1

p0

p1

v0

v1 v2

v3 v4

v5

p0

p1

4

The rendering pipeline.

Several stages of the rendering pipeline also lend themselves naturally to another form of parallelism: data-parallelism. For example, each vertex or image pixel can be processed independently of the others, but using the same algorithms, in other words using the common Single Instruction Multiple Data (SIMD) approach. The quest for realism in games has led to GPUs being developed to offer more freedom to the application programmer, such that three of the stages of the pipeline are now programmable, typically using application programming interfaces (APIs) such as OpenGL [3] or DirectX [4]. The data parallel approach has consequently evolved from SIMD to a more complicated Single Programme Multiple Data (SPMD) model as hardware capabilities have increased. In the SPMD model, different branches may be followed within a rendering stage for different sections of the data.

The main drawback of allowing vertex, primitive and fragment processing to be programmable is one of load-balance. If an application programmer has made one of the three programmable stages very complex, it is easy for a bottleneck in the processing to occur. Recent generations of graphics cards have therefore moved towards a unified shader architecture, in which a single type of programmable unit is used for each of the three programmable stages. Many such units are employed and specialised GPU hardware allocates the processing resources as required to share out the work evenly. Modern GPUs typically contain tens or even hundreds of processing units, each unit further containing several Arithmetic Logic Units (ALUs) able to exploit the SIMD characteristic of much of the processing. A block diagram of a modern NVIDIA GPU architecture, illustrating the many individual processing cores, is shown on the previous page.

General-purpose computing on the GPU

The fact that modern GPUs typically contain hundreds of ALUs leads to units capable of over 1 TFlops, several times that of a typical CPU. This performance naturally led to an interest in using them for computationally-intensive problems outside the traditional graphics domain. The first attempts to do this were extremely challenging – the application programmer mapped their problem onto the graphics pipeline, then used the graphics-specific APIs such as OpenGL or DirectX to program the GPU. Recently new programming environments and APIs such as OpenCL [5], NVIDIA's CUDA [6] and Microsoft's DirectCompute (part of DirectX) have been developed. These, plus the move towards the unified shader architecture, have

simplified the development process greatly, leading to an explosion in GPGPU programming. Manufacturers now release products aimed specifically at the HPC market, in particular the AMD FireStream range [7] and the NVIDIA Tesla [8] range. In the latest Top500 list [9] three of the fastest four machines employ GPGPUs, with the top spot taken by the Chinese Tianhe-1A system with a performance of 2.57 PFlops. One reason for their popularity is, if GPGPUs are used efficiently, a similar performance to more traditionally-specified machines may be obtained while at the same time using much less power, so environmental impact and running costs are lower. This is illustrated by the Green500 list [10], which ranks machines according to their performance per unit of power used.

Challenges of GPGPU computing

Attractive as GPGPUs appear for scientific computing, significant challenges must be addressed to make best use of them and achieve a performance which makes their use more worthwhile than a standard CPU.

The primary issue is that of the application's scope for parallelism: it must demonstrate sufficient data-parallel characteristics such that it can be mapped to the GPU architecture and make full use of the many processing cores available. To hide memory latency it is also usually more efficient to have many more processing threads than available cores. The GPU hardware has highly efficient thread switching mechanisms so a core is not idle while a task is waiting for data to be fetched from memory.

The second challenge is making efficient usage of the GPU memory through the application’s memory access patterns, where several problems may be encountered. The first problem to address is that of copying data between the main memory of the machine hosting the GPU and the device itself. This transfer must take place over the PCI Express bus and is quite expensive, therefore such transfers should be minimised wherever possible. The GPU device’s memory architecture also needs to be taken into account. For instance, on NVIDIA's Fermi architecture the memory may be broken down into global application, shared, local, constant and texture memory. This may be simplified somewhat into global memory, local memory that is shared between blocks of threads and per-thread private local memory. These areas all have different sizes, bandwidth and latency which need to be considered within the application.

Continues opposite.

5

Specific to large machines employing multiple GPUs is the problem of communication between the GPU devices. This may lead to a significant bottleneck.

Although the SPMD model is supported by modern GPUs, to make best use of the hardware code-branching should still be minimised as much as possible. In the NVIDIA CUDA model, threads are grouped together into ‘warps’. If a code branch occurs within a warp then all threads must execute both branches. Clearly this is suboptimal for performance.

Taking these challenges into account, along with a consideration of the general GPU architecture, we can suggest the type of applications likely to make the best use of GPGPUs:

• Ensure that the application has substantial parallelism.

• Ensure that it has high computational requirements ie a high ratio of arithmetic operations to memory operations.

• Prefer throughput over latency.

If an application displays these characteristics it should make good use of a GPGPU. However, achieving optimal performance still requires a good understanding of the GPGPU architecture. Programming models and APIs such as CUDA and OpenCL are making this process easier, however there is still some way to go and the GPGPU landscape is a rapidly evolving one.

But do not despair! With 20 years’ experience of parallel architectures and applications, EPCC is well placed to provide advice and guidance. Articles elsewhere in this issue describe some of our experiences with programming GPGPUs.

Towards the future

As a multi-billion pound industry, the commodity games market will continue to be the main driver of GPU development. However, unlike CPUs, GPU developments have been much more disruptive to the application programmer, with rapidly changing features and APIs. If the emergent GPGPU compute

APIs keep up with the pace of development of the hardware then hopefully such changes will be easier to deal with in future.

Historically, co-processors have been integrated into the main microprocessor itself or into an integrated chipset. This is already happening to some extent with the AMD Fusion chipset and desktop versions of the Intel Sandy Bridge architecture, although initially with lower performance than standalone GPU devices. One hope for this integration is for higher bandwidth communications between CPU and GPU, helping to reduce one of the bottlenecks. Of course CPUs are continually developing, with the trend being firmly set along the multi-core path – will CPUs and GPUs eventually merge technologies?

Will GPGPUs be a long-term feature of high performance computing? The answer to that is not obvious at the moment, however their development, along with other trends such as multicore CPUs and other accelerators, do point to one thing – the Exascale machines of the future are likely to have millions of restricted functionality compute cores with less memory available to the developer, and more complicated memory access patterns. Developers, their applications, and more importantly the programming tools and models, will have to evolve in order to get anything like a reasonable performance out of them.

References

[1] www.amd.com[2] www.nvidia.com[3] www.opengl.org[4] www.gamesforwindows.com/en-US/directx/[5] www.khronos.org/opencl/[6] www.nvidia.co.uk/object/what_is_cuda_new_uk.html[7] www.amd.com/us/products/workstation/firestream/Pages/

firestream.aspx[8] www.nvidia.com/object/tesla_computing_solutions.html[9] www.top500.org[10] www.green500.org

Coming soon...new HeCTor gpu service Alan Gray

EPCC is soon to host a new GPU testbed resource, available as part of the HECToR service. The primary aim of the system is to allow researchers to gain vital experience with this disruptive architecture and prepare their applications for the future. The hardware, comprising several interconnected nodes, will provide a total of 13 NVIDIA ‘Fermi’ GPUs and a single AMD Firestream GPU. The testbed is funded by EPSRC.

Full details, including information on how to apply for access: www.hector.ac.uk/howcan/admin/apply/HECToRGPU.php

6

The Ludwig fluid dynamics package is a versatile parallel application capable of simulating the hydrodynamics of complex fluids through the use of Lattice Boltzmann models. It is used for cutting-edge research into condensed matter physics, including the search for new materials with special properties that could potentially impact everyday life.

The original code, developed at EPCC, is capable of exploiting the largest of supercomputers by taking advantage of many thousands of CPU cores in parallel. We have now performed work to enable Ludwig to efficiently exploit novel massively parallel GPU (graphics processing unit) accelerated architectures, which offer significant power-performance advantages over traditional systems, but – as explained elsewhere in this issue – are notoriously difficult to program when dealing with real, large-scale applications.

The work focused on the NVIDIA GPU architecture. The CUDA programming model was used to adapt a number of key computationally-expensive routines in Ludwig, such that they are offloaded to the powerful GPU. Careful tuning, which dramatically improved performance, included minimising CPU-GPU data transfer by keeping data resident on the GPU wherever possible, the introduction of a new data layout within the application to allow optimal use of GPU memory bandwidth and optimisations in certain key loops to exploit register usage.

Furthermore, as the work was targeted at multi-GPU machines, particular attention was given to optimising the communication

patterns to minimise GPU-to-GPU data transfer overheads while retaining the code’s excellent parallel performance capability. New routines were developed, including the implementation of specialized data compression techniques, together with the overlapping of CUDA operations and MPI communications.

The new code is observed, for a binary fluid benchmark, to retain good scaling behaviour all the way up to 256 NVIDIA Fermi GPUs (the largest resource to which we currently have access). It is seen to provide a factor of around three to four performance increase compared to the use of ‘traditional’ 12-core AMD Magny-Cours Opteron CPUs (with all 12 cores fully utilised). With this standard of performance we are approaching the level at which those large problem sizes required for leading-edge research may be feasibly tackled.

Future work will include the GPU adaptation of advanced functionality within Ludwig. We will also exploit this work by collaborating closely with Cray Ltd to help the development and maturation of their prototype implementation of the OpenMP accelerator directive model, a higher level GPU programming approach which promises to offer productivity advantages over CUDA.

We are very grateful to Lawrence Livermore National Laboratory and University of Cambridge HPCS for access to resources, and especially thankful to Dr Balint Joo at Jefferson Laboratory for his assistance with running benchmarks.

researching physics using many gpus in parallelAlan Gray

7

porting a particle transport code to gpgpu using hybrid MpI/OpenMp/Cuda programming modelsPaul Graham

EPCC has been working with AWE to port one of its benchmark codes (Chimaera) to GPU, under the AWE Innovative Architecture evaluation project.

Chimaera is a particle transport code, using a wavefront algorithm. The original coding uses Fortran 90 and MPI and scales well to thousands of cores for large problem sizes. The challenge is to enable it to handle very much larger problems, which are likely to be beyond the scalability of the original code using conventional HPC systems. Over 90% of the run time is spent in one routine.

To evaluate the potential of innovative HPC architectures for AWE’s applications, an initial evaluation project used a serial driver program to call the computational kernel of the Chimaera wavefront algorithm, in order to simulate the performance within a single MPI task of the full code. This was done using the IBM Cell engine and an NVidia Tesla 870 GPU.

Initial performance results from both of these platforms were disappointing but the cause was immediately identified as a lack of parallelism in the way the wavefront algorithm was coded. This was rectified by the re-introduction (from original coding a decade previously for the Cray YMP) of a 3D diagonal sliced algorithm.

The wavefront algorithm used in the code involves sweeping through a 3D array of spatial mesh points in eight directions corresponding to the eight vertices of a cube. For each sweep direction, the original coding had an outer loop on one spatial dimension (z) and inner loops on x and y. MPI communication occurred after each x/y ‘tile’, corresponding to a particular value of z, had been computed. The lack of parallelism arose because the algorithm was recursive in each of the x, y, and z directions. This was rectified by changing the order in which the computation was performed, so that instead of processing successive x/y tiles, successive ‘3D diagonal slices’ were processed. The first such slice has just one element (with xyz co-ordinates 1-1-1). The second has three (1-1-2, 1-2-1, and 2-1-1) and so on. For realistic problems, slices quickly become large and have typically hundreds or thousands of mesh points.

Within each slice, all mesh points may be computed independently. This algorithm was initially developed for

vectorisation on the Cray YMP but is clearly also ideally suited to the concurrent operation of threads required by a GPGPU.

This algorithm was almost certainly appropriate for both the Cell and the GPGPU but work was taken forward only on the GPGPU. With this algorithm in place, speedups compared with host processing in the range of 5x–10x were obtained for Single Precision arithmetic, work on the double precision version is ongoing.

It is hoped that the use of OpenMP to parallelise across the cores of the SMP node will further improve the overall performance. If we consider a system consisting of multiple N-core SMP nodes (where ‘core’ is taken to mean ‘host core plus GPU’) then, using only MPI and Cuda, the parallelism across the cores of a single SMP node can only come from having N MPI tasks running on the node each with its own spatial decomposition, MPI communications being done by memory-to-memory copy.

There are two possibilities for using OpenMP to improve on this scenario, both are being considered:

• Alter the decomposition to have a single ‘fat’ spatial decomposition with a single MPI task on the SMP node and use OpenMP either to parallelise on a non-recursive inner loop on ‘particle directions’ or on subsets of mesh points within large slices.

• Use the existing N ‘thin’ spatial decompositions, one per core, but have a single MPI task for the node. Within this single MPI task, assign a thin decomposition to each of N OpenMP threads and use the same structure and logic that was used for MPI. With this technique, it is hoped that increased efficiency would stem from using OpenMP to communicate shared memory pointers to what were originally MPI communications buffers, rather than performing memory-to-memory copies.

With either scheme, the OpenMP threads will be GPGPU-capable.

The project is due to finish in May 2011.

8

gpus rock!Kostas Kavoussanakis, EPCC

Stefan Bilbao, School of Music, University of Edinburgh

A modular, user-designed percussion synthesis

framework.

Digital sound synthesis and audio processing came into being in the late 1950s, exploiting results in speech synthesis. These techniques are still prevalent today, notably in use on mobile phones, and the sound they produce is noticeably artificial. Such systems relate directly to the mathematical formulae involved, and bear no resemblance to notions used by musicians. Sampling – the use of pre-recorded audio fragments – was introduced to make the sound more natural, however the inherent repetitiveness of the individual samples tends to spoil the result.

To address these problems, physical models of musical components (eg strings, bars, plates, membranes and rooms) and their interactions with excitation mechanisms such as reeds or bows can be used. Sound output by physical models not only exhibits a natural character, but can also go beyond existing instruments, allowing musicians to create a signature sound. Additionally, instruments are defined by geometrical and material parameters, and are played by sending in physically meaningful signals such as striking locations and forces, making the use of physical models much more intuitive.

Of particular interest is the ‘finite difference time domain’ methodology. The basic idea is long-established: approximate the system under consideration using a grid, and then use time-stepping methods to advance the solution. The approach has all the benefits expected from a physical model. It is applicable to various sound types and can produce high quality output. The parameters involved correspond to known physical qualities and the system is flexible enough to allow interaction with a user.

The benefits of the finite difference method come at a high computational cost. While the gigaflop capabilities of mainstream desktop processors suffice for wind instruments, single strings and certainly speech, an HPC system such as HECToR would be necessary to handle small acoustic spaces. But it is the areas between these systems – the multicore and GPU systems – that are of greater interest: they are easily obtainable, capable of handling complex sound such as plates, drums and pianos, and yet not currently exploited.

Last year EPCC worked with the University’s School of Music to investigate the feasibility of high-quality audio synthesis using physical models on GPUs. It was a six-month project, funded by the School of Informatics iDEA Lab [1]. The starting points for the project were three simple, short, but representative pieces of Matlab code, of increasing complexity:

• Code 1: a simple set of oscillators, each of which produces a basic sinusoidal output.

• Code 2: a basic finite-differences model of a vibrating string.

• Code 3: a basic model of plate reverberation as applied to an input signal.

The code fragments were treated successively, each ported first to C, and then to CUDA, running on the Tesla GPU boards hosted on the NESS machine at EPCC, with various accuracy and benchmarking tests carried out at each stage.

The original Matlab codes used double precision arithmetic. The project explored the use of single precision arithmetic in C and CUDA. We concluded that single precision expressions of these codes require advanced numerical analysis techniques, as they involve calculations exceeding the limits of single precision.

We benchmarked Matlab on representative conditions, using a 4-core desktop and double precision arithmetic (which performs better than single precision on Matlab and benefits from the use of built-in libraries). Because the known deficit of Teslas on double precision arithmetic has been overcome in current GPU implementations, we benchmarked CUDA (and C) in single precision, to show what can be achieved in terms of performance. In practice, double precision CUDA exhibited very low performance penalty compared to single precision.

With respect to performance, the three Matlab codes exhibited ascending computational complexity. Code 1 is not computationally intensive unless the number of oscillators is increased to 1,000,000. Code 2 is also not computationally intensive, with C and Matlab approaching real-time execution. Code 3 is much more computationally intensive, with Matlab runtimes around 120 to 370 times longer than real time.

C, in general, performs better than Matlab, especially as the complexity increases. This is even more so the case for CUDA, which unlike Matlab and C runs faster than real time on Code 2 for the benchmarked problems and performs exceptionally better that Matlab and C on Code 3, being between three and ten times slower than runtime but still, between 12 and 37 times faster than Matlab. The conclusion is that CUDA and GPUs can be employed for sound synthesis problems of complexity not realistically tackled by Matlab or even C.

In terms of future work, all three CUDA implementations could be improved. Codes 1 and 2 could be used to experiment with new techniques such as using texture memory and streaming. Code 3 probably has the most interest and could be accelerated by using more than one Tesla card. But the real test comes from simulating acoustic spaces, something that we hope to tackle in a follow-on project.

[1] http://idea.ed.ac.uk/

9

In the beginning there was the Internet, then came the World Wide Web, followed by the Grid. In the age of persistent, pervasive networks the methods used by scientists to accumulate, archive and access data are fundamentally different to a decade ago. At the start of the century, I was a post-doc in the Theoretical Physics group at Edinburgh, part of the UKQCD collaboration. The UKQCD collaboration uses supercomputers to gain understanding of the properties of fundamental particles – quarks and gluons – by numerical computation of the theory of the Strong nuclear force. This is known as Lattice Quantum Chromodynamics (LQCD) [1].

These computations result in the production of lots of data. I was struggling to manage how this data was stored and accessed because the volume, and in particular, the complexity of the data, was increasing. In 2002 UKQCD and EPCC formed a project called QCDgrid to solve UKQCD data management and archiving needs.

The project, now called the UKQCD Grid Service, has run in various forms for nearly nine years. The resulting software infrastructure – Distributed Grid Storage (DiGS) [2,3] – is not specific to LQCD data, but can in principle be used for any scientific data: for example it has been used for cell biology data. DiGS relies on three basic ideas:

• separation of data and metadata

• semantic access of the data via the metadata

• data integrity by distributed replication.

Metadata is data about the data, which could include the type, format, size, provenance and scientific context of the data. One important part of the metadata is the unique identifier for the data. This is the Logical File Name (LFN). The LFN specifies to which data file the metadata belongs. The abstraction of separating the metadata from the data is a powerful one and splits the problem into a portion which is domain specific (what metadata is required to adequately describe the data) and a portion which is generic (storing and accessing data). Semantic access via the metadata means a user only interacts with the DiGS service, by specifying which data is to be stored or accessed by considering the metadata. The DiGS service controls the location and movement of data.

Managing distributed resources is harder than a single resource, but DiGS turns this feature into a strength. By having multiple copies of the data at different sites, any failure at a single site, permanent or temporary, is mitigated as the data is still available. The data is robust and secure. The central DiGS service decides where to place and locate replicas and which replica to transfer to the user.

DiGS achieves this by employing two technologies. Firstly, XML and XML schema are used to mark up the metadata. For LQCD data there is a standard scheme describing the metadata called QCDml [4]. XML Query is used to interrogate the XML data held in an XML data base, called eXist. These components form part of the Metadata catalogue service. The File catalogue service supports three different methods of managing access to different storage elements: Globus GridFTP, SRM and OMERO. The architecture of DiGS is shown above.

Since 2005 I have worked at EPCC, but I have been involved in the project since the start. DiGS is part of UKQCD's contribution to the International Lattice Data Grid [5], which joins together several national or regional data grids to share LQCD data internationally. I am the convener of the metadata working group of ILDG, and have a ‘senior user’ role for the UKQCD Grid service within EPCC. From the point of view of someone who uses rather than develops the software, I can comment on the effectiveness of DiGS. In fact it is now part of UKQCD's infrastructure, with people using it automatically in their day-to-day work.

The use of DiGS means that once the data is archived, it really is persistent and pervasive, and can be retrieved simply. One particular feature which is mostly unnoticed, but ultimately extremely useful for system managers, is replacing the storage nodes. They can be simply unplugged and a new storage system added. The Replica file service can deal with this change in a managed way if a node is ‘retired’, or it can cope with a catastrophic failure of a node. Having been involved with the ‘end-of-life’ of projects before, DiGS makes managing this process completely straightforward.

The architecture of DiGS.

Digs:Distributed grid storageChris Maynard

Continues on p11.

Our MSc in High Performance Computing gives an excellent grounding in HPC technologies and their practical application. It aims to:

• Equip students with an understanding of HPC architectures and technologies.

• Equip students with expertise in advanced tools and techniques for HPC software development.

• Enable students to apply this knowledge in order to exploit modern parallel computing systems in key scientific and commercial application areas.

• Enable students to develop as HPC practitioners, able to apply current and emergent technologies in both industry and research.

• Enable students to develop skills in problem-solving, project management, independent and critical thinking, team work, professionalism and communication.

Staff who teach on the MSc have a wealth of expertise in HPC and research computing and work closely with colleagues from other academic discilplines. The MSc will appeal to students from physical sciences, computer science, engineering and mathematics who have a keen interest in computing and would like to learn about HPC and parallel programming.

Applications are welcome at any time, however we normally allocate our scholarships by the end of April each year.

Careers

Our students acquire skills that are applicable both to academic computational science research and to a wide range of careers in industry. Previous graduates have gone on to PhDs in areas that utilise HPC technologies, including astrophysics, biology, chemistry, geosciences, informatics and materials science. Others have gone directly into employment in a range of commercial areas, including software development, petroleum engineering, finance and HPC support.

What our students say

“Studying HPC has been really fulfilling and the possibility of using real HPC resources is by itself a reason to study at EPCC, but in addition I’ve also found many outstanding teachers during the year.” Pablo Barrio

“Very friendly staff and classmates, together with excellent facilities and a beautiful city created a welcoming and supportive environment. Coupling this with a teaching method that is one of the best I have encountered and interesting course content resulted in me learning a lot of useful information and developing important skills. I am very thankful to EPCC for such a wonderful experience.” Alan Richardson

More information: www.epcc.ed.ac.uk/msc Contact us: [email protected] details: www.ed.ac.uk/studying/postgraduate/applying

10

Msc in High performance Computing

Training the next generation of computational science professionals.

The Msc in High performance Computing (HpC) is a one-year postgraduate masters course taught by epCC at the university of edinburgh.

11

Degree programme

The MSc in HPC aims to give students:

• Expertise in advanced tools and techniques for HPC software development and numerical algorithms.

• The ability to apply this knowledge to key areas in physics, chemistry, engineering and environmental modelling.

• Interdisciplinary skills that are integral to computational science.

• Transferable skills in problem-solving, project management, independent & critical thinking, professionalism and communication.

The programme includes two semesters of taught courses followed by a four-month independent research project.

The core courses provide a broad-based coverage of the fundamentals of HPC and parallel computing; the optional courses concentrate on specialist areas relevant to computational science. The teaching and learning approaches have a strong practical focus, and students have access to leading-edge HPC platforms and technologies.

Taught courses

• HPC Architectures

• HPC Ecosystem

• Message-Passing Programming

• Threaded Programming

• Parallel Numerical Algorithms

• Parallel Programming Languages

• Performance Programming

• Advanced Parallel Programming

• Parallel Design Patterns

• Software Development

• Programming Skills

• Project Preparation

Selected optional courses from other MSc programmes are available, for example from the MSc in Computer Science and the MSc in Operational Research.

HECToR, the UK’s national academic computing service. Hosted by EPCC, this Cray XE system is one of the largest, fastest and most powerful supercomputers in Europe.

Most recently EPCC has been collaborating with scientists from Tsukuba University in Japan on metadata capture as part of the UKQCD Grid service.

The DiGS project has been a great success, making managing UKQCD's data much easier. The separation of the metadata from the data means the solution is a general one and can be used by other applications. A metadata scheme for describing the data is all that is required, which at its most basic is necessary to distinguish the data in one file from another. UKQCD continues to use the application, but the project itself has finished. Given how successful DiGS is, we hope to

obtain funding to continue the project. Meanwhile, if you have a scientific application and need to manage your data, perhaps DiGS can help you! See the DiGS website for details.

[1] The UKQCD collaboration website http://ukqcd.epcc.ed.ac.uk[2] M.G. Beckett et al Phil. Trans. R. Soc. A, 367 N1897 (2009) 2471[3] The DiGS website http://www2.epcc.ed.ac.uk/~digs/[4] C.M. Maynard and D. Pleiter Nucl. Phys. B. Proc. Suppl. 130 (2005) 213[5] M.G. Beckett et al, Comp. Phys. Comm. DOI 10.1016/ j.cpc.2011.01.027, arXiv:0910.1692 [hep-lat]

HPC simulations are important in many technological areas. This calculation, made at the University of Edinburgh, shows the structure of liquid crystals used in LCD displays. Image courtesy Oliver Henrich.

DiGS continued

epCC is a european centre of expertise in developing high performance, novel computing solutions; managing advanced systems and providing HpC training. Our clients and partners include local and global industry, government institutions and academia.

epCC’s combination of advanced computing resources and expertise is unique and unmatched by any european university.www.epcc.ed.ac.uk [email protected]

Our man in AfricaGavin J. Pringle

In December 2010, I had the great honour of visiting South Africa as a guest of the Centre of High Performance Computing (CHPC) in Cape Town. I had been invited to talk at the CHPC’s annual National Conference and to

teach MPI at the High Performance Computing School at the University of the Western Cape, just outside Cape Town.

CHPC runs the fastest supercomputer in Africa. Housed at the CSIR in Cape Town, it is a SUN Constellation system with 2000 Intel Nehalem 8-core processing units and 400 TB of storage.

The 2010 HPC School came first, running for 8 days in total, where I taught MPI for two days and then offered consultancy for a third. The course was for graduates, and there were around 50 attendees from all over Africa. A fifth of the students were from outside South Africa and were funded by either IBM South Africa or the Abdus Salam International Centre for Theoretical Physics (ICTP) in Trieste, Italy.

The CHPC National Conference followed, with around 300 participants from all parts of the globe at this ‘National’ meeting, from academia to industry, science councils to governmental bodies. The theme of the conference was ‘Advancing Research and Development through the National Cyber-infrastructure Initiatives’.

My talk [1] focused on how we at EPCC engage with industry to establish collaborative projects. Traditionally, HPC centres are funded solely through governmental bodies; however EPCC also secures funding from industry. I described the processes that we follow, such as our project management methods and other insider tips, so that other HPC centres may also be as successful as EPCC.

The meeting’s first plenary session also included talks from Danny Powell, of NCSA at University of Illinois at Urbana-Champaign, and Clement Onime, of ICTP, Trieste, Italy. You can download copies of all the plenary talks in PDF format at the Conference’s website [2].

The next day, I presented my dCSE-funded [3] work on porting and profiling OpenFOAM on HECToR [4], at the meeting’s High Performance Computer-Aided Science and Engineering workshop.

On the third day, we were given a tour of the Table Mountain National Park. There we saw ostriches on the beach and chacma baboons eating mussels directly out of the ocean.

All in all, I found the trip to be highly educational, opening my eyes to the high quality of HPC work done throughout South Africa and beyond. Thanks are due to Happy Sithole, Jeff Chen, Catherine Cress and Kevin Colville.

References [1] blog.wizzy.com/post/Centre-for-High-Performance-Computing-

national-meeting[2] www.chpcconf.co.za[3] www.hector.ac.uk/cse/distributedcse[4] www.hector.ac.uk/cse/distributedcse/reports/openfoam

EPCC News 69: GPUs now with extra va-va-voom

Documents

Transcript of EPCC News 69: GPUs now with extra va-va-voom