School of Computing - Blackboard Learn€¦ · AMD Advanced Micro Devices Manufacturer of ... MPI...

!

Evaluation of the Performance of HPC Mini-Apps on the Cloud

Jamie Roberts

Submitted in accordance with the requirements for the degree of BSc Computer Science

2016/2017

School of Computing FACULTY OF ENGINEERING

- ! -ii

The candidate confirms that the following have been submitted:

Type of Project: ____Empirical Study_____________

The candidate confirms that the work submitted is their own and the appropriate credit has been given where reference has been made to the work of others.

I understand that failure to attribute material which is obtained from another source may be considered as plagiarism.

(Signature of student) ___________________________

© 2017 The University of Leeds and Jamie Roberts

Items Format Recipient(s) and Date

Deliverables 1 Report SSO (10/05/17)

Deliverable 2 Software codes and URL Supervisor, assessor (10/05/17)

- ! -iii

Summary

High performance computing (HPC) and cloud computing are two contrasting paradigms. This research aims to understand the possible benefits and drawbacks of executing a tightly coupled HPC application on a loosely coupled cloud architecture.

This work uses a new form of application to do this, referred to as a mini-app. Mini-apps are designed as a tool, that is reflective of a real scientific application. As such they are a reliable source when ascertaining an architectures ability to execute their parent application effectively. Their use is deemed important to the further development of HPC architectures and parallel programming techniques.

Studies have shown that virtual machine placement, the choice of virtualisation technology, and application parameters can be used to improve the performance of HPC applications on cloud computing architectures. For this reason, these techniques are explored to uncover further aspects of cloud architecture and HPC application, that can be used to better understand these subjects and ultimately improve application performance whilst utilising cloud computing resources.

In this research we show how virtual machine (VM) placement effects application performance, and that the performance of programming models can be architecture dependent. This work concludes that HPC mini applications are a suitable tool for analysing the performance of prototype computing architectures and that cloud computing is a viable candidate platform for the execution of the mini-app presented.

- ! -iv

Acknowledgements

I would like to thank a number of people at The University of Leeds. First my supervisor Dr Karim Djemame for his support in completing this project, and throughout my time here at the University. Dr Jie Xu for his valuable feedback on my progress with this project, and Dr Richard Kavanagh for providing the Zabbix data monitoring daemon. Also John Hodrien, who put up with the testbed throwing “all kinds of errors” so I could complete this work. I would also like to extend my thanks to all the developers that worked on any of the software that has been used for within project.

- ! -v

Glossary of Terms

abbreviation Term DefinitionAMD Advanced Micro Devices Manufacturer of

microprocessors motherboards and chipsets.

API Application Programming Interface

A set of.functions that allow access to a resource programatically

ARC Advanced Research Computing

Here this refers to the University of Leeds HPC facilities.

ARC1 Advanced Research Computer 1

An HPC cluster at the University of Leeds

ARC2 Advanced Research Computer 2

An HPC cluster at the University of Leeds

CLI Command Line Interface A means of interacting with a program using commands at a terminal

CPU Central Processing Unit Main processing unit in a computer

DDR Double Data Rate A memory architecture design that allows two signals per clock cycle

FLOPS Floating Point Operations Any mathematical operation on a floating point number

FPGA Field Programmable Gate Array

An integrated circuit than can be programmed to the users needs

GHz Gigahertz A unit of measurement of clock frequency.

GPU Graphics Processing Unit A computer processing unit designed mainly for processing graphics.

HPC High Performance Computing A computing paradigm associated with massively parallel architectures and parallel program design.

HPCC High Performance Computing Challenge

High Performance Computing Challenge

- ! -vi

HTC High Throughput Computing A computing paradigm concerned with the number of serial jobs completed

I/O input/output Any communication with the outside world during program execution

IP Internet Protocol a digital media transport system that runs over standard IP networks

KVM Kernel-based Virtual Machine A kernel based hardware virtualisation method

LXC Linux Containers An operating system level virtualisation method

MDS MetaData Server A server containing only meta data

MIMD Multiple Instruction Stream, Multiple Data Stream

A type of hardware architecture

MISD Multiple Instruction Stream, Single Data Stream


MPI Message Passing Interface The standard programming model for distributed memory programming

MPICH Message Passing Interface Chameleon

Where chameleon stands for the portable layer of this MPI implementation

MVAPICH An implementation of MPICH that is optimised for infini-band networks

NFS Network File System A network shared file system

NIST National Institute of Standards and Technology

A body for standards in technology

NORC Naval Ordinance Research Computer

The first computer built with performance in mind

OSS Object Storage Server A Lustre server for storing d ata

PBS Portable Batch System A topology sensitive HPC scheduling program

QDR Quad Data Rate A memory architecture design that allows four signals per clock cycle

RAM Random Access Memory The memory in a computer system

- ! -vii

RPC Remote Procedure Call A distributed programming method that allows the execution of remotely located code.

SGE Son of Grid Engine A topology aware scheduler that is derived from Sun Grid Engine

SIMD Single Instruction, Multiple Data Stream.


SoC School of Computing School of Computing at the University of Leeds

SPEC Standard Performance Evaluation Co-Orporation

A body

SSH Secure Socket Shell A network protocol that provides secure network services

SQL Structured Query Language Language used to query a database, based on set theory.

TPP Thread Per Process One Open MP thread per MPI process

UDP User Datagram Protocol A connectionless network protocol

UK-MAC United Kingdom Mini App Consortium

A consortium of academic institutions who design and create mini-apps

VIM Virtual Infrastructure Manager

A computer program that allows administrators and users to manage virtual and physical resources

VM Virtual Machine An emulated computer system

XML Extensible Markup Language User customisable Mark up language

- ! -viii

Table of Contents

Summary iii ......................................................................................................Acknowledgements iv .....................................................................................Glossary of Terms v ........................................................................................Table of Contents viii .......................................................................................Chapter 1: Introduction 1 ................................................................................

1.1 Aim 1 ...................................................................................................

1.2 Objectives 1 .........................................................................................

1.3 Deliverables 2 ......................................................................................

1.4 Report Structure 2 ...............................................................................

1.5 Methodology 2 .....................................................................................

1.5.1 Direct Experiments 3 ................................................................

1.5.2 Data Collection 3 ......................................................................

1.5.3 Data analysis 3 .........................................................................

1.5.4 Benchmarking 3 ........................................................................

1.6 Project Schedule 4 ..............................................................................

1.7 Summary 5 .........................................................................................

Chapter 2: Background & Literature Review 6 .............................................2.1 HPC 6 ..................................................................................................

2.1.1 Introduction 6 ............................................................................

2.1.2 Infrastructure and architecture 6 ...............................................

2.1.3 Networking 8 .............................................................................

2.1.4 File Systems 8 ..........................................................................

2.1.5 Scheduling 8 .............................................................................

2.1.6 Benchmarking 9 ........................................................................

2.1.7 Challenges 9 .............................................................................

2.2 Cloud Computing 10 ...........................................................................

2.2.1 Introduction 10 ..........................................................................

2.2.2 Infrastructure and architecture 11 .............................................

2.2.3 Virtualisation 12 ........................................................................

2.2.4 Virtual Infrastructure Managers. 13 ..........................................

2.3 Parallel applications 14 ........................................................................

- ! -ix

2.3.1 Intro 14 .....................................................................................

2.3.2 Programming Models 14 ..........................................................

2.3.3 Performance analysis of parallel applications 15 .....................

2.3.4 mini-applications 15 ..................................................................

2.3.5 CloverLeaf 16 ...........................................................................

2.3.6 Loop tiling 16 ............................................................................

2.4 Related Work 16 ..................................................................................

2.5 Summary 18 ........................................................................................

Chapter 3: Experimental Design 19 ...............................................................3.1 Identification of parameters 19 ............................................................

3.2 Identification of metrics 20 ...................................................................

3.3 Scale Up 20 .........................................................................................

3.4 Placement 21 .......................................................................................

3.5 Virtualisation Overhead 21 ..................................................................

3.6 Scale Out 21 ........................................................................................

3.7 Containers 22 .....................................................................................

3.8 Loop Tiling - Single Slot 22 ..................................................................

3.9 Loop Tiling - 8 Nodes 23 ......................................................................

3.10 The Problem 23 .................................................................................

3.11 Summary 24 ......................................................................................

Chapter 4: Implementation 25 ........................................................................4.1 Hardware Specification 25 ...................................................................

Table 4.1 - Hardware Specification 25 ...............................................

4.2 Network Topology 26 ...........................................................................

4.3 Benchmarking ARC 26 ........................................................................

4.4 Cluster of Virtual Machines 27 .............................................................

4.4.1 Images 27 .................................................................................

4.4.2 Templates 27 ............................................................................

4.4.3 Networking 28 ...........................................................................

4.5 Containers 28 ......................................................................................

4.5.1 Overview 28 ..............................................................................

4.5.2 Docker Machine 29 ...................................................................

4.5.3 Boot2Docker 29 ........................................................................

4.5.4 Docker 29 ................................................................................

4.5.5 Docker images 29 .....................................................................

- ! -x

4.6 Monitoring 30 .......................................................................................

4.6.1 Zabbix 30 ..................................................................................

4.6.2 Telegraf, InfluxDB and Grafana 30 ...........................................

4.7 CloverLeaf 30 ......................................................................................

4.7.1 Adaptations 30 ..........................................................................

4.7.2 Optimisations 30 .......................................................................

4.7.3 Profiling 31 ................................................................................

4.8 Analysis Of Data 31 .............................................................................

4.8.1 Python Program 31 ..................................................................

4.8.2 Java Program 31 ......................................................................

4.9 Summary 31 .......................................................................................

Chapter 5: Technical Evaluation 32 ...............................................................5.1 Scale-Up 32 .........................................................................................

5.1.1 OpenMP 32 ..............................................................................

5.1.2 MPI 33 ......................................................................................

5.1.3 Hybrid 35 ..................................................................................

5.2 Placement 36 .......................................................................................

5.3 Overhead 37 ........................................................................................

5.4 Scale-Out 38 ........................................................................................

5.5 Containers 38 ......................................................................................

5.6 Loop Tiling - 1 Slot 40 ..........................................................................

5.7 Loop Tiling - 8 Nodes 40 ......................................................................

5.8 Summary 41 .......................................................................................

Chapter 6: Project Evaluation 42 ....................................................................6.1 Project Objectives 42 ...........................................................................

6.2 Contributions and recommendations 42 ..............................................

6.3 Evaluation of Methodology 43 .............................................................

6.4 Positioning of project outcomes 44 ......................................................

6.5 Limitations 45 .....................................................................................

6.6 Future Work 45 ....................................................................................

6.6.1 Further Investigation Into Application Behaviour 45 .................

6.6.2 Optimisation 46 .........................................................................

6.6.3 Containers 46 ...........................................................................

6.6.4 Request / Load aware Scheduling 46 ......................................

6.6.5 Tiny core on VM 46 ...................................................................

- ! -xi

6.6.6 Benchmark Containers on ARC 47 ...........................................

6.6.7 Increase Range of Applications 47 ...........................................

6.6.9 Application Characterisation 47 ................................................

Chapter 7: Conclusion 49 ...............................................................................List of References 50 .......................................................................................Appendix A

External Materials 58 ...............................................................................Figure a.3 - loop tiling 58 ...................................................................

Figure a.4 - Scale Up Wall Clock 59 ..................................................

Figure a.5 - Scale Up Efficiency 59 ...................................................

Figure a.6 - MPI 16 processors profile 59 .........................................

Figure a.7 - MPI 2 processors profile 60 ...........................................

Figure a.8 - CloverLeaf Profile 61 ......................................................

Appendix BEthical Issues addressed 63...................................................................

- ! -xii

- ! -1

Chapter 1: Introduction

The use of cloud computing for HPC applications is a heavily researched topic, offering many socio-economic advantages that are attractive from both a research and commercial perspective. The cost of dedicated hardware to perform HPC workloads is high in terms of; time, investment and the skills required. Therefore, alternatives to this model are required that will allow potential users the opportunity to execute parallel applications without this inherent cost.[42]

To explore this topic further, this empirical investigation uses an HPC mini-app. Mini apps are a relatively new approach to scientific application development. They are small, self -contained applications that allow researchers to effectively benchmark architectures and programmers to test new theories. In both cases, removing the inherent complexity of prototyping with a full scale HPC application.[39]This project considers CloverLeaf, an application for solving Eulerian Langrangian compressible hydrodynamics equations, and is reflective of a code used extensively by the Atomic Weapons Establishment [40].

Container technologies are another hot topic in computing research and Docker[29] is a popular implementation of container based virtualisation. Therefore, Docker will be used as a comparison to ascertain its suitability for executing the CloverLeaf mini-app.

This empirical investigation will use two contrasting architectures; The School of Computing (SoC) cloud testbed and the ARC (Advanced Research Computing) facility, both hosted by the University of Leeds. This will allow direct comparison of a cloud testbed and an HPC cluster.

1.1 Aim

The aim of this project is to provide a thorough performance analysis of CloverLeaf on two contrasting architectures. It is hoped this will uncover possible drawbacks, or benefits to be gained by utilising the cloud to host this particular HPC mini application. Furthermore, it is hoped the experiments conducted may help to bring to light elements of virtualised infrastructure, and application characteristics that determine an applications potential for scalability on a cloud platform.

1.2 Objectives

1. Demonstrate an understanding of HPC, cloud computing and HPC mini-apps.

2. Design and implement a range of experiments to form a methodology suitable to study the effects of virtualisation, and potentially the use of container technologies on the performance of the CloverLeaf mini-app.

- ! -2

3. Supply a quantitative analysis of the performance of CloverLeaf on the school’s test bed using metrics including speed up and efficiency.

4. Provide recommendations for effective use of Cloverleaf on the cloud. If there is no such advice then explain why this application is not suitable for parallel execution on a virtualised infrastructure.

5. Describe limitations of this project’s methodology, and realise opportunities for further study presented by this topic.

1.3 Deliverables

1. A report detailing analysis.

2. Any software created or amended for this project.

1.4 Report Structure

Chapter 2 will cover a review of HPC, cloud technologies and HPC mini-apps. This will describe cloud and HPC architectures, and discuss relevant programming models. It will also explain the motivation for using HPC mini apps and their relevance to current computing research. Chapter 2 will also review the literature that is most relevant to this area of research, which will cover a range of recent work involving HPC applications on cloud infrastructures. Chapter 3 will then describe the experiment design process which will be used, starting with defining the parameters for experimentation and then describing the individual experiments. Chapter 4 will then focus on the implementation of these experiments. The hardware and topology of ARC and the SoC cloud test bed will be presented and techniques used to execute and monitor experiments will be explained. Chapter 5 will cover the technical evaluation, including a quantitative evaluation of results and discussion of any observations made. Chapter 6 presents a project evaluation, this will include a critical evaluation of the project methodology, it will position this work with existing work and discuss opportunity for future work. Finally, Chapter 7 will offer the conclusion to this project.

1.5 Methodology

This study is an empirical investigation that will use direct experiments to generate data. This data will be analysed both visually and programatically to derive understanding about the underlying architectures and programming models presented. These experiments will use the ARC as a benchmark to compare with the performance of the SoC testbed.The

- ! -3

methodology presented here is strongly motivated by a number of studies[42][43][44] that are discussed further in section 2.4 of this report.

1.5.1 Direct Experiments

The experiments are performed directly on two systems at the University Of Leeds. The SoC cloud testbed, and ARC2 (one of the universities HPC facilities). This provides a direct comparison between cloud and HPC. These experiments will utilise 32 cores and up to 8 nodes of each system will be utilised to explore the effect of communications between nodes. It is considered that this will allow study at appropriate scale given the limitations of the shared resources available, and the time limitations of this project

1.5.2 Data Collection

Data is collected from the results of running the cloverleaf mini application 5 times on each configuration to compare the overall performance of each system in terms of wall clock time, efficiency and scalability. The data is averaged over 5 runs to allow for the variance between results expected for cloud infrastructure. It is considered that averaging over a greater number of timing runs may be more appropriate to gain accurate results, but it is felt five runs is more appropriate when prototyping ideas due to time limitations ion this project. In order to monitor the underlying physical hosts on the cloud test bed the Zabbix [2] monitoring tool is used to collect resource utilisation information from Central Processing Unit (CPU), Random Access Memory (RAM) and network interfaces. To monitor virtual machines (VM’s) a solution by Influxdata [3] is used, This includes a monitoring daemon known as Telegraf[3], and a time series database called Influxdb[3]. To visualise and query this data a web server is hosted that runs the Grafana [4] graphing tool. This provides a browser based realtime monitoring solution that is consistent with values displayed by resource monitoring tools such as top [67].

1.5.3 Data analysis

To analyse the data collected from CloverLeaf, a python application is used to calculate an average speedup and efficiency for each configuration. Data collected from the monitoring and profiling tools will be analysed visually in order to identify any potential performance bottle necks or configurations that indicate improved performance.

1.5.4 Benchmarking

The application selected for benchmarking is the CloverLeaf mini-app. Mini-apps offer a more accurate view of a systems performance profile in comparison to micro-benchmarks while removing the complexity of using a full HPC application.Therefore CloverLeaf is a suitable application for this purpose.

To analyse the performance of the cloud it is necessary to compare it to a benchmark time. The benchmark time used for this project is the wall clock time and efficiency provided by

- ! -4

the ARC HPC system hosted by the University of Leeds. This offers a direct comparison of these two architectures.

1.6 Project Schedule

There were a number of milestones that have been met throughout this project they were identified as literature view, experimental design, implementation of research environment, data analysis, present results. As such this forms the iterative approach that was actually taken towards experimentation.

A Gantt chart was created during the scope and planning phase of this project to map out the order of events.The tasks to be completed were then identified and mapped on to a timeline covering the time available for this project. While this was adhered to where possible, the order and amount of time spent on each task was intended to be flexible if necessary given the findings of the exploratory project. The Gantt chart presented here shows the order initially estimated during planning.

Task Week 1

Week 2

Week 3

week 4

week 5

week 6

week 7

week 8

week 9

week 10

week 11

week 12

Lit review

Scope and planning

Write methodology chapter 3Write implementation chapter 4Set - up environment

Write script to automate test runsGather data cloud

Analyse Data cloud

Profile application

Analyse profiling

Write quantitive evaluation chapter 5

Task

- ! -5

As this project explored topics that had not initially been planned including containers and loop tiling the iterative approach of, literature view, experimental design, implementation of research environment, data analysis, present results is a more accurate description of the timeline of events in this project.

1.7 Summary

Chapter 1 has introduced the problem to the reader, discussed an outline of the project methodology and given a schedule of actions to be taken in this project. The following chapter will discuss areas of cloud computing and HPC relevant to this project and present a case for the use of mini-apps as a tool to quantify the performance of cloud and HPC architectures.

Act on analysis

Test solutions

Write critical evaluation chapter 6Write conclusion chapter 7Review report

Week 1

Week 2

Week 3

week 4

week 5

week 6

week 7

week 8

week 9

week 10

week 11

week 12

Task

- ! -6

Chapter 2: Background & Literature Review

2.1 HPC

2.1.1 Introduction

The idea of HPC, also termed ‘supercomputing’ is by no means new. The first supercomputer IBM’s Naval Ordnance Research Calculator (NORC) was presented in 1954 [5]. This was the first machine built with performance in mind. The current scale of HPC systems can be seen in the TOP500 list where current number 1, Chinese built TauLight, demonstrates a system utilising 10,649,600 cores [6].

Supercomputing has been considered to be at the peta-scale since the launch of the first peta-scale machine in 2008 [7]. Advances in hardware engineering now provide many-core processors such as Knights Landing [8], whilst future solutions speak of 3D chip stacking [9]. The next aim for the field is to achieve exa-scale computing. To reach this scale of computing would be a considerable achievement, as it is believed it will allow simulation of the human brain [10].

AMD’s vision for exa-scale computing presents a conceptual model of a heterogeneous system architecture. According to their vision, advances in software and hardware engineering will be required to achieve this goal, and core to this success will be advances in memory design and heterogeneous hardware. Their concept aims to hide hardware heterogeneity from the programmer, and suggests pointers should be able to be passed from GPU to CPU as if they were one [11].

The use of HPC brings many benefits, such as the fact that problems can be solved at increased resolutions providing deeper insight and application execution time can be reduced considerably, providing a faster route to market for both commercial users and researchers [12]. However while HPC is essential in many areas, it does not solve all problems; some applications are unsuitable for parallel execution and homogeneous HPC architectures bring many challenges for application developers, system administrators and HPC users [13].

2.1.2 Infrastructure and architecture

Hardware for executing parallel workloads is classified under Flynn’s taxonomy [14] as Multiple Instruction stream, Multiple Data stream (MIMD), Single Instruction Multiple Data stream (SIMD) or Multiple Instruction Single Data stream (MISD).

SIMD devices include GPU’s that perform the same instruction on arrays of values due to constraints of the hardware. This provides data level parallelism but not concurrency offering

- ! -7

many advantages in the areas of graphics and areas that require repetitive calculation such as physics [15].

for (i=0; i<100; i++):

a[i] = 2*b[i]

Figure 2.1 An example of data level parallelism. If 100 processors are used this becomes simply a[i]=2*b[i] on each processor, as there is no data dependency between iterations.

MISD, allows multiple instructions to run a single data stream [14] MISD is not a commonly used architecture and SIMD devices are beyond the scope of this project, as such they will not be discussed in any further detail here.MIMD devices include all multi-core devices including clusters and supercomputers but also any multi-core devices which currently includes many desktops, laptops, and mobile devices.

This project will focus on MIMD devices, which can be broken-down into two separate architectures defined by a processors ability to access memory. A shared memory architecture requires that every core of every processing unit can “see” or directly access all of the system memory by bus, without using network communication. A distributed memory architecture, is such that two machines are networked making it possible for all of the CPU’s available to work in parallel on a single task if programmed to do so.

Figure 2.2 : distributed memory architecture Shows a distributed memory architecture with 16 available cores. This distributed memory architecture is made up of two distinct machines linked by a network connection. Each of these machines displays a shared memory architecture with 8 available cores.

Currently the increased use of devices such as Graphics Processing Unit (GPUS) and Field Programmable Gate Arrays (FPGAs) [16] means that HPC is seeing a shift towards a heterogeneous hardware environment, and it is not un-common to see clusters providing

- ! -8

both Advanced Micro Devices (AMD) [78] nodes and Intel nodes on a single cluster. However, more generally HPC takes place on homogeneous infrastructure and uses a closely coupled architecture such that the system appears to the user as one system, all nodes share a single file system and there are shared libraries installed on the machine. Users are not usually able to install programs on the machines without compiling from source and can not gain root access. The operating system on each node is the same and while not always the case the hardware often consists of the same nodes replicated a number of times.

2.1.3 Networking

The Infini-Band [77] interconnects on ARC allow connection speeds of up to 40Gb/s in comparison to to a connection of just 1Gb/s, that is given by ethernet on SoC testbed. The difference in GFlops/s experienced when executing Weather Research and Forecasting (WRF) [70], a weather simulation application, on between 4 and 24 nodes is given in figure a.1. Here we see that the difference between interconnects does not appear to affect the application adversely until the number of servers is greater than eight. In cases where an application is communication bound this effect may more noticeable at lower node counts, for applications requiring less input/output (I/O) and communication ethernet may tolerate higher node counts. The frequency of communication in the application and the size of the messages communicated, as well as the frequency and size of any IO operations will effect these results [17].

2.1.4 File Systems

HPC architecture is often dependent on the use off a parallel file systems such as Lustre[18]. Lustre is an open source GNU object based filesystem that is used on a number of the worlds fastest computer systems. Lustre divides data into meta data and object data storing each on an individual server. To deliver performance increases Lustre uses asynchronous I/O, data striping and file replication. The way in which Lustre is configured is system dependent and not discussed further here [19].

2.1.5 Scheduling

There are two types of HPC facility. In the first instance, capability where the concern is using the whole machine to solve one problem in as short a time as possible; and in the second instance, capacity where the facility is concerned with time sharing the machine across multiple users each wanting to solve an individual problem. Each user may then be using a percentage of the machine ranging from one core to the whole machine depending on the problem they are trying to solve [23]. In either case utilisation of HPC facilities is usually high.

- ! -9

In the latter case the system requires a scheduling system to allow users to time-share the system effectively. There are a number of examples of scheduling systems that are topology aware in that they will attempt to minimise switch hops or scatter problems across nodes depending on the users preference, examples include Son of Grid Engine (SGE) [20], Slurm [21] and Portable Job Scheduler (PBS) [22]. The latter offer power awareness providing means of reducing the amount of energy consumed by the system. SGE is used on ARC, using the Command Line Interface (CLI) one can submit job scripts, request interactive job sessions, query queues and view location of running jobs. Scheduling in a real word environment is NP-complete therefore algorithms for real world scheduling problems often rely on heuristics rather than exact methods [23].

2.1.6 Benchmarking

The HPC community are clearly concerned with performance, in light of this benchmarking the performance of such machines is an area of discussion in the field. The top 500 List documents the performance of the worlds fastest machines and relies on the Linpack benchmark to do so [24].Linpack solves a dense a systems of linear equations of the form ax = b.Since this does not represent the workload of a typical application there is often a difference between the peak performance shown by Linpack and the true performance of a system when under the load of a real application. There is much criticism of the Top500 due to this, and a list of improvements are suggested by Kramer [25] that include developing a more representative benchmarking suite, creating metrics to define system value in a monetary sense, and aggregating values into one statistic. Linpack is included in the High Performance Computing Challenge (HPCC) [71] benchmarks, along with Stream[71] a synthetic memory benchmark that will is used to measure memory access here.

2.1.7 Challenges

HPC presents many challenges, some of which are discussed in [13]. Here they refer to the HPC paradigm earlier defined as capacity HPC and describe the limitations of the HPC platform model, that they feel are hindering users of second generation HPC applications.

They highlight the current software provisioning approach as one of the problems that they face. Shared libraries affecting the whole user base if one library is updated, and the difficulty of getting new programs installed are included as problems in this area. They include the solution to compile form source in the users home directory, and highlight that this is usually not without error often causing conflicts of some kind.

Networking is also suggested to be a problem, modern applications often rely on internet communication throughout execution or during installation, and since HPC facilities use

- ! -10

firewalls, and are often behind proxy servers this is not always possible. Furthermore it is suggested in cases where this is possible bandwidth oftentimes remains a problem.

The static resource allocation model is also criticised, this requires the user to know how long there application will run for, and how much memory and CPU it will require.This leads to a two-fold problem. The scheduler allocates resources dependent on the amount requested, where a request for less resources will be given higher priority. This leads the user to request as few resources as possible. However if the user does not request enough resources the application will fail and the use of resources will be wasted for both the user and the facility. Capacity HPC is often a multi tenant environment. Therefore while resources in terms of CPU and memory are statically allocated, network bandwidth cannot be requested by the user. Rather, the users I/O speed is reflective of the traffic on the network at that time.This contrast means that while sufficient CPU and memory are allocated slow down may be experienced through limited network resources.

2.2 Cloud Computing

2.2.1 Introduction

The definition of cloud computing has been a much discussed topic, [26] presents a number of definitions of cloud computing.The NIST provide a definition in [27] where they state essential characteristics of cloud as:

1. On Demand, Self Service: Often given the power grid analogy, that is services should be able to be accessed transparently by the user

2. Broad Network access: Should be accessible from anywhere by internet.

3. Resource pooling: The cloud provider uses a pool of resources to service a group of distinct users in a multi tenant model. The user need not be aware of where there resources are physically located.

4. Rapid Elasticity: Potential to scale-up and scale-out, when required.Resources released automatically when no longer required.

5. Measured Service: Resource utilisation should be monitored so both provider and user are aware of utilisation levels. Furthermore cloud systems should use this monitoring to optimise resource utilisation.

This leads to heterogeneous, loosely-coupled, distributed architectures, that offer centralised control over often disparate resources, providing a platform for computing as a service with a number of service levels and deployment models defined in [27] as,

- ! -11

1.SaaS: The user accesses this service through a client interface, browser based or a desktop application. The user is unaware and has no control of the infrastructure used to operate the service.

2. PaaS: Allows the user to deploy their applications on the cloud.The user has no control over infrastructure but may control user specific application configuration settings

3. IaaS: The consumer stiles not control or manage the underlying hardware but is able to deploy any software including Operating systems.They may control network configurations and manage storage devices.

Further distinction is given in[27] in terms of deployment models,

1. Private cloud: Cloud infrastructure that is for the use of a single organisation. It may however be operated and owned by a third party, and be located on or off premises.

2. Community cloud: Cloud infrastructure is used by a community that somehow share concerns. May be owned and managed by sin or multiple organisations and again be located on or off premises.

3. Public cloud: Cloud infrastructure that can be accessed by anybody. The infrastructure is located on the premises of the cloud provider who may be a business, government or academic organisation.

4. Hybrid cloud: A combination of two or more cloud infrastructures that are linked through middleware to aloud cloud bursting and load balancing between infrastructures.

Many of the advantages offered by cloud computing such as elasticity, rely on loose coupling, and a low utilisation model, meaning that the cloud provider must be prepared to invest in servers that may not always be under full utilisation. Further advantages can be leveraged through hardware heterogeneity. As an example, Amazon allows users to build custom data centres out of instance types offered, allowing the user control over physical infrastructure on which to deploy their virtual instances to suit there own needs.

2.2.2 Infrastructure and architecture

The cloud architecture is a layered architecture. The hardware is at the base of the stack offering hardware resources that may range from HPC infrastructure through to commodity servers. The host OS resides on the server as would be usual. On top of this the VIM is installed providing potentially a choice of Hypervisors and giving administrators the means to configure hosts for cloud installation and manage cloud infrastructure. Users of the system will use the VIM to deploy and manage virtual machines throughout there lifecycle, create images and templates for VM’s and contextualise there VM’s. They may also manage

- ! -12

network configuration of individual VM’s using dynamically or statically allocated IP addresses and create virtual networks. The VM its self contains the guest OS and any libraries and applications that the user may install. There are many alternative cloud architectures to the one depicted in figure 2.3, an example of which is given in the following section where docker is introduced as a virtualisation method.

Figure 2.3 - A typical cloud architecture. This architecture is reflective of the SoC cloud testbed.

2.2.3 Virtualisation

Virtualisation is a key technology in cloud computing. As result there are a wide variety of vendors offering virtualisation solutions that are implemented at a number of different levels. Two levels of virtualisation of interest to this work are the hardware abstraction layer and the operating system layer. At the hardware abstraction layer the underlying hardware is virtualised. This can be done in a number of different ways including full-virtualisation, para -virtualisation and hardware assisted virtualisation. Full-virtualisation emulates all of the underlying hardware allowing an operating system to be run on the virtualised hardware without any modifications. Whereas para-virtualisation requires software to be modified to run on the virtual host, as the interface that the operating system needs to interact with is not the same as that of a physical host. Operating system layer virtualisation is a lightweight virtualisation mechanism that allows multiple user spaces under one host operating system. This allows isolation of processes that can interact with the underlying operating system

Hardware

Privileged VM

Hypervisor

Virtual Machine

Libraries

Guest OS

Application

Hardware

ssh storage ssh storage

Privileged VM

Virtual Infrastructure Manager

- ! -13

directly without the need for hardware virtualisation. This removes the overhead of the virtual machine manager and the hypervisor required by hardware visualisation.

Kernel-based Virtual Machine (KVM) [28] is an open source hypervisor that performs virtualisation at the hardware abstraction layer using full virtualisation. KVM is a kernel module that is compiled into the linux kernel offering advantages such as increased usability and improved performance. KVM has been referred to as the best hypervisor for general deployment in an HPC environment [36]

The Docker [29] client is an operating system level virtualisation platform that leverages LXC[30] technology to provide a virtualisation eco-system, consisting of the docker CLI and the docker hub repository. The CLI provides the users with the means to manage images and deploy docker services while the hub allows users to store docker images remotely relying on a system similar to the Git[75] CLI to push and pull images from host to repository. Docker also provides many of the same advantages as VM’s such as providing an isolated, consistent and secure environment, while also offering further advantages such as reduced application overhead and a smaller image footprint due to the lack of need for replication. This allows docker to be used as an approach to micro-services where each container is concerned with a very specific task providing a modular approach to cloud deployment. This is done by using a single kernel to run multiple instances of an operating system.

2.2.4 Virtual Infrastructure Managers.

A Virtual Infrastructure Manager (VIM) allows centralised control over cloud resources from both administration and user perspectives. An administrator may use the VIM to partition resources for specific uses or to add and remove physical devices to the cloud. Whereas a user may manage VM’s throughout their lifecycle, or manage images and templates for their virtual instances.

This project uses the OpenNebula[31] VIM, an open source solution with a relatively simple architecture. The core provides an XML-RPC API that can be used to access resources in an OpenNebula cloud programatically. The Structured Query Language (SQL) database holds the state of entities within the OpenNebula environment. The VM manager executes the VM lifecycle displayed in Figure 2.5 and the host manager handles information about the host infrastructure.The tools layer hosts the scheduler used to deploy VM’s on to the physical infrastructure and the CLI that administrators can use to interact with open nebula. At the base of the OpenNebula stack is the drivers layer. The transfer driver is used to clone swap and delete images, the virtual machine driver executes tasks on the lifecycle of a virtual machine such as shutdown and migrate. Finally the information driver executes scripts on physical hosts to gather information about them.

- ! -14

2.3 Parallel applications

2.3.1 Intro

Applications are often termed as being concurrent or parallel, where both of these terms are taken to mean the same thing. Here parallel applications will be considered those that are concerned with solving a single problem as a number of distinct tasks and concurrent programs will be considered to be programs where two tasks may share processor time but communication between the tasks is not required and synchronisation not always necessary as they are performing distinct tasks.

HPC applications are generally parallel applications. This requires programming models that allow the programmer to take advantages of the different forms of hardware parallelisation and to utilise both shared and distributed memory architectures effectively. In contrast, applications that are particularly suited to cloud deployment are often reliant on concurrency or request level parallelism in some way, and it is concurrency that cloud computing is synonymous with.

2.3.2 Programming Models

MPI is the standard for distributed memory programming on hardware that follows the SIMD model defined above. This programming model offers the programmer a lot of control over exactly how parallelism is implemented allowing experienced programmers to take full advantage of the underlying hardware by exposing the parallelism to the programmer. This flexibility however does add complexity to the task of developing MPI applications. There are many implementations available not limited to but including OpenMPI [32] and MPICH [33] for ethernet networks and MVAPICH [34] and others for devices with Infini-Band interconnects.[35]. In an MPI program parallelism takes place at the process level meaning all processes are created at the start of program execution and all processes are killed at the end of execution. Processes are more heavy weight than threads in that they are more expensive to instantiate, and have a larger communication overhead.

MPI is said to offer course grained parallelism. This is the typical divide and conquer approach employed in programming in many places. 1.Identify tasks that can be divided into chunks. 2. Solve sub problems of equal size. 3. Aggregate values later in execution. Typically MPI programs rely on a large collective communication to distribute values for these tasks across processors and the values are aggregated collectively or otherwise at the end of the application with point to point communications communicating dependencies between processes after each iteration. This is a powerful approach but clearly requires extra work in the identification of tasks and implementation of underlying communication necessary.

- ! -15

While not discussed further here there are other programming models available for distributed memory programming including High Performance Fortran, Co-array Fortran and unified parallel C [35].

OpenMP is the standard in shared memory programming and offers a high level parallel programming environment that uses light weight threads rather than the processes used by MPI. This follows a fork and join threading model, such that one thread is created on instantiation and further threads are spawned from this initial thread. Threads allows the programmer to take advantage of loop level parallelism as such offering finer grained parallelism than MPI.

2.3.3 Performance analysis of parallel applications

There are a number methods commonly used for analysing analysing parallel

algorithms[68].The speedup of a parallel algorithm can be calculated. Where ! is the serial

execution time and ! is the parallel execution time, speedup ! .

Ideally an efficient use of processes would mean that each time the number of processes is doubled the application runtime is halved, however this is rarely the case in practice. Therefore a way of quantifying the efficiency of parallel algorithms is required. So, where !

is the number of cores used to execute the algorithm, efficiency !

The scalability of parallel algorithms can be defined using two definitions. Strong scaling defined by Amdahl’s law[49] refers to the way an algorithm scales when executed with a number of processes! and fixed problem size ! And weak scaling, as defined by Gustafson[50] that refers to the way an algorithm scales with a number of processes ! and increasing problem size per processor. That is, when !

2.3.4 mini-applications

A Mini-app is selected as the test application due to their current relevance in HPC. Their usefulness is becoming clear to application developers as a way to test operations in isolation. The relatively small logic units, or kernels, that power these applications bring a modular approach to scientific application development [36]. The idea behind mini-apps is that they reflect the most compute intensive part of a real HPC application.They are therefore built by the developers of the application they are designed to reflect and used as a tool to develop methods that may be successful in the real HPC application [37].There role in system co-design and the benchmarking of systems is seen by some as key to the success of HPC in achieving exa-scale capabilities and as such a validation methodology has been developed for showing a mini apps relevance to its parent application thus allowing users to feel confident the results they are obtaining from a mini app will remain relevant when utilising the parent application [38]. There are a number of mini-apps included

tstp S = ts

tp

p

E = tsptp.

p n.p

n ∝ p.

- ! -16

in the Mantevo [80] mini app benchmarking suite including CloverLeaf, MiniFE, and phdMesh.

Their effectiveness as a middle ground between a full scale scientific application and a simple benchmark is of specific importance to this project. The HPC community is beginning to note a large difference between the results given in the benchmark performance and actual performance when utilised for full scientific applications, [39] and as such the same has been inferred for the cloud.

2.3.5 CloverLeaf

The project considers CloverLeaf which is developed by United Kingdom Mini App Consortium (UK-MAC) [76] ,It is 4,500 lines in size and is meant to reflect a code used by the Atomic Weapons Establishment[40]. CloverLeaf solves Eulerian Langrangian compressible hydrodynamics equations using a second order accurate explicit solver. To do this a mesh is used and each kernel of the application traverses the mesh updating values[37]. Communications during the application are the result of boundary nodes been updated.

2.3.6 Loop tiling

Loop tiling, which can also referred to as cache blocking, is a technique used to optimise the performance of HPC applications. The idea is that large arrays of data can be divided into chunks of data that fit into cache, offering improved performance due to less cache misses. This is a form of cache aware programming and, a feature that is built into the CloverLeaf [1] Mini-app allowing the user control over the the number of tiles per problem and number of tiles for each of these tiles. This allows the user to investigate the performance of tiling combinations on there given platform. Any optimisations made in this way are platform specific, hence while performance may improve for one platform it may decrease for another. It is often this technique and other such as manual loop un-rolling that optimised versions of HPC codes rely on [23]. Automated loop using the OPS framework is discussed further in tiling in 2D and 3D is discussed further in [69].There is also a code example of loop tiling in figure a.3

2.4 Related Work

In recent work [42] it is shown that cloud co-existing with HPC can increase job turnaround time by 2 times and increase throughput by 6 times. A selection of benchmarks and applications are used to test optimisations to granularity, network configuration and CPU affinity. Experiments are performed on LXC containers, thin-client and VM. These are then compared and a methodology is presented for a multi-platform application aware online job scheduling system. This work shows a well-founded yet optimistic view of cloud computing, and as such it motivates much of the work done here. It is concluded that cloud can

- ! -17

successfully compliment super computer and that awareness needs to be bi-directional between the two platforms for this to succeed.

In earlier works by the same lead author, HPC workloads using cloud sim and amazon public cloud [43,44]. Here four classes of HPC applications are considered. They are defined as,

“ExtremeHPC: Extremely tightly coupled or topology- sensitive applications for which the best will be to provide dedicated nodes, example – IS.

SyncHPC: Sensitive to interference, but less compared to ExtremeHPC and can sustain small degree of interference to get consolidation benefits, examples – LU, ChaNGa.

AsyncHPC: asynchronous (and less communication sensitive) and can sustain more interference than SyncHPC, examples – EP, MapReduce applications.

NonHPC: Do not perform any communication, can sustain more interference, and can be placed on heterogeneous hardware, example – Web applications”

Gupta et al [43] and [44] then approach the VM placement problem as discussed herein, utilising the same number of physical node and processors as used in this project. As such this work motivates the approach taken here towards VM placement. VM consolidation is discussed and an HPC aware scheduling algorithm is presented before being implemented in cloud sim. They conclude that resource utilisation by HPC workloads can be affected by VM placement and that cloud VIMs would greatly benefit from HPC aware scheduling, further work is suggest ed in the form of scheduling a mix of HPC and regular applications effectively.

In [45] Multi-physics applications are executed on a Eucalyptus managed private cloud test bed.The execution of an MPI application HybridMD is benchmarked. The application is said to requires intensive communication and computation and is executed 50 times on each of four cloud configurations utilising between 8 and 32 cores and between 3 and 8 VM’s. They propose a hybrid architecture for scientific applications and conclude again that VM placement and consolidation can have an effect on HPC applications. Further work is suggested to provide high level checkpointing techniques and the migration of application components.

In [46] and [47], a Comparison of Virtualisation and Containerisation Techniques for High-Performance Computing, KVM and Docker are compared in terms of scaling up and scaling out. Scale-up experiments are performed up to 24 cores, and scaling out is shown up to 64

- ! -18

nodes using a selection of mini-apps, namely, CloverLeaf, CoMD and MiniFE. They conclude that KVM carries an average performance overhead of 1.57% over bare metal, and Docker -0.17% overhead over bare metal when executing CloverLeaf for scale up and scale out experiments respectively.

in [48] Younge et al compare 4 full virtualisation hypervisors: Xen, KVM,VMWare and Virtual box. They discuss the usability of each hypervisor concluding that Xen and Virtual box are easiest to install, and that the full Kernel modification required by Xen can cause future problems with patches and updates.They use a geographically distributed, heterogeneous test platform Future Grid. Future Grid aims to build understanding off cloud computing and as such it is possible to request bare metal runs. This is used to capture the over head of each of the hypervisors used. They run HPCC and SPEC benchmarks 20 times, using 8 core VM’s each with 16 GB of ram. KVM outperforms Xen and VirtualBox for Linpack, FFT and OpenMP benchmarking it is concluded that KVM is the best hypervisor over all for HPC. Xen offers some of the best results in terms of minimum times but un-explained fluctuations are experienced.This work again offers a well founded, and optimistic view of virtualisation for HPC, and the ability of KVM to execute HPC workloads is confirmed in this work.

2.5 Summary

This chapter has explained the background research that has been done for this project, specifically covering the areas of HPC, cloud computing and HPC mini-apps. It has also covered the literature reviewed that is most relevant to this subject area.The next chapter will outline the process of experimental design used for this research.

- ! -19

Chapter 3: Experimental Design

One aim of these experiments is to compare MPI, OpenMP and hybrid programming models.They will also allow VM and container to be compared, using KVM and Docker respectively. Experiments will be performed to show scaling up and scaling out.This will allow observation of the overhead incurred by the different virtualisation methods and will allow the scalability of the solution to be ascertained. Further, loop tiling optimisations will be performed to explore the effects of loop tiling on the CloverLeaf application when executed on SoC cloud testbed.

3.1 Identification of parameters

HPC systems are reliant on user knowledge to obtain the best performance out of their applications. This requires the user to know how their application performs under variety of different conditions, these include: 1) The compiler used to compile the source code, 2) Any compiler optimisations used, 3) The implementation of the programming model, 4) Amount of memory required to run the application, 5) The programming model chosen, 6) Number of cores, 7) Number of nodes, 8) Problem size, 9) Application variables

Due to time it is beyond the scope of this project to fully exhaust all of these options on both systems, therefore attempts are made to equal parameters that are not open for experimentation. To this end the GNU compiler v5 [51] is used on both systems. CloverLeaf is a memory bound code, and the amount of system memory effects performance, therefore it is felt that maximising memory while keeping systems equal for equivalent experiments will offer the best performance available within the limitations of the hardware to be compared, whilst keeping experiments comparable.

Implementations of MPI used on each system are different due to the difference in interconnects. MPI implementations on the ARC system are optimised for Infini-Band therefore ARC will use MVAPICH2 and SoC testbed will use MPICH v3. This leaves a reduced set of parameters selected for benchmarking performance on ARC, these include: programming model ! number of cores ! and number of nodes !

The parameters selected for experiment on arc all exist as parameters in the cloud, however cloud offers a wider range of parameters for experimentation. These are identified as 1) number of virtual machines, 2) number of cores per virtual machine, 3) virtualisation method used, 4) placement algorithm used, 5) storage optimisation algorithm used.

It is not possible to exhaust all of these parameters therefore a smaller set of parameters are selected including: number of virtual machines,! and virtualisation method,! Therefore a suitable cloud configuration within the context of these experiments is some combination of

m, p, n.

vm v.

- ! -20

! = number of virtual machines per node

! = number of physical cores per virtual machine

! = virtualisation method

! = programming model

For cloud experiments performed here each physical core is mapped to exactly one virtual core when requests are made. Therefore the value for ! is equal to the number of physical cores which in turn is equal to the number of virtual cores. From these parameters a number of experiments were formed that are detailed in the following section.

3.2 Identification of metrics

Metrics used to determine the performance of cloverleaf are wall clock time ! , speedup !and efficiency ! as defined in section 2.3.3 of this report. While the interest of this project is performance, efficiency values will be used to assess the practicality of solutions.

A further metric that will be discussed is scalability. Scalability may mean a number of different things in the context of this project. The scalability of the algorithm will be compared on each architecture. The applications ability to scale to a number of VM’s and physical hosts will also be researched. It is important to note that these experiments will be done within the physical limits of the SoC cloud test bed. Therefore wherever an ability to scale is shown it is within the pre-defined limitations of the experiment which will utilise up to16 VM’s, 32 cores and 8 physical nodes.

3.3 Scale Up

The aim of the scale up experiment is to quantify and allow comparison of the performance of a single 16 core node of both ARC and SoC testbed.

Table 3.1 - Scale-Up

This experiment will be performed using 3 different implementations of CloverLeaf.The flat MPI version, the OpenMP version and the hybrid version. Utilising the hybrid implementation

vmn

pvmV

M

p

t S,E,

1

1

KVM

1,2,4,8,16

vm

p

n V

- ! -21

of the program will add an extra dimension to the problem and this will be explored. Therefore where ! represents the number of MPI processes and where ! represents the number of OpenMP threads, all combinations where ! = 16, will be used.

3.4 Placement

The aim of the placement experiment is to investigate the performance of one node of the SoC cloud test bed utilising all available cores while varying the number of VM’s. This experiment will use the flat MPI and the hybrid implementation. Hybrid will be OpenMP inside VM and MPI between VM’s.

Table 3.2- Placement

3.5 Virtualisation Overhead

The aim is to investigate the overhead incurred for each virtual machine per physical host by comparing with results of previous experiments.

Table 3.3- Virtualisation Overhead

3.6 Scale Out

The aim is to determine the applications ability to scale across physical hosts while quantifying the cost of network communication on both the SoC cloud testbed and ARC.

np ntnp i nt

1

KVM

16

1,2,4,8,16

n

vm

V p

1

KVM

1

1,2,4,8,16

n

vm

V pmv

- ! -22

Table 3.4- Scale out

3.7 Containers

Aim: To determine the performance of the cloud test bed using a container as the virtualisation method.

Table 3.5- Containers

This parameter will be incremented by changing the number of processes requested when instantiating the VM.

As in the first scale up experiment this experiment will be performed using 3 different implementations of CloverLeaf. The flat MPI version, the OpenMP version and the hybrid version. Utilising the hybrid implementation of the program will add an extra dimension to the problem and this will be explored. Therefore where ! represents the number of MPI processes and where ! represents the number of OpenMP threads, all combinations where ! = 16, will be used.

3.8 Loop Tiling - Single Slot

The aim of the experiment was to explore the effects of loop tiling on a single slot using test problem 4.

1

Docker

1

1,2,4,8,16

n

p

V vm

npnt

np i nt

2

KVM

32

2,4,8

n

vm

�V p

4

KVM

32

4,8

n

vm

�V

p

8

KVM

32

8

n

vm

V p

- ! -23

Table 3.6- Loop Tiling- Single Slot Symmetric

Of the values selected the optimal tiling appeared at 1024 x 1024 therefore possible combinations of 10242 are investigated.

Table 3.7- Loop Tiling- Single Slot Asymmetric

3.9 Loop Tiling - 8 Nodes

Aim: To Explore the effects of loop tiling on a 32 core cluster using test problem 5.

Table 3.8- Loop Tiling- 8 Nodes

3.10 The Problem

The test problem chosen for these experiments is the socket test and simulates a small, high-density region of ideal gas expanding into a larger, low-density region. The problem is

3,8402 cells in size and executes for a fixed 87 time steps. For each experiment this problem is executed 5 times and an average of these 5 runs is given in the results for this project.

1

KVM

1

8

128, 256, 512, 1024, 2048, 4096

128, 256, 512, 1024, 2048, 4096

n

Tc

p

V

Tp

vm

8, 16, 32 ,64 ,128, 256, 512, 1024,2048,4096

131072, 65536, 32768,16384, 8192, 4096, 2048, 512, 256

Tp Tc

8

KVM

8

32

128, 256, 512, 1024, 2048, 4096

128, 256, 512, 1024, 2048, 4096

n

Tc

p

V

Tp

vm

- ! -24

This problem demonstrates a weak scaling run if problem size is increased proportionally with , however it is used with fixed problem size to reduce the time frame of the following experiments.

3.11 Summary

In summary, this chapter has described the metrics that will be used to quantify performance of CloverLeaf and identified the parameters that are to be used for experimentation.The aim of each of the experiments has been outlined and the parameters for each experiment have been defined. The next chapter will explain the methods used to implement these experiments and describe the programs and tools used to analyse the data collected.

- ! -25

Chapter 4: Implementation

This chapter will explain how the experiments detailed in the previous section will be put into effect. It will detail the hardware specification of the two test platforms, ARC and the SoC testbed. It will also explain technologies and tools used to implement the experimental environment.

4.1 Hardware Specification

Table 4.1 - Hardware Specification

Table 4.1 gives the hardware specification for a single node of each of the systems to be studied. The Intel Xeon E5-2630 v3[52] used on the SoC testbed are from the Haswell family, 2 generations on from the E5-2670 Sandy Bridge[53] processors provided on the ARC system. Haswell processors have been shown to offer 1.9 x peak floating point operation performance over the Sandy Bridge processors [52] [53]. This is shown in the SPEC benchmark scores given in table.

Comparing RAM architecture, shows the SoC testbed with DDR 4 and ARC with DDR 3. ARC is using the highest frequency DDR 3 ram available. This however is the base frequency for DDR 4 and the exact frequency as not been ascertained on the SoC test bed. As CloverLeaf is memory bound and memory performance is known to effect performance

Component ARC SoC testbedServer HP BL460 blade Dell Power Edge 1950CPU 2 x Intel E5-2670 2 x Intel Xeon E5-2630 v3CPU base frequency 2.6GHz 2.4GHz,Turbo Boost frequency 3.3Ghz 3.2GhzL1 Cache 8 x 32kb 8 x 32kbL2 Cache 8 x 256 kb 8 x 256 kbCache 20M 20MBus Speed 8 GT/s 8 GT/sHyper-threading yes yesRAM supported DDR 3 DDR 4Family Sandy Bridge HaswellSPEC integer operations 632 673SPEC floating point 481 571Lithography 32 nm 22 nmRAM Capacity 32GBRAM arch DDR 3 DDR 4RAM frequency 1600 MHzNetwork Fabric Ethernet 1Gbit/s Infini-Band 40Gbit/s Storage NFS Shared Files System Lustre Parallel File SystemOperating System Centos 6 Centos 6Huge Page Size 2048 kB 2048 kB

- ! -26

[54] it was considered that this should be explored further.To do this a single execution of the stream benchmark was done on each system to compare the memory access times.The results show that in fact Arc has improved access times perhaps highlighting the overhead of virtualisation. The full results are given in figure a.2.

4.2 Network Topology

The SoC cloud test bed is contained within one rack of servers where each node is connected to each other node by 1Gb ethernet connection therefore while this system has not been designed with HPC performance in mind its size means that there aren’t any considerations required concerning switch hops caused by communications between server racks as all nodes are in the same rack.

The ARC2 system uses Infini-Band for networking, this is a layered network referred to as a half clos topology. Each server has a 4X quad-data-rate (QDR) with two 4X QDR links up to the core, able to transfer data at ~8GB/s. The latency between servers is between 1.1 and 1.5 microseconds. To reduce communication cost the compute nodes are arranged in the racks as a high density block to minimise the diameter of the compute topology [55].

Figure 4.1 - ARC2 Topology

4.3 Benchmarking ARC

To access the ARC system users login to ARC via ssh using one of the login nodes, from here the user moves to the scratch storage provided at /nobackup (OSS 1 and OSS 2). This is the mount for the Lustre parallel file system providing storage for any programs and data required for execution. This storage solutions is not backed up automatically so measures

- ! -27

are put in place to ensure program data remains un-affected. Jobs are submitted to the clusters SGE scheduler using the recommended node syntax. This allows the user to specify exclusive access to nodes so that the node won’t be used by other users while their job is executing. This is done to prevent noise entering measurements caused by other users of the system. The scheduler is used on its optimal setting meaning so that it will attempt to minimise switch hops

4.4 Cluster of Virtual Machines

4.4.1 Images

Two images have been prepared for use on the open nebula VIM one master_image and one slave_image. Both of these images are based on a Centos7[88] image that is available from the OpenNebula marketplace and uses the KVM hypervisor. The KVM hypervisor was chosen as it has shown to be the best performing hardware level virtualisation hypervisor for HPC applications in a number of studies [48].

ssh[73] is required by MPI for communication to take place therefore this must be set up between the master_image and slave_image.This involves installing ssh with yum install openssh-server,[56] then generating an ssh key on the master device using ssh-keygen -t dsa and copying it to the client using ssh-copy-id client. Each machine must be accessed once by ssh before password-less ssh is enabled using eval `ssh-agent` followed by ~/.ssh/id_dsa.

The nfs-kernel-server package is installed on the master image, a directory is created to be shared by the cluster with read/write permissions and no_subtree_check enabled to prevent checking of parent directory permissions before allowing access.The folder is also shared with all IP dresses such that any machine that attempts to mount the folder can. Some of these steps reduce the security of the cluster but that is not a concern here.

4.4.2 Templates

A number of templates have been created as part of the preparation for this project. Initially a master and slave template are created.

MasterMemory 16 GBCPU 16 CoresVCPU 16 CoresStorage Master ImagePersistence YesNetwork Private V NetPlacement Chosen HostIP adress Fixed

SlaveMemory 16 GBCPU 16 CoresVCPU 16 CoresStorage Slave ImagePersistence NoNetwork Private V NetPlacement Chosen HostIP adress from pool

- ! -28

Figure 4.1 - Templates for master and slave devices.

These templates can then be duplicated and created for individual experiments such that they specify the values required for that experiment. using images and templates in this way it is possible to deploy a cluster of 16 VM’s using only two images and two templates.

To perform explicit placement experiments these templates are duplicated so that selected host information can be given to each machine.

4.4.3 Networking

Networking on the OpenNebula VIM on the SoC testbed is usually done by requesting access to a virtual network that assigns Internet Protocol (IP) addresses dynamically from a pre-selected pool of IP addresses. This system re uses addresses meaning that if a machine is taken down the IP address will be different next time the cluster is instantiated. To solve this problem a private virtual network is created and the first address of this network is allocated to the master machine. Now as the slave images are deployed they are deployed on to this private network of pre allocated addresses.

4.5 Containers

4.5.1 Overview

To use docker on an OpenNebula managed cloud, one approach is to use a combination of tools provided by Docker to provision a container inside an existing VM.

Figure 4.5- Overview of container set-up

OpenNebula

- ! -29

4.5.2 Docker Machine

Docker-machine[57] is an Ubuntu[74] based image imported form the OpenNebula Market Place. A VM is created on the SoC testbed with this image and this is then used as a tool to provision a docker engine on the SoC cloud test bed. This is done using the docker-machine CLI and provided open nebula driver, together providing an interface to configure and manage Boot2Docker[58] instances. Once google’s Go and the OpenNebula driver is installed a docker engine can be deployed onto the OpenNebula platform using the boot2docker image.

4.5.3 Boot2Docker

Boot2Docker [58] is the image used for the VM that runs the docker engine, essentially a VM using an in memory operating system based on TinyCoreLinux [59].This provides a lightweight OS ideal for hosting docker containers.

To provision a single Boot2Docker instance from Docker-Machine a script is created that takes two arguments. The ID of the template that is to be used from the OpenNebula data store and the VM name that the user would like to be displayed in OpenNebula upon instantiation.

docker-machine create --driver opennebula --opennebula-network-id 60 --opennebula-xmlrpcurl https://csgate1.leeds.ac.uk:2633/RPC2 --opennebula-user username --opennebula-password password --opennebula-template-id $TEMPLATEID $VMNAME

Here the template ID refers to an OpenNebula template that uses the boot2docker image as its storage method.

4.5.4 Docker

Docker is installed on the Boot2Docker VM allowing the user to pull images from the docker hub and modify them for execution as a docker container containing a custom application.

4.5.5 Docker images

Two images have been used as docker container images. The first is a Centos 7 image provided by Nersc [60] for MPI development, the second is an image based on Alpine [61] a Lightweight Linux distribution. A base image was pulled from the docker hub repository with MPI installed [62]. To install the fortran compiler and install CloverLeaf a Dockerfile is created, synonymous to a Makefile. This file defines any folders to be copied from the local drive on launch, any contextualisation commands to be executed on launch and the program to be executed.To keep a history of changes made to the images the docker hub repository

- ! -30

is used. As a result of this project there are two containers that can be accessed from the docker hub repository.[75] [76]. To use these images from the docker engine, you can use pull jay83/cloverleaf, then, docker run -it --entrypoint /bin/sh jay83/cloverleaf. This executes the container such that the user is presented with a shell and CloverLeaf can be executed from here.

4.6 Monitoring

4.6.1 Zabbix

Zabbix has been utilised in two ways in this project, first to monitor using the user interface for instantaneous feedback on underlying hardware and secondly using the monitoring Daemon provided by Dr Kavanagh. This has allowed insight into the utilisation level of physical resources, however turned off during timing runs to prevent any interference with results.

4.6.2 Telegraf, InfluxDB and Grafana

To monitor individual virtual machines Telegraf[3] is used.This is a monitoring daemon written in Go that collects system and network usage information from the host.This information sent via UDP to InfluxDB[3], a time series database that allows tracking of data over time.This is then sent to the Grafana[4] server using local host so that the user can view and query data on the Grafana dashboard using SQL queries.This was used to monitor the cluster in initial testing to gain insight into how the cluster was performing and turned off for timing runs as the effects of extra communication are known to have a deleterious effect on MPI programs.

4.7 CloverLeaf

4.7.1 Adaptations

The Fortran control code of the clover leaf application has been adapted so that a named file is generated containing the information that is to be analysed, rather than all output files being named clover.out, the named file contains only the variables that are required by the python program. The adapted version of CloverLeaf is available in [77].

4.7.2 Optimisations

Tiling optimisations have been performed on two configurations. This is done by setting input

variables in the input script to over decompose the problem. ! allows the user to select the number of tiles across the problem. ! allows the user to define the number of these tiles for each of ! , so the two variable offer a recursive sub-division of the problem.

TpTc

Tp

- ! -31

4.7.3 Profiling

Profiling has been done using a number of tools in this project. Allinea Map[63] has been used on the ARC to determine the reasons for poor performance. This requires compiling the code for debug and running the Allinea server application on the head node of the ARC system. The code is then submitted to the batch scheduler and profiling data is displayed for the user to analyse. Also mpiP[64] has also been used to gain further insight into the reasons for this performance bottleneck. Cloverleaf has a built in profiler that gives the time spent in each kernel this is used to find where the application spends its most time.

4.8 Analysis Of Data

4.8.1 Python Program

A python script has been used to calculate the average speed up, efficiency, and wall clock times that are given in each run. The python script first averages values for wall clock and start up over five runs over 5 runs. From this averaged value it then calculates speed up and efficiency as a single value for the five runs. This is done using the formulas given in the mathematical analysis section 2.3.3.

4.8.2 Java Program

A Java program has been created that instantiates the cluster across eight nodes using configuration provided by the optimal 32 core run using the OpenNebula Java API [65]. It is considered in section that this could be used as a means to characterise applications on an OpenNebula cloud.

4.9 Summary

In summary this chapter has explained the implementation of the experiments presented in chapter 3. It has discussed the methods used to collect data from the ARC system, and the provisioning of images and templates suitable for setting up a scalable MPI cluster on the SoC cloud testbed. The use of container technologies on the SoC cloud testbed has been demonstrated and the methods used to collect and analyse data have been detailed. The following chapter will form the technical evaluation of the experiments implemented in this chapter.

- ! -32

Chapter 5: Technical Evaluation

This chapter aims to analyse the data collected and provide a discussion of the quantitive analysis performed. The key metrics will be discussed including efficiency speedup and scalability. Any observations made will be highlighted, discussed and explained where possible.

5.1 Scale-Up

5.1.1 OpenMP

The hypothesis to be tested: Virtualisation comes at the cost of performance, therefore using one VM with a large number of processors should reduce the cost of virtualisation, while allowing the use of OpenMP as a programming model.

Figure 5.1 - Scale-Up OpenMP The graph and table above show results from the scale up experiments using Open MP

Results for the scale up experiment show that performance is as expected for both architectures up to 4 cores yet for 8 cores is inefficient on both architectures. The cloud marginally outperforms ARC for both 2 and 8 cores and shows only 1 second difference at 4 cores. At 16 cores ARC clearly outperforms the testbed. Monitoring shows that only 75 % of the available processing power is being used by a single virtual machine with 16 cores, when Open MP is used. Further more this is a result of only ever being twelve active process at one time. Reasons for this may include:

Wal

l clo

ck T

ime

(s)

0

50

100

150

200

250

300

350

Processors1 2 4 8 16

Arc 2 Cloud

ARC SoC ARC SoCP2 1.88 2.05 0.94 1.034 3.56 3.66 0.89 0.928 3.66 4.30 0.46 0.5416 8.39 3.60 0.52 0.22

!S !E�E!S

- ! -33

1. Hyper Threading

2. Ineffective use of the underlying hardware

3. Thread and CPU affinity

1. Hyper threading does not appear to be enabled when lscpu is used to enquire the number threads per processor on the VM, however enquiring on the head node for the cloud test bed reveals the ht flag is set. It is not possible to ssh into the nodes of the cloud testbed, and as such it is not confirmed wether hyper-threading is enabled at the level of the physical host. If hyper threading is enabled it could be that two OpenMP threads are being assigned to the same physical core resulting in a performance reduction due to the expense of hyper-threading [23]

2. Monitoring the physical host on which the VM is deployed shows that the node is being used by other processes. In this case monitoring displays that 1.4 out of the 16 CPU’s available are being utilised when the VM is scheduled.Therefore it is impossible to get full utilisation of a testbed server. This is an effect of multi-tenancy that is expected on cloud platforms.

3. Furthermore thread affinity may be having an effect, in that more than one thread is being assigned to each core or that threads are being pinned to processors un-effectively. The effects of CPU affinity are shown by Gupta et al in [30]. To explore this further the OpenMP environment variable OMP_PROC_BIND is used. Experimentation here reveals marginally improved performance by using OMP_PROC_BIND=CLOSE however changing this to SPREAD appears to also offer improve performance suggesting that setting the variable at all is having an effect, but in no particular way.

Despite the poor performance of the testbed at 16 cores the most efficient use of an OpenMP only run is given by a VM with 2 processors, and performance for 4 and 8 processors is also greater than the ARC system. Hence this experiment has shown that if executing CloverLeaf using OpenMP with between 2 and 8 processes using the SoC test bed would be a better option than ARC in terms of both application runtime and efficient use of underlying hardware. However if utilising 16 cores on one node ARC would offer greater performance in terms of runtime, while offering an efficiency value only 0.02 less than cloud for 8 cores.

5.1.2 MPI

Scaling up using MPI shows that ARC shows a larger time for 2 processors than one processor. Cloverleaf uses a mesh of values, where nodes that border processes must be communicated to the neighbouring processes. It was considered that communication between two processes causes a communication of boundary nodes that is the full dimension of the problem. Hence causing a situation where communication is dominating

- ! -34

computation. However performance on cloud suggests that this is not the problem. If the communication pattern alone is causing this effect then it would be consistent with the results taken from the cloud executions, however this is not the case.

Figure 5.2 - Scale-Up MPI

The application was profiled using Allinea MAP (figure a.6 & a.7) which shows that time spent in communication increases and that less time is spent in vectorised instructions. It was considered that this may be a result of both processes trying to access the same resources at the same time causing a lock on resources, and the second process to wait. To find out if this is the case mpiP is used to gain information about individual MPI calls. mpiP shows that mpi_wait_all becomes the top call site and the MPI function where most time is spent. Also the maximum time for a mpi_wait_all call is shown to double.It was considered that CPU affinity may be affecting this result. Since requesting a whole node, yet only two MPI processes, the batch scheduler may be assigning each of the MPI processes to a different processor. This would result in one node with two eight core processors each utilising only one core. This would result in unnecessary communication between the two processors. Hence communication does dominate, but as result of the combination of both of these reasons. To see if CPU affinity was having an effect, affinity was set such that the processes will be on neighbouring cores of the same CPU using the SGE scheduler. A setting recommended at [66] is applied and a further setting that is known to assigns to two adjacent cores on one cpu. However neither of these options show a performance

Wal

l clo

ck T

ime

(s)

050

100150200250300350400450500


Arc 2 Cloud

- ! -35

increase.Also for completeness this anomaly was noticed on isolated occasions during testing stages using VM’s on the cloud test bed, and again when using docker.

Although it had been determined that choice of compiler and MPI implementation were not to be explored, the code has been recompiled using the intel compiler using both OpenMPI and Intel MPI. However, every combination has yielded either the same results, or results in application failure.While performance on ARC is poor, monitoring shows between 94.3% and 100% usage throughout run time per core for all cores while running CloverLeaf in one VM with 16 physical cores assigned. This shows that utilisation of virtual CPU’s is greatly increased by using MPI on the cloud.In conclusion, this experiment shows that ARC is completely in-efficient when executing this application. For cloud however MPI yields much greater performance than that of OpenMP for both 8 and 16 cores, offering improved efficiency in both cases. Results for 16 cores still show an inefficient use off the underlying hardware, again it is assumed that this is an effect of multi-tenancy, process affinity and the clouds loosely coupled architecture.

5.1.3 Hybrid

Most interestingly on ARC we see that the time where ! = 2 is the most efficient run of all of the runs on ARC, this is in contrast to the previous experiment. The problem that is presented with 2 processes no longer exists. The reasons for this may be due to OpenMP allowing loop level parallelism that is not offered by MPI, therefore it could be that OpenMP is taking advantage of this where MPI alone cannot. Communication overhead between OpenMP threads is also expected to be less than communication between MPI processes [23].When profiled using Allinea we see only two peaks showing large collective communications rather than the 8 seen for an MPI only run. It is felt that this is due to the difference between the thread model and the process model used by each of these implementations [23].

Figure 5.2 - Hybrid

p

Wal

l clo

ck T

ime

(s)

0102030405060708090

Implementation

4 x 4 2 x 8 8 x 2 MPI OMP 16x1

ARC 2 Cloud

Wal

l Clo

ck T

ime

(s)

0

50

100

150

200

250

300

Processors

1 2 4 8 16

ARC 2 Cloud

- ! -36

Doubling from 8 to 16 cores offers an improvement of only 3 seconds, and an efficiency value of 0.43. The reasons for performance degradation on the cloud at 16 processes and 16 threads are the same to that of the previous experiments. Similar performance on ARC indicates that this may also be the result of the increased initialisation and communication over head incurred.

Table 5.1- Final results for whole node access

In summary, experiment one shows that only ARC is able to make effective use of one full node when executing CloverLeaf given the constraint of using only one VM on the SoC cloud testbed. Returning to verify the OpenMP runs on cloud gives very similar pattern of scalability, confirming the pattern we see in the results. The individual times are also very similar. The MPI and Hybrid results however offer a large variance on what we see here. It appears that sometimes the application will continue to improve up to 16 cores, however other times it will not. This variance in results is expected due to the clouds loosely coupled architecture. Each time a virtual machine is requested the resources allocated to the virtual machine will be dependent on the system load and the schedulers actions at that given time.There are some further tables of results for this experiment given in a.4 and a.5.

5.2 Placement

Here each experiment uses 16 cores and these are divided equally between the VM’s too assess the effects of packing VM’s into node, rather than using only one virtual machine

Arc2 SoC Arc2 SoC Arc 2 SoC

MPI 4.35 6.629 0.27 0.414 69.19 47.56OMP 8.39 3.60 0.52 0.224 35.89 87.61TPP 6.85 7.01 0.43 0.44 34.27 44.96

! t!S !E!E ! t!S

0102030405060708090

Number of Vm’s

16 8 4 2 1

Hybrid MPI

1 78.2 4.03 0.252 58.97 5.35 0.334 81.74 3.86 0.248 48.91 6.45 0.4016 44.96 7.01 0.44

! vm !E�S! t

1 44.96 7.01 0.442 45.63 6.91 0.434 40.48 7.79 0.498 60.00 5.26 0.3316 62.14 5.07 0.32

! vm !E!S! t

- ! -37

Figure 5.3 - Placement

The optimal result within the bounds of this experiment is given for the use of 4 VM’s and the

MPI programming model.It was noted in profiling that this configuration is tiled such that ! =! = 4.This means the problem is being tiled such that it matches the symmetry of the virtualised infrastructure. Another observation is that we are using two eight core processors, and as such using a 4 x 4 configuration is potentially offering a better mapping of processes to virtual cores, and virtual cores to physical cores, somehow assisting the scheduler to achieve better affinity at all levels from physical host to virtual machine and from virtual CPU to process.

Efficiency for whole node access now shows a value of 0.49 in comparison to the 0.41 given by the flat MPI version with only 1VM, also offering a 4.48 second saving in terms of runtime. This is a 10 % reduction in comparison to experiment 1.2 and concludes that this is the most efficient way to use a whole single node on the cloud test bed. Furthermore this configuration is only 3% less efficient than the optimal whole node access time achieved by ARC in the scale up experiments.

5.3 Overhead

Table 5.2- Overhead

Here we see the effects of adding 16 virtual machines to a single node each with one CPU using MPI as the programming model. For two VM’s, each with one CPU, we see the most efficient use of parallelism that has been demonstrated by these experiments, and an improvement of 17 seconds over one VM with one CPU. At 4 VM’s we see that results favour single machine and at 8 VM’s there is again a 0.7 s reduction in runtime over that of 1 VM. This shows that the overhead for 1VM with 8 cores and 8VM’s with 1 core is almost equal, given the variance in times seen throughout these experiments.

Once the node is under full utilisation the single VM offers much greater performance, monitoring shows that each of the 16 single core VM’s can only use 70 % of its associated CPU. It is assumed this is due to the overhead of network virtualisation between 16 machines and the inherent cost of whole node access we have already seen.

TpTc

2 167.14 150.24 89.95 100.348 61.93 61.8716 47.56 62.14

! p ! vm ! vm = p

- ! -38

It is considered that this may be result of affinity at the VM level. That is affinity between physical CPU and hypervisor, or between hypervisor and virtual CPU. It is beyond the scope of this study to investigate this further due to both time constraints and administrational rights.

5.4 Scale-Out

Table 5.3- Scale out

Scaling-out is initially performed using the MPI implementation. Interestingly cloud shows improved performance over ARC2 when using 8 physical hosts. The previous experiments have also shown the cloud to be outperforming ARC2 using only the MPI implementation and this confirms that result. The reasons for this remain unknown.While cloud out performs ARC2 for the optimal configuration testing, the deviation between the full range of results on cloud for this experiment is 14.23 and the average time is 48.66 s whereas on ARC2 deviation is 0.01 and mean time 28.08 s. Here we see the importance of the correct placement of resources as is concluded in [43] [44] [45].Once the optimal cloud configuration within the bounds of this experiment had been found, a Thread Per Process (TPP) run was performed to see if this offered any improvement over the MPI only run, as with lower core counts. Again TPP marginally outperforms the flat MPI implementation confirming the results we found in previous experiments.It is also important to note the difference in speed up achieved on ARC when scaled to 8 nodes. This shows how well the CloverLeaf application scales out on a HPC cluster and shows a strong difference between the 2 architectures.

5.5 Containers

Here containers are introduced as an alternative virtualisation method. Docker containers offer a shorter single core run time than both ARC and VM. While it is expected that containers offer an improved runtime over a VM, the container showing shorter runtimes

2 1 60.74 5.19 0.16

2 2 58.03 5.43 0.17

2 4 61.58 5.12 0.16

4 1 28.23 11.17 0.35

4 2 34.71 9.09 0.28

8 1 27.73 11.37 0.36

�n �S� t� vmn �E

Arc TPP

cloud TPP

Arc MPI

cloud MPI

2 24.24 28.09 58.03

4 16.23 28.07 28.23

8 11.27 24.48 28.08 27.73

!n

- ! -39

than ARC suggests that executing applications on the bare metal of the cloud test bed would also be more effective than using ARC for single core execution. This may be down to the cloud testbed’s hardware advantages in the form of, a later generation processor and DDR 4 RAM.The results for 16 processes for both OMP and TPP show that using a container with 16 cores can offer over 10% improvement in runtime compared to that of VM and comes within 10% of the runtime given by the ARC system.

Figure 5.4- Containers

TPP

Wal

l Clo

ck T

ime

(s)

0255075

100125150175200225250275300325


ARC 2 VM Docker

Docker VM Arc2

1 299.738 315.34 301.17

2154.11TPP

151.78TPP

136.01OMP

495.73TPP

86.1OMP

74.08TPP

857.61TPP

64.90TPP

46.98TPP

1639.58TPP

44.96TPP

35.89 OMP

� p

MPI

Wal

l clo

ck T

ime

(s)

0

50

100

150

200

250

300

350

400

450

500

Processors

1 2 4 8 16

ARC 2 VM DockerOpen MP

Wal

l clo

ck T

ime

(s)

0255075

100125150175200225250275300325350

Processors

1 2 4 8 16

ARC 2 VM Docker

- ! -40

5.6 Loop Tiling - 1 Slot

Having noted the variance in times for all experiments, the problem is first executed with the default setting of 1 tile per MPI rank to obtain a base time.The range of the results given is short of the actual range performed but the interesting results are given in table 6.1. Here we see that run time decreases 14.77 % when the problem is divided into 1024 tiles each with 1024 tiles.

Table 5.4- Loop Tiling - 1Slot

It is then considered that altering the symmetry of the tiling may also effect run time. Therefore the problem is tiled in all combinations possible of powers of 2 that are equal to 1024 x1024.There is no improvement made by altering the symmetry, and in fact quickly makes results worse. However it is noted that the combinations either side of 1024x1024 do offer similar results to the optimal run shown by this experiment.

5.7 Loop Tiling - 8 Nodes

Here we see that over-decomposing the problem causes the application to slow down. It is considered that his is because application granularity needs to be optimised for network communication rather than cache re-use in this instance.[69]

8 coreRun 1 65.32Run 2 64.93Run 3 67.28Run 4 61.87Run 5 65.33Average 64.95Deviation 1.95

128 256 512 1024 2048 4096128 256 512 1024 2048 4096

Run 1 57.61 57.55 57.14 57.96 56.59 58.86Run 2 59.38 59.00 54.14 56.98 59.70 62.72Run 3 59.71 60.49 53.97 56.28 61.64 66.58Run 4 62.68 63.40 55.14 52.09 54.66 58.85Run 5 61.39 57.81 57.58 53.49 60.43 68.52Average 60.15 59.65 55.96 55.36 58.06 63.05Deviation 1.95 2.39 1.68 2.47 2.89 4.41

!Tc!Tp

Experiment 7.2131072 65536 32768 16384 8192 4096 2048 512 2568 16 32 64 128 256 512 2048 4096

Run 1 66.86 70.41 65.03 58.66 63.82 63.67 58.44 53.06 56.97Run 2 62.72 59.59 60.28 64.48 58.91 57.48 53.18 57.55 55.49Run 3 59.07 63.68 64.07 59.43 59.89 60.78 54.01 57.02 53.94Run 4 59.35 58.39 62.76 62.54 60.08 59.66 57.00 56.36 57.55Run 5 62.87 61.64 59.67 67.96 62.99 57.10 54.33 56.49 56.45Average 62.17 62.74 62.36 62.61 62.36 59.74 55.39 56.10 55.99Deviation 3.18 4.74 2.33 3.80 2.14 2.67 2.23 1.76 1.42

!Tc!Tp

- ! -41

5.8 Summary

This chapter completes a quantitative analysis of the CloverLeaf mini app on both of previously presented in this report. It has discussed any observations made, and attempted to explain results where possible. The following chapter will detail a critical evaluation of the methodology presented here, position this work amongst recent works in the field, and discuss limitations and further work presented by this project.

Experiment 8: 32 Core Tiling4096 x 4096

512 x 512 128 x 128 64 x 64 32 x 32 8 x 4

Run 1 53.44 34.34 28.31 25.61 27.94 25.36Run 2 52.46 33.04 27.60 26.32 28.05 26.75Run 3 52.90 32.63 26.88 26.65 26.65 25.70Run 4 53.61 33.14 28.42 26.84 26.84 25.43Run 5 54.5 32.45 28.6 25.70 25.70 26.07Average 53.38 33.12 27.96 26.22 27.04 25.86Deviation 0.7726 0.739 0.7141 0.5527 0.9769 0.5694

- ! -42

Chapter 6: Project Evaluation

6.1 Project Objectives

The project objectives given in chapter 1 are:

1. Demonstrate an understanding of HPC, cloud computing and HPC Mini apps. - This is demonstrated by section 2.1, 2.2, 2.3, and 2.4 of this report which contain the background and literature review for this project.

2. Design and implement a range of experiments to form a methodology suitable to study the effects of virtualisation, and potentially containerisation on the performance of the CloverLeaf mini-app. - A range of experiments designed for this project are detailed in chapter 3 of this report.

3. Supply a quantitive analysis of the performance of CloverLeaf on the school’s test bed using described performance analysis methodology. - Chapter 5 completes a quantitative analysis of the results collected in this research.

4. Provide recommendation for effective use of Cloverleaf on the cloud. If there is no such advice explain why this application is not suitable for parallel execution on a virtualised infrastructure. - Discussion in chapter 5 has already supplied some recommendation for the effective use of the CloverLeaf mini-app on both ARC and the SoC testbed. Further recommendations will be given in the following, section 6.2 of this report.

5. Describe limitations of this projects methodology, and realise opportunities for further study presented by this topic. - Details of the limitations presented by this project, and suggestions for further work have not yet been discussed. They are included in sections, 6.3 and 6.5 of this report.

6.2 Contributions and recommendations

1. This project confirms that VM placement can make a substantial difference to execution time as found in [42][43][44].This study has revealed that utilising 4 VM’s of 4 cores in size is an optimal placement for virtual machines if deploying onto a single node of the SoC testbed.While utilisation of a whole node cannot be recommended if performance is the main concern for the user, utilisation of whole nodes can lead to reduced energy therefore this finding is still significant to this project.

2. Due to the low communication overhead and low IO costs associated with the CloverLeaf utilising more physical hosts can reduce the load on each CPU, meaning that run time is reduced. Hence striping VM’s across physical hosts is recommended. It can not be confirmed that these results hold for more than 8 physical hosts.

- ! -43

3. Docker has shown to offer improvement over traditional VM in some cases, however returning to validate these results has offered a variation in times of over 100% increase when using OpenMP. It would therefore be not recommended to use OpenMP only when executing parallel applications in a Docker container. Results here contrast with those seen in [46][47]. It is considered therefore that this may be an effect of layered virtualisation, and that this may not hold if the docker container is executed on the bare metal.

4. In some cases using one OMP thread per MPI process offers good performance it is likely this is a result of a reduced cost of inter-thread communication and loop level parallelism offered by Open MP. This has been consistent across all platforms so should be considered as an execution option if executing CloverLeaf.

5. CloverLeaf allows the user to optimise the tiling pattern for there system, therefore time should be spent investigating this on there system. Optimisations made in this way are platform dependent therefore it can not be concluded that the successful configurations used here will be consistent on other systems.

6.3 Evaluation of Methodology

The methodology used here is motivated by work in [42] [43] [44], as such it has utilised methods taken from theses studies. Results are averaged over 5 runs as in [42] however this appears to be insufficient to get a full range of the times that can be expected on the SoC test bed. It is noted, that in [42], runs are not taken consecutively but at different times and with the system exerted to different work loads. Due to the duration of this project, and fairly stable load of the testbed used, it has not to been possible to do this here. As such the results here do not carry the same validity as those in [30][45]. [45] averages results over 50 executions for each configuration. It is felt a middle ground between 5 and 50 may be more appropriate to gather sufficient information on the SoC cloud test bed.

The approach taken to parameter selection by this project is also clearly limited. Exhaustive search of all parameters available would be much more preferable for an empirical investigation as in [50].This is due to time limitations and therefore can not be remedied here but remains a criticism of the methodology.

Here we have used a private cloud that has a fairly low utilisation, and node syntax has been used on ARC with the optimal placement setting.This has allowed these experiments to capture the effects of infrastructure and architecture rather than the utilisation model of these two systems. Here the focus has been on trying to optimise placement and conditions as much as possible.Therefore this methodology does not fully capture the effects of multi tenancy on either system.Further it is impossible to quantify a platforms ability to process HPC workloads off performing timing runs using only one application. Ideally a range of mini-apps could be used that would cover a range of common HPC problems.

- ! -44

There have been many attempts to verify results during this project.While this has not always confirmed findings, it has provided insight into the highly dynamic nature of deploying HPC clusters on a cloud platform.

6.4 Positioning of project outcomes

The criteria selected to position this work are collected from the body of work that has been reviewed in section 2.4 of this project.They are selected as they seem to be the commonly achieved tasks by work in this field.The only criteria that is not covered in the selection of work provided in the closely related work is the comparison of multiple programming models on multiple architectures.This is selected as a criteria here as it is an important aspect of this work and is motivate by work in [37]

Table 6.1 - Positioning of this work

This work is strongly motivated by work in [42] [43] [46] however this work is different because layered virtualisation is explored. Also a variety of implementations of the same application are use to explore the difference between programming models on these two contrasting architectures. This work by no means covers the range of questions posed in [42] nor does it cover as thorough range of experiments. Using one application severely limits this studies findings in ascertaining the suitability of the SoC testbed for HPC applications.

[42] [43] [46] This Work

Deployed multiple applications to fully benchmark infrastructure

x x x

Compared virtualisation methods x x xBenchmarked the performance of cloud against bare metal HPC

x x

Compare virtualisation with host system x xCompared shared, distributed and hybrid programming models using the same application different platforms.

x

Conducted experiments into placement of VM’s. x xApplied optimisations to HPC applications on cloud

x x

Performed application profiling to identify performance characteristics

x x

Perform scale up and scale out experiments x x x

- ! -45

6.5 Limitations

There are a range of limitations that have been identified in this piece of research, which are listed below.

1. Experiments are limited to a range of parameters and metrics that do not fully convey the possible benefits and drawbacks of utilising cloud for HPC.

2. Docker swarm has not been utilised in this project and since docker recommends a single process per container it may be that performance can be improved using a cluster of containers as has been concluded with VM’s.

3. HPC is concerned with large scale deployment on massively parallel architectures, here experimentation is limited to 32 cores.This limits the use of this research somewhat for those concerned with large core counts.

4. Working on private cloud does not expose the experiments to the hardware heterogeneity of public cloud or the inherent expense of communication throughput and latency expected on such platforms, therefore it is not possible to conclude that any of theses experiments would offer the same results on public cloud.

5. It is not possible to ascertain the exact frequency of the cpu’s on either system. Both systems incorporate turbo boost and exact frequency at any given time is unknown.

6. The overhead of virtualisation could be more accurately captured by benchmarking the bare metal of the SoC cloud testbed.

7.Isolation of servers on both ARC and cloud testbed limits the noise entering experiments however it does not fully capture the effects of multi tenancy.

8. Limited range of applications to one application clearly this does not cover the range of applications required in an HPC setting.

9. We have seen here that Cloverleaf performs well when striped across physical hosts therefore employing a combination of load aware-round robin scheduling, with data striping may offer improved performance for the 32 core cluster that has been demonstrated here. The only reason this hasn’t been executed is time constraint therefore this would make for easy further investigation.

6.6 Future Work

6.6.1 Further Investigation Into Application Behaviour

The investigatory work that the been done here has used a variety of monitoring and profiling solutions. However clearly there are more ways to analyse application behaviour than are presented here.The UK-MAC provide a set of tools that can be used for

- ! -46

investigating application behaviour. Of particular interest on this project would be wm heatmap [79] that returns memory utilisation information to the user.

6.6.2 Optimisation

Gupta[42] shows that adjusting networking parameters including datagram size and are successful parameters in the optimisation of HPC application on cloud platforms. Furthermore Cloverleaf is said to be affected by huge page library therefore an interesting area of further work may be investigating the use of page size, datagram size, with tiling to ascertain an optimal network configuration and tiling combination.

6.6.3 Containers

It is possible to deploy containers directly on to the bare metal infrastructure of an OpenNebula cloud using OpenNebulas oneDock plugin. This is not installed on the universities systems but would be expected to offer improved performance over the approach taken by removing a virtualisation layer.Benchmarking of containers that are more popular in the scientific computing community such as shifter and rancher would also be of interest as these solutions have shown better performance in some cases.

6.6.4 Request / Load aware Scheduling

The idea of a cloud enabled HPC cluster is discussed in[41]. It is noted that a large number of users utilising ARC are actually running programs using only 1 process and many more utilising between 1 and 16. Clearly there is potential to load share between cloud and HPC here at the University of Leeds, even if only for these jobs where the number of processes is small.

6.6.5 Tiny core on VM

Investigating the performance of cloud applications using an in memory operation system such as tiny core linux on a VM .This was attempted during this project and effort was spent building the gfortran compiler from source to find some of the libraries required unavailable for tiny core linux. I did however get MPI up and running with a small program written in C on tiny core linux and as such this would be an easy area to explore further. Building the libraries required for tiny core linux so that a gfortran compiler is available is another option, but unfortunately beyond the remit of this project. It is also now known that a g95 compiler can be compiled from source successfully on some versions of tiny core linux. This was not explored further and it is not confirmed that this will work on the boot2Docker release of Tiny Core.

- ! -47

6.6.6 Benchmark Containers on ARC

It had been hoped there would be enough time during this project to benchmark a container solution against bare metal HPC by running a container based solution called singularity on the ARC systems. Due to the closure of the ARC1 facility this was not possible. This would allow the researcher to explore the effects of virtualisation on an Infini-Band network and offer a more accurate judgement of the effects of containerisation on HPC workloads. The image that has been used for this project is suitable for many HPC container solutions therefore this is a relatively simple extension of this work.

6.6.7 Increase Range of Applications

This study has limited the number of applications to 1 clearly a greater range of applications are required to fully understand the performance characteristics of the SoC cloud testbed.a range of mini applications to cover a spectrum of HPC problems could be selected from [80] this selection might include

1. Implicit unstructured partial differential equations: Mini FE

2. Explicit unstructured partial differential equations: phdMesh

3. Explicit structured partial differential equations: Mini Ghost

4. Molecular Dynamics: Mini MD

5. Hydro dynamics: CloverLeaf

6. Circuit Simulation: Mini Xyce

It is expected this would develop further understanding into application characteristics that may define an applications ability to scale on cloud.

6.6.9 Application Characterisation

In [42] it is suggested that applications should be characterised to ascertain their suitability for execution on a given platform. It is suggested here that a combination of the Java application that instantiates the cluster, the python script that evaluates performance data and the experiments performed, could be used to drive a application characterising system such that an application could be benchmarked effectively on an OpenNebula cloud. Utilising the Java script to run experiments such that the cluster is instantiated each time an experiment is performed may offer a more accurate view of performance. This may allow testing over a period of time and under different work loads.

Cluster performance statistics may be queried on the influxDB data base, given access to the underlying resources, the same metrics could be gathered from the underlying hardware and aggregated into a statistic that quantify the performance in terms of:

- ! -48

1. Performance

2. Efficiency

Extending this characterising tool to include a pricing model for cost and using the energy monitors provided on the SoC testbed further metrics such as:

3. Energy use

4. Cost

Could be added offering the user a choice of metric on which to schedule there application.

- ! -49

Chapter 7: Conclusion

This paper has sought to explore the effects of different programming models when executed on different architecture.It has also compared virtualisation methods that are currently considered viable solutions to HPC on the cloud, and conducted experiments into virtual machine placement to ascertain the feasibility of using the cloud to execute the CloverLeaf mini-app.Further we have seen how mini applications can help researchers conduct such feasibility studies.

We have shown that the optimal choice of programming model is platform dependent and furthermore that it is dependent on the configuration of that platform. We have also confirmed that virtual machine placement does effect the execution time of the CloverLeaf mini application, and that a layered virtualisation method can improve application performance under the correct conditions. Further we have shown that while this performance advantage can be gained, it is not a reliable performance gain.

This has demonstrated that loosely coupled cloud architectures can be suitable for the execution of the CloverLeaf mini-app, but that performance reductions will be seen if virtual machine placement is not optimal.The cloverLeaf mini app has been key to the loop tiling experiments explored and this has shown how mini-apps can be useful to explore prototype architectures.

The findings from this research are important as they confirm that cloud can be used for HPC if the application has the correct characteristics and time is spent to understand the application that is being used.

- ! -50

List of References

[1] UK Mini- App Consortium. A Lagrangian-Eulerian hydrodynamics benchmark. [Online] 2017. [Accessed 14th March 2017]. Available From: http://uk-mac.github.io/

[2] Zabbix. The enterprise-class monitoring solution for everyone. [Online]. 2001. [Accessed 14th March 2017]. Available From: http://www.zabbix.com/

[3] Influxdata. Data analytics solutions provider. [Online] 2017. [Accessed 14th March 2017]. available From: https://www.influxdata.com/

[4] Grafana. Open source time series analytics. [Online]. 2017. [Accessed 14th March 2017]. Available From: https://grafana.com/

[5] Eckert, W.J., Jones, R. Faster, Faster: A simple description of a giant electronic calculator and the problems it solves. McGraw-Hill, New York. 1955.

[6] Top500. The top 500 List for November 2016. [Online]. Accessed 14th March 2017]. Available from: https://www.top500.org/lists/2016/11/

[7] Kormonicki, A., Mullen-Schulz, G., Landon, D. Roadrunner: Hardware and software overview. [Online] 2017. [Accessed 30th April 2017]. Available From: http://www.redbooks.ibm.com/redpapers/pdfs/redp4477.pdf

[8] Sodani, S., Gramunt, R., Corbal, J., Kim, H., Vinod, K.,Chinthamani, S., Hutsell, S., agarawal, R., Liu, Y., Knights. Landing Second Generation Intel Xeon product. [Online] 2017. [accessed 30th April 2017]. Available From: http://ieeexplore.ieee.org/document/7453080/

[9] Feldman, M. 2015. Life beyond Moores Law. [Online]. [Accessed 30th April 2017]. Available from: https://www.top500.org/news/life-beyond-moores-law/

[10] Mohr, B. 2014. The Human Brain Project. [Online]. [Accessed 30th April 2017]. Available from: https://www.top500.org/news/the-human-brain-project-will-push-the-boundaries-of-supercomputing/

[11] Schulte, M.J., Loh, G., Beckman, B., Brantley, W., Gurumurthi, S,. Jayasena, N., Paul, I,. Reiinhardt , S., Rodgers, G. Achieving Exascale Capabilities through Heterogeneous Computing. [Online]. [Accessed 30th April 2017] Available from: http://ieeexplore.ieee.org/document/7155462/?part=1

[12] Nielsen, F. Introduction to HPC with MPI for Data Science. London, Springer. 2016.

[13] Weidner, O., Atkinson, M., Barker, A. and Vicente, R.F. Rethinking HPC Platforms: Challenges, Opportunities and Recommendations. [Online]. 2017. [Accessed 30th April 2017]. Available from: http://arxiv.org/abs/1702.05513.

[14] Flynn, M. J., Rudd, K.W. Parallel Architectures, ACM Computing Surveys. [Online].1996. 28(1). [Accessed 30th April 2017]. Available from: https://www.researchgate.net/publication/220267962_On_Parallel_architectures

http://uk-mac.github.io/

http://www.zabbix.com/

https://www.influxdata.com/

https://grafana.com/

http://www.redbooks.ibm.com/redpapers/pdfs/redp4477.pdf

http://www.redbooks.ibm.com/redpapers/pdfs/redp4477.pdf

http://ieeexplore.ieee.org/document/7453080/

https://www.top500.org/news/life-beyond-moores-law/

http://ieeexplore.ieee.org/document/7155462/?part=1

http://ieeexplore.ieee.org/document/7155462/?part=1

http://arxiv.org/abs/1702.05513

https://www.researchgate.net/publication/220267962_On_Parallel_architectures

https://www.researchgate.net/publication/220267962_On_Parallel_architectures

- ! -51

[15] Gupta, S.R., Shinde, N. Architectures of Flynn’s taxonomy — comparison of methods. [Online]. [Accessed 30th April 2017]. Available from: http://ijiset.com/vol2/v2s9/IJISET_V2_I9_17.pdf

[16] Segal, O., Nasibeh, N,. Margala, M., 2014. High Level Programming of FPGa’s for HPC and data centric applications. [Online]. [Accessed 30th April 2017]. Available from: http://www.ieee-hpec.org/2014/CD/index_htm_files/FinalPapers/89.pdf

[17] HPC: Advisory Council. Interconnect Analysis: 10GigE and InfiniBand in High Performance Computing. [Online]. [Accessed 14th March 2017]. Available from: http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf

[18] Lustre, Homepage. [Online].[Accessed 29th March 2017]. Available from: http://lustre.org/

[19] Understanding Lustre Filesystem Internals. [Online]. 2009. [Accessed 29th March 2017]. Available from: http://wiki.old.lustre.org/images/d/da/Understanding_Lustre_Filesystem_Internals.pdf

[20] University of Liverpool. Son of Grid Engine Project Homepage. [Online]. 2017.[Accessed 14th March 2017]. Available from: https://arc.liv.ac.uk/trac/SGE

[21] Slurm. Workload Manager. [Online]. 2017. [Accessed 29th March 2017]. Available from: https://slurm.schedmd.com/

[22] PBS Professional. Open Source Project. [Online]. 2017. [Accessed 24th March 2017]. Available from: http://www.pbspro.org/

[23] Eijkhout, V. 2014. Introduction to High Performance Scientific Computing. [Accessed 14th March 2017]. Available from: http://pages.tacc.utexas.edu/~eijkhout/Articles/EijkhoutIntroToHPC.pdf

[24] Top 500. The list. [Online]. 2017. [Accessed 14th March 2017]. Available from: https://www.top500.org/project/

[25] Kramer, W. Proceedings of the 21st international conference on Parallel architectures and compilation techniques p. 223-230 [Online]. [Accessed 2nd May 2017] http://dl.acm.org/citation.cfm?id=2370816

[26] Vaquero, L.M., Rodero-Merino, L., Caceres, J. and Lindner, M. A Break in the Clouds: Towards a Cloud Definition. [Online]. [Accessed 29th March 2017]. Available from: http://www.research.ibm.com/haifa/projects/systech/reservoir/public/CloudDefinitionPaper.pdf

[27] Mell, P. and Grance, T. The NIST definition of cloud computing NIST Special Publication. [Online]. 2011, 145, pp. 7. Available from: http://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-145.pdf

[28] Linux. Kernel Virtual Machine. [Online]. 2017. [Accessed 2nd May 2017]. Available from: https://www.linux-kvm.org/page/Main_Page

http://ijiset.com/vol2/v2s9/IJISET_V2_I9_17.pdf

http://ijiset.com/vol2/v2s9/IJISET_V2_I9_17.pdf

http://www.ieee-hpec.org/2014/CD/index_htm_files/FinalPapers/89.pdf

http://www.ieee-hpec.org/2014/CD/index_htm_files/FinalPapers/89.pdf

http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf

http://www.hpcadvisorycouncil.com/pdf/IB_and_10GigE_in_HPC.pdf

http://wiki.old.lustre.org/images/d/da/Understanding_Lustre_Filesystem_Internals.pdf



https://arc.liv.ac.uk/trac/SGE

https://slurm.schedmd.com/

http://www.pbspro.org/

https://www.top500.org/project/

https://www.top500.org/project/

http://www.research.ibm.com/haifa/projects/systech/reservoir/public/CloudDefinitionPaper.pdf



https://www.linux-kvm.org/page/Main_Page

- ! -52

[29] Docker. Homepage. [Online]. 2017. [Accessed 29th March 2017]. Available from: https://www.docker.com/

[30] Canonocial. Linux Containers. [Online]. 2017. [Accessed 14th March 2017]. Available from: https://linuxcontainers.org/lxc/introduction/

[31] Open Nebula. OpenNebula Homepage. [Online]. 2017. [Accessed 14th March 2017]. Available from: https://opennebula.org/

[32] Open MPI. Open Source High Performance Computing. [Online]. 2017. [Accessed 2nd May 2017]. Available from: https://www.open-mpi.org/

[33] MPICH. Homepage. [Online]. 2017. [Accessed 29th March 2017]. Available from: https://www.mpich.org/

[34] MVAPICH. MPI over InfiniBand, Omni-Path, Ethernet/iWARP, and RoCE. [Online]. 2017.[Accessed 29th March 2017]. Available from: http://mvapich.cse.ohio-state.edu/

[35] Hager, G. and Wellein, G. Introduction to High Performance Computing for Scientists and Engineers. London, Taylor and Francis. [Online]. 2011. [Accessed 29th March 2017]. Available from: http://prdrklaina.weebly.com/uploads/5/7/7/3/5773421/introduction_to_high_performance_computing_for_scientists_and_engineers.pdf

[36] Heroux, M.A., Doerfler, D.W., Crozier, P.S., Willenbring, J.M., Edwards, H.C., Williams, a., Rajan, M., Keiter, E.R., Thorn, H.K. and Numrich, R.W. Improving Performance via mini apps. MANTEVO overview. [Online]. 2009. [Accessed 29th March 2017]. Available from: https://mantevo.org/MantevoOverview.pdf

[37] Mallinson, A.C., Beckingsale, D.A., Gaudin, W.P. and Herdman, J.A. CloverLeaf : Preparing Hydrodynamics Codes for Exascale. Proceedings of Cray User Group. (i). [Online]. 2013. [Accessed 30th March 2017]. Available from: https://cug.org/proceedings/cug2014_proceedings/includes/files/pap124.pdf

[38] Barrett, R.F., Borkar, S., Dosanjh, S.S., Hammond, S.D., Heroux, M.A., Hu, X.S., Luitjens, J., Parker, S.G., Shalf, J. and Tang, L. On the role of co-design in high performance computing. Advances in Parallel Computing. [Online]. 2013, 24, pp.141–155. Available from: https://pdfs.semanticscholar.org/56d8/b4d8bee2d1659d54b50399a4dd810fd9a862.pdf

[39] Hansen, R. R&D 100 awards Magazine. [Online]. 2013. [Accessed 29th March 2017]. Available from: https://str.llnl.gov/content/pages/october-2013/pdf/10.13.5.pdf

[40] The Atomic Weapons Establishment. Applying OpenACC to the CloverLeaf Hydrodynamics Mini-App. [Online]. 2017. [Accessed 29th March 2017]. Available from: http://www.cray.com/sites/default/files/resources/OpenACC_213462.7_OpenACC_Cloverleaf_CS_FNL.pdf

https://www.docker.com/

https://www.docker.com/

https://linuxcontainers.org/lxc/introduction/

https://opennebula.org/

https://www.open-mpi.org/

https://www.mpich.org/

http://mvapich.cse.ohio-state.edu/

https://str.llnl.gov/content/pages/october-2013/pdf/10.13.5.pdf

http://www.cray.com/sites/default/files/resources/OpenACC_213462.7_OpenACC_Cloverleaf_CS_FNL.pdf

http://www.cray.com/sites/default/files/resources/OpenACC_213462.7_OpenACC_Cloverleaf_CS_FNL.pdf

- ! -53

[41]Li, C., Chen, T., Chen, Y., Wang, S. and Computing, A.C. 2011. Formosa3 : a Cloud-Enabled HPC Cluster in NCHC. ,pp.192–198.

[42] Gupta, a., Faraboschi, P., Gioachin, F., Kale, L. V, Kaufmann, R., Lee, B., March, V., Milojicic, D. and Suen, C.H. Evaluating and Improving the Performance and Scheduling of HPC applications in Cloud. IEEE Transactions on Cloud Computing. [Online]. 2014, 7161, pp.1–1. available from: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6858018.

[43] Gupta, a., Milojicic, D. and Kalé, L. V. Optimizing VM placement for HPC in the cloud. Proceedings of the 2012 workshop on Cloud services, federation, and the 8th open cirrus summit - FederatedClouds ’12. [Online]. 2012. [Accessed 14th March 2017]. Available from: http://dl.acm.org/citation.cfm?doid=2378975.2378977.

[44] Gupta, a., Kalé, L. V., Milojicic, D., Faraboschi, P. and Balle, S.M. 2013. HPC-aware VM placement in infrastructure clouds. Proceedings of the IEEE International conference on cloud engineering, IC2E 2013. [Online]. 2013. pp.11–20. [Accessed 14th March 2017]. Available from: http://charm.cs.illinois.edu/newPapers/13-01/paper.pdf

[45] Hussein, M. and Moussa, M. H. High-performance execution of scientific Multi-Physics Coupled applications in a private cloud. [Online]. 2014. 4(2),pp.4–7. [Accessed 24th March 2017]. Available from: https://www.ijarcsse.com/docs/papers/Volume_4/2_February2014/V3I12-0377.pdf

[46] Zhou, Y., Subramaniam, B., Kate Keahey, K. and Lange, J. Comparison of virtualization and containerization techniques for high-performance computing. [Online]. 2017. [Accessed 14th March 2017]. Available from: http://www.prognosticlab.org/~jarusl/pubs/sc-2015-abstract.pdf

[47] Zhou, Y., Subramaniam, B., Kate Keahey, K. and Lange, J. Comparison of virtualisation and containerisation techniques for high performance computing. [Online]. 2017. [Accessed 29th March 2017]. Available from: http://sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/poster_files/post239s2-file2.pdf

[48] Younge, A.J., Henschel, R., Brown, J.T., Laszewski, G. Von, Qiu, J. and Fox, G.C. Analysis of virtualization technologies for high performance computing environments. 2011 IEEE 4th International conference on cloud computing. [Online]. 2011, pp.9–16. [Accessed 29th March 2017]. Available from: http://dl.acm.org/citation.cfm?id=2055437.2055572

[49] Amdahl, G.M. Validity of the single-processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings vol. 30 (Atlantic City, N.J., apr. 18–20). AFIPS Press, Reston, Va. [Online]. 1967, pp. 483–485. [Accessed 29th March 2017]. Available from: http://dl.acm.org/citation.cfm?id=1465560

http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6858018

http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6858018

http://sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/poster_files/post239s2-file2.pdf



http://dl.acm.org/citation.cfm?id=2055437.2055572

- ! -54

[50] Gustafson, J.L. Communications of the ACM. [Online]. 1988. 31(5), pp. 532-533. [Accessed 30th March 2017]. Available from: http://dl.acm.org/citation.cfm?id=42415

[51] GNU. GCC 5 Release page. [Online]. 2017. [Accessed 29th March 2017]. Available from: https://gcc.gnu.org/gcc-5/

[52] IBM. Intel xeon processor E5-2630 v3 homepage. [Online]. 2017. [Accessed 14th March 2017]. Available from: http://ark.intel.com/products/83356/Intel-Xeon-Processor-E5-2630-v3-20M-Cache-2_40-GHz#@ordering

[53] IBM. Intel xeon E5-2670 processor homepage. [Online]. 2017. [Accessed 30th March 2017]. Available from: http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI

[54] Mantetvo. Cloverleaf homepage. [Online]. 2017. [Accessed 25th April 2017]. Available from: https://github.com/Mantevo/CloverLeaf

[55] The University of Leeds. Advanced Research Computing. [Online]. 2017. [Accessed 19th March 2017]. Available from: http://arc.leeds.ac.uk/systems/arc2/

[56] Linux, NFS: Network File System [Online]. 2017. [Accessed 29th March 2017]. Available from: https://sourceforge.net/projects/nfs/

[57] Docker [Online]. 2017. [Accessed 25th April 2017]. Available from: https://docs.docker.com/machine/

[57] Docker Machine. Docker-Machine [Online]. 2017. [Accessed 19th March 2017]. Available from: https://github.com/OpenNebula/docker-machine-opennebula/blob/master/README.md

[58] Boot2Docker. Lightweight Linux for Docker.[Online]. 2017. [Accessed 19th March 2017]. Available from: http://boot2docker.io/

[59] Tiny Core. A modular based system with community built extensions [Online]. 2017. [Accessed 25th April 2017]. Available from: http://www.tinycorelinux.net/intro.html

[60] Nersc. Docker Hub: nersc centos MPI [Online]. 2017. [Accessed 25th April 2017]. Available from: https://hub.docker.com/r/nersc/centos-mpi/

[61] Alpine. Alpine homepage. [Online]. 2017. [Accessed 13th March 2017]. Available from: https://alpinelinux.org/

[62] Nguyen, N. Alpine MPICH. [Online]. 2017. [Accessed 19th March 2017]. Available from: https://hub.docker.com/r/nlknguyen/alpine-mpich/

[63] ARM. Allinea MAP parallel profiler.[Online]. 2017. [Accessed 13th March 2017]. Available from: Available from: https://www.allinea.com/products/map

[64] mpiP. Parallel profiler. [Online]. 2017. [Accessed 21th April 2017]. Available from: http://mpip.sourceforge.net/

https://gcc.gnu.org/gcc-5/

http://ark.intel.com/products/83356/Intel-Xeon-Processor-E5-2630-v3-20M-Cache-2_40-GHz#@ordering



http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI



https://github.com/Mantevo/CloverLeaf

https://docs.docker.com/machine/

https://docs.docker.com/machine/

http://boot2docker.io/

http://www.tinycorelinux.net/intro.html

https://hub.docker.com/r/nersc/centos-mpi/

https://hub.docker.com/r/nlknguyen/alpine-mpich/

https://www.allinea.com/products/map

http://mpip.sourceforge.net/

http://mpip.sourceforge.net/

- ! -55

[65] Open Nebula. OpenNebula Cloud API Specification. [Online]. 2017. [Accessed 21th April 2017]. Available from: http://docs.opennebula.org/doc/4.12/oca/java/

[66] Love, D. Grid Engine Configurations [Online]. 2017. [Accessed 13th March 2017]. Available from: http://arc.liv.ac.uk/SGE/howto/sge-configs.html

[67] Top. man page for linux command [Online] Available from: https://linux.die.net/man/1/top

[68] Wilkinson, B. and Allen, M. Parallel Programming: Techniques and applications using Networked Workstations and Parallel Computers. 2nd ed. London, Pearson Education International. 2005. [Accessed 23rd March 2017]. Available from http://webpages.uncc.edu/abw/parallel/par_prog/PPBook.pdf

[69]Dragoi, C., Henzinger, T.A. and Zufferey, D. 2016. PSYNC : A partially synchronous language for fault-tolerant distributed algorithms. Popl. [Online],pp.1–16. Available from: http://www.di.ens.fr/~cezarad/popl16.pdf.

[70] WRF. The Weather Research and Forecasting Model [Online]. 2017. [Accessed 13th March 2017]. Available from: http://www.wrf-model.org/index.php

[71] HPCC. The HPC Challenge Benchmark [Online]. 2017. [Accessed 10th May 2017]. Available from:http://icl.cs.utk.edu/hpcc/

[72]CentOS. CentOS Homepage [Online]. 2017. [Accessed 10th May 2017]. Available from: https://www.centos.org/download/

[73] SSH, The ssh protocol [Online]. 2017. [Accessed 10th May 2017]. Available from: https://www.ssh.com/ssh/protocol/

[74] Ubuntu, The Ubuntu Homepage [Online]. 2017. [Accessed 10th May 2017]. Available from: https://www.ubuntu.com/

[75] Git. Git Homepage [Online]. 2017. [Accessed 10th May 2017]. Available from:https://git-scm.com/

[76] UK Mini-App Consortium. UK-MAC homepage [Online]. 2017. [Accessed 10th May 2017]. Available from:http://uk-mac.github.io/

[77] Mellanox. Introduction to Infini-Band [Online]. 2017. [Accessed 10th May 2017]. Available from:http://www.mellanox.com/pdf/whitepapers/IB_Intro_WP_190.pdf

[78] AMD. AMD Homepage [Online]. 2017. [Accessed 10th May 2017]. Available from: http://www.amd.com/en/home

[79] UK Mini-App Consortium. WM tools [Online]. 2017. [Accessed 10th May 2017]. Available from: https://github.com/UK-MaC/WMTools

http://docs.opennebula.org/doc/4.12/oca/java/

http://arc.liv.ac.uk/SGE/howto/sge-configs.html

https://linux.die.net/man/1/top

http://webpages.uncc.edu/abw/parallel/par_prog/PPBook.pdf

http://webpages.uncc.edu/abw/parallel/par_prog/PPBook.pdf

http://www.wrf-model.org/index.php

http://icl.cs.utk.edu/hpcc/

https://www.centos.org/download/

https://www.ssh.com/ssh/protocol/

https://www.ubuntu.com/

https://git-scm.com/

https://git-scm.com/

http://uk-mac.github.io/

http://www.mellanox.com/pdf/whitepapers/IB_Intro_WP_190.pdf

http://www.amd.com/en/home

http://www.amd.com/en/home

https://github.com/UK-MaC/WMTools

- ! -56

[80]Mantevo. Mantevo Homepage [Online]. 2017. [Accessed 10th May 2017]. Available from: https://mantevo.org/

- ! -57

- ! -58

Appendix AExternal Materials

Figure a.1- WRF results

Figure a.2 - Stream Benchmark Results.

Figure a.3 - loop tiling

Instead of

for(i=0; i<N; ++i){

...

}

where T is the tile size

for(j=0; j<N; j+=T)

for(i=j; i<min(N, j+T); ++i){

}

Function ARC MB/s

SoC MB/s

Copy 6103.9 6040.0

Scale 5891.9 5715.1

Add 8391.3 7743.4

Triad 8083.8 7587.5

- ! -59

Figure a.4 - Scale Up Wall Clock

Figure a.5 - Scale Up Efficiency

Figure a.6 - MPI 16 processors profile

Experiment 1: Wall Clock

Cores VM Arc2

2 151.78 TPP 136.01 TPP

4 86.1 OMP 74.08 TPP

8 64.90 TPP 46.98 TPP

16 44.96 TPP 34.27 TPP

Experiment 1: Efficiency

Cores VM E Arc2

2 1.04 0.94

4 0.92 1.02

8 0.80 0.80

16 0.44 0.52

- ! -60

Figure a.7 - MPI 2 processors profile

- ! -61

Figure a.8 - CloverLeaf Profile

Profiler Output Time Percentage

Timestep 4.5982 5.5126

Ideal Gas 3.5903 4.3043

Viscosity 3.1375 3.7615

PdV 10.1378 12.1540

Revert 1.7525 2.1010

Acceleration 4.3621 5.2296

Fluxes 3.3004 3.9615

Cell Advection 13.6200 16.3287

Momentum Advection 28.7953 34.5219

Reset 2.9002 3.4769

Summary 0.2716 0.2609

Visit 0.0000 0.0000

Tile Halo Exchange 0.0013 0.0015

Self Halo Exchange 0.1453 0.1742

MPI Halo Exchange 6.8446 8.2058

Total 83.4071 99.9945

Rest 0.0046 0.0055

- ! -62

- ! -63

Appendix BEthical Issues addressed

This work carries no ethical issues. It is noted that public access to HPC in the cloud does offer security concerns and that the research into Nuclear Weapons carries serious ethical issues. However this work doe not facilitate that.

School of Computing - Blackboard Learn€¦ · AMD Advanced Micro Devices Manufacturer of ... MPI...

Documents

Transcript of School of Computing - Blackboard Learn€¦ · AMD Advanced Micro Devices Manufacturer of ... MPI...