CA670 - Concurrent Programming Overviewdavids/courses/CA670/CA670_Introduction_2p.pdf ·...

Introduction

CA670 - Concurrent ProgrammingIntroduction

David Sinclair

Introduction

Overview

This module will introduce you to:

• Basic concepts• Resource contention• Synchronisation• Scalability & Performance

• Concurrency in Java• Java Threads• Synchronisation in Java

• Design Strategies

• OpenMP

• OpenCL

• Formal Specification of Concurrent Systems

Introduction

Texts

Supplementary:

• D. Lea, Concurrent Programming in Java: Design Principlesand Patterns, 1999, 978-0201310092

• S. Oaks and H. Wong, Java Threads, 2004, 978-0596007829

• B. Goetz, T. Peierls, J. Bloch, J. Bowbeer, D Holmes, and D.Lea, Java Concurrency in Practice, 2006, 978-0321349606

• G. Barlas, Multicore and GPU Programming: An IntegratedApproach, 2014, 978-0124171374

• C. A. R. Hoare, Communicating Sequential Processes, 1985,978-0131532892

• R. Milner, Communicating and Mobile Systems: theπ-calculus, 1999, 978-0521658690

Introduction

Contact Details

Lecturer: David SinclairOffice: L253Phone: 5510Email: [email protected]: http://www.computing.dcu.ie/∼davidsCourse web page:http://www.computing.dcu.ie/∼davids/CA670.html

Introduction

How do I successfully complete this module?

The module mark is a straight weighted average of:

• 25% continuous assessment• 2 elements

• Java Threads (10%)• Multicore/GPGPU (15%)

• 75% end-of-semester examination• 3 hour exam• Answer 5 out of 6 questions.

Introduction

What if I don’t successfully complete this module?

If you fail the module after the end-of-semester exams then youwill need to repeat some elements of the assessment.

• If you just failed the exam you can resit the exam in theAutumn.

• If you just failed the continuous assessment then you mustcomplete a resit assignment.

• Contact me after the results are published for the resitassignment.

• If you failed both the exam and the continuous assessment,you must repeat both.

If you fail the module after the resit examination and continuousassessment you must repeat all aspects of the module.

Introduction

Introduction

A concurrent program is the interleaving of sets of sequentialatomic instructions.

• These sequential processes execute at the same time, on thesame or different computational units (processors, cores,GPUs).

• Interleaving means that is at any given time each processor isexecuting one of the instructions of the sequential processes.

• The relative rate at which the instructions of each process areexecuted is not important.

Each sequential process consists of a series of atomic instructions.

• An atomic instruction is an instruction that once it starts,proceeds to completion without interruption.

• Different processors have different atomic instructions, andthis can have a big effect.

Introduction

Correctness

If P(~a) is a property of the input (pre-condition), and Q(~a, ~b) is aproperty of the input and output (post-condition), then correctnessis defined as either:

Partial correctness(P(~a) ∧ terminates(Prog(~a, ~b)))⇒ Q(~a, ~b)

Total correctnessP(~a)⇒ (terminates(Prog(~a, ~b)) ∧ Q(~a, ~b))

Totally correct programs terminate.

A totally correct specification of the incrementing tasks is:

a ∈ N ⇒ (terminates(INC (a, a)) ∧ a = a + 1)

Introduction

Correctness (2)

There are 2 types of correctness properties:

• Safety properties: These must always be true.• Mutual exclusion: Two processes must not interleave certain

sequences of instructions. Typically this involves changing thestate of a shared resource, e.g.

• Updating the value of a shared variable.• 2 or more processes updating a shared file.

• Absence of deadlock: Deadlock is when a non terminatingsystem cannot respond to any signal.

• Liveness properties: These must eventually be true.• Absence of starvation: Information sent is delivered.• Fairness: That any contention must be resolved.

Introduction

Correctness (3)

There 4 different ways to specify fairness.

• Weak Fairness: If a process continuously makes a request,eventually it will be granted.

• Strong fairness: If a process makes a request infinitely often,eventually it will be granted.

• Linear waiting: If a process makes a request, it will be grantedbefore any other process is granted the request more thanonce.

• FIFO: If a process makes a request, it will be granted beforeany other process that makes a later request.

Introduction

Flynn’s Taxonomy

Originally developed in 1966, Flynn’s taxonomy characterisesparallel machines by the number of instruction and data streams.

SISD This simple sequential machine executes a singleinstruction stream on a single data stream. Mostcomputers use to be SISD machines, but today manyprocessors come in a multicore configuration.However each individual core can be considered as anSISD machine.

SIMD These machines executes a single instruction streamon multiple data streams. Vector machines andGPUs are examples of this classification.

Introduction

Flynn’s Taxonomy (2)MISD These machines executes multiple instruction

streams on a single data stream. Quite rare, but canbe found in fault tolerant systems where the samedata is processed by different instruction programsand the final output is by majority decision.

MIMD These machines executes multiple instructionstreams on multiple data streams. This classificationcan be further broken down as:• Shared memory MIMD: Each instruction stream

can access a shared memory space. Speeds upcommunication, but at the possible cost ofcontention.

• Distributed memory MIMD: Also known asshared-nothing MIMD. Each instruction streamhas its own local memory and communicationbetween instruction streams is by exchangingmessages.

Introduction

Concurrent Programming - The Past(?)Concurrent programming started out with specialist machines thathad architectures based on either:

• Shared Memory where multiple processors communicated witheach other using a shared memory.

• Communications were as fast as memory access but therecould be significant memory contention.

• Distributed Memory where each of the multiple processors hadthere own local memory and they communicated with eachother by a dedicated high speed communications network.

• Communication as fast as network but slower than memoryaccess. Less contention.

These machines were very expensive.

Cluster Computing was the first affordable architecture forconcurrent computing.

• Standard PCs running modified Unix kernels connected by aEthernet network.

Introduction

Concurrent Programming - Today

Today, with the advent of multicore processors and GeneralPurpose Graphical Processing Units (GPGPUs), also known asGraphical Processing Units (GPUs), concurrent processing isavailable at everyone’s desk.

The Heterogeneous Systems Architecture (HSA), developed by theHSA Foundation, identifies 2 major core types.

• The Latency Compute Unit (LCU), a generalisation of a CPU,maps naturally onto task parallelism.

• The Throughput Compute Unit (TCU), a generalisation of aGPU, maps naturally onto data parallelism.

Introduction

Latency Compute Unit

A Latency Compute Unit is designed to process tasks as quickly aspossible.

Local Memory/Registers

Cache

Control

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALULCUs are designed to reduce anylatency in executing a task.

• A large cache to reducememory access latency.

• Complicated Control andALU circuitry to provide:

• powerful ALUs• branch prediction; and• data forwarding.

Introduction

Throughput Compute Unit

A Throughput Compute Unit is designed to process data asefficiently as possible in parallel.

Local Memory/Registers

= ALU= Control= Cache

TCUs are designed to maximisethe data throughput, usually atthe cost of latency. Requires alarge number of threads tocompensate for the latency.

• Simple Control and ALUcircuitry to provide:

• very energy efficient; and• no branch prediction or

data forwarding.

Introduction

Measuring the benefit of Concurrent ProgrammingThe most obvious metric for measuring the benefit of concurrentcomputing is speedup

speedup =tseqtpar

where tseq is the execution time on a sequential machine and tparis execution time on a parallel machine.

Both tseq and tpar can be heavily influenced by programmer skill,compiler choice, compiler optimisation switches, operating system,file system type and time of day (network traffic and workloads).

Therefore we should ensure that:

• Both the sequential and parallel program are executed onidentical hardware and software platforms under the same loadconditions.

• The sequential program should be the fastest known solution.

Introduction

Measuring the benefit of Concurrent Programming (2)The speedup metric does not tell us if the effort in parallelising theprogram was “worth it” in terms of resources consumed, i.e. whichis better, a speedup of 10 on 100 machines or a speedup of 8 on16 machines?

efficiency = speedupN =

tseqtpar∗N

where N is the number of “machines” and tseq and tpar are asbefore.

Scalability, the ability to efficiently handle a growing amount ofwork, can be measured by the efficiency metric, which in thiscontext is called the StrongScalingEfficiency(N) metric. Anothermetric for the scalability is:

WeakScalingEfficiency(N) =tseqt′par

where t ′par is the time to solve a problem N times larger that theproblem solved in tseq on a sequential machine.

Introduction

Measuring the benefit of Concurrent Programming (3)

Some general guidelines when developing a parallel algorithm.

• Start with the development of its sequential variant.• Not just any sequential algorithm, but the one that maps most

naturally to a parallel implementation, i.e. for sorting youwould not use quick sort, but bubble sort would be a “morenaturally parallelisable” algorithm.

• Profile the sequential algorithm to identify the most “timeconsuming” parts.

• If these parts can be parallelised, estimate the benefit.

We will discuss appropriate design patterns later in the module.

Introduction

Measuring the benefit of Concurrent Programming (4)

Some guidelines when measuring performance metrics.

• The duration of the whole execution should be measured.

• Results should be measured over multiple runs and presentedin terms of averages (and ideally standard deviations).

• Outliers should be excluded if there is a valid explanation toexcluded them.

• Results should be reported for various sizes of input data(scalability).

• The number of processes/threads should not exceed thenumber of hardware cores available.

Introduction

Bad News - Amdahl’s Law

Gene Amdahl devised a simple thought experiment in 1967 toestimate the benefit that could be expected from parallelprogramming.

Assume that:

• A single sequential processor exits the task in time T .

• The fraction of the application that can be parallelised is α.The inherently sequential part of the application is 1− α.

• The parallelisable part of the application can be evenly dividedamong the computational units.

• Communication between the parallelised elements incurs nooverhead.

Introduction

Bad News - Amdahl’s Law (2)

speedup =tseqtpar

= T(1−α)T+αT

N

As N →∞, the upper bound on speedup is:

limN→∞ speedup = 1(1−α)

Similarly,

efficiency =tseq

N∗tpar = TN((1−α)T+αT

N)

= 1N(1−α)+α

Introduction

Bad News - Amdahl’s Law (3)

From G. Barlas, Multicore and GPU Programming: An IntegratedApproach.

Introduction

Good News - Amdahl’s Law is Wrong!Experience shows that this is not the case! A list of the top 500supercomputers in 2014 is composed of machine that havethousands to millions of processing units.

In 1988 Gustafson and Barsis looked at the problem from parallelprogram’s point of view, i.e. what could it do relative to itssequential implementation. Amdahl’s law implicitly assumes a fixeddata set size, whereas the Gustafson-Barsis law implicitly assumesarbitrarily large data sets (that are increasing in size).

Assume:

• The parallel implementation executes in time T .• The implementation spends the fraction α executing on all

machines. (1− α) is execute sequentially.

speedup =tseqtpar

= (1−α)T+NαTT = (1− α) + Nα

efficiency =tseq

N∗tpar = (1−α)N + α

Introduction

Good News - Amdahl’s Law is Wrong! (2)

From G. Barlas, Multicore and GPU Programming: An IntegratedApproach.

CA670 - Concurrent Programming Overviewdavids/courses/CA670/CA670_Introduction_2p.pdf ·...

Documents

Transcript of CA670 - Concurrent Programming Overviewdavids/courses/CA670/CA670_Introduction_2p.pdf ·...