CA670 - Concurrent Programming Overviewdavids/courses/CA670/CA670_Introduction_2p.pdf ·...
Transcript of CA670 - Concurrent Programming Overviewdavids/courses/CA670/CA670_Introduction_2p.pdf ·...
Introduction
CA670 - Concurrent ProgrammingIntroduction
David Sinclair
Introduction
Overview
This module will introduce you to:
• Basic concepts• Resource contention• Synchronisation• Scalability & Performance
• Concurrency in Java• Java Threads• Synchronisation in Java
• Design Strategies
• OpenMP
• OpenCL
• Formal Specification of Concurrent Systems
Introduction
Texts
Supplementary:
• D. Lea, Concurrent Programming in Java: Design Principlesand Patterns, 1999, 978-0201310092
• S. Oaks and H. Wong, Java Threads, 2004, 978-0596007829
• B. Goetz, T. Peierls, J. Bloch, J. Bowbeer, D Holmes, and D.Lea, Java Concurrency in Practice, 2006, 978-0321349606
• G. Barlas, Multicore and GPU Programming: An IntegratedApproach, 2014, 978-0124171374
• C. A. R. Hoare, Communicating Sequential Processes, 1985,978-0131532892
• R. Milner, Communicating and Mobile Systems: theπ-calculus, 1999, 978-0521658690
Introduction
Contact Details
Lecturer: David SinclairOffice: L253Phone: 5510Email: [email protected]: http://www.computing.dcu.ie/∼davidsCourse web page:http://www.computing.dcu.ie/∼davids/CA670.html
Introduction
How do I successfully complete this module?
The module mark is a straight weighted average of:
• 25% continuous assessment• 2 elements
• Java Threads (10%)• Multicore/GPGPU (15%)
• 75% end-of-semester examination• 3 hour exam• Answer 5 out of 6 questions.
Introduction
What if I don’t successfully complete this module?
If you fail the module after the end-of-semester exams then youwill need to repeat some elements of the assessment.
• If you just failed the exam you can resit the exam in theAutumn.
• If you just failed the continuous assessment then you mustcomplete a resit assignment.
• Contact me after the results are published for the resitassignment.
• If you failed both the exam and the continuous assessment,you must repeat both.
If you fail the module after the resit examination and continuousassessment you must repeat all aspects of the module.
Introduction
Introduction
A concurrent program is the interleaving of sets of sequentialatomic instructions.
• These sequential processes execute at the same time, on thesame or different computational units (processors, cores,GPUs).
• Interleaving means that is at any given time each processor isexecuting one of the instructions of the sequential processes.
• The relative rate at which the instructions of each process areexecuted is not important.
Each sequential process consists of a series of atomic instructions.
• An atomic instruction is an instruction that once it starts,proceeds to completion without interruption.
• Different processors have different atomic instructions, andthis can have a big effect.
Introduction
Correctness
If P(~a) is a property of the input (pre-condition), and Q(~a, ~b) is aproperty of the input and output (post-condition), then correctnessis defined as either:
Partial correctness(P(~a) ∧ terminates(Prog(~a, ~b)))⇒ Q(~a, ~b)
Total correctnessP(~a)⇒ (terminates(Prog(~a, ~b)) ∧ Q(~a, ~b))
Totally correct programs terminate.
A totally correct specification of the incrementing tasks is:
a ∈ N ⇒ (terminates(INC (a, a)) ∧ a = a + 1)
Introduction
Correctness (2)
There are 2 types of correctness properties:
• Safety properties: These must always be true.• Mutual exclusion: Two processes must not interleave certain
sequences of instructions. Typically this involves changing thestate of a shared resource, e.g.
• Updating the value of a shared variable.• 2 or more processes updating a shared file.
• Absence of deadlock: Deadlock is when a non terminatingsystem cannot respond to any signal.
• Liveness properties: These must eventually be true.• Absence of starvation: Information sent is delivered.• Fairness: That any contention must be resolved.
Introduction
Correctness (3)
There 4 different ways to specify fairness.
• Weak Fairness: If a process continuously makes a request,eventually it will be granted.
• Strong fairness: If a process makes a request infinitely often,eventually it will be granted.
• Linear waiting: If a process makes a request, it will be grantedbefore any other process is granted the request more thanonce.
• FIFO: If a process makes a request, it will be granted beforeany other process that makes a later request.
Introduction
Flynn’s Taxonomy
Originally developed in 1966, Flynn’s taxonomy characterisesparallel machines by the number of instruction and data streams.
SISD This simple sequential machine executes a singleinstruction stream on a single data stream. Mostcomputers use to be SISD machines, but today manyprocessors come in a multicore configuration.However each individual core can be considered as anSISD machine.
SIMD These machines executes a single instruction streamon multiple data streams. Vector machines andGPUs are examples of this classification.
Introduction
Flynn’s Taxonomy (2)MISD These machines executes multiple instruction
streams on a single data stream. Quite rare, but canbe found in fault tolerant systems where the samedata is processed by different instruction programsand the final output is by majority decision.
MIMD These machines executes multiple instructionstreams on multiple data streams. This classificationcan be further broken down as:• Shared memory MIMD: Each instruction stream
can access a shared memory space. Speeds upcommunication, but at the possible cost ofcontention.
• Distributed memory MIMD: Also known asshared-nothing MIMD. Each instruction streamhas its own local memory and communicationbetween instruction streams is by exchangingmessages.
Introduction
Concurrent Programming - The Past(?)Concurrent programming started out with specialist machines thathad architectures based on either:
• Shared Memory where multiple processors communicated witheach other using a shared memory.
• Communications were as fast as memory access but therecould be significant memory contention.
• Distributed Memory where each of the multiple processors hadthere own local memory and they communicated with eachother by a dedicated high speed communications network.
• Communication as fast as network but slower than memoryaccess. Less contention.
These machines were very expensive.
Cluster Computing was the first affordable architecture forconcurrent computing.
• Standard PCs running modified Unix kernels connected by aEthernet network.
Introduction
Concurrent Programming - Today
Today, with the advent of multicore processors and GeneralPurpose Graphical Processing Units (GPGPUs), also known asGraphical Processing Units (GPUs), concurrent processing isavailable at everyone’s desk.
The Heterogeneous Systems Architecture (HSA), developed by theHSA Foundation, identifies 2 major core types.
• The Latency Compute Unit (LCU), a generalisation of a CPU,maps naturally onto task parallelism.
• The Throughput Compute Unit (TCU), a generalisation of aGPU, maps naturally onto data parallelism.
Introduction
Latency Compute Unit
A Latency Compute Unit is designed to process tasks as quickly aspossible.
Local Memory/Registers
Cache
Control
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALULCUs are designed to reduce anylatency in executing a task.
• A large cache to reducememory access latency.
• Complicated Control andALU circuitry to provide:
• powerful ALUs• branch prediction; and• data forwarding.
Introduction
Throughput Compute Unit
A Throughput Compute Unit is designed to process data asefficiently as possible in parallel.
Local Memory/Registers
= ALU= Control= Cache
TCUs are designed to maximisethe data throughput, usually atthe cost of latency. Requires alarge number of threads tocompensate for the latency.
• Simple Control and ALUcircuitry to provide:
• very energy efficient; and• no branch prediction or
data forwarding.
Introduction
Measuring the benefit of Concurrent ProgrammingThe most obvious metric for measuring the benefit of concurrentcomputing is speedup
speedup =tseqtpar
where tseq is the execution time on a sequential machine and tparis execution time on a parallel machine.
Both tseq and tpar can be heavily influenced by programmer skill,compiler choice, compiler optimisation switches, operating system,file system type and time of day (network traffic and workloads).
Therefore we should ensure that:
• Both the sequential and parallel program are executed onidentical hardware and software platforms under the same loadconditions.
• The sequential program should be the fastest known solution.
Introduction
Measuring the benefit of Concurrent Programming (2)The speedup metric does not tell us if the effort in parallelising theprogram was “worth it” in terms of resources consumed, i.e. whichis better, a speedup of 10 on 100 machines or a speedup of 8 on16 machines?
efficiency = speedupN =
tseqtpar∗N
where N is the number of “machines” and tseq and tpar are asbefore.
Scalability, the ability to efficiently handle a growing amount ofwork, can be measured by the efficiency metric, which in thiscontext is called the StrongScalingEfficiency(N) metric. Anothermetric for the scalability is:
WeakScalingEfficiency(N) =tseqt′par
where t ′par is the time to solve a problem N times larger that theproblem solved in tseq on a sequential machine.
Introduction
Measuring the benefit of Concurrent Programming (3)
Some general guidelines when developing a parallel algorithm.
• Start with the development of its sequential variant.• Not just any sequential algorithm, but the one that maps most
naturally to a parallel implementation, i.e. for sorting youwould not use quick sort, but bubble sort would be a “morenaturally parallelisable” algorithm.
• Profile the sequential algorithm to identify the most “timeconsuming” parts.
• If these parts can be parallelised, estimate the benefit.
We will discuss appropriate design patterns later in the module.
Introduction
Measuring the benefit of Concurrent Programming (4)
Some guidelines when measuring performance metrics.
• The duration of the whole execution should be measured.
• Results should be measured over multiple runs and presentedin terms of averages (and ideally standard deviations).
• Outliers should be excluded if there is a valid explanation toexcluded them.
• Results should be reported for various sizes of input data(scalability).
• The number of processes/threads should not exceed thenumber of hardware cores available.
Introduction
Bad News - Amdahl’s Law
Gene Amdahl devised a simple thought experiment in 1967 toestimate the benefit that could be expected from parallelprogramming.
Assume that:
• A single sequential processor exits the task in time T .
• The fraction of the application that can be parallelised is α.The inherently sequential part of the application is 1− α.
• The parallelisable part of the application can be evenly dividedamong the computational units.
• Communication between the parallelised elements incurs nooverhead.
Introduction
Bad News - Amdahl’s Law (2)
speedup =tseqtpar
= T(1−α)T+αT
N
As N →∞, the upper bound on speedup is:
limN→∞ speedup = 1(1−α)
Similarly,
efficiency =tseq
N∗tpar = TN((1−α)T+αT
N)
= 1N(1−α)+α
Introduction
Bad News - Amdahl’s Law (3)
From G. Barlas, Multicore and GPU Programming: An IntegratedApproach.
Introduction
Good News - Amdahl’s Law is Wrong!Experience shows that this is not the case! A list of the top 500supercomputers in 2014 is composed of machine that havethousands to millions of processing units.
In 1988 Gustafson and Barsis looked at the problem from parallelprogram’s point of view, i.e. what could it do relative to itssequential implementation. Amdahl’s law implicitly assumes a fixeddata set size, whereas the Gustafson-Barsis law implicitly assumesarbitrarily large data sets (that are increasing in size).
Assume:
• The parallel implementation executes in time T .• The implementation spends the fraction α executing on all
machines. (1− α) is execute sequentially.
speedup =tseqtpar
= (1−α)T+NαTT = (1− α) + Nα
efficiency =tseq
N∗tpar = (1−α)N + α
Introduction
Good News - Amdahl’s Law is Wrong! (2)
From G. Barlas, Multicore and GPU Programming: An IntegratedApproach.