Art of Multiprocessor Programming1 This work is licensed under a Creative Commons Attribution-...

Art of Multiprocessor Programming 1

This work is licensed under a

Creative Commons Attribution-ShareAlike 2.5 License.

• You are free:– to Share — to copy, distribute and transmit the work – to Remix — to adapt the work

• Under the following conditions:– Attribution. You must attribute the work to “The Art of

Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work).

– Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.

• For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to– http://creativecommons.org/licenses/by-sa/3.0/.

• Any of the above conditions can be waived if you get permission from the copyright holder.

• Nothing in this license impairs or restricts the author's moral rights.

Parallel and distributed

programmingAdam Piotrowski

Grzegorz Jabłoński

Lecture I

Parallel and distributed programming

lux.dmcs.pl/padp

Grading policy Lecture slides and material

Additional material

Course Overview

Introduction to multiprocessor programming Using OpenMP

Distributed programmingMPI, CORBA programming

Theoretical approche to multiprocessor

programmingFundamentals - Models, algorithms

Real-World programming - ArchitecturesTechniques

The Art of Multiprocessor Programming

by Maurice Herlihy and Nir Shavit

Using OpenMPPortable Shared Memory Parallel

Programming

by Barbara Chapman, Gabriele Jost and Ruud van der Pas

Still on some of your desktops: The Uniprocesor

memory

In the Enterprise: The Shared Memory

Multiprocessor(SMP)

BusBus

shared memory

cachecache

Traditional Scaling Process

User code

TraditionalUniprocessor

Speedup1.8x1.8x

3.6x3.6x

Time: Moore’s law

Multicore Scaling Process

User code

Multicore

Speedup 1.8x1.8x

3.6x3.6x

Unfortunately, not so simple…

Real-World Scaling Process

1.8x1.8x 2x2x 2.9x2.9x

User code

Multicore

Speedup

Parallelization and Synchronization require great care…

Art of Multiprocessor Programming

Sequential Computation

memory

object object

thread

Concurrent Computation

memory

object object

Asynchrony

• Sudden unpredictable delays– Cache misses (short)– Page faults (long)– Scheduling quantum used up (really

Model Summary

• Multiple threads– Sometimes called processes

• Single shared memory• Objects live in memory• Unpredictable asynchronous

delays

Parallel Primality Testing

• Challenge– Print primes from 1 to 1010

• Given– Ten-processor multiprocessor– One thread per processor

• Goal– Get ten-fold speedup (or close)

Load Balancing

• Split the work evenly• Each thread tests range of 109

…109 10102·1091

P0 P1 P9

Procedure for Thread i

void primePrint { int i = ThreadID.get(); // IDs in {0..9} for (j = i*109+1, j<(i+1)*109; j++) { if (isPrime(j)) print(j); }}

Issues

• Higher ranges have fewer primes• Yet larger numbers harder to test• Thread workloads

– Uneven– Hard to predict

• Need dynamic load balancing

Issues

• Higher ranges have fewer primes• Yet larger numbers harder to test• Thread workloads

– Uneven– Hard to predict

• Need dynamic load balancingre

jected

Shared Counter

each thread takes a number

int counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); }}

Counter counter = new Counter(1); void primePrint { long j = 0; while (j < 1010) { j = counter.getAndIncrement(); if (isPrime(j)) print(j); }}

Shared counterobject

Stop when every value taken

Increment & return each new

Counter Implementation

public class Counter { private long value;

public long getAndIncrement() { return value++; }}

Counter Implementation

public long getAndIncrement() { return value++; }} OK for single thread,

not for concurrent threads

What It Means

temp = value; value = value + 1; return temp;

Not so good…

Value… 1

write 2

write 3

write 2

Is this problem inherent?

If we could only glue reads and writes…

write read

Challenge

public long getAndIncrement() { temp = value; value = temp + 1; return temp; }}

Challenge

Make these steps atomic (indivisible)

Hardware Solution

public long getAndIncrement() { temp = value; value = temp + 1; return temp; }} ReadModifyWrite()

instruction

An Aside: Java™

public long getAndIncrement() { synchronized { temp = value; value = temp + 1; } return temp; }}

An Aside: Java™

Synchronized block

An Aside: Java™

Mutual Exclusion

Why do we care?

• We want as much of the code as possible to execute concurrently (in parallel)

• A larger sequential part implies reduced performance

• Amdahl’s law: this relation is not linear…

Amdahl’s Law

OldExecutionTimeNewExecutionTimeSpeedup=

…of computation given n CPUs instead of 1

Amdahl’s Law

1Speedup=

Amdahl’s Law

1Speedup=

Parallel fraction

Amdahl’s Law

1Speedup=

Parallel fraction

Sequential fraction

Amdahl’s Law

1Speedup=

Parallel fraction

Number of

processors

Sequential fraction

Example

• Ten processors• 60% concurrent, 40% sequential• How close to 10-fold speedup?

Example

Speedup=2.17=

Example

Speedup=3.57=

Example

Speedup=5.26=

Example

1099.0

Speedup=9.17=

The Moral

• Making good use of our multiple processors (cores) means – Finding ways to effectively parallelize

our code• Minimize sequential parts• Reduce idle time in which threads wait

without

Implementing a Counter

Make these steps indivisible using

Locks (Mutual Exclusion)

public interface Lock {

public void lock();

public void unlock();}

public void lock();

acquire lock

public void lock();

release lock

acquire lock

Using Locks

public class Counter { private long value; private Lock lock; public long getAndIncrement() { lock.lock(); try { int temp = value; value = value + 1; } finally { lock.unlock(); } return temp; }}

Using Locks

acquire Lock

Using Locks

Release lock(no matter what)

Using Locks

Critical section

Today: Revisit Mutual Exclusion

• Think of performance, not just correctness and progress

• Begin to understand how performance depends on our software properly utilizing the multiprocessor machine’s hardware

• And get to know a collection of locking algorithms…

What Should you do if you can’t get a lock?

• Keep trying– “spin” or “busy-wait”– Good if delays are short

• Give up the processor– Good if delays are long– Always good on uniprocessor

What Should you do if you can’t get a lock?

• Keep trying– “spin” or “busy-wait”– Good if delays are short

• Give up the processor– Good if delays are long– Always good on uniprocessor

our focus

Basic Spin-Lock

Resets lock upon exit

spin lock

critical section

Basic Spin-Lock

Resets lock upon exit

spin lock

critical section

…lock introduces sequential bottleneck

Review: Test-and-Set

• Boolean value• Test-and-set (TAS)

– Swap true with current value– Return value tells if prior value was

true or false

• Can reset just by writing false

public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) {

boolean prior = value; value = newValue; return prior; }}

public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) {

boolean prior = value; value = newValue; return prior; }}

Swap old and new values

AtomicBoolean lock = new AtomicBoolean(false)…boolean prior = lock.getAndSet(true)

Swapping in true is called “test-and-set” or TAS

Test-and-Set Locks

• Locking– Lock is free: value is false– Lock is taken: value is true

• Acquire lock by calling TAS– If result is false, you win– If result is true, you lose

• Release lock by writing false

Test-and-set Lock

class TASlock { AtomicBoolean state = new AtomicBoolean(false);

void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}

Test-and-set Lock

Lock state is AtomicBoolean

Test-and-set Lock

Keep trying until lock acquired

Test-and-set Lock

Release lock by resetting state to false

Performance

• Experiment– n threads– Increment shared counter 1 million

• How long should it take?• How long does it take?

threads

no speedup because of sequential bottleneck

Mystery #1ti

threads

TAS lock

What is going on?

Bus-Based Architectures

memory

cachecache

memory

cachecache

Random access memory (10s of cycles)

memory

cachecache

Shared Bus•Broadcast medium•One broadcaster at a time•Processors and memory all “snoop”

memory

cachecache

Per-Processor Caches•Small•Fast: 1 or 2 cycles•Address & state information

Jargon Watch

• Cache hit– “I found what I wanted in my cache”– Good Thing™

Jargon Watch

• Cache hit– “I found what I wanted in my cache”– Good Thing™

• Cache miss– “I had to shlep all the way to memory

for that data”– Bad Thing™

Cave Canem

• This model is still a simplification– But not in any essential way– Illustrates basic principles

• Will discuss complexities later

Processor Issues Load Request

memory

cachecache

memory

cachecache

Gimmedata

Memory Responds

memory

cachecache

Got your data right here data

memory

cachecachedata

Gimmedata

memory

cachecachedata

Gimmedata

memory

cachecachedata

I got data

Other Processor Responds

memory

cachecache

I got data

datadata

Other Processor Responds

memory

cachecache

datadata

Modify Cached Data

memory

cachedata

Modify Cached Data

memory

cachedata

memory

Modify Cached Data

cachedata

memory

Modify Cached Data

What’s up with the other copies?

Cache Coherence

• We have lots of copies of data– Original copy in memory – Cached copies at processors

• Some processor modifies its own copy– What do we do with the others?– How to avoid confusion?

Write-Back Caches

• Accumulate changes in cache• Write back when needed

– Need the cache for something else– Another processor wants it

• On first modification– Invalidate other entries– Requires non-trivial protocol …

Write-Back Caches

• Cache entry has three states– Invalid: contains raw seething bits– Valid: I can read but I can’t write– Dirty: Data has been modified

• Intercept other load requests• Write back to memory before using cache

Invalidate

memory

cachedatadata

Invalidate

memory

cachedatadata

Mine, all mine!

Invalidate

memory

cachedatadata

Invalidate

memory

cachedata

Art of Multiprocessor Programming1 This work is licensed under a Creative Commons Attribution-...

Documents

Transcript of Art of Multiprocessor Programming1 This work is licensed under a Creative Commons Attribution-...

Version 2.21 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.Creative Commons Attribution-ShareAlike 4.0.

Photo by Valentina_A - Creative Commons Attribution-NonCommercial-ShareAlike License N04Created with Haiku Deck.

CC Creative Commons Attribution ShareAlike - 4 · CC Creative Commons Attribution ShareAlike - 4.0 (CC BY-SA 4.0) 2016 OXFAM. Any views or opinions presented in this manual are solely

Personas & Scenario Mapping This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.Creative Commons Attribution-NonCommercial-ShareAlike.

Photo by twm1340 - Creative Commons Attribution-ShareAlike ... › 2015 › 04 › bibframe-basics.pdf · Photo by twm1340 - Creative Commons Attribution-ShareAlike License N00 Created

This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/BiostatMedicalProductRegulation/biomed_lec… · Creative Commons Attribution-NonCommercial-ShareAlike

Photo by aresauburn™ - Creative Commons Attribution-ShareAlike License N06Created with Haiku Deck222

Photo by - Creative Commons Attribution-ShareAlike License with Haiku Deck.

Using Web 2.0 in Teaching This work is licensed under a Creative Commons Attribution-ShareAlike 2.5 License.Creative Commons Attribution-ShareAlike 2.5.

Adapted from: FatMax 2007. Licensed under a Creative Commons Attribution-NonCommercial- ShareAlike 2.5 LicenseCreative Commons Attribution-NonCommercial-

This work is licensed under aCreative Commons “Attribution ... · This work is licensed under aCreative Commons “Attribution-NonCommercial-ShareAlike 4.0 International” license.

Creative Commons Attribution-NonCommercial-ShareAlike ... › texts › Moebius-Noodles.pdf · Creative Commons Attribution-NonCommercial-ShareAlike license by Moebius Noodles. Others

Tefko Saracevic This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License Creative Commons Attribution-NonCommercial-ShareAlike.

Photo by Doun Dounell - Creative Commons Attribution … · · 2014-04-03Photo by Bestindiaedu.com - Creative Commons Attribution-ShareAlike License N08 Created with Haiku Deck

Photo by marcp_dmoz - Creative Commons Attribution-NonCommercial-ShareAlike License N05Created with Haiku Deck.

Photo by @notnixon - Creative Commons Attribution- … · 2015-12-11 · Photo by marcopako - Creative Commons Attribution- ShareAlike License N00. Created with Haiku Deck

Licensed under the Creative Commons Attribution-ShareAlike License.

Licensed under Creative Commons Attribution ShareAlike 4.0 International Problem Mapping.

This work is licensed under a Creative Commons Attribution ...ocw.jhsph.edu/courses/SocialBehavioralFoundations/... · This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike

Creative Commons Attribution-NonCommercial-ShareAlike ...ocw.jhsph.edu/courses/vulnerablepopulations/PDFs/WorkerPop-sec2_Fitzgerald.pdfCreative Commons BY-SA. Retrieved from flickr.com.