COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC...

56
COL730: An Introduc2on to Parallel Computa2on

Transcript of COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC...

Page 1: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

COL730:AnIntroduc2ontoParallelComputa2on

Page 2: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Focus of the Course

• Understand parallel eco-system➡ Parallel computer organization

• Parallel algorithms and data structures; Analysis

• Learn broad parallel programming styles➡ Shared memory programming , Distributed memory programming, Co-processors,

Hybrids

• Learn basic parallelization tools and techniques➡ Performance measurement

• Scheduling and Load balancing Inter-process communication Synchronization

• Get broad coverage for teaching “parallel programming”➡ Also prepare to delve into research

• C/C++ is assumed

Page 3: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Related Keywords

• Parallel

• Concurrent

• Distributed

• Supercomputing

• Grid computing

• Cloud computing

• Big data

Page 4: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Large Computational Problems

• Molecular simulation➡ Many million bodies ⇒ days/iteration

• Atmospheric simulation➡ 1km 3D-grid, each point interacts with neighbors➡ Days of simulation time

• Data Analytics

• Computational biology➡ Drug design➡ Gene sequencing

• Oil exploration➡ Months of processing of seismic data

• Financial processing➡ Market prediction, investing

Page 5: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Large Computational Problems

• Molecular simulation➡ Many million bodies ⇒ days/iteration

• Atmospheric simulation➡ 1km 3D-grid, each point interacts with neighbors➡ Days of simulation time

• Data Analytics

• Computational biology➡ Drug design➡ Gene sequencing

• Oil exploration➡ Months of processing of seismic data

• Financial processing➡ Market prediction, investing

NAMD, 1M atoms: 316PF/nsGROMACS, .25M molecules: 25 PF/nsLAMMPS: .5M atoms: 15PF/nsWeather: 1M PF full weather modeling of 2Weeks.

Page 6: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Large Computational Problems

• Molecular simulation➡ Many million bodies ⇒ days/iteration

• Atmospheric simulation➡ 1km 3D-grid, each point interacts with neighbors➡ Days of simulation time

• Data Analytics

• Computational biology➡ Drug design➡ Gene sequencing

• Oil exploration➡ Months of processing of seismic data

• Financial processing➡ Market prediction, investing

NAMD, 1M atoms: 316PF/nsGROMACS, .25M molecules: 25 PF/nsLAMMPS: .5M atoms: 15PF/nsWeather: 1M PF full weather modeling of 2Weeks.

Theano 1M images: 375pF

Page 7: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Transistors (K)Clock rate (MHz)Power (W)Instr/Clk (ILP)

Page 8: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

• Newlytrendytechnologyofcomputerchipswouldbecomedoublypowerfuleverytwoyears.

• Theywouldeventuallybesosmall,thattheycouldbeembeddedinhomes,carsand

• personalportablecommunica2onsequipment.

Moore’s Law (1965)

Page 9: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

• Newlytrendytechnologyofcomputerchipswouldbecomedoublypowerfuleverytwoyears.

• Theywouldeventuallybesosmall,thattheycouldbeembeddedinhomes,carsand

• personalportablecommunica2onsequipment.“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.”

Moore’s Law (1965)

Page 10: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

• Newlytrendytechnologyofcomputerchipswouldbecomedoublypowerfuleverytwoyears.

• Theywouldeventuallybesosmall,thattheycouldbeembeddedinhomes,carsand

• personalportablecommunica2onsequipment.“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.”

Moore’s Law (1965)

1975:Revisedrateofcircuitcomplexitydoublingto18monthsgoingforward

“Thereisnoroomle3tosqueezeanythingoutbybeingclever.Goingforwardfromherewehavetodependonthetwosizefactors-biggerdiesandfinerdimensions.”

Page 11: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

• Newlytrendytechnologyofcomputerchipswouldbecomedoublypowerfuleverytwoyears.

• Theywouldeventuallybesosmall,thattheycouldbeembeddedinhomes,carsand

• personalportablecommunica2onsequipment.“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.”

Moore’s Law (1965)

1975:Revisedrateofcircuitcomplexitydoublingto18monthsgoingforward

“Thereisnoroomle3tosqueezeanythingoutbybeingclever.Goingforwardfromherewehavetodependonthetwosizefactors-biggerdiesandfinerdimensions.”

2003:AnotherdecadeisprobablystraighMorward...Thereiscertainlynoendtocrea2vity.

Page 12: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

• Newlytrendytechnologyofcomputerchipswouldbecomedoublypowerfuleverytwoyears.

• Theywouldeventuallybesosmall,thattheycouldbeembeddedinhomes,carsand

• personalportablecommunica2onsequipment.“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.”

Moore’s Law (1965)

1975:Revisedrateofcircuitcomplexitydoublingto18monthsgoingforward

“Thereisnoroomle3tosqueezeanythingoutbybeingclever.Goingforwardfromherewehavetodependonthetwosizefactors-biggerdiesandfinerdimensions.”

2003:AnotherdecadeisprobablystraighMorward...Thereiscertainlynoendtocrea2vity.2010:Int.TechnologyRoadmapforSemiconductors:growthtoslowattheendof2013,transistorcountsanddensi2eswilldoubleeverythreeyears.

Page 13: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

• Newlytrendytechnologyofcomputerchipswouldbecomedoublypowerfuleverytwoyears.

• Theywouldeventuallybesosmall,thattheycouldbeembeddedinhomes,carsand

• personalportablecommunica2onsequipment.“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.”

Moore’s Law (1965)

1975:Revisedrateofcircuitcomplexitydoublingto18monthsgoingforward

“Thereisnoroomle3tosqueezeanythingoutbybeingclever.Goingforwardfromherewehavetodependonthetwosizefactors-biggerdiesandfinerdimensions.”

2003:AnotherdecadeisprobablystraighMorward...Thereiscertainlynoendtocrea2vity.2010:Int.TechnologyRoadmapforSemiconductors:growthtoslowattheendof2013,transistorcountsanddensi2eswilldoubleeverythreeyears.2016:IntelSECfiling—Thegapbetweensuccessivegenera2onsofchipswithnew,smallertransistorswillwiden.Withthetransistorsat14nm(10nmin2017),itisbecomingmoredifficulttoshrinkthemcost-effec2vely.

Page 14: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

• Intel Xeon E5 2699 v4➡ 22 cores @2.2GHz, .63TF

• nVIDIA P100➡ 3,584 cores @1.33GHz, 5.3TF

• Xeon Phi 7290➡ 72 cores @1.5GHz, 3.5TF

• Power ISeries 8286-42A➡ 12 cores @3.52 GHz, .34 TF

2017 State of the Art

Page 15: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

L3

Memory

• Intel Xeon E5 2699 v4➡ 22 cores @2.2GHz, .63TF

• nVIDIA P100➡ 3,584 cores @1.33GHz, 5.3TF

• Xeon Phi 7290➡ 72 cores @1.5GHz, 3.5TF

• Power ISeries 8286-42A➡ 12 cores @3.52 GHz, .34 TF

2017 State of the Art

Page 16: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

L3

Memory

• Intel Xeon E5 2699 v4➡ 22 cores @2.2GHz, .63TF

• nVIDIA P100➡ 3,584 cores @1.33GHz, 5.3TF

• Xeon Phi 7290➡ 72 cores @1.5GHz, 3.5TF

• Power ISeries 8286-42A➡ 12 cores @3.52 GHz, .34 TF

2017 State of the Art

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

L3

Memory

Page 17: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

L3

Memory

• Intel Xeon E5 2699 v4➡ 22 cores @2.2GHz, .63TF

• nVIDIA P100➡ 3,584 cores @1.33GHz, 5.3TF

• Xeon Phi 7290➡ 72 cores @1.5GHz, 3.5TF

• Power ISeries 8286-42A➡ 12 cores @3.52 GHz, .34 TF

2017 State of the Art

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

L3

Memory

Accelerator

Page 18: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

L3

Memory

• Intel Xeon E5 2699 v4➡ 22 cores @2.2GHz, .63TF

• nVIDIA P100➡ 3,584 cores @1.33GHz, 5.3TF

• Xeon Phi 7290➡ 72 cores @1.5GHz, 3.5TF

• Power ISeries 8286-42A➡ 12 cores @3.52 GHz, .34 TF

2017 State of the Art

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

L3

Memory

Accelerator

Accelerator

Page 19: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Accelerator

Accelerator

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

Accelerator

Accelerator

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

Accelerator

Accelerator

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

Accelerator

Accelerator

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

Network

• Intel Xeon E5 2699 v4➡ 22 cores @2.2GHz, .63TF

• nVIDIA P100➡ 3,584 cores @1.33GHz, 5.3TF

• Xeon Phi 7290➡ 72 cores @1.5GHz, 3.5TF

• Power ISeries 8286-42A➡ 12 cores @3.52 GHz, .34 TF

2017 State of the Art

Page 20: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Parallel Computers

courtesyXinhua

Page 21: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Parallel Computers

courtesyXinhua

2016: “Fastest supercomputer in the world” Sunway TaihuLight (10.649,600 cores) Linpack Performance (Rmax) 93 PFlop/s Theoretical Peak (Rpeak) 125 pFlop/s Power: 15.4 MW Memory: 1.3 pB Processor: Sunway SW26010 260C 1.45GHz Interconnect: Sunway Operating System: Sunway RaiseOS 2.0.5

Page 22: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Parallel Computers

courtesyXinhua

2016: “Fastest supercomputer in the world” Sunway TaihuLight (10.649,600 cores) Linpack Performance (Rmax) 93 PFlop/s Theoretical Peak (Rpeak) 125 pFlop/s Power: 15.4 MW Memory: 1.3 pB Processor: Sunway SW26010 260C 1.45GHz Interconnect: Sunway Operating System: Sunway RaiseOS 2.0.5

© Mark Richards 3C DDP-116

4k words 16-bit memory 294K adds/sec

Photo by Burroughs Corp D825

Upto 4 computer modules Upto 16 memory modules

(4KW each) 167K add/sec, 25K mul/sec

Page 23: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Parallel Computers

courtesyXinhua

Page 24: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Parallel Computers

courtesyXinhua

2017: “Fastest supercomputer in the world” Sunway TaihuLight (10,649,600 cores) Linpack Performance (Rmax) 93 PFlop/s Theoretical Peak (Rpeak) 125 pFlop/s Power: 15.4 MW Memory: 1.3 pB Processor: Sunway SW26010 260C 1.45GHz Interconnect: Sunway Operating System: Sunway RaiseOS 2.0.5

Page 25: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Why Parallel?• Can’t clock faster• Do more per clock (bigger ICs ...)

– Execute complex “special-purpose” instruction– Execute more simple instructions

• Even with more instructions per second• RAM access remain a bottleneck (~ +10% per year)• Multiple processors can access memory in parallel• Increased caching requirement

• HPC is also about• Parallel memory (larger and faster)• Parallel IO• Many HPC applications rely on this, more than the raw computation

speed

11

Page 26: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Execu2ngStoredPrograms

Sequen2alOPoperands

OPoperands

OPoperandsOPoperands

OPoperands

OPoperands

OPoperands

OPoperands

ParallelOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands

Page 27: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Execu2ngStoredPrograms

Sequen2alOPoperands

OPoperands

OPoperandsOPoperands

OPoperands

OPoperands

OPoperands

OPoperands

ParallelOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands

Page 28: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Execu2ngStoredPrograms

Sequen2alOPoperands

OPoperands

OPoperandsOPoperands

OPoperands

OPoperands

OPoperands

OPoperands

ParallelOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands

Page 29: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Execu2ngStoredPrograms

Sequen2alOPoperands

OPoperands

OPoperandsOPoperands

OPoperands

OPoperands

OPoperands

OPoperands

ParallelOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands

Page 30: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Execu2ngStoredPrograms

Sequen2alOPoperands

OPoperands

OPoperandsOPoperands

OPoperands

OPoperands

OPoperands

OPoperands

ParallelOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands

Page 31: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Execu2ngStoredPrograms

Sequen2alOPoperands

OPoperands

OPoperandsOPoperands

OPoperands

OPoperands

OPoperands

OPoperands

ParallelOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands

Two threads of executions

Page 32: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Compiler/System Issues

• ILP: Even single thread not strictly in the ‘program order’ ➡ Pipelined: Different instructions in different stages➡ Superscalar: Multiple execution units execute different instructions in

“parallel”➡ Out-of-order execution➡ Speculative execution (Branch prediction)➡ Auto parallelization and vectorization

• Variables can be shared across multiple threads➡ With direct participation or obliviously➡ Memory transfer is not in terms of variables but “cache lines”➡ Need to “synchronize” access

• Execution of multiple threads need not be “sequentially consistent”

Page 33: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Programming in the ‘Parallel’• Understand target model (Semantics)

– Implications/Restrictions of constructs/features• Design for the target model

– Choice of granularity, synchronization primitive – Usually more of a performance issue

• Think concurrent– For each thread, other threads are ‘adversaries’

• At least with regard to timing– Process launch, Communication, Synchronization

• Clearly define pre and post conditions• Employ high-level constructs when possible

– Debugging is extra-hard without it

14

Page 34: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Learn Parallel Programming?• Let compiler extract parallelism

– In general, not successful so far– Too context sensitive– Many efficient serial data structures and algorithms are

parallel-inefficient– Even if compiler extracted parallelism from serial code,

it may not be what you want• Programmer must conceptualize and code

parallelism• Understand parallel algorithms and data structures

15

Page 35: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Communica2on

SharedMemory MessagePassing

P PP

Memory

P PPMem

ory

Mem

ory

Mem

ory

Interconnect

Page 36: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Shared Memory Architecture• Processors to access memory as global address space• Memory updates by one processor are visible to others

(eventually)– Memory consistency models

• UMA– Typically Symmetric Multiprocessors (SMP)– Equal access and access times to memory– Hardware support for Cache Coherency (CC-UMA)

• NUMA– Typically multiple SMPs, with access to each other’s memories– Not all processors have equal access time to all memories– CC-NUMA: Cache coherency is harder

17

Page 37: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

SharedMemory

UMA NUMA

P PP

MemoryController

P PPMem

ory

Mem

ory

Mem

ory

Interconnect

Memory

Page 38: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Pros/Cons of Shared Memory

+ Easier to program with global address space

+ Typically fast memory access(when hardware supported)

- Hard to scale- Adding CPUs (geometrically) increases traffic - Programmer initiated synchronization of

memory accesses

19

Page 39: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Distributed Memory Arch.• Communication network (typically between

processors, but also memory)– Ethernet, Infiniband, Custom made

• Processor-local memory• Access to another processor’s data through

well defined communication protocol– Implicit synchronization semantics

• Inter-process synchronization by programmer

20

Page 40: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Pros/Cons of Distributed Memory+ Memory is scalable with number of

processors+ Local access is fast (no cache coherency

overhead)+ Cost effective, with off-the-shelf processor/

network- Programs often more complex (no RAM

model)- Data communication is complex to manage

21

Page 41: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Parallel Task Decomposition• Data Parallel

– Perform f(x) for many x• Task Parallel

– Perform many functions fi• Pipeline

DataParallel

TaskParallel

22

Page 42: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Fundamental Questions

• Is the problem amenable to parallelization?– Are there (serial) dependencies

• What machine architectures are available?– Can they be re-configured?– Communication network

• Algorithm– How to decompose the problem into tasks– How to map tasks to processors

23

Page 43: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Measuring Performance

• What do you measure?➡ Elapsed wall-clock time

✦ But other processes can interfere➡ CPU time, CPU+System

✦ But waits/stalls caused by you are your waits➡ Multiple threads of control➡ Job throughput

• Interfering processes

• Cache performance warp benchmarks➡ Other processes interfere with caches➡ OS also caches file data in memory➡ Flush?

• Mean of multiple measurements (keep an eye on variance)

• Measure scalability➡ With input size, input location, processor count, memory size, network

• Use Profiler

Page 44: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Measuring Performance

• What do you measure?➡ Elapsed wall-clock time

✦ But other processes can interfere➡ CPU time, CPU+System

✦ But waits/stalls caused by you are your waits➡ Multiple threads of control➡ Job throughput

• Interfering processes

• Cache performance warp benchmarks➡ Other processes interfere with caches➡ OS also caches file data in memory➡ Flush?

• Mean of multiple measurements (keep an eye on variance)

• Measure scalability➡ With input size, input location, processor count, memory size, network

• Use Profiler

• clock()• gettimeofday()• times()• clock_gettime()• /usr/bin/time

Page 45: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Speedup,Sp=Exec2meusing1processorsystem(T1)

Exec2meusingpprocessors(Tp)

Efficiency= Spp

SimplePerformanceMetrics

25

Page 46: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Speedup,Sp=Exec2meusing1processorsystem(T1)

Exec2meusingpprocessors(Tp)

Efficiency= Spp

Cost,Cp=p×Tp

SimplePerformanceMetrics

25

Page 47: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Speedup,Sp=Exec2meusing1processorsystem(T1)

Exec2meusingpprocessors(Tp)

Efficiency= Spp

Cost,Cp=p×Tp

SimplePerformanceMetrics

Op2malifCp=T1

25

Page 48: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Speedup,Sp=Exec2meusing1processorsystem(T1)

Exec2meusingpprocessors(Tp)

Efficiency= Spp

Cost,Cp=p×Tp

SimplePerformanceMetrics

Op2malifCp=T1

Lookoutforinefficiency: T1=n3 Tp=n2.5,forp=n2 Cp=n4.5

25

Page 49: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

• f = fraction of the problem that is sequential➡(1 – f) = fraction that is parallel

• Best parallel time

• Speedup with p processors:

Amdahl’s Law

26

Tp = T1(f +1� f

p)

Page 50: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Amdahl’sLaw

• Onlyfrac2on(1-f)sharedbypprocessorsIncreasingpcannotspeed-upfracEonf

• Upperboundonspeedupatp=∞

f

Page 51: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Amdahl’sLaw

• Onlyfrac2on(1-f)sharedbypprocessorsIncreasingpcannotspeed-upfracEonf

• Upperboundonspeedupatp=∞

f

Convergesto0

Page 52: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Amdahl’sLaw

• Onlyfrac2on(1-f)sharedbypprocessorsIncreasingpcannotspeed-upfracEonf

• Upperboundonspeedupatp=∞

f

Convergesto0

Page 53: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Amdahl’sLaw

• Onlyfrac2on(1-f)sharedbypprocessorsIncreasingpcannotspeed-upfracEonf

• Upperboundonspeedupatp=∞

Example:f=2%,S∞→1/0.02=50

f

Convergesto0

Page 54: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Amdahl’sLaw

• Onlyfrac2on(1-f)sharedbypprocessorsIncreasingpcannotspeed-upfracEonf

• Upperboundonspeedupatp=∞

Example:f=2%,S∞→1/0.02=50

f

Convergesto0

Page 55: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Speedup versus the number of processing elements for adding a list of numbers.

Speedup can saturate and efficiency drops (Amdahl's law)

ScalingCharacteris2cs

source: Gramma et al.

Page 56: COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC filing — The gap between successive genera2ons of chips with new, smaller transistors

Isoefficiency: Measure of Scalability

• Rate at which the problem size must increase (as a function of the number of processors) to maintain a constant efficiency➡ Lower rate => More Scalable➡ E =

n = I(K, p)

T1(n)

pTp(n, p)=) f(n, p) = K