A Roadmap to Restoring Computing's Former Glory

35
A Roadmap to Restoring Computing's Former Glory David I. August Princeton University (Not speaking for Parakinetics, Inc.)

description

A Roadmap to Restoring Computing's Former Glory. David I. August. Princeton University. (Not speaking for Parakinetics, Inc.). Era of DIY: Multicore Reconfigurable GPUs Clusters. 10 Cores!. 10-Core Intel Xeon “Unparalleled Performance”. Golden era of computer architecture. - PowerPoint PPT Presentation

Transcript of A Roadmap to Restoring Computing's Former Glory

Page 1: A Roadmap to Restoring Computing's Former Glory

A Roadmap to Restoring Computing's Former Glory

David I. August

Princeton University

(Not speaking for Parakinetics, Inc.)

Page 2: A Roadmap to Restoring Computing's Former Glory

Golden era of computer architecture

1992 20121994 1996 1998 2000 2002 2004 2006 2008 2010

~ 3 years behind

CPU92CPU95CPU2000CPU2006

Year

SP

EC

CIN

T P

erfo

rman

ce (

log.

Sca

le)

Era of DIY:• Multicore• Reconfigurable• GPUs• Clusters

10 Cores!

10-Core Intel Xeon“Unparalleled Performance”

Page 3: A Roadmap to Restoring Computing's Former Glory

P6 SUPERSCALAR ARCHITECTURE (CIRCA 1994)

AutomaticSpeculation

AutomaticPipelining

Parallel ResourcesAutomatic

Allocation/Scheduling

Commit

Page 4: A Roadmap to Restoring Computing's Former Glory

MULTICORE ARCHITECTURE (CIRCA 2010)

AutomaticPipelining

Parallel Resources

AutomaticSpeculation

AutomaticAllocation/Scheduling

Commit

Page 5: A Roadmap to Restoring Computing's Former Glory
Page 6: A Roadmap to Restoring Computing's Former Glory

Realizable parallelism

Parallel Library Calls

Time

Time

Thr

eads

Thr

eads

Credit: Jack Dongarra

Page 7: A Roadmap to Restoring Computing's Former Glory

“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law

Page 8: A Roadmap to Restoring Computing's Former Glory

Multicore Needs:

1. Automatic resource allocation/scheduling, speculation/commit, and pipelining.

2. Low overhead access to programmer insight.3. Code reuse. Ideally, this includes support of legacy codes as

well as new codes.4. Intelligent automatic parallelization.

Parallel Programming

Automatic Parallelization Parallel Libraries

Computer Architecture

Implicitly parallel programming with

critique-based iterative, occasionally interactive,

speculatively pipelined automatic

parallelization

A Roadmap to restoring computing’s

former glory.

Page 9: A Roadmap to Restoring Computing's Former Glory

Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining.2. Low overhead access to programmer insight.3. Code reuse. Ideally, this includes support of legacy codes as well as new codes.4. Intelligent automatic parallelization.

New or ExistingSequential Code DSWP Family

Optis Parallelized Code

Machine Specific Performance Primitives

Complainer/Fixer

InsightAnnotation

One Implementation

New or ExistingLibraries

InsightAnnotation

OtherOptis

SpeculativeOptis

Page 10: A Roadmap to Restoring Computing's Former Glory

0

1

2

3

4

5

LD:1

LD:2

W:1

W:3

LD:3

Core 1

Core 2

Core 3

W:2

W:4

LD:4

LD:5

C:1

C:2

C:3

Core 4

Spec-PS-DSWPP6 SUPERSCALAR ARCHITECTURE

Page 11: A Roadmap to Restoring Computing's Former Glory

Example

A: while (node) {B: node = node->next;C: res = work(node);D: write(res); }

B1

C1

A1

Core 1 Core 2 Core 3

A2

B2

D1

C2

D2

Tim

e

Program Dependence Graph

A B

D

C

Control DependenceData Dependence

Page 12: A Roadmap to Restoring Computing's Former Glory

Example

A: while (node) {B: node = node->next;C: res = work(node);D: write(res); }

B1

C1

A1

Core 1 Core 2 Core 3

A2

B2

D1

C2

D2

Tim

e

Spec-DOALL

Program Dependence Graph

A B

D

C

Control DependenceData Dependence

Page 13: A Roadmap to Restoring Computing's Former Glory

Example

A: while (node) {B: node = node->next;C: res = work(node);D: write(res); }

Core 1 Core 2 Core 3

Tim

e

Spec-DOALL

A2

B2

C2

D2

A1

B1

C1

D1

A3

B3

Program Dependence Graph

A B

D

C

Control DependenceData Dependence

Page 14: A Roadmap to Restoring Computing's Former Glory

Example

B: node = node->next;C: res = work(node);D: write(res); }

Core 1 Core 2 Core 3

Tim

e

Program Dependence Graph

A B

D

C

Control DependenceData Dependence

Spec-DOALL

A2A1 A3

B2

C2

D2

B1

C1

D1

B3

C3

D3

A: while (node) { while (true) {

B2

C2

D2

B3

C3

D3

B4

C4

D4

197.parser

Slowdown

Page 15: A Roadmap to Restoring Computing's Former Glory

Core 1 Core 2 Core 3

Tim

e

C1

D1

B1

B7

C3

D3

B3

C4

D4

B4

C5

D5

B5

C6

B6

Spec-DOACROSS

Core 1 Core 2 Core 3

Tim

e

Spec-DSWP

C2

D2

B2

C1

D1

B1

B3

B4

B2

C2

C3 D2

B5

B6

B7

D3

C5

C6

C4

D5

D4

Throughput: 1 iter/cycle Throughput: 1 iter/cycle

Page 16: A Roadmap to Restoring Computing's Former Glory

Comparison: Spec-DOACROSS and Spec-DSWP

Comm.Latency = 2: Comm.Latency = 2:Comm.Latency = 1: 1 iter/cycle Comm.Latency = 1: 1 iter/cycle

Core 1 Core 2 Core 3

Tim

e

C1

D1

B1

C2

D2

B2

C3

D3

B3

Core 1 Core 2 Core 3

B2

B3

B1

B5

B6

B4

C2

C3

C1

C5

C6

C4

B7

PipelineFill time

0.5 iter/cycle 1 iter/cycle

D2

D3

D1

D5

D4

Tim

eC4

D4

B4

C5

D5

B5

C6

B6

B7

Page 17: A Roadmap to Restoring Computing's Former Glory

(1,1)(8,2)

(16,4)(24,6)

(32,8)

(40,10)

(48,12)

(56,14)

(64,16)

(72,18)

(80,20)

(88,22)

(96,24)

(104,26)

(112,28)

(120,30)

(128,32)0

5

10

15

20

25

30

35

40

45

50TLSSpec-PS-DSWP

(Number of Total Cores, Number of Nodes)

Perf

orm

ance

Spe

edup

(X)

TLS vs. Spec-DSWP[MICRO 2010]Geomean of 11 benchmarks on the same cluster

Page 18: A Roadmap to Restoring Computing's Former Glory

Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2. Low overhead access to programmer insight.3. Code reuse. Ideally, this includes support of legacy codes as well as new codes.4. Intelligent automatic parallelization.

New or ExistingSequential Code DSWP Family

Optis Parallelized Code

Machine Specific Performance Primitives

Complainer/Fixer

InsightAnnotation

One Implementation

New or ExistingLibraries

InsightAnnotation

OtherOptis

SpeculativeOptis

Page 19: A Roadmap to Restoring Computing's Former Glory

19

char *memory;

void * alloc(int size);

void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr;}

Core 1 Core 2

Tim

e

Core 3

Execution Plan

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

Page 20: A Roadmap to Restoring Computing's Former Glory

20

char *memory;

void * alloc(int size);@Commutative

void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr;}

Core 1 Core 2

Tim

e

Core 3

Execution Plan

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

Page 21: A Roadmap to Restoring Computing's Former Glory

21

char *memory;

void * alloc(int size);@Commutative

Core 1 Core 2

Tim

e

Core 3

Execution Plan

alloc1

alloc2

alloc3

alloc4

alloc5

alloc6

void * alloc(int size) { void * ptr = memory; memory = memory + size; return ptr;}

Easily Understood Non-Determinism!

Page 22: A Roadmap to Restoring Computing's Former Glory

[MICRO ‘07, Top Picks ’08; Automatic: PLDI ‘11]

~50 of ½ Million LOCs modified in SpecINT 2000Mods also include Non-Deterministic Branch

Page 23: A Roadmap to Restoring Computing's Former Glory

Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2. Low overhead access to programmer insight. 3. Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4. Intelligent automatic parallelization.

New or ExistingSequential Code DSWP Family

Optis Parallelized Code

Machine Specific Performance Primitives

Complainer/Fixer

InsightAnnotation

One Implementation

New or ExistingLibraries

InsightAnnotation

OtherOptis

SpeculativeOptis

Page 24: A Roadmap to Restoring Computing's Former Glory

24

SumReduction

Unroll

Rotate

0.90X

0.10X

30.0X

1.1X

0.8XSum

Reduction

Unroll

SumReduction

Rotate

Rotate

Unroll

1.5X

Iterative Compilation[Cooper ‘05; Almagor ‘04; Triantafyllis ’05]

Page 25: A Roadmap to Restoring Computing's Former Glory

PS-DSWPComplainer

Page 26: A Roadmap to Restoring Computing's Former Glory

Red Edges: Deps between malloc() & free()Blue Edges: Deps between rand() callsGreen Edges: Flow Deps inside Inner LoopOrange Edges: Deps between function calls

Unroll

SumReduction

Rotate

PS-DSWPComplainer Who can

help me? ProgrammerAnnotation

Page 27: A Roadmap to Restoring Computing's Former Glory

PS-DSWPComplainer

SumReduction

Page 28: A Roadmap to Restoring Computing's Former Glory

PS-DSWPComplainer

SumReduction

PROGRAMMERCommutative

Page 29: A Roadmap to Restoring Computing's Former Glory

PS-DSWPComplainer

SumReduction

PROGRAMMERCommutative

LIBRARYCommutative

Page 30: A Roadmap to Restoring Computing's Former Glory

PS-DSWPComplainer

SumReduction

PROGRAMMERCommutative

LIBRARYCommutative

1 8 16 24 32 40 48 56 640

1020304050

Scalable Speedup!

Parallel HMMER V2HMMER with Commutative

Page 31: A Roadmap to Restoring Computing's Former Glory

Multicore Needs:1. Automatic resource allocation/scheduling, speculation/commit, and pipelining. 2. Low overhead access to programmer insight. 3. Code reuse. Ideally, this includes support of legacy codes as well as new codes. 4. Intelligent automatic parallelization.

New or ExistingSequential Code DSWP Family

Optis Parallelized Code

Machine Specific Performance Primitives

Complainer/Fixer

InsightAnnotation

One Implementation

New or ExistingLibraries

InsightAnnotation

OtherOptis

SpeculativeOptis

Page 32: A Roadmap to Restoring Computing's Former Glory

Performance relative to Best Sequential128 Cores in 32 Nodes with Intel Xeon Processors [MICRO 2010]

Page 33: A Roadmap to Restoring Computing's Former Glory

Restoration of Trend

Page 34: A Roadmap to Restoring Computing's Former Glory

“Compiler Advances Double Computing Power Every 18 Years!” – Proebsting’s Law

Compiler Technology

Architecture/Devices

Era of DIY:• Multicore• Reconfigurable• GPUs• Clusters

Compiler technology inspired class of architectures?

Page 35: A Roadmap to Restoring Computing's Former Glory

The End