6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall...

73
Spring 2010 Spring 2010 Parallelization Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Transcript of 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall...

Page 1: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Spring 2010Spring 2010

Parallelization

Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory

Massachusetts Institute of Technology

Page 2: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

c eas a a e at o O tu t es

Outline

• Why ParallelismWhy Parallelism

• Parallel Execution

• Parallelizing Compilers

D d A l i• Dependence Analysis

• Increasing Parallelization Opportunitiesg ppo

Page 3: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Moore’s Law

ItaniumItanium 2

1,000,000,000From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

X-1

1/78

0)

52%/year ??%/yearP2

P3P4

10,000,000

100,000,000

Num

be

man

ce (v

s. V

AX

386

486

PentiumP2

1,000,000

er of Transistors

11978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016

Per

for

25%/year

8086

286 100,000

10,000

s

Saman Amarasinghe 3 6.035 ©MIT Fall 2006From David Patterson

Page 4: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Uniprocessor Performance (SPECint)

ItaniumItanium 2

p ( )1,000,000,000

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

P2

P3P4

10,000,000

100,000,000

Num

be

386

486

PentiumP2

1,000,000

er of Transistors

8086

286 100,000

10,000

s

Saman Amarasinghe 4 6.035 ©MIT Fall 2006From David Patterson

Page 5: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Multicores Are Here!

512512 Picochip Ambric Pi hi PC102 AM2045

256 Cisco CSR-1

128 Intel TflopsTflops

64

32 Raza Cavium # of# of Raw Raw XLRXLR O tOcteon

16cores 8 Niagara Cell

4 Boardcom 1480 Boardcom 1480 Opteron 4P

4 Xeon MP Xbox360

2 Power4 PA-8800 Opteron Tanglewood

PExtreme Power6Yonah

1 4004 8080 8086 286 386 486 Pentium P2 P3 Itanium

P4P41 8008 Athlon Itanium 2

1970 1975 1980 1985 1990 1995 2000 2005 20??

Saman Amarasinghe 5 6.035 ©MIT Fall 2006

Page 6: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Issues with Parallelism • Amdhal’s Law

– Any computation can be analyzed in terms of a portion that must be executed sequentially Ts and a portion that can be must be executed sequentially, Ts, and a portion that can be executed in parallel, Tp. Then for n processors:

– T(n) = Ts + Tp/n – T() = Ts, thus maximum speedup (Ts + Tp) /Ts T() Ts, thus maximum speedup (Ts + Tp) /Ts

• Load Balancing – The work is distributed among processors so that all processors

are kept busy when parallel task is executed.

• Granularity – The size of the parallel regions between synchronizations or

the ratio of computation (useful work) to communication

Saman Amarasinghe 6 6.035 ©MIT Fall 2006

the ratio of computation (useful work) to communication (overhead).

Page 7: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

c eas a a e at o O tu t es

Outline

• Why ParallelismWhy Parallelism

• Parallel Execution

• Parallelizing Compilers

D d A l i• Dependence Analysis

• Increasing Parallelization Opportunitiesg ppo

Page 8: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Types of Parallelismyp

• Instruction Level P ll li (ILP) Scheduling and Hardware Parallelism (ILP)

• Task Level Parallelism

Scheduling and Hardware

Mainly by hand (TLP)

• Loop Level Parallelism • Loop Level Parallelism (LLP) or Data Parallelism Hand or Compiler Generated

• Pipeline Parallelism

• Divide and Conquer

Hardware or Streaming

Recursive functions

Saman Amarasinghe 8 6.035 ©MIT Fall 2006

Divide and Conquer Parallelism

Recursive functions

Page 9: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Why Loops?y p

• 90% of the execution time in 10% of the code – Mostly in loops

• If parallel, can get good performance – Load balancing

• Relatively easy to analyze

Saman Amarasinghe 9 6.035 ©MIT Fall 2006

Page 10: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Programmer Defined Parallel Loopg p

• FORALL • FORACROSS – No “loop carried – Some “loop carried

dependences” dependences” Fully parallel – Fully parallel

Saman Amarasinghe 10 6.035 ©MIT Fall 2006

Page 11: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Parallel Execution • Example

FORPAR I = 0 to N FORPAR I 0 to N A[I] = A[I] + 1

• Block Distribution: Program gets mapped into /Iters = ceiling(N/NUMPROC);

FOR P = 0 to NUMPROC-1 FOR I = P*Iters to MIN((P+1)*Iters, N)A[I] = A[I] + 1

• SPMD (Single Program, Multiple Data) Code If(myPid == 0) {

…… Iters = ceiling(N/NUMPROC);

}Barrier();FOR I = myPid*Iters to MIN((myPid+1)*Iters, N)

Saman Amarasinghe 11 6.035 ©MIT Fall 2006

FOR I myPid Iters to MIN((myPid+1) Iters, N)A[I] = A[I] + 1

Barrier();

Page 12: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Parallel Execution • Example

FORPAR I = 0 to N FORPAR I 0 to N A[I] = A[I] + 1

• Block Distribution: Program gets mapped into /Iters = ceiling(N/NUMPROC);

FOR P = 0 to NUMPROC-1 FOR I = P*Iters to MIN((P+1)*Iters, N)A[I] = A[I] + 1

• Code fork a function Iters = ceiling(N/NUMPROC);ParallelExecute(func1);ParallelExecute(func1); … void func1(integer myPid){

FOR I = myPid*Iters to MIN((myPid+1)*Iters, N)

Saman Amarasinghe 12 6.035 ©MIT Fall 2006

FOR I myPid Iters to MIN((myPid+1) Iters, N)A[I] = A[I] + 1

}

Page 13: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Stac ate o S a ed

Parallel Execution

• SPMD – Need to get all the processors execute the

control flow E t h i ti h d d d t• Extra synchronization overhead or redundant computation on all processors or both

– Stack: Private or Shared?

• Fork – Local variables not visible within the function

• Either make the variables used/defined in the loop body global or pass and return them as arguments

Saman Amarasinghe 13 6.035 ©MIT Fall 2006

body global or pass and return them as arguments • Function call overhead

Page 14: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Parallel Thread Basics • Create separate threads

– Create an OS threadCreate an OS thread • (hopefully) it will be run on a separate core

– pthread_create(&thr, NULL, &entry_point, NULL)

– Overhead in thread creation • Create a separate stack • Get the OS to allocate a threadGet the OS to allocate a thread

• Thread pool – Create all the threads (= num cores) at the beginning( ) g g – Keep N-1 idling on a barrier, while sequential execution – Get them to run parallel code by each executing a

function – Back to the barrier when parallel region is done

Saman Amarasinghe 14 6.035 ©MIT Fall 2006

Page 15: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

c eas a a e at o O tu t es

Outline

• Why ParallelismWhy Parallelism

• Parallel Execution

• Parallelizing Compilers

D d A l i• Dependence Analysis

• Increasing Parallelization Opportunitiesg ppo

Page 16: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

Parallelizing Compilersg p

• Finding FORALL Loops out of FOR loopsg p p

• ExamplesExamples FOR I = 0 to 5 A[I] = A[I] + 1

FOR I = 0 to 5 A[I] = A[I+6] + 1

For I = 0 to 5 A[2*I] = A[2*I + 1] + 1

Saman Amarasinghe 16 6.035 ©MIT Fall 2006

A[2 I] A[2 I + 1] + 1

Page 17: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

– =

Iteration Spacep• N deep loops n-dimensional discrete cartesian space

– Normalized loops: assume step size = 1

0 1 2 3 4 5 6 7 J

i [i1, i2, i3,…, in]

FOR I = 0 to 6 FOR J = I to 7

0 1 2 3 4 5 6 7 J 0 1 22

I 3 4 5

• Iterations are represented as coordinates in iteration space – i ̅ = [i i i i ]

5 6

Saman Amarasinghe 17 6.035 ©MIT Fall 2006

Page 18: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Iteration Spacep• N deep loops n-dimensional discrete cartesian space

– Normalized loops: assume step size = 1

0 1 2 3 4 5 6 7 J FOR I = 0 to 6

FOR J = I to 7

0 1 2 3 4 5 6 7 J 0 1 22 3 4 5

I

• Iterations are represented as coordinates in iteration space • Sequential execution order of iterations Lexicographic order

5 6

• Sequential execution order of iterations Lexicographic order [0,0], [0,1], [0,2], …, [0,6], [0,7],

[1,1], [1,2], …, [1,6], [1,7], [2,2], …, [2,6], [2,7],

Saman Amarasinghe 18 6.035 ©MIT Fall 2006

……… [6,6], [6,7],

Page 19: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Iteration Spacep• N deep loops n-dimensional discrete cartesian space

– Normalized loops: assume step size = 1

0 1 2 3 4 5 6 7 J FOR I = 0 to 6

FOR J = I to 7

0 1 2 3 4 5 6 7 J 0 1 22 3 4 5

I

• Iterations are represented as coordinates in iteration space • Sequential execution order of iterations Lexicographic order

5 6

• Sequential execution order of iterations Lexicographic order • Iteration i̅ is lexicograpically less than j̅ , i̅ < j̅ iff

there exists c s.t. i1 = j1, i2 = j2,… ic-1 = jc-1 and ic < jc

Saman Amarasinghe 19 6.035 ©MIT Fall 2006

Page 20: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Iteration Spacep• N deep loops n-dimensional discrete cartesian space

– Normalized loops: assume step size = 1

0 1 2 3 4 5 6 7 J FOR I = 0 to 6

FOR J = I to 7

0 1 2 3 4 5 6 7 J 0 1 22 3 4 5

I

• An affine loop nest – Loop bounds are integer linear functions of constants, loop constant

5 6

Loop bounds are integer linear functions of constants, loop constant variables and outer loop indexes

– Array accesses are integer linear functions of constants, loop constant variables and loop indexes

Saman Amarasinghe 20 6.035 ©MIT Fall 2006

Page 21: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Iteration Spacep• N deep loops n-dimensional discrete cartesian space

– Normalized loops: assume step size = 1 0 1 2 3 4 5 6 7 J

FOR I = 0 to 6 FOR J = I to 7

0 1 2 3 4 5 6 7 J 0 1 22 3 4 5

I

• Affine loop nest Iteration space as a set of liner inequalities

5 6

Affine loop nest Iteration space as a set of liner inequalities 0 ≤ I

I ≤ 6 I ≤ J

Saman Amarasinghe 21 6.035 ©MIT Fall 2006

I ≤ J J ≤ 7

Page 22: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Data Spacep

• M dimensional arrays m-dimensional discrete cartesian space h b– a hypercube

Integer A(10) 0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5

Float B(5, 6)

0 1 2 3 4 5 0 1 22 3 4

Saman Amarasinghe 22 6.035 ©MIT Fall 2006

Page 23: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

a

Dependencesp• True dependence

a = = a

• Anti dependence = a

a =

• Output dependence a = a =

• Definition: Data dependence exists for a dynamic instance i and j iffData dependence exists for a dynamic instance i and j iff – either i or j is a write operation – i and j refer to the same variable – i executes before j

Saman Amarasinghe 23 6.035 ©MIT Fall 2006

• How about array accesses within loops?

Page 24: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

c eas a a e at o O tu t es

Outline

• Why ParallelismWhy Parallelism

• Parallel Execution

• Parallelizing Compilers

D d A l i• Dependence Analysis

• Increasing Parallelization Opportunitiesg ppo

Page 25: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Array Accesses in a loopy p FOR I = 0 to 5 A[I] = A[I] + 1

0 1 2 3 4 5 6 7 8 9 10 1112 0 1 2 3 4 5 Iteration Space Data Space

Saman Amarasinghe 25 6.035 ©MIT Fall 2006

Page 26: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

Array Accesses in a loopy p FOR I = 0 to 5 A[I] = A[I] + 1

0 1 2 3 4 5 6 7 8 9 10 1112 0 1 2 3 4 5 Iteration Space Data Space

= A[I] A[I]

= A[I]A[I] A[I]

= A[I] A[I]A[I]

= A[I] A[I]

= A[I]

Saman Amarasinghe 26 6.035 ©MIT Fall 2006

[ ] A[I]

= A[I] A[I]

Page 27: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

Array Accesses in a loopy p FOR I = 0 to 5 A[I+1] = A[I] + 1

0 1 2 3 4 5 6 7 8 9 10 1112 0 1 2 3 4 5 Iteration Space Data Space

= A[I] A[I+1]

= A[I]A[I] A[I+1]

= A[I] A[I+1]A[I+1]

= A[I] A[I+1]

= A[I]

Saman Amarasinghe 27 6.035 ©MIT Fall 2006

[ ] A[I+1]

= A[I] A[I+1]

Page 28: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

Array Accesses in a loopy p FOR I = 0 to 5 A[I] = A[I+2] + 1

0 1 2 3 4 5 6 7 8 9 10 1112 0 1 2 3 4 5 Iteration Space Data Space

= A[I+2] A[I]

= A[I+2]A[I+2] A[I]

= A[I+2] A[I]A[I]

= A[I+2] A[I]

= A[I+2]

Saman Amarasinghe 28 6.035 ©MIT Fall 2006

[ ] A[I]

= A[I+2] A[I]

Page 29: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

Array Accesses in a loopy p FOR I = 0 to 5 A[2*I] = A[2*I+1] + 1

0 1 2 3 4 5 6 7 8 9 10 1112 0 1 2 3 4 5 Iteration Space Data Space

= A[2*+1] A[2*I]

= A[2*I+1]A[2*I+1] A[2*I]

= A[2*I+1] A[2*I]A[2 I]

= A[2*I+1] A[2*I]

= A[2*I+1]

Saman Amarasinghe 29 6.035 ©MIT Fall 2006

[ ] A[2*I]

= A[2*I+1] A[2*I]

Page 30: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

Distance Vectors

• A loop has a distance d if there exist a data p dependence from iteration i to j and d = j-i

0 5

FOR I = 0 to 5

FOR I = 0 to 5 A[I] = A[I] + 1

0dv

1d FOR I 0 to 5 A[I+1] = A[I] + 1

1dv

FOR I = 0 to 5 A[I] = A[I+2] + 1

2dv

Saman Amarasinghe 30 6.035 ©MIT Fall 2006

FOR I = 0 to 5 A[I] = A[0] + 1

*2,1 dv

Page 31: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Multi-Dimensional DependencepFOR I = 1 to n FOR J 1 t

J FOR J = 1 to n A[I, J] = A[I, J-1] + 1

I

1 0

dv

Saman Amarasinghe 31 6.035 ©MIT Fall 2006

Page 32: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Multi-Dimensional DependencepFOR I = 1 to n FOR J 1 t

J FOR J = 1 to n A[I, J] = A[I, J-1] + 1

I

1 0

dv

FOR I = 1 to n FOR J 1 to n

J

FOR J = 1 to n A[I, J] = A[I+1, J] + 1 I

1

Saman Amarasinghe 32 6.035 ©MIT Fall 2006

0 1

dv

Page 33: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Outline

• Dependence Analysisp y • Increasing Parallelization Opportunities

Page 34: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

What is the Dependence?pFOR I = 1 to n FOR J = 1 to n

J FOR J 1 to n A[I, J] = A[I-1, J+1] + 1

I

0 1 2 3 4 5 0 11 2 3 4

Saman Amarasinghe 34 6.035 ©MIT Fall 2006

4

Page 35: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

What is the Dependence?pFOR I = 1 to n FOR J = 1 to n

J FOR J 1 to n A[I, J] = A[I-1, J+1] + 1

I

0 1 2 3 4 5 0 11 2 3 4

Saman Amarasinghe 35 6.035 ©MIT Fall 2006

4

Page 36: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

What is the Dependence?pFOR I = 1 to n FOR J = 1 to n

J FOR J 1 to n A[I, J] = A[I-1, J+1] + 1

I 1

1

1dv

Saman Amarasinghe 36 6.035 ©MIT Fall 2006

Page 37: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

What is the Dependence?pFOR I = 1 to n FOR J = 1 to n

J FOR J 1 to n A[I, J] = A[I-1, J+1] + 1

I

FOR I = 1 to n O 1 t

J

FOR J = 1 to n A[I] = A[I-1] + 1 I

Saman Amarasinghe 37 6.035 ©MIT Fall 2006

Page 38: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

What is the Dependence?pFOR I = 1 to n FOR J = 1 to n

J FOR J 1 to n A[I, J] = A[I-1, J+1] + 1

I 1 dv=[1, -1]

1

1dv

FOR I = 1 to n O 1 t

J

FOR J = 1 to n B[I] = B[I-1] + 1 I

1111

Saman Amarasinghe 38 6.035 ©MIT Fall 2006

* 1

,3

1 ,

2 1

,1

1 dv

Page 39: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

What is the Dependence?

FOR i = 1 to N-1

p J

FOR j = 1 to N-1A[i,j] = A[i,j-1] + A[i-1,j]; I

01

1 0

,0 1

dv

Saman Amarasinghe 39 6.035 ©MIT Fall 2006

Page 40: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Recognizing FORALL Loops

• Find data dependences in loop FFor every paiir of array acceses tto the same array –

g g p

f th If the first access has at least one dynamic instance (an iteration) in which it refers to a location in the array that the second access also refers to in at least one of the later dynamic instancesalso refers to in at least one of the later dynamic instances (iterations). Then there is a data dependence between the statements

– (Note that same array can refer to itself – output dependences) output dependences)(Notethat same array can refer to itself

• Definition

– Loop carried dependence: Loop-carried dependence:dependence that crosses a loop boundary

• If there are no loop carried dependences parallelizableIf there are no loop carried dependences parallelizable

Saman Amarasinghe 40 6.035 ©MIT Fall 2006

Page 41: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Data Dependence Analysisp y

• I: Distance Vector method • II: Integer Programming

Saman Amarasinghe 41 6.035 ©MIT Fall 2006

Page 42: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Distance Vector Method

• The ith loop is parallelizable for allp p dependence d = [d1,…,di,..dn] either

one of d1,…,di-1 is > 0 or

all d1,…,di = 0

Saman Amarasinghe 42 6.035 ©MIT Fall 2006

Page 43: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

es

Is the Loop Parallelizable?p

FOR I = 0 to 5 A[I] A[I] + 1

0dv Yes A[I] = A[I] + 1

FOR I = 0 to 5 A[I+1] = A[I] + 1

1dv No

FOR I = 0 to 5 2dv No A[I] = A[I+2] + 1

Saman Amarasinghe 43 6.035 ©MIT Fall 2006

FOR I = 0 to 5 A[I] = A[0] + 1

*dv No

Page 44: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Are the Loops Parallelizable?pFOR I = 1 to n FOR J 1 t

J FOR J = 1 to n A[I, J] = A[I, J-1] + 1

I

1 0

dv Yes No

FOR I = 1 to n FOR J 1 to n

J

FOR J = 1 to n A[I, J] = A[I+1, J] + 1 I

1

Saman Amarasinghe 44 6.035 ©MIT Fall 2006

0 1

dv No Yes

Page 45: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

Are the Loops Parallelizable?pFOR I = 1 to n FOR J = 1 to n

J FOR J 1 to n A[I, J] = A[I-1, J+1] + 1

I 1 dv=[1, -1]

1

1dv No

Yes

FOR I = 1 to n O 1 t

J

FOR J = 1 to n B[I] = B[I-1] + 1 I

1 No

Saman Amarasinghe 45 6.035 ©MIT Fall 2006

* 1

dv No Yes

Page 46: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

Integer Programming Methodg g g

• Example FOR I 0 t 5FOR I = 0 to 5 A[I+1] = A[I] + 1

• Is there a loop-carried dependence between A[I+1] and A[I] – Is there two distinct iterations iw and ir such that A[iw+1] is the same

location as A[ir]location as A[ir] – integers iw, ir 0 ≤ iw, ir ≤ 5 iw ir iw+ 1 = ir

• Is there a dependence between A[I+1] and A[I+1]Is there a dependence between A[I+1] and A[I+1] – Is there two distinct iterations i1 and i2 such that A[i1+1] is the same

location as A[i2+1] – integers i i 0 ≤ i i ≤ 5 i i i + 1 = i +1

Saman Amarasinghe 46 6.035 ©MIT Fall 2006

integers i1, i2 0 ≤ i1, i2 ≤ 5 i1 i2 i1+ 1 i2 +1

Page 47: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Integer Programming Methodg g g FOR I = 0 to 5 A[I+1] = A[I] + 1

• FormulationFormulation• – an integer vector i̅ such that  i̅ ≤ b̅ where

 is an integer matrix and b̅ is an integer vector isan integer matrix and b is an integer vector

Saman Amarasinghe 47 6.035 ©MIT Fall 2006

Page 48: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Iteration SpacepFOR I = 0 to 5 A[I+1] = A[I] + 1

• N deep loops n-dimensional discrete cartesian space

0 1 2 3 4 5 6 7 J 0 • Affine loop nest Iteration

space as a set of liner inequalities

0 1 2 3I 0 ≤ I

I ≤ 6 I ≤ J

3 4 5 6

I

Saman Amarasinghe 48 6.035 ©MIT Fall 2006

J ≤ 7 6

Page 49: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Integer Programming Method

g g g FOR I = 0 to 5 A[I+1] = A[I] + 1

• FormulationFormulation – an integer vector i̅ such that  i̅ ≤ b̅ where

 is an integer matrix and b̅ is an integer vector isan integer matrix and b is an integer vector

• Our problem formulation for A[i] and A[i+1] integers i i 0 ≤ i i ≤ 5 iiw iir iw++ 11 = i= ir integers iw, ir 0 ≤ iw, ir ≤ 5 i

– iw ir is not an affine function • divide into 2 problems • Problem 1 with iw < ir and problem 2 with ir < iw

• If either problem has a solution there exists a dependence

– How about iw+ 1 = irw r • Add two inequalities to single problem

iw+ 1 ≤ ir, and ir ≤ iw+ 1

Saman Amarasinghe 49 6.035 ©MIT Fall 2006

Page 50: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Integer Programming Formulationg g g FOR I = 0 to 5

• Problem 1 A[I+1] = A[I] + 1

0 ≤ iw

iw ≤ 5

0 ≤ irir ≤ 5

i iiw < ir iw+ 1 ≤ iri ≤ iiw+ 1ir ≤ + 1

Saman Amarasinghe 50 6.035 ©MIT Fall 2006

Page 51: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

-

Integer Programming Formulationg g g FOR I = 0 to 5

• Problem 1 A[I+1] = A[I] + 1

0 ≤ iw -iw ≤ 0

iw ≤ 5 iw ≤ 5

0 ≤ ir -ir ≤ 0

ir ≤ 5 ir ≤ 5

i < iir iiw i ≤iw - ir ≤ -11

iw+ 1 ≤ ir iw - ir ≤ -1

i ≤ iiw+ 1+ 1 -iiw + ir ≤ 1ir ≤ + i ≤ 1

Saman Amarasinghe 51 6.035 ©MIT Fall 2006

Page 52: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

- -

Integer Programming Formulationg g g

• Problem 1 Â b̅ 0 ≤ iw -iw ≤ 0 -1 0 0 iw ≤ 5 iw ≤

5

1 0 5 0 ≤ ir -ir ≤ 0 0 -1 0 ir ≤ 5 ir ≤

5

0 1 5 i i i i ≤ 1 1 1 1iw < ir iw - ir ≤ -1 1 -1 -1 iw+ 1 ≤ ir iw - ir ≤ -1 1 -1 -1 i ≤ i + 1 -i + i ≤ 1 -1 1 1ir ≤ iw+ 1 iw + ir ≤ 1 1 1 1

• and problem 2 with ir < iw

Saman Amarasinghe 52 6.035 ©MIT Fall 2006

and problem 2 with ir < iw

Page 53: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Generalization • An affine loop nest

FOR i1 = fl1(c1…ck) to Iu1(c1…ck)FOR i2 = fl2(i1,c1…ck) to Iu2(i1,c1…ck) …… FOR in = fln(i1…in-1,c1…ck) to Iun(i1…in-1,c1…ck)

A[fa1(i1…in,c1…ck), fa2(i1…in,c1…ck),…,fam(i1…in,c1…ck)]

• Solve 2*n problems of the formSolve 2*n problems of the form • i1 = j1, i2 = j2,…… in-1 = jn-1, in < jn • i1 = j1, i2 = j2,…… in-1 = jn-1, jn < in • i1 = j1, i2 = j2,…… in-1 < jn-11 j1 2 j2 n 1 jn 1 • i1 = j1, i2 = j2,…… jn-1 < in-1

………………… • i1 = j1, i2 < j2 • i1 = j1, j2 < i2

Saman Amarasinghe 53 6.035 ©MIT Fall 2006

1 j1 j2 2

• i1 < j1 • j1 < i1

Page 54: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

c eas a a e at o O tu t es

Outline

• Why ParallelismWhy Parallelism

• Parallel Execution

• Parallelizing Compilers

D d A l i• Dependence Analysis

• Increasing Parallelization Opportunitiesg ppo

Page 55: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Increasing Parallelization OpportunitiesOpportunities

• Scalar Privatization • Reduction Recognition • Induction Variable IdentificationInduction Variable Identification • Array Privatization • Loop Transformations • Granularity of Parallelismy• Interprocedural Parallelization

Saman Amarasinghe 55 6.035 ©MIT Fall 2006

Page 56: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Scalar Privatization

• ExamplepFOR i = 1 to n X = A[i] * 3;B[i] XB[i] = X;

• Is there a loop carried dependence?Is there a loop carried dependence? • What is the type of dependence?

Saman Amarasinghe 56 6.035 ©MIT Fall 2006

Page 57: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Privatization

• Analysis: – Any anti- and output- loop-carried dependences

Eli i t b i i i l l t t• Eliminate by assigning in local context FOR i = 1 to n

integer Xtmp;[i] * 3Xtmp = A[i] * 3;

B[i] = Xtmp;

• Eliminate by expanding into an array FOR i = 1 to n

Xtmp[i] = A[i] * 3;

Saman Amarasinghe 57 6.035 ©MIT Fall 2006

p B[i] = Xtmp[i];

Page 58: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

Privatization

• Need a final assignment to maintain the correct value after the loop nestvalue after the loop nest

• Eliminate by assigning in local contextEliminate by assigning in local context FOR i = 1 to n

integer Xtmp;Xtmp = A[i] * 3;B[i] = Xtmp;if(i == n) X = Xtmp

Eli i t b di i t• Eliminate by expanding into an array FOR i = 1 to n

Xtmp[i] = A[i] * 3;B[i] = Xtmp[i];

Saman Amarasinghe 58 6.035 ©MIT Fall 2006

B[i] Xtmp[i];X = Xtmp[n];

Page 59: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Another Examplep

• How about loop-carried truepdependences?

• Example• Example FOR i = 1 to n X = X + A[i];[ ];

• Is this loop parallelizable?p p

Saman Amarasinghe 59 6.035 ©MIT Fall 2006

Page 60: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Reduction Recognitiong

• Reduction Analysis: – Only associative operations – The result is never used within the loop

• Transformation Integer Xtmp[NUMPROC];Barrier();FOR i = myPid*Iters to MIN((myPid+1)*Iters, n)

Xtmp[myPid] = Xtmp[myPid] + A[i];Barrier()Barrier();If(myPid == 0) {

FOR p = 0 to NUMPROC-1X = X + Xtmp[p];

Saman Amarasinghe 60 6.035 ©MIT Fall 2006

X X + Xtmp[p]; …

Page 61: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

Induction Variables

• Example FOR i = 0 to N

A[i] = 2^i;

• After strength reductionAfter strength reduction t = 1 FOR i = 0 to N

A[i] = t;A[i] t; t = t*2;

• What happened to loop carried dependences?pp p p • Need to do opposite of this!

– Perform induction variable analysis

Saman Amarasinghe 61 6.035 ©MIT Fall 2006

– Rewrite IVs as a function of the loop variable

Page 62: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Array Privatizationy

• Similar to scalar privatization

• However, analysis is more complex – Array Data Dependence Analysis:

Checks if two iterations access the same location – Array Data Flow Analysis:Array Data Flow Analysis:

Checks if two iterations access the same value

• Transformations – Similar to scalar privatization

P i t f h d ith

Saman Amarasinghe 62 6.035 ©MIT Fall 2006

– Private copy for each processor or expand with an additional dimension

Page 63: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Loop Transformations • A loop may not be parallel as is

E l

p J

• Example FOR i = 1 to N-1 FOR j = 1 to N-1

I

j A[i,j] = A[i,j-1] + A[i-1,j];

Saman Amarasinghe 63 6.035 ©MIT Fall 2006

Page 64: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Loop Transformations • A loop may not be parallel as is

E l

p J

• Example FOR i = 1 to N-1 FOR j = 1 to N-1

I

j A[i,j] = A[i,j-1] + A[i-1,j]; J

After loop Skewing I

oldnew ii 11

• After loop Skewing FOR i = 1 to 2*N-3 FORPAR j = max(1,i-N+2) to min(i, N-1)

oldnew jj 10

Saman Amarasinghe 64 6.035 ©MIT Fall 2006

A[i-j+1,j] = A[i-j+1,j-1] + A[i-j,j];

Page 65: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

Granularity of Parallelism y

FOR j = 1+ myPid*Iters to MIN((myPid+1)*Iters, n-1)

Barrier();

• Example FOR i = 1 to N 1

J FOR i 1 to N-1

FOR j = 1 to N-1A[i,j] = A[i,j] + A[i-1,j];

I • Gets transformed into

FOR i = 1 to N-1 Barrier();

j 1 id* (( id 1)* 1) A[i,j] = A[i,j] + A[i-1,j];

• Inner loop parallelism can be expensive – Startup and teardown overhead of parallel regions – Lot of synchronization Lot of synchronization – Can even lead to slowdowns

Saman Amarasinghe 65 6.035 ©MIT Fall 2006

Page 66: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Granularity of Parallelism y

• Inner loop parallelism can be expensivep p p

• Solutions – Don’t parallelize if the amount of work withinDon t parallelize if the amount of work within

the loop is too small oror – Transform into outer-loop parallelism

Saman Amarasinghe 66 6.035 ©MIT Fall 2006

Page 67: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

=

Outer Loop Parallelismp

• Example FOR i = 1 to N 1

J FOR i 1 to N-1

FOR j = 1 to N-1A[i,j] = A[i,j] + A[i-1,j]; I

• After Loop Transpose FOR j = 1 to N-1

i 1 1

I FOR i = 1 to N-1

A[i,j] = A[i,j] + A[i-1,j];

• Get mapped into J

• Get mapped into Barrier();FOR j = 1+ myPid*Iters to MIN((myPid+1)*Iters, n-1)

FOR i = 1 to N-1

Saman Amarasinghe 67 6.035 ©MIT Fall 2006

A[i,j] = A[i,j] + A[i-1,j];Barrier();

Page 68: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Unimodular Transformations

• Interchange, reverse and skewg ,• Use a matrix transformation

Inew = A Iold Inew A Iold

• Interchange

oldnew ii

01 10

Interchange

Reverse

oldnew jj 01

oldnew ii 01

• Reverse

oldnew jj 10

ii 11 • Skew Saman Amarasinghe 68 6.035 ©MIT Fall 2006

old

old

new

new

j i

j i

10 11

Page 69: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Legality of Transformationsg y • Unimodular transformation with matrix A is valid iff.

For all dependence vectors vFor all dependence vectors v the first non-zero in Av is positive

• Example 0101

FOR i = 1 to N-1 FOR j = 1 to N-1A[i,j] = A[i,j] + A[i-1,j];

10 01

1 0

,0 1

dv

• Interchange

01 10

A

01 10

10 01

01 10

• Reverse

10 01

A

10 01

10 01

10 01

• Skew Saman Amarasinghe 69 6.035 ©MIT Fall 2006

10 11

A

10 11

10 01

10 11

Page 70: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Interprocedural Parallelizationp

• Function calls will make a loop unparallelizatble – Reduction of available parallelism – A lot of inner-loop parallelism

• Solutions I t d l A l i– Interprocedural Analysis

– Inlining

Saman Amarasinghe 70 6.035 ©MIT Fall 2006

Page 71: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Interprocedural Parallelizationp

• Issues S f ti d ti – Same function reused many times

– Analyze a function on each trace Possibly exponential – Analyze a function once unrealizable path problem

• Interprocedural Analysis Need to update all the analysis – Need to update all the analysis

– Complex analysis – Can be expensive

• Inlining – Works with existing analysis

Saman Amarasinghe 71 6.035 ©MIT Fall 2006

Works with existing analysis – Large code bloat can be very expensive

Page 72: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

Summaryy

• Multicores are here – Need parallelism to keep the performance gains – Programmer defined or compiler extracted parallelism

• Automatic parallelization of loops with arrays – Requires Data Dependence Analysis – Iteration space & data space abstraction

A i t bl

– An integer programmiing problem

• Many optimizations that’ll increase parallelism Many optimizations that ll increase parallelism

Saman Amarasinghe 72 6.035 ©MIT Fall 2006

Page 73: 6.035 lecture 17, Parrallelization · Saman Amarasin ghe 3 6.035 ©MITFrom David PattersonFall 2006. Uniprocessor Performance (SPECint) Itanium Itanium 2 p(1,000,000,000 From Hennessy

MIT OpenCourseWarehttp://ocw.mit.edu

6.035 Computer Language Engineering Spring 2010

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.