18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel...

17
18.337 Parallel Computing’s Challenges

Transcript of 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel...

Page 1: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

18.337Parallel Computing’s Challenges

Page 2: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

Old Homework (emphasized for effect)

• Download a parallel program from somewhere.– Make it work

• Download another parallel program– Now, …, make them work together!

Page 3: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

SIMD

• SIMD (Single Instruction, Multiple Data) refers to parallel hardware that can execute the same instruction on multiple data. (Think the addition of two vectors. One add instruction applies to every element of the vector.)

– Term was coined with one element per processor in mind, but with today’s deep memories and hefty processors, large chunks of the vectors would be added on one processor.

– Term was coined with a broadcasting of an instruction in mind, hence the single instruction, but today’s machines are usually more flexible.

– Term was coined with A+B and elementwise AxB in mind and so nobody really knows for sure if matmul or fft is SIMD or not, but these operations can certainly be built from SIMD operations.

•  Today, it is not unusual to refer to a SIMD operation (sometimes but not always historically synonymous with Data Parallel Operations though this feels wrong to me) when the software appears to run “lock-step” with every processor executing the same instruction.

– Usage: “I hear that machine is particularly fast when the program primarily consists of SIMD operations.”

– Graphics processors such as NVIDEA seem to run fastest on SIMD type operations, but current research (and old research too) pushes the limits of SIMD.

Page 4: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

Natural Question may not be the most important

• How do I parallelize x?– First question many students ask– Answer often either one of

• Fairly obvious• Very difficult

– Can miss the true issues of high performance• These days people are often good at exploiting locality for

performance• People are not very good about hiding communication and

anticipating data movement to avoid bottlenecks• People are not very good about interweaving multiple functions to

make the best use of resources– Usually misses the issue of interoperability

• Will my program play nicely with your program?• Will my program really run on your machine?

Page 5: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

Class Notation

• Vectors small roman letters”: x,y, …• Vectors have length n if possible• Matrices large roman (sometimes Greek)

letters: A,B,X,Λ,Σ• Matrices are n x n, or maybe m x n, but

almost never n x m. Could be p x q.• Scalars may be small greek letters or

small roman letters – may not be as consistent

Page 6: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

Algorithm Example: FFTs

• For now think of an FFT as a “black box”y=FFT(x) takes as input and output a vector of

length n defined (but not computed) as a matrix time vector: y=Fnx, where (Fn)jk=e-2πijk/n

for j,k=0,…,(n-1).

• Important Use Cases– Column fft: fft(X), fft(X,[ ],1) (MATLAB)– Row fft: fft(X,[ ],2) (MATLAB)– 2d fft: (do a row and column) fft2(X)

• fft2(X) = row_fft(col_fft(X)) = col_fft( row_fft(X))

Page 7: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

How to implement a column FFT?

• Put block columns on each processor

• Do local column FFTs

P0 P1 P2

Local column FFTs may be “column at a time”or “pipelined”

In the case of FFT probably a fast local package available, but may not be true for other ops. Also as MIT students have been known to do, you might try to beat the packages.

Page 8: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

A closer look at column fft

• Put block columns on each processor

• Where were the columns? Where are they going?• The cost of the above can be very expensive in

performance. Can we hide it somewhere?

P0 P1 P2

Page 9: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

What about row fft• Suppose block columns on each processor

• Many transpose and then apply column FFT and transpose back• This thinking is simple and do-able• Not only simple but encourages the paradigm of

– 1) do whatever 2) get good parallelism and 3) do whatever• Harder to decide whether to do rows in parallel or to interweave transposing of

pieces and start computation– May be more performance, but nobody to my knowledge has done a good job of

this yet. You maybe could be first.

P0 P1 P2

Page 10: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

Not load balanced column fft?• Suppose block columns on each processor

• To load balance or to not load balance, that is the question

• Traditional Wisdom says this is badly load balanced and parallelism is lost, but there is a cost of moving the data which may or may not be worth the gain in load balancing

P0 P1 P2

Page 11: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

2d fft• Suppose block columns on each processor

• Can do columns, transpose, rows, transpose• Can do transpose, rows, transpose, columns• Can be fancier?

P0 P1 P2

Page 12: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

So much has to do with access to memory and data movement

• The conventional wisdom is that it’s all about locality. This remains partially true and partially not quite as true as it used to be.

Page 13: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

http://www.cs.berkeley.edu/~samw/research/talks/sc07.pdf

Page 14: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

A peak inside an FFT(more later in the semester)

Time wasted on the telephone

Page 15: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

Tracing Back the data dependency

Page 16: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

New term for the day: MIMD

• MIMD (Multiple Instruction stream, Multiple Data stream) refers to most current parallel hardware where each processor can independently execute their own instructions. The importance of MIMD over SIMD emerged in the early 1990’s, as commodity processors became the basis of much parallel computing.

•  One may also refer to a MIMD operation in an implementation, if one wishes to emphasize non-homogeneous execution. (Often contrasted to SIMD.)

Page 17: 18.337 Parallel Computing’s Challenges. Old Homework (emphasized for effect) Download a parallel program from somewhere. –Make it work Download another.

Importance of Abstractions

• Ease of use requires that the very notion of a processor really should be buried underneath the user

• Some think that the very requirements of performance require the opposite

• I am fairly sure the above bullet is more false than true – you can be the ones to figure this all out!