College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational...

25
College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds College of Nanoscale Science and Engineering University at Albany, State University of New York, Albany, NY 12309 Lenore Mullin, Computer Science University at Albany, State University of New York, Albany, NY 12309

Transcript of College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational...

Page 1: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

A uniform algebraically-based approach to computational physics

and efficient programming

James E. Raynolds College of Nanoscale Science and Engineering

University at Albany, State University of New York, Albany, NY 12309

Lenore Mullin, Computer Science

University at Albany, State University of New York, Albany, NY 12309

Page 2: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

Matrix Example

In Fortran 90:

First temporary computed:

Second temporary:

Last operation:

D = A + B + C

temp1 = B + C

temp2 = A + temp1

D = temp2

Page 3: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

Matrix Example (cont)

Intermediate temporaries consume memory and add to processing operations

Solution: compose index operations

Loop over i, j:

No temporaries:

D(i, j) = A(i, j) + B(i, j) + C(i, j)

Page 4: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

Need for formalism Few problems are as simple as

Formalism designed to handle extremely complicated situations systematically

Goal: composition of algorithms

• For Example: Radar is composed of the composition of numerous algorithms: QR(FFT(X)).

• Optimizations are classically done sequentially even when parallel Optimizations are classically done sequentially even when parallel processors and nodes are used. FFT(or DFT?) then QRprocessors and nodes are used. FFT(or DFT?) then QR

• Optimizations can be optimized across algorithms, processors, and Optimizations can be optimized across algorithms, processors, and memoriesmemories

D = A + B + C

Page 5: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

MoA and PSI CalculusBasic Properties:• An index calculus: psi function.• Shape polymorphic functions and operators:

•Operations are defined using shapes and psi.•MoA defines some useful operations and function.•As long as shapes define functions and operations any new function or operation may be defined and reduced.

• Fundamental type is the array:•scalars are 0-dimensional arrays.

• Denotational Normal Form(DNF) = reduced form in Cartesian coordinates (independent of data layout: row major, column major, regular sparse, …)

• Operational Normal Form(ONF) = reduced form for 1-d memory layout(s).

•Defines How to Build the code on processor/memory hierarchies. ONF reveals loops and control.

Page 6: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

ApplicationsLevels of Processor/Memory Hierarchy

• Can be Modeled by Increasing Dimensionality of Data Array.

– Additional dimension for each level of the hierarchy.– Envision data as reshaped/transposed to reflect mapping to

increased dimensionality.– An Index Calculus automatically transforms algorithm to

reflect restructured data array.– Data, layout, data movement, and scalarization automatically

generated based on MoA descriptions and Psi Calculus Definitions of Array Operations, Functions and their compositions.

– Arrays are any dimension, even 0, I.e. scalars

Page 7: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

Processor/Memory Hierarchycontinued

• Math and indexing operations in same expression

• Framework for design space search– Rigorous and provably correct– Extensible to complex architectures

Approach

Mathematics of Arrays

Example: “raising” arraydimensionality

y= convintricate math

intricatememory accesses(indexing)

(x)

Me

mo

ry H

iera

rch

y

Parallelism

Main Memory

L2 Cache

L1 Cache

Map

x: < 0 1 2 … 35 >

Map:

< 3 4 5 >< 0 1 2 >

< 6 7 8 >< 9 10 11 >

< 12 13 14 >

< 18 19 20 >< 21 22 23 >

< 24 25 26 >< 27 28 29 >

< 30 31 32 >

< 15 16 17 >

< 33 34 35 >

P0 P1 P2

P0

P1

P2

Page 8: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

Manipulation of an array Given a 3 by 5 by 4 array:

Shape vector:

Index vector:

Used to select:

A =

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

16 17 18 19

⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥

,

20 21 22 23

24 25 26 27

28 29 30 31

32 33 34 35

36 37 38 39

⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥

,

40 41 42 43

44 45 46 47

48 49 50 51

52 53 54 55

56 57 58 59

⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥

ρA =< 354 >

i =< 213 >

iψA =< 213 >ψA = 47

Page 9: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

More Definitions

Reverse: Given an array

The reversal is given through indexing

Examples:

ξ

φξ

< i >ψ (φξ ) =< ρξ [0] − (i +1) >ψξ

rv =< 012 34 5 >

φ r

v =< 54 3210 >

ξ 2 =

0 1

2 3

4 5

6 7

⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥

φξ =

6 7

4 5

2 3

0 1

⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥

Page 10: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

Some Psi Calculus OperationsBuilt Using & Shapes

Operations

take

drop

rotate

cat

unaryOmega

binaryOmega

reshape

iota

Arguments

Vector A, int N

Vector A, int N

Vector A, int N

Vector A, Vector B

Operation Op, dimension D,Array A

Operation Op,Dimension Adim.Array A, Dimension Bdim,Array B

Vector A, Vector B

int N

Definition

Forms a Vector of the first N elements of A

Forms a Vector of the last (A.size-N) elements of A

Forms a Vector of the last N elements of A concatenated to the other elements of A

Forms a Vector that is the concatenation of A and B

Applies unary operator Op to D-dimensional components of A (like a for all loop)

Applies binary operator Op to Adim-dimensional components of A and Bdim-dimensional components of B (like a for all loop)

Reshapes B into an array having A.size dimensions, where the length in each dimension is given by the corresponding element of A

Forms a vector of size N, containing values 0 . . N-1

= index permutation = operators = restructuring = index generation

Page 11: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

New FFT algorithm: record speed

Maximize in-cache operations through use of repeated transpose-reshape operations

Similar to partitioning for parallel implementation

Do as many operations in cache as possible

Re-materialize the array to achieve locality

Continue processing in cache and repeat process

Page 12: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

Example

Assume cache size c = 4; input vector length n = 32; number of rows r = n/c = 8

Generate vector of indices:

Use re-shape operator to generate a matrix

rv = ι (n) =< 012K 31>

ρ

Page 13: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

Starting Matrix

Each row is of length equal to the size “c”

Standard butterfly applied to each row as...

A ≡ rc ˆ ρ r v =

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

16 17 18 19

20 21 22 23

24 25 26 27

28 29 30 31

⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥

Page 14: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

⎥⎥⎥⎥

⎢⎢⎢⎢

=

31272319151173

30262218141062

2925211713951

2824201612840

TA

Next transpose

To continue further would induce cache misses so transpose and reshape.

Transpose-reshape operation composed over indices (only result is materialized.

The transpose is:

Page 15: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

Resulting Transpose-Reshape

Materialize the transpose-reshaped array B

Carry out butterfly operation on each row

Weights are re-ordered Access patterns are

standard...⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

=≡

31272319

151173

30262218

141062

29252117

13951

28242016

12840

)(ˆ TArcB ρ

Page 16: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

⎥⎥⎥⎥

⎢⎢⎢⎢

=

3115301429132812

27112610259248

237226215204

193182171160

TB

Transpose-Reshape again

As before: to proceed further would induce cache misses so:

Do the transpose-reshape again (composing indices) The transpose is:

Page 17: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢

=≡

31153014

29132812

27112610

259248

237226

215204

193182

71160

)(ˆ TBrcC ρ

Last step (in this example)

Materialize the composed transpose-reshaped array C

Carry out the last step of the FFT

This last step corresponds to cycles of length 2 involving elements 0 and 16, 1 and 17, etc.

1

Page 18: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

Final Transpose Data has been permuted numerous times

• Multiple reshape-transposes We could reverse the transformations

• There would be multiple steps, multiple writes. Viewing the problem as an n-cube(hypercube for radix 2)

allows us to use the number of reshape-transposes as an argument to rotate(or shift) of a vector generated from the dimension of the hypercube.• This rotated vector is used as an argument to binary

transpose.• Permutes everything at once.• Express Algebraically, Psi reduce to DNF then ONF for a

generic design.• ONF has only two loops no matter what dimension

hypercube(or n-cube for radix = n) we start with.

Page 19: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

Speed enhancement over previous record

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Log_2(size of FFT)

Enhancement Ratio

Page 20: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

FFT Time vs. size

0.0001

0.001

0.01

0.1

1

10

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

log_2(FFT size)

Time (sec)

optimized

not optimized

Page 21: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

Page 22: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

Summary All operations have been carried out in cache at the

price of re-arranging the data Data blocks can be of any size (powers of the radix):

need not equal the cache size Optimum performance: tradeoff between reduction of

cache misses and cost of transpose-reshape operations Number of transpose-reshape operations determined

by the data block size (cache size) Record performance: up to factor of 4 better than

libraries

Page 23: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Science Direct 25 Hottest Articles

Page 24: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Book under review at springer

Page 25: College of Nanoscale Science and Engineering A uniform algebraically-based approach to computational physics and efficient programming James E. Raynolds.

College of Nanoscale Science and Engineering

New paper at J. Comp. Phys.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.