High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

43
High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999

Transcript of High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

Page 1: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

High Performance on the J90 Systems

David Turner & Tom DeBoni

NERSC User Services Group

April 1999

Page 2: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 2

Philosophical Ramblings

Design for optimization?

Where to start?

When to stop?

Page 3: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 3

J90 Potential

STREAM benchmark resultsSustainable memory bandwidth

(http://www.cs.virginia.edu/stream)

John McCalpin, SGI

bytes/iter FLOPS/iterCOPY

a(i)=b(i) 16 0

TRIAD

a(i)=b(i)+q*c(i) 24 2

Page 4: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 4

STREAM Results

Machine ncpus COPY TRIAD MFLOPSCray_C90 16 105497.0 103812.0 8651.0Cray_C90 8 55071.9 63229.6 5269.1Cray_C90 1 6965.4 9500.7 791.7

Cray_J932 16 16298.2 14995.9 1249.7Cray_J932 8 9995.2 8941.3 745.1Cray_J932 1 1433.6 1270.0 105.8

Cray_T3E-900 16 7497.0 8828.0 735.7Cray_T3E-900 8 3747.0 4471.0 372.6Cray_T3E-900 1 484.0 568.0 47.3

SGI_Origin_2K 16 5560.0 5240.0 436.7SGI_Origin_2K 8 2570.0 2740.0 228.3SGI_Origin_2K 1 332.0 358.0 29.8

Sun_UE_10000 16 2371.0 2905.0 242.1Sun_UE_10000 8 1271.0 1546.0 128.8Sun_UE_10000 1 164.0 202.0 16.8

Page 5: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 5

STREAM Results (cont.)

Machine COPY TRIAD MFLOPS

Cray_C90 6965.4 9500.7 791.7

Cray_J932 1433.6 1270.0 105.8

Compaq_AlphaServer_DS20 1077.0 1323.0 110.2

IBM_RS6000-397 778.8 882.4 73.5

Cray_T3E-900 484.0 568.0 47.3

SGI_Origin_2K 332.0 358.0 29.8

Generic_440BX_400 304.0 315.4 26.3

Sun_Ultra2-2200 228.5 189.9 25.9

Sun_UE_10000 164.0 202.0 16.8

Apple_Mac_G3_266 137.1 137.1 11.4

Page 6: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 6

Tools

F90 (with lots of options)

ja./nameja -cst -n name

hpm

prof

flowview

atexpert

Page 7: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 7

Program “SLOW”PROGRAM SLOW

IMPLICIT NONE INTEGER, PARAMETER :: DIMSIZE=8000000 REAL, DIMENSION(DIMSIZE) :: X, Y, Z INTEGER:: I, J

X = RANF() Y = RANF() DO J = 1, 10 DO I = 1, DIMSIZE Z(I)=LOG(SIN(X(I))**2+COS(Y(I))**4) END DO PRINT *, Z(DIMSIZE-1) ENDDO STOP

END PROGRAM SLOW

Page 8: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 8

No Optimization

f90 -O0 -r6 -O,msgs,negmsgs -o slow slow.f90

x = RANF()

cf90-6204 f90:VECTOR SLOW,File = slow.f90, Line=8

A loop starting at line 8 was vectorized.

y = RANF()

cf90-6204 f90:VECTOR SLOW,File = slow.f90, Line=9

A loop starting at line 9 was vectorized.

Page 9: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 9

Moderate Optimization

f90 -O1 -r6 -O,msgs,negmsgs -o slow slow.f90

do j = 1, 10

cf90-6286 f90:VECTOR SLOW,File = slow.f90,Line=10

A loop starting at line 10 was not vectorized because it contains input/output operations at line 14.

DO i = 1, DIMSIZE

cf90-6204 f90:VECTOR SLOW,File = slow.f90,Line=11

A loop starting at line 11 was vectorized.

z(i) = LOG(SIN(x(i))**2 + COS(y(i))**4)

cf90-6001 f90:SCALAR SLOW,File=slow.f90,Line=12

An exponentiation was replaced by optimization. This may cause numerical differences.

Page 10: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 10

High Optimization

f90 -O3 -r6 -O,msgs,negmsgs -o slow slow.f90

cf90-6502 f90:TASKING SLOW,File=slow.f90,Line=10

A loop starting at line 10 was not tasked because it contains input/output operations at line 14.

cf90-6403 f90:TASKING SLOW,File=slow.f90,Line=11

A loop starting at line 11 was tasked.

Page 11: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 11

Optimization Results

Opt NCPUS Elapsed User Sys

0 768.7530 583.6793 7.1886

1 89.0162 82.1009 1.1936

2 104.7003 81.5687 1.0003

3 1 107.0177 81.6185 1.2994

3 2 44.6562 81.7050 1.4069

3 3 41.3401 81.5320 1.3099

3 4 24.8146 81.8099 1.2968

Page 12: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 12

2 CPU Speedup

(Concurrent CPUs * Connect seconds = CPU seconds)

--------------- --------------- -----------

1 * 5.4300 = 5.4300

2 * 38.1300 = 76.2600

(Concurrent CPUs * Connect seconds = CPU seconds)

(Avg.) (total) (total)

--------------- -------------- -----------

1.88 * 43.5600 = 81.6900

Page 13: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 13

3 CPU Speedup

(Concurrent CPUs * Connect seconds = CPU seconds)

--------------- --------------- -----------

1 * 9.2200 = 9.2200

2 * 13.5500 = 27.1000

3 * 15.0700 = 45.2100

(Concurrent CPUs * Connect seconds = CPU seconds)

(Avg.) (total) (total)

--------------- -------------- -----------

2.15 * 37.8400 = 81.5300

Page 14: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 14

4 CPU Speedup

(Concurrent CPUs * Connect seconds = CPU seconds)

--------------- --------------- -----------

1 * 2.0400 = 2.0400

2 * 1.7700 = 3.5400

3 * 5.3200 = 15.9600

4 * 15.0700 = 60.2800

(Concurrent CPUs * Connect seconds = CPU seconds)

(Avg.) (total) (total)

--------------- -------------- ----------

3.38 * 24.2000 = 81.8200

Page 15: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 15

Useful F90 Options

-e (0 or i) - initializes storage or flags use of unitialized vars-e n - flags nonstandard fortran usage-e v - make all variables static-g - same as -G0-G (0 or 1) - sets debugging level to statement or block-m (0 - 4) - message verbosity (0 gives most output)-N (72, 80, or 132) - source line length-O - Optimization levels

0,1,2,3, aggress, fastint, msgs, negmsgs, inline(0-3), scalar(0-3), task(0-3), vector (0-3)

-r (0-6, …) - listing levels (6 is EVERYthing)-R (a, b, c)- runtime checking: args, array bounds, indexing

Page 16: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 16

Using flowtrace/flowview

f90 -O1 -ef -o slow slow.f90./slowflowview -Luch > slow.flow

Routine Tot Time Percentage Accum%

------------ -------- ---------- -------

SUB2 5.66E+01 69.02 69.02

SUB1 2.43E+01 29.63 98.65

SLOW 1.11E+00 1.35 100.00

Page 17: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 17

Using prof

f90 -O1 -l prof -o slow slow.f90

./slow

prof -x ./slow > slow.prof

profview slow.prof

Page 18: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 18

profview Output

Page 19: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 19

Optimization Strategies

• First, let the compiler do it• Vectorize and scalar optimize, then parallelize

• Vectorization can give you a factor of 10 speedup• Scalar optimization can improve performance by

10-50%• Parallelism will give you a linear speedup, max• Memory contention inhibits gains from parallelism

• Let the compiler advise you

• Add directives where appropriate• Be sure you tell the truth• Check your answers

Page 20: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 20

Scalar Optimization

Subroutine or function inlining

Fast (32-bit) integers

-Oallfastint

-Ofastint

Use INTERFACE specifications if passing array sections

Page 21: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 21

Vectorization

Page 22: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 22

Inhibitors to Vectorization

Function or subroutine references

Inline

Push loop

Split loop

Backwards data dependencies

Reorder loop, use temporary vector

I/O statements

Character or bit manipulations

Branches into loop or backward out of loop

Page 23: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 23

Nonvectorizable Code

DO I = 1, N

CALL CALC(X(I), Y(I), Z(I))

ENDDO

...

SUBROUTINE CALC(X, Y, Z)

Z = ALOG(SQRT((SIN(X) * COS(Y)) ** X))

RETURN

END

Page 24: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 24

Inlining

DO I = 1, N

Z(I) = ALOG(SQRT((SIN(X(I))*COS(Y(I)))**X(I)))

ENDDO

Page 25: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 25

Pushing

CALL CALC(X(I), Y(I), Z(I), N)

...

SUBROUTINE CALC(X, Y, Z, N)

DIMENSION X(N), Y(N), Z(N)

DO I = 1, N

Z(I) = ALOG(SQRT((SIN(X(I))*COS(Y(I)))**X(I)))

ENDDO

RETURN

END

Page 26: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 26

Splitting

DO I = 1, N

A(I) = ABS(CALC(C(I)))

B(I) = A(I) ** T * SQRT(C(I))

A(I) = SIN(ALOG(C(I)))

ENDDO

Page 27: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 27

Splitting (cont.)

EXTERNAL CALC

DO I = 1, N

A(I) = ABS(CALC(C(I)))

ENDDO

DO I = 1, N

B(I) = A(I) ** T * SQRT(C(I))

A(I) = SIN(ALOG(C(I)))

ENDDO

Page 28: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 28

Scalar Recurrence

DIMENSION A(1000), C(1000)

DO J = 1, M

S = BB

DO I = 1, N

S = S * C(I)

A(I) = A(I) + S

ENDDO

ENDDO

<cf90-8135,Scalar,Line=7> Loop starting at line 7 was unrolled 16 times.

Page 29: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 29

Scalar Recurrence (cont.)

DIMENSION A(1000), C(1000), S(1000)DO I = 1, M S(I) = BBENDDODO I = 1, N DO J = 1, M S(J) = S(J) * C(I) A(I) = A(I) + S(J) ENDDOENDDO

Loop starting at line 5 was unrolled 2 times.

A loop starting at line 5 was vectorized.

A loop starting at line 9 was vectorized.

Page 30: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 30

Compiler Vector Directives

CDIR$ directive

!DIR$ directive

VECTOR, NOVECTOR

Turn vectorization on or off until end of program unit.

IVDEP

Ignore vector dependencies in next loop.

Page 31: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 31

Parallel Computing

Multitasking, microtasking, autotasking, parallel processing, multiprocessing, etc.

This is “fine-grained” parallelism

parallelism mostly comes from loop slicing

One possible goal: parallelize outer loop(s),

vectorize inner loop(s)

F90 is capable of autotasking, but it can always

benefit from help

Page 32: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 32

Parallelism

Page 33: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 33

Parallelism, cont.

Page 34: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 34

Data “Scoping”

DIMENSION A(N)

SUM = 0.0

DO I = 1, N

TEMP = DEEP_THOUGHT(A,I)

SUM = SUM + TEMP * A(I)

ENDDO

A, N Shared, read-only everywhere

I, TEMP Private, read-write everywhere

SUM Shared, read-write everywhere

Page 35: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 35

Compiler Tasking Directives

DIMENSION A(N)

SUM = 0.0

!MIC$ DOALL SHARED(A,N),PRIVATE(I,TEMP)

DO I = 1, N

TEMP = DEEP_THOUGHT(A,I) * A(I)

!MIC$ GUARD

SUM = SUM + TEMP

!MIC$ ENDGUARD

ENDDO

Page 36: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 36

Threshold Test

DIMENSION A(N)

SUM = 0.0

!MIC$ DOALL VECTOR

!MIC$ IF(N.GT.1000)

!MIC$ SHARED(A,N),PRIVATE(I,TEMP)

DO I = 1, N

TEMP = DEEP_THOUGHT(A,I)

!MIC$ GUARD

SUM = SUM + TEMP * A(I)

!MIC$ ENDGUARD

ENDDO

Page 37: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 37

Helping F90 with Parallelism

DIMENSION A(N), SUM(NumTasks)

!MIC$ DOALL SHARED(A,N),PRIVATE(J,I,TEMP)DO J = 1, NumTasks

SUM(J) = 0.0

!MIC$ CNCALL DO I = 1, N

SUM(J) = SUM(J) = DEEP_THOUGHT(A,I,J) * A(I)

ENDDO

ENDDO

DO J = 1, NumTasks

TSUM = TSUM + SUM(J)

ENDDO

Page 38: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 38

Helping F90 with Directives

• Useful compiler directives for tasking• CASE, ENDCASE• CNCALL• DOALL• DOPARALLEL, ENDDO• GUARD, ENDGUARD• MAXCPUS• NUMCPUS• PERMUTATION• PARALLEL, ENDPARALLEL

• These all begin with !MIC$• NOTE: There are also OpenMP directives...

Page 39: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 39

Helping F90 with Directives, cont.

Directive Parameters

AUTOSCOPE

IF

MAXCPUS

PRIVATE

SAVELAST

SHARED

Directive Work Distribution

CHUNKSIZE

GUIDED

NCPUS_CHUNKS

NUMCHUNKS

SINGLE

VECTOR

These all augment !MIC$ directives

NOTE: There are also OpenMP directive parameters...

Page 40: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 40

atexpert

f90 -eX -O3 -r6 -o slow slow.f90

setenv NCPUS 1

./slow

atexpert

Page 41: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 41

atexpert Output

Page 42: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 42

atexpert Output, cont.

Page 43: High Performance on the J90 Systems David Turner & Tom DeBoni NERSC User Services Group April 1999.

13 April, 1999 High Performance on the J90 Systems 43

atexpert Output, cont.