Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the...

43
Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James Demmel (taken from David Culler, Lecture 18, CS267, 1997) http://www.cs.berkeley.edu/~demmel/ cs267_Spr99
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    1

Transcript of Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the...

Page 1: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.1

CS 267 Applications of Parallel Computers

Lecture 28: LogP and the

Implementation and Modeling of Parallel Sorts

James Demmel

(taken from David Culler,

Lecture 18, CS267, 1997)

http://www.cs.berkeley.edu/~demmel/cs267_Spr99

Page 2: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.2

Practical Performance Target (circa 1992)

• Sort one billion large keys in one minute on one thousand processors.

• Good sort on a workstation can do 1 million keys in about 10 seconds

– just fits in memory

– 16 bit Radix Sort

• Performance unit: µs per key per processor– s ~ 10 for single Sparc 2

Page 3: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.3

Studies on Parallel Sorting

Sorting Networks

PRAM Sorts

MEM

p p p°°°

Sorting on Network Y

P

M

network

P

M

P

M°°°

LogP SortsSorting onMachine X

Page 4: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.4

The Study

Interesting ParallelSorting Algorithms

Analyze under LogP

Parametersfor CM-5

Estimate ExecutionTime

Implement in Split-C

Execute on CM-5

Compare

??

(Bitonic, Column, Histo- radix, Sample)

Page 5: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.5

LogP

Page 6: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.6

Deriving the LogP Model

° Processing

– powerful microprocessor, large DRAM, cache => P

° Communication

+ significant latency (100's of cycles) => L

+ limited bandwidth (1 – 5% of memory bw) => g

+ significant overhead (10's – 100's of cycles) => o- on both ends

– no consensus on topology

=> should not exploit structure

+ limited capacity– no consensus on programming model

=> should not enforce one

Page 7: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.7

LogP

Interconnection Network

MPMPMP° ° °

P ( processors )

Limited Volume( L/ g to or from

a proc)

o (overhead)

L (latency)

og (gap)

• Latency in sending a (small) mesage between modules

• overhead felt by the processor on sending or receiving msg

• gap between successive sends or receives (1/BW)

• Processors

Page 8: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.8

Using the Model

° Send n messages from proc to proc in time 2o + L + g(n-1)

– each processor does o n cycles of overhead

– has (g-o)(n-1) + L available compute cycles

° Send n messages from one to many

in same time

° Send n messages from many to one

in same time

– all but L/g processors block

so fewer available cycles

o L o

o og

Ltime

P

P

Page 9: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.9

Use of the Model (cont)

° Two processors sending n words to each other (i.e., exchange) in time

2o + L + max(g,2o) (n-1) max(g,2o) + L

° P processors each sending n words to all processors (n/P each) in a static, balanced pattern without conflicts , e.g., transpose, fft, cyclic-to-block, block-to-cyclic

same

exercise: what’s wrong with the formula above?

Assumes optimal pattern of send/receive, so could underestimate time

Page 10: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.10

LogP "philosophy"

• Think about:

• – mapping of N words onto P processors

• – computation within a processor, its cost, and balance

• – communication between processors, its cost, and balance

• given a charaterization of processor and network performance

• Do not think about what happens within the network

This should be good enough!

Page 11: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.11

Typical Sort

Exploits the n = N/P grouping

° Significant local computation

° Very general global communication / transformation

° Computation of the transformation

Page 12: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.12

Split-C

Global Address Space

P0 Pprocs-1P1

local

• Explicitly parallel C

• 2D global address space– linear ordering on local spaces

• Local and Global pointers – spread arrays too

• Read/Write

• Get/Put (overap compute and comm)– x := G; . . .

– sync();

• Signaling store (one-way)– G :– x; . . .

– store_sync(); or all_store_sync();

• Bulk transfer

• Global comm.

Page 13: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.13

Basic Costs of operations in Split-C

• Read, Write x = *G, *G = x 2 (L + 2o)

• Store *G :– x L + 2o

• Get x := *G o

.... 2L + 2o sync(); o

– with interval g

• Bulk store (n words with words/message)

2o + (n-1)g + L

• Exchange 2o + 2L + (nL/g) max(g,2o)

• One to many

• Many to one

Page 14: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.14

LogP model

• CM5:– L = 6 µs

– o = 2.2 µs

– g = 4 µs

– P varies from 32 to 1024

• NOW– L = 8.9

– o = 3.8

– g = 12.8

– P varies up to 100

• What is the processor performance?

Page 15: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.15

Sorting

Page 16: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.16

Local Sort Performance (11 bit radix sort of 32 bits numbers)

Log N/P

µs

/ K

ey

0

1

2

3

4

5

6

7

8

9

10

0 5 10 15 20

31

25.1

16.9

10.4

6.2

Entropy inKey Values

Entropy = -i pi log pi ,

pi = Probability of key i

<--------- TLB misses ---------->

Page 17: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.17

Local Computation Parameters - EmpiricalParameter Operation µs per key Sort

Swap Simulate cycle butterfly per key 0.025 lg N Bitonic

mergesort Sort bitonic sequence 1.0

scatter Move key for Cyclic-to-block 0.46

gather Move key for Block-to-cyclic 0.52 if n<=64k or P<=64 Bitonic & Column

1.1 otherwise

local sort Local radix sort (11 bit) 4.5 if n < 64K

9.0 - (281000/n)

merge Merge sorted lists 1.5 Column

copy Shift Key 0.5

zero Clear histogram bin 0.2 Radix

hist produce histogram 1.2

add produce scan value 1.0

bsum adjust scan of bins 2.5

address determine desitination 4.7

compare compare key to splitter 0.9 Sample

localsort8 local radix sort of samples 5.0

Page 18: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.18

Bottom Line (Preview)

N/P

us/

key

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Bitonic 1024

Bitonic 32

Column 1024

Column 32

Radix 1024

Radix 32

Sample 1024

Sample 32

• Good fit between predicted and measured (10%)

• Different sorts for different sorts– scaling by processor, input size, sensitivity

• All are global / local hybrids– the local part is hard to implement and model

Page 19: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.19

Odd-Even Merge - classic parallel sort

N values to be sorted

A0 A1 A2 A3 AM-1 B0 B1 B2 B3 BM-1

Treat as two lists ofM = N/2

Sort each separately

A0 A2 … AM-2 B0 B2 … BM-2

Redistribute intoeven and odd sublists A1 A3 … AM-1 B1 B3 … BM-1

Merge into twosorted lits

E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1

Pairwise swaps ofEi and Oi will put itin order

Page 20: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.20

Where’s the Parallelism?

E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1

1xN

1xN

4xN/4

2xN/2

Page 21: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.21

Mapping to a Butterfly (or Hypercube)A0 A1 A2 A3 B0 B1 B2 B3

A0 A1 A2 A3

A0 A1 A2 A3

B0B1 B3 B2

B2B3 B1 B0

A0 A1 A2 A3 B2B3 B1 B0

Reverse Orderof one list viacross edges

two sorted sublists

Pairwise swapson way back2 3 4 8 7 6 5 1

2 3 4 7 6 5 81

2 4 6 81 3 5 7

1 2 3 4 5 6 7 8

Page 22: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.22

Bitonic Sort with N/P per node

all_bitonic(int A[PROCS]::[n])

sort(tolocal(&A[ME][0]),n,0)

for (d = 1; d <= logProcs; d++)

for (i = d-1; i >= 0; i--) {

swap(A,T,n,pair(i));

merge(A,T,n,mode(d,i));

}

sort(tolocal(&A[ME][0]),n,mask(i));

sortswap

A bitonic sequence decreases and then increases (or vice versa)Bitonic sequences can be merged like monotonic sequences

Page 23: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.23

Bitonic Sort

lg N/p stages are local sort

Block Layout

remaining stages involve Block-to-cyclic, local merges (i - lg N/P cols)cyclic-to-block, local merges ( lg N/p cols within stage)

Page 24: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.24

Analysis of Bitonic

• How do you do transpose?

• Reading Exercise

Page 25: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.25

Bitonic Sort: time per key

Predicted

N/P

us/

key

0

10

20

30

40

50

60

70

80

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Measured

N/P

us/

key

0

10

20

30

40

50

60

70

80

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

512

256

128

64

32

Page 26: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.26

Bitonic: Breakdown

Predicted

N/P

us/k

ey

0

10

20

30

40

50

60

70

80

16384

32768

65536

131072

262144

524288

1048576

Measured

N/P

us/k

ey

0

10

20

30

40

50

60

70

80

16384

32768

65536

131072

262144

524288

1048576

Remap B-C

Remap C-B

Mergesort

Swap

Localsort

P= 512, random

Page 27: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.27

Bitonic: Effect of Key Distributions

Entropy (bits)

µs/k

ey

0

5

10

15

20

25

30

35

40

45

0 6 10 17 25 31

Swap

Merge Sort

Local Sort

Remap C-B

Remap B-C

P = 64, N/P = 1 M

Page 28: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.28

Column Sort

(3) Sort

(2) Transpose - block to cyclic

(1) Sort

(4) Transpose- cyclic to block w/o scatter

(6) shift

(5) Sort

(8) Unshift

(7) mergework efficient

Treat datalike n x P array,with n >= P^2,I.e. N >= P^3

Page 29: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.29

Column Sort: Times

Predicted

N/P

us/

key

0

5

10

15

20

25

30

35

40

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Measured

N/P

us/

key

0

5

10

15

20

25

30

35

40

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

512

256

128

64

32

Only works for N >= P^3

Page 30: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.30

Column: Breakdown

Predicted

N/P

us/k

ey

0

5

10

15

20

25

30

35

40

16384

32768

65536

131072

262144

524288

1048576

Measured

N/P

us/k

ey

0

5

10

15

20

25

30

35

40

16384

32768

65536

131072

262144

524288

1048576

Sort1

Sort2

Sort3

Merge

Trans

Untrans

Shift

Unshift

P= 64, random

Page 31: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.31

Column: Key distributions

Entropy (bits)

µs

/ key

0

5

10

15

20

25

30

35

0 6 10 17 25 31

Merge

Sorts

Remaps

Shifts

P = 64, N/P = 1M

Page 32: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.32

Histo-radix sortP

n=N/P

Per pass:

1. compute local histogram

2. compute position of 1st

member of each bucket in

global array

– 2^r scans with end-around

3. distribute all the keys

Only r = 8,11,16 make sense

for sorting 32 bit numbers

2^r23

Page 33: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.33

Histo-Radix Sort (again)

Local Data

Local Histograms

Each Passform local histogramsform global histogramglobally distribute data

P

Page 34: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.34

Radix Sort: Times

Predicted

N/P

us/

key

0

20

40

60

80

100

120

140

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Measured

N/P

us/

key

0

20

40

60

80

100

120

140

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

512

256

128

64

32

Page 35: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.35

Radix: Breakdown

N/P

us/k

ey

0

20

40

60

80

100

120

140

1638

4

3276

8

6553

6

1E+

05

3E+

05

5E+

05

1E+

06

Dist

GlobalHist

LocalHist

Dist-m

GlobalHist-m

LocalHist-m

Page 36: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.36

Radix: Key distribution

Entropy

µs /

ke

y

0

10

20

30

40

50

60

70

0 6

10

17

25

31

Cycl

ic

118

Dist

Global Hist

Local Hist

Slowdown due to contentionin redistribution

Page 37: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.37

Radix: Stream Broadcast Problem

n

(P-1) ( 2o + L + (n-1) g ) ? Need to slow first processor to pipeline well

Page 38: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.38

What’s the right communication mechanism?

• Permutation via writes– consistency model?

– false sharing?

• Reads?

• Bulk Transfers?– what do you need to change in the algorithm?

• Network scheduling?

Page 39: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.39

Sample Sort

1. compute P-1 values of keys that

would split the input into roughly equal pieces.

– take S~64 samples per processor

– sort PS keys

– take key S, 2S, . . . (P-1)S

– broadcast splitters

2. Distribute keys based on splitters

3. Local sort

[4.] possibly reshift

Page 40: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.40

Sample Sort: Times

Predicted

N/P

us/

key

0

5

10

15

20

25

30

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Measured

N/P

us/

key

0

5

10

15

20

25

30

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

512

256

128

64

32

Page 41: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.41

Sample Breakdown

N/P

us/

key

0

5

10

15

20

25

30

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Split

Sort

Dist

Split-m

Sort-m

Dist-m

Page 42: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.42

Comparison

N/P

us/

key

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Bitonic 1024

Bitonic 32

Column 1024

Column 32

Radix 1024

Radix 32

Sample 1024

Sample 32

• Good fit between predicted and measured (10%)

• Different sorts for different sorts– scaling by processor, input size, sensitivity

• All are global / local hybrids– the local part is hard to implement and model

Page 43: Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Culler 1997CS267 L28 Sort.43

Conclusions

• Distributed memory model leads to hybrid global / local algorithms

• LogP model is good enough for the global part– bandwidth (g) or overhead (o) matter most

– including end-point contention

– latency (L) only matters when BW doesn’t

– g is going to be what really matters in the days ahead (NOW)

• Local computational performance is hard!– dominated by effects of storage hierarchy (TLBs)

– getting trickier with multilevels » physical address determines L2 cache behavior

– and with real computers at the nodes (VM)

– and with variations in model» cycle time, caches, . . .

• See http://www.cs.berkeley.edu/~culler/papers/sort.ps

• See http://now.cs.berkeley.edu/Papers2/Postscript/spdt98.ps– disk-to-disk parallel sorting