Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the...

Culler 1997CS267 L28 Sort.1

CS 267 Applications of Parallel Computers

Lecture 28: LogP and the

Implementation and Modeling of Parallel Sorts

James Demmel

(taken from David Culler,

Lecture 18, CS267, 1997)

http://www.cs.berkeley.edu/~demmel/cs267_Spr99


Practical Performance Target (circa 1992)

• Sort one billion large keys in one minute on one thousand processors.

• Good sort on a workstation can do 1 million keys in about 10 seconds

– just fits in memory

– 16 bit Radix Sort

• Performance unit: µs per key per processor– s ~ 10 for single Sparc 2


Studies on Parallel Sorting

Sorting Networks

PRAM Sorts

MEM

p p p°°°

Sorting on Network Y

P

M

network

P

M

P

M°°°

LogP SortsSorting onMachine X


The Study

Interesting ParallelSorting Algorithms

Analyze under LogP

Parametersfor CM-5

Estimate ExecutionTime

Implement in Split-C

Execute on CM-5

Compare

??

(Bitonic, Column, Histo- radix, Sample)


LogP


Deriving the LogP Model

° Processing

– powerful microprocessor, large DRAM, cache => P

° Communication

+ significant latency (100's of cycles) => L

+ limited bandwidth (1 – 5% of memory bw) => g

+ significant overhead (10's – 100's of cycles) => o- on both ends

– no consensus on topology

=> should not exploit structure

+ limited capacity– no consensus on programming model

=> should not enforce one


LogP

Interconnection Network

MPMPMP° ° °

P ( processors )

Limited Volume( L/ g to or from

a proc)

o (overhead)

L (latency)

og (gap)

• Latency in sending a (small) mesage between modules

• overhead felt by the processor on sending or receiving msg

• gap between successive sends or receives (1/BW)

• Processors


Using the Model

° Send n messages from proc to proc in time 2o + L + g(n-1)

– each processor does o n cycles of overhead

– has (g-o)(n-1) + L available compute cycles

° Send n messages from one to many

in same time

° Send n messages from many to one

in same time

– all but L/g processors block

so fewer available cycles

o L o

o og

Ltime

P

P


Use of the Model (cont)

° Two processors sending n words to each other (i.e., exchange) in time

2o + L + max(g,2o) (n-1) max(g,2o) + L

° P processors each sending n words to all processors (n/P each) in a static, balanced pattern without conflicts , e.g., transpose, fft, cyclic-to-block, block-to-cyclic

same

exercise: what’s wrong with the formula above?

Assumes optimal pattern of send/receive, so could underestimate time


LogP "philosophy"

• Think about:

• – mapping of N words onto P processors

• – computation within a processor, its cost, and balance

• – communication between processors, its cost, and balance

• given a charaterization of processor and network performance

• Do not think about what happens within the network

This should be good enough!


Typical Sort

Exploits the n = N/P grouping

° Significant local computation

° Very general global communication / transformation

° Computation of the transformation


Split-C

Global Address Space

P0 Pprocs-1P1

local

• Explicitly parallel C

• 2D global address space– linear ordering on local spaces

• Local and Global pointers – spread arrays too

• Read/Write

• Get/Put (overap compute and comm)– x := G; . . .

– sync();

• Signaling store (one-way)– G :– x; . . .

– store_sync(); or all_store_sync();

• Bulk transfer

• Global comm.


Basic Costs of operations in Split-C

• Read, Write x = *G, *G = x 2 (L + 2o)

• Store *G :– x L + 2o

• Get x := *G o

.... 2L + 2o sync(); o

– with interval g

• Bulk store (n words with words/message)

2o + (n-1)g + L

• Exchange 2o + 2L + (nL/g) max(g,2o)

• One to many

• Many to one


LogP model

• CM5:– L = 6 µs

– o = 2.2 µs

– g = 4 µs

– P varies from 32 to 1024

• NOW– L = 8.9

– o = 3.8

– g = 12.8

– P varies up to 100

• What is the processor performance?


Sorting


Local Sort Performance (11 bit radix sort of 32 bits numbers)

Log N/P

µs

/ K

ey

0

1

2

3

4

5

6

7

8

9

10

0 5 10 15 20

31

25.1

16.9

10.4

6.2

Entropy inKey Values

Entropy = -i pi log pi ,

pi = Probability of key i

<--------- TLB misses ---------->


Local Computation Parameters - EmpiricalParameter Operation µs per key Sort

Swap Simulate cycle butterfly per key 0.025 lg N Bitonic

mergesort Sort bitonic sequence 1.0

scatter Move key for Cyclic-to-block 0.46

gather Move key for Block-to-cyclic 0.52 if n<=64k or P<=64 Bitonic & Column

1.1 otherwise

local sort Local radix sort (11 bit) 4.5 if n < 64K

9.0 - (281000/n)

merge Merge sorted lists 1.5 Column

copy Shift Key 0.5

zero Clear histogram bin 0.2 Radix

hist produce histogram 1.2

add produce scan value 1.0

bsum adjust scan of bins 2.5

address determine desitination 4.7

compare compare key to splitter 0.9 Sample

localsort8 local radix sort of samples 5.0


Bottom Line (Preview)

N/P

us/

key

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Bitonic 1024

Bitonic 32

Column 1024

Column 32

Radix 1024

Radix 32

Sample 1024

Sample 32

• Good fit between predicted and measured (10%)

• Different sorts for different sorts– scaling by processor, input size, sensitivity

• All are global / local hybrids– the local part is hard to implement and model


Odd-Even Merge - classic parallel sort

N values to be sorted

A0 A1 A2 A3 AM-1 B0 B1 B2 B3 BM-1

Treat as two lists ofM = N/2

Sort each separately

A0 A2 … AM-2 B0 B2 … BM-2

Redistribute intoeven and odd sublists A1 A3 … AM-1 B1 B3 … BM-1

Merge into twosorted lits

E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1

Pairwise swaps ofEi and Oi will put itin order


Where’s the Parallelism?

E0 E1 E2 E3 EM-1 O0 O1 O2 O3 OM-1

1xN

1xN

4xN/4

2xN/2


Mapping to a Butterfly (or Hypercube)A0 A1 A2 A3 B0 B1 B2 B3

A0 A1 A2 A3

A0 A1 A2 A3

B0B1 B3 B2

B2B3 B1 B0

A0 A1 A2 A3 B2B3 B1 B0

Reverse Orderof one list viacross edges

two sorted sublists

Pairwise swapson way back2 3 4 8 7 6 5 1

2 3 4 7 6 5 81

2 4 6 81 3 5 7

1 2 3 4 5 6 7 8


Bitonic Sort with N/P per node

all_bitonic(int A[PROCS]::[n])

sort(tolocal(&A[ME][0]),n,0)

for (d = 1; d <= logProcs; d++)

for (i = d-1; i >= 0; i--) {

swap(A,T,n,pair(i));

merge(A,T,n,mode(d,i));

}

sort(tolocal(&A[ME][0]),n,mask(i));

sortswap

A bitonic sequence decreases and then increases (or vice versa)Bitonic sequences can be merged like monotonic sequences


Bitonic Sort

lg N/p stages are local sort

Block Layout

remaining stages involve Block-to-cyclic, local merges (i - lg N/P cols)cyclic-to-block, local merges ( lg N/p cols within stage)


Analysis of Bitonic

• How do you do transpose?

• Reading Exercise


Bitonic Sort: time per key

Predicted

N/P

us/

key

0

10

20

30

40

50

60

70

80

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Measured

N/P

us/

key

0

10

20

30

40

50

60

70

80

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

512

256

128

64

32


Bitonic: Breakdown

Predicted

N/P

us/k

ey

0

10

20

30

40

50

60

70

80

16384

32768

65536

131072

262144

524288

1048576

Measured

N/P

us/k

ey

0

10

20

30

40

50

60

70

80

16384

32768

65536

131072

262144

524288

1048576

Remap B-C

Remap C-B

Mergesort

Swap

Localsort

P= 512, random


Bitonic: Effect of Key Distributions

Entropy (bits)

µs/k

ey

0

5

10

15

20

25

30

35

40

45

0 6 10 17 25 31

Swap

Merge Sort

Local Sort

Remap C-B

Remap B-C

P = 64, N/P = 1 M


Column Sort

(3) Sort

(2) Transpose - block to cyclic

(1) Sort

(4) Transpose- cyclic to block w/o scatter

(6) shift

(5) Sort

(8) Unshift

(7) mergework efficient

Treat datalike n x P array,with n >= P^2,I.e. N >= P^3


Column Sort: Times

Predicted

N/P

us/

key

0

5

10

15

20

25

30

35

40

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Measured

N/P

us/

key

0

5

10

15

20

25

30

35

40

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

512

256

128

64

32

Only works for N >= P^3


Column: Breakdown

Predicted

N/P

us/k

ey

0

5

10

15

20

25

30

35

40

16384

32768

65536

131072

262144

524288

1048576

Measured

N/P

us/k

ey

0

5

10

15

20

25

30

35

40

16384

32768

65536

131072

262144

524288

1048576

Sort1

Sort2

Sort3

Merge

Trans

Untrans

Shift

Unshift

P= 64, random


Column: Key distributions

Entropy (bits)

µs

/ key

0

5

10

15

20

25

30

35

0 6 10 17 25 31

Merge

Sorts

Remaps

Shifts

P = 64, N/P = 1M


Histo-radix sortP

n=N/P

Per pass:

1. compute local histogram

2. compute position of 1st

member of each bucket in

global array

– 2^r scans with end-around

3. distribute all the keys

Only r = 8,11,16 make sense

for sorting 32 bit numbers

2^r23


Histo-Radix Sort (again)

Local Data

Local Histograms

Each Passform local histogramsform global histogramglobally distribute data

P


Radix Sort: Times

Predicted

N/P

us/

key

0

20

40

60

80

100

120

140

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Measured

N/P

us/

key

0

20

40

60

80

100

120

140

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

512

256

128

64

32


Radix: Breakdown

N/P

us/k

ey

0

20

40

60

80

100

120

140

1638

4

3276

8

6553

6

1E+

05

3E+

05

5E+

05

1E+

06

Dist

GlobalHist

LocalHist

Dist-m

GlobalHist-m

LocalHist-m


Radix: Key distribution

Entropy

µs /

ke

y

0

10

20

30

40

50

60

70

0 6

10

17

25

31

Cycl

ic

118

Dist

Global Hist

Local Hist

Slowdown due to contentionin redistribution


Radix: Stream Broadcast Problem

n

(P-1) ( 2o + L + (n-1) g ) ? Need to slow first processor to pipeline well


What’s the right communication mechanism?

• Permutation via writes– consistency model?

– false sharing?

• Reads?

• Bulk Transfers?– what do you need to change in the algorithm?

• Network scheduling?


Sample Sort

1. compute P-1 values of keys that

would split the input into roughly equal pieces.

– take S~64 samples per processor

– sort PS keys

– take key S, 2S, . . . (P-1)S

– broadcast splitters

2. Distribute keys based on splitters

3. Local sort

[4.] possibly reshift


Sample Sort: Times

Predicted

N/P

us/

key

0

5

10

15

20

25

30

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Measured

N/P

us/

key

0

5

10

15

20

25

30

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

512

256

128

64

32


Sample Breakdown

N/P

us/

key

0

5

10

15

20

25

30

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Split

Sort

Dist

Split-m

Sort-m

Dist-m


Comparison

N/P

us/

key

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

1638

4

3276

8

6553

6

1310

72

2621

44

5242

88

1048

576

Bitonic 1024

Bitonic 32

Column 1024

Column 32

Radix 1024

Radix 32

Sample 1024

Sample 32

• Good fit between predicted and measured (10%)

• Different sorts for different sorts– scaling by processor, input size, sensitivity

• All are global / local hybrids– the local part is hard to implement and model


Conclusions

• Distributed memory model leads to hybrid global / local algorithms

• LogP model is good enough for the global part– bandwidth (g) or overhead (o) matter most

– including end-point contention

– latency (L) only matters when BW doesn’t

– g is going to be what really matters in the days ahead (NOW)

• Local computational performance is hard!– dominated by effects of storage hierarchy (TLBs)

– getting trickier with multilevels » physical address determines L2 cache behavior

– and with real computers at the nodes (VM)

– and with variations in model» cycle time, caches, . . .

• See http://www.cs.berkeley.edu/~culler/papers/sort.ps

• See http://now.cs.berkeley.edu/Papers2/Postscript/spdt98.ps– disk-to-disk parallel sorting

Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the...

Documents

Transcript of Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the...