2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

19
2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

Transcript of 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

Page 1: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2a.1

Evaluating Parallel Programs

Cluster Computing, UNC-Charlotte, B. Wilkinson.

Page 2: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2a.2

Sequential execution time, ts: Estimate by counting computational steps of best sequential algorithm.

Parallel execution time, tp: In addition to number of computational steps, tcomp, need to estimate communication overhead, tcomm:

tp = tcomp + tcomm

Page 3: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2a.3

Computational Time

Count number of computational steps.When more than one process executed simultaneously, count computational steps of most complex process. Generally, function of n and p, i.e.

tcomp = f (n, p)

Often break down computation time into parts. Then

tcomp = tcomp1 + tcomp2 + tcomp3 + …

Analysis usually done assuming that all processors are same and operating at same speed.

Page 4: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2a.4

Communication TimeMany factors, including network structure.

As a first approximation, use

tcomm = tstartup + ntdata

tstartup -- startup time, essentially time to send a message with no data. Assumed to be constant.

tdata -- transmission time to send one data word, also assumed constant, and there are n data words.

Page 5: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2a.5

Idealized Communication Time

Number of data items (n)

Startup time

The equation to compute the communication time ignore the fact that the source and destination may not be directly linked in a real system so that the message may pass through intermediate nodes.

It is also assumed that the overhead incurred by including information other than data in the packet is constant and can be part of startup time.

Page 6: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2a.6

Final communication time, tcomm

Summation of communication times of all sequential messages from one process, i.e.

tcomm = tcomm1 + tcomm2 + tcomm3 + …

Communication patterns of all processes assumed same and take place together so that only one process need be considered.

Both tstartup and tdata, measured in units of one computational step, so that can add tcomp and tcomm together to obtain parallel execution time, tp.

Page 7: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

Communication Time of Broadcast/Gather

• If broadcast is done through single shared wire for Ethernet, the time complexity is O(1) for single data item and O(w) if w data items.

• If binary tree is used as the underlying network structure and 1-to-N fan-out broadcast is used, then what about communication cost for p final destinations (leaf nodes) using w messages? – We assume the left and right child will receive the message from their

parent in a sequential way. However, at each level, different parent nodes will send out the message at the same time.

2a.7

Page 8: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

1-to-N fan-out Broadcast

• tcomm = 2 (log p) (tstartup + wtdata )

• It depends on number of levels and number of nodes at each level.

• For a binary tree and p final destinations at the leave level.

2a.8

Page 9: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2a.9

Benchmark FactorsWith ts, tcomp, and tcomm, can establish speedup factor and computation/communication ratio for a particular algorithm/implementation:

Both functions of number of processors, p, and number of data elements, n.

Page 10: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2a.10

Factors give indication of scalability of parallel solution with increasing number of processors and problem size.

Computation/communication ratio will highlight effect of communication with increasing problem size and system size.

We wish to have dominant factor in computation instead of communication, as n increases, communication can be ignored and adding more processors can improve the performance.

Page 11: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2a.11

Example• Adding n numbers using two computers,

each adding n/2 numbers each.

• Numbers initially held in one computer.

Computer 1 Computer 2Send n/2 numbers

Send result back

Add up n/2 numbers

Add partial sums

tcomm = tstartup +(n/2)tdata

tcomm = tstartup + tdata

tcomp = n/2

tcomp = 1

Page 12: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2a.12

Overall

tcomm = 2tstartup +(n/2 + 1)tdata = O(n)tcomp = n/2 + 1 = O(n)

Computation/Communication ratio = O(1)

Page 13: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2a.13

Another problem

Computation time complexity = O(n2)

Communication time complexity = O(n)

Computation/Communication ratio = O(n)

Page 14: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2a.14

CostCost = (execution time ) x (number of processors)

Cost of sequential computation = ts

Cost of parallel computation = tp x p

Cost-optimal algorithm

When parallel computation cost is proportional to sequential computation:

Cost = tp x p = k x ts

k is a constant

Page 15: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

2a.15

ExampleSuppose ts = O(n log n) for the best sequential programwhere n = number of data item p = number of processors

For cost optimality iftp = O(n log n) / p = O(n/p log n)

Not cost optimal if

tp = O(n^2/p )

A parallel algorithm is cost-optimal if parallel time complexity times the number of processors equals the sequential time complexity.

Page 16: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

Evaluating programs• Measuring the execution time• Time-complexity analysis gives an insight into the parallel algorithm and is

useful in comparing different algorithms. We want to know how the algorithm actually performs in a real system.

• We can measure the elapsed time between two points in the code in seconds.– System calls, such as clock(), time(), or gettimeofday() or MPI_Wtime()

– Example:

L1: time(&t1);

.

.

L2: time(&t2);

elapsed_time = difftime(t2, t1);

2a.16

Page 17: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

Communication Time by the Ping-Pong Method

• Point-to-point communication time of a specific system can be found using the ping-pong method.

• One process p0 sends a message to another process, say p1. Immediately upon receiving the message, p1 sends the message back to p0. The time is divided by two to obtain an estimate of the time of one-way communication. For example, at p0:time(&t1);

send(&x, p1);

recv(&x, p1);

time(&t2);

elapsed_time = 0.5* difftime(t2, t1);

2a.17

Page 18: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

Profilling

• A profile of a program is a histogram or graph showing the time spent on different part of the program. Showing the number of times certain source code are executed.

• It can help to identify certain hot spot places in a program visited many times during the execution. These places could be optimized first.

2a.18

Page 19: 2a.1 Evaluating Parallel Programs Cluster Computing, UNC-Charlotte, B. Wilkinson.

Program Profile Histogram

2a.19

Statement number of region of program