CS 484

64
CS 484

description

CS 484. Designing Parallel Algorithms. Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit from a methodical approach. Framework for algorithm design - PowerPoint PPT Presentation

Transcript of CS 484

Page 1: CS 484

CS 484

Page 2: CS 484

Designing Parallel Algorithms

Designing a parallel algorithm is not easy.There is no recipe or magical ingredient Except creativity

We can benefit from a methodical approach. Framework for algorithm design

Most problems have several parallel solutions which may be totally different from the best sequential algorithm.

Page 3: CS 484

PCAM Algorithm Design4 Stages to designing a parallel algorithm Partitioning Communication Agglomeration Mapping

P & C focus on concurrency and scalability.A & M focus on locality and performance.

Page 4: CS 484

PCAM Algorithm DesignPartitioning Computation and data are decomposed.Communication Coordinate task executionAgglomeration Combining of tasks for performanceMapping Assignment of tasks to processors

Page 5: CS 484
Page 6: CS 484

PartitioningIgnore the number of processors and the target architecture.Expose opportunities for parallelism.Divide up both the computation and dataCan take two approaches domain decomposition functional decomposition

Page 7: CS 484

Domain DecompositionStart algorithm design by analyzing the data Divide the data into small pieces Approximately equal in sizeThen partition the computation by associating it with the data.Communication issues may arise as one task needs the data from another task.

Page 8: CS 484

Domain DecompositionEvaluate the definite integral.

1

021

4x

0 1

Page 9: CS 484

Split up the domain

0 1

Page 10: CS 484

Split up the domain

0 1

Page 11: CS 484

Split up the domain

0 10.50.25 0.75

Now eachtask simplyevaluatesthe integralin their range.

All that is left is to sum up eachtask's answerfor the total.

Page 12: CS 484

Domain DecompositionConsider dividing up a 3-D grid What issues arise?

Other issues? What if your problem has more

than one data structure? Different problem phases? Replication?

Page 13: CS 484
Page 14: CS 484
Page 15: CS 484

Functional DecompositionFocus on the computationDivide the computation into disjoint tasks Avoid data dependency among tasksAfter dividing the computation, examine the data requirements of each task.

Page 16: CS 484

Functional DecompositionNot as natural as domain decompositionConsider search problemsOften functional decomposition is very useful at a higher level. Climate modeling

Ocean simulation Hydrology Atmosphere, etc.

Page 17: CS 484

Partitioning ChecklistDefine a LOT of tasks?Avoid redundant computation and storage?Are tasks approximately equal?Does the number of tasks scale with the problem size?Have you identified several alternative partitioning schemes?

Page 18: CS 484

CommunicationThe information flow between tasks is specified in this stage of the design

Remember: Tasks execute concurrently. Data dependencies may limit

concurrency.

Page 19: CS 484

CommunicationDefine Channel Link the producers with the consumers. Consider the costs

Intellectual Physical

Distribute the communication.Specify the messages that are sent.

Page 20: CS 484

Communication PatternsLocal vs. GlobalStructured vs. UnstructuredStatic vs. DynamicSynchronous vs. Asynchronous

Page 21: CS 484

Local CommunicationCommunication within a neighborhood.

Algorithm choice determines communication.

Page 22: CS 484

Global CommunicationNot localized.Examples All-to-All Master-Worker

5 3 72

1

Page 23: CS 484

Avoiding Global Communication

Distribute the communication and computation

5 3 72

1

15 3 7 2 131013

Page 24: CS 484

Divide and ConquerPartition the problem into two or more subproblemsPartition each subproblem, etc.

15 3

7

2 158 3

Results in structured nearest neighbor communication pattern.

Page 25: CS 484

Structured CommunicationEach task’s communication resembles each other task’s communicationIs there a pattern?

Page 26: CS 484

Unstructured Communication

No regular pattern that can be exploited.Examples Unstructured Grid Resolution changes

Complicates the next stages of design

Page 27: CS 484

Synchronous Communication

Both consumers and producers are aware when communication is requiredExplicit and simple

t = 1 t = 2 t = 3

Page 28: CS 484

Asynchronous Communication

Timing of send/receive is unknown. No pattern

Consider: very large data structure Distribute among computational tasks

(polling) Define a set of read/write tasks Shared Memory

Page 29: CS 484

Problems to AvoidA centralized algorithm Distribute the computation Distribute the communicationA sequential algorithm Seek for concurrency Divide and conquer

Small, equal sized subproblems

Page 30: CS 484

Communication Design Checklist

Is communication balanced? All tasks about the same

Is communication limited to neighborhoods? Restructure global to local if possible.

Can communications proceed concurrently?Can the algorithm proceed concurrently? Find the algorithm with most concurrency.

Be careful!!!

Page 31: CS 484

AgglomerationPartition and Communication steps were abstractAgglomeration moves to concrete.Combine tasks to execute efficiently on some parallel computer.Consider replication.

Page 32: CS 484

Agglomeration GoalsReduce communication costs by increasing computation decreasing/increasing granularityRetain flexibility for mapping and scaling.Reduce software engineering costs.

Page 33: CS 484

Changing GranularityA large number of tasks does not necessarily produce an efficient algorithm.We must consider the communication costs.Reduce communication by having fewer tasks sending less messages (batching)

Page 34: CS 484

Surface to Volume EffectsCommunication is proportional to the surface of the subdomain.Computation is proportional to the volume of the subdomain.Increasing computation will often decrease communication.

Page 35: CS 484

How many messages total?How much data is sent?

Page 36: CS 484

How many messages total?How much data is sent?

Page 37: CS 484

Replicating ComputationTrade-off replicated computation for reduced communication.Replication will often reduce execution time as well.

Page 38: CS 484

Summation of N Integers

s = sumb = broadcast

How many steps?

Page 39: CS 484

Using Replication (Butterfly)

Page 40: CS 484

Using Replication

Butterfly to Hypercube

Page 41: CS 484

Avoid CommunicationLook for tasks that cannot execute concurrently because of communication requirements.Replication can help accomplish two tasks at the same time, like: Summation Broadcast

Page 42: CS 484

Preserve FlexibilityCreate more tasks than processors.Overlap communication and computation.Don't incorporate unnecessary limits on the number of tasks.

Page 43: CS 484

Agglomeration ChecklistReduce communication costs by increasing locality.Do benefits of replication outweigh costs?Does replication compromise scalability?Does the number of tasks still scale with problem size?Is there still sufficient concurrency?

Page 44: CS 484

MappingSpecify where each task is to operate.Mapping may need to change depending on the target architecture.Mapping is NP-complete.

Page 45: CS 484

MappingGoal: Reduce Execution Time Concurrent tasks ---> Different

processors High communication ---> Same

processorMapping is a game of trade-offs.

Page 46: CS 484

MappingMany domain-decomposition problems make mapping easy. Grids Arrays etc.

Page 47: CS 484

MappingUnstructured or complex domain decomposition based algorithms are difficult to map.

Page 48: CS 484

Other Mapping ProblemsVariable amounts of work per taskUnstructured communicationHeterogeneous processors different speeds different architectures

Solution: LOAD BALANCING

Page 49: CS 484

Load BalancingStatic Determined a priori Based on work, processor speed, etc.Probabilistic RandomDynamic Restructure load during executionTask Scheduling (functional decomp.)

Page 50: CS 484

Static Load BalancingBased on a priori knowledge.Goal: Equal WORK on all processorsAlgorithms: Basic Recursive Bisection

Page 51: CS 484

BasicDivide up the work based on

Work required Processor speed

i

ii p

pRr

Page 52: CS 484

Recursive BisectionDivide work in half recursively.Based on physical coordinates.

Page 53: CS 484

Dynamic AlgorithmsAdjust load when an imbalance is

detected.Local or Global

Page 54: CS 484

Task SchedulingMany tasks with weak locality

requirements.Manager-Worker model.

Page 55: CS 484

Task SchedulingManager-WorkerHierarchical Manager-Worker

Uses submanagersDecentralized

No central manager Task pool on each processor Less bottleneck

Page 56: CS 484

Mapping ChecklistIs the load balanced?Are there communication

bottlenecks?Is it necessary to adjust the load

dynamically?Can you adjust the load if necessary?Have you evaluated the costs?

Page 57: CS 484

PCAM Algorithm DesignPartition

Domain or Functional DecompositionCommunication

Link producers and consumersAgglomeration

Combine tasks for efficiencyMapping

Divide up the tasks for balanced execution

Page 58: CS 484

Example: Atmosphere ModelSimulate atmospheric processes

Wind Clouds, etc.

Solves a set of partial differential equations describing the fluid behavior

Page 59: CS 484

Representation of Atmosphere

Page 60: CS 484

Data Dependencies

Page 61: CS 484

Partition & Communication

Page 62: CS 484

Agglomeration

Page 63: CS 484

Mapping

Page 64: CS 484

Mapping