Revisiting a slide from the syllabus: CS 525 will cover

16
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory processors – Multi-core processors Parallel programming – Shared memory programming (OpenMP) – Distributed memory programming (MPI) – Thread-based programming Parallel algorithms – Sorting – Matrix-vector multiplication – Graph algorithms Applications – Computational science and engineering – High-performance computing

description

Revisiting a slide from the syllabus: CS 525 will cover. Parallel and distributed computing architectures Shared memory processors Distributed memory processors Multi-core processors Parallel programming S hared memory programming ( OpenMP ) D istributed memory programming (MPI) - PowerPoint PPT Presentation

Transcript of Revisiting a slide from the syllabus: CS 525 will cover

Page 1: Revisiting a slide from the syllabus: CS 525 will cover

Revisiting a slide from the syllabus:CS 525 will cover

• Parallel and distributed computing architectures– Shared memory processors– Distributed memory processors– Multi-core processors

• Parallel programming – Shared memory programming (OpenMP)– Distributed memory programming (MPI)– Thread-based programming

• Parallel algorithms– Sorting– Matrix-vector multiplication– Graph algorithms

• Applications– Computational science and engineering– High-performance computing

Page 2: Revisiting a slide from the syllabus: CS 525 will cover

Revisiting a slide from the syllabus:CS 525 will cover

• Parallel and distributed computing architectures– Shared memory processors– Distributed memory processors– Multi-core processors

• Parallel programming – Shared memory programming (OpenMP)– Distributed memory programming (MPI)– Thread-based programming

• Parallel algorithms– Sorting– Matrix-vector multiplication– Graph algorithms

• Applications– Computational science and engineering– High-performance computing This lecture hopes to touch on each

of the highlighted aspects via one running example

Page 3: Revisiting a slide from the syllabus: CS 525 will cover

Intel Nehalem: an example of a multi-core shared-memory architecture

A few of the features… • 2 sockets• 4 cores per socket• 2 hyper-threads per core 16 threads in total• Proc speed: 2.5 GHz

• 24 GB total memory• Cache:

– L1 (32 KB, on-core)– L2 (2x6 MB, on-core)– L3 (8 MB, shared)

• Memory: virtually globally shared (from programmer’s point of view)

• Multithreading: simultaneous (Multiple instructions from ready threads executed in a

given cycle)

Block diagram

Page 4: Revisiting a slide from the syllabus: CS 525 will cover

Overview of Programming Models

• Programming models provide support for expressing concurrency and synchronization

• Process based models assume that all data associated with a process is private, by default, unless otherwise specified

• Lightweight processes and threads assume that all memory is global– A thread is a single stream of control in the flow of a program

• Directive based programming models extend the threaded model by facilitating creation and synchronization of threads

Page 5: Revisiting a slide from the syllabus: CS 525 will cover

Advantages of Multithreading (threaded programming)

• Threads provide software portability• Multithreading enables latency hiding • Multithreading takes scheduling and load

balancing burdens away from programmers• Multithreaded programs are significantly easier

to write than message-passing programs – Becoming increasingly widespread in use due to the

abundance of multicore platforms

Page 6: Revisiting a slide from the syllabus: CS 525 will cover

Shared Address Space Programming APIs

• The POSIX Thread (Pthread) API– Has emerged as the standard threads API– Low-level primitives (relatively difficult to work with)

• OpenMP– Is a directive-based API for programming shared

address space platforms (has become a standard)– Used with Fortran, C, C++– Directives provide support for concurrency,

synchronization, and data handling without having to explicitly manipulate threads

Page 7: Revisiting a slide from the syllabus: CS 525 will cover

Parallelizing Graph Algorithms

• Challenges:– Runtime dominated by memory latency than processor speed– Little work is done while visiting a vertex or an edge

• Little computation to hide memory access cost– Access patterns determined only at runtime

• Prefetching techniques inapplicable– There is poor data locality

• Difficult to obtain good memory system performance

• For these reasons, parallel performance– on distributed memory machines is often poor– on shared memory machines is often better

We consider here graph coloring as an example of a graph algorithm to parallelize on shared memory machines

Page 8: Revisiting a slide from the syllabus: CS 525 will cover

Graph coloring• Graph coloring is an assignment of

colors (positive integers) to the vertices of a graph such that adjacent vertices get different colors

• The objective is to find a coloring with the least number of colors

• Examples of applications:– Concurrency discovery in parallel

computing (illustrated in the figure to the right)

– Sparse derivative computation– Frequency assignment– Register allocation, etc

Page 9: Revisiting a slide from the syllabus: CS 525 will cover

A greedy algorithm for coloring

• Graph coloring is NP-hard to solve optimally (and even to approximate)• The following Greedy algorithm gives very good solution in practice

Complexity of GREEDY: O(|E|) (thanks to the way the array forbiddenColors is used)

color is a vertex-indexed array that stores the color of each vertexforbiddenColors is a color-indexed array used to mark impermissible colors to a vertex

Page 10: Revisiting a slide from the syllabus: CS 525 will cover

Parallelizing Greedy Coloring

• Desired goal: parallelize GREEDY such that– Parallel runtime is roughly O(|E|/p) when p processors

(threads) are used– Number of colors used is nearly the same as in the

serial case

• Difficult to achieve since GREEDY is inherently sequential

• Challenge: come up with a way to create concurrency in a nontrivial way

Page 11: Revisiting a slide from the syllabus: CS 525 will cover

A potentially “generic” parallelization technique

• “Standard” Partitioning– Break up the given problem into p independent subproblems of almost equal

sizes– Solve the p subproblems concurrently

Main work lies in the decomposition step which is often no easier than solving the original problem

• “Relaxed” Partitioning– Break up the problem into p, not necessarily entirely independent, subproblems

of almost equal sizes– Solve the p subproblems concurrently– Detect inconsistencies in the solutions concurrently– Resolve any inconsistencies

Can be used potentially successfully if the resolution in the fourth step involves only local adjustments

Page 12: Revisiting a slide from the syllabus: CS 525 will cover

“Relaxed Partitioning” applied towards parallelizing Greedy coloring

• Speculation and Iteration:– Color as many vertices as possible concurrently,

tentatively tolerating potential conflicts, detect and resolve conflicts afterwards (iteratively)

Page 13: Revisiting a slide from the syllabus: CS 525 will cover

Parallel Coloring on Shared Memory Platforms (using Speculation and Iteration)

Lines 4 and 9 can be parallelized using the OpenMP directive#pragma omp parallel for

Page 14: Revisiting a slide from the syllabus: CS 525 will cover

Sample experimental results of Algorithm 2 (Iterative) on Nehalem: I

Graph (RMAT-G): 16.7M vertices; 133.1M edges; Degree (avg=16, Max= 1,278, variance=416)

Page 15: Revisiting a slide from the syllabus: CS 525 will cover

Sample experimental results of Algorithm 2 (Iterative) on Nehalem: II

Graph (RMAT-B): 16.7M vertices; 133.7M edges; Degree (avg=16, Max= 38,143, variance=8,086)

Page 16: Revisiting a slide from the syllabus: CS 525 will cover

Recap

• Saw an example of a multi-core architecture– Intel Nehalem

• Took a bird’s eye view of parallel programming models– OpenMP (highlighted)

• Saw an example of design of a graph algorithm– Graph coloring

• Saw some experimental results