Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort...

32
Introduction to Parallel Computing George Karypis Sorting

Transcript of Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort...

Page 1: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Introduction to Parallel Computing

George KarypisSorting

Page 2: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

OutlineBackgroundSorting NetworksQuicksortBucket-Sort & Sample-Sort

Page 3: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

BackgroundInput Specification

Each processor has n/p elementsA ordering of the processors

Output SpecificationEach processor will get n/p consecutive elements of the final sorted array.The “chunk” is determined by the processor ordering.

VariationsUnequal number of elements on output.

In general, this is not a good idea and it may require a shift to obtain the equal size distribution.

Page 4: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Basic Operation: Compare-Split Operation

Single element per processor

Multiple elements per processor

Page 5: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Sorting NetworksSorting is one of the fundamental problems in Computer ScienceFor a long time researchers have focused on the problem of “how fast can we sort n elements”?

Serialnlog(n) lower-bound for comparison-based sorting

ParallelO(1), O(log(n)), O(???)

Sorting networksCustom-made hardware for sorting!

Hardware & algorithmMostly of theoretical interest but fun to study!

Page 6: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Elements of Sorting NetworksKey Idea:

Perform many comparisons in parallel.

Key Elements:Comparators:

Consist of two-input, two-output wiresTake two elements on the input wires and outputs them in sorted order in the output wires.

Network architecture:The arrangement of the comparators into interconnected comparator columns

similar to multi-stage networks

Many sorting networks have been developed.

Bitonic sorting networkΘ(log2(n)) columns of comparators.

Page 7: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Bitonic Sequence

Bitonic sequences are graphically represented by lines as follows:

12

7

0

46

Page 8: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Why Bitonic Sequences?A bitonic sequence can be “easily” sorted in increasing/decreasing order.

s s1 s2

• Every element of s1 will be less than or equal to every element of s2• Both s1 and s2 are bitonic sequences.• So how can a bitonic sequence be sorted?

BitonicSplit

Page 9: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

An example

Page 10: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Bitonic Merging Network

A comparator network that takes as input a bitonicsequence and performs a sequence of bitonic splits to sort it.

+BM[n]A bitonic merging network for sorting in increasing order an n-element bitonicsequence.

-BM[n]Similar sort in decreasing order.

Page 11: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Are we done?Given a set of elements, how do we re-arrange them into a bitonic sequence?Key Idea:

Use successively larger bitonic networks to transform the set into a bitonic sequence.

Page 12: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

An example

Page 13: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

ComplexityHow many columns of comparators are required to sort n=2l elements?

i.e., depth d(n) of the network?

Page 14: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Bitonic Sort on a HypercubeOne-element-per-processor case

How do we map the algorithm onto a hypercube?What is the comparator?How do the wires get mapped?

What can you say about thepairs of wires that are inputsto the various comparators?

Page 15: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Illustration

Page 16: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Communication Pattern

Page 17: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Algorithm

Complexity?

Page 18: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Bitonic Sort on a MeshOne-element-per-processor case

How do the wires get mapped?

Which one is better?Why?

Page 19: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Row-Major Shuffled Mapping

Complexity?

Can we do better?What is the lowest bound of sorting on a mesh?

communication performed by each process

Page 20: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

More than one element per processor

Hypercube

Mesh

Page 21: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Bitonic Sort Summary

Page 22: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Quicksort

Page 23: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Parallel FormulationHow about recursive decomposition?

Is it a good idea?We need to do the partitioning of the array around a pivot element in parallel.

What is the lower bound of parallel quicksort?

What will it take to achieve this lower bound?

Page 24: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Optimal for CRCW PRAMOne element per processorArbitrary resolution of the concurrent writes.Views the sorting as a two-step process:

(i) Constructing a binary tree of pivot elements(ii) Obtaining the sorted sequence by performing an inordertraversal of this binary tree.

Page 25: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Building the Binary Tree

Complexity?

Page 26: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Practical QuicksortShared-memory

Data resides on a shared array.During a partitioning each processor is responsible for a certain portion.

Array Partitioning:Select & Broadcast pivot.Local re-arrangement.

Is this required?Global re-arrangement.

Page 27: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Efficient Global Rearrangement

Page 28: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Practical QuicksortComplexity

Complexity for message-passing is similar assuming that the all-to-allpersonalized communication is not cross-bisection bandwidth limited.

Page 29: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

A word on Pivot SelectionSelecting pivots that lead to balanced partitions is importance

height of the treeeffective utilization of processors

Page 30: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Sample SortGeneralization of bucket sort with data-driven sampling

n/p elements per-processor.Each processor sorts is local elements. Each processor selects p-1 equally spaced elements from its own list.The combined p(p-1) set of elements are sorted and p-1 equally spaced elements are selected from that list.Each processor splits its own list according to these splitters into p buckets.Each processor sends its ith bucket to the ith processor.Each processor merges the elements that it receives.Done.

Page 31: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Sample Sort Illustration

Page 32: Introduction to Parallel Computingkarypis/parbook/Lectures/GK... · 2005-01-08 · Sample Sort Generalization of bucket sort with data-driven sampling n/p elements per-processor.

Sample Sort Complexity

Assumes a serial sort