PatternsInParallelComputing2014 - College of...

6/4/2014

1

Patterns in parallel computing

Oregon State University EECSJune 4, 2014

Copyright © 2006, Intel Corporation. All rights reserved. Prices and availability subject to change without notice.*Other brands and names are the property of their respective owners

1

,

Michael Wrinn, Ph.D.Research DirectorUniversity Research Office, Intel

Disclaimer: the views expressed in this talk are those of the speaker, not his employer. This is an academic style talk, not intended to address details of any particular Intel product or plan.

Parallel Programming Principle and Practice

Lecture 5 —shared memory

programming with OpenMP

Ji H iJin, Hai

School of Computer Science and Technology

Huazhong University of Science and Technology

Example Problem:Numerical Integration

3

Parallel Pi: OpenMP solution

4

Case Study: AkariLogic puzzle from Nikoli

Goal: Place chess rooks on open squares such that• No two rooks attack each other• Numbered squares surrounded by specified number of rooks

• All open squares are “covered” by one or more rooks• Black squares block attack of rooks


5 5

Akari Algorithm

Input: board size and list of number and black squares

Sort numbered squares by value; plain black squares after numbered

Place rooks around all “4” squares• Get next numbered square in list• Try all rook combinations around square, via recursive call


6

– “3” square => 4 combinations– “2” square => 6 combinations– “1” square => 4 combinations

• If no more numbered squares, compile list of all open squares• Using backtracking:

– Try rook in/out next open square from list– Solution reached when no more open squares

6

6/4/2014

2

Search Tree – placeThree() example


7 7

Conventional analysis: where to parallelize?

What might Hotspot Analysis show?• If move generation is fast, very little

time in BT()• Likely DeadEnd() or Solution() (at

leafs of search tree) would report most execution time

Typically not much parallelism to be

function BT(c){if (DeadEnd(c)) return;if (Solution(c)) Output(c);elseforeach (s = next moves from c) {

BT(s);}

}


8

Typically not much parallelism to be exploited in checking partial solutions

Example: Akari solver• countBlanks() is 96% of serial time• Simple for-loop over board squares

– 360 squares for dataset used

8

Akari Execution – Hotspot Parallelism

int board::countBlanks(){// find total number of blanks and non-covered squaresint i, c = 0;

#pragma omp parallel for reduction(+:c)for (i = 0; i < rows*cols; ++i)

if (B[0][i] == ' ' || B[0][i] == 'n' || B[0][i] == 'N' )


9 9

Code version Time (sec) Speedupserial 29.557 1.0parallel (32 cores)

36.841 0.8

++c;return c;

}

Architecting Parallel Software:Design patterns in practice

A brief look at work developed by• Kurt Keutzer, UC Berkeley• Tim Mattson, Intel

many contributors at UC Berkeley UIUC Intel


10

• many contributors at UC Berkeley, UIUC, Intel

How do we fix the parallel programming problem?

Focus on software architecture and time‐tested designs that have been shown to work …

Given a good design, a

programmer can create

11

Design patterns are a way to put these approaches into

writing.

can create quality software

regardless of the language

Graph Algorithms

Dynamic Programming

Dense Linear Algebra

Sparse Linear Algebra

Unstructured Grids

Structured Grids

Model View Controller

Iterative Refinement

Map Reduce

Layered Systems

Arbitrary Static Task Graph

Pipe and Filter

Agent and Repository

Process Control

Event Based, Implicit Invocation

Puppeteer

Graphical Models

Finite State Machines

Backtrack Branch and Bound

N-Body Methods

Circuits

Spectral Methods

Monte Carlo

Structural Patterns Computational Patterns

Our Pattern Language 2011

Task ParallelismRecursive splitting

Data ParallelismPipeline

Discrete Event Geometric DecompositionSpeculation

SPMDStrict data par

Fork/JoinActorsMaster/workerGraph partitioning

Distributed-ArrayShared-Data

Shared-QueueShared Hash Table

MIMD Thread PoolSIMD Task Graph

Parallel Execution Patterns

Algorithm Strategy Patterns

Implementation Strategy Patterns

Data structureProgram structure

Loop-Par.BSPTask-Queue

Msg PassCollective CommMutual exclusion

CoordinationAdvancing “program counters”

Task-GraphData FlowDigital Circuits

Pt-2-pt syncCollective syncTrans Mem

6/4/2014

3

Our Pattern Language

Applications

StructuralPatterns

ComputationalPatterns

Parallel Algorithm Strategy Patterns

Implementation Patterns

Execution Patterns

•Pipe-and-Filter

•Agent-and-Repository

•Process-Control

•Event-Based/Implicit-Invocation

Identify the SW Structure

Structural Patterns

•Puppeteer

•Model-View-Controller

•Iterative-Refinement

•Map-Reduce

•Layered-Systems

•Arbitrary-Static-Task-Graph

These define the software structure but do not describe what is computed

Analogy: Layout of Factory Plant Analogy: Machinery of the Factory

Filter 2Filter 3

Filter 1

Structural Pattern: Pipe and Filter

• Filters embody computation• Only see inputs and produce outputs• No global or shared state

•Pipes embody communication

Filter 6

Filter 5

Filter 4

Filter 7

May have feedback

Examples?Examples?

Examples: pipe and filter

Almost every large software program has a pipe and filter structure at the highest level

Logic optimizerImage Retrieval SystemCompiler

6/4/2014

4

Structural Pattern: MapReduce

To us, it means

A map stage, where data is mapped onto independent computations

A reduce stage, where the results of the map stage are summarized (i.e. reduced)g ( )

Map

Reduce

Map

Reduce

Examples?Examples?

Examples of Map Reduce

General structure:

Map a computation across distributed data sets

Reduce the results to find the best/(worst), maxima/(minima)

Speech recognition• Map HMM computation to evaluate word match• Reduce to find the most-likely word sequences

Support-vector machines (ML)• Map to evaluate distance from the frontier • Reduce to find the greatest outlier from the frontier

Identify Key Computations

Finite State Mach.

Apps

Dwarves Em

bed

SP

EC

DB

Gam

es

ML

HP

C

CA

D

Health Image Speech Music BrowserGraph AlgorithmsGraphical ModelsBacktrack / B&B

Structured GridDense MatrixSparse MatrixSpectral (FFT)Monte CarloN-Body

Dynamic Prog.Unstructured Grid

Finite State Mach.Circuits

These define the key computations, but do not describe how they are implemented

OPL: Parallel Algorithm Strategy

Applications

StructuralPatterns




Execution Patterns


Parallel Algorithm strategies:

These patterns define high‐level strategies to exploit concurrency within a computation for execution on a parallel computer.

They address the different ways concurrency is naturally expressed within a problem/applicationexpressed within a problem/application.

How does the software architecture map onto parallel algorithms?

Task ParallelismRecursive splitting

Data ParallelismPipeline

Discrete Event Geometric DecompositionSpeculation

Algorithm Strategy Patterns


Applications

StructuralPatterns




Execution Patterns

6/4/2014

5


Implementation strategies:

These are the structures that are realized in source code to support (a) how the program itself is organized and (b) common data structures specific to parallel programming.

How do parallel algorithms map onto sourceHow do parallel algorithms map onto source code in a parallel programming language?

SPMDStrict data par

Fork/JoinActorsMaster/workerGraph partitioning

Distributed-ArrayShared-Data

Shared-QueueShared Hash Table


Data structureProgram structure

Loop-Par.BSPTask-Queue


Applications

StructuralPatterns




Execution Patterns

Parallel Execution Patterns

Parallel Execution Patterns:

These are the approaches often embodied in a runtime system that supports the execution of a parallel program.

How is the source code realized as an executing program running on the target parallel processor?

MIMD Thread PoolSIMD Task Graph

Parallel Execution PatternsMsg PassCollective CommMutual exclusion

CoordinationAdvancing “program counters”

Task-GraphData FlowDigital Circuits

Pt-2-pt syncCollective syncTrans Mem

Compelling Application:Fast, Robust Pediatric MRI

Pediatric MRI is difficult:

Children cannot sit still, breathhold

Low tolerance for long exams

Anesthesia is costly and risky

Like to accelerate MRI acquisition

Advanced MRI techniques exist, but require d t d t i t l ith fdata‐ and compute‐ intense algorithms for image reconstruction

Reconstruction must be fast, or time saved in accelerated acquisition is lost in computing reconstruction

Slow reconstruction times are a non‐starter for clinical use

SW architecture of image reconstruction Game‐Changing Speedup

100X faster reconstruction

Higher‐quality, faster MRI

This image: 8 month‐old patient with cancerous mass in liver

256 x 84 x 154 x 8 data size

Serial Recon: 1 hour

P ll l R 1 i t Parallel Recon: 1 minute

Fast enough for clinical use

Software currently deployed at Lucile Packard Children's Hospital for clinical study of the reconstruction technique

6/4/2014

6

Software Design Patterns: earlier lesson

Early days OO

Perception: Object oriented? Isn’t that just an academic thing?

Usage: specialists

Now

Perception: OO=programming

Isn’t this how it was always done?


31

Usage: specialists only. Mainstream regards with indifference or anxiety.

Performance: not so good.

Usage: cosmetically widespread, some key concepts actually deployed.

Performance: so-so, masked by CPU advances until now.

1994

Then

Perception: Parallel programming? Isn’t that just an HPC thing?

Usage: specialists

Soon

Perception: PP=programming

Isn’t this how it was always done?

Parallel Design Patterns: does history rhyme?(a prediction from 2011)


32

Usage: specialists only. Mainstream regards with indifference or anxiety.

Performance: very good, for the specialists.

Usage: widespread, key concepts actually deployed.

Performance: broadly sufficient. Application domains greatly expanded.

2011

UC BerkeleyComputational Characteristics

of (big data) application areas (2014)

Web

Sea

rch

Soc

ial N

etw

orks

Dat

abas

e

Big

Dat

a A

naly

tics

Vis

ualiz

atio

n

Gen

om

ics

UA

V &

Wea

rabl

eV

isio

n

Big

Dat

a M

ultim

edia

Adv

ertis

ing

Fin

ance

Graph Algorithms

Graphical Models

Apps

Patterns

33

G ap ca ode s

Backtrack / B&B

Finite State Machines

Circuits

Dynamic Programming

N-Body

Unstructured Grid

Structured Grid

Dense Matrix

Sparse Matrix

Spectral (FFT)

Monte Carlo

Forrest Iandola [email protected]

Genomics - Architecture for doing all in a dayApril 31, 2014

34

Gans Srinivasa, Sr Principal EngineerKarthik Gururaj, Mohammad Ghodrat, Mishali Naik, Paolo Narvaez

and Genomics Team

DCG‐HSS

Overview

Secondary Analysis DNA/RNA pipeline + More

Systems Biomedicine Development of

Personalized TreatmentsPrimary Analysis

FASTQFor

individual

BAMFor

individual

HCHC Variant

S

Big DataStarts here

Joint Genotyping

Pathway networks to predict various drug responses (Graph

models, Bayesian models & ML)

Cell response &methylation data

Models for Drug Treatment and Response

HCHCHCHCHCHCHCHCHC

Store

Pop/DisStudyPop/DisStudy

Imaging, Parallelism,Statistical work,Accelerated Algorithms

+imaging +ML + Drug Sensitivity Assays +Mechanistic learning and systems +Correlation work and Knowledge systems

Various Algorithms, Big Data, Parallelism, Accelerated Algorithms, Statistical and Math puzzles, In memory Processing

Heterogeneous Computing –Xeon+Accelerators+GPU+Memory computing / Data Store

DNA Pipeline: BWA+GATK: Whole Genome Sample: ~65x CoverageCollaborating with our Partners and Medical community

Step # of Threads

Runtime (hours)

Read Alignment (bwamem)

24 7

View (samtools) 24 2

Sort + Index (samtools) 24 3MarkDuplicates(picardtools) + Index

1 11

RealignerTargetCreator(GATK)

24 1

IndelRealigner* (GATK) + I d

24 6.5

Step Tool # of Threads

Runtime(hours)

Read Alignment (bwa) 16 8Sampe (bwa) 1 24Import (samtools) 1 11Sort + Index (samtools) 1 14.5MarkDuplicates(picardtools) + Index

1 11.5

UnifiedGenotyper* (GATK) 16 7.5

SomaticIndelDetector(GATK)

1 3

RealignerTargetCreator 16 0.8

Redesign of Mark Duplicates

+Merge Bam Align

30-36hours

SW HW

36

Process level Parallelism

Thread-level Parallelism

Index

BaseRecalibrator(GATK) 24 1.3PrintReads* (GATK) + Index + Flagstat

24 12.3

TOTAL (hours) 44

g g(GATK)IndelRealigner* (GATK) + Index

1 17.5

BaseRecalibrator*(GATK) 1 62

PrintReads* (GATK) + Index + Flagstat

1 25

TOTAL (hours) 177Algorithmic Improvement

6X improvement so far and 4X without major code change and rest with code changes.

6/4/2014

7

Pair HMM Acceleration using AVX

• Computation kernel and bottleneck in GATK Haplotype Caller

• AVX enables 8 floating point SIMD operations in parallel

• 2 Ways to vectorize HMM computation

• Intra-Sequence – Parallelize computation within one HMM matrix operation. Run multiple (8) computations concurrently along diagonal

• Inter-Sequence – Perform multiple (8) HMM matrix operations at once

37

Time (seconds) Speedup C++/Java

Serial C++ 1540 1x / 9x

1 core with AVX (Intra) 340 4.5x / 40.7x

1 core with AVX (Inter) 285 5.4x / 48.6x

24 cores with AVX (Inter) 14.3 108x / 970x

24 cores hybrid (Inter) 15.7 98x / 882x

38

UC BerkeleyEvaluating Hardware Choices

SoftwareRadio

Computer Vision

Machine Learning

Cancer Genomics

Graph Processing

Multimedia Analysis

Interactive Cloud

What to target?What are productivity/cost/performance

39

CPU FPGA

Memories, Interconnects, I/O

ASICGPU

Racks (10s kW) Mobiles (1W)Embedded (10 kW‐1W)

tradeoffs?

UC Berkeley

High‐level pattern constrains space of reasonable low‐level mappings

(Insert latest OPL chart showing path)

40

UC Berkeley

Specializers: Pattern‐specific and platform‐specific compilers

App 1 App 2 App 3

Dense Sparse Graph Trav.

Clusters CPU GPU

Allow maximum efficiency and expressibility in specializersby avoiding mandatory intermediary layers

41

postcript on patterns in big data


42

6/4/2014

8

Graph abstraction:essential for data driven problems

43

Flashback to 1998

Why?

First Google advantage: a Graph Algorithm & a System to Support it!

Example 1: PageRank

Twister

Hadoop5.5 hrs

1 hr

40M Webpages, 1.4 Billion Links

GraphLab8 min

Hadoop results from [Kang et al. '11]Twister (in-memory MapReduce) [Ekanayake et al. ‘10]

Hadoop 95 Cores 7.5 hrs

DistributedGraphLab

32 EC2 machines

80 secs

Example 2: Never Ending Learner Project (CoEM)

0.3% of Hadoop time

2 orders of mag faster 2 orders of mag cheaper

34.8 Billion TrianglesExample 3: Triangle Counting on Twitter Graph

64 Machines

1636 Machines423 Minutes

Hadoop[WWW’11]

15 Seconds

S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11

Why??Wrong Abstraction!(Broadcast O(degree2) messages per Vertex)

Wrapping up (pontificating)...

Patterns:• use ‘em• use the right ones• beware hasty/naive parallelization (pattern understanding

will help you here)will help you here)

Patterns facilitate good design; their utility continues to grow.

PatternsInParallelComputing2014 - College of...

Documents

Transcript of PatternsInParallelComputing2014 - College of...