Tim Mattson - Intel Developer Zone · 2013-12-17 · Inference Engine: Beam Search with Graph...

Disruptive technologies: a software response

Tim MattsonPrincipal EngineerIntel [email protected]

2

Top 500: total number of processors (1993-2000)

0

20000

40000

60000

80000

100000

120000

140000

1993 1994 1995 1996 1997 1998 1999 2000

The early years … SIMD MPP fades and clusters take over.

Source: the “June lists” from www.top500.org

3

Top 500: total number of processors (1993-2009)

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

4000000

4500000

1993 1995 1997 1999 2001 2003 2005 2007 2009

Disruptive technology

trend!

Source: the “June lists” from www.top500.org

4

The culprit … many core chips. And its getting worse!

IBM CellIBM Cell

NVIDIA Tesla C1060Intel Intel TerascaleTerascaleresearch chipresearch chip

ATI RV770

3rd party names are the property of their owners.Source: SC09 OpenCL tutorial

80 cores

240 cores

1 CPU + 6 cores

160 cores

5

… and they are extremely disruptive

HPC masochists (like me) welcome this trend, but can we realistically expect normal programmers to adapt?

Maybe with a fancy new high level language?

No! We must respond to the HW disruption with a “new” approach.No! We must respond to the HW

disruption with a “new” approach.

→

→

→

→

Complex mapping of algorithms onto HWo Heterogeneous-cores/ hierarchical-parallelism

Horrendously tiny non-scalable fractionso Amdahl’s law

Concurrency >> N … to balance network latencies

o Little’s law

Concurrency ~ No Large core counts (N)

Given:

6

How is the rest of the world adapting to many core?

The gaming industry: Early adopters of many core technology.Game development … gross generalizations:

Time and money: 1-3 years for 1-20 million dollars.A “blockbuster” game has revenues that rival that from a major Hollywood movie.

Major games take teams with 50 to 100 people.Only a quarter of the people are are programmers.

7

The heart of a game is the “Game Engine”

InputSim

data/stateAudio

Data/state

Render

Data/state

assetsMedia (disk, optical media)

control

Front End

Data

networkAnimation

game state and its internally managed data structures.

8

The heart of a game is the “Game Engine”

InputSim

data/stateAudio

Data/state

Render

Data/state

assetsMedia (disk, optical media)

control

Front End

Data

networkAnimation

game state and its internally managed data structures.

Physics Particle System

AICollision Detection

Time Loop

Sim is the integrator and integrates from one time step to the next; calling update

methods for the other modules

inside a central time loop.

Sim is the integrator and integrates from one time step to the next; calling update

methods for the other modules

inside a central time loop.

Sim

9

Many-core Coping in the gaming industryThe key is to compartmentalize tasks … a separation of concerns:

A small number (<10%) of high priced experts optimize the SIM and performance-critical game engine modules for specific platforms

Technology programmers … use C and assembly code.The rest of the team focuses on the art-work, the story line and putting the basic components together to get the final product “out the door”.

Productivity programmers … use C++, scripting languages, and frameworks.

This is a trend reaching across the software industry …Programming by importing functionality from existing modulesScripting languages specialized through custom modules to increase productivity across a software team.Partial solutions that are specialized to solve specific problems in a domain.

10

You can see this trend in the languages people use

Source: O'Reilly Books

11

You know a trend is real when it shows up in web-comics!

YouYou’’re flying!re flying!How?How?

Python! I Just Typed Python! I Just Typed

import antigravityimport antigravity

Source: Source: www.xkcd.comwww.xkcd.com

12

Modular software, frameworks and parallel computing

As parallelism goes mainstream and reaches extreme levels of scalability … We can’t retrain all the world’s programmers to handle disruptive scalable hardware.Modular software development techniques and Frameworks will savethe day:

Framework: A software platform providing partial solutions to a class of problems that can be specialized to solve a specific problem.

This is not a new idea:CactusCommon Component ArchitectureSIERRA… and many others

… we just need to push it further and deeper.

We need a systematic way to build a useful collection of frameworks for Scalable computing

We need a systematic way to build a useful collection of frameworks for Scalable computing

13

Personal Health

Image Retrieval

Hearing, Music Speech Parallel

BrowserDesign Pattern Language (OPL)

Sketching

Legacy Code Schedulers Communication &

Synch. PrimitivesEfficiency Language Compilers

Par Lab Research OverviewEasy to write correct software that runs efficiently on manycore

Legacy OS

Intel Multicore/GPGPU

OS Libraries & Services

RAMP Manycore

HypervisorOS

Arch.

Productivity

Layer

Efficiency

Layer Cor

rect

ness

Applications

Composition & Coordination Language (C&CL)

Parallel Libraries

Parallel Frameworks

Static Verification

Dynamic Checking

Debuggingwith Replay

Directed Testing

Autotuners

C&CL Compiler/Interpreter

Efficiency Languages

Type Systems

14

Personal Health

Image Retrieval



Sketching



Par Lab Research OverviewEasy to write correct software that runs efficiently on manycore

Legacy OS



RAMP Manycore

HypervisorOS

Arch.

Productivity

Layer

Efficiency

Layer Cor

rect

ness

Applications


Parallel Libraries

Parallel Frameworks

Static Verification

Dynamic Checking


Directed Testing

Autotuners



Type Systems

High level, safe, concurrency through high level abstractionsHigh level, safe, concurrency

through high level abstractions

Low level, risky, hardware details fully exposed

Low level, risky, hardware details fully exposed

15

Personal Health

Image Retrieval



Sketching



UCB Par Lab Research OverviewEasy to write correct software that runs efficiently on manycore

Legacy OS



RAMP Manycore

HypervisorOS

Arch.

Productivity

Layer

Efficiency

Layer Cor

rect

ness

Applications


Parallel Libraries

Parallel Frameworks

Static Verification

Dynamic Checking


Directed Testing

Autotuners



Type Systems

Split programmers into two groups

Productivity programmers:Domain experts who know the application but are not trained in the nuances of concurrency

Efficiency programmers: Computer scientists fully versed in concurrency and mapping software onto the features of a particular hardware platform.

Programming environment must support this separation of concerns.

Split programmers into two groups

Productivity programmers:Domain experts who know the application but are not trained in the nuances of concurrency

Efficiency programmers: Computer scientists fully versed in concurrency and mapping software onto the features of a particular hardware platform.

Programming environment must support this separation of concerns.

Logo for UCB Logo for UCB ParLabParLab

1613 dwarves

To get frameworks right … start with an understanding of software architecture

PLPP: Pattern language of

Parallel Programming

17

Graph-Algorithms

Dynamic-Programming

Dense-Linear-Algebra

Sparse-Linear-Algebra

Unstructured-Grids

Structured-Grids

Model-View-Controller

Iterative-Refinement

Map-Reduce

Layered-Systems

Arbitrary-Static-Task-Graph

Pipe-and-Filter

Agent-and-Repository

Process-Control

Event-Based/Implicit-Invocation

Puppeteer

Graphical-Models

Finite-State-Machines

Backtrack-Branch-and-Bound

N-Body-Methods

Circuits

Spectral-Methods

Monte-Carlo

Applications

Task-ParallelismRecursive-splitting

Data-ParallelismPipeline

Discrete-Event Geometric-Decomposition

SPMDStrict-Data-Par.

Fork/JoinActorsMaster/WorkerGraph-Partitioning

Distributed-ArrayShared-Data

Shared-QueueShared-Hash-Table

MIMDSIMD

Structural Patterns Computational Patterns

Parallel Execution Patterns

Concurrent Algorithm Strategy Patterns

Implementation Strategy Patterns

Message-PassingCollective-Comm.Mutual-Exclusion

Thread-PoolSpeculation

Data structureProgram structure

CoordinationAdvancing “program counters”

Point-To-Point-Sync.Collective-Sync.Transactional-Mem.

Loop-Par.BSPTask-Queue

Task-GraphData-FlowDigital-Circuits

Speculation

A Design Pattern Language for Engineering Parallel applications

Source: Keutzer and Mattson, Intel Technology Journal, Vol. 13, Issue 4, 2010.

18

Connecting Patterns and Frameworks

A pattern is a well known solution to a recurring problem … captures expert knowledge in a form that can be peer-reviewed, refined, and passed-on to others. An architecture defines the organization and structure of a software system … it can be defined by a hierarchical composition of patterns.A framework is a software environment based on an architecture … a partial solution for a domain of problems that can be specialized to solve specific problems.

18

Patterns Patterns →→ Architectures Architectures →→ Frameworks Frameworks

19

Speech Feature Extractor

Pattern/framework example:Speech Recognition Application

Inference Engine

Voice Input

SpeechFeatures

WordSequence

…

I think therefore I am

Acoustic Model

Pronunciation Model

Language Model

Large Vocabulary Continuous Speech Recognition (LVCSR)Computes the most likely word sequence Based on the Hidden Markov model (HMM) inferenceUsing extracted speech features and the recognition network

Source: Keutzer GSRC program review, Sept. 2009Source: Keutzer GSRC program review, Sept. 2009

20

Inference Engine: Beam Search with Graph Traversal


Inference Engine

Voice Input

SpeechFeatures

WordSequence

… I think therefore I am

Acoustic Model

Pronunciation Model

Language Model

Software Architecture

Speech Feature Extraction: Signal Processing Iterative through inputs

one time step at a time

In each iteration, perform beam search

algorithm steps

In each step, consider alternative interpretations

Top level: Extract and Recognize

FeatureExtraction Recognition

STFT

Noise Reduction

Log‐Mel‐Filterbank

DCT

Extractthen

Recognize

Extractaudio

features from input


21


Patterns Used

Iterative Pattern

Pipe & Filter

MapReduce

Top level: Extract and Recognize

Pipe & Filter

Feature

Extraction

Recognition

Speech Feature Extraction: Signal Processing

Pipe & Filter

STFT

Noise Reduction

Log‐Mel‐Filterbank

DCT



Inference Engine

Voice Input

SpeechFeatures

WordSequence


Acoustic Model

Pronunciation Model

Language Model

22

Framework (Top Level)

Top level Framework: Extract and Recognize

FeatureExtraction Recognition

FeatureDimensions

39 dim

60 dim

Output SequenceFormat

sequence

lattice

InputSoundFormat

.wav

.wmp

Recognition Network Customizations

Linear Lexical Network

WFST Network

Key:

Fixed Components

User Customizations



Inference Engine

Voice Input

SpeechFeatures

WordSequence


Acoustic Model

Pronunciation Model

Language Model

23

Top level Framework: Extract and Recognize

FeatureExtraction RecognitionFeature

Dimensions

Output SequenceFormat

InputSoundFormat

Recognition Network

Customizations

Framework (Recognize)


Pruning Threshold Heuristic

Absolute

Predicative

Observation ProbabilityComputation

Full

Approx

(*)

* Transition Probability computation and pruning heuristic

Word Transition Modeling:

Statistical Transitions

Deterministic Transitions

Key:

Fixed Components

User Customizations


24

Can Frameworks be efficient enough?

Frameworks and other high level abstracts add overhead.Can they be efficient enough for the extremely low overheads required by Amdahl's law?Yes … The key is to:

Auto-turningcollapse layers of abstractions

258/21/2009 James Demmel Motifs: 25

Autotuning DGEMM with ATLAS (n = 500)

ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.ATLAS written by C. Whaley, inspired by PHiPAC, by Asanovic, Bilmes,Chin,D.

0.0

100.0

200.0

300.0

400.0

500.0

600.0

700.0

800.0

900.0

AMD Ath

lon-600

DEC ev56

-533

DEC ev6-5

00HP90

00/73

5/135

IBM PPC60

4-112

IBM Power2

-160

IBM Power3

-200

Pentiu

m Pro-20

0Pen

tium II-

266

Pentiu

m III-55

0

SGI R10

000ip

28-20

0

SGI R12

000ip

30-27

0

Sun U

ltraS

parc2-2

00

Architectures

MFL

OPS

Vendor BLASATLAS BLASF77 BLAS

Source: Jack Dongarra

26

Auto-tuning: Itanium 2 and The Need for Search

Reference

Best: 4x2

Mflop/s

Mflop/s

SpMV with BCSR algorithm, explore space of options for row and column blocking

Source: Demmel and Hoemmen, UCB CS294 lecture, spring 2008

27

SpMV with BCSR algorithm, explore row and column blocking options (this slide and the next)

Source: Demmel and Hoemmen, UCB CS294 lecture, spring 2008

28

Register Profiles: IBM and Intel IA-64

Power3 - 17% Power4 - 16%

Itanium 2 - 33%Itanium 1 - 8%

252 Mflop/s

122 Mflop/s

820 Mflop/s

459 Mflop/s

247 Mflop/s

107 Mflop/s

1.2 Gflop/s

190 Mflop/sSource: Demmel and Hoemmen, UCB CS294 lecture, spring 2008

29

Auto-tuningThe parameter space is huge … so we need to build tools that help us search parameter space to find optimal solutionsWe need to automate the process of instrumenting a code for auto-tuningWe need to generalize this to applications beyond linear algebra.

30

Can Frameworks be efficient enough?

Frameworks and other high level abstracts add overhead.Can they be efficient enough for the extremely low overheads required by Amdahl's law?Yes … The key is to:

Auto-turningcollapse layers of abstractions and adapt to target platform at runtime.

31

So…What is Intel Ct Technology?

Ct adds parallel collection objects & methods to C++Library interface and is fully ANSI/ISO-compliant (works with ICC, VC++, GCC)

Ct abstracts away architectural detailsVector ISA width / Core count / Memory model / Cache sizesFocus on what to do, not how to do itSequential semanticsCt is designed to be dynamically retargetable to SSE, AVX, LRB, …

Ct is safe, by default…but with expert controls to override for performance

Programmers think sequential, not parallel

11/23/2009 31Source: Anwar Ghuloum, Intel, Nov. 13, 2009.

http://www.intel.com/go/ct

32

Operations over parallel collections

Regular Vecs

Vec3D

Irregular Vecs

VecIndexed

VecNested

Vec

Vec2D

Vec<Tuple<…>>

& growing…Priorities: VecSparse, Vec2DSparse, VecND

repeatCol, shuffle, transpose, swapRows, shift, rotate, scatter, …


33

Parallel Operations on Ct Collections

Vector Processing

Vec<F32> A, B, C, D;

A += B/C * D;

Native/Intrinsic Coding

CMP

VPREFETCH

FMADD

INC

JMP

NVec<F32>native(NVec<F32> …) {

__asm__ {

…

};

}

…


A = map(native)(A, B, C, D);

The Ct Runtime Automates This TransformationThe Ct Runtime Automates This Transformation

Or Programmers Can Choose Desired Level of AbstractionOr Programmers Can Choose Desired Level of Abstraction

Linear algebra, global data movement/communication

Kernel Processing

Elt<F32> kernel(Elt<F32> a, b, c, d) {

return a + (b/c)*d;

}

…


A = map(kernel)(A, B, C, D);

Embarrassingly parallel, shaders, image processing


34

The Ct VM

C, C2, Ci*C, C2, Ci* LRBLRB HybridHybrid

Ct JIT/CompilerCt JIT/Compiler

IA-based Virtual ISAIA-based Virtual ISA

Task/Threading RuntimeTask/Threading RuntimeBackend JIT/CompilerBackend JIT/Compiler

Memory ManagerMemory Manager

Debug/PerfSvcs

Debug/PerfSvcs

Ct’sHardware

AbstractionLayer

Other Languages!

Other Back-ends

Ct API (Average C++

Developer)

VM IR (Language

Implementor)

CVI (Hand

Tuning)

Ct+ Opcode APICt+ Opcode API


35

ConclusionsMany core is a fundamental disruption to business as usual.We can’t turn everyone into HPC-masochist programmers …we have to build software frameworks that support a separation of concerns:

Hardcore experts build programming frameworks … emphasize efficiencyDomain expert programmers assembly applications within a framework … emphasize productivity.

Can we make this efficient?Yes

Auto-tuning to search parameter space to find optimal performance point.Dynamic code generation to collapse abstraction overhead and a JIT to support aggressive program-driven optimization.

11/23/2009 35

36

Acknowledgements Kurt Keutzer (UCB), Ralph Johnson (UIUC) and our community of patterns writers:

Hugo Andrade, Chris Batten, Eric Battenberg, Hovig Bayandorian, Dai Bui, Bryan Catanzaro, Jike Chong, Enylton Coelho, KatyaGonina, Yunsup Lee, Mark Murphy, Heidi Pan, Kaushik Ravindran, Sayak Ray, Erich Strohmaier, Bor-yiing Su, Narayanan Sundaram, Guogiang Wang, Youngmin Yi., Jeff Anderson-Lee, Joel Jones, Terry Ligocki, and Sam Williams.

The development of our pattern language has also received a boost from Par Lab faculty — particularly:

Krste Asanovic, Jim Demmel, and David Patterson.

Tim Mattson - Intel Developer Zone · 2013-12-17 · Inference Engine: Beam Search with Graph...

Documents

Transcript of Tim Mattson - Intel Developer Zone · 2013-12-17 · Inference Engine: Beam Search with Graph...