Tim Mattson - Intel Developer Zone · 2013-12-17 · Inference Engine: Beam Search with Graph...
Transcript of Tim Mattson - Intel Developer Zone · 2013-12-17 · Inference Engine: Beam Search with Graph...
Disruptive technologies: a software response
Tim MattsonPrincipal EngineerIntel [email protected]
2
Top 500: total number of processors (1993-2000)
0
20000
40000
60000
80000
100000
120000
140000
1993 1994 1995 1996 1997 1998 1999 2000
The early years … SIMD MPP fades and clusters take over.
Source: the “June lists” from www.top500.org
3
Top 500: total number of processors (1993-2009)
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
1993 1995 1997 1999 2001 2003 2005 2007 2009
Disruptive technology
trend!
Source: the “June lists” from www.top500.org
4
The culprit … many core chips. And its getting worse!
IBM CellIBM Cell
NVIDIA Tesla C1060Intel Intel TerascaleTerascaleresearch chipresearch chip
ATI RV770
3rd party names are the property of their owners.Source: SC09 OpenCL tutorial
80 cores
240 cores
1 CPU + 6 cores
160 cores
5
… and they are extremely disruptive
HPC masochists (like me) welcome this trend, but can we realistically expect normal programmers to adapt?
Maybe with a fancy new high level language?
No! We must respond to the HW disruption with a “new” approach.No! We must respond to the HW
disruption with a “new” approach.
→
→
→
→
Complex mapping of algorithms onto HWo Heterogeneous-cores/ hierarchical-parallelism
Horrendously tiny non-scalable fractionso Amdahl’s law
Concurrency >> N … to balance network latencies
o Little’s law
Concurrency ~ No Large core counts (N)
Given:
6
How is the rest of the world adapting to many core?
The gaming industry: Early adopters of many core technology.Game development … gross generalizations:
Time and money: 1-3 years for 1-20 million dollars.A “blockbuster” game has revenues that rival that from a major Hollywood movie.
Major games take teams with 50 to 100 people.Only a quarter of the people are are programmers.
7
The heart of a game is the “Game Engine”
InputSim
data/stateAudio
Data/state
Render
Data/state
assetsMedia (disk, optical media)
control
Front End
Data
networkAnimation
game state and its internally managed data structures.
8
The heart of a game is the “Game Engine”
InputSim
data/stateAudio
Data/state
Render
Data/state
assetsMedia (disk, optical media)
control
Front End
Data
networkAnimation
game state and its internally managed data structures.
Physics Particle System
AICollision Detection
Time Loop
Sim is the integrator and integrates from one time step to the next; calling update
methods for the other modules
inside a central time loop.
Sim is the integrator and integrates from one time step to the next; calling update
methods for the other modules
inside a central time loop.
Sim
9
Many-core Coping in the gaming industryThe key is to compartmentalize tasks … a separation of concerns:
A small number (<10%) of high priced experts optimize the SIM and performance-critical game engine modules for specific platforms
Technology programmers … use C and assembly code.The rest of the team focuses on the art-work, the story line and putting the basic components together to get the final product “out the door”.
Productivity programmers … use C++, scripting languages, and frameworks.
This is a trend reaching across the software industry …Programming by importing functionality from existing modulesScripting languages specialized through custom modules to increase productivity across a software team.Partial solutions that are specialized to solve specific problems in a domain.
10
You can see this trend in the languages people use
Source: O'Reilly Books
11
You know a trend is real when it shows up in web-comics!
YouYou’’re flying!re flying!How?How?
Python! I Just Typed Python! I Just Typed
import antigravityimport antigravity
Source: Source: www.xkcd.comwww.xkcd.com
12
Modular software, frameworks and parallel computing
As parallelism goes mainstream and reaches extreme levels of scalability … We can’t retrain all the world’s programmers to handle disruptive scalable hardware.Modular software development techniques and Frameworks will savethe day:
Framework: A software platform providing partial solutions to a class of problems that can be specialized to solve a specific problem.
This is not a new idea:CactusCommon Component ArchitectureSIERRA… and many others
… we just need to push it further and deeper.
We need a systematic way to build a useful collection of frameworks for Scalable computing
We need a systematic way to build a useful collection of frameworks for Scalable computing
13
Personal Health
Image Retrieval
Hearing, Music Speech Parallel
BrowserDesign Pattern Language (OPL)
Sketching
Legacy Code Schedulers Communication &
Synch. PrimitivesEfficiency Language Compilers
Par Lab Research OverviewEasy to write correct software that runs efficiently on manycore
Legacy OS
Intel Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
HypervisorOS
Arch.
Productivity
Layer
Efficiency
Layer Cor
rect
ness
Applications
Composition & Coordination Language (C&CL)
Parallel Libraries
Parallel Frameworks
Static Verification
Dynamic Checking
Debuggingwith Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
14
Personal Health
Image Retrieval
Hearing, Music Speech Parallel
BrowserDesign Pattern Language (OPL)
Sketching
Legacy Code Schedulers Communication &
Synch. PrimitivesEfficiency Language Compilers
Par Lab Research OverviewEasy to write correct software that runs efficiently on manycore
Legacy OS
Intel Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
HypervisorOS
Arch.
Productivity
Layer
Efficiency
Layer Cor
rect
ness
Applications
Composition & Coordination Language (C&CL)
Parallel Libraries
Parallel Frameworks
Static Verification
Dynamic Checking
Debuggingwith Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
High level, safe, concurrency through high level abstractionsHigh level, safe, concurrency
through high level abstractions
Low level, risky, hardware details fully exposed
Low level, risky, hardware details fully exposed
15
Personal Health
Image Retrieval
Hearing, Music Speech Parallel
BrowserDesign Pattern Language (OPL)
Sketching
Legacy Code Schedulers Communication &
Synch. PrimitivesEfficiency Language Compilers
UCB Par Lab Research OverviewEasy to write correct software that runs efficiently on manycore
Legacy OS
Intel Multicore/GPGPU
OS Libraries & Services
RAMP Manycore
HypervisorOS
Arch.
Productivity
Layer
Efficiency
Layer Cor
rect
ness
Applications
Composition & Coordination Language (C&CL)
Parallel Libraries
Parallel Frameworks
Static Verification
Dynamic Checking
Debuggingwith Replay
Directed Testing
Autotuners
C&CL Compiler/Interpreter
Efficiency Languages
Type Systems
Split programmers into two groups
Productivity programmers:Domain experts who know the application but are not trained in the nuances of concurrency
Efficiency programmers: Computer scientists fully versed in concurrency and mapping software onto the features of a particular hardware platform.
Programming environment must support this separation of concerns.
Split programmers into two groups
Productivity programmers:Domain experts who know the application but are not trained in the nuances of concurrency
Efficiency programmers: Computer scientists fully versed in concurrency and mapping software onto the features of a particular hardware platform.
Programming environment must support this separation of concerns.
Logo for UCB Logo for UCB ParLabParLab
1613 dwarves
To get frameworks right … start with an understanding of software architecture
PLPP: Pattern language of
Parallel Programming
17
Graph-Algorithms
Dynamic-Programming
Dense-Linear-Algebra
Sparse-Linear-Algebra
Unstructured-Grids
Structured-Grids
Model-View-Controller
Iterative-Refinement
Map-Reduce
Layered-Systems
Arbitrary-Static-Task-Graph
Pipe-and-Filter
Agent-and-Repository
Process-Control
Event-Based/Implicit-Invocation
Puppeteer
Graphical-Models
Finite-State-Machines
Backtrack-Branch-and-Bound
N-Body-Methods
Circuits
Spectral-Methods
Monte-Carlo
Applications
Task-ParallelismRecursive-splitting
Data-ParallelismPipeline
Discrete-Event Geometric-Decomposition
SPMDStrict-Data-Par.
Fork/JoinActorsMaster/WorkerGraph-Partitioning
Distributed-ArrayShared-Data
Shared-QueueShared-Hash-Table
MIMDSIMD
Structural Patterns Computational Patterns
Parallel Execution Patterns
Concurrent Algorithm Strategy Patterns
Implementation Strategy Patterns
Message-PassingCollective-Comm.Mutual-Exclusion
Thread-PoolSpeculation
Data structureProgram structure
CoordinationAdvancing “program counters”
Point-To-Point-Sync.Collective-Sync.Transactional-Mem.
Loop-Par.BSPTask-Queue
Task-GraphData-FlowDigital-Circuits
Speculation
A Design Pattern Language for Engineering Parallel applications
Source: Keutzer and Mattson, Intel Technology Journal, Vol. 13, Issue 4, 2010.
18
Connecting Patterns and Frameworks
A pattern is a well known solution to a recurring problem … captures expert knowledge in a form that can be peer-reviewed, refined, and passed-on to others. An architecture defines the organization and structure of a software system … it can be defined by a hierarchical composition of patterns.A framework is a software environment based on an architecture … a partial solution for a domain of problems that can be specialized to solve specific problems.
18
Patterns Patterns →→ Architectures Architectures →→ Frameworks Frameworks
19
Speech Feature Extractor
Pattern/framework example:Speech Recognition Application
Inference Engine
Voice Input
SpeechFeatures
WordSequence
…
I think therefore I am
Acoustic Model
Pronunciation Model
Language Model
Large Vocabulary Continuous Speech Recognition (LVCSR)Computes the most likely word sequence Based on the Hidden Markov model (HMM) inferenceUsing extracted speech features and the recognition network
Source: Keutzer GSRC program review, Sept. 2009Source: Keutzer GSRC program review, Sept. 2009
20
Inference Engine: Beam Search with Graph Traversal
Speech Feature Extractor
Inference Engine
Voice Input
SpeechFeatures
WordSequence
… I think therefore I am
Acoustic Model
Pronunciation Model
Language Model
Software Architecture
Speech Feature Extraction: Signal Processing Iterative through inputs
one time step at a time
In each iteration, perform beam search
algorithm steps
In each step, consider alternative interpretations
Top level: Extract and Recognize
FeatureExtraction Recognition
STFT
Noise Reduction
Log‐Mel‐Filterbank
DCT
Extractthen
Recognize
Extractaudio
features from input
Source: Keutzer GSRC program review, Sept. 2009Source: Keutzer GSRC program review, Sept. 2009
21
Inference Engine: Beam Search with Graph Traversal
Patterns Used
Iterative Pattern
Pipe & Filter
MapReduce
Top level: Extract and Recognize
Pipe & Filter
Feature
Extraction
Recognition
Speech Feature Extraction: Signal Processing
Pipe & Filter
STFT
Noise Reduction
Log‐Mel‐Filterbank
DCT
Source: Keutzer GSRC program review, Sept. 2009Source: Keutzer GSRC program review, Sept. 2009
Speech Feature Extractor
Inference Engine
Voice Input
SpeechFeatures
WordSequence
… I think therefore I am
Acoustic Model
Pronunciation Model
Language Model
22
Framework (Top Level)
Top level Framework: Extract and Recognize
FeatureExtraction Recognition
FeatureDimensions
39 dim
60 dim
Output SequenceFormat
sequence
lattice
InputSoundFormat
.wav
.wmp
Recognition Network Customizations
Linear Lexical Network
WFST Network
Key:
Fixed Components
User Customizations
Source: Keutzer GSRC program review, Sept. 2009Source: Keutzer GSRC program review, Sept. 2009
Speech Feature Extractor
Inference Engine
Voice Input
SpeechFeatures
WordSequence
… I think therefore I am
Acoustic Model
Pronunciation Model
Language Model
23
Top level Framework: Extract and Recognize
FeatureExtraction RecognitionFeature
Dimensions
Output SequenceFormat
InputSoundFormat
Recognition Network
Customizations
Framework (Recognize)
Inference Engine: Beam Search with Graph Traversal
Pruning Threshold Heuristic
Absolute
Predicative
Observation ProbabilityComputation
Full
Approx
(*)
* Transition Probability computation and pruning heuristic
Word Transition Modeling:
Statistical Transitions
Deterministic Transitions
Key:
Fixed Components
User Customizations
Source: Keutzer GSRC program review, Sept. 2009Source: Keutzer GSRC program review, Sept. 2009
24
Can Frameworks be efficient enough?
Frameworks and other high level abstracts add overhead.Can they be efficient enough for the extremely low overheads required by Amdahl's law?Yes … The key is to:
Auto-turningcollapse layers of abstractions
258/21/2009 James Demmel Motifs: 25
Autotuning DGEMM with ATLAS (n = 500)
ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.ATLAS written by C. Whaley, inspired by PHiPAC, by Asanovic, Bilmes,Chin,D.
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
900.0
AMD Ath
lon-600
DEC ev56
-533
DEC ev6-5
00HP90
00/73
5/135
IBM PPC60
4-112
IBM Power2
-160
IBM Power3
-200
Pentiu
m Pro-20
0Pen
tium II-
266
Pentiu
m III-55
0
SGI R10
000ip
28-20
0
SGI R12
000ip
30-27
0
Sun U
ltraS
parc2-2
00
Architectures
MFL
OPS
Vendor BLASATLAS BLASF77 BLAS
Source: Jack Dongarra
26
Auto-tuning: Itanium 2 and The Need for Search
Reference
Best: 4x2
Mflop/s
Mflop/s
SpMV with BCSR algorithm, explore space of options for row and column blocking
Source: Demmel and Hoemmen, UCB CS294 lecture, spring 2008
27
SpMV with BCSR algorithm, explore row and column blocking options (this slide and the next)
Source: Demmel and Hoemmen, UCB CS294 lecture, spring 2008
28
Register Profiles: IBM and Intel IA-64
Power3 - 17% Power4 - 16%
Itanium 2 - 33%Itanium 1 - 8%
252 Mflop/s
122 Mflop/s
820 Mflop/s
459 Mflop/s
247 Mflop/s
107 Mflop/s
1.2 Gflop/s
190 Mflop/sSource: Demmel and Hoemmen, UCB CS294 lecture, spring 2008
29
Auto-tuningThe parameter space is huge … so we need to build tools that help us search parameter space to find optimal solutionsWe need to automate the process of instrumenting a code for auto-tuningWe need to generalize this to applications beyond linear algebra.
30
Can Frameworks be efficient enough?
Frameworks and other high level abstracts add overhead.Can they be efficient enough for the extremely low overheads required by Amdahl's law?Yes … The key is to:
Auto-turningcollapse layers of abstractions and adapt to target platform at runtime.
31
So…What is Intel Ct Technology?
Ct adds parallel collection objects & methods to C++Library interface and is fully ANSI/ISO-compliant (works with ICC, VC++, GCC)
Ct abstracts away architectural detailsVector ISA width / Core count / Memory model / Cache sizesFocus on what to do, not how to do itSequential semanticsCt is designed to be dynamically retargetable to SSE, AVX, LRB, …
Ct is safe, by default…but with expert controls to override for performance
Programmers think sequential, not parallel
11/23/2009 31Source: Anwar Ghuloum, Intel, Nov. 13, 2009.
http://www.intel.com/go/ct
32
Operations over parallel collections
Regular Vecs
Vec3D
Irregular Vecs
VecIndexed
VecNested
Vec
Vec2D
Vec<Tuple<…>>
& growing…Priorities: VecSparse, Vec2DSparse, VecND
repeatCol, shuffle, transpose, swapRows, shift, rotate, scatter, …
11/23/2009 32Source: Anwar Ghuloum, Intel, Nov. 13, 2009.
33
Parallel Operations on Ct Collections
Vector Processing
Vec<F32> A, B, C, D;
A += B/C * D;
Native/Intrinsic Coding
CMP
VPREFETCH
FMADD
INC
JMP
NVec<F32>native(NVec<F32> …) {
__asm__ {
…
};
}
…
Vec<F32> A, B, C, D;
A = map(native)(A, B, C, D);
The Ct Runtime Automates This TransformationThe Ct Runtime Automates This Transformation
Or Programmers Can Choose Desired Level of AbstractionOr Programmers Can Choose Desired Level of Abstraction
Linear algebra, global data movement/communication
Kernel Processing
Elt<F32> kernel(Elt<F32> a, b, c, d) {
return a + (b/c)*d;
}
…
Vec<F32> A, B, C, D;
A = map(kernel)(A, B, C, D);
Embarrassingly parallel, shaders, image processing
11/23/2009 33Source: Anwar Ghuloum, Intel, Nov. 13, 2009.
34
The Ct VM
C, C2, Ci*C, C2, Ci* LRBLRB HybridHybrid
Ct JIT/CompilerCt JIT/Compiler
IA-based Virtual ISAIA-based Virtual ISA
Task/Threading RuntimeTask/Threading RuntimeBackend JIT/CompilerBackend JIT/Compiler
Memory ManagerMemory Manager
Debug/PerfSvcs
Debug/PerfSvcs
Ct’sHardware
AbstractionLayer
Other Languages!
Other Back-ends
Ct API (Average C++
Developer)
VM IR (Language
Implementor)
CVI (Hand
Tuning)
Ct+ Opcode APICt+ Opcode API
11/23/2009 34Source: Anwar Ghuloum, Intel, Nov. 13, 2009.
35
ConclusionsMany core is a fundamental disruption to business as usual.We can’t turn everyone into HPC-masochist programmers …we have to build software frameworks that support a separation of concerns:
Hardcore experts build programming frameworks … emphasize efficiencyDomain expert programmers assembly applications within a framework … emphasize productivity.
Can we make this efficient?Yes
Auto-tuning to search parameter space to find optimal performance point.Dynamic code generation to collapse abstraction overhead and a JIT to support aggressive program-driven optimization.
11/23/2009 35
36
Acknowledgements Kurt Keutzer (UCB), Ralph Johnson (UIUC) and our community of patterns writers:
Hugo Andrade, Chris Batten, Eric Battenberg, Hovig Bayandorian, Dai Bui, Bryan Catanzaro, Jike Chong, Enylton Coelho, KatyaGonina, Yunsup Lee, Mark Murphy, Heidi Pan, Kaushik Ravindran, Sayak Ray, Erich Strohmaier, Bor-yiing Su, Narayanan Sundaram, Guogiang Wang, Youngmin Yi., Jeff Anderson-Lee, Joel Jones, Terry Ligocki, and Sam Williams.
The development of our pattern language has also received a boost from Par Lab faculty — particularly:
Krste Asanovic, Jim Demmel, and David Patterson.