Bridging the Moore’s Law Performance Gap with Innovation Scaling Todd Austin University of Michigan.
Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n...
Transcript of Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n...
Scaling Big Data Analytics with
Moore’s Law
KunleOlukotun
EEandCS
Data Trends
n Increasing volume, variety and complexity of data
n Challenge: enable data-driven discovery n Deliver the capability to mine, search and analyze
this data in near real time
Microprocessor Trends
Moore’sLaw
PowerWall
Endofsequen6alperformance
Heterogeneous Computing Platforms
Graphics Processing Unit (GPU)
> 8 TFLOPS, SIMD
Accelerators
Programmable Logic
> 9 TFLOPS
Cluster
10s of cores, SIMD 1000s of nodes Multicore Multi-socket NUMA
> 1 TB DRAM
Parallelism and specialization
MPI Map Reduce
Verilog VHDL
CUDA OpenCL
Threads OpenMP
Specialized Parallel Programming
Cluster
Multicore CPU Muti-socket
Graphics Processing Unit (GPU)
Programmable Logic
Custom computing
MPI: Message Passing Interface
Specialized Programmers ⇒ Scarce
Data Analytics Programming Challenge
Multicore
GPU
Pthreads OpenMP
CUDA OpenCL
Predictive Analytics
Data ETL
Data Query
Graph Analysis
Cluster MPI Map Reduce
FPGA Verilog VHDL
Data Analytics Application
Data Analytics Programming Challenge
Multicore
GPU
Pthreads OpenMP
CUDA OpenCL
Predictive Analytics
Data ETL
Data Query
Graph Analysis
High-Performance Domain Specific
Languages
Cluster MPI Map Reduce
FPGA Verilog VHDL
Data Analytics Application
Domain Specific Languages
n Domain Specific Languages (DSLs) n Programming language with restricted
expressiveness for a particular domain n High-level, usually declarative, and deterministic
Structured Query Language
High Performance DSLs for Data Analytics
Graph Analysis
Prediction Recommendation
Data Transformation
Query Proc. OptiQL
Graph Alg. OptiGraph
Machine Learning OptiML
Data Extraction
OptiWrangle
Applications
HP DSLs
Heterogeneous Hardware
DSL
Compiler
DSL
Compiler
DSL
Compiler
DSL
Compiler
Multicore GPU FPGA Cluster
Scaling the HP DSL Approach
n Many potential DSLs
n How do we quickly create high-performance implementations for DSLs we care about?
n Enable expert programmers to easily create new DSLs n Make optimization knowledge reusable n Simplify the compiler generation process
n A few DSL developers enable many more DSL users n Leave expert programming to experts!
Delite DSL
Framework
Delite: DSL Infrastructure
Graph Analysis
Prediction Recommendation
Data Transformation
Query Proc. OptiQL
Graph Alg. OptiGraph
Machine Learning OptiML
Data Extraction
OptiWrangle
Applications
HP DSLs
Heterogeneous Hardware
DSL
Compiler
DSL
Compiler
DSL
Compiler
DSL
Compiler
Multicore GPU FPGA Cluster
Delite: A Framework for High Performance DSLs
n A compiler tool-chain for high performance embedded DSLs n Libraries on steroids (generative programming)
n Built on top of Lightweight Modular Staging (LMS) to build an intermediate representation (IR) from Scala application code
n Provides extensible reusable components n Parallel patterns for structured computation n Delite structs for structured data n Transformers for domain-specific optimizations
n Delite optimizes DSL code and generates target code n Scala, C++, CUDA, OpenCL, and clusters
Parallel Patterns
Most data analytic computations can be expressed as functional parallel patterns on collections (e.g. sets, arrays, tables, n-d matrices) Nested parallel patterns
map zip reduce groupBy
key1
key3
key2
Map, Zip, Filter, FlatMap, Reduce, GroupBy, Join, Sort, …
Delite Overview
Key elements n DSLs embedded in
Scala
n IR created using type-directed staging
n Domain specific optimization
n General parallelism and locality optimizations
n Optimized mapping to HW targets
Op6{Wrangler,QL,ML,Graph}
Op6mizedCodeGenerators
Scala C++ CUDA OpenCL MPI HDL
Genericanalyses&
transforma6ons
parallel data Parallel patterns
K. J. Brown et. al., “A heterogeneous parallel framework for domain-specific languages,” PACT, 2011.
Domainspecificanalyses&
transforma6onsdomain data
domain ops
DSL 1
••• domain data
domain ops
DSL n
OptiML: Overview
n Provides a familiar (MATLAB-like) language and API for writing ML applications n Ex. valc=a*b(a, b are Matrix[Double])
n Implicitly parallel data structures n Base types
n Vector[T], Matrix[T], Graph[V,E], Stream[T] n Subtypes
n TrainingSet, IndexVector, Image, …
n Implicitly parallel control structures n sum{…}, (0::end) {…}, gradient { … }, untilconverged { … } n Allow anonymous functions with restricted semantics to be
passed as arguments of the control structures
K-means Clustering in OptiML
untilconverged(kMeans,tol){kMeans=>valclusters=samples.groupRowsBy{sample=>
kMeans.mapRows(mean=>dist(sample,mean)).minIndex}valnewKmeans=clusters.map(c=>c.sum/c.length)newKmeans}
calculatedistancestocurrentmeans
assigneachsampletotheclosestmean
moveeachclustercentroidtothemeanofthepointsassignedtoit
• No explicit map-reduce, no key-value pairs (e.g. MR) • No distributed data structures (e.g. Spark RDDs) • No annotations for hardware design • Efficient multicore and GPU execution • Efficient cluster and NUMA implementation • Efficient FPGA hardware
Mapping Nested Parallel Patterns to GPUs
m=Matrix.rand(nR,nC)
v=m.sumCols
m=Matrix.rand(nR,nC)
v=m.sumRows
map(i)
reduce(j)
sumCols sumRows
limited parallelism
non-coalesced memory
0
10
20
30
40
50
60
[64K,1K] [8K,8K] [1K,64K] [64K,1K] [8K,8K] [1K,64K]Normalized
ExecuDo
nTime
1D thread-block/thread warp-based Mul6Dim
HyoukJoong Lee et. al, “Locality-Aware Mapping of Nested Parallel Patterns on GPUs,” MICRO'14
MSM Builder Using OptiML with Vijay Pande
!
Markov State Models (MSMs) MSMs are a powerful means of modeling the structure and dynamics of molecular systems, like proteins
x86 ASM
high prod, low perf
low prod, high perf
high prod, high perf
Distributed Heterogeneous Execution
n Separate Memory Regions n NUMA n Clusters n FPGAs
n Partitioning Analysis n Multidimensional arrays n Decide which data
structures / parallel ops to partition across abstract memory regions
n Nested Pattern Transformations n Optimize patterns for
distributed and heterogeneous architectures
Delite¶llel&data& Delite¶llel&pa+erns&
DSL&Applica2on&
local&data&par22oned&data&
Heterogeneous&Code&Genera2on&&&Distributed&Run2me&
scheduled&pa+erns&
scheduled,&transformed&
pa+erns&
local&data&par22oned&data&
Nested Pattern Transformations
Partitioning & Stencil Analysis
Heterogeneous Cluster Performance
4 node local cluster: 3.4 GB dataset
Multi-socket NUMA Performance Multi-Terabyte Main Memory
1– 48 threads 4 sockets
05101520253035404550
1 12 24 48
Gene
0
5
10
15
20
25
30
35
40
45
1 12 24 48
GDA
0
5
10
15
20
25
1 12 24 48
LogReg
0
5
10
15
20
25
30
35
1 12 24 48
Speedu
p
TPCHQ1
05101520253035404550
1 12 24 48
k-means
05101520253035404550
1 12 24 48
Triangle
0
5
10
15
20
25
30
35
1 12 24 48
PageRank
30x 45x
10x 10x
11x 5x
10x
SIMD Parallelism (Intel AVX2) n Single Instruction Multiple Data (SIMD)
n SIMD parallelism is keeping up with Moore’s law: doubling per generation
n Precision vs. parallelism SIMD Precision SIMD Parallelism
Statistical vs. Hardware Efficiency with Chris Ré
Same statistical efficiency
Improved hardware efficiency
• 8-bit gives about 3x speed up!
• Lower precision is possible
• Good match to specialized/reconfigurable HW?
BUCKWILD! same statistical efficiency with greater hardware efficiency
0.01
0.1
1
10
0 10 20
Loss
# Iterations
n32-bit
n8-bit
Logistic Regression using SGD
FPGAs in the Datacenter? n FPGAs based accelerators
n Recent commercial interest from Baidu, Microsoft, and Intel n Key advantage: Performance, Performance/Watt n Key disadvantage: lousy programming model
n Verilog and VHDL poor match for software developers n But, high quality designs
n High level synthesis (HLS) tools with C interface n Medium/low quality designs n Need architectural knowledge to build good accelerators n Not enough information in compiler IR to perform access
pattern and data layout optimizations n Cannot synthesize complex data paths with nested
parallelism
Our Approach to FPGA Design
PaQernTransformaDonsFusion
Pa*ernTilingCodeMo3on
ParallelPaQerns
TiledParallelPaQernIR
BitstreamGeneraDon
FPGAConfiguraDon
HardwareGeneraDonMemoryAlloca3onTemplateSelec3on
MetapipelineAnalysis
MaxJHGL
High-levelParallelPaQerns
DataLocalityimprovedwithparallelpaQernDlingtransformaDons
NestedParallelismexploitedwithhierarchicalpipelinesanddoublebuffers
GenerateMaxJtogenerateVHDL
K-Means Hardware
Vector Dist
(Norm) Vector
Dist (Norm)
+ +
/ /
Vector Dist
(Norm)
samples Tile
Load
Inc
/ New
kmeans Tile
Store
+
kmeans Tile
Load
Scalar Dist
(Tree +)
(MinDist, Idx)
kmeansBlock buffer
samplesBlock Double buffer
samplesBlock Double buffer
minIdx Double buffer
sum Buffer
count Buffer
new kmeans Double Buffer
Similarto(andmoregeneralthan)hand-wri]endesigns1
[1]Hussainetal,“Fpgaimplementa6onofk-meansalgorithmforbioinforma6csapplica6on:Anacceleratedapproachtoclusteringmicroarraydata”,AHS2011
1.Loadkmeans 2.Metapipeline:Calculatesumandcount
3.Metapipeline:Calculatenewkmeans,storeresults
Impact of Tiling and Metapipelining
Delite DSL
Framework
Scaling Big Data Analytics with Moore’s Law
Graph Analysis
Prediction Recommendation
Data Transformation
Query Proc. OptiQL
Graph Alg. OptiGraph
Machine Learning OptiML
Data Extraction
OptiWrangle
Applications
HP DSLs
Heterogeneous Hardware Multicore GPU FPGA Cluster
Parallel data Parallel patterns
Analyses&
Transforma6ons
✓ ✓ ✓ ✓
Colaborators & Funding
n Funding n PPL : Oracle Labs, Nvidia, Intel,
AMD, Huawei, SAP
n NSF n DARPA
n HyoukJoong Lee n Victoria Popic n Raghu Prabhakar n Aleksander Prokopec (EPFL)
n Vera Salvisberg (EPFL) n Arvind Sujeeth
n Faculty n Pat Hanrahan n Martin Odersky (EPFL) n Chris Ré n Tiark Rompf (Purdue/EPFL)
n PhD Students
n Chris Aberger n Kevin Brown n Hassan Chafi
n Zach DeVito
n Chris De Sa
n Nithin George (EPFL) n David Koeplinger
Scaling Big Data Analytics with Moore’s Law
n Power
n Performance
n Programmability
n Portability
Modern HW (Multicore, SIMD, GPU, FPGA, NUMA)
High Performance DSLs (OptiML, OptiQL, …)
Delite
• Increasingvolume,varietyandcomplexityofdata• Heterogeneityofmodernhardware• SeriouschallengesfordataanalyDcs