Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n...

Scaling Big Data Analytics with

Moore’s Law

KunleOlukotun

EEandCS

Data Trends

n  Increasing volume, variety and complexity of data

n  Challenge: enable data-driven discovery n  Deliver the capability to mine, search and analyze

this data in near real time

Microprocessor Trends

Moore’sLaw

PowerWall

Endofsequen6alperformance

Heterogeneous Computing Platforms

Graphics Processing Unit (GPU)

> 8 TFLOPS, SIMD

Accelerators

Programmable Logic

> 9 TFLOPS

Cluster

10s of cores, SIMD 1000s of nodes Multicore Multi-socket NUMA

> 1 TB DRAM

Parallelism and specialization

MPI Map Reduce

Verilog VHDL

CUDA OpenCL

Threads OpenMP

Specialized Parallel Programming

Cluster

Multicore CPU Muti-socket

Graphics Processing Unit (GPU)

Programmable Logic

Custom computing

MPI: Message Passing Interface

Specialized Programmers ⇒ Scarce

Data Analytics Programming Challenge

Multicore

GPU

Pthreads OpenMP

CUDA OpenCL

Predictive Analytics

Data ETL

Data Query

Graph Analysis

Cluster MPI Map Reduce

FPGA Verilog VHDL

Data Analytics Application

Data Analytics Programming Challenge

Multicore

GPU

Pthreads OpenMP

CUDA OpenCL

Predictive Analytics

Data ETL

Data Query

Graph Analysis

High-Performance Domain Specific

Languages

Cluster MPI Map Reduce

FPGA Verilog VHDL

Data Analytics Application

Domain Specific Languages

n  Domain Specific Languages (DSLs) n  Programming language with restricted

expressiveness for a particular domain n  High-level, usually declarative, and deterministic

Structured Query Language

High Performance DSLs for Data Analytics

Graph Analysis

Prediction Recommendation

Data Transformation

Query Proc. OptiQL

Graph Alg. OptiGraph

Machine Learning OptiML

Data Extraction

OptiWrangle

Applications

HP DSLs

Heterogeneous Hardware

DSL

Compiler

DSL

Compiler

DSL

Compiler

DSL

Compiler

Multicore GPU FPGA Cluster

Scaling the HP DSL Approach

n  Many potential DSLs

n  How do we quickly create high-performance implementations for DSLs we care about?

n  Enable expert programmers to easily create new DSLs n  Make optimization knowledge reusable n  Simplify the compiler generation process

n  A few DSL developers enable many more DSL users n  Leave expert programming to experts!

Delite DSL

Framework

Delite: DSL Infrastructure

Graph Analysis


Data Transformation

Query Proc. OptiQL



Data Extraction

OptiWrangle

Applications

HP DSLs

Heterogeneous Hardware

DSL

Compiler

DSL

Compiler

DSL

Compiler

DSL

Compiler

Multicore GPU FPGA Cluster

Delite: A Framework for High Performance DSLs

n  A compiler tool-chain for high performance embedded DSLs n  Libraries on steroids (generative programming)

n  Built on top of Lightweight Modular Staging (LMS) to build an intermediate representation (IR) from Scala application code

n  Provides extensible reusable components n  Parallel patterns for structured computation n  Delite structs for structured data n  Transformers for domain-specific optimizations

n  Delite optimizes DSL code and generates target code n  Scala, C++, CUDA, OpenCL, and clusters

Parallel Patterns

Most data analytic computations can be expressed as functional parallel patterns on collections (e.g. sets, arrays, tables, n-d matrices) Nested parallel patterns

map zip reduce groupBy

key1

key3

key2

Map, Zip, Filter, FlatMap, Reduce, GroupBy, Join, Sort, …

Delite Overview

Key elements n  DSLs embedded in

Scala

n  IR created using type-directed staging

n  Domain specific optimization

n  General parallelism and locality optimizations

n  Optimized mapping to HW targets

Op6{Wrangler,QL,ML,Graph}

Op6mizedCodeGenerators

Scala C++ CUDA OpenCL MPI HDL

Genericanalyses&

transforma6ons

parallel data Parallel patterns

K. J. Brown et. al., “A heterogeneous parallel framework for domain-specific languages,” PACT, 2011.

Domainspecificanalyses&

transforma6onsdomain data

domain ops

DSL 1

••• domain data

domain ops

DSL n

OptiML: Overview

n  Provides a familiar (MATLAB-like) language and API for writing ML applications n  Ex. valc=a*b(a, b are Matrix[Double])

n  Implicitly parallel data structures n  Base types

n  Vector[T], Matrix[T], Graph[V,E], Stream[T] n  Subtypes

n  TrainingSet, IndexVector, Image, …

n  Implicitly parallel control structures n  sum{…}, (0::end) {…}, gradient { … }, untilconverged { … } n  Allow anonymous functions with restricted semantics to be

passed as arguments of the control structures

K-means Clustering in OptiML

untilconverged(kMeans,tol){kMeans=>valclusters=samples.groupRowsBy{sample=>

kMeans.mapRows(mean=>dist(sample,mean)).minIndex}valnewKmeans=clusters.map(c=>c.sum/c.length)newKmeans}

calculatedistancestocurrentmeans

assigneachsampletotheclosestmean

moveeachclustercentroidtothemeanofthepointsassignedtoit

•  No explicit map-reduce, no key-value pairs (e.g. MR) •  No distributed data structures (e.g. Spark RDDs) •  No annotations for hardware design •  Efficient multicore and GPU execution •  Efficient cluster and NUMA implementation •  Efficient FPGA hardware

Mapping Nested Parallel Patterns to GPUs

m=Matrix.rand(nR,nC)

v=m.sumCols

m=Matrix.rand(nR,nC)

v=m.sumRows

map(i)

reduce(j)

sumCols sumRows

limited parallelism

non-coalesced memory

0

10

20

30

40

50

60

[64K,1K] [8K,8K] [1K,64K] [64K,1K] [8K,8K] [1K,64K]Normalized

ExecuDo

nTime

1D thread-block/thread warp-based Mul6Dim

HyoukJoong Lee et. al, “Locality-Aware Mapping of Nested Parallel Patterns on GPUs,” MICRO'14

MSM Builder Using OptiML with Vijay Pande

!

Markov State Models (MSMs) MSMs are a powerful means of modeling the structure and dynamics of molecular systems, like proteins

x86 ASM

high prod, low perf

low prod, high perf

high prod, high perf

Distributed Heterogeneous Execution

n  Separate Memory Regions n  NUMA n  Clusters n  FPGAs

n  Partitioning Analysis n  Multidimensional arrays n  Decide which data

structures / parallel ops to partition across abstract memory regions

n  Nested Pattern Transformations n  Optimize patterns for

distributed and heterogeneous architectures

Delite&parallel&data& Delite&parallel&pa+erns&

DSL&Applica2on&

local&data&par22oned&data&

Heterogeneous&Code&Genera2on&&&Distributed&Run2me&

scheduled&pa+erns&

scheduled,&transformed&

pa+erns&

local&data&par22oned&data&

Nested Pattern Transformations

Partitioning & Stencil Analysis

Heterogeneous Cluster Performance

4 node local cluster: 3.4 GB dataset

Multi-socket NUMA Performance Multi-Terabyte Main Memory

1– 48 threads 4 sockets

05101520253035404550

1 12 24 48

Gene

0

5

10

15

20

25

30

35

40

45

1 12 24 48

GDA

0

5

10

15

20

25

1 12 24 48

LogReg

0

5

10

15

20

25

30

35

1 12 24 48

Speedu

p

TPCHQ1

05101520253035404550

1 12 24 48

k-means

05101520253035404550

1 12 24 48

Triangle

0

5

10

15

20

25

30

35

1 12 24 48

PageRank

30x 45x

10x 10x

11x 5x

10x

SIMD Parallelism (Intel AVX2) n  Single Instruction Multiple Data (SIMD)

n  SIMD parallelism is keeping up with Moore’s law: doubling per generation

n  Precision vs. parallelism SIMD Precision SIMD Parallelism

Statistical vs. Hardware Efficiency with Chris Ré

Same statistical efficiency

Improved hardware efficiency

•  8-bit gives about 3x speed up!

•  Lower precision is possible

•  Good match to specialized/reconfigurable HW?

BUCKWILD! same statistical efficiency with greater hardware efficiency

0.01

0.1

1

10

0 10 20

Loss

# Iterations

n32-bit

n8-bit

Logistic Regression using SGD

FPGAs in the Datacenter? n  FPGAs based accelerators

n  Recent commercial interest from Baidu, Microsoft, and Intel n  Key advantage: Performance, Performance/Watt n  Key disadvantage: lousy programming model

n  Verilog and VHDL poor match for software developers n  But, high quality designs

n  High level synthesis (HLS) tools with C interface n  Medium/low quality designs n  Need architectural knowledge to build good accelerators n  Not enough information in compiler IR to perform access

pattern and data layout optimizations n  Cannot synthesize complex data paths with nested

parallelism

Our Approach to FPGA Design

PaQernTransformaDonsFusion

Pa*ernTilingCodeMo3on

ParallelPaQerns

TiledParallelPaQernIR

BitstreamGeneraDon

FPGAConfiguraDon

HardwareGeneraDonMemoryAlloca3onTemplateSelec3on

MetapipelineAnalysis

MaxJHGL

High-levelParallelPaQerns

DataLocalityimprovedwithparallelpaQernDlingtransformaDons

NestedParallelismexploitedwithhierarchicalpipelinesanddoublebuffers

GenerateMaxJtogenerateVHDL

K-Means Hardware

Vector Dist

(Norm) Vector

Dist (Norm)

+ +

/ /

Vector Dist

(Norm)

samples Tile

Load

Inc

/ New

kmeans Tile

Store

+

kmeans Tile

Load

Scalar Dist

(Tree +)

(MinDist, Idx)

kmeansBlock buffer

samplesBlock Double buffer

samplesBlock Double buffer

minIdx Double buffer

sum Buffer

count Buffer

new kmeans Double Buffer

Similarto(andmoregeneralthan)hand-wri]endesigns1

[1]Hussainetal,“Fpgaimplementa6onofk-meansalgorithmforbioinforma6csapplica6on:Anacceleratedapproachtoclusteringmicroarraydata”,AHS2011

1.Loadkmeans 2.Metapipeline:Calculatesumandcount

3.Metapipeline:Calculatenewkmeans,storeresults

Impact of Tiling and Metapipelining

Delite DSL

Framework

Scaling Big Data Analytics with Moore’s Law

Graph Analysis


Data Transformation

Query Proc. OptiQL



Data Extraction

OptiWrangle

Applications

HP DSLs

Heterogeneous Hardware Multicore GPU FPGA Cluster

Parallel data Parallel patterns

Analyses&

Transforma6ons

✓ ✓ ✓ ✓

Colaborators & Funding

n  Funding n  PPL : Oracle Labs, Nvidia, Intel,

AMD, Huawei, SAP

n  NSF n  DARPA

n  HyoukJoong Lee n  Victoria Popic n  Raghu Prabhakar n  Aleksander Prokopec (EPFL)

n  Vera Salvisberg (EPFL) n  Arvind Sujeeth

n  Faculty n  Pat Hanrahan n  Martin Odersky (EPFL) n  Chris Ré n  Tiark Rompf (Purdue/EPFL)

n  PhD Students

n  Chris Aberger n  Kevin Brown n  Hassan Chafi

n  Zach DeVito

n  Chris De Sa

n  Nithin George (EPFL) n  David Koeplinger

Scaling Big Data Analytics with Moore’s Law

n  Power

n  Performance

n  Programmability

n  Portability

Modern HW (Multicore, SIMD, GPU, FPGA, NUMA)

High Performance DSLs (OptiML, OptiQL, …)

Delite

•  Increasingvolume,varietyandcomplexityofdata•  Heterogeneityofmodernhardware•  SeriouschallengesfordataanalyDcs

Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n...

Documents

Transcript of Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n...