Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n...

31
Scaling Big Data Analytics with Moore’s Law Kunle Olukotun EE and CS

Transcript of Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n...

Page 1: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Scaling Big Data Analytics with

Moore’s Law

KunleOlukotun

EEandCS

Page 2: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Data Trends

n  Increasing volume, variety and complexity of data

n  Challenge: enable data-driven discovery n  Deliver the capability to mine, search and analyze

this data in near real time

Page 3: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Microprocessor Trends

Moore’sLaw

PowerWall

Endofsequen6alperformance

Page 4: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Heterogeneous Computing Platforms

Graphics Processing Unit (GPU)

> 8 TFLOPS, SIMD

Accelerators

Programmable Logic

> 9 TFLOPS

Cluster

10s of cores, SIMD 1000s of nodes Multicore Multi-socket NUMA

> 1 TB DRAM

Parallelism and specialization

Page 5: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

MPI Map Reduce

Verilog VHDL

CUDA OpenCL

Threads OpenMP

Specialized Parallel Programming

Cluster

Multicore CPU Muti-socket

Graphics Processing Unit (GPU)

Programmable Logic

Custom computing

MPI: Message Passing Interface

Page 6: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Specialized Programmers ⇒ Scarce

Page 7: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Data Analytics Programming Challenge

Multicore

GPU

Pthreads OpenMP

CUDA OpenCL

Predictive Analytics

Data ETL

Data Query

Graph Analysis

Cluster MPI Map Reduce

FPGA Verilog VHDL

Data Analytics Application

Page 8: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Data Analytics Programming Challenge

Multicore

GPU

Pthreads OpenMP

CUDA OpenCL

Predictive Analytics

Data ETL

Data Query

Graph Analysis

High-Performance Domain Specific

Languages

Cluster MPI Map Reduce

FPGA Verilog VHDL

Data Analytics Application

Page 9: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Domain Specific Languages

n  Domain Specific Languages (DSLs) n  Programming language with restricted

expressiveness for a particular domain n  High-level, usually declarative, and deterministic

Structured Query Language

Page 10: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

High Performance DSLs for Data Analytics

Graph Analysis

Prediction Recommendation

Data Transformation

Query Proc. OptiQL

Graph Alg. OptiGraph

Machine Learning OptiML

Data Extraction

OptiWrangle

Applications

HP DSLs

Heterogeneous Hardware

DSL

Compiler

DSL

Compiler

DSL

Compiler

DSL

Compiler

Multicore GPU FPGA Cluster

Page 11: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Scaling the HP DSL Approach

n  Many potential DSLs

n  How do we quickly create high-performance implementations for DSLs we care about?

n  Enable expert programmers to easily create new DSLs n  Make optimization knowledge reusable n  Simplify the compiler generation process

n  A few DSL developers enable many more DSL users n  Leave expert programming to experts!

Page 12: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Delite DSL

Framework

Delite: DSL Infrastructure

Graph Analysis

Prediction Recommendation

Data Transformation

Query Proc. OptiQL

Graph Alg. OptiGraph

Machine Learning OptiML

Data Extraction

OptiWrangle

Applications

HP DSLs

Heterogeneous Hardware

DSL

Compiler

DSL

Compiler

DSL

Compiler

DSL

Compiler

Multicore GPU FPGA Cluster

Page 13: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Delite: A Framework for High Performance DSLs

n  A compiler tool-chain for high performance embedded DSLs n  Libraries on steroids (generative programming)

n  Built on top of Lightweight Modular Staging (LMS) to build an intermediate representation (IR) from Scala application code

n  Provides extensible reusable components n  Parallel patterns for structured computation n  Delite structs for structured data n  Transformers for domain-specific optimizations

n  Delite optimizes DSL code and generates target code n  Scala, C++, CUDA, OpenCL, and clusters

Page 14: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Parallel Patterns

Most data analytic computations can be expressed as functional parallel patterns on collections (e.g. sets, arrays, tables, n-d matrices) Nested parallel patterns

map zip reduce groupBy

key1

key3

key2

Map, Zip, Filter, FlatMap, Reduce, GroupBy, Join, Sort, …

Page 15: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Delite Overview

Key elements n  DSLs embedded in

Scala

n  IR created using type-directed staging

n  Domain specific optimization

n  General parallelism and locality optimizations

n  Optimized mapping to HW targets

Op6{Wrangler,QL,ML,Graph}

Op6mizedCodeGenerators

Scala C++ CUDA OpenCL MPI HDL

Genericanalyses&

transforma6ons

parallel data Parallel patterns

K. J. Brown et. al., “A heterogeneous parallel framework for domain-specific languages,” PACT, 2011.

Domainspecificanalyses&

transforma6onsdomain data

domain ops

DSL 1

••• domain data

domain ops

DSL n

Page 16: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

OptiML: Overview

n  Provides a familiar (MATLAB-like) language and API for writing ML applications n  Ex. valc=a*b(a, b are Matrix[Double])

n  Implicitly parallel data structures n  Base types

n  Vector[T], Matrix[T], Graph[V,E], Stream[T] n  Subtypes

n  TrainingSet, IndexVector, Image, …

n  Implicitly parallel control structures n  sum{…}, (0::end) {…}, gradient { … }, untilconverged { … } n  Allow anonymous functions with restricted semantics to be

passed as arguments of the control structures

Page 17: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

K-means Clustering in OptiML

untilconverged(kMeans,tol){kMeans=>valclusters=samples.groupRowsBy{sample=>

kMeans.mapRows(mean=>dist(sample,mean)).minIndex}valnewKmeans=clusters.map(c=>c.sum/c.length)newKmeans}

calculatedistancestocurrentmeans

assigneachsampletotheclosestmean

moveeachclustercentroidtothemeanofthepointsassignedtoit

•  No explicit map-reduce, no key-value pairs (e.g. MR) •  No distributed data structures (e.g. Spark RDDs) •  No annotations for hardware design •  Efficient multicore and GPU execution •  Efficient cluster and NUMA implementation •  Efficient FPGA hardware

Page 18: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Mapping Nested Parallel Patterns to GPUs

m=Matrix.rand(nR,nC)

v=m.sumCols

m=Matrix.rand(nR,nC)

v=m.sumRows

map(i)

reduce(j)

sumCols sumRows

limited parallelism

non-coalesced memory

0

10

20

30

40

50

60

[64K,1K] [8K,8K] [1K,64K] [64K,1K] [8K,8K] [1K,64K]Normalized

ExecuDo

nTime

1D thread-block/thread warp-based Mul6Dim

HyoukJoong Lee et. al, “Locality-Aware Mapping of Nested Parallel Patterns on GPUs,” MICRO'14

Page 19: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

MSM Builder Using OptiML with Vijay Pande

!

Markov State Models (MSMs) MSMs are a powerful means of modeling the structure and dynamics of molecular systems, like proteins

x86 ASM

high prod, low perf

low prod, high perf

high prod, high perf

Page 20: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Distributed Heterogeneous Execution

n  Separate Memory Regions n  NUMA n  Clusters n  FPGAs

n  Partitioning Analysis n  Multidimensional arrays n  Decide which data

structures / parallel ops to partition across abstract memory regions

n  Nested Pattern Transformations n  Optimize patterns for

distributed and heterogeneous architectures

Delite&parallel&data& Delite&parallel&pa+erns&

DSL&Applica2on&

local&data&par22oned&data&

Heterogeneous&Code&Genera2on&&&Distributed&Run2me&

scheduled&pa+erns&

scheduled,&transformed&

pa+erns&

local&data&par22oned&data&

Nested Pattern Transformations

Partitioning & Stencil Analysis

Page 21: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Heterogeneous Cluster Performance

4 node local cluster: 3.4 GB dataset

Page 22: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Multi-socket NUMA Performance Multi-Terabyte Main Memory

1– 48 threads 4 sockets

05101520253035404550

1 12 24 48

Gene

0

5

10

15

20

25

30

35

40

45

1 12 24 48

GDA

0

5

10

15

20

25

1 12 24 48

LogReg

0

5

10

15

20

25

30

35

1 12 24 48

Speedu

p

TPCHQ1

05101520253035404550

1 12 24 48

k-means

05101520253035404550

1 12 24 48

Triangle

0

5

10

15

20

25

30

35

1 12 24 48

PageRank

30x 45x

10x 10x

11x 5x

10x

Page 23: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

SIMD Parallelism (Intel AVX2) n  Single Instruction Multiple Data (SIMD)

n  SIMD parallelism is keeping up with Moore’s law: doubling per generation

n  Precision vs. parallelism SIMD Precision SIMD Parallelism

Page 24: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Statistical vs. Hardware Efficiency with Chris Ré

Same statistical efficiency

Improved hardware efficiency

•  8-bit gives about 3x speed up!

•  Lower precision is possible

•  Good match to specialized/reconfigurable HW?

BUCKWILD! same statistical efficiency with greater hardware efficiency

0.01

0.1

1

10

0 10 20

Loss

# Iterations

n32-bit

n8-bit

Logistic Regression using SGD

Page 25: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

FPGAs in the Datacenter? n  FPGAs based accelerators

n  Recent commercial interest from Baidu, Microsoft, and Intel n  Key advantage: Performance, Performance/Watt n  Key disadvantage: lousy programming model

n  Verilog and VHDL poor match for software developers n  But, high quality designs

n  High level synthesis (HLS) tools with C interface n  Medium/low quality designs n  Need architectural knowledge to build good accelerators n  Not enough information in compiler IR to perform access

pattern and data layout optimizations n  Cannot synthesize complex data paths with nested

parallelism

Page 26: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Our Approach to FPGA Design

PaQernTransformaDonsFusion

Pa*ernTilingCodeMo3on

ParallelPaQerns

TiledParallelPaQernIR

BitstreamGeneraDon

FPGAConfiguraDon

HardwareGeneraDonMemoryAlloca3onTemplateSelec3on

MetapipelineAnalysis

MaxJHGL

High-levelParallelPaQerns

DataLocalityimprovedwithparallelpaQernDlingtransformaDons

NestedParallelismexploitedwithhierarchicalpipelinesanddoublebuffers

GenerateMaxJtogenerateVHDL

Page 27: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

K-Means Hardware

Vector Dist

(Norm) Vector

Dist (Norm)

+ +

/ /

Vector Dist

(Norm)

samples Tile

Load

Inc

/ New

kmeans Tile

Store

+

kmeans Tile

Load

Scalar Dist

(Tree +)

(MinDist, Idx)

kmeansBlock buffer

samplesBlock Double buffer

samplesBlock Double buffer

minIdx Double buffer

sum Buffer

count Buffer

new kmeans Double Buffer

Similarto(andmoregeneralthan)hand-wri]endesigns1

[1]Hussainetal,“Fpgaimplementa6onofk-meansalgorithmforbioinforma6csapplica6on:Anacceleratedapproachtoclusteringmicroarraydata”,AHS2011

1.Loadkmeans 2.Metapipeline:Calculatesumandcount

3.Metapipeline:Calculatenewkmeans,storeresults

Page 28: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Impact of Tiling and Metapipelining

Page 29: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Delite DSL

Framework

Scaling Big Data Analytics with Moore’s Law

Graph Analysis

Prediction Recommendation

Data Transformation

Query Proc. OptiQL

Graph Alg. OptiGraph

Machine Learning OptiML

Data Extraction

OptiWrangle

Applications

HP DSLs

Heterogeneous Hardware Multicore GPU FPGA Cluster

Parallel data Parallel patterns

Analyses&

Transforma6ons

✓ ✓ ✓ ✓

Page 30: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Colaborators & Funding

n  Funding n  PPL : Oracle Labs, Nvidia, Intel,

AMD, Huawei, SAP

n  NSF n  DARPA

n  HyoukJoong Lee n  Victoria Popic n  Raghu Prabhakar n  Aleksander Prokopec (EPFL)

n  Vera Salvisberg (EPFL) n  Arvind Sujeeth

n  Faculty n  Pat Hanrahan n  Martin Odersky (EPFL) n  Chris Ré n  Tiark Rompf (Purdue/EPFL)

n  PhD Students

n  Chris Aberger n  Kevin Brown n  Hassan Chafi

n  Zach DeVito

n  Chris De Sa

n  Nithin George (EPFL) n  David Koeplinger

Page 31: Scaling Big Data Analytics with Moore’s Law · Scaling Big Data Analytics with Moore’s Law n Power n Performance n Programmability n Portability Modern HW (Multicore, SIMD, GPU,

Scaling Big Data Analytics with Moore’s Law

n  Power

n  Performance

n  Programmability

n  Portability

Modern HW (Multicore, SIMD, GPU, FPGA, NUMA)

High Performance DSLs (OptiML, OptiQL, …)

Delite

•  Increasingvolume,varietyandcomplexityofdata•  Heterogeneityofmodernhardware•  SeriouschallengesfordataanalyDcs