Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator....

32
Weld: A Common Runtime for Data Analytics Shoumik Palkar, James Thomas, Anil Shanbhag*, Deepak Narayanan, Malte Schwarzkopf*, Holger Pirk*, Saman Amarasinghe*, Matei Zaharia Stanford InfoLab, *MIT CSAIL

Transcript of Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator....

Page 1: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Weld: A Common Runtime for Data Analytics

Shoumik Palkar, James Thomas, Anil Shanbhag*, Deepak Narayanan, Malte Schwarzkopf*, Holger Pirk*, Saman Amarasinghe*, Matei Zaharia

Stanford InfoLab, *MIT CSAIL

Page 2: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Motivation

Modern data apps combine many disjoint processing libraries & functions» Relational, statistics, machine learning, …» E.g. PyData stack

+ Great results leveraging work of 1000s of authors

– No optimization across these functions

Page 3: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

How Bad is This Problem?

Growing gap between memory/processing makes traditional way of combining functions worse

data = pandas.parse_csv(string)

filtered = pandas.dropna(data)

avg = numpy.mean(filtered)

parse_csv

dropna

mean

5-30x slowdowns in NumPy, Pandas, TensorFlow, etc

Page 4: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

How We Solve Thismachine learningSQL graph

algorithms

CPU GPU

Common Runtime

Page 5: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

How We Solve Thismachine learningSQL graph

algorithms

CPU GPU

Weld IR

Backends

Runtime API

OptimizerWeldruntime

Page 6: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Runtime APIUses lazy evaluation to collect work across libraries

data = lib1.f1()lib2.map(data,

item => lib3.f2(item))

User Application Weld Runtime

Combined IRprogram

Optimizedmachine code

110111001110101101111

IR fragmentsfor each function

RuntimeAPI

f1

map

f2

Data in application

Page 7: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Weld IR

Designed to meet three goals:

1. Library composition: support complete workloads such as nested parallel calls

2. Ability to express optimizations: e.g. loop fusion, vectorization, loop tiling

3. Explicit parallelism and targeting parallel hardware

Page 8: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Why Don’t Compilers Solve This?

Languages and intermediate representations (IRs) make it hard to optimize across libraries»Main abstraction is shared memory

(must worry about aliasing, order, etc)

Most compilers don’t model parallel operations»Makes high performance code generation for

heterogeneous parallel hardware even more difficult

Page 9: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Weld IR

Small, powerful design inspired by “monad comprehensions”

Parallel loops: iterate over a dataset

Builders: declarative objects for producing results» E.g. append items to a list, compute a sum» Can be implemented differently on different hardware

Captures relational algebra, functional APIs like Spark, linear algebra, and composition thereof

Page 10: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

BuildersHardware independent and explicitly parallel

Three operators:

merge(builder, value): Merge a value into the builder and return a new builder

result(builder): destroy the builder and return a value

for(data, builders, func): iterate over data, potentially merging values into one or more builders in parallel

Page 11: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Examples

Implement functional operators using builders

def map(data, f):builder = new vecbuilder[int]for x in data:

merge(builder, f(x))result(builder)

def reduce(data, zero, func): builder = new merger[zero, func]for x in data:

merge(builder, x)result(builder)

Page 12: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Example Optimization: Fusionsquares = map(data, x => x * x)sum = reduce(data, 0, +)

bld1 = new vecbuilder[int]bld2 = new merger[0, +]for x in data:

merge(bld1, x * x)merge(bld2, x)

Loops can be merged into one pass over data

Page 13: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Optimizer

Cost Based Optimizer similar to an RDBMS

Builder Implementations: How to implement a particular builder (e.g., global vs. local hash tables)

Transforms: Should expressions be fused, vectorized, inlined, etc.

Quantifies choices among optimizations using data from the program

Page 14: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Cost Model

Inspired by cost models for in-memory databases

+ modeling nested parallelism and choices among implementations of data structures

Models cache contention, costs of atomic instructions, etc.

Page 15: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Implementation

Prototype with APIs in Scala and Python» LLVM and Voodoo for code gen

Integrations: TensorFlow, NumPy, Pandas, Spark

Page 16: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Results: Individual WorkloadsSQL (TPC-H) PageRank

0 2 4 6 8

10 12

1 2 4 8 12

Ru

ntim

e [

secs

]

Number of threads

GraphMatHand-opt

Weld

Word2VecQ1 Q3

Q6 Q12

0

0.2

0.4

0.6

0.8

1

1.2

1 4 12

Ru

ntim

e [

secs

]

Number of threadsHyPer

H.o.Weld

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

1 4 12

Ru

ntim

e [

secs

]

Number of threadsHyPer

H.o.Weld

0

0.05

0.1

0.15

0.2

0.25

0.3

1 4 12

Runtim

e [se

cs]

Number of threadsHyPer

H.o.Weld

0

0.1

0.2

0.3

0.4

0.5

0.6

1 4 12

Runtim

e [se

cs]

Number of threadsHyPer

H.o.Weld 0

5

10

15

20

25

Ru

ntim

e [

secs

] TFTF-Op

Weld

TF-Op = C++ operator

Page 17: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

TPC-H Logistic RegressionVector Sum

Results: Existing Frameworks

0 5

10 15 20 25 30 35 40 45

TPC-H Q1 TPC-H Q6

Ru

ntim

e [

secs

]

Workload

SparkSQLWeld

0 0.02 0.04 0.06 0.08 0.1

0.12 0.14 0.16 0.18 0.2

Ru

ntim

e [

secs

]

NPNExpr

Weld

Integration effort: 500 lines glue, 30 lines/operator

0.1

1

10

100

1000

LR (1T) LR (12T)

Ru

ntim

e [

secs

; lo

g1

0]

Workload

TFHand-opt

Weld

1 Core 12 Cores

Page 18: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Results: Modeling Costs

Takeaway: Cost curves *resemble* actual runtimes

Page 19: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Results: Cross-Library Optimization

0.01

0.1

1

10

100

Runt

ime

(sec,

log1

0)

CurrentWeld, no CLOWeld, CLOWeld, 12 core

Pandas + NumPy

290x

31x

0.0

0.5

1.0

1.5

2.0

Runt

ime

(sec)

Scala UDFWeld

Spark SQL UDF

14x

Page 20: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Conclusion

The way we compose software will have to changeto efficiently use modern hardware

Lots of open questions and design decisions!» Leveraging specialized hardware, domain info, …

Open source: soon!

Page 21: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Related Work

HyPer, LegoBase, Tupleware: target relational algebra and serial UDFs; no nested parallelism

LLVM, OpenCL: low-level shared-memory model

NESL, parallel FPs: not closed under optimizations

DSLs: Weld focuses on integration with existing libraries and cross-library optimization

Page 22: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Observation

Many analytics algorithms can be written with a few “embarrassingly parallel” operators» See how many run on MapReduce / Spark

Focus on these instead of general programs

Page 23: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

The Goal

CPUs GPUs …

common IR

machine learningSQL graph

algorithms

optimizer

Page 24: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Results on GPUs

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Q1 Q3 Q6 Q12

Ru

ntim

e [

secs

]

Number of threadsOcelot Weld

SQL (TPC-H)

0

5.9

Runtim

e [se

cs]

TFBond

NearestNeighbors

0

0.7

Runtim

e [se

cs]

Bond (CPU)LightSpMV

Bond (GPU)

PageRank

Page 25: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Example Transformationsdef query(products: vec[{dept:int, price:int}]):sum = 0for p in products:if p.dept == 20: sum += p.price

def query(dept: vec[int], price: vector[int]):sum = 0for i in 0..len(users):if dept[i] == 20: sum += price[i]

for i in 0..len(products) by 4:sum += price[i..i+4] * (dept[i..i+4] == [20,20,20,20])

row-to-column

vectorization

Page 26: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Weld Results: TPC-H Q6

0.53

0.140.08 0.11

0.03 0.030.00

0.10

0.20

0.30

0.40

0.50

0.60

Python Java C HyPer Database

Optimized Weld

Run

time

(sec

)

Page 27: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Effect of Optimizations

0.23

0.08

0.03

0.00

0.05

0.10

0.15

0.20

0.25

Row-Oriented Program

After Row-To-Column

After Vectorization

Run

time

(sec

)

Transformations usableon any Weld program

Page 28: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

How Weld Fits Into Applications

User Application

Weld Runtime

CombinedIR program

Machinecode

110111001110101101111010

IR fragmentsfor each function

RuntimeAPI

f1 map

f2

Data in application

Optimizedprogram

data = lib1.f1()lib2.map(data,

item => lib3.f2(item))

Page 29: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Example: Spark + NumPy

Select some users using Spark SQL

Score them using NumPy

Compute average using Spark

data = spark.sql(“select user.featuresfrom userswhere age > 20”)

scores = data.map(lambda vec: scoreMatrix * vec)

average = scores.mean()

Page 30: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Example: Spark + NumPy

Weld IR Block 1

Weld IR Block 2

Weld IR Block 3

Aggregation

SQL OperationMatrix Ops

Matrix Ops

Matrix Ops

Matrix Ops

CombinedProgram

data = spark.sql(“select user.featuresfrom userswhere age > 20”)

scores = data.map(lambda vec: scoreMatrix * vec)

average = scores.mean()

Page 31: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Runtime API

data = lib1.f1()lib2.map(data,

item => lib3.f2(item))

User Application Weld Runtime

Combined IRprogram

Optimizedmachine code

110111001110101101111

IR fragmentsfor each function

RuntimeAPI

f1

map

f2

Data in application

Uses lazy evaluation to collect work across libraries

Page 32: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25

Supported Optimizations

Loop Fusion Loop Tiling

Row-to-Column Constant Folding

Common Subexpressions Branch Predication

Inlining Vectorization

Insert free() Calls …