Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator....
Transcript of Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator....
![Page 1: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/1.jpg)
Weld: A Common Runtime for Data Analytics
Shoumik Palkar, James Thomas, Anil Shanbhag*, Deepak Narayanan, Malte Schwarzkopf*, Holger Pirk*, Saman Amarasinghe*, Matei Zaharia
Stanford InfoLab, *MIT CSAIL
![Page 2: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/2.jpg)
Motivation
Modern data apps combine many disjoint processing libraries & functions» Relational, statistics, machine learning, …» E.g. PyData stack
+ Great results leveraging work of 1000s of authors
– No optimization across these functions
![Page 3: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/3.jpg)
How Bad is This Problem?
Growing gap between memory/processing makes traditional way of combining functions worse
data = pandas.parse_csv(string)
filtered = pandas.dropna(data)
avg = numpy.mean(filtered)
parse_csv
dropna
mean
5-30x slowdowns in NumPy, Pandas, TensorFlow, etc
![Page 4: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/4.jpg)
How We Solve Thismachine learningSQL graph
algorithms
CPU GPU
…
Common Runtime
…
![Page 5: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/5.jpg)
How We Solve Thismachine learningSQL graph
algorithms
CPU GPU
…
…
Weld IR
Backends
Runtime API
OptimizerWeldruntime
![Page 6: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/6.jpg)
Runtime APIUses lazy evaluation to collect work across libraries
data = lib1.f1()lib2.map(data,
item => lib3.f2(item))
User Application Weld Runtime
Combined IRprogram
Optimizedmachine code
110111001110101101111
IR fragmentsfor each function
RuntimeAPI
f1
map
f2
Data in application
![Page 7: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/7.jpg)
Weld IR
Designed to meet three goals:
1. Library composition: support complete workloads such as nested parallel calls
2. Ability to express optimizations: e.g. loop fusion, vectorization, loop tiling
3. Explicit parallelism and targeting parallel hardware
![Page 8: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/8.jpg)
Why Don’t Compilers Solve This?
Languages and intermediate representations (IRs) make it hard to optimize across libraries»Main abstraction is shared memory
(must worry about aliasing, order, etc)
Most compilers don’t model parallel operations»Makes high performance code generation for
heterogeneous parallel hardware even more difficult
![Page 9: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/9.jpg)
Weld IR
Small, powerful design inspired by “monad comprehensions”
Parallel loops: iterate over a dataset
Builders: declarative objects for producing results» E.g. append items to a list, compute a sum» Can be implemented differently on different hardware
Captures relational algebra, functional APIs like Spark, linear algebra, and composition thereof
![Page 10: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/10.jpg)
BuildersHardware independent and explicitly parallel
Three operators:
merge(builder, value): Merge a value into the builder and return a new builder
result(builder): destroy the builder and return a value
for(data, builders, func): iterate over data, potentially merging values into one or more builders in parallel
![Page 11: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/11.jpg)
Examples
Implement functional operators using builders
def map(data, f):builder = new vecbuilder[int]for x in data:
merge(builder, f(x))result(builder)
def reduce(data, zero, func): builder = new merger[zero, func]for x in data:
merge(builder, x)result(builder)
![Page 12: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/12.jpg)
Example Optimization: Fusionsquares = map(data, x => x * x)sum = reduce(data, 0, +)
bld1 = new vecbuilder[int]bld2 = new merger[0, +]for x in data:
merge(bld1, x * x)merge(bld2, x)
Loops can be merged into one pass over data
![Page 13: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/13.jpg)
Optimizer
Cost Based Optimizer similar to an RDBMS
Builder Implementations: How to implement a particular builder (e.g., global vs. local hash tables)
Transforms: Should expressions be fused, vectorized, inlined, etc.
Quantifies choices among optimizations using data from the program
![Page 14: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/14.jpg)
Cost Model
Inspired by cost models for in-memory databases
+ modeling nested parallelism and choices among implementations of data structures
Models cache contention, costs of atomic instructions, etc.
![Page 15: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/15.jpg)
Implementation
Prototype with APIs in Scala and Python» LLVM and Voodoo for code gen
Integrations: TensorFlow, NumPy, Pandas, Spark
![Page 16: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/16.jpg)
Results: Individual WorkloadsSQL (TPC-H) PageRank
0 2 4 6 8
10 12
1 2 4 8 12
Ru
ntim
e [
secs
]
Number of threads
GraphMatHand-opt
Weld
Word2VecQ1 Q3
Q6 Q12
0
0.2
0.4
0.6
0.8
1
1.2
1 4 12
Ru
ntim
e [
secs
]
Number of threadsHyPer
H.o.Weld
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
1 4 12
Ru
ntim
e [
secs
]
Number of threadsHyPer
H.o.Weld
0
0.05
0.1
0.15
0.2
0.25
0.3
1 4 12
Runtim
e [se
cs]
Number of threadsHyPer
H.o.Weld
0
0.1
0.2
0.3
0.4
0.5
0.6
1 4 12
Runtim
e [se
cs]
Number of threadsHyPer
H.o.Weld 0
5
10
15
20
25
Ru
ntim
e [
secs
] TFTF-Op
Weld
TF-Op = C++ operator
![Page 17: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/17.jpg)
TPC-H Logistic RegressionVector Sum
Results: Existing Frameworks
0 5
10 15 20 25 30 35 40 45
TPC-H Q1 TPC-H Q6
Ru
ntim
e [
secs
]
Workload
SparkSQLWeld
0 0.02 0.04 0.06 0.08 0.1
0.12 0.14 0.16 0.18 0.2
Ru
ntim
e [
secs
]
NPNExpr
Weld
Integration effort: 500 lines glue, 30 lines/operator
0.1
1
10
100
1000
LR (1T) LR (12T)
Ru
ntim
e [
secs
; lo
g1
0]
Workload
TFHand-opt
Weld
1 Core 12 Cores
![Page 18: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/18.jpg)
Results: Modeling Costs
Takeaway: Cost curves *resemble* actual runtimes
![Page 19: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/19.jpg)
Results: Cross-Library Optimization
0.01
0.1
1
10
100
Runt
ime
(sec,
log1
0)
CurrentWeld, no CLOWeld, CLOWeld, 12 core
Pandas + NumPy
290x
31x
0.0
0.5
1.0
1.5
2.0
Runt
ime
(sec)
Scala UDFWeld
Spark SQL UDF
14x
![Page 20: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/20.jpg)
Conclusion
The way we compose software will have to changeto efficiently use modern hardware
Lots of open questions and design decisions!» Leveraging specialized hardware, domain info, …
Open source: soon!
![Page 21: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/21.jpg)
Related Work
HyPer, LegoBase, Tupleware: target relational algebra and serial UDFs; no nested parallelism
LLVM, OpenCL: low-level shared-memory model
NESL, parallel FPs: not closed under optimizations
DSLs: Weld focuses on integration with existing libraries and cross-library optimization
![Page 22: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/22.jpg)
Observation
Many analytics algorithms can be written with a few “embarrassingly parallel” operators» See how many run on MapReduce / Spark
Focus on these instead of general programs
![Page 23: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/23.jpg)
The Goal
CPUs GPUs …
common IR
machine learningSQL graph
algorithms
optimizer
![Page 24: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/24.jpg)
Results on GPUs
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Q1 Q3 Q6 Q12
Ru
ntim
e [
secs
]
Number of threadsOcelot Weld
SQL (TPC-H)
0
5.9
Runtim
e [se
cs]
TFBond
NearestNeighbors
0
0.7
Runtim
e [se
cs]
Bond (CPU)LightSpMV
Bond (GPU)
PageRank
![Page 25: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/25.jpg)
Example Transformationsdef query(products: vec[{dept:int, price:int}]):sum = 0for p in products:if p.dept == 20: sum += p.price
def query(dept: vec[int], price: vector[int]):sum = 0for i in 0..len(users):if dept[i] == 20: sum += price[i]
for i in 0..len(products) by 4:sum += price[i..i+4] * (dept[i..i+4] == [20,20,20,20])
row-to-column
vectorization
![Page 26: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/26.jpg)
Weld Results: TPC-H Q6
0.53
0.140.08 0.11
0.03 0.030.00
0.10
0.20
0.30
0.40
0.50
0.60
Python Java C HyPer Database
Optimized Weld
Run
time
(sec
)
![Page 27: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/27.jpg)
Effect of Optimizations
0.23
0.08
0.03
0.00
0.05
0.10
0.15
0.20
0.25
Row-Oriented Program
After Row-To-Column
After Vectorization
Run
time
(sec
)
Transformations usableon any Weld program
![Page 28: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/28.jpg)
How Weld Fits Into Applications
User Application
Weld Runtime
CombinedIR program
Machinecode
110111001110101101111010
IR fragmentsfor each function
RuntimeAPI
f1 map
f2
Data in application
Optimizedprogram
data = lib1.f1()lib2.map(data,
item => lib3.f2(item))
![Page 29: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/29.jpg)
Example: Spark + NumPy
Select some users using Spark SQL
Score them using NumPy
Compute average using Spark
data = spark.sql(“select user.featuresfrom userswhere age > 20”)
scores = data.map(lambda vec: scoreMatrix * vec)
average = scores.mean()
![Page 30: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/30.jpg)
Example: Spark + NumPy
Weld IR Block 1
Weld IR Block 2
Weld IR Block 3
Aggregation
SQL OperationMatrix Ops
Matrix Ops
Matrix Ops
Matrix Ops
CombinedProgram
data = spark.sql(“select user.featuresfrom userswhere age > 20”)
scores = data.map(lambda vec: scoreMatrix * vec)
average = scores.mean()
![Page 31: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/31.jpg)
Runtime API
data = lib1.f1()lib2.map(data,
item => lib3.f2(item))
User Application Weld Runtime
Combined IRprogram
Optimizedmachine code
110111001110101101111
IR fragmentsfor each function
RuntimeAPI
f1
map
f2
Data in application
Uses lazy evaluation to collect work across libraries
![Page 32: Weld: A Common Runtime for Data Analytics Talks/Shoumik_seminar.pdf · Weld TF-Op = C++ operator. TPC-H Vector Sum Logistic Regression Results: Existing Frameworks 0 5 10 15 20 25](https://reader034.fdocuments.us/reader034/viewer/2022050400/5f7e2f994b54f453e12c09d6/html5/thumbnails/32.jpg)
Supported Optimizations
Loop Fusion Loop Tiling
Row-to-Column Constant Folding
Common Subexpressions Branch Predication
Inlining Vectorization
Insert free() Calls …