BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient...
Transcript of BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient...
![Page 1: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/1.jpg)
BlinkML:Efficient Maximum Likelihood Estimationwith Probabilistic Guarantees
Yongjoo Park Jingyi Qing Xiaoyang Shen Barzan Mozafari
University of Michigan
![Page 2: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/2.jpg)
Machine learning workloads are slow and costly
More data ⇒ slower training• 1 hour 35 minutes for 37M training examples
Often training multiple models• New data becoming available• Feature engineering
50K 100K 500K 1M 10M 37M
79 s 95 s 2.8 min4.4 min
29 min
1.6 hour
Number of examples
Criteo dataset, Logistic Regression with L-BFGS optimization algorithm
![Page 3: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/3.jpg)
Key Question: Can sampling accelerate ML training?
grad = (1/N) ∑i=1..N g(xi | θt)
≈ (1/n) ∑i=1..n g(xi | θt)
ML trainingiterative gradient computation
sum(X) = (1/N) ∑i=1..N Xi
≈ (1/n) ∑i=1..n Xi
SQL analytics
[Park et al. SIGMOD’18]
A platform-independent approachDo similar properties hold?
![Page 4: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/4.jpg)
Three key challenges
Model quality guarantee• No closed-form solution: grad(θN) = (1/N) ∑i=1..N g(xi | θN) = 0
• CLT or Hoeffding is NOT directly applicable
Generalization• Logistic Regression ≠ Principal Component Analysis
Efficiency• Too many approximate models >(longer) A single full model
![Page 5: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/5.jpg)
Our core contribution
A system for efficient, quality-guaranteed ML training
It supports models trained via maximum likelihood estimation1. Linear Regression2. Logistic Regression (#1 classifier according to 2017 Kaggle survey)3. Probabilistic PCA4. Generalized Linear Models, ... [https://www.kaggle.com/surveys/2017]
We bring Fisher’s theory to practice, and apply it in a novel way for quality-guaranteed, sampling-based ML
![Page 6: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/6.jpg)
To put things into context
We: Uniform random sampling is effective!
1. simple2. no a priori knowledge required3. still significant speedups
Generalize AQP to multivariate models
Much different from the work on biased/importance sampling
Orthogonal to AutoML
![Page 7: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/7.jpg)
<SystemOverview>...
![Page 8: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/8.jpg)
BlinkML: interface
1. model type (e.g., Logistic Regression)2. training set (of size N)3. accuracy (e.g., 98%)
full model
Accuracy 1- ε meansEx[𝟏(full(x) ≠ approx(x))] ≤ ε with high probability (e.g., 95%)
98% accurate approx model
BlinkML
what does accuracy mean?
what’s this process?
![Page 9: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/9.jpg)
BlinkML: internal workflow
Step 1: profile model/data complexity
Step 2: estimate min sample size
Crucial component (= AccEstimator):estimate accuracy of approx model w/o full model
Step 3: train an approximate model
![Page 10: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/10.jpg)
BlinkML: architecture
ConvexOptimization(scipy.optimize)
GradientComputation
(numpy & pyspark)
BlinkML Core (= AccEstimator)
ModelSpecification(e.g., LinearReg)
1. ease of implementation2. compatibility with existing ecosystems3. distributed computation
![Page 11: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/11.jpg)
...</SystemOverview>
![Page 12: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/12.jpg)
<QualityGuarantee>...
![Page 13: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/13.jpg)
Goal: bounding the prediction difference
The expected prediction difference:diff(full, approx) = Ex[𝟏(approx(x) ≠ full(x))]
Our goal:diff(full, approx) ≤ ε with high probability
e.g., ε = 0.01 → 99% same predictions
(for classification tasks)
How can we estimate diff(full, approx)?
![Page 14: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/14.jpg)
negative
Difference in params → diff(full, approx)
A model is a function of parametersA logistic regression model predicts:
1 (pos) if 1/(1+exp(-θT x) > 0.50 (neg) otherwise positive
determined by θ
If we know θN and θn
we can infer diff(f(x; θN), f(x; θn))
BUT, we don’t know θN. How to infer the difference?
f(x; θ)
![Page 15: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/15.jpg)
Infer probabilistically w/ Monte Carlo simulation
We estimate diff(full, approx) using samples from Pr(θN)
How do we obtain samples from Pr(θN)?
θN,1 θN,2 θN,4
= Ex[𝟏(f(x; θN) ≠ f(x; θn))]
θN,3 θN,5
diff(full, approx) 0.01 0.005 0.015 0.01 0.008
We say diff(full, approx) ≤ 0.01 with 80% probability (4/5)
BlinkML uses thousands of samples for accurate estimation
![Page 16: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/16.jpg)
Obtain samples from Fisher + optimization
H-1: model complexityJ: data variance
Based on Fisher’s theory, we get:
θN − θn ∼ Normal(0, αnH-1JH-1)param offull model
param ofapprox model
multivariatenormal distribution
The size of covariance matrix, O(#features^2), makes sampling slow
z ∼ N(0, I) → L z ∼ N(0, LLT)
We employ mathematical tricks
We obtain L such that LLT = H-1JH-1 directly from gradients using the information matrix equality
sampling = matrix multiplication
![Page 17: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/17.jpg)
Recap of our quality guarantee mechanism
1. Obtain a parameter θn and a factor L
2. Obtain samples of full model parameters θN
3. Compute many diff(full, approx) using samples
4. Ensure diff(full, approx) ≤ ε with high probability
We must train an approximate model to obtain θn
In our paper: performs this by training at most TWO approx models
For a certain sample size n:
![Page 18: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/18.jpg)
...</QualityGuarantee>
![Page 19: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/19.jpg)
<Experiments>...
![Page 20: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/20.jpg)
Models and datasets
Model Dataset # of examples # of features
Linear Regression Gas 4M 57
Power 2M 114
Logistic Regression Criteo 46M 998,922
HIGGS 11M 28
Max Entropy Classifier MNIST 8M 784
Yelp 5M 100,000
Probabilistic PCA MNIST 8M 784
HIGGS 11M 28
Publicly available GB-scale machine learning datasetsThe number of features range from 28 to 1 million
![Page 21: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/21.jpg)
Requested Accuracy
Spee
dup
0
300
600
900
1200
1500
80% 85% 90% 95% 96% 97% 98% 99%
Logistic Regression, HIGGS
0
10
20
30
40
50
60
96% 97% 98% 99%
BlinkML offers large speedups
Spee
dup
Requested Accuracy
Speedups adjust based on requested accuracy
![Page 22: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/22.jpg)
Requested Accuracy
Spee
dup
BlinkML offers large speedups
0
300
600
900
1200
1500
80% 85% 90% 95% 96% 97% 98% 99%
Logistic Regression, HIGGS
Requested Accuracy
Spee
dup
0
5
10
15
20
80% 85% 90% 95% 96% 97% 98% 99%
Max Entropy Classifier, Yelp
Speedups adjust based on model/data complexity(see more systematic study in our paper)
![Page 23: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/23.jpg)
Approx. models satisfy requested accuracy
80% 85% 90% 95% 96% 97% 98% 99%
95% Bound of Actual Accuracy
Requested Accuracy100%
95%
90%
85%
80%
Requested Accuracy80% 85% 90% 95% 96% 97% 98% 99%
95% Bound of Actual Accuracy
Requested Accuracy
Requested Accuracy
Logistic Regression, HIGGS Max Entropy Classifier, Yelp
100%
95%
90%
85%
80%
Accuracy guarantees were conservative (which is not bad)
![Page 24: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/24.jpg)
Faster hyperparameter searching with BlinkML
50%
60%
70%
80%
With BlinkML (+Scipy) With Only Scipy
100 20 30 40 50 60
Runtime (min)
Accu
racy
BlinkML found the best model at itr #91 (in 6 mins, test acc 75%)
BlinkML961 models in 30 mins
(sample size: 10K–9M)
Regular3 models in 30 mins
Regular could not find it in 1 hour
Logistic Regression, Criteo
![Page 25: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/25.jpg)
...</Experiments>
![Page 26: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/26.jpg)
Summary
1. Extended sampling-based analytics to commonly used ML
2. Our approach offers probabilistic quality-guarantees
3. Core: uncertainty on params → uncertainty on predictions
4. Empirical studies show that min sample size automatically adjusts
![Page 27: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/27.jpg)
What’s next?
Can we extend this approach to other models?• SVM, ensemble models, deep neural nets, ...
Run BlinkML directly on SQL engines?• Relational DBs are well optimized for structured data• No need to move/migrate data
Propagating errors to downstream applications• Formal semantics required
![Page 28: BlinkML - Yongjoo ParkBlinkML: architecture Convex Optimization (scipy.optimize) Gradient Computation (numpy& pyspark) BlinkMLCore (= AccEstimator) Model Specification (e.g., LinearReg)](https://reader034.fdocuments.us/reader034/viewer/2022042220/5ec5ff2435716311ae109199/html5/thumbnails/28.jpg)
Thank you!