Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture...
Transcript of Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture...
![Page 1: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/1.jpg)
Lecture 12: Model Serving
CSE599W: Spring 2018
![Page 2: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/2.jpg)
Deep Learning Applications
“That drink will get you to 2800 calories for today”
“I last saw your keys in the store room”
“Remind Tom of the party”
“You’re on page 263 of this book”
Intelligent assistant Surveillance / Remote assistance Input keyboard
![Page 3: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/3.jpg)
Model Serving Constraints● Latency constraint
○ Batch size cannot be as large as possible when executing in the cloud○ Can only run lightweight model in the device
● Resource constraint○ Battery limit for the device○ Memory limit for the device○ Cost limit for using cloud
● Accuracy constraint○ Some loss is acceptable by using approximate models○ Multi-level QoS
![Page 4: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/4.jpg)
Runtime Environment
Sensor Processor Radio Cloud
4
CameraTouch screenMicrophone
CPUGPUFPGAASIC
Wifi4G / LTEBluetooth
CPU serverGPU serverTPU / FPGA server
streaming
![Page 5: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/5.jpg)
Resource usage for a continuous vision app
Imager Processor Radio
Omnivision OV274090mW
Tegra K1 GPU290GOPS@10W= 34pJ/OP
Qualcomm SD810 LTE>800mWAtheros 802.11 a/g15Mbps@700mW= 47nJ/b
Cloud
Amazon EC2CPU c4.large 2x400GFLOPS $0.1/hGPU g2.2xlarge 2.3TFLOPS $0.65/h
5Huge gap between workload and budget
BudgetDevice power Cloud cost
30% of 10Wh for 10h = 300mW $10 person/year
Workload Deep learning 300GFLOPS @ 30GFLOPs/frame, 10fps
Compute power
9GFLOPS 3.5GFLOPS (GPU) / 8GFLOPS (CPU)
![Page 6: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/6.jpg)
Outline● Model compression
● Serving System
![Page 7: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/7.jpg)
Model Compression● Tensor decomposition● Network pruning● Quantization● Smaller model
![Page 8: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/8.jpg)
Matrix decomposition
* *
Fully-connected layer● Memory reduction:● Computation reduction:
Merge into one matrix
M
N
M
R
R
R
N
![Page 9: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/9.jpg)
Tensor decompositionConvolutional layer● Memory reduction: ● Computation reduction:
* Kim, Yong-Deok, et al. "Compression of deep convolutional neural networks for fast and low power mobile applications." ICLR (2016).
bounded by
![Page 10: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/10.jpg)
Decompose the entire model
![Page 11: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/11.jpg)
Fine-tuning after decomposition
![Page 12: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/12.jpg)
Accuracy & Latency after Decomposition
![Page 13: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/13.jpg)
Network pruning: Deep compression
* Song Han, et al. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding." ICLR (2016).
![Page 14: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/14.jpg)
Network pruning: prune the connections
![Page 15: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/15.jpg)
Network pruning: weight sharing1. Use k-means clustering to
identify the shared weights for each layer of a trained network. Minimize
2. Finetune the neural network using shared weights.
![Page 16: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/16.jpg)
Network pruning: accuracy
![Page 17: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/17.jpg)
Network pruning: accuracy vs compression
![Page 18: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/18.jpg)
Quantization: XNOR-Net
* Mohammad Rastegari, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks." ECCV (2016).
![Page 19: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/19.jpg)
Quantization: binary weights
![Page 20: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/20.jpg)
Quantization: binary input and weights
![Page 21: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/21.jpg)
Quantization: accuracy
![Page 22: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/22.jpg)
Quantization
![Page 23: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/23.jpg)
Quantization
![Page 24: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/24.jpg)
Smaller model: Knowledge distillation● Knowledge distillation: use a teacher model (large model) to train a
student model (small model)
* Romero, Adriana, et al. "Fitnets: Hints for thin deep nets." ICLR (2015).
![Page 25: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/25.jpg)
Smaller model: accuracy
![Page 26: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/26.jpg)
DiscussionWhat are the implications of these model compression techniques for serving?
● Specialized hardware for sparse models○ Song Han, et al. “EIE: Efficient Inference Engine on Compressed Deep Neural Network.”
ISCA 2016
● Accuracy and resource trade-off○ Han, Seungyeop, et al. "MCDNN: An Approximation-Based Execution Framework for Deep
Stream Processing Under Resource Constraints." MobiSys (2016).
![Page 27: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/27.jpg)
Serving system
![Page 28: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/28.jpg)
Serving system● Goals:
○ High flexibility for writing applications○ High efficiency on GPUs○ Satisfy latency SLA
● Challenges○ Provide common abstraction for different frameworks○ Achieve high efficiency
■ Sub-second latency SLA that limits the batch size■ Model optimization and multi-tenancy causes long tail
![Page 29: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/29.jpg)
FrontendFrontend
Nexus: efficient neural network serving system…requests
unit query
Frontend
30
BackendM2
GPU
scheduler
Backend
GPU
scheduler
M3M1
BackendM2
GPU
scheduler
BackendM2
GPU
scheduler
Backend
GPU
scheduler
M4M3
Remarks• Frontend runtime library allows
arbitrary app logic• Packing models to achieve
higher utilization• A GPU scheduler allows new
batching primitives• A batch-aware global scheduler
allocates GPU cycles for each model
Global Scheduler
App 1 App 2 App 3 Control flow
Nexus RuntimeLoad Balancer Dispatch Approx. lib
Application Logic
![Page 30: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/30.jpg)
Flexibility: Application runtimeclass ModelHandler: # return output future def Execute(input)
class AppBase: # return ModelHandler def GetModelHandler(framework, model, version, latency_sla) # Load models during setup time, implemented by developer def Setup() # Process requests, implemented by developer def Process(request)
31
Async RPC, execute the model remotely
Send load model request to global scheduler
![Page 31: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/31.jpg)
Application example: Face recognitionclass FaceRecApp(AppBase): def Setup(self): self.m1 = self.GetModelHandler(“caffe”, “vgg_face”, 1, 100) self.m2 = self.GetModelHandler(“mxnet”, “age_net”, 1, 100)
def Process(self, request): ret1 = self.m1.Execute(request.image) ret2 = self.m2.Execute(request.image) return Reply(request.user_id, ret1[“name”], ret2[“age”])
32
Load model from different framework
Execute concurrently on remote GPUs
Force to synchronize when accessing future data
![Page 32: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/32.jpg)
Application example: Traffic Analysisclass TrafficApp(AppBase): def Setup(self): self.det = self.GetModelHandler(“darknet”, “yolo9000”, 1, 300) self.r1 = self.GetModelHandler(“caffe”, “vgg_face”, 1, 150) self.r2 = self.GetModelHandler(“caffe”,“googlenet_car”, 1, 150)
def Process(self, request): persons, cars = [], [] for obj in self.det.Execute(request.image): if obj[“class”] == “person”: persons.append(self.r1.Execute(request.image[obj[“rect”]]) elif obj[“class”] == “car”: cars.append(self.r2.Execute(request.image[obj[“rect”]]) return Reply(request.user_id, persons, cars)
33
![Page 33: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/33.jpg)
High Efficiency
• For high request rate, high latency SLA workload, saturate GPU efficiency by using large batch size
34
Request rate
Latency SLA
low
high
highlow
Workload characteristic
batch size >= 32
VGG16 throughput vs batch size
GPU saturate region
![Page 34: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/34.jpg)
High Efficiency
• Suppose we can choose a different batch size for each op (layer), and allocate dedicated GPUs for each op.
35
Request rate
Latency SLA
low
high
highlow
Workload characteristic
...
op1 op2 opn
GPU
GPU
GPU
VGG16 conv2_1 (winograd) VGG16 fc6
batch size 16
![Page 35: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/35.jpg)
Split Batching
•
36
![Page 36: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/36.jpg)
Split Batching
37
VGG16 optimal batch size for each segment for latency SLA 15ms
Batch size 3 for entire model
40%
13%
![Page 37: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/37.jpg)
High Efficiency•
38
Request rate
Latency SLA
low
high
highlow
Workload characteristic
batching
exec
exec
time
batch k
batch k+1
GPU is idle during this period
Execute multiple models on one GPU
![Page 38: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/38.jpg)
Execute multiple models on a single GPU
39
A1 A2 A3 A4 A5
batch k
Model A requests
GPU execution A1 A2 A3 A4 A5
batch k+1
worst latency
![Page 39: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/39.jpg)
Execute multiple models on a single GPU
40
A1 A2 A3 A4 A5
batch k
Model A requests
GPU execution A1 A2 A3 A4 A5
batch k+1
A1 A2 A3 A4 A5B1 B2 B3 B4 B5
Model A requests
GPU execution
Model B requests
A1 B1 ...
Operations of model A and B interleave
worst latency
worst latency
Next batch
A5
![Page 40: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/40.jpg)
Execute multiple models on a single GPU
41
A1 A2 A3 B1 B2 B3 B4 B5
Model A requests
GPU execution
Model B requests
...A4 A5 A1 A2 A3 A4 A5
batch kA batch kBbatch kA+1
worst latency
Use larger batch size as latency is reduced and predictive
![Page 41: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/41.jpg)
High EfficiencySolution depends
• If saturate GPU in temporal domain due to low latency: allocate dedicated GPU(s)
• If not: can use multi-batching to share GPU cycles with other models
42
Request rate
Latency SLA
low
high
highlow
Workload characteristic
![Page 42: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/42.jpg)
Prefix batching for model specialization
• Specialization (long-term / short-term) re-trains last a few layers
43A new model
![Page 43: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/43.jpg)
Prefix batching for model specialization
• Specialization (long-term / short-term) re-trains last a few layers
• Prefix batching allows batch execution for common prefix
44A new model
Different suffixes execute individually
Common prefix can be batched together
![Page 44: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/44.jpg)
Meet Latency SLA: Global scheduler
• First apply split batching and prefix batching if possible• Multi-batching: bin-packing problem to pack models into GPUs• Bin-packing optimization goal
• Minimize the resource usage (number of GPUs)
• Constraint• Requests have to be served within latency SLA
• Degrees of freedom• Split workloads into smaller tasks• Change the batch size
45
![Page 45: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/45.jpg)
Best-fit decreasing algorithms
46
![Page 46: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/46.jpg)
Best-fit decreasing algorithms
47
Bin to fit in execution cycles
![Page 47: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/47.jpg)
Best-fit decreasing algorithms
48
Bin to fit in execution cycles
...
sort
All residue workloads
![Page 48: Lecture 12: Model Serving - University of Washingtondlsys.cs.washington.edu/pdf/lecture12.pdfLecture 12: Model Serving CSE599W: Spring 2018 Deep Learning Applications “That drink](https://reader034.fdocuments.us/reader034/viewer/2022042218/5ec4b0cef1e42a1c4c314356/html5/thumbnails/48.jpg)
Reference● Kim, Yong-Deok, et al. "Compression of deep convolutional neural networks for fast and low
power mobile applications." ICLR (2016).● Song Han, et al. "Deep Compression: Compressing Deep Neural Networks with Pruning,
Trained Quantization and Huffman Coding." ICLR (2016).● Romero, Adriana, et al. "Fitnets: Hints for thin deep nets." ICLR (2015).● Mohammad Rastegari, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional
Neural Networks." ECCV (2016).● Crankshaw, Daniel, et al. "Clipper: A Low-Latency Online Prediction Serving System." NSDI
(2017).