SVE: Distributed Video Processing at Yajurvedi, Paul ...iwanicki/courses/ds/2019/... · Abhishek...

Post on 07-Aug-2020

0 views 0 download

Transcript of SVE: Distributed Video Processing at Yajurvedi, Paul ...iwanicki/courses/ds/2019/... · Abhishek...

SVE: Distributed Video Processing at Facebook Scale

Qi Huang, Petchean Ang, Peter Knowles, Tomasz Nykiel, Iaroslav Tverdokhlib, Amit Yajurvedi, Paul Dapolito IV, Xifan Yan, Maxim Bykov, Chuen Liang, Mohit Talwar,

Abhishek Mathur, Sachin Kulkarni, Matthew Burke and Wyatt LloydFacebook, University of Southern California, Cornell, Princeton

Presentation by Jonas Umland

Introduction

● Every day:○ 8B video views○ 500M users watch 100M hours video ○ Many tens of millions uploads

Overview● Legacy design (MES) vs new design (SVE)● Performance comparison● DAG execution system ● Overload control● Production lessons

Full Video Pipeline

Tasks 153 22 18 >1000

Production Video Applications

Monolithic Encoding Script

Design Goals for a New Engine

Fast Robust Flexible

SVE Architecture Overview

SVE Architecture - Preprocessor

● Validation● Splitting video into chunks for old clients● DAG generation● Storing input video● Caching

SVE Architecture - Scheduler & Workers

● Scheduler○ Receiving DAG from preprocessor○ Scheduling tasks○ Putting tasks into queue, when no worker is available (high & low prio)

● Worker○ Executing task○ Fetching data from preprocessor or intermediate storage○ Writing to intermediate storage

SVE Architecture Overview - Intermediate Storage

● Caching of application metadata● Caching of video/audio data● Storing DAG state● Automatically free data

Overlap Upload and Encoding

Overlap Upload and Encoding

Parallel Processing

Parallel Processing

Video Sync (Durably Storing)

Video Sync (Durably Storing)

Overall latency improvement

DAG Execution System

Dynamic DAG Generation● Processing tasks depend on

video propterties● Enables performance testing

Fault Tolerance Strategies

Component Strategy

Client device Anticipate intermittent uploads

Front-end Replicate state externally

Preprocessor Replicate state externally

Scheduler Synchronously replicate state externally

Worker Replicate in time

Task Many retries

Storage Replicate on multiple disks

Retry Tasks After Recoverable Error

Success rate

First try 99.788%

2 local retries 99.795%

1 retry on different worker 99.901%

6 retries on different workers 99.995%

Failure of 20 % of Preprocessors in a Region

Gradual Failure of 5% of Workers in a Region

Mitigate overload

1) Delay latency insensitive tasks2) Delay latency sensitive tasks and notify engineer3) Redirect portion of video uploads to different region4) Delay video processing

Overload Control in Practice

Production Lessons

● Mismatch for livestreaming● Failures from global inconsistency● Failures from regional inconsistency● Continuous sandboxing

Summary

● 3 additional parallelities to improve latency● DAG execution system● Robust to overload and fault● Large scale production insights

SourcesMost images are extracted from the paper:

https://www.cs.princeton.edu/~wlloyd/papers/sve-sosp17.pdf

And from Qi Huang's Talk:

www.qhuangcs.com/slides/sosp_sve.pptx