Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda...
Transcript of Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda...
![Page 1: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/1.jpg)
Serverless Data Analytics with Flint
YOUNGBIN KIM AND JIMMY LIN
![Page 2: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/2.jpg)
Large-scale analytical data processing
o Spark adoption is booming
o Many use cases
![Page 3: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/3.jpg)
Large-scale analytical data processing
o Spark adoption is boomingo Many use cases=> Requirement: Pre-installed on a cluster before they can be used for analytics
◦ On-premise data center or cluster of virtual instances in the cloud
![Page 4: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/4.jpg)
Large-scale analytical data processingo Problem: Cluster management can be difficulto monitoring the health of worker nodeso troubleshoot a variety of issueso fixing/replacing underperforming nodes
May not be feasible for many small startups/researchers with the limited resources !
![Page 5: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/5.jpg)
Large-scale analytical data processingo How about scaling?
![Page 6: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/6.jpg)
Managed big data frameworkso Current solution: Managed big data frameworks
o Example: Amazon Elastic Map Reduce (EMR)
o Advantages:
o Reduces the burden of cluster management
o Save costs (automatically terminated)
o Limitations:
o Time is wasted in cluster initialization/rescaling/teardown
o Need to choose the details of the managed cluster
o There are still management overheads & idle costs.
![Page 7: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/7.jpg)
Serverless analyticso Serverless analytics to the rescue!
Worker NodeExecutor Cache
Task
Cluster Manager
Task
Worker NodeExecutor Cache
Task Task
![Page 8: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/8.jpg)
Flint
o Flint: prototype execution engine for serverless PySparko PySpark with serverless backend by simply specifying a config fileo No costs for idle capacityo Simplicityo Use cases: ad hoc analytics and exploratory data analysis
![Page 9: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/9.jpg)
Flint architectureo Spark tasks are executed in AWS Lambdao Intermediate data are held in Amazon’s Simple Queue Service (SQS)o Reuses as many existing Spark components as possibleoQuery planning and optimizationoMany different types of RDD transformations
![Page 10: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/10.jpg)
Workflow
![Page 11: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/11.jpg)
QueueQueue
Spark Context
Client
FlintScheduler Backend
SQS
FlintExecutor
FlintExecutor
Output Partition
Output PartitionFinal Stage
Intermediate Stage
FlintExecutor LambdaFlint
ExecutorFlint
Executor
Input Partition S3
Input Partition
Input Partition
Amazon Web Services
Data MovementControl Flow
Flint architecture
![Page 12: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/12.jpg)
Flint architectureo The Flint scheduler coordinates Flint executors to execute a
particular physical planoFunction registrationoQueue initializationoSerializationo Invocation using thread pooloProcess the response from an executor
![Page 13: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/13.jpg)
Flint executoro Flint executor is a python process running inside an Amazon Lambda
functionoEach serverless compute function invocation processes a single taskoSimplifies the communication requirement between an executor
and a driveroLess affected by the limitation of execution time
![Page 14: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/14.jpg)
Remote storage for shuffling◦ No permanent storage◦ Small ephemeral disk space (~512 MB)◦ Execution time limitation
=> Cannot guarantee the Flint executors from the previous stage are still alive to pass data
◦ Communication between Lambda functions
◦ Amazon’s Simple Queue Service (SQS)◦ highly-scalable◦ reliable
![Page 15: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/15.jpg)
Experimental Evaluation◦ A Spark cluster running the Databricks Unified Analytics Platform
(Standard)◦ 11 m4.2xlarge instances (one driver and ten workers) - 80 vCores◦ 80 max concurrent invocations (~ 80 vCores)
![Page 16: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/16.jpg)
Experimental Evaluation◦ NYC taxi dataset (215 GB)◦ Pick-up and drop-off dates/time, trip distance, payment type, tip
amount◦ Queries inspired by an exploratory data analysis task described in a
popular blog post by Todd Schneider
![Page 17: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/17.jpg)
Experimental Evaluation◦ Q0: Line count◦ Q1: Taxi drop-offs at the Goldman Sachs headquarters (hourly
aggregation)
![Page 18: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/18.jpg)
Experimental Evaluation
◦ Q2: Similar to Q1, but for Citigroup headquarters◦ Q3: Goldman Sachs taxi drop-offs with tips greater than $10◦ Q4: Cash vs. credit card payments◦ Q5: Yellow taxi vs. green taxi , monthly aggregation◦ Q6: Effect of precipitation on taxi trips
![Page 19: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/19.jpg)
Experimental Evaluation
![Page 20: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/20.jpg)
Experimental Evaluation• Q1: Taxi drop-offs at the Goldman Sachs headquarters (hourly aggregation)• Q3: Goldman Sachs taxi drop-offs with tips greater than $10tradeoff of concurrency between the latency and the cost
![Page 21: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/21.jpg)
Lambda Limitations◦ Most serverless platforms currently have several limitations
◦ Memory size (e.g. 3008 MB for AWS)◦ Execution time limitation (5 ~ 9 minutes)◦ Cold start problem
![Page 22: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/22.jpg)
Record Record
Record
Record
Record
Record
Record
Record
Record
Record
Record
Record
Timeout
Input
Processed by another executor
Driver
{ bucket: ..., range: 0-1000, …}
{ status: incomplete, range: 0-719, …}
Record
Record
Record
Record{ bucket: ..., range: 720-1000, …}
{ status: complete, ...}
Lambda LimitationsExecution time limitation
![Page 23: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/23.jpg)
Lambda Limitations
![Page 24: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/24.jpg)
Lambda LimitationsOther constraints◦ Memory (3008 MB)◦ Request size: 6 MB
◦ Metadata◦ Response size: 6 MB
◦ Collect
![Page 25: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/25.jpg)
Related Worko Iris:o The origin of Flint (A course project, UWaterloo, Fall 2016)o Distributed computation framework supporting a subset of Spark API
o In-browser data analytics backed by serverless backend
o Amazon Athenao Per-query pricing with zero idle costso Only supports SQL
o Presto distributed SQL engine
o Databricks Serverlesso Automatically managed pools of cloud resources
◦ auto-configured & auto-scaled
![Page 26: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/26.jpg)
Related Worko PyWren
◦ Framework built from scratch on top of serverless compute functions and persistent storage
o Qubole Spark on Serverlesso Ported the existing Spark executor infrastructure onto AWS
Lambda, whereas Flint is a from-scratch implementationoCommunication modeloAWS Lambda limitations
![Page 27: Serverless Data Analytics with Flint · Flint architecture o Spark tasks are executed in AWS Lambda o Intermediate data are held in Amazon’s Simple Queue Service (SQS) ... o Only](https://reader033.fdocuments.us/reader033/viewer/2022042204/5ea61747a0035a42982b4b8b/html5/thumbnails/27.jpg)
Future Work◦ Intensive shuffling tasks◦ Robustness◦ Higher level libraries (e.g. MLlib)