Spark Uber Development Kit
-
Upload
jen-aman -
Category
Data & Analytics
-
view
2.244 -
download
0
Transcript of Spark Uber Development Kit
![Page 1: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/1.jpg)
Kelvin Chu, Hadoop Platform, UberGang Wu, Hadoop Platform, Uber
Spark Uber Development Kit
Spark Summit 2016June 07, 2016
![Page 2: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/2.jpg)
“Transportation as reliable as running water, everywhere, for everyone”
Uber Mission
![Page 3: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/3.jpg)
About Us
● Hadoop team of Data Infrastructure at Uber● Schema systems● HDFS data lake● Analytics engines on Hadoop● Spark computing framework and toolings
![Page 4: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/4.jpg)
Execution Environment Complexity
![Page 5: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/5.jpg)
Cluster Sizes
20 times
![Page 6: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/6.jpg)
YARN Mesos
![Page 7: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/7.jpg)
Docker JVM
![Page 8: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/8.jpg)
Parquet ORC
Sequence Text
![Page 9: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/9.jpg)
Home Built Services
Hive Kafka ELK
![Page 10: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/10.jpg)
Consequence:Pretty hard for beginners, sometimes hard for experienced users too.
![Page 11: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/11.jpg)
Goals:Multi-Platform: Abstract out environmentSelf-Service: Create and run Spark jobs super easilyReliability: Prevent harm to infrastructure systems
![Page 12: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/12.jpg)
Engineers SRE
API
Tools
![Page 13: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/13.jpg)
Engineers SRE
API
Tools
EasySelf-Service
Multi-Platform
No HarmReliability
![Page 14: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/14.jpg)
• SCBuilder• Kafka dispersal
• SparkPlug
Engineers SRE
API
Tools
![Page 15: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/15.jpg)
SCBuilderEncapsulate cluster environment details
● Builder Pattern for SparkContext● Incentive for users:
○ performance optimized (default can’t pass 100GB)○ debug optimized (history server, event logs)○ don’t need to ask around YARN, history servers, HDFS configs
● Best practices enforcement:○ SRE approved CPU and memory settings○ resource efficient serialization
![Page 16: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/16.jpg)
Kafka DispersalKafka as data sink of RDD result
● Incentive for users:○ RDD as first class citizen => parallelization○ built-in HA
● Best practices enforcement:○ rate limiting○ message integrity by schema○ bad messages tracking
publish(data: RDD, topic: String, schemaId: Int, appId: String)
![Page 17: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/17.jpg)
SparkPlugKickstart job development
● A collection of popular job templates○ Two commands to run the first job in Dev
● One use case per template○ e.g. Ozzie + SparkSQL + Incremental processing○ e.g. Incremental processing + Kafka dispersal
● Best Practices○ built-in unit tests, test coverage, Jenkins○ built-in Kafka, HDFS mocks
![Page 18: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/18.jpg)
Example Spark Application Data Flow Chart
Query UI
ELK SearchDashboards
Report
HIVE
Topic 1
Topic 1
Topic 1
Topic 2
Topic 2
Topic 2
Topic 3
Topic 3
Topic 4HIVE
SharedDB
HIVECustom
DB
AppWebUI
App
![Page 19: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/19.jpg)
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
TERM_NAME_XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
AB
Title: Job
message
TITLE TITLE TITLE: WORKFLOW ENV NAME
TITLE: WORKFLOW ENV ID TITLE: LOCATION
![Page 20: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/20.jpg)
• Geo-spatial processing• SCBuilder• Kafka dispersal
• SparkPlug
Engineers SRE
API
Tools
![Page 21: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/21.jpg)
Commonly used UDFs
GeoSpatial UDF
within(trip_location, city_shape) contains(geofence, auto_location)
Find if a car is inside a city Find all autos in one area
![Page 22: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/22.jpg)
Commonly used UDFs
GeoSpatial UDF
overlaps(trip1, trip2) intersects(trip_location, gas_location)
Find trips that have similar routes
Find all gas stations a trip route has passed by
![Page 23: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/23.jpg)
Objective: associate all trips with city_id for a single day.
SELECT trip.trip_id, city.city_id
FROM trip JOIN city
WHERE contains(city.city_shape, trip.start_location)
AND trip.datestr = ‘2016-06-07’
Spatial JoinCommon query at Uber
![Page 24: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/24.jpg)
Spatial JoinProblem
It takes nearly ONE WEEK to run at Uber’s data scale.
1. Spark does not have broadcast join optimization for non-equation join.
2. Not scalable, only one executor is used for cartesian join.
![Page 25: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/25.jpg)
Spatial Join
Build a UDF to broadcast geo-spatial index
![Page 26: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/26.jpg)
Spatial JoinRuntime Index Generation
Index data is small but change often (city table)
Get fields from geo tables (city_id and city_shape)
Build QuadTree or RTree index at Spark Driver
1. Build Index
![Page 27: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/27.jpg)
Spatial JoinExecutor Execution
UDF code is part of the Spark UDK jar.
⇒ get_city_id(location), returns city_id of a location
Use the broadcasted spatial index for fast spatial retrieval
2. Broadcast Index
![Page 28: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/28.jpg)
Spatial JoinRuntime UDF Generation
SELECT
trip_id, get_city_id(start_location)
FROM
trip
WHERE
datestr = ‘2016-06-07’
3. Rewrite Query (2 mins only! compared to 1 week before)
![Page 29: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/29.jpg)
• Geo-spatial processing• SCBuilder• Kafka dispersal
• SparkChamber• SparkPlug
Engineers SRE
API
Tools
![Page 30: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/30.jpg)
Spark Debugging
1. Tons of local log files across many machines.
2. Overall file size is huge and difficult to be handled by a single machine.
3. Painful for debugging, which log is useful?
![Page 31: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/31.jpg)
Spark ChamberDistributed Log Debugger for Spark
Extend Spark Shell by Hooks.
Easy to adopt for Spark developers.
Interactive
![Page 32: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/32.jpg)
Spark Chamber Session
![Page 33: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/33.jpg)
my_user_name
![Page 34: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/34.jpg)
![Page 35: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/35.jpg)
![Page 36: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/36.jpg)
![Page 37: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/37.jpg)
![Page 38: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/38.jpg)
Spark ChamberDistributed Log Debugger for Spark
1. Get all recent Spark Application IDs.
2. Get first exception, all exceptions grouped by types sorted by time, etc.
3. Display CPU, memory, I/O metrics.
4. Dive into a specific driver/executor/machine
5. Search
Features
![Page 39: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/39.jpg)
Spark ChamberDistributed Log Debugger for Spark
Developer mode: debug developer’s own Spark job.
SRE mode: view and check all users’ Spark job information.
Security
![Page 40: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/40.jpg)
YARN aggregates log files on HDFS Files are named after host names
All application IDs of the same user are under same place.
One machine has one log file, regardless of # executors on that machine.
Spark ChamberEnable Yarn Log Aggregation
/ tmp / logs / username / logs // tmp / logs / username /
username
username
username
username
username
username
username
username
username
username
username
username
username
username
![Page 41: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/41.jpg)
Spark ChamberUse Spark to debug Spark
Extend the Spark Shell by Hooks:
1. For ONE application Id, distribute log files to different executors.
2. Extract each lines and save into DataFrame.
3. Sort log dataframe by time and hostname.
4. Retrieve target log via SparkSQL DataFrame APIs.
![Page 42: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/42.jpg)
• Geo-spatial processing• SCBuilder• Kafka dispersal
• SparkChamber• SparkPlug
• SparkChamber
Engineers SRE
API
Tools
Future Work
![Page 43: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/43.jpg)
Spark ChamberSRE version - Cluster wide insights
● Dimensions - Jobs○ All○ Single team○ Single engineer
● Dimensions - Time○ Last month, week, day
● Dimensions - Hardware○ Specific rack, pod
![Page 44: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/44.jpg)
Spark ChamberSRE version - Analytics and Machine Learning
● Analytics○ Resource auditing○ Data access auditing
● Machine Learning○ Failures diagnostics○ Malicious jobs detection○ Performance optimization
![Page 45: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/45.jpg)
• Geo-spatial processing• SCBuilder• Kafka dispersal• Hive table registration (Didn’t cover today)• Incremental processing (Didn’t cover
today)• Debug logging• Metrics• Configurations• Data Freshness
• Resource usage
• SparkChamber• SparkPlug• Unit testing (Didn’t cover today)• Oozie integration (Didn’t cover today)
• SparkChamber• Resource usage auditing• Data access auditing• Machine learning on jobs
Engineers SRE
API
Tools
Future Work
![Page 46: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/46.jpg)
Today, Tuesday, June 74:50 PM – 5:20 PMRoom: Ballroom B
SPARK: INTERACTIVE TO PRODUCTION
Dara Adib, Uber
![Page 47: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/47.jpg)
Tomorrow, Wednesday, June 85:25 PM – 5:55 PM
Room: Imperial
Locality Sensitive Hashing by Spark
Alain Rodriguez, Fraud Platform, UberKelvin Chu, Hadoop Platform, Uber
![Page 48: Spark Uber Development Kit](https://reader031.fdocuments.us/reader031/viewer/2022030310/58f9a90d760da3da068b6a6f/html5/thumbnails/48.jpg)
Thank you
Proprietary and confidential © 2016 Uber Technologies, Inc. All rights reserved. No part of this document may be
reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or
by any information storage or retrieval systems, without permission in writing from Uber. This document is intended
only for the use of the individual or entity to whom it is addressed and contains information that is privileged,
confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified
that the information contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person
other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.