Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)
-
Upload
spark-summit -
Category
Data & Analytics
-
view
640 -
download
4
Transcript of Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)
![Page 1: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/1.jpg)
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data ApplicationsKelvin Chu @ Uber
![Page 2: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/2.jpg)
About Myself• Started with Spark 0.7 • Co-created Spark Job Server at Ooyala • Working at Uber since 2014 August
![Page 3: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/3.jpg)
About Uber• Found in 2010 • One Tap to Request a Ride • Build Software Platform for Driver Partners
and Riders
![Page 4: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/4.jpg)
• 311 Cities • 58 Countries • Hundreds of thousands of driver partners • Millions of riders • 1+ million trips around the world everyday
4
![Page 5: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/5.jpg)
Data Platform Team• Second Engineer • Part of Data Engineering • Members with diverse background from
Hadoop, HBase, Oozie, Spark, Voldemort, YARN, etc.
![Page 6: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/6.jpg)
Data Lake
6
![Page 7: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/7.jpg)
Sqoop on Spark for Data Ingestion
5:45pm Today Room 3
Veena Basavaraj (Uber) Vinoth Chandar (Uber)
![Page 8: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/8.jpg)
Challenges• Shared by Many Teams
• Different technical background • Producers • Consumers
• Many Use Cases • Different SLAs
![Page 9: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/9.jpg)
Spark YARN
Parquet
9
![Page 10: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/10.jpg)
Why Spark?• Easy to Use • Ecosystem
• Batch jobs • SparkSQL • MLlib
![Page 11: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/11.jpg)
YARN• Resource Management
• Allocation • Teams/Jobs Isolation • Cluster Optimization
• Hadoop Kerberos Security
![Page 12: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/12.jpg)
12
……
……
Resource Scheduler
Real Machines
Spark Jobs
Placement
Optimization
![Page 13: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/13.jpg)
13
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
![Page 14: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/14.jpg)
Resource Queues• Resource Isolation • CPU & Memory
• I/O in the future • Hierarchical queues • Priorities as Weights • Allocate different teams and users to queues • Queue placement policies
![Page 15: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/15.jpg)
High Availability• Cluster Mode • Spark Context in Application Master • Automatic Retry
• Default: Once • Executor failure handled by Spark
![Page 16: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/16.jpg)
HA Tests Passed• Kill active YARN Resource Manager • Kill YARN Node Manager • Kill the job Application Master • Kill random Spark executors • Kill YARN history server • Kill Spark history server • Results:
• Existing spark jobs finished • New jobs can be submitted
![Page 17: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/17.jpg)
SPARK-6751 use version 1.3+
or set the flag spark.eventLog.overwrite
17
![Page 18: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/18.jpg)
Security• Critical in Multi-Tenancy • Only cluster manager
• Hadoop Kerberos Security • Authentication • Authorization handled by HDFS
18
![Page 19: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/19.jpg)
• SPARK-5342 • Delegation tokens expire in 7 days • Spark Streaming • Resolved in v1.4
• SPARK-5111 • HiveContext
19
![Page 20: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/20.jpg)
20
![Page 21: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/21.jpg)
21
![Page 22: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/22.jpg)
Data Locality• Executors are started before data
• No Data Locality • Pass data locations to SparkContext
val locations = InputFormatInfo .computePreferredLocations(Seq(new InputFormatInfo(new Configuration(), classOf[ParquetInputFormat], new Path(“...”))) val sc = new SparkContext(conf, locations)
Second Argument
![Page 23: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/23.jpg)
Parquet• Schema • Columnar file format
• Column pruning • Filter predicate push down
• Strong Spark support • SparkSQL • ParquetInputFormat
![Page 24: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/24.jpg)
Schema• Contract
• Multiple teams • Producers • Consumers
• Data to persist in a typed manner • Analytics
• Serve as documentation • Develop new applications faster
• Prevent a lot of bugs
![Page 25: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/25.jpg)
Schema Evolution• Schema merging in Spark v1.3
• SparkSQL • Schema evolution
• Merge old and new compatible versions • No “Alter table …”
![Page 26: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/26.jpg)
Schema Tools• Big Investment • Services
• Creating and retrieving schema • Validating schema evolution
• Libraries for producers and consumers • Multiple languages
![Page 27: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/27.jpg)
Speed
2 to 4 times FASTER
![Page 28: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/28.jpg)
• Columnar file format • Column pruning
• Wide Table • Filter predicate push down • Compression
28
![Page 29: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/29.jpg)
Spark UDK• Uber Development Kit
• Specific to Uber Environment • Help users get their jobs up and running
quickly. • UDK doesn't wrap Spark API.
• We embrace it!
![Page 30: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/30.jpg)
Template Class• Memory
• executor-memory • driver-memory • spark.yarn.executor.memoryOverhead • spark.yarn.driver.memoryOverhead • spark.kryoserializer.buffer.max.mb • spark.driver.maxResultSize
• CPU • num-executors • executor-cores
• High Availability • spark.eventLog.overwrite
• spark.serializer to org.apache.spark.serializer.KryoSerializer • spark.speculation • parquet.enable.binaryString
![Page 31: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/31.jpg)
• Default for Uber environment • e.g. HBase
• Default high performance and failover settings • Specific Spark version.
• Data store API • API for common computation • UDF • Logging
![Page 32: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/32.jpg)
Uber Use Cases
32
![Page 33: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/33.jpg)
Inference | Cleaning | Parquet• ETL
• JSON in gzip • Avro
• Schema Inference • SparkSQL
![Page 34: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/34.jpg)
• Data Cleaning by Inferred Schema • Conversion to Parquet • Validation
• Sampling • SparkSQL
34
![Page 35: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/35.jpg)
Analytics• SparkSQL on Data Lake • Business metrics • Data validation • Spark Job Server
• Caching for multiple queries via REST
![Page 36: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/36.jpg)
MLlib• Decision Tree
• Random Forest • Boosting Tree
• K-Mean
![Page 37: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/37.jpg)
• Powerful Algorithms in Many Area • API Easy to use
• SPARK-3727: More prediction functionality • Estimated probability • Multiple ways of aggregating predictions
37
![Page 38: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/38.jpg)
Spatial Analysis
38
![Page 39: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/39.jpg)
39
![Page 40: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/40.jpg)
Summary• Motivation • YARN • Parquet • Some Use Cases
![Page 41: Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applications-(Kelvin Chu, Uber)](https://reader030.fdocuments.us/reader030/viewer/2022032506/55ce9da7bb61ebb5288b45fe/html5/thumbnails/41.jpg)
Spark Job Server Community Gathering
Today Welcome to join us!