Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with...
-
Upload
spark-summit -
Category
Data & Analytics
-
view
479 -
download
3
Transcript of Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with...
![Page 1: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/1.jpg)
LESSONS LEARNED DEVELOPING AND MANAGING MASSIVE (300TB+) APACHE SPARK PIPELINES IN PRODUCTIONBrandon Carl
![Page 2: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/2.jpg)
![Page 3: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/3.jpg)
MARCH 15, 2016
"SEE THE MOMENTS YOU CARE ABOUT FIRST"
![Page 4: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/4.jpg)
![Page 5: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/5.jpg)
MACHINE LEARNING
![Page 6: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/6.jpg)
MACHINE LEARNING LIFECYCLE
Training Examples
Machine Learning
ModelMake
PredictionsMeasure
Outcomes
Ranking Events
Client Events
![Page 7: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/7.jpg)
WHY SPARK?
• Performance • Testability • Modularity • Serialized Logging
![Page 8: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/8.jpg)
SERIALIZED LOGGING
{ "id": 123, "scores": { "modelA": 0.2345, "modelB": 0.0012 }, "features": { 1001: 0.9934, 1002: 0.1923 }}
![Page 9: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/9.jpg)
SERIALIZED LOGGING
struct Candidate { 1: i64 id; 2: map<string, double> scores; 3: map<i64, double> features;}
new Candidate() .setId(id) .setScores(scores) .setFeatures(features)
![Page 10: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/10.jpg)
CHANGES OVER TIME
![Page 11: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/11.jpg)
CHANGES OVER TIME
• RDD • Dataset • Training Data Joiner
![Page 12: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/12.jpg)
TRAINING DATA JOINER
class MyTrainingDataJoiner(spark: SparkSession) extends TrainingDataJoiner {val labels: Map[String, LabelFunction] = ???
}
case class Output(id: Long, label_value: Double)
![Page 13: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/13.jpg)
MANAGING MASSIVE SCALE
![Page 14: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/14.jpg)
MANAGING MASSIVE SCALE - PEOPLE
![Page 15: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/15.jpg)
AUTOMATE EVERYTHING
![Page 16: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/16.jpg)
SIMPLE INTERFACE
![Page 17: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/17.jpg)
SIMPLE INTERFACE
RankingEvent .read('input_table', '2017-10-25') .filter(...) .map(...) .write('output_table', '2017-10-25')
![Page 18: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/18.jpg)
MANAGING MASSIVE SCALE - DATA
![Page 19: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/19.jpg)
PLAN FOR GROWTH
![Page 20: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/20.jpg)
PERSIST TO HDFS
![Page 21: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/21.jpg)
PERSIST TO HDFS
Source Data Map/Filter
Join Output
Source Data Map/Filter
![Page 22: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/22.jpg)
PERSIST TO HDFS
Source Data Map/Filter Temporary
Table
Source Data Map/Filter Temporary
Table
Join Output
![Page 23: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/23.jpg)
KRYO SERIALIZATION
![Page 24: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/24.jpg)
KRYO SERIALIZATION
new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.kryo.registrationRequired", "true") .registerKryoClasses(Array(classOf[...], ...))
![Page 25: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/25.jpg)
BIG-O MATTERS
![Page 26: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/26.jpg)
BIG-O MATTERS
final def withName(s: String): Value = values .find(_.toString == s) .getOrElse(throw new NoSuchElementException(...))
![Page 27: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/27.jpg)
BIG-O MATTERS
final def withName(s: String): Value = values .find(_.toString == s) .getOrElse(throw new NoSuchElementException(...))
![Page 28: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/28.jpg)
DATA STRUCTURES MATTER
![Page 29: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/29.jpg)
DATA STRUCTURES MATTER
• AnyRefMap • IntMap • LongMap • fastutil (http://fastutil.di.unimi.it)
![Page 30: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/30.jpg)
DATA SKEW MATTERS
![Page 31: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/31.jpg)
TEST ON SAMPLED DATA
![Page 32: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/32.jpg)
![Page 33: Lessons Learned Developing and Managing Massive (300TB+) Apache Spark Pipelines in Production with Brandon Carl](https://reader033.fdocuments.us/reader033/viewer/2022042707/5a1529d07f8b9a65768b4745/html5/thumbnails/33.jpg)