End-to-End Data Pipelines with Apache Spark
-
Upload
burak-yavuz -
Category
Software
-
view
388 -
download
3
Transcript of End-to-End Data Pipelines with Apache Spark
![Page 1: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/1.jpg)
End-to-End Data Pipelines with Apache Spark
Burak YavuzDecember 27, 2015
![Page 2: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/2.jpg)
Who Am I?• Software Engineer at Databricks• MS Management Science & Eng. @ Stanford
University• BS Mechanical Eng. @ Bogazici University,
Istanbul• Contributor to Spark Core, MLlib, SQL, and
Streaming• Maintainer of Spark Packages
2
![Page 3: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/3.jpg)
Outline• Intro - Spark & Ecosystem• Build an End-to-End Data Product
• Step 1: Understand your Data• SparkSQL - DataFrames
• Step 2: Build your Service• SparkMLlib - ML Pipelines
• Step 3: Monitor your Service• Spark Streaming• Kafka
3
![Page 4: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/4.jpg)
Timeline of Spark• 2010: a research paper• 2010-13: a project under github/mesos • 2013-14: Apache incubating -> TLP • 2014: the most active project in the ASF
4
![Page 5: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/5.jpg)
Apache Spark
5
![Page 6: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/6.jpg)
Spark Ecosystem• 770 contributors• 6000+ forks on GitHub• 14000+ commits!
6https://github.com/apache/spark
![Page 7: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/7.jpg)
7http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
![Page 8: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/8.jpg)
8http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
![Page 9: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/9.jpg)
9http://go.databricks.com/hubfs/DataBricks_Surveys_-_Content/Spark-Survey-2015-Infographic.pdf
![Page 10: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/10.jpg)
10
![Page 11: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/11.jpg)
• a community index of 3rd-party packages• helps users find packages• helps package developers meet users• users provide feedback through voting and
commenting• index maintained by Databricks
11
3rd Party Packages
Community
Spark Packageshttp://spark-packages.org
![Page 12: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/12.jpg)
Types of Packages Currently Available• Data Source Connectors
• spark-avro, spark-redshift, spark-mongodb, spark-sequoiadb, spark-cassandra-connector, …
• Deployment Scripts• spark_azure, spark_gce, sbt-spark-ec2
• Machine Learning Algorithms• spark-hash, spark-mrmr-feature-selection, streaming-
matrix-factorization, generalized-kmeans-clustering• and many more…
12
![Page 13: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/13.jpg)
What’s new in Spark 1.6• Dataset API• Automatic memory configuration• Optimized state storage in Spark Streaming• Pipeline persistence in Spark ML
13
![Page 14: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/14.jpg)
DemoSource Code: http://brkyvz.github.io/spark-pipeline
Scenario: As an e-commerce company, we would like to recommend products that users may like in order to increase sales and profit.
Dataset: http://jmcauley.ucsd.edu/data/amazon/ - 18 GB - 82.83 million reviewsWe will use a subset with 24 million reviews
14
![Page 15: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/15.jpg)
15
![Page 16: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/16.jpg)
16
![Page 17: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/17.jpg)
Recommendation Engines• Finding Similar Items
• Clustering using: • Metadata• Matrix Factorization
• Frequent Itemsets• Ranking
• Rating Prediction using:• Matrix Factorization
17
![Page 18: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/18.jpg)
Architecture
18
Web Service 1
Web Service 2
Web Service 3
Cassandra
Sales DataDatabase
Spark
Sales + RatingsRating Data
ML Model
Recommendations
Request
![Page 19: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/19.jpg)
19
Step 1: Understand your Data
![Page 20: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/20.jpg)
20
Step 2: Build your Service
![Page 21: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/21.jpg)
Solution ProposalUse Matrix Factorization to understand customers and items.
Then:1) Predict the rating for a product for a given user2) Find similar products, and show top k
21
![Page 22: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/22.jpg)
Matrix Factorization
22https://databricks-training.s3.amazonaws.com/slides/Spark_Summit_MLlib_070214_v2.pdf
![Page 23: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/23.jpg)
Matrix Factorization
23https://databricks-training.s3.amazonaws.com/slides/Spark_Summit_MLlib_070214_v2.pdf
![Page 24: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/24.jpg)
24https://databricks-training.s3.amazonaws.com/slides/Spark_Summit_MLlib_070214_v2.pdf
![Page 25: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/25.jpg)
25
Step 3: Monitor your Service
![Page 26: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/26.jpg)
• Distributed messaging system• High-throughput• Fast• Scalable• Durable
• http://kafka.apache.org/
26
Apache Kafka
![Page 27: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/27.jpg)
Architecture
27
Web Service 1
Web Service 2
Web Service 3
Kafka Spark Streaming
![Page 28: End-to-End Data Pipelines with Apache Spark](https://reader035.fdocuments.us/reader035/viewer/2022062413/58a651571a28ab6e368b6a4b/html5/thumbnails/28.jpg)
Architecture
28
Web Service 1
Web Service 2
Web Service 3
Kafka Spark Streaming