Introduction to Spark (Intern Event Presentation)
-
Upload
databricks -
Category
Software
-
view
2.216 -
download
2
Transcript of Introduction to Spark (Intern Event Presentation)
![Page 1: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/1.jpg)
Introduction to Spark
Matei Zaharia Databricks Intern Event, August 2015
![Page 2: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/2.jpg)
What is Apache Spark?
Fast and general computing engine for clusters
Makes it easy and fast to process large datasets • APIs in Java, Scala, Python, R • Libraries for SQL, streaming, machine learning, … • 100x faster than Hadoop MapReduce for some apps
![Page 3: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/3.jpg)
About Databricks
Founded by creators of Spark in 2013
Offers a hosted cloud service built on Spark • Interactive workspace with notebooks, dashboards, jobs
![Page 4: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/4.jpg)
0 20 40 60 80
100 120 140 160
2010 2011 2012 2013 2014 2015
Cont
ribut
ors
Contributors / Month to Spark
Community Growth
Most active open source project in big data
![Page 5: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/5.jpg)
Spark Programming Model
Write programs in terms of transformations on distributed datasets
Resilient Distributed Datasets (RDDs) • Collections of objects stored in memory or disk across a cluster • Built via parallel transformations (map, filter, …) • Automatically rebuilt on failure
![Page 6: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/6.jpg)
Example: Log Mining
Load error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “MySQL” in s).count()
messages.filter(lambda s: “Redis” in s).count()
. . .
tasks
results Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
Action
Result: full-text search of Wikipedia in 0.5 sec (vs 20s for on-disk data)
![Page 7: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/7.jpg)
Example: Logistic Regression
0 500
1000 1500 2000 2500 3000 3500 4000
1 5 10 20 30
Runn
ing
Tim
e (s
)
Number of Iterations
Hadoop Spark
110 s / iteration
first iteration 80 s further iterations 1 s
Iterative algorithm used in machine learning
![Page 8: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/8.jpg)
Source: Daytona GraySort benchmark, sortbenchmark.org
2100 machines 2013 Record: Hadoop
72 minutes
2014 Record: Spark
207 machines
23 minutes
On-Disk Performance Time to sort 100TB
![Page 9: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/9.jpg)
Higher-Level Libraries
Spark
Spark Streaming
real-time
Spark SQL structured data
MLlib machine learning
GraphX graph
![Page 10: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/10.jpg)
Higher-Level Libraries
// Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”)
// Train a machine learning model model = KMeans.train(points, 10)
// Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)
![Page 11: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/11.jpg)
Demo
![Page 12: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/12.jpg)
Over 1000 production users, clusters up to 8000 nodes
Many talks online at spark-summit.org
Spark Community
![Page 13: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/13.jpg)
![Page 14: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/14.jpg)
Ongoing Work
Speeding up Spark through code generation and binary processing (Project Tungsten)
R interface to Spark (SparkR)
Real-time machine learning library
Frontend and backend work in Databricks (visualization, collaboration, auto-scaling, …)
![Page 15: Introduction to Spark (Intern Event Presentation)](https://reader034.fdocuments.us/reader034/viewer/2022042611/58f183271a28ab416b8b460f/html5/thumbnails/15.jpg)
Thank you. We’re hiring!