Fast and Expressive Big Data Analytics with Python
description
Transcript of Fast and Expressive Big Data Analytics with Python
![Page 1: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/1.jpg)
Matei Zaharia
Fast and Expressive Big Data Analytics with Python
UC BERKELEY
spark-project.org
UC Berkeley / MIT
![Page 2: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/2.jpg)
What is Spark?Fast and expressive cluster computing system interoperable with Apache HadoopImproves efficiency through:
»In-memory computing primitives»General computation graphs
Improves usability through:»Rich APIs in Scala, Java, Python»Interactive shell
Up to 100× faster(2-10× on disk)
Often 5× less code
![Page 3: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/3.jpg)
Project HistoryStarted in 2009, open sourced 201017 companies now contributing code
»Yahoo!, Intel, Adobe, Quantifind, Conviva, Bizo, …
Entered Apache incubator in JunePython API added in February
![Page 4: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/4.jpg)
An Expanding StackSpark is the basis for a wide set of projects in the Berkeley Data Analytics Stack (BDAS)
Spark
Spark Streamin
g(real-time)
GraphX(graph)
…
Shark(SQL)
MLbase(machine learning)
More details: amplab.berkeley.edu
![Page 5: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/5.jpg)
This TalkSpark programming modelExamplesDemoImplementationTrying it out
![Page 6: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/6.jpg)
Why a New Programming Model?
MapReduce simplified big data processing, but users quickly found two problems:Programmability: tangle of map/red functionsSpeed: MapReduce inefficient for apps that share data across multiple steps
»Iterative algorithms, interactive queries
![Page 7: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/7.jpg)
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFSread
Slow due to data replication and disk I/O
![Page 8: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/8.jpg)
iter. 1 iter. 2 . . .
Input
Distributedmemory
Input
query 1
query 2
query 3
. . .
one-timeprocessing
10-100× faster than network and disk
What We’d Like
![Page 9: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/9.jpg)
Spark ModelWrite programs in terms of transformations on distributed datasets
Resilient Distributed Datasets (RDDs)»Collections of objects that can be stored in
memory or disk across a cluster»Built via parallel transformations (map,
filter, …)»Automatically rebuilt on failure
![Page 10: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/10.jpg)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patternslines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))messages = errors.map(lambda s: s.split(“\t”)[2])messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda s: “foo” in s).count()messages.filter(lambda s: “bar” in s).count(). . .
tasks
resultsCache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Result: full-text search of Wikipedia in 2 sec (vs 30 s for
on-disk data)
Result: scaled to 1 TB data in 7 sec (vs 180 sec for on-disk
data)
![Page 11: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/11.jpg)
Fault ToleranceRDDs track the transformations used to build them (their lineage) to recompute lost datamessages = textFile(...).filter(lambda s: “ERROR” in s) .map(lambda s: s.split(“\t”)[2])
HadoopRDDpath = hdfs://…
FilteredRDDfunc = lambda s:
…
MappedRDDfunc = lambda s:
…
![Page 12: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/12.jpg)
Example: Logistic RegressionGoal: find line separating two sets of points
+
–
+ ++
+
+
++ +
– ––
–
–
–– –
+
target
–
random initial line
![Page 13: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/13.jpg)
Example: Logistic Regressiondata = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations): gradient = data.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x ).reduce(lambda x, y: x + y) w -= gradient
print “Final w: %s” % w
![Page 14: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/14.jpg)
Logistic Regression Performance
1 5 10 20 300
5001000150020002500300035004000
HadoopPySpark
Number of Iterations
Runn
ing
Tim
e (s
)
110 s / iteration
first iteration 80 sfurther iterations
5 s
![Page 15: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/15.jpg)
Demo
![Page 16: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/16.jpg)
Supported OperatorsmapfiltergroupByunionjoinleftOuterJoinrightOuterJoin
reducecountfoldreduceByKeygroupByKeycogroupflatMap
takefirstpartitionBypipedistinctsave...
![Page 17: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/17.jpg)
1000+ meetup members60+ contributors17 companies contributing
Spark Community
![Page 18: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/18.jpg)
This TalkSpark programming modelExamplesDemoImplementationTrying it out
![Page 19: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/19.jpg)
OverviewSpark core is written in ScalaPySpark calls existing scheduler, cache and networking layer (2K-line wrapper)No changes to Python
Your app Spark
client
Spark worker
Python child
Python child
PySp
ark
Spark worker
Python child
Python child
![Page 20: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/20.jpg)
OverviewSpark core is written in ScalaPySpark calls existing scheduler, cache and networking layer (2K-line wrapper)No changes to Python
Your app Spark
client
Spark worker
Python child
Python childPy
Spar
k
Spark worker
Python child
Python child
Main PySpark author:Josh Rosen
cs.berkeley.edu/~joshrosen
![Page 21: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/21.jpg)
Object MarshalingUses pickle library for both communication and cached data
»Much cheaper than Python objects in RAM
Lambda marshaling library by PiCloud
![Page 22: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/22.jpg)
Job SchedulerSupports general operator graphsAutomatically pipelines functionsAware of data locality and partitioning
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
![Page 23: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/23.jpg)
InteroperabilityRuns in standard CPython, on Linux / Mac
»Works fine with extensions, e.g. NumPy
Input from local file system, NFS, HDFS, S3
»Only text files for now
Works in IPython, including notebookWorks in doctests – see our tests!
![Page 24: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/24.jpg)
Getting StartedVisit spark-project.org for video tutorials, online exercises, docsEasy to run in local mode (multicore), standalone clusters, or EC2Training camp at Berkeley in August (free video): ampcamp.berkeley.edu
![Page 25: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/25.jpg)
Getting StartedEasiest way to learn is the shell:$ ./pyspark
>>> nums = sc.parallelize([1,2,3]) # make RDD from array
>>> nums.count()3
>>> nums.map(lambda x: 2 * x).collect()[2, 4, 6]
![Page 26: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/26.jpg)
ConclusionPySpark provides a fast and simple way to analyze big datasets from PythonLearn more or contribute at spark-project.org
Look for our training camp on August 29-30!
My email: [email protected]
![Page 27: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/27.jpg)
Behavior with Not Enough RAM
Cache disabled
25% 50% 75% Fully cached
020406080
10068
.858.1
40.729.7
11.5
% of working set in memory
Iter
atio
n ti
me
(s)
![Page 28: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/28.jpg)
The Rest of the StackSpark is the foundation for wide set of projects in the Berkeley Data Analytics Stack (BDAS)
Spark
Spark Streamin
g(real-time)
GraphX(graph)
…
Shark(SQL)
MLbase(machine learning)
More details: amplab.berkeley.edu
![Page 29: Fast and Expressive Big Data Analytics with Python](https://reader035.fdocuments.us/reader035/viewer/2022081507/5681664c550346895dd9c9af/html5/thumbnails/29.jpg)
Performance Comparison
0
5
10
15
20
25
Impa
la
(disk
)Im
pala
(m
em)
Reds
hift
Shar
k (d
isk)
Shar
k (m
em)Re
spon
se T
ime
(s)
SQL0
5
10
15
20
25
30
35
Stor
mSp
ark
Thro
ughp
ut (
MB/
s/no
de)
Streaming0
5
10
15
20
25
30
Hado
opGi
raph
Grap
hLab
Grap
hX
Resp
onse
Tim
e (m
in)
Graph