Cloud Computing Expo West - Crash Course in Open Source Cloud Computing
Distributed Computing with Open-Source...
Transcript of Distributed Computing with Open-Source...
![Page 1: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/1.jpg)
Reza Zadeh
Distributed Computing"with Open-Source Software
Presented at Infosys OSSmosis
![Page 2: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/2.jpg)
Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters » Wide use in both enterprises and web industry
How do we program these things?
![Page 3: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/3.jpg)
Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine
Machine Learning Example Current State of Spark Ecosystem
![Page 4: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/4.jpg)
Data flow vs.
Traditional network programming
![Page 5: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/5.jpg)
Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale: » How to split problem across nodes?
• Must consider network & data locality » How to deal with failures? (inevitable at scale) » Even worse: stragglers » Ethernet networking not fast » Have to write programs per machine
Rarely used in commodity datacenters
![Page 6: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/6.jpg)
Data Flow Models Restrict the programming interface so that the system can do more automatically Express jobs as graphs of high-level operators » System picks how to split each operator into tasks
and where to run each task » Run parts twice fault recovery
Biggest example:
MapReduce
Map
Map
Map
Reduce
Reduce
![Page 7: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/7.jpg)
Example MapReduce Algorithms Matrix-vector multiplication Power iteration (e.g. PageRank) Gradient descent methods
Stochastic SVD Tall skinny QR
Many others!
![Page 8: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/8.jpg)
Why Use a Data Flow Engine? Ease of programming » High-level functions instead of message passing
Wide deployment » More common than MPI, especially “near” data
Scalability to very largest clusters
Examples: " Pig, Hive,Scalding, Storm
![Page 9: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/9.jpg)
Limitations of MapReduce
![Page 10: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/10.jpg)
Limitations of MapReduce MapReduce is great at one-pass computation, but inefficient for multi-pass algorithms No efficient primitives for data sharing » State between steps goes to distributed file system » Slow due to replication & disk storage
![Page 11: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/11.jpg)
iter. 1 iter. 2 . . .
Input
file system"read
file system"write
file system"read
file system"write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
file system"read
Commonly spend 90% of time doing I/O
Example: Iterative Apps
![Page 12: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/12.jpg)
Result While MapReduce is simple, it can require asymptotically more communication or I/O
![Page 13: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/13.jpg)
Spark computing engine
![Page 14: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/14.jpg)
Spark Computing Engine Extends a programming language with a distributed collection data-structure » “Resilient distributed datasets” (RDD)
Open source at Apache » Most active community in big data, "
with 50+ companies contributing
Clean APIs in Java, Scala, Python Community: SparkR
![Page 15: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/15.jpg)
Resilient Distributed Datasets (RDDs)
Main idea: Resilient Distributed Datasets » Immutable collections of objects, spread across cluster » Statically typed: RDD[T] has objects of type T
val sc = new SparkContext()!val lines = sc.textFile("log.txt") // RDD[String]!!// Transform using standard collection operations !val errors = lines.filter(_.startsWith("ERROR")) !val messages = errors.map(_.split(‘\t’)(2)) !!messages.saveAsTextFile("errors.txt") !
lazily evaluated
kicks off a computation
![Page 16: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/16.jpg)
Key Idea Resilient Distributed Datasets (RDDs) » Collections of objects across a cluster with user
controlled partitioning & storage (memory, disk, ...) » Built via parallel transformations (map, filter, …) » The world only lets you make make RDDs such that
they can be:
Automatically rebuilt on failure
![Page 17: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/17.jpg)
Python, Java, Scala, R // Scala:
val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()
// Java (better in java8!):
JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
![Page 18: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/18.jpg)
Fault Tolerance
file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
reduce map
Inpu
t file
RDDs track lineage info to rebuild lost data
![Page 19: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/19.jpg)
reduce map
Inpu
t file
Fault Tolerance
file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
RDDs track lineage info to rebuild lost data
![Page 20: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/20.jpg)
Machine Learning example
![Page 21: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/21.jpg)
Logistic Regression data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data.map(lambda p: (1 / (1 + exp(-‐p.y * w.dot(p.x)))) * p.y * p.x ).reduce(lambda a, b: a + b) w -‐= gradient print “Final w: %s” % w
![Page 22: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/22.jpg)
Logistic Regression Results
0 500
1000 1500 2000 2500 3000 3500 4000
1 5 10 20 30
Runn
ing T
ime
(s)
Number of Iterations
Hadoop Spark
110 s / iteration
first iteration 80 s further iterations 1 s
100 GB of data on 50 m1.xlarge EC2 machines
![Page 23: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/23.jpg)
Behavior with Less RAM 68
.8
58.1
40.7
29.7
11.5
0
20
40
60
80
100
0% 25% 50% 75% 100%
Itera
tion
time
(s)
% of working set in memory
![Page 24: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/24.jpg)
Benefit for Users Same engine performs data extraction, model training and interactive queries
… DFS read
DFS write pa
rse DFS read
DFS write tra
in DFS read
DFS write qu
ery
DFS
DFS read pa
rse
train
quer
y
Separate engines
Spark
![Page 25: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/25.jpg)
State of the Spark ecosystem
![Page 26: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/26.jpg)
Most active open source "community in big data
200+ developers, "50+ companies contributing
Spark Community
0
50
100
150
Contributors in past year
![Page 27: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/27.jpg)
Project Activity M
apRe
duce
YARN
HDFS
Stor
m
Spar
k
0
200
400
600
800
1000
1200
1400
1600
Commits
Activity in past 6 months
![Page 28: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/28.jpg)
Continuing Growth
source: ohloh.net
Contributors per month to Spark
![Page 29: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/29.jpg)
A General Platform
Spark Core
Spark Streaming"real-time
Spark SQL
structured
GraphX graph
MLlib machine learning
Standard libraries included with Spark
![Page 30: Distributed Computing with Open-Source Softwarestanford.edu/~rezab/slides/infosys_reza.pdfDistributed Computing" with Open-Source Software Presented(at(Infosys(OSSmosis(Problem Data](https://reader030.fdocuments.us/reader030/viewer/2022041017/5ecaf6e731e6bc613a32fcd4/html5/thumbnails/30.jpg)
Conclusion Data flow engines are becoming an important platform for numerical algorithms While early models like MapReduce were inefficient, new ones like Spark close this gap More info: spark.apache.org