Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark...
Transcript of Big Data Infrastructures & Technologieshomepages.cwi.nl/~manegold/UvA-ABS-MBA-BDBA-ISfBD/05-Spark...
www.cwi.nl/~boncz/bads
Big Data Infrastructures & Technologies
Spark and MLLIB
www.cwi.nl/~boncz/bads
OVERVIEW OF SPARK
www.cwi.nl/~boncz/bads
What is Spark? • Fast and expressive cluster computing system interoperable with
Apache Hadoop
• Improves efficiency through:
– In-memory computing primitives
– General computation graphs
• Improves usability through:
– Rich APIs in Scala, Java, Python
– Interactive shell
www.cwi.nl/~boncz/bads
What is Spark? • Fast and expressive cluster computing system interoperable with
Apache Hadoop
• Improves efficiency through:
– In-memory computing primitives
– General computation graphs
• Improves usability through:
– Rich APIs in Scala, Java, Python
– Interactive shell
Up to 100× faster (2-10× on disk)
www.cwi.nl/~boncz/bads
What is Spark? • Fast and expressive cluster computing system interoperable with
Apache Hadoop
• Improves efficiency through:
– In-memory computing primitives
– General computation graphs
• Improves usability through:
– Rich APIs in Scala, Java, Python
– Interactive shell
Up to 100× faster (2-10× on disk)
Often 5× less code
www.cwi.nl/~boncz/bads
The Spark Stack
• Spark is the basis of a wide set of projects in the Berkeley Data Analytics
Stack (BDAS)
Spark
Spark Streaming
(real-time)
GraphX (graph)
…
Spark SQL
MLIB (machine learning)
More details: amplab.berkeley.edu
www.cwi.nl/~boncz/bads
Why a New Programming Model?
• MapReduce greatly simplified big data analysis
• But as soon as it got popular, users wanted more:
– More complex, multi-pass analytics (e.g. ML, graph)
– More interactive ad-hoc queries
– More real-time stream processing
• All 3 need faster data sharing across parallel jobs
www.cwi.nl/~boncz/bads
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
www.cwi.nl/~boncz/bads
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFS read
www.cwi.nl/~boncz/bads
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFS read
HDFS write
www.cwi.nl/~boncz/bads
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFS read
HDFS write
HDFS read
www.cwi.nl/~boncz/bads
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFS read
HDFS write
HDFS read
HDFS write
www.cwi.nl/~boncz/bads
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFS read
HDFS write
HDFS read
HDFS write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFS read
www.cwi.nl/~boncz/bads
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFS read
HDFS write
HDFS read
HDFS write
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFS read
Slow due to replication, serialization, and disk IO
www.cwi.nl/~boncz/bads
iter. 1 iter. 2 . . .
Input
Data Sharing in Spark
Distributed memory
Input
query 1
query 2
query 3
. . .
one-time processing
~10× faster than network and disk
www.cwi.nl/~boncz/bads
Spark Programming Model • Key idea: resilient distributed datasets (RDDs)
– Distributed collections of objects that can be cached in memory
across the cluster
– Manipulated through parallel operators
– Automatically recomputed on failure
• Programming interface
– Functional APIs in Scala, Java, Python
– Interactive use from Scala shell
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
Worker
Worker
Worker
Driver
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
Worker
Worker
Worker
Driver
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
Base RDD
Worker
Worker
Worker
Driver
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
Worker
Worker
Worker
Driver
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
Worker
Worker
Worker
Driver
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
Transformed RDD
Worker
Worker
Worker
Driver
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
Worker
Worker
Worker
Driver
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
Worker
Worker
Worker
Driver
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache()
Worker
Worker
Worker
Driver
www.cwi.nl/~boncz/bads
Lambda Functions
Lambda function functional programming!
= implicit function definition
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
www.cwi.nl/~boncz/bads
Lambda Functions
Lambda function functional programming!
= implicit function definition
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
bool detect_error(string x) {
return x.startswith(“ERROR”);
}
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache()
Worker
Worker
Worker
Driver
Base RDD Transformed RDD
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
Base RDD Transformed RDD
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
Base RDD Transformed RDD
Action
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
Base RDD Transformed RDD
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
Base RDD Transformed RDD
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
tasks
Base RDD Transformed RDD
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
tasks
Base RDD Transformed RDD
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
tasks
results
Base RDD Transformed RDD
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
messages.filter(lambda x: “bar” in x).count
Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
messages.filter(lambda x: “bar” in x).count
tasks
Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
messages.filter(lambda x: “bar” in x).count
tasks
Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
messages.filter(lambda x: “bar” in x).count
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
messages.filter(lambda x: “bar” in x).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
messages.filter(lambda x: “bar” in x).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
Result: full-text search of Wikipedia in
<1 sec (vs 20 sec for on-disk data)
www.cwi.nl/~boncz/bads
Example: Log Mining Load error messages from a log into memory, then interactively search
for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda x: x.startswith(“ERROR”))
messages = errors.map(lambda x: x.split(‘\t’)[2])
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(lambda x: “foo” in x).count
messages.filter(lambda x: “bar” in x).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
Result: full-text search of Wikipedia in
<1 sec (vs 20 sec for on-disk data)
Result: scaled to 1 TB data in 5-7 sec
(vs 170 sec for on-disk data)
www.cwi.nl/~boncz/bads
Fault Tolerance
• file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
filter reduce map
Inp
ut
file
RDDs track lineage info to rebuild lost data
www.cwi.nl/~boncz/bads
filter reduce map
Inp
ut
file
• file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
RDDs track lineage info to rebuild lost data
Fault Tolerance
www.cwi.nl/~boncz/bads
filter reduce map
Inp
ut
file
• file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)
RDDs track lineage info to rebuild lost data
Fault Tolerance
www.cwi.nl/~boncz/bads
Example: Logistic Regression
0
500
1000
1500
2000
2500
3000
3500
4000
1 5 10 20 30
Ru
nn
ing
Tim
e (
s)
Number of Iterations
Hadoop
Spark
110 s / iteration
first iteration 80 s further iterations 1 s
www.cwi.nl/~boncz/bads
Spark in Scala and Java
// Scala:
val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()
// Java:
JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
www.cwi.nl/~boncz/bads
Supported Operators
•map
•filter
•groupBy
•sort
•union
•join
•leftOuterJoin
•rightOuterJoin
•reduce
•count
•fold
•reduceByKey
•groupByKey
•cogroup
•cross
•zip
sample
take
first
partitionBy
mapWith
pipe
save
...
www.cwi.nl/~boncz/bads
Software Components
• Spark client is library in user program (1 instance
per app)
• Runs tasks locally or on cluster
– Mesos, YARN, standalone mode
• Accesses storage systems via Hadoop
InputFormat API
– Can use HBase, HDFS, S3, …
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage
www.cwi.nl/~boncz/bads
Task Scheduler
General task graphs
Automatically pipelines
functions
Data locality aware
Partitioning aware
to avoid shuffles
= cached partition = RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map
www.cwi.nl/~boncz/bads
Spark SQL • Columnar SQL analytics engine for Spark
– Support both SQL and complex analytics
– Up to 100X faster than Apache Hive
• Compatible with Apache Hive
– HiveQL, UDF/UDAF, SerDes, Scripts
– Runs on existing Hive warehouses
• In use at Yahoo! for fast in-memory OLAP
www.cwi.nl/~boncz/bads
Hive Architecture
Hive
Catalog
HDFS
Client
Driver
SQL
Parser
Query
Optimizer
Physical Plan
Execution
CLI JDBC
MapReduce
www.cwi.nl/~boncz/bads
Spark SQL Architecture
Hive
Catalog
HDFS
Client
Driver
SQL
Parser
Physical Plan
Execution
CLI JDBC
Spark
Cache Mgr.
Query
Optimizer
[Engle et al, SIGMOD 2012]
www.cwi.nl/~boncz/bads
What Makes it Faster? • Lower-latency engine (Spark OK with 0.5s jobs)
• Support for general DAGs
• Column-oriented storage and compression
• New optimizations (e.g. map pruning)
www.cwi.nl/~boncz/bads
Other Spark Stack Projects
• Spark Streaming: stateful, fault-tolerant stream processing (out since Spark 0.7)
•sc.twitterStream(...) .flatMap(_.getText.split(“ ”)) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)
• MLlib: Library of high-quality machine learning algorithms (out since 0.8)
www.cwi.nl/~boncz/bads
Performance
Imp
ala
(dis
k)
Imp
ala
(mem
)
Red
shif
t
Sp
ark
SQ
L (d
isk)
Sp
ark
SQ
L (
mem
)
0
5
10
15
20
25
Re
spo
nse
Tim
e (
s)
SQL
www.cwi.nl/~boncz/bads
Performance
Imp
ala
(dis
k)
Imp
ala
(mem
)
Red
shif
t
Sp
ark
SQ
L (d
isk)
Sp
ark
SQ
L (
mem
)
0
5
10
15
20
25
Re
spo
nse
Tim
e (
s)
SQL
Sto
rm
Sp
ark
0
5
10
15
20
25
30
35
Th
rou
gh
pu
t (M
B/s
/no
de)
Streaming
www.cwi.nl/~boncz/bads
Performance
Imp
ala
(dis
k)
Imp
ala
(mem
)
Red
shif
t
Sp
ark
SQ
L (d
isk)
Sp
ark
SQ
L (
mem
)
0
5
10
15
20
25
Re
spo
nse
Tim
e (
s)
SQL
Sto
rm
Sp
ark
0
5
10
15
20
25
30
35
Th
rou
gh
pu
t (M
B/s
/no
de)
Streaming
Had
oo
p
Gir
aph
Gra
ph
X
0
5
10
15
20
25
30
Re
spo
nse
Tim
e (
min
)
Graph
www.cwi.nl/~boncz/bads
What it Means for Users
• Separate frameworks:
… HDFS
read
HDFS
write ET
L
HDFS
read
HDFS
write tra
in
HDFS
read
HDFS
write query
www.cwi.nl/~boncz/bads
What it Means for Users
• Separate frameworks:
… HDFS
read
HDFS
write ET
L
HDFS
read
HDFS
write tra
in
HDFS
read
HDFS
write query
HDFS
read ET
L
tra
in
query
Spark:
www.cwi.nl/~boncz/bads
What it Means for Users
• Separate frameworks:
… HDFS
read
HDFS
write ET
L
HDFS
read
HDFS
write tra
in
HDFS
read
HDFS
write query
HDFS
read ET
L
tra
in
query
Spark:
www.cwi.nl/~boncz/bads
What it Means for Users
• Separate frameworks:
… HDFS
read
HDFS
write ET
L
HDFS
read
HDFS
write tra
in
HDFS
read
HDFS
write query
HDFS
HDFS
read ET
L
tra
in
query
Spark:
www.cwi.nl/~boncz/bads
Conclusion
• Big data analytics is evolving to include:
– More complex analytics (e.g. machine learning)
– More interactive ad-hoc queries
– More real-time stream processing
• Spark is a fast platform that unifies these apps
• More info: spark-project.org
www.cwi.nl/~boncz/bads
SPARK MLLIB
www.cwi.nl/~boncz/bads
What is MLLIB? MLlib is a Spark subproject providing machine learning primitives:
• initial contribution from AMPLab, UC Berkeley
• shipped with Spark since version 0.8
www.cwi.nl/~boncz/bads
What is MLLIB? Algorithms:
• classification: logistic regression, linear support vector machine (SVM),
naive Bayes
• regression: generalized linear regression (GLM)
• collaborative filtering: alternating least squares (ALS)
• clustering: k-means
• decomposition: singular value decomposition (SVD), principal
component analysis (PCA)
www.cwi.nl/~boncz/bads
Collaborative Filtering
www.cwi.nl/~boncz/bads
Alternating Least Squares (ALS)
www.cwi.nl/~boncz/bads
Collaborative Filtering in Spark MLLIB
trainset =
sc.textFile("s3n://bads-music-dataset/train_*.gz")
.map(lambda l: l.split('\t'))
.map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2])))
model = ALS.train(trainset, rank=10, iterations=10) # train
testset = # load testing set
sc.textFile("s3n://bads-music-dataset/test_*.gz")
.map(lambda l: l.split('\t'))
.map(lambda l: Rating(int(l[0]), int(l[1]), int(l[2])))
# apply model to testing set (only first two cols) to predict
predictions =
model.predictAll(testset.map(lambda p: (p[0], p[1])))
.map(lambda r: ((r[0], r[1]), r[2]))
www.cwi.nl/~boncz/bads
Spark MLLIB – ALS Performance
System Wall-clock /me (seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLlib 481
• Dataset: Netflix data
• Cluster: 9 machines.
• MLlib is an order of magnitude faster than Mahout.
• MLlib is within factor of 2 of GraphLab.
www.cwi.nl/~boncz/bads
Spark Implementation of ALS
• Workers load data
• Models are instantiated
at workers.
• At each iteration, models
are shared via join
between workers.
• Good scalability.
• Works on large datasets
Master Workers
www.cwi.nl/~boncz/bads
Spark SQL + MLLIB
www.cwi.nl/~boncz/bads
MLLIB Pointers • Website: http://spark.apache.org
• Tutorials: http://ampcamp.berkeley.edu
• Spark Summit: http://spark-summit.org
• Github: https://github.com/apache/spark
• Mailing lists: [email protected] [email protected]