Spark-101
-
Upload
knowledgent -
Category
Documents
-
view
51 -
download
1
Transcript of Spark-101
Apache Spark Introduction
Sameer Kohli
2
What is Apache Spark?
• Apache Spark is a general purpose and high performance in-memory execution engine.
• It provides high-level API’s in Scala, Java and Python. Users can get started by using a shell provided for both Scala and Python.
• Spark serves as the foundation for many higher-level projects, such as Spark SQL, MLLib (machine learning), GraphX (graph), and Spark Streaming (streaming).
• Spark allows programmers to develop complex, multi-step data pipelines using the directed acyclic graph (DAG) pattern.
3
Role of Apache Spark in the Big Data Eco-system
4
Spark Community Efforts
Data Sharing in MapReduce
iter. 1 iter. 2 . . .
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFSread
Slow due to replication, serialization, and disk IO
iter. 1 iter. 2 . . .
Input
Distributedmemory
Input
query 1
query 2
query 3
. . .
one-timeprocessing
10-100× faster than network and disk
Data Sharing in Spark
7
Spark Programming ModelKey idea: Resilient Distributed Datasets (RDDs)
• Distributed collections of objects that can be cached in memory across cluster nodes.
• Manipulated through various parallel operators.
• Automatically rebuilt on failure.
Interface
• Clean language-integrated API in Scala.
• Can be used interactively from Scala console.
8
Example: Log Mining
Load error messages from a log into memory, then interactively search for various patterns:
lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()
cachedMsgs.filter(_.contains(“foo”)).count()cachedMsgs.filter(_.contains(“bar”)).count()
Base RDD Transformed RDD’s
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
Result: Scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)
9
Conclusion
• Spark offers a rich API to make data analytics fast: both fast to write and fast to run.
• Achieves 100x speedups in real applications.
• Growing community with major Fortune 500 companies contributing.
• Details, tutorials, videos: www.spark-project.org
10
Thank You! Questions?