Spark-101

Apache Spark Introduction

Sameer Kohli

2

What is Apache Spark?

• Apache Spark is a general purpose and high performance in-memory execution engine.

• It provides high-level API’s in Scala, Java and Python. Users can get started by using a shell provided for both Scala and Python.

• Spark serves as the foundation for many higher-level projects, such as Spark SQL, MLLib (machine learning), GraphX (graph), and Spark Streaming (streaming).

• Spark allows programmers to develop complex, multi-step data pipelines using the directed acyclic graph (DAG) pattern.

http://en.wikipedia.org/wiki/Directed_acyclic_graph

3

Role of Apache Spark in the Big Data Eco-system

4

Spark Community Efforts

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to replication, serialization, and disk IO

iter. 1 iter. 2 . . .

Input

Distributedmemory

Input

query 1

query 2

query 3

. . .

one-timeprocessing

10-100× faster than network and disk

Data Sharing in Spark

7

Spark Programming ModelKey idea: Resilient Distributed Datasets (RDDs)

• Distributed collections of objects that can be cached in memory across cluster nodes.

• Manipulated through various parallel operators.

• Automatically rebuilt on failure.

Interface

• Clean language-integrated API in Scala.

• Can be used interactively from Scala console.

8

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns:

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

cachedMsgs.filter(_.contains(“foo”)).count()cachedMsgs.filter(_.contains(“bar”)).count()

Base RDD Transformed RDD’s

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: Scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)

9

Conclusion

• Spark offers a rich API to make data analytics fast: both fast to write and fast to run.

• Achieves 100x speedups in real applications.

• Growing community with major Fortune 500 companies contributing.

• Details, tutorials, videos: www.spark-project.org

http://www.spark-project.org/

10

Thank You! Questions?

Spark-101

Documents

Transcript of Spark-101