Spark-101

10
Apache Spark Introduction Sameer Kohli

Transcript of Spark-101

Page 1: Spark-101

Apache Spark Introduction

Sameer Kohli

Page 2: Spark-101

2

What is Apache Spark?

• Apache Spark is a general purpose and high performance in-memory execution engine.

• It provides high-level API’s in Scala, Java and Python. Users can get started by using a shell provided for both Scala and Python.

• Spark serves as the foundation for many higher-level projects, such as Spark SQL, MLLib (machine learning), GraphX (graph), and Spark Streaming (streaming).

• Spark allows programmers to develop complex, multi-step data pipelines using the directed acyclic graph (DAG) pattern.

Page 3: Spark-101

3

Role of Apache Spark in the Big Data Eco-system

Page 4: Spark-101

4

Spark Community Efforts

Page 5: Spark-101

Data Sharing in MapReduce

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to replication, serialization, and disk IO

Page 6: Spark-101

iter. 1 iter. 2 . . .

Input

Distributedmemory

Input

query 1

query 2

query 3

. . .

one-timeprocessing

10-100× faster than network and disk

Data Sharing in Spark

Page 7: Spark-101

7

Spark Programming ModelKey idea: Resilient Distributed Datasets (RDDs)

• Distributed collections of objects that can be cached in memory across cluster nodes.

• Manipulated through various parallel operators.

• Automatically rebuilt on failure.

Interface

• Clean language-integrated API in Scala.

• Can be used interactively from Scala console.

Page 8: Spark-101

8

Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns:

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

cachedMsgs.filter(_.contains(“foo”)).count()cachedMsgs.filter(_.contains(“bar”)).count()

Base RDD Transformed RDD’s

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: Scaled to 1 TB data in 5-7 sec(vs 170 sec for on-disk data)

Page 9: Spark-101

9

Conclusion

• Spark offers a rich API to make data analytics fast: both fast to write and fast to run.

• Achieves 100x speedups in real applications.

• Growing community with major Fortune 500 companies contributing.

• Details, tutorials, videos: www.spark-project.org

Page 10: Spark-101

10

Thank You! Questions?