Map reduce vs spark

25
MapReduce vs/and Spark Tudor Lapusan BigData Romanian Tour - Timisoara

Transcript of Map reduce vs spark

Page 1: Map reduce vs spark

MapReduce vs/and Spark Tudor Lapusan

BigData Romanian Tour - Timisoara

Page 2: Map reduce vs spark

History

Page 3: Map reduce vs spark

MapReduce basic functionalities

● Fault tolerance

● Monitoring & status updates

● Scalability

Page 4: Map reduce vs spark

Hadoop MapReduce

Input Map Reduce Output

Page 5: Map reduce vs spark

Hadoop MapReduce

Input Map Shuffle Reduce Output

Page 6: Map reduce vs spark

MapReduce DAG

A

D

B

C

E

F

Page 7: Map reduce vs spark

Spark

● RDD● Operations : Transformations and Actions

Page 8: Map reduce vs spark

RDD - Resilient Distributed Dataset

RDD is fault-tolerant collection of elements distributed across many servers on which we can perform parallel operations.

Page 9: Map reduce vs spark

RDD

Scala code

val data = Array(1, 2, 3, 4, 5, 6, 7, 8)

val rddData = sc.parallelize(data)

Page 10: Map reduce vs spark

RDD

Scala code

val rddFile = sc.textFile("data.txt")

Page 11: Map reduce vs spark

RDD persistence

MEMORY_ONLY

MEMORY_AND_DISK

MEMORY_ONLY_SER

MEMORY_AND_DISK_SER

DISK_ONLY

MEMORY_ONLY_2

MEMORY_AND_DISK_2

OFF_HEAP

Page 12: Map reduce vs spark

Transformations

RDD 1

RDD 2

Transformations are operations on RDDs that return new RDDs

Page 13: Map reduce vs spark

TransformationsRDD 1

InputRDD{1,2,3,4,5,6}

MapRDD{2,3,4,5,6,7}

FilterRDD{1,2,3,5,6}

map x => x +1 filter x => x != 4

Page 14: Map reduce vs spark

ActionsRDD 1

Actions are the operations on RDD which return a final value or write the data to an external storage system.

RDD 1

Page 15: Map reduce vs spark

ActionsRDD 1

InputRDD{1,2,3,4,5,6}

MapRDD{2,3,4,5,6,7}

FilterRDD{1,2,3,5,6}

map x => x +1 filter x => x != 4

count()=6 take(2)={1,2} saveAsTextFile()

Page 16: Map reduce vs spark

Spark DAG

RDD 1

RDD 2

RDD 4

RDD 6

RDD 3

RDD 5

ActionTransformation

Stage

Page 17: Map reduce vs spark

Spark DAG vs MapReduce DAG

RDD 1

RDD 2

RDD 4

RDD 6

RDD 3

RDD 5

A

B

D C

E

F

Page 18: Map reduce vs spark

Programing languages

MapReduce Java Ruby Perl

PythonPHP

RC++

SparkJavaScala

Python

Page 19: Map reduce vs spark

Easy of use

- Spark is easier to program and include an interactive mode.

- Hadoop MapReduce is harder to program but many tools are available to make it easier.

Page 20: Map reduce vs spark

Performance : Sort Benchmark 2013

Page 21: Map reduce vs spark

Performance : Sort Benchmark 2014

Page 22: Map reduce vs spark

Costs

Page 23: Map reduce vs spark

Costs : hardware recommendation

Spark MapReduce Hadoop

Cores 8-16 4

Memory 8GB to hundreds of GB 24GB

Disks 4-8 4-6 one-TB disks

Network 10GB or more 1GB Ethernet

Spark recommendation Hortonworks recommendation

Page 24: Map reduce vs spark

Costs : developers