Introduction to Spark

21
Introduction to Spark Sriram and Amritendu DOS Lab, IIT Madras “Introduction to Spark” by Sriram and Amritendu is licensed under a Creative Commons Attribution 4.0 International License .

description

This is a short introduction to spark presented at Exebit 2014 at IIT Madras.

Transcript of Introduction to Spark

Page 1: Introduction to Spark

Introduction to Spark

Sriram and Amritendu

DOS Lab, IIT Madras

“Introduction to Spark” by Sriram and Amritendu is licensed under a Creative Commons Attribution 4.0 International License.

Page 2: Introduction to Spark

Motivation

• In Hadoop, programmer writes job using Map Reduce abstraction

• Runtime distributes work and handles fault-tolerance

Makes analysis of large-data sets easy and reliable

Emerging Class of Applications

Machine learning• K-means clustering

.

.

Graph Algorithms• Page-rank

.

.

DOS Lab, IIT Madras

Page 3: Introduction to Spark

Intermediate results are reused across multiple computations

Nature of the emerging class of applications

Iterative Computation

DOS Lab, IIT Madras

Page 4: Introduction to Spark

Problem with Hadoop MapReduce

HDFSR R R

Iteration 1W W W

HDFS

R R R

HDFS

W W WIteration 2

Results are written to HDFS

New job is launched for each iteration

Incurs substantial storage and job launch overheadsDOS Lab, IIT Madras

Page 5: Introduction to Spark

Can we do away with these overheads?

Persist intermediate results in memory

What if a node fails?

HDFSL L L

Iteration 1

Memory is 10-100X faster than disk/network Iteration 2

X

Challenge: how to handle faults efficiently?

W

R R R

W W

W W W

RR R

DOS Lab, IIT Madras

Page 6: Introduction to Spark

Approaches to handle faults

• ReplicationIssues:– Requires more storage– More network traffic

– Log the operation– Re-compute lost partitions

using lineage information

Master

W

M R

Replica 1

R

Replica 2

XCan tolerate ‘r-1’

failures

• Using LineageD1 D2 D3

C1 C2X

D2 D3C2

Issues:Recovery time can be high if re-computation is very costly– high iteration time– wide dependencies

Wide dependencies

DOS Lab, IIT Madras

Page 7: Introduction to Spark

Spark

• RDD – Resilient Distributed Datasets– Read-only, partitioned collection of records

– Supports only coarse-grained operations• e.g. map and group-by transformations, reduce action

– Uses lineage graph to recover from faults

D12

D11

D13

3 partitions

DOS Lab, IIT Madras

Val

Page 8: Introduction to Spark

Spark contd.

• Control placement of partitions of RDD– can specify number of partitions

– can partition based on a key in each record• useful in joins

• In-memory storage– Up to 100X speedup over Hadoop for iterative

applications

• Spark can run on Hadoop YARN and read files from HDFS

• Spark is coded using Scala

DOS Lab, IIT Madras

Page 9: Introduction to Spark

SCALA overview

• Functional programming meets object orientation

• “No side effects” aids concurrent programming

• Every variable is an object

• Every function is a value

DOS Lab, IIT Madras

Page 10: Introduction to Spark

Variables and Functions

var obj : java.lang.String = “Hello”var x = new A()

def square(x: Int) : Int={x * x

}Return

type

DOS Lab, IIT Madras

Page 11: Introduction to Spark

Execution of a function

scala> square(2)res0:Int = 4

scala-> square(square(6))res1:Int = 1296

def square(x: Int) : Int={x * x

}

DOS Lab, IIT Madras

Page 12: Introduction to Spark

Nested Functions

def factorial(i: Int): Int = {def fact(i: Int, acc: Int): Int ={

if (i <= 1)acc

elsefact(i - 1, i * acc)

}fact(i, 1)

}

DOS Lab, IIT Madras

Page 13: Introduction to Spark

Nested Functions

def factorial(i: Int): Int = {def fact(i: Int, acc: Int): Int ={

if (i <= 1)acc

elsefact(i - 1, i * acc)

} fact(i, 1)

}

DOS Lab, IIT Madras

Page 14: Introduction to Spark

Higher order map functions

val add = (x: Int) => x+1val lst = list(1,2,3)

lst.map(add) : list(2,3,4)lst.map(x => x+1) : list(2,3,4)lst.map( _ + 1) : list(2,3,4)

DOS Lab, IIT Madras

Page 15: Introduction to Spark

Defining Objects

object Example{ def main(args: Array[String]) {

val logData = sc.textFile(logFile, 2).cache()--------------

}}

Example.main( (“master”,”noOfMap”,”noOfReducer”) )

DOS Lab, IIT Madras

Page 16: Introduction to Spark

Spark: Filter transformation in RDD

val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line =>line.contains("a"))

Here is a example of filterTransformation, you can notice that the filter method will be applied on each line and return a new RDDtest

Give me those lines which contains ‘a’

Here is a example of filterTransformation, you can notice that the filter method will be applied on each line and return a new RDD

DOS Lab, IIT Madras

Page 17: Introduction to Spark

Countval logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(

line =>line.contains("a"))numAs.count()

5Here is a example of filterTransformation, you can notice that the filter method will be applied on each line and return a new RDDtest

DOS Lab, IIT Madras

Page 18: Introduction to Spark

Flatmap

val logData = sc.textFile(logFile, 2).cache() val numAs = logData.flatMap(line => line.split(" "))

Take each line, split based on space and give me the array

Here is a example of filter map ( Here, is, a, example, of, filter,map )

DOS Lab, IIT Madras

Page 19: Introduction to Spark

Wordcount Example in Spark

new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://[input_path_to_textfile]")val counts = file.flatMap (line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)counts.saveAsTextFile("hdfs://[output_path]")

DOS Lab, IIT Madras

Page 20: Introduction to Spark

Limitations

• RDDs are not suitable for applications that require fine-grained updates

– e.g. web storage system

DOS Lab, IIT Madras

Page 21: Introduction to Spark

References

• http://www.slideshare.net/tpunder/a-brief-intro-to-scala

• Scala in depth by Joshua D. Suereth

• Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica“Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing”, In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI'12). USENIX Association, Berkeley, CA, USA, 2012.

• Pictures:– http://www.xbitlabs.com/images/news/2011-04/hard_disk_drive.jpg

– http://www.thecomputercoach.net/assets/images/256_MB_DDR_333_Cl2_5_Pc2700_RAM_Chip_Brand_New_Chip.jpg

DOS Lab, IIT Madras