CS014 Introduction to Data Structures and...

29
10/31/2018 1

Transcript of CS014 Introduction to Data Structures and...

Page 1: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

10/31/2018 1

Page 2: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Spark RDD

10/31/2018 2

Page 3: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Where are we?

Distributed storage in HDFS

MapReduce query execution in Hadoop

High-level data manipulation using Pig

10/31/2018 3

Page 4: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

A Step Back to MapReduce

Designed in the early 2000’s

Machines were unreliable

Focused on fault-tolerance

Addressed data-intensive applications

Limited memory

10/31/2018 4

Page 5: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Shuffle Reduce Map

MapReduce in Practice

10/31/2018 5

M

M

M

R

R

R

Map

M

M

M

Reduce

R

R

R

Can we improve on that?

Page 6: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Pig

Slightly improves disk I/O by consolidating

map-only jobs

10/31/2018 7

Shuffle Reduce Map

M2

M2

M2

R

R

R

Map

M1

M1

M1

Page 7: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Map

Pig

Slightly improves disk I/O by consolidating

map-only jobs

10/31/2018 8

Shuffle Reduce

M2

M2

M2

R

R

R

M1

M1

M1

Page 8: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Pig

Slightly improves disk I/O by consolidating

map-only jobs

10/31/2018 9

Shuffle Reduce Map

M1

M1

M1

R1

R1

R1

Map

M2

M2

M2

Page 9: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Pig

Slightly improves disk I/O by consolidating

map-only jobs

10/31/2018 10

Shuffle Reduce Map

M1

M1

M1

R1

R1

R1

M2

M2

M2

Page 10: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Pig (at a higher level)

10/31/2018 11

FILTER FOREACH

FOREACH

JOIN

FILTER

GROUP

BY

Page 11: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

RDD

Resilient Distributed Datasets

A distributed query processing engine

The Spark counterpart to MapReduce

Designed for in-memory processing

10/31/2018 12

Page 12: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

In-memory Processing

The machine specs change

More reliable

Bigger memory

And the workload changed

Analytical queries

Iterative operations (like ML)

The main idea: Rather than storing

intermediate results to disk, keep them in

memory

How about fault tolerance?

10/31/2018 13

Page 13: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

RDD Example

10/31/2018 14

FILTER FOREACH

FOREACH

JOIN

FILTER

GROUP

BY

Mem

Mem

Mem

Mem

Page 14: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

RDD Abstraction

RDD is a pointer to a distributed dataset

Stores information about how to compute the

data rather than where the data is

Transformation: Converts an RDD to another

RDD

Action: Returns an answer of an operation

over an RDD

10/31/2018 15

Page 15: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Spark RDD Features

Lazy execution: Collect transformations and

execute on actions (Similar to Pig)

Lineage tracking: Keep track of the lineage of

each RDD for fault-tolerance

10/31/2018 16

Page 16: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

RDD

10/31/2018 17

RDD RDD

Operation

Page 17: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Filter Operation

10/31/2018 18

RDD RDD

Filter

Similarly, projection operation

(ForEach in Pig)

Narrow dependency

Page 18: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

GroupBy (Shuffle) Operation

10/31/2018 19

RDD RDD

Group By

Similar operation Join

Wide dependency

Page 19: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Types of Dependencies

Narrow dependencies

Wide dependencies

10/31/2018 20

Credit: https://github.com/rohgar/scala-spark-4/wiki/Wide-vs-Narrow-Dependencies

Page 20: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Examples of Transformations

map

flatMap

reduceByKey

filter

sample

join

union

partitionBy

10/31/2018 21

Page 21: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Examples of Actions

count

collect

save(path)

persist

reduce

10/31/2018 22

Page 22: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

How RDD can be helpful

Consolidate operations

Combine transformations

Iterative operations

Keep the output of an iteration in memory till the

next iteration

Data sharing

Reuse the same data without having to read it

multiple times

10/31/2018 23

Page 23: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Examples

# Initialize the Spark context

JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo");

10/31/2018 24

Page 24: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Examples

# Initialize the Spark context

JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo");

# Hello World! Example. Count the number of lines in the file

JavaRDD<String> textFileRDD = spark.textFile("nasa_19950801.tsv");

long count = textFileRDD.count();

System.out.println("Number of lines is "+count);

10/31/2018 25

Page 25: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Examples

# Count the number of OK lines

JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() {

@Override

public Boolean call(String s) throws Exception {

String code = s.split("\t")[5];

return code.equals("200");

}

});

long count = okLines.count();

System.out.println("Number of OK lines is "+count);

10/31/2018 26

Page 26: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Examples

# Count the number of OK lines

# Shorten the implementation using lambdas (Java 8 and above)

JavaRDD<String> okLines = textFileRDD.filter(s -> s.split("\t")[5].equals("200"));

long count = okLines.count();

System.out.println("Number of OK lines is "+count);

10/31/2018 27

Page 27: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Examples

# Make it parametrized by taking the response code as a command line argument

String inputFileName = args[0];

String desiredResponseCode = args[1];

...

JavaRDD<String> textFileRDD = spark.textFile(inputFileName);

JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() {

@Override

public Boolean call(String s) {

String code = s.split("\t")[5];

return code.equals(desiredResponseCode);

}

}); 10/31/2018 28

Page 28: CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-10-31-RDD.pdfResilient Distributed Datasets A distributed query processing engine The Spark counterpart

Examples

# Count by response code

JavaPairRDD<Integer, String> linesByCode = textFileRDD.mapToPair(new PairFunction<String, Integer, String>() {

@Override

public Tuple2<Integer, String> call(String s) {

String code = s.split("\t")[5];

return new Tuple2<Integer, String>(Integer.valueOf(code), s);

}

});

Map<Integer, Long> countByCode = linesByCode.countByKey();

System.out.println(countByCode);

10/31/2018 29