CS014 Introduction to Data Structures and...

10/31/2018 1

Spark RDD

10/31/2018 2

Where are we?

Distributed storage in HDFS

MapReduce query execution in Hadoop

High-level data manipulation using Pig

10/31/2018 3

A Step Back to MapReduce

Designed in the early 2000’s

Machines were unreliable

Focused on fault-tolerance

Addressed data-intensive applications

Limited memory

10/31/2018 4

Shuffle Reduce Map

MapReduce in Practice

10/31/2018 5

Reduce

Can we improve on that?

Slightly improves disk I/O by consolidating

map-only jobs

10/31/2018 7

Shuffle Reduce Map

map-only jobs

10/31/2018 8

Shuffle Reduce

map-only jobs

10/31/2018 9

Shuffle Reduce Map

map-only jobs

10/31/2018 10

Shuffle Reduce Map

Pig (at a higher level)

10/31/2018 11

FILTER FOREACH

FOREACH

FILTER

Resilient Distributed Datasets

A distributed query processing engine

The Spark counterpart to MapReduce

Designed for in-memory processing

10/31/2018 12

In-memory Processing

The machine specs change

More reliable

Bigger memory

And the workload changed

Analytical queries

Iterative operations (like ML)

The main idea: Rather than storing

intermediate results to disk, keep them in

memory

How about fault tolerance?

10/31/2018 13

RDD Example

10/31/2018 14

FILTER FOREACH

FOREACH

FILTER

RDD Abstraction

RDD is a pointer to a distributed dataset

Stores information about how to compute the

data rather than where the data is

Transformation: Converts an RDD to another

Action: Returns an answer of an operation

over an RDD

10/31/2018 15

Spark RDD Features

Lazy execution: Collect transformations and

execute on actions (Similar to Pig)

Lineage tracking: Keep track of the lineage of

each RDD for fault-tolerance

10/31/2018 16

10/31/2018 17

RDD RDD

Operation

Filter Operation

10/31/2018 18

RDD RDD

Filter

Similarly, projection operation

(ForEach in Pig)

Narrow dependency

GroupBy (Shuffle) Operation

10/31/2018 19

RDD RDD

Group By

Similar operation Join

Wide dependency

Types of Dependencies

Narrow dependencies

Wide dependencies

10/31/2018 20

Credit: https://github.com/rohgar/scala-spark-4/wiki/Wide-vs-Narrow-Dependencies

Examples of Transformations

flatMap

reduceByKey

filter

sample

partitionBy

10/31/2018 21

Examples of Actions

collect

save(path)

persist

reduce

10/31/2018 22

How RDD can be helpful

Consolidate operations

Combine transformations

Iterative operations

Keep the output of an iteration in memory till the

next iteration

Data sharing

Reuse the same data without having to read it

multiple times

10/31/2018 23

Examples

# Initialize the Spark context

JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo");

10/31/2018 24

Examples

# Initialize the Spark context

JavaSparkContext spark = new JavaSparkContext("local", "CS226-Demo");

# Hello World! Example. Count the number of lines in the file

JavaRDD<String> textFileRDD = spark.textFile("nasa_19950801.tsv");

long count = textFileRDD.count();

System.out.println("Number of lines is "+count);

10/31/2018 25

Examples

# Count the number of OK lines

JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() {

@Override

public Boolean call(String s) throws Exception {

String code = s.split("\t")[5];

return code.equals("200");

long count = okLines.count();

System.out.println("Number of OK lines is "+count);

10/31/2018 26

Examples

# Count the number of OK lines

# Shorten the implementation using lambdas (Java 8 and above)

JavaRDD<String> okLines = textFileRDD.filter(s -> s.split("\t")[5].equals("200"));

long count = okLines.count();

System.out.println("Number of OK lines is "+count);

10/31/2018 27

Examples

# Make it parametrized by taking the response code as a command line argument

String inputFileName = args[0];

String desiredResponseCode = args[1];

JavaRDD<String> textFileRDD = spark.textFile(inputFileName);

JavaRDD<String> okLines = textFileRDD.filter(new Function<String, Boolean>() {

@Override

public Boolean call(String s) {

return code.equals(desiredResponseCode);

}); 10/31/2018 28

Examples

# Count by response code

JavaPairRDD<Integer, String> linesByCode = textFileRDD.mapToPair(new PairFunction<String, Integer, String>() {

@Override

public Tuple2<Integer, String> call(String s) {

return new Tuple2<Integer, String>(Integer.valueOf(code), s);

Map<Integer, Long> countByCode = linesByCode.countByKey();

System.out.println(countByCode);

10/31/2018 29

CS014 Introduction to Data Structures and...

Documents

Transcript of CS014 Introduction to Data Structures and...

CS014 Introduction to Data Structures and Algorithms › ~eldawy › 19FCS226 › slides › CS226-09... · Couchbase Data Platform Service-Centric Clustered Data System ... Offline

CS226 Big-Data Managementeldawy/18FCS226/slides/CS226-09...Big-data Expert Understand how the big-data platforms really work Control those thousands of processors efficiently to carry

SCHEME OF EXAMINATIONpdf.uktech.ac.in/wp-content/uploads/pdf/syllabus... · 2015. 5. 22. · IT – 014/CS034 Multimedia communication & System design IT 024/CS014 Wireless Networks.

CS014 Introduction to Data Structures and Algorithmseldawy/17FCS014/slides/CS014-09-28-Intro.… · C++ classes (declaration, definition, constructor, destructor …) Arrays (both

Hadoop Distributed File System (HDFS)eldawy/18FCS226/slides/CS226-10-05-HDFS.pdfHadoop Distributed File System (HDFS) 10/05/2018 1. HDFS Overview A distributed file system Built on

Components of Big Dataeldawy/18FCS226/slides/CS... · Components of Big Data 10/01/2018 25. Storage of Big Data Data is growing faster than Moore’s Law Too much data to fit on a

CS014 Introduction to Data Structures and Algorithmseldawy/18SCS133/slides/CS133-05-Intersection.pdf · Add p.l to S at the order p.x (p.l=S i) checkIntersection(S i-1, S i) checkIntersection(S

CS014 Introduction to Data Structures and Algorithmseldawy/17FCS014/slides/CS014-09-28-Intro.pdf · 2017. 9. 29. · C++ classes (declaration, definition, constructor, destructor

Hadoop Map Reduceeldawy/18FCS226/slides/CS226-10-17-MapReduce.pdfMapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce

CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-11-28-AsterixDB.pdfCouchbase Data Platform Service-Centric Clustered Data System ... Offline Mobile

ADT Lists, Stacks, and Queueseldawy/17FCS014/slides/CS014-10-05-ADT-Lists.pdf · Lists, Stacks, and Queues Instructor: Ahmed Eldawy 1. Objectives Understand the importance of ADT

CS014 Introduction to Data Structures and Algorithmseldawy/18FCS226/slides/CS226-11-28-Aster... · 2018-11-30 · Couchbase Data Platform Service-Centric Clustered Data System ...