Resilient Distributed DataSets

RDD – Overview(Resilient Distributed Datasets*)

Nov 1st 2014Oakland CABy Taposh Dutta Roy

* Source: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Contents

• What is RDD• Motivation Behind RDD• Use Cases for RDD• Challenges for RDD• RDD: Solve

What is RDD

“RDDs are fault tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operations. “

In a nutshell RDDs are a level of abstraction that enable efficient data reuse in a broad range of applications

Motivation behind RDD

Current frameworks like MapReduce & Dyrad provide a numerous abstractions for accessing a cluster’s computational resources but lack abstractions for leveraging the distributed memory !!!

Data reuse is common in many iterative machine learning algorithms such as – Page Rank, K-means Clustering & Logistic Regression.

Motivation behind RDD

Another use case is when an user runs multiple adhoc queries on the same subset of data. Unfortunately in current frameworks, the only way to reuse data between computations i.e between two jobs is to write to an external storage system e.g. a distributed file system such as Amazon S3.

Use cases for RDD

1. Solving Iterative problems

Existing Solution – Slow, needs high I/O

RDD - Fast, in memory

Use cases for RDD

Example: Suppose I have to look at the webserver access logs and look for an error_code or certain text.

Use cases for RDD

Example (cont’d) : I run the above code on server which returns a set of files with the words looked for grepped, closes the cluster and puts the file into an Amazon S3 location specified in the script.

Now we look at the result files and need to extract some other text from this file, we will need to write or use another set of map-reduce code. This might take extra time to fetch the files, process and provide the results.

Use cases for RDD

RDD solves this problem by storing the data in memory and providing a ability for the user to requery the subset.

Use cases for RDD

2. Solving Interactive ProblemsThe second use case is its usage in interactive algorithms such as logistic regression which need

the data to be re-used.

Challenge for RDD

The main challenge in designing RDD is defining a programming interface that can provide fault tolerance efficiently.

Challenges for RDD

Existing solutions such as distributed shared memory, key value stores, & databases offer an interface based on fine-grained updates.

With such systems, the only way to get fault tolerance is to replicate the data across machines or to log updates across machines. Both of these approaches are data intensive. They need high bandwidth to move the data over the cluster network and large storage.

RDD: Solve

RDD solves these probems by providing an interface based on coarse grained transformations such as map, filter and join. These transformations apply the same operations to many data items.This allows them to efficiently provide fault tolerance by logging the transformations used to build a dataset (i.e. lineage) rather than actual data. If a partition of RDD is lost, the RDD has enough information about how it ..

RDD: Solve

(Cont’d) was derived from other RDD to recompute just that partition. The lost data can be recovered quickly, without costly replication.

Applications not suitable : RDD

RDDs would be less suitable for applications that make asynchronous fine grained updates to shared state, such as a storage system for a web application or an incremental web crawller. For such applications traditional update logging and data checkpointingsuch as databases.

Conclusion RDD

RDD's goal is to provide an efficient programming model for batchanalytics.

RDD has been implemented in a system called SPARK.

Resilient Distributed DataSets - Apache SPARK

Engineering

Transcript of Resilient Distributed DataSets - Apache SPARK

Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.

Apache Spark: A Unified Engine for Big Data Processing€¦ · Resilient Distributed Datasets (RDDs) Apache Spark: A Unified Engine for Big Data Processing PAGE 10 § An RDD is a

Resilient Distributed Datasets: A Fault-Tolerant ... · PDF fileResilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf

Spark and Resilient Distributed Datasets · Resilient Distributed Datasets (RDD) (2/2) I AnRDDis divided into a number ofpartitions, which areatomic pieces of information. I Partitions

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

RESILIENT DISTRIBUTED DATASETS: A FAULT ...ey204/teaching/ACS/R244_2018...RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING MATEIZAHARIA,

MAPREDUCE & RESILIENT DISTRIBUTED DATASETS CS6410Resilient Distributed Datasets (RDD) ... The need to process large data distributed across hundreds or thousands of machines in order

BUILDING RESILIENT MICROSERVICES with APACHE QPID … · 2017-12-14 · BUILDING RESILIENT MICROSERVICES ... Resilient system design. Microservices. Continuous delivery Design/build

Restoring and Maintaining Resilient Landscapes …forestry.scat-nsn.gov/publicweb/AZ_SCA_Report_Final.pdf · Restoring and Maintaining Resilient Landscapes on the San Carlos Apache

Resilient distributed datasets a fault-tolerant abstraction for in-memory cluster computing

Jia Yu (于嘉) · 2020-07-07 · Apache Spark [10] is an in-memory cluster computing system. Spark provides a novel data abstraction called resilient distributed datasets (RDDs)

Resilient Distributed Datasetspavlo/courses/fall2013/... · Resilient Distributed Datasets (RDDs) •Restricted form of distributed shared memory –read-only, partitioned collection

Resilient Messaging - Crunch Toolscrunchtools.com/files/2014/05/Resilient-Messaging-AMQ... · 2014. 5. 20. · @scottcranton 20+ years in Middleware software coding and sales Apache

Introduction to Apache Spark€¦ · •The Spark programming model •Language and deployment choices •Example algorithm (PageRank) Key Concept: RDD’s Resilient Distributed Datasets

Resilient New Zealand key geospatial resilience dataset Mr.Graeme... · Resilient New Zealand ... • Identify key datasets and implement an improvement plan ... – Lack of standards

Apache Spark - Thomas RoparsApache Spark Originally developed at Univ. of California Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, M.

Resilient Distributed Datasets: A Fault-Tolerant …ey204/teaching/ACS/R212_2015_2016/...Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Apache Spark Introduction - Seoul AIseoulai.com/presentations/ApacheSparkIntroduction.pdf · Apache Spark Introduction Motivation Resilient Distributed Dataset (RDD) Lineage chain

B14 Apache Spark with IMS and DB2 data€¦ · Using a fault-tolerant abstraction for in-memory cluster computing –Resilient Distributed Datasets (RDDs) Can be deployed on different