The Performance of MapReduce: An In-depth Study

The Performance of MapReduce: An In-depth

Dawei Jiang, Beng Chin Ooi, Lei Shi, Sai Wu,School of Computing, NUS

Presented by Tang Kai

Introduction Factors affecting Performance of MR Pruning search space Implementation Benchmark

MapReduce-based systems are increasingly being used.◦ Simple yet impressive interface

Map() Reduce()◦ Flexible

Storage system independence◦ Scalable◦ Fine-grain fault tolerance

Introduction

Previous study◦ Fundamental difference

Schema support Data access Fault tolerance

◦ Benchmark Parallel DB >> MR-based

Motivation

Is it not possible to have a flexible, scalable and efficient MapReduce-based systems?

Works◦ Identify several performance bottlenecks◦ manage bottlenecks and tune performance

well-known engineering and database techniques

Conclusion◦ 2.5x-3.5x

Object

7 steps of a MapReduce job

Factors affecting Performance of MR

1) Map2) Parse3) Process4) Sort5) Shuffle6) Merge7) Reduce

I/O mode Indexing Parsing Sorting

Factors affecting Performance of MR

Direct I/O◦ read data from the disk directly◦ Local

Streaming I/O◦ streaming data from the storage system by an

inter-process communication scheme, such as TCP/IP or JDBC.

◦ Local and remote

Direct I/O > Streaming I/O by 10%-15%

I/O mode

Input of a MapReduce job◦ a set of files stored in a distributed file system,

i.e. HDFS Ranged-indexes

◦ input HDFS files are not sorted but each data chunk in the files are indexed by keys Block-level indexes

◦ tables stored in database servers Database indexed tables

Indexing

Boost selection task 2x-10x depending on the selectivity

Raw data -> <k,v> pair

Immutable decoding◦ Read-only records (set once)

Mutable decoding

Mutable decoder is 10x faster.◦ boost selection task 2x overall

Parsing

Map-side sorting affects performance of aggregation◦ Cost of key comparison is non-trivial.

Example◦ SourceIP in UserVisits Table◦ Sort intermediate records.◦ sourceIP variable-length string

String compare (byte-to-byte) Fingerprint compare (integer)

Fingerprint-based is 4x-5x faster.◦ 20%-25% overall

Sorting

Why◦ 4 factors

Resulting in large search space (2*2*3*2)◦ Budget limit on Amazon EC2

Greedy

Pruning search space

Greedy Stategy

Pruning search space

I/O mode

Parser

Different sort schemesIn various architecture

Direct I/O

Stream I/O

Hadoop Writable

Google’s ProtocolBuffer

Berkeley DB

3 datasets

4 queries

Benchmark

Hadoop 0.19.2 as code base Direct I/O

◦ Modification of data node implementation Text decoder

◦ Immutable same as Dewitt◦ Mutable by ourselves

Binary decoder◦ Hadoop

Immutable Writable decoder Mutable using hadoop API by ourselves

◦ Google Protocol buffer Build-in compiler->mutable Immutable by ourselves

◦ Berkeley DB BDB binding API (mutable)

Implementation details

Amazon EC2 (Elastic computing cloud)◦ 7.5GB memory◦ 2 virtual cores◦ 64-bits Fedora 8

Tuning EC2 disk I/O by shifting peak time. Hadoop Setting

◦ Block size of HDFS: 512MB◦ Heap size of JVM: 1024MB

Environment details

Benchmark for I/O Results for different I/O mode

◦ Single node◦ No-op job w/ map w/o reduce

Results for record parsing◦ Run in Java process instead of MapReduce job◦ Time start after loading into memory

Mutable > Immutable◦ Mutable text> mutable binary

Benchmark for parsing

In between hadoop-based system◦ Cache factor

In between hadoop-based and Parallel DB◦ Close

Benchmark for Grep Task

Selection task -> scan -> Index Caching Indexing

Benchmark for Selection Task

Parsing: 2x faster Sorting: 20%-25% faster

◦ Not significant in small size aggregation task

Benchmark for Aggregation Task

Large: SELECT sourceIP, SUM(adRevenue)FROM UserVisits GROUP BY sourceIP;

Small: SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7)

On decoding scheme Comparison of tuned MR-based & Parallel

Benchmark for Join Task

Cons◦ Need to be committed/forked to Hadoop source

code tree◦ A complete framework is needed instead of

miscellaneous patches.◦ Various API support: CLI, Web rather than Java.

Future work◦ Provide query parser, optimizer etc to build a

complete solution◦ Elastic power-aware data intensive Cloud

http://www.comp.nus.edu.sg/~epic/download/MapReduceBenchmark.tar.gz

Cons & Future work

Tenzing: A SQL Implemetation On The MapReduce Framework

The Performance of MapReduce: An In-depth Study

Technology

Transcript of The Performance of MapReduce: An In-depth Study

MapReduce on the Cell Broadband Engine Architecturecavazos/cisc879-spring... · MapReduce is just such an abstraction. Overview Motivation MapReduce Cell BE Architecture Design Performance

A Micro-Benchmark Suite for Evaluating Hadoop MapReduce …...Hadoop MapReduce 5 Performance of Hadoop MapReduce is influenced by many factors • Network configuration of cluster

Improving MapReduce Performance in Heterogeneous … · University of California at Berkeley . Motivation 1. MapReduce becoming popular – Open-source implementation, Hadoop, used

Performance of Spark vs MapReduce

Resource Provisioning Framework for MapReduce Jobs with ... · Resource Provisioning Framework for MapReduce Jobs with Performance Goals Abhishek Verma, Ludmila Cherkasova, Roy H.

– 1 – MapReduce Programming Oct 25, 2011 Topics Large-scale computing Traditional high-performance computing (HPC) Cluster computing MapReduce Definition.

Cogset: A High-Performance MapReduce Engine

CUDA performance study on Hadoop MapReduce Cluster

HADOOP MAPREDUCE PERFORMANCE ENHANCEMENT USING IN-NODE COMBINERS

Improving MapReduce Performance in Heterogeneous Environmentsbnrg.eecs.berkeley.edu/~randy/Courses/CS268.F08/papers/42_osdi_08… · Improving MapReduce Performance in Heterogeneous

Mapreduce Performance in Heterogeneous Environments: A · PDF fileMapreduce Performance in Heterogeneous Environments: A Review. ... Mapreduce performance in heterogeneous environments

– 1 – MapReduce Programming Topics Large-scale computing Traditional high-performance computing (HPC) Cluster computing MapReduce Definition Examples Implementation.

MapReduce Framework Performance Comparison€¦ · MapReduce Framework Performance Comparison Author: Thomas N agele t.nagele@student.ru.nl 4031253 First supervisor/assessor: prof.

Improving MapReduce Performance in Heterogeneous ...zbo/publications/Ant.pdf · Improving MapReduce Performance in Heterogeneous Environments with Adaptive Task Tuning Dazhao Cheng,

Improving MapReduce Performance in Heterogeneous Environmentsbnrg.cs.berkeley.edu/~adj/publications/paper-files/EECS-2008-99.pdf · Improving MapReduce Performance in Heterogeneous

MapReduce - Cornell Universitymajor performance degradations. MapReduce vs Spark. Perspectives • MapReduce, databases, Spark, all differ: • In engineering needs for a particular

A Micro-Benchmark Suite for Evaluating Hadoop MapReduce on ...prof.ict.ac.cn/bpoe_5_vldb/wp-content/uploads/2014/... · Factors Effecting Performance of Hadoop MapReduce 5 Performance

journal | Computing and Informatics - A MAPREDUCE BASED DISTRIBUTED … · A MapReduce Based Distributed LSI for Scalable Information Retrieval 261 the performance of the MapReduce

Dell PowerEdge C5220: Hadoop MapReduce performance

Workload Dependent Hadoop MapReduce Application Performance Modeling · PDF fileWorkload Dependent Hadoop MapReduce Application Performance Modeling Dominique Heger Introduction In