MapReduce: Optimizations, Limitations, and Open Issues

22
MapReduce: Limitations, Optimizations and Open Issues The 11th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA--13) Vasiliki Kalavri, Vladimir Vlassov {kalavri, vladv}@kth.se 17 July 2013, Melbourne, Australia

Transcript of MapReduce: Optimizations, Limitations, and Open Issues

MapReduce: Limitations, Optimizations

and Open Issues

The 11th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA--13)

Vasiliki Kalavri, Vladimir Vlassov{kalavri, vladv}@kth.se17 July 2013, Melbourne, Australia

Outline

● MapReduce / Hadoop○ background

○ current state

● Limitations and Existing Optimizations○ performance

○ programming model

○ configuration and automation

● Trends, Open Issues, Future Directions2

Big Data & Hadoop MapReduce

3

Motivation and Goal

● Numerous Hadoop variations and enhancements over the past few years○ each branching out from vanilla Hadoop○ hard to choose the appropriate tool○ no categorization / classification exists

● In our survey○ overview existing variations○ classify the optimizations○ identify trends and open issues

4

5

MapReduce Programming Model

MapReduce

● Key-Value Pairs○ Partitioning functions

● 2nd Order Functions○ User-Defined Map - Reduce

● Input / Output ○ Distributed Fault-Tolerant File System

● Data-Centric Computation ○ Move the computation to the data

6

Hadoop MapReduce 1.0

7

Limitations● Scalability● Cluster Utilization● No support for non-MR applications

YARN (MapReduce v.2)

8

● JobTracker => Resource Manager and Application Master

● Map/Reduce Slots => Resource Container

MapReduce Limitations

● Performance○ initialization, scheduling, coordination

○ data materialization - intensive disk I/O

● Programming Model○ single-input operators

○ fixed processing pipeline - job chaining

○ no support for iterations

● Configuration and Automation○ sensitive to configuration parameters

○ complicated tuning 9

Performance Issues (1)

10

job setup, initialization

task scheduling monitoring and coordination

11

Performance Issues (2)

Data Materialization and Replication

Intensive Disk I/O

12

Programming Model Issues (1)

Input A

Input B

Merged Input

tagging

Single Input OperatorsHard to Join / Cross Datasets

pre-processing

13

Programming Model Issues (2)

Fixed, Static Processing Pipeline

Job ChainingNo support for Iterations

Performance Optimizations

● Operator Pipelining● Approximate Results● Indexing and Sorting● Work Sharing● Data Reuse● Skew Mitigation● Data Colocation

14

Programming Model Extensions

● High-Level Languages○ Declarative, SQL-like

○ Semi-structured JSON data

○ Java / Scala libraries for complex processing flows

● Domain-Specific Systems○ Iterations

○ Incremental Computations

15

Configuration and Automation

● Self-Tuning○ dynamic configuration based on workload

○ learn performance models

○ data-flow sharing

● Disk I/O Minimization○ dynamically setting number of reducers

○ handle skew and batch I/O operations

● Data-aware Optimizations○ static code analysis

○ index creation and selective input scans 16

Trends

● In-memory processing○ minimize disk I/O and communication

● Traditional database techniques○ organize and structure data, indexing

● Caching○ reuse of previous computations

● Relaxation of fault-tolerance○ materialize less often

17

18

System Major ContributionOpen-Source,

Available?Transparent

MR Online Pipelining, Online aggregation yes yes

EARL Fast approximate results yes no

Hadoop++, HAIL Improve relational operations no yes / no

MRShare Concurrent work sharing no no

ReStore Reuse of previous computations no yes

SkewTune Automatic skew mitigation no yes

CoHadoop Data colocation no no

HaLoop Iterations support yes no

Incoop Incremental processing no no

Starfish Dynamic self-tuning no yes

Sailfish I/O minimization, automatic tuning no yes

Manimal Automatic data-aware optimizations no yes

Open Issues

● No standard benchmark● No "typical" MapReduce workload● Each system is evaluated using different

○ datasets

○ applications

○ deployments

■ impossible to compare or only compare with vanilla Hadoop

● Application transparency19

Future Directions

● Fault-tolerance adjustment mechanisms● Standardize workloads and comparison

metrics● Support for interactive analysis

○ query optimization techniques

○ data reuse

○ fast approximate results

20

Conclusions

● MapReduce and Hadoop are very useful, successful and interesting tools

● There is still a lot of room for optimizations and research

● But, MapReduce might not always be the right tool for the job○ more flexible data-flows

○ relational operations

○ graph processing

○ machine learning 21

MapReduce: Limitations, Optimizations

and Open Issues

The 11th IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA--13)

Vasiliki Kalavri, Vladimir Vlassov{kalavri, vladv}@kth.se17 July 2013, Melbourne, Australia