February 2014 HUG : Tez Details and Insides

24
© Hortonworks Inc. 2013 Page 1 Apache Tez : Accelerating Hadoop Query Processing Bikas Saha @bikassaha

description

February 2014 HUG : Tez Details and Insides

Transcript of February 2014 HUG : Tez Details and Insides

Page 1: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013 Page 1

Apache Tez : Accelerating Hadoop Query Processing

Bikas Saha@bikassaha

Page 2: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Introduction

Page 2

• Distributed execution framework targeted towards data-processing applications.

• Based on expressing a computation as a dataflow graph.

• Highly customizable to meet a broad spectrum of use cases.

• Built on top of YARN – the resource management framework for Hadoop.

• Open source Apache incubator project and Apache licensed.

Page 3: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Design Themes

Page 3

• Empowering End Users• Execution Performance

Page 4: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Empowering End Users

• Expressive dataflow definition API’s• Flexible Input-Processor-Output runtime model• Data type agnostic• Simplifying deployment

Page 4

Page 5: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Aggregate Stage

Partition Stage

Preprocessor Stage

Tez – Empowering End Users

• Expressive dataflow definition API’s

Page 5

Sampler

Task-1 Task-2

Task-1 Task-2

Task-1 Task-2

Samples

Ranges

Distributed Sort

Page 6: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Empowering End Users

• Flexible Input-Processor-Output runtime model– Construct physical runtime executors dynamically by connecting

different inputs, processors and outputs.– End goal is to have a library of inputs, outputs and processors that can

be programmatically composed to generate useful tasks.

Page 6

Mapper

HDFSInput

MapProcessor

FileSortedOutput

FinalReduce

ShuffleInput

ReduceProcessor

HDFSOutput

IntermediateJoiner

Input1

JoinProcessor

FileSortedOutput

Input2

Page 7: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Empowering End Users

• Data type agnostic– Tez is only concerned with the movement of data. Files and streams of

bytes.– Clean separation between logical application layer and physical

framework layer. Design important to be a platform for a variety of applications.

Page 7

File

Stream

Key Value

Tez Task

Tuples

User Code

Bytes Bytes

Page 8: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Empowering End Users

• Simplifying deployment– Tez is a completely client side application.–No deployments to do. Simply upload to any accessible FileSystem and

change local Tez configuration to point to that.– Enables running different versions concurrently. Easy to test new

functionality while keeping stable versions for production.– Leverages YARN local resources.

Page 8

ClientMachine

NodeManager

TezTask

NodeManager

TezTaskTezClient

HDFSTez Lib 1 Tez Lib 2

ClientMachine

TezClient

Page 9: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Empowering End Users

• Expressive dataflow definition API’s• Flexible Input-Processor-Output runtime model• Data type agnostic• Simplifying usage

With great power API’s come great responsibilities

Tez is a framework on which end user applications can be built

Page 9

Page 10: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Execution Performance

• Performance gains over Map Reduce• Optimal resource management• Plan reconfiguration at runtime• Dynamic physical data flow decisions

Page 10

Page 11: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Execution Performance

• Performance gains over Map Reduce– Eliminate replicated write barrier between successive computations.– Eliminate job launch overhead of workflow jobs.– Eliminate extra stage of map reads in every workflow job.– Eliminate queue and resource contention suffered by workflow jobs that

are started after a predecessor job completes.

Page 11

Pig/Hive - MRPig/Hive - Tez

Page 12: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Execution Performance

• Plan reconfiguration at runtime–Dynamic runtime concurrency control based on data size, user operator

resources, available cluster resources and locality.–Advanced changes in dataflow graph structure.– Progressive graph construction in concert with user optimizer.

Page 12

HDFS Blocks

YARNResources

Stage 150 maps

100 partitions

Stage 2100

reducers

Stage 150 maps

100 partitions

Stage 2100 10

reducers

Only 10GB’s

of data

Page 13: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Execution Performance

• Optimal resource management–Reuse YARN containers to launch new tasks.–Reuse YARN containers to enable shared objects across tasks.– TezSession to encapsulate all this for the user

Page 13

YARN Container

TezTask Host

TezTask1

TezTask2

Sha

red

Obj

ects

YARN Container

Tez Application Master

Start Task

Task Done

Start Task

Page 14: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Execution Performance

• Dynamic physical data flow decisions–Decide the type of physical byte movement and storage on the fly.– Store intermediate data on distributed store, local store or in-memory.– Transfer bytes via blocking files or streaming and the spectrum in

between.

Page 14

Producer(small size)

In-Memory

Consumer

Producer

Local File

Consumer

At Runtime

Page 15: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Automatic Reduce Parallelism

Page 15

Map Vertex

Reduce VertexApp Master

Vertex Manager

Vertex StateMachine

Cancel Task

Event Model

Map tasks send data statistics events to the Reduce Vertex Manager.

Vertex ManagerPluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism

Page 16: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Automatic Reduce Parallelism

Page 16

Map Vertex

Reduce VertexApp Master

Vertex ManagerData Size Statistics

Vertex StateMachine

Cancel Task

Event Model

Map tasks send data statistics events to the Reduce Vertex Manager.

Vertex ManagerPluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism

Page 17: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Automatic Reduce Parallelism

Page 17

Map Vertex

Reduce VertexApp Master

Vertex ManagerData Size Statistics

Vertex StateMachine

Set Parallelism

Cancel Task

Re-Route

Event Model

Map tasks send data statistics events to the Reduce Vertex Manager.

Vertex ManagerPluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism

Page 18: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Now and Next

Page 18

Page 19: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Bridge the Data Spectrum

Page 19

Fact TableDimension

Table 1

Result Table 1

Dimension Table 2

Result Table 2

Dimension Table 3

Result Table 3

Broadcast

Join

Shuffle

Join

Typical pattern in a TPC-DS query

Fact Table

Dimension Table 1

Dimension Table 1

Dimension Table 1

Broadcast join

for small data sets

Based on data size, the query optimizer can run either plan as a single Tez job

Broadcast

Join

Page 20: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Current status

• Apache Incubator Project–Rapid development. Over 800 jiras opened. Over 600 resolved.–Growing community of contributors and users

• Focus on stability– Testing and quality are highest priority.– Code ready and deployed on multi-node environments.

• Support for a vast topology of DAGs– Already functionally equivalent to Map Reduce. Existing Map Reduce

jobs can be executed on Tez with few or no changes.–Hive retargeted to use Tez for execution of queries (HIVE-4660).– Pig to use Tez for execution of scripts (PIG-3446).

Page 20

Page 21: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Roadmap

• Richer DAG support– Support for co-scheduling– Efficient iterations

• Performance optimizations– More efficiencies in transfer of data– Improve session performance

• Usability.– Stability and testability–Recovery and history– Tools for performance analysis and debugging

Page 21

Page 22: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Community

• Early adopters and code contributors welcome– Adopters to drive more scenarios. Contributors to make them happen.– Hive and Pig communities are on-board and making great progress - HIVE-4660

and PIG-3446

• Tez meetup for developers and users– http://www.meetup.com/Apache-Tez-User-Group

• Technical blog series– http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-proces

sing/ (will soon be available on the Apache Wiki)

• Useful links– Work tracking: https://issues.apache.org/jira/browse/TEZ– Code: https://github.com/apache/incubator-tez– Developer list: [email protected]

User list: [email protected] Issues list: [email protected]

Page 22

Page 23: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez – Takeaways

• Distributed execution framework that works on computations represented as dataflow graphs

• Naturally maps to execution plans produced by query optimizers

• Customizable execution architecture designed to enable dynamic performance optimizations at runtime

• Works out of the box with the platform figuring out the hard stuff

• Span the spectrum of interactive latency to batch• Open source Apache project – your use-cases and code are

welcome• It works and is already being used by Hive and Pig

Page 23

Page 24: February 2014 HUG : Tez Details and Insides

© Hortonworks Inc. 2013

Tez

Thanks for your time and attention!

Deep dive on Tez video at http://www.infoq.com/presentations/apache-tez

Questions?

@bikassaha

Page 24