February 2014 HUG : Tez Details and Insides
-
Upload
yahoo-developer-network -
Category
Education
-
view
112 -
download
1
description
Transcript of February 2014 HUG : Tez Details and Insides
© Hortonworks Inc. 2013 Page 1
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha@bikassaha
© Hortonworks Inc. 2013
Tez – Introduction
Page 2
• Distributed execution framework targeted towards data-processing applications.
• Based on expressing a computation as a dataflow graph.
• Highly customizable to meet a broad spectrum of use cases.
• Built on top of YARN – the resource management framework for Hadoop.
• Open source Apache incubator project and Apache licensed.
© Hortonworks Inc. 2013
Tez – Design Themes
Page 3
• Empowering End Users• Execution Performance
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Expressive dataflow definition API’s• Flexible Input-Processor-Output runtime model• Data type agnostic• Simplifying deployment
Page 4
© Hortonworks Inc. 2013
Aggregate Stage
Partition Stage
Preprocessor Stage
Tez – Empowering End Users
• Expressive dataflow definition API’s
Page 5
Sampler
Task-1 Task-2
Task-1 Task-2
Task-1 Task-2
Samples
Ranges
Distributed Sort
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Flexible Input-Processor-Output runtime model– Construct physical runtime executors dynamically by connecting
different inputs, processors and outputs.– End goal is to have a library of inputs, outputs and processors that can
be programmatically composed to generate useful tasks.
Page 6
Mapper
HDFSInput
MapProcessor
FileSortedOutput
FinalReduce
ShuffleInput
ReduceProcessor
HDFSOutput
IntermediateJoiner
Input1
JoinProcessor
FileSortedOutput
Input2
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Data type agnostic– Tez is only concerned with the movement of data. Files and streams of
bytes.– Clean separation between logical application layer and physical
framework layer. Design important to be a platform for a variety of applications.
Page 7
File
Stream
Key Value
Tez Task
Tuples
User Code
Bytes Bytes
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Simplifying deployment– Tez is a completely client side application.–No deployments to do. Simply upload to any accessible FileSystem and
change local Tez configuration to point to that.– Enables running different versions concurrently. Easy to test new
functionality while keeping stable versions for production.– Leverages YARN local resources.
Page 8
ClientMachine
NodeManager
TezTask
NodeManager
TezTaskTezClient
HDFSTez Lib 1 Tez Lib 2
ClientMachine
TezClient
© Hortonworks Inc. 2013
Tez – Empowering End Users
• Expressive dataflow definition API’s• Flexible Input-Processor-Output runtime model• Data type agnostic• Simplifying usage
With great power API’s come great responsibilities
Tez is a framework on which end user applications can be built
Page 9
© Hortonworks Inc. 2013
Tez – Execution Performance
• Performance gains over Map Reduce• Optimal resource management• Plan reconfiguration at runtime• Dynamic physical data flow decisions
Page 10
© Hortonworks Inc. 2013
Tez – Execution Performance
• Performance gains over Map Reduce– Eliminate replicated write barrier between successive computations.– Eliminate job launch overhead of workflow jobs.– Eliminate extra stage of map reads in every workflow job.– Eliminate queue and resource contention suffered by workflow jobs that
are started after a predecessor job completes.
Page 11
Pig/Hive - MRPig/Hive - Tez
© Hortonworks Inc. 2013
Tez – Execution Performance
• Plan reconfiguration at runtime–Dynamic runtime concurrency control based on data size, user operator
resources, available cluster resources and locality.–Advanced changes in dataflow graph structure.– Progressive graph construction in concert with user optimizer.
Page 12
HDFS Blocks
YARNResources
Stage 150 maps
100 partitions
Stage 2100
reducers
Stage 150 maps
100 partitions
Stage 2100 10
reducers
Only 10GB’s
of data
© Hortonworks Inc. 2013
Tez – Execution Performance
• Optimal resource management–Reuse YARN containers to launch new tasks.–Reuse YARN containers to enable shared objects across tasks.– TezSession to encapsulate all this for the user
Page 13
YARN Container
TezTask Host
TezTask1
TezTask2
Sha
red
Obj
ects
YARN Container
Tez Application Master
Start Task
Task Done
Start Task
© Hortonworks Inc. 2013
Tez – Execution Performance
• Dynamic physical data flow decisions–Decide the type of physical byte movement and storage on the fly.– Store intermediate data on distributed store, local store or in-memory.– Transfer bytes via blocking files or streaming and the spectrum in
between.
Page 14
Producer(small size)
In-Memory
Consumer
Producer
Local File
Consumer
At Runtime
© Hortonworks Inc. 2013
Tez – Automatic Reduce Parallelism
Page 15
Map Vertex
Reduce VertexApp Master
Vertex Manager
Vertex StateMachine
Cancel Task
Event Model
Map tasks send data statistics events to the Reduce Vertex Manager.
Vertex ManagerPluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism
© Hortonworks Inc. 2013
Tez – Automatic Reduce Parallelism
Page 16
Map Vertex
Reduce VertexApp Master
Vertex ManagerData Size Statistics
Vertex StateMachine
Cancel Task
Event Model
Map tasks send data statistics events to the Reduce Vertex Manager.
Vertex ManagerPluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism
© Hortonworks Inc. 2013
Tez – Automatic Reduce Parallelism
Page 17
Map Vertex
Reduce VertexApp Master
Vertex ManagerData Size Statistics
Vertex StateMachine
Set Parallelism
Cancel Task
Re-Route
Event Model
Map tasks send data statistics events to the Reduce Vertex Manager.
Vertex ManagerPluggable user logic that understands the data statistics and can formulate the correct parallelism. Advises vertex controller on parallelism
© Hortonworks Inc. 2013
Tez – Now and Next
Page 18
© Hortonworks Inc. 2013
Tez – Bridge the Data Spectrum
Page 19
Fact TableDimension
Table 1
Result Table 1
Dimension Table 2
Result Table 2
Dimension Table 3
Result Table 3
Broadcast
Join
Shuffle
Join
Typical pattern in a TPC-DS query
Fact Table
Dimension Table 1
Dimension Table 1
Dimension Table 1
Broadcast join
for small data sets
Based on data size, the query optimizer can run either plan as a single Tez job
Broadcast
Join
© Hortonworks Inc. 2013
Tez – Current status
• Apache Incubator Project–Rapid development. Over 800 jiras opened. Over 600 resolved.–Growing community of contributors and users
• Focus on stability– Testing and quality are highest priority.– Code ready and deployed on multi-node environments.
• Support for a vast topology of DAGs– Already functionally equivalent to Map Reduce. Existing Map Reduce
jobs can be executed on Tez with few or no changes.–Hive retargeted to use Tez for execution of queries (HIVE-4660).– Pig to use Tez for execution of scripts (PIG-3446).
Page 20
© Hortonworks Inc. 2013
Tez – Roadmap
• Richer DAG support– Support for co-scheduling– Efficient iterations
• Performance optimizations– More efficiencies in transfer of data– Improve session performance
• Usability.– Stability and testability–Recovery and history– Tools for performance analysis and debugging
Page 21
© Hortonworks Inc. 2013
Tez – Community
• Early adopters and code contributors welcome– Adopters to drive more scenarios. Contributors to make them happen.– Hive and Pig communities are on-board and making great progress - HIVE-4660
and PIG-3446
• Tez meetup for developers and users– http://www.meetup.com/Apache-Tez-User-Group
• Technical blog series– http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-proces
sing/ (will soon be available on the Apache Wiki)
• Useful links– Work tracking: https://issues.apache.org/jira/browse/TEZ– Code: https://github.com/apache/incubator-tez– Developer list: [email protected]
User list: [email protected] Issues list: [email protected]
Page 22
© Hortonworks Inc. 2013
Tez – Takeaways
• Distributed execution framework that works on computations represented as dataflow graphs
• Naturally maps to execution plans produced by query optimizers
• Customizable execution architecture designed to enable dynamic performance optimizations at runtime
• Works out of the box with the platform figuring out the hard stuff
• Span the spectrum of interactive latency to batch• Open source Apache project – your use-cases and code are
welcome• It works and is already being used by Hive and Pig
Page 23
© Hortonworks Inc. 2013
Tez
Thanks for your time and attention!
Deep dive on Tez video at http://www.infoq.com/presentations/apache-tez
Questions?
@bikassaha
Page 24