Apache Tez – Present and Future
-
Upload
jeff-zhang -
Category
Technology
-
view
82 -
download
0
Transcript of Apache Tez – Present and Future
© Hortonworks Inc. 2015 Page 1
Apache Tez – Present and Future
Jeff Zhang (@zjffdu)Rajesh Balamohan (@rajeshbalamohan)
© Hortonworks Inc. 2015
Agenda•Tez Introduction
•Tez Feature Deep Dive
•Tez Improvement & Debuggability
•Tez Status & Roadmap
© Hortonworks Inc. 2015
I/O Synchronization Barrier
I/O Synchronization Barrier
Job 1 ( Join a & b )
Job 3 ( Group by of c )
Job 2 (Group by of a Join b)
Job 4 (Join of S & R )
Hive - MR
Example of MR versus Tez
Page 3
Single Job
Hive - Tez
Join a & b
Group by of a Join b
Group by of c
Job 4 (Join of S & R )
© Hortonworks Inc. 2015
Tez – Introduction
Page 4
• Distributed execution framework targeted towards data-processing applications.
• Based on expressing a computation as a dataflow graph (DAG).
• Highly customizable to meet a broad spectrum of use cases.
• Built on top of YARN – the resource management framework for Hadoop.
• Open source Apache project and Apache licensed.
© Hortonworks Inc. 2015
What is DAG & Why DAG
ProjectionFilterGroupBy…
JoinUnionIntersect…
Split…
• Directed Acyclic Graph• Any complicated DAG can been composed of the following 3 basic
paradigm – Sequential– Merge– Divide
© Hortonworks Inc. 2015
Expressing DAG in Tez API
• DAG API (Logic View)–Allow user to build DAG– Topological structure of the data computation flow
• Runtime API (Runtime View)–Application logic of each computation unit (vertex)–How to move/read/write data between vertices
© Hortonworks Inc. 2015
DAG API (Logic View)
Page 7
• Vertex (Processor, Parallelism, Resource, etc…)
• Edge (EdgeProperty)–DataMovement
– Scatter Gather (Join, GroupBy … )– Broadcast ( Pig Replicated Join / Hive Broadcast Join )– One-to-One ( Pig Order by )– Custom
© Hortonworks Inc. 2015
Runtime API (Runtime View)
Page 8
ProcessorInput Output
• Input– Through which processor receives data on an edge– Vertex can have multiple inputs
• Processor– Application Logic (One vertex one processor)– Consume the inputs and produce the outputs
• Output– Through which processor writes data to an edge– One vertex can have multiple outputs
• Example of Input/Output/Processor– MRInput & MROutput (InputFormat/OutputFormat)– OrderedGroupedKVInput & OrderedPartitionedKVOutput (Scatter Gather)– UnorderedKVInput & UnorderedKVOutput (Broadcast & One-to-One)– PigProcessor/HiveProcessor
© Hortonworks Inc. 2015
Benefit of DAG• Easier to express computation in DAG
• No intermediate data written to HDFS
• Less pressure on NameNode
• No resource queuing effort & less resource contention
• More optimization opportunity with more global context
© Hortonworks Inc. 2015
Agenda•Tez Introduction
•Tez Feature Deep Dive
•Tez Improvement & Debuggability
•Tez Status & Roadmap
© Hortonworks Inc. 2015
Container-Reuse• Reuse the same container across DAG/Vertices/Tasks
• Benefit of Container-Reuse– Less resources consumed–Reduce overhead of launching JVM–Reduce overhead of negotiate with Resource Manager–Reduce overhead of resource localization–Reduce network IO–Object Caching (Object Sharing)
© Hortonworks Inc. 2015
Tez Session• Multiple Jobs/DAGs in one AM
• Container-reuse across Jobs/DAGs
• Data sharing between Jobs/DAGs
© Hortonworks Inc. 2015
Dynamic Parallelism Estimation • VertexManager
– Listen to the other vertices status
– Coordinate and schedule its tasks
– Communication between vertices
© Hortonworks Inc. 2015
ATS Integration• Tez is fully integrated with YARN ATS (Application Timeline
Service)–DAG Status, DAG Metrics, Task Status, Task Metrics are captured
• Diagnostics & Performance analysis–Data Source for monitoring & diagnostics –Data Source for performance analysis
© Hortonworks Inc. 2015
Recovery• AM can crash in corner cases
–OOM–Node failure–…
• Continue from the last checkpoint
• Transparent to end users
AM Crash
© Hortonworks Inc. 2015
Order By of Pig
f = Load ‘foo’ as (x, y);o = Order f by x;Load
Sample( Calculate Histogram)
HDFS
Partition
Sort
Broadcast
Load
Sample( Calculate Histogram)
Partition
Sort
One-to-One
Scatter Gather
Scatter Gather
© Hortonworks Inc. 2015
Agenda•Tez Introduction
•Tez Feature Deep Dive
•Tez Improvement & Debuggability
•Tez Status & Roadmap
© Hortonworks Inc. 2015
• Performance– Speculation– Intermediate File Improvements–Better use of JVM Memory– Shuffle Improvements
• Debuggability– Tez UI– Local mode– Job Analysis Tools– Shuffle Performance Analysis Tool
© Hortonworks Inc. 2015
Speculation• Good for clusters having good/slow nodes or heterogeneous
hardware.• Maintains periodic runtime statistics of tasks• Triggers speculative attempt when estimated runtime > mean
runtime
© Hortonworks Inc. 2015
Intermediate File Format Improvements
• Used for storing intermediate data in Tez
• Drawbacks of earlier format–Needs larger buffer in memory (due to
duplicate keys)–Bigger file size in disk–Not ideal for all use cases
• New Intermediate File Format–Works based on (K, List<V>)– Provides 57% memory efficiency and
23% improvement in disk storage
TaskSpill 1 Spill 2 Spill 3
Merged Spill
………………………
New IFile FormatKey Len K1Value Len V1
Value Len V2 V_ENDRLE Value Len V3 …
Key Len K2Value Len V1
Value Len V5 V_ENDRLE Value Len V6 …
Old IFile Format
Key Len Value Len K1 V1
Key Len Value Len K1 V2
Key Len Value Len K1 V3
Key Len Value Len K2 V1
………………………
Key Len Value Len K2 V5
Key Len Value Len K2 V6
© Hortonworks Inc. 2015
Better use of JVM Memory• BytesWritable Improvements
– Provides FastByteSerialization– Saves 8 bytes per key-value pair– Reduces IFile size by 25% – Reduces SERDE costs
• PipelinedSorter can support > 2 GB sort buffers– Containers with higher RAM no longer
limited by 2 GB sort buffer limits– Avoids unnecessary spills in large jobs
• Reduced key comparison costs in PipelinedSorter
Key Value
Key Size Bytes Val Size Bytes
Key Size BytesSize Val Size BytesSize
Serialize to memory Serialize to memory
Serialize to disk Serialize to disk
© Hortonworks Inc. 2015
Better use of JVM Memory - Contd• Enabled RLE in reducer codepath
– Reduced key comparisons in merge codepath– Improved Job Runtime (observed 10% improvement)– Reduced CPU cost
Without Fix
691 seconds
With Fix
621 seconds
© Hortonworks Inc. 2015
Better use of JVM Memory - Contd• WeightedMemoryDistributor for better memory management
in tasks–Observed 26% runtime improvement in tasks
© Hortonworks Inc. 2015
Source Task
….….
Broadcast Shuffle Improvements
Task 1
Task 2
Task N
…
Task 1
Task 2
Task N
…
Task 1
Task 2
Task N
…
Broadcast
From local diskFrom local disk
Source Task
….….
Task 1
Task 2
Task N
…
Task 1
Task 2
Task N
…
Task 1
Task 2
Task N
…
Broadcast
Before Fix After Fix
© Hortonworks Inc. 2015
PipelinedShuffle Improvments• Final merge in source
task is avoided. – Less IO
• Consumers are informed about spill events in advance– Better usage of network
bandwidth– Overlap CPU with
network– For sorted/unsorted
outputs, send data to consumers in chunks
• Observed 20% runtime improvement in queries involving heavy skews
Task 1Spill 1
Task 2
Reduce Task 1 Reduce Task 1Reduce Task 1Reduce Task 1Reduce Task N
…..…..
…..…..
Spill 1 Spill 2 Spill 3
Task 1Spill 1
Task 2Spill 1 Spill 2 Spill 3
Reduce Task 1 Reduce Task 1Reduce Task 1Reduce Task 1Reduce Task N
…..…..
…..…..
Merged Spill
Normal Shuffle Path
Pipelined Shuffle Path
© Hortonworks Inc. 2015
PipelinedShuffle Improvements
Job Runtime : 925 seconds Job Runtime : 680 seconds- 26% improvement- Avoids final merge (less IO, CPU cost)- Downstream can consume data whenever a spill
is generated
© Hortonworks Inc. 2015
• Performance– Speculation–Better use of JVM Memory– Intermediate File Improvements– Shuffle Improvements
• Debuggability– Tez UI– Local mode– Job Analysis Tools– Shuffle Performance Analysis Tool
© Hortonworks Inc. 2015
Tez UI
© Hortonworks Inc. 2015
Tez UI
Tez UI
30
Download data from ATS
© Hortonworks Inc. 2015
Better Debuggability– Local Mode• Test Tez Jobs without Hadoop Cluster• Enables Fast Prototyping• Fast Unit Testing• Runs on Single JVM (easy for debugging)• Scheduling / RPC invocations Skipped
© Hortonworks Inc. 2015
Job Analysis Tools• DAG Swimlane
– “$TEZ_HOME/tez-tools/swimlanes/sh yarn-swimlanes.sh <app_id>”
PrewarmContainer Reuse
Remote Reads
© Hortonworks Inc. 2015
Shuffle Performance Analysis Tools• Analyze Tez logs in Hadoop• Analyze shuffle performance between source / destination
nodes Data transferred from node 7 to rest of the nodes are slow
© Hortonworks Inc. 2015
Shuffle Performance Analysis Tools• Analyze shuffle performance between source / destination
nodes
© Hortonworks Inc. 2015
RoadMap• Shared output edges
– Same output to multiple vertices
• Local mode stabilization
• Optimizing (include/exclude) vertex at runtime
• Partial completion VertexManager
• Co-Scheduling
• Framework stats for better runtime decisions
© Hortonworks Inc. 2015
Tez – Adoption • Apache Hive
• Start from Hive 0.13• set hive.exec.engine = tez
• Apache Pig• Start from Pig 0.14• pig -x tez
• Cascading
• Flink
Page 36
© Hortonworks Inc. 2015
Tez Community• Useful Links
– http://tez.apache.org/– JIRA : https://issues.apache.org/jira/browse/TEZ– Code Repository: https://git-wip-us.apache.org/repos/asf/tez.git–Mailing Lists
– Dev List: [email protected]– User List: [email protected]– Issues List: [email protected]
• Tez Meetup– http://www.meetup.com/Apache-Tez-User-Group
© Hortonworks Inc. 2015
Thank You!Questions & Answers
Page 38