February 2014 HUG : Hive On Tez

© Hortonworks Inc. 2013.© Hortonworks Inc. 2013.

Hive on Tez

Gunther Hagleitner ([email protected])

© Hortonworks Inc. 2013. © Hortonworks Inc. 2013.

Stinger Project(announced February 2013)

Batch AND Interactive SQL-IN-Hadoop

Stinger InitiativeA broad, community-based effort to drive the next generation of HIVE

Coming Soon:• Hive on Apache Tez• Query Service• Buffer Cache• Cost Based Optimizer (Optiq)• Vectorized Processing

Hive 0.11, May 2013:• Base Optimizations• SQL Analytic Functions• ORCFile, Modern File Format

Hive 0.12, October 2013:

• VARCHAR, DATE Types• ORCFile predicate pushdown• Advanced Optimizations• Performance Boosts via YARN

SpeedImprove Hive query performance by 100X to allow for interactive query times (seconds)

ScaleThe only SQL interface to Hadoop designed for queries that scale from TB to PB

SQLSupport broadest range of SQL semantics for analytic applications running against Hadoop

…all IN Hadoop

Goals:


SQL: Enhancing SQL Semantics

Hive SQL Datatypes Hive SQL SemanticsINT SELECT, INSERT

TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY

BOOLEAN JOIN on explicit join key

FLOAT Inner, outer, cross and semi joins

DOUBLE Sub-queries in FROM clause

STRING ROLLUP and CUBE

TIMESTAMP UNION

BINARY Windowing Functions (OVER, RANK, etc)

DECIMAL Custom Java UDFs

ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.)

DATE Advanced UDFs (ngram, Xpath, URL)

VARCHAR Sub-queries in WHERE, HAVING

CHAR Expanded JOIN Syntax

SQL Compliant Security (GRANT, etc.)

INSERT/UPDATE/DELETE (ACID)

Hive 0.12

Available

Roadmap

SQL ComplianceHive 12 provides a wide array of SQL data types and semantics so your existing tools integrate more seamlessly with Hadoop


Stinger: Hive performance

Feature Description Benefit

Tez Integration Tez is significantly better engine than MapReduce Latency

Vectorized Query Take advantage of modern hardware by processing thousand-row blocks rather than row-at-a-time. Throughput

Query PlannerUsing extensive statistics now available in Metastore to better plan and optimize query, including predicate pushdown during compilation to eliminate portions of input (beyond partition pruning)

Latency

ORC File Columnar, type aware format with indices LatencyCost Based Optimizer

(Optiq)Join re-ordering and other optimizations based on column statistics including histograms etc. (future) Latency


Hive on Tez – Basics

• Hive/Tez integration in Hive 0.13 (Tez 0.2.0)• Hive/Tez 0.3.0 available in branch• Hive on MR works unchanged on hadoop 1 and 2• Hive on Tez is only available on hadoop 2• Turn on: hive.execution.engine=tez• Hive looks and feels the same on both MR and Tez

– CLI/JDBC/UDFs/SQL/Metastore

• Explain/reporting is similar to MR, but reflects differences in plan/execution

• Deployment is simple (Tez comes as a client-side library)• Job client can submit MR via Tez

– And emulate NES too (duck hunt anyone?)


Query 88

select *from (select count(*) h8_30_to_9 from store_sales JOIN household_demographics ON store_sales.ss_hdemo_sk = household_demographics.hd_demo_sk JOIN time_dim ON store_sales.ss_sold_time_sk = time_dim.t_time_sk JOIN store ON store_sales.ss_store_sk = store.s_store_sk where time_dim.t_hour = 8 and time_dim.t_minute >= 30 and ((household_demographics.hd_dep_count = 3 and household_demographics.hd_vehicle_count<=3+2) or (household_demographics.hd_dep_count = 0 and household_demographics.hd_vehicle_count<=0+2) or (household_demographics.hd_dep_count = 1 and household_demographics.hd_vehicle_count<=1+2)) and store.s_store_name = 'ese') s1 JOIN (select count(*) h9_to_9_30 from store_sales ...

• 8 full table scans


Query 88: M/R

Total MapReduce jobs = 29...Total MapReduce CPU Time Spent: 0 days 2 hours 52 minutes 39 seconds 380 msecOK345617 687625 686131 1032842 1030364 606859604232 692428Time taken: 403.28 seconds, Fetched: 1 row(s)


Query 88: Tez

Map 1: 1/1 Map 11: 1/1 Map 12: 1/1 Map 13: 1/1 Map 14: 1/1 Map 15: 1/1 Map 16: 241/241 Map 18: 1/1 Map 19: 1/1 Map 2: 1/1 Map 20: 1/1 Map 21: 1/1 Map 22: 1/1 Map 23: 1/1 Map 24: 241/241 Map 26: 1/1 Map 27: 1/1 Map 28: 1/1 Map 29: 1/1 Map 3: 241/241 Map 30: 240/240 Map 32: 241/241 Map 34: 1/1 Map 35: 1/1 Map 36: 1/1 Map 37: 1/1 Map 38: 1/1 Map 39: 241/241 Map 42: 1/1 Map 43: 1/1 Map 44: 240/240 Map 46: 241/241 Reducer 10: 1/1 Reducer 17: 1/1 Reducer 25: 1/1 Reducer 31: 1/1 Reducer 33: 1/1 Reducer 4: 1/1 Reducer 40: 1/1 Reducer 41: 1/1 Reducer 45: 1/1 Reducer 47: 1/1 Reducer 5: 1/1 Reducer 6: 1/1 Reducer 7: 1/1 Reducer 8: 1/1 Reducer 9: 1/1Status: Finished successfullyOK345617 687625 686131 1032842 1030364 606859 604232 692428Time taken: 90.233 seconds, Fetched: 1 row(s)


Hive – MR Hive – Tez

Hive-on-MR vs. Hive-on-Tez

SELECT g1.x, g1.avg, g2.cntFROM (SELECT a.x, AVERAGE(a.y) AS avg FROM a GROUP BY a.x) g1JOIN (SELECT b.x, COUNT(b.y) AS avg FROM b GROUP BY b.x) g2ON (g1.x = g2.x)ORDER BY avg;

GROUP a BY a.x

JOIN (a,b)

GROUP b BY b.x

ORDER BY

M M M

R R

M M

R

M M

R

M

R

HDFS HDFS

HDFS

M M M

R R

R

M M

R

GROUP BY a.xJOIN (a,b)

ORDER BY

GROUP BY x

Tez avoids unnecessary writes

to HDFS

HIVE-4660


Hive on Tez - Execution

Feature Description Benefit

Tez Session Overcomes Map-Reduce job-launch latency by pre-launching Tez AppMaster Latency

Tez Container Pre-Launch

Overcomes Map-Reduce latency by pre-launching hot containers ready to serve queries. Latency

Tez Container Re-UseFinished maps and reduces pick up more work rather than exiting. Reduces latency and eliminates difficult split-size tuning. Out of box performance!

Latency

Runtime re-configuration of DAG

Runtime query tuning by picking aggregation parallelism using online query statistics Throughput

Tez In-Memory Cache Hot data kept in RAM for fast access. Latency

Complex DAGs Tez Broadcast Edge and Map-Reduce-Reduce pattern improve query scale and throughput. Throughput


Pipelined Splits

hive.orc.compute.splits.num.threads=10

Tez AM

tez.am.grouping.split-waves=1.4

Map task

Reduce start-up latency by launching tasks as soon as they are ready


Pipelined Early Exit (Limit)

hive.orc.compute.splits.num.threads=10

tez.am.grouping.split-waves=1.4

Map task

Done!


Statistics and Cost-based optimization

• Statistics: – Hive has table and column level statistics– Used to determine parallelism, join selection

• Optiq: Open source, Apache licensed query execution framework in Java– Used by Apache Drill, Apache Cascade, Lucene DB, …– Based on Volcano paper– 20 man years dev, more than 50 optimization rules

• Goals for hive– Ease of Use – no manual tuning for queries, make choices automatically based on cost– View Chaining/Ad hoc queries involving multiple views– Help enable BI Tools front-ending Hive– Emphasis on latency reduction

• Cost computation will be used for Join ordering Join algorithm selection Tez vertex boundary selection

HIVE-5775


Broadcast Join

• Similar to map-join w/o the need to build a hash table on the client

• Will work with any level of sub-query nesting• Uses stats to determine if applicable• How it works:

– Broadcast result set is computed in parallel on the cluster– Join processor are spun up in parallel– Broadcast set is streamed to join processor– Join processors build hash table– Other relation is joined with hashtable

• Tez handles:– Best parallelism– Best data transfer of the hashed relation– Best scheduling to avoid latencies


1-1 Edge

• Typical star schema join involve join between large number of tables

• Dimension aren’t always tiny (Customer dimension)• Might not be able to handle all dimensions in single vertex as

broadcast joins• Tez allows streaming records from one processor to the next via

a 1-1 Edge– Transfer details (streaming, files, etc) are handled transparently– Scheduling/cluster capacity is worked out by Tez

• Allows hive to build a pipeline of in memory joins which we can stream records through


Dynamically partitioned Hash join

• Kicks in when large table is bucketed– Bucketed table– Dynamic as part of query processing

• Will use custom edge to match the partitioning on the smaller table

• Allows hash-join in cases where broadcast would be too large• Tez gives us the option of building custom edges and vertex

managers– Fine grained control over how the data is replicated and partitioned– Scheduling and actual data transfer is handled by Tez


Shuffle join and Grouping w/o sort

• In the MR model joins, grouping and window functions are typically mapped onto a shuffle phase

• That means sorting will be used to implement the operator• With Tez it’s easy to switch out the implementation

– Decouples partitioning, algorithm and transfer of the data– The nitty-gritty details are still abstracted away (data movement,

scheduling, parallelism)

• With the right statistics that let’s hive pick the right implementation for the query at hand


Union all

• Common operation in decision support queries• Caused additional no-op stages in MR plans

– Last stage spins up multi-input mapper to write result– Intermediate unions have to be materialized before additional processing

• Tez has union that handles these cases transparently w/o any intermediate steps


Multi-insert queries

• Allows the same input to be split and written to different tables or partitions– Avoids duplicate scans/processing– Useful for ETL– Similar to “Splits” in PIG

• In MR a “split” in the operator pipeline has to be written to HDFS and processed by multiple additional MR jobs

• Tez allows to send the mulitple outputs directly to downstream processors

© Hortonworks Inc. 2013

Putting it all together

Architecting the Future of Big Data

Tez Session populates container pool

Dimension table calculation and HDFS split generation in parallel

Dimension tables broadcasted to Hive MapJoin tasks

Final Reducer pre-launched and fetches completed inputs

TPCDS – Query-27 with Hive on Tez


TPC-DS 10 TB Query Times (Tuned Queries)

Data: 10 TB data loaded into ORCFile using defaults.Hardware: 20 nodes: 24 CPU threads, 64 GB RAM, 5 local SATA drives each.Queries: TPC-DS queries with partition filters added.


TPC-DS 30 TB Query Times (Tuned Queries)

Data: 30 TB data loaded into ORCFile using defaults.Hardware: 14 nodes: 2x Intel E5-2660 v2, 256GB RAM, 23x 7.2kRPM drives each.Queries: TPC-DS queries with partition filters added.


Tez Concurrency Over 30 TB Dataset

• Test:– Launch 20 queries concurrently and measure total time to completion.– Used the 20 tuned TPC-DS queries from previous slide.– Each query used once.– Queries hit multiple different fact tables with different access patterns.

• Results:– The 20 queries finished within 27.5 minutes.– Average 82.65 seconds per query.

• Data and Hardware details:– 30 Terabytes of data loaded into ORCFile using all defaults.– Hardware: 14 nodes: 2x Intel E5-2660 v2, 256GB RAM, 23x 7.2kRPM drives each.

February 2014 HUG : Hive On Tez

Education

Transcript of February 2014 HUG : Hive On Tez