Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

Hive a t Yahoo : Le t te rs f r om the t renches

P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n , C h r i s D r o m e J u n e 1 0 , 2 0 1 5⎪

2 0 1 5 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a

2

About myself

2014 Hadoop Summit, San Jose, California

Mithun Radhakrishnan Hive Engineer at Yahoo! Hive Committer and long-time

contributor› Metastore-scaling› Integration› HCatalog

[email protected] @mithunrk

mailto:[email protected]

3

About myself


Chris Drome Hive Engineer at Yahoo! Hive contributor [email protected]

mailto:[email protected]

5 2015 Hadoop Summit, San Jose, California


1 TB

› 6.2x speedup over Hive 0.10 (RCFile)• Between 2.5-17x

› Average query time: 172 seconds• Between 5-947 seconds

• Down from 729 seconds (Hive 0.10 RCFile)

› 61% queries completed in under 2 minutes› 81% queries completed in under 4 minutes


Explaining the speed-ups

Hadoop 2.x, et al. Apache Tez

› (Arbitrary DAG)-based Execution Engine› “Playing the gaps” between M&R

• Intermediate data and the HDFS

› Smart scheduling› Container re-use› Pipelined job start-up

Hive › Statistics› Vectorized Execution

ORC› PPD

9

Expectations with Hive 0.13 production


Tez would outperform M/R by miles Tez would enable better cluster utilization

› Use less resources

Tez (and dependencies) would be “production ready”› GUI for task logs, DAG overviews, swim-lanes› Speculative execution

Similarly, ORC and Vectorization› Support evolving schemas


The Y!Grid

18 Hadoop Clusters in YGrid› 41565 Nodes› Biggest cluster: 5728 Nodes› 1M jobs a day

Hadoop 2.6+ Large Datasets

› Daily, hourly, minute-level frequencies› Thousands of partitions, 100s of 1000s of files, TBs of data per partition› 580 PB of data, total

Pig 0.14 on Tez, Pig 0.11 Hive 0.13 on Tez HCatalog for interoperability Oozie for scheduling GDM for data-loading Spark, HBase, Storm, etc…


Data processing use cases

Grid usage› 30+ million jobs per month› 12+ million Oozie launcher jobs

Pig usage› Handles majority of data pipelines/ETL (~43% of jobs)

Hive usage› Relatively smaller niche› 632,000 queries per month (35% Tez)

HCatalog for Inter-operability› Metadata storage for all Hadoop data› Yahoo-scale› Pig pipelines with Hive analytics


Business Intelligence Tools

Tableau, MicroStrategy Power users

› Tableau Server for scheduled reports

Challenges:› Security

• ACLs, Authentication, Encryption over the wire

› Bandwidth• Transporting results over ODBC

• Limit result-set to 1000s-10000s of rows

• Aggregations

› Query Latency• Metadata queries

• Partition/Table scans

• Materialized views


Data producer owns the data› Unlike traditional DBs

Multi-paradigm data access/generation› Pig/Hive/MapReduce using HCatalog

Highly available metadata service UI for tracking/debugging jobs Execution engine should ideally support speculative execution

Non-negotiables for Hive upgrade at Yahoo!


Yahoo! Hive-0.13

Based on Apache Hive-0.13.1 Internal Yahoo! Patches (admin web-services, data discovery, etc.) Community patches to stabilize Apache Hive-0.13.1

› Tez

• HIVE-7544, HIVE-6748, HIVE-7112, …

› Vectorization

• HIVE-8163, HIVE-8092, HIVE-7188, HIVE-7105, HIVE-7514, …

› Failures

• HIVE-7851, HIVE-7459, HIVE-7771, HIVE-7396, …

› Optimizations

• HIVE-7231, HIVE-7219, HIVE-7203, HIVE-7052, …

› Data integrity

• HIVE-7694, HIVE-7494, HIVE-7045, HIVE-7346, HIVE-7232, …

Phased upgrades› Phase 1: 285 JIRAs› Phase 2: 23 JIRAs (HIVE-8781 and related dependencies)› Phase 3: 46 JIRAs (HIVE-10114 and related dependencies)


One remote Hive Metastore “instance”› 4 HCatalog Servers behind a hardware VIP

• L3DSR load balancer

• 96GB-128GB RAM, 16 core boxes

› Backed by Oracle RAC

About 10 Gateways› Interactive use of Hive (and Pig, Oozie, M/R)› hive.metastore.uris -> HCatalog

About 4 HiveServer2 instances› Ad Hoc queries, aggregation

Hive deployment (per cluster)

16 Yahoo Confidential & Proprietary

Evolution of grid services at Yahoo!

Gateway Machines

GridOracleOracle RAC

Browser

HUE

Hive Server 2

BI Tools

HCatalogHCatalog


Query performance on very large data sets› HIVE-8292: Reading … has high overhead in MapOperator.cleanUpInputFileChangedOp

Split-generation on very large data sets› Tends to generate more splits (maps tasks) compared to M/R› Long split generation times› Hogging the Hadoop queues

• Wave factor vs multi-tenancy requirements

› HIVE-10114: Split strategies for ORC

Scaling problems with ATS› More of a problem with Pig workflows› 10K+ tasks/job are routine› AM progress reporting, heart-beating, memory usage› Hadoop 2.6.0.10+

Challenges experienced with Hive on Tez

18 Yahoo Confidential & Proprietary


At Yahoo! Scale,› 100s of Databases per cluster› 100s of Tables per database› 100s of columns per Table› 1000s of Partitions per Table

• Larger tables: Thousands of partitions, per hour

• Millions of partitions every few days

• 10s of millions of partitions, over dataset retention period

Problems:› Metadata volume

• Database/Table/Partition IO Formats

• Record serialization details

• HDFS paths

• Statistics– Per partition

– Per column

Fast execution engines aren’t the whole picture

Letters f rom the trenches


From: Another ETL pipeline.

To: The Yahoo Hive Team

Subject: Slow queries

YHive team,

My query fails with OutOfMemoryError. I tried increasing container size, but it still fails. Please help!

Here are my settings:set mapreduce.input.fileinputformat.split.maxsize=16777216;

set mapreduce.map.memory.mb=4096;

set mapreduce.reduce.memory.mb=4096;

set mapred.child.java.opts=“-Xmx1024m”

...

INSERT OVERWRITE TABLE my_table PARTITION( foo, bar, goo )

SELECT * FROM {

...

}

...


From: YET another ETL pipeline.


Subject: Slow UDF performance

YHive team,

Why does using a simple custom UDF cause queries to time out?

SELECT foo, bar, my_function( goo )

FROM my_large_table

WHERE ...


From: The ETL team


Subject: A small matter of size...

Dear YHive team,

We have partitioned our table using the following 6 partition keys: {hourly-timestamp, name, property, geo-location, shoe-size, and so on…}.

For a given timestamp, the combined cardinality of the remaining partition-keys is about 10000/hr.

If queries on partitioned tables are supposed to be faster, how come queries on our table take forever just to get off the ground?

Yours gigantically,

Project Grape Ape


Metadata volume and Query Execution time

Anatomy of a Hive query1. Compile query to AST

2. Thrift-call to Metastore, for partition list

3. Examine partitions, data-paths, etc. Construct physical query plan.

4. Run optimizers on the plan

5. Execute plan. (M/R, Tez).

Partition pruner:› Removes partitions that shouldn’t participate in the query.› In effect, remove input-directories from the Hadoop job.


The problems of large-scale metadata

Partition pruner is single-threaded› Query spans a day› Query spanning a week? 2 million partitions

Partition objects are huge:› HDFS Paths› IO Formats › Record Deserializer info› Data column schema

Datanucleus:› 1 Partition: Join 6 Oracle tables in the backend.

Thrift serialization/deserialization takes minutes.› *Minutes*.


Immediate workarounds

“Hive wasn’t originally designed for more than 10000s of partitions, total…”

Throw hardware at it› 4 HCatalog servers behind a hardware VIP› High-RAM boxes:

• 96GB-128 GB metastore processes

• Tune each to use 100 connections to the Oracle RAC

Client-side tuning› Increase hive.metastore.client.socket.timeout› Increase heap size as needed (container size)› Multi-threaded fstat operations


Fix the leaky/noisy bits

Metastore frequently ran out of memory:› Disable Hadoop FileSystem cache

• HIVE-3098, HDFS-3545

• FileSystem.CACHE used UGI.hashcode()– Compared Subjects for equality, not equivalence.

› Fixed Thrift 0.9

• TSaslServerTransport had circular references

• JVM couldn’t detect these for cleanup– WeakReferences are your friend

• Fix incompatibility with L3DSR pings

Data discovery from Oozie:› Use JMS notifications, on publication› Oozie Coordinators wake up on ActiveMQ notification, kick off dependent workflows› Reduced polling frequency


More fixes

Metadata-only queries:› SELECT DISTINCT tstamp FROM my_purple_table ORDER BY tstamp DESC LIMIT 1000;

› Replace HiveMetaStoreClient::getPartitions() with getPartitionNames().

› Local job, versus cluster.

Optimize the optimizer:› The first step in some optimizers:

• List<Partition> partitions = hiveMetaStoreClient.getPartitions( db, table, (short)-1 );

• Pray that the client and/or the metastore don’t run out of memory.

• Take a nap.

› Fixed PartitionPruner, MetadataOnlyOptimizer.


Long-term fixes:

DirectSQL short-circuits:› Datanucleus problems at scale

• (Yes, we are aware of the irony that might result from extrapolation.)

› Specific to the backing DB.

Compaction of Partition info:› HIVE-7223, HIVE-7576, HIVE-9845, etc.› Schema evolves infrequently› Partition-info rarely differs from table-info

– Except HDFS paths (which are super-strings)

› List<Partition> vs Iterator<Partition>

• PartitionSet abstraction– The delight of Inheritance in Thrift

• Reduced memory foot-prints


“The finest trick of The Devil was to persuade you that he does not exist.”

-- ???


From: A major reporting team


Subject: Urgent! Customer reports are borking.

Dear YHive team,

When we connect Tableau Server 8.3 to Y!Hive 0.12/0.13, it is unusably slow. Queries take too long to run, and time out.

We’d prefer not to change our query-code too much. How soon can Hive accommodate our simple queries?

Yours hysterically,

Project Zodiac


Analysis: The query

Non-const partition key predicates:› E.g.

WHERE utc_time <= from_unixtime(unix_timestamp()- 2*24*60*60, 'yyyyMMdd')

AND utc_time >= from_unixtime(unix_timestamp()- 32*24*60*60, 'yyyyMMdd')

› Solution: Use constant expressions where possible.› Fix: Hive 1.x supports dynamic partition pruning, and constant folding.

Costly joins with partitioned dimension tables:› E.g. › SELECT … FROM fact_table JOIN (SELECT * FROM dimension_table WHERE dt IN (SELECT MAX(dt) from dimension_table);

› Workaround: External “pointer” tables.› Fix: Dynamic partition pruning.


Analysis: The data

Data stored in TEXTFILE› Solution: Switch to columnar storage

• ORC, dictionary encoding, vectorization, predicate pushdown

Over-partitioning:› Too many partition keys› Diminishing returns with partition pruning› Solution: Eliminate partition keys, consider sorting

Small Part files› Hard-coded nReducers› E.g.

hive> dfs -count /projects/foo_stats;

9081 682735 1876847648672 /projects/foo.db/foo_stats

› Solution:

• set hive.merge.mapfiles=true;

• set hive.merge.mapredfiles=true;

• set hive.merge.tezfiles=true;

39 2015 Hadoop Summit San Jose

We’re not done yet

Tez/ATS scaling Speed up split calculation Auto/Offline compaction Abuse detection Better handling of schema

evolution Skew Joins in Hive UDFs with JNI and configuring

LD_LIBRARY_PATH

Quest ions?

Backup

42

YHive configuration settings:


set hive.merge.mapfiles=false; -- Except when producing data.

set hive.merge.mapredfiles=false; -- Except when producing data.

set tez.merge.files=false; -- Except when producing data.

-- For ORC files.

-- dfs.blocksize=134217728; -- hdfs-site.xml

set orc.stripe.size=67108864; -- 64MB stripes.

set orc.compress.size=262144; -- 256KB compress buffer.

set orc.compress=ZLIB; -- Override to NONE, per table.

set orc.create.index=true; -- ORC indexes.

set orc.optimize.index.filter=true; -- Predicate pushdown with ORC index

set orc.row.index.stride=10000;

43

YHive configuration settings: (contd)


-- Delegation Token Store settings:

set hive.cluster.delegation.token.store.class=ZooKeeperTokenStore;

set hive.cluster.delegation.token.renew-interval=172800000;

(Start HCat Server with -Djute.maxbuffer=24MB -> 190K+ tokens.)

-- Data Nucleus settings:

set datanucleus.connectionPoolingType=DBCP; -- !(BoneCP).

set datanucleus.cache.level1.type=none;

set datanucleus.cache.level2.type=none;

set datanucleus.connectionPool.maxWait=200000;

set datanucleus.connectionPool.minIdle=0;

-- Misc.

set hive.metastore.event.listeners=com.yahoo.custom.JMSListener;

44

Zookeeper Token Storage performance


Jute Buffer Size (in MB) Max delegation token count

4MB 30K

8MB 60K

12MB 90K

16MB 130K

20MB 160K

24MB 190K


Why Hive on Tez?

Shark, Impala› Pre-emption for in-memory systems› Multi-tenant, shared clusters› Heterogeneous nodes› Existing ecosystem› Community-driven development

Shark› Good proof of concept, but was not production ready› Shuffle performance (at the time)› Hive on Spark – under active development


Analysis: Tableau/ODBC driver

Tableau has come a long way, but› Schema discovery

• SELECT * FROM my_large_table LIMIT 0;

• SELECT DISTINCT part_key FROM my_large_table;

› SQL dialect

• Depends on vendor-specific driver-name

› Schema metadata-scans

• 3 partition listings per query

› Miscellaneous problems:

• “Custom SQL” rewrites

• Trouble with quoting

tl;dr : Try to transition to Simba’s 2.0.x Drivers with Tableau 8.3.x

Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

Data & Analytics

Transcript of Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches