Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
-
Upload
mithun-radhakrishnan -
Category
Data & Analytics
-
view
577 -
download
4
Transcript of Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hive a t Yahoo : Le t te rs f r om the t renches
P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n , C h r i s D r o m e J u n e 1 0 , 2 0 1 5⎪
2 0 1 5 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
2
About myself
2014 Hadoop Summit, San Jose, California
Mithun Radhakrishnan Hive Engineer at Yahoo! Hive Committer and long-time
contributor› Metastore-scaling› Integration› HCatalog
[email protected] @mithunrk
3
About myself
2014 Hadoop Summit, San Jose, California
Chris Drome Hive Engineer at Yahoo! Hive contributor [email protected]
Recap
5 2015 Hadoop Summit, San Jose, California
6 2015 Hadoop Summit, San Jose, California
7 2015 Hadoop Summit, San Jose, California
1 TB
› 6.2x speedup over Hive 0.10 (RCFile)• Between 2.5-17x
› Average query time: 172 seconds• Between 5-947 seconds
• Down from 729 seconds (Hive 0.10 RCFile)
› 61% queries completed in under 2 minutes› 81% queries completed in under 4 minutes
8 2015 Hadoop Summit, San Jose, California
Explaining the speed-ups
Hadoop 2.x, et al. Apache Tez
› (Arbitrary DAG)-based Execution Engine› “Playing the gaps” between M&R
• Intermediate data and the HDFS
› Smart scheduling› Container re-use› Pipelined job start-up
Hive › Statistics› Vectorized Execution
ORC› PPD
9
Expectations with Hive 0.13 production
2014 Hadoop Summit, San Jose, California
Tez would outperform M/R by miles Tez would enable better cluster utilization
› Use less resources
Tez (and dependencies) would be “production ready”› GUI for task logs, DAG overviews, swim-lanes› Speculative execution
Similarly, ORC and Vectorization› Support evolving schemas
10 2015 Hadoop Summit, San Jose, California
The Y!Grid
18 Hadoop Clusters in YGrid› 41565 Nodes› Biggest cluster: 5728 Nodes› 1M jobs a day
Hadoop 2.6+ Large Datasets
› Daily, hourly, minute-level frequencies› Thousands of partitions, 100s of 1000s of files, TBs of data per partition› 580 PB of data, total
Pig 0.14 on Tez, Pig 0.11 Hive 0.13 on Tez HCatalog for interoperability Oozie for scheduling GDM for data-loading Spark, HBase, Storm, etc…
11 2015 Hadoop Summit, San Jose, California
Data processing use cases
Grid usage› 30+ million jobs per month› 12+ million Oozie launcher jobs
Pig usage› Handles majority of data pipelines/ETL (~43% of jobs)
Hive usage› Relatively smaller niche› 632,000 queries per month (35% Tez)
HCatalog for Inter-operability› Metadata storage for all Hadoop data› Yahoo-scale› Pig pipelines with Hive analytics
12 2015 Hadoop Summit, San Jose, California
Business Intelligence Tools
Tableau, MicroStrategy Power users
› Tableau Server for scheduled reports
Challenges:› Security
• ACLs, Authentication, Encryption over the wire
› Bandwidth• Transporting results over ODBC
• Limit result-set to 1000s-10000s of rows
• Aggregations
› Query Latency• Metadata queries
• Partition/Table scans
• Materialized views
13 2015 Hadoop Summit, San Jose, California
Data producer owns the data› Unlike traditional DBs
Multi-paradigm data access/generation› Pig/Hive/MapReduce using HCatalog
Highly available metadata service UI for tracking/debugging jobs Execution engine should ideally support speculative execution
Non-negotiables for Hive upgrade at Yahoo!
14 2015 Hadoop Summit, San Jose, California
Yahoo! Hive-0.13
Based on Apache Hive-0.13.1 Internal Yahoo! Patches (admin web-services, data discovery, etc.) Community patches to stabilize Apache Hive-0.13.1
› Tez
• HIVE-7544, HIVE-6748, HIVE-7112, …
› Vectorization
• HIVE-8163, HIVE-8092, HIVE-7188, HIVE-7105, HIVE-7514, …
› Failures
• HIVE-7851, HIVE-7459, HIVE-7771, HIVE-7396, …
› Optimizations
• HIVE-7231, HIVE-7219, HIVE-7203, HIVE-7052, …
› Data integrity
• HIVE-7694, HIVE-7494, HIVE-7045, HIVE-7346, HIVE-7232, …
Phased upgrades› Phase 1: 285 JIRAs› Phase 2: 23 JIRAs (HIVE-8781 and related dependencies)› Phase 3: 46 JIRAs (HIVE-10114 and related dependencies)
15 2015 Hadoop Summit, San Jose, California
One remote Hive Metastore “instance”› 4 HCatalog Servers behind a hardware VIP
• L3DSR load balancer
• 96GB-128GB RAM, 16 core boxes
› Backed by Oracle RAC
About 10 Gateways› Interactive use of Hive (and Pig, Oozie, M/R)› hive.metastore.uris -> HCatalog
About 4 HiveServer2 instances› Ad Hoc queries, aggregation
Hive deployment (per cluster)
16 Yahoo Confidential & Proprietary
Evolution of grid services at Yahoo!
Gateway Machines
GridOracleOracle RAC
Browser
HUE
Hive Server 2
BI Tools
HCatalogHCatalog
17 2015 Hadoop Summit, San Jose, California
Query performance on very large data sets› HIVE-8292: Reading … has high overhead in MapOperator.cleanUpInputFileChangedOp
Split-generation on very large data sets› Tends to generate more splits (maps tasks) compared to M/R› Long split generation times› Hogging the Hadoop queues
• Wave factor vs multi-tenancy requirements
› HIVE-10114: Split strategies for ORC
Scaling problems with ATS› More of a problem with Pig workflows› 10K+ tasks/job are routine› AM progress reporting, heart-beating, memory usage› Hadoop 2.6.0.10+
Challenges experienced with Hive on Tez
18 Yahoo Confidential & Proprietary
19 2015 Hadoop Summit, San Jose, California
At Yahoo! Scale,› 100s of Databases per cluster› 100s of Tables per database› 100s of columns per Table› 1000s of Partitions per Table
• Larger tables: Thousands of partitions, per hour
• Millions of partitions every few days
• 10s of millions of partitions, over dataset retention period
Problems:› Metadata volume
• Database/Table/Partition IO Formats
• Record serialization details
• HDFS paths
• Statistics– Per partition
– Per column
Fast execution engines aren’t the whole picture
Letters f rom the trenches
21 2015 Hadoop Summit, San Jose, California
From: Another ETL pipeline.
To: The Yahoo Hive Team
Subject: Slow queries
YHive team,
My query fails with OutOfMemoryError. I tried increasing container size, but it still fails. Please help!
Here are my settings:set mapreduce.input.fileinputformat.split.maxsize=16777216;
set mapreduce.map.memory.mb=4096;
set mapreduce.reduce.memory.mb=4096;
set mapred.child.java.opts=“-Xmx1024m”
...
INSERT OVERWRITE TABLE my_table PARTITION( foo, bar, goo )
SELECT * FROM {
...
}
...
22 2015 Hadoop Summit, San Jose, California
From: YET another ETL pipeline.
To: The Yahoo Hive Team
Subject: Slow UDF performance
YHive team,
Why does using a simple custom UDF cause queries to time out?
SELECT foo, bar, my_function( goo )
FROM my_large_table
WHERE ...
23 2015 Hadoop Summit, San Jose, California
24 2015 Hadoop Summit, San Jose, California
From: The ETL team
To: The Yahoo Hive Team
Subject: A small matter of size...
Dear YHive team,
We have partitioned our table using the following 6 partition keys: {hourly-timestamp, name, property, geo-location, shoe-size, and so on…}.
For a given timestamp, the combined cardinality of the remaining partition-keys is about 10000/hr.
If queries on partitioned tables are supposed to be faster, how come queries on our table take forever just to get off the ground?
Yours gigantically,
Project Grape Ape
25 2015 Hadoop Summit, San Jose, California
26 2015 Hadoop Summit, San Jose, California
Metadata volume and Query Execution time
Anatomy of a Hive query1. Compile query to AST
2. Thrift-call to Metastore, for partition list
3. Examine partitions, data-paths, etc. Construct physical query plan.
4. Run optimizers on the plan
5. Execute plan. (M/R, Tez).
Partition pruner:› Removes partitions that shouldn’t participate in the query.› In effect, remove input-directories from the Hadoop job.
27 2015 Hadoop Summit, San Jose, California
The problems of large-scale metadata
Partition pruner is single-threaded› Query spans a day› Query spanning a week? 2 million partitions
Partition objects are huge:› HDFS Paths› IO Formats › Record Deserializer info› Data column schema
Datanucleus:› 1 Partition: Join 6 Oracle tables in the backend.
Thrift serialization/deserialization takes minutes.› *Minutes*.
28 2015 Hadoop Summit, San Jose, California
Immediate workarounds
“Hive wasn’t originally designed for more than 10000s of partitions, total…”
Throw hardware at it› 4 HCatalog servers behind a hardware VIP› High-RAM boxes:
• 96GB-128 GB metastore processes
• Tune each to use 100 connections to the Oracle RAC
Client-side tuning› Increase hive.metastore.client.socket.timeout› Increase heap size as needed (container size)› Multi-threaded fstat operations
29 2015 Hadoop Summit, San Jose, California
Fix the leaky/noisy bits
Metastore frequently ran out of memory:› Disable Hadoop FileSystem cache
• HIVE-3098, HDFS-3545
• FileSystem.CACHE used UGI.hashcode()– Compared Subjects for equality, not equivalence.
› Fixed Thrift 0.9
• TSaslServerTransport had circular references
• JVM couldn’t detect these for cleanup– WeakReferences are your friend
• Fix incompatibility with L3DSR pings
Data discovery from Oozie:› Use JMS notifications, on publication› Oozie Coordinators wake up on ActiveMQ notification, kick off dependent workflows› Reduced polling frequency
30 2015 Hadoop Summit, San Jose, California
More fixes
Metadata-only queries:› SELECT DISTINCT tstamp FROM my_purple_table ORDER BY tstamp DESC LIMIT 1000;
› Replace HiveMetaStoreClient::getPartitions() with getPartitionNames().
› Local job, versus cluster.
Optimize the optimizer:› The first step in some optimizers:
• List<Partition> partitions = hiveMetaStoreClient.getPartitions( db, table, (short)-1 );
• Pray that the client and/or the metastore don’t run out of memory.
• Take a nap.
› Fixed PartitionPruner, MetadataOnlyOptimizer.
31 2015 Hadoop Summit, San Jose, California
Long-term fixes:
DirectSQL short-circuits:› Datanucleus problems at scale
• (Yes, we are aware of the irony that might result from extrapolation.)
› Specific to the backing DB.
Compaction of Partition info:› HIVE-7223, HIVE-7576, HIVE-9845, etc.› Schema evolves infrequently› Partition-info rarely differs from table-info
– Except HDFS paths (which are super-strings)
› List<Partition> vs Iterator<Partition>
• PartitionSet abstraction– The delight of Inheritance in Thrift
• Reduced memory foot-prints
32 2015 Hadoop Summit, San Jose, California
“The finest trick of The Devil was to persuade you that he does not exist.”
-- ???
33 2015 Hadoop Summit, San Jose, California
34 2015 Hadoop Summit, San Jose, California
35 2015 Hadoop Summit, San Jose, California
36 2015 Hadoop Summit, San Jose, California
From: A major reporting team
To: The Yahoo Hive Team
Subject: Urgent! Customer reports are borking.
Dear YHive team,
When we connect Tableau Server 8.3 to Y!Hive 0.12/0.13, it is unusably slow. Queries take too long to run, and time out.
We’d prefer not to change our query-code too much. How soon can Hive accommodate our simple queries?
Yours hysterically,
Project Zodiac
37 2015 Hadoop Summit, San Jose, California
Analysis: The query
Non-const partition key predicates:› E.g.
WHERE utc_time <= from_unixtime(unix_timestamp()- 2*24*60*60, 'yyyyMMdd')
AND utc_time >= from_unixtime(unix_timestamp()- 32*24*60*60, 'yyyyMMdd')
› Solution: Use constant expressions where possible.› Fix: Hive 1.x supports dynamic partition pruning, and constant folding.
Costly joins with partitioned dimension tables:› E.g. › SELECT … FROM fact_table JOIN (SELECT * FROM dimension_table WHERE dt IN (SELECT MAX(dt) from dimension_table);
› Workaround: External “pointer” tables.› Fix: Dynamic partition pruning.
38 2015 Hadoop Summit, San Jose, California
Analysis: The data
Data stored in TEXTFILE› Solution: Switch to columnar storage
• ORC, dictionary encoding, vectorization, predicate pushdown
Over-partitioning:› Too many partition keys› Diminishing returns with partition pruning› Solution: Eliminate partition keys, consider sorting
Small Part files› Hard-coded nReducers› E.g.
hive> dfs -count /projects/foo_stats;
9081 682735 1876847648672 /projects/foo.db/foo_stats
› Solution:
• set hive.merge.mapfiles=true;
• set hive.merge.mapredfiles=true;
• set hive.merge.tezfiles=true;
39 2015 Hadoop Summit San Jose
We’re not done yet
Tez/ATS scaling Speed up split calculation Auto/Offline compaction Abuse detection Better handling of schema
evolution Skew Joins in Hive UDFs with JNI and configuring
LD_LIBRARY_PATH
Quest ions?
Backup
42
YHive configuration settings:
2014 Hadoop Summit, San Jose, California
set hive.merge.mapfiles=false; -- Except when producing data.
set hive.merge.mapredfiles=false; -- Except when producing data.
set tez.merge.files=false; -- Except when producing data.
-- For ORC files.
-- dfs.blocksize=134217728; -- hdfs-site.xml
set orc.stripe.size=67108864; -- 64MB stripes.
set orc.compress.size=262144; -- 256KB compress buffer.
set orc.compress=ZLIB; -- Override to NONE, per table.
set orc.create.index=true; -- ORC indexes.
set orc.optimize.index.filter=true; -- Predicate pushdown with ORC index
set orc.row.index.stride=10000;
43
YHive configuration settings: (contd)
2014 Hadoop Summit, San Jose, California
-- Delegation Token Store settings:
set hive.cluster.delegation.token.store.class=ZooKeeperTokenStore;
set hive.cluster.delegation.token.renew-interval=172800000;
(Start HCat Server with -Djute.maxbuffer=24MB -> 190K+ tokens.)
-- Data Nucleus settings:
set datanucleus.connectionPoolingType=DBCP; -- !(BoneCP).
set datanucleus.cache.level1.type=none;
set datanucleus.cache.level2.type=none;
set datanucleus.connectionPool.maxWait=200000;
set datanucleus.connectionPool.minIdle=0;
-- Misc.
set hive.metastore.event.listeners=com.yahoo.custom.JMSListener;
44
Zookeeper Token Storage performance
2014 Hadoop Summit, San Jose, California
Jute Buffer Size (in MB) Max delegation token count
4MB 30K
8MB 60K
12MB 90K
16MB 130K
20MB 160K
24MB 190K
45 2015 Hadoop Summit, San Jose, California
Why Hive on Tez?
Shark, Impala› Pre-emption for in-memory systems› Multi-tenant, shared clusters› Heterogeneous nodes› Existing ecosystem› Community-driven development
Shark› Good proof of concept, but was not production ready› Shuffle performance (at the time)› Hive on Spark – under active development
46 2015 Hadoop Summit, San Jose, California
Analysis: Tableau/ODBC driver
Tableau has come a long way, but› Schema discovery
• SELECT * FROM my_large_table LIMIT 0;
• SELECT DISTINCT part_key FROM my_large_table;
› SQL dialect
• Depends on vendor-specific driver-name
› Schema metadata-scans
• 3 partition listings per query
› Miscellaneous problems:
• “Custom SQL” rewrites
• Trouble with quoting
tl;dr : Try to transition to Simba’s 2.0.x Drivers with Tableau 8.3.x
47 2015 Hadoop Summit, San Jose, California