Hadoop Summit 2014 : Benchmarking Apache Hive at Yahoo Scale

Post on 16-Apr-2017

112 views 5 download

Transcript of Hadoop Summit 2014 : Benchmarking Apache Hive at Yahoo Scale

Benchmark ing H ive a t Yahoo Sca le

P R E S E N T E D B Y M i t h u n R a d h a k r i s h n a n J u n e 4 , 2 0 1 4⎪

2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a

2

About myself

HCatalog Committer, Hive contributor› Metastore, Notifications, HCatalog APIs› Integration with Oozie, Data Ingestion

Other odds and ends› DistCp

mithun@apache.org

2014 Hadoop Summit, San Jose, California

3

About this talk

Introduction to “Yahoo Scale” The use-case in Yahoo The Benchmark The Setup The Observations (and, possibly, lessons) Fisticuffs

2014 Hadoop Summit, San Jose, California

4

The Y!Grid

16 Hadoop Clusters in YGrid› 32500 Nodes› 750K jobs a day

Hadoop 0.23.10.x, 2.4.x Large Datasets

› Daily, hourly, minute-level frequencies› Terabytes of data, 1000s of files, per dataset instance

Pig 0.11 Hive 0.10 / HCatalog 0.5

› => Hive 0.12

2014 Hadoop Summit, San Jose, California

5

Data Processing Use cases

2014 Hadoop Summit, San Jose, California

Pig for Data Pipelines› Imperative paradigm› ~45% Hadoop Jobs on Production Clusters

• M/R + Oozie = 41%

Hive for Ad hoc queries› SQL› Relatively smaller number of jobs

• *Major* Uptick

Use HCatalog for Inter-op

6 Yahoo Confidential & Proprietary

Hive is Currently the Fastest Growing Product on the Grid

Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-140

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

30,000,000

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

7.0%

8.0%

9.0%

10.0%

All Jobs Hive (% of all jobs)

All G

rid J

obs

(in M

illio

ns)

Hive

Jobs

(% o

f All

Jobs

)

2.4 million Hive jobs

7

Business Intelligence Tools

{Tableau, MicroStrategy, Excel, … } Challenges:

› Security• ACLs, Authentication, Encryption over the wire, Full-disk Encryption

› Bandwidth• Transporting results over ODBC

› Query Latency• Query execution time• Cost of query “optimizations”• “Bad” queries

2014 Hadoop Summit, San Jose, California

8

The Benchmark

TPC-h› Industry standard (tpc.org/tpch)› 22 queries› dbgen –s 1000 –S 3

• Parallelizable

Reynold Xin’s excellent work:› https://github.com/rxin› Transliterated queries to suit Hive 0.9

2014 Hadoop Summit, San Jose, California

9

Relational Diagram

2014 Hadoop Summit, San Jose, California

PARTKEY

NAME

MFGR

BRAND

TYPE

SIZE

CONTAINER

COMMENT

RETAILPRICE

PARTKEY

SUPPKEY

AVAILQTY

SUPPLYCOST

COMMENT

SUPPKEY

NAME

ADDRESS

NATIONKEY

PHONE

ACCTBAL

COMMENT

ORDERKEY

PARTKEY

SUPPKEY

LINENUMBER

RETURNFLAG

LINESTATUS

SHIPDATE

COMMITDATE

RECEIPTDATE

SHIPINSTRUCT

SHIPMODE

COMMENT

CUSTKEY

ORDERSTATUS

TOTALPRICE

ORDERDATE

ORDER-PRIORITY

SHIP-PRIORITY

CLERK

COMMENT

CUSTKEY

NAME

ADDRESS

PHONE

ACCTBAL

MKTSEGMENT

COMMENT

PART (P_)SF*200,000

PARTSUPP (PS_)SF*800,000

LINEITEM (L_)SF*6,000,000

ORDERS (O_)SF*1,500,000

CUSTOMER (C_)SF*150,000

SUPPLIER (S_)SF*10,000

ORDERKEY

NATIONKEY

EXTENDEDPRICE

DISCOUNT

TAX

QUANTITY

NATIONKEY

NAME

REGIONKEY

NATION (N_)25

COMMENT

REGIONKEY

NAME

COMMENT

REGION (R_)5

10

The Setup

› 350 Node cluster• Xeon boxen: 2 Slots with E5530s => 16 CPUs• 24GB memory

– NUMA enabled

• 6 SATA drives, 2TB, 7200 RPM Seagates• RHEL 6.4• JRE 1.7 (-d64)• Hadoop 0.23.7+/2.3+, Security turned off• Tez 0.3.x• 128MB HDFS block-size

› Downscale tests: 100 Node cluster• hdfs-balancer.sh

2014 Hadoop Summit, San Jose, California

11

The Prep

Data generation:› Text data: dbgen on MapReduce› Transcode to RCFile and ORC: Hive on MR

• insert overwrite table orc_table partition( … ) select * from text_table;

› Partitioning:• Only for 1TB, 10TB cases• Perils of dynamic partitioning

› ORC File:• 64MB stripes, ZLIB Compression

2014 Hadoop Summit, San Jose, California

Observat ions

13 2014 Hadoop Summit, San Jose, California

14

100 GB

› 18x speedup over Hive 0.10 (Textfile)• 6-50x

› 11.8x speedup over Hive 0.10 (RCFile)• 5-30x

› Average query time: 28 seconds• Down from 530 (Hive 0.10 Text)

› 85% queries completed in under a minute

2014 Hadoop Summit, San Jose, California

15 2014 Hadoop Summit, San Jose, California

16

1 TB

› 6.2x speedup over Hive 0.10 (RCFile)• Between 2.5-17x

› Average query time: 172 seconds• Between 5-947 seconds• Down from 729 seconds (Hive 0.10 RCFile)

› 61% queries completed in under 2 minutes› 81% queries completed in under 4 minutes

2014 Hadoop Summit, San Jose, California

17 2014 Hadoop Summit, San Jose, California

18

10 TB

› 6.2x speedup over Hive 0.10 (RCFile)• Between 1.6-10x

› Average query time: 908 seconds (426 seconds excluding outliers)• Down from 2129 seconds with Hive 0.10 RCFile

– (1712 seconds excluding outliers)› 61% queries completed in under 5 minutes› 71% queries completed in under 10 minutes› Q6 still completes in 12 seconds!

2014 Hadoop Summit, San Jose, California

19

Explaining the speed-ups

Hadoop 2.x, et al. Tez

› (Arbitrary DAG)-based Execution Engine› “Playing the gaps” between M&R

• Temporary data and the HDFS› Feedback loop› Smart scheduling› Container re-use› Pipelined job start-up

Hive › Statistics› “Vector-ized” Execution

ORC› PPD

2014 Hadoop Summit, San Jose, California

20 2014 Hadoop Summit, San Jose, California

21 2014 Hadoop Summit, San Jose, California

ORC File Layout Data is composed of multiple streams per

column

Index allows for skipping rows (default to every 10,000 rows), keeping position in each stream, and min-max for each column

Footer contains directory of stream locations, and the encoding for each column

Integer columns are serialized using run-length encoding

String columns are serialized using dictionary for column values, and the same run length encoding

Stripe footer is used to find the requested column’s data streams and adjacent stream reads are merged

22 2014 Hadoop Summit, San Jose, California

ORC UsageCREATE TABLE addresses ( name string, street string, city string, state string, zip int ) STORED AS orc TBLPROPERTIES ("orc.compress"= "ZLIB");LOCATION ‘/path/to/addresses’;

ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT orc

SET hive.default.fileformat = orcSET hive.exec.orc.memory.pool = 0.50 (ORC writer is allowed 50% of JVM heap size by default)

ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde’INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’ OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat';

Key Default Comments

orc.compress ZLIB high-level compression (one of NONE, ZLIB, Snappy)

orc.compress.size 262,144 (256 KB) number of bytes in each compression chunk

orc.stripe.size 67,108,864 (64 MB) number of bytes in each stripe. Each ORC stripe is processed in one map task (try 32 MB to cut down on disk I/O)

orc.row.index.stride 10,000 number of rows between index entries (must be >= 1,000). A larger stride-size increases the probability of not being able to skip the stride, for a predicate.

orc.create.index true whether to create row indexes. This is for predicate push-down (bloom-filters). If data is frequently accessed/filtered on a certain column, then sorting on the column and using index-filters makes column filters work faster

23 2014 Hadoop Summit, San Jose, California

24 2014 Hadoop Summit, San Jose, California

25

Configuring ORC

set hive.merge.mapredfiles=true set hive.merge.mapfiles=true set orc.stripe.size=67,108,864

› Half the HDFS block-size• Prevent cross-block stripe-read• Tangent: DistCp

set orc.compress=???› Depends on size and distribution› Snappy compression hasn’t been explored

YMMV› Experiment

2014 Hadoop Summit, San Jose, California

26 2014 Hadoop Summit, San Jose, California

Conclusions

28

Y!Grid sticking with Hive

Familiarity› Existing ecosystem

Community Scale Multitenant Coming down the pike

› CBO› In-memory caching solutions atop HDFS

• RAMfs a la Tachyon?

2014 Hadoop Summit, San Jose, California

29

We’re not done yet

SQL compliance Scaling up the metastore

performance Better BI Tool integration Faster transport

› HiveServer2 result-sets

2014 Hadoop Summit, San Jose, California

30

References

The YDN blog post:› http

://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn

Code:› https://github.com/mythrocks/hivebench (TPC-h scripts, datagen, transcode utils)› https://github.com/t3rmin4t0r/tpch-gen (Parallel TPC-h gen)› https://github.com/rxin/TPC-H-Hive (TPC-h scripts for Hive)› https://issues.apache.org/jira/browse/HIVE-600 (Yuntao’s initial TPC-h JIRA)

2014 Hadoop Summit, San Jose, California

Thank You@mithunrk

mithun@apache.org

We are hiring!

Stop by Kiosk P9 or reach out to us at bigdata@yahoo-inc.com.

I ’m glad you asked.

33

Sharky comments

Testing with Shark 0.7.x and Shark 0.8› Compatible with Hive Metastore 0.9› 100GB datasets : Admirable performance› 1TB/10TB: Tests did not run completely

• Failures, especially in 10TB cases• Hangs while shuffling data• Scaled back to 100 nodes -> More tests ran through, but not completely

› nReducers: Not inferred

Miscellany› Security› Multi-tenancy› Compatibility

2014 Hadoop Summit, San Jose, California