Download - Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)

Interactive SQL POC on HadoopHive 13, Hive-on-Tez, Presto

Storage: RCFile, ORC, Parquet and Avro

Sudhir MallemAlex Bain

George Zhang

Introduction • Interactive SQL POC on HDFS• Hive (version 13), Hive-on-Tez, Presto• Storage formats:• RCFile• ORC• Parquet• Avro

• Compression• Snappy• Zlib• Native gzip compression

Goal of Benchmark• Goal of this benchmark is :• To provide a comprehensive overview and testing of the interactive SQL on

Hadoop• To measure response time in terms of performance across each of the

platform across different storage formats• To measure compression – data size – across different dimensions• To have a better understanding of the performance gain we may potentially

see with Queries on each of these platforms.• Avro is widely used across LinkedIn and testing it on the newer platform (hive

13) with other tools (tez, presto etc) will give us a good understanding of the performance gain we may potentially see.

System, Storage formats and compressionSystems chosen:

Hive - version 13.1Hive13 on Tez + Yarn - Tez version: 0.4.1 Presto - 0.74, 0.79 and 0.80 version

Storage Formats and compression: ORC + zlib compressionRCFile + snappyParquet + snappyAvro + Avro native compression - Deflate level 9

System, Storage formats and compression• Presto - dataset was created in RCFile. • This was the most recommended version for Presto when this was

evaluated. • During the time of the evaluation, Presto had issues working with

Avro and Parquet. The queries were either not running or they were not quite optimized.• With Presto v0.80 releasing, we have tested it with ORCFile format.• We flattened certain data (pageviewevent) to support benchmark on

Presto. • Currently Presto supports only Map datatypes. Struct etc use

Json_extract function.

About Hive 13• This is the next version of Hive at LinkedIn. • Hive is used heavily at LinkedIn for interactive sql capability where

users are not PigLatin savvy and prefer a sql solution. • Hive is generally slow as it runs on Map/reduce and competes with

Mappers and reducers on the HDFS system along with PigLatin and vanilla m/r.

About Hive 13-on-Tez• Tez is a new application framework built on Hadoop Yarn that can execute complex directed acyclic

graphs of general data processing tasks. In many ways it can be thought of as a more flexible and powerful successor of the map-reduce framework built by HortonWorks

• It generalizes map and reduce tasks by exposing interfaces for generic data processing tasks, which consist of a triplet of interfaces: input, output and processor. These tasks are the vertices in the execution graph. Edges (i.e.: data connections between tasks) are first class citizens in Tez and together with the input/output interfaces greatly increase the flexibility of how data is transferred between tasks.

• Tez also greatly extends the possible ways of which individual tasks can be linked together; In fact any arbitrary DAG can be executed directly in Tez. In Tez parlance a map-reduce job is basically a simple DAG consisting of a single map and reduce vertice connected by a “bipartite” edge (i.e.: the edge connects every map task to every reduce task). Map input and reduce outputs are HDFS inputs and outputs respectively. The map output class locally sorts and partitions the data by a certain key, while the reduce input class merge-sorts its data on the same key.

• Tez also provides what basically is a map-reduce compat layer that let’s one run MR jobs on top of the new execution layer by implementing Map/Reduce concepts on the new execution framework.

About Presto• Presto is an open source distributed SQL query engine for running

interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.• Presto was designed and written from the ground up for interactive

analytics and approaches the speed of commercial data warehouses while scaling to the data size of organizations like LinkedIn and Facebook.

About Dataset• The input dataset was carefully chosen to cover not only the performance perspective of

benchmarking, but also to gain better insight into each of the system. It gives a good understanding of the query patters they support, the functions, ease of use etc.

• Different dimension tables, facts and Aggregates.• Data ranges anywhere from 20k rows to 80+billion.• Hive supports Complex datatypes like Struct, array, Union and Map. The data that we chose

has nested structure, key values and binary data. • We Flattened the data for use in Presto as the 0.74 version of Presto supports only Array and

Map datatype. The underlying data is stored as JSON, so we have to use json functions to extract and refer to the data.

• One of the dataset is a flat table with 600+ columns, specifically to test the columnar functionality with Parquet, RCFile and ORC file formats.

Evaluation Criteria• We chose 15 queries for our testing and benchmarking. These sqls are

some of the commonly used queries users run in DWH at LinkedIn. • The queries test the following functionality:• date and time manipulations• nested sqls, wildcard searches, • Filter predicates, partition pruning, Full table scans and Joins (3way, 2way etc). • exists, in, not exists, not in• Aggregate functions like sum, max, count(distinct), count(1) • Extract keys from map, struct datatypes

Query 1 – simple groupby and count

select trackingcode, count(1) from pagevieweventwhere datepartition='2014-07-15'group by trackingcode limit 100;

Query 2 - case expression with filter predicates

SELECTdatepartition,SUM(CASE when requestheader.pagekey in ('pulse-saved-articles','pulse-settings','pulse-pbar','pulse-slice-internal', 'pulse-share-hub','pulse-special-jobs-economy','pulse-browse','pulse-slice-connections') then 1 when requestheader.pagekey in ('pulse-slice','pulse-top-news') and (trackingcode NOT LIKE 'eml-tod%' OR trackingcode IS NULL) then 1 else 0 end) AS TODAY_PVFROMpagevieweventwheredatepartition = '2014-07-15'and header.memberid > 0group by datepartition;

Query 3 – check count(distinct) with wildcard search

SELECTa.datepartition,d.country_sk,COUNT(1) AS total_count,count(distinct a.header.memberid) as unique_countFROM pageviewevent aINNER JOIN dim_tracking_code b ON a.trackingcode=b.tracking_codeINNER JOIN dim_page_key c ON a.requestheader.pagekey=c.page_key AND c.is_aggregate = 1left outer join dim_member_cntry_lcl d on a.header.memberId = d.member_skWHERE a.datepartition = '2014-07-18' AND ( LOWER(a.trackingcode) LIKE 'eml_bt1%' OR LOWER(a.trackingcode) LIKE 'emlt_bt1%' OR LOWER(a.trackingcode) LIKE 'eml-bt1%' OR LOWER(a.trackingcode) LIKE 'emlt-bt1%' )GROUP BY a.datepartition, country_sk ;

Query 4 – Joins, filter predicates with count(distinct)

SELECT datepartition, coalesce(c.country_sk,-9), COUNT(DISTINCT a.header.memberid)FROM pageviewevent ainner join dim_page_key b on a.requestheader.pagekey = b.page_key and b.page_key_group_sk = 39 and b.is_aggregate = 1left outer join dim_member_cntry_lcl c on a.header.memberid= c.member_skwhere a.datepartition = '2014-07-19'and a.header.memberid > 0group by datepartition, coalesce(c.country_sk,-9);

Query 5 – test map datatype with filter predicatesselectsubstr(datepartition,1,10) as date_data,campaigntypeint, header.memberid, channelid, `format` as ad_format, publisherid, campaignid, advertiserid, creativeid,parameters['organicActivityId'] as activityid,parameters['activityType'] as socialflag,'0' as feedposition,sum(case when statusint in (1,4) and channelid in (2,1) then 1 when statusint in (1,4) and channelid in (2000, 3000) and parameters['sequence'] = 0 then 1 else 0 end) as imp,sum(case when statusint = 1 and channelid in (2,1) then 1 when statusint = 1 and channelid in (2000, 3000) and parameters['sequence'] = 0 then 1 else 0 end) as imp_sas,sum(case when channelid in (2000, 3000) and parameters['sequence'] > 0 then 1 else 0 end) as view_other,sum(case when statusint = 1 then cost else 0.0 end) as rev_impfrom adimpressioneventwheredatepartition = '2014-07-20' and campaignTypeInt = 14group bysubstr(datepartition,1,10),campaigntypeint, header.memberid, channelid, `format`, publisherid, campaignid, advertiserid, creativeid,parameters['organicActivityId'],parameters['activityType']limit 1000;

Query 6 – 2 table join with count(distinct)

select count(distinct member_sk)from dim_position pjoin dim_company c on c.company_sk=p.std_company_sk and c.active='Y' and c.company_type_sk=4where end_date is null and is_primary ='Y';

Query 7 – 600+ column table testselect om.current_company as Company,om.industry as Industry,om.company_size as Company_Size,om.current_title as Job_Title,om.member_sk as Member_SK,om.first_name as First_Name,om.last_name as Last_Name,om.email_address as Email,om.connections as Connections ,om.country as Country,om.region as Region,om.cropped_picture_id as Profile_Picture,om.pref_locale as Pref_Locale,om.headline as Headlinefrom om_segment omwhereom.ACTIVE_FLAG = 1and om.country_sk in (162,78,75,2,57)and om.connections > 99and om.pageview_l30d > 0and( ( om.headline like '%linkedin%') or (om.current_title like '%linkedin%') or (

(om.headline like '%social media%' or om.headline like'%social consultant%' or om.headline like '%social recruit%' orom.headline like '%employer brand%') and(om.headline like '%train%' or om.headline like '%consult%' orom.headline like '%advis%' or om.headline like '%recruit%') ) or (

(om.current_title like '%social media%' or om.current_title like'%social consultant%' or om.current_title like '%social recruit%' orom.current_title like '%employer brand%') and(om.current_title like '%train%' or om.current_title like '%consult%' or om.current_title like'%advis%' or om.current_title like '%recruit%' ) )) ;

Query 8 – 3 table joins with uniques

select distinct f.member_skFROM dim_education e join dim_member_flat f on (e.member_sk = f.member_sk) join dim_school s on (e.school_sk = s.school_sk)WHERE f.active_flag = 'Y' and ( ( e.country_sk = 167 ) OR ( s.country_sk = 167 ) ) limit 1000;

Query 9 – wide table test (600+ columns - test columnar)

select member_skfrom om_segmentwhere(lss_decision_maker_flag like 'DM' or lss_decision_maker_flag like 'IC')and (lss_company_tier like 'Enterprise' or lss_company_tier like 'SMB' or lss_company_tier like 'SRM')and (lss_customer_status like 'Prospect' or lss_customer_status like 'Customer')and (lss_subscriber_status like 'Online Gen Subscriber' or lss_subscriber_status like 'Not a Subscriber')and country_sk in (14,194,174,95,154,167,227,37,102,78,162,163,70,193,21,132,59,101,2,242) limit 1000;

Query 10 – using sub-queries joins – push downselect p.member_sk from dim_position pinner join ( select position_sk, std_title_2_sk, member_sk from dim_position_std_title_2) pton p.position_sk = pt.position_sk and p.member_sk = pt.member_skinner join ( select std_title_2_sk from dim_std_title_2 where std_title_2_id in (17801,20923,11001,21845,8206,8136,22224,5204,13257,5642,8,16565,792,12949,13758)) t on pt.std_title_2_sk = t.std_title_2_skinner join ( select company_sk from dim_company where company_size_sk > 2) c on p.std_company_sk = c.company_skwhere p.end_date is null and p.is_primary = 'Y' limit 1000;

Query 11 – test unionallselect distinct member_sk from ( select member_sk from dim_education where school_sk in ( 9873, 10065, 10388, 9872, 7916, 10241, 10242, 9900, 10377, 10719, 10637, 8534, 8535, 9906) union all select member_sk from dim_position where final_company_sk in (74701,74702,12831,159378,62771,67754,75480,79641,73975,87156,1895741,147775) or company_sk in (74701,74702,12831,159378,62771,67754,75480,79641,73975,87156,1895741,147775)) x limit 1000;

Query 12 – 3 table joinscreate table u_smallem.retirement_members asselect distinct sds.member_skfrom u_smallem.v_retirement_dm sds inner join dim_member_flat mem on mem.member_sk=sds.member_sk and active_flag='Y' inner join dim_position pos on sds.member_sk=pos.member_skwhere (pos.final_seniority_2_sk in (6,7,9,10) OR pos.user_supplied_title like '%senior consultant%')UNIONselect distinct current_date, mem.member_sk, 739, 4from dim_position pos inner join dim_member_flat mem on mem.member_sk=pos.member_sk and active_flag='Y'where pos.final_company_sk in(12254,24672,12694,16583,21410,38641,145164,32346,20918,35083,96824,49506,159381,48201,45860,215432,53484,327842,63747,78721,139406,778800) and (final_std_title_2_sk in (select std_title_2_sk as final_st_title_2_sk from dim_std_title_2 where occupation_id=235) or pos.user_supplied_title like '%benefit consultant%');

Query 13 – time based calculationsselect distinct member_sk from ( select member_sk, start_date, end_date, cast(from_unixtime(unix_timestamp()-24*3600*90,'yyyyMM') as int) d1, cast(year(from_unixtime(unix_timestamp())) as int)*100 d2, source_created_tsfrom dim_position ) xwhere start_date >= d1 or end_date >= d1 or ((start_date = d2 or end_date = d2)and source_created_ts >= unix_timestamp()-24*3600*90) limit 1000;

Query 14 – many small table joinscreate table u_smallem.vs_rti_ad_orderasselecto.ad_order_sk,sum (r.ad_impressions) as impressions,sum (r.ad_clicks) as clicksfrom agg_daily_ad_revenue rinner join dim_ad a on r.ad_sk = a.ad_skinner join dim_ad_order o on r.ad_order_sk = o.ad_order_skinner join dim_advertiser v on v.advertiser_sk = o.advertiser_skwhere r.datepartition >= '2014-07-01' and r.datepartition <= '2014-07-31'and r.ad_creative_size_sk in (6,8,17,29)and v.adv_saleschannel_name like 'Field%'and o.lars_sales_channel_name like 'Advertising Field'and r.ad_site_sk = 1and r.ad_zone_sk <> 1175and o.proposal_bind_id is not nulland(coalesce(a.lars_product_type, 'n/a') not like 'Click Tracker' or coalesce(a.lars_product_type, 'n/a') not like 'inMail'or coalesce(a.lars_target_type, 'n/a') not like 'Partner Message' or coalesce(a.lars_target_type, 'n/a') not like 'Polls')group by o.ad_order_skhaving sum(r.ad_impressions) > 9999;

drop table if exists u_smallem.vs_final;create table u_smallem.vs_finalasselect distinct i.member_sk from (select member_sk, f.ad_order_sk, count(1) as impr from fact_detail_ad_impressions f join u_smallem.vs_rti_ad_order u on f.ad_order_sk = u.ad_order_skwhere date_sk >= '2014-07-01' and date_sk <= '2014-07-07'and ad_creative_size_sk in (6,17)--and ad_order_sk in (select distinct ad_order_sk from u_smallem.vs_rti_ad_order)group by member_sk, f.ad_order_skhaving count(1) > 10) i join om_segment o on i.member_sk = o.member_skwhere i.member_sk > 0 and o.pageview_l30d < 3000;

Query 15 – check not existsdrop table if exists u_smallem.tmp_SDS_AU;create table u_smallem.tmp_SDS_AUAS select distinct member_sk from fact_bzops_follower f1 where company_id = 2584270 and status='A' and not exists ( select 1 from fact_bzops_follower f2 where company_id = 3600 and status='A' and f2.member_sk = f1.member_sk) ;

Query1 – Test concurrent users (Presto only)• This exercise was performed for Presto only. • Concurrency is measured by number of users

running query in parallel.• For simplicity sake, we chose the same query

ran by 1 user, 2, 4, 8 and 12 users at the same time.

• Queries 3 and 4 Failed with multiple concurrent users which clearly indicates that more memory is required on the system

• Multiple big table joins would fail on the system when run concurrently.

Query3:SELECT datepartition, coalesce(c.country_sk,-9), COUNT(DISTINCT a.memberid)FROM pageviewevent_flat ainner join dim_page_key b on a.pagekey = b.page_key and b.page_key_group_sk = 39 and b.is_aggregate = 1left outer join dim_member_cntry_lcl c on a.memberid= c.member_skwhere a.datepartition = '2014-07-11'and a.memberid > 0group by datepartition, coalesce(c.country_sk,-9);

Query1 – Test linear growth (7 day window)

Query:select trackingcode, count(1) from pageviewevent_flatwhere datepartition >= '2014-07-15' and datepartition <= '2014-07-16'group by trackingcode limit 100;

• This exercise was performed on Presto and hive-on-tez

• We chose Query 1 for this test• Query 1 was ran with 1,2,4 and 7 day range.

Query3 – Test linear growth (7 day window)

Query:SELECTa.datepartition,d.country_sk,COUNT(1) AS total_count,count(distinct a.memberid) as unique_countFROM pageviewevent_flat aINNER JOIN dim_page_key c ON a.pagekey=c.page_key AND c.is_aggregate = 1left outer join dim_member_cntry_lcl d on a.memberId = d.member_skWHERE a.datepartition >= '2014-07-18' and a.datepartition <= '2014-07-19' AND ( LOWER(a.trackingcode) LIKE 'eml_bt1%' OR LOWER(a.trackingcode) LIKE 'emlt_bt1%' OR LOWER(a.trackingcode) LIKE 'eml-bt1%' OR LOWER(a.trackingcode) LIKE 'emlt-bt1%' )GROUP BY a.datepartition, country_sk ;

• This exercise was performed on Presto and hive-on-tez

• We chose Query 3 for this test• Query 3 was ran with 1,2,4 and 7 day range.

All queries – Holistic view

Conclusion• Hive-on-Tez• Pros:

• Environments that are running on Hive only can benefit from having Hive-on-tez.• Hive-on-tez offers considerable improvement in query performance and offers an alternate

solution to MapReduce.• In many cases we have seen queries speed up atleast 3x-8x compared to Hive.• Switching to Hive-on-tez is extremely simple (set hive.execution.engine=tez)

• Cons:• For this POC, we had to tweak many Hive configuration properties to get the optimal

performance for queries running on Tez. We felt this to be a drawback as we had to tune parameters specific to certain queries. This may be a hindrance for ad-hoc queries.

• There were couple of queries that were running infinitely and we had to terminate them.

Conclusion• Presto • Pros:

• was proven to be fast and is a very good solution for ad-hoc analysis and faster table scans.• Presto was 3x to 10x faster in almost many queries compared to Hive on MapReduce.• The sql federation and query federation is an amazing feature for joining mysql or teradata

to Hive tables using Presto. This is similar to the Aster data SQL-H feature.• Cons:

• Requires separate installation• Memory was a big issue with Presto. Concurrency test that we did with multiple users

clearly indicates that memory was insufficient. Also, joining 2 big tables requires lot of memory and running them on Presto clearly indicates that this is not going to work as it doesn’t support distributed hash joins.

• DDLs are not supported.