Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)

Click here to load reader

download Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)

of 32

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)

Interactive SQL POC

Interactive SQL POC on HadoopHive 13, Hive-on-Tez, PrestoStorage: RCFile, ORC, Parquet and Avro

Sudhir MallemAlex BainGeorge Zhang


Introduction Interactive SQL POC on HDFSHive (version 13), Hive-on-Tez, PrestoStorage formats:RCFileORCParquetAvroCompressionSnappyZlibNative gzip compression

Goal of BenchmarkGoal of this benchmark is :To provide a comprehensive overview and testing of the interactive SQL on HadoopTo measure response time in terms of performance across each of the platform across different storage formatsTo measure compression data size across different dimensionsTo have a better understanding of the performance gain we may potentially see with Queries on each of these platforms.Avro is widely used across LinkedIn and testing it on the newer platform (hive 13) with other tools (tez, presto etc) will give us a good understanding of the performance gain we may potentially see.

System, Storage formats and compressionSystems chosen: Hive - version 13.1Hive13 on Tez + Yarn - Tez version: 0.4.1 Presto - 0.74, 0.79 and 0.80 version Storage Formats and compression: ORC + zlib compressionRCFile + snappyParquet + snappyAvro + Avro native compression - Deflate level 9

System, Storage formats and compressionPresto - dataset was created in RCFile. This was the most recommended version for Presto when this was evaluated. During the time of the evaluation, Presto had issues working with Avro and Parquet. The queries were either not running or they were not quite optimized.With Presto v0.80 releasing, we have tested it with ORCFile format.We flattened certain data (pageviewevent) to support benchmark on Presto. Currently Presto supports only Map datatypes. Struct etc use Json_extract function.

About Hive 13This is the next version of Hive at LinkedIn. Hive is used heavily at LinkedIn for interactive sql capability where users are not PigLatin savvy and prefer a sql solution. Hive is generally slow as it runs on Map/reduce and competes with Mappers and reducers on the HDFS system along with PigLatin and vanilla m/r.

About Hive 13-on-TezTez is a new application framework built on Hadoop Yarn that can execute complex directed acyclic graphs of general data processing tasks. In many ways it can be thought of as a more flexible and powerful successor of the map-reduce framework built by HortonWorksIt generalizes map and reduce tasks by exposing interfaces for generic data processing tasks, which consist of a triplet of interfaces: input, output and processor. These tasks are the vertices in the execution graph. Edges (i.e.: data connections between tasks) are first class citizens in Tez and together with the input/output interfaces greatly increase the flexibility of how data is transferred between tasks. Tez also greatly extends the possible ways of which individual tasks can be linked together; In fact any arbitrary DAG can be executed directly in Tez. In Tez parlance a map-reduce job is basically a simple DAG consisting of a single map and reduce vertice connected by a bipartite edge (i.e.: the edge connects every map task to every reduce task). Map input and reduce outputs are HDFS inputs and outputs respectively. The map output class locally sorts and partitions the data by a certain key, while the reduce input class merge-sorts its data on the same key.Tez also provides what basically is a map-reduce compat layer that lets one run MR jobs on top of the new execution layer by implementing Map/Reduce concepts on the new execution framework.

About PrestoPresto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the data size of organizations like LinkedIn and Facebook.

About DatasetThe input dataset was carefully chosen to cover not only the performance perspective of benchmarking, but also to gain better insight into each of the system. It gives a good understanding of the query patters they support, the functions, ease of use etc.Different dimension tables, facts and Aggregates.Data ranges anywhere from 20k rows to 80+billion.Hive supports Complex datatypes like Struct, array, Union and Map. The data that we chose has nested structure, key values and binary data. We Flattened the data for use in Presto as the 0.74 version of Presto supports only Array and Map datatype. The underlying data is stored as JSON, so we have to use json functions to extract and refer to the data.One of the dataset is a flat table with 600+ columns, specifically to test the columnar functionality with Parquet, RCFile and ORC file formats.

Evaluation CriteriaWe chose 15 queries for our testing and benchmarking. These sqls are some of the commonly used queries users run in DWH at LinkedIn. The queries test the following functionality:date and time manipulationsnested sqls, wildcard searches, Filter predicates, partition pruning, Full table scans and Joins (3way, 2way etc). exists, in, not exists, not inAggregate functions like sum, max, count(distinct), count(1) Extract keys from map, struct datatypes

Query 1 simple groupby and count

select trackingcode, count(1) from pagevieweventwhere datepartition='2014-07-15'group by trackingcode limit 100;

Query 2 - case expression with filter predicates

SELECTdatepartition,SUM(CASE when requestheader.pagekey in ('pulse-saved-articles','pulse-settings','pulse-pbar','pulse-slice-internal', 'pulse-share-hub','pulse-special-jobs-economy','pulse-browse','pulse-slice-connections') then 1 when requestheader.pagekey in ('pulse-slice','pulse-top-news') and (trackingcode NOT LIKE 'eml-tod%' OR trackingcode IS NULL) then 1 else 0 end) AS TODAY_PVFROMpagevieweventwheredatepartition = '2014-07-15'and header.memberid > 0group by datepartition;

Query 3 check count(distinct) with wildcard search

SELECTa.datepartition,d.country_sk,COUNT(1) AS total_count,count(distinct a.header.memberid) as unique_countFROM pageviewevent aINNER JOIN dim_tracking_code b ON a.trackingcode=b.tracking_codeINNER JOIN dim_page_key c ON a.requestheader.pagekey=c.page_key AND c.is_aggregate = 1left outer join dim_member_cntry_lcl d on a.header.memberId = d.member_skWHERE a.datepartition = '2014-07-18' AND ( LOWER(a.trackingcode) LIKE 'eml_bt1%' OR LOWER(a.trackingcode) LIKE 'emlt_bt1%' OR LOWER(a.trackingcode) LIKE 'eml-bt1%' OR LOWER(a.trackingcode) LIKE 'emlt-bt1%' )GROUP BY a.datepartition, country_sk ;

Query 4 Joins, filter predicates with count(distinct)

SELECT datepartition, coalesce(c.country_sk,-9), COUNT(DISTINCT a.header.memberid)FROM pageviewevent ainner join dim_page_key b on a.requestheader.pagekey = b.page_key and b.page_key_group_sk = 39 and b.is_aggregate = 1left outer join dim_member_cntry_lcl c on a.header.memberid= c.member_skwhere a.datepartition = '2014-07-19'and a.header.memberid > 0group by datepartition, coalesce(c.country_sk,-9);

Query 5 test map datatype with filter predicates

selectsubstr(datepartition,1,10) as date_data,campaigntypeint, header.memberid, channelid, `format` as ad_format, publisherid, campaignid, advertiserid, creativeid,parameters['organicActivityId'] as activityid,parameters['activityType'] as socialflag,'0' as feedposition,sum(case when statusint in (1,4) and channelid in (2,1) then 1 when statusint in (1,4) and channelid in (2000, 3000) and parameters['sequence'] = 0 then 1 else 0 end) as imp,sum(case when statusint = 1 and channelid in (2,1) then 1 when statusint = 1 and channelid in (2000, 3000) and parameters['sequence'] = 0 then 1 else 0 end) as imp_sas,sum(case when channelid in (2000, 3000) and parameters['sequence'] > 0 then 1 else 0 end) as view_other,sum(case when statusint = 1 then cost else 0.0 end) as rev_impfrom adimpressioneventwheredatepartition = '2014-07-20' and campaignTypeInt = 14group bysubstr(datepartition,1,10),campaigntypeint, header.memberid, channelid, `format`, publisherid, campaignid, advertiserid, creativeid,parameters['organicActivityId'],parameters['activityType']limit 1000;

Query 6 2 table join with count(distinct)

select count(distinct member_sk)from dim_position pjoin dim_company c on c.company_sk=p.std_company_sk and'Y' and c.company_type_sk=4where end_date is null and is_primary ='Y';

Query 7 600+ column table test

select om.current_company as Company,om.industry as Industry,om.company_size as Company_Size,om.current_title as Job_Title,om.member_sk as Member_SK,om.first_name as First_Name,om.last_name as Last_Name,om.email_address as Email,om.connections as Connections , as Country,om.region as Region,om.cropped_picture_id as Profile_Picture,om.pref_locale as Pref_Locale,om.headline as Headlinefrom om_segment omwhereom.ACTIVE_FLAG = 1and om.country_sk in (162,78,75,2,57)and om.connections > 99and om.pageview_l30d > 0and( ( om.headline like '%linkedin%') or (om.current_title like '%linkedin%') or (

(om.headline like '%social media%' or om.headline like'%social consultant%' or om.headline like '%social recruit%' orom.headline like '%employer brand%') and(om.headline like '%train%' or om.headline like '%consult%' orom.headline like '%advis%' or om.headline like '%recruit%') ) or (

(om.current_title like '%social media%' or om.current_title like'%social consultant%' or om.current_title like '%social recruit%' orom.current_title like '%employer brand%') and(om.current_title like '%train%' or om.current_title like '%consult%' or om.current_title like'%advis%' or om.current_title