Hive and Presto for Big Data

HIVE AND PRESTO

FOR BIG DATA

ANALYTICS IN THE

CLOUDSiva Narayanan

Qubole

[email protected]

@k2_181

`WHOAMI`

PhD in Large-scale scientific data management

Parallel query processing,

Greenplum Parallel Database

Hadoop, Hive, Presto at Qubole

Niche.

Scientific simulation apps

Fortune Companies

Small and medium

enterprises

WHATS NEW ABOUT BIG DATA YOU

SAY

Traditionally, analytics on data internal to an organization

Customer data

ERP data

Some pre-digested external data like market research

Sophisticated analytics using new data sources

Social data

Website dataLow density, fine grained and massive

Most EDWs are < 2TB

LOW DENSITY, HIGH VOLUME DATA

Amul comment data: 18000 * 140 * 60 * 24 * 30 = 100 GB per month

Category Unique visitors

Retail Luxury

goods

20 million

Retail

consumer

goods

30 million

Retail tickets 26 million

Social Media Website data

Traditional technologies cannot handle this low-density, high volume data

SKELETON OF A BIG DATA PROJECT

Internal

Data

External

DataTB - PBs

Actionable report

Analytics Workflow

HOW DO THE BIG GUYS DO IT?

Build data centers

Buy or build custom big-data software

Hire ETL engineers who manage bringing data into the system

Hire admins to keep it all running

Hire data scientists to come up with interesting questions

Hire developers who can translate questions into programs

Lots of upfront investment

Long time to get started

Lots of risks

BIG DATA PROJECT ENTAILS

LANDSCAPE IS CHANGING

Advent of public clouds

Cheap, reliable storage

Provision 10-1000s of machines in a couple of minutes

Pay as you go, grow as you please

Free / inexpensive big-data software

Hadoop, Hive, Presto

CLOUD PRIMITIVES

Persistent object store e.g. AWS S3

Reliability is basically solved for you (*)

Ability to provision clusters with pre-built images in a couple of minutes

Pay by the hour (or by the minute)

Spot instances (AWS)

Relational DB as a Service

MySQL, PostgreSQL etc

THE CLOUD CAN HANDLE YOUR DATA

CLOUDS COMPUTE FLEXIBILITY

Analytics workloads tend to be bursty

Most orgs struggle to predict usage 2-3 months down the line

Tend to overprovision compute

Result: < 30% utilization of their hardware

Cloud allows you to scale up and down

Trickier for a big data system, but possible

Chen et al, VLDB

2012

Provision for peak workload

BIG DATA SOFTWARE

Many open source projects

Hadoop based on Googles MR paper (Yahoo)

Hive (SQL-on-Hadoop)

Presto (Fast SQL)

Production ready, running at scale at Yahoo, FB and many other

environments

ENTER HADOOP

Open-source implementation of Map-reduce used by Google to index

trillions of web pages

Allows programmers to write distributed programs using map and

reduce abstractions

Ability to run these programs on large amounts of data

Uses bunch of cheap hardware, can tolerate failures

HADOOP SCALES!

HIVE: SQL ON HADOOP

Facebook had a Multi Petabyte Warehouse

Had 80+ engineers writing Hadoop jobs

Files are insufficient data abstractions

Need tables, schemas, partitions, indices

SQL is highly popular

So, implement SQL on top of Hadoop

Allowed non-programmers to process all the data

FB open-sourced it

Production ready

Processes 25PB of data in FB

Processes 20PB of data at Qubole

HIVE ALLOWS YOU TO DESCRIBE DATA

Example

My data lives in Amazon S3 in a specific location

It is in delimited text format

Please create a virtual table for me

Number of data formats: JSON, Text, Binary, Avro, ProtoBuf, Thrift

Analytics is often a downstream process

Conversion of data is time consuming and not productive

create external table nation (N_NATIONKEY INT, N_NAME STRING,

N_REGIONKEY INT, N_COMMENT STRING)

ROW FORMAT DELIMITED

STORED AS TEXTFILE

LOCATION 's3n://public-qubole/datasets/tpch5G/nation';

HIVE EXTENSIBILITY

Connect to external data sources like MongoDB

Write code to understand new data formats - serdes

Custom UDFs in Java

Plug in custom code in python or any other language

SELECT

TRANSFORM (hosting_ids, user_id, d)

USING 'python combine_arrays.py' AS (hosting_ranks_array, user_id, d)

FROM s_table;

HIVE ALLOWS YOU TO QUERY THE DATA

SQL-Like

Query is parallelized using Hadoop as execution engine

Select count(*) from nation;

Count(*)

Count(*)

Sum()

HIVE EXECUTION

Split Hive query into multiple Hadoop/MR jobs

Run Job 1, save intermediate output to HDFS

Run Job 2..

Return results

Data parallel because every hadoop job runs on number of machines

T11

100MB

T12

100MB

10 files

5 files

5 files

TASK PARALLELISM

T1 T2 T3

100MB 100MB 100MB10 files

EXECUTION MODEL 1

T1

100MB

T2

100MB

T3

100MB10 files

Only 100MB of memory required Can stop and resume Allows for multiplexing multiple pipelines Can tolerate failures Spilling can be expensive Time to first result is high

EXECUTION MODEL 2

T1

100MB

T2

100MB

T3

100MB10 files

Task parallelism Needs 3X memory No spilling, hence much faster Early first results Stop and resume is trickier Multiplexing is more difficult Cannot tolerate failures

ENTER PRESTO

Hive was EM1 and had associated disadvantages

Internal project at Facebook to implement EM2 (Presto)

Use case was interactive queries over the same data

Open sourced late 2013

Promised much faster query performance

In-memory processing, aggressive pipelining

Supports all the data formats that Hive does

Cant plug in user code at this point, vanilla SQL

CONTRASTING HIVE AND PRESTO

Hive Presto

Uses Hadoop MR for

execution (EM1)

Pipelined execution model

(EM2)

Spills intermediate data to FS Intermediate data in memory

Can tolerate failures Does not tolerate failures

Automatic join ordering User-specified join ordering

Can handle joins of two large

tables

One table needs to fit in

memory

Supports grouping sets Does not support GS

Plug in custom code Cannot plug in custom code

More data types Limited data types

Hive 0.11 vs Presto 0.60

PERFORMANCE COMPARISON

Presto is 2.5-7x faster But, some queries just run out of memory Contrasts the execution models

IN A NUTSHELL

SAMPLE SETUP

Cloud Storage

Sqoop

Application

Sync

Heavy duty queries Interactive queries

CRYSTAL BALL

Hive is actively working on task parallelism as part of the Stinger

Initiative

Presto is also making rapid progress in bridging some of its gaps

There are other open source projects:

Impala, Shark, Drill, Tajo

Lots of goodies for users

CONCLUSION

Big Data Analytics is becoming accessible and affordable

Public clouds give flexibility and change economics

Hive and Presto provide intuitive and powerful ways to interact with

your data

Sign up for a free trial at Qubole.com

Get access to Hive, Presto, Hadoop, Pig as a Service on Amazon and Google cloud services

Siva [email protected] / @k2_181

QUESTIONS

Where should data be stored?

What formats are appropriate?

What kinds of processing needs to happen?

What parts are expressible in ANSI-SQL?

How can I plug-in proprietary business logic?

How much compute power is required?

How do I put it all together?

Hive and Presto for Big Data

Documents

Transcript of Hive and Presto for Big Data