Hive and Presto for Big Data

31
HIVE AND PRESTO FOR BIG DATA ANALYTICS IN THE CLOUD Siva Narayanan Qubole [email protected] @k2_181

description

big data

Transcript of Hive and Presto for Big Data

  • HIVE AND PRESTO

    FOR BIG DATA

    ANALYTICS IN THE

    CLOUDSiva Narayanan

    Qubole

    [email protected]

    @k2_181

  • `WHOAMI`

    PhD in Large-scale scientific data management

    Parallel query processing,

    Greenplum Parallel Database

    Hadoop, Hive, Presto at Qubole

    Niche.

    Scientific simulation apps

    Fortune Companies

    Small and medium

    enterprises

  • WHATS NEW ABOUT BIG DATA YOU

    SAY

    Traditionally, analytics on data internal to an organization

    Customer data

    ERP data

    Some pre-digested external data like market research

    Sophisticated analytics using new data sources

    Social data

    Website dataLow density, fine grained and massive

    Most EDWs are < 2TB

  • LOW DENSITY, HIGH VOLUME DATA

    Amul comment data: 18000 * 140 * 60 * 24 * 30 = 100 GB per month

    Category Unique visitors

    Retail Luxury

    goods

    20 million

    Retail

    consumer

    goods

    30 million

    Retail tickets 26 million

    Social Media Website data

    Traditional technologies cannot handle this low-density, high volume data

  • SKELETON OF A BIG DATA PROJECT

    Internal

    Data

    External

    DataTB - PBs

    Actionable report

    Analytics Workflow

  • HOW DO THE BIG GUYS DO IT?

    Build data centers

    Buy or build custom big-data software

    Hire ETL engineers who manage bringing data into the system

    Hire admins to keep it all running

    Hire data scientists to come up with interesting questions

    Hire developers who can translate questions into programs

  • Lots of upfront investment

    Long time to get started

    Lots of risks

    BIG DATA PROJECT ENTAILS

  • LANDSCAPE IS CHANGING

    Advent of public clouds

    Cheap, reliable storage

    Provision 10-1000s of machines in a couple of minutes

    Pay as you go, grow as you please

    Free / inexpensive big-data software

    Hadoop, Hive, Presto

  • CLOUD PRIMITIVES

    Persistent object store e.g. AWS S3

    Reliability is basically solved for you (*)

    Ability to provision clusters with pre-built images in a couple of minutes

    Pay by the hour (or by the minute)

    Spot instances (AWS)

    Relational DB as a Service

    MySQL, PostgreSQL etc

  • THE CLOUD CAN HANDLE YOUR DATA

  • CLOUDS COMPUTE FLEXIBILITY

    Analytics workloads tend to be bursty

    Most orgs struggle to predict usage 2-3 months down the line

    Tend to overprovision compute

    Result: < 30% utilization of their hardware

    Cloud allows you to scale up and down

    Trickier for a big data system, but possible

    Chen et al, VLDB

    2012

    Provision for peak workload

  • BIG DATA SOFTWARE

    Many open source projects

    Hadoop based on Googles MR paper (Yahoo)

    Hive (SQL-on-Hadoop)

    Presto (Fast SQL)

    Production ready, running at scale at Yahoo, FB and many other

    environments

  • ENTER HADOOP

    Open-source implementation of Map-reduce used by Google to index

    trillions of web pages

    Allows programmers to write distributed programs using map and

    reduce abstractions

    Ability to run these programs on large amounts of data

    Uses bunch of cheap hardware, can tolerate failures

  • HADOOP SCALES!

  • HIVE: SQL ON HADOOP

    Facebook had a Multi Petabyte Warehouse

    Had 80+ engineers writing Hadoop jobs

    Files are insufficient data abstractions

    Need tables, schemas, partitions, indices

    SQL is highly popular

    So, implement SQL on top of Hadoop

    Allowed non-programmers to process all the data

    FB open-sourced it

    Production ready

    Processes 25PB of data in FB

    Processes 20PB of data at Qubole

  • HIVE ALLOWS YOU TO DESCRIBE DATA

    Example

    My data lives in Amazon S3 in a specific location

    It is in delimited text format

    Please create a virtual table for me

    Number of data formats: JSON, Text, Binary, Avro, ProtoBuf, Thrift

    Analytics is often a downstream process

    Conversion of data is time consuming and not productive

    create external table nation (N_NATIONKEY INT, N_NAME STRING,

    N_REGIONKEY INT, N_COMMENT STRING)

    ROW FORMAT DELIMITED

    STORED AS TEXTFILE

    LOCATION 's3n://public-qubole/datasets/tpch5G/nation';

  • HIVE EXTENSIBILITY

    Connect to external data sources like MongoDB

    Write code to understand new data formats - serdes

    Custom UDFs in Java

    Plug in custom code in python or any other language

    SELECT

    TRANSFORM (hosting_ids, user_id, d)

    USING 'python combine_arrays.py' AS (hosting_ranks_array, user_id, d)

    FROM s_table;

  • HIVE ALLOWS YOU TO QUERY THE DATA

    SQL-Like

    Query is parallelized using Hadoop as execution engine

    Select count(*) from nation;

    Count(*)

    Count(*)

    Sum()

  • HIVE EXECUTION

    Split Hive query into multiple Hadoop/MR jobs

    Run Job 1, save intermediate output to HDFS

    Run Job 2..

    Return results

    Data parallel because every hadoop job runs on number of machines

    T11

    100MB

    T12

    100MB

    10 files

    5 files

    5 files

  • TASK PARALLELISM

    T1 T2 T3

    100MB 100MB 100MB10 files

  • EXECUTION MODEL 1

    T1

    100MB

    T2

    100MB

    T3

    100MB10 files

    Only 100MB of memory required Can stop and resume Allows for multiplexing multiple pipelines Can tolerate failures Spilling can be expensive Time to first result is high

  • EXECUTION MODEL 2

    T1

    100MB

    T2

    100MB

    T3

    100MB10 files

    Task parallelism Needs 3X memory No spilling, hence much faster Early first results Stop and resume is trickier Multiplexing is more difficult Cannot tolerate failures

  • ENTER PRESTO

    Hive was EM1 and had associated disadvantages

    Internal project at Facebook to implement EM2 (Presto)

    Use case was interactive queries over the same data

    Open sourced late 2013

    Promised much faster query performance

    In-memory processing, aggressive pipelining

    Supports all the data formats that Hive does

    Cant plug in user code at this point, vanilla SQL

  • CONTRASTING HIVE AND PRESTO

    Hive Presto

    Uses Hadoop MR for

    execution (EM1)

    Pipelined execution model

    (EM2)

    Spills intermediate data to FS Intermediate data in memory

    Can tolerate failures Does not tolerate failures

    Automatic join ordering User-specified join ordering

    Can handle joins of two large

    tables

    One table needs to fit in

    memory

    Supports grouping sets Does not support GS

    Plug in custom code Cannot plug in custom code

    More data types Limited data types

    Hive 0.11 vs Presto 0.60

  • PERFORMANCE COMPARISON

    Presto is 2.5-7x faster But, some queries just run out of memory Contrasts the execution models

  • IN A NUTSHELL

  • SAMPLE SETUP

    Cloud Storage

    Sqoop

    Application

    Sync

    Heavy duty queries Interactive queries

  • CRYSTAL BALL

    Hive is actively working on task parallelism as part of the Stinger

    Initiative

    Presto is also making rapid progress in bridging some of its gaps

    There are other open source projects:

    Impala, Shark, Drill, Tajo

    Lots of goodies for users

  • CONCLUSION

    Big Data Analytics is becoming accessible and affordable

    Public clouds give flexibility and change economics

    Hive and Presto provide intuitive and powerful ways to interact with

    your data

  • Sign up for a free trial at Qubole.com

    Get access to Hive, Presto, Hadoop, Pig as a Service on Amazon and Google cloud services

    Siva [email protected] / @k2_181

  • QUESTIONS

    Where should data be stored?

    What formats are appropriate?

    What kinds of processing needs to happen?

    What parts are expressible in ANSI-SQL?

    How can I plug-in proprietary business logic?

    How much compute power is required?

    How do I put it all together?