Data discoveryonhadoop@yahoo! hadoopsummit2014

Data D iscove ry on Hadoop - Rea l i z i ng the Fu l l Po ten t i a l o f You r Da ta

P R E S E N T E D B Y T h i r u v e l T h i r u m o o l a n , S u m e e t S i n g h ⎪ J u n e 3 , 2 0 1 4

2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a

Introduction

2 2014 Hadoop Summit, San Jose, California

Sumeet Singh Senior Director, Product Management Hadoop and Big Data Platforms Cloud Engineering Group

Thiruvel Thirumoolan Principal Engineer Hadoop and Big Data Platforms

Cloud Engineering Group

§  Developer in the Hive-HCatalog team, and active contributor to Apache Hive

§  Responsible for Hive, HiveServer2 and HCatalog across all Hadoop clusters and ensuring they work at scale for the usage patterns of Yahoo

§  Loves mining the trove of Hadoop logs for usage patterns and insights

§  Bachelors degree from Anna University

701 First Avenue, Sunnyvale, CA 94089 USA @thiruvel

§  Manages Hadoop products team at Yahoo!

§  Responsible for Product Management, Strategy and Customer Engagements

§  Managed Cloud Services products team and

headed Strategy functions for the Cloud Platform Group at Yahoo

§  M.B.A. from UCLA and M.S. from Rensselaer(RPI) 701 First Avenue, Sunnyvale, CA 94089 USA @sumeetksingh

Agenda

3

The Data Management Challenge 1

Apache HCatalog to Rescue 2

Data Registration and Discovery 3

Opening Up Adhoc Access to Data 4

Summary and Q&A 5

2014 Hadoop Summit, San Jose, California

Hadoop Grid as the Source of Truth for Data


TV

PC

Phone

Tablet

Pushed Data

Pulled Data

Web Crawl

Social

Email

3rd Party Content

Data

Advertising

Content

User Profiles / No-SQL Serving Stores

Serving

Data Highway Feeds

Hadoop Grid

BI, Reporting, Adhoc Analytics

ILLUSTRATIVE


34,000 servers

478 PB

0

100

200

300

400

500

600

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

45,000

2006 2007 2008 2009 2010 2011 2012 2013 2014

Raw

HD

FS S

tora

ge (i

n PB

)

Num

ber o

f Ser

vers

Year

Servers 1 Across all Hadoop (16 clusters, 32,500 servers, 455 PB) and HBase (7 clusters, 1,500 servers, 23 PB) clusters, May 23, 2014

Growth in HDFS1 1.25 billion files & dir

Processing and Analyzing Data with Hadoop…Then


HDFS

MapReduce (YARN)

Pig Hive Java MR APIs

InputFormat/ OutputFormat

Load / Store SerDe

MetaStore Client

Hive MetaStore

Hadoop Streaming

Oozie

Processing and Analyzing Data with HBase…Then


HDFS

HBase


TableInputFormat/ TableOutputFormat

HBaseStorage MetaStore Client

Hive MetaStore

HBaseStorage Handler

Oozie

Hadoop Jobs on the Platform Today


100% (21.5 M)

1% 4%

9%

10%

31%

45%

All Jobs Pig Oozie Launcher

Java MR Hive GDM Streaming, distcp, Spark

Job Distribution (May 1 – May 26, 2014)

Challenges in Managing Data on Multi-tenant Platforms


Data Producers

Platform Services

Data Consumers

§  Data shared across tools such as MR, Pig, and Hive

§  Schema and semantics knowledge across the company

§  Support for schema evolution and downstream change communication

§  Fine-grained access controls (row / column) vs. all or nothing

§  Clear ownership of data

§  Data lineage and integrity

§  Audits and compliance (e.g. SOX)

§  Retention, duplication, and waste

Data Economy Challenges

Apache HCatalog

& Data Discovery

Apache HCatalog in the Technology Stack at Yahoo


Compute

Services

Storage

Infrastructure Services

Hive Pig Oozie HDFS Proxy GDM

YARN MapReduce

HDFS HBase

Zookeeper Support Shop Monitoring Starling Messaging

Service

HCatalog

Storm Spark Tez

HCatalog Facilitates Interoperability…Now


HDFS

MapReduce (YARN)


InputFormat/ OutputFormat

SerDe & Storage Handler MetaStore

Client

HCatalog MetaStore

HCatInputFormat / HCatOutputFormat

HCatLoader/ HCatStorer

HDFS

HBase Notifications

Oozie


Data Model

Database (namespace)

Table (schema)

Table (schema)

Partitions Partitions

Buc

kets

Buc

kets

Skewed Unskewed

Optional per table

Partitions, buckets, and skews facilitate faster, more direct access to data

Note on Buckets §  It is hard to guess the right number of buckets that can also change overtime, hard to coordinate and align for joins §  Community is working on dynamic bucketing that would have the same benefit without the need for static partitioning

Sample Table Registration


Select project database USE xyz; Create table CREATE EXTERNAL TABLE search (

bcookie string COMMENT ‘Standard browser cookie’, time_stamp int COMMENT ‘DD-‐MON-‐YYYY HH:MI:SS (AM/PM)’, uid string COMMENT ‘User id’, ip string COMMENT ‘...’, pg_spaceid string COMMENT ‘...’, ...)

PARTITIONED BY ( locale string COMMENT ‘Country of origin’, datestamp string COMMENT ‘Date in YYYYMMDD format’)

STORED AS ORC LOCATION ‘/projects/search/...’; Add partitions manually, (if you choose to) ALTER TABLE search ADD PARTITION ( locale=‘US’, datestamp=‘20130201’) LOCATION ‘/projects/search/...’;

All your company’s data (metadata) can be registered with HCatalog irrespective of the tool used.

Getting Data into HCatalog – DML and DDL


LOAD Files into tables Load operations are copy/move operations from HDFS or local filesystem that move datafiles into locations corresponding to HCat tables. File format must agree with the table format. LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)];

INSERT data from a query into tables Query results can be inserted into tables of file system directories by using the insert clause. INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement; INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

HCat also supports multiple inserts in the same statement or dynamic partition inserts.

ALTER TABLE ADD PARTITIONS

You can use ALTER TABLE ADD PARTITION to add partitions to a table. The location must be a directory inside of which data files reside. If new partitions are directly added to HDFS, HCat will not be aware of these. ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1’;

Getting Data into HCatalog – HCat APIs


Pig HCatLoader is used with Pig scripts to read data from HCatalog-managed tables, and HCatStorer is used with Pig scripts to write data to HCatalog-managed tables. A = load '$DB.$TABLE' using org.apache.hcatalog.pig.HCatLoader(); B = FILTER A BY $FILTER; C = foreach B generate foo, bar; store C into '$OUTPUT_DB.$OUTPUT_TABLE' USING org.apache.hcatalog.pig.HCatStorer ('$OUTPUT_PARTITION');

MapReduce

The HCatInputFormat is used with MapReduce jobs to read data from HCatalog-managed tables. HCatOutputFormat is used with MapReduce jobs to write data to HCatalog-managed tables. Map<String, String> partitionValues = new HashMap<String, String>(); partitionValues.put("a", "1"); partitionValues.put("b", "1"); HCatTableInfo info = HCatTableInfo.getOutputTableInfo(dbName, tblName, partitionValues); HCatOutputFormat.setOutput(job, info);

HCatalog Integration with Data Mgmt. Platform (GDM)


HCatalog MetaStore

Cluster 1 - Colo 1 HDFS

Cluster 2 – Colo 2 HDFS

Grid Data Management

Feed Acquisition

Feed Replication

HCatalog MetaStore

Feed datasets as partitioned external tables Growl extracts schema for backfill

HCatClient. addPartitions(…) Mark LOAD_DONE

HCatClient. addPartitions(…) Mark LOAD_DONE

Partitions are dropped with (HCatClient.dropPartitions(…)) after retention expiration with a drop_partition notification

add_partition event notification

add_partition event notification

HCatalog Notification


Namespace: E.g. “hcat.thebestcluster”

JMS Topic: E.g. “<dbname>.<tablename>”

Sample JMS Notification { "timestamp" : 1360272556, "eventType" : "ADD_PARTITION", "server" : "thebestcluster-‐hcat.dc1.grid.yahoo.com", "servicePrincipal" : "hcat/thebestcluster-‐[email protected]", "db" : "xyz", "table" : "search", "partitions": [ { "locale" : "US", "datestamp" : "20140602" }, { "locale" : "UK", "datestamp" : "20140602" }, { "locale" : "IN", "datestamp" : "20140602" } ] }

§  HCatalog uses JMS (ActiveMQ) notifications that can be sent for add_database, add_table, add_partition, drop_partition, drop_table, and drop_database

§  Notifications can be extended for schema change notifications (proposed)

HCat Client

HCat MetaStore

ActiveMQ Server

Register Channel Publish to listener channels

Subscribers

Oozie, HCatalog, and Messaging Integration


Oozie

Message Bus

HCatalog

3. Push notification <New Partition>

2. Register Topic

4. Notify New Partition

Data Producer HDFS

Produce data (distcp, pig, M/R..)

/data/click/2014/06/02

1. Query/Poll Partition

Start workflow

Update metadata (ALTER TABLE click ADD PARTITION(data=‘2014/06/02’) location ’hdfs://data/click/2014/06/02’)

Data Discovery with HCatalog


§  HCatalog instances become a unifying metastore for all data at Yahoo

§  Discovery is about

o  Browsing / inspecting metadata

o  Searching for datasets

§  It helps to solve

o  Schema knowledge across the company

o  Schema evolution

o  Lineage

o  Ownerships

o  Data type – dev or prod

Data Discovery Physical View


Global View of

All Data in HCatalog

DC1-C1

DC1-C2

DCn-Cn

. . .

DC2-C1

DC2-C2

DCm-Cm

. . .

Discovery UI

Data Center 1 Data Center 2

HCat REST (Templeton)






ILLUSTRATIVE

Data Discovery Features


§  Browsing o  Tables / Databases

o  Schema, format, properties

o  Partitions and metadata about each partition

§  Searches for tables

o  Table name (regex) or Comments

o  Column name or comments

o  Ownership, File format

o  Location

o  Properties (Dev/Prod)

Discovery UI


Search Tables Search

The Best Cluster

audience_db

tumblr_db

user_db

adv_warehouse

flickr_db

page_clicks Hourly clickstream table

ad_clicks Hourly ad clicks table

user_info User registration info

session_info Session feed info

audience_info Primary audience table

GLOBAL HCATALOG DASHBOARD

Available Databases

Available Tables (audience_db)

Search the HCat tables

Browse the DBs by cluster

Search results or browse db results

1 2 Next 1 2 Next

ILLUSTRATIVE

Table Display UI


ILLUSTRATIVE

GLOBAL HCATALOG DASHBOARD

HCat Instance The Best Cluster

Database audience_db

Table page_clicks

Owner Awesome Yahoo

Schema

…more table information and properties (e.g. data format etc.)

Partitions

…list of partitions

Column Type Description

bcookie string Standard browser cookie

timestamp string DD-‐MON-‐YYYY HH:MI:SS (AM/PM)

uid string User id . . .

Data Discovery Design Approach


§  A single web interface connects to all HCatalog instances (same and cross-colo)

§  Select an appropriate HCat instance and browse all metadata o  Each HCatalog instance runs a webserver (Templeton/ WebHCat) to read

metadata o  All reads audited o  ACL’s apply

§  Search functionality will be added to Templeton and HCatalog o  New Thrift interface to support search o  All searches audited o  ACL’s apply

§  Long term design o  Read and Write HCatalog instances

Data Discovery Going Forward


§  Lineage o  Source datasets o  Derived datasets

§  Data Quality

o  Statistics help in heuristics instead of running a job

Table 1 / Partition 1

HBase

ORC Table Partition 1

Dimension Table

Statistics/ Agg. Table

Daily Stats Table

Copied by distcp / external registrar

Hourly

ILLUSTRATIVE

Data Discovery Going Forward (cont’d)


ILLUSTRATIVE

Schema Column Type Description

bcookie string Standard browser cookie

timestamp string DD-‐MON-‐YYYY HH:MI:SS (AM/PM)

uid string User id

File Format ORC

Table Properties Compression

Type

zlib

External

§  User ‘awesome_yahoo’

added ‘foo string’ to the table on May 29, 2014 at ‘1:10 AM’

§  User ‘me_too’ added table properties ‘orc.compress=ZLIB’ on May 30, 2014 at ‘9:00 AM’

§  User ‘me_too’ changed the file format from ‘RCFile’ to ‘ORC’ on Jun 1, 2014 at ‘10:30 AM’

.

.

.

. . .

HCatalog is Part of a Broader Solution Set


Hive

HiveServer2

HCatalog

§  Data warehousing software that facilitates querying and managing large datasets in HDFS

§  Provides a mechanism to project structure onto HDFS data and query the data using a SQL-like language called HiveQL

§  Server process (Thrift-based RPC interface) to support concurrent clients connecting over ODBC/JDBC

§  Provides authentication and enforces authorization for ODBC/JDBC clients for metadata access

§  Table and storage management layer that enables users with different tools (Pig, M/R, and Hive) to more easily share data

§  Presents a relational view of data in HDFS, abstracts where or in what format data is stored, and enables notifications of data availability

Starling §  Hadoop log warehouse for analytics on grid usage (job history, tasks, job

counters etc.) §  1TB of raw logs processed / day, 24 TB of processed data

Product Role in the Grid Stack

28

Deployment Layout

Tez and MapReduce

on YARN +

HDFS Oracle DBMS

LoadBalancer

HCatalog

Thrift HS2

ODBC/JDBC Launcher Gateway

LoadBalancer

Data Out Client

Client/ CLI

HiveQL

M/R Jobs Pig M/R

Cloud Messaging

ActiveMQ notifications

HiveServer2

Hadoop

Hive

HCatalog

2014 Hadoop Summit, San Jose, California


Hive for Both Batch and Interactive Adhoc Analytics

Tez §  Computation expressed as a dataflow graph

with reusable primitives §  No intermediate outputs to HDFS §  Built on top of YARN §  Hive generates Tez plans for lower latency Query Engine Improvements §  Cost-based optimizations §  In-memory joins §  Caching hot tables §  Vectorized processing Better Columnar Store §  ORCFile with predicate pushdown §  Built for both speed and storage efficiency Tez Service §  Always-on pool of AMs / container re-use

Improved Latency and Throughput

Analytics Functions §  SQL 2003 Compliant §  OVER with PARTITION BY and ORDER BY §  Wide variety of windowing functions:

o  RANK o  LEAD/LAG o  ROW_NUMBER o  FIRST_VALUE o  LAST_VALUE o  Many more

§  Aligns well with BI ecosystem

Improving SQL Coverage §  Non-correlated sub-queries using IN in

WHERE §  Expanded SQL types including DATETIME,

VARCHAR, etc.

Extended Analytical Ability

HiveServer2 as ODBC / JDBC Endpoint

§  Gateway that Hive clients can talk to

§  Supports concurrent clients

§  User/ global session/configuration information

§  Support for secure clusters and encryption

§  DoAs support allows Hive queries to run as the requester



Data to Desktop (D2D) – BI and Reporting on ODBC

HiveServer2

Hive

Hadoop

Desktop Web

Intelligence Server

Metadata Database

Grid ODBC driver


DataOut – Data to Any Off-Grid Destination on JDBC

HiveSplit HiveSplit

HiveServer2 M

S

FS/DB

S

FS/DB

HiveSplit

S

FS/DB

Execute Query Prepare Splits

Fetch Splits

Legend: M – Master, S – Slave, FS/ DB – Filesystem/ Database

§  DataOut is an efficient method of moving data off the grid

§  Advantages: o  API based on well-known

JDBC interface

o  Works with HCatalog / Hive

o  Agnostic to the underlying

storage format

o  Parts of the whole data can

be pulled in parallel

SQL-based Authorization for Controlled Access


§  SQL-compliant authorization model (Users, Roles, Privileges, Objects)

§  Fine-grain authorization and access control patterns (row and column in conjunction with views)

§  Can be used in conjunction with storage-based authorization

Privileges Access Control §  Objects consist of databases, tables,

and views

§  Privileges are GRANTed on objects

o  SELECT: read access to an object

o  INSERT: write (insert) access to an object

o  UPDATE: write (update) access to an object

o  DELETE: delete access for an object

o  ALL PRIVILEGES: all privileges

§  Roles can be associated with objects

§  Privileges are associated with roles

§  CREATE, DROP, and SET ROLE statements manipulate roles and membership

§  SUPERUSER role for databases can grant access control to users or roles (not limited to HDFS permissions)

§  PUBLIC role includes all users

§  Prevents undesirable operations on objects by unauthorized users

Starling (Log Warehouse) for Historical Analysis and Trends


Cluster 1 Cluster 2 Cluster 3 Cluster N

Oozie

HCatalog HDFS

Hive

Starling Dashboard

Discovery Portal

Query Server

Sour

ce

Clu

ster

s

War

ehou

se

Clu

ster

s


SQL on Hadoop the Fastest Growing Product on Grid

0.0%

1.0%

2.0%

3.0%

4.0%

5.0%

6.0%

7.0%

8.0%

9.0%

10.0%

0

5

10

15

20

25

30

Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14

Hiv

e Jo

bs (%

of A

ll Jo

bs)

All

Grid

Job

s (in

Mill

ions

)

All Jobs Hive (% of all jobs)

2.5 million queries

In Summary


Data shared across tools such as MR, Pig, and Hive Apache HCatalog

Schema and semantics knowledge across the company Data Discovery

Support for schema evolution and downstream change communication Apache HCatalog

Fine-grained access controls (row / column) vs. all or nothing

SQL-based Authorization

Clear ownership of data Data Discovery

Data lineage and integrity Data Discovery / Starling

Audits and compliance (e.g. SOX) Data Discovery / Starling

Retention, duplication, and waste Data Discovery / Starling

✔

✔

✔

✔

✔

✔

✔

✔

Acknowledge


1 Apache Hive (and HiveServer2, HCatalog) Community

http://hive.apache.org/people.html

2 HCatalog and Hive Development Team at Yahoo

Olga Natkovich Annie Lin Fangyue Wang

Chris Drome Jin Sun Selina Zhang

Mithun Radhakrishnan Viraj Bhat

3 Oozie Development Team

Rohini Palaniswamy Ryota Egashira Purshotam Shah

Mona Chitnis Michelle Chiang

4 Grid Data Management (GDM) Team

Mark Holderbaugh Aaron Gresch Lawrence Prem Kumar

Scott Preece Yan Braun

5 Service Engineering and Data Operations

Rob Realini David Kuder Chuck Sheldon

Rajiv Chittajallu Vineeth Vadrevu Andy Rhee

6 Product Management

Sid Shaik Amrit Lal Kimsukh Kundu

Thank You @thiruvel @sumeetksingh

We are hiring! Stop by Kiosk P9 or reach out to us at [email protected].

Data discoveryonhadoop@yahoo! hadoopsummit2014

Technology

Transcript of Data discoveryonhadoop@yahoo! hadoopsummit2014