Data discoveryonhadoop@yahoo! hadoopsummit2014
-
Upload
thiruvel -
Category
Technology
-
view
514 -
download
0
description
Transcript of Data discoveryonhadoop@yahoo! hadoopsummit2014
Data D iscove ry on Hadoop - Rea l i z i ng the Fu l l Po ten t i a l o f You r Da ta
P R E S E N T E D B Y T h i r u v e l T h i r u m o o l a n , S u m e e t S i n g h ⎪ J u n e 3 , 2 0 1 4
2 0 1 4 H a d o o p S u m m i t , S a n J o s e , C a l i f o r n i a
Introduction
2 2014 Hadoop Summit, San Jose, California
Sumeet Singh Senior Director, Product Management Hadoop and Big Data Platforms Cloud Engineering Group
Thiruvel Thirumoolan Principal Engineer Hadoop and Big Data Platforms
Cloud Engineering Group
§ Developer in the Hive-HCatalog team, and active contributor to Apache Hive
§ Responsible for Hive, HiveServer2 and HCatalog across all Hadoop clusters and ensuring they work at scale for the usage patterns of Yahoo
§ Loves mining the trove of Hadoop logs for usage patterns and insights
§ Bachelors degree from Anna University
701 First Avenue, Sunnyvale, CA 94089 USA @thiruvel
§ Manages Hadoop products team at Yahoo!
§ Responsible for Product Management, Strategy and Customer Engagements
§ Managed Cloud Services products team and
headed Strategy functions for the Cloud Platform Group at Yahoo
§ M.B.A. from UCLA and M.S. from Rensselaer(RPI) 701 First Avenue, Sunnyvale, CA 94089 USA @sumeetksingh
Agenda
3
The Data Management Challenge 1
Apache HCatalog to Rescue 2
Data Registration and Discovery 3
Opening Up Adhoc Access to Data 4
Summary and Q&A 5
2014 Hadoop Summit, San Jose, California
Hadoop Grid as the Source of Truth for Data
4 2014 Hadoop Summit, San Jose, California
TV
PC
Phone
Tablet
Pushed Data
Pulled Data
Web Crawl
Social
3rd Party Content
Data
Advertising
Content
User Profiles / No-SQL Serving Stores
Serving
Data Highway Feeds
Hadoop Grid
BI, Reporting, Adhoc Analytics
ILLUSTRATIVE
5 2014 Hadoop Summit, San Jose, California
34,000 servers
478 PB
0
100
200
300
400
500
600
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
2006 2007 2008 2009 2010 2011 2012 2013 2014
Raw
HD
FS S
tora
ge (i
n PB
)
Num
ber o
f Ser
vers
Year
Servers 1 Across all Hadoop (16 clusters, 32,500 servers, 455 PB) and HBase (7 clusters, 1,500 servers, 23 PB) clusters, May 23, 2014
Growth in HDFS1 1.25 billion files & dir
Processing and Analyzing Data with Hadoop…Then
6 2014 Hadoop Summit, San Jose, California
HDFS
MapReduce (YARN)
Pig Hive Java MR APIs
InputFormat/ OutputFormat
Load / Store SerDe
MetaStore Client
Hive MetaStore
Hadoop Streaming
Oozie
Processing and Analyzing Data with HBase…Then
7 2014 Hadoop Summit, San Jose, California
HDFS
HBase
Pig Hive Java MR APIs
TableInputFormat/ TableOutputFormat
HBaseStorage MetaStore Client
Hive MetaStore
HBaseStorage Handler
Oozie
Hadoop Jobs on the Platform Today
8 2014 Hadoop Summit, San Jose, California
100% (21.5 M)
1% 4%
9%
10%
31%
45%
All Jobs Pig Oozie Launcher
Java MR Hive GDM Streaming, distcp, Spark
Job Distribution (May 1 – May 26, 2014)
Challenges in Managing Data on Multi-tenant Platforms
9 2014 Hadoop Summit, San Jose, California
Data Producers
Platform Services
Data Consumers
§ Data shared across tools such as MR, Pig, and Hive
§ Schema and semantics knowledge across the company
§ Support for schema evolution and downstream change communication
§ Fine-grained access controls (row / column) vs. all or nothing
§ Clear ownership of data
§ Data lineage and integrity
§ Audits and compliance (e.g. SOX)
§ Retention, duplication, and waste
Data Economy Challenges
Apache HCatalog
& Data Discovery
Apache HCatalog in the Technology Stack at Yahoo
10 2014 Hadoop Summit, San Jose, California
Compute
Services
Storage
Infrastructure Services
Hive Pig Oozie HDFS Proxy GDM
YARN MapReduce
HDFS HBase
Zookeeper Support Shop Monitoring Starling Messaging
Service
HCatalog
Storm Spark Tez
HCatalog Facilitates Interoperability…Now
11 2014 Hadoop Summit, San Jose, California
HDFS
MapReduce (YARN)
Pig Hive Java MR APIs
InputFormat/ OutputFormat
SerDe & Storage Handler MetaStore
Client
HCatalog MetaStore
HCatInputFormat / HCatOutputFormat
HCatLoader/ HCatStorer
HDFS
HBase Notifications
Oozie
12 2014 Hadoop Summit, San Jose, California
Data Model
Database (namespace)
Table (schema)
Table (schema)
Partitions Partitions
Buc
kets
Buc
kets
Skewed Unskewed
Optional per table
Partitions, buckets, and skews facilitate faster, more direct access to data
Note on Buckets § It is hard to guess the right number of buckets that can also change overtime, hard to coordinate and align for joins § Community is working on dynamic bucketing that would have the same benefit without the need for static partitioning
Sample Table Registration
13 2014 Hadoop Summit, San Jose, California
Select project database USE xyz; Create table CREATE EXTERNAL TABLE search (
bcookie string COMMENT ‘Standard browser cookie’, time_stamp int COMMENT ‘DD-‐MON-‐YYYY HH:MI:SS (AM/PM)’, uid string COMMENT ‘User id’, ip string COMMENT ‘...’, pg_spaceid string COMMENT ‘...’, ...)
PARTITIONED BY ( locale string COMMENT ‘Country of origin’, datestamp string COMMENT ‘Date in YYYYMMDD format’)
STORED AS ORC LOCATION ‘/projects/search/...’; Add partitions manually, (if you choose to) ALTER TABLE search ADD PARTITION ( locale=‘US’, datestamp=‘20130201’) LOCATION ‘/projects/search/...’;
All your company’s data (metadata) can be registered with HCatalog irrespective of the tool used.
Getting Data into HCatalog – DML and DDL
14 2014 Hadoop Summit, San Jose, California
LOAD Files into tables Load operations are copy/move operations from HDFS or local filesystem that move datafiles into locations corresponding to HCat tables. File format must agree with the table format. LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)];
INSERT data from a query into tables Query results can be inserted into tables of file system directories by using the insert clause. INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement; INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
HCat also supports multiple inserts in the same statement or dynamic partition inserts.
ALTER TABLE ADD PARTITIONS
You can use ALTER TABLE ADD PARTITION to add partitions to a table. The location must be a directory inside of which data files reside. If new partitions are directly added to HDFS, HCat will not be aware of these. ALTER TABLE table_name ADD PARTITION (partCol = 'value1') location 'loc1’;
Getting Data into HCatalog – HCat APIs
15 2014 Hadoop Summit, San Jose, California
Pig HCatLoader is used with Pig scripts to read data from HCatalog-managed tables, and HCatStorer is used with Pig scripts to write data to HCatalog-managed tables. A = load '$DB.$TABLE' using org.apache.hcatalog.pig.HCatLoader(); B = FILTER A BY $FILTER; C = foreach B generate foo, bar; store C into '$OUTPUT_DB.$OUTPUT_TABLE' USING org.apache.hcatalog.pig.HCatStorer ('$OUTPUT_PARTITION');
MapReduce
The HCatInputFormat is used with MapReduce jobs to read data from HCatalog-managed tables. HCatOutputFormat is used with MapReduce jobs to write data to HCatalog-managed tables. Map<String, String> partitionValues = new HashMap<String, String>(); partitionValues.put("a", "1"); partitionValues.put("b", "1"); HCatTableInfo info = HCatTableInfo.getOutputTableInfo(dbName, tblName, partitionValues); HCatOutputFormat.setOutput(job, info);
HCatalog Integration with Data Mgmt. Platform (GDM)
16 2014 Hadoop Summit, San Jose, California
HCatalog MetaStore
Cluster 1 - Colo 1 HDFS
Cluster 2 – Colo 2 HDFS
Grid Data Management
Feed Acquisition
Feed Replication
HCatalog MetaStore
Feed datasets as partitioned external tables Growl extracts schema for backfill
HCatClient. addPartitions(…) Mark LOAD_DONE
HCatClient. addPartitions(…) Mark LOAD_DONE
Partitions are dropped with (HCatClient.dropPartitions(…)) after retention expiration with a drop_partition notification
add_partition event notification
add_partition event notification
HCatalog Notification
17 2014 Hadoop Summit, San Jose, California
Namespace: E.g. “hcat.thebestcluster”
JMS Topic: E.g. “<dbname>.<tablename>”
Sample JMS Notification { "timestamp" : 1360272556, "eventType" : "ADD_PARTITION", "server" : "thebestcluster-‐hcat.dc1.grid.yahoo.com", "servicePrincipal" : "hcat/thebestcluster-‐[email protected]", "db" : "xyz", "table" : "search", "partitions": [ { "locale" : "US", "datestamp" : "20140602" }, { "locale" : "UK", "datestamp" : "20140602" }, { "locale" : "IN", "datestamp" : "20140602" } ] }
§ HCatalog uses JMS (ActiveMQ) notifications that can be sent for add_database, add_table, add_partition, drop_partition, drop_table, and drop_database
§ Notifications can be extended for schema change notifications (proposed)
HCat Client
HCat MetaStore
ActiveMQ Server
Register Channel Publish to listener channels
Subscribers
Oozie, HCatalog, and Messaging Integration
18 2014 Hadoop Summit, San Jose, California
Oozie
Message Bus
HCatalog
3. Push notification <New Partition>
2. Register Topic
4. Notify New Partition
Data Producer HDFS
Produce data (distcp, pig, M/R..)
/data/click/2014/06/02
1. Query/Poll Partition
Start workflow
Update metadata (ALTER TABLE click ADD PARTITION(data=‘2014/06/02’) location ’hdfs://data/click/2014/06/02’)
Data Discovery with HCatalog
19 2014 Hadoop Summit, San Jose, California
§ HCatalog instances become a unifying metastore for all data at Yahoo
§ Discovery is about
o Browsing / inspecting metadata
o Searching for datasets
§ It helps to solve
o Schema knowledge across the company
o Schema evolution
o Lineage
o Ownerships
o Data type – dev or prod
Data Discovery Physical View
20 2014 Hadoop Summit, San Jose, California
Global View of
All Data in HCatalog
DC1-C1
DC1-C2
DCn-Cn
. . .
DC2-C1
DC2-C2
DCm-Cm
. . .
Discovery UI
Data Center 1 Data Center 2
HCat REST (Templeton)
HCat REST (Templeton)
HCat REST (Templeton)
HCat REST (Templeton)
HCat REST (Templeton)
HCat REST (Templeton)
ILLUSTRATIVE
Data Discovery Features
21 2014 Hadoop Summit, San Jose, California
§ Browsing o Tables / Databases
o Schema, format, properties
o Partitions and metadata about each partition
§ Searches for tables
o Table name (regex) or Comments
o Column name or comments
o Ownership, File format
o Location
o Properties (Dev/Prod)
Discovery UI
22 2014 Hadoop Summit, San Jose, California
Search Tables Search
The Best Cluster
audience_db
tumblr_db
user_db
adv_warehouse
flickr_db
page_clicks Hourly clickstream table
ad_clicks Hourly ad clicks table
user_info User registration info
session_info Session feed info
audience_info Primary audience table
GLOBAL HCATALOG DASHBOARD
Available Databases
Available Tables (audience_db)
Search the HCat tables
Browse the DBs by cluster
Search results or browse db results
1 2 Next 1 2 Next
ILLUSTRATIVE
Table Display UI
23 2014 Hadoop Summit, San Jose, California
ILLUSTRATIVE
GLOBAL HCATALOG DASHBOARD
HCat Instance The Best Cluster
Database audience_db
Table page_clicks
Owner Awesome Yahoo
Schema
…more table information and properties (e.g. data format etc.)
Partitions
…list of partitions
Column Type Description
bcookie string Standard browser cookie
timestamp string DD-‐MON-‐YYYY HH:MI:SS (AM/PM)
uid string User id . . .
Data Discovery Design Approach
24 2014 Hadoop Summit, San Jose, California
§ A single web interface connects to all HCatalog instances (same and cross-colo)
§ Select an appropriate HCat instance and browse all metadata o Each HCatalog instance runs a webserver (Templeton/ WebHCat) to read
metadata o All reads audited o ACL’s apply
§ Search functionality will be added to Templeton and HCatalog o New Thrift interface to support search o All searches audited o ACL’s apply
§ Long term design o Read and Write HCatalog instances
Data Discovery Going Forward
25 2014 Hadoop Summit, San Jose, California
§ Lineage o Source datasets o Derived datasets
§ Data Quality
o Statistics help in heuristics instead of running a job
Table 1 / Partition 1
HBase
ORC Table Partition 1
Dimension Table
Statistics/ Agg. Table
Daily Stats Table
Copied by distcp / external registrar
Hourly
ILLUSTRATIVE
Data Discovery Going Forward (cont’d)
26 2014 Hadoop Summit, San Jose, California
ILLUSTRATIVE
Schema Column Type Description
bcookie string Standard browser cookie
timestamp string DD-‐MON-‐YYYY HH:MI:SS (AM/PM)
uid string User id
File Format ORC
Table Properties Compression
Type
zlib
External
§ User ‘awesome_yahoo’
added ‘foo string’ to the table on May 29, 2014 at ‘1:10 AM’
§ User ‘me_too’ added table properties ‘orc.compress=ZLIB’ on May 30, 2014 at ‘9:00 AM’
§ User ‘me_too’ changed the file format from ‘RCFile’ to ‘ORC’ on Jun 1, 2014 at ‘10:30 AM’
.
.
.
. . .
HCatalog is Part of a Broader Solution Set
27 2014 Hadoop Summit, San Jose, California
Hive
HiveServer2
HCatalog
§ Data warehousing software that facilitates querying and managing large datasets in HDFS
§ Provides a mechanism to project structure onto HDFS data and query the data using a SQL-like language called HiveQL
§ Server process (Thrift-based RPC interface) to support concurrent clients connecting over ODBC/JDBC
§ Provides authentication and enforces authorization for ODBC/JDBC clients for metadata access
§ Table and storage management layer that enables users with different tools (Pig, M/R, and Hive) to more easily share data
§ Presents a relational view of data in HDFS, abstracts where or in what format data is stored, and enables notifications of data availability
Starling § Hadoop log warehouse for analytics on grid usage (job history, tasks, job
counters etc.) § 1TB of raw logs processed / day, 24 TB of processed data
Product Role in the Grid Stack
28
Deployment Layout
Tez and MapReduce
on YARN +
HDFS Oracle DBMS
LoadBalancer
HCatalog
Thrift HS2
ODBC/JDBC Launcher Gateway
LoadBalancer
Data Out Client
Client/ CLI
HiveQL
M/R Jobs Pig M/R
Cloud Messaging
ActiveMQ notifications
HiveServer2
Hadoop
Hive
HCatalog
2014 Hadoop Summit, San Jose, California
29 2014 Hadoop Summit, San Jose, California
Hive for Both Batch and Interactive Adhoc Analytics
Tez § Computation expressed as a dataflow graph
with reusable primitives § No intermediate outputs to HDFS § Built on top of YARN § Hive generates Tez plans for lower latency Query Engine Improvements § Cost-based optimizations § In-memory joins § Caching hot tables § Vectorized processing Better Columnar Store § ORCFile with predicate pushdown § Built for both speed and storage efficiency Tez Service § Always-on pool of AMs / container re-use
Improved Latency and Throughput
Analytics Functions § SQL 2003 Compliant § OVER with PARTITION BY and ORDER BY § Wide variety of windowing functions:
o RANK o LEAD/LAG o ROW_NUMBER o FIRST_VALUE o LAST_VALUE o Many more
§ Aligns well with BI ecosystem
Improving SQL Coverage § Non-correlated sub-queries using IN in
WHERE § Expanded SQL types including DATETIME,
VARCHAR, etc.
Extended Analytical Ability
HiveServer2 as ODBC / JDBC Endpoint
§ Gateway that Hive clients can talk to
§ Supports concurrent clients
§ User/ global session/configuration information
§ Support for secure clusters and encryption
§ DoAs support allows Hive queries to run as the requester
30 2014 Hadoop Summit, San Jose, California
31 2014 Hadoop Summit, San Jose, California
Data to Desktop (D2D) – BI and Reporting on ODBC
HiveServer2
Hive
Hadoop
Desktop Web
Intelligence Server
Metadata Database
Grid ODBC driver
32 2014 Hadoop Summit, San Jose, California
DataOut – Data to Any Off-Grid Destination on JDBC
HiveSplit HiveSplit
HiveServer2 M
S
FS/DB
S
FS/DB
HiveSplit
S
FS/DB
Execute Query Prepare Splits
Fetch Splits
Legend: M – Master, S – Slave, FS/ DB – Filesystem/ Database
§ DataOut is an efficient method of moving data off the grid
§ Advantages: o API based on well-known
JDBC interface
o Works with HCatalog / Hive
o Agnostic to the underlying
storage format
o Parts of the whole data can
be pulled in parallel
SQL-based Authorization for Controlled Access
33 2014 Hadoop Summit, San Jose, California
§ SQL-compliant authorization model (Users, Roles, Privileges, Objects)
§ Fine-grain authorization and access control patterns (row and column in conjunction with views)
§ Can be used in conjunction with storage-based authorization
Privileges Access Control § Objects consist of databases, tables,
and views
§ Privileges are GRANTed on objects
o SELECT: read access to an object
o INSERT: write (insert) access to an object
o UPDATE: write (update) access to an object
o DELETE: delete access for an object
o ALL PRIVILEGES: all privileges
§ Roles can be associated with objects
§ Privileges are associated with roles
§ CREATE, DROP, and SET ROLE statements manipulate roles and membership
§ SUPERUSER role for databases can grant access control to users or roles (not limited to HDFS permissions)
§ PUBLIC role includes all users
§ Prevents undesirable operations on objects by unauthorized users
Starling (Log Warehouse) for Historical Analysis and Trends
34 2014 Hadoop Summit, San Jose, California
Cluster 1 Cluster 2 Cluster 3 Cluster N
Oozie
HCatalog HDFS
Hive
Starling Dashboard
Discovery Portal
Query Server
Sour
ce
Clu
ster
s
War
ehou
se
Clu
ster
s
35 2014 Hadoop Summit, San Jose, California
SQL on Hadoop the Fastest Growing Product on Grid
0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
7.0%
8.0%
9.0%
10.0%
0
5
10
15
20
25
30
Mar-13 Apr-13 May-13 Jun-13 Jul-13 Aug-13 Sep-13 Oct-13 Nov-13 Dec-13 Jan-14 Feb-14 Mar-14 Apr-14 May-14
Hiv
e Jo
bs (%
of A
ll Jo
bs)
All
Grid
Job
s (in
Mill
ions
)
All Jobs Hive (% of all jobs)
2.5 million queries
In Summary
36 2014 Hadoop Summit, San Jose, California
Data shared across tools such as MR, Pig, and Hive Apache HCatalog
Schema and semantics knowledge across the company Data Discovery
Support for schema evolution and downstream change communication Apache HCatalog
Fine-grained access controls (row / column) vs. all or nothing
SQL-based Authorization
Clear ownership of data Data Discovery
Data lineage and integrity Data Discovery / Starling
Audits and compliance (e.g. SOX) Data Discovery / Starling
Retention, duplication, and waste Data Discovery / Starling
✔
✔
✔
✔
✔
✔
✔
✔
Acknowledge
37 2014 Hadoop Summit, San Jose, California
1 Apache Hive (and HiveServer2, HCatalog) Community
http://hive.apache.org/people.html
2 HCatalog and Hive Development Team at Yahoo
Olga Natkovich Annie Lin Fangyue Wang
Chris Drome Jin Sun Selina Zhang
Mithun Radhakrishnan Viraj Bhat
3 Oozie Development Team
Rohini Palaniswamy Ryota Egashira Purshotam Shah
Mona Chitnis Michelle Chiang
4 Grid Data Management (GDM) Team
Mark Holderbaugh Aaron Gresch Lawrence Prem Kumar
Scott Preece Yan Braun
5 Service Engineering and Data Operations
Rob Realini David Kuder Chuck Sheldon
Rajiv Chittajallu Vineeth Vadrevu Andy Rhee
6 Product Management
Sid Shaik Amrit Lal Kimsukh Kundu
Thank You @thiruvel @sumeetksingh
We are hiring! Stop by Kiosk P9 or reach out to us at [email protected].