Download - Hive Percona 2009

Data Warehousing & Analytics on Hadoop

Ashish Thusoo, Prasad Chakka

Facebook Data Team

Why Another Data Warehousing System?

Data, data and more data200GB per day in March 2008

2+TB(compressed) raw data per day today

Lets try Hadoop…

Pros– Superior in availability/scalability/manageability– Efficiency not that great, but throw more hardware– Partial Availability/resilience/scale more important than ACID

Cons: Programmability and Metadata– Map-reduce hard to program (users know sql/bash/python)– Need to publish data in well known schemas

Solution: HIVE

What is HIVE?

A system for managing and querying structured data built on top of Hadoop– Map-Reduce for execution– HDFS for storage– Metadata on raw files

Key Building Principles:– SQL as a familiar data warehousing tool– Extensibility – Types, Functions, Formats, Scripts– Scalability and Performance

Simplifying Hadoop

hive> select key, count(1) from kv1 where key > 100 group by key;

vs.

$ cat > /tmp/reducer.sh

uniq -c | awk '{print $2"\t"$1}‘

$ cat > /tmp/map.sh

awk -F '\001' '{if($1 > 100) print $1}‘

$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1

$ bin/hadoop dfs –cat /tmp/largekey/part*

Looks like this ..

Disks

Node

Disks

Node

Disks

Node

Disks

Node

Disks

Node

Disks

Node

1 Gigabit 4-8 Gigabit

Node=

DataNode +

Map-Reduce

Data Warehousing at Facebook Today

Web Servers Scribe Servers

Filers

Hive on Hadoop ClusterOracle RAC Federated MySQL

Hive/Hadoop Usage @ Facebook

Types of Applications:– Reporting

Eg: Daily/Weekly aggregations of impression/click counts Complex measures of user engagement

– Ad hoc Analysis Eg: how many group admins broken down by state/country

– Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes

– Spam Detection Anomalous patterns for Site Integrity Application API usage patterns

– Ad Optimization– Too many to count ..

Hadoop Usage @ Facebook

Data statistics:– Total Data: ~1.7PB – Cluster Capacity ~2.4PB– Net Data added/day: ~15TB

6TB of uncompressed source logs 4TB of uncompressed dimension data reloaded daily

– Compression Factor ~5x (gzip, more with bzip)

Usage statistics:– 3200 jobs/day with 800K tasks(map-reduce tasks)/day– 55TB of compressed data scanned daily– 15TB of compressed output data written to hdfs– 80 MM compute minutes/day

In Pictures

HIVE Internals!!

HIVE: Components

HDFS

Hive CLIDDL QueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

ExecutionParser

Planner

DB

Web U

I

Optimizer

Data Model

Logical Partitioning

Hash Partitioning

clicks

HDFS MetaStore

/hive/clicks/hive/clicks/ds=2008-03-25

/hive/clicks/ds=2008-03-25/0

…

Tables

Data LocationBucketing Info

Partitioning Cols

Metastore DB

Hive Query Language

SQL– Subqueries in from clause– Equi-joins– Multi-table Insert– Multi-group-by

Sampling Complex object types Extensibility

– Pluggable Map-reduce scripts– Pluggable User Defined Functions– Pluggable User Defined Types– Pluggable Data Formats

Machine 2

Machine 1<k1, v1><k2, v2><k3, v3>

<k4, v4><k5, v5><k6, v6>

Map Reduce Example

<nk1, nv1><nk2, nv2><nk3, nv3>


LocalMap



GlobalShuffle



LocalSort

<nk2, 3>

<nk1, 2><nk3, 1>

LocalReduce

Hive QL – Join

INSERT INTO TABLE pv_usersSELECT pv.pageid, u.ageFROM page_view pv JOIN user u ON (pv.userid = u.userid);

Hive QL – Join in Map Reduce

key value

111 <1,1>

111 <1,2>

222 <1,1>

pageid userid time

1 111 9:08:01

2 111 9:08:13

1 222 9:08:14

userid age gender

111 25 female

222 32 male

page_view

user

key value

111 <2,25>

222 <2,32>

Map

key value

111 <1,1>

111 <1,2>

111 <2,25>

key value

222 <1,1>

222 <2,32>

ShuffleSort

Pageid age

1 25

2 25

pageid age

1 32

Reduce

Hive QL – Group By

SELECT pageid, age, count(1)FROM pv_usersGROUP BY pageid, age;

Hive QL – Group By in Map Reduce

pageid age

1 25

2 25

pv_users

pageid age count

1 25 1

1 32 1

pageid age

1 32

2 25

Map

key value

<1,25> 1

<2,25> 1

key value

<1,32> 1

<2,25> 1

key value

<1,25> 1

<1,32> 1

key value

<2,25> 1

<2,25> 1

ShuffleSort

pageid age count

2 25 2

Reduce

Group by Optimizations

Map side partial aggregations– Hash Based aggregates– Serialized key/values in hash tables

Optimizations being Worked On:– Exploit pre-sorted data for distinct counts– Partial aggregations and Combiners– Be smarter about how to avoid multiple stage– Exploit table/column statistics for deciding strategy

Inserts into Files, Tables and Local Files

FROM pv_users INSERT INTO TABLE pv_gender_sum

SELECT pv_users.gender, count_distinct(pv_users.userid)

GROUP BY(pv_users.gender) INSERT INTO DIRECTORY‘/user/facebook/tmp/pv_age_sum.dir’

SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age)

INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013

SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);

Extensibility - Custom Map/Reduce Scripts

FROM (

FROM pv_users MAP(pv_users.userid, pv_users.date) USING

'map_script' AS(dt, uid) CLUSTER BY(dt)) map

INSERT INTO TABLE pv_users_reduced

REDUCE(map.dt, map.uid) USING 'reduce_script' AS (date, count);

Open Source Community

21 contributors and growing – 6 contributors within Facebook

Contributors from:– Academia– Other web companies– Etc..

7 committers– 1 external to Facebook and looking to add more here

Future Work

Statistics and cost-based optimization Integration with BI tools (through JDBC/ODBC) Performance improvements More SQL constructs & UDFs Indexing Schema Evolution Advanced operators

– Cubes/Frequent Item Sets/Window Functions

Hive Roadmap– http://wiki.apache.org/hadoop/Hive/Roadmap

http://wiki.apache.org/hadoop/Hive/Roadmap

Information

Available as a sub project in Hadoop- http://wiki.apache.org/hadoop/Hive (wiki)- http://hadoop.apache.org/hive (home page)- http://svn.apache.org/repos/asf/hadoop/hive (SVN repo)- ##hive (IRC)- Works with hadoop-0.17, 0.18, 0.19

Release 0.3 is coming in the next few weeks

Mailing Lists: – hive-{user,dev,commits}@hadoop.apache.org

http://wiki.apache.org/hadoop/Hive

http://hadoop.apache.org/hive