Hive Percona 2009

28
Data Warehousing & Analytics on Hadoop Ashish Thusoo, Prasad Chakka Facebook Data Team

description

talk given at Percona 2009 conference held along side mysql conference in Santa Clara, USA.

Transcript of Hive Percona 2009

Page 1: Hive Percona 2009

Data Warehousing & Analytics on Hadoop

Ashish Thusoo, Prasad Chakka

Facebook Data Team

Page 2: Hive Percona 2009

Why Another Data Warehousing System?

Data, data and more data200GB per day in March 2008

2+TB(compressed) raw data per day today

Page 3: Hive Percona 2009
Page 4: Hive Percona 2009
Page 5: Hive Percona 2009

Lets try Hadoop…

Pros– Superior in availability/scalability/manageability– Efficiency not that great, but throw more hardware– Partial Availability/resilience/scale more important than ACID

Cons: Programmability and Metadata– Map-reduce hard to program (users know sql/bash/python)– Need to publish data in well known schemas

Solution: HIVE

Page 6: Hive Percona 2009

What is HIVE?

A system for managing and querying structured data built on top of Hadoop– Map-Reduce for execution– HDFS for storage– Metadata on raw files

Key Building Principles:– SQL as a familiar data warehousing tool– Extensibility – Types, Functions, Formats, Scripts– Scalability and Performance

Page 7: Hive Percona 2009

Simplifying Hadoop

hive> select key, count(1) from kv1 where key > 100 group by key;

vs.

$ cat > /tmp/reducer.sh

uniq -c | awk '{print $2"\t"$1}‘

$ cat > /tmp/map.sh

awk -F '\001' '{if($1 > 100) print $1}‘

$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1

$ bin/hadoop dfs –cat /tmp/largekey/part*

Page 8: Hive Percona 2009

Looks like this ..

Disks

Node

Disks

Node

Disks

Node

Disks

Node

Disks

Node

Disks

Node

1 Gigabit 4-8 Gigabit

Node=

DataNode +

Map-Reduce

Page 9: Hive Percona 2009

Data Warehousing at Facebook Today

Web Servers Scribe Servers

Filers

Hive on Hadoop ClusterOracle RAC Federated MySQL

Page 10: Hive Percona 2009

Hive/Hadoop Usage @ Facebook

Types of Applications:– Reporting

Eg: Daily/Weekly aggregations of impression/click counts Complex measures of user engagement

– Ad hoc Analysis Eg: how many group admins broken down by state/country

– Data Mining (Assembling training data) Eg: User Engagement as a function of user attributes

– Spam Detection Anomalous patterns for Site Integrity Application API usage patterns

– Ad Optimization– Too many to count ..

Page 11: Hive Percona 2009

Hadoop Usage @ Facebook

Data statistics:– Total Data: ~1.7PB – Cluster Capacity ~2.4PB– Net Data added/day: ~15TB

6TB of uncompressed source logs 4TB of uncompressed dimension data reloaded daily

– Compression Factor ~5x (gzip, more with bzip)

Usage statistics:– 3200 jobs/day with 800K tasks(map-reduce tasks)/day– 55TB of compressed data scanned daily– 15TB of compressed output data written to hdfs– 80 MM compute minutes/day

Page 12: Hive Percona 2009

In Pictures

Page 13: Hive Percona 2009
Page 14: Hive Percona 2009

HIVE Internals!!

Page 15: Hive Percona 2009

HIVE: Components

HDFS

Hive CLIDDL QueriesBrowsing

Map Reduce

MetaStore

Thrift API

SerDeThrift Jute JSON..

ExecutionParser

Planner

DB

Web U

I

Optimizer

Page 16: Hive Percona 2009

Data Model

Logical Partitioning

Hash Partitioning

clicks

HDFS MetaStore

/hive/clicks/hive/clicks/ds=2008-03-25

/hive/clicks/ds=2008-03-25/0

Tables

Data LocationBucketing Info

Partitioning Cols

Metastore DB

Page 17: Hive Percona 2009

Hive Query Language

SQL– Subqueries in from clause– Equi-joins– Multi-table Insert– Multi-group-by

Sampling Complex object types Extensibility

– Pluggable Map-reduce scripts– Pluggable User Defined Functions– Pluggable User Defined Types– Pluggable Data Formats

Page 18: Hive Percona 2009

Machine 2

Machine 1<k1, v1><k2, v2><k3, v3>

<k4, v4><k5, v5><k6, v6>

Map Reduce Example

<nk1, nv1><nk2, nv2><nk3, nv3>

<nk2, nv4><nk2, nv5><nk1, nv6>

LocalMap

<nk2, nv4><nk2, nv5><nk2, nv2>

<nk1, nv1><nk3, nv3><nk1, nv6>

GlobalShuffle

<nk1, nv1><nk1, nv6><nk3, nv3>

<nk2, nv4><nk2, nv5><nk2, nv2>

LocalSort

<nk2, 3>

<nk1, 2><nk3, 1>

LocalReduce

Page 19: Hive Percona 2009

Hive QL – Join

INSERT INTO TABLE pv_usersSELECT pv.pageid, u.ageFROM page_view pv JOIN user u ON (pv.userid = u.userid);

Page 20: Hive Percona 2009

Hive QL – Join in Map Reduce

key value

111 <1,1>

111 <1,2>

222 <1,1>

pageid userid time

1 111 9:08:01

2 111 9:08:13

1 222 9:08:14

userid age gender

111 25 female

222 32 male

page_view

user

key value

111 <2,25>

222 <2,32>

Map

key value

111 <1,1>

111 <1,2>

111 <2,25>

key value

222 <1,1>

222 <2,32>

ShuffleSort

Pageid age

1 25

2 25

pageid age

1 32

Reduce

Page 21: Hive Percona 2009

Hive QL – Group By

SELECT pageid, age, count(1)FROM pv_usersGROUP BY pageid, age;

Page 22: Hive Percona 2009

Hive QL – Group By in Map Reduce

pageid age

1 25

2 25

pv_users

pageid age count

1 25 1

1 32 1

pageid age

1 32

2 25

Map

key value

<1,25> 1

<2,25> 1

key value

<1,32> 1

<2,25> 1

key value

<1,25> 1

<1,32> 1

key value

<2,25> 1

<2,25> 1

ShuffleSort

pageid age count

2 25 2

Reduce

Page 23: Hive Percona 2009

Group by Optimizations

Map side partial aggregations– Hash Based aggregates– Serialized key/values in hash tables

Optimizations being Worked On:– Exploit pre-sorted data for distinct counts– Partial aggregations and Combiners– Be smarter about how to avoid multiple stage– Exploit table/column statistics for deciding strategy

Page 24: Hive Percona 2009

Inserts into Files, Tables and Local Files

FROM pv_users INSERT INTO TABLE pv_gender_sum

SELECT pv_users.gender, count_distinct(pv_users.userid)

GROUP BY(pv_users.gender) INSERT INTO DIRECTORY‘/user/facebook/tmp/pv_age_sum.dir’

SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age)

INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’ FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY \013

SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY(pv_users.age);

Page 25: Hive Percona 2009

Extensibility - Custom Map/Reduce Scripts

FROM (

FROM pv_users MAP(pv_users.userid, pv_users.date) USING

'map_script' AS(dt, uid) CLUSTER BY(dt)) map

INSERT INTO TABLE pv_users_reduced

REDUCE(map.dt, map.uid) USING 'reduce_script' AS (date, count);

Page 26: Hive Percona 2009

Open Source Community

21 contributors and growing – 6 contributors within Facebook

Contributors from:– Academia– Other web companies– Etc..

7 committers– 1 external to Facebook and looking to add more here

Page 27: Hive Percona 2009

Future Work

Statistics and cost-based optimization Integration with BI tools (through JDBC/ODBC) Performance improvements More SQL constructs & UDFs Indexing Schema Evolution Advanced operators

– Cubes/Frequent Item Sets/Window Functions

Hive Roadmap– http://wiki.apache.org/hadoop/Hive/Roadmap

Page 28: Hive Percona 2009

Information

Available as a sub project in Hadoop- http://wiki.apache.org/hadoop/Hive (wiki)- http://hadoop.apache.org/hive (home page)- http://svn.apache.org/repos/asf/hadoop/hive (SVN repo)- ##hive (IRC)- Works with hadoop-0.17, 0.18, 0.19

Release 0.3 is coming in the next few weeks

Mailing Lists: – hive-{user,dev,commits}@hadoop.apache.org