H base vs hive srp vs analytics 2-14-2012

14
HBase vs. Hive Philip Wickline Chief Technology Officer Hadapt

Transcript of H base vs hive srp vs analytics 2-14-2012

Page 1: H base vs hive   srp vs analytics 2-14-2012

HBase vs. Hive

Philip WicklineChief Technology Officer

Hadapt

Page 2: H base vs hive   srp vs analytics 2-14-2012

Goals

Brief introduction to the differences between transactional/operational and analytical systems

Understand when to use Hive and when to use HBase and why

2

Page 3: H base vs hive   srp vs analytics 2-14-2012

Databases

3

Page 4: H base vs hive   srp vs analytics 2-14-2012

Datastores

4

Page 5: H base vs hive   srp vs analytics 2-14-2012

Differences of Purpose : “Transaction Processing”

Operational systems

• Optimized for small short random access – reads and writes

• E.g. record that an employee invested $100 in a S&P500 index fund in his 401(k) *or* record that a user posted something on another users “wall”

Traditional DB examples

• Oracle

• MySQL

NoSQL Examples

• HBase

• MongoDB

• Cassandra

5

Page 6: H base vs hive   srp vs analytics 2-14-2012

Differences of Purpose: Analytics

Analytics

• Optimized for read-only computations about large amounts of data

• E.g. compute the average amount invested in bond funds and stock funds for all employees at all employers over the last 5 years

DB Examples

• Netezza

• Vertica

NoSQL Examples

• Hive

• Pig

6

0

2

4

6

8

10

12

14

16

Oct Nov Dec Jan Feb Mar

Plan

Actual

Option 1

Acme

GM

Newco

Oldco

Bigcorp

Option 1

0

5

10

5-10

0-5

Page 7: H base vs hive   srp vs analytics 2-14-2012

HBase Data Model : Conceptual

From the BigTable paper:

“a sparse, distributed, persistent multi-dimensional sorted map”

(row : bytestring, column family : bytestring, column : bytestring, time : int64) -> byte string

7

Page 8: H base vs hive   srp vs analytics 2-14-2012

HBase Map

{ ”key_1" : {

”columnfamily_a" : {

”column_i" : {

15 : "y",

4 : "m"

},

”column_ii" : {

15 : "d”,

}},

“columnfamily_b" : {

”column_other" : {

6 : "w"

3 : "o"

1 : "w”

}}}}

8

Page 9: H base vs hive   srp vs analytics 2-14-2012

Hive Data Model : Conceptual

Traditional Relational Tables

9

CUSTKEY NAME ADDRESS NATIONKEY PHONE ACCTBAL COMMENT

451234 NEWC

ORP

196

Broadway

1 111-555-

1212

$1,231,285 NULL

887765 ACME 1 Main st.

2 222-555-

1212

$46,945 “Top

customer”

Page 10: H base vs hive   srp vs analytics 2-14-2012

HBase Data Model : Physical

Every cell stored with row, family, column and timestamp

Allows fast lookup with low copy overhead

BUT

Space inefficient (optional compression available) and inefficient to scan

10

“key_1” “cf_a” “c_i” 15 “foo”

“key_1” “cf_a” “c_ii” 15 “bar”

“key_2” “cf_a” “c_ii” 4 “baz”

Page 11: H base vs hive   srp vs analytics 2-14-2012

Hive Data Model : Physical

Depends on the underlying storage files

Can use flat text files, RCFiles, even use HBase for storage

Standard Row Storage

11

C_1 C_2 C_3 C_4

11 12 13 14

21 22 23 24

31 32 33 34

41 42 43 44

51 52 53 54

Page 12: H base vs hive   srp vs analytics 2-14-2012

Hive Data Model : RCFile

Break into row groups, and then store as columns

12

Row Group 1

C_1 11 21 31

C_2 12 22 32

C_3 13 23 33

C_4 14 24 34

Row Group 2

C_1 41 51

C_2 42 52

C_3 43 53

C_4 44 54

Page 13: H base vs hive   srp vs analytics 2-14-2012

Informal Performance Comparison

13

Hive HBase

Insert Speed batch Fast!

Update Speed NA Fast!

Lookup speed MR lower bound

(10s of seconds)

Fast!

Data warehouse

queries

15x faster on one

test

Uh oh

Page 14: H base vs hive   srp vs analytics 2-14-2012

THANK YOU