HBase @ Hadoop Day Seattle

50
HBase Amandeep Khurana University of California, Santa Cruz Twitter: @amansk [email protected] www.amandeepkhurana.com Tuesday, August 17, 2010

Transcript of HBase @ Hadoop Day Seattle

Page 1: HBase @ Hadoop Day Seattle

HBaseAmandeep Khurana

University of California, Santa CruzTwitter: @amansk

[email protected]

Tuesday, August 17, 2010

Page 2: HBase @ Hadoop Day Seattle

How did it start?

• At Google

• Lots of semi structured data

• Commodity hardware

• Horizontal scalability

• Tight integration with MapReduce

2

Tuesday, August 17, 2010

Page 3: HBase @ Hadoop Day Seattle

Why NoSQL?

• RDBMS don’t scale

• Typically large monolithic systems

• Hard to shard

• Specialized hardware.. expensive!

• Buzzword!

3

Tuesday, August 17, 2010

Page 4: HBase @ Hadoop Day Seattle

Google BigTable

• Distributed multi level map

• Fault tolerant, persistent

• Scalable

• Runs on commodity hardware

• Self managing

• Large number of read/write ops

• Fast scans

4

Tuesday, August 17, 2010

Page 5: HBase @ Hadoop Day Seattle

HBase

• Open source BigTable

• HDFS as underlying DFS

• ZooKeeper as lock service

• Tight integration with Hadoop MapReduce

5

Tuesday, August 17, 2010

Page 6: HBase @ Hadoop Day Seattle

HBase

• Data model

• Architecture, implementation

• Regions, Region Servers etc

• API

• Current status and future direction

• Use cases

• How to think HBase (or NoSQL)?

6

Tuesday, August 17, 2010

Page 7: HBase @ Hadoop Day Seattle

Data Model

• Sparse, multi dimensional map(row, column, timestamp) cell

• Column = Column Family:Column Qualifier

v1

Fam1:Qual1

t1AK

Rows

Columns

Timestamps

7

Tuesday, August 17, 2010

Page 8: HBase @ Hadoop Day Seattle

Data Model

• Sparse, multi dimensional map(row, column, timestamp) cell

• Column = Column Family:Column Qualifier

v1

Fam1:Qual1

t1v2

t2>t1

t2AK

Rows

Columns

Timestamps

7

Tuesday, August 17, 2010

Page 9: HBase @ Hadoop Day Seattle

Regions

• Region: Contiguous set of lexicographically sorted rows

• hbase.hregion.max.filesize (default 256MB)

• Regions hosted by Region Servers

8

Tuesday, August 17, 2010

Page 10: HBase @ Hadoop Day Seattle

Regions and Splittingrow1

row256

row257

row600

9

Tuesday, August 17, 2010

Page 11: HBase @ Hadoop Day Seattle

Regions and Splittingrow1

row256

row257

row600

9

Writes

Tuesday, August 17, 2010

Page 12: HBase @ Hadoop Day Seattle

Regions and Splittingrow1

row256

row257

row600

row400

row401

9

Tuesday, August 17, 2010

Page 13: HBase @ Hadoop Day Seattle

System Structure

10

Region Servers Master

ZooKeeperHDFS

Map

Reduce

Tuesday, August 17, 2010

Page 14: HBase @ Hadoop Day Seattle

Master

• Region splitting

• Load balancing

• Metadata operations

• Multiple masters for failover

11

Tuesday, August 17, 2010

Page 15: HBase @ Hadoop Day Seattle

ZooKeeper

• Master election

• Locate -ROOT- region

• Region Server membership

12

Tuesday, August 17, 2010

Page 16: HBase @ Hadoop Day Seattle

Where is my row?

13

ZooKeeper

MyRow

-ROOT-

.META.MyTable

• 3 level hierarchical lookup scheme

Tuesday, August 17, 2010

Page 17: HBase @ Hadoop Day Seattle

Where is my row?

13

ZooKeeper

MyRow

-ROOT-

.META.MyTable

• 3 level hierarchical lookup scheme

Tuesday, August 17, 2010

Page 18: HBase @ Hadoop Day Seattle

Where is my row?

13

ZooKeeper

MyRow

-ROOT-

.META.MyTable

• 3 level hierarchical lookup scheme

Row per META region

Tuesday, August 17, 2010

Page 19: HBase @ Hadoop Day Seattle

Where is my row?

13

ZooKeeper

MyRow

-ROOT-

.META.MyTable

• 3 level hierarchical lookup scheme

Row per META region

Row per table region

Tuesday, August 17, 2010

Page 20: HBase @ Hadoop Day Seattle

Where is my row?

13

ZooKeeper

MyRow

-ROOT-

.META.MyTable

• 3 level hierarchical lookup scheme

Row per META region

Row per table region

Tuesday, August 17, 2010

Page 21: HBase @ Hadoop Day Seattle

Region

14

HFile(on HDFS)

HLog(Append only

WAL on HDFS)(Sequence File)(one per RS)

HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value

Memstore

Region

HFile(on HDFS)

Tuesday, August 17, 2010

Page 22: HBase @ Hadoop Day Seattle

Region

14

HFile(on HDFS)

HLog(Append only

WAL on HDFS)(Sequence File)(one per RS)

HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value

Memstore

Region

HFile(on HDFS)

Write

Tuesday, August 17, 2010

Page 23: HBase @ Hadoop Day Seattle

Region

14

HFile(on HDFS)

HLog(Append only

WAL on HDFS)(Sequence File)(one per RS)

HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value

Memstore

Region

HFile(on HDFS)

Tuesday, August 17, 2010

Page 24: HBase @ Hadoop Day Seattle

Region

14

HFile(on HDFS)

HLog(Append only

WAL on HDFS)(Sequence File)(one per RS)

HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value

Memstore

Region

HFile(on HDFS)

SmallHFile

Flush

Tuesday, August 17, 2010

Page 25: HBase @ Hadoop Day Seattle

Region

14

HFile(on HDFS)

HLog(Append only

WAL on HDFS)(Sequence File)(one per RS)

HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value

Memstore

Region

HFile(on HDFS)

SmallHFile

Tuesday, August 17, 2010

Page 26: HBase @ Hadoop Day Seattle

Region

14

HFile(on HDFS)

HLog(Append only

WAL on HDFS)(Sequence File)(one per RS)

HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value

Memstore

Region

HFile(on HDFS)

SmallHFile

Compaction

Tuesday, August 17, 2010

Page 27: HBase @ Hadoop Day Seattle

Region

14

HLog(Append only

WAL on HDFS)(Sequence File)(one per RS)

HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value

Memstore

Region

Compaction

Tuesday, August 17, 2010

Page 28: HBase @ Hadoop Day Seattle

Region

14

HLog(Append only

WAL on HDFS)(Sequence File)(one per RS)

HFile: Immutable sorted map (byte[] byte[])(row, column, timestamp) cell value

Memstore

Region

HFile(on HDFS)

Tuesday, August 17, 2010

Page 29: HBase @ Hadoop Day Seattle

Region

15

HFile(on HDFS)

HLog(Append only

WAL on HDFS)(Sequence File)(one per RS)

Memstore

Region

HFile(on HDFS)

HFile(on HDFS)

Tuesday, August 17, 2010

Page 30: HBase @ Hadoop Day Seattle

Region

15

HFile(on HDFS)

HLog(Append only

WAL on HDFS)(Sequence File)(one per RS)

Memstore

Region

HFile(on HDFS)

HFile(on HDFS)

Read

Tuesday, August 17, 2010

Page 31: HBase @ Hadoop Day Seattle

Ways to access• Java

• REST

• Thrift

• Scala

• Jython

• Groovy DSL

• Ruby shell

• Java MR, Cascading, Pig, Hive

16

Tuesday, August 17, 2010

Page 32: HBase @ Hadoop Day Seattle

Java API

• Get

• Put

• Delete

• Scan

• IncrementColumnValue

• TableInputFormat - MapReduce Source

• TableOutputFormat - MapReduce Sink

17

Tuesday, August 17, 2010

Page 33: HBase @ Hadoop Day Seattle

Other Features

• Compression

• In memory column families

• Multiple masters

• Rolling restart

• Bloom filters

• Efficient bulk loads

• Source and sink for Hive, Pig, Cascading

18

Tuesday, August 17, 2010

Page 34: HBase @ Hadoop Day Seattle

Things being worked on

• Master rewrite

• Move more stuff into ZooKeeper

• Column family based access control

• Inter cluster replication (managed by ZK)

• Store Lucene indexes (HBasene)

19

Tuesday, August 17, 2010

Page 35: HBase @ Hadoop Day Seattle

Use Cases

Tuesday, August 17, 2010

Page 36: HBase @ Hadoop Day Seattle

HBase @ SU*

• Backend for su.pr

• Real time serving + MR analytics (separate clusters)

• 50% cascading, 50% java MR

• Prod cluster (~20 nodes) serves 20k requests/sec

• All new features are backed by HBase

• Hardware: 2xi7, 24GB RAM, 4x1TB

21*Source: Personal communication with

J-D Cryans, StumbleUponTuesday, August 17, 2010

Page 37: HBase @ Hadoop Day Seattle

HBase @ Mozilla*• Socorro - crash reporting system

• Catch, process and present crash info for Firefox, Thunderbird, Fennec, Camino, Seamonkey

• 1.5m crash reports/day

• Earlier: NFS, PostgreSQL

• 17 node production cluster

• Dual Quad Core + 24GB RAM + 4x1TB

• Some user facing reports still served by PostgreSQL. Being ported to HBase in next Socorro version

22*Source: http://blog.mozilla.com/webdev/2010/07/26/moving-socorro-to-hbase/Tuesday, August 17, 2010

Page 38: HBase @ Hadoop Day Seattle

Data Integration*

• Multiple heterogenous data sources

• Notion of connected data

• Think RDF

• Graph connecting data elements across systems

• Store in HBase, build transitive closures

• Pattern mining

23*Source: ClouDFuse - Scalable data integration in the cloud, MS Project, Amandeep Khurana, UC Santa CruzTuesday, August 17, 2010

Page 39: HBase @ Hadoop Day Seattle

HBase @ Trend Micro*

• Store threat information - Smart Protection Network

• Open source cloud computing initiative - TCloud

• Primarily run off EC2

24*Source: https://hbase.s3.amazonaws.com/hbase/HBase-Trend-HUG10.pdfTuesday, August 17, 2010

Page 40: HBase @ Hadoop Day Seattle

HBase @ Yahoo*

• Content optimization

• Meta-data about content stored in HBase

• Used for extracting item features

• Used in conjunction with PNUTS, Hadoop

• Process 100s of GB in each run

25*Source: http://www.slideshare.net/ydn/7-online-contentoptimizationhadoopsummit2010Tuesday, August 17, 2010

Page 41: HBase @ Hadoop Day Seattle

HBase @ Twitter*

• 7TB/day incoming data, increasing

• Analytics

• People search

• Building new solutions on HBase

• Part of a much larger scheme of things

• Scribe, Crane, Pig, MySQL, Cassandra, Oink, Elephant Bird, Birdbrain, Hadoop

26

*Sources: http://www.slideshare.net/kevinweil/nosql-at-twitter-nosql-eu-2010http://www.slideshare.net/ydn/3-hadoop-pigattwitterhadoopsummit2010Tuesday, August 17, 2010

Page 42: HBase @ Hadoop Day Seattle

Others• Facebook

• Flurry

• Adobe

• Runa

• GumGum

• Openplaces

• Meetup.com

• Powerset

• WorldLingo

• Lily

• Drawn To Scale

• RapLeaf

• ...

27

Tuesday, August 17, 2010

Page 43: HBase @ Hadoop Day Seattle

How to think in HBase?

Tuesday, August 17, 2010

Page 44: HBase @ Hadoop Day Seattle

HBase v/s RDBMS

• Neither solves all problems

• It’s really a wrong comparison

• But puts things in context

29

Tuesday, August 17, 2010

Page 45: HBase @ Hadoop Day Seattle

HBase v/s RDBMS

30

HBase RDBMSColumn oriented Row oriented (mostly)

Flexible schema, add columns on the fly

Fixed schema

Good with sparse tables Not optimized for sparse tables

No query language SQL

Wide tables Narrow tables

Joins using MR - not optimizedOptimized for joins (small, fast ones too!)

Tight integration with MR Not really...

Tuesday, August 17, 2010

Page 46: HBase @ Hadoop Day Seattle

HBase v/s RDBMS

31

HBase RDBMSDe-normalize your data Normalize as you can

Horizontal scalability. Just add hardware

Hard to shard and scale

Consistent Consistent

No transactions Transactional

Good for semi structured data as well as structured data

Good for structured data

Tuesday, August 17, 2010

Page 47: HBase @ Hadoop Day Seattle

HBase v/s RDBMS

32

Tuesday, August 17, 2010

Page 48: HBase @ Hadoop Day Seattle

HBase v/s RDBMS

32

Rule: You probably don’t need HBase if your data can easily fit and be processed on a single

RDBMS box.

Tuesday, August 17, 2010

Page 49: HBase @ Hadoop Day Seattle

HBase v/s RDBMS

32

Rule: You probably don’t need HBase if your data can easily fit and be processed on a single

RDBMS box.

But then, you are at Hadoop Day, so it probably can’t!

Tuesday, August 17, 2010

Page 50: HBase @ Hadoop Day Seattle

Q&A

Tuesday, August 17, 2010