HBase Data Modeling and Access Patterns with Kite SDK

37
1 HBase Data Modeling and Access Patterns with Kite SDK Adam Warrington Sr. Manager Customer Ops Tools Team

description

Speaker: Adam Warrington (Cloudera) The Kite SDK is a set of libraries and tools focused on making it easier to build systems on top of the Hadoop ecosystem. HBase support has recently been added to the Kite SDK Data Module, which allows a developer to model and access data in HBase consistent with how they would model data in HDFS using Kite. This talk will focus on Kite's HBase support by covering Kite basics and moving through the specifics of working with HBase as a data source. This feature overview will be supplemented by specifics of how that feature is being used in production applications at Cloudera.

Transcript of HBase Data Modeling and Access Patterns with Kite SDK

Page 1: HBase Data Modeling and Access Patterns with Kite SDK

1

HBase Data Modeling and Access Patterns with Kite SDKAdam WarringtonSr. Manager Customer Ops Tools Team

Page 2: HBase Data Modeling and Access Patterns with Kite SDK

2 ©2014 Cloudera, Inc. All rights reserved.2

Developing on top of Apache Hadoop

• Apache Hadoop is an incredibly powerful platform on which to develop data applications.

• Scale• it provides the infrastructure needed to process big data at scale.

• Flexibility• General purpose platform on top of which one can build almost any type of big data

application.• Diverse Ecosystem

• Multitude of storage engines, tools for ETL, machine learning, analysis, and data science.

• This comes at a cost…

Page 3: HBase Data Modeling and Access Patterns with Kite SDK

3 ©2014 Cloudera, Inc. All rights reserved.3

Developing on top of Apache Hadoop: The Cost

• The API is very basic and low level.• Developers are required to build plumbing and

infrastructure to create even a basic system.• Repeat process for every system you create.• Have to understand the quirks of each system.• The barrier to entry is high for many enterprise Java

developers in the industry.

Page 4: HBase Data Modeling and Access Patterns with Kite SDK

4

What is Kite SDK?

©2014 Cloudera, Inc. All rights reserved.

• Kite SDK aims to solve this problem by building a higher level API on top of the Hadoop ecosystem

• Kite exists as a client-side library for writing Hadoop Data Applications

• Modular• Datasets: standard storage• Morphlines: ETL as configuration• Data Management Tools

Page 5: HBase Data Modeling and Access Patterns with Kite SDK

5

What is Kite SDK?

©2014 Cloudera, Inc. All rights reserved.

• Kite SDK aims to solve this problem by building a higher level API on top of the Hadoop ecosystem

• Kite exists as a client-side library for writing Hadoop Data Applications

• Modular• Datasets: standard storage• Morphlines: ETL as configuration• Data Management Tools

• Today’s talk will focus on the Datasets Module

Page 6: HBase Data Modeling and Access Patterns with Kite SDK

6

Kite Datasets

©2014 Cloudera, Inc. All rights reserved.

• Motivation• Focus on your data, not managing it

• Goals• Think in terms of data, not files• Describe your data and Kite does the right thing• Consistency - should work across the platform• Reliability

Page 7: HBase Data Modeling and Access Patterns with Kite SDK

7

Kite Datasets

©2014 Cloudera, Inc. All rights reserved.

At the heart of the Kite Datasets module is a unified storage interface.

• Dataset – a collection of entities• DatasetRepository – physical storage location for datasets• DatasetDescriptor – holds dataset metadata (schema, format)• DatasetWriter – write entities to a dataset in a stream• DatasetReader – read entities from a dataset

Page 8: HBase Data Modeling and Access Patterns with Kite SDK

8 ©2014 Cloudera, Inc. All rights reserved.8

Kite Partition Strategies

PartitionStrategy defines how to map an entity to partitions in HDFS or row keys in HBasePartitionStrategy p = new PartitionStrategy.Builder() .year("timestamp") .month("timestamp") .day("timestamp").build();

/user/hive/warehouse/events /year=2014/month=05/day=05 /FlumeData.1375659013795 /FlumeData.1375659013796

Page 9: HBase Data Modeling and Access Patterns with Kite SDK

9

Kite Datasets Example

©2014 Cloudera, Inc. All rights reserved.

Event.avsc{ "type" : "record", "name" : ”Event", "namespace" : "com.example”, "fields" : [ { "name”: ”id", "type”: ”long” }, { “name”: “timestamp”, “type”: “long” }, { “name”: “source”, “type”: “string” } ]}

Log4j Configurationlog4j.appender.flume = org.kitesdk.data.flume.Log4jAppenderlog4j.appender.flume.Hostname = localhostlog4j.appender.flume.Port = 41415log4j.appender.flume.DatasetRepositoryUri = repo:hivelog4j.appender.flume.DatasetName = events

Page 10: HBase Data Modeling and Access Patterns with Kite SDK

10

Kite Datasets Example Continued

©2014 Cloudera, Inc. All rights reserved.

Dataset CreationDatasetRepository repo = DatasetRepositories.open("repo:hive");DatasetDescriptor descriptor = new DatasetDescriptor.Builder() schema(Event.avsc).build();repo.create("events", descriptor);

Java CodeLogger logger = Logger.getLogger(...);

Event event = new Event();event.setId(id);event.setTimestamp(System.currentTimeMillis());event.setSource(source);logger.info(event);

Page 11: HBase Data Modeling and Access Patterns with Kite SDK

11

Kite Datasets Example Continued

©2014 Cloudera, Inc. All rights reserved.

/user /hive /warehouse /events /FlumeData.1375659013795 /FlumeData.1375659013796

Avrofiles

Resulting File Layout

Page 12: HBase Data Modeling and Access Patterns with Kite SDK

12

Kite HBase ModuleOverview

Page 13: HBase Data Modeling and Access Patterns with Kite SDK

13 ©2014 Cloudera, Inc. All rights reserved.13

HBase Storage Format

HBase storage concepts are fundamentally different from file formats on HDFS• Ordered Rows• Column Families• Random Access Operations

Page 14: HBase Data Modeling and Access Patterns with Kite SDK

14 ©2014 Cloudera, Inc. All rights reserved.14

HBase Storage Format

New concepts added to the Dataset API:• Composite Keys – support for entity ordering with

composite keys• Column mapping – define how data is split across

column families and columns in a table• Random Access Dataset Methods– support for Get,

Put, and Delete operations on the Dataset interface

Page 15: HBase Data Modeling and Access Patterns with Kite SDK

15 ©2014 Cloudera, Inc. All rights reserved.15

Composite Key Engineering

• Properly engineered row keys is crucial for optimizing HBase scans.• HBase tables sort using lexicographical ordering of key

byte arrays• Composite keys are a common use case, but hard to

get correct.

Page 16: HBase Data Modeling and Access Patterns with Kite SDK

16 ©2014 Cloudera, Inc. All rights reserved.16

Composite Key Engineering With Partition Strategies

• We already have a way to split records across storage buckets with a PartitionStrategy.• Let’s re-use that concept.• Example: Define a PartitionStrategy optimized for historical web page scans

Website.avsc{ "type" : "record", "name" : ”Website", "namespace" : "com.example”, "fields" : [ { "name”: ”url", "type”: ”string” }, { “name”: “timestamp”, “type”: “long” },

{ "name”: ”content", "type" : ”string” } ]}

Partition Strategy Builder

PartitionStrategy p = new PartitionStrategy.Builder() .identity(”url") .identity(”timestamp") .build();

Page 17: HBase Data Modeling and Access Patterns with Kite SDK

17 ©2014 Cloudera, Inc. All rights reserved.17

Composite Key Engineering With Partition Strategies

Or with the Partition Strategy JSON format

Website.avsc{ "type" : "record", "name" : ”Website", "namespace" : "com.example”, "fields" : [ { "name”: ”url", "type”: ”string” }, { “name”: “timestamp”, “type”: “long” }, { "name”: ”content", "type" : ”string” } ]}

WebsitePartitionStrat.json

[ { “source”: “url”, “type”: “id” }, { “source”: “timestamp”, “type”: “id” }

]

Page 18: HBase Data Modeling and Access Patterns with Kite SDK

18 ©2014 Cloudera, Inc. All rights reserved.18

Key Memcmp Encoding

• Encode composite key parts so serialized byte array will sort lexicographically by key fields in order.

{ “id”: 1, “ts”: 100, …}

{ “id”: 2, “ts”: 50, …}

{ “id”: 2, “ts”: 102, …}

< <

Page 19: HBase Data Modeling and Access Patterns with Kite SDK

19 ©2014 Cloudera, Inc. All rights reserved.19

Key Memcmp Encoding (Integer and Long)

Value Bytes

1 0x00000001

0 0x00000000

-1 0xFFFFFFFFF

-2 0xFFFFFFFFE

Standard integer and long serialization sorts across negative and positive numbers wrong

So we flip the sign bit when serializing an integer or long

Value Bytes

1 0x80000001

0 0x80000000

-1 0x7FFFFFFFF

-2 0x7FFFFFFFE

Page 20: HBase Data Modeling and Access Patterns with Kite SDK

20 ©2014 Cloudera, Inc. All rights reserved.20

Key Memcmp Encoding (Variable Length Types)

Value1 Value2 Bytes“foo” “bar” \x03foo\x03bar

“foo” “zr” \x03foo\x02zr

“zo” “bar” 0xFFFFFFFFF

Binary Avro encoding is length prefixed. This can sort composite keys wrong.

So we terminated Strings with a terminating character.

Value1 Value2 Bytes

“foo” “bar” foo\x00bar\x00

“foo” “zr” foo\x00zr\x00

“zo” “bar” zo\x00bar\x00

Page 21: HBase Data Modeling and Access Patterns with Kite SDK

21 ©2014 Cloudera, Inc. All rights reserved.21

Key Memcmp Encoding (Variable Length Types)

• How do we handle a \x00 byte present in the variable length type?• Convert \x00 byte to \x00\x01, and use \x00\x00 as terminating

character.

Value1 Value2 Bytes

“fo” “bar” foo\x00\x00bar\x00\x00

“fo\x00” “aa” foo\x00\x01\x00\x00aa\x00\x00

Page 22: HBase Data Modeling and Access Patterns with Kite SDK

22 ©2014 Cloudera, Inc. All rights reserved.22

Column Mappings

Defines how an Avro record’s fields are mapped to an HBase table row.Mapping Type Descriptioncolumn Maps a record field value directly to a columncounter Similar to column, except supports atomic increment

keyAsColumn Maps key/value field types to a column family where each key entry is a column qualifier and value entry is the cell value.

key Record field’s value is part of the composite key

occVersion Enables optimistic concurrency control on the dataset.

Page 23: HBase Data Modeling and Access Patterns with Kite SDK

23 ©2014 Cloudera, Inc. All rights reserved.23

Column Mappings: Header DefinitionEvent.avsc

{ "type" : "record", "name" : "Event", "namespace" : "com.example”, “mapping”: [ { “source”: “id”, “type”: “key” }, { “source”: “ts”, “type”: “key” }, { “source”: “source”, “type”: “column”, “value”: “meta:source”}, { “source”: “atts”, “type”: keyAsColumn”, “value”: “atts:” } ], "fields" : [ { "name" : "id", "type" : "long” }, { "name" : "ts", "type" : "long” }, { "name" : "source", "type" : "string" }, { “name” : “atts”, “type”: { “type”: “map”, “value”: “string” } } ]}

• Mapping definition attribute can be added right to the Avro record schema

• Still a valid Avro schema – Avro’s schema parser will ignore unknown attributes in record header.

Page 24: HBase Data Modeling and Access Patterns with Kite SDK

24 ©2014 Cloudera, Inc. All rights reserved.24

Column Mappings: Field DefinitionEvent.avsc

{ "type" : "record", "name" : "Event", "namespace" : "com.example”, "fields" : [ { "name”: "id", "type”: "long”, “mapping”: { “type”: “key” }}, { "name”: "ts", "type" : "long”, “mapping”: { “type”: “key” }}, { "name”: "source", "type”: "string”, “mapping”: { “type”: “column”, “value”: “meta:source” }}, { “name” : “atts”, “type”: { “type”: “map”, “value”: “string” }, “mapping”: { “type”: “keyAsColumn”, “value”: “atts:” }} ]}

• Mapping definition attributes can be defined directly on the Avro schema fields.

• Still a valid Avro schema – Avro’s schema parser will ignore unknown attributes on fields.

Page 25: HBase Data Modeling and Access Patterns with Kite SDK

25 ©2014 Cloudera, Inc. All rights reserved.25

Column Mappings: External DefinitionEvent.avsc

{ "type" : "record", "name" : "Event", "namespace" : "com.example”, "fields" : [ { "name”: "id", "type”: "long” }, { "name”: "ts", "type" : "long” }, { "name”: "source", "type”: "string” }, { “name” : “atts”, “type”: { “type”: “map”, “value”: “string” }} ]}

• Mapping definition attributes can be defined in an external file.

• Perfect if you don’t want to update existing Avro schemas.

EventMapping.json[ { “source”: “id”, “type”: “key” }, { “source”: “ts”, “type”: “key” }, { “source”: “source”, “type”: “column”, “value”: “meta:source”}, { “source”: “atts”, “type”: keyAsColumn”, “value”: “atts:” }]

Page 26: HBase Data Modeling and Access Patterns with Kite SDK

26 ©2014 Cloudera, Inc. All rights reserved.26

Column Mapping Types: “column”

• Maps a field to a fully qualified column• Fields serialized using Avro binary encoding except…

• Integer serialized as 4 byte int• Long serialized as 8 byte long• String serialized as UTF8 bytes

• Allows atomic increment and append on these types, which length prefixed and zig-zag encoding would not.

Row Key Column Family: meta Column Family: atts

Key Part 1 Key Part 2 Qualfier: source Qualifier: ip Qualifier: level

1 1396322485 server1 192.168.0.100 ERROR

Event Instance:{ “id”: 1, “ts”: 1396322485, “source”: “server1”, “atts”: { “ip”: “192.168.0.100”, “level”: “ERROR” }}

Page 27: HBase Data Modeling and Access Patterns with Kite SDK

27 ©2014 Cloudera, Inc. All rights reserved.27

Column Mapping Types: “keyAsColumn”

• Allowed for Map and Record types• Splits apart a Map by its entries, using keys as the

qualifier, and storing values in the cell.• Splits apart a Record by its fields, using field names as

the qualifier, and storing the values in the cell.• Fields serialized using Avro’s binary encoding• Allows pattern for atomic updates to the keyAsColumn

field.Row Key Column Family: meta Column Family: atts

Key Part 1 Key Part 2 Qualfier: source Qualifier: ip Qualifier: level

1 1396322485 server1 192.168.0.100 ERROR

Event Instance:{ “id”: 1, “ts”: 1396322485, “source”: “server1”, “atts”: { “ip”: “192.168.0.100”, “level”: “ERROR” }}

Page 28: HBase Data Modeling and Access Patterns with Kite SDK

28 ©2014 Cloudera, Inc. All rights reserved.28

Column Mapping Types: “key”

• Allowed for simple types – int, long, float, double, boolean, string, bytes

• Can be defined on multiple fields to support multi-part keys

• Rows are ordered lexicographically by key mapping fields in the order they are defined

Row Key Column Family: meta Column Family: attsKey Part 1 Key Part 2 Qualfier: source Qualifier: ip Qualifier: level1 1396322485 server1 192.168.0.100 ERROR

Event Instance:{ “id”: 1, “ts”: 1396322485, “source”: “server1”, “atts”: { “ip”: “192.168.0.100”, “level”: “ERROR” }}

Page 29: HBase Data Modeling and Access Patterns with Kite SDK

29 ©2014 Cloudera, Inc. All rights reserved.29

123

4

public E get(Key key);public boolean put(E entity);public long increment(Key key, String fieldName, long amount);public void delete(Key key);

RandomAccessDataset

Adds a number of methods to the Dataset interface for random access operations.

Page 30: HBase Data Modeling and Access Patterns with Kite SDK

30

Random Access Dataset Example

©2014 Cloudera, Inc. All rights reserved.

Website.avsc{ "type" : "record", "name" : ”Website", "namespace" : "com.example”, "fields" : [ { "name”: ”url", "type”: ”string” }, { “name”: “timestamp”, “type”: “long” }, { “name”: “size”, “type”: “int” }, { "name”: ”content", "type" : ”string” } ]}

WebsitesPartitionStrat.json[ { “source”: “url”, “type”: “id” }]

WebsiteVersionsPartitionStrat.json[ { “source”: “url”, “type”: “id” }, { “source”: “timestamp”, “type”: “id” }]

WebsiteColumnMapping.json[ { “source”: “url”, “type”: “column”, “value”: “meta:url” }, { “source”: “timestamp”, “type”: “column”, “value”: “meta:timestamp” }, { “source”: “size”, “type”: “column”, “value”: “meta:size” }, { “source”: “content”, “type”: “column”, “value”: “content:content” }]

Page 31: HBase Data Modeling and Access Patterns with Kite SDK

31

Random Access Dataset Example

©2014 Cloudera, Inc. All rights reserved.

private RandomAccessDataset<Website> websitesDataset = …;private RandomAccessDataset<Website> websiteVersionsDataset = …;

public void calculateNextFetch(String url) { Key key = new Key.Builder(websitesDataset).add("url", url).build(); Website website = websites.get(key);

DatasetReader<Website> websiteVersionReader = websiteVersionsDataset.with("url", url).newReader();

long ts = computeNextFetchTime(websiteVersionReader); website.setNextFetchTime(ts); websites.put(website);}

Page 32: HBase Data Modeling and Access Patterns with Kite SDK

32

Kite HBase ModuleAdvanced Features

Page 33: HBase Data Modeling and Access Patterns with Kite SDK

33 ©2014 Cloudera, Inc. All rights reserved.33

Concurrency Control

• HBase doesn’t have native support for transactions.• This missing feature can be problematic to newbies.• Single Row Puts are atomic, so best practice is to prefer de-

normalizing data into wide rows.• This doesn’t help for Get-Update-Put operations though…

Page 34: HBase Data Modeling and Access Patterns with Kite SDK

34 ©2014 Cloudera, Inc. All rights reserved.34

Optimistic Concurrency Control

• Prevents multiple processes performing row updates from colliding• Enabled with an

“occVersion” column mapping type.

{ "type" : "record", "name" : "Event", "namespace" : "com.example”, “mapping”: [ { “source”: “id”, “type”: “key” }, { “source”: “ts”, “type”: “key” }, { “source”: “source”, “type”: “column”, “value”: “meta:source”}, { “source”: “version”, “type”: occVersion” } ], "fields" : [ { "name" : "id", "type" : "long” }, { "name" : "ts", "type" : "long” }, { "name" : "source", "type" : "string" }, { “name” : “version”, “type” : “long” } ]}

Page 35: HBase Data Modeling and Access Patterns with Kite SDK

35 ©2014 Cloudera, Inc. All rights reserved.35

Optimistic Concurrency Control Continued…

• The version field is used to track the version in the row.• Uses checkAndPut under the hood to ensure the row hasn’t been updated.• Can’t put to an existing row without first fetching it.• If conflict occurs, put() on RandomAccessDataset will return false.• Successful put() increments the version.• Up to the developer how to handle a conflict.• Enables data protection for long running edits, like shared editing in a web

application.

Page 36: HBase Data Modeling and Access Patterns with Kite SDK

36 ©2014 Cloudera, Inc. All rights reserved.36

Other Notable Advanced Features

• Schema Migrations• Users have the ability to add or remove fields from the Avro record schemas.• Kite SDK keeps the historical set of Avro schemas in a specially designated

HBase table.• Kite SDK will verify that only valid schema migrations can occur.

• Composite Datasets• Users can create multiple datasets for a single HBase table.• This allows developers to atomically Get and Put multiple types of Avro

records to a single row.• Kite SDK will verify that dataset column mappings don’t clash.

Page 37: HBase Data Modeling and Access Patterns with Kite SDK

37 ©2014 Cloudera, Inc. All rights reserved.

Adam Warrington@adamwar