HBase Secondary Indexing

HBase Secondary Indexing

Implementation and analysis of secondary index in HBaseGino McCarty, Ranjan Kumar

What’s next

● HBase Introduction

● HBase Scan

● Secondary index

● CoProcessors

● Secondary Index using CoProcessors

● Testing Infrastructure

● Benchmarks

● Challenges and conclusions

HBase

● Open source, non-relational, distributed database built on top of Hadoop.

● Columnar data store, also called Tabular data store.

● Architecture - Hadoop, HDFS, Row keys, Column Families

Image source: http://blogs.igalia.com/dpino/2012/10/31/introduction-to-hbase-and-nosql-systems/

EmpId Lastname Firstname Salary

1 Smith Joe 40000

2 Jones Mary 50000

3 Johnson Cathy 44000

Row-oriented

1,Smith,Joe,40000;

2,Jones,Mary,50000;

3,Johnson,Cathy,44000;

Column-oriented

1,2,3;

Smith,Jones,Johnson;

Joe,Mary,Cathy;

40000,50000,44000;

HBase Scan

Scan - API for data retrieval

● Scan scan = new Scan(startRow, stopRow)

● scan.addFamily(f);

● scan.addColumn(c);

Filters - control the amount of returned to the client

● scan.setFilter(new ValueFilter(CompareOp.EQUAL, new

SubStringComparator(“3”)));

Optimizations - caching, batching (still scans the entire table- client timeouts,

lease expiring)

● scan.setCaching(1000);

● scan.setBatch(1000);

Secondary Index

● The way you design your row key affects everything

● Secondary index to avoid table scans. Indexes can be

stored in another index table.

● Create the index on a column and store the row keys

corresponding to that column in a separate table.

● RegionServers use the index table to perform selective

scan.

HBase and CoProcessor Architecture

Trigger Based Observer

System● Region Observer

● Master Observer

● Log Observer

EndPoints

(Stored Procedures)● Server Side Execution

● Distributed

● Can be called by Clients

or Observers

Secondary Indexing utilizing CoProcessors

● For every index we

create

o Regions have

their own

assigned index

o Indexes and

regions are kept

together

o Immune to

Splits and

Server outages

Sample Code Implementation

● Optimizing for Queries based

on Cartesian Products of Key

Values

● Example, what was the air time

of all carriers that flew on a

wednesday.

HbaseAdmin admin = new IndexAdmin(conf);

HTableDescriptor htd = new

HTableDescriptor(TableName.valueOf(tableName));

HColumnDescriptor hcd = new HColumnDescriptor(columnFamily);

htd.addFamily(hcd);

IndexSpecification iSpec = new IndexSpecification(indexName);

iSpec.addIndexColumn(hcd,indexColumnQualifier,

ValueType.String, 10);

TableIndices tableIndices = new TableIndices();

tableIndices.addIndex(iSpec);

htd.setValue(Constants.INDEX_SPEC_KEY,

tableIndices.toByteArray());

admin.createTable(htd);

Testing Environment

● Testing Platforms:

o Hadoop-2.2.0

o HBase-0.98.8

Single Laptop Benchmark ● HBase Pseudo Distributed Mode

QEMU Based 3 Node Virtual Cluster● 2GB Ram and 2 Core Intel i7 per node

● Data Set: 7 Million Rows DataSet- Airline On-Time Statistics and Delay Causes

- http://stat-computing.org/dataexpo/2009/the-data.html

- Roughly 80GB DataStore Size Per Node

http://stat-computing.org/dataexpo/2009/the-data.html

HIndex ● Secondary Index for HBase

● Implementation by Huawei developers

● Uses coprocessors to inject code into master and region

servers.

● Creates an additional index table for every column on which

an index is desired.

HIndex conclusions

● Limited to HBase .94 version.

● Need to build the code with hbase source i.e. it is not

available independently as a jar.

● Very little documentation and support.

● Unstable at many many edgecases

Benchmarks

Query1:

new RowFilter(CompareOp.EQUAL, new SubstringComparator(“WN”)

38ms per record

Query2:

new ValueFilter(CompareOp.GREATER,new BinaryComparator(60));

3.9ms per record

Query Time taken on

single node(ms)

Total Records

on single node

Time taken on

cluster(ms)

Total Records

on cluster

Query1 50910 346435 62291 689409

Query2 207757 3807652 125777 7785038

Average Performance over 3 Runs

Table Name Keys Column Families and Values

flight_data year,

month,

dayofMonth,

dayOfWeek,

Departure

Time,

Carrier

(flight)

Month,

dayOfWeek, Carrier,

Flight Number, Origin,

Destination,

(trip)

Distance,

AirTime,

Arrival Delay

Challenges & Conclusions

Indexing data across regions

● Co-locating index with data in the same region

● Make column family a part of the index

Handling region split

● Split the index table by having a custom splitter as per the row key

distribution of the data table

Secondary indexes improve query performance at the expense of extra space.

CoProcessors add extra overhead for each query processing.

Questions

References:HBase - CoProcessors - https://blogs.apache.org/hbase/entry/coprocessor_introduction

HIndex - https://github.com/Huawei-Hadoop/hindex

HIndex Overview - http://www.slideshare.net/rajeshbabuchintaguntla/apache-con-hindex

https://blogs.apache.org/hbase/entry/coprocessor_introduction

https://github.com/Huawei-Hadoop/hindex

http://www.slideshare.net/rajeshbabuchintaguntla/apache-con-hindex

HBase Secondary Indexing

Data & Analytics

Transcript of HBase Secondary Indexing