HBase Secondary Indexing

16
HBase Secondary Indexing Implementation and analysis of secondary index in HBase Gino McCarty, Ranjan Kumar

Transcript of HBase Secondary Indexing

Page 1: HBase Secondary Indexing

HBase Secondary Indexing

Implementation and analysis of secondary index in HBaseGino McCarty, Ranjan Kumar

Page 2: HBase Secondary Indexing

What’s next

● HBase Introduction

● HBase Scan

● Secondary index

● CoProcessors

● Secondary Index using CoProcessors

● Testing Infrastructure

● Benchmarks

● Challenges and conclusions

Page 3: HBase Secondary Indexing

HBase

● Open source, non-relational, distributed database built on top of Hadoop.

● Columnar data store, also called Tabular data store.

● Architecture - Hadoop, HDFS, Row keys, Column Families

Image source: http://blogs.igalia.com/dpino/2012/10/31/introduction-to-hbase-and-nosql-systems/

EmpId Lastname Firstname Salary

1 Smith Joe 40000

2 Jones Mary 50000

3 Johnson Cathy 44000

Row-oriented

1,Smith,Joe,40000;

2,Jones,Mary,50000;

3,Johnson,Cathy,44000;

Column-oriented

1,2,3;

Smith,Jones,Johnson;

Joe,Mary,Cathy;

40000,50000,44000;

Page 4: HBase Secondary Indexing

HBase

Page 5: HBase Secondary Indexing

HBase Scan

Scan - API for data retrieval

● Scan scan = new Scan(startRow, stopRow)

● scan.addFamily(f);

● scan.addColumn(c);

Filters - control the amount of returned to the client

● scan.setFilter(new ValueFilter(CompareOp.EQUAL, new

SubStringComparator(“3”)));

Optimizations - caching, batching (still scans the entire table- client timeouts,

lease expiring)

● scan.setCaching(1000);

● scan.setBatch(1000);

Page 6: HBase Secondary Indexing

Secondary Index

● The way you design your row key affects everything

● Secondary index to avoid table scans. Indexes can be

stored in another index table.

● Create the index on a column and store the row keys

corresponding to that column in a separate table.

● RegionServers use the index table to perform selective

scan.

Page 7: HBase Secondary Indexing

HBase and CoProcessor Architecture

Trigger Based Observer

System● Region Observer

● Master Observer

● Log Observer

EndPoints

(Stored Procedures)● Server Side Execution

● Distributed

● Can be called by Clients

or Observers

Page 8: HBase Secondary Indexing
Page 9: HBase Secondary Indexing

Secondary Indexing utilizing CoProcessors

● For every index we

create

o Regions have

their own

assigned index

o Indexes and

regions are kept

together

o Immune to

Splits and

Server outages

Page 10: HBase Secondary Indexing

Sample Code Implementation

● Optimizing for Queries based

on Cartesian Products of Key

Values

● Example, what was the air time

of all carriers that flew on a

wednesday.

HbaseAdmin admin = new IndexAdmin(conf);

HTableDescriptor htd = new

HTableDescriptor(TableName.valueOf(tableName));

HColumnDescriptor hcd = new HColumnDescriptor(columnFamily);

htd.addFamily(hcd);

IndexSpecification iSpec = new IndexSpecification(indexName);

iSpec.addIndexColumn(hcd,indexColumnQualifier,

ValueType.String, 10);

TableIndices tableIndices = new TableIndices();

tableIndices.addIndex(iSpec);

htd.setValue(Constants.INDEX_SPEC_KEY,

tableIndices.toByteArray());

admin.createTable(htd);

Page 11: HBase Secondary Indexing

Testing Environment

● Testing Platforms:

o Hadoop-2.2.0

o HBase-0.98.8

Single Laptop Benchmark ● HBase Pseudo Distributed Mode

QEMU Based 3 Node Virtual Cluster● 2GB Ram and 2 Core Intel i7 per node

● Data Set: 7 Million Rows DataSet- Airline On-Time Statistics and Delay Causes

- http://stat-computing.org/dataexpo/2009/the-data.html

- Roughly 80GB DataStore Size Per Node

Page 12: HBase Secondary Indexing

HIndex ● Secondary Index for HBase

● Implementation by Huawei developers

● Uses coprocessors to inject code into master and region

servers.

● Creates an additional index table for every column on which

an index is desired.

Page 13: HBase Secondary Indexing

HIndex conclusions

● Limited to HBase .94 version.

● Need to build the code with hbase source i.e. it is not

available independently as a jar.

● Very little documentation and support.

● Unstable at many many edgecases

Page 14: HBase Secondary Indexing

Benchmarks

Query1:

new RowFilter(CompareOp.EQUAL, new SubstringComparator(“WN”)

38ms per record

Query2:

new ValueFilter(CompareOp.GREATER,new BinaryComparator(60));

3.9ms per record

Query Time taken on

single node(ms)

Total Records

on single node

Time taken on

cluster(ms)

Total Records

on cluster

Query1 50910 346435 62291 689409

Query2 207757 3807652 125777 7785038

Average Performance over 3 Runs

Table Name Keys Column Families and Values

flight_data year,

month,

dayofMonth,

dayOfWeek,

Departure

Time,

Carrier

(flight)

Month,

dayOfWeek, Carrier,

Flight Number, Origin,

Destination,

(trip)

Distance,

AirTime,

Arrival Delay

Page 15: HBase Secondary Indexing

Challenges & Conclusions

Indexing data across regions

● Co-locating index with data in the same region

● Make column family a part of the index

Handling region split

● Split the index table by having a custom splitter as per the row key

distribution of the data table

Secondary indexes improve query performance at the expense of extra space.

CoProcessors add extra overhead for each query processing.

Page 16: HBase Secondary Indexing

Questions

References:HBase - CoProcessors - https://blogs.apache.org/hbase/entry/coprocessor_introduction

HIndex - https://github.com/Huawei-Hadoop/hindex

HIndex Overview - http://www.slideshare.net/rajeshbabuchintaguntla/apache-con-hindex