Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

46
Culvert A secondary indexing framework for BigTable-style databases with HIVE integration Ed Kohlwey Cloud Computing Team

description

Ed Kohlwey's presentation at 2011 Hadoop Summit. Secondary indexing is a common design pattern in BigTable-like databases that allows users to index one or more columns in a table. This technique enables fast search of records in a database based on a particular column instead of the row id, thus enabling relational-style semantics in a NoSQL environment. This is accomplished by representing the index either in a reserved namespace in the table or another index table. Despite the fact that this is a common design pattern in BigTable-based applications, most implementations of this practice to date have been tightly coupled with a particular application. As a result, few general-purpose frameworks for secondary indexing on BigTable-like databases exist, and those that do are tied to a particular implementation of the BigTable model. We developed a solution to this problem called Culvert that supports online index updates as well as a variation of the HIVE query language. In designing Culvert, we sought to make the solution pluggable so that it can be used on any of the many BigTable-like databases (HBase, Cassandra, etc.). We will discuss our experiences implementing secondary indexing solutions over multiple underlying data stores, and how these experiences drove design decisions in creating the Culvert framework. We will also discuss our efforts to integrate HIVE on top of multiple indexing solutions and databases, and how we implemented a subset of HIVE's query language on Culvert.

Transcript of Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Page 1: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Culvert A secondary indexing framework for BigTable-

style databases with HIVE integration

Ed KohlweyCloud Computing Team

Page 2: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Session Agenda

• Secondary Indexing• The Solution: Culvert• Culvert Design & Architecture• How It Works• API Examples• Where to Get It & Credits

Page 3: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Secondary Indexing

• General design pattern for inverted index– Maintain a map from value to location of

records/documents that contain them• Lots of different variations– Term partitioned index– Document partitioned index

• Solves problem of BigTable-style databases only having one primary key for records

Page 4: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Foo TableRowID contact: city contact: phone inventory:count order:Apples

Apples 5

John Springfield (999)-888-7777 3

Pears 10

Sample Inventory Application

Sample Term-Partitioned Index Tableorder:Apples IndexRowID3 -> Dave3 -> John17 -> Paul20 -> Sue

Page 5: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Foo TableRowID contact: comments

John John likes apples.

Sue Sue likes pears.

Sample Inventory Application

Sample Document-Partitioned Index Table

contact:comments Index

RowID apples:john john:John likes:John likes:Sue pears:Sue sue:Sue0x178df - - -0x32da4 - - -

Page 6: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

We found ourselves implementing these ideas over and over for clients.

Why not make a library?

Page 7: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Solution: Culvert

Page 8: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Requirements

• Support secondary indexing• Support an analyst query environment• Database Extensibility– There’s actually a lot of BigTable implementations out

there (HBase, Cassandra, proprietary)• Internal Extensibility– There’s lots of ways to index records– There’s lots of ways to retrieve records– Separate retrieval operations from index implementation

Page 9: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

What Culvert Does

• Indexing• Interface for queries (Java and HIVE)• Abstraction mechanism for multiple

underlying databases

Page 10: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Culvert Design & Architecture

• Use sorted iterators to retrieve values– Lots of algorithms can be expressed as sorting (like

people tend to do in Map/Reduce)– Optional “dumping” feature can provide parallelism

• Decorator design pattern is intuitive to interact with• Allows streaming of results as they become available• Uses Coprocessors to implement parallel operations

Page 11: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Hive

Culvert Client-Side Operation

TableAdapter Constraint Client

Culvert Region-Side Operation

LocalTableAdapter RemoteOp

Culvert Region-Side Operation

LocalTableAdapter RemoteOp

Java API

Architecture Diagram

Page 12: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Constraint Architecture

• Used to express query predicate operations– projection and selection (SELECT)– set operations (AND/OR)– joins

• Decoupled from Indices– Currently focused on term-partitioned indices– Future work includes expanding document-

partitioned index functionality

Page 13: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Index Architecture

• Index is an abstract type– Defines how to store and use the index

• One index per column– Didn’t see a performance reason to index over

multiple columns– Multiple indices complicates framework code– Map of “logical fields” was more easily maintained

in the application– May evolve in the future

Page 14: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Index Architecture (cont.)

• One index table per index– Allows Index implementations to assume they

don’t share the index table– Don’t need to worry about other Indices

clobbering their table structure– Tables are assumed to be cheap

Page 15: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Table Adapters

• TableAdapter and LocalTableAdapter are abstraction mechanisms, roughly equivalent to HTable and HRegion

• RemoteOp is roughly equivalent to CoprocessorProtocol, is handled by TableAdapter and LocalTableAdapter

• Gives implementers fine-grained control over parallelism + table operations

Page 16: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Using Culvert With HIVE

• Why HIVE?– Already very popular– Take advantage of upstream advances– Good framework to “optimize later”

• Culvert implements a HIVE StorageHandler and PredicateHandler

• Facilitates analyst interaction with database• Reduces the “SQL Gap”

Page 17: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

HIVE Culvert Input Format

• Handles AND, >, < query predicates based on indices

• Each index can be broken up into fragments based on region start and end keys– We take the cross-product of each indexes regions

to create input splits for AND

Page 18: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

How It Works

Overview of Indexing Operations

Page 19: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Indexing

• Indices are built via insertion operations on the client (i.e. Client.put(…))

• Whether a field is indexed is controlled by a configuration file

• In the future, will support indexing of arbitrary columns via Map/Reduce

Page 20: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Retrieval

• Query API is exposed via HIVE and Java– HIVE API delegates to Java API– Java API is based on subclasses of Constraint

• Focused on providing parallel, real-time query execution

Page 21: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Walkthrough of Logical Operations on Indices

Page 22: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Logical Operations on Indices

• Logical operations can be represented as a merge sort if we return the keys from the original table in sorted order

• Example: AND

orders:Apples Index

1 -> Dean

3 -> Susan

4 -> John

8 -> Paul

14 -> Renee

33 -> Sheryl

orders:Oranges Index

4 -> Dean

5 -> Susan

5 -> Paul

6 -> George

12 -> Karen

19 -> Tom

Page 23: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Apples < 3 AND Oranges > 5

• First query each index

orders:Apples Index

1 -> Dean

3 -> Susan

4 -> John

8 -> Paul

14 -> Renee

33 -> Sheryl

orders:Oranges Index

4 -> Dean

5 -> Susan

5 -> Paul

6 -> George

12 -> Karen

19 -> Tom

Page 24: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Apples < 3 AND Oranges > 5

• Then order results for each index• Happens on the region servers

1 -> Dean

3 -> Susan 5 -> Susan

5 -> Paul

6 -> George

12 -> Karen

19 -> Tom

Page 25: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Apples < 3 AND Oranges > 5

• Then order results for each index• Happens on the region servers

Dean

Susan Susan

Paul

George

Karen

Tom

Page 26: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Apples < 3 AND Oranges > 5

• Then order results for each index• Notice this happens on the region servers*

Dean

Susan Susan

Paul

George

Karen

Tom

Done

Page 27: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Apples < 3 AND Oranges > 5

• Then order results for each index• Notice this happens on the region servers*

Dean

Susan George

Karen

Paul

Susan

Tom

Done

Done

Page 28: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Apples < 3 AND Oranges > 5

• Then merge the sorted results on the client

Dean

Susan George

Karen

Paul

Susan

Tom

Page 29: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Apples < 3 AND Oranges > 5

• Dean is lowest, Dean is not on the head of all the queues, discard

Dean

Susan George

Karen

Paul

Susan

Tom

Page 30: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Apples < 3 AND Oranges > 5

• George is lowest, George is not on the head of all queues, discard

Dean

Susan George

Karen

Paul

Susan

Tom

Page 31: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Apples < 3 AND Oranges > 5

• Continue…

Dean

Susan George

Karen

Paul

Susan

Tom

Page 32: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Apples < 3 AND Oranges > 5

• Susan is on the head of all the queues, return Susan

Dean

Susan George

Karen

Paul

Susan

Tom✔

Page 33: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Apples < 3 AND Oranges > 5

• Tom is discarded, now we’re finished

Dean

Susan George

Karen

Paul

Susan

Tom✔

Page 34: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Joins

• Numerous methods possible• A few examples– Use sub-queries to fetch related records – Use merge sorting to simultaneously fetch records

satisfying both sides of the join, filter those that don’t match

• Presently, Culvert has only one join (sub-queries method)

Page 35: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges)

JoinConstraintUser performs joins with aconstraint (decorator design pattern)

Page 36: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges)

Left SubConstraint

…John…

JoinConstraint

Constraint receives row ID’s from a leftsub-constraint.

Page 37: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges)

order:Apples… …John 5… …

Left SubConstraint

…John…

JoinConstraint

Constraint looks up fieldvalues for the left side (if notalready present in the results)

Page 38: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges)

order:Apples… …John 5… …

Left SubConstraint

…John…

JoinConstraint

order:Oranges… …George 5Jane 5… …

For each record in the leftresult set, the constraint createsa new right-side constraint tofetch indexed items matchingthe right side of the constraint.

Page 39: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Example: Join Apple Order Size on Orange Order Size (order:Apples = order:Oranges)

order:Apples… …John 5… …

Left SubConstraint

…John…

JoinConstraint

order:Oranges… …George 5Jane 5… …

… … …John 5 George

John 5 Jane

… … …

Finally,the joinedrecords are returned.

Page 40: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Culvert Java API Examples

• Goal: to be intuitive and easy to interact with• Provide a simple relational API without forcing

a developer to use SQL

Page 41: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Culvert API Example: InsertionConfiguration culvertConf = CConfiguration.getDefault();// index definitions are loaded implicitly from the// configurationClient client = new Client(culvertConf);List<CKeyValue> valuesToPut = Lists.newArrayList();valuesToPut.add(new CKeyValue(

"foo".getBytes(), "bar".getBytes(), "baz”.getBytes()));

Put put = new Put(valuesToPut);client.put("tableName", put);

Page 42: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Culvert API Example: RetrievalConfiguration culvertConf = CConfiguration.getDefault();// index definitions are loaded implicitly from the configurationClient client = new Client(culvertConf);Index c1Index = client.getIndexByName("index1");Constraint c1Constraint = new IndexRangeConstraint(

c1Index, new CRange("abba".getBytes(), "cadabra".getBytes()));

Index[] c2Indices = client.getIndicesForColumn("rabbit".getBytes(),"hat".getBytes());

Constraint c2Constraint = new IndexRangeConstraint(c2Indices[0],new CRange("bar".getBytes(), "foo".getBytes()));

Constraint and = new And(c1Constraint, c2Constraint);Iterator<Result> results = client.query("tablename", and);

Page 43: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Future Work

• (Re)Building Indices via Map/Reduce• More index types– Document-partitioned– Others?

• More retrieval operations• Profiling + tuning• Storing configuration details in a table or in

Zookeeper

Page 44: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Where to Get It*

http://github.com/booz-allen-hamilton/culvert

*Available 6/29/2011

Where to Tweet It

#culvert

Page 45: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Culvert Team• Ed Kohlwey (@ekohlwey)• Jesse Yates (@jesse_yates)• Jeremy Walsh• Tomer Kishoni (@tokbot)• Jason Trost (@jason_trost)

Page 46: Culvert: A Robust Framework for Secondary Indexing of Structured and Unstructured Data

Questions?