Alternatives to Apache Accumulo’s Java API

© Josh Elser 2015, Hortonworks

Alternatives to Apache Accumulo’s Java

API

Josh Elser

@josh_elser (@hortonworks)


Or…

I’m really tired of having to write Java

code all the time and I want to use

something else.


Or…

OK, I’ll still write Java, but I’m 110%

done with re-writing the same

boilerplate to parse CLI args, convert

records into a standard format, deal

with concurrency and retry server-side

errors...


You have options

There is life after Accumulo’s Java API:

● Apache Pig

● Apache Hive

● Accumulo’s “Thrift Proxy”

● Other JVM-based languages

● Cascading/Scalding

● Spark


Lots of integration points, lots of considerations:

We want to avoid numbering each consideration because

each differ in importance depending on the application.

Every decision has an effect

Maturity

Stability

Performance

ExtensibilityEase of use


Maturity

How well-adopted is the code you’re using?

Where does the code live? Is there a structured community

or is it just sitting in a Github repository?

Can anyone add fixes and improvements? Are they

merged/accepted (when someone provides them)?

Are there tests and are they actually run?

Are releases made and published regularly?

Your own code is difficult enough to maintain.


Stability

Is there a well-defined user-facing API to use?

Cross-project integrations are notorious in making

assumptions about how you should use the code.

Does the integration produce the same outcomes that the

“native” components do?

Can users reliably expect code to work across versions?

Using some external integration should feel like using

the project without that integration. Code that worked

once should continue to work.


Performance

Does the code run sufficiently quick enough?

Can you saturate your physical resources with ease?

Do you have to spend days in a performance tool

reworking how you use the API?

Does the framework spend excessive amounts of time

converting types to some framework?

Can you get an answer in an acceptable amount of time?

Each use case has its own set of performance

requirements. Experimentation is necessary.


Ease of Use

Can you write the necessary code in a reasonable

amount of time?

Goes back to: “am I sick of writing verbose code (Java)”?

Choosing the right tool can drastically reduce the amount of

code to write.

Can the solution to your problem be reasonably expressed

in the required language?

Using a library should feel natural and enjoyable to

write while producing a succinct solution.


Extensibility

Does the integration support enough features of the

underlying system?

Can you use the novel features of the underlying system

via the integration?

Can custom parsing/processing logic be included?

How much external configuration/setup is needed before

you can invoke your code?

Using an integration should not require sacrifice in the

features of the underlying software.


Apply it to Accumulo!

Let’s take these 5 points and see how they apply to some

of the more well-defined integration projects.

We’ll use Accumulo’s Java API as the reference point for

how we judge other projects.


Accumulo Java API

Reference implementation on how clients use Accumulo.

Comprised of Java methods and classes with extreme

assessment on the value and effectiveness of each.

M: Evaluated/implemented by all Accumulo developers. Well-test and heavily

critiqued.

S: Follows SemVer since 1.6.1. Is the definition of the Accumulo API.

P: High-performance, typically limited by network and server-side impl.

EoU: Verbose and often pedantic. Implements low-level operations key-value

centric operations, not high-level application functions.

E: Provides well-defined building blocks for implementing custom libraries and

exposes methods for interacting with all Accumulo features.


Apache Pig

Apache Pig is a platform for analyzing large data sets that

consists of a high-level language for expressing data

analysis programs.[1]

Default execution runs on YARN (MapReduce and Tez).

Pig is often adored for its fast prototyping and data analysis abilities

with “Pig Latin”: functions which perform operations on Tuples.

Pig Latin allows for very concise solutions to problems.

LoadStoreFunc interface enables AccumuloStorage

1. http://pig.apache.org


Apache Pig

-- Load a text file of data

A = LOAD 'student.txt' AS (name:chararray, term:chararray, gpa:float);

-- Group records by student

B = GROUP A BY name;

-- Average GPAs per student

C = FOREACH B GENERATE A.name, AVG(A.gpa);

3 lines of Pig Latin, would take hundreds of lines in Java just to read the

data.

AccumuloStorage introduced in Apache Pig 0.13.0

Maps each tuple into an Accumulo row.

Very easy to both write/read data to/from Accumulo.

STORE flights INTO 'accumulo://flights?instance=...' USING

org.apache.pig.backend.hadoop.accumulo.AccumuloStorage(

'carrier_name,src_airport,dest_airport,tail_number');


Apache Pig

Pig enables users to perform lots of powerful data

manipulation and computation task with little code but

requires users to learn Pig Latin which is unique.

M: Apache Pig is a very well-defined community with its own processes.

S: Use of Pig Latin with AccumuloStorage feels natural and doesn’t have

edge cases which are unsupported.

P: Often suffers from the under-optimization that comes with generalized

MapReduce. Will incur penalties for quick jobs (with MapReduce only). Not as

fast as well-architected, hand-written code.

EoU: Very concise and easy to use. Comes with most of the same

drawbacks of dynamic programming languages. Not straightforward to test.

E: Requires user intervention to create/modify tables with custom configuration

and splits. Column visibility on a per-cell basis is poorly represented because

Pig Latin doesn’t have the ability to support it well.


Apache Hive

Apache Hive is data warehouse software that facilitates

querying and managing large datasets residing in

distributed storage.[1]

One of the “old-time” SQL-on-Hadoop software projects.

Fought hard against the “batch-only” stigma recently building on top of

Tez for ‘interactive queries”

Defines Hive Query Language (HQL) which is close to, but not quite,

compatible with the SQL-92 standard.

Defines extension points which allow for external storage engines known as StorageHandlers.

1. http://hive.apache.org


Apache Hive

# Create a Hive table from the Accumulo table “my_table”

> CREATE TABLE my_table(uid string, name string, age int, height int)

STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'

WITH SERDEPROPERTIES ("accumulo.columns.mapping" =

":rowID,person:name,person:age,person:height"

);

# Run “SQL” queries

> SELECT name, height, uid FROM my_table ORDER BY height;

Like Pig, simple queries can be executed with very little amounts of code

and each record maps into an Accumulo row.

Unlike Pig, generating these tables in Hive itself is often difficult and is reliant

upon first creating a “native” Hive table and then inserting the data into an AccumuloStorageHandler-backed Hive table.

AccumuloStorageHandler introduced in Apache Hive 0.14.0. With the

use of Tez, “point” queries on the rowID can be executed extremely quickly:

> SELECT * FROM my_table WHERE uid = “12345”;


Apache Hive

Using SQL to query Accumulo is a refreshing change, but

the write-path with Hive leaves a bit to be desired. Will

often require data ingest through another tool.

M: Apache Hive is a very well-defined community with its own processes.

S: HQL sometimes feels a bit clunky due to limitations of the

StorageHandler interface.

P: Lots of effort here in Hive recently using Apache Calcite and Apache Tez to

optimize query execution and reduce MapReduce overhead. Translating

Accumulo Key-Values to Hive’s types can be expensive as well.

EoU: HQL as it stands is close enough to make those familiar with SQL feel

at home. Some oddities to work around, but are typically easy to deal with.

E: Like Pig, Hive also suffers from the lack of an ability to represent features

like cell-level visibility. Some options like table configuration, are exposed

through Hive, but most cases will require custom manipulation and

configuration of Accumulo tables before using Hive.


Accumulo “Thrift Proxy”

Apache Thrift is software framework which combines a

software stack with a code generation engine to build

cross-language services.[1]

Thrift is the software that Accumulo builds its client-server RPC service

on.

Thrift provides desirable features such as optional message fields and

well-performing abstractions over the low-level details such as

threading and connection management.

Clients and servers don’t need to be implemented in the same

language as each other.

1. http://thrift.apache.org



Clients could directly implement the necessary code to

speak directly to Accumulo Master and TabletServers, but

that is an extremely large undertaking.

Accumulo provides an optional “Proxy” process which

provides a Java API-like interface over Thrift instead of

the low-level RPC Thrift API.

Accumulo bundles Python and Ruby client bindings by

default. Generating other languages is simple when Thrift

is already installed.

1. http://thrift.apache.org


proxy.createTable(login, table, true, Accumulo::TimeType::MILLIS)

unless proxy.tableExists(login,table)

update1 = Accumulo::ColumnUpdate.new({'colFamily' => "cf1",

'colQualifier' => "cq1", 'value'=> "a"})

update2 = Accumulo::ColumnUpdate.new({'colFamily' => "cf2",

'colQualifier' => "cq2", 'value'=> "b"})

proxy.updateAndFlush(login, table,{'row1' => [update1,update2]})

cookie = proxy.createScanner(login, table, nil)

result = proxy.nextK(cookie,10)

result.results.each{ |keyvalue| puts "Key: #{keyvalue.key.inspect}

Value: #{keyvalue.value}" }

if not client.tableExists(login, table):

client.createTable(login, table, True, TimeType.MILLIS)

row1 = {'a':[ColumnUpdate('a','a',value='value1'),

ColumnUpdate('b','b',value='value2')]}

client.updateAndFlush(login, table, row1)

cookie = client.createScanner(login, table, None)

for entry in client.nextK(cookie, 10).results:

print entry


Ruby

Python



The first noticeable difference in implementations is that the

performance of writing a Python or Ruby client will be

much less than a native Java client.

Some of the performance loss is likely in using a dynamic

language. Your experience in the language is relevant

too.

Most of the performance loss is due to passing all requests

through the Proxy before it reaches TabletServers.

Proxy servers are not highly available and would require

manual load balancing. Single client environments work

well, but many active clients will overload a Proxy.1. http://thrift.apache.org



The novelty of using languages like Python and Ruby to

interact with Accumulo is enjoyable. The Proxy’s

architecture will not scale well past a few clients.

M: The Proxy isn’t widely (publicly) used but is generally maintained by devs.

S: Because the Proxy server API isn’t in the Accumulo Public API, no

guarantees are made on its methods

P: High availability and load balancing are left to users to solve. Will take

significant engineering effort to smartly scale to supporting many clients.

EoU: Thrift tends to generate decent code to work with for each supported

language which makes writing clients feel relatively natural.

E: The generated client code per language could easily be extended to act

more like an ORM. The full spectrum of Accumulo’s Java API should be

exposed via the Proxy which doesn’t impose limitations in use.


Cascading and Spark

Apache Spark has been causing big waves in the Hadoop

community for the past year, touted across the spectrum as

a complete replacement for MapReduce to complementary

technology.

Cascading (not at the ASF but is ASLv2) is an abstraction

layer on top of various Hadoop components. It’s been

around for quite some time now and is well-received.

Both suffer from a lack of well-defined upstream Accumulo

adoption within their respective communities. Snippets can

be found online, but they’re typically end-user developed

additions.

Lots of opportunities for users to step up and improve each!


Clojure and Scala

Clojure and Scala are both examples of languages which

run on the JVM that are not Java.

These languages should both natively support the

Accumulo Java API, although it’s somewhat uncharted

territory that may have subtle bugs (ACCUMULO-3718)

Github has a spattering of example code, but there lack

definitive resources for both Clojure and Scala.

Lots of opportunity for users to step up and improve

support for these languages!


Concrete Comparison

Let’s do a comparison on the effort needed to analyze some real data. Stanford

hosts a collection of Amazon reviews (~35M records, ~14G gzip) that are

available for use.[1] Reviews retain their category from Amazon (e.g. Books,

Music, Instant Video) as well as some metadata such as the user who made

the review, the score and the review text. Reviews are an integer value

between 1 and 5 inclusive.

The steps taken were as follows:

1. Convert the raw files into CSV (custom Java code)

2. Insert the data into an Accumulo table (custom Java code)

3. Answer a query using the Accumulo Java API, Pig and Hive.

The question is relatively simple and (hopefully) representative of a practical

problem to solve: compute the average review on books by each identified

users. If I made two book reviews with scores 1 and 5, the query would return a

value of 3 for me as (1 + 5) / 2 = 3.

1. http://snap.stanford.edu/data/web-Amazon-links.html: J. McAuley and J. Leskovec. Hidden factors and hidden topics:

understanding rating dimensions with review text. RecSys, 2013.

http://i.stanford.edu/~julian/pdfs/recsys13.pdf



Concrete Comparison

To answer the question, we need to scan Accumulo, apply two filters, group

reviews for the same user together and compute an average. I wrote a simple

parser and ingester in ~750 lines of Java (leveraging some libraries).

Accumulo Java API:

A single-threaded client which performs all of this in memory can be

achieved in 162 lines of code. Doesn’t use any custom iterators. Not a

MapReduce job so the grouping phase must fit in memory. More work is

needed to actually scale this solution.

Pig:

1 line of Pig Latin to define the relation (table), 4 lines which perform the

computations and 1 line to output the data to the console.

Hive:

1 line to register our Accumulo table as a Hive table, and 1 HQL

statement.

Both Pig and Hive also have the ability to run as a MapReduce job which

means that they can handle much larger datasets automatically.


Takeaways

Take stock of your application needs and run your own

experiments!

Every approach has it’s pros and cons, with the Accumulo

Java API really only suffering from the verbosity and

boilerplate of Java applications themselves.

Because each application is different, it’s important to take

stock of which problems need to be solved, which can be

“hacked”, and which can be completely ignored.

Whatever you do choose, make an effort to contribute back

to the community in some way!


Credit where credit is due

Amazon Reviews: http://snap.stanford.edu/data/web-Amazon-links.html: J.

McAuley and J. Leskovec. Hidden factors and hidden topics: understanding

rating dimensions with review text. RecSys, 2013.

Other code used for the experiments:

● Parser, ingester, and query code: https://github.com/joshelser/as2015

● Library to help ingest the data: https://github.com/joshelser/cereal

Names (Apache, Apache $Project, and $Project) and logos are trademarks of the

ASF and the respective Apache $Projects: Accumulo, Hive, Pig, Spark, and

Thrift.

The Cascading logo used was from http://www.cascading.org/

The Clojure logo used was from http://clojure.org/

The Scala logo used was copied from http://www.scala-lang.org/



https://github.com/joshelser/as2015

https://github.com/joshelser/cereal

http://www.cascading.org/

http://clojure.org/

http://www.scala-lang.org/


Thanks!

@josh_elser

[email protected]

Alternatives to Apache Accumulo’s Java API

Software

Transcript of Alternatives to Apache Accumulo’s Java API