accumulo summit 2015: alternatives to apache accumulo's java api [api]

© Josh Elser 2015, Hortonworks

Alternatives to Apache Accumulo’s Java API

Josh Elser@josh_elser (@hortonworks)


Or…I’m really tired of having to write Java code all the time and I want to use something else.


Or…OK, I’ll still write Java, but I’m 110% done with re-writing the same boilerplate to parse CLI args, convert records into a standard format, deal with concurrency and retry server-side errors...


You have options

There is life after Accumulo’s Java API:

● Apache Pig

● Apache Hive

● Accumulo’s “Thrift Proxy”

● Other JVM-based languages

● Cascading/Scalding

● Spark


Lots of integration points, lots of considerations:

We want to avoid numbering each consideration because each differ in importance depending on the application.

Every decision has an effect

Maturity

Stability

Performance

ExtensibilityEase of use


Maturity

How well-adopted is the code you’re using?

Where does the code live? Is there a structured community or is it just sitting in a Github repository?

Can anyone add fixes and improvements? Are they merged/accepted (when someone provides them)?

Are there tests and are they actually run?

Are releases made and published regularly?

Your own code is difficult enough to maintain.


Stability

Is there a well-defined user-facing API to use?

Cross-project integrations are notorious in making assumptions about how you should use the code.

Does the integration produce the same outcomes that the “native” components do?

Can users reliably expect code to work across versions?

Using some external integration should feel like using the project without that integration. Code that worked once should continue to work.


Performance

Does the code run sufficiently quick enough?

Can you saturate your physical resources with ease?

Do you have to spend days in a performance tool reworking how you use the API?

Does the framework spend excessive amounts of time converting types to some framework?

Can you get an answer in an acceptable amount of time?

Each use case has its own set of performance requirements. Experimentation is necessary.


Ease of Use

Can you write the necessary code in a reasonable amount of time?

Goes back to: “am I sick of writing verbose code (Java)”?

Choosing the right tool can drastically reduce the amount of code to write.

Can the solution to your problem be reasonably expressed in the required language?

Using a library should feel natural and enjoyable to write while producing a succinct solution.


Extensibility

Does the integration support enough features of the underlying system?

Can you use the novel features of the underlying system via the integration?

Can custom parsing/processing logic be included?

How much external configuration/setup is needed before you can invoke your code?

Using an integration should not require sacrifice in the features of the underlying software.


Apply it to Accumulo!

Let’s take these 5 points and see how they apply to some of the more well-defined integration projects.

We’ll use Accumulo’s Java API as the reference point for how we judge other projects.


Accumulo Java API

Reference implementation on how clients use Accumulo.Comprised of Java methods and classes with extremeassessment on the value and effectiveness of each.

M: Evaluated/implemented by all Accumulo developers. Well-test and

heavily critiqued.S: Follows SemVer since 1.6.1. Is the definition of the Accumulo API.

P: High-performance, typically limited by network and server-side impl.

EoU: Verbose and often pedantic. Implements low-level operations key-value centric operations, not high-level application functions.

E: Provides well-defined building blocks for implementing custom libraries and exposes methods for interacting with all Accumulo features.


Apache Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.[1]

Default execution runs on YARN (MapReduce and Tez).

Pig is often adored for its fast prototyping and data analysis abilities with “Pig Latin”: functions which perform operations on Tuples.

Pig Latin allows for very concise solutions to problems.

LoadStoreFunc interface enables AccumuloStorage

1. http://pig.apache.org


Apache Pig-- Load a text file of dataA = LOAD 'student.txt' AS ( name:chararray, term:chararray, gpa:float);

-- Group records by studentB = GROUP A BY name;

-- Average GPAs per studentC = FOREACH B GENERATE A. name, AVG(A.gpa);

3 lines of Pig Latin, would take hundreds of lines in Java just to read the data.

AccumuloStorage introduced in Apache Pig 0.13.0

Maps each tuple into an Accumulo row.

Very easy to both write/read data to/from Accumulo.

STORE flights INTO 'accumulo://flights?instance=...' USING org.apache.pig.backend.hadoop.accumulo.AccumuloStorage( 'carrier_name,src_airport,dest_airport,tail_number');


Apache Pig

Pig enables users to perform lots of powerful data manipulation and computation task with little code but requires users to learn Pig Latin which is unique.

M: Apache Pig is a very well-defined community with its own processes.

S: Use of Pig Latin with AccumuloStorage feels natural and doesn’t have edge cases which are unsupported.

P: Often suffers from the under-optimization that comes with generalized MapReduce. Will incur penalties for quick jobs (with MapReduce only). Not as fast as well-architected, hand-written code.

EoU: Very concise and easy to use. Comes with most of the same drawbacks of dynamic programming languages. Not straightforward to test.

E: Requires user intervention to create/modify tables with custom configuration and splits. Column visibility on a per-cell basis is poorly represented because Pig Latin doesn’t have the ability to support it well.


Apache HiveApache Hive is data warehouse software that facilitates

querying and managing large datasets residing in distributed storage.[1]

One of the “old-time” SQL-on-Hadoop software projects.

Fought hard against the “batch-only” stigma recently building on top of Tez for ‘interactive queries”

Defines Hive Query Language (HQL) which is close to, but not quite, compatible with the SQL-92 standard.

Defines extension points which allow for external storage engines known as StorageHandlers.

1. http://hive.apache.org


Apache Hive# Create a Hive table from the Accumulo table “my_table”> CREATE TABLE my_table(uid string, name string, age int, height int)STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'WITH SERDEPROPERTIES ("accumulo.columns.mapping" = ":rowID,person:name,person:age,person:height");# Run “SQL” queries> SELECT name, height, uid FROM my_table ORDER BY height;

Like Pig, simple queries can be executed with very little amounts of code and each record maps into an Accumulo row.

Unlike Pig, generating these tables in Hive itself is often difficult and is reliant upon first creating a “native” Hive table and then inserting the data into an AccumuloStorageHandler-backed Hive table.

AccumuloStorageHandler introduced in Apache Hive 0.14.0. With the use of Tez, “point” queries on the rowID can be executed extremely quickly:

> SELECT * FROM my_table WHERE uid = “12345”;


Apache Hive

Using SQL to query Accumulo is a refreshing change, but the write-path with Hive leaves a bit to be desired. Will often require data ingest through another tool.

M: Apache Hive is a very well-defined community with its own processes.

S: HQL sometimes feels a bit clunky due to limitations of the StorageHandler interface.

P: Lots of effort here in Hive recently using Apache Calcite and Apache Tez to optimize query execution and reduce MapReduce overhead. Translating Accumulo Key-Values to Hive’s types can be expensive as well.

EoU: HQL as it stands is close enough to make those familiar with SQL feel at home. Some oddities to work around, but are typically easy to deal with.

E: Like Pig, Hive also suffers from the lack of an ability to represent features like cell-level visibility. Some options like table configuration, are exposed through Hive, but most cases will require custom manipulation and configuration of Accumulo tables before using Hive.


Accumulo “Thrift Proxy”

Apache Thrift is software framework which combines a software stack with a code generation engine to build cross-language services.[1]

Thrift is the software that Accumulo builds its client-server RPC service on.

Thrift provides desirable features such as optional message fields and well-performing abstractions over the low-level details such as threading and connection management.

Clients and servers don’t need to be implemented in the same language as each other.

1. http://thrift.apache.org



Clients could directly implement the necessary code to speak directly to Accumulo Master and TabletServers, but that is an extremely large undertaking.

Accumulo provides an optional “Proxy” process which provides a Java API-like interface over Thrift instead of the low-level RPC Thrift API.

Accumulo bundles Python and Ruby client bindings by default. Generating other languages is simple when Thrift is already installed.



proxy.createTable(login, table, true, Accumulo::TimeType::MILLIS) unless proxy.tableExists(login,table)

update1 = Accumulo::ColumnUpdate.new({'colFamily' => "cf1",'colQualifier' => "cq1", 'value'=> "a"})

update2 = Accumulo::ColumnUpdate.new({'colFamily' => "cf2",'colQualifier' => "cq2", 'value'=> "b"})

proxy.updateAndFlush(login, table,{'row1' => [update1,update2]})cookie = proxy.createScanner(login, table, nil)result = proxy.nextK(cookie,10)result.results.each{ |keyvalue| puts "Key: #{keyvalue.key.inspect}

Value: #{keyvalue.value}" }

if not client.tableExists(login, table): client.createTable(login, table, True, TimeType.MILLIS)row1 = {'a':[ColumnUpdate('a','a',value='value1'),

ColumnUpdate('b','b',value='value2')]}client.updateAndFlush(login, table, row1)cookie = client.createScanner(login, table, None)for entry in client.nextK(cookie, 10).results: print entry


Ruby

Python



The first noticeable difference in implementations is that the performance of writing a Python or Ruby client will be much less than a native Java client.

Some of the performance loss is likely in using a dynamic language. Your experience in the language is relevant too.

Most of the performance loss is due to passing all requests through the Proxy before it reaches TabletServers.

Proxy servers are not highly available and would require manual load balancing. Single client environments work well, but many active clients will overload a Proxy.




The novelty of using languages like Python and Ruby to interact with Accumulo is enjoyable. The Proxy’s architecture will not scale well past a few clients.

M: The Proxy isn’t widely (publicly) used but is generally maintained by devs.

S: Because the Proxy server API isn’t in the Accumulo Public API, no guarantees are made on its methods

P: High availability and load balancing are left to users to solve. Will take significant engineering effort to smartly scale to supporting many clients.

EoU: Thrift tends to generate decent code to work with for each supported language which makes writing clients feel relatively natural.

E: The generated client code per language could easily be extended to act more like an ORM. The full spectrum of Accumulo’s Java API should be exposed via the Proxy which doesn’t impose limitations in use.


Cascading and SparkApache Spark has been causing big waves in the Hadoop

community for the past year, touted across the spectrum as a complete replacement for MapReduce to complementary technology.

Cascading (not at the ASF but is ASLv2) is an abstraction layer on top of various Hadoop components. It’s been around for quite some time now and is well-received.

Both suffer from a lack of well-defined upstream Accumulo adoption within their respective communities. Snippets can be found online, but they’re typically end-user developed additions.

Lots of opportunities for users to step up and improve each!


Clojure and Scala

Clojure and Scala are both examples of languages which run on the JVM that are not Java.

These languages should both natively support the Accumulo Java API, although it’s somewhat uncharted territory that may have subtle bugs (ACCUMULO-3718)

Github has a spattering of example code, but there lack definitive resources for both Clojure and Scala.

Lots of opportunity for users to step up and improve support for these languages!


Concrete Comparison

Let’s do a comparison on the effort needed to analyze some real data. Stanford hosts a collection of Amazon reviews (~35M records, ~14G gzip) that are available for use.[1] Reviews retain their category from Amazon (e.g. Books, Music, Instant Video) as well as some metadata such as the user who made the review, the score and the review text. Reviews are an integer value between 1 and 5 inclusive.

The steps taken were as follows:

1. Convert the raw files into CSV (custom Java code)2. Insert the data into an Accumulo table (custom Java code)3. Answer a query using the Accumulo Java API, Pig and Hive.

The question is relatively simple and (hopefully) representative of a practical problem to solve: compute the average review on books by each identified users. If I made two book reviews with scores 1 and 5, the query would return a value of 3 for me as (1 + 5) / 2 = 3.

1. http://snap.stanford.edu/data/web-Amazon-links.html: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.

http://i.stanford.edu/%7Ejulian/pdfs/recsys13.pdf





Concrete Comparison

To answer the question, we need to scan Accumulo, apply two filters, group reviews for the same user together and compute an average. I wrote a simple parser and ingester in ~750 lines of Java (leveraging some libraries).

Accumulo Java API:A single-threaded client which performs all of this in memory can be

achieved in 162 lines of code. Doesn’t use any custom iterators. Not a MapReduce job so the grouping phase must fit in memory. More work is needed to actually scale this solution.

Pig:1 line of Pig Latin to define the relation (table), 4 lines which perform the

computations and 1 line to output the data to the console.

Hive:1 line to register our Accumulo table as a Hive table, and 1 HQL statement.

Both Pig and Hive also have the ability to run as a MapReduce job which means that they can handle much larger datasets automatically.


Takeaways

Take stock of your application needs and run your own experiments!

Every approach has it’s pros and cons, with the Accumulo Java API really only suffering from the verbosity and boilerplate of Java applications themselves.

Because each application is different, it’s important to take stock of which problems need to be solved, which can be “hacked”, and which can be completely ignored.

Whatever you do choose, make an effort to contribute back to the community in some way!


Credit where credit is due

Amazon Reviews: http://snap.stanford.edu/data/web-Amazon-links.html: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.

Other code used for the experiments:● Parser, ingester, and query code: https://github.com/joshelser/as2015● Library to help ingest the data: https://github.com/joshelser/cereal

Names (Apache, Apache $Project, and $Project) and logos are trademarks of the ASF and the respective Apache $Projects: Accumulo, Hive, Pig, Spark, and Thrift.

The Cascading logo used was from http://www.cascading.org/

The Clojure logo used was from http://clojure.org/

The Scala logo used was copied from http://www.scala-lang.org/





https://github.com/joshelser/as2015

https://github.com/joshelser/cereal

http://www.cascading.org/

http://clojure.org/

http://www.scala-lang.org/


Thanks!

@[email protected]

accumulo summit 2015: alternatives to apache accumulo's java api [api]

Technology

java code

hortonworks josh elser

necessary code

code youre

verbose code java

use accumulos java api

external integration

use case