accumulo summit 2015: alternatives to apache accumulo's java api [api]

30
© Josh Elser 2015, Hortonworks Alternatives to Apache Accumulo’s Java API Josh Elser @josh_elser (@hortonworks)

Upload: accumulo-summit

Post on 15-Jul-2015

192 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Alternatives to Apache Accumulo’s Java API

Josh Elser@josh_elser (@hortonworks)

Page 2: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Or…I’m really tired of having to write Java code all the time and I want to use something else.

Page 3: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Or…OK, I’ll still write Java, but I’m 110% done with re-writing the same boilerplate to parse CLI args, convert records into a standard format, deal with concurrency and retry server-side errors...

Page 4: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

You have options

There is life after Accumulo’s Java API:

● Apache Pig

● Apache Hive

● Accumulo’s “Thrift Proxy”

● Other JVM-based languages

● Cascading/Scalding

● Spark

Page 5: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Lots of integration points, lots of considerations:

We want to avoid numbering each consideration because each differ in importance depending on the application.

Every decision has an effect

Maturity

Stability

Performance

ExtensibilityEase of use

Page 6: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Maturity

How well-adopted is the code you’re using?

Where does the code live? Is there a structured community or is it just sitting in a Github repository?

Can anyone add fixes and improvements? Are they merged/accepted (when someone provides them)?

Are there tests and are they actually run?

Are releases made and published regularly?

Your own code is difficult enough to maintain.

Page 7: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Stability

Is there a well-defined user-facing API to use?

Cross-project integrations are notorious in making assumptions about how you should use the code.

Does the integration produce the same outcomes that the “native” components do?

Can users reliably expect code to work across versions?

Using some external integration should feel like using the project without that integration. Code that worked once should continue to work.

Page 8: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Performance

Does the code run sufficiently quick enough?

Can you saturate your physical resources with ease?

Do you have to spend days in a performance tool reworking how you use the API?

Does the framework spend excessive amounts of time converting types to some framework?

Can you get an answer in an acceptable amount of time?

Each use case has its own set of performance requirements. Experimentation is necessary.

Page 9: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Ease of Use

Can you write the necessary code in a reasonable amount of time?

Goes back to: “am I sick of writing verbose code (Java)”?

Choosing the right tool can drastically reduce the amount of code to write.

Can the solution to your problem be reasonably expressed in the required language?

Using a library should feel natural and enjoyable to write while producing a succinct solution.

Page 10: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Extensibility

Does the integration support enough features of the underlying system?

Can you use the novel features of the underlying system via the integration?

Can custom parsing/processing logic be included?

How much external configuration/setup is needed before you can invoke your code?

Using an integration should not require sacrifice in the features of the underlying software.

Page 11: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Apply it to Accumulo!

Let’s take these 5 points and see how they apply to some of the more well-defined integration projects.

We’ll use Accumulo’s Java API as the reference point for how we judge other projects.

Page 12: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Accumulo Java API

Reference implementation on how clients use Accumulo.Comprised of Java methods and classes with extremeassessment on the value and effectiveness of each.

M: Evaluated/implemented by all Accumulo developers. Well-test and

heavily critiqued.S: Follows SemVer since 1.6.1. Is the definition of the Accumulo API.

P: High-performance, typically limited by network and server-side impl.

EoU: Verbose and often pedantic. Implements low-level operations key-value centric operations, not high-level application functions.

E: Provides well-defined building blocks for implementing custom libraries and exposes methods for interacting with all Accumulo features.

Page 13: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Apache Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.[1]

Default execution runs on YARN (MapReduce and Tez).

Pig is often adored for its fast prototyping and data analysis abilities with “Pig Latin”: functions which perform operations on Tuples.

Pig Latin allows for very concise solutions to problems.

LoadStoreFunc interface enables AccumuloStorage

1. http://pig.apache.org

Page 14: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Apache Pig-- Load a text file of dataA = LOAD 'student.txt' AS ( name:chararray, term:chararray, gpa:float);

-- Group records by studentB = GROUP A BY name;

-- Average GPAs per studentC = FOREACH B GENERATE A. name, AVG(A.gpa);

3 lines of Pig Latin, would take hundreds of lines in Java just to read the data.

AccumuloStorage introduced in Apache Pig 0.13.0

Maps each tuple into an Accumulo row.

Very easy to both write/read data to/from Accumulo.

STORE flights INTO 'accumulo://flights?instance=...' USING org.apache.pig.backend.hadoop.accumulo.AccumuloStorage( 'carrier_name,src_airport,dest_airport,tail_number');

Page 15: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Apache Pig

Pig enables users to perform lots of powerful data manipulation and computation task with little code but requires users to learn Pig Latin which is unique.

M: Apache Pig is a very well-defined community with its own processes.

S: Use of Pig Latin with AccumuloStorage feels natural and doesn’t have edge cases which are unsupported.

P: Often suffers from the under-optimization that comes with generalized MapReduce. Will incur penalties for quick jobs (with MapReduce only). Not as fast as well-architected, hand-written code.

EoU: Very concise and easy to use. Comes with most of the same drawbacks of dynamic programming languages. Not straightforward to test.

E: Requires user intervention to create/modify tables with custom configuration and splits. Column visibility on a per-cell basis is poorly represented because Pig Latin doesn’t have the ability to support it well.

Page 16: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Apache HiveApache Hive is data warehouse software that facilitates

querying and managing large datasets residing in distributed storage.[1]

One of the “old-time” SQL-on-Hadoop software projects.

Fought hard against the “batch-only” stigma recently building on top of Tez for ‘interactive queries”

Defines Hive Query Language (HQL) which is close to, but not quite, compatible with the SQL-92 standard.

Defines extension points which allow for external storage engines known as StorageHandlers.

1. http://hive.apache.org

Page 17: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Apache Hive# Create a Hive table from the Accumulo table “my_table”> CREATE TABLE my_table(uid string, name string, age int, height int)STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'WITH SERDEPROPERTIES ("accumulo.columns.mapping" = ":rowID,person:name,person:age,person:height");# Run “SQL” queries> SELECT name, height, uid FROM my_table ORDER BY height;

Like Pig, simple queries can be executed with very little amounts of code and each record maps into an Accumulo row.

Unlike Pig, generating these tables in Hive itself is often difficult and is reliant upon first creating a “native” Hive table and then inserting the data into an AccumuloStorageHandler-backed Hive table.

AccumuloStorageHandler introduced in Apache Hive 0.14.0. With the use of Tez, “point” queries on the rowID can be executed extremely quickly:

> SELECT * FROM my_table WHERE uid = “12345”;

Page 18: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Apache Hive

Using SQL to query Accumulo is a refreshing change, but the write-path with Hive leaves a bit to be desired. Will often require data ingest through another tool.

M: Apache Hive is a very well-defined community with its own processes.

S: HQL sometimes feels a bit clunky due to limitations of the StorageHandler interface.

P: Lots of effort here in Hive recently using Apache Calcite and Apache Tez to optimize query execution and reduce MapReduce overhead. Translating Accumulo Key-Values to Hive’s types can be expensive as well.

EoU: HQL as it stands is close enough to make those familiar with SQL feel at home. Some oddities to work around, but are typically easy to deal with.

E: Like Pig, Hive also suffers from the lack of an ability to represent features like cell-level visibility. Some options like table configuration, are exposed through Hive, but most cases will require custom manipulation and configuration of Accumulo tables before using Hive.

Page 19: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Accumulo “Thrift Proxy”

Apache Thrift is software framework which combines a software stack with a code generation engine to build cross-language services.[1]

Thrift is the software that Accumulo builds its client-server RPC service on.

Thrift provides desirable features such as optional message fields and well-performing abstractions over the low-level details such as threading and connection management.

Clients and servers don’t need to be implemented in the same language as each other.

1. http://thrift.apache.org

Page 20: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Accumulo “Thrift Proxy”

Clients could directly implement the necessary code to speak directly to Accumulo Master and TabletServers, but that is an extremely large undertaking.

Accumulo provides an optional “Proxy” process which provides a Java API-like interface over Thrift instead of the low-level RPC Thrift API.

Accumulo bundles Python and Ruby client bindings by default. Generating other languages is simple when Thrift is already installed.

1. http://thrift.apache.org

Page 21: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

proxy.createTable(login, table, true, Accumulo::TimeType::MILLIS) unless proxy.tableExists(login,table)

update1 = Accumulo::ColumnUpdate.new({'colFamily' => "cf1",'colQualifier' => "cq1", 'value'=> "a"})

update2 = Accumulo::ColumnUpdate.new({'colFamily' => "cf2",'colQualifier' => "cq2", 'value'=> "b"})

proxy.updateAndFlush(login, table,{'row1' => [update1,update2]})cookie = proxy.createScanner(login, table, nil)result = proxy.nextK(cookie,10)result.results.each{ |keyvalue| puts "Key: #{keyvalue.key.inspect}

Value: #{keyvalue.value}" }

if not client.tableExists(login, table): client.createTable(login, table, True, TimeType.MILLIS)row1 = {'a':[ColumnUpdate('a','a',value='value1'),

ColumnUpdate('b','b',value='value2')]}client.updateAndFlush(login, table, row1)cookie = client.createScanner(login, table, None)for entry in client.nextK(cookie, 10).results: print entry

Accumulo “Thrift Proxy”

Ruby

Python

Page 22: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Accumulo “Thrift Proxy”

The first noticeable difference in implementations is that the performance of writing a Python or Ruby client will be much less than a native Java client.

Some of the performance loss is likely in using a dynamic language. Your experience in the language is relevant too.

Most of the performance loss is due to passing all requests through the Proxy before it reaches TabletServers.

Proxy servers are not highly available and would require manual load balancing. Single client environments work well, but many active clients will overload a Proxy.

1. http://thrift.apache.org

Page 23: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Accumulo “Thrift Proxy”

The novelty of using languages like Python and Ruby to interact with Accumulo is enjoyable. The Proxy’s architecture will not scale well past a few clients.

M: The Proxy isn’t widely (publicly) used but is generally maintained by devs.

S: Because the Proxy server API isn’t in the Accumulo Public API, no guarantees are made on its methods

P: High availability and load balancing are left to users to solve. Will take significant engineering effort to smartly scale to supporting many clients.

EoU: Thrift tends to generate decent code to work with for each supported language which makes writing clients feel relatively natural.

E: The generated client code per language could easily be extended to act more like an ORM. The full spectrum of Accumulo’s Java API should be exposed via the Proxy which doesn’t impose limitations in use.

Page 24: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Cascading and SparkApache Spark has been causing big waves in the Hadoop

community for the past year, touted across the spectrum as a complete replacement for MapReduce to complementary technology.

Cascading (not at the ASF but is ASLv2) is an abstraction layer on top of various Hadoop components. It’s been around for quite some time now and is well-received.

Both suffer from a lack of well-defined upstream Accumulo adoption within their respective communities. Snippets can be found online, but they’re typically end-user developed additions.

Lots of opportunities for users to step up and improve each!

Page 25: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Clojure and Scala

Clojure and Scala are both examples of languages which run on the JVM that are not Java.

These languages should both natively support the Accumulo Java API, although it’s somewhat uncharted territory that may have subtle bugs (ACCUMULO-3718)

Github has a spattering of example code, but there lack definitive resources for both Clojure and Scala.

Lots of opportunity for users to step up and improve support for these languages!

Page 26: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Concrete Comparison

Let’s do a comparison on the effort needed to analyze some real data. Stanford hosts a collection of Amazon reviews (~35M records, ~14G gzip) that are available for use.[1] Reviews retain their category from Amazon (e.g. Books, Music, Instant Video) as well as some metadata such as the user who made the review, the score and the review text. Reviews are an integer value between 1 and 5 inclusive.

The steps taken were as follows:

1. Convert the raw files into CSV (custom Java code)2. Insert the data into an Accumulo table (custom Java code)3. Answer a query using the Accumulo Java API, Pig and Hive.

The question is relatively simple and (hopefully) representative of a practical problem to solve: compute the average review on books by each identified users. If I made two book reviews with scores 1 and 5, the query would return a value of 3 for me as (1 + 5) / 2 = 3.

1. http://snap.stanford.edu/data/web-Amazon-links.html: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.

Page 27: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Concrete Comparison

To answer the question, we need to scan Accumulo, apply two filters, group reviews for the same user together and compute an average. I wrote a simple parser and ingester in ~750 lines of Java (leveraging some libraries).

Accumulo Java API:A single-threaded client which performs all of this in memory can be

achieved in 162 lines of code. Doesn’t use any custom iterators. Not a MapReduce job so the grouping phase must fit in memory. More work is needed to actually scale this solution.

Pig:1 line of Pig Latin to define the relation (table), 4 lines which perform the

computations and 1 line to output the data to the console.

Hive:1 line to register our Accumulo table as a Hive table, and 1 HQL statement.

Both Pig and Hive also have the ability to run as a MapReduce job which means that they can handle much larger datasets automatically.

Page 28: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Takeaways

Take stock of your application needs and run your own experiments!

Every approach has it’s pros and cons, with the Accumulo Java API really only suffering from the verbosity and boilerplate of Java applications themselves.

Because each application is different, it’s important to take stock of which problems need to be solved, which can be “hacked”, and which can be completely ignored.

Whatever you do choose, make an effort to contribute back to the community in some way!

Page 29: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Credit where credit is due

Amazon Reviews: http://snap.stanford.edu/data/web-Amazon-links.html: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.

Other code used for the experiments:● Parser, ingester, and query code: https://github.com/joshelser/as2015● Library to help ingest the data: https://github.com/joshelser/cereal

Names (Apache, Apache $Project, and $Project) and logos are trademarks of the ASF and the respective Apache $Projects: Accumulo, Hive, Pig, Spark, and Thrift.

The Cascading logo used was from http://www.cascading.org/

The Clojure logo used was from http://clojure.org/

The Scala logo used was copied from http://www.scala-lang.org/

Page 30: Accumulo Summit 2015: Alternatives to Apache Accumulo's Java API [API]

© Josh Elser 2015, Hortonworks

Thanks!

@[email protected]