accumulo summit 2015: alternatives to apache accumulo's java api [api]

Click here to load reader

Post on 15-Jul-2015

118 views

Category:

Technology

0 download

Embed Size (px)

TRANSCRIPT

  • Josh Elser 2015, Hortonworks

    Alternatives to Apache Accumulos Java APIJosh Elser@josh_elser (@hortonworks)

  • Josh Elser 2015, Hortonworks

    OrIm really tired of having to write Java code all the time and I want to use something else.

  • Josh Elser 2015, Hortonworks

    OrOK, Ill still write Java, but Im 110% done with re-writing the same boilerplate to parse CLI args, convert records into a standard format, deal with concurrency and retry server-side errors...

  • Josh Elser 2015, Hortonworks

    You have optionsThere is life after Accumulos Java API:

    Apache Pig

    Apache Hive

    Accumulos Thrift Proxy

    Other JVM-based languages

    Cascading/Scalding

    Spark

  • Josh Elser 2015, Hortonworks

    Lots of integration points, lots of considerations:

    We want to avoid numbering each consideration because each differ in importance depending on the application.

    Every decision has an effect

    Maturity

    Stability

    Performance

    ExtensibilityEase of use

  • Josh Elser 2015, Hortonworks

    MaturityHow well-adopted is the code youre using?

    Where does the code live? Is there a structured community or is it just sitting in a Github repository?

    Can anyone add fixes and improvements? Are they merged/accepted (when someone provides them)?

    Are there tests and are they actually run?

    Are releases made and published regularly?

    Your own code is difficult enough to maintain.

  • Josh Elser 2015, Hortonworks

    StabilityIs there a well-defined user-facing API to use?

    Cross-project integrations are notorious in making assumptions about how you should use the code.

    Does the integration produce the same outcomes that the native components do?

    Can users reliably expect code to work across versions?

    Using some external integration should feel like using the project without that integration. Code that worked once should continue to work.

  • Josh Elser 2015, Hortonworks

    PerformanceDoes the code run sufficiently quick enough?

    Can you saturate your physical resources with ease?

    Do you have to spend days in a performance tool reworking how you use the API?

    Does the framework spend excessive amounts of time converting types to some framework?

    Can you get an answer in an acceptable amount of time?

    Each use case has its own set of performance requirements. Experimentation is necessary.

  • Josh Elser 2015, Hortonworks

    Ease of UseCan you write the necessary code in a reasonable

    amount of time?

    Goes back to: am I sick of writing verbose code (Java)?

    Choosing the right tool can drastically reduce the amount of code to write.

    Can the solution to your problem be reasonably expressed in the required language?

    Using a library should feel natural and enjoyable to write while producing a succinct solution.

  • Josh Elser 2015, Hortonworks

    ExtensibilityDoes the integration support enough features of the

    underlying system?

    Can you use the novel features of the underlying system via the integration?

    Can custom parsing/processing logic be included?

    How much external configuration/setup is needed before you can invoke your code?

    Using an integration should not require sacrifice in the features of the underlying software.

  • Josh Elser 2015, Hortonworks

    Apply it to Accumulo!Lets take these 5 points and see how they apply to some of

    the more well-defined integration projects.

    Well use Accumulos Java API as the reference point for how we judge other projects.

  • Josh Elser 2015, Hortonworks

    Accumulo Java APIReference implementation on how clients use Accumulo.Comprised of Java methods and classes with extremeassessment on the value and effectiveness of each.

    M: Evaluated/implemented by all Accumulo developers. Well-test and heavily critiqued.

    S: Follows SemVer since 1.6.1. Is the definition of the Accumulo API.P: High-performance, typically limited by network and server-side impl.EoU: Verbose and often pedantic. Implements low-level operations key-

    value centric operations, not high-level application functions.E: Provides well-defined building blocks for implementing custom libraries

    and exposes methods for interacting with all Accumulo features.

  • Josh Elser 2015, Hortonworks

    Apache PigApache Pig is a platform for analyzing large data sets that

    consists of a high-level language for expressing data analysis programs.[1]

    Default execution runs on YARN (MapReduce and Tez).

    Pig is often adored for its fast prototyping and data analysis abilities with Pig Latin: functions which perform operations on Tuples.

    Pig Latin allows for very concise solutions to problems.

    LoadStoreFunc interface enables AccumuloStorage

    1. http://pig.apache.org

  • Josh Elser 2015, Hortonworks

    Apache Pig-- Load a text file of dataA = LOAD 'student.txt' AS ( name:chararray, term:chararray, gpa:float);

    -- Group records by studentB = GROUP A BY name;

    -- Average GPAs per studentC = FOREACH B GENERATE A. name, AVG(A.gpa);

    3 lines of Pig Latin, would take hundreds of lines in Java just to read the data.

    AccumuloStorage introduced in Apache Pig 0.13.0

    Maps each tuple into an Accumulo row.

    Very easy to both write/read data to/from Accumulo.

    STORE flights INTO 'accumulo://flights?instance=...' USING org.apache.pig.backend.hadoop.accumulo.AccumuloStorage( 'carrier_name,src_airport,dest_airport,tail_number');

  • Josh Elser 2015, Hortonworks

    Apache PigPig enables users to perform lots of powerful data

    manipulation and computation task with little code but requires users to learn Pig Latin which is unique.

    M: Apache Pig is a very well-defined community with its own processes.S: Use of Pig Latin with AccumuloStorage feels natural and doesnt have

    edge cases which are unsupported.P: Often suffers from the under-optimization that comes with generalized

    MapReduce. Will incur penalties for quick jobs (with MapReduce only). Not as fast as well-architected, hand-written code.

    EoU: Very concise and easy to use. Comes with most of the same drawbacks of dynamic programming languages. Not straightforward to test.

    E: Requires user intervention to create/modify tables with custom configuration and splits. Column visibility on a per-cell basis is poorly represented because Pig Latin doesnt have the ability to support it well.

  • Josh Elser 2015, Hortonworks

    Apache HiveApache Hive is data warehouse software that facilitates

    querying and managing large datasets residing in distributed storage.[1]

    One of the old-time SQL-on-Hadoop software projects.

    Fought hard against the batch-only stigma recently building on top of Tez for interactive queries

    Defines Hive Query Language (HQL) which is close to, but not quite, compatible with the SQL-92 standard.

    Defines extension points which allow for external storage engines known as StorageHandlers.

    1. http://hive.apache.org

  • Josh Elser 2015, Hortonworks

    Apache Hive# Create a Hive table from the Accumulo table my_table> CREATE TABLE my_table(uid string, name string, age int, height int)STORED BY 'org.apache.hadoop.hive.accumulo.AccumuloStorageHandler'WITH SERDEPROPERTIES ("accumulo.columns.mapping" = ":rowID,person:name,person:age,person:height");# Run SQL queries> SELECT name, height, uid FROM my_table ORDER BY height;

    Like Pig, simple queries can be executed with very little amounts of code and each record maps into an Accumulo row.

    Unlike Pig, generating these tables in Hive itself is often difficult and is reliant upon first creating a native Hive table and then inserting the data into an AccumuloStorageHandler-backed Hive table.

    AccumuloStorageHandler introduced in Apache Hive 0.14.0. With the use of Tez, point queries on the rowID can be executed extremely quickly:

    > SELECT * FROM my_table WHERE uid = 12345;

  • Josh Elser 2015, Hortonworks

    Apache HiveUsing SQL to query Accumulo is a refreshing change, but

    the write-path with Hive leaves a bit to be desired. Will often require data ingest through another tool.

    M: Apache Hive is a very well-defined community with its own processes.S: HQL sometimes feels a bit clunky due to limitations of the StorageHandler interface.

    P: Lots of effort here in Hive recently using Apache Calcite and Apache Tez to optimize query execution and reduce MapReduce overhead. Translating Accumulo Key-Values to Hives types can be expensive as well.

    EoU: HQL as it stands is close enough to make those familiar with SQL feel at home. Some oddities to work around, but are typically easy to deal with.

    E: Like Pig, Hive also suffers from the lack of an ability to represent features like cell-level visibility. Some options like table configuration, are exposed through Hive, but most cases will require custom manipulation and configuration of Accumulo tables before using Hive.

  • Josh Elser 2015, Hortonworks

    Accumulo Thrift ProxyApache Thrift is software framework which combines a

    software stack with a code generation engine to build cross-language services.[1]

    Thrift is the software that Accumulo builds its client-server RPC service on.

    Thrift provides desirable features such as optional message fields and well-performing abstractions over the low-level details such as threading and connection management.

    Clients and servers dont need to be implemented in the same language as each other.

    1. http://thrift.apache.org

  • Josh Elser 2015,