accumulo summit 2015: event-driven big data with accumulo - leveraging big data in motion...

of 24 /24
Accumulo Summit - 4/28/2015 Event-Driven Big Data with Accumulo Leveraging Big Data in Motion… John Hebeler Lockheed Martin Inc. [email protected] “It is a capital mistake to theorize before one has data.” Sherlock Holmes

Author: accumulo-summit

Post on 15-Jul-2015




0 download

Embed Size (px)


  • Accumulo Summit - 4/28/2015

    Event-Driven Big Data with Accumulo

    Leveraging Big Data in M o t i o n

    John HebelerLockheed Martin Inc.

    [email protected]

    It is a capital mistake to theorize before one has data. Sherlock Holmes

  • PlanBrief Event-Driven OverviewAccumulo Event ManagementDemonstration/Access to EC2


  • Events Events drive our world - it is our context Data processing often reflects these events but with batch latency,

    poor resolution, longitudinal conflicts, and pull-type architectures If you dont ask - no one hears

    Event consequences are delayed and possibly lost Especially true In Context with related events Time plays a critical factor - before, after, simultaneous

    Focus on Accumulo Role and Implementation


  • Event-Driven Architecture

    Events drive to consequences Multiple Levels/Iterations

    Clients (or downstream events) analyze the consequences in near real-time Stateless except for Big Data (Accumulo) which makes it possible!

    Resolution, Fidelity, Query, 4

  • Accumulo Data Model

    Decomposable, Flexible Key Lexicographical Index (only) from Row ID Family and Qualifier can be Columns or Row/Key Enrichment Visibility controls row level flexible security Timestamp usually automatic and allows versions

    Value Anything but not really searchable

    Any above can be quite h u g e Atomic only at Row Level


    Row IDColumn

    TimestampFamily Qualifier Visibility

  • Events and Context Store events for easy retrieval Events continue to grow; Context reaches steady state Proper interpretation of an event within its context Idempotence


  • Categories

    1. Direct Accumulo Operations2. Event Programming3. Event Management with Accumulo

  • Direct Accumulo Operations

  • Query Key constructs - Packed fields vs Column based - your choice Lexigraphical Index Only Index - (Another word for build a new table)

    a finds a.a.a.b Not usually practical to search in the Value Query for the past values (versions) Time

    ArrayList ranges = new ArrayList( );// Populate rangesBatchScanner bs = conn.createBatchScanner(table, );b.setRanges(ranges)

    TableOperations to = conn.tableOperations()to.setProperty(tableName, table.iterator.scan.vers.opt.maxVersions, N);to.setProperty(tableName, table.iterator.majc.vers.opt.maxVersions, N);to.setProperty(tableName, table.iterator.minc.vers.opt.maxVersions, N);

    RowID Family Qualifier Value


  • Event Update Store events for easy retrieval Maintain context surrounding the event Write with same key - updates value

    RowID Family Qualifier Value


    EventID1 EventID2 EventID3 Event** JSON or Serialized Object

  • Event Cursor Accumulo Cursor automatically buffers responses to conserve memory Events constructed directly from an Accumulo row do not

    If not careful, out of memory exceptions (especially true in big data)

    RowID Family Qualifier ValueClass EventCursor {Iterator rowIterator = null;public EventCursor(Scanner s) {

    rowIterator = s.iterator();}

    public Event next() { return( row2Event(; } }

  • A Word About Accumulo Visibility

    Different (part of the key)

  • Event Programming

  • Exception based Programming Dont ask for permission but plan for exceptions

    Faster and more efficient Program to expect that they wont happen and if they do, handle it Watch out for thread contention - can use Lock

    RowID Family Qualifier Value

    // Optional - openLock.lock();while(true){ try { wr = aClient.createBatchWriter(EVENT_CONTEXT_TABLE, new BatchWriterConfig()); break; } catch (TableNotFoundException e) {

    // Create Table and retry - also need to catch TableExistsException aClient.tableOperations().create(EVENT_CONTEXT_TABLE);

    }}// Optional - openLock.unlock();

  • Avoid Transactions Big data transactions expensive (and difficult) Make the need rare and solution lazy Distributed partial state dilemma

    Append and update a single row does not require formal transactionsRace Condition lazy recognition and repairAccumulo only ensures row level transactions (but can still be of value for each field can hold a lot of data)Event conclusions too close in time are just reprocessed or properly thread bundled

    RowID Family Qualifier Value


  • Progressive Provenience Retrieve origin of event combinations Maintain context surrounding the event Use same key in different tables for rapid traversal

    RowID Family Qualifier Value


  • Test Events

    Test Flag allows In-Stream Test and Validation Availability Performance Quality What Ifs

    Flag indicates different storage table, queues,

  • Event Management with Accumulo

  • Turning an Event Off Event assertion no longer supported (but was)

    RowID Family Qualifier Value


  • Forgetting an Event (Error) Store events for easy retrieval Maintain context surrounding the event

    RowID Family Qualifier Value


  • Time Travel Rerun (Time) Events due to corrupted data,

    out-of-order events, event error, event correction, or what ifscenarios Develop context surrounding the event Remixing the cake

    ** Need to Run Topic X again since last October due to error then// Collect all events for Topic since October (already in time order)// Clear Topic X Context// Rerun collected events in order (all corrected now!)

    RowID Family Qualifier Value


  • Future Events

    Future Events (Expiring State, Travel Plans, ) May not happen or change

    RowID Family Qualifier Value

    Store event as always Schedule timer (or interval timer) to ignite future events Events easily removed due to update, timer finds nothing Requires careful consideration of index/RowId

  • Extra Extra Analytics

    Events create a rich foundation for longitudinal analytics - but must consider the data model for efficient queries (proper indexing)

    Backup/Recovery Take advantage of Accumulo clone and pause processing

    Hybrid Systems Semantic Web Related NoSQL - MongoDB and Neo4J Map Reduce

    Gotcha Accumulo built upon Hadoop, Zookeeper

  • Follow Up Email for EC2 accumulo and event driven prototype

    [email protected] Questions any time Play - free micro

    computer one year