accumulo summit 2015: event-driven big data with accumulo - leveraging big data in motion...
Embed Size (px)
TRANSCRIPT
-
Accumulo Summit - 4/28/2015
Event-Driven Big Data with Accumulo
Leveraging Big Data in M o t i o n
John HebelerLockheed Martin Inc.
It is a capital mistake to theorize before one has data. Sherlock Holmes
-
PlanBrief Event-Driven OverviewAccumulo Event ManagementDemonstration/Access to EC2
2
-
Events Events drive our world - it is our context Data processing often reflects these events but with batch latency,
poor resolution, longitudinal conflicts, and pull-type architectures If you dont ask - no one hears
Event consequences are delayed and possibly lost Especially true In Context with related events Time plays a critical factor - before, after, simultaneous
Focus on Accumulo Role and Implementation
3
-
Event-Driven Architecture
Events drive to consequences Multiple Levels/Iterations
Clients (or downstream events) analyze the consequences in near real-time Stateless except for Big Data (Accumulo) which makes it possible!
Resolution, Fidelity, Query, 4
-
Accumulo Data Model
Decomposable, Flexible Key Lexicographical Index (only) from Row ID Family and Qualifier can be Columns or Row/Key Enrichment Visibility controls row level flexible security Timestamp usually automatic and allows versions
Value Anything but not really searchable
Any above can be quite h u g e Atomic only at Row Level
KeyValue
Row IDColumn
TimestampFamily Qualifier Visibility
-
Events and Context Store events for easy retrieval Events continue to grow; Context reaches steady state Proper interpretation of an event within its context Idempotence
6
-
Categories
1. Direct Accumulo Operations2. Event Programming3. Event Management with Accumulo
-
Direct Accumulo Operations
-
Query Key constructs - Packed fields vs Column based - your choice Lexigraphical Index Only Index - (Another word for build a new table)
a finds a.a.a.b Not usually practical to search in the Value Query for the past values (versions) Time
ArrayList ranges = new ArrayList( );// Populate rangesBatchScanner bs = conn.createBatchScanner(table, );b.setRanges(ranges)
TableOperations to = conn.tableOperations()to.setProperty(tableName, table.iterator.scan.vers.opt.maxVersions, N);to.setProperty(tableName, table.iterator.majc.vers.opt.maxVersions, N);to.setProperty(tableName, table.iterator.minc.vers.opt.maxVersions, N);
RowID Family Qualifier Value
9
-
Event Update Store events for easy retrieval Maintain context surrounding the event Write with same key - updates value
RowID Family Qualifier Value
10
EventID1 EventID2 EventID3 Event** JSON or Serialized Object
-
Event Cursor Accumulo Cursor automatically buffers responses to conserve memory Events constructed directly from an Accumulo row do not
If not careful, out of memory exceptions (especially true in big data)
RowID Family Qualifier ValueClass EventCursor {Iterator rowIterator = null;public EventCursor(Scanner s) {
rowIterator = s.iterator();}
public Event next() { return( row2Event(s.iterator.next())); } }
-
A Word About Accumulo Visibility
Different (part of the key)
-
Event Programming
-
Exception based Programming Dont ask for permission but plan for exceptions
Faster and more efficient Program to expect that they wont happen and if they do, handle it Watch out for thread contention - can use Lock
RowID Family Qualifier Value
// Optional - openLock.lock();while(true){ try { wr = aClient.createBatchWriter(EVENT_CONTEXT_TABLE, new BatchWriterConfig()); break; } catch (TableNotFoundException e) {
// Create Table and retry - also need to catch TableExistsException aClient.tableOperations().create(EVENT_CONTEXT_TABLE);
}}// Optional - openLock.unlock();
-
Avoid Transactions Big data transactions expensive (and difficult) Make the need rare and solution lazy Distributed partial state dilemma
Append and update a single row does not require formal transactionsRace Condition lazy recognition and repairAccumulo only ensures row level transactions (but can still be of value for each field can hold a lot of data)Event conclusions too close in time are just reprocessed or properly thread bundled
RowID Family Qualifier Value
15
-
Progressive Provenience Retrieve origin of event combinations Maintain context surrounding the event Use same key in different tables for rapid traversal
RowID Family Qualifier Value
16
-
Test Events
Test Flag allows In-Stream Test and Validation Availability Performance Quality What Ifs
Flag indicates different storage table, queues,
-
Event Management with Accumulo
-
Turning an Event Off Event assertion no longer supported (but was)
RowID Family Qualifier Value
19
-
Forgetting an Event (Error) Store events for easy retrieval Maintain context surrounding the event
RowID Family Qualifier Value
20
-
Time Travel Rerun (Time) Events due to corrupted data,
out-of-order events, event error, event correction, or what ifscenarios Develop context surrounding the event Remixing the cake
** Need to Run Topic X again since last October due to error then// Collect all events for Topic since October (already in time order)// Clear Topic X Context// Rerun collected events in order (all corrected now!)
RowID Family Qualifier Value
21
-
Future Events
Future Events (Expiring State, Travel Plans, ) May not happen or change
RowID Family Qualifier Value
Store event as always Schedule timer (or interval timer) to ignite future events Events easily removed due to update, timer finds nothing Requires careful consideration of index/RowId
-
Extra Extra Analytics
Events create a rich foundation for longitudinal analytics - but must consider the data model for efficient queries (proper indexing)
Backup/Recovery Take advantage of Accumulo clone and pause processing
Hybrid Systems Semantic Web Related NoSQL - MongoDB and Neo4J Map Reduce
Gotcha Accumulo built upon Hadoop, Zookeeper
-
Follow Up Email for EC2 accumulo and event driven prototype
[email protected] Questions any time Play - free micro
computer one year