apache accumulo - meetupfiles.meetup.com/1789394/bdl19-1-apache accumulo.pdf · sqrrl secure....
TRANSCRIPT
sqrrlSecure. Scale. Adapt.
Sqrrl Data, Inc. All Rights Reserved
Apache Accumulo
Adam Fuchs, CTOSqrrl Data, Inc.
September 19, 2013
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved2
Accumulo• Where it fits• How it works• Unique features• How to use it• Performance
Outline
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved3
Data-Driven Query-Driven
Two Halves of Real-Time
NoSQL Databases
Business Intelligence Tools
Stream Processing Engines
Real-Time reduce event to reaction timeReal-Time reduce ingest to query latency
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved4
Data-Driven + Query-Driven Real-Time Ecosystem
Data
NoSQL+
SPE Dashboards
Actions
InteractiveAnalysis Tools(Discovery + Forensics)
1 2
3
1. SPE queries NoSQL to enrich streaming data2. SPE persists results in NoSQL for future query3. SPE takes action automatically4. SPE issues data-driven alerts5. NoSQL provides context for dashboards6. Analysis tools query use NoSQL to search and manipulate historical
data
5
4
6
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved5
This talk focuses on the database.
DataSPE Dashboards
Actions
InteractiveAnalysis Tools(Discovery + Forensics)
1. SPE queries NoSQL to enrich streaming data2. SPE persists results in NoSQL for future query3. SPE takes action automatically4. SPE issues data-driven alerts5. NoSQL provides context for dashboards6. Analysis tools query use NoSQL to search and manipulate historical
data
4
3
NoSQL+6
5
21
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved6
Trendulo – An Example Application
Developed by Jared Winick: see trendulo.com
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved7
Accumulo• Where it fits• How it works• Unique features• How to use it• Performance
Outline
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved8
An Accumulo key is a 5-tuple, consisting of:
Row: Controls AtomicityColumn Family: Controls Locality Column Qualifier: Controls UniquenessVisibility Label: Controls AccessTimestamp: Controls VersioningRow Col. Fam. Col. Qual. Visibility Timestamp Value
John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute …John Doe Test Results Cholesterol JD|PCP_JD 20120912 183
John Doe Test Results Mental Health JD|PSYCH_JD 20120801 PassJohn Doe Test Results X-Ray JD|PHYS_JD 20120513 101011011010
0…Accumulo Key/Value
Example
Accumulo Data Format
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved9
Collections of KV pairs form TablesTables are partitioned into TabletsMetadata tablets hold info about other tablets, forming a 3-level hierarchyA Tablet is a unit of work for a Tablet Server
Root Tablet-∞ to ∞
Metadata Tablet 1
-∞ to “Encyclopedia:Ocelot”
Data Tablet
-∞ : thing
Data Tablet
thing : ∞
Data Tablet-∞ :
Ocelot
Data Tablet
Ocelot : Yak
Data TabletYak : ∞
Data Tablet
-∞ to ∞
Metadata Tablet 2 “Encyclopedia:Ocelot” to
∞
Well-Known Location
(zookeeper)
Table: Adam’s Table
Table: Encyclopedia
Table: Foo
Accumulo Tablets
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved10
Accumulo Processes
Tablet Server
Tablet
Tablet Server
Tablet
Tablet Server
Tablet
Application
Zookeeper
Zookeeper
Zookeeper
Master
HDFS
Read/Write
Store/Replicate
Assign/Balance
Delegate Authority
Delegate Authority
Application
Application
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved11
Tablet Data Flow
In-Memory Map
Write AheadLog
(For Recovery)
Sorted, Indexed
File
Sorted, Indexed
FileSorted, Indexed
File
TabletReadsIterator
TreeMinor
Compaction
Merging / Major Compaction
Iterator Tree
Writes Iterator Tree
Scan
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved12
Accumulo• Where it fits• How it works• Unique features• How to use it• Performance
Outline
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved13
Iterator Framework
Iterator Operations:
File ReadsBlock CachingMergingDeletionIsolationLocality GroupsRange SelectionColumn SelectionCell-level SecurityVersioningFilteringAggregationPartitioned Joins
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved14
Word Count: Summing Aggregating Iterator
Input Corpus
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved15
Ingesters
QueriersTablet Servers
Accumulo Latencies
Input BatchWriter
In-Memory
ScanIterators
Scanner/Batch
Scanner
In-Memory
RFile
Compaction
Iterators
ScanIterators
RFile
Compaction
Iterators
In-Memory
Map
RFiles
Compaction
Iterators
ScanIterators
Output
~ms~ms ~ms
ms
- m
in
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved16
Accumulo Throughput
Ingesters QueriersTablet Servers
Input BatchWriter
In-Memory
Map
ScanIterators
Scanner/Batch
Scanner
In-Memory
Map
RFile
Compaction
Iterators
ScanIterators
RFile
Compaction
Iterators
In-Memory
Map
RFiles
CompactionIterators
ScanIterators
Output
~ms~ms ~ms
ms
- m
in
Scan:up to 1M entries/
s per node
Ingest:up to 500K
entries/s per node
Read-Modify-Write Latency: ~ms
>1K entries/s challenging with R-M-W
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved17
• Iterators extend the set of operations that are optimized to avoid Read-Modify-Write
• Iterators enable very high throughput for:– Upserts– Filtering– Aggregation
• Iterators also support server-side query operations
Iterator Overview
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved18
Data-Centric Security
Row Col Value1 Name Jones1 Sales 1001 Age 282 Name Smith2 Sales 3502 Age 252 Quota 1000
Row Col Value1 Name Anon11 Sales 1002 Name Smith2 Sales 3502 Age 252 Quota 1000
User 1 User 2Sqrrl/Accumul
o
Definition: Data carries with it information that is required to make policy decisions on its releasability.
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved19
Security
Row Col. Fam. Col. Qual. Visibility Timestamp Value
John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute …John Doe Test Results Cholesterol JD|PCP_JD 20120912 183
John Doe Test Results Mental Health JD|PSYCH_JD 20120801 PassJohn Doe Test Results X-Ray JD|PHYS_JD 20120513 101011011010
0…
Example Accumulo Key/Value Pairs
Accumulo is the only NoSQL database with cell-level access controls
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved20
Data-Centric Security Ecosystem
Data Lab Sqrrl Enterprise App
User Attributes
Audits
Policies
End Users
AutPoli
Key
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved21
Accumulo• Where it fits• How it works• Unique features• How to use it• Performance
Outline
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved22
Hierarchical Decomposition
Row:
Column Family:
Column Qualifier:
Value:
<person>
attribute purchases returns
age
<age>
discount
<cost>
hat
<cost>
sneakers
<rate>
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved23
Row:
Column Family:
Column Qualifier:
Value:
george
attributepurchasesreturns
age
27 $83
hat
$42
sneakers
bill
attribute purchases
40%
sneakers
$100
discount
49
age
Key/Value Pair
Materialized Table
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved24
Forward and Inverted Index
Table:
Row:
Column Family:
Column Qualifier:
Value:
Forward Index
<UUID>
<Type>
<Field>
<Term>
Inverted Index
<Term>
<UUID>
<Type+Field>
<Digest of Event>
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved25
Forward and Inverted Index
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved26
Table:
Row:
Column Family:
Column Qualifier:
Value:
Geo Index
<GeoHash>
<Event Type>
<UUID>
<Digest of Event>
Latitude10110101001
Longitude00111010010
101001110111010101011100001011100
Depth11010110110
Custom Indexing
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved27
Sqrrl Enterprise Built on Apache Accumulo
Sqrrl Server
Bulk Processing Integration
Exploratory / Operational
Apps
Graph + Document I/O
Sqrrl API over Apache Thrift RPC(JSON, Graph, Aggregation, Search, etc.)• Sqrrl proprietary
• Automated indexing• Custom iterators• Lucene integration• Security extensions Accumulo RPC
(Sorted Key/Value I/O)
Hadoop RPC(File I/O)
• Open source (including Sqrrl contributions)
• Open source or commercial distributions
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved28
Accumulo• Where it fits• How it works• Unique features• How to use it• Performance
Outline
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved29
Accumulo with D4M 2.0 Schema Performance
Source: D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database, Kepner et. al., HPEC 2013
Maximizing throughput on an 8-node, 192-core cluster:
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved30
Accumulo Scalability: Graph500 Benchmark
source: http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved31
Read/Modify/Write (HBase) vs. Iterators/Combiners (Accumulo)
Atomic Increment Performance Comparison
Friday, 20 September 13
Sqrrl Data, Inc. All Rights Reserved32
Adam Fuchs, CTOSqrrl Data, Inc.
Questions?
Friday, 20 September 13