apache accumulo - meetupfiles.meetup.com/1789394/bdl19-1-apache accumulo.pdf · sqrrl secure....

32
sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl Data, Inc. September 19, 2013 Friday, 20 September 13

Upload: others

Post on 20-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

sqrrlSecure. Scale. Adapt.

Sqrrl Data, Inc. All Rights Reserved

Apache Accumulo

Adam Fuchs, CTOSqrrl Data, Inc.

September 19, 2013

Friday, 20 September 13

Page 2: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved2

Accumulo• Where it fits• How it works• Unique features• How to use it• Performance

Outline

Friday, 20 September 13

Page 3: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved3

Data-Driven Query-Driven

Two Halves of Real-Time

NoSQL Databases

Business Intelligence Tools

Stream Processing Engines

Real-Time reduce event to reaction timeReal-Time reduce ingest to query latency

Friday, 20 September 13

Page 4: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved4

Data-Driven + Query-Driven Real-Time Ecosystem

Data

NoSQL+

SPE Dashboards

Actions

InteractiveAnalysis Tools(Discovery + Forensics)

1 2

3

1. SPE queries NoSQL to enrich streaming data2. SPE persists results in NoSQL for future query3. SPE takes action automatically4. SPE issues data-driven alerts5. NoSQL provides context for dashboards6. Analysis tools query use NoSQL to search and manipulate historical

data

5

4

6

Friday, 20 September 13

Page 5: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved5

This talk focuses on the database.

DataSPE Dashboards

Actions

InteractiveAnalysis Tools(Discovery + Forensics)

1. SPE queries NoSQL to enrich streaming data2. SPE persists results in NoSQL for future query3. SPE takes action automatically4. SPE issues data-driven alerts5. NoSQL provides context for dashboards6. Analysis tools query use NoSQL to search and manipulate historical

data

4

3

NoSQL+6

5

21

Friday, 20 September 13

Page 6: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved6

Trendulo – An Example Application

Developed by Jared Winick: see trendulo.com

Friday, 20 September 13

Page 7: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved7

Accumulo• Where it fits• How it works• Unique features• How to use it• Performance

Outline

Friday, 20 September 13

Page 8: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved8

An Accumulo key is a 5-tuple, consisting of:

Row: Controls AtomicityColumn Family: Controls Locality Column Qualifier: Controls UniquenessVisibility Label: Controls AccessTimestamp: Controls VersioningRow Col. Fam. Col. Qual. Visibility Timestamp Value

John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute …John Doe Test Results Cholesterol JD|PCP_JD 20120912 183

John Doe Test Results Mental Health JD|PSYCH_JD 20120801 PassJohn Doe Test Results X-Ray JD|PHYS_JD 20120513 101011011010

0…Accumulo Key/Value

Example

Accumulo Data Format

Friday, 20 September 13

Page 9: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved9

Collections of KV pairs form TablesTables are partitioned into TabletsMetadata tablets hold info about other tablets, forming a 3-level hierarchyA Tablet is a unit of work for a Tablet Server

Root Tablet-∞ to ∞

Metadata Tablet 1

-∞ to “Encyclopedia:Ocelot”

Data Tablet

-∞ : thing

Data Tablet

thing : ∞

Data Tablet-∞ :

Ocelot

Data Tablet

Ocelot : Yak

Data TabletYak : ∞

Data Tablet

-∞ to ∞

Metadata Tablet 2 “Encyclopedia:Ocelot” to

Well-Known Location

(zookeeper)

Table: Adam’s Table

Table: Encyclopedia

Table: Foo

Accumulo Tablets

Friday, 20 September 13

Page 10: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved10

Accumulo Processes

Tablet Server

Tablet

Tablet Server

Tablet

Tablet Server

Tablet

Application

Zookeeper

Zookeeper

Zookeeper

Master

HDFS

Read/Write

Store/Replicate

Assign/Balance

Delegate Authority

Delegate Authority

Application

Application

Friday, 20 September 13

Page 11: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved11

Tablet Data Flow

In-Memory Map

Write AheadLog

(For Recovery)

Sorted, Indexed

File

Sorted, Indexed

FileSorted, Indexed

File

TabletReadsIterator

TreeMinor

Compaction

Merging / Major Compaction

Iterator Tree

Writes Iterator Tree

Scan

Friday, 20 September 13

Page 12: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved12

Accumulo• Where it fits• How it works• Unique features• How to use it• Performance

Outline

Friday, 20 September 13

Page 13: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved13

Iterator Framework

Iterator Operations:

File ReadsBlock CachingMergingDeletionIsolationLocality GroupsRange SelectionColumn SelectionCell-level SecurityVersioningFilteringAggregationPartitioned Joins

Friday, 20 September 13

Page 14: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved14

Word Count: Summing Aggregating Iterator

Input Corpus

Friday, 20 September 13

Page 15: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved15

Ingesters

QueriersTablet Servers

Accumulo Latencies

Input BatchWriter

In-Memory

ScanIterators

Scanner/Batch

Scanner

In-Memory

RFile

Compaction

Iterators

ScanIterators

RFile

Compaction

Iterators

In-Memory

Map

RFiles

Compaction

Iterators

ScanIterators

Output

~ms~ms ~ms

ms

- m

in

Friday, 20 September 13

Page 16: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved16

Accumulo Throughput

Ingesters QueriersTablet Servers

Input BatchWriter

In-Memory

Map

ScanIterators

Scanner/Batch

Scanner

In-Memory

Map

RFile

Compaction

Iterators

ScanIterators

RFile

Compaction

Iterators

In-Memory

Map

RFiles

CompactionIterators

ScanIterators

Output

~ms~ms ~ms

ms

- m

in

Scan:up to 1M entries/

s per node

Ingest:up to 500K

entries/s per node

Read-Modify-Write Latency: ~ms

>1K entries/s challenging with R-M-W

Friday, 20 September 13

Page 17: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved17

• Iterators extend the set of operations that are optimized to avoid Read-Modify-Write

• Iterators enable very high throughput for:– Upserts– Filtering– Aggregation

• Iterators also support server-side query operations

Iterator Overview

Friday, 20 September 13

Page 18: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved18

Data-Centric Security

Row Col Value1 Name Jones1 Sales 1001 Age 282 Name Smith2 Sales 3502 Age 252 Quota 1000

Row Col Value1 Name Anon11 Sales 1002 Name Smith2 Sales 3502 Age 252 Quota 1000

User 1 User 2Sqrrl/Accumul

o

Definition: Data carries with it information that is required to make policy decisions on its releasability.

Friday, 20 September 13

Page 19: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved19

Security

Row Col. Fam. Col. Qual. Visibility Timestamp Value

John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute …John Doe Test Results Cholesterol JD|PCP_JD 20120912 183

John Doe Test Results Mental Health JD|PSYCH_JD 20120801 PassJohn Doe Test Results X-Ray JD|PHYS_JD 20120513 101011011010

0…

Example Accumulo Key/Value Pairs

Accumulo is the only NoSQL database with cell-level access controls

Friday, 20 September 13

Page 20: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved20

Data-Centric Security Ecosystem

Data Lab Sqrrl Enterprise App

User Attributes

Audits

Policies

End Users

AutPoli

Key

Friday, 20 September 13

Page 21: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved21

Accumulo• Where it fits• How it works• Unique features• How to use it• Performance

Outline

Friday, 20 September 13

Page 22: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved22

Hierarchical Decomposition

Row:

Column Family:

Column Qualifier:

Value:

<person>

attribute purchases returns

age

<age>

discount

<cost>

hat

<cost>

sneakers

<rate>

Friday, 20 September 13

Page 23: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved23

Row:

Column Family:

Column Qualifier:

Value:

george

attributepurchasesreturns

age

27 $83

hat

$42

sneakers

bill

attribute purchases

40%

sneakers

$100

discount

49

age

Key/Value Pair

Materialized Table

Friday, 20 September 13

Page 24: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved24

Forward and Inverted Index

Table:

Row:

Column Family:

Column Qualifier:

Value:

Forward Index

<UUID>

<Type>

<Field>

<Term>

Inverted Index

<Term>

<UUID>

<Type+Field>

<Digest of Event>

Friday, 20 September 13

Page 25: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved25

Forward and Inverted Index

Friday, 20 September 13

Page 26: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved26

Table:

Row:

Column Family:

Column Qualifier:

Value:

Geo Index

<GeoHash>

<Event Type>

<UUID>

<Digest of Event>

Latitude10110101001

Longitude00111010010

101001110111010101011100001011100

Depth11010110110

Custom Indexing

Friday, 20 September 13

Page 27: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved27

Sqrrl Enterprise Built on Apache Accumulo

Sqrrl Server

Bulk Processing Integration

Exploratory / Operational

Apps

Graph + Document I/O

Sqrrl API over Apache Thrift RPC(JSON, Graph, Aggregation, Search, etc.)• Sqrrl proprietary

• Automated indexing• Custom iterators• Lucene integration• Security extensions Accumulo RPC

(Sorted Key/Value I/O)

Hadoop RPC(File I/O)

• Open source (including Sqrrl contributions)

• Open source or commercial distributions

Friday, 20 September 13

Page 28: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved28

Accumulo• Where it fits• How it works• Unique features• How to use it• Performance

Outline

Friday, 20 September 13

Page 29: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved29

Accumulo with D4M 2.0 Schema Performance

Source: D4M 2.0 Schema: A General Purpose High Performance Schema for the Accumulo Database, Kepner et. al., HPEC 2013

Maximizing throughput on an 8-node, 192-core cluster:

Friday, 20 September 13

Page 30: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved30

Accumulo Scalability: Graph500 Benchmark

source: http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2013_56002v1.pdf

Friday, 20 September 13

Page 31: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved31

Read/Modify/Write (HBase) vs. Iterators/Combiners (Accumulo)

Atomic Increment Performance Comparison

Friday, 20 September 13

Page 32: Apache Accumulo - Meetupfiles.meetup.com/1789394/BDL19-1-Apache Accumulo.pdf · sqrrl Secure. Scale. Adapt. Sqrrl Data, Inc. All Rights Reserved Apache Accumulo Adam Fuchs, CTO Sqrrl

Sqrrl Data, Inc. All Rights Reserved32

Adam Fuchs, CTOSqrrl Data, Inc.

Questions?

Friday, 20 September 13