oct 2012 hug: apache accumulo: unlocking the power of big data

27
sqrrl data, INC. Secure. Scale. Adapt. [email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved Adam Fuchs, Chief Technology Officer

Upload: yahoo-developer-network

Post on 15-Jan-2015

2.311 views

Category:

Technology


0 download

DESCRIPTION

Apache Accumulo, originally developed by the National Security Agency and now an Apache Software Foundation project, builds upon Google's Bigtable design to provide a scalable, lightly-structured database capability complementing the ubiquitous Hadoop environment. The core capabilities of Accumulo include cell-level security, flexible schemas, real-time analytics, bulk I/O, and linear scalability beyond trillions of entries and petabytes of data. These new capabilities lead to techniques that unlock the power of Big Data, but don't fit into traditional database design patterns. Learn about the advantages of Apache Accumulo and how it fits into the Hadoop and NoSQL ecosystem. Presenter: Adam Fuchs, CTO, sqrrl

TRANSCRIPT

Page 1: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

sqrrl data, INC.Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Adam Fuchs, Chief Technology Officer

Page 2: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Who We are

2

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

is the commercial provider of

Mature Database Technology - Apache Accumulo

Fine-Grained Access Controls - Data Integration and Sharing

Proven Performance - Petabytes and Beyond

Advanced Analytics - Search, Statistics, and Graphs

Page 3: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Contents

Core Philosophy

Technology

Techniques

Application APIs

3

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 4: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Integration across:

Multiple business linesMultiple data setsMultiple applicationsMultiple security, privacy, legal, policy, regulatory, and compliance constraintsNew demands

Apache Accumulo Perspective

Application

Data Data Data

Application Application

4

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 5: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Accumulo Design Drivers

Scalability Near linear performance improvements at thousands of nodes Durable and reliable under increased failures that come with scale

2

Diverse, Interactive Analytics Sorted key/value core performs well in a diverse set of domains Information retrieval, statistics, graph analysis, geo indexing, and more

3

Cell-Level Security Express common security requirements in the infrastructure, not just in the application Data-centric approach encourages secure sharing

1

5

Secure. Scale. Adapt.

Flexible, Adaptive Schema Start with universal structures and indexing Refine the schema over time

4

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 6: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Contents

Core Philosophy

Technology

Techniques

Application APIs

6

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 7: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Accumulo Key Structure

An Accumulo key is a 5-tuple, consisting of:

Row: Controls AtomicityColumn Family: Controls Locality Column Qualifier: Controls UniquenessVisibility Label: Controls AccessTimestamp: Controls Versioning

Row Col. Fam. Col. Qual. Visibility Timestamp Value

John Doe Notes PCP PCP_JD 20120912 Patient suffers from an acute …

John Doe Test Results Cholesterol JD|PCP_JD 20120912 183

John Doe Test Results Mental Health JD|PSYCH_JD 20120801 Pass

John Doe Test Results X-Ray JD|PHYS_JD 20120513 1010110110100…

Accumulo Key/Value Example

7

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 8: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Visibility Syntax & Semantics

8

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 9: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Tablets

9

Collections of KV pairs form Tables

Tables are partitioned into Tablets

Metadata tablets hold info about other tablets, forming a 3-level hierarchy

A Tablet is a unit of work for a Tablet Server

Root Tablet-∞ to ∞

Metadata Tablet 1-∞ to “Encyclopedia:Ocelot”

Data Tablet-∞ : thing

Data Tabletthing : ∞

Data Tablet-∞ : Ocelot

Data TabletOcelot : Yak

Data TabletYak : ∞

Data Tablet-∞ to ∞

Metadata Tablet 2 “Encyclopedia:Ocelot” to ∞

Well-Known Location

(zookeeper)

Table: Adam’s Table Table: Encyclopedia Table: Foo

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 10: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Accumulo Architecture

Tablet Server

Tablet

Tablet Server

Tablet

Tablet Server

Tablet

Application

Zookeeper

Zookeeper

Zookeeper

Master

Hadoop

Read/Write

Store/Replicate

Assign/Balance

Delegate Authority

Delegate Authority

Application

Application

10

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 11: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Tablet Data Flow

In-Memory Map

Write AheadLog

(For Recovery)

Sorted, Indexed

File

Sorted, Indexed

FileSorted, Indexed

File

Tablet

ReadsIterator

TreeMinor

Compaction

Merging / Major Compaction

Iterator Tree

Writes

11

Secure. Scale. Adapt.

Iterator Tree

Scan

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 12: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Contents

Core Philosophy

Technology

Techniques

Application APIs

16

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 13: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Hierarchical Decomposition

17

Row:

Column Family:

Column Qualifier:

Value:

<person>

attribute purchases returns

age

<age>

discount

<cost>

hat

<cost>

sneakers

<40%>

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 14: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Materialized Table

18

Row:

Column Family:

Column Qualifier:

Value:

george

attribute purchases returns

age

27 $83

hat

$42

sneakers

bill

attribute purchases

40%

sneakers

$100

discount

49

age

Secure. Scale. Adapt.

Key/Value Pair

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 15: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Forward and Inverted Index

19

Table:

Row:

Column Family:

Column Qualifier:

Value:

Forward Index

<UUID>

<Type>

<Field>

<Term>

Inverted Index

<Term>

<Type> + <Field>

<UUID>

<Digest of Event>

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 16: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Forward and Inverted Index

20

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 17: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Graph Analysis

21

Table:

Row:

Column Family:

Column Qualifier:(Tuples):

Value:

Graph Table

<Node ID>

“Node Info” “Out Edges” “In Edges”

<Field>

<Value>

<Node ID>

<Edge ID>

<Edge Info>

<Node ID>

<Edge ID>

<Edge Info>

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 18: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Geospatial Queries

22

Table:

Row:

Column Family:

Column Qualifier:

Value:

Geo Index

<GeoHash>

<Event Type>

<UUID>

<Digest of Event>

Secure. Scale. Adapt.

Latitude10110101001

Longitude00111010010

101001110111010101011100001011100

Depth11010110110

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 19: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Document Partitioning

23

Table:

Row:

Column Family:

Column Qualifier(Tuples):

Value:

Shard Table

<Partition ID>

“Docs” “Inv. Index” “Field Index”

<UUID>

<Value>

<Term>

<UUID>

<Field:Term>

<UUID>

Secure. Scale. Adapt.

<Field>

“Geo”

<Hash>

<UUID>

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 20: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Document Partitioning

24

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 21: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Intersecting Iterator

26

Secure. Scale. Adapt.

‘foo’ and (‘bar’ or ‘baz’)

<Partition ID>

“Docs” “Inv. Index”

<UUID>

<Value>

<Term>

<UUID><Field>

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 22: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Contents

Core Philosophy

Technology

Techniques

Application APIs

27

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 23: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

acorn

28

Key/Value pairs are great! How do I construct a document partitioning key again?

Techniques should be built into an APILet the people have polyglotLucene, SQL, SPARQL, JAQL, Matlab (not just Key, Value, Range)

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

=

+

+

Page 24: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Combined IR + Graph Search

29

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 25: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Schema-less Stats

30

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 26: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Get Involved

http://accumulo.apache.org

Help us make Accumulo even better!

31

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved

Page 27: Oct 2012 HUG: Apache Accumulo: Unlocking the Power of Big Data

Contact

32

Adam Fuchs, CTO

sqrrl data, Inc.617-520-4375

www.sqrrl.com@sqrrl_inc

[email protected]

Secure. Scale. Adapt.

[email protected] | @sqrrl_inc | 617.520.4375 sqrrl data, INC., All Rights Reserved