accumulo design

54
APACHE ACCUMULO From a design perspective

Upload: koverse-inc

Post on 03-Dec-2014

156 views

Category:

Data & Analytics


1 download

DESCRIPTION

Learn the fundamentals of Accumulo with this presentation by Koverse CTO Aaron Cordova (@aaroncordova)

TRANSCRIPT

Page 1: Accumulo design

APACHE ACCUMULOFrom a design perspective

Page 2: Accumulo design

SCALABLE KEY-VALUE STORE BASED ON GOOGLE'S

BIGTABLE

Page 3: Accumulo design

BIGTABLE FEATURES• Distributes data across many commodity servers

• Sorts data by key for fast lookup of values by key

• Scan across multiple key value pairs

• Highly consistent writes to single row

• Support for MapReduce jobs

Page 4: Accumulo design

DATA MODEL

Key

ValueRow ID

ColumnTimestamp

Family Qualifier

Page 5: Accumulo design

Row ID Col Fam Col Qual Timestamp Value

Bob Email id0023 20120301 Hey joe, can you send ...

Bob Email id0024 20120302 Re: next Thursday ...

Bob UserPrefs Background 20130101 Grey

Fred Email id0001 20080302 Welcome to gmail ...

Sarah Email id0004 20130201 Hi again ...

Sara Videos ytid009 20100303 nsu736:)jdudjdk$:)378;'$$)

Page 6: Accumulo design

Tablet servers HDFS DataNodesCommit Layer Replication Layer

Page 7: Accumulo design

SINCE 2006• Several BigTable implementations

• Apache Hbase

• Apache Cassandra

• Apache Accumulo

• others …

Page 8: Accumulo design

BIGTABLE IS BIGTABLE RIGHT?

Page 9: Accumulo design

HBASE

Page 10: Accumulo design

HBASE• Open source Apache project started by developers at

Powerset, bought by Microsoft

• Now used at Facebook, StumbleUpon, other big web sites

• Fast reads

• Row-oriented API

• Each column family has it's own set of files

Page 11: Accumulo design

CASSANDRA

Page 12: Accumulo design

CASSANDRA• Apache project started at Facebook

• Combines elements of BigTable and Amazon's Dynamo into one system

• Used at Netflix, other web sites

• Fast writes

• Tunable consistency

Page 13: Accumulo design

Tablet serversCommit and Replication Layer

Page 14: Accumulo design

CONSISTENCY

• Highly consistent means: writes in one place

• Eventually consistent: writes in > one place

• Writes in > one place: network partition tolerance

• Partition tolerance: geographically distributed servers

• *Google uses Spanner to synchronize multiple dbs

Page 15: Accumulo design

Tablet serversData Center A Data Center B

Page 16: Accumulo design

Data Center A Data Center BTablet servers

Page 17: Accumulo design

OVERVIEW

• Both highly scalable

• Used to build web applications that can serve millions of users at once

• Serves as a low-latency persistence layer for real time service of requests

• Available in single data center or cross data center options

Page 18: Accumulo design

USE CASE

• Most data comes from users

• Schema defined by the application

• Data builds up over time

Page 19: Accumulo design

Many UsersDbWeb

application

Page 20: Accumulo design

ACCUMULO

Page 21: Accumulo design

ACCUMULO

• Can support the web application use-case

• But what are those other extra features for?

Page 22: Accumulo design

ACCUMULO ‘EXTRAS’• Dynamic Column Families

• Column Visibility

• Key-value oriented API

• Iterators

• Batch Scanners

Page 23: Accumulo design

BIG ORGANIZATIONS

• Missions other than internet services

• Various disparate operational systems that generate data

• Desire to look across and analyze that data

• Desire to deliver results to their own population

Page 24: Accumulo design

USE CASE IS DISCOVERING AND ANALYZING ALL DATA

Page 25: Accumulo design

ISSUES

• Scale

• Unknown / multiple schema

• Support for analysis without data movement

• Varying levels of sensitivity in the same system

• Support a high number of low-latency user requests

Page 26: Accumulo design

Many Users

Analyze

Db

Data sets

Page 27: Accumulo design

SCALE?

Page 28: Accumulo design

CHECK (IT’S BIGTABLE)

Page 29: Accumulo design

NO CONTROL OVER OR MANY DIFFERENT SCHEMA?

Page 30: Accumulo design

MAP EXISTING FIELDS TO COLUMNS DYNAMICALLY

Page 31: Accumulo design

INCLUDING COLUMN FAMILIES

Page 32: Accumulo design

VARYING LEVELS OF DATA SENSITIVITY?

Page 33: Accumulo design

COLUMN VISIBILITY

Page 34: Accumulo design

DATA MODEL

Key

ValueRow ID

ColumnTime

stampFamily Qualifier Visibility

Page 35: Accumulo design

Row ID Col Fam Col Qual Col Vis Timestamp Value

Bob Email id0023 personal comms 20120301 Hey joe, can

you send ...

Bob Email id0024 personal comms 20120302 Re: next

Thursday ...

Bob UserPrefs Background prefs 20130101 Grey

Fred Email id0001 personal comms 20080302 Welcome to

gmail ...

Sarah Email id0004 personal comms 20130201 Hi again ...

Sara Videos ytid009 public post 20100303nsu736:)jdu

djdk$:)378;'$$)

Page 36: Accumulo design

DATA OF VARYING SENSITIVITY LEVELS CAN BE PHYSICALLY CO-LOCATED

Page 37: Accumulo design

FRAMEWORKS LIKE HADOOP MAP REDUCE LOVE IT WHEN

DATA IS ALL TOGETHER

Page 38: Accumulo design

LOOK ACROSS DATASETS?

Page 39: Accumulo design

SECONDARY INDICES

Page 40: Accumulo design

SECONDARY INDICES

• Application-created data: known

• Pre-existing data? unknown

Page 41: Accumulo design

DATA DISCOVERY!

Page 42: Accumulo design

SECONDARY INDICESRowID Col Qual Value

RID00001 age 54

RID00001 name bob

RID00002 name fred

RID00003 age 43

RID00003 height 5’9”

RID00003 name harry

RID00004 name carl

RID00005 name evan

RowID Col Fam Col Qual

43 age RID00003

54 age RID00001

5’9” height RID00003

bob name RID00001

carl name RID00004

evan name RID00005

fred name RID00002

harry name RID00003

Page 43: Accumulo design

PARTIAL ROW SCANS

Page 44: Accumulo design

BATCH SCANNERS

Page 45: Accumulo design

RowID Col Qual Value

RID00001 age 54

RID00001 name bob

RID00002 name fred

RID00003 age 43

RID00003 height 5’9”

RID00003 name harry

RID00004 name carl

RID00005 name evan

Batch Scanner

Page 46: Accumulo design

COLUMN VISIBILITY APPLIES TO INDEXES TOO

Page 47: Accumulo design

ANALYSIS?

Page 48: Accumulo design

MAPREDUCE: CHECK

Page 49: Accumulo design

SHUFFLE-SORTED?

• Between Map and Reduce phases is shuffle-sort

• Sorting by key is necessary so all the values for a given key end up next to each other …

• BigTable also sorts keys …

Page 50: Accumulo design

ITERATORS

Page 51: Accumulo design

Value combine(Iterator<Value> values)

Page 52: Accumulo design

PRE-COMPUTATION

Page 53: Accumulo design

Many Users

Analyze

Db

Data sets

Page 54: Accumulo design

ACCUMULO