introduction to accumulo

26
Introduction to Accumulo Mario Pastorelli [email protected] March 7, 2016 1

Upload: mario-pastorelli

Post on 16-Apr-2017

425 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Introduction to Accumulo

Introduction to Accumulo

Mario [email protected]

March 7, 2016

1

Page 2: Introduction to Accumulo

History

To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:

I GFS: distributed filesystemI MapReduce: distributed data processingI BigTable: distributed storage system for

structured data

Accumulo is an open-source implementation ofBigTable

2

Page 3: Introduction to Accumulo

History

To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:

I GFS: distributed filesystem

I MapReduce: distributed data processingI BigTable: distributed storage system for

structured data

Accumulo is an open-source implementation ofBigTable

2

Page 4: Introduction to Accumulo

History

To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:

I GFS: distributed filesystemI MapReduce: distributed data processing

I BigTable: distributed storage system forstructured data

Accumulo is an open-source implementation ofBigTable

2

Page 5: Introduction to Accumulo

History

To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:

I GFS: distributed filesystemI MapReduce: distributed data processingI BigTable: distributed storage system for

structured data

Accumulo is an open-source implementation ofBigTable

2

Page 6: Introduction to Accumulo

History

To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:

I GFS: distributed filesystemI MapReduce: distributed data processingI BigTable: distributed storage system for

structured data

Accumulo is an open-source implementation ofBigTable

2

Page 7: Introduction to Accumulo

Distributed Structured Data

I structured data should be– distributed for parallel processing

– indexed for fast retrieval (“structured” means that it hassome kind of “primary key”)

– tabular for easy processing of complex data, each row canpotentially have many columns

I databases offer indexes and tables but don’tscale without significant effort

I key-value stores can easily be distributed buthave limited index support over keys and don’thave support for tabular format out of the box

3

Page 8: Introduction to Accumulo

Distributed Structured Data

I structured data should be– distributed for parallel processing

– indexed for fast retrieval (“structured” means that it hassome kind of “primary key”)

– tabular for easy processing of complex data, each row canpotentially have many columns

I databases offer indexes and tables but don’tscale without significant effort

I key-value stores can easily be distributed buthave limited index support over keys and don’thave support for tabular format out of the box

3

Page 9: Introduction to Accumulo

Distributed Structured Data

I structured data should be– distributed for parallel processing

– indexed for fast retrieval (“structured” means that it hassome kind of “primary key”)

– tabular for easy processing of complex data, each row canpotentially have many columns

I databases offer indexes and tables but don’tscale without significant effort

I key-value stores can easily be distributed buthave limited index support over keys and don’thave support for tabular format out of the box

3

Page 10: Introduction to Accumulo

Accumulo

I Accumulo is a key-value store with support fortabular data

– keys are columns identifiers, i.e. they uniquely identify acolumn of a row

– a row is composed by multiple keys-values grouped by theprefix of the key, the row id

4

Page 11: Introduction to Accumulo

ExampleEMAIL NAME LASTNAME COMPANY

[email protected] Olivia Smith Winsystems

[email protected] Emily Brown Jones Inc.

⇓KEY (composed by row id and column id) VALUE

[email protected] Olivia

[email protected] Smith

[email protected] Winsystems

[email protected] Emily

[email protected] Brown

[email protected] Jones Inc.

5

Page 12: Introduction to Accumulo

Composite Keys

Keys in Accumulo are composite and have the following components

I row id: to which row the key belongs toI column family: to which “column group” the key belongs toI column qualifier: the column idI column visibility: who can access this columnI timestamp: the version of the key

A single key-value is stored as

KEYVALUE

row idcolumn

timestampfamily qualifier visibility

6

Page 13: Introduction to Accumulo

Composite Keys

Keys in Accumulo are composite and have the following components

I row id: to which row the key belongs toI column family: to which “column group” the key belongs toI column qualifier: the column idI column visibility: who can access this columnI timestamp: the version of the key

A single key-value is stored as

KEYVALUE

row idcolumn

timestampfamily qualifier visibility

6

Page 14: Introduction to Accumulo

Accumulo featuresI range queries: keys are stored in lexicographical order

allowing to query “semantically close” data

– e.g. temporal data can be stored such that aggregation ofclose days is local and fast

I fast: with proper key schemas a query can takemilliseconds

I scalable: designed to store huge amount of data overmultiple tables

I built-in cache for recently queried data

I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .

7

Page 15: Introduction to Accumulo

Accumulo featuresI range queries: keys are stored in lexicographical order

allowing to query “semantically close” data

– e.g. temporal data can be stored such that aggregation ofclose days is local and fast

I fast: with proper key schemas a query can takemilliseconds

I scalable: designed to store huge amount of data overmultiple tables

I built-in cache for recently queried data

I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .

7

Page 16: Introduction to Accumulo

Accumulo featuresI range queries: keys are stored in lexicographical order

allowing to query “semantically close” data

– e.g. temporal data can be stored such that aggregation ofclose days is local and fast

I fast: with proper key schemas a query can takemilliseconds

I scalable: designed to store huge amount of data overmultiple tables

I built-in cache for recently queried data

I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .

7

Page 17: Introduction to Accumulo

Accumulo featuresI range queries: keys are stored in lexicographical order

allowing to query “semantically close” data

– e.g. temporal data can be stored such that aggregation ofclose days is local and fast

I fast: with proper key schemas a query can takemilliseconds

I scalable: designed to store huge amount of data overmultiple tables

I built-in cache for recently queried data

I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .

7

Page 18: Introduction to Accumulo

Accumulo featuresI range queries: keys are stored in lexicographical order

allowing to query “semantically close” data

– e.g. temporal data can be stored such that aggregation ofclose days is local and fast

I fast: with proper key schemas a query can takemilliseconds

I scalable: designed to store huge amount of data overmultiple tables

I built-in cache for recently queried data

I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .

7

Page 19: Introduction to Accumulo

Example

we want to store and analyze tweets from all aroundthe world.

8

Page 20: Introduction to Accumulo

Example: Tweets analysis

I A tweet has the following (simplified) fields– coordinate: geospatial information composed by longitude

and latitude

– created at: UTC time of the tweet

– id: tweet unique identifier

– user informations, such as

I user.id: unique identifier of the userI user.screen name: user nameI . . .

– entities such as hashtags, urls. . .

– text: tweet content

– . . .

I how do we store this data in Accumulo?

9

Page 21: Introduction to Accumulo

Example: Tweets analysis

I there is no single way to do it, it depends onthe query

I two good practices– work with denormalized data

– specialize tables for each kind of query

10

Page 22: Introduction to Accumulo

Example: Tweets analysis

I there is no single way to do it, it depends onthe query

I two good practices– work with denormalized data

– specialize tables for each kind of query

10

Page 23: Introduction to Accumulo

Example: Twitter User Timeline

I schemaKEY

VALUErow id

columntimestamp

family qualifier visibility

user.id + created at + id

”coordinate” lon/lat

”entities””hashtags” hashtags

”urls” urls”text” text

I Easy to process the entire timeline or a timeinterval for the same user

I Not good for other kind of analysis– find all the tweets with a given hashtag

– find all the tweets in New York

– . . .

11

Page 24: Introduction to Accumulo

Example: Twitter User Timeline

I schemaKEY

VALUErow id

columntimestamp

family qualifier visibility

user.id + created at + id

”coordinate” lon/lat

”entities””hashtags” hashtags

”urls” urls”text” text

I Easy to process the entire timeline or a timeinterval for the same user

I Not good for other kind of analysis– find all the tweets with a given hashtag

– find all the tweets in New York

– . . .

11

Page 25: Introduction to Accumulo

Summary

I Accumulo is great for storing large amount ofstructured data

I Accumulo is good for interactive queries as wellas more batch queries

I Accumulo is a low-level system– NoSQL (that’s not good!), which means no high-level

language to query the data

– a lot of flexibility which can easily backfire

12

Page 26: Introduction to Accumulo

Thank you

Questions?

13