introduction to accumulo
TRANSCRIPT
History
To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:
I GFS: distributed filesystemI MapReduce: distributed data processingI BigTable: distributed storage system for
structured data
Accumulo is an open-source implementation ofBigTable
2
History
To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:
I GFS: distributed filesystem
I MapReduce: distributed data processingI BigTable: distributed storage system for
structured data
Accumulo is an open-source implementation ofBigTable
2
History
To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:
I GFS: distributed filesystemI MapReduce: distributed data processing
I BigTable: distributed storage system forstructured data
Accumulo is an open-source implementation ofBigTable
2
History
To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:
I GFS: distributed filesystemI MapReduce: distributed data processingI BigTable: distributed storage system for
structured data
Accumulo is an open-source implementation ofBigTable
2
History
To accommodate their needs for analysis of largeamounts of data on commodity hardware, Googledeveloped three main distributed systems:
I GFS: distributed filesystemI MapReduce: distributed data processingI BigTable: distributed storage system for
structured data
Accumulo is an open-source implementation ofBigTable
2
Distributed Structured Data
I structured data should be– distributed for parallel processing
– indexed for fast retrieval (“structured” means that it hassome kind of “primary key”)
– tabular for easy processing of complex data, each row canpotentially have many columns
I databases offer indexes and tables but don’tscale without significant effort
I key-value stores can easily be distributed buthave limited index support over keys and don’thave support for tabular format out of the box
3
Distributed Structured Data
I structured data should be– distributed for parallel processing
– indexed for fast retrieval (“structured” means that it hassome kind of “primary key”)
– tabular for easy processing of complex data, each row canpotentially have many columns
I databases offer indexes and tables but don’tscale without significant effort
I key-value stores can easily be distributed buthave limited index support over keys and don’thave support for tabular format out of the box
3
Distributed Structured Data
I structured data should be– distributed for parallel processing
– indexed for fast retrieval (“structured” means that it hassome kind of “primary key”)
– tabular for easy processing of complex data, each row canpotentially have many columns
I databases offer indexes and tables but don’tscale without significant effort
I key-value stores can easily be distributed buthave limited index support over keys and don’thave support for tabular format out of the box
3
Accumulo
I Accumulo is a key-value store with support fortabular data
– keys are columns identifiers, i.e. they uniquely identify acolumn of a row
– a row is composed by multiple keys-values grouped by theprefix of the key, the row id
4
ExampleEMAIL NAME LASTNAME COMPANY
[email protected] Olivia Smith Winsystems
[email protected] Emily Brown Jones Inc.
⇓KEY (composed by row id and column id) VALUE
[email protected] Olivia
[email protected] Smith
[email protected] Winsystems
[email protected] Emily
[email protected] Brown
[email protected] Jones Inc.
5
Composite Keys
Keys in Accumulo are composite and have the following components
I row id: to which row the key belongs toI column family: to which “column group” the key belongs toI column qualifier: the column idI column visibility: who can access this columnI timestamp: the version of the key
A single key-value is stored as
KEYVALUE
row idcolumn
timestampfamily qualifier visibility
6
Composite Keys
Keys in Accumulo are composite and have the following components
I row id: to which row the key belongs toI column family: to which “column group” the key belongs toI column qualifier: the column idI column visibility: who can access this columnI timestamp: the version of the key
A single key-value is stored as
KEYVALUE
row idcolumn
timestampfamily qualifier visibility
6
Accumulo featuresI range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation ofclose days is local and fast
I fast: with proper key schemas a query can takemilliseconds
I scalable: designed to store huge amount of data overmultiple tables
I built-in cache for recently queried data
I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .
7
Accumulo featuresI range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation ofclose days is local and fast
I fast: with proper key schemas a query can takemilliseconds
I scalable: designed to store huge amount of data overmultiple tables
I built-in cache for recently queried data
I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .
7
Accumulo featuresI range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation ofclose days is local and fast
I fast: with proper key schemas a query can takemilliseconds
I scalable: designed to store huge amount of data overmultiple tables
I built-in cache for recently queried data
I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .
7
Accumulo featuresI range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation ofclose days is local and fast
I fast: with proper key schemas a query can takemilliseconds
I scalable: designed to store huge amount of data overmultiple tables
I built-in cache for recently queried data
I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .
7
Accumulo featuresI range queries: keys are stored in lexicographical order
allowing to query “semantically close” data
– e.g. temporal data can be stored such that aggregation ofclose days is local and fast
I fast: with proper key schemas a query can takemilliseconds
I scalable: designed to store huge amount of data overmultiple tables
I built-in cache for recently queried data
I many others, such as bulk imports, iterators, faulttolerance, large rows, multiple-batch queries, testingutilities (mocks, miniclusters) . . .
7
Example
we want to store and analyze tweets from all aroundthe world.
8
Example: Tweets analysis
I A tweet has the following (simplified) fields– coordinate: geospatial information composed by longitude
and latitude
– created at: UTC time of the tweet
– id: tweet unique identifier
– user informations, such as
I user.id: unique identifier of the userI user.screen name: user nameI . . .
– entities such as hashtags, urls. . .
– text: tweet content
– . . .
I how do we store this data in Accumulo?
9
Example: Tweets analysis
I there is no single way to do it, it depends onthe query
I two good practices– work with denormalized data
– specialize tables for each kind of query
10
Example: Tweets analysis
I there is no single way to do it, it depends onthe query
I two good practices– work with denormalized data
– specialize tables for each kind of query
10
Example: Twitter User Timeline
I schemaKEY
VALUErow id
columntimestamp
family qualifier visibility
user.id + created at + id
”coordinate” lon/lat
”entities””hashtags” hashtags
”urls” urls”text” text
I Easy to process the entire timeline or a timeinterval for the same user
I Not good for other kind of analysis– find all the tweets with a given hashtag
– find all the tweets in New York
– . . .
11
Example: Twitter User Timeline
I schemaKEY
VALUErow id
columntimestamp
family qualifier visibility
user.id + created at + id
”coordinate” lon/lat
”entities””hashtags” hashtags
”urls” urls”text” text
I Easy to process the entire timeline or a timeinterval for the same user
I Not good for other kind of analysis– find all the tweets with a given hashtag
– find all the tweets in New York
– . . .
11
Summary
I Accumulo is great for storing large amount ofstructured data
I Accumulo is good for interactive queries as wellas more batch queries
I Accumulo is a low-level system– NoSQL (that’s not good!), which means no high-level
language to query the data
– a lot of flexibility which can easily backfire
12
Thank you
Questions?
13