8. column oriented databases

71
Column-Oriented Databases In Depth Ciao ciao Vai a fare ciao ciao Dr. Fabio Fumarola

Upload: fabio-fumarola

Post on 18-Jul-2015

3.103 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: 8. column oriented databases

Column-Oriented DatabasesIn Depth

Ciaociao

Vai a fare

ciao ciaoDr. Fabio Fumarola

Page 2: 8. column oriented databases

Outline• Column-Oriented Introduction• Bigtable:

– Features– Data Model– Rows and Column Family– Timestamp– API– Implementation

• Other Open Source solutions2

Page 3: 8. column oriented databases

Column-Oriented Introduction• We analyzed that key-value database are simple

hash table, where:– All the access are via primary keys and return an object as

result.

• Column-Oriented databases are motivated by the necessity to model something more than object-values.

3

Page 4: 8. column oriented databases

Bigtable• It is used pervasively at Google in several projects:

– Google Analytics, Google Finance, Google Plus,– Personalized Search, Google Docs, and Google Maps.

• These products use Bigtable for a variety of demanding workloads ranging from:– Batch oriented processing to– Low latency serving of data to end users

4

Page 5: 8. column oriented databases

Bigtable: Main Characteristics• It does not support a full relational data model• It provides clients with a simple data model that

– Supports dynamic control over data layout and format– Allows clients to reason about the locality of the data in

the underlying storage

• Data is indexed using row and column names that can be arbitrary strings

• Data is stored as uninterpreted strings supporting several kind of data.

5

Page 6: 8. column oriented databases

Data Model• A Bigtable is a sparse, distributed and persistent

multi-dimensional sorted map.• The map is indexed at the first lever by a row-key• And at second and third level by a column-key and a

timestamp respectively.• Each value in the map is an array of bytes

(row:string, column:string, time:int64) → string

6

Page 7: 8. column oriented databases

Example of mapping

7

Page 8: 8. column oriented databases

Data Model• They settled on this data model after reasoning on

potential uses of Bigtable.• The example that drove their decision was storing

the crawling results and to enable analysis.• The main goal is to “keep a copy of a large collection

of web pages and related information that could be used by many different projects” (Cache).

8

Page 9: 8. column oriented databases

Webtable example• The URLs is used as row keys • various aspects of web pages as column names, • and store the contents of the web pages in the contents:

column under the timestamps when they were fetched.

9

Page 10: 8. column oriented databases

Data Model: Rows• The row keys are arbitrary strings (up to 64KB)• Every read or write of data under a single row key is

atomic (no considering the number of different columns being read or written).

• Data are maintained in lexicographic order by row key.

• Ordered set of row-key are grouped in ranges and dynamically partitioned

10

Page 11: 8. column oriented databases

Data Model: Rows• Each row range is called tablet, • Tablets are the unit of distribution and load

balancing.• As a result, reads of short row ranges are

efficient and typically require communication with only a small number of machines.

• For example, in Webtable, pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs.

11

Page 12: 8. column oriented databases

Data Model: Column Families• Column keys are grouped into sets called column

families.• All data stored in a column family is usually of the

same type (values of the same family are compressed together).

• The design philosophy is to have a low number of column-families (< 101).

• In contrast, a table may have an unbounded number of columns.

12

Page 13: 8. column oriented databases

Data Model: Column Families• A column key is named using the following syntax:

family:qualifier.• Column family names must be printable, but

qualifiers may be arbitrary strings.• An example of column family is language• It uses only one column key in the language family,

and it stores each web page’s language ID.

13

Page 14: 8. column oriented databases

Data Model: Column Families• Another useful column family for this table is anchor; • Each column key in this family represents a single anchor• The qualifier is the name of the referring site; the cell

contents is the link text.

14

Page 15: 8. column oriented databases

Timestamps• Each cell in a Bigtable can contain multiple versions

of the same data; • These versions are indexed by timestamp in

microseconds. • Bigtable timestamps are 64-bit integers.• Different versions of a cell are stored in decreasing

timestamp order, so that the most recent versions can be read first.

15

Page 16: 8. column oriented databases

Timestamps• It normally stores 3 versions of the same data• Stale data is garbage-collected automatically• In our Webtable example, we set the timestamps of

the crawled pages stored in the contents.

16

Page 17: 8. column oriented databases

API• Operator supported:

– put(key, columnFamily, columnQualifier, value)– get(key)– Scan(startKey, endKey)– delete(key)

• It also provides functions for changing cluster, table, and column family metadata, such as access control rights.

17

Page 18: 8. column oriented databases

API: Get & Scan• A Get is simply a Scan limited by the API to one row.• A Scan fetches zero or more rows of a table.• By default, a Scan reads the entire table from start to

end.– Scan()– Scan(byte[] startRow)– Scan(byte[] startRow, byte[] stopRow)

18

Page 19: 8. column oriented databases

API• A scan can be limited in several ways:– Start and Stop row– Caching size: to limit the number of rows: – Batch size: to limit the number of columns– Max result size: to limit the number of bytes– Setting the column names and qualifiers

• Bigtable can be used with MapReduce, a frame- work for running large-scale parallel computations developed at Google.

19

Page 20: 8. column oriented databases

Building Blocks• Google File System (GFS)• Google SSTable• Distributed lock service called Chubby (Zookeper)

20

Page 21: 8. column oriented databases

Google File Systems• Bigtable uses the distributed Google File System

(GFS) to store log and data files.

21

http://static.googleusercontent.com/media/research.google.com/it//archive/gfs-sosp2003.pdf

Page 22: 8. column oriented databases

Google SSTable• File format is used internally to store Bigtable data.• An SSTable provides a persistent, ordered immutable

map from keys to values.• Provided operations are: – lookup the value associated with a key– To iterate over all key/value pairs in a specified key range

• Internally, each SSTable contains a sequence of blocks (typically of 64KB size)

22

Page 23: 8. column oriented databases

Google SSTable• A block index (stored at the end of the SSTable) is

used to locate blocks.• The index is loaded into memory when the SSTable is

opened. • A lookup can be performed with a single disk seek:

we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk.

23

Page 24: 8. column oriented databases

Chubby• It is a highly-available and persistent distributed lock service.• It consists by at least five replicas, one of which is elected as

master.• Chubby provides a namespace that consists of directories and

small files.

24

http://static.googleusercontent.com/media/research.google.com/it//archive/chubby-osdi06.pdf

Page 25: 8. column oriented databases

Chubby and Bigtable• Bigtable uses Chubby for a variety of tasks:– to ensure that there is at most one active master at any

time; – to store the bootstrap location of Bigtable data;– to discover tablet servers and finalize tablet server

deaths; – to store Bigtable schema information (the column family

information for each table); – and to store access control lists.

25

Page 26: 8. column oriented databases

Implementation• It is composed by three major components:– a library that is linked into every client, – one master server, – and many tablet servers

• Tablet servers can be dynamically added (or removed) from a cluster to accomodate changes in workloads.

26

Page 27: 8. column oriented databases

Implementation: Master• It is responsible for:– assigning tablets to tablet servers, – detecting the addition and expiration of tablet servers, – balancing tablet-server load, – and garbage collection of files in GFS.

• In addition, it handles schema changes such as table and column family creations.

27

Page 28: 8. column oriented databases

Implementation: Tablet Server• Each tablet server manages a set of tablets (typically

we have somewhere between ten to a thousand tablets per tablet server).

• The tablet server handles read and write requests to the tablets that it has loaded, and also splits tablets that have grown too large.

28

Page 29: 8. column oriented databases

Implementation: Clients• clients communicate directly with tablet servers for

reads and writes.• Because Bigtable clients do not rely on the master

for tablet location information, most clients never communicate with the master.

• As a result, the master is lightly loaded in practice (differently from HBase).

29

Page 30: 8. column oriented databases

Implementation: Cluster• A Bigtable cluster stores a number of tables. • Each table consists of a set of tablets, and each

tablet contains all data associated with a row range. • Initially, each table consists of just one tablet. As a

table grows, it is automatically split into multiple tablets, each approximately 100-200 MB in size by default.

30

Page 31: 8. column oriented databases

Tablets

31

Page 32: 8. column oriented databases

Tablet Location• It is used a three-level hierarchy similar to a B+ tree

to store tablet location information

32

Page 33: 8. column oriented databases

Tablet Location• The first level is a file stored in Chubby that contains the

location of the root tablet. • The root tablet contains the location of all tablets in a special

METADATA table.

33

Page 34: 8. column oriented databases

Tablet Location• The second level contains the location of a set of user tablets.• The METADATA table stores the location of a tablet under a

row key.

34

Page 35: 8. column oriented databases

Tablet Location• The third level contains the data. • The client library caches tablet locations. • If the client does not know the location of a tablet then it

recursively moves up the tablet location hierarchy.

35

Page 36: 8. column oriented databases

Tablet Location• They also store secondary information in the

METADATA table, including a log of all events pertaining to each tablet.

• This information is helpful for debugging and performance analysis.

36

Page 37: 8. column oriented databases

Tablet Assignment and Serving

37

Page 38: 8. column oriented databases

Tablet Assignment• Each tablet is assigned to one tablet server at a time. • The master keeps track of the set of live tablet

servers, that is assigned and unassigned tablets. • When a tablet is unassigned, and a tablet server with

sufficient room for the tablet is available, the master assigns the tablet by sending a tablet load request to the tablet server.

38

Page 39: 8. column oriented databases

Tablet Assignment• Bigtable uses Chubby to keep track of tablet servers. • When a tablet server starts, it creates, and acquires

an exclusive lock on, a uniquely-named file in a specific Chubby directory

• When a tablet server ends the master is responsible for reassigning those tablets as soon as possible.

39

Page 40: 8. column oriented databases

Tablet Serving• The persistent state of a tablet is stored in GFS• Updates are committed to a commit log that stores redo

records

40

It stores recently committed updates

It stores the older updates

Page 41: 8. column oriented databases

Tablet Recover• It is done by a tablet server reading tablet metadata

from the METADATA table.• This table contains an event log with a set of redo

point, which are pointer to commit logs containing data for the tablet.

• The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points.

41

Page 42: 8. column oriented databases

Tablet Serving• When a read operation arrives at a tablet server, the

operation is executed on a merged view of the sequence of SSTables and the memtable.

• This operation is efficient because SSTable and memtable are lexicographically sorted data structures.

• Incoming read and write operations can continue while tablets are split and merged.

42

Page 43: 8. column oriented databases

Compactions• As write operations execute, the size of the

memtable increases (immutable). • When the memtable size reaches a threshold, the

memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS.

• There are three type of compactions: Minor, Merging, and Major.

43

Page 44: 8. column oriented databases

Minor CompactionsIt has two goals: •it shrinks the memory usage of the tablet server, •it reduces the amount of data that has to be read from the commit log during recovery if this server dies.

Incoming read and write operations can continue while compactions occur.

44

Page 45: 8. column oriented databases

Merging Compactions• Every minor compaction generates a new SSTable.• Read operations might need to merge updates from

an arbitrary number of SSTables.• This operations are bounded by allowing periodically

merging compaction.• A merging compaction reads the contents of a few

SSTables and the memtable, and writes out a new SSTable.

45

Page 46: 8. column oriented databases

Major Compaction• It is a compaction that rewrites all SSTables into

exactly one SSTable.• SSTables produced by non-major compactions can

contain special deletion entries.• A major compaction, on the other hand, produces an

SSTable that contains no deletion information or deleted data.

• These major compactions allow Bigtable to reclaim resources used by deleted data.

46

Page 47: 8. column oriented databases

Refinements

47

Page 48: 8. column oriented databases

Locality groups• Clients can group multiple column families together

into a locality group.• A separate SSTable is generated for each locality

group in each tablet.• Segregating column families that are not typically

accessed together into separate locality groups enables more efficient reads.

48

Page 49: 8. column oriented databases

Compression• A user-specified compression format is applied to

each SSTable block.• A used compression approach is based on:

– first pass uses Bentley and McIlroy’s scheme – The second pass uses a fast compression algorithm that

looks for repetitions in a small 16 KB window of the data.

• This scheme applied to Webtable achieved a 10-to-1 reduction in space.

49

Page 50: 8. column oriented databases

Caching for read To improve read performance, tablet servers use two levels of caching. •The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. •The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS.

50

Page 51: 8. column oriented databases

Bloom filters• The number of accesses are reduced by allowing

clients to specify that Bloom filters should be create for SSTables.

• A Bloom filter allows us to ask whether an SSTable contains any data for a specified row/column pair.

• http://billmill.org/bloomfilter-tutorial/

51

Page 52: 8. column oriented databases

Commit-log• All the operation of write and delete are persisted on

a global commit-log (inter-tablet).• Using one log provides significant performance

benefits during normal operation, but it complicates recovery.

• To recover the state for a tablet, the new tablet server needs to reapply the mutations for that tablet from the commit log written by the original tablet server.

52

Page 53: 8. column oriented databases

Commit-log• To avoid duplicating log reads, it is at firs sorted by

table, row name, log sequence number .⟨ ⟩

• In the sorted output, all mutations for a particular tablet are contiguous and can therefore be read efficiently with one disk seek followed by a sequential read.

• To parallelize the sorting, the log file is partitioned into files of 64MB and sorted using MapReduce.

53

Page 54: 8. column oriented databases

Exploiting immutability• The Bigtable system is simplified by the fact that all of the

SSTables generated are immutable.• This allows us to not do any sync of accesses.• The only mutable data structure that is accessed by both

reads and writes is the memtable.• The immutability of SSTables enables us to split tablets

quickly.• Instead of generating a new set of SSTables for each child

tablet, we let the child tablets share the SSTables of the parent tablet.

54

Page 55: 8. column oriented databases

Lessons Lerned

55

Page 56: 8. column oriented databases

Lessons• Distributed systems are vulnerable to many types of

failures: memory and network corruption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems.

• Add new features only when you know how to use it.

56

Page 57: 8. column oriented databases

Lessons• A practical lesson learned from supporting Bigtable is

the importance of proper system-level monitoring.• Clarity of Design. “Given both the size of our system

(about 100,000 lines of non-test code), as well as the fact that code evolves over time in unexpected ways, we have found that code and design clarity are of immense help in code maintenance and debugging.”

57

Page 58: 8. column oriented databases

Open Source Column Stores

58

Page 59: 8. column oriented databases

Open Source Column Stores• Hbase (http://hbase.apache.org/)• Hypertable (http://hypertable.org/)• Cassandra (http://cassandra.apache.org/)• Accumulo (http://accumulo.apache.org/)• Parquet (http://parquet.io/)

59

Page 60: 8. column oriented databases

HBaseApache HBase™ is the Hadoop database to use when you need you need random, realtime read/write access to your Data.•Automatic and configurable sharding of tables•Automatic failover support between RegionServers.•Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.•Easy to use Java API for client access.•To be distributed, it has to run on top of hdfs•Integrated with MapReduce

60

Page 61: 8. column oriented databases

HBase Architecture

61

Page 62: 8. column oriented databases

Data Location, Assignment and Serving

62

Page 63: 8. column oriented databases

Coprocessors

63

https://blogs.apache.org/hbase/entry/coprocessor_introduction

Page 64: 8. column oriented databases

Hypertable• Hypertable is an open source database system

inspired by publications on the design of Google's BigTable.

• Hypertable runs on top of a distributed file system such as the Apache Hadoop DFS, GlusterFS, or the Kosmos File System (KFS). It is written almost entirely in C++.

64

http://hypertable.com/documentation/

Page 65: 8. column oriented databases

Cassandra• Big-Table extension:– All nodes are similar.– Can be used as a distributed hash-table, with an "SQL-like"

language, CQL (but no JOIN!)

• Data can have expiration (set on INSERT)• Map/reduce possible with Apache Hadoop• Rich Data Model (columns, composites, counters,

secondary indexes, map, set, list, counters)

65

Page 66: 8. column oriented databases

Cassandra Scalability

66

Page 67: 8. column oriented databases

Reading from Cassandra

clientclient

memtable

sstable

sstable

sstable

Row cachekey cache

Page 68: 8. column oriented databases

Writing to Cassandra

clientclient Commit log (Disk)

Memtable (memory)

sstable

Flush

Replication factor: 3Replication factor: 3

sstable sstablesstable

Page 69: 8. column oriented databases

CQL

69

Page 70: 8. column oriented databases

When not to use

70

Page 71: 8. column oriented databases

When not to use• There are not suitable for systems:– that requires ACID transactions for writes and reads,– That needs to aggregate data (SUM and AVG) you need to

do it at client side,– Are in an initial stage. If we need to change the queries we

could have to change the columns.

71