nosql dbmses - uublogcn.files.wordpress.com · introduction to couchdb: main concepts, basic data...

NoSQL DBMSes Luca Morandini

Data Architect – AURIN Project University of Melbourne

[email protected]

Outline This lecture, and the one on the 18th of April, deal with the topic of Non-relational DBMSes (also known as NoSQL). In more detail, the program of the two lectures covers: l  Distributed DBMSes: definition of big data, the challenges

DBMSes face, the solutions to concurrent updates in distributed databases

l  Non-relational data models: the way data can be organized in forms other that tables and relationships among tables

l  Introduction to CouchDB: main concepts, basic data operations, conflict update resolutions

l  CouchDB queries: how to select data from CouchDB, views, show, and list functions

l  Advanced CouchDB: clustering, application development

About me l  I graduated in statistics in 1991, and have been working in

software development ever since l  I joined the University of Melbourne in 2012, working as

Data Architect in the Melbourne eResearch Group l  I have been working on relational DBMSes since 1991, and

with NoSQL DBMSes (CouchDB and Accumulo) since 2011 l  My experience with CouchDB has focused on storing large

volume of heterogeneous data in the context of urban research software platforms.

l  I wrote a fair amount of open source software, been an Apache committer, and started some open source software projects of my own

Part 1: Distributed DBMSes

The rise of “Big data” The rise of the Internet has increased massively the amount and heterogeneity of data that can be stored in databases. Computer logs, commercial transaction, world-wide street network, Facebook posts and Twitter feeds are only some examples of the data that are currently stored in databases. The DBMSes managing these databases generally fall under the name of NoSQL, since they do not follow the traditional relational model. Examples of NoSQL DBMSes in use by the Internet big players are: Amazon DynamoDB, Google Spanner, Apache Cassandra (Facebook).

What “Big data” means? Big data is not just about “bigness”, introducing the four

“Vs” : • Volume: yes, volume (Giga, Tera, Peta, Exa, …) is a

criteria, but not the only one • Velocity: the frequency of new data being brought in to the

system and analysis performed • Variety: the variability and complexity of data schema. The

more complex the data schema(s) you have, the higher the probability of them changing along the way, adding more complexity.

• Veracity: the level of trust in the data accuracy; the more diverse sources you have, the more unstructured they are, the less veracity you have.

How To Tackle Big Data? In a nutshell, the computing load is split among different computers (hence the name of distributed databases). The set of computers than co-operate to manage the database is known as a cluster, while each computer in a cluster is known as a node. The action of making a computing system more powerful by adding more computational node is called horizontal scaling, while making it more powerful by adding more resources to a node is called vertical scaling. Since the set of data is split among the nodes, the DBMS has to propagate changes across nodes, wile keeping the consistency of the whole database.

Why NoSQL is a misnomer SQL is the main query language of relational DBMSes, and it is by far the most widely used language to access databases. In relational databases, data are stored in tables and tables are connected using relationships. The NoSQL name caught the imagination of many, and signalled a move away from the traditional relational/SQL type of DBMS. However, you can have distributed, non-relational DBMSes that have an SQL-like query language. For instance, the Hadoop Hive query language (HiveQL) is SQL-like, but Hive is not a relational DBMS, neither HiveQL can be considered to fully support CRUD operations (support for transactions is missing). The “NoSQL” term is better used when it is intended as “Not Only SQL”.

A note about availability and consistency

Among the many qualities that a database should have, two

stand out when considering distributed databases. l  Availability: a database always answer queries from clients l  Consistency: a database gives the same response at

queries happening at the same time To underline the importance of these two qualities, imagine a

database of a cinema booking system that is nor available nor consistent: it would be overwhelmed by requests (leading to time-outs of the booking page), and it would return different results (one client may see that there are 5 seats available, another -at the same time- see the theatre as fully booked)

Availability and consistency do not scale

While it would be great a database that is both available and consistent, these cannot happen on a distributed database. In other words, these two qualities do not scale. The reason for the failure to scale up is intuitive: when the volume of data calls for a cluster of database servers, ensuring that every change is propagated to every node is slow process, and can be blocked by a partition of the cluster.

Node 1 Node 2 Node 3

Client 2 Client 1 Client 3

Two-phase commit

This is the usual algorithm used in relational DBMSes to enforce consistency during transactions. A transaction is a set of changes in a database that is treated as one single change (say, dealing with a money transfer, which involves adding the sum to one bank account, and detracting it from another one).

The two-phase commit works as: l  locking data that are within the transaction scope l  performing transactions on a temporary database l  completing transactions (commit) only when all nodes in the

cluster have performed the transaction l  aborts transactions (rollback) when a partition is detected This procedure entails the following: l  reduced availability (data lock, stop in case of partition) l  enforced consistency (every database is in a consistent state,

and all are left in the same state) Therefore, two-phase commit is a good solution when the cluster is

co-located, less then good when it is distributed

Multi-Version Concurrency Control (MVCC)

l  MVCC is a method to ensure availability (every node in a cluster always accepts requests), and some sort of recovery from a partition by reconciling the single databases with revisions (data are not replaced, they are just given a new revision number)

l  In MVCC, concurrent updates are possible without distributed

locks (only the local copy of the object is locked), since the updates will have different revision numbers; the transaction that completes last will get a higher revision number, hence will be considered the “current” value

l  Coarse-grained DBMS models, like document-oriented DBMSs,

fit the optimistic locking of the MVCC method, since there are fewer transactions

Multi-Version Concurrency Control Example

An example of how two clients avoid locking data and still avoid inconsistency by using a revision number for sequence transactions. Client A Client B POST obj1,{name:”a”} ==> OK,rev:1

PUT obj1,rev:1,{name:”b”} ==> OK,rev:2

PUT obj1,rev:1,{name:”c”} ==> ERROR GET obj1 ==> OK,rev:2, {name:“b”} PUT obj1,rev:2,{name:”b c”} ==> OK,rev:3 MVCC relies on monotonous increasing revision numbers and, crucially, the preservation of old object versions to avoid read locks and ensure availability (i.e. when an object is updated, the old versions can still be retrieved).

Examples of Distributed DBMSs l  Google Spanner: a mix of semi-relational tables (close to a key-value store like BigTable), SQL-like query language, interleaved tables (two relations in a one-to-many relationship can be co-located in the same set of rows), global timestamps due to advanced timing, two-phase commit and Paxos, auto-sharded, versioned rows

l  Cloudera's Impala: in-memory heterogeneous-format SQL database

based on Hadoop, used for analytics l  Apache Drill: heterogeneous-format SQL, no metadata required,

used for analytics l CouchDB: document-oriented DBMS, with MVCC, sharding and

replication

Sharding Sharding is the partitioning of a database “horizontally”; that is, the

rows are partitioned in subsets that are stored on different servers.

The advantage of a sharded database lies in the improvement of

performance through the distribution of computing load across nodes

There are different sharding strategies, most notably: l  Hash sharding: to distribute rows evenly across the cluster l Range sharding: similar rows (say, tweets coming for the same

area) are stored on the same node (or sub-set of nodes)

Replication and Sharding Replication is the action of storing the same row on different nodes to make the database fault-tolerant. Replication and sharding can be combined with the objective of maximizing availability while maintaining a minimum level of data safety. For instance, in a cluster of 3 nodes, a replication factor of 2 and a number of shards equal to 4 would look like:

Shard 1

Shard 3

Shard 2

Node 1

Shard 1

Shard 3

Shard 2

Node 3

Shard 1

Shard 4

Shard 2

Node 2

Shard 4

Shard 3

Part 2: Non-relational Data Models

Why DBMSes for Distributed Environments? While Relational DBMS are extremely good for ensuring consistency and availability, and there is nothing preventing them implementing partition-tolerant algorithms, the normalization that lies at the heart of a relational database model produces fine-grained data, which are less partition-tolerant than coarse-grained data. Example: l A typical contact database in a relational data model may

include: a person table, a telephone table, an email table, an address table.

l The same database in a document-oriented database would entail one document type only, with telephones numbers, email addresses, etc., nested as arrays in the same document.

The document-oriented option needs less synchronization and is easily “shardable” (shards are partitions of a database based on an attribute, e.g.: sharding tweets by city means every city has one database holding all the Tweets that originated in that city).

Non-relational Data Models #1 Key-values stores (Redis, PostgreSQL Hstore) Key Value “Luca” {phones: [411134, 034444],

addresses: [“1 The Avenue”,“3 High st.”]} “Richard” {phones: [4312, 4011134455]:

Addresses: []} BigTable DBMSes (Google BigTable, Apache Accumulo) Row Column Family Columns Values “Luca” phones mobile {041148965, 04984556}

landline {0333754, 0255545} addresses current “1 The Avenue” Past [“3 High St”]

“Richard” phones mobile {04222, 0423422}

landline {034564} addresses current “4 The Golf Course” Past []

Non-relational Data Models #2 Graph DBMes (Neo4J, OrientDB) Nodes “Luca” {phones: [411134], addresses: [“1 The Avenue”]} “Richard” {phones: [4312]: Addresses: []} Edges colleagues: [{from:“Luca”, to: “Richard”}] Document-oriented DBMSes (CouchDB, MongoDB) Document { id: 12233, name: “Luca”, phones: {mobile: {041148965, 04984556}, landline {0333754, 0255545}}, addresses: {current : “1 The Avenue”, past:[“3 High St”]} }

MapReduce algorithms •  This paradigm, pioneered by Google, is particularly suited to

parallel computing of the Single-Instruction, Multiple-Data type (see Flynn's taxonomy).

•  The first step (Map), distributes data across machines, while the

second (Reduce) hierarchical summarizes them until the result is obtained. Apart from parallelism, its advantage lies in moving the process where data are, greatly reducing network traffic.

•  Example (unashamedly taken from Wikipedia): function map(name, document): forEach word w in document: emit (w, 1) function reduce(word, partialCounts): sum = 0 forEach pc in partialCounts: sum = sum + pc emit (word, sum)

MapReduce word count in a picture

Oh, where have you Been, Billy Boy, Billy Boy?

Oh, 1 Where, 1 Have, 1 You, 1

Been, 1 Billy, 1 Boy, 1 Billy, 1 Boy, 1

Map Reduce

Oh, 1 Where, 1 Have, 1 You, 1 Been, 1 Billy, 2 Boy, 2

MapReduce and Distributed DBMSes •  Since it is horizontally scalable, MapReduce is the tool of choice

when operations on big datasets are to be done •  Let's see how the staple of database query, the inner join, works

on MapReduce (fileL and fileR have to be joined on userId field): function map(fileL, fileR): forEach line in fileL: emit ({key: line.userId, {type: “L”, line: line}) forEach line in fileR: emit ({key: line.userId, {type: “R”, line: line}) function reduce(key, values): forEach value in values: if value.type == “L” lineL= value.line else lineR= value.line return {userid: key, left: lineL, right: lineR}

References Big Data: http://www.ibmbigdatahub.com/tag/587 Consistency and availability at scale: http://robertgreiner.com/2014/08/cap-theorem-revisited/ Multi-Version Concurrency Control: http://www.eecs.berkeley.edu/~brewer/cs262/concurrency-

distributed-databases.pdf Big Table DBMS: h t t p : / / s t a t i c . g o o g l e u s e r c o n t e n t . c o m / m e d i a /

research.google.com/en//archive/bigtable-osdi06.pdf

nosql dbmses - uublogcn.files.wordpress.com · introduction to couchdb: main concepts, basic data...

Documents