cloud computing: opportunities and challenges · nosql systems - motivations • rdbmss do not...

CLOUD COMPUTING: OPPORTUNITIES AND CHALLENGES Danilo Ardagna [email protected]

Outline • Cloud persistent data storage overview

• Amazon S3

• NoSQL systems

• Map-reduce

• Scalable RDBMS

Persistent data and available types of storage

Blobs Virtual volumes

Traditional RDBMS

Scalable RDBMS

NoSQL

BLOBS: Amazon S3 •  S3: Simple Storage Service •  Store binary data objects for private or public use •  The implementation is fault-tolerant and assumes that

hardware failures are a common occurrence •  S3 automatically makes multiple copies of each object to

achieve high availability and durability •  Objects size 1B-5TB •  All objects reside in buckets •  S3 objects can be accessed by HTTP requests •  Other AWS services use S3 as a storage system for AMIs,

access logs, and temporary files •  Amazon S3 charges: Amount of data stored, Amount of data

transferred in and out of S3, and the number of requests made to S3

S3 - Pricing Storage Standard

Storage Reduced Redundancy Storage

First 1 TB / month $0.125 per GB $0.093 per GB

Next 49 TB / month $0.110 per GB $0.083 per GB




Over 5000 TB / month $0.055 per GB $0.037 per GB

Request Pricing

PUT, COPY, POST, or LIST Requests

$0.01 per 1,000 requests

GET and all other Requests

$0.01 per 10,000 requests

S3 - Pricing

Data Transfer Pricing IN

All data transfer in $0.000 per GB

Data Transfer Pricing OUT

First 1 GB / month $0.000 per GB

Up to 10 TB / month $0.120 per GB

Next 40 TB / month $0.090 per GB



NoSQL systems - Motivations • RDBMSs do not scale horizontally:

•  Most databases use a shared-nothing architecture •  Many user requests often involve related information •  Data shipping kills databases performance

• NoSQL: •  Not Only SQL •  Employed in public, massively scaled Web site scenarios, where

traditional DB features matter less, and fast fetching of relatively simple data sets matters most

•  With the focus on the Web, the constant thirst for performance amongst technologists, NoSQL databases are seen favorably and used by an enthusiastic population of developers

NoSQL systems

What is NoSQL? • No use of SQL as query language:

•  Manage large volumes of data that do not necessarily follow a fixed schema

•  Data is partitioned among different machines and JOIN operations are not usable

• ACID guarantees may be relaxed: •  Eventual consistency •  Transactions limited to single data items

• Distributed, fault-tolerant architecture •  Data held in a redundant manner on several servers •  Horizontal scalability

The CAP Theorem • Databases may only excel at two of the following three

attributes: Consistency, Availability and Partition tolerance • Relational databases favor Consistency and Availability • NoSQL databases favor Availability and Partition

tolerance

•  In other words, NoSQL intentionally de-emphasizes the rules and functionality of consistency that many database administrators and developers think of as the very prerequisites of database management

• Real applications can take advantage by, possibly, adopting the two technologies

Queries: other source of inefficiencies • A typical SQL query (search through primary key) would

be: "SELECT COLUMN1, COLUMN2, COLUMN3 FROM TABLE WHERE PRIMARY_KEY={X}”

•  Lets analyze what happens when we execute above SQL: •  Query passes through SQL engine

•  Lexical analysis & Parsing of SQL statement •  SQL optimization to choose optimal execution path for the statement

•  Searching the index for primary key •  Primary key is located, data retrieval is performed

Queries: other source of inefficiencies • NoSQL use case: “GET [TABLE][PRIMARY_KEY]"

•  In NOSQL, primary key and corresponding columns are stored as Hash

•  Looking up primary key is performed in constant time and there is no need for lexical analysis, parsing and optimization

• Many large websites use MySQL with Memcached. Memcached will serve as in-memory NOSQL

Consistency Management • RDBMS:

•  Two phase commit (2PC) •  A distributed algorithm that coordinates all the processes that

participate in a distributed atomic transaction on whether to commit or abort

•  The commit-request phase (or voting phase): A coordinator process attempts to prepare all the transaction's participating processes to take the necessary steps for either committing or aborting the transaction and to vote

•  The commit phase, in which, based on voting of the participants, the coordinator decides whether to commit or abort the transaction and notifies the result to all the participants

Consistency Management •  Multi-version concurrency control (MVCC):

•  Updates are implemented by marking the old data as obsolete and adding the newer version

•  There are multiple versions stored, but only one is the latest •  Provides potential point in time consistent views •  Read transactions use a timestamp or transaction ID to determine

what state of the DB to read, and read these versions of the data •  Avoids managing locks for read transactions because writes can be

isolated by virtue of the old versions being maintained, rather than through a process of locks

•  Writes affect future version but at the transaction ID that the read is working at, everything is guaranteed to be consistent because the writes are occurring at a later transaction ID

•  Eventually Consistency: •  Changes made at one replica will be transmitted asynchronously to the

others (e.g., DNS) •  Discrepancies in data state between replicas, and thus between users

and locations, for a temporary period may occur

NoSQL – Basic Concepts •  Tuple: a row in a relational table, where attribute names

are pre-defined in a schema, and the values must be scalar. The values are referenced by attribute name, as opposed to an array or list, where they are referenced by ordinal position

• Document: allows values to be nested documents or lists as well as scalar values, and the attribute names are dynamically defined for each document at runtime. A document differs from a tuple in that the attributes are not defined in a global schema, and this wider range of values are permitted

NoSQL – Basic Concepts • Extensible record: a hybrid between a tuple and a

document, where families of attributes are defined in a schema, but new attributes can be added (within an attribute family) on a per-record basis. Attributes may be list-valued

• Object: Analogous to an object in programming languages, but without the procedural methods. Values may be references or nested objects

NoSQL – Basic Concepts • Most of the systems allow horizontal partitioning of data,

storing records on different servers according to some key; this is called sharding

• Some of the systems also allow vertical partitioning, where parts of a single record are stored on different servers

NoSQL sub-categories •  Key-Value Stores: store values and an index to find them,

based on a programmer-defined key

•  Document Stores: store documents. Indexes can be defined and a simple query mechanism is also provided

•  Wide Column Stores: store extensible records that can be partitioned vertically and horizontally across nodes

•  Graph Databases: provide efficient distributed storage and queries of a graph of nodes with references among them

•  Each NoSQL subcategory serves certain scenarios best

Key-Value Stores •  The mother of all NoSQL database types

•  Let’s consider an example: •  A key-value pair might consist of a key like “Phone Number” that is

associated with a value like “(212) 555-1212”

• Key-Value Stores contain records whose entire content is made up of such pairs

•  The structure of one record can differ from the others in the same collection

Key-Value Stores

6

In his paper Amazon's Dynamo1 (Dynamo is the online retailer’s foundational NoSQL database)﴿, Werner Vogels, Amazon.com’s Chief Technology Officer, describes why such an approach is appropriate: “Most of these services only store and retrieve data by primary key and do not require the complex querying and management functionality offered by an RDBMS.” In other words, various systems on the Web, many of which are consumer-facing, don’t have sophisticated database needs, but they nonetheless have a huge burden. They must carry out their simple needs very, very quickly.

NoSQL databases handle these workloads well, but they make serious concessions, to otherwise mainstream database needs, in order to do it. That is well-justified, but not always well-understood; in fact there exist NoSQL practitioners who advocate the usage of NoSQL as a general database technology applicable to the mainstream of application database needs. Such advocacy has caused some relational database customers to have concerns that they should perhaps switch to NoSQL databases even for line-of-business (LOB) applications.

Customers have these concerns despite the fact that most LOB apps require transactional guarantees, and are well-served by normalized design and formal schema. This can be a controversial state of affairs and we hope to sort out that controversy. For now though, let’s just say that NoSQL databases work well in certain scenarios, and that sketching out what those scenarios are, and what they are not, is an important goal of this paper.

To help enumerate those scenarios, it’s best that we discuss four subcategories that NoSQL databases tend to break down into. Enumerations of such subcategories tend to vary, but they usually include Key-Value Stores, Document Stores, Wide Column Stores and Graph Databases. Each NoSQL subcategory serves certain scenarios best. To understand core NoSQL scenarios as best as we can, let’s explore the various NoSQL subcategories and the specific types of applications and workloads they support most ably.

Key-Value Stores The Key-Value Store subcategory (summarized graphically in Figure 1) is perhaps the mother of all NoSQL database types. Most NoSQL databases feature key-value mechanisms, even if only behind the scenes. NoSQL databases that belong to the explicit Key-Value Store category use their namesake construct as the basic unit of storage. A key-value pair might consist of a key like “Phone Number” that is associated with a value like “(﴾212)﴿ 555-1212.” Key-Value Stores contain records whose entire content is made up of such pairs; the structure of one record

can differ from the others in the same collection.

1 http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

Figure 1: Key-Value Stores often use the nomenclature of tables and rows, but the latter simply contain collections of key-value pairs, which vary from row to row.

Key-Value Stores • Collections, dictionaries and associative arrays in the

programming world work on the same principle. Data caches work on the key-value principle as well

• Values can consist of long text content, not just numeric and short string data

• Data is often non-hierarchical, so the lack of relational logic or join constructs is acceptable

Key-Value Stores • Example of systems:

•  Azure Table Storage •  MemcacheDB •  Dynamo •  Voldemort •  Dynomite, Kai and Riak open source derivatives of Dynamo

• Other NoSQL database types build upon Key-Value Store principles. Therefore you should expect their applications to be more specialized than, but not wholly distinct from, those of Key-Value Stores themselves

Document Stores • Each document consists of a set of keys and values,

which can be compared to a relational table’s field names and values

• As with Key-Value Stores, each record can have a structure widely differentiated from the others

•  Frequently, Document Stores contain JSON objects, each of which has a schema-free of set properties and values

• Values may contain attachments, point to other documents, or directly contain them

Document Stores

8

with JavaScript programming and programmers. In fact, the native stored procedure/scripting language for both CouchDB and MongoDB is JavaScript itself.

Documents can also contain attachments, making document stores useful for content management. The fact that certain Document Stores feature versioning of their documents (i.e. old versions are retained and all versions are numbered) makes this all the more so.

CouchDB and MongoDB have been used for an array of public-facing Web application types including blog engines, event logs, appointment calendars, media stores, chat applications, cloud bookmark storage and even Twitter clients.

An important facet of Document Stores is that the documents themselves can be addressed by unique URLs. And given the HTTP and URL orientation, document databases are automatically REST-friendly, as their APIs bear out. In the case of CouchDB, the HTTP orientation is developed to the point where the database can function as its own Web application server.

Here’s how: so-called Show Functions in CouchDB – JavaScript functions that render HTML with the return statement – can be stored in special documents called design documents, and each function within is accessible via URL. This means that entire Web applications can be implemented in a document database. Users visit a URL, code runs on the server and content is returned via the HTTP response stream, just as it would be with classic ASP, node.js, ASP.NET Web Pages or PHP.

This HTTP and application orientation distinguishes Documents Stores from Key-Value Stores, the latter of which are more general purpose in their implementation and application. That said, there are some NoSQL taxonomies which do not recognize the Document Store category and instead label its members as Key-Value Stores.

As you will see, the remaining two NoSQL subcategories utilize key-value technology as well.

Wide Column Stores Wide Column Stores, also known as Column Family Stores, manage key-value pairs, but they organize their storage in a semi-schematized and hierarchical pattern. Perhaps fittingly then, some of their nomenclature correlates with that of RDBMS technology. For example, the keys in a Wide Column Store are referred to as columns, and are stored in structures that are sometimes referred to as tables. In

Figure 2: Document Stores contain JSON objects, referred to as documents, each of which has a schema-free of set properties and values. Values may contain attachments, point to other documents, or directly contain them.

Document Stores • Example of systems:

• Amazon SimpleDB • CouchDB • MongoDB

Amazon SimpleDB •  Provide Select, Delete, GetAttributes, and PutAttributes

operations on documents (domains in Amazon terminology)

•  Main characteristics: •  Nested documents are not allowed •  Eventual consistency, not transactional consistency •  Asynchronous replication •  Supports more than one grouping in one database: documents are put

into domains, which support multiple indexes

•  Select operations are on one domain, and specify a conjunction of constraints on attributes, basically in the form:

select <attributes> from <domain> where <list of attribute value constraints>

Amazon SimpleDB • Different domains may be stored on different Amazon

nodes • Domain indexes are automatically updated when any

document’s attributes are modified • SimpleDB does not automatically partition data over

servers • Built-in constraints:

•  10 GB maximum domain size •  250 active domains •  5 second limit on queries

A Java Example package net.nineapps.programmingec2.chapter7; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.ArrayList; import java.util.Calendar; import java.util.Date; import java.util.List; import java.util.TimeZone;

import com.amazonaws.auth.AWSCredentials; import com.amazonaws.services.simpledb.AmazonSimpleDB; import com.amazonaws.services.simpledb.AmazonSimpleDBClient; import com.amazonaws.services.simpledb.model.Attribute; import com.amazonaws.services.simpledb.model.Item; import com.amazonaws.services.simpledb.model.PutAttributesRequest; import com.amazonaws.services.simpledb.model.ReplaceableAttribute; import com.amazonaws.services.simpledb.model.SelectRequest; import com.amazonaws.services.simpledb.model.SelectResult; import com.amazonaws.services.sqs.model.Message; public class SQSLogger {

private AmazonSimpleDB simpleDB; private SimpleDateFormat format;

A Java Example public SQSLogger(AWSCredentials credentials) { // get the SimpleDB service simpleDB = new AmazonSimpleDBClient(credentials); format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); format.setTimeZone(TimeZone.getTimeZone("GMT")); }

/** * Log an SQS message into the SimpleDB domain "sqs_log". pickedUpTime will * be used to calculate latency If the message was been successfully * processed we save also the time at which it was processed (deleted from * the queue). */ public void logMessage(Message message, Date pickedUpTime, boolean succeeded) { String timestamp = message.getAttributes().get("SentTimestamp"); String sentTimestamp = format .format(new Date(Long.parseLong(timestamp))); List<ReplaceableAttribute> attributes = new ArrayList<ReplaceableAttribute>(); attributes.add(new ReplaceableAttribute("SentTimestamp", sentTimestamp, true));

A Java Example // All attributes are set to replace=true // except for PickedUpTimestamp // since the message could be picked up several times // until it is successfully processed // For latency we need the earliest PickedTimestamp attributes.add(new ReplaceableAttribute("PickedUpTimestamp", format .format(pickedUpTime), false)); attributes.add(new ReplaceableAttribute("MessageBody", message .getBody(), true)); if (succeeded) { attributes.add(new ReplaceableAttribute("ProcessedTimestamp", format.format(new Date()), true)); } // create an item in SimpleDB for this message PutAttributesRequest request = new PutAttributesRequest("sqs_log", // simpledb // domain // name message.getMessageId(), // item name attributes); simpleDB.putAttributes(request); }

A Java Example /** * Get the average latency of messages served from the given start datetime * to the given end datetime. Return value is a long expressed in * milliseconds **/ public long getLatency(Date start, Date end) throws ParseException { long count = 0; long totalLatency = 0; String nextToken = null; // retrieve all the items which are in the // date range we want SelectRequest request = new SelectRequest( "select SentTimestamp, PickedUpTimestamp, MessageBody " + "from sqs_log " + "where SentTimestamp > '" + format.format(start) + "' " + "and PickedUpTimestamp < '" + format.format(end) + "'"); do { request.setNextToken(nextToken); SelectResult result = simpleDB.select(request); nextToken = result.getNextToken();

A Java Example for (Item item : result.getItems()) { String sentTimestamp = null; String pickedUpTimestamp = null;

for (Attribute attribute : item.getAttributes()) { if ("SentTimestamp".equals(attribute.getName())) { sentTimestamp = attribute.getValue();

} else if ("PickedUpTimestamp".equals(attribute.getName())) { // we need the earliest PickedUpTimestamp if (pickedUpTimestamp == null || pickedUpTimestamp.compareTo(attribute .getValue()) > 0) { pickedUpTimestamp = attribute.getValue(); } } } totalLatency += format.parse(pickedUpTimestamp).getTime() - format.parse(sentTimestamp).getTime(); count++; } } while (nextToken != null); // return the average return (count != 0 ? totalLatency / count : 0); }

A Java Example /** Helper method which returns the latency of the past given period of time, * in miliseconds. **/ public long getLatency(int seconds) throws ParseException { Date now = new Date(); Calendar before = Calendar.getInstance(); before.setTime(now); before.add(Calendar.SECOND, -seconds); return getLatency(before.getTime(), now); }

/** * Returns the number of messages served from the given start timestamp to * the end timestamp. **/ public long getThroughput(Date start, Date end) { SelectRequest request = new SelectRequest("select count(*) “+ "from sqs_log " + "where SentTimestamp > '" + format.format(start) + "' " + "and ProcessedTimestamp < ’” + format.format(end) + "'"); SelectResult result = simpleDB.select(request); for (Attribute attribute : result.getItems().get(0).getAttributes()) { if ("Count".equals(attribute.getName())) { return Long.parseLong(attribute.getValue()); } } return 0; }

A Java Example

/** * Helper method which returns the number of messages served in the past * given period of time. **/ public long getThroughput(int seconds) { Date now = new Date(); Calendar before = Calendar.getInstance(); before.setTime(now); before.add(Calendar.SECOND, -seconds); return getThroughput(before.getTime(), now); } }

CouchDB and MongoDB • Use JavaScript data types for the values stored in their

documents • Documents can be thought of as JavaScript objects and

can, in fact, be written and read in JSON format •  The native stored procedure/scripting language for both

CouchDB and MongoDB is JavaScript • Documents can also contain attachments, making

document stores useful for content management • Example of applications: blog engines, event logs,

appointment calendars, media stores, chat applications, cloud bookmark storage,Twitter clients,...

Wide Column Stores • Manage key-value pairs, but they organize their storage in

a semi-schematized and hierarchical pattern

• Some of their nomenclature correlates with that of RDBMS technology

• Between the table and column level lie various intermediate structures that vary by product: •  Apache Cassandra (originated by Facebook): Super Columns •  Hypertable and Apache Hbase: Column Families •  Google’s BigTable: Tablets

Wide Column Stores

9

between the table and column level lie various intermediate structures that vary by product. For example, Apache Cassandra (originated by Facebook) features Super Columns. Hypertable and Apache HBase feature Column Families, and Google’s BigTable features Tablets. The hierarchical structure and some of the varying nomenclature of Wide Column Stores is summarized in Figure 3.

Although the schema within the intermediate structures can vary from row to row, tables and the intermediate structures themselves must be declared. Therefore, Wide Column Stores, while they tolerate schema variation at the “leaf” column level, are not completely schema-free. One could reasonably argue, in fact, that schema changes at the non-leaf level in Wide Column Stores are more disruptive than changes to table schemas in relational databases.

Wide Column Stores work well for a subset of requirements that Key-Value Stores accommodate and many adopters of this category of NoSQL database cite the performance factors, over the structural ones, as reasons they chose it. But, clearly, Wide Column Stores are best for semi-structured data, rather than data whose structure is completely variable from row to row.

As an example, in a product catalog, we may have a collection of items, each of which has a size and a rating associated with it, and we may want to store these items together in a table. But certain items’ sizes may be represented by height, width and depth, others by radius, and still others by weight. The rating may be a star rating on a 1-5 scale (e.g. for a book), or collection of sub-ratings on various attributes (e.g. freshness, flavor, color, moistness). Accommodating a grouping of entities with high-level characteristics in common, but with differing context-specific attributes, is one area where Wide Column Stores do well.

In the relational world, traditionally, such context-specific attributes would each need to be stored in separate tables, with a foreign key in the main table to link them2. Joins and application-level merging of the datasets might be necessary. But Wide Column Stores allow such differently nuanced data to comingle in the same tables and query result sets.

2 Recent versions of major RDBMS products offer new features to accommodate this requirement without resorting to separate attribute tables. Such features in SQL Server and SQL Azure will be discussed later in this paper.

Figure 3: Wide Column Stores contain tables (﴾indicated above as “T”)﴿; Cassandra calls them “super-column families” (﴾shown as “SCF”)﴿. These contain a key and columns (﴾“C”)﴿ which consist of name/value pairs. Columns are subdivided into column families (﴾“CF”)﴿, which are known as “super columns” (﴾“SC”)﴿ in Cassandra. Columns are schema-free, but higher-level objects must be declared.

Wide Column Stores • Although the schema within the intermediate structures

can vary from row to row, tables and the intermediate structures themselves must be declared

•  Tolerate schema variation at the “leaf” column level, but they are not completely schema-free

• Better for semi-structured rather than completely variable unstructured data

Wide Column Stores: A product catalog example

• Accommodating a grouping of entities with high-level characteristics in common, but with differing context-specific attributes, is one area where Wide Column Stores do well

•  In the relational world: •  Context-specific attributes would each need to be stored in

separate tables, with a foreign key in the main table to link them •  Joins and application-level merging of the datasets might be

necessary

Wide Column Stores: A product catalog example

• A collection of items, each of which has a size and a rating associated with it, and we may want to store these items together in a table

• Certain items’ sizes may be represented by height, width and depth, others by radius, and still others by weight

•  The rating may be a star rating on a 1-5 scale (e.g., for a book), or collection of sub-ratings on various attributes (e.g., freshness, flavor, color, moistness)

Graph Databases • Recognize entities in a business or other domain, and

explicitly track the relationships between them • Entities are called nodes and the relationships between

them are called edges • Example of a graph database assertion :

Chris city Auckland • Chris and Auckland are nodes and city is an edge

• Popular graph databases: AllegroGraph, Neo4j, and Twitter’s FlockDB

Graph Databases

10

Graph Databases Graph databases recognize entities in a business or other domain, and explicitly track the relationships between them. In the graph database world, these entities are called nodes and the relationships between them are called edges; all of these terms come from mathematical graph theory as does this NoSQL database subcategory’s name. An example of a graph database assertion (the fundamental atomic unit of data expression) might be:

Chris city Auckland

Where Chris and Auckland are nodes and city is an edge.

From Relational to Relationships As we try to orient ourselves to graph databases from a relational frame of reference, we could think of an edge in a graph database (a predicate) as a join, and the subject and the object of that predicate (the Chris node and the Auckland node, respectfully, in the above case) as rows in a table. Attributes of a node that have scalar values (for example the attribute Age might have a value of 45) can also be represented using edges and nodes, or as properties and values, depending on the specific graph database in use. In the former case, an edge might be thought of as a column, in a broad sense, rather than as a join. A collection of assertions are kept together in a graph. The structure of Graph Databases is illustrated in Figure 4.

New edges can be added (or old ones removed) at any time, allowing one-to-many and many-to-many relationships to be expressed easily and avoiding anything like an intermediate relationship table that you might use in a relational database to accommodate many-to-many joins.

Social graphs fit into the graph database rubric nicely (as does the name). Constructs like friends, followers, degrees of separation, lists, endorsements, status messages and responses to them are very naturally accommodated in graph databases. Semantic Web data also maps quite nicely on to the graph database structure.

Graphs and ORM As we consider the concepts of properties, values and relationships, it starts to become clear that graph database theory has some alignment with object-relational modeling and ORM programming. This then

Figure 4: Graph databases, like those in other NoSQL subcategories, may be key-value based, but they excel at tracking relationships (edges) between entities (nodes), in addition to the entities, keys and values, themselves. Sometimes even the key-value pairs are represented as edges and nodes.

From Relational to Relationships • We could think of an edge in a graph database (a

predicate) as a join, and the subject and the object of that predicate (the Chris node and the Auckland node) as rows in a table

• Attributes of a node that have scalar values (e.g., Age) can also be represented using edges and nodes, or as properties and values, depending on the specific graph database in use. In the former case, an edge might be thought of as a column, in a broad sense, rather than as a join

• A collection of assertions are kept together in a graph

From Relational to Relationships • New edges can be added (or old ones removed) at any

time, allowing one-to-many and many-to-many relationships to be expressed easily and avoiding anything an intermediate relationship table

• Example of applications: •  Social graphs fit into the graph database rubric nicely (as does the

name). Constructs like friends, followers, degrees of separation, lists, endorsements, status messages and responses to them are very naturally accommodated

•  Semantic Web data also maps quite nicely on to the graph database structure

Graphs and ORM • Graph database theory has some alignment with object-

relational modeling and ORM programming

• Object databases typically are schema based (even if the schema describes a class rather than a table) and are focused on entities and their properties

• Graph databases are designed to accommodate slowly- or even rapidly-changing schemas and focus on relationships between entities more than the entities themselves

Shared Legacy: MapReduce, Hadoop, BigTable and Hbase

•  Two technologies underlie, or have provided inspiration for, many of the individual products in each NoSQL subcategory

• Google’s MapReduce and BigTable and their open source counterparts, Apache Hadoop and Hbase: •  MapReduce (Hadoop): generalized parallel job processing engines •  BigTable (HBase): Wide Column Stores whose tables can serve as

sources and destinations for the MapReduce and Hadoop jobs

Why are the job processing engines necessary? •  The less structured, less formal approaches employed by

NoSQL databases make querying them less straightforward than in the relational world, and MapReduce/Hadoop help mitigate the burden

• Although explicit joins are not necessary in the NoSQL world, the permissive environment and resulting inconsistency across records/entities/documents makes for quite a bit more hunting and gathering in order to satisfy a query

Why are the job processing engines necessary? •  This is especially true for distributed NoSQL databases

which store their data across various servers, typically using sharding

•  The lack of query optimizers, and corresponding query efficiencies, in NoSQL databases cries out for some help: •  Queries have to be broken up and executed across multiple

repositories on different servers •  At some point, the resulting segmented result sets need to be

collected and unified •  Map-reduce works very well on that:

•  The process of distributing the query across multiple agents is the Map step

•  The process of coalescing the results into a single result set is the Reduce step

Map-reduce • Map-reduce: A general algorithm, and is prevalent in

functional programming languages, which support the notion of map and reduce functions

• MapReduce: The patented software framework from Google that the company applies in the realm of managing large datasets over clusters or other distributed topologies

Motivation for MapReduce

•  Large-Scale Data Processing: •  Want to use 1000s of CPUs •  But don’t want hassle of managing things

• MapReduce Architecture provides: •  Automatic parallelization & distribution •  Fault tolerance •  I/O scheduling •  Monitoring & status updates

What is Map-reduce? • Map-reduce:

•  Programming model from LISP •  (and other functional languages)

• Many problems can be phrased this way • Easy to distribute across nodes • Nice retry/failure semantics •  Input: a set of key/value pairs • User supplies two functions:

•  map(k,v) à list(k1,v1) •  reduce(k1, list(v1)) à v2

•  (k1,v1) is an intermediate key/value pair • Output is the set of (k1,v2) pairs

An example: Word Count • We have a large file of words • Count the number of times each distinct word appears in

the file

• Sample application: analyze web server logs to find popular URLs

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

Map(<Key, Value>)

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

Input List Intermediate

Values

<”pippo", ”A B C">

<”pluto", ”D">

<”paperino", ”E A">

Map(<Key, Value>)

<”A", 1>

<”B", 1>

<“C”, 1>

Input List

Intermediate Values

let map(String document_name, String document_content)= foreach Word word in document_content : emit(word, 1)

<”D", 1>

<”E", 1>

<“A”, 1>

Map

<KeyA, Value>

<KeyB, Value>

<KeyC, Value>

<KeyA, Value>

<KeyB, Value>

<KeyD, Value>

<KeyC, Value>

<KeyD, Value>

Map Map Map

<KeyA, Value>

<KeyA, Value>

Reduce

<KeyB, Value>

<KeyB, Value>

<KeyC, Value>

<KeyC, Value>

<KeyD, Value>

<KeyD, Value>

Reduce Reduce Reduce

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

Reduce(Key, Iterator)

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

Intermediate Values

Output List

<“A”,1>

<“A”, 1>

<“A”, 1> Reduce(Key, Iterator)

<“A”, 3>

Intermediate Values

Output List

<“B”, 1>

<“B”, 1>

<“B”,1>

<“B”, 3>

let reduce(Word word, Iterator<int> occourences) = int total_occourences = 0; foreach int o in occourences : total_occourences += o; emit(word, total_occourences);

Combiners • Often a map task will produce many pairs of the form

(k,v1), (k,v2), … for the same key k •  E.g., popular words in Word Count

• Can save network time by pre-aggregating at mapper •  combine(k1, list(v1)) à v2 •  Usually same as reduce function

• Works only if reduce function is commutative and associative

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

Combine(Key, Iterator)

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

<Key, Value>

Intermediate Values

Intermediate Values

Map

<KeyA, Value>

<KeyB, Value>

<KeyC, Value>

<KeyA, Value>

<KeyB, Value>

<KeyD, Value>

<KeyC, Value>

<KeyD, Value>

<KeyA, Value>

<KeyA, Value>

Reduce

<KeyB, Value>

<KeyB, Value>

<KeyC, Value>

<KeyC, Value>

<KeyD, Value>

<KeyD, Value>

Reduce Reduce Reduce

Combine

Map

Combine

Map

Combine

Map

Combine

Hadoop MapReduce Implementation Example: WordCount

public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException {

String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); Text word = new Text();

while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, new IntWritable(1)); }

}


public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); }

output.collect(key, new IntWritable(sum));

}


public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount");

conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class);

conf.setNumMapTasks(new Integer(40)); conf.setNumReduceTasks(new Integer(30));

conf.setInputPath(new Path("/shared/wikipedia_small")); conf.setOutputPath(new Path("/user/kheafield/word_count")); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf);

}

Example uses: distributed grep distributed sort web link-graph reversal term-vector / host web access log stats inverted index construction document clustering machine learning statistical machine translation ... ... ...

MapReduce Programs In Google Source Tree

•  Typical cluster: •  100s/1000s of 2-CPU x86 machines, 2-4 GB of memory •  Limited bisection bandwidth •  Storage is on local IDE disks •  GFS: distributed file system manages data (SOSP'03) •  Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines

Implementation is a C++ library linked into user programs

Implementation Overview

Execution Overview

Input Split

Input Split

Input Split

Input Data

Input Split

Input Split

Input Split

Master

R

Execution Overview

Input Split

Input Split

Input Split

Input Data

M

Input Split

Input Split

Input Split

M R

Master

R

Execution Overview

Input Split

Input Split

Input Split

Input Data

M Output

File

Output File

Output File

Output Files

Input Split

Input Split

Input Split

M R Output

File

Output File

Output File

Master

Data flow •  Input, final output are stored on a distributed file system

•  Scheduler tries to schedule map tasks “close” to physical storage location of input data

•  Intermediate results are stored on local FS of map and reduce workers

• Output is often input to another map reduce task

Coordination • Master data structures

•  Task status: (idle, in-progress, completed) •  Idle tasks get scheduled as workers become available •  When a map task completes, it sends the master the location and

sizes of its R intermediate files, one for each reducer •  Master pushes this info to reducers

• Master pings workers periodically to detect failures

Failures • Map worker failure

•  Map tasks completed or in-progress at worker are reset to idle •  Reduce workers are notified when task is rescheduled on another

worker

• Reduce worker failure •  Only in-progress tasks are reset to idle

• Master failure •  MapReduce task is aborted and client is notified

Map-reduce: considerations for data management • As effective as these mechanisms can be, they also

introduce extra work for the database developer: •  A declarative language over distributed storage is missing •  The declarative power of SQL provides productivity that most

organizations count on

12

approach called map-reduce acknowledges and addresses this conundrum. Specifically, the process of distributing the query across multiple agents is the Map step, and the process of coalescing the results into a single result set is the Reduce step.

Map-reduce is a general algorithm, and is prevalent in functional programming languages – including F# – which support the notion of map and reduce functions. MapReduce (without the hyphen) is the patented software framework from Google that the company applies in the realm of managing large datasets over clusters or other distributed topologies. Hadoop is the top-level Apache project which implements map-reduce as a generalized highly parallel, divide-and-conquer batch job task manager.

Google MapReduce/ BigTable and Apache Hadoop /HBase have their fingerprints all over most NoSQL databases. For example, Apache CouchDB, one of the document store databases already discussed, is, according to its Web site on apache.org, “queried and indexed in a MapReduce fashion.” Some would argue that CouchDB’s map and reduce steps differ conceptually from those in MapReduce itself. Nonetheless, the overarching map-reduce approach is the inspiration for the design of many NoSQL products.

As effective as these mechanisms can be, they also introduce extra work for the database developer. That’s because instead of providing a declarative language over distributed storage that could then be implemented using map-reduce functionality under the covers, the architecture’s designers focused primarily on the raw processing approach and never added a language abstraction. In the world of line-of-business applications, the declarative power of SQL provides productivity that most organizations count on. Map-reduce based systems, by and large, cannot provide that productivity.

A summary of the various NoSQL database subcategories, and the suitability of each to different scenarios and requirements, including map-reduce, is presented in table form in Figure 5.

Figure 5: This chart shows the applicability of different NoSQL database types to different needs or scenarios. Notice that wide column stores are more special-purposed than are the other NoSQL subcategories, which are applicable in a variety of scenarios.

Frameworks supporting NoSQL Query to Map-reduce

•  Apache Pig: •  Platform for analyzing large data sets that consists of a high-level

language for expressing data analysis programs, coupled with infrastructure for evaluating these programs

•  A compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject)

•  Pig's language (Pig Latin) which has the following key properties: •  Ease of programming. It is trivial to achieve parallel execution of simple,

"embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain

•  Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency

•  Extensibility. Users can create their own functions to do special-purpose processing

Frameworks supporting NoSQL Query to Map-reduce

• Apache Hive: •  A data warehouse system for Hadoop that facilitates easy data

summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems

•  Provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL

•  At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL

Scalable RDBMS Scalable RDBMS

Scalable RDBMS • Use small-scope operations: Operations that span many

nodes, e.g. joins over many tables, will not scale well with sharding

• Use small-scope transactions: Transactions that span many nodes are going to be very inefficient, with the communication and two- phase commit overhead

• NoSQL systems avoid these two problems by making it difficult or impossible to perform larger-scope operations and transactions

• Scalable RDBMS does not need to preclude larger-scope operations and transactions: they simply penalize a customer for these operations if they use them

Current Scalable RDBMS • MySQL Cluster • VoltDB • Clustrix • ScaleDB • ScaleBase • NimbusDB • Google Megastore

Scalable RDBMS – Technological solutions • Data sharding over multiple database servers (shared

nothing architecture) • Every shard is replicated to support recovery and fast

access to read-mostly data • Some systems allow customers to choose the sharding

attribute • Bidirectional geographic replication maybe also supported • Supports in-memory as well as disk-based data:

•  In-memory storage allows real-time responses •  Indexes and record structures are designed for RAM rather than

disk, and the overhead of a disk cache/buffer is eliminated as well

Scalable RDBMS – Technological solutions • All SQL calls are made through stored procedures, with

each stored procedure being one transaction • Data is sharded to allow transactions to be executed

on a single node, no locks are required, no waits on locks, transaction coordination is likewise avoided

• Stored procedures are compiled to produce code comparable to the access level calls of NoSQL systems

• Some systems (e.g., Clustrix) are sold as rack-mounted appliances and use solid state disks for additional performance

References •  J. Barr. Host Your Web Site In The Cloud: Amazon Web

Services Made Easy: Amazon EC2 Made Easy. Sitepoint 2010 •  A. J. Brust. NoSQL and the Windows Azure Platform.

http://blogs.msdn.com/b/sqlazure/archive/2011/05/04/10160671.aspx

•  R. Cattell. Scalable SQL and NoSQL Data Stores. SIGMOD Record, December 2010 (Vol. 39, No. 4), 12-27.

•  T.K. Prasad. MapReduce architecture: Map Reduce Architecture. http://www.cs.wright.edu/~tkprasad/courses/cs707L06MapReduce.ppt

•  J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large http://research.google.com/archive/mapreduce.html

•  S. Ghemawat, H. Gobioff, S.T. Leung. The Google File System. http://research.google.com/archive/gfs.html