cloud computing for large scale data analysis

IMT - Institutions Markets Technologies

Institute for Advanced Studies

Lucca

Cloud computing for

large scale data analysis

PhD Program in Computer Science and Engineering

XXIV Cycle

Gianmarco De Francisci Morales

[email protected]

Supervisor:

Dott. Claudio Lucchese

Co-supervisor:

Dott. Ranieri Baraglia

15 February 2010

Abstract

An incredible “data deluge” is currently drowning the world. Data sourcesare everywhere, from the Web 2.0 to large scientific experiments, from socialnetworks to sensors. This huge amount of data is a valuable asset in ourinformation society.

Data analysis is the process of inspecting data in order to extract usefulinformation. Decision makers commonly use this information to drive theirchoices. The quality of the information extracted by this process greatlybenefits from the availability of extensive data sets.

As we enter the “Petabyte Age”, traditional approaches for data analysisbegin to show their limits. “Cloud computing” is an emerging alternativetechnology for large scale data analysis. Data oriented cloud systems com-bine both storage and computing in a distributed and virtualized manner.They are built to scale to thousands of computers, and focus on fault toler-ance, cost effectiveness and ease of use.

This thesis aims to provide a coherent framework for research in the fieldof large scale data analysis on cloud computing systems. The goal of ourresearch is threefold. I) To build a toolbox of algorithms and propose a wayto evaluate and compare them. II) To design extensions to current cloudparadigms in order to support these algorithms. III) To develop methodsfor online analysis of large scale data.

In order to reach our goal, we plan to adopt principles from databaseresearch. We postulate that many results in this field are relevant also forcloud computing systems. With our work, we expect to provide a com-mon ground on which the database and cloud communities will be able tocommunicate and thrive.

i

Contents

List of Figures iii

List of Tables iv

List of Acronyms v

1 Introduction 11.1 The “Big Data” Problem . . . . . . . . . . . . . . . . . . . . 21.2 Methodology Evolution . . . . . . . . . . . . . . . . . . . . . 5

2 State of the Art 102.1 Technology Overview . . . . . . . . . . . . . . . . . . . . . . . 102.2 Comparison with PDBMS . . . . . . . . . . . . . . . . . . . . 182.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4 Research Directions . . . . . . . . . . . . . . . . . . . . . . . 22

3 Research Plan 293.1 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Thesis Proposal . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Bibliography 33

ii

List of Figures

1.1 The Petabyte Age . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Data Information Knowledge Wisdom hierarchy . . . . . . . . 3

2.1 Data oriented cloud computing architecture . . . . . . . . . . 112.2 The MapReduce programming model . . . . . . . . . . . . . . 15

iii

List of Tables

2.1 Major cloud software stacks . . . . . . . . . . . . . . . . . . . 122.2 Relative advantages of PDBMS and Cloud Computing . . . . 20

iv

List of Acronyms

ACID Atomicity, Consistency, Isolation, Durability . . . . . . . . . . . . . . . . . 19

BASE Basically Available, Soft state, Eventually consistent . . . . . . . . . 19

DAG Direct Acyclic Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

DBMS Database Management System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

DHT Distributed Hash Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

DIKW Data Information Knowledge Wisdom . . . . . . . . . . . . . . . . . . . . . . . . 3

GFS Google File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

HDFS Hadoop Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

IaaS Infrastructure as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

MPI Message Passing Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

MR MapReduce. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

OLAP Online Analysis Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

OLTP Online Transaction Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

PaaS Platform as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

PDBMS Parallel Database Management System . . . . . . . . . . . . . . . . . . . . . . . 6

PDM Parallel Disk Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

PVM Parallel Virtual Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

RAM Random Access Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

SaaS Software as a Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

SOA Service Oriented Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

SQL Structured Query Language. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6

UDF User Defined Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

v

Chapter 1

Introduction

How would you sort 1GB of data? Today’s computers have enough mem-ory to hold this quantity of data, so the best way is to use an in-memoryalgorithm like quicksort. What if you had to sort 100 GB of data? Even ifsystems with more than 100 GB of memory exist, they are by no means com-mon or cheap. So the best solution is to use a disk based sort like mergesort

or greedsort. However, what if you had 10 TB of data to sort? Today’s harddisks are usually 1 to 2 TB in size, which means that just to hold 10 TBof data we need multiple disks. In this case the bandwidth between mem-ory and disk would probably become a bottleneck. So, in order to obtainacceptable completion times, we need to employ multiple computing nodesand a parallel algorithm like bitonic sort.

This example illustrates a very general point: the same problem at dif-ferent scales needs radically different solutions. In many cases, the modelwe use to reason about the problem needs to be changed. In the exampleabove, the Random Access Machine (RAM) model [59] should be used in theformer case, while the Parallel Disk Model (PDM) [67] in the latter. Thereason is that the simplifying assumptions made by the models do not holdat every scale. Citing George E. P. Box “Essentially, all models are wrong,but some are useful” [7], and arguably “Most of them do not scale”.

For data analysis, scaling up of data sets is a “double edged” sword.On the one hand, it is an opportunity because “no data is like more data”.Deeper insights are possible when more data is available. Oh the other hand,it is a challenge. Current methodologies are often not suitable to handle verylarge data sets, so new solutions are needed.

1

1.1 The “Big Data” Problem 2

1.1 The “Big Data” Problem

The issues raised by large data sets in the context of analytical applicationsare becoming ever more important as we enter the “Petabyte Age”. Fig-ure 1.1 shows some of the data sizes for today’s problems [5]. The data setsare orders of magnitude greater than what fits on a single hard drive, andtheir management poses a serious challenge. Web companies are currentlyfacing this issue, striving to find efficient solutions. The ability to analyzeor manage more data is a distinct competitive advantage. This problem hasbeen labeled in various ways: petabyte scale, web scale or “big data”.

Figure 1.1: The Petabyte Age

Currently, an incredible “data deluge” is drowning the world with data.The quantity of data we need to sift through every day is enormous, justthink of the results of a search engine query. They are so many that weare not able to examine all of them, and indeed the competition on websearch now focuses on ranking the results. This is just an example of a moregeneral trend.

Data sources are everywhere. Web users produce vast amounts of text,audio and video contents in the so called Web 2.0. Relationships and tags insocial networks create large graphs spanning millions of vertexes and edges.

Scientific experiments are another huge data source. The Large HadronCollider (LHC) at European Council for Nuclear Research (CERN) is ex-pected to generate around 50 TB of raw data per day. The Hubble telescope


captured hundreds of thousands astronomical images, each hundreds ofmegabytes large. Computational biology experiments like high-throughputgenome sequencing produce large quantities of data that require extensivepost-processing.

In the future, sensors like Radio-Frequency Identification (RFID) tagsand Global Positioning System (GPS) receivers will spread everywhere.These sensors will produce petabytes of data just as a result of their sheernumbers, thus starting “the industrial revolution of data” [33].

Figure 1.2: Data Information Knowledge Wisdom hierarchy

But why are we interested in data? It is common belief that data with-out a model is just noise. Models are used to describe salient features in thedata, which can be extracted via data analysis. Figure 1.2 represents the pop-ular Data Information Knowledge Wisdom (DIKW) hierarchy [56]. In thishierarchy data stands at the lowest level and bears the smallest level of un-derstanding. Data needs to be processed and condensed into more connectedforms in order to be useful for event comprehension and decision making.Information, knowledge and wisdom are these forms of understanding. Re-lations and patterns that allow to gain deeper awareness of the process that


generated the data, and principles that can guide future decisions.The large availability of data potentially enables more powerful analysis

and unexpected outcomes. For example, Google can detect regional fluoutbreaks up to ten days faster than the Center for Disease Control andPrevention by monitoring increased search activity for flu-related topics [29].Chapter 2 presents more examples of interesting large scale data analysisproblems.

But how do we define “big data”? The definition is of course relative andevolves in time as technology progresses. Indeed, thirty years ago one ter-abyte would be considered enormous, while today we are commonly dealingwith such quantity of data.

According to Jacobs [37], big data is “data whose size forces us to lookbeyond the tried-and-true methods that are prevalent at that time”. Thismeans that we can call “big” an amount of data that forces us to use or createinnovative research products. In this sense big data is a driver for research.The definition is interesting because it puts the focus on the subject whoneeds to manage the data rather than on the object to be managed. Theemphasis is thus on user requirements such as throughput and latency.

Let us highlight some of the requirements for a system used to performdata intensive computing on large data sets. Given the effort to find a novelsolution and the fact that data sizes are ever growing, this solution should beapplicable for a long period of time. Thus the most important requirementa solution has to satisfy is scalability.

Scalability is defined as “the ability of a system to accept increased inputvolume without impacting the profits”. This means that the gains from theinput increment should be proportional to the increment itself. This is abroad definition used also in other fields like economy, but more specificones for computer science are provided in the next section.

For a system to be fully scalable, its scale should not be a design parame-ter. Forcing the system designer to take into account all possible deploymentsizes for a system, leads to a scalable architecture without fundamental bot-tlenecks. We define such a system “scale-free”.

However, apart from scalability, there are other requirements for a largescale data intensive computing system. Real world systems cost money tobuild and operate. Companies attempt to find the most cost effective wayof building a large system because it usually requires a significant money

1.2 Methodology Evolution 5

investment. Partial upgradability is an important money saving feature, andis more easily attained with a loosely coupled system. Operational costs likepersonnel salary (i.e. system administrators) account for a large share of thebudget of IT departments. To be profitable, large scale systems must requireas little human intervention as possible. Therefore autonomic systems arepreferable, systems that are self-configuring, self-tuning and self-healing. Inthis respect, a key property is fault tolerance.

Fault tolerance is “the property of a system to operate properly in spiteof the failure of some of its components”. When dealing with a large system,the probability that a disk breaks or a server crashes raises dramatically:it is the norm rather than the exception. A performance degradation isacceptable as long as the systems does not halt completely. A denial ofservice of a system usually has a negative economic impact, especially forWeb-based companies. The goal of fault tolerance techniques is to create ahighly available system.

To summarize, a large scale data analysis system should be scalable, costeffective and fault tolerant.

1.2 Methodology Evolution

Providing data for analysis is a problem that has been extensively studied.Many different solutions exist. However, the usual approach is to employ aDatabase Management System (DBMS) to store and manage the data.

DBMSs were born in the ’60s and used a navigational model. The socalled CODASYL approach, born together with COBOL, used a hierarchicalor network schema for data representation. The user had to explicitly tra-verse the data using links to find what he was interested in. Even simplequeries like “Find all people in India” were prohibitively expensive as theyrequired to scan all the data.

The CODASYL approach was partly due to technological limits: the mostcommon mass storage technology at the time was tape. Magnetic tapes havevery high storage density and throughput for linear access, but extremelyslow random access performance due to large seek times. Notice that thisdescription fits almost perfectly current disk technology, and provides the ra-tionale for some of the new developments in data management systems [31].

During the ’70s Codd [13] introduced the famous relational model that


is still in use today. The model introduces the familiar concepts of tabulardata, relation, normalization, primary key, relational algebra and so on.The model had two main implementations: IBM’s System R and Berkley’sINGRES. All the modern databases can be in some way traced back tothese two common ancestors in terms of design and code base [69]. For thisreason DBMSs bear the burden of their age and are not always a good fitfor today’s applications.

The original purpose of DBMSs was to process transactions in businessoriented processes, also known as Online Transaction Processing (OLTP).Queries were written in Structured Query Language (SQL) and run againstdata modeled in relational style. On the other hand, currently DBMSsare used in a wide range of different areas: besides OLTP, we have OnlineAnalysis Processing (OLAP) applications like data warehousing and busi-ness intelligence, stream processing with continuous queries, text databasesand much more [60]. Furthermore, stored procedures are used instead ofplain SQL for performance reasons. Given the shift and diversification ofapplication fields, it is not a surprise that most existing DBMSs fail to meettoday’s high performance requirements [61, 62].

High performance has always been a key issue in database research.There are usually two approaches to achieve it: vertical and horizontal. Theformer is the simplest, and consists in adding resources (cores, memory,disks, etc...) to an existing system. If the resulting system is capable oftaking advantage of the new resources it is said to scale up. The inherentlimitation of this approach is that the single most powerful system availableon earth could be not enough. The latter approach is more complex, andconsists in adding in parallel new separate systems to the existing one. Ifhigher performance is achieved the systems is said to scale out. The multiplesystems are treated as a single logical unit. The result is a parallel systemwith all the (hard) problems of concurrency.

Typical parallel systems are divided into three categories according totheir architecture: shared memory, shared disk or shared nothing. In thefirst category we find Symmetric Multi-Processors (SMPs) and large parallelmachines. In the second one we find rack based solutions like Storage AreaNetwork (SAN) or Network Attached Storage (NAS). The last categoryincludes large commodity clusters interconnected by a local network and isusually deemed the most scalable [63].


Parallel Database Management Systems (PDBMSs) [21] are the resultof these considerations. They are an attempt to achieve high performancein a parallel fashion. Almost all the designs of a PDBMS use the same basicdataflow pattern for queries and horizontal partitioning of the tables on acluster of shared nothing machines [23].

Unfortunately, PDBMS are very complex systems. They need fine tun-ing of many “knobs” and feature simplistic fault tolerance policies. In theend, they do not provide the user with adequate ease of installation andadministration (the so called “one button” experience), and flexibility of use(poor support of User Defined Functions (UDFs)).

To date, despite numerous claims about their scalability, PDBMSs haveproven to be profitable only up to the tens or hundreds of nodes. It islegitimate to question whether this is the result of a fundamental theoreticalproblem in the parallel approach.

Parallelism has some well known limitations. Amdahl [4] in his 1967paper argued in favor of a single-processor approach to achieve high perfor-mance. Indeed, the famous “Amdahl’s law” states that the parallel speedupof a program is inherently limited by the inverse of his serial fraction, thenon parallelizable part of the program. His law also defines the concept ofstrong scalability, in which the total problem size is fixed. Equation 1.1 spec-ifies Amdahl’s law for N parallel processing units where rs and rp are theserial and parallel fraction of the program measured on a single processor(rs + rp = 1)

SpeedUp(N) =1

rs +rpN

(1.1)

Nonetheless, parallel techniques are justified from the theoretical pointof view. Gustafson [32] in 1988 re-evaluated Amdahl’s law using differentassumptions, i.e. the problem sizes increases with the computing units. Inthis case the problem size per unit is fixed. Under these assumptions theachievable speedup is almost linear, as stated in Equation 1.2. In this case r′sand r′p are the serial and parallel fraction measured on the parallel systeminstead of the serial one. The equation also defines the concept of scaled

speedup or weak scalability.

SpeedUp(N) = r′s + r′p ∗N = N + (1−N) ∗ r′s (1.2)


Even though the two formulations are mathematically equivalent [58],they make drastically different assumptions. In our case the size of theproblem is large and ever growing. Hence it seems appropriate to adoptGustafson’s point of view, which justifies the parallel approach.

There are also practical considerations that corroborate parallel meth-ods. Physical limits for processors have been reached. Chip makers are nomore raising clock frequencies because of heat dissipation and power con-sumption concerns. As a result, they are already introducing parallelism atthe CPU level with manycore processors [35]. In addition, the most cost ef-fective option to build a large system is to use many commodity inexpensivecomponents instead of deploying high-end servers. These issues have led toa renewed popularity of parallel computing.

Parallel computing has a long history. It has traditionally focused on“number crunching”. Common applications were tightly coupled and CPUintensive (e.g. large simulations or finite element analysis). Control-parallelprogramming interfaces like Message Passing Interface (MPI) or ParallelVirtual Machine (PVM) are still the de-facto standard in this area. Thesesystems are notoriously hard to program. Fault tolerance is difficult toachieve and scalability is an art. They require explicit control of paral-lelism and are sometimes referred to as “the assembly language of parallelcomputing”.

In stark contrast with this legacy, a new class of parallel systems hasemerged: cloud computing. Cloud systems focus on being scale-free, faulttolerant, cost efficient and easy to use.

Cloud computing is the result of the convergence of three technologies:grid computing, virtualization and Service Oriented Architecture (SOA).The aim of cloud computing is thus to offer services on a virtualized parallelback-end system. These services are divided in categories according to theresource they offer: Infrastructure as a Service (IaaS) like Amazon’s EC2 andS3, Platform as a Service (PaaS) like Google’s App Engine and Microsoft’sAzure Services Platform, Software as a Service (SaaS) like Salesforce, OnLiveand virtually every Web application.

Lately cloud computing has received a substantial amount of attentionfrom industry, academia and press. As a result, the term “cloud computing”has become a buzzword, overloaded with meanings. There is lack of consen-sus on what is and what is not cloud, as even simple client-server applications


on portable devices are sometimes included in the category [17]. As usual,the boundaries between similar technologies are fuzzy, so there is no cleardistinction among grid, utility, cloud, and other kind of computing technolo-gies. In spite of the many attempts to describe cloud computing [48], thereis no widely accepted definition.

Even without a clear definition, there are some properties that we thinka cloud computing system should have. Each of the three aforementionedtechnologies brings some feature to cloud computing. According to SOAprinciples, a cloud system should be a distributed system, with separate,loosely coupled entities collaborating among each other. Virtualization (notintended just as x86 virtualization but as a general abstraction of comput-ing, storage and communication facilities) provides for location, replication

and failure transparency. Finally, grid computing endorses scalability. Allthese properties are clearly well grounded in parallel and distributed sys-tems research. Indeed, cloud computing can be thought of as the currentstep in this area.

Within cloud computing, there is a subset of technologies that is moregeared towards data analysis. These systems are aimed mainly at I/O in-tensive tasks, are optimized for streaming data from disk drives and usea data-parallel approach. An interesting feature is they are “dispersed”:computing and storage facilities are distributed, abstracted and intermixed.These systems attempt to move computation as close to data as possiblebecause moving large quantities of data is expensive. Finally, the burdenof dealing with the issues caused by parallelism is removed from the pro-grammer. This provides the programmer with a scale-agnostic programmingmodel.

Data oriented cloud computing systems are a natural alternative toPDBMSs when dealing with large scale data. As such, a fierce debate iscurrently taking place, both in industry and academy, on which is the besttool.

The remainder of this document is organized as follows. Chapter 2 givesa technical and research overview of data oriented cloud systems. To setthe frame for the thesis dissertation we show a comparison with PDBMS.Finally, Chapter 3 describes the research proposal in greater detail.

Chapter 2

State of the Art

Large scale data challenges have spurred a large number of projects on dataoriented cloud computing systems. Most of these projects feature importantindustrial partners alongside academic institutions. Big internet companiesare the most interested in finding novel methods to manage and use data.Hence, it is not surprising that Google, Yahoo!, Microsoft and few othersare leading the trend in this area.

In the remainder of this document we will focus only on the data orientedpart of cloud technologies, and will simply refer to them as cloud comput-ing. In the next sections we will describe the technologies that have beendeveloped to tackle large scale data intensive computing problems, and theresearch directions in this area.

2.1 Technology Overview

Even though existing cloud computing systems are very diverse, they sharemany common traits. For this reason we propose a general architecture ofcloud computing systems, that captures the commonalities in the form of amulti-layered stack. Figure 2.1 shows the proposed software stack of a cloudsystem.

At the lowest level we find a coordination level that serves as a basicbuilding block for the distributed services higher in the stack. This leveldeals with basic concurrency issues.

The distributed data layer builds on top of the coordination one. Thislayer deals with distributed data access, but unlike a traditional distributedfile system it does not offer standard POSIX semantics for the sake of per-

10

2.1 Technology Overview 11

Coordination

ComputationHigh Level Languages

DistributedData

Data Abstraction

Figure 2.1: Data oriented cloud computing architecture

formance. The data abstraction layer is still part of the data layer and offersdifferent, more sophisticated interfaces to data.

The computation layer is responsible for managing distributed process-ing. As with the data layer, generality is sacrificed for performance. Onlyembarrassingly data parallel problems are commonly solvable in this frame-work. The high level languages layer encompasses a number of languages,interfaces and systems that have been developed to simplify and enrich ac-cess to the computation layer.

Table 2.1 reports some of the most popular cloud computing softwaresclassified according to the architecture of Figure 2.1. Google has describedhis software stack in a series of published papers, but the software itself is notavailable outside the company. Some independent developers, later hired byYahoo!, implemented the stack described by Google and released the projectunder an open-source license. Hadoop [6] is now an umbrella project ofthe Apache Software Foundation and is used and developed extensively byYahoo!. Microsoft has released part of its stack in binary format, but the


Google Yahoo! Microsoft Others

High LevelLanguages

Sawzall Pig LatinDryadLINQSCOPE

HiveCascading

Computation MapReduce Hadoop Dryad

DataAbstraction

BigTableHBasePNUTS

CassandraVoldemort

DistributedData

GFS HDFS Cosmos Dynamo

Coordination Chubby Zookeeper

Table 2.1: Major cloud software stacks

software is proprietary.In the coordination layer we find two implementations of a consensus

algorithm. Chubby [8] is a distributed implementation of Paxos [42] andZookeeper is Hadoop’s re-implementation in Java. They are centralizedservices for maintaining configuration information, naming, providing dis-tributed synchronization and group services. All these kinds of services areused by distributed applications.

Chubby has been used to build a Domain Name System (DNS) alterna-tive, to build a replicated distributed log and as a primary election mech-anism. Zookeeper has also been used as a bootstrap service locator anda load balancer. The main characteristics of these services are very highavailability and reliability, sacrificing high performance. Their interface issimilar to a distributed file system with whole-file reads and writes (usuallymetadata), advisory locks and notification mechanisms. They have solidroots in distributed systems research.

On the next level, the distributed data layer presents different kindsof data storages. A common feature in this layer is to avoid full POSIXsemantic in favor of simpler ones. Furthermore, consistency is somewhatrelaxed for the sake of performance.

Google File System (GFS) [27], Hadoop Distributed File System (HDFS)and Cosmos are all distributed file systems geared towards data streaming.They are not general purpose file systems. For example, in HDFS files canonly be appended but not modified. They use large blocks of 64 MB or


more, which are replicated for fault tolerance.GFS uses a master-slave architecture where the master is responsible

for metadata operations while the slaves (chunkservers) are used for data.It provides loose consistency guarantees, for example a record might beappended more than once (at least once semantics). GFS is implemented asa user space process and stores the data chunks as lazily allocated files ona local file system. HDFS basically implements the same architecture. Inboth systems the master (namenode in HDFS terminology) is a single pointof failure.

Cosmos [9]is Microsoft’s internal file system for cloud computing. Thereis not much public information available about this component. It is anappend-only file system optimized for streaming of petabyte scale data. Datais replicated for fault tolerance and compressed for efficiency. From thisscarce information it looks like a re-implementation of Google’s GFS.

Dynamo [20] is a storage service used at Amazon. It is a key-value storeused for low latency access to data. It has a peer-to-peer (P2P) architec-ture that uses consistent hashing to spread the load and a gossip protocolto guarantee eventual consistency [68]. Amazon’s shopping cart and otherservices (e.g. S3) are built on top of Dynamo.

The systems described above are, with the exception of Dynamo, mainlyappend-only and stream oriented file systems. However, it is sometimesconvenient to access data in a different fashion. For example using a moreelaborate data model or enabling read/write operations. For these reasonsdata abstractions are usually built on top of the aforementioned systems.

BigTable [10] and HBase [6] are non-relational databases. They are ac-tually multidimensional, sparse, sorted maps, useful for semi-structured ornon structured data. As in many other similar systems, access to data isprovided via primary key only, but each key can have more “columns”. Thedata are stored column-wise, which allows for a more efficient representation(“null” values are stripped out) and for better compression (data is homo-geneous). The data is also partitioned in tablets and distributed over a largenumber of tablet servers. These systems provide efficient read/write accessbuilt on top of their respective file system (GFS and HDFS) with support forversioning, timestamps and single row transactions. Google Earth, GoogleAnalytics and many other user-facing services use this kind of storage.

The fundamental data structure of BigTable and HBase is a SSTable. It


is the underlying file format used to store data. SSTables are designed sothat a data access requires a single disk access. An SSTable is never changed.If new data is added a new SSTable is created and they are eventually “com-pacted”. The immutability of SSTable is at the core of data checkpointingand recovery routines, and resembles the design of a log structured file sys-tem.

PNUTS [16] is a similar storage service developed by Yahoo! outside theHadoop project. PNUTS allows key-value based lookup where values maybe structured as typed columns or “blobs”. PNUTS users may control theconsistency of the data they read using “hints” to require the latest versionor a potentially stale one. Data is split into range or hash tablets by a tabletcontroller. This layout is soft-cached in message routers that are used forqueries. The primary use of PNUTS is in web applications like Flickr.

PNUTS uses a “per-record timeline” consistency model which guaran-tees that every replica will apply operations to a single record in the sameorder. This is stronger than eventual consistency and relies on a guaranteed,totally-ordered-delivery message brokering service. Each record is geograph-ically replicated around several data centers. The message broker does notguarantee in order delivery between different data centers. Thus each recordhas a single master copy which is updated by the broker, committed andthen published to replicas. This makes sure that every update gets replayedin a canonical order at each replica. In order to reduce latency the masteringof the replica is switched according to changes in access location.

Cassandra [25] is a Facebook project (now open-sourced into Apacheincubator) that aims to build a system with a BigTable-like interface ona Dynamo-style infrastructure. Voldemort [44] is an open-source non re-lational database built by LinkedIn. It is basically a large persistent Dis-tributed Hash Table (DHT). This area has been very fruitful and thereare many other products, but all of them share in some way the featuressketched in these paragraphs: distributed architecture, non relational model,no ACID guarantees, limited relational operations (no join).

In the computation layer we find paradigms for large scale data intensivecomputing. They are mainly dataflow paradigms with support for auto-mated parallelization. We can recognize the same pattern found in previouslayers also here: trade off generality for performance.

MapReduce (MR) [19] is a distributed computing engine inspired by


Figure 2.2: The MapReduce programming model

concepts of functional languages. More specifically, MR is based on twohigher order functions: Map and Reduce. The Map function reads the inputas a list of key-value pairs and applies a UDF to each pair. The result is asecond list of intermediate key-value pairs. This list is sorted and grouped bykey and used as input to the Reduce function. The Reduce function appliesa second UDF to every intermediate key with all its associated values toproduce the final result. The two phases are non overlapping as detailed inFigure 2.2.

The Map and Reduce function are purely functional and thus withoutside effects. This is the reason why they are easily parallelizable. Fur-thermore, fault tolerance is easily achieved by just re-executing the failedfunction. The programming interface is easy to use and does not allow anyexplicit control of parallelism. Even though the paradigm is not generalpurpose, many interesting algorithms can be implemented on it. The mostparadigmatic application is building the inverted index for Google’s searchengine. Simplistically, the crawled and filtered web documents are read fromGFS, and for every word the couple 〈word, doc id〉 is emitted in the Mapphase. The Reduce phase needs just to sort all the document identifiersassociated with the same word 〈word, [doc id1, doc id2, ...]〉 to create thecorresponding posting list.


Hadoop [6] is a Java implementation of MapReduce, which is roughlyequivalent to Google’s version, even though there are many reports aboutthe performance superiority of the latter [30]. Like the original MR, it usesa master-slave architecture.

The Job Tracker is the master to which client applications submit MRjobs. It pushes tasks (job splits) out to available slave nodes on which aTask Tracker runs. The slave nodes are also HDFS chunkservers, and theJob Tracker knows which node contains the data. The Job Tracker thusstrives to keep the jobs as close to the data as possible. With a rack-awarefilesystem, if the work cannot be hosted on the actual node where the dataresides, priority is given to nodes in the same rack. This reduces networktraffic on the main backbone network for the Map phase.

Mappers write intermediate values locally on disk. Each reducer in turnpulls the data from various remote disks via HTTP. The partitions are al-ready sorted by key by the mappers, so the reducer just merge-sorts thedifferent partitions to bring the same keys together. These two phases arecalled shuffle and sort and are also the most expensive in terms of I/O oper-ations. In the last phase the reducer can finally apply the Reduce functionand write the output to HDFS.

Dryad [36] is Microsoft’s alternative to MapReduce. Dryad is a dis-tributed execution engine inspired by macro-dataflow techniques and mas-sively data-parallel shader programming languages. Program specificationis done by building a Direct Acyclic Graph (DAG) whose vertexes are op-erations and whose edges are data channels. The programmer specifies theDAG structure and the functions (standard or user defined) to be executedin each vertex. The system is responsible for scheduling vertex executionon a shared nothing cluster, keeping track of dependencies, deciding whichchannels to use (shared memory queues, TCP pipes or files), deciding theparallelism degree of operations and which vertexes to aggregate in a singleprocess. An interesting characteristic is that the programming paradigmof Dryad is more general than MapReduce. Indeed MR workflows can beexpressed in Dryad but not vice-versa.

At the last level we find high level interfaces to these computing systems.These interfaces are meant to simplify writing programs for cloud systems.Even though this task is easier than writing custom MPI code, MapReduce,Hadoop and Dryad still offer fairly low level programming interfaces, which


require the knowledge of a full programming language (C++ or Java). Theinterfaces in this level usually allow even non programmers to take advantageof the cloud infrastructure.

Sawzall [54] is a high level, parallel data processing scripting languagebuilt on top of MR. Its intended target are filter and aggregation scripts ofrecord sets, akin to the AWK scripting language. The user can employ a setof predefined aggregators like counting, sampling and histograms. The scriptneeds only to declare the desired aggregators, implement the filtering logic,which is applied record by record, and emit partial results to the aggregators.The resulting language is at the same time expressive and simple. It forcesthe programmer to think one record at a time, and allows the system tomassively parallelize the computation without any programming effort.

Pig Latin [51] is a more sophisticated language for general data manip-ulation. It is an interpreted imperative language which is able to performfiltering, aggregation, joining and other complex transformations. Pig Latinstatements produce Hadoop jobs that are executed in parallel. Pig Latinhas a rich nested data model, fully supports UDFs and aims for the “sweetspot” between SQL and MR.

The authors claim and report witnesses that an imperative language (i.e.Pig Latin) is simpler to understand and program than a declarative one (i.e.SQL). At the same time Pig Latin uses relational-algebra-style primitivesthat can exploit the same kind of optimizations found in databases. Nev-ertheless the users retains more control over the way their programs areexecuted. Somewhat disappointingly, Pig Latin has no loop and conditionalstatements, and thus suffers from the same problem of “awkward embed-ding” of SQL. The authors are working on a Pig-aware version of scriptinglanguages like Perl and Python to overcome this problem.

DryadLINQ [71] is a set of language extensions and a corresponding com-piler. The compiler trasforms Language Integrated Query (LINQ) expres-sions into Dryad plans. LINQ is a query language integrated in Microsoft’s.NET framework. It is based on lambda-expressions and thus it is easily par-allelized. It is strongly typed and has both an imperative (object-oriented)and a declarative (SQL-like) version.

LINQ expressions are compiled to Dryad execution plans lazily. Theexpressions are decomposed into subexpression each to be run into a sep-arate Dryad vertex. The optimizer removes redundant operations, pushes

2.2 Comparison with PDBMS 18

aggregation upstream and merges multiple operators in pipelines where pos-sible. The DAG runs in parallel and once successfully competed the data isavailable to the client through an iterator, akin to a normal SQL recordset.

SCOPE [9] is a declarative scripting language for Dryad based on SQL.It has the same tabular data model and implements most of its commands(select, group by, join, etc..). It also provides some more complex opera-tions like preprocessing, reduction and combination of rows. Programmerscan write UDFs and data adapters using C#. The code can be directlyembedded into the scripts and the scripts themselves can be imported andparameterized. A scope script compiles to a DAG and exectutes on Dryad.

Hive [65] is a Facebook project to build a data warehousing system ontop of Hadoop. It employs a tabular data model where the tables can bepartitioned on columns and bucketed to different files. The files reside onHDFS, each table in a different directory. The metadata (schemas, location,statistics and auxiliary information) is kept separately in a relational DBMScalled Hive-Metastore. This speeds up metadata-only operations but compli-cates consistency. Hive employs a declarative query language called HiveQL

derived from SQL. However custom MR scripts can be easily plugged in.Queries are compiled to a DAG of MR jobs. Hive is still a young project andstill lacks many features like a cost-based optimizer and columnar storage.

Cascading [14] is a Java API for creating complex and fault tolerantdata processing workflows on top of Hadoop. Cascading uses “tuples andstreams” as data containers and a “pipe and filters” model for defining dataprocesses. It supports splits, joins, grouping, and sorting but also customMR operations.

2.2 Comparison with PDBMS

How do cloud technologies compare with their direct competitors, PDBMSs?To put this comparison in the right perspective we would like to recall the“CAP theorem” [26, 28] stated by Fox and Brewer. The theorem says thatin every system we can get at most two out of these three properties: Con-

sistency, Availability, Partition tolerance. We can build systems with differentfeatures depending on which properties we choose. If we choose Consistencyand Availability we can build a single database or a cluster database usingtwo phase commit (which is not partition tolerant). To have Consistency


and Partition tolerance we can employ distributed locking and consensusalgorithms to build a distributed database. Availability is reduced becauseof blocking and unavailability of minority partitions. Finally, if we are readyto give up Consistency to have a highly available and partition tolerant sys-tem, we can build systems like the DNS. In this case we need to employtechniques that deal with inconsistency like leases and conflict resolution.

Partition tolerance is required to build large scale distributed systems,where network failure is guaranteed. Cloud systems usually prefer havinghigh availability for economic reasons, especially if they are used to poweruser-facing Web applications. On the other hand PDBMSs traditionallyprefer consistency, as they are the evolution of traditional databases.

The choice is then between Basically Available, Soft state, Eventuallyconsistent (BASE) [55] and Atomicity, Consistency, Isolation, Durability(ACID), two complementary design philosophies. With BASE the applica-tion requirements need to be deeply understood in order to address consis-tency issues that are not automatically solved by the system. The paybackis increased scalability [45]. With ACID we need not worry about state orconsistency issues, but the performance is constrained by total serializability.

Cloud computing and most PDBMS also target different markets. Cloudtechnology is used mainly for OLAP tasks, PDBMS for OLTP, but thereare also niche databases targeted towards OLAP workloads.

PDBMSs employ the traditional relational model for data. Cloud sys-tems are more flexible in this sense, providing more sophisticated data mod-els (i.e. nested). The method to express queries is also quite different:parallel databases use SQL and select-project-join queries, cloud systemsare more focused on UDFs and custom transformations. This is a result oftheir respective goals, which are slightly different: parallel databases managelarge scale data, cloud systems analyze large scale data.

Also the interaction with the system differs. A cloud system usually doesnot require to define a data schema upfront. The user can avoid the burdenof loading the data into rigid tables and start right away with his analysis.If the result is good the analysis can be repeated and improved with time.This approach is more agile and responds better to changes compared to thetraditional database approach.

Not everyone agrees on the merits of these new cloud systems [22]. Thereis an ongoing fierce debate on which approach is the best [53]. Most of


Parallel Databases Cloud Computing

multipurpose (analysis and dataupdate, batch and interactive)

high data integrity via ACIDtransactions

many compatible tools (loading,management, reporting, mining,data visualization)

standard declarative high-levelquery language: SQL

automatic query optimization

designed for scaling on largecommodity clusters

very high availability and faulttolerance

flexible data model, no need toconvert data to a normalizedschema at load time

full integration with familiarprogramming languages

full control over performance

Table 2.2: Relative advantages of PDBMS and Cloud Computing

the critiques are towards the inefficiency of the brute force approach whencompared to database technology [64]. More generally, cloud computingignores years of database research and results [21, 23].

Advocates of the cloud approach answer that not every problem involvingdata has to be solved with a DBMS [18]. Moreover, both SQL and therelational model are not always the best answer. Actually, this belief is heldalso inside the database community [62].

For these reasons many non relational databases are flourishing on theWeb. The so called “NoSQL” movement synthesizes this trend. Sacrifice theability to perform joins and ACID properties for the sake of performance andscalability, by going back to a simpler key-value design. Common practicesin everyday database use also justify this trend. The application logic is oftentoo complex to be expressed in relational terms (the impedance mismatchproblem). Programmers thus commonly use “binary blobs” stored insidetables and access them by primary key only.

Declarative languages like SQL are as good as the optimizer beneath.Many times the programmer has to give “hints” to the optimizer in orderto achieve the best query plan. This problem gets more evident as the

2.3 Case Studies 21

size of the data grows [37]. In these cases, giving the control back to theprogrammer is usually the best option.

Table 2.2 summarizes the relative advantages of each system. Each en-try in the table can be seen as an advantage that the system has over itscompetitor, and thus a flaw in the latter.

On a final note, advocates for a hybrid system see value in both ap-proaches and call for a new DBMS designed specifically for cloud systems [1].

2.3 Case Studies

Data analysis researchers have seen in cloud computing the perfect tool torun their computations on huge data sets. In the literature, there are manyexamples of successful applications of the cloud paradigm in different areasof data analysis. In this section, we review a subset of the most significantcase studies.

Information retrieval has been the classical area of application for cloudtechnologies. Indexing is a representative application of this field, as theoriginal purpose of MR was to build Google’s index for web search. In orderto take advantage of increased hardware and input sizes, novel adaptationsof single-pass indexing have also been proposed [46].

Pairwise document similarity is a common tool for a variety of problemssuch as clustering and cross-document coreference resolution. When the cor-pus is large, MR is convenient because it allows to efficiently decompose thecomputation. The inner products involved in computing similarity are splitinto separate multiplication and summation stages, which are well matchedto disk access patterns across several machines [24].

Frequent itemset mining is commonly used for query recomendation.When the data set size is huge, both the memory use and the computationalcost can be prohibitively expensive. Parallel algorithms designed so thateach machine executes an independent group of mining tasks are the idealfit for cloud systems [43].

Search logs contain rich and up-to-date information about users’ prefer-ences. Many data-driven applications highly rely on online mining of searchlogs. A cloud based OLAP system for search logs has been devised in orderto support a variety of data-driven applications [72].

Machine learning has been a fertile ground for cloud computing applica-

2.4 Research Directions 22

tions. For instance, MR has been employed for the prediction of user ratingof movies, based on accumulated rating data [11]. In general, many widelyused machine learning algorithms can be expressed in terms of MR [12].

Graph analysis is a common and useful task. Graphs are ubiquitous andcan be used to represent a number of real world structures, e.g. networks,roads and relationships. The size of the graphs of interest has been rapidlyincreasing in recent years. Estimating the graph diameter for graphs withbillions of nodes (e.g. the Web) is a challenging task, which can benefit fromcloud computing [38].

Similarly, frequently computed metrics such as the clustering coefficientand the transitivity ratio involve the execution of a triangle counting algo-rithm on the graph. Approximated triangle counting algorithms that runon cloud systems can achieve impressive speedups [66].

Co-clustering is a data mining technique which allows simultaneous clus-tering of the rows and columns of a matrix. Co-clustering searches for sub-matrices of rows and columns that are interrelated. Even though powerful,co-clustering is not practical to apply on large matrices with several millionsof rows and columns. A distributed co-clustering solution that uses Hadoophas been proposed to address this issue [52].

A number of different graph mining tasks can be unified via a generaliza-tion of matrix-vector multiplication. These tasks can benefit from a singlehighly optimized implementation of the multiplication. Starting from thisobservation, a library built on top of Hadoop has been proposed as a wayto perform easily and efficiently large scale graph mining operations [39].

Cloud technologies have also been applied to other fields, such as sci-entific simulations of earthquakes [11], collaborative web filtering [49] andbioinformatics [57].

2.4 Research Directions

The systems described in the previous sections are fully functional and com-monly used in production environment. Nonetheless, many research effortshave been made in order to improve and evolve them. Most of the effortshave focused on MapReduce, also because of the availability and success ofHadoop. The research in this area can be divided in three categories: com-

putational models, programming paradigm enrichments and online analytics.


2.4.1 Computational Models

Various models for MapReduce have been proposed. Afrati and Ullman [3]describe a general I/O cost model that captures the essential features ofMapReduce systems. The key assumptions of the model are:

• Files are replicated recordsets stored on a GFS-like file system with avery large block size b and can be red/written in parallel by processes;

• Processes are the conventional unit of computation but have limits onI/O: a lower limit of b (the block size) and an upper limit of s, aquantity that can represent the available main memory of processors;

• Processors, with a CPU, main memory and secondary storage, areavailable in infinite supply.

The authors present various algorithms for multiway join and sorting, andanalyze the communication and processing costs for these examples. Dif-ferently from standard MR, an algorithm in this model is a Direct AcyclicGraph of processes, in a way similar to Dryad or Cascading. Additionally,the model assumes that keys are not delivered in sorted order to the Re-duce. Because of these departures from the actual MR paradigm, the modelis not appropriate to compare real-world algorithms on Hadoop. However,an implementation of the augmented model would be a good starting pointfor future research. At the same time, it would be interesting to examinethe best algorithms for various common tasks on this model.

Karloff et al. [40] propose a novel theoretical model of computation forMapReduce. The authors formally define the Map and Reduce functionsand the steps of a MR algorithm. Then they define a new algorithmic class:MRCi. An algorithm in this class is a finite sequence of Map and Reducerounds with the following limitations, given an input of size n:

• each Map or Reduce is implemented by a RAM that uses sub-linearspace and polynomial time w.r.t. n;

• the total space used by the output of each Map is less than quadraticin n;

• the number of rounds is O(logi n)


The model makes a number of assumptions on the underlying infrastructureto justify the definition. The number of available processors is sub-linear.This restriction guarantees that algorithms in MRC are practical. Eachprocessor has a sub-linear amount of memory. Given that the Reduce phasecan not begin until all the Maps are done, the intermediate results must bestored temporarily in memory. This explains the space limit on the Mapoutput which is given by the total memory available across all the machines.The authors give examples of algorithms for graph and string problems. Theresult of their analysis is an algorithmic design technique for MRC.

Lammel [41] developes an interesting functional model of MapReducebased on Haskell. The author reverse-engineered the original papers onMapReduce and Sawzall to derive an executable specification. Their analysisshows various interesting points. First, they show what are the relationshipsbetween Google’s MR and the functional operators of map and reduce. Theuser defined mapper is the argument of a map combinator. The reducer isas well the argument of a map combinator over the intermediate data. Thereducer is typically also an application of a reduce combinator, but it canperform also other operations like sorting.

Second, the authors disambiguate the type definition of MR which isvery vague in the original paper. They decompose MR in three phaseshighlighting the shuffling and grouping operation that is usually performedsilently by the system. The signatures are as follows:

map : Map(k1 : v1) → [(k2, v2)]

groupBy : [(k2, v2)] →Map(k2 : [v2])

reduce : Map(k2 : [v2]) →Map(k2 : v3)

The Map(key : value) data type is commonly called associative array inPHP or dictionary in Python, basically a set of keys with associated values.The square brackets [ ] and parenthesis ( ) indicate respectively a list anda tuple. These definition are more rigorous in the way they use the terms“list” and “set”.

Third they analyze the combiner function of MR and find that it isactually useless. If it defines a function that is logically different from thereducer there is no guarantee of correctness. Otherwise there is no need to


define two separate functions for the same functionality. However we arguethat there might be practical reasons to separate them, like performingoutput or sorting only in the last phase.

Finally they analyze Sawzall’s aggregators. They find that the aggre-gators identify the ingredients of a reduce (in the functional sense). Theycontend that the essence of a Sawzall program is to define a list homomor-phism: a function to be mapped over the list elements as well as a “monoid”

to be used for reduction. A monoid is a simple algebraic structure composedof a set, an associative operation and its unit. They conclude that the dis-tribution model of Sawzall is both simpler and more general thanks to theuse of monoids, and that it cannot be based directly on MR because of itsmulti-level aggregation strategy (machine-rack-global).

2.4.2 Programming Paradigm Enrichments

A number of extensions to the base MR system have been developed. Theseworks differentiate from the models above because their result is a workingsystem. Most of these works focus on extending MR towards the databasearea.

Yang et al. [70] propose an extension to MR in order to simplify theimplementation of relational operators. Namely, the implementation of join,complex, multi-table select and set operations. The normal MR workflow isextended with a third final Merge phase. This function takes in input twodifferent key-value pair lists and outputs a third key-value pair list. Themodel assumes that the Reduce produces key-value pairs that are then fedto the Merge. The signature of the functions are as follows:

map : (k1, v1)α → [(k2, v2)]α

reduce : (k2, [v2])α → (k2, [v3])α

merge : ((k2, [v3])α , (k3, [v4])β) → [(k4, v5)]γ

where α, β, γ represent data lineages, k is a key and v is a value. Thelineage is used to distinguish the source of the data, a necessary feature forjoins. The implementation of the Merge phase is quite complex, so we willpresent only some key features. A Merge is a UDF like Map and Reduce. To


determine the data sources for the Merge, the user needs to define a partition

selector module. After being selected, the two input lists are accessed via twological iterators. The user has to describe the iteration pattern implementinga move function that is called after every merge. Other optional functionsmay also be defined for preprocessing purposes. Finally, the authors describehow to implement common join algorithms on their framework and providesome insight on the communication costs.

The framework described in the paper, even though efficient, is quitecomplex for the user. To implement a single join algorithm the program-mer needs to write up to five different functions for the Merge phase only.Moreover, the system exposes many internal details that pollute the cleanfunctional interface of MR. To be truly usable, the framework should pro-vide a wrapper library with common algorithms already implemented. Evenbetter, this framework could be integrated into high level interfaces like Pigor Hive for a seamless and efficient experience.

Another attempt to integrate database feature into MR is proposed byAbouzeid et al. [2]. Their HadoopDB is an architectural hybrid betweenHadoop and a DBMS (PostgreSQL). The goal is to achieve the flexibilityand fault tolerance of Hadoop and the performance and query interface of aDBMS. The architecture of HadoopDB comprises single instances of Post-greSQL as the data storage (instead of HDFS), Hadoop as the communica-tion layer, a separate catalog for metadata and a query rewriter (SMS Plan-ner) that is a modified version of Hive. The SMS Planner converts HiveQLqueries into MR jobs that read data from the PostgreSQL instances. Thecore idea is to leverage the I/O efficiency given by indexes in the databaseto reduce the amount of data that needs to be read from disk.

The system is then tested with the same benchmarks used by Pavloet al. [53]. The results show a performance improvement over Hadoop inmost tasks. Furthermore HadoopDB is relatively more fault tolerant than acommon PDBMS, in the sense that crashes do not trigger expensive queryrestarts. HadoopDB is an interesting attempt to join distributed systemsand databases. Nonetheless, we argue that reusing principles of databaseresearch rather than its artifacts should be the preferred method.


2.4.3 Online Analytics

A different research direction aims to enable online analytics on large scaledata. This gives substantial competitive advantages in adapting to changes,and the ability to process stream data. Systems like MR are essentially batchsystems, while BigTable provides low latency but is just a lookup table. Todate there is no system that provides low latency general querying.

Olston et al. [50] describe an approach to interactive analysis of web-scale data. The scenario entails interactive processing of a single negotiatedquery over static data. Their idea is to split the analysis phase in two: anoffline and an online phase. In the former, the user submits a template tothe system. This template is basically a parameterized query plan. Thesystem examines the template, optimizes it and computes all the necessaryauxiliary data structures to speed up query evaluation. In the process itmay negotiate restrictions on the template. In the online phase the userinstantiates the template by binding the parameters into the template. Thesystem computes the final answer using the auxiliary structures in “real-time”.

The offline phase works in a batch environment with large computingresources (i.e. a MR-style system). The online phase may alternatively berun on a single workstation, if enough data reduction happens in the offlinephase. The authors focus on parameterized filters as examples and showhow to optimize the query plan for the 2-phase split. They try to pushall the parameter-free operations in the offline phase, and create indexesor materialized views at phase junctures. Various types of indexing ap-proaches are then evaluated, together with other background optimizationtechniques. This work looks promising and there are still questions to beanswered like what kind of templates are amenable to interactivity, how tohelp the user building the query template and how to introduce approximatequery processing techniques (e.g. online aggregation [34]).

On this last issue we find a work by Condie et al. [15]. They modifyHadoop in order to pipeline data between operators, support online aggre-gation and continuous queries. They name their system Hadoop OnlinePrototype (HOP). In HOP a downstream dataflow element can begin con-suming data before a producer element has finished execution. Hence theycan generate and refine an approximate answer using online aggregation [34].Pipelining also enables to push data as it comes inside a running job, which


in turn enables stream processing and continuous queries.To implement pipelining the simple pull interface between Reduce and

Map has to be modified into a hybrid push/pull interface. The mapperspush data to reducers in order to speed up the shuffle and sort phase. Thepushed data is treated ad tentative to retain fault tolerance: if one of theentities involved in the transfer fails the data is simply discarded. For thisreason mappers also write the data to disk as in normal MR, and the pullinterface is retained.

Online aggregation is supported by simply applying the reduce functionto all the pipelined data received so far. This aggregation can be triggeredon specific events (e.g. 50% of the mappers completed). The result is asnapshot of the computation and can be accessed via HDFS. ContinuousMR jobs can be run using both pipelining and aggregation on stream data.The reduce function is invoked at intervals defined by time or number ofinputs. To retain fault tolerance the system retains a rolling suffix of thehistory of Map output.

HOP can also pipeline and do online aggregation between jobs, but thisis defined “problematic” by the authors. For pipelining, they say that “thecomputation of the reduce function from the previous job and the map func-tion of the next job cannot be overlapped”. This is actually not true becausethe Map function does not need the full input before starting, but operatesrecord-wise. For online aggregation the problem is that the output of Reduceon 50% of the input is not directly related to the final output. This meansthat every snapshot must be recomputed from scratch, using considerablecomputing resources. This problem can be alleviated for Reduce functionsthat are declared to be distributive, associative or algebraic aggregate. Thisline of research is extremely promising and some of the shortcomings of thiswork are directly addressable. For example we can enable overlapping thecomputation for a Reduce-to-Map pipelining. Moreover, we could use se-

mantic clues to specify properties about the functions or the input. Thiscould easily be implemented using Java annotations.

Chapter 3

Research Plan

In the previous Chapters we have shown the relevance of data intensive cloudcomputing and we have highlighted some of the currently pursued researchdirections. This research field is new and rapidly evolving. As a result, theinitial efforts are not organic.

This thesis aims to fill this gap by providing a coherent research agenda inthis field. We will address the emerging issues in employing cloud computingfor large scale data analysis. We will build our research upon a large baseof literature in data analysis, database and parallel systems research. As aresult we expect to significantly improve the state of the art in the field, andpossibly evolve the current computational paradigm.

3.1 Research Problem

Cloud computing is an emerging technology in the data analysis field, usedto capitalize on very large data sets. “There is no data like more data”is a famous motto that epitomizes the opportunity to extract significantinformation by exploiting huge volumes of data.

Information represents a competitive advantage for companies operatingin the information society, an advantage that is all the greater the soonerit is achieved. In the limit, online or real-time analytics will become aninvaluable support for decision making.

To date, cloud computing technologies have been successfully employedfor batch processing. Adapting these systems to online analytics is a chal-lenging problem, mainly because of the fundamental design choices thatfavor throughput over latency [47].

29

3.1 Research Problem 30

Another driver for change is the shift towards higher level languages.These languages often incorporate traditional relational operators or buildupon SQL. This means that efficient implementation of relational operatorsis a key problem. Nevertheless, operators like join are currently an “Achille’sheel” of cloud systems [53].

However, we argue that this limit in the current unstructured and brute-force approach is not fundamental. This approach is just the easiest tostart with, but not necessarily the best one. As research progresses moresophisticated approaches will be available. High level languages will hidetheir complexity and allow their seamless integration.

Many data analysis algorithms spanning different application areas havebeen proposed for cloud systems. So far, speedup and scalability resultshave been encouraging.

Still, it is not clear in the research community which problems are a goodmatch for cloud systems. More importantly, the ingredients and recipes forbuilding a successful algorithm are still hidden. A library of “algorithmicdesign patterns” would enable faster and easier adoption of cloud systems.

Even though there have been attempts to address these issues, currentresearch efforts in this area have been mostly “one shot”. The results arepromising but their scope and generality is often limited to the specificproblem at hand. These efforts also lack a coherent vision and a systematicapproach. For this reason they are also difficult to compare and integratewith each other.

This “high entropy” level is typical of new research areas whose agendais not yet clear. One of the main shortcomings is that the available modelsfor cloud systems are at an embryonal stage of development. This hampersthe comparison of algorithms and the evaluation of their properties from atheoretical point of view.

Furthermore, up to now industry has been leading the research in cloudcomputing. Academia has now to follow and “catch-up”. Nonetheless thisfact also has positive implications. The research results are easily adaptedto industrial environments because the gap between production and researchsystems is small.

The problems described up to this point are a result of the early stageof development in which this area of research currently is. For this reasonwe claim that further effort is needed to investigate this area.

3.2 Thesis Proposal 31

3.2 Thesis Proposal

The goal of this thesis will be to provide a coherent framework for researchin the field of large scale data analysis on cloud computing systems. Morespecifically we aim to provide a framework that tries to answer these ques-tions:

• Which are the best algorithms for large scale data analysis?

• How to support these algorithms on cloud computing systems?

• Is it possible to carry out online data analysis on such systems?

We will approach these problems according to the following methodology.We will study the most common data analysis algorithms. A wide spec-

trum of analytical algorithms are being applied to large amounts of data.These algorithms range from typical text processing tasks, such as index-ing, to pattern extraction, machine learning and graph mining. Such algo-rithms encompass different computational patterns, which might not prop-erly match the popular paradigms available today. Therefore, the first stepwill be to define an analytical workload that is representative of their newunsatisfied requirements.

This analysis will allow us to identify the weaknesses of existing systems,and to design a roadmap of contributions to the state of the art. To this end,we propose to selectively and coherently integrate into cloud systems prin-ciples from database research, as opposed to its artifacts. For this reason wewill develop a comprehensive approach to integration. We will pick relevantprinciples from the database literature, which we can apply to satisfy therequirements. We will re-elaborate these principles in new forms and adaptthem to the new context. If unable to find suitable results in the literature,we will formulate novel original approaches.

We plan to evaluate the quality of our proposed contributions both froman experimental and theoretical point of view. We will propose computa-tional models for cloud computing that take into account the features of thesystem and its possible extensions. We will use these models to analyze ourproposed solutions, and also to provide a common reference ground for thecomparison of other solutions, algorithms and paradigms.

In addressing interactive and online analytics, we will evaluate differentapproaches to query answering, like sampling and approximate answers. In

3.2 Thesis Proposal 32

this respect, providing confidence intervals for approximate answers is animportant feature, missing from current systems, that we plan to explore.We will also try to address the issue of skewed distribution of data, whichoften leads to straggler tasks that slow down the entire job. Furthermore,we will try to leverage properties of the input or of the computation in orderto perform optimizations and reduce the job turnaround time.

We will generally use MapReduce as the target paradigm, because ofits widespread adoption by the scientific community and of the open-sourceavailability of its implementations (e.g. Hadoop). Nevertheless, we willstrive to propose principles and approaches that are as general as possible.We hope that other cloud systems and the next generation of massivelyparallel systems will also be able to benefit from our work.

We expect to produce extensions to MapReduce and to the MapReduceparadigm in order to support online analytics and other common computa-tion patterns. This will conceivably lead to a new paradigm, an evolutionbeyond the original MapReduce. In our activity we will give attention tothe integration with high level languages, as this shields the user from theevolution of the underlying infrastructure.

As a byproduct of our research, we will develop a toolbox of algorithmsfor large scale data analysis. This toolbox will exploit our proposed exten-sions of the MapReduce paradigm, thus validating the value and soundnessof our research efforts.

To conclude, the expected contributions of this thesis are:

• To build and evaluate a toolbox of algorithms for large scale dataanalysis on cloud computing systems

• To design extensions to existing programming paradigms in order tosupport these algorithms

• To develop methods to speed up these algorithms in order to supportonline processing

Our research is an effort to bring together the distributed systems anddatabase communities. Cross-contamination of research areas is a greatstimulus for scientific advancement. Our research is a step toward the cre-ation of a common ground on which these communities will be able to com-municate and thrive together.

Bibliography

[1] D. J. Abadi, “Data Management in the Cloud: Limitations and Op-portunities,” IEEE Data Engineering Bulletin, vol. 32, no. 1, pp. 3–12,March 2009.

[2] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, andA. Rasin, “HadoopDB: An Architectural Hybrid of MapReduce andDBMS Technologies for Analytical Workloads,” in Proceedings of theVLDB Endowment, vol. 2, no. 1, August 2009, pp. 922–933.

[3] F. Afrati and J. Ullman, “A New Computation Model for Rack-BasedComputing,” 2009, submitted to PODS 2010: Symposium on Principlesof Database Systems.

[4] G. M. Amdahl, “Validity of the single processor approach to achievinglarge scale computing capabilities,” in AFIPS ’67 (Spring): Proceedingsof the April 18-20, 1967, spring joint computer conference. ACM, April1967, pp. 483–485.

[5] C. Anderson, “The Petabyte Age: Because more isn’t just more—moreis different,” Wired, vol. 16, no. 07, July 2008.

[6] A. Bialecki, M. Cafarella, D. Cutting, and O. O’Malley. (2005)Hadoop: A framework for running applications on large clusters builtof commodity hardware. [Online]. Available: http://hadoop.apache.org

[7] G. E. P. Box and N. R. Draper, Empirical model-building and responsesurface. John Wiley & Sons, Inc., 1986, p. 424.

[8] M. Burrows, “The Chubby lock service for loosely-coupled distributedsystems,” in OSDI ’06: Proceedings of the 7th Symposium on OperatingSystems Design and Implementation, November 2006, pp. 335–350.

33

BIBLIOGRAPHY

[9] R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver,and J. Zhou, “SCOPE: Easy and efficient parallel processing of mas-sive data sets,” in Proceedings of the VLDB Endowment, vol. 1, no. 2.VLDB Endowment, August 2008, pp. 1265–1276.

[10] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur-rows, T. Chandra, A. Fikes, and R. E. Gruber, “BigTable: A distributedstorage system for structured data,” in OSDI ’06: Proceedings of the 7thSymposium on Operating systems Design and Implementation, Novem-ber 2006, pp. 205–218.

[11] S. Chen and S. Schlosser, “Map-Reduce Meets Wider Varieties ofApplications,” Intel Research, Pittsburgh, Tech. Rep. IRP-TR-08-05,May 2008. [Online]. Available: http://www.pittsburgh.intel-research.net/∼chensm/papers/IRP-TR-08-05.pdf

[12] C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, andK. Olukotun, “Map-Reduce for Machine Learning on Multicore,” inAdvances in Neural Information Processing Systems. MIT Press, 2007,vol. 19, pp. 281–288.

[13] E. F. Codd, “A relational model of data for large shared data banks,”Communications of the ACM, vol. 13, no. 6, pp. 377–387, 1970.

[14] Concurrent. (2008, January) Cascading: An API for executingworkflows on Hadoop. [Online]. Available: http://www.cascading.org

[15] T. Condie, N. Conway, P. Alvaro, J. Hellerstein, K. Elmele-egy, and R. Sears, “MapReduce Online,” University of California,Berkeley, Tech. Rep. UCB/EECS-2009-136, October 2009. [On-line]. Available: http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html

[16] B. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohan-non, H. Jacobsen, N. Puz, D. Weaver, and R. Yerneni, “PNUTS: Ya-hoo!’s hosted data serving platform,” in Proceedings of the VLDB En-dowment, vol. 1, no. 2. VLDB Endowment, August 2008, pp. 1277–1288.

[17] M. Creeger, “Cloud Computing: An Overview,” Queue, vol. 7, no. 5,pp. 3–4, June 2009.

34

BIBLIOGRAPHY

[18] J. Dean and S. Ghemawat, “MapReduce: a flexible data processingtool,” Communications of the ACM, vol. 53, no. 1, pp. 72–77, January2010.

[19] J. Dean and S. Ghemawat, “MapReduce: simplified data processingon large clusters,” in OSDI ’04: Proceedings of the 6th Symposium onOpearting Systems Design and Implementation. USENIX Association,December 2004, pp. 137–150.

[20] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo:Amazon’s highly available key-value store,” in SOSP ’07: Proceedings of21st ACM SIGOPS symposium on Operating systems principles. ACM,October 2007, pp. 205–220.

[21] D. DeWitt and J. Gray, “Parallel database systems: the future of highperformance database systems,” Communications of the ACM, vol. 35,no. 6, pp. 85–98, June 1992.

[22] D. DeWitt and M. Stonebraker. (2008, January) MapReduce: A majorstep backwards. [Online]. Available: http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step-backwards http://databasecolumn.vertica.com/database-innovation/mapreduce-ii

[23] D. DeWitt, S. Ghandeharizadeh, D. Schneider, A. Bricker, H.-I. Hsiao,and R. Rasmussen, “The Gamma database machine project,” IEEETransactions on Knowledge and Data Engineering, vol. 2, no. 1, pp.44–62, March 1990.

[24] T. Elsayed, J. Lin, and D. W. Oard, “Pairwise document similarityin large collections with MapReduce,” in HLT ’08: Proceedings of the46th Annual Meeting of the Association for Computational Linguisticson Human Language Technologies. ACL, June 2008, pp. 265–268.

[25] Facebook. (2008, August) Cassandra: A highly scalable, eventuallyconsistent, distributed, structured key-value store. [Online]. Available:http://incubator.apache.org/cassandra

[26] A. Fox and E. A. Brewer, “Harvest, Yield, and Scalable Tolerant Sys-tems,” in HOTOS ’99: Proceedings of the the 7th Workshop on Hot

35

BIBLIOGRAPHY

Topics in Operating Systems. IEEE Computer Society, March 1999,pp. 174–178.

[27] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google file system,”in SOSP ’03: Proceedings of the 19th ACM symposium on Operatingsystems principles. ACM, October 2003, pp. 29–43.

[28] S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility ofconsistent, available, partition-tolerant web services,” SIGACT News,vol. 33, no. 2, pp. 51–59, June 2002.

[29] J. Ginsberg, M. Mohebbi, R. Patel, L. Brammer, M. Smolinski, andL. Brilliant, “Detecting influenza epidemics using search engine querydata,” Nature, vol. 457, no. 7232, pp. 1012–1014, 2008.

[30] Google Blog. (2008, November) Sorting 1PB with MapRe-duce. [Online]. Available: http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html

[31] J. Gray and D. Patterson, “A Conversation with Jim Gray,” Queue,vol. 1, no. 4, pp. 8–17, July 2003.

[32] J. L. Gustafson, “Reevaluating Amdahl’s law,” Communications of theACM, vol. 31, no. 5, pp. 532–533, May 1988.

[33] J. M. Hellerstein, “Programming a Parallel Future,” University ofCalifornia, Berkeley, Tech. Rep. UCB/EECS-2008-144, November 2008.[Online]. Available: http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-144.html

[34] J. M. Hellerstein, P. J. Haas, and H. J. Wang, “Online aggregation,”in SIGMOD ’97: Proceedings of the 23rd ACM SIGMOD internationalconference on Management of data. ACM, May 1997, pp. 171–182.

[35] M. D. Hill and M. R. Marty, “Amdahl’s Law in the Multicore Era,”IEEE Computer, vol. 41, no. 7, pp. 33–38, July 2008.

[36] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: Dis-tributed data-parallel programs from sequential building blocks,” inEuroSys ’07: Proceedings of the 2nd ACM SIGOPS/EuroSys EuropeanConference on Computer Systems. ACM, March 2007, pp. 59–72.

36

BIBLIOGRAPHY

[37] A. Jacobs, “The pathologies of Big Data,” Communications of theACM, vol. 52, no. 8, pp. 36–44, August 2009.

[38] U. Kang, C. Tsourakakis, A. Appel, C. Faloutsos, andJ. Leskovec, “HADI: Fast diameter estimation and mining inmassive graphs with Hadoop,” Carnegie Mellon University, Pitts-burgh, Tech. Rep. CMU-ML-08-117, December 2008. [Online].Available: http://reports-archive.adm.cs.cmu.edu/anon/anon/home/ftp/ml2008/CMU-ML-08-117.pdf

[39] U. Kang, C. E. Tsourakakis, and C. Faloutsos, “PEGASUS: A Peta-Scale Graph Mining System Implementation and Observations,” inICDM ’09: Proceedings of the 9th IEEE International Conference onData Mining. IEEE Computer Society, December 2009, pp. 229–238.

[40] H. Karloff, S. Suri, and S. Vassilvitskii, “A Model of Computationfor MapReduce,” in SODA ’10: Symposium on Discrete Algorithms.ACM, January 2010.

[41] R. Lammel, “Google’s MapReduce programming model – Revisited,”Science of Computer Programming, vol. 70, no. 1, pp. 1–30, January2008.

[42] L. Lamport, “The part-time parliament,” ACM Transactions on Com-puter Systems, vol. 16, no. 2, pp. 133–169, May 1998.

[43] H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, “PFP: ParallelFP-Growth for Query Recommendation,” in RecSys ’08: Proceedingsof the 2nd ACM conference on Recommender Systems. ACM, October2008, pp. 107–114.

[44] LinkedIn. (2009, June) Project Voldemort: A distributed database.[Online]. Available: http://project-voldemort.com/

[45] F. Marinescu and C. Humble. (2008, March) Trading Consistencyfor Scalability in Distributed Architectures. [Online]. Available:http://www.infoq.com/news/2008/03/ebaybase

[46] R. M. C. McCreadie, C. Macdonald, and I. Ounis, “On single-passindexing with MapReduce,” in SIGIR ’09: Proceedings of the 32nd

37

BIBLIOGRAPHY

international ACM SIGIR conference on Research and development inInformation Retrieval. ACM, July 2009, pp. 742–743.

[47] M. K. McKusick and S. Quinlan, “GFS: Evolution on Fast-forward,”Queue, vol. 7, no. 7, pp. 10–20, August 2009.

[48] P. Mell and T. Grance. (2009, October) Definition of Cloud Computing.National Institute of Standards and Technology (NIST). [Online].Available: http://csrc.nist.gov/groups/SNS/cloud-computing

[49] M. G. Noll and C. Meinel, “Building a Scalable Collaborative WebFilter with Free and Open Source Software,” in SITIS ’08: Proceedingsof the 4th IEEE International Conference on Signal Image Technologyand Internet Based Systems. IEEE Computer Society, November 2008,pp. 563–571.

[50] C. Olston, E. Bortnikov, K. Elmeleegy, F. Junqueira, and B. Reed,“Interactive Analysis of Web-Scale Data,” in CIDR ’09: Proceedingsof the 4th Conference on Innovative Data Systems Research, January2009.

[51] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “PigLatin: A not-so-foreign language for data processing,” in SIGMOD’08: Proceedings of the 34th ACM SIGMOD international conferenceon Management of data. ACM, June 2008, pp. 1099–1110.

[52] S. Papadimitriou and J. Sun, “DisCo: Distributed Co-clustering withMap-Reduce: A Case Study towards Petabyte-Scale End-to-End Min-ing,” in ICDM ’08: Proceedings of the 8th IEEE International Confer-ence on Data Mining. IEEE Computer Society, December 2008, pp.512–521.

[53] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. Dewitt, S. Mad-den, and M. Stonebraker, “A Comparison of Approaches to Large-ScaleData Analysis,” SIGMOD ’09: Proceedings of the 35th SIGMOD inter-national conference on Management of data, pp. 165–178, June 2009.

[54] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting thedata: Parallel analysis with Sawzall,” Scientific Programming, vol. 13,no. 4, pp. 277–298, October 2005.

38

BIBLIOGRAPHY

[55] D. Pritchett, “BASE: An ACID Alternative,” Queue, vol. 6, no. 3, pp.48–55, May 2008.

[56] J. Rowley, “The wisdom hierarchy: representations of the DIKW hi-erarchy,” Journal of Information Science, vol. 33, no. 2, pp. 163–180,April 2007.

[57] M. Schatz, “CloudBurst: highly sensitive read mapping with MapRe-duce,” Bioinformatics, vol. 25, no. 11, p. 1363, June 2009.

[58] Y. Shi, “Reevaluating Amdahl’s Law and Gustafson’s Law,”October 1996, Computer Sciences Department, Temple University.[Online]. Available: http://www.cis.temple.edu/∼shi/docs/amdahl/amdahl.html

[59] S. S. Skiena, The Algorithm Design Manual, 2nd ed. Springer, 2008,p. 31.

[60] M. Stonebraker and U. Cetintemel, “One Size Fits All: An Idea WhoseTime Has Come and Gone,” in ICDE ’05: Proceedings of the 21st In-ternational Conference on Data Engineering. IEEE Computer Society,April 2005, pp. 2–11.

[61] M. Stonebraker, C. Bear, U. Cetintemel, M. Cherniack, T. Ge,N. Hachem, S. Harizopoulos, J. Lifter, J. Rogers, and S. Zdonik, “Onesize fits all? Part 2: Benchmarking results,” in CIDR ’07: Proceedingsof the 3rd Conference on Innovative Data Systems Research, January2007.

[62] M. Stonebraker, S. Madden, D. Abadi, S. Harizopoulos, N. Hachem,and P. Helland, “The End of an Architectural Era (It’s Time for aComplete Rewrite),” in VLDB ’07: Proceedings of the 33rd interna-tional conference on Very Large Data Bases. ACM, September 2007,pp. 1150–1160.

[63] M. Stonebraker, “The Case for Shared Nothing,” IEEE Database En-gineering Bulletin, vol. 9, no. 1, pp. 4–9, March 1986.

[64] M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson,A. Pavlo, and A. Rasin, “MapReduce and Parallel DBMSs: Friends

39

BIBLIOGRAPHY

or Foes?” Communications of the ACM, vol. 53, no. 1, pp. 64–71,January 2010.

[65] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony,H. Liu, P. Wyckoff, and R. Murthy, “Hive: A warehousing solution overa map-reduce framework,” in Proceedings of the VLDB Endowment,vol. 2, no. 2. VLDB Endowment, August 2009, pp. 1626–1629.

[66] C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos,“DOULION: Counting Triangles in Massive Graphs with a Coin,” inKDD ’09: Proceedings of the 15th ACM SIGKDD international con-ference on Knowledge Discovery and Data mining. ACM, April 2009,pp. 837–846.

[67] J. Vitter and E. Shriver, “Algorithms for parallel memory, I: Two-levelmemories,” Algorithmica, vol. 12, no. 2, pp. 110–147, September 1994.

[68] W. Vogels, “Eventually Consistent,” Queue, vol. 6, no. 6, pp. 14–19,December 2008.

[69] Wikipedia. (2010, January) Database Management System. [On-line]. Available: http://en.wikipedia.org/wiki/Database managementsystem

[70] H. Yang, A. Dasdan, R. Hsiao, and D. Parker, “Map-reduce-merge:simplified relational data processing on large clusters,” in SIGMOD’07: Proceedings of the 33rd ACM SIGMOD international conferenceon Management of data. ACM, June 2007, pp. 1029–1040.

[71] Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. Gunda, andJ. Currey, “DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language,” in OSDI ’08: Proceed-ings of the 8th Symposium on Operating System Design and Implemen-tation, December 2008.

[72] B. Zhou, D. Jiang, J. Pei, and H. Li, “OLAP on search logs: an in-frastructure supporting data-driven applications in search engines,” inKDD ’09: Proceedings of the 15th ACM SIGKDD international confer-ence on Knowledge Discovery and Data mining. ACM, June 2009, pp.1395–1404.

40

cloud computing for large scale data analysis

Documents